Files
datatools-dev/docs/DEVELOPER.md
Michael 38011872e1 docs(i18n): document language packs across user, dev, and marketing docs
README + USER-GUIDE describe the sidebar picker and current coverage
(home + shared chrome, per-tool bodies pending). DEVELOPER gains a
how-to for adding packs and keys with the parity-test guarantee.
TECHNICAL §10b records the in-house-JSON architecture and locks in the
no-gettext decision (also logged in DECISIONS). REQUIREMENTS reflects
the new interface surface and updated test count. COPY.md adds a
"Language claim" slot so landing/email work can pick it up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:16:24 +00:00

9.1 KiB

Developer Guide

Architecture, data flow, extension points.

Architecture

CLI (src/cli*.py)         GUI (src/gui/app.py + pages/)
     │                          │
     └──────────┐    ┌──────────┘
                ▼    ▼
            ┌────────────────┐
            │   src/core/    │
            └────────────────┘

Core/UI rule: business logic in core/ only. CLI + GUI translate user input → core call → display result.

Module map

Module Public surface
i18n t(key, lang=None, **fmt), current_language(), set_language(), render_language_selector(), LANGUAGES
core.dedup deduplicate(), MatchStrategy, ColumnMatchStrategy, Algorithm, SurvivorRule, DeduplicationResult, MatchResult, build_default_strategies()
core.normalizers normalize_email/phone/name/address/string, NormalizerType, get_normalizer()
core.io read_file(), write_file(), list_sheets(), detect_encoding/delimiter/header_row, repair_bytes()
core.config DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule
core.analyze analyze(), Finding, findings_by_tool(), _NULL_LIKE
core.fixes @register("fix_id") decorator, get_fix(), available_actions()
core.normalize auto_fix(), apply_decisions(), NormalizationResult, is_normalized()
core.text_clean clean_dataframe(), CleanOptions, CleanResult, smart_title_case()
core.format_standardize standardize_dataframe(), StandardizeOptions, StandardizeResult, FieldType, per-cell standardize_*()
core.errors DataToolsError hierarchy, ensure_dataframe(), ensure_choice(), wrap_file_read/write(), format_for_user()
core._constants US_STATE_NAMES, US_STATE_CODES, USPS_EXPANSIONS, USPS_COMPRESSIONS

Data flow — Deduplicator

read_file()                       # auto-detect encoding, delimiter, header
   ▼ DataFrame
build_default_strategies()        # if no explicit strategies
   ▼                              # strong keys (email, phone) → standalone OR
                                  # weak keys (name, address) → AND with strong
_apply_normalizations()           # add _norm_* shadow columns
   ▼
_find_match_groups()              # O(n²) pair compare, OR strategies, union-find
   ▼
[review_callback()]               # optional interactive review
   ▼
_select_survivor()                # per group: first/last/most-complete/most-recent
   ▼
[_merge_group()]                  # optional: fill blanks from losers
   ▼
DeduplicationResult               # deduplicated_df, removed_df, match_groups, log

Extension recipes

Add a normalizer

  1. Add function to core/normalizers.py:
    def normalize_company(value: Optional[str]) -> str:
        if not value or not isinstance(value, str): return ""
        name = value.strip().casefold()
        for sfx in ("inc", "llc", "corp", "ltd", "co"):
            name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip()
        return name
    
  2. Register: add COMPANY = "company" to NormalizerType + entry in _NORMALIZER_MAP.
  3. Auto-detect (optional): add a _COLUMN_TYPE_PATTERNS row in core/dedup.py.

Add a fuzzy algorithm

  1. Add value to Algorithm enum in core/dedup.py.
  2. Add case in _compute_similarity().
  3. Document the value in CLI help text.

Add a survivor rule

  1. Add value to SurvivorRule enum.
  2. Add branch in _select_survivor().
  3. Add CLI mapping.

Add a fix + detector (analyzer/gate)

  1. Detector in core/analyze.py: add _detect_<thing>(df) -> list[Finding], hook into the main analyze() pipeline. Emit Finding with a unique fix_action id.
  2. Fix in core/fixes.py:
    @register("fix_id")
    def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]:
        # ...
        return out_df, cells_changed
    
  3. Constant in core/analyze.py: add FIX_<NAME> = "fix_id" so the detector and fix can reference it.

No other call sites change. Gate auto-discovers it via the registry.

i18n — language packs

The GUI's user-facing strings live in src/i18n/packs/<code>.json, keyed by ISO-639-1 code. English (en.json) is canonical; missing keys in other packs fall back to English, and missing keys in English fall back to the literal dotted key so a typo is visible rather than silent.

Look up a string in code:

from src.i18n import t
st.button(t("upload.run_button"))
st.warning(t("gate.warning", name=filename))   # {name} interpolated via str.format

t() reads the active language from st.session_state["ui_lang"]. Outside a Streamlit run (tests, scripts) it falls back to English.

Add a new language:

  1. Copy src/i18n/packs/en.json to src/i18n/packs/<code>.json and translate values in place. Keep the key tree identical.
  2. Add a one-line entry to LANGUAGES in src/i18n/__init__.py: {"code": "fr", "label": "Français"}. The sidebar picker auto-renders.
  3. Run pytest tests/test_lang_packs.py — the parity test fails until every key from en.json exists in the new pack (and orphan keys not in English are also flagged).

Add a new key:

  1. Add it to en.json first (canonical pack).
  2. Add it to every other registered pack in the same commit. The parity test enforces this.
  3. Use the dotted key at the call site: t("section.subsection.key") or t("section.key", name=value) for placeholder interpolation.

Authoring rules:

  • Keys live under semantic sections (home.*, upload.*, findings.*, tools.<id>.name). Don't nest by language or by tool unless the string is genuinely tool-specific.
  • Use {named} placeholders (not positional {0}) so translators see what's being interpolated.
  • Strings can contain Streamlit markdown (**bold**) — pass through st.markdown / st.caption as usual.
  • Do not put strings inside the farewell-overlay JS payload without going through _js_html_safe() in src/gui/components/_legacy.py; the helper escapes both the JS string terminator and HTML special chars. The test TestFarewellEscape pins that contract.
  • The sidebar picker is mounted by hide_streamlit_chrome(), so every page that calls that helper automatically gets the picker. Pages that don't call it (rare) can call render_language_selector() directly.

Add a format-standardizer field type

  1. Add value to FieldType enum in core/format_standardize.py.
  2. Add per-cell standardize_<x>(value, *, …) returning (new_value, changed).
  3. Add option fields to StandardizeOptions (with defaults that preserve existing behavior).
  4. Wire into _apply_field_type() dispatcher (the else branch raises AssertionError — every enum value needs a branch).
  5. Add validation entry in StandardizeOptions.from_dict() for any new enum-shaped option.

Errors

Use core/errors.py instead of raw ValueError / OSError:

Pattern Use
Bad arg, wrong type, missing column InputValidationError
Bad config / options file ConfigError
File parses but isn't what we expected FileFormatError
File I/O failure (perms, missing, disk full) FileAccessError
Internal invariant broken (unreachable branch) AssertionError

Helpers:

  • ensure_dataframe(value, function="my_func") at every public entry that takes a df.
  • ensure_choice(value, name="mode", choices=[...]) at every entry that takes a literal.
  • wrap_file_read(path, "operation", exc) / wrap_file_write(...) when wrapping OSError.

GUI / CLI handlers: use format_for_user(exc, context="...") to render.

All DataToolsError subclasses extend stdlib ValueError or OSError so existing handlers still catch them.

Tests

# All
pytest -q
# By module
pytest tests/test_dedup.py
# Include slow / integration
pytest -m slow
# Single test
pytest tests/test_dedup.py::TestExactMatch::test_basic

Test layout:

tests/
├── conftest.py                        # fixtures
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
├── test_analyze.py · test_normalize.py · test_text_clean.py
├── test_format_standardize.py
├── test_format_standardize_corpus.py  # 199-row buyer corpus
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
└── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py

Fixture corpora: test-cases/text-cleaner-corpus/ (21 files) · test-cases/encodings-corpus/ (31 files) · test-cases/format-cleaner-corpus/ (7 files + spec).

Known limitations

  • Dedup is O(n²) — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix.
  • Single-threaded — could benefit from multiprocessing.
  • Memory-bound — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine.
  • No multi-sheet dedup — each Excel sheet processed independently.
  • Phonenumbers minimum-length — international numbers without country codes fall back to digits-only.