Files
datatools-dev/docs/DEVELOPER.md
Michael 070e3c9f06 docs(gui): document the new GUI test layer
REQUIREMENTS §16 updates the test count (1777 → 1916) and breaks out
the GUI subset. DEVELOPER's Tests section gains the 'gui' marker
recipes and the new tests/gui/ tree under test layout, plus a short
'GUI test layer' explainer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:13:40 +00:00

11 KiB

Developer Guide

Architecture, data flow, extension points.

Architecture

CLI (src/cli*.py)         GUI (src/gui/app.py + pages/)
     │                          │
     └──────────┐    ┌──────────┘
                ▼    ▼
            ┌────────────────┐
            │   src/core/    │
            └────────────────┘

Core/UI rule: business logic in core/ only. CLI + GUI translate user input → core call → display result.

Module map

Module Public surface
i18n t(key, lang=None, **fmt), current_language(), set_language(), render_language_selector(), LANGUAGES
core.dedup deduplicate(), MatchStrategy, ColumnMatchStrategy, Algorithm, SurvivorRule, DeduplicationResult, MatchResult, build_default_strategies()
core.normalizers normalize_email/phone/name/address/string, NormalizerType, get_normalizer()
core.io read_file(), write_file(), list_sheets(), detect_encoding/delimiter/header_row, repair_bytes()
core.config DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule
core.analyze analyze(), Finding, findings_by_tool(), _NULL_LIKE
core.fixes @register("fix_id") decorator, get_fix(), available_actions()
core.normalize auto_fix(), apply_decisions(), NormalizationResult, is_normalized()
core.text_clean clean_dataframe(), CleanOptions, CleanResult, smart_title_case()
core.format_standardize standardize_dataframe(), StandardizeOptions, StandardizeResult, FieldType, per-cell standardize_*()
core.errors DataToolsError hierarchy, ensure_dataframe(), ensure_choice(), wrap_file_read/write(), format_for_user()
core._constants US_STATE_NAMES, US_STATE_CODES, USPS_EXPANSIONS, USPS_COMPRESSIONS

Data flow — Deduplicator

read_file()                       # auto-detect encoding, delimiter, header
   ▼ DataFrame
build_default_strategies()        # if no explicit strategies
   ▼                              # strong keys (email, phone) → standalone OR
                                  # weak keys (name, address) → AND with strong
_apply_normalizations()           # add _norm_* shadow columns
   ▼
_find_match_groups()              # O(n²) pair compare, OR strategies, union-find
   ▼
[review_callback()]               # optional interactive review
   ▼
_select_survivor()                # per group: first/last/most-complete/most-recent
   ▼
[_merge_group()]                  # optional: fill blanks from losers
   ▼
DeduplicationResult               # deduplicated_df, removed_df, match_groups, log

Extension recipes

Add a normalizer

  1. Add function to core/normalizers.py:
    def normalize_company(value: Optional[str]) -> str:
        if not value or not isinstance(value, str): return ""
        name = value.strip().casefold()
        for sfx in ("inc", "llc", "corp", "ltd", "co"):
            name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip()
        return name
    
  2. Register: add COMPANY = "company" to NormalizerType + entry in _NORMALIZER_MAP.
  3. Auto-detect (optional): add a _COLUMN_TYPE_PATTERNS row in core/dedup.py.

Add a fuzzy algorithm

  1. Add value to Algorithm enum in core/dedup.py.
  2. Add case in _compute_similarity().
  3. Document the value in CLI help text.

Add a survivor rule

  1. Add value to SurvivorRule enum.
  2. Add branch in _select_survivor().
  3. Add CLI mapping.

Add a fix + detector (analyzer/gate)

  1. Detector in core/analyze.py: add _detect_<thing>(df) -> list[Finding], hook into the main analyze() pipeline. Emit Finding with a unique fix_action id.
  2. Fix in core/fixes.py:
    @register("fix_id")
    def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]:
        # ...
        return out_df, cells_changed
    
  3. Constant in core/analyze.py: add FIX_<NAME> = "fix_id" so the detector and fix can reference it.

No other call sites change. Gate auto-discovers it via the registry.

i18n — language packs

The GUI's user-facing strings live in src/i18n/packs/<code>.json, keyed by ISO-639-1 code. English (en.json) is canonical; missing keys in other packs fall back to English, and missing keys in English fall back to the literal dotted key so a typo is visible rather than silent.

Look up a string in code:

from src.i18n import t
st.button(t("upload.run_button"))
st.warning(t("gate.warning", name=filename))   # {name} interpolated via str.format

t() reads the active language from st.session_state["ui_lang"]. Outside a Streamlit run (tests, scripts) it falls back to English.

Add a new language:

  1. Copy src/i18n/packs/en.json to src/i18n/packs/<code>.json and translate values in place. Keep the key tree identical.
  2. Add a one-line entry to LANGUAGES in src/i18n/__init__.py: {"code": "fr", "label": "Français"}. The sidebar picker auto-renders.
  3. Run pytest tests/test_lang_packs.py — the parity test fails until every key from en.json exists in the new pack (and orphan keys not in English are also flagged).

Add a new key:

  1. Add it to en.json first (canonical pack).
  2. Add it to every other registered pack in the same commit. The parity test enforces this.
  3. Use the dotted key at the call site: t("section.subsection.key") or t("section.key", name=value) for placeholder interpolation.

Authoring rules:

  • Keys live under semantic sections (home.*, upload.*, findings.*, tools.<id>.name). Don't nest by language or by tool unless the string is genuinely tool-specific.
  • Use {named} placeholders (not positional {0}) so translators see what's being interpolated.
  • Strings can contain Streamlit markdown (**bold**) — pass through st.markdown / st.caption as usual.
  • Do not put strings inside the farewell-overlay JS payload without going through _js_html_safe() in src/gui/components/_legacy.py; the helper escapes both the JS string terminator and HTML special chars. The test TestFarewellEscape pins that contract.
  • The sidebar picker is mounted by hide_streamlit_chrome(), so every page that calls that helper automatically gets the picker. Pages that don't call it (rare) can call render_language_selector() directly.

Add a format-standardizer field type

  1. Add value to FieldType enum in core/format_standardize.py.
  2. Add per-cell standardize_<x>(value, *, …) returning (new_value, changed).
  3. Add option fields to StandardizeOptions (with defaults that preserve existing behavior).
  4. Wire into _apply_field_type() dispatcher (the else branch raises AssertionError — every enum value needs a branch).
  5. Add validation entry in StandardizeOptions.from_dict() for any new enum-shaped option.

Errors

Use core/errors.py instead of raw ValueError / OSError:

Pattern Use
Bad arg, wrong type, missing column InputValidationError
Bad config / options file ConfigError
File parses but isn't what we expected FileFormatError
File I/O failure (perms, missing, disk full) FileAccessError
Internal invariant broken (unreachable branch) AssertionError

Helpers:

  • ensure_dataframe(value, function="my_func") at every public entry that takes a df.
  • ensure_choice(value, name="mode", choices=[...]) at every entry that takes a literal.
  • wrap_file_read(path, "operation", exc) / wrap_file_write(...) when wrapping OSError.

GUI / CLI handlers: use format_for_user(exc, context="...") to render.

All DataToolsError subclasses extend stdlib ValueError or OSError so existing handlers still catch them.

Tests

# All (core + CLI + GUI)
pytest -q
# Quick loop — skip the GUI layer
pytest -q -m 'not gui'
# Only the GUI tests
pytest -q -m gui
# By module
pytest tests/test_dedup.py
# Include slow / integration
pytest -m slow
# Single test
pytest tests/test_dedup.py::TestExactMatch::test_basic

Test layout:

tests/
├── conftest.py                        # core/CLI fixtures
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
├── test_analyze.py · test_normalize.py · test_text_clean.py
├── test_format_standardize.py
├── test_format_standardize_corpus.py  # 199-row buyer corpus
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
├── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
├── test_perf_regressions.py           # shape pins for the perf wins
└── gui/                               # Streamlit AppTest-driven tests
    ├── conftest.py                    # AppTest fixtures + helpers
    ├── _findings_panel_harness.py     # isolated component test page
    ├── test_smoke.py                  # every page renders in EN + ES
    ├── test_chrome.py                 # language selector, hide_chrome
    ├── test_gate.py                   # require_normalization_gate
    ├── test_workflows.py              # happy path per Ready tool
    ├── test_dedup_review.py           # match-group card interactions
    ├── test_advanced_panels.py        # config_panel widgets
    ├── test_errors.py                 # malformed-upload error paths
    └── test_findings_panel.py         # analyzer findings rendering

GUI test layer

GUI tests drive pages with streamlit.testing.v1.AppTest — in-process, no browser, no display. They pre-populate st.session_state with stashed-upload bytes (via the stash_upload() helper in tests/gui/conftest.py) and either click buttons via app.button[i].click().run() or assert on the session_state after the run.

Marker registered in pytest.ini. Default pytest runs everything; pytest -m 'not gui' skips them for a faster core-only loop. Coming-Soon stubs are pinned by the smoke tests so a regression ("import error", "missing widget") shows up immediately.

Fixture corpora: test-cases/text-cleaner-corpus/ (21 files) · test-cases/encodings-corpus/ (31 files) · test-cases/format-cleaner-corpus/ (7 files + spec).

Known limitations

  • Dedup pair-compare is O(n²) for fuzzy strategies. Exact-only strategies (every column uses Algorithm.EXACT at threshold 100) now route through an O(n) groupby fast path automatically — no API change. Fuzzy strategies can opt into prefix blocking via deduplicate(..., blocking_columns=[...], blocking_prefix_len=1) to partition pairs by a cheap key (trades recall for speed).
  • Threading is opt-in for format_standardizeStandardizeOptions.parallel_columns > 1 uses a thread pool. On CPython 3.12 the GIL caps the win at roughly neutral; the scaffolding is in place for free-threaded Python 3.13+.
  • Memory-bound — entire file loaded into pandas. Streaming reads exist but not integrated with the dedup engine.
  • No multi-sheet dedup — each Excel sheet processed independently.
  • Phonenumbers minimum-length — international numbers without country codes fall back to digits-only.