# Developer Guide Architecture, data flow, extension points. ## Architecture ``` CLI (src/cli*.py) GUI (src/gui/app.py + pages/) │ │ └──────────┐ ┌──────────┘ ▼ ▼ ┌────────────────┐ │ src/core/ │ └────────────────┘ ``` **Core/UI rule**: business logic in `core/` only. CLI + GUI translate user input → core call → display result. ## Module map | Module | Public surface | |--------|----------------| | `i18n` | `t(key, lang=None, **fmt)`, `current_language()`, `set_language()`, `render_language_selector()`, `LANGUAGES` | | `core.dedup` | `deduplicate()`, `MatchStrategy`, `ColumnMatchStrategy`, `Algorithm`, `SurvivorRule`, `DeduplicationResult`, `MatchResult`, `build_default_strategies()` | | `core.normalizers` | `normalize_email/phone/name/address/string`, `NormalizerType`, `get_normalizer()` | | `core.io` | `read_file()`, `write_file()`, `list_sheets()`, `detect_encoding/delimiter/header_row`, `repair_bytes()` | | `core.config` | `DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule` | | `core.analyze` | `analyze()`, `Finding`, `findings_by_tool()`, `_NULL_LIKE` | | `core.fixes` | `@register("fix_id")` decorator, `get_fix()`, `available_actions()` | | `core.normalize` | `auto_fix()`, `apply_decisions()`, `NormalizationResult`, `is_normalized()` | | `core.text_clean` | `clean_dataframe()`, `CleanOptions`, `CleanResult`, `smart_title_case()` | | `core.format_standardize` | `standardize_dataframe()`, `StandardizeOptions`, `StandardizeResult`, `FieldType`, per-cell `standardize_*()` | | `core.errors` | `DataToolsError` hierarchy, `ensure_dataframe()`, `ensure_choice()`, `wrap_file_read/write()`, `format_for_user()` | | `core._constants` | `US_STATE_NAMES`, `US_STATE_CODES`, `USPS_EXPANSIONS`, `USPS_COMPRESSIONS` | ## Data flow — Find Duplicates ``` read_file() # auto-detect encoding, delimiter, header ▼ DataFrame build_default_strategies() # if no explicit strategies ▼ # strong keys (email, phone) → standalone OR # weak keys (name, address) → AND with strong _apply_normalizations() # add _norm_* shadow columns ▼ _find_match_groups() # O(n²) pair compare, OR strategies, union-find ▼ [review_callback()] # optional interactive review ▼ _select_survivor() # per group: first/last/most-complete/most-recent ▼ [_merge_group()] # optional: fill blanks from losers ▼ DeduplicationResult # deduplicated_df, removed_df, match_groups, log ``` ## Extension recipes ### Add a normalizer 1. Add function to `core/normalizers.py`: ```python def normalize_company(value: Optional[str]) -> str: if not value or not isinstance(value, str): return "" name = value.strip().casefold() for sfx in ("inc", "llc", "corp", "ltd", "co"): name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip() return name ``` 2. Register: add `COMPANY = "company"` to `NormalizerType` + entry in `_NORMALIZER_MAP`. 3. Auto-detect (optional): add a `_COLUMN_TYPE_PATTERNS` row in `core/dedup.py`. ### Add a fuzzy algorithm 1. Add value to `Algorithm` enum in `core/dedup.py`. 2. Add case in `_compute_similarity()`. 3. Document the value in CLI help text. ### Add a survivor rule 1. Add value to `SurvivorRule` enum. 2. Add branch in `_select_survivor()`. 3. Add CLI mapping. ### Add a fix + detector (analyzer/gate) 1. **Detector** in `core/analyze.py`: add `_detect_(df) -> list[Finding]`, hook into the main `analyze()` pipeline. Emit Finding with a unique `fix_action` id. 2. **Fix** in `core/fixes.py`: ```python @register("fix_id") def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]: # ... return out_df, cells_changed ``` 3. **Constant** in `core/analyze.py`: add `FIX_ = "fix_id"` so the detector and fix can reference it. No other call sites change. Gate auto-discovers it via the registry. ### Tool page header — `render_tool_header(tool_id)` Every tool page renders its title block via `render_tool_header(tool_id)` in `src/gui/components/_legacy.py` — do not call `st.title()` + `st.caption()` directly. The helper renders: - `tools..page_title` as the page title (left column). - A **Help** popover button right of the title (icon `:material/help_outline:`, label from `help.button_label`). Clicking opens an `st.popover` containing the markdown body. - `tools..page_caption` as the caption below. All copy is i18n-driven; editors can tweak help text without touching Python. If a tool is missing its `help_md` key, the popover falls back to `help.missing_body`. **`help_md` structure** (markdown, stored as a single string with `\n` line breaks in JSON): ``` **When to use** - bullet 1 - bullet 2 **Steps** 1. numbered step 2. numbered step **Examples** - example 1 - example 2 **Tip** one-sentence pro tip. ``` Keep it short — the popover is intentionally compact. Mirror the structure across every tool so the muscle memory transfers. ### i18n — language packs The GUI's user-facing strings live in `src/i18n/packs/.json`, keyed by ISO-639-1 code. English (`en.json`) is canonical; missing keys in other packs fall back to English, and missing keys in English fall back to the literal dotted key so a typo is visible rather than silent. **Look up a string in code:** ```python from src.i18n import t st.button(t("upload.run_button")) st.warning(t("gate.warning", name=filename)) # {name} interpolated via str.format ``` `t()` reads the active language from `st.session_state["ui_lang"]`. Outside a Streamlit run (tests, scripts) it falls back to English. **Add a new language:** 1. Copy `src/i18n/packs/en.json` to `src/i18n/packs/.json` and translate values in place. Keep the key tree identical. 2. Add a one-line entry to `LANGUAGES` in `src/i18n/__init__.py`: `{"code": "fr", "label": "Français"}`. The sidebar picker auto-renders. 3. Run `pytest tests/test_lang_packs.py` — the parity test fails until every key from `en.json` exists in the new pack (and orphan keys not in English are also flagged). **Add a new key:** 1. Add it to `en.json` first (canonical pack). 2. Add it to every other registered pack in the same commit. The parity test enforces this. 3. Use the dotted key at the call site: `t("section.subsection.key")` or `t("section.key", name=value)` for placeholder interpolation. **Authoring rules:** - Keys live under semantic sections (`home.*`, `upload.*`, `findings.*`, `help.*`, `tools..name`). Don't nest by language or by tool unless the string is genuinely tool-specific. - Per-tool header copy lives under `tools..{page_title, page_caption, help_md}`. `page_caption` is the one-line subtitle under the title; `help_md` is the popover body (see *Tool page header* above). Top-level `help.button_label` / `help.missing_body` are shared across every tool. - Use `{named}` placeholders (not positional `{0}`) so translators see what's being interpolated. - Strings can contain Streamlit markdown (`**bold**`) — pass through `st.markdown` / `st.caption` as usual. - Do **not** put strings inside the farewell-overlay JS payload without going through `_js_html_safe()` in `src/gui/components/_legacy.py`; the helper escapes both the JS string terminator and HTML special chars. The test `TestFarewellEscape` pins that contract. - The sidebar picker is mounted by `hide_streamlit_chrome()`, so every page that calls that helper automatically gets the picker. Pages that don't call it (rare) can call `render_language_selector()` directly. ### Licensing The license layer lives at ``src/license/``. The public API: ```python from src.license import ( get_manager, require_feature, current_state, FeatureFlag, Tier, License, ) mgr = get_manager() if not mgr.is_valid(): raise RuntimeError("Not licensed") require_feature(FeatureFlag.DEDUPLICATOR) ``` **Storage**: ``~/.datatools/license.json`` (override via ``DATATOOLS_LICENSE_PATH``). Signed with Ed25519 (asymmetric) — the seller's private key signs; the buyer's binary verifies with the embedded public key. **Key material**: | Variable | Who has it | Where it's used | |---|---|---| | ``DATATOOLS_LICENSE_PRIVKEY`` | Seller only | ``scripts/generate_license.py`` (mint a buyer's blob), ``scripts/generate_keypair.py`` writes a fresh one | | ``DATATOOLS_LICENSE_PUBKEY`` | Every shipped binary | Verification at activation time; set at build time via PyInstaller env | If neither env var is set, ``src.license.crypto`` falls back to the deterministic dev keypair in ``src/license/_dev_keypair.py``. The dev key is in source on purpose (so tests work without secrets), but a frozen build that's still using it is a build-config bug — :func:`assert_production_safe` refuses to start such a binary. **First-time setup for shipped builds**: 1. ``python scripts/generate_keypair.py --output prod-keys.env`` — creates a fresh keypair. 2. Stash ``DATATOOLS_LICENSE_PRIVKEY`` somewhere safe (password manager / KMS). Lose it and you can't issue renewals without reshipping a new build with a new public key. 3. Configure the PyInstaller build env with ``DATATOOLS_LICENSE_PUBKEY=`` so the shipped binary verifies against the production key. 4. Mint buyer licenses with ``DATATOOLS_LICENSE_PRIVKEY= python scripts/generate_license.py ...``. **Dev bypass**: ``DATATOOLS_DEV_MODE=1`` short-circuits every check. The test suite's autouse fixture sets this so existing tests don't need their own license fixtures. Tests that need the real check explicitly use ``isolated_license_path`` / ``activated_license_manager`` / ``unactivated_license_manager``. **Adding a feature flag**: 1. Add the enum value to ``FeatureFlag`` in ``src/license/schema.py``. 2. Add it to the relevant tier's set in ``FEATURES_BY_TIER`` in ``src/license/features.py``. 3. Gate at the call site: ``require_feature(FeatureFlag.YOUR_FLAG)``. **Adding a new tier**: 1. Add the enum value to ``Tier``. 2. Add a row to ``FEATURES_BY_TIER`` listing the unlocked flags. 3. Add ``license.tier_`` translation keys to every i18n pack. 4. The activation flow, sidebar status badge, feature gate, and home grid lock badge all pick up the new tier automatically. **Worked example — the Lite tier**: ```python # src/license/schema.py class Tier(str, Enum): LITE = "lite" # new CORE = "core" ... # src/license/features.py FEATURES_BY_TIER = { ... Tier.LITE: frozenset({ FeatureFlag.DEDUPLICATOR, FeatureFlag.TEXT_CLEANER, FeatureFlag.FORMAT_STANDARDIZER, }), Tier.CORE: _all(), ... } ``` Then in en.json/es.json add ``license.tier_lite``. That's it — the existing ``require_feature_or_render_upgrade`` (GUI) and ``guard(feature=...)`` (CLI) calls in every tool page/CLI route a Lite user into the upgrade prompt for any tool the tier doesn't unlock. The home grid's lock badge fires off the same feature lookup. **Minting a license** (creator-only): ```bash DATATOOLS_LICENSE_SECRET= \ python scripts/generate_license.py \ --name "Jane Doe" --email jane@example.com \ --tier core --years 1 ``` The script prints a ``DTLIC1:`` blob to stdout — deliver this in the Gumroad / purchase email. The buyer pastes it into the activation page or runs ``python -m src.license_cli activate --name ...``. ### Add a format-standardizer field type 1. Add value to `FieldType` enum in `core/format_standardize.py`. 2. Add per-cell `standardize_(value, *, …)` returning `(new_value, changed)`. 3. Add option fields to `StandardizeOptions` (with defaults that preserve existing behavior). 4. Wire into `_apply_field_type()` dispatcher (the `else` branch raises `AssertionError` — every enum value needs a branch). 5. Add validation entry in `StandardizeOptions.from_dict()` for any new enum-shaped option. ## Errors Use `core/errors.py` instead of raw `ValueError` / `OSError`: | Pattern | Use | |---------|-----| | Bad arg, wrong type, missing column | `InputValidationError` | | Bad config / options file | `ConfigError` | | File parses but isn't what we expected | `FileFormatError` | | File I/O failure (perms, missing, disk full) | `FileAccessError` | | Internal invariant broken (unreachable branch) | `AssertionError` | Helpers: - `ensure_dataframe(value, function="my_func")` at every public entry that takes a df. - `ensure_choice(value, name="mode", choices=[...])` at every entry that takes a literal. - `wrap_file_read(path, "operation", exc)` / `wrap_file_write(...)` when wrapping `OSError`. GUI / CLI handlers: use `format_for_user(exc, context="...")` to render. All `DataToolsError` subclasses extend stdlib `ValueError` or `OSError` so existing handlers still catch them. ## PDF Extractor — bundled Tesseract Frozen builds (installer / AppImage) ship Tesseract OCR inside the bundle so scanned PDFs work without a separate system install. Source / `pip` developer environments still resolve Tesseract from `PATH`. **Runtime layout (frozen bundles)**: | Resource | Path | |---|---| | Tesseract binary | `Path(sys._MEIPASS) / "tesseract" / "tesseract"` (Linux/macOS), `…/tesseract/tesseract.exe` (Windows) | | Tessdata directory | `Path(sys._MEIPASS) / "tesseract" / "tessdata"` | | English model | `Path(sys._MEIPASS) / "tesseract" / "tessdata" / "eng.traineddata"` | **Discovery order** (PDF Extractor runtime): 1. `DATATOOLS_TESSERACT_BIN` env var (override — explicit path to a `tesseract` binary). 2. Bundled path under `sys._MEIPASS` (frozen bundles only — falls through to step 3 otherwise). 3. `tesseract` on `PATH` (developer setups, source checkouts). 4. Windows well-known locations (`C:\Program Files\Tesseract-OCR\tesseract.exe`, etc.). **Where the bytes come from**: - **Tessdata** is vendored at `build/vendor/tessdata/eng.traineddata` — the "best" English model from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best). PyInstaller's spec copies it into `tesseract/tessdata/` inside the bundle. - **Tesseract binary** is fetched at build time by `build/tesseract.py` — per-platform download URLs are pinned in that module. The current pin is **Tesseract 5.5.0**. CI (`.github/workflows/build.yml`) imports `fetch_tessdata` + `fetch_tesseract_for_platform` and runs them before PyInstaller. **To update Tesseract**: 1. Bump the version pin + the per-platform fetch URLs in `build/tesseract.py`. 2. If upstream changed the `eng.traineddata` schema, refresh `build/vendor/tessdata/eng.traineddata` from `tessdata_best` at the matching tag. 3. Push a `v*` tag so CI rebuilds all three platforms, then smoke-test a scanned-PDF run through the PDF Extractor before publishing the release. 4. Update `LICENSE_TESSERACT.txt` at the repo root if the upstream license terms change (Tesseract is Apache-2.0 today). ## Tests ```bash # All (core + CLI + GUI) pytest -q # Quick loop — skip the GUI layer pytest -q -m 'not gui' # Only the GUI tests pytest -q -m gui # By module pytest tests/test_dedup.py # Include slow / integration pytest -m slow # Single test pytest tests/test_dedup.py::TestExactMatch::test_basic ``` Test layout: ``` tests/ ├── conftest.py # core/CLI fixtures ├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py ├── test_analyze.py · test_normalize.py · test_text_clean.py ├── test_format_standardize.py ├── test_format_standardize_corpus.py # 199-row buyer corpus ├── test_pipeline.py # pipeline engine: adapters, run, validate, serialize ├── test_cli_pipeline.py # pipeline CLI: recommend/apply/strict/audit ├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py ├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py ├── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py ├── test_perf_regressions.py # shape pins for the perf wins └── gui/ # Streamlit AppTest-driven tests ├── conftest.py # AppTest fixtures + helpers ├── _findings_panel_harness.py # isolated component test page ├── test_smoke.py # every page renders in EN + ES ├── test_chrome.py # language selector, hide_chrome ├── test_gate.py # require_normalization_gate ├── test_workflows.py # happy path per Ready tool ├── test_dedup_review.py # match-group card interactions ├── test_advanced_panels.py # config_panel widgets ├── test_pipeline_builder.py # module-card builder: cards, reorder, JSON, run ├── test_pipeline_phrasing.py # step_phrase/step_status + name bridge (pure fns) ├── test_errors.py # malformed-upload error paths └── test_findings_panel.py # analyzer findings rendering ``` ### Pipeline (Automated Workflows) coverage The pipeline feature is pinned end to end across four files (~115 tests): `test_pipeline.py` (core engine — every adapter's summary numbers, step data-flow, error stop/continue, empty/single-column/all-disabled edges, dict + file serialization round-trips, `recommended_pipeline(include=…)`, soft-dependency validation), `test_cli_pipeline.py` (CLI — `--recommend`, dry-run-by-default, `--apply` output + audit JSON, `--steps`, `--strict`, `--continue-on-error`, arg validation, save→load round-trip), `test_pipeline_builder.py` (the visual builder via AppTest — card seeding, toggle, reorder ▲/▼, add/remove, restore-recommended, Advanced JSON import/export, per-tool Configure panels emitting the right option dicts), and `test_pipeline_phrasing.py` (the plain-English `step_phrase`/`step_status` helpers and the adapter-key→friendly-name bridge as pure functions). ### GUI test layer GUI tests drive pages with `streamlit.testing.v1.AppTest` — in-process, no browser, no display. They pre-populate `st.session_state` with stashed-upload bytes (via the `stash_upload()` helper in `tests/gui/conftest.py`) and either click buttons via `app.button[i].click().run()` or assert on the `session_state` after the run. Marker registered in `pytest.ini`. Default `pytest` runs everything; `pytest -m 'not gui'` skips them for a faster core-only loop. Coming-Soon stubs are pinned by the smoke tests so a regression ("import error", "missing widget") shows up immediately. Fixture corpora: `test-cases/text-cleaner-corpus/` (21 files) · `test-cases/encodings-corpus/` (31 files) · `test-cases/format-cleaner-corpus/` (7 files + spec). ## Known limitations - **Dedup pair-compare is O(n²)** for fuzzy strategies. Exact-only strategies (every column uses `Algorithm.EXACT` at threshold 100) now route through an O(n) groupby fast path automatically — no API change. Fuzzy strategies can opt into prefix blocking via `deduplicate(..., blocking_columns=[...], blocking_prefix_len=1)` to partition pairs by a cheap key (trades recall for speed). - **Threading is opt-in for format_standardize** — `StandardizeOptions.parallel_columns > 1` uses a thread pool. On CPython 3.12 the GIL caps the win at roughly neutral; the scaffolding is in place for free-threaded Python 3.13+. - **Memory-bound** — entire file loaded into pandas. Streaming reads exist but not integrated with the dedup engine. - **No multi-sheet dedup** — each Excel sheet processed independently. - **Phonenumbers minimum-length** — international numbers without country codes fall back to digits-only.