Files
datatools-dev/docs/DEVELOPER.md
Michael fd9606c67b build: drop the local Python release method, return to CI-only installer builds
Removes the single-command Python packaging method (build/make_release.py
+ build/build_portable_zip.py + build/macos/build_zip.sh) and the portable
.zip artifacts it produced. Release builds go back to the original GitHub
Actions process: the CI matrix builds one installer per platform (.dmg /
.exe / .AppImage) on tag push and attaches them to a GitHub Release.

Tesseract OCR bundling is preserved: the fetch helpers the workflow depends
on (fetch_tessdata, fetch_tesseract_for_platform) are extracted into a
standalone build/tesseract.py, which build.yml now imports.

Docs (README, build/README, DEVELOPER, TECHNICAL, USER-GUIDE, vendor README,
es translations) updated to drop the portable-zip flavor and point at the
new module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 17:47:36 +00:00

18 KiB

Developer Guide

Architecture, data flow, extension points.

Architecture

CLI (src/cli*.py)         GUI (src/gui/app.py + pages/)
     │                          │
     └──────────┐    ┌──────────┘
                ▼    ▼
            ┌────────────────┐
            │   src/core/    │
            └────────────────┘

Core/UI rule: business logic in core/ only. CLI + GUI translate user input → core call → display result.

Module map

Module Public surface
i18n t(key, lang=None, **fmt), current_language(), set_language(), render_language_selector(), LANGUAGES
core.dedup deduplicate(), MatchStrategy, ColumnMatchStrategy, Algorithm, SurvivorRule, DeduplicationResult, MatchResult, build_default_strategies()
core.normalizers normalize_email/phone/name/address/string, NormalizerType, get_normalizer()
core.io read_file(), write_file(), list_sheets(), detect_encoding/delimiter/header_row, repair_bytes()
core.config DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule
core.analyze analyze(), Finding, findings_by_tool(), _NULL_LIKE
core.fixes @register("fix_id") decorator, get_fix(), available_actions()
core.normalize auto_fix(), apply_decisions(), NormalizationResult, is_normalized()
core.text_clean clean_dataframe(), CleanOptions, CleanResult, smart_title_case()
core.format_standardize standardize_dataframe(), StandardizeOptions, StandardizeResult, FieldType, per-cell standardize_*()
core.errors DataToolsError hierarchy, ensure_dataframe(), ensure_choice(), wrap_file_read/write(), format_for_user()
core._constants US_STATE_NAMES, US_STATE_CODES, USPS_EXPANSIONS, USPS_COMPRESSIONS

Data flow — Find Duplicates

read_file()                       # auto-detect encoding, delimiter, header
   ▼ DataFrame
build_default_strategies()        # if no explicit strategies
   ▼                              # strong keys (email, phone) → standalone OR
                                  # weak keys (name, address) → AND with strong
_apply_normalizations()           # add _norm_* shadow columns
   ▼
_find_match_groups()              # O(n²) pair compare, OR strategies, union-find
   ▼
[review_callback()]               # optional interactive review
   ▼
_select_survivor()                # per group: first/last/most-complete/most-recent
   ▼
[_merge_group()]                  # optional: fill blanks from losers
   ▼
DeduplicationResult               # deduplicated_df, removed_df, match_groups, log

Extension recipes

Add a normalizer

  1. Add function to core/normalizers.py:
    def normalize_company(value: Optional[str]) -> str:
        if not value or not isinstance(value, str): return ""
        name = value.strip().casefold()
        for sfx in ("inc", "llc", "corp", "ltd", "co"):
            name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip()
        return name
    
  2. Register: add COMPANY = "company" to NormalizerType + entry in _NORMALIZER_MAP.
  3. Auto-detect (optional): add a _COLUMN_TYPE_PATTERNS row in core/dedup.py.

Add a fuzzy algorithm

  1. Add value to Algorithm enum in core/dedup.py.
  2. Add case in _compute_similarity().
  3. Document the value in CLI help text.

Add a survivor rule

  1. Add value to SurvivorRule enum.
  2. Add branch in _select_survivor().
  3. Add CLI mapping.

Add a fix + detector (analyzer/gate)

  1. Detector in core/analyze.py: add _detect_<thing>(df) -> list[Finding], hook into the main analyze() pipeline. Emit Finding with a unique fix_action id.
  2. Fix in core/fixes.py:
    @register("fix_id")
    def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]:
        # ...
        return out_df, cells_changed
    
  3. Constant in core/analyze.py: add FIX_<NAME> = "fix_id" so the detector and fix can reference it.

No other call sites change. Gate auto-discovers it via the registry.

Tool page header — render_tool_header(tool_id)

Every tool page renders its title block via render_tool_header(tool_id) in src/gui/components/_legacy.py — do not call st.title() + st.caption() directly. The helper renders:

  • tools.<id>.page_title as the page title (left column).
  • A Help popover button right of the title (icon :material/help_outline:, label from help.button_label). Clicking opens an st.popover containing the markdown body.
  • tools.<id>.page_caption as the caption below.

All copy is i18n-driven; editors can tweak help text without touching Python. If a tool is missing its help_md key, the popover falls back to help.missing_body.

help_md structure (markdown, stored as a single string with \n line breaks in JSON):

**When to use**
- bullet 1
- bullet 2

**Steps**
1. numbered step
2. numbered step

**Examples**
- example 1
- example 2

**Tip** one-sentence pro tip.

Keep it short — the popover is intentionally compact. Mirror the structure across every tool so the muscle memory transfers.

i18n — language packs

The GUI's user-facing strings live in src/i18n/packs/<code>.json, keyed by ISO-639-1 code. English (en.json) is canonical; missing keys in other packs fall back to English, and missing keys in English fall back to the literal dotted key so a typo is visible rather than silent.

Look up a string in code:

from src.i18n import t
st.button(t("upload.run_button"))
st.warning(t("gate.warning", name=filename))   # {name} interpolated via str.format

t() reads the active language from st.session_state["ui_lang"]. Outside a Streamlit run (tests, scripts) it falls back to English.

Add a new language:

  1. Copy src/i18n/packs/en.json to src/i18n/packs/<code>.json and translate values in place. Keep the key tree identical.
  2. Add a one-line entry to LANGUAGES in src/i18n/__init__.py: {"code": "fr", "label": "Français"}. The sidebar picker auto-renders.
  3. Run pytest tests/test_lang_packs.py — the parity test fails until every key from en.json exists in the new pack (and orphan keys not in English are also flagged).

Add a new key:

  1. Add it to en.json first (canonical pack).
  2. Add it to every other registered pack in the same commit. The parity test enforces this.
  3. Use the dotted key at the call site: t("section.subsection.key") or t("section.key", name=value) for placeholder interpolation.

Authoring rules:

  • Keys live under semantic sections (home.*, upload.*, findings.*, help.*, tools.<id>.name). Don't nest by language or by tool unless the string is genuinely tool-specific.
  • Per-tool header copy lives under tools.<id>.{page_title, page_caption, help_md}. page_caption is the one-line subtitle under the title; help_md is the popover body (see Tool page header above). Top-level help.button_label / help.missing_body are shared across every tool.
  • Use {named} placeholders (not positional {0}) so translators see what's being interpolated.
  • Strings can contain Streamlit markdown (**bold**) — pass through st.markdown / st.caption as usual.
  • Do not put strings inside the farewell-overlay JS payload without going through _js_html_safe() in src/gui/components/_legacy.py; the helper escapes both the JS string terminator and HTML special chars. The test TestFarewellEscape pins that contract.
  • The sidebar picker is mounted by hide_streamlit_chrome(), so every page that calls that helper automatically gets the picker. Pages that don't call it (rare) can call render_language_selector() directly.

Licensing

The license layer lives at src/license/. The public API:

from src.license import (
    get_manager, require_feature, current_state,
    FeatureFlag, Tier, License,
)

mgr = get_manager()
if not mgr.is_valid():
    raise RuntimeError("Not licensed")
require_feature(FeatureFlag.DEDUPLICATOR)

Storage: ~/.datatools/license.json (override via DATATOOLS_LICENSE_PATH). Signed with Ed25519 (asymmetric) — the seller's private key signs; the buyer's binary verifies with the embedded public key.

Key material:

Variable Who has it Where it's used
DATATOOLS_LICENSE_PRIVKEY Seller only scripts/generate_license.py (mint a buyer's blob), scripts/generate_keypair.py writes a fresh one
DATATOOLS_LICENSE_PUBKEY Every shipped binary Verification at activation time; set at build time via PyInstaller env

If neither env var is set, src.license.crypto falls back to the deterministic dev keypair in src/license/_dev_keypair.py. The dev key is in source on purpose (so tests work without secrets), but a frozen build that's still using it is a build-config bug — :func:assert_production_safe refuses to start such a binary.

First-time setup for shipped builds:

  1. python scripts/generate_keypair.py --output prod-keys.env — creates a fresh keypair.
  2. Stash DATATOOLS_LICENSE_PRIVKEY somewhere safe (password manager / KMS). Lose it and you can't issue renewals without reshipping a new build with a new public key.
  3. Configure the PyInstaller build env with DATATOOLS_LICENSE_PUBKEY=<hex> so the shipped binary verifies against the production key.
  4. Mint buyer licenses with DATATOOLS_LICENSE_PRIVKEY=<hex> python scripts/generate_license.py ....

Dev bypass: DATATOOLS_DEV_MODE=1 short-circuits every check. The test suite's autouse fixture sets this so existing tests don't need their own license fixtures. Tests that need the real check explicitly use isolated_license_path / activated_license_manager / unactivated_license_manager.

Adding a feature flag:

  1. Add the enum value to FeatureFlag in src/license/schema.py.
  2. Add it to the relevant tier's set in FEATURES_BY_TIER in src/license/features.py.
  3. Gate at the call site: require_feature(FeatureFlag.YOUR_FLAG).

Adding a new tier:

  1. Add the enum value to Tier.
  2. Add a row to FEATURES_BY_TIER listing the unlocked flags.
  3. Add license.tier_<name> translation keys to every i18n pack.
  4. The activation flow, sidebar status badge, feature gate, and home grid lock badge all pick up the new tier automatically.

Worked example — the Lite tier:

# src/license/schema.py
class Tier(str, Enum):
    LITE = "lite"          # new
    CORE = "core"
    ...

# src/license/features.py
FEATURES_BY_TIER = {
    ...
    Tier.LITE: frozenset({
        FeatureFlag.DEDUPLICATOR,
        FeatureFlag.TEXT_CLEANER,
        FeatureFlag.FORMAT_STANDARDIZER,
    }),
    Tier.CORE: _all(),
    ...
}

Then in en.json/es.json add license.tier_lite. That's it — the existing require_feature_or_render_upgrade (GUI) and guard(feature=...) (CLI) calls in every tool page/CLI route a Lite user into the upgrade prompt for any tool the tier doesn't unlock. The home grid's lock badge fires off the same feature lookup.

Minting a license (creator-only):

DATATOOLS_LICENSE_SECRET=<shipping-secret> \
    python scripts/generate_license.py \
        --name "Jane Doe" --email jane@example.com \
        --tier core --years 1

The script prints a DTLIC1: blob to stdout — deliver this in the Gumroad / purchase email. The buyer pastes it into the activation page or runs python -m src.license_cli activate <blob> --name ....

Add a format-standardizer field type

  1. Add value to FieldType enum in core/format_standardize.py.
  2. Add per-cell standardize_<x>(value, *, …) returning (new_value, changed).
  3. Add option fields to StandardizeOptions (with defaults that preserve existing behavior).
  4. Wire into _apply_field_type() dispatcher (the else branch raises AssertionError — every enum value needs a branch).
  5. Add validation entry in StandardizeOptions.from_dict() for any new enum-shaped option.

Errors

Use core/errors.py instead of raw ValueError / OSError:

Pattern Use
Bad arg, wrong type, missing column InputValidationError
Bad config / options file ConfigError
File parses but isn't what we expected FileFormatError
File I/O failure (perms, missing, disk full) FileAccessError
Internal invariant broken (unreachable branch) AssertionError

Helpers:

  • ensure_dataframe(value, function="my_func") at every public entry that takes a df.
  • ensure_choice(value, name="mode", choices=[...]) at every entry that takes a literal.
  • wrap_file_read(path, "operation", exc) / wrap_file_write(...) when wrapping OSError.

GUI / CLI handlers: use format_for_user(exc, context="...") to render.

All DataToolsError subclasses extend stdlib ValueError or OSError so existing handlers still catch them.

PDF Extractor — bundled Tesseract

Frozen builds (installer / AppImage) ship Tesseract OCR inside the bundle so scanned PDFs work without a separate system install. Source / pip developer environments still resolve Tesseract from PATH.

Runtime layout (frozen bundles):

Resource Path
Tesseract binary Path(sys._MEIPASS) / "tesseract" / "tesseract" (Linux/macOS), …/tesseract/tesseract.exe (Windows)
Tessdata directory Path(sys._MEIPASS) / "tesseract" / "tessdata"
English model Path(sys._MEIPASS) / "tesseract" / "tessdata" / "eng.traineddata"

Discovery order (PDF Extractor runtime):

  1. DATATOOLS_TESSERACT_BIN env var (override — explicit path to a tesseract binary).
  2. Bundled path under sys._MEIPASS (frozen bundles only — falls through to step 3 otherwise).
  3. tesseract on PATH (developer setups, source checkouts).
  4. Windows well-known locations (C:\Program Files\Tesseract-OCR\tesseract.exe, etc.).

Where the bytes come from:

  • Tessdata is vendored at build/vendor/tessdata/eng.traineddata — the "best" English model from tessdata_best. PyInstaller's spec copies it into tesseract/tessdata/ inside the bundle.
  • Tesseract binary is fetched at build time by build/tesseract.py — per-platform download URLs are pinned in that module. The current pin is Tesseract 5.5.0. CI (.github/workflows/build.yml) imports fetch_tessdata + fetch_tesseract_for_platform and runs them before PyInstaller.

To update Tesseract:

  1. Bump the version pin + the per-platform fetch URLs in build/tesseract.py.
  2. If upstream changed the eng.traineddata schema, refresh build/vendor/tessdata/eng.traineddata from tessdata_best at the matching tag.
  3. Push a v* tag so CI rebuilds all three platforms, then smoke-test a scanned-PDF run through the PDF Extractor before publishing the release.
  4. Update LICENSE_TESSERACT.txt at the repo root if the upstream license terms change (Tesseract is Apache-2.0 today).

Tests

# All (core + CLI + GUI)
pytest -q
# Quick loop — skip the GUI layer
pytest -q -m 'not gui'
# Only the GUI tests
pytest -q -m gui
# By module
pytest tests/test_dedup.py
# Include slow / integration
pytest -m slow
# Single test
pytest tests/test_dedup.py::TestExactMatch::test_basic

Test layout:

tests/
├── conftest.py                        # core/CLI fixtures
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
├── test_analyze.py · test_normalize.py · test_text_clean.py
├── test_format_standardize.py
├── test_format_standardize_corpus.py  # 199-row buyer corpus
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
├── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
├── test_perf_regressions.py           # shape pins for the perf wins
└── gui/                               # Streamlit AppTest-driven tests
    ├── conftest.py                    # AppTest fixtures + helpers
    ├── _findings_panel_harness.py     # isolated component test page
    ├── test_smoke.py                  # every page renders in EN + ES
    ├── test_chrome.py                 # language selector, hide_chrome
    ├── test_gate.py                   # require_normalization_gate
    ├── test_workflows.py              # happy path per Ready tool
    ├── test_dedup_review.py           # match-group card interactions
    ├── test_advanced_panels.py        # config_panel widgets
    ├── test_errors.py                 # malformed-upload error paths
    └── test_findings_panel.py         # analyzer findings rendering

GUI test layer

GUI tests drive pages with streamlit.testing.v1.AppTest — in-process, no browser, no display. They pre-populate st.session_state with stashed-upload bytes (via the stash_upload() helper in tests/gui/conftest.py) and either click buttons via app.button[i].click().run() or assert on the session_state after the run.

Marker registered in pytest.ini. Default pytest runs everything; pytest -m 'not gui' skips them for a faster core-only loop. Coming-Soon stubs are pinned by the smoke tests so a regression ("import error", "missing widget") shows up immediately.

Fixture corpora: test-cases/text-cleaner-corpus/ (21 files) · test-cases/encodings-corpus/ (31 files) · test-cases/format-cleaner-corpus/ (7 files + spec).

Known limitations

  • Dedup pair-compare is O(n²) for fuzzy strategies. Exact-only strategies (every column uses Algorithm.EXACT at threshold 100) now route through an O(n) groupby fast path automatically — no API change. Fuzzy strategies can opt into prefix blocking via deduplicate(..., blocking_columns=[...], blocking_prefix_len=1) to partition pairs by a cheap key (trades recall for speed).
  • Threading is opt-in for format_standardizeStandardizeOptions.parallel_columns > 1 uses a thread pool. On CPython 3.12 the GIL caps the win at roughly neutral; the scaffolding is in place for free-threaded Python 3.13+.
  • Memory-bound — entire file loaded into pandas. Streaming reads exist but not integrated with the dedup engine.
  • No multi-sheet dedup — each Excel sheet processed independently.
  • Phonenumbers minimum-length — international numbers without country codes fall back to digits-only.