Files
datatools-dev/docs/DEVELOPER.md
Michael e534fb4989 sec(license): Ed25519 sigs + production-safe tripwire
Two coupled hardening upgrades.

1. Asymmetric signatures (HMAC → Ed25519)

The previous HMAC scheme used a symmetric secret that any motivated
reverse engineer could pull out of the shipped binary and use to
mint blobs for any tier / name / email. With Ed25519, the binary
ships only the public verification key; the signing key never
leaves the seller's environment, so binary compromise no longer
yields forgery.

- src/license/crypto.py rewritten around
  cryptography.hazmat.primitives.asymmetric.ed25519. Same public
  API surface (sign/verify/encode_blob/decode_blob), same canonical
  JSON encoding — drop-in for the manager / cli / GUI layers.
- DATATOOLS_LICENSE_PRIVKEY (seller-side) and
  DATATOOLS_LICENSE_PUBKEY (build-time) env vars supply the keys;
  the in-source dev keypair (src/license/_dev_keypair.py)
  deterministically derives from a seed phrase for repro builds and
  tests.
- Blob prefix bumped DTLIC1: → DTLIC2:. Decoding a DTLIC1 blob
  surfaces a clear "old format" error rather than a confusing
  signature mismatch.
- scripts/generate_keypair.py mints fresh production keypairs for
  the seller (run once, stash the private key offline). Adds
  cryptography>=41,<46 to requirements.txt (was an undeclared
  transitive dep).

2. Production-safe tripwire

assert_production_safe() refuses to boot a frozen / shipped build
when either:

- DATATOOLS_DEV_MODE=1 is set (would unconditionally bypass every
  license check — fine in source/test but catastrophic in a buyer
  install).
- The active verification key is still the embedded dev key (the
  build pipeline forgot to set DATATOOLS_LICENSE_PUBKEY).

No-op in source / pytest runs (sys.frozen is unset) so test
fixtures and dev workflows keep working without ceremony. Called
from src/cli_license_guard.guard() and from hide_streamlit_chrome
— so it fires on every CLI invocation and every GUI page load.

Tests: 49 license-layer unit tests (was 40); added Ed25519
wrong-key rejection, dev-keypair seed pin, blob v2 prefix, v1
rejection with clear message, and four production-safe scenarios
(no-op in source, fires on DEV_MODE in frozen, fires on dev key in
frozen, passes in frozen with prod pubkey). Total: 2024 → 2033.

Docs (REQUIREMENTS §17a, DEVELOPER licensing recipe, DECISIONS
§9b + decision log) updated with the new threat-model write-up,
key-storage workflow, and tripwire behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:34:48 +00:00

15 KiB

Developer Guide

Architecture, data flow, extension points.

Architecture

CLI (src/cli*.py)         GUI (src/gui/app.py + pages/)
     │                          │
     └──────────┐    ┌──────────┘
                ▼    ▼
            ┌────────────────┐
            │   src/core/    │
            └────────────────┘

Core/UI rule: business logic in core/ only. CLI + GUI translate user input → core call → display result.

Module map

Module Public surface
i18n t(key, lang=None, **fmt), current_language(), set_language(), render_language_selector(), LANGUAGES
core.dedup deduplicate(), MatchStrategy, ColumnMatchStrategy, Algorithm, SurvivorRule, DeduplicationResult, MatchResult, build_default_strategies()
core.normalizers normalize_email/phone/name/address/string, NormalizerType, get_normalizer()
core.io read_file(), write_file(), list_sheets(), detect_encoding/delimiter/header_row, repair_bytes()
core.config DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule
core.analyze analyze(), Finding, findings_by_tool(), _NULL_LIKE
core.fixes @register("fix_id") decorator, get_fix(), available_actions()
core.normalize auto_fix(), apply_decisions(), NormalizationResult, is_normalized()
core.text_clean clean_dataframe(), CleanOptions, CleanResult, smart_title_case()
core.format_standardize standardize_dataframe(), StandardizeOptions, StandardizeResult, FieldType, per-cell standardize_*()
core.errors DataToolsError hierarchy, ensure_dataframe(), ensure_choice(), wrap_file_read/write(), format_for_user()
core._constants US_STATE_NAMES, US_STATE_CODES, USPS_EXPANSIONS, USPS_COMPRESSIONS

Data flow — Deduplicator

read_file()                       # auto-detect encoding, delimiter, header
   ▼ DataFrame
build_default_strategies()        # if no explicit strategies
   ▼                              # strong keys (email, phone) → standalone OR
                                  # weak keys (name, address) → AND with strong
_apply_normalizations()           # add _norm_* shadow columns
   ▼
_find_match_groups()              # O(n²) pair compare, OR strategies, union-find
   ▼
[review_callback()]               # optional interactive review
   ▼
_select_survivor()                # per group: first/last/most-complete/most-recent
   ▼
[_merge_group()]                  # optional: fill blanks from losers
   ▼
DeduplicationResult               # deduplicated_df, removed_df, match_groups, log

Extension recipes

Add a normalizer

  1. Add function to core/normalizers.py:
    def normalize_company(value: Optional[str]) -> str:
        if not value or not isinstance(value, str): return ""
        name = value.strip().casefold()
        for sfx in ("inc", "llc", "corp", "ltd", "co"):
            name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip()
        return name
    
  2. Register: add COMPANY = "company" to NormalizerType + entry in _NORMALIZER_MAP.
  3. Auto-detect (optional): add a _COLUMN_TYPE_PATTERNS row in core/dedup.py.

Add a fuzzy algorithm

  1. Add value to Algorithm enum in core/dedup.py.
  2. Add case in _compute_similarity().
  3. Document the value in CLI help text.

Add a survivor rule

  1. Add value to SurvivorRule enum.
  2. Add branch in _select_survivor().
  3. Add CLI mapping.

Add a fix + detector (analyzer/gate)

  1. Detector in core/analyze.py: add _detect_<thing>(df) -> list[Finding], hook into the main analyze() pipeline. Emit Finding with a unique fix_action id.
  2. Fix in core/fixes.py:
    @register("fix_id")
    def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]:
        # ...
        return out_df, cells_changed
    
  3. Constant in core/analyze.py: add FIX_<NAME> = "fix_id" so the detector and fix can reference it.

No other call sites change. Gate auto-discovers it via the registry.

i18n — language packs

The GUI's user-facing strings live in src/i18n/packs/<code>.json, keyed by ISO-639-1 code. English (en.json) is canonical; missing keys in other packs fall back to English, and missing keys in English fall back to the literal dotted key so a typo is visible rather than silent.

Look up a string in code:

from src.i18n import t
st.button(t("upload.run_button"))
st.warning(t("gate.warning", name=filename))   # {name} interpolated via str.format

t() reads the active language from st.session_state["ui_lang"]. Outside a Streamlit run (tests, scripts) it falls back to English.

Add a new language:

  1. Copy src/i18n/packs/en.json to src/i18n/packs/<code>.json and translate values in place. Keep the key tree identical.
  2. Add a one-line entry to LANGUAGES in src/i18n/__init__.py: {"code": "fr", "label": "Français"}. The sidebar picker auto-renders.
  3. Run pytest tests/test_lang_packs.py — the parity test fails until every key from en.json exists in the new pack (and orphan keys not in English are also flagged).

Add a new key:

  1. Add it to en.json first (canonical pack).
  2. Add it to every other registered pack in the same commit. The parity test enforces this.
  3. Use the dotted key at the call site: t("section.subsection.key") or t("section.key", name=value) for placeholder interpolation.

Authoring rules:

  • Keys live under semantic sections (home.*, upload.*, findings.*, tools.<id>.name). Don't nest by language or by tool unless the string is genuinely tool-specific.
  • Use {named} placeholders (not positional {0}) so translators see what's being interpolated.
  • Strings can contain Streamlit markdown (**bold**) — pass through st.markdown / st.caption as usual.
  • Do not put strings inside the farewell-overlay JS payload without going through _js_html_safe() in src/gui/components/_legacy.py; the helper escapes both the JS string terminator and HTML special chars. The test TestFarewellEscape pins that contract.
  • The sidebar picker is mounted by hide_streamlit_chrome(), so every page that calls that helper automatically gets the picker. Pages that don't call it (rare) can call render_language_selector() directly.

Licensing

The license layer lives at src/license/. The public API:

from src.license import (
    get_manager, require_feature, current_state,
    FeatureFlag, Tier, License,
)

mgr = get_manager()
if not mgr.is_valid():
    raise RuntimeError("Not licensed")
require_feature(FeatureFlag.DEDUPLICATOR)

Storage: ~/.datatools/license.json (override via DATATOOLS_LICENSE_PATH). Signed with Ed25519 (asymmetric) — the seller's private key signs; the buyer's binary verifies with the embedded public key.

Key material:

Variable Who has it Where it's used
DATATOOLS_LICENSE_PRIVKEY Seller only scripts/generate_license.py (mint a buyer's blob), scripts/generate_keypair.py writes a fresh one
DATATOOLS_LICENSE_PUBKEY Every shipped binary Verification at activation time; set at build time via PyInstaller env

If neither env var is set, src.license.crypto falls back to the deterministic dev keypair in src/license/_dev_keypair.py. The dev key is in source on purpose (so tests work without secrets), but a frozen build that's still using it is a build-config bug — :func:assert_production_safe refuses to start such a binary.

First-time setup for shipped builds:

  1. python scripts/generate_keypair.py --output prod-keys.env — creates a fresh keypair.
  2. Stash DATATOOLS_LICENSE_PRIVKEY somewhere safe (password manager / KMS). Lose it and you can't issue renewals without reshipping a new build with a new public key.
  3. Configure the PyInstaller build env with DATATOOLS_LICENSE_PUBKEY=<hex> so the shipped binary verifies against the production key.
  4. Mint buyer licenses with DATATOOLS_LICENSE_PRIVKEY=<hex> python scripts/generate_license.py ....

Dev bypass: DATATOOLS_DEV_MODE=1 short-circuits every check. The test suite's autouse fixture sets this so existing tests don't need their own license fixtures. Tests that need the real check explicitly use isolated_license_path / activated_license_manager / unactivated_license_manager.

Adding a feature flag:

  1. Add the enum value to FeatureFlag in src/license/schema.py.
  2. Add it to the relevant tier's set in FEATURES_BY_TIER in src/license/features.py.
  3. Gate at the call site: require_feature(FeatureFlag.YOUR_FLAG).

Adding a new tier:

  1. Add the enum value to Tier.
  2. Add a row to FEATURES_BY_TIER listing the unlocked flags.
  3. Add license.tier_<name> translation keys to every i18n pack.
  4. The activation flow, sidebar status badge, feature gate, and home grid lock badge all pick up the new tier automatically.

Worked example — the Lite tier:

# src/license/schema.py
class Tier(str, Enum):
    LITE = "lite"          # new
    CORE = "core"
    ...

# src/license/features.py
FEATURES_BY_TIER = {
    ...
    Tier.LITE: frozenset({
        FeatureFlag.DEDUPLICATOR,
        FeatureFlag.TEXT_CLEANER,
        FeatureFlag.FORMAT_STANDARDIZER,
    }),
    Tier.CORE: _all(),
    ...
}

Then in en.json/es.json add license.tier_lite. That's it — the existing require_feature_or_render_upgrade (GUI) and guard(feature=...) (CLI) calls in every tool page/CLI route a Lite user into the upgrade prompt for any tool the tier doesn't unlock. The home grid's lock badge fires off the same feature lookup.

Minting a license (creator-only):

DATATOOLS_LICENSE_SECRET=<shipping-secret> \
    python scripts/generate_license.py \
        --name "Jane Doe" --email jane@example.com \
        --tier core --years 1

The script prints a DTLIC1: blob to stdout — deliver this in the Gumroad / purchase email. The buyer pastes it into the activation page or runs python -m src.license_cli activate <blob> --name ....

Add a format-standardizer field type

  1. Add value to FieldType enum in core/format_standardize.py.
  2. Add per-cell standardize_<x>(value, *, …) returning (new_value, changed).
  3. Add option fields to StandardizeOptions (with defaults that preserve existing behavior).
  4. Wire into _apply_field_type() dispatcher (the else branch raises AssertionError — every enum value needs a branch).
  5. Add validation entry in StandardizeOptions.from_dict() for any new enum-shaped option.

Errors

Use core/errors.py instead of raw ValueError / OSError:

Pattern Use
Bad arg, wrong type, missing column InputValidationError
Bad config / options file ConfigError
File parses but isn't what we expected FileFormatError
File I/O failure (perms, missing, disk full) FileAccessError
Internal invariant broken (unreachable branch) AssertionError

Helpers:

  • ensure_dataframe(value, function="my_func") at every public entry that takes a df.
  • ensure_choice(value, name="mode", choices=[...]) at every entry that takes a literal.
  • wrap_file_read(path, "operation", exc) / wrap_file_write(...) when wrapping OSError.

GUI / CLI handlers: use format_for_user(exc, context="...") to render.

All DataToolsError subclasses extend stdlib ValueError or OSError so existing handlers still catch them.

Tests

# All (core + CLI + GUI)
pytest -q
# Quick loop — skip the GUI layer
pytest -q -m 'not gui'
# Only the GUI tests
pytest -q -m gui
# By module
pytest tests/test_dedup.py
# Include slow / integration
pytest -m slow
# Single test
pytest tests/test_dedup.py::TestExactMatch::test_basic

Test layout:

tests/
├── conftest.py                        # core/CLI fixtures
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
├── test_analyze.py · test_normalize.py · test_text_clean.py
├── test_format_standardize.py
├── test_format_standardize_corpus.py  # 199-row buyer corpus
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
├── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
├── test_perf_regressions.py           # shape pins for the perf wins
└── gui/                               # Streamlit AppTest-driven tests
    ├── conftest.py                    # AppTest fixtures + helpers
    ├── _findings_panel_harness.py     # isolated component test page
    ├── test_smoke.py                  # every page renders in EN + ES
    ├── test_chrome.py                 # language selector, hide_chrome
    ├── test_gate.py                   # require_normalization_gate
    ├── test_workflows.py              # happy path per Ready tool
    ├── test_dedup_review.py           # match-group card interactions
    ├── test_advanced_panels.py        # config_panel widgets
    ├── test_errors.py                 # malformed-upload error paths
    └── test_findings_panel.py         # analyzer findings rendering

GUI test layer

GUI tests drive pages with streamlit.testing.v1.AppTest — in-process, no browser, no display. They pre-populate st.session_state with stashed-upload bytes (via the stash_upload() helper in tests/gui/conftest.py) and either click buttons via app.button[i].click().run() or assert on the session_state after the run.

Marker registered in pytest.ini. Default pytest runs everything; pytest -m 'not gui' skips them for a faster core-only loop. Coming-Soon stubs are pinned by the smoke tests so a regression ("import error", "missing widget") shows up immediately.

Fixture corpora: test-cases/text-cleaner-corpus/ (21 files) · test-cases/encodings-corpus/ (31 files) · test-cases/format-cleaner-corpus/ (7 files + spec).

Known limitations

  • Dedup pair-compare is O(n²) for fuzzy strategies. Exact-only strategies (every column uses Algorithm.EXACT at threshold 100) now route through an O(n) groupby fast path automatically — no API change. Fuzzy strategies can opt into prefix blocking via deduplicate(..., blocking_columns=[...], blocking_prefix_len=1) to partition pairs by a cheap key (trades recall for speed).
  • Threading is opt-in for format_standardizeStandardizeOptions.parallel_columns > 1 uses a thread pool. On CPython 3.12 the GIL caps the win at roughly neutral; the scaffolding is in place for free-threaded Python 3.13+.
  • Memory-bound — entire file loaded into pandas. Streaming reads exist but not integrated with the dedup engine.
  • No multi-sheet dedup — each Excel sheet processed independently.
  • Phonenumbers minimum-length — international numbers without country codes fall back to digits-only.