Files
datatools-dev/docs/DEVELOPER.md
Michael d32b58e61a feat(license): add Lite SKU; remove user-facing free trial
Two coupled changes:

1. Lite tier
   - New Tier.LITE in src/license/schema.py.
   - FEATURES_BY_TIER[Tier.LITE] = {Deduplicator, Text Cleaner,
     Format Standardizer}. The three universally-useful tools that
     cover the most common bookkeeping / RevOps / Klaviyo prep
     workflows. Other six tools require Core.
   - i18n: license.tier_lite, license.feature_locked_title,
     license.feature_locked_body, license.upgrade_link,
     license.status_locked (en + es).
   - Per-tool feature gate at every GUI tool page
     (require_feature_or_render_upgrade) and every tool CLI
     (guard(feature=...)). A locked tool renders an upgrade
     prompt + Manage-license button (GUI) or exits with code 2
     (CLI).
   - Home grid: tool cards the user's tier doesn't unlock get a
     red 🔒 Locked badge in place of green Ready.

2. Trial removed
   - Activation form's "Start 1-year trial" button removed.
   - license_cli's `trial` subcommand removed.
   - activation.trial_button / activation.trial_help i18n keys
     dropped (pack parity test stays green).
   - Tier.TRIAL stays in the enum (back-compat with any field-
     tested trial licenses); LicenseManager._mint stays internal
     for tests and the seller's key generator.
   - Decision logged in DECISIONS §9b: a 1-year all-features
     trial undercuts paid Lite; paid-only keeps tier economics
     clean.

Tests (+29 net): +17 Lite-tier unit/guard tests + 13 Lite-tier
GUI tests + 1 trial-absent assertion - 2 trial CLI tests - 1
trial GUI button test. Total: 1995 → 2024.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:19:30 +00:00

14 KiB

Developer Guide

Architecture, data flow, extension points.

Architecture

CLI (src/cli*.py)         GUI (src/gui/app.py + pages/)
     │                          │
     └──────────┐    ┌──────────┘
                ▼    ▼
            ┌────────────────┐
            │   src/core/    │
            └────────────────┘

Core/UI rule: business logic in core/ only. CLI + GUI translate user input → core call → display result.

Module map

Module Public surface
i18n t(key, lang=None, **fmt), current_language(), set_language(), render_language_selector(), LANGUAGES
core.dedup deduplicate(), MatchStrategy, ColumnMatchStrategy, Algorithm, SurvivorRule, DeduplicationResult, MatchResult, build_default_strategies()
core.normalizers normalize_email/phone/name/address/string, NormalizerType, get_normalizer()
core.io read_file(), write_file(), list_sheets(), detect_encoding/delimiter/header_row, repair_bytes()
core.config DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule
core.analyze analyze(), Finding, findings_by_tool(), _NULL_LIKE
core.fixes @register("fix_id") decorator, get_fix(), available_actions()
core.normalize auto_fix(), apply_decisions(), NormalizationResult, is_normalized()
core.text_clean clean_dataframe(), CleanOptions, CleanResult, smart_title_case()
core.format_standardize standardize_dataframe(), StandardizeOptions, StandardizeResult, FieldType, per-cell standardize_*()
core.errors DataToolsError hierarchy, ensure_dataframe(), ensure_choice(), wrap_file_read/write(), format_for_user()
core._constants US_STATE_NAMES, US_STATE_CODES, USPS_EXPANSIONS, USPS_COMPRESSIONS

Data flow — Deduplicator

read_file()                       # auto-detect encoding, delimiter, header
   ▼ DataFrame
build_default_strategies()        # if no explicit strategies
   ▼                              # strong keys (email, phone) → standalone OR
                                  # weak keys (name, address) → AND with strong
_apply_normalizations()           # add _norm_* shadow columns
   ▼
_find_match_groups()              # O(n²) pair compare, OR strategies, union-find
   ▼
[review_callback()]               # optional interactive review
   ▼
_select_survivor()                # per group: first/last/most-complete/most-recent
   ▼
[_merge_group()]                  # optional: fill blanks from losers
   ▼
DeduplicationResult               # deduplicated_df, removed_df, match_groups, log

Extension recipes

Add a normalizer

  1. Add function to core/normalizers.py:
    def normalize_company(value: Optional[str]) -> str:
        if not value or not isinstance(value, str): return ""
        name = value.strip().casefold()
        for sfx in ("inc", "llc", "corp", "ltd", "co"):
            name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip()
        return name
    
  2. Register: add COMPANY = "company" to NormalizerType + entry in _NORMALIZER_MAP.
  3. Auto-detect (optional): add a _COLUMN_TYPE_PATTERNS row in core/dedup.py.

Add a fuzzy algorithm

  1. Add value to Algorithm enum in core/dedup.py.
  2. Add case in _compute_similarity().
  3. Document the value in CLI help text.

Add a survivor rule

  1. Add value to SurvivorRule enum.
  2. Add branch in _select_survivor().
  3. Add CLI mapping.

Add a fix + detector (analyzer/gate)

  1. Detector in core/analyze.py: add _detect_<thing>(df) -> list[Finding], hook into the main analyze() pipeline. Emit Finding with a unique fix_action id.
  2. Fix in core/fixes.py:
    @register("fix_id")
    def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]:
        # ...
        return out_df, cells_changed
    
  3. Constant in core/analyze.py: add FIX_<NAME> = "fix_id" so the detector and fix can reference it.

No other call sites change. Gate auto-discovers it via the registry.

i18n — language packs

The GUI's user-facing strings live in src/i18n/packs/<code>.json, keyed by ISO-639-1 code. English (en.json) is canonical; missing keys in other packs fall back to English, and missing keys in English fall back to the literal dotted key so a typo is visible rather than silent.

Look up a string in code:

from src.i18n import t
st.button(t("upload.run_button"))
st.warning(t("gate.warning", name=filename))   # {name} interpolated via str.format

t() reads the active language from st.session_state["ui_lang"]. Outside a Streamlit run (tests, scripts) it falls back to English.

Add a new language:

  1. Copy src/i18n/packs/en.json to src/i18n/packs/<code>.json and translate values in place. Keep the key tree identical.
  2. Add a one-line entry to LANGUAGES in src/i18n/__init__.py: {"code": "fr", "label": "Français"}. The sidebar picker auto-renders.
  3. Run pytest tests/test_lang_packs.py — the parity test fails until every key from en.json exists in the new pack (and orphan keys not in English are also flagged).

Add a new key:

  1. Add it to en.json first (canonical pack).
  2. Add it to every other registered pack in the same commit. The parity test enforces this.
  3. Use the dotted key at the call site: t("section.subsection.key") or t("section.key", name=value) for placeholder interpolation.

Authoring rules:

  • Keys live under semantic sections (home.*, upload.*, findings.*, tools.<id>.name). Don't nest by language or by tool unless the string is genuinely tool-specific.
  • Use {named} placeholders (not positional {0}) so translators see what's being interpolated.
  • Strings can contain Streamlit markdown (**bold**) — pass through st.markdown / st.caption as usual.
  • Do not put strings inside the farewell-overlay JS payload without going through _js_html_safe() in src/gui/components/_legacy.py; the helper escapes both the JS string terminator and HTML special chars. The test TestFarewellEscape pins that contract.
  • The sidebar picker is mounted by hide_streamlit_chrome(), so every page that calls that helper automatically gets the picker. Pages that don't call it (rare) can call render_language_selector() directly.

Licensing

The license layer lives at src/license/. The public API:

from src.license import (
    get_manager, require_feature, current_state,
    FeatureFlag, Tier, License,
)

mgr = get_manager()
if not mgr.is_valid():
    raise RuntimeError("Not licensed")
require_feature(FeatureFlag.DEDUPLICATOR)

Storage: ~/.datatools/license.json (override via DATATOOLS_LICENSE_PATH). Signed locally with HMAC-SHA256 using a secret read from DATATOOLS_LICENSE_SECRET (build-time replace; the in-repo default is a development placeholder).

Dev bypass: DATATOOLS_DEV_MODE=1 short-circuits every check. The test suite's autouse fixture sets this so existing tests don't need their own license fixtures. Tests that need the real check explicitly use isolated_license_path / activated_license_manager / unactivated_license_manager.

Adding a feature flag:

  1. Add the enum value to FeatureFlag in src/license/schema.py.
  2. Add it to the relevant tier's set in FEATURES_BY_TIER in src/license/features.py.
  3. Gate at the call site: require_feature(FeatureFlag.YOUR_FLAG).

Adding a new tier:

  1. Add the enum value to Tier.
  2. Add a row to FEATURES_BY_TIER listing the unlocked flags.
  3. Add license.tier_<name> translation keys to every i18n pack.
  4. The activation flow, sidebar status badge, feature gate, and home grid lock badge all pick up the new tier automatically.

Worked example — the Lite tier:

# src/license/schema.py
class Tier(str, Enum):
    LITE = "lite"          # new
    CORE = "core"
    ...

# src/license/features.py
FEATURES_BY_TIER = {
    ...
    Tier.LITE: frozenset({
        FeatureFlag.DEDUPLICATOR,
        FeatureFlag.TEXT_CLEANER,
        FeatureFlag.FORMAT_STANDARDIZER,
    }),
    Tier.CORE: _all(),
    ...
}

Then in en.json/es.json add license.tier_lite. That's it — the existing require_feature_or_render_upgrade (GUI) and guard(feature=...) (CLI) calls in every tool page/CLI route a Lite user into the upgrade prompt for any tool the tier doesn't unlock. The home grid's lock badge fires off the same feature lookup.

Minting a license (creator-only):

DATATOOLS_LICENSE_SECRET=<shipping-secret> \
    python scripts/generate_license.py \
        --name "Jane Doe" --email jane@example.com \
        --tier core --years 1

The script prints a DTLIC1: blob to stdout — deliver this in the Gumroad / purchase email. The buyer pastes it into the activation page or runs python -m src.license_cli activate <blob> --name ....

Add a format-standardizer field type

  1. Add value to FieldType enum in core/format_standardize.py.
  2. Add per-cell standardize_<x>(value, *, …) returning (new_value, changed).
  3. Add option fields to StandardizeOptions (with defaults that preserve existing behavior).
  4. Wire into _apply_field_type() dispatcher (the else branch raises AssertionError — every enum value needs a branch).
  5. Add validation entry in StandardizeOptions.from_dict() for any new enum-shaped option.

Errors

Use core/errors.py instead of raw ValueError / OSError:

Pattern Use
Bad arg, wrong type, missing column InputValidationError
Bad config / options file ConfigError
File parses but isn't what we expected FileFormatError
File I/O failure (perms, missing, disk full) FileAccessError
Internal invariant broken (unreachable branch) AssertionError

Helpers:

  • ensure_dataframe(value, function="my_func") at every public entry that takes a df.
  • ensure_choice(value, name="mode", choices=[...]) at every entry that takes a literal.
  • wrap_file_read(path, "operation", exc) / wrap_file_write(...) when wrapping OSError.

GUI / CLI handlers: use format_for_user(exc, context="...") to render.

All DataToolsError subclasses extend stdlib ValueError or OSError so existing handlers still catch them.

Tests

# All (core + CLI + GUI)
pytest -q
# Quick loop — skip the GUI layer
pytest -q -m 'not gui'
# Only the GUI tests
pytest -q -m gui
# By module
pytest tests/test_dedup.py
# Include slow / integration
pytest -m slow
# Single test
pytest tests/test_dedup.py::TestExactMatch::test_basic

Test layout:

tests/
├── conftest.py                        # core/CLI fixtures
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
├── test_analyze.py · test_normalize.py · test_text_clean.py
├── test_format_standardize.py
├── test_format_standardize_corpus.py  # 199-row buyer corpus
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
├── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
├── test_perf_regressions.py           # shape pins for the perf wins
└── gui/                               # Streamlit AppTest-driven tests
    ├── conftest.py                    # AppTest fixtures + helpers
    ├── _findings_panel_harness.py     # isolated component test page
    ├── test_smoke.py                  # every page renders in EN + ES
    ├── test_chrome.py                 # language selector, hide_chrome
    ├── test_gate.py                   # require_normalization_gate
    ├── test_workflows.py              # happy path per Ready tool
    ├── test_dedup_review.py           # match-group card interactions
    ├── test_advanced_panels.py        # config_panel widgets
    ├── test_errors.py                 # malformed-upload error paths
    └── test_findings_panel.py         # analyzer findings rendering

GUI test layer

GUI tests drive pages with streamlit.testing.v1.AppTest — in-process, no browser, no display. They pre-populate st.session_state with stashed-upload bytes (via the stash_upload() helper in tests/gui/conftest.py) and either click buttons via app.button[i].click().run() or assert on the session_state after the run.

Marker registered in pytest.ini. Default pytest runs everything; pytest -m 'not gui' skips them for a faster core-only loop. Coming-Soon stubs are pinned by the smoke tests so a regression ("import error", "missing widget") shows up immediately.

Fixture corpora: test-cases/text-cleaner-corpus/ (21 files) · test-cases/encodings-corpus/ (31 files) · test-cases/format-cleaner-corpus/ (7 files + spec).

Known limitations

  • Dedup pair-compare is O(n²) for fuzzy strategies. Exact-only strategies (every column uses Algorithm.EXACT at threshold 100) now route through an O(n) groupby fast path automatically — no API change. Fuzzy strategies can opt into prefix blocking via deduplicate(..., blocking_columns=[...], blocking_prefix_len=1) to partition pairs by a cheap key (trades recall for speed).
  • Threading is opt-in for format_standardizeStandardizeOptions.parallel_columns > 1 uses a thread pool. On CPython 3.12 the GIL caps the win at roughly neutral; the scaffolding is in place for free-threaded Python 3.13+.
  • Memory-bound — entire file loaded into pandas. Streaming reads exist but not integrated with the dedup engine.
  • No multi-sheet dedup — each Excel sheet processed independently.
  • Phonenumbers minimum-length — international numbers without country codes fall back to digits-only.