Adds ~115 tests pinning the Automated Workflows feature end to end: - tests/test_pipeline.py (+43): per-adapter summary correctness on known inputs, multi-step data flow, error stop/continue contract, empty / single-column / all-disabled edges, dict+file serialization round-trips, recommended_pipeline(include=…), and a synthesized demo integration run. - tests/test_cli_pipeline.py (new, 21): --recommend, dry-run-by-default, --apply output CSV + audit JSON, --steps, --strict abort, arg validation, --continue-on-error vs halt, and a save→load round-trip. Invokes the Typer app directly to bypass the license guard (house pattern). - tests/gui/test_pipeline_builder.py (+9): reorder ▲/▼, disabled edge buttons, disabled-step persistence across reorder, restore-recommended, Advanced JSON export/import, and per-tool Configure panels emitting the correct option dicts (AppTest). - tests/gui/test_pipeline_phrasing.py (new, 30): step_phrase/step_status and the adapter-key→friendly-name bridge as pure functions, incl. pluralization, column prose, and warn/error status derivation. Full suite: 2565 passed, 91 skipped. No product bugs surfaced. Documents the coverage in docs/DEVELOPER.md (test tree + a pipeline-coverage note). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
20 KiB
Developer Guide
Architecture, data flow, extension points.
Architecture
CLI (src/cli*.py) GUI (src/gui/app.py + pages/)
│ │
└──────────┐ ┌──────────┘
▼ ▼
┌────────────────┐
│ src/core/ │
└────────────────┘
Core/UI rule: business logic in core/ only. CLI + GUI translate user input → core call → display result.
Module map
| Module | Public surface |
|---|---|
i18n |
t(key, lang=None, **fmt), current_language(), set_language(), render_language_selector(), LANGUAGES |
core.dedup |
deduplicate(), MatchStrategy, ColumnMatchStrategy, Algorithm, SurvivorRule, DeduplicationResult, MatchResult, build_default_strategies() |
core.normalizers |
normalize_email/phone/name/address/string, NormalizerType, get_normalizer() |
core.io |
read_file(), write_file(), list_sheets(), detect_encoding/delimiter/header_row, repair_bytes() |
core.config |
DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule |
core.analyze |
analyze(), Finding, findings_by_tool(), _NULL_LIKE |
core.fixes |
@register("fix_id") decorator, get_fix(), available_actions() |
core.normalize |
auto_fix(), apply_decisions(), NormalizationResult, is_normalized() |
core.text_clean |
clean_dataframe(), CleanOptions, CleanResult, smart_title_case() |
core.format_standardize |
standardize_dataframe(), StandardizeOptions, StandardizeResult, FieldType, per-cell standardize_*() |
core.errors |
DataToolsError hierarchy, ensure_dataframe(), ensure_choice(), wrap_file_read/write(), format_for_user() |
core._constants |
US_STATE_NAMES, US_STATE_CODES, USPS_EXPANSIONS, USPS_COMPRESSIONS |
Data flow — Find Duplicates
read_file() # auto-detect encoding, delimiter, header
▼ DataFrame
build_default_strategies() # if no explicit strategies
▼ # strong keys (email, phone) → standalone OR
# weak keys (name, address) → AND with strong
_apply_normalizations() # add _norm_* shadow columns
▼
_find_match_groups() # O(n²) pair compare, OR strategies, union-find
▼
[review_callback()] # optional interactive review
▼
_select_survivor() # per group: first/last/most-complete/most-recent
▼
[_merge_group()] # optional: fill blanks from losers
▼
DeduplicationResult # deduplicated_df, removed_df, match_groups, log
Extension recipes
Add a normalizer
- Add function to
core/normalizers.py:def normalize_company(value: Optional[str]) -> str: if not value or not isinstance(value, str): return "" name = value.strip().casefold() for sfx in ("inc", "llc", "corp", "ltd", "co"): name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip() return name - Register: add
COMPANY = "company"toNormalizerType+ entry in_NORMALIZER_MAP. - Auto-detect (optional): add a
_COLUMN_TYPE_PATTERNSrow incore/dedup.py.
Add a fuzzy algorithm
- Add value to
Algorithmenum incore/dedup.py. - Add case in
_compute_similarity(). - Document the value in CLI help text.
Add a survivor rule
- Add value to
SurvivorRuleenum. - Add branch in
_select_survivor(). - Add CLI mapping.
Add a fix + detector (analyzer/gate)
- Detector in
core/analyze.py: add_detect_<thing>(df) -> list[Finding], hook into the mainanalyze()pipeline. Emit Finding with a uniquefix_actionid. - Fix in
core/fixes.py:@register("fix_id") def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]: # ... return out_df, cells_changed - Constant in
core/analyze.py: addFIX_<NAME> = "fix_id"so the detector and fix can reference it.
No other call sites change. Gate auto-discovers it via the registry.
Tool page header — render_tool_header(tool_id)
Every tool page renders its title block via render_tool_header(tool_id) in src/gui/components/_legacy.py — do not call st.title() + st.caption() directly. The helper renders:
tools.<id>.page_titleas the page title (left column).- A Help popover button right of the title (icon
:material/help_outline:, label fromhelp.button_label). Clicking opens anst.popovercontaining the markdown body. tools.<id>.page_captionas the caption below.
All copy is i18n-driven; editors can tweak help text without touching Python. If a tool is missing its help_md key, the popover falls back to help.missing_body.
help_md structure (markdown, stored as a single string with \n line breaks in JSON):
**When to use**
- bullet 1
- bullet 2
**Steps**
1. numbered step
2. numbered step
**Examples**
- example 1
- example 2
**Tip** one-sentence pro tip.
Keep it short — the popover is intentionally compact. Mirror the structure across every tool so the muscle memory transfers.
i18n — language packs
The GUI's user-facing strings live in src/i18n/packs/<code>.json, keyed by ISO-639-1 code. English (en.json) is canonical; missing keys in other packs fall back to English, and missing keys in English fall back to the literal dotted key so a typo is visible rather than silent.
Look up a string in code:
from src.i18n import t
st.button(t("upload.run_button"))
st.warning(t("gate.warning", name=filename)) # {name} interpolated via str.format
t() reads the active language from st.session_state["ui_lang"]. Outside a Streamlit run (tests, scripts) it falls back to English.
Add a new language:
- Copy
src/i18n/packs/en.jsontosrc/i18n/packs/<code>.jsonand translate values in place. Keep the key tree identical. - Add a one-line entry to
LANGUAGESinsrc/i18n/__init__.py:{"code": "fr", "label": "Français"}. The sidebar picker auto-renders. - Run
pytest tests/test_lang_packs.py— the parity test fails until every key fromen.jsonexists in the new pack (and orphan keys not in English are also flagged).
Add a new key:
- Add it to
en.jsonfirst (canonical pack). - Add it to every other registered pack in the same commit. The parity test enforces this.
- Use the dotted key at the call site:
t("section.subsection.key")ort("section.key", name=value)for placeholder interpolation.
Authoring rules:
- Keys live under semantic sections (
home.*,upload.*,findings.*,help.*,tools.<id>.name). Don't nest by language or by tool unless the string is genuinely tool-specific. - Per-tool header copy lives under
tools.<id>.{page_title, page_caption, help_md}.page_captionis the one-line subtitle under the title;help_mdis the popover body (see Tool page header above). Top-levelhelp.button_label/help.missing_bodyare shared across every tool. - Use
{named}placeholders (not positional{0}) so translators see what's being interpolated. - Strings can contain Streamlit markdown (
**bold**) — pass throughst.markdown/st.captionas usual. - Do not put strings inside the farewell-overlay JS payload without going through
_js_html_safe()insrc/gui/components/_legacy.py; the helper escapes both the JS string terminator and HTML special chars. The testTestFarewellEscapepins that contract. - The sidebar picker is mounted by
hide_streamlit_chrome(), so every page that calls that helper automatically gets the picker. Pages that don't call it (rare) can callrender_language_selector()directly.
Licensing
The license layer lives at src/license/. The public API:
from src.license import (
get_manager, require_feature, current_state,
FeatureFlag, Tier, License,
)
mgr = get_manager()
if not mgr.is_valid():
raise RuntimeError("Not licensed")
require_feature(FeatureFlag.DEDUPLICATOR)
Storage: ~/.datatools/license.json (override via
DATATOOLS_LICENSE_PATH). Signed with Ed25519 (asymmetric) — the
seller's private key signs; the buyer's binary verifies with the
embedded public key.
Key material:
| Variable | Who has it | Where it's used |
|---|---|---|
DATATOOLS_LICENSE_PRIVKEY |
Seller only | scripts/generate_license.py (mint a buyer's blob), scripts/generate_keypair.py writes a fresh one |
DATATOOLS_LICENSE_PUBKEY |
Every shipped binary | Verification at activation time; set at build time via PyInstaller env |
If neither env var is set, src.license.crypto falls back to the
deterministic dev keypair in src/license/_dev_keypair.py. The
dev key is in source on purpose (so tests work without secrets),
but a frozen build that's still using it is a build-config bug —
:func:assert_production_safe refuses to start such a binary.
First-time setup for shipped builds:
python scripts/generate_keypair.py --output prod-keys.env— creates a fresh keypair.- Stash
DATATOOLS_LICENSE_PRIVKEYsomewhere safe (password manager / KMS). Lose it and you can't issue renewals without reshipping a new build with a new public key. - Configure the PyInstaller build env with
DATATOOLS_LICENSE_PUBKEY=<hex>so the shipped binary verifies against the production key. - Mint buyer licenses with
DATATOOLS_LICENSE_PRIVKEY=<hex> python scripts/generate_license.py ....
Dev bypass: DATATOOLS_DEV_MODE=1 short-circuits every check.
The test suite's autouse fixture sets this so existing tests don't
need their own license fixtures. Tests that need the real check
explicitly use isolated_license_path /
activated_license_manager / unactivated_license_manager.
Adding a feature flag:
- Add the enum value to
FeatureFlaginsrc/license/schema.py. - Add it to the relevant tier's set in
FEATURES_BY_TIERinsrc/license/features.py. - Gate at the call site:
require_feature(FeatureFlag.YOUR_FLAG).
Adding a new tier:
- Add the enum value to
Tier. - Add a row to
FEATURES_BY_TIERlisting the unlocked flags. - Add
license.tier_<name>translation keys to every i18n pack. - The activation flow, sidebar status badge, feature gate, and home grid lock badge all pick up the new tier automatically.
Worked example — the Lite tier:
# src/license/schema.py
class Tier(str, Enum):
LITE = "lite" # new
CORE = "core"
...
# src/license/features.py
FEATURES_BY_TIER = {
...
Tier.LITE: frozenset({
FeatureFlag.DEDUPLICATOR,
FeatureFlag.TEXT_CLEANER,
FeatureFlag.FORMAT_STANDARDIZER,
}),
Tier.CORE: _all(),
...
}
Then in en.json/es.json add license.tier_lite. That's it — the
existing require_feature_or_render_upgrade (GUI) and
guard(feature=...) (CLI) calls in every tool page/CLI route a
Lite user into the upgrade prompt for any tool the tier doesn't
unlock. The home grid's lock badge fires off the same feature
lookup.
Minting a license (creator-only):
DATATOOLS_LICENSE_SECRET=<shipping-secret> \
python scripts/generate_license.py \
--name "Jane Doe" --email jane@example.com \
--tier core --years 1
The script prints a DTLIC1: blob to stdout — deliver this in the
Gumroad / purchase email. The buyer pastes it into the activation
page or runs python -m src.license_cli activate <blob> --name ....
Add a format-standardizer field type
- Add value to
FieldTypeenum incore/format_standardize.py. - Add per-cell
standardize_<x>(value, *, …)returning(new_value, changed). - Add option fields to
StandardizeOptions(with defaults that preserve existing behavior). - Wire into
_apply_field_type()dispatcher (theelsebranch raisesAssertionError— every enum value needs a branch). - Add validation entry in
StandardizeOptions.from_dict()for any new enum-shaped option.
Errors
Use core/errors.py instead of raw ValueError / OSError:
| Pattern | Use |
|---|---|
| Bad arg, wrong type, missing column | InputValidationError |
| Bad config / options file | ConfigError |
| File parses but isn't what we expected | FileFormatError |
| File I/O failure (perms, missing, disk full) | FileAccessError |
| Internal invariant broken (unreachable branch) | AssertionError |
Helpers:
ensure_dataframe(value, function="my_func")at every public entry that takes a df.ensure_choice(value, name="mode", choices=[...])at every entry that takes a literal.wrap_file_read(path, "operation", exc)/wrap_file_write(...)when wrappingOSError.
GUI / CLI handlers: use format_for_user(exc, context="...") to render.
All DataToolsError subclasses extend stdlib ValueError or OSError so existing handlers still catch them.
PDF Extractor — bundled Tesseract
Frozen builds (installer / AppImage) ship Tesseract OCR inside the bundle so scanned PDFs work without a separate system install. Source / pip developer environments still resolve Tesseract from PATH.
Runtime layout (frozen bundles):
| Resource | Path |
|---|---|
| Tesseract binary | Path(sys._MEIPASS) / "tesseract" / "tesseract" (Linux/macOS), …/tesseract/tesseract.exe (Windows) |
| Tessdata directory | Path(sys._MEIPASS) / "tesseract" / "tessdata" |
| English model | Path(sys._MEIPASS) / "tesseract" / "tessdata" / "eng.traineddata" |
Discovery order (PDF Extractor runtime):
DATATOOLS_TESSERACT_BINenv var (override — explicit path to atesseractbinary).- Bundled path under
sys._MEIPASS(frozen bundles only — falls through to step 3 otherwise). tesseractonPATH(developer setups, source checkouts).- Windows well-known locations (
C:\Program Files\Tesseract-OCR\tesseract.exe, etc.).
Where the bytes come from:
- Tessdata is vendored at
build/vendor/tessdata/eng.traineddata— the "best" English model from tessdata_best. PyInstaller's spec copies it intotesseract/tessdata/inside the bundle. - Tesseract binary is fetched at build time by
build/tesseract.py— per-platform download URLs are pinned in that module. The current pin is Tesseract 5.5.0. CI (.github/workflows/build.yml) importsfetch_tessdata+fetch_tesseract_for_platformand runs them before PyInstaller.
To update Tesseract:
- Bump the version pin + the per-platform fetch URLs in
build/tesseract.py. - If upstream changed the
eng.traineddataschema, refreshbuild/vendor/tessdata/eng.traineddatafromtessdata_bestat the matching tag. - Push a
v*tag so CI rebuilds all three platforms, then smoke-test a scanned-PDF run through the PDF Extractor before publishing the release. - Update
LICENSE_TESSERACT.txtat the repo root if the upstream license terms change (Tesseract is Apache-2.0 today).
Tests
# All (core + CLI + GUI)
pytest -q
# Quick loop — skip the GUI layer
pytest -q -m 'not gui'
# Only the GUI tests
pytest -q -m gui
# By module
pytest tests/test_dedup.py
# Include slow / integration
pytest -m slow
# Single test
pytest tests/test_dedup.py::TestExactMatch::test_basic
Test layout:
tests/
├── conftest.py # core/CLI fixtures
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
├── test_analyze.py · test_normalize.py · test_text_clean.py
├── test_format_standardize.py
├── test_format_standardize_corpus.py # 199-row buyer corpus
├── test_pipeline.py # pipeline engine: adapters, run, validate, serialize
├── test_cli_pipeline.py # pipeline CLI: recommend/apply/strict/audit
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
├── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
├── test_perf_regressions.py # shape pins for the perf wins
└── gui/ # Streamlit AppTest-driven tests
├── conftest.py # AppTest fixtures + helpers
├── _findings_panel_harness.py # isolated component test page
├── test_smoke.py # every page renders in EN + ES
├── test_chrome.py # language selector, hide_chrome
├── test_gate.py # require_normalization_gate
├── test_workflows.py # happy path per Ready tool
├── test_dedup_review.py # match-group card interactions
├── test_advanced_panels.py # config_panel widgets
├── test_pipeline_builder.py # module-card builder: cards, reorder, JSON, run
├── test_pipeline_phrasing.py # step_phrase/step_status + name bridge (pure fns)
├── test_errors.py # malformed-upload error paths
└── test_findings_panel.py # analyzer findings rendering
Pipeline (Automated Workflows) coverage
The pipeline feature is pinned end to end across four files (~115 tests):
test_pipeline.py (core engine — every adapter's summary numbers, step
data-flow, error stop/continue, empty/single-column/all-disabled edges,
dict + file serialization round-trips, recommended_pipeline(include=…),
soft-dependency validation), test_cli_pipeline.py (CLI — --recommend,
dry-run-by-default, --apply output + audit JSON, --steps, --strict,
--continue-on-error, arg validation, save→load round-trip),
test_pipeline_builder.py (the visual builder via AppTest — card seeding,
toggle, reorder ▲/▼, add/remove, restore-recommended, Advanced JSON
import/export, per-tool Configure panels emitting the right option dicts),
and test_pipeline_phrasing.py (the plain-English step_phrase/step_status
helpers and the adapter-key→friendly-name bridge as pure functions).
GUI test layer
GUI tests drive pages with streamlit.testing.v1.AppTest —
in-process, no browser, no display. They pre-populate
st.session_state with stashed-upload bytes (via the
stash_upload() helper in tests/gui/conftest.py) and either click
buttons via app.button[i].click().run() or assert on the
session_state after the run.
Marker registered in pytest.ini. Default pytest runs everything;
pytest -m 'not gui' skips them for a faster core-only loop.
Coming-Soon stubs are pinned by the smoke tests so a regression
("import error", "missing widget") shows up immediately.
Fixture corpora: test-cases/text-cleaner-corpus/ (21 files) · test-cases/encodings-corpus/ (31 files) · test-cases/format-cleaner-corpus/ (7 files + spec).
Known limitations
- Dedup pair-compare is O(n²) for fuzzy strategies. Exact-only
strategies (every column uses
Algorithm.EXACTat threshold 100) now route through an O(n) groupby fast path automatically — no API change. Fuzzy strategies can opt into prefix blocking viadeduplicate(..., blocking_columns=[...], blocking_prefix_len=1)to partition pairs by a cheap key (trades recall for speed). - Threading is opt-in for format_standardize —
StandardizeOptions.parallel_columns > 1uses a thread pool. On CPython 3.12 the GIL caps the win at roughly neutral; the scaffolding is in place for free-threaded Python 3.13+. - Memory-bound — entire file loaded into pandas. Streaming reads exist but not integrated with the dedup engine.
- No multi-sheet dedup — each Excel sheet processed independently.
- Phonenumbers minimum-length — international numbers without country codes fall back to digits-only.