README + USER-GUIDE describe the sidebar picker and current coverage (home + shared chrome, per-tool bodies pending). DEVELOPER gains a how-to for adding packs and keys with the parity-test guarantee. TECHNICAL §10b records the in-house-JSON architecture and locks in the no-gettext decision (also logged in DECISIONS). REQUIREMENTS reflects the new interface surface and updated test count. COPY.md adds a "Language claim" slot so landing/email work can pick it up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.1 KiB
Developer Guide
Architecture, data flow, extension points.
Architecture
CLI (src/cli*.py) GUI (src/gui/app.py + pages/)
│ │
└──────────┐ ┌──────────┘
▼ ▼
┌────────────────┐
│ src/core/ │
└────────────────┘
Core/UI rule: business logic in core/ only. CLI + GUI translate user input → core call → display result.
Module map
| Module | Public surface |
|---|---|
i18n |
t(key, lang=None, **fmt), current_language(), set_language(), render_language_selector(), LANGUAGES |
core.dedup |
deduplicate(), MatchStrategy, ColumnMatchStrategy, Algorithm, SurvivorRule, DeduplicationResult, MatchResult, build_default_strategies() |
core.normalizers |
normalize_email/phone/name/address/string, NormalizerType, get_normalizer() |
core.io |
read_file(), write_file(), list_sheets(), detect_encoding/delimiter/header_row, repair_bytes() |
core.config |
DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule |
core.analyze |
analyze(), Finding, findings_by_tool(), _NULL_LIKE |
core.fixes |
@register("fix_id") decorator, get_fix(), available_actions() |
core.normalize |
auto_fix(), apply_decisions(), NormalizationResult, is_normalized() |
core.text_clean |
clean_dataframe(), CleanOptions, CleanResult, smart_title_case() |
core.format_standardize |
standardize_dataframe(), StandardizeOptions, StandardizeResult, FieldType, per-cell standardize_*() |
core.errors |
DataToolsError hierarchy, ensure_dataframe(), ensure_choice(), wrap_file_read/write(), format_for_user() |
core._constants |
US_STATE_NAMES, US_STATE_CODES, USPS_EXPANSIONS, USPS_COMPRESSIONS |
Data flow — Deduplicator
read_file() # auto-detect encoding, delimiter, header
▼ DataFrame
build_default_strategies() # if no explicit strategies
▼ # strong keys (email, phone) → standalone OR
# weak keys (name, address) → AND with strong
_apply_normalizations() # add _norm_* shadow columns
▼
_find_match_groups() # O(n²) pair compare, OR strategies, union-find
▼
[review_callback()] # optional interactive review
▼
_select_survivor() # per group: first/last/most-complete/most-recent
▼
[_merge_group()] # optional: fill blanks from losers
▼
DeduplicationResult # deduplicated_df, removed_df, match_groups, log
Extension recipes
Add a normalizer
- Add function to
core/normalizers.py:def normalize_company(value: Optional[str]) -> str: if not value or not isinstance(value, str): return "" name = value.strip().casefold() for sfx in ("inc", "llc", "corp", "ltd", "co"): name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip() return name - Register: add
COMPANY = "company"toNormalizerType+ entry in_NORMALIZER_MAP. - Auto-detect (optional): add a
_COLUMN_TYPE_PATTERNSrow incore/dedup.py.
Add a fuzzy algorithm
- Add value to
Algorithmenum incore/dedup.py. - Add case in
_compute_similarity(). - Document the value in CLI help text.
Add a survivor rule
- Add value to
SurvivorRuleenum. - Add branch in
_select_survivor(). - Add CLI mapping.
Add a fix + detector (analyzer/gate)
- Detector in
core/analyze.py: add_detect_<thing>(df) -> list[Finding], hook into the mainanalyze()pipeline. Emit Finding with a uniquefix_actionid. - Fix in
core/fixes.py:@register("fix_id") def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]: # ... return out_df, cells_changed - Constant in
core/analyze.py: addFIX_<NAME> = "fix_id"so the detector and fix can reference it.
No other call sites change. Gate auto-discovers it via the registry.
i18n — language packs
The GUI's user-facing strings live in src/i18n/packs/<code>.json, keyed by ISO-639-1 code. English (en.json) is canonical; missing keys in other packs fall back to English, and missing keys in English fall back to the literal dotted key so a typo is visible rather than silent.
Look up a string in code:
from src.i18n import t
st.button(t("upload.run_button"))
st.warning(t("gate.warning", name=filename)) # {name} interpolated via str.format
t() reads the active language from st.session_state["ui_lang"]. Outside a Streamlit run (tests, scripts) it falls back to English.
Add a new language:
- Copy
src/i18n/packs/en.jsontosrc/i18n/packs/<code>.jsonand translate values in place. Keep the key tree identical. - Add a one-line entry to
LANGUAGESinsrc/i18n/__init__.py:{"code": "fr", "label": "Français"}. The sidebar picker auto-renders. - Run
pytest tests/test_lang_packs.py— the parity test fails until every key fromen.jsonexists in the new pack (and orphan keys not in English are also flagged).
Add a new key:
- Add it to
en.jsonfirst (canonical pack). - Add it to every other registered pack in the same commit. The parity test enforces this.
- Use the dotted key at the call site:
t("section.subsection.key")ort("section.key", name=value)for placeholder interpolation.
Authoring rules:
- Keys live under semantic sections (
home.*,upload.*,findings.*,tools.<id>.name). Don't nest by language or by tool unless the string is genuinely tool-specific. - Use
{named}placeholders (not positional{0}) so translators see what's being interpolated. - Strings can contain Streamlit markdown (
**bold**) — pass throughst.markdown/st.captionas usual. - Do not put strings inside the farewell-overlay JS payload without going through
_js_html_safe()insrc/gui/components/_legacy.py; the helper escapes both the JS string terminator and HTML special chars. The testTestFarewellEscapepins that contract. - The sidebar picker is mounted by
hide_streamlit_chrome(), so every page that calls that helper automatically gets the picker. Pages that don't call it (rare) can callrender_language_selector()directly.
Add a format-standardizer field type
- Add value to
FieldTypeenum incore/format_standardize.py. - Add per-cell
standardize_<x>(value, *, …)returning(new_value, changed). - Add option fields to
StandardizeOptions(with defaults that preserve existing behavior). - Wire into
_apply_field_type()dispatcher (theelsebranch raisesAssertionError— every enum value needs a branch). - Add validation entry in
StandardizeOptions.from_dict()for any new enum-shaped option.
Errors
Use core/errors.py instead of raw ValueError / OSError:
| Pattern | Use |
|---|---|
| Bad arg, wrong type, missing column | InputValidationError |
| Bad config / options file | ConfigError |
| File parses but isn't what we expected | FileFormatError |
| File I/O failure (perms, missing, disk full) | FileAccessError |
| Internal invariant broken (unreachable branch) | AssertionError |
Helpers:
ensure_dataframe(value, function="my_func")at every public entry that takes a df.ensure_choice(value, name="mode", choices=[...])at every entry that takes a literal.wrap_file_read(path, "operation", exc)/wrap_file_write(...)when wrappingOSError.
GUI / CLI handlers: use format_for_user(exc, context="...") to render.
All DataToolsError subclasses extend stdlib ValueError or OSError so existing handlers still catch them.
Tests
# All
pytest -q
# By module
pytest tests/test_dedup.py
# Include slow / integration
pytest -m slow
# Single test
pytest tests/test_dedup.py::TestExactMatch::test_basic
Test layout:
tests/
├── conftest.py # fixtures
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
├── test_analyze.py · test_normalize.py · test_text_clean.py
├── test_format_standardize.py
├── test_format_standardize_corpus.py # 199-row buyer corpus
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
└── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
Fixture corpora: test-cases/text-cleaner-corpus/ (21 files) · test-cases/encodings-corpus/ (31 files) · test-cases/format-cleaner-corpus/ (7 files + spec).
Known limitations
- Dedup is O(n²) — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix.
- Single-threaded — could benefit from
multiprocessing. - Memory-bound — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine.
- No multi-sheet dedup — each Excel sheet processed independently.
- Phonenumbers minimum-length — international numbers without country codes fall back to digits-only.