Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS, TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from prose-heavy to bullet-heavy + table-heavy. Same information density, significantly less reading load. Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content that landed since v1.6: - Format Standardizer (3rd Ready tool) - 199-row buyer corpus - src/core/errors.py structured hierarchy + ensure_dataframe / ensure_choice / wrap_file_read|write / format_for_user helpers - src/core/_constants.py shared USPS/state lookup tables - Cross-tool audit fixes (NaN matching, removed_df schema, validation, enum-bounds checks, forward-compat config) - Per-domain error_policy across format standardizers - Inconsistent-date-format detector - Excel header-row auto-detection + write_file delimiter param Per-doc changes: - README.md (175 → 71): 9-tool table at top, status column, 3 CLI entry points listed, dropped repeated marketing prose. - docs/README.md (38 → 27): pure index — buyer-facing vs creator-only split + version footer. - USER-GUIDE.md (208 → 118): tool table replaces script descriptions, troubleshooting compressed to bullets, gate explanation tightened. - CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed redundant intro text, kept full recipes section. - REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added §18 Error Handling, formatting tightened to single-line entries. - TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged redundant §3.5-3.7 OS sections, added §7 (Error handling) + §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate / Review page / repair_bytes promoted from §10.2.x sub-numbering). - DEVELOPER.md (285 → 161): module map table replaces per-file prose, extension recipes condensed, new §Errors covers when to use each hierarchy class. - BUSINESS.md (278 → 225): collapsed prose to tables (use cases, competitive landscape, costs, risks); honest-status updated. - DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved, decision log compressed to single-line entries, added v1.6 entries (Format Standardizer Ready, errors module). - RECOVERY.md (180 → 147): rebuild steps as numbered + tabular, external dependencies as one table, recovery priorities tightened. No information removed; redundancy compressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.8 KiB
6.8 KiB
Developer Guide
Architecture, data flow, extension points.
Architecture
CLI (src/cli*.py) GUI (src/gui/app.py + pages/)
│ │
└──────────┐ ┌──────────┘
▼ ▼
┌────────────────┐
│ src/core/ │
└────────────────┘
Core/UI rule: business logic in core/ only. CLI + GUI translate user input → core call → display result.
Module map
| Module | Public surface |
|---|---|
core.dedup |
deduplicate(), MatchStrategy, ColumnMatchStrategy, Algorithm, SurvivorRule, DeduplicationResult, MatchResult, build_default_strategies() |
core.normalizers |
normalize_email/phone/name/address/string, NormalizerType, get_normalizer() |
core.io |
read_file(), write_file(), list_sheets(), detect_encoding/delimiter/header_row, repair_bytes() |
core.config |
DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule |
core.analyze |
analyze(), Finding, findings_by_tool(), _NULL_LIKE |
core.fixes |
@register("fix_id") decorator, get_fix(), available_actions() |
core.normalize |
auto_fix(), apply_decisions(), NormalizationResult, is_normalized() |
core.text_clean |
clean_dataframe(), CleanOptions, CleanResult, smart_title_case() |
core.format_standardize |
standardize_dataframe(), StandardizeOptions, StandardizeResult, FieldType, per-cell standardize_*() |
core.errors |
DataToolsError hierarchy, ensure_dataframe(), ensure_choice(), wrap_file_read/write(), format_for_user() |
core._constants |
US_STATE_NAMES, US_STATE_CODES, USPS_EXPANSIONS, USPS_COMPRESSIONS |
Data flow — Deduplicator
read_file() # auto-detect encoding, delimiter, header
▼ DataFrame
build_default_strategies() # if no explicit strategies
▼ # strong keys (email, phone) → standalone OR
# weak keys (name, address) → AND with strong
_apply_normalizations() # add _norm_* shadow columns
▼
_find_match_groups() # O(n²) pair compare, OR strategies, union-find
▼
[review_callback()] # optional interactive review
▼
_select_survivor() # per group: first/last/most-complete/most-recent
▼
[_merge_group()] # optional: fill blanks from losers
▼
DeduplicationResult # deduplicated_df, removed_df, match_groups, log
Extension recipes
Add a normalizer
- Add function to
core/normalizers.py:def normalize_company(value: Optional[str]) -> str: if not value or not isinstance(value, str): return "" name = value.strip().casefold() for sfx in ("inc", "llc", "corp", "ltd", "co"): name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip() return name - Register: add
COMPANY = "company"toNormalizerType+ entry in_NORMALIZER_MAP. - Auto-detect (optional): add a
_COLUMN_TYPE_PATTERNSrow incore/dedup.py.
Add a fuzzy algorithm
- Add value to
Algorithmenum incore/dedup.py. - Add case in
_compute_similarity(). - Document the value in CLI help text.
Add a survivor rule
- Add value to
SurvivorRuleenum. - Add branch in
_select_survivor(). - Add CLI mapping.
Add a fix + detector (analyzer/gate)
- Detector in
core/analyze.py: add_detect_<thing>(df) -> list[Finding], hook into the mainanalyze()pipeline. Emit Finding with a uniquefix_actionid. - Fix in
core/fixes.py:@register("fix_id") def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]: # ... return out_df, cells_changed - Constant in
core/analyze.py: addFIX_<NAME> = "fix_id"so the detector and fix can reference it.
No other call sites change. Gate auto-discovers it via the registry.
Add a format-standardizer field type
- Add value to
FieldTypeenum incore/format_standardize.py. - Add per-cell
standardize_<x>(value, *, …)returning(new_value, changed). - Add option fields to
StandardizeOptions(with defaults that preserve existing behavior). - Wire into
_apply_field_type()dispatcher (theelsebranch raisesAssertionError— every enum value needs a branch). - Add validation entry in
StandardizeOptions.from_dict()for any new enum-shaped option.
Errors
Use core/errors.py instead of raw ValueError / OSError:
| Pattern | Use |
|---|---|
| Bad arg, wrong type, missing column | InputValidationError |
| Bad config / options file | ConfigError |
| File parses but isn't what we expected | FileFormatError |
| File I/O failure (perms, missing, disk full) | FileAccessError |
| Internal invariant broken (unreachable branch) | AssertionError |
Helpers:
ensure_dataframe(value, function="my_func")at every public entry that takes a df.ensure_choice(value, name="mode", choices=[...])at every entry that takes a literal.wrap_file_read(path, "operation", exc)/wrap_file_write(...)when wrappingOSError.
GUI / CLI handlers: use format_for_user(exc, context="...") to render.
All DataToolsError subclasses extend stdlib ValueError or OSError so existing handlers still catch them.
Tests
# All
pytest -q
# By module
pytest tests/test_dedup.py
# Include slow / integration
pytest -m slow
# Single test
pytest tests/test_dedup.py::TestExactMatch::test_basic
Test layout:
tests/
├── conftest.py # fixtures
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
├── test_analyze.py · test_normalize.py · test_text_clean.py
├── test_format_standardize.py
├── test_format_standardize_corpus.py # 199-row buyer corpus
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
└── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
Fixture corpora: test-cases/text-cleaner-corpus/ (21 files) · test-cases/encodings-corpus/ (31 files) · test-cases/format-cleaner-corpus/ (7 files + spec).
Known limitations
- Dedup is O(n²) — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix.
- Single-threaded — could benefit from
multiprocessing. - Memory-bound — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine.
- No multi-sheet dedup — each Excel sheet processed independently.
- Phonenumbers minimum-length — international numbers without country codes fall back to digits-only.