Files
datatools-dev/docs/DEVELOPER.md
Michael abb720997e docs: tight, scannable rewrite — every item earns its place
Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS,
TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from
prose-heavy to bullet-heavy + table-heavy. Same information density,
significantly less reading load.

Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content
that landed since v1.6:

- Format Standardizer (3rd Ready tool)
- 199-row buyer corpus
- src/core/errors.py structured hierarchy + ensure_dataframe /
  ensure_choice / wrap_file_read|write / format_for_user helpers
- src/core/_constants.py shared USPS/state lookup tables
- Cross-tool audit fixes (NaN matching, removed_df schema, validation,
  enum-bounds checks, forward-compat config)
- Per-domain error_policy across format standardizers
- Inconsistent-date-format detector
- Excel header-row auto-detection + write_file delimiter param

Per-doc changes:

- README.md (175 → 71): 9-tool table at top, status column, 3 CLI
  entry points listed, dropped repeated marketing prose.
- docs/README.md (38 → 27): pure index — buyer-facing vs creator-only
  split + version footer.
- USER-GUIDE.md (208 → 118): tool table replaces script descriptions,
  troubleshooting compressed to bullets, gate explanation tightened.
- CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed
  redundant intro text, kept full recipes section.
- REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added
  §18 Error Handling, formatting tightened to single-line entries.
- TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged
  redundant §3.5-3.7 OS sections, added §7 (Error handling) +
  §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate /
  Review page / repair_bytes promoted from §10.2.x sub-numbering).
- DEVELOPER.md (285 → 161): module map table replaces per-file prose,
  extension recipes condensed, new §Errors covers when to use each
  hierarchy class.
- BUSINESS.md (278 → 225): collapsed prose to tables (use cases,
  competitive landscape, costs, risks); honest-status updated.
- DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved,
  decision log compressed to single-line entries, added v1.6 entries
  (Format Standardizer Ready, errors module).
- RECOVERY.md (180 → 147): rebuild steps as numbered + tabular,
  external dependencies as one table, recovery priorities tightened.

No information removed; redundancy compressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:49:29 +00:00

6.8 KiB

Developer Guide

Architecture, data flow, extension points.

Architecture

CLI (src/cli*.py)         GUI (src/gui/app.py + pages/)
     │                          │
     └──────────┐    ┌──────────┘
                ▼    ▼
            ┌────────────────┐
            │   src/core/    │
            └────────────────┘

Core/UI rule: business logic in core/ only. CLI + GUI translate user input → core call → display result.

Module map

Module Public surface
core.dedup deduplicate(), MatchStrategy, ColumnMatchStrategy, Algorithm, SurvivorRule, DeduplicationResult, MatchResult, build_default_strategies()
core.normalizers normalize_email/phone/name/address/string, NormalizerType, get_normalizer()
core.io read_file(), write_file(), list_sheets(), detect_encoding/delimiter/header_row, repair_bytes()
core.config DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule
core.analyze analyze(), Finding, findings_by_tool(), _NULL_LIKE
core.fixes @register("fix_id") decorator, get_fix(), available_actions()
core.normalize auto_fix(), apply_decisions(), NormalizationResult, is_normalized()
core.text_clean clean_dataframe(), CleanOptions, CleanResult, smart_title_case()
core.format_standardize standardize_dataframe(), StandardizeOptions, StandardizeResult, FieldType, per-cell standardize_*()
core.errors DataToolsError hierarchy, ensure_dataframe(), ensure_choice(), wrap_file_read/write(), format_for_user()
core._constants US_STATE_NAMES, US_STATE_CODES, USPS_EXPANSIONS, USPS_COMPRESSIONS

Data flow — Deduplicator

read_file()                       # auto-detect encoding, delimiter, header
   ▼ DataFrame
build_default_strategies()        # if no explicit strategies
   ▼                              # strong keys (email, phone) → standalone OR
                                  # weak keys (name, address) → AND with strong
_apply_normalizations()           # add _norm_* shadow columns
   ▼
_find_match_groups()              # O(n²) pair compare, OR strategies, union-find
   ▼
[review_callback()]               # optional interactive review
   ▼
_select_survivor()                # per group: first/last/most-complete/most-recent
   ▼
[_merge_group()]                  # optional: fill blanks from losers
   ▼
DeduplicationResult               # deduplicated_df, removed_df, match_groups, log

Extension recipes

Add a normalizer

  1. Add function to core/normalizers.py:
    def normalize_company(value: Optional[str]) -> str:
        if not value or not isinstance(value, str): return ""
        name = value.strip().casefold()
        for sfx in ("inc", "llc", "corp", "ltd", "co"):
            name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip()
        return name
    
  2. Register: add COMPANY = "company" to NormalizerType + entry in _NORMALIZER_MAP.
  3. Auto-detect (optional): add a _COLUMN_TYPE_PATTERNS row in core/dedup.py.

Add a fuzzy algorithm

  1. Add value to Algorithm enum in core/dedup.py.
  2. Add case in _compute_similarity().
  3. Document the value in CLI help text.

Add a survivor rule

  1. Add value to SurvivorRule enum.
  2. Add branch in _select_survivor().
  3. Add CLI mapping.

Add a fix + detector (analyzer/gate)

  1. Detector in core/analyze.py: add _detect_<thing>(df) -> list[Finding], hook into the main analyze() pipeline. Emit Finding with a unique fix_action id.
  2. Fix in core/fixes.py:
    @register("fix_id")
    def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]:
        # ...
        return out_df, cells_changed
    
  3. Constant in core/analyze.py: add FIX_<NAME> = "fix_id" so the detector and fix can reference it.

No other call sites change. Gate auto-discovers it via the registry.

Add a format-standardizer field type

  1. Add value to FieldType enum in core/format_standardize.py.
  2. Add per-cell standardize_<x>(value, *, …) returning (new_value, changed).
  3. Add option fields to StandardizeOptions (with defaults that preserve existing behavior).
  4. Wire into _apply_field_type() dispatcher (the else branch raises AssertionError — every enum value needs a branch).
  5. Add validation entry in StandardizeOptions.from_dict() for any new enum-shaped option.

Errors

Use core/errors.py instead of raw ValueError / OSError:

Pattern Use
Bad arg, wrong type, missing column InputValidationError
Bad config / options file ConfigError
File parses but isn't what we expected FileFormatError
File I/O failure (perms, missing, disk full) FileAccessError
Internal invariant broken (unreachable branch) AssertionError

Helpers:

  • ensure_dataframe(value, function="my_func") at every public entry that takes a df.
  • ensure_choice(value, name="mode", choices=[...]) at every entry that takes a literal.
  • wrap_file_read(path, "operation", exc) / wrap_file_write(...) when wrapping OSError.

GUI / CLI handlers: use format_for_user(exc, context="...") to render.

All DataToolsError subclasses extend stdlib ValueError or OSError so existing handlers still catch them.

Tests

# All
pytest -q
# By module
pytest tests/test_dedup.py
# Include slow / integration
pytest -m slow
# Single test
pytest tests/test_dedup.py::TestExactMatch::test_basic

Test layout:

tests/
├── conftest.py                        # fixtures
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
├── test_analyze.py · test_normalize.py · test_text_clean.py
├── test_format_standardize.py
├── test_format_standardize_corpus.py  # 199-row buyer corpus
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
└── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py

Fixture corpora: test-cases/text-cleaner-corpus/ (21 files) · test-cases/encodings-corpus/ (31 files) · test-cases/format-cleaner-corpus/ (7 files + spec).

Known limitations

  • Dedup is O(n²) — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix.
  • Single-threaded — could benefit from multiprocessing.
  • Memory-bound — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine.
  • No multi-sheet dedup — each Excel sheet processed independently.
  • Phonenumbers minimum-length — international numbers without country codes fall back to digits-only.