# Developer Guide Architecture, data flow, extension points. ## Architecture ``` CLI (src/cli*.py) GUI (src/gui/app.py + pages/) │ │ └──────────┐ ┌──────────┘ ▼ ▼ ┌────────────────┐ │ src/core/ │ └────────────────┘ ``` **Core/UI rule**: business logic in `core/` only. CLI + GUI translate user input → core call → display result. ## Module map | Module | Public surface | |--------|----------------| | `core.dedup` | `deduplicate()`, `MatchStrategy`, `ColumnMatchStrategy`, `Algorithm`, `SurvivorRule`, `DeduplicationResult`, `MatchResult`, `build_default_strategies()` | | `core.normalizers` | `normalize_email/phone/name/address/string`, `NormalizerType`, `get_normalizer()` | | `core.io` | `read_file()`, `write_file()`, `list_sheets()`, `detect_encoding/delimiter/header_row`, `repair_bytes()` | | `core.config` | `DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule` | | `core.analyze` | `analyze()`, `Finding`, `findings_by_tool()`, `_NULL_LIKE` | | `core.fixes` | `@register("fix_id")` decorator, `get_fix()`, `available_actions()` | | `core.normalize` | `auto_fix()`, `apply_decisions()`, `NormalizationResult`, `is_normalized()` | | `core.text_clean` | `clean_dataframe()`, `CleanOptions`, `CleanResult`, `smart_title_case()` | | `core.format_standardize` | `standardize_dataframe()`, `StandardizeOptions`, `StandardizeResult`, `FieldType`, per-cell `standardize_*()` | | `core.errors` | `DataToolsError` hierarchy, `ensure_dataframe()`, `ensure_choice()`, `wrap_file_read/write()`, `format_for_user()` | | `core._constants` | `US_STATE_NAMES`, `US_STATE_CODES`, `USPS_EXPANSIONS`, `USPS_COMPRESSIONS` | ## Data flow — Deduplicator ``` read_file() # auto-detect encoding, delimiter, header ▼ DataFrame build_default_strategies() # if no explicit strategies ▼ # strong keys (email, phone) → standalone OR # weak keys (name, address) → AND with strong _apply_normalizations() # add _norm_* shadow columns ▼ _find_match_groups() # O(n²) pair compare, OR strategies, union-find ▼ [review_callback()] # optional interactive review ▼ _select_survivor() # per group: first/last/most-complete/most-recent ▼ [_merge_group()] # optional: fill blanks from losers ▼ DeduplicationResult # deduplicated_df, removed_df, match_groups, log ``` ## Extension recipes ### Add a normalizer 1. Add function to `core/normalizers.py`: ```python def normalize_company(value: Optional[str]) -> str: if not value or not isinstance(value, str): return "" name = value.strip().casefold() for sfx in ("inc", "llc", "corp", "ltd", "co"): name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip() return name ``` 2. Register: add `COMPANY = "company"` to `NormalizerType` + entry in `_NORMALIZER_MAP`. 3. Auto-detect (optional): add a `_COLUMN_TYPE_PATTERNS` row in `core/dedup.py`. ### Add a fuzzy algorithm 1. Add value to `Algorithm` enum in `core/dedup.py`. 2. Add case in `_compute_similarity()`. 3. Document the value in CLI help text. ### Add a survivor rule 1. Add value to `SurvivorRule` enum. 2. Add branch in `_select_survivor()`. 3. Add CLI mapping. ### Add a fix + detector (analyzer/gate) 1. **Detector** in `core/analyze.py`: add `_detect_(df) -> list[Finding]`, hook into the main `analyze()` pipeline. Emit Finding with a unique `fix_action` id. 2. **Fix** in `core/fixes.py`: ```python @register("fix_id") def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]: # ... return out_df, cells_changed ``` 3. **Constant** in `core/analyze.py`: add `FIX_ = "fix_id"` so the detector and fix can reference it. No other call sites change. Gate auto-discovers it via the registry. ### Add a format-standardizer field type 1. Add value to `FieldType` enum in `core/format_standardize.py`. 2. Add per-cell `standardize_(value, *, …)` returning `(new_value, changed)`. 3. Add option fields to `StandardizeOptions` (with defaults that preserve existing behavior). 4. Wire into `_apply_field_type()` dispatcher (the `else` branch raises `AssertionError` — every enum value needs a branch). 5. Add validation entry in `StandardizeOptions.from_dict()` for any new enum-shaped option. ## Errors Use `core/errors.py` instead of raw `ValueError` / `OSError`: | Pattern | Use | |---------|-----| | Bad arg, wrong type, missing column | `InputValidationError` | | Bad config / options file | `ConfigError` | | File parses but isn't what we expected | `FileFormatError` | | File I/O failure (perms, missing, disk full) | `FileAccessError` | | Internal invariant broken (unreachable branch) | `AssertionError` | Helpers: - `ensure_dataframe(value, function="my_func")` at every public entry that takes a df. - `ensure_choice(value, name="mode", choices=[...])` at every entry that takes a literal. - `wrap_file_read(path, "operation", exc)` / `wrap_file_write(...)` when wrapping `OSError`. GUI / CLI handlers: use `format_for_user(exc, context="...")` to render. All `DataToolsError` subclasses extend stdlib `ValueError` or `OSError` so existing handlers still catch them. ## Tests ```bash # All pytest -q # By module pytest tests/test_dedup.py # Include slow / integration pytest -m slow # Single test pytest tests/test_dedup.py::TestExactMatch::test_basic ``` Test layout: ``` tests/ ├── conftest.py # fixtures ├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py ├── test_analyze.py · test_normalize.py · test_text_clean.py ├── test_format_standardize.py ├── test_format_standardize_corpus.py # 199-row buyer corpus ├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py ├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py └── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py ``` Fixture corpora: `test-cases/text-cleaner-corpus/` (21 files) · `test-cases/encodings-corpus/` (31 files) · `test-cases/format-cleaner-corpus/` (7 files + spec). ## Known limitations - **Dedup is O(n²)** — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix. - **Single-threaded** — could benefit from `multiprocessing`. - **Memory-bound** — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine. - **No multi-sheet dedup** — each Excel sheet processed independently. - **Phonenumbers minimum-length** — international numbers without country codes fall back to digits-only.