datatools-dev/docs/DEVELOPER.md

# Developer Guide

Architecture, data flow, extension points.

## Architecture

```
CLI (src/cli*.py)         GUI (src/gui/app.py + pages/)
     │                          │
     └──────────┐    ┌──────────┘
                ▼    ▼
            ┌────────────────┐
            │   src/core/    │
            └────────────────┘
```

**Core/UI rule**: business logic in `core/` only. CLI + GUI translate user input → core call → display result.

## Module map

| Module | Public surface |
|--------|----------------|
| `core.dedup` | `deduplicate()`, `MatchStrategy`, `ColumnMatchStrategy`, `Algorithm`, `SurvivorRule`, `DeduplicationResult`, `MatchResult`, `build_default_strategies()` |
| `core.normalizers` | `normalize_email/phone/name/address/string`, `NormalizerType`, `get_normalizer()` |
| `core.io` | `read_file()`, `write_file()`, `list_sheets()`, `detect_encoding/delimiter/header_row`, `repair_bytes()` |
| `core.config` | `DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule` |
| `core.analyze` | `analyze()`, `Finding`, `findings_by_tool()`, `_NULL_LIKE` |
| `core.fixes` | `@register("fix_id")` decorator, `get_fix()`, `available_actions()` |
| `core.normalize` | `auto_fix()`, `apply_decisions()`, `NormalizationResult`, `is_normalized()` |
| `core.text_clean` | `clean_dataframe()`, `CleanOptions`, `CleanResult`, `smart_title_case()` |
| `core.format_standardize` | `standardize_dataframe()`, `StandardizeOptions`, `StandardizeResult`, `FieldType`, per-cell `standardize_*()` |
| `core.errors` | `DataToolsError` hierarchy, `ensure_dataframe()`, `ensure_choice()`, `wrap_file_read/write()`, `format_for_user()` |
| `core._constants` | `US_STATE_NAMES`, `US_STATE_CODES`, `USPS_EXPANSIONS`, `USPS_COMPRESSIONS` |

## Data flow — Deduplicator

```
read_file()                       # auto-detect encoding, delimiter, header
   ▼ DataFrame
build_default_strategies()        # if no explicit strategies
   ▼                              # strong keys (email, phone) → standalone OR
                                  # weak keys (name, address) → AND with strong
_apply_normalizations()           # add _norm_* shadow columns
   ▼
_find_match_groups()              # O(n²) pair compare, OR strategies, union-find
   ▼
[review_callback()]               # optional interactive review
   ▼
_select_survivor()                # per group: first/last/most-complete/most-recent
   ▼
[_merge_group()]                  # optional: fill blanks from losers
   ▼
DeduplicationResult               # deduplicated_df, removed_df, match_groups, log
```

## Extension recipes

### Add a normalizer

1. Add function to `core/normalizers.py`:
   ```python
   def normalize_company(value: Optional[str]) -> str:
       if not value or not isinstance(value, str): return ""
       name = value.strip().casefold()
       for sfx in ("inc", "llc", "corp", "ltd", "co"):
           name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip()
       return name
   ```
2. Register: add `COMPANY = "company"` to `NormalizerType` + entry in `_NORMALIZER_MAP`.
3. Auto-detect (optional): add a `_COLUMN_TYPE_PATTERNS` row in `core/dedup.py`.

### Add a fuzzy algorithm

1. Add value to `Algorithm` enum in `core/dedup.py`.
2. Add case in `_compute_similarity()`.
3. Document the value in CLI help text.

### Add a survivor rule

1. Add value to `SurvivorRule` enum.
2. Add branch in `_select_survivor()`.
3. Add CLI mapping.

### Add a fix + detector (analyzer/gate)

1. **Detector** in `core/analyze.py`: add `_detect_<thing>(df) -> list[Finding]`, hook into the main `analyze()` pipeline. Emit Finding with a unique `fix_action` id.
2. **Fix** in `core/fixes.py`:
   ```python
   @register("fix_id")
   def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]:
       # ...
       return out_df, cells_changed
   ```
3. **Constant** in `core/analyze.py`: add `FIX_<NAME> = "fix_id"` so the detector and fix can reference it.

No other call sites change. Gate auto-discovers it via the registry.

### Add a format-standardizer field type

1. Add value to `FieldType` enum in `core/format_standardize.py`.
2. Add per-cell `standardize_<x>(value, *, …)` returning `(new_value, changed)`.
3. Add option fields to `StandardizeOptions` (with defaults that preserve existing behavior).
4. Wire into `_apply_field_type()` dispatcher (the `else` branch raises `AssertionError` — every enum value needs a branch).
5. Add validation entry in `StandardizeOptions.from_dict()` for any new enum-shaped option.

## Errors

Use `core/errors.py` instead of raw `ValueError` / `OSError`:

| Pattern | Use |
|---------|-----|
| Bad arg, wrong type, missing column | `InputValidationError` |
| Bad config / options file | `ConfigError` |
| File parses but isn't what we expected | `FileFormatError` |
| File I/O failure (perms, missing, disk full) | `FileAccessError` |
| Internal invariant broken (unreachable branch) | `AssertionError` |

Helpers:
- `ensure_dataframe(value, function="my_func")` at every public entry that takes a df.
- `ensure_choice(value, name="mode", choices=[...])` at every entry that takes a literal.
- `wrap_file_read(path, "operation", exc)` / `wrap_file_write(...)` when wrapping `OSError`.

GUI / CLI handlers: use `format_for_user(exc, context="...")` to render.

All `DataToolsError` subclasses extend stdlib `ValueError` or `OSError` so existing handlers still catch them.

## Tests

```bash
# All
pytest -q
# By module
pytest tests/test_dedup.py
# Include slow / integration
pytest -m slow
# Single test
pytest tests/test_dedup.py::TestExactMatch::test_basic
```

Test layout:
```
tests/
├── conftest.py                        # fixtures
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
├── test_analyze.py · test_normalize.py · test_text_clean.py
├── test_format_standardize.py
├── test_format_standardize_corpus.py  # 199-row buyer corpus
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
└── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
```

Fixture corpora: `test-cases/text-cleaner-corpus/` (21 files) · `test-cases/encodings-corpus/` (31 files) · `test-cases/format-cleaner-corpus/` (7 files + spec).

## Known limitations

- **Dedup is O(n²)** — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix.
- **Single-threaded** — could benefit from `multiprocessing`.
- **Memory-bound** — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine.
- **No multi-sheet dedup** — each Excel sheet processed independently.
- **Phonenumbers minimum-length** — international numbers without country codes fall back to digits-only.