Files
datatools-dev/docs/DEVELOPER.md
Michael abb720997e docs: tight, scannable rewrite — every item earns its place
Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS,
TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from
prose-heavy to bullet-heavy + table-heavy. Same information density,
significantly less reading load.

Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content
that landed since v1.6:

- Format Standardizer (3rd Ready tool)
- 199-row buyer corpus
- src/core/errors.py structured hierarchy + ensure_dataframe /
  ensure_choice / wrap_file_read|write / format_for_user helpers
- src/core/_constants.py shared USPS/state lookup tables
- Cross-tool audit fixes (NaN matching, removed_df schema, validation,
  enum-bounds checks, forward-compat config)
- Per-domain error_policy across format standardizers
- Inconsistent-date-format detector
- Excel header-row auto-detection + write_file delimiter param

Per-doc changes:

- README.md (175 → 71): 9-tool table at top, status column, 3 CLI
  entry points listed, dropped repeated marketing prose.
- docs/README.md (38 → 27): pure index — buyer-facing vs creator-only
  split + version footer.
- USER-GUIDE.md (208 → 118): tool table replaces script descriptions,
  troubleshooting compressed to bullets, gate explanation tightened.
- CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed
  redundant intro text, kept full recipes section.
- REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added
  §18 Error Handling, formatting tightened to single-line entries.
- TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged
  redundant §3.5-3.7 OS sections, added §7 (Error handling) +
  §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate /
  Review page / repair_bytes promoted from §10.2.x sub-numbering).
- DEVELOPER.md (285 → 161): module map table replaces per-file prose,
  extension recipes condensed, new §Errors covers when to use each
  hierarchy class.
- BUSINESS.md (278 → 225): collapsed prose to tables (use cases,
  competitive landscape, costs, risks); honest-status updated.
- DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved,
  decision log compressed to single-line entries, added v1.6 entries
  (Format Standardizer Ready, errors module).
- RECOVERY.md (180 → 147): rebuild steps as numbered + tabular,
  external dependencies as one table, recovery priorities tightened.

No information removed; redundancy compressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:49:29 +00:00

162 lines
6.8 KiB
Markdown

# Developer Guide
Architecture, data flow, extension points.
## Architecture
```
CLI (src/cli*.py) GUI (src/gui/app.py + pages/)
│ │
└──────────┐ ┌──────────┘
▼ ▼
┌────────────────┐
│ src/core/ │
└────────────────┘
```
**Core/UI rule**: business logic in `core/` only. CLI + GUI translate user input → core call → display result.
## Module map
| Module | Public surface |
|--------|----------------|
| `core.dedup` | `deduplicate()`, `MatchStrategy`, `ColumnMatchStrategy`, `Algorithm`, `SurvivorRule`, `DeduplicationResult`, `MatchResult`, `build_default_strategies()` |
| `core.normalizers` | `normalize_email/phone/name/address/string`, `NormalizerType`, `get_normalizer()` |
| `core.io` | `read_file()`, `write_file()`, `list_sheets()`, `detect_encoding/delimiter/header_row`, `repair_bytes()` |
| `core.config` | `DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule` |
| `core.analyze` | `analyze()`, `Finding`, `findings_by_tool()`, `_NULL_LIKE` |
| `core.fixes` | `@register("fix_id")` decorator, `get_fix()`, `available_actions()` |
| `core.normalize` | `auto_fix()`, `apply_decisions()`, `NormalizationResult`, `is_normalized()` |
| `core.text_clean` | `clean_dataframe()`, `CleanOptions`, `CleanResult`, `smart_title_case()` |
| `core.format_standardize` | `standardize_dataframe()`, `StandardizeOptions`, `StandardizeResult`, `FieldType`, per-cell `standardize_*()` |
| `core.errors` | `DataToolsError` hierarchy, `ensure_dataframe()`, `ensure_choice()`, `wrap_file_read/write()`, `format_for_user()` |
| `core._constants` | `US_STATE_NAMES`, `US_STATE_CODES`, `USPS_EXPANSIONS`, `USPS_COMPRESSIONS` |
## Data flow — Deduplicator
```
read_file() # auto-detect encoding, delimiter, header
▼ DataFrame
build_default_strategies() # if no explicit strategies
▼ # strong keys (email, phone) → standalone OR
# weak keys (name, address) → AND with strong
_apply_normalizations() # add _norm_* shadow columns
_find_match_groups() # O(n²) pair compare, OR strategies, union-find
[review_callback()] # optional interactive review
_select_survivor() # per group: first/last/most-complete/most-recent
[_merge_group()] # optional: fill blanks from losers
DeduplicationResult # deduplicated_df, removed_df, match_groups, log
```
## Extension recipes
### Add a normalizer
1. Add function to `core/normalizers.py`:
```python
def normalize_company(value: Optional[str]) -> str:
if not value or not isinstance(value, str): return ""
name = value.strip().casefold()
for sfx in ("inc", "llc", "corp", "ltd", "co"):
name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip()
return name
```
2. Register: add `COMPANY = "company"` to `NormalizerType` + entry in `_NORMALIZER_MAP`.
3. Auto-detect (optional): add a `_COLUMN_TYPE_PATTERNS` row in `core/dedup.py`.
### Add a fuzzy algorithm
1. Add value to `Algorithm` enum in `core/dedup.py`.
2. Add case in `_compute_similarity()`.
3. Document the value in CLI help text.
### Add a survivor rule
1. Add value to `SurvivorRule` enum.
2. Add branch in `_select_survivor()`.
3. Add CLI mapping.
### Add a fix + detector (analyzer/gate)
1. **Detector** in `core/analyze.py`: add `_detect_<thing>(df) -> list[Finding]`, hook into the main `analyze()` pipeline. Emit Finding with a unique `fix_action` id.
2. **Fix** in `core/fixes.py`:
```python
@register("fix_id")
def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]:
# ...
return out_df, cells_changed
```
3. **Constant** in `core/analyze.py`: add `FIX_<NAME> = "fix_id"` so the detector and fix can reference it.
No other call sites change. Gate auto-discovers it via the registry.
### Add a format-standardizer field type
1. Add value to `FieldType` enum in `core/format_standardize.py`.
2. Add per-cell `standardize_<x>(value, *, …)` returning `(new_value, changed)`.
3. Add option fields to `StandardizeOptions` (with defaults that preserve existing behavior).
4. Wire into `_apply_field_type()` dispatcher (the `else` branch raises `AssertionError` — every enum value needs a branch).
5. Add validation entry in `StandardizeOptions.from_dict()` for any new enum-shaped option.
## Errors
Use `core/errors.py` instead of raw `ValueError` / `OSError`:
| Pattern | Use |
|---------|-----|
| Bad arg, wrong type, missing column | `InputValidationError` |
| Bad config / options file | `ConfigError` |
| File parses but isn't what we expected | `FileFormatError` |
| File I/O failure (perms, missing, disk full) | `FileAccessError` |
| Internal invariant broken (unreachable branch) | `AssertionError` |
Helpers:
- `ensure_dataframe(value, function="my_func")` at every public entry that takes a df.
- `ensure_choice(value, name="mode", choices=[...])` at every entry that takes a literal.
- `wrap_file_read(path, "operation", exc)` / `wrap_file_write(...)` when wrapping `OSError`.
GUI / CLI handlers: use `format_for_user(exc, context="...")` to render.
All `DataToolsError` subclasses extend stdlib `ValueError` or `OSError` so existing handlers still catch them.
## Tests
```bash
# All
pytest -q
# By module
pytest tests/test_dedup.py
# Include slow / integration
pytest -m slow
# Single test
pytest tests/test_dedup.py::TestExactMatch::test_basic
```
Test layout:
```
tests/
├── conftest.py # fixtures
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
├── test_analyze.py · test_normalize.py · test_text_clean.py
├── test_format_standardize.py
├── test_format_standardize_corpus.py # 199-row buyer corpus
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
└── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
```
Fixture corpora: `test-cases/text-cleaner-corpus/` (21 files) · `test-cases/encodings-corpus/` (31 files) · `test-cases/format-cleaner-corpus/` (7 files + spec).
## Known limitations
- **Dedup is O(n²)** — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix.
- **Single-threaded** — could benefit from `multiprocessing`.
- **Memory-bound** — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine.
- **No multi-sheet dedup** — each Excel sheet processed independently.
- **Phonenumbers minimum-length** — international numbers without country codes fall back to digits-only.