docs: tight, scannable rewrite — every item earns its place

Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS, TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from prose-heavy to bullet-heavy + table-heavy. Same information density, significantly less reading load. Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content that landed since v1.6: - Format Standardizer (3rd Ready tool) - 199-row buyer corpus - src/core/errors.py structured hierarchy + ensure_dataframe / ensure_choice / wrap_file_read|write / format_for_user helpers - src/core/_constants.py shared USPS/state lookup tables - Cross-tool audit fixes (NaN matching, removed_df schema, validation, enum-bounds checks, forward-compat config) - Per-domain error_policy across format standardizers - Inconsistent-date-format detector - Excel header-row auto-detection + write_file delimiter param Per-doc changes: - README.md (175 → 71): 9-tool table at top, status column, 3 CLI entry points listed, dropped repeated marketing prose. - docs/README.md (38 → 27): pure index — buyer-facing vs creator-only split + version footer. - USER-GUIDE.md (208 → 118): tool table replaces script descriptions, troubleshooting compressed to bullets, gate explanation tightened. - CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed redundant intro text, kept full recipes section. - REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added §18 Error Handling, formatting tightened to single-line entries. - TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged redundant §3.5-3.7 OS sections, added §7 (Error handling) + §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate / Review page / repair_bytes promoted from §10.2.x sub-numbering). - DEVELOPER.md (285 → 161): module map table replaces per-file prose, extension recipes condensed, new §Errors covers when to use each hierarchy class. - BUSINESS.md (278 → 225): collapsed prose to tables (use cases, competitive landscape, costs, risks); honest-status updated. - DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved, decision log compressed to single-line entries, added v1.6 entries (Format Standardizer Ready, errors module). - RECOVERY.md (180 → 147): rebuild steps as numbered + tabular, external dependencies as one table, recovery priorities tightened. No information removed; redundancy compressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:49:29 +00:00
parent 26b9771625
commit abb720997e
10 changed files with 1105 additions and 2053 deletions
--- a/docs/DEVELOPER.md
+++ b/docs/DEVELOPER.md
@@ -1,285 +1,161 @@
 # Developer Guide

-Architecture, data flow, and extension guide for the DataTools Deduplicator.
+Architecture, data flow, extension points.

 ## Architecture

 ```
-CLI (src/cli.py)                 GUI (src/gui/app.py)
-     │                                │
-     │  flags → strategies            │  widgets → strategies
-     │  _interactive_review()         │  match_group_card()
-     │  tqdm progress bar             │  st.progress()
-     │                                │
-     └──────────┐    ┌────────────────┘
-                │    │
+CLI (src/cli*.py)         GUI (src/gui/app.py + pages/)
+     │                          │
+     └──────────┐    ┌──────────┘
                ▼    ▼
-          ┌─────────────────┐
-          │   core.dedup     │
-          │  deduplicate()   │
-          └────────┬────────┘
-                   │
-      ┌────────────┼────────────┐
-      ▼            ▼            ▼
- core.io      core.normalizers  core.config
- read/write   normalize_*()     save/load JSON
+            ┌────────────────┐
+            │   src/core/    │
+            └────────────────┘
 ```

-**Key principle:** All business logic lives in `src/core/`. The CLI and GUI are thin wrappers that translate user input into `deduplicate()` arguments and display the `DeduplicationResult`.
+**Core/UI rule**: business logic in `core/` only. CLI + GUI translate user input → core call → display result.

-## File-by-File Reference
+## Module map

-### src/core/dedup.py — Deduplication Engine
+| Module | Public surface |
+|--------|----------------|
+| `core.dedup` | `deduplicate()`, `MatchStrategy`, `ColumnMatchStrategy`, `Algorithm`, `SurvivorRule`, `DeduplicationResult`, `MatchResult`, `build_default_strategies()` |
+| `core.normalizers` | `normalize_email/phone/name/address/string`, `NormalizerType`, `get_normalizer()` |
+| `core.io` | `read_file()`, `write_file()`, `list_sheets()`, `detect_encoding/delimiter/header_row`, `repair_bytes()` |
+| `core.config` | `DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule` |
+| `core.analyze` | `analyze()`, `Finding`, `findings_by_tool()`, `_NULL_LIKE` |
+| `core.fixes` | `@register("fix_id")` decorator, `get_fix()`, `available_actions()` |
+| `core.normalize` | `auto_fix()`, `apply_decisions()`, `NormalizationResult`, `is_normalized()` |
+| `core.text_clean` | `clean_dataframe()`, `CleanOptions`, `CleanResult`, `smart_title_case()` |
+| `core.format_standardize` | `standardize_dataframe()`, `StandardizeOptions`, `StandardizeResult`, `FieldType`, per-cell `standardize_*()` |
+| `core.errors` | `DataToolsError` hierarchy, `ensure_dataframe()`, `ensure_choice()`, `wrap_file_read/write()`, `format_for_user()` |
+| `core._constants` | `US_STATE_NAMES`, `US_STATE_CODES`, `USPS_EXPANSIONS`, `USPS_COMPRESSIONS` |

-The central module. Contains:
-
- **Enums:** `Algorithm` (4 fuzzy algorithms), `SurvivorRule` (4 selection rules)
- **Data classes:** `ColumnMatchStrategy`, `MatchStrategy`, `MatchResult`, `DeduplicationResult`
- **`deduplicate()`** — main entry point. Takes a DataFrame + optional strategies/rules, returns a `DeduplicationResult` with deduplicated DataFrame, removed rows, match groups, and log entries.
- **`build_default_strategies()`** — scans column names with regex patterns to auto-detect email, phone, name, and address columns. Builds strong/weak key strategies with appropriate algorithms and normalizers.
- **`_UnionFind`** — disjoint-set data structure for transitive closure. If A matches B and B matches C, all three end up in one group.
- **`_find_match_groups()`** — O(n^2) pairwise comparison. For each pair, tries all strategies (OR semantics). Feeds matches into union-find. Returns match groups with confidence scores.
- **`_select_survivor()`** — picks the row to keep based on the survivor rule.
- **`_merge_group()`** — fills blank fields in the survivor from loser rows.
-
-### src/core/normalizers.py — Text Normalization
-
-Five normalizer functions, each `str → str`, idempotent, None-safe:
-
- **`normalize_email()`** — lowercase, strip Gmail dots, strip `+tag` suffixes
- **`normalize_phone()`** — parse with `phonenumbers` to E.164; fallback to digits-only
- **`normalize_name()`** — strip title prefixes (Dr., Mr.) and suffixes (Jr., PhD), case-fold
- **`normalize_address()`** — USPS abbreviations (Street→St, Avenue→Ave), case-fold
- **`normalize_string()`** — trim, collapse whitespace, case-fold
-
-The `get_normalizer()` registry function maps `NormalizerType` enum values to functions.
-
-### src/core/io.py — File I/O
-
-Auto-detection stack:
-
-1. **`detect_encoding()`** — checks BOM, then uses `charset-normalizer` heuristics
-2. **`detect_delimiter()`** — uses `csv.Sniffer` on first 20 lines
-3. **`detect_header_row()`** — finds first row where all cells look like column names
-
-Main functions:
- **`read_file()`** — reads CSV/TSV/Excel with full auto-detection. Returns a DataFrame.
- **`write_file()`** — writes DataFrame to CSV or Excel. Uses `utf-8-sig` by default for Windows Excel compatibility.
- **`list_sheets()`** — returns sheet names from an Excel workbook.
-
-### src/core/config.py — Configuration Profiles
-
-Save/load deduplication settings as JSON:
-
- **`DeduplicationConfig`** — flat dataclass with all settings: strategies, survivor rule, merge flag, algorithm, threshold, normalizer map.
- **`.to_file()` / `.from_file()`** — JSON serialization
- **`.to_strategies()`** — converts config back to `MatchStrategy` objects for the engine
- **`.to_survivor_rule()`** — converts string to `SurvivorRule` enum
-
-### src/cli.py — Command-Line Interface
-
-Typer-based CLI with 17 options. Key responsibilities:
-
- Parse flags into strategies, survivor rule, and other config
- Set up logging (timestamped log files in `logs/`)
- Column name validation with fuzzy suggestions on typos
- `_interactive_review()` — side-by-side row display with y/n/s prompts
- Progress bar via `tqdm` for files > 10,000 rows
- Output formatting and file writing
-
-### src/gui/app.py — Streamlit GUI
-
-Single-page layout:
- File upload with instant preview and configurable delimiter (comma, tab, semicolon, pipe, or custom)
- Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles)
- Find Duplicates button → runs `deduplicate()` with `progress_callback`
- Interactive review via `st.data_editor` with inline checkboxes and column dropdowns
- Batch actions: Accept All, Reject All, Clear Decisions
- Apply review decisions and download cleaned results
- Download buttons for deduplicated CSV, removed rows, and match groups report
-
-### src/gui/components.py — Reusable GUI Widgets
-
- **`match_group_card()`** — expandable card with `st.data_editor`: inline Keep checkboxes per row, `SelectboxColumn` dropdowns for differing columns, and a live surviving rows preview
- **`config_panel()`** — the advanced options expander, returns settings dict with strategies, survivor rule, merge flag
- **`results_summary()`** — summary metrics and download buttons
- **`apply_review_decisions()`** — builds final DataFrames from user review decisions (merge, split, or keep-all per group) with column override support
-
-## Data Flow
+## Data flow — Deduplicator

 ```
-Input File
-    │
-    ▼
-read_file()          ← auto-detect encoding, delimiter, header
-    │
-    ▼
-DataFrame
-    │
-    ▼
-build_default_strategies()   ← (if no explicit strategies)
-    │                            scan column names → regex patterns
-    │                            strong keys: email, phone (standalone OR)
-    │                            weak keys: name, address (AND with strong)
-    ▼
-_apply_normalizations()      ← add _norm_* shadow columns
-    │                            normalize_email(), normalize_phone(), etc.
-    ▼
-_find_match_groups()         ← O(n²) pairwise comparison
-    │                            for each pair: try all strategies (OR)
-    │                            _compute_similarity() per column
-    │                            union-find for transitive closure
-    ▼
-[review_callback()]          ← optional: interactive review per group
-    │                            True=accept, False=reject, None=skip
-    ▼
-_select_survivor()           ← per group: first/last/most-complete/most-recent
-    │
-    ▼
-[_merge_group()]             ← optional: fill blanks from losers
-    │
-    ▼
-DeduplicationResult
-    ├── deduplicated_df      ← cleaned DataFrame (shadow cols dropped)
-    ├── removed_df           ← rows that were removed
-    ├── match_groups         ← list of MatchResult with confidence, columns
-    └── log_entries          ← human-readable audit log
+read_file()                       # auto-detect encoding, delimiter, header
+   ▼ DataFrame
+build_default_strategies()        # if no explicit strategies
+   ▼                              # strong keys (email, phone) → standalone OR
+                                  # weak keys (name, address) → AND with strong
+_apply_normalizations()           # add _norm_* shadow columns
+   ▼
+_find_match_groups()              # O(n²) pair compare, OR strategies, union-find
+   ▼
+[review_callback()]               # optional interactive review
+   ▼
+_select_survivor()                # per group: first/last/most-complete/most-recent
+   ▼
+[_merge_group()]                  # optional: fill blanks from losers
+   ▼
+DeduplicationResult               # deduplicated_df, removed_df, match_groups, log
 ```

-## How to Add a Normalizer
+## Extension recipes

-1. **Add the function** in `src/core/normalizers.py`:
+### Add a normalizer

-```python
-def normalize_company(value: Optional[str]) -> str:
-    """Strip legal suffixes (Inc, LLC, Corp), case-fold."""
-    if not value or not isinstance(value, str):
-        return ""
-    name = value.strip().casefold()
-    # Strip common suffixes
-    for suffix in ("inc", "llc", "corp", "ltd", "co"):
-        name = re.sub(rf"\b{suffix}\.?\s*$", "", name).strip()
-    return name
-```
+1. Add function to `core/normalizers.py`:
+   ```python
+   def normalize_company(value: Optional[str]) -> str:
+       if not value or not isinstance(value, str): return ""
+       name = value.strip().casefold()
+       for sfx in ("inc", "llc", "corp", "ltd", "co"):
+           name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip()
+       return name
+   ```
+2. Register: add `COMPANY = "company"` to `NormalizerType` + entry in `_NORMALIZER_MAP`.
+3. Auto-detect (optional): add a `_COLUMN_TYPE_PATTERNS` row in `core/dedup.py`.

-2. **Register it** in the same file:
+### Add a fuzzy algorithm

-```python
-class NormalizerType(str, Enum):
-    # ... existing types ...
-    COMPANY = "company"      # ← add enum value
+1. Add value to `Algorithm` enum in `core/dedup.py`.
+2. Add case in `_compute_similarity()`.
+3. Document the value in CLI help text.

-_NORMALIZER_MAP: dict[NormalizerType, Callable[[str], str]] = {
-    # ... existing entries ...
-    NormalizerType.COMPANY: normalize_company,   # ← add mapping
-}
-```
+### Add a survivor rule

-3. **Add auto-detection pattern** in `src/core/dedup.py` (optional):
+1. Add value to `SurvivorRule` enum.
+2. Add branch in `_select_survivor()`.
+3. Add CLI mapping.

-```python
-_COLUMN_TYPE_PATTERNS = [
-    # ... existing patterns ...
-    (re.compile(r"company|organization|org_name", re.I),
-     NormalizerType.COMPANY, Algorithm.TOKEN_SET_RATIO, 85.0, False),
-]
-```
+### Add a fix + detector (analyzer/gate)

-## How to Add a Matching Algorithm
+1. **Detector** in `core/analyze.py`: add `_detect_<thing>(df) -> list[Finding]`, hook into the main `analyze()` pipeline. Emit Finding with a unique `fix_action` id.
+2. **Fix** in `core/fixes.py`:
+   ```python
+   @register("fix_id")
+   def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]:
+       # ...
+       return out_df, cells_changed
+   ```
+3. **Constant** in `core/analyze.py`: add `FIX_<NAME> = "fix_id"` so the detector and fix can reference it.

-1. **Add the enum value** in `src/core/dedup.py`:
+No other call sites change. Gate auto-discovers it via the registry.

-```python
-class Algorithm(str, Enum):
-    # ... existing values ...
-    SOUNDEX = "soundex"
-```
+### Add a format-standardizer field type

-2. **Add the computation** in `_compute_similarity()`:
+1. Add value to `FieldType` enum in `core/format_standardize.py`.
+2. Add per-cell `standardize_<x>(value, *, …)` returning `(new_value, changed)`.
+3. Add option fields to `StandardizeOptions` (with defaults that preserve existing behavior).
+4. Wire into `_apply_field_type()` dispatcher (the `else` branch raises `AssertionError` — every enum value needs a branch).
+5. Add validation entry in `StandardizeOptions.from_dict()` for any new enum-shaped option.

-```python
-def _compute_similarity(val_a: str, val_b: str, algorithm: Algorithm) -> float:
-    # ... existing cases ...
-    if algorithm == Algorithm.SOUNDEX:
-        return 100.0 if _soundex(val_a) == _soundex(val_b) else 0.0
-```
+## Errors

-3. **Add the CLI flag value** in `src/cli.py` help text for `--algorithm`.
+Use `core/errors.py` instead of raw `ValueError` / `OSError`:

-## How to Add a Survivor Strategy
+| Pattern | Use |
+|---------|-----|
+| Bad arg, wrong type, missing column | `InputValidationError` |
+| Bad config / options file | `ConfigError` |
+| File parses but isn't what we expected | `FileFormatError` |
+| File I/O failure (perms, missing, disk full) | `FileAccessError` |
+| Internal invariant broken (unreachable branch) | `AssertionError` |

-1. **Add the enum value** in `src/core/dedup.py`:
+Helpers:
+- `ensure_dataframe(value, function="my_func")` at every public entry that takes a df.
+- `ensure_choice(value, name="mode", choices=[...])` at every entry that takes a literal.
+- `wrap_file_read(path, "operation", exc)` / `wrap_file_write(...)` when wrapping `OSError`.

-```python
-class SurvivorRule(str, Enum):
-    # ... existing values ...
-    KEEP_LONGEST = "longest"
-```
+GUI / CLI handlers: use `format_for_user(exc, context="...")` to render.

-2. **Add the logic** in `_select_survivor()`:
+All `DataToolsError` subclasses extend stdlib `ValueError` or `OSError` so existing handlers still catch them.

-```python
-if rule == SurvivorRule.KEEP_LONGEST:
-    return max(indices, key=lambda i: len(str(df.iloc[i].to_dict())))
-```
-
-3. **Add to the CLI** survivor map in `src/cli.py`.
-
-## Testing
-
-### Run Tests
+## Tests

 ```bash
-# All tests
-pytest tests/ -q
-
-# Specific module
-pytest tests/test_dedup.py -q
-pytest tests/test_normalizers.py -q
-pytest tests/test_io.py -q
-pytest tests/test_config.py -q
-pytest tests/test_cli.py -q
-
-# Verbose with output
-pytest tests/ -v
-
-# Stop on first failure
-pytest tests/ -x
+# All
+pytest -q
+# By module
+pytest tests/test_dedup.py
+# Include slow / integration
+pytest -m slow
+# Single test
+pytest tests/test_dedup.py::TestExactMatch::test_basic
 ```

-### Test Structure
-
+Test layout:
 ```
 tests/
-├── conftest.py          # Shared fixtures
-│   ├── sample_csv_path  # Path to samples/messy_sales.csv
-│   ├── sample_df        # Loaded sample CSV as DataFrame
-│   ├── simple_df        # Small 5-row DataFrame with obvious duplicates
-│   ├── merge_df         # DataFrame with partial records
-│   └── tmp_csv          # Temporary CSV from simple_df
-├── test_dedup.py        # Engine tests: similarity, union-find, pairs, integration
-├── test_normalizers.py  # Normalizer tests: all 5 types with edge cases
-├── test_io.py           # I/O tests: encoding, delimiter, header, read/write
-├── test_config.py       # Config tests: serialization round-trip
-└── test_cli.py          # CLI tests: argument parsing, file handling
+├── conftest.py                        # fixtures
+├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
+├── test_analyze.py · test_normalize.py · test_text_clean.py
+├── test_format_standardize.py
+├── test_format_standardize_corpus.py  # 199-row buyer corpus
+├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
+├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
+└── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
 ```

-### Writing Tests
+Fixture corpora: `test-cases/text-cleaner-corpus/` (21 files) · `test-cases/encodings-corpus/` (31 files) · `test-cases/format-cleaner-corpus/` (7 files + spec).

-Follow existing patterns. Tests use pytest fixtures from `conftest.py`:
+## Known limitations

-```python
-def test_my_feature(simple_df):
-    """Test description."""
-    result = deduplicate(simple_df, ...)
-    assert len(result.match_groups) == expected
-    assert result.deduplicated_df.shape[0] == expected_rows
-```
-
-## Known Limitations
-
- **O(n^2) pairwise comparison** — no blocking or indexing. Works well up to ~50,000 rows. Beyond that, performance degrades quadratically. Future optimization: add blocking (partition by first letter, zip code prefix, etc.) to reduce comparison space.
- **No multi-sheet dedup** — each Excel sheet is processed independently. Cross-sheet deduplication is not supported.
- **Phone normalization requires valid-length numbers** — the `phonenumbers` library rejects numbers that are too short or too long for the detected region. Fallback is digits-only, which may produce false negatives for international numbers without country codes.
- **Single-threaded** — no parallel comparison. Could benefit from `multiprocessing` for large files.
- **Memory-bound** — entire file is loaded into a pandas DataFrame. Files larger than available RAM will fail. Chunked reading exists but is not integrated with the dedup engine.
+- **Dedup is O(n²)** — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix.
+- **Single-threaded** — could benefit from `multiprocessing`.
+- **Memory-bound** — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine.
+- **No multi-sheet dedup** — each Excel sheet processed independently.
+- **Phonenumbers minimum-length** — international numbers without country codes fall back to digits-only.