feat: add documentation, Streamlit GUI, and full source tree

- Rewrite README.md with project overview, quick-start, and CLI summary - Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections - Add docs/DEVELOPER.md with architecture, data flow, and extension guides - Rewrite src/core/__init__.py with public API exports and module docstring - Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive match group review with side-by-side diff, and download buttons - Add .gitignore, requirements.txt, all source code, tests, and sample data - Add streamlit to requirements.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 23:06:39 +00:00
parent 0613dc420c
commit b871ab24fc
47 changed files with 4413 additions and 2 deletions
--- a/docs/DEVELOPER.md
+++ b/docs/DEVELOPER.md
@@ -0,0 +1,282 @@
+# Developer Guide
+
+Architecture, data flow, and extension guide for the DataTools Deduplicator.
+
+## Architecture
+
+```
+CLI (src/cli.py)                 GUI (src/gui/app.py)
+     │                                │
+     │  flags → strategies            │  widgets → strategies
+     │  _interactive_review()         │  match_group_card()
+     │  tqdm progress bar             │  st.progress()
+     │                                │
+     └──────────┐    ┌────────────────┘
+                │    │
+                ▼    ▼
+          ┌─────────────────┐
+          │   core.dedup     │
+          │  deduplicate()   │
+          └────────┬────────┘
+                   │
+      ┌────────────┼────────────┐
+      ▼            ▼            ▼
+ core.io      core.normalizers  core.config
+ read/write   normalize_*()     save/load JSON
+```
+
+**Key principle:** All business logic lives in `src/core/`. The CLI and GUI are thin wrappers that translate user input into `deduplicate()` arguments and display the `DeduplicationResult`.
+
+## File-by-File Reference
+
+### src/core/dedup.py — Deduplication Engine
+
+The central module. Contains:
+
+- **Enums:** `Algorithm` (4 fuzzy algorithms), `SurvivorRule` (4 selection rules)
+- **Data classes:** `ColumnMatchStrategy`, `MatchStrategy`, `MatchResult`, `DeduplicationResult`
+- **`deduplicate()`** — main entry point. Takes a DataFrame + optional strategies/rules, returns a `DeduplicationResult` with deduplicated DataFrame, removed rows, match groups, and log entries.
+- **`build_default_strategies()`** — scans column names with regex patterns to auto-detect email, phone, name, and address columns. Builds strong/weak key strategies with appropriate algorithms and normalizers.
+- **`_UnionFind`** — disjoint-set data structure for transitive closure. If A matches B and B matches C, all three end up in one group.
+- **`_find_match_groups()`** — O(n^2) pairwise comparison. For each pair, tries all strategies (OR semantics). Feeds matches into union-find. Returns match groups with confidence scores.
+- **`_select_survivor()`** — picks the row to keep based on the survivor rule.
+- **`_merge_group()`** — fills blank fields in the survivor from loser rows.
+
+### src/core/normalizers.py — Text Normalization
+
+Five normalizer functions, each `str → str`, idempotent, None-safe:
+
+- **`normalize_email()`** — lowercase, strip Gmail dots, strip `+tag` suffixes
+- **`normalize_phone()`** — parse with `phonenumbers` to E.164; fallback to digits-only
+- **`normalize_name()`** — strip title prefixes (Dr., Mr.) and suffixes (Jr., PhD), case-fold
+- **`normalize_address()`** — USPS abbreviations (Street→St, Avenue→Ave), case-fold
+- **`normalize_string()`** — trim, collapse whitespace, case-fold
+
+The `get_normalizer()` registry function maps `NormalizerType` enum values to functions.
+
+### src/core/io.py — File I/O
+
+Auto-detection stack:
+
+1. **`detect_encoding()`** — checks BOM, then uses `charset-normalizer` heuristics
+2. **`detect_delimiter()`** — uses `csv.Sniffer` on first 20 lines
+3. **`detect_header_row()`** — finds first row where all cells look like column names
+
+Main functions:
+- **`read_file()`** — reads CSV/TSV/Excel with full auto-detection. Returns a DataFrame.
+- **`write_file()`** — writes DataFrame to CSV or Excel. Uses `utf-8-sig` by default for Windows Excel compatibility.
+- **`list_sheets()`** — returns sheet names from an Excel workbook.
+
+### src/core/config.py — Configuration Profiles
+
+Save/load deduplication settings as JSON:
+
+- **`DeduplicationConfig`** — flat dataclass with all settings: strategies, survivor rule, merge flag, algorithm, threshold, normalizer map.
+- **`.to_file()` / `.from_file()`** — JSON serialization
+- **`.to_strategies()`** — converts config back to `MatchStrategy` objects for the engine
+- **`.to_survivor_rule()`** — converts string to `SurvivorRule` enum
+
+### src/cli.py — Command-Line Interface
+
+Typer-based CLI with 17 options. Key responsibilities:
+
+- Parse flags into strategies, survivor rule, and other config
+- Set up logging (timestamped log files in `logs/`)
+- Column name validation with fuzzy suggestions on typos
+- `_interactive_review()` — side-by-side row display with y/n/s prompts
+- Progress bar via `tqdm` for files > 10,000 rows
+- Output formatting and file writing
+
+### src/gui/app.py — Streamlit GUI
+
+Single-page layout:
+- File upload with instant preview
+- Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles)
+- Find Duplicates button → runs `deduplicate()` with `progress_callback`
+- Interactive review: expandable match group cards with merge/keep/skip buttons
+- Download buttons for deduplicated CSV, removed rows, and match groups report
+
+### src/gui/components.py — Reusable GUI Widgets
+
+- **`match_group_card()`** — expandable card showing side-by-side row comparison with diff highlighting
+- **`config_panel()`** — the advanced options expander, returns a `DeduplicationConfig`
+- **`results_summary()`** — summary stats and download buttons
+
+## Data Flow
+
+```
+Input File
+    │
+    ▼
+read_file()          ← auto-detect encoding, delimiter, header
+    │
+    ▼
+DataFrame
+    │
+    ▼
+build_default_strategies()   ← (if no explicit strategies)
+    │                            scan column names → regex patterns
+    │                            strong keys: email, phone (standalone OR)
+    │                            weak keys: name, address (AND with strong)
+    ▼
+_apply_normalizations()      ← add _norm_* shadow columns
+    │                            normalize_email(), normalize_phone(), etc.
+    ▼
+_find_match_groups()         ← O(n²) pairwise comparison
+    │                            for each pair: try all strategies (OR)
+    │                            _compute_similarity() per column
+    │                            union-find for transitive closure
+    ▼
+[review_callback()]          ← optional: interactive review per group
+    │                            True=accept, False=reject, None=skip
+    ▼
+_select_survivor()           ← per group: first/last/most-complete/most-recent
+    │
+    ▼
+[_merge_group()]             ← optional: fill blanks from losers
+    │
+    ▼
+DeduplicationResult
+    ├── deduplicated_df      ← cleaned DataFrame (shadow cols dropped)
+    ├── removed_df           ← rows that were removed
+    ├── match_groups         ← list of MatchResult with confidence, columns
+    └── log_entries          ← human-readable audit log
+```
+
+## How to Add a Normalizer
+
+1. **Add the function** in `src/core/normalizers.py`:
+
+```python
+def normalize_company(value: Optional[str]) -> str:
+    """Strip legal suffixes (Inc, LLC, Corp), case-fold."""
+    if not value or not isinstance(value, str):
+        return ""
+    name = value.strip().casefold()
+    # Strip common suffixes
+    for suffix in ("inc", "llc", "corp", "ltd", "co"):
+        name = re.sub(rf"\b{suffix}\.?\s*$", "", name).strip()
+    return name
+```
+
+2. **Register it** in the same file:
+
+```python
+class NormalizerType(str, Enum):
+    # ... existing types ...
+    COMPANY = "company"      # ← add enum value
+
+_NORMALIZER_MAP: dict[NormalizerType, Callable[[str], str]] = {
+    # ... existing entries ...
+    NormalizerType.COMPANY: normalize_company,   # ← add mapping
+}
+```
+
+3. **Add auto-detection pattern** in `src/core/dedup.py` (optional):
+
+```python
+_COLUMN_TYPE_PATTERNS = [
+    # ... existing patterns ...
+    (re.compile(r"company|organization|org_name", re.I),
+     NormalizerType.COMPANY, Algorithm.TOKEN_SET_RATIO, 85.0, False),
+]
+```
+
+## How to Add a Matching Algorithm
+
+1. **Add the enum value** in `src/core/dedup.py`:
+
+```python
+class Algorithm(str, Enum):
+    # ... existing values ...
+    SOUNDEX = "soundex"
+```
+
+2. **Add the computation** in `_compute_similarity()`:
+
+```python
+def _compute_similarity(val_a: str, val_b: str, algorithm: Algorithm) -> float:
+    # ... existing cases ...
+    if algorithm == Algorithm.SOUNDEX:
+        return 100.0 if _soundex(val_a) == _soundex(val_b) else 0.0
+```
+
+3. **Add the CLI flag value** in `src/cli.py` help text for `--algorithm`.
+
+## How to Add a Survivor Strategy
+
+1. **Add the enum value** in `src/core/dedup.py`:
+
+```python
+class SurvivorRule(str, Enum):
+    # ... existing values ...
+    KEEP_LONGEST = "longest"
+```
+
+2. **Add the logic** in `_select_survivor()`:
+
+```python
+if rule == SurvivorRule.KEEP_LONGEST:
+    return max(indices, key=lambda i: len(str(df.iloc[i].to_dict())))
+```
+
+3. **Add to the CLI** survivor map in `src/cli.py`.
+
+## Testing
+
+### Run Tests
+
+```bash
+# All tests
+pytest tests/ -q
+
+# Specific module
+pytest tests/test_dedup.py -q
+pytest tests/test_normalizers.py -q
+pytest tests/test_io.py -q
+pytest tests/test_config.py -q
+pytest tests/test_cli.py -q
+
+# Verbose with output
+pytest tests/ -v
+
+# Stop on first failure
+pytest tests/ -x
+```
+
+### Test Structure
+
+```
+tests/
+├── conftest.py          # Shared fixtures
+│   ├── sample_csv_path  # Path to samples/messy_sales.csv
+│   ├── sample_df        # Loaded sample CSV as DataFrame
+│   ├── simple_df        # Small 5-row DataFrame with obvious duplicates
+│   ├── merge_df         # DataFrame with partial records
+│   └── tmp_csv          # Temporary CSV from simple_df
+├── test_dedup.py        # Engine tests: similarity, union-find, pairs, integration
+├── test_normalizers.py  # Normalizer tests: all 5 types with edge cases
+├── test_io.py           # I/O tests: encoding, delimiter, header, read/write
+├── test_config.py       # Config tests: serialization round-trip
+└── test_cli.py          # CLI tests: argument parsing, file handling
+```
+
+### Writing Tests
+
+Follow existing patterns. Tests use pytest fixtures from `conftest.py`:
+
+```python
+def test_my_feature(simple_df):
+    """Test description."""
+    result = deduplicate(simple_df, ...)
+    assert len(result.match_groups) == expected
+    assert result.deduplicated_df.shape[0] == expected_rows
+```
+
+## Known Limitations
+
+- **O(n^2) pairwise comparison** — no blocking or indexing. Works well up to ~50,000 rows. Beyond that, performance degrades quadratically. Future optimization: add blocking (partition by first letter, zip code prefix, etc.) to reduce comparison space.
+- **No multi-sheet dedup** — each Excel sheet is processed independently. Cross-sheet deduplication is not supported.
+- **Phone normalization requires valid-length numbers** — the `phonenumbers` library rejects numbers that are too short or too long for the detected region. Fallback is digits-only, which may produce false negatives for international numbers without country codes.
+- **Single-threaded** — no parallel comparison. Could benefit from `multiprocessing` for large files.
+- **Memory-bound** — entire file is loaded into a pandas DataFrame. Files larger than available RAM will fail. Chunked reading exists but is not integrated with the dedup engine.