# Developer Guide Architecture, data flow, and extension guide for the DataTools Deduplicator. ## Architecture ``` CLI (src/cli.py) GUI (src/gui/app.py) │ │ │ flags → strategies │ widgets → strategies │ _interactive_review() │ match_group_card() │ tqdm progress bar │ st.progress() │ │ └──────────┐ ┌────────────────┘ │ │ ▼ ▼ ┌─────────────────┐ │ core.dedup │ │ deduplicate() │ └────────┬────────┘ │ ┌────────────┼────────────┐ ▼ ▼ ▼ core.io core.normalizers core.config read/write normalize_*() save/load JSON ``` **Key principle:** All business logic lives in `src/core/`. The CLI and GUI are thin wrappers that translate user input into `deduplicate()` arguments and display the `DeduplicationResult`. ## File-by-File Reference ### src/core/dedup.py — Deduplication Engine The central module. Contains: - **Enums:** `Algorithm` (4 fuzzy algorithms), `SurvivorRule` (4 selection rules) - **Data classes:** `ColumnMatchStrategy`, `MatchStrategy`, `MatchResult`, `DeduplicationResult` - **`deduplicate()`** — main entry point. Takes a DataFrame + optional strategies/rules, returns a `DeduplicationResult` with deduplicated DataFrame, removed rows, match groups, and log entries. - **`build_default_strategies()`** — scans column names with regex patterns to auto-detect email, phone, name, and address columns. Builds strong/weak key strategies with appropriate algorithms and normalizers. - **`_UnionFind`** — disjoint-set data structure for transitive closure. If A matches B and B matches C, all three end up in one group. - **`_find_match_groups()`** — O(n^2) pairwise comparison. For each pair, tries all strategies (OR semantics). Feeds matches into union-find. Returns match groups with confidence scores. - **`_select_survivor()`** — picks the row to keep based on the survivor rule. - **`_merge_group()`** — fills blank fields in the survivor from loser rows. ### src/core/normalizers.py — Text Normalization Five normalizer functions, each `str → str`, idempotent, None-safe: - **`normalize_email()`** — lowercase, strip Gmail dots, strip `+tag` suffixes - **`normalize_phone()`** — parse with `phonenumbers` to E.164; fallback to digits-only - **`normalize_name()`** — strip title prefixes (Dr., Mr.) and suffixes (Jr., PhD), case-fold - **`normalize_address()`** — USPS abbreviations (Street→St, Avenue→Ave), case-fold - **`normalize_string()`** — trim, collapse whitespace, case-fold The `get_normalizer()` registry function maps `NormalizerType` enum values to functions. ### src/core/io.py — File I/O Auto-detection stack: 1. **`detect_encoding()`** — checks BOM, then uses `charset-normalizer` heuristics 2. **`detect_delimiter()`** — uses `csv.Sniffer` on first 20 lines 3. **`detect_header_row()`** — finds first row where all cells look like column names Main functions: - **`read_file()`** — reads CSV/TSV/Excel with full auto-detection. Returns a DataFrame. - **`write_file()`** — writes DataFrame to CSV or Excel. Uses `utf-8-sig` by default for Windows Excel compatibility. - **`list_sheets()`** — returns sheet names from an Excel workbook. ### src/core/config.py — Configuration Profiles Save/load deduplication settings as JSON: - **`DeduplicationConfig`** — flat dataclass with all settings: strategies, survivor rule, merge flag, algorithm, threshold, normalizer map. - **`.to_file()` / `.from_file()`** — JSON serialization - **`.to_strategies()`** — converts config back to `MatchStrategy` objects for the engine - **`.to_survivor_rule()`** — converts string to `SurvivorRule` enum ### src/cli.py — Command-Line Interface Typer-based CLI with 17 options. Key responsibilities: - Parse flags into strategies, survivor rule, and other config - Set up logging (timestamped log files in `logs/`) - Column name validation with fuzzy suggestions on typos - `_interactive_review()` — side-by-side row display with y/n/s prompts - Progress bar via `tqdm` for files > 10,000 rows - Output formatting and file writing ### src/gui/app.py — Streamlit GUI Single-page layout: - File upload with instant preview and configurable delimiter (comma, tab, semicolon, pipe, or custom) - Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles) - Find Duplicates button → runs `deduplicate()` with `progress_callback` - Interactive review via `st.data_editor` with inline checkboxes and column dropdowns - Batch actions: Accept All, Reject All, Clear Decisions - Apply review decisions and download cleaned results - Download buttons for deduplicated CSV, removed rows, and match groups report ### src/gui/components.py — Reusable GUI Widgets - **`match_group_card()`** — expandable card with `st.data_editor`: inline Keep checkboxes per row, `SelectboxColumn` dropdowns for differing columns, and a live surviving rows preview - **`config_panel()`** — the advanced options expander, returns settings dict with strategies, survivor rule, merge flag - **`results_summary()`** — summary metrics and download buttons - **`apply_review_decisions()`** — builds final DataFrames from user review decisions (merge, split, or keep-all per group) with column override support ## Data Flow ``` Input File │ ▼ read_file() ← auto-detect encoding, delimiter, header │ ▼ DataFrame │ ▼ build_default_strategies() ← (if no explicit strategies) │ scan column names → regex patterns │ strong keys: email, phone (standalone OR) │ weak keys: name, address (AND with strong) ▼ _apply_normalizations() ← add _norm_* shadow columns │ normalize_email(), normalize_phone(), etc. ▼ _find_match_groups() ← O(n²) pairwise comparison │ for each pair: try all strategies (OR) │ _compute_similarity() per column │ union-find for transitive closure ▼ [review_callback()] ← optional: interactive review per group │ True=accept, False=reject, None=skip ▼ _select_survivor() ← per group: first/last/most-complete/most-recent │ ▼ [_merge_group()] ← optional: fill blanks from losers │ ▼ DeduplicationResult ├── deduplicated_df ← cleaned DataFrame (shadow cols dropped) ├── removed_df ← rows that were removed ├── match_groups ← list of MatchResult with confidence, columns └── log_entries ← human-readable audit log ``` ## How to Add a Normalizer 1. **Add the function** in `src/core/normalizers.py`: ```python def normalize_company(value: Optional[str]) -> str: """Strip legal suffixes (Inc, LLC, Corp), case-fold.""" if not value or not isinstance(value, str): return "" name = value.strip().casefold() # Strip common suffixes for suffix in ("inc", "llc", "corp", "ltd", "co"): name = re.sub(rf"\b{suffix}\.?\s*$", "", name).strip() return name ``` 2. **Register it** in the same file: ```python class NormalizerType(str, Enum): # ... existing types ... COMPANY = "company" # ← add enum value _NORMALIZER_MAP: dict[NormalizerType, Callable[[str], str]] = { # ... existing entries ... NormalizerType.COMPANY: normalize_company, # ← add mapping } ``` 3. **Add auto-detection pattern** in `src/core/dedup.py` (optional): ```python _COLUMN_TYPE_PATTERNS = [ # ... existing patterns ... (re.compile(r"company|organization|org_name", re.I), NormalizerType.COMPANY, Algorithm.TOKEN_SET_RATIO, 85.0, False), ] ``` ## How to Add a Matching Algorithm 1. **Add the enum value** in `src/core/dedup.py`: ```python class Algorithm(str, Enum): # ... existing values ... SOUNDEX = "soundex" ``` 2. **Add the computation** in `_compute_similarity()`: ```python def _compute_similarity(val_a: str, val_b: str, algorithm: Algorithm) -> float: # ... existing cases ... if algorithm == Algorithm.SOUNDEX: return 100.0 if _soundex(val_a) == _soundex(val_b) else 0.0 ``` 3. **Add the CLI flag value** in `src/cli.py` help text for `--algorithm`. ## How to Add a Survivor Strategy 1. **Add the enum value** in `src/core/dedup.py`: ```python class SurvivorRule(str, Enum): # ... existing values ... KEEP_LONGEST = "longest" ``` 2. **Add the logic** in `_select_survivor()`: ```python if rule == SurvivorRule.KEEP_LONGEST: return max(indices, key=lambda i: len(str(df.iloc[i].to_dict()))) ``` 3. **Add to the CLI** survivor map in `src/cli.py`. ## Testing ### Run Tests ```bash # All tests pytest tests/ -q # Specific module pytest tests/test_dedup.py -q pytest tests/test_normalizers.py -q pytest tests/test_io.py -q pytest tests/test_config.py -q pytest tests/test_cli.py -q # Verbose with output pytest tests/ -v # Stop on first failure pytest tests/ -x ``` ### Test Structure ``` tests/ ├── conftest.py # Shared fixtures │ ├── sample_csv_path # Path to samples/messy_sales.csv │ ├── sample_df # Loaded sample CSV as DataFrame │ ├── simple_df # Small 5-row DataFrame with obvious duplicates │ ├── merge_df # DataFrame with partial records │ └── tmp_csv # Temporary CSV from simple_df ├── test_dedup.py # Engine tests: similarity, union-find, pairs, integration ├── test_normalizers.py # Normalizer tests: all 5 types with edge cases ├── test_io.py # I/O tests: encoding, delimiter, header, read/write ├── test_config.py # Config tests: serialization round-trip └── test_cli.py # CLI tests: argument parsing, file handling ``` ### Writing Tests Follow existing patterns. Tests use pytest fixtures from `conftest.py`: ```python def test_my_feature(simple_df): """Test description.""" result = deduplicate(simple_df, ...) assert len(result.match_groups) == expected assert result.deduplicated_df.shape[0] == expected_rows ``` ## Known Limitations - **O(n^2) pairwise comparison** — no blocking or indexing. Works well up to ~50,000 rows. Beyond that, performance degrades quadratically. Future optimization: add blocking (partition by first letter, zip code prefix, etc.) to reduce comparison space. - **No multi-sheet dedup** — each Excel sheet is processed independently. Cross-sheet deduplication is not supported. - **Phone normalization requires valid-length numbers** — the `phonenumbers` library rejects numbers that are too short or too long for the detected region. Fallback is digits-only, which may produce false negatives for international numbers without country codes. - **Single-threaded** — no parallel comparison. Could benefit from `multiprocessing` for large files. - **Memory-bound** — entire file is loaded into a pandas DataFrame. Files larger than available RAM will fail. Chunked reading exists but is not integrated with the dedup engine.