Files
datatools-dev/docs/DEVELOPER.md
Michael b871ab24fc feat: add documentation, Streamlit GUI, and full source tree
- Rewrite README.md with project overview, quick-start, and CLI summary
- Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections
- Add docs/DEVELOPER.md with architecture, data flow, and extension guides
- Rewrite src/core/__init__.py with public API exports and module docstring
- Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive
  match group review with side-by-side diff, and download buttons
- Add .gitignore, requirements.txt, all source code, tests, and sample data
- Add streamlit to requirements.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 23:06:39 +00:00

11 KiB

Developer Guide

Architecture, data flow, and extension guide for the DataTools Deduplicator.

Architecture

CLI (src/cli.py)                 GUI (src/gui/app.py)
     │                                │
     │  flags → strategies            │  widgets → strategies
     │  _interactive_review()         │  match_group_card()
     │  tqdm progress bar             │  st.progress()
     │                                │
     └──────────┐    ┌────────────────┘
                │    │
                ▼    ▼
          ┌─────────────────┐
          │   core.dedup     │
          │  deduplicate()   │
          └────────┬────────┘
                   │
      ┌────────────┼────────────┐
      ▼            ▼            ▼
 core.io      core.normalizers  core.config
 read/write   normalize_*()     save/load JSON

Key principle: All business logic lives in src/core/. The CLI and GUI are thin wrappers that translate user input into deduplicate() arguments and display the DeduplicationResult.

File-by-File Reference

src/core/dedup.py — Deduplication Engine

The central module. Contains:

  • Enums: Algorithm (4 fuzzy algorithms), SurvivorRule (4 selection rules)
  • Data classes: ColumnMatchStrategy, MatchStrategy, MatchResult, DeduplicationResult
  • deduplicate() — main entry point. Takes a DataFrame + optional strategies/rules, returns a DeduplicationResult with deduplicated DataFrame, removed rows, match groups, and log entries.
  • build_default_strategies() — scans column names with regex patterns to auto-detect email, phone, name, and address columns. Builds strong/weak key strategies with appropriate algorithms and normalizers.
  • _UnionFind — disjoint-set data structure for transitive closure. If A matches B and B matches C, all three end up in one group.
  • _find_match_groups() — O(n^2) pairwise comparison. For each pair, tries all strategies (OR semantics). Feeds matches into union-find. Returns match groups with confidence scores.
  • _select_survivor() — picks the row to keep based on the survivor rule.
  • _merge_group() — fills blank fields in the survivor from loser rows.

src/core/normalizers.py — Text Normalization

Five normalizer functions, each str → str, idempotent, None-safe:

  • normalize_email() — lowercase, strip Gmail dots, strip +tag suffixes
  • normalize_phone() — parse with phonenumbers to E.164; fallback to digits-only
  • normalize_name() — strip title prefixes (Dr., Mr.) and suffixes (Jr., PhD), case-fold
  • normalize_address() — USPS abbreviations (Street→St, Avenue→Ave), case-fold
  • normalize_string() — trim, collapse whitespace, case-fold

The get_normalizer() registry function maps NormalizerType enum values to functions.

src/core/io.py — File I/O

Auto-detection stack:

  1. detect_encoding() — checks BOM, then uses charset-normalizer heuristics
  2. detect_delimiter() — uses csv.Sniffer on first 20 lines
  3. detect_header_row() — finds first row where all cells look like column names

Main functions:

  • read_file() — reads CSV/TSV/Excel with full auto-detection. Returns a DataFrame.
  • write_file() — writes DataFrame to CSV or Excel. Uses utf-8-sig by default for Windows Excel compatibility.
  • list_sheets() — returns sheet names from an Excel workbook.

src/core/config.py — Configuration Profiles

Save/load deduplication settings as JSON:

  • DeduplicationConfig — flat dataclass with all settings: strategies, survivor rule, merge flag, algorithm, threshold, normalizer map.
  • .to_file() / .from_file() — JSON serialization
  • .to_strategies() — converts config back to MatchStrategy objects for the engine
  • .to_survivor_rule() — converts string to SurvivorRule enum

src/cli.py — Command-Line Interface

Typer-based CLI with 17 options. Key responsibilities:

  • Parse flags into strategies, survivor rule, and other config
  • Set up logging (timestamped log files in logs/)
  • Column name validation with fuzzy suggestions on typos
  • _interactive_review() — side-by-side row display with y/n/s prompts
  • Progress bar via tqdm for files > 10,000 rows
  • Output formatting and file writing

src/gui/app.py — Streamlit GUI

Single-page layout:

  • File upload with instant preview
  • Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles)
  • Find Duplicates button → runs deduplicate() with progress_callback
  • Interactive review: expandable match group cards with merge/keep/skip buttons
  • Download buttons for deduplicated CSV, removed rows, and match groups report

src/gui/components.py — Reusable GUI Widgets

  • match_group_card() — expandable card showing side-by-side row comparison with diff highlighting
  • config_panel() — the advanced options expander, returns a DeduplicationConfig
  • results_summary() — summary stats and download buttons

Data Flow

Input File
    │
    ▼
read_file()          ← auto-detect encoding, delimiter, header
    │
    ▼
DataFrame
    │
    ▼
build_default_strategies()   ← (if no explicit strategies)
    │                            scan column names → regex patterns
    │                            strong keys: email, phone (standalone OR)
    │                            weak keys: name, address (AND with strong)
    ▼
_apply_normalizations()      ← add _norm_* shadow columns
    │                            normalize_email(), normalize_phone(), etc.
    ▼
_find_match_groups()         ← O(n²) pairwise comparison
    │                            for each pair: try all strategies (OR)
    │                            _compute_similarity() per column
    │                            union-find for transitive closure
    ▼
[review_callback()]          ← optional: interactive review per group
    │                            True=accept, False=reject, None=skip
    ▼
_select_survivor()           ← per group: first/last/most-complete/most-recent
    │
    ▼
[_merge_group()]             ← optional: fill blanks from losers
    │
    ▼
DeduplicationResult
    ├── deduplicated_df      ← cleaned DataFrame (shadow cols dropped)
    ├── removed_df           ← rows that were removed
    ├── match_groups         ← list of MatchResult with confidence, columns
    └── log_entries          ← human-readable audit log

How to Add a Normalizer

  1. Add the function in src/core/normalizers.py:
def normalize_company(value: Optional[str]) -> str:
    """Strip legal suffixes (Inc, LLC, Corp), case-fold."""
    if not value or not isinstance(value, str):
        return ""
    name = value.strip().casefold()
    # Strip common suffixes
    for suffix in ("inc", "llc", "corp", "ltd", "co"):
        name = re.sub(rf"\b{suffix}\.?\s*$", "", name).strip()
    return name
  1. Register it in the same file:
class NormalizerType(str, Enum):
    # ... existing types ...
    COMPANY = "company"      # ← add enum value

_NORMALIZER_MAP: dict[NormalizerType, Callable[[str], str]] = {
    # ... existing entries ...
    NormalizerType.COMPANY: normalize_company,   # ← add mapping
}
  1. Add auto-detection pattern in src/core/dedup.py (optional):
_COLUMN_TYPE_PATTERNS = [
    # ... existing patterns ...
    (re.compile(r"company|organization|org_name", re.I),
     NormalizerType.COMPANY, Algorithm.TOKEN_SET_RATIO, 85.0, False),
]

How to Add a Matching Algorithm

  1. Add the enum value in src/core/dedup.py:
class Algorithm(str, Enum):
    # ... existing values ...
    SOUNDEX = "soundex"
  1. Add the computation in _compute_similarity():
def _compute_similarity(val_a: str, val_b: str, algorithm: Algorithm) -> float:
    # ... existing cases ...
    if algorithm == Algorithm.SOUNDEX:
        return 100.0 if _soundex(val_a) == _soundex(val_b) else 0.0
  1. Add the CLI flag value in src/cli.py help text for --algorithm.

How to Add a Survivor Strategy

  1. Add the enum value in src/core/dedup.py:
class SurvivorRule(str, Enum):
    # ... existing values ...
    KEEP_LONGEST = "longest"
  1. Add the logic in _select_survivor():
if rule == SurvivorRule.KEEP_LONGEST:
    return max(indices, key=lambda i: len(str(df.iloc[i].to_dict())))
  1. Add to the CLI survivor map in src/cli.py.

Testing

Run Tests

# All tests
pytest tests/ -q

# Specific module
pytest tests/test_dedup.py -q
pytest tests/test_normalizers.py -q
pytest tests/test_io.py -q
pytest tests/test_config.py -q
pytest tests/test_cli.py -q

# Verbose with output
pytest tests/ -v

# Stop on first failure
pytest tests/ -x

Test Structure

tests/
├── conftest.py          # Shared fixtures
│   ├── sample_csv_path  # Path to samples/messy_sales.csv
│   ├── sample_df        # Loaded sample CSV as DataFrame
│   ├── simple_df        # Small 5-row DataFrame with obvious duplicates
│   ├── merge_df         # DataFrame with partial records
│   └── tmp_csv          # Temporary CSV from simple_df
├── test_dedup.py        # Engine tests: similarity, union-find, pairs, integration
├── test_normalizers.py  # Normalizer tests: all 5 types with edge cases
├── test_io.py           # I/O tests: encoding, delimiter, header, read/write
├── test_config.py       # Config tests: serialization round-trip
└── test_cli.py          # CLI tests: argument parsing, file handling

Writing Tests

Follow existing patterns. Tests use pytest fixtures from conftest.py:

def test_my_feature(simple_df):
    """Test description."""
    result = deduplicate(simple_df, ...)
    assert len(result.match_groups) == expected
    assert result.deduplicated_df.shape[0] == expected_rows

Known Limitations

  • O(n^2) pairwise comparison — no blocking or indexing. Works well up to ~50,000 rows. Beyond that, performance degrades quadratically. Future optimization: add blocking (partition by first letter, zip code prefix, etc.) to reduce comparison space.
  • No multi-sheet dedup — each Excel sheet is processed independently. Cross-sheet deduplication is not supported.
  • Phone normalization requires valid-length numbers — the phonenumbers library rejects numbers that are too short or too long for the detected region. Fallback is digits-only, which may produce false negatives for international numbers without country codes.
  • Single-threaded — no parallel comparison. Could benefit from multiprocessing for large files.
  • Memory-bound — entire file is loaded into a pandas DataFrame. Files larger than available RAM will fail. Chunked reading exists but is not integrated with the dedup engine.