Update README, CLI reference, and developer guide to cover delimiter selector, inline checkboxes/dropdowns, live surviving rows preview, multi-row survivors, and apply_review_decisions(). Remove dead link. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 KiB
12 KiB
Developer Guide
Architecture, data flow, and extension guide for the DataTools Deduplicator.
Architecture
CLI (src/cli.py) GUI (src/gui/app.py)
│ │
│ flags → strategies │ widgets → strategies
│ _interactive_review() │ match_group_card()
│ tqdm progress bar │ st.progress()
│ │
└──────────┐ ┌────────────────┘
│ │
▼ ▼
┌─────────────────┐
│ core.dedup │
│ deduplicate() │
└────────┬────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
core.io core.normalizers core.config
read/write normalize_*() save/load JSON
Key principle: All business logic lives in src/core/. The CLI and GUI are thin wrappers that translate user input into deduplicate() arguments and display the DeduplicationResult.
File-by-File Reference
src/core/dedup.py — Deduplication Engine
The central module. Contains:
- Enums:
Algorithm(4 fuzzy algorithms),SurvivorRule(4 selection rules) - Data classes:
ColumnMatchStrategy,MatchStrategy,MatchResult,DeduplicationResult deduplicate()— main entry point. Takes a DataFrame + optional strategies/rules, returns aDeduplicationResultwith deduplicated DataFrame, removed rows, match groups, and log entries.build_default_strategies()— scans column names with regex patterns to auto-detect email, phone, name, and address columns. Builds strong/weak key strategies with appropriate algorithms and normalizers._UnionFind— disjoint-set data structure for transitive closure. If A matches B and B matches C, all three end up in one group._find_match_groups()— O(n^2) pairwise comparison. For each pair, tries all strategies (OR semantics). Feeds matches into union-find. Returns match groups with confidence scores._select_survivor()— picks the row to keep based on the survivor rule._merge_group()— fills blank fields in the survivor from loser rows.
src/core/normalizers.py — Text Normalization
Five normalizer functions, each str → str, idempotent, None-safe:
normalize_email()— lowercase, strip Gmail dots, strip+tagsuffixesnormalize_phone()— parse withphonenumbersto E.164; fallback to digits-onlynormalize_name()— strip title prefixes (Dr., Mr.) and suffixes (Jr., PhD), case-foldnormalize_address()— USPS abbreviations (Street→St, Avenue→Ave), case-foldnormalize_string()— trim, collapse whitespace, case-fold
The get_normalizer() registry function maps NormalizerType enum values to functions.
src/core/io.py — File I/O
Auto-detection stack:
detect_encoding()— checks BOM, then usescharset-normalizerheuristicsdetect_delimiter()— usescsv.Snifferon first 20 linesdetect_header_row()— finds first row where all cells look like column names
Main functions:
read_file()— reads CSV/TSV/Excel with full auto-detection. Returns a DataFrame.write_file()— writes DataFrame to CSV or Excel. Usesutf-8-sigby default for Windows Excel compatibility.list_sheets()— returns sheet names from an Excel workbook.
src/core/config.py — Configuration Profiles
Save/load deduplication settings as JSON:
DeduplicationConfig— flat dataclass with all settings: strategies, survivor rule, merge flag, algorithm, threshold, normalizer map..to_file()/.from_file()— JSON serialization.to_strategies()— converts config back toMatchStrategyobjects for the engine.to_survivor_rule()— converts string toSurvivorRuleenum
src/cli.py — Command-Line Interface
Typer-based CLI with 17 options. Key responsibilities:
- Parse flags into strategies, survivor rule, and other config
- Set up logging (timestamped log files in
logs/) - Column name validation with fuzzy suggestions on typos
_interactive_review()— side-by-side row display with y/n/s prompts- Progress bar via
tqdmfor files > 10,000 rows - Output formatting and file writing
src/gui/app.py — Streamlit GUI
Single-page layout:
- File upload with instant preview and configurable delimiter (comma, tab, semicolon, pipe, or custom)
- Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles)
- Find Duplicates button → runs
deduplicate()withprogress_callback - Interactive review via
st.data_editorwith inline checkboxes and column dropdowns - Batch actions: Accept All, Reject All, Clear Decisions
- Apply review decisions and download cleaned results
- Download buttons for deduplicated CSV, removed rows, and match groups report
src/gui/components.py — Reusable GUI Widgets
match_group_card()— expandable card withst.data_editor: inline Keep checkboxes per row,SelectboxColumndropdowns for differing columns, and a live surviving rows previewconfig_panel()— the advanced options expander, returns settings dict with strategies, survivor rule, merge flagresults_summary()— summary metrics and download buttonsapply_review_decisions()— builds final DataFrames from user review decisions (merge, split, or keep-all per group) with column override support
Data Flow
Input File
│
▼
read_file() ← auto-detect encoding, delimiter, header
│
▼
DataFrame
│
▼
build_default_strategies() ← (if no explicit strategies)
│ scan column names → regex patterns
│ strong keys: email, phone (standalone OR)
│ weak keys: name, address (AND with strong)
▼
_apply_normalizations() ← add _norm_* shadow columns
│ normalize_email(), normalize_phone(), etc.
▼
_find_match_groups() ← O(n²) pairwise comparison
│ for each pair: try all strategies (OR)
│ _compute_similarity() per column
│ union-find for transitive closure
▼
[review_callback()] ← optional: interactive review per group
│ True=accept, False=reject, None=skip
▼
_select_survivor() ← per group: first/last/most-complete/most-recent
│
▼
[_merge_group()] ← optional: fill blanks from losers
│
▼
DeduplicationResult
├── deduplicated_df ← cleaned DataFrame (shadow cols dropped)
├── removed_df ← rows that were removed
├── match_groups ← list of MatchResult with confidence, columns
└── log_entries ← human-readable audit log
How to Add a Normalizer
- Add the function in
src/core/normalizers.py:
def normalize_company(value: Optional[str]) -> str:
"""Strip legal suffixes (Inc, LLC, Corp), case-fold."""
if not value or not isinstance(value, str):
return ""
name = value.strip().casefold()
# Strip common suffixes
for suffix in ("inc", "llc", "corp", "ltd", "co"):
name = re.sub(rf"\b{suffix}\.?\s*$", "", name).strip()
return name
- Register it in the same file:
class NormalizerType(str, Enum):
# ... existing types ...
COMPANY = "company" # ← add enum value
_NORMALIZER_MAP: dict[NormalizerType, Callable[[str], str]] = {
# ... existing entries ...
NormalizerType.COMPANY: normalize_company, # ← add mapping
}
- Add auto-detection pattern in
src/core/dedup.py(optional):
_COLUMN_TYPE_PATTERNS = [
# ... existing patterns ...
(re.compile(r"company|organization|org_name", re.I),
NormalizerType.COMPANY, Algorithm.TOKEN_SET_RATIO, 85.0, False),
]
How to Add a Matching Algorithm
- Add the enum value in
src/core/dedup.py:
class Algorithm(str, Enum):
# ... existing values ...
SOUNDEX = "soundex"
- Add the computation in
_compute_similarity():
def _compute_similarity(val_a: str, val_b: str, algorithm: Algorithm) -> float:
# ... existing cases ...
if algorithm == Algorithm.SOUNDEX:
return 100.0 if _soundex(val_a) == _soundex(val_b) else 0.0
- Add the CLI flag value in
src/cli.pyhelp text for--algorithm.
How to Add a Survivor Strategy
- Add the enum value in
src/core/dedup.py:
class SurvivorRule(str, Enum):
# ... existing values ...
KEEP_LONGEST = "longest"
- Add the logic in
_select_survivor():
if rule == SurvivorRule.KEEP_LONGEST:
return max(indices, key=lambda i: len(str(df.iloc[i].to_dict())))
- Add to the CLI survivor map in
src/cli.py.
Testing
Run Tests
# All tests
pytest tests/ -q
# Specific module
pytest tests/test_dedup.py -q
pytest tests/test_normalizers.py -q
pytest tests/test_io.py -q
pytest tests/test_config.py -q
pytest tests/test_cli.py -q
# Verbose with output
pytest tests/ -v
# Stop on first failure
pytest tests/ -x
Test Structure
tests/
├── conftest.py # Shared fixtures
│ ├── sample_csv_path # Path to samples/messy_sales.csv
│ ├── sample_df # Loaded sample CSV as DataFrame
│ ├── simple_df # Small 5-row DataFrame with obvious duplicates
│ ├── merge_df # DataFrame with partial records
│ └── tmp_csv # Temporary CSV from simple_df
├── test_dedup.py # Engine tests: similarity, union-find, pairs, integration
├── test_normalizers.py # Normalizer tests: all 5 types with edge cases
├── test_io.py # I/O tests: encoding, delimiter, header, read/write
├── test_config.py # Config tests: serialization round-trip
└── test_cli.py # CLI tests: argument parsing, file handling
Writing Tests
Follow existing patterns. Tests use pytest fixtures from conftest.py:
def test_my_feature(simple_df):
"""Test description."""
result = deduplicate(simple_df, ...)
assert len(result.match_groups) == expected
assert result.deduplicated_df.shape[0] == expected_rows
Known Limitations
- O(n^2) pairwise comparison — no blocking or indexing. Works well up to ~50,000 rows. Beyond that, performance degrades quadratically. Future optimization: add blocking (partition by first letter, zip code prefix, etc.) to reduce comparison space.
- No multi-sheet dedup — each Excel sheet is processed independently. Cross-sheet deduplication is not supported.
- Phone normalization requires valid-length numbers — the
phonenumberslibrary rejects numbers that are too short or too long for the detected region. Fallback is digits-only, which may produce false negatives for international numbers without country codes. - Single-threaded — no parallel comparison. Could benefit from
multiprocessingfor large files. - Memory-bound — entire file is loaded into a pandas DataFrame. Files larger than available RAM will fail. Chunked reading exists but is not integrated with the dedup engine.