# DataTools Deduplicator Find and remove duplicate rows in CSV, delimited text, and Excel files — with fuzzy matching, smart normalization, and interactive review. ## Features - **Zero-config start** — auto-detects encoding, delimiters, headers, and match columns - **Fuzzy matching** — Jaro-Winkler, Levenshtein, and token set ratio algorithms - **5 built-in normalizers** — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case) - **Merge mode** — fill missing fields in the surviving row from removed duplicates - **4 survivor rules** — keep first, last, most complete, or most recent row per group - **Interactive review** — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live - **Config profiles** — save and reload your settings as JSON for repeatable runs - **Dual interface** — full CLI for automation, Streamlit GUI for visual review - **Dry-run by default** — preview what would change before writing anything - **Audit trail** — every run produces a match groups report and timestamped log ## Quick Start ### Install ```bash pip install -r requirements.txt ``` ### CLI ```bash # Preview duplicates (dry run — no files written) python -m src.cli customers.csv # Remove duplicates and save the result python -m src.cli customers.csv --apply # Fuzzy-match names at 80% similarity, merge missing fields python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply # Interactively review each match group python -m src.cli customers.csv --review --apply ``` ### GUI ```bash streamlit run src/gui/app.py ``` Upload a file, click **Find Duplicates**, review match groups side-by-side, then download the cleaned result. ## CLI Usage Summary ``` python -m src.cli INPUT_FILE [OPTIONS] Options: --apply Write output files (default: preview only) --output, -o PATH Output file path --subset, -s COLS Columns to match on (comma-separated) --key, -k COLS Strong-key columns for exact matching --fuzzy COLS Columns to fuzzy-match --algorithm, -a ALG levenshtein | jaro_winkler | token_set_ratio --threshold, -t N Similarity threshold 0-100 (default: 85) --normalize COL:TYPE Per-column normalizers (e.g., email:email,phone:phone) --survivor RULE first | last | most-complete | most-recent --merge Fill missing fields from removed duplicates --review Interactively review each match group --config PATH Load settings from a JSON config file --save-config PATH Save current settings to JSON --sheet NAME Excel sheet name or 0-based index --encoding ENC Override auto-detected encoding --header-row N 0-based header row index --help Show full help ``` ## Sample Output ``` $ python -m src.cli samples/messy_sales.csv Reading messy_sales.csv... 50 rows, 8 columns Finding duplicates... ────────────────────────────────────────────────── File: messy_sales.csv Rows in: 50 Rows out: 28 Removed: 22 Groups: 22 ────────────────────────────────────────────────── Match groups: Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email) Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone) ... This was a preview. Add --apply to write the output files. ``` ## Output Files When `--apply` is used, three files are produced: | File | Contents | |------|----------| | `{input}_deduplicated.csv` | Cleaned data with duplicates removed | | `{input}_removed.csv` | Rows that were removed | | `{input}_match_groups.csv` | Audit trail: group ID, confidence, matched columns, survivor flag | ## Documentation - [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections - [Developer Guide](docs/DEVELOPER.md) — architecture, data flow, how to extend ## Requirements - Python 3.10+ - Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer ## License Proprietary. All rights reserved.