e672488d50c6bd15572e551f155dbc869a43d21c
Only the row chosen by the survivor rule (first, last, most-recent, etc.) is checked by default. Other rows start unchecked. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DataTools Deduplicator
Find and remove duplicate rows in CSV and Excel files — with fuzzy matching, smart normalization, and interactive review.
Features
- Zero-config start — auto-detects encoding, delimiters, headers, and match columns
- Fuzzy matching — Jaro-Winkler, Levenshtein, and token set ratio algorithms
- 5 built-in normalizers — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
- Merge mode — fill missing fields in the surviving row from removed duplicates
- 4 survivor rules — keep first, last, most complete, or most recent row per group
- Interactive review — inspect each match group and decide: merge, keep both, or skip
- Config profiles — save and reload your settings as JSON for repeatable runs
- Dual interface — full CLI for automation, Streamlit GUI for visual review
- Dry-run by default — preview what would change before writing anything
- Audit trail — every run produces a match groups report and timestamped log
Quick Start
Install
pip install -r requirements.txt
CLI
# Preview duplicates (dry run — no files written)
python -m src.cli customers.csv
# Remove duplicates and save the result
python -m src.cli customers.csv --apply
# Fuzzy-match names at 80% similarity, merge missing fields
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply
# Interactively review each match group
python -m src.cli customers.csv --review --apply
GUI
streamlit run src/gui/app.py
Upload a file, click Find Duplicates, review match groups side-by-side, then download the cleaned result.
CLI Usage Summary
python -m src.cli INPUT_FILE [OPTIONS]
Options:
--apply Write output files (default: preview only)
--output, -o PATH Output file path
--subset, -s COLS Columns to match on (comma-separated)
--key, -k COLS Strong-key columns for exact matching
--fuzzy COLS Columns to fuzzy-match
--algorithm, -a ALG levenshtein | jaro_winkler | token_set_ratio
--threshold, -t N Similarity threshold 0-100 (default: 85)
--normalize COL:TYPE Per-column normalizers (e.g., email:email,phone:phone)
--survivor RULE first | last | most-complete | most-recent
--merge Fill missing fields from removed duplicates
--review Interactively review each match group
--config PATH Load settings from a JSON config file
--save-config PATH Save current settings to JSON
--sheet NAME Excel sheet name or 0-based index
--encoding ENC Override auto-detected encoding
--header-row N 0-based header row index
--help Show full help
Sample Output
$ python -m src.cli samples/messy_sales.csv
Reading messy_sales.csv...
50 rows, 8 columns
Finding duplicates...
──────────────────────────────────────────────────
File: messy_sales.csv
Rows in: 50
Rows out: 28
Removed: 22
Groups: 22
──────────────────────────────────────────────────
Match groups:
Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
...
This was a preview. Add --apply to write the output files.
Output Files
When --apply is used, three files are produced:
| File | Contents |
|---|---|
{input}_deduplicated.csv |
Cleaned data with duplicates removed |
{input}_removed.csv |
Rows that were removed |
{input}_match_groups.csv |
Audit trail: group ID, confidence, matched columns, survivor flag |
Documentation
- CLI Reference — every flag with examples and recipe sections
- Developer Guide — architecture, data flow, how to extend
- User Guide — installation and usage for end users
Requirements
- Python 3.10+
- Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer
License
Proprietary. All rights reserved.
Description
Languages
Python
87.3%
HTML
10%
CSS
1.8%
Shell
0.4%
JavaScript
0.2%
Other
0.2%