Michael dc48578c7e feat: launch Chrome in app mode for chromeless window
python -m src.gui now opens Chrome with --app flag, hiding the address
bar, tabs, and bookmarks bar. Falls back to default browser if Chrome
is not found. Headless flag passed via CLI so streamlit run directly
still auto-opens normally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:24:54 +00:00

DataTools Deduplicator

Find and remove duplicate rows in CSV, delimited text, and Excel files — with fuzzy matching, smart normalization, and interactive review.

Features

  • Zero-config start — auto-detects encoding, delimiters, headers, and match columns
  • Fuzzy matching — Jaro-Winkler, Levenshtein, and token set ratio algorithms
  • 5 built-in normalizers — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
  • Merge mode — fill missing fields in the surviving row from removed duplicates
  • 4 survivor rules — keep first, last, most complete, or most recent row per group
  • Interactive review — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live
  • Config profiles — save and reload your settings as JSON for repeatable runs
  • Dual interface — full CLI for automation, Streamlit GUI for visual review
  • Dry-run by default — preview what would change before writing anything
  • Audit trail — every run produces a match groups report and timestamped log

Quick Start

Install

pip install -r requirements.txt

CLI

# Preview duplicates (dry run — no files written)
python -m src.cli customers.csv

# Remove duplicates and save the result
python -m src.cli customers.csv --apply

# Fuzzy-match names at 80% similarity, merge missing fields
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply

# Interactively review each match group
python -m src.cli customers.csv --review --apply

GUI

streamlit run src/gui/app.py

Upload a file, click Find Duplicates, review match groups side-by-side, then download the cleaned result.

CLI Usage Summary

python -m src.cli INPUT_FILE [OPTIONS]

Options:
  --apply                  Write output files (default: preview only)
  --output, -o PATH        Output file path
  --subset, -s COLS        Columns to match on (comma-separated)
  --key, -k COLS           Strong-key columns for exact matching
  --fuzzy COLS             Columns to fuzzy-match
  --algorithm, -a ALG      levenshtein | jaro_winkler | token_set_ratio
  --threshold, -t N        Similarity threshold 0-100 (default: 85)
  --normalize COL:TYPE     Per-column normalizers (e.g., email:email,phone:phone)
  --survivor RULE          first | last | most-complete | most-recent
  --merge                  Fill missing fields from removed duplicates
  --review                 Interactively review each match group
  --config PATH            Load settings from a JSON config file
  --save-config PATH       Save current settings to JSON
  --sheet NAME             Excel sheet name or 0-based index
  --encoding ENC           Override auto-detected encoding
  --header-row N           0-based header row index
  --help                   Show full help

Sample Output

$ python -m src.cli samples/messy_sales.csv

Reading messy_sales.csv...
  50 rows, 8 columns
Finding duplicates...

──────────────────────────────────────────────────
  File:      messy_sales.csv
  Rows in:   50
  Rows out:  28
  Removed:   22
  Groups:    22
──────────────────────────────────────────────────

Match groups:
  Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
  Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
  ...

This was a preview. Add --apply to write the output files.

Output Files

When --apply is used, three files are produced:

File Contents
{input}_deduplicated.csv Cleaned data with duplicates removed
{input}_removed.csv Rows that were removed
{input}_match_groups.csv Audit trail: group ID, confidence, matched columns, survivor flag

Documentation

Requirements

  • Python 3.10+
  • Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer

License

Proprietary. All rights reserved.

Description
Data tools development
Readme 7.7 MiB
Languages
Python 87.3%
HTML 10%
CSS 1.8%
Shell 0.4%
JavaScript 0.2%
Other 0.2%