8dfc6ad8ae4ed2dd5ed45f457ac3623dc1e1dbbc
Two more detectors close the analyzer gap list: mixed_line_endings (warn, tool=02): scans raw bytes for combinations of CRLF / LF / bare CR. Disaster pattern after multi-source concat (Windows + macOS + Linux exports stitched together). Operates on raw bytes only — DataFrame-mode analyze() skips it because raw bytes aren't available. _load_for_analysis now returns the raw bytes alongside the DataFrame and repair result so the detector has them. near_duplicate_rows (info, tool=01): cheap dedup signal — strip and lowercase every string column, then count df.duplicated(). Catches the most common case (same customer entered twice with subtle formatting differences) without paying for fuzzy matching. Anything more sophisticated stays in tool 01. Six new tests cover both detectors plus the dataframe-mode skip path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataTools
A bundle of Python data-cleaning tools for CSV and Excel files. Two scripts ship today; more are in build.
| # | Tool | What it does |
|---|---|---|
| 01 | Deduplicator | Find and remove duplicate rows with exact + fuzzy matching, smart normalization, and interactive review. |
| 02 | Text Cleaner | Trim whitespace, fold smart quotes, strip invisible / control characters, normalize Unicode, normalize line endings, optional case conversion. |
Deduplicator
Features
- Zero-config start — auto-detects encoding, delimiters, headers, and match columns
- Fuzzy matching — Jaro-Winkler, Levenshtein, and token set ratio algorithms
- 5 built-in normalizers — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
- Merge mode — fill missing fields in the surviving row from removed duplicates
- 4 survivor rules — keep first, last, most complete, or most recent row per group
- Interactive review — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live
- Config profiles — save and reload your settings as JSON for repeatable runs
- Dual interface — full CLI for automation, Streamlit GUI for visual review
- Dry-run by default — preview what would change before writing anything
- Audit trail — every run produces a match groups report and timestamped log
Quick Start
Install
pip install -r requirements.txt
CLI
# Preview duplicates (dry run — no files written)
python -m src.cli customers.csv
# Remove duplicates and save the result
python -m src.cli customers.csv --apply
# Fuzzy-match names at 80% similarity, merge missing fields
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply
# Interactively review each match group
python -m src.cli customers.csv --review --apply
GUI
streamlit run src/gui/app.py
Upload a file, click Find Duplicates, review match groups side-by-side, then download the cleaned result.
CLI Usage Summary
python -m src.cli INPUT_FILE [OPTIONS]
Options:
--apply Write output files (default: preview only)
--output, -o PATH Output file path
--subset, -s COLS Columns to match on (comma-separated)
--key, -k COLS Strong-key columns for exact matching
--fuzzy COLS Columns to fuzzy-match
--algorithm, -a ALG levenshtein | jaro_winkler | token_set_ratio
--threshold, -t N Similarity threshold 0-100 (default: 85)
--normalize COL:TYPE Per-column normalizers (e.g., email:email,phone:phone)
--survivor RULE first | last | most-complete | most-recent
--merge Fill missing fields from removed duplicates
--review Interactively review each match group
--config PATH Load settings from a JSON config file
--save-config PATH Save current settings to JSON
--sheet NAME Excel sheet name or 0-based index
--encoding ENC Override auto-detected encoding
--header-row N 0-based header row index
--help Show full help
Sample Output
$ python -m src.cli samples/messy_sales.csv
Reading messy_sales.csv...
50 rows, 8 columns
Finding duplicates...
──────────────────────────────────────────────────
File: messy_sales.csv
Rows in: 50
Rows out: 28
Removed: 22
Groups: 22
──────────────────────────────────────────────────
Match groups:
Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
...
This was a preview. Add --apply to write the output files.
Output Files
When --apply is used, three files are produced:
| File | Contents |
|---|---|
{input}_deduplicated.csv |
Cleaned data with duplicates removed |
{input}_removed.csv |
Rows that were removed |
{input}_match_groups.csv |
Audit trail: group ID, confidence, matched columns, survivor flag |
Text Cleaner
Character-level hygiene for messy CSV / Excel input. Solves the dirty-data failure modes that silently break VLOOKUPs, dedup runs, and downstream imports:
- Trailing / leading whitespace and tabs in cells
- Non-breaking spaces (
U+00A0) hiding inside text where regular spaces should be - Smart quotes pasted from Word (
""''→""'') - Em / en dashes, ellipsis, other typographic Unicode
- Zero-width and bidi-mark characters (
U+200B,U+200C,U+200D, etc.) - BOMs from Excel "Save As CSV UTF-8"
- Mixed line endings (
\r\n, bare\r) inside multi-line cells - Control characters (
U+0000-U+001Fminus\t \n \r) - Optional Unicode NFC / NFKC normalization
- Optional per-column case conversion (UPPER / lower / smart Title / Sentence)
# Preview what would change (dry-run)
python -m src.cli_text_clean samples/messy_text.csv
# Apply the safe defaults
python -m src.cli_text_clean samples/messy_text.csv --apply
# Title-case the name column, upper-case the SKU column
python -m src.cli_text_clean products.csv --case title:name,upper:sku --apply
# Just trim and collapse — nothing fancy
python -m src.cli_text_clean messy.csv --preset minimal --apply
Three presets: minimal (trim + collapse only), excel-hygiene (default; everything safe ON), paranoid (adds lossy NFKC fold).
Outputs {input}_cleaned.csv plus a per-cell {input}_changes.csv audit (row, column, old, new, ops applied).
See docs/CLI-REFERENCE.md for every flag.
Documentation
- CLI Reference — every flag with examples and recipe sections
- Developer Guide — architecture, data flow, how to extend
Requirements
- Python 3.10+
- Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer
License
Proprietary. All rights reserved.
Description
Languages
Python
87.3%
HTML
10%
CSS
1.8%
Shell
0.4%
JavaScript
0.2%
Other
0.2%