4687cf87b40da2b30471b12a755e105b68ad3f26
Adds a top-level test infrastructure layer addressing four needs at once:
a single command to run anything, cross-platform automation, install/e2e
sanity, and zero-config pickup of new fixtures dropped into test-cases/.
Top-level runner — run_tests.py
python run_tests.py # everything (default)
python run_tests.py --tool dedup # one tool's tests
python run_tests.py --unit # category scopes
python run_tests.py --e2e # end-to-end CLI
python run_tests.py --install # import / dependency sanity
python run_tests.py --fixtures # corpus + dropped-file sweep
python run_tests.py --coverage # term-missing report
python run_tests.py --quick # skip @pytest.mark.slow
Tools: analyze, cli, config, dedup, io, normalizers, text_clean.
Cross-platform — tox.ini
Envs for py310-py313 plus install / e2e / fixtures / coverage / lint.
Forces UTF-8 (PYTHONUTF8=1, PYTHONIOENCODING=utf-8) so identical fixture
bytes parse the same on Linux/macOS/Windows.
Shared config — pytest.ini
testpaths, python_files conventions, custom markers (slow, e2e, install,
fixture_sweep), warning filters that fail on our own DeprecationWarnings
while tolerating third-party ones.
New test layers
tests/test_install.py — required deps import; project modules import;
src.core public API surface; CLI --help exits 0; streamlit app.py
parses as valid Python; run_tests.py --help works.
tests/test_e2e.py — CLI roundtrips: cli_analyze table + JSON, cli_text_clean
--apply writes a real file with NBSP/smart-quote folded, dedup CLI
removes duplicates, run_tests.py self-tests.
tests/test_fixtures_sweep.py — parametrizes over every CSV/TSV/XLSX
inside test-cases/ (excluding text-cleaner-corpus/, which has its own
suite). Each fixture must: load through repair_bytes, run analyze()
cleanly, and survive clean_dataframe() with row/col counts unchanged
plus idempotency. Drop a CSV in, re-run — no test code changes needed.
tests/test_gap_coverage.py — closes audit gaps: clean_headers=False
toggle, repair_bytes with tab/semicolon delimiters, BOM+NUL+smart-
quote combined-fix scenario, analyze() over an XLSX path, sample_rows
larger than the DataFrame, mid-cell BOM, findings_by_tool edges, plus
a strict xfail documenting the known §4.17 numeric/phone whitespace
heuristic gap.
Test count
Before: 288 passed + 1 xfailed
After: 475 passed + 2 xfailed (the second xfail is the documented
collapse_whitespace gap on phone-shaped cells; spec §4.17 calls
for a heuristic that hasn't been implemented yet).
Functional gaps surfaced (not fixed in this commit):
- Text cleaner: collapse_whitespace runs unconditionally on every string
cell, including phone/numeric/date-shaped ones. Spec §4.17 requires a
skip heuristic. Captured as strict xfail so the gap stays visible.
- io.read_file does not run pre-parse repair; only analyze() and direct
callers of read_csv_repaired() get it. CLI tool pages and the dedup
CLI miss the safety net.
- Analyzer has no mixed_line_endings detector or near_duplicate_rows
detector; both planned but require additional plumbing.
- GUI tool pages each have their own uploader instead of picking up the
home-page upload through session_state.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataTools
A bundle of Python data-cleaning tools for CSV and Excel files. Two scripts ship today; more are in build.
| # | Tool | What it does |
|---|---|---|
| 01 | Deduplicator | Find and remove duplicate rows with exact + fuzzy matching, smart normalization, and interactive review. |
| 02 | Text Cleaner | Trim whitespace, fold smart quotes, strip invisible / control characters, normalize Unicode, normalize line endings, optional case conversion. |
Deduplicator
Features
- Zero-config start — auto-detects encoding, delimiters, headers, and match columns
- Fuzzy matching — Jaro-Winkler, Levenshtein, and token set ratio algorithms
- 5 built-in normalizers — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
- Merge mode — fill missing fields in the surviving row from removed duplicates
- 4 survivor rules — keep first, last, most complete, or most recent row per group
- Interactive review — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live
- Config profiles — save and reload your settings as JSON for repeatable runs
- Dual interface — full CLI for automation, Streamlit GUI for visual review
- Dry-run by default — preview what would change before writing anything
- Audit trail — every run produces a match groups report and timestamped log
Quick Start
Install
pip install -r requirements.txt
CLI
# Preview duplicates (dry run — no files written)
python -m src.cli customers.csv
# Remove duplicates and save the result
python -m src.cli customers.csv --apply
# Fuzzy-match names at 80% similarity, merge missing fields
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply
# Interactively review each match group
python -m src.cli customers.csv --review --apply
GUI
streamlit run src/gui/app.py
Upload a file, click Find Duplicates, review match groups side-by-side, then download the cleaned result.
CLI Usage Summary
python -m src.cli INPUT_FILE [OPTIONS]
Options:
--apply Write output files (default: preview only)
--output, -o PATH Output file path
--subset, -s COLS Columns to match on (comma-separated)
--key, -k COLS Strong-key columns for exact matching
--fuzzy COLS Columns to fuzzy-match
--algorithm, -a ALG levenshtein | jaro_winkler | token_set_ratio
--threshold, -t N Similarity threshold 0-100 (default: 85)
--normalize COL:TYPE Per-column normalizers (e.g., email:email,phone:phone)
--survivor RULE first | last | most-complete | most-recent
--merge Fill missing fields from removed duplicates
--review Interactively review each match group
--config PATH Load settings from a JSON config file
--save-config PATH Save current settings to JSON
--sheet NAME Excel sheet name or 0-based index
--encoding ENC Override auto-detected encoding
--header-row N 0-based header row index
--help Show full help
Sample Output
$ python -m src.cli samples/messy_sales.csv
Reading messy_sales.csv...
50 rows, 8 columns
Finding duplicates...
──────────────────────────────────────────────────
File: messy_sales.csv
Rows in: 50
Rows out: 28
Removed: 22
Groups: 22
──────────────────────────────────────────────────
Match groups:
Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
...
This was a preview. Add --apply to write the output files.
Output Files
When --apply is used, three files are produced:
| File | Contents |
|---|---|
{input}_deduplicated.csv |
Cleaned data with duplicates removed |
{input}_removed.csv |
Rows that were removed |
{input}_match_groups.csv |
Audit trail: group ID, confidence, matched columns, survivor flag |
Text Cleaner
Character-level hygiene for messy CSV / Excel input. Solves the dirty-data failure modes that silently break VLOOKUPs, dedup runs, and downstream imports:
- Trailing / leading whitespace and tabs in cells
- Non-breaking spaces (
U+00A0) hiding inside text where regular spaces should be - Smart quotes pasted from Word (
""''→""'') - Em / en dashes, ellipsis, other typographic Unicode
- Zero-width and bidi-mark characters (
U+200B,U+200C,U+200D, etc.) - BOMs from Excel "Save As CSV UTF-8"
- Mixed line endings (
\r\n, bare\r) inside multi-line cells - Control characters (
U+0000-U+001Fminus\t \n \r) - Optional Unicode NFC / NFKC normalization
- Optional per-column case conversion (UPPER / lower / smart Title / Sentence)
# Preview what would change (dry-run)
python -m src.cli_text_clean samples/messy_text.csv
# Apply the safe defaults
python -m src.cli_text_clean samples/messy_text.csv --apply
# Title-case the name column, upper-case the SKU column
python -m src.cli_text_clean products.csv --case title:name,upper:sku --apply
# Just trim and collapse — nothing fancy
python -m src.cli_text_clean messy.csv --preset minimal --apply
Three presets: minimal (trim + collapse only), excel-hygiene (default; everything safe ON), paranoid (adds lossy NFKC fold).
Outputs {input}_cleaned.csv plus a per-cell {input}_changes.csv audit (row, column, old, new, ops applied).
See docs/CLI-REFERENCE.md for every flag.
Documentation
- CLI Reference — every flag with examples and recipe sections
- Developer Guide — architecture, data flow, how to extend
Requirements
- Python 3.10+
- Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer
License
Proprietary. All rights reserved.
Description
Languages
Python
87.3%
HTML
10%
CSS
1.8%
Shell
0.4%
JavaScript
0.2%
Other
0.2%