Go to file

Michael 4687cf87b4 test: single-command runner, cross-platform automation, fixture auto-discovery

Adds a top-level test infrastructure layer addressing four needs at once:
a single command to run anything, cross-platform automation, install/e2e
sanity, and zero-config pickup of new fixtures dropped into test-cases/.

Top-level runner — run_tests.py
  python run_tests.py                # everything (default)
  python run_tests.py --tool dedup   # one tool's tests
  python run_tests.py --unit         # category scopes
  python run_tests.py --e2e          # end-to-end CLI
  python run_tests.py --install      # import / dependency sanity
  python run_tests.py --fixtures     # corpus + dropped-file sweep
  python run_tests.py --coverage     # term-missing report
  python run_tests.py --quick        # skip @pytest.mark.slow
Tools: analyze, cli, config, dedup, io, normalizers, text_clean.

Cross-platform — tox.ini
  Envs for py310-py313 plus install / e2e / fixtures / coverage / lint.
  Forces UTF-8 (PYTHONUTF8=1, PYTHONIOENCODING=utf-8) so identical fixture
  bytes parse the same on Linux/macOS/Windows.

Shared config — pytest.ini
  testpaths, python_files conventions, custom markers (slow, e2e, install,
  fixture_sweep), warning filters that fail on our own DeprecationWarnings
  while tolerating third-party ones.

New test layers
  tests/test_install.py — required deps import; project modules import;
    src.core public API surface; CLI --help exits 0; streamlit app.py
    parses as valid Python; run_tests.py --help works.
  tests/test_e2e.py — CLI roundtrips: cli_analyze table + JSON, cli_text_clean
    --apply writes a real file with NBSP/smart-quote folded, dedup CLI
    removes duplicates, run_tests.py self-tests.
  tests/test_fixtures_sweep.py — parametrizes over every CSV/TSV/XLSX
    inside test-cases/ (excluding text-cleaner-corpus/, which has its own
    suite). Each fixture must: load through repair_bytes, run analyze()
    cleanly, and survive clean_dataframe() with row/col counts unchanged
    plus idempotency. Drop a CSV in, re-run — no test code changes needed.
  tests/test_gap_coverage.py — closes audit gaps: clean_headers=False
    toggle, repair_bytes with tab/semicolon delimiters, BOM+NUL+smart-
    quote combined-fix scenario, analyze() over an XLSX path, sample_rows
    larger than the DataFrame, mid-cell BOM, findings_by_tool edges, plus
    a strict xfail documenting the known §4.17 numeric/phone whitespace
    heuristic gap.

Test count
  Before: 288 passed + 1 xfailed
  After:  475 passed + 2 xfailed (the second xfail is the documented
          collapse_whitespace gap on phone-shaped cells; spec §4.17 calls
          for a heuristic that hasn't been implemented yet).

Functional gaps surfaced (not fixed in this commit):
  - Text cleaner: collapse_whitespace runs unconditionally on every string
    cell, including phone/numeric/date-shaped ones. Spec §4.17 requires a
    skip heuristic. Captured as strict xfail so the gap stays visible.
  - io.read_file does not run pre-parse repair; only analyze() and direct
    callers of read_csv_repaired() get it. CLI tool pages and the dedup
    CLI miss the safety net.
  - Analyzer has no mixed_line_endings detector or near_duplicate_rows
    detector; both planned but require additional plumbing.
  - GUI tool pages each have their own uploader instead of picking up the
    home-page upload through session_state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 16:01:06 +00:00

.streamlit

fix: remove headless=true so browser opens on launch

2026-04-29 01:23:07 +00:00

docs

feat: implement text cleaner (script 02) with CLI, GUI, and tests

2026-04-29 15:14:15 +00:00

samples

feat: implement text cleaner (script 02) with CLI, GUI, and tests

2026-04-29 15:14:15 +00:00

src

feat(gui): wire analyzer into home page with findings panel and tool badges

2026-04-29 15:53:22 +00:00

test-cases

test: add text-cleaner corpus and close gaps surfaced by it

2026-04-29 15:37:35 +00:00

tests

test: single-command runner, cross-platform automation, fixture auto-discovery

2026-04-29 16:01:06 +00:00

.gitignore

feat: add documentation, Streamlit GUI, and full source tree

2026-04-28 23:06:39 +00:00

pytest.ini

test: single-command runner, cross-platform automation, fixture auto-discovery

2026-04-29 16:01:06 +00:00

README.md

feat: implement text cleaner (script 02) with CLI, GUI, and tests

2026-04-29 15:14:15 +00:00

requirements-dev.txt

feat: add documentation, Streamlit GUI, and full source tree

2026-04-28 23:06:39 +00:00

requirements.txt

feat: add documentation, Streamlit GUI, and full source tree

2026-04-28 23:06:39 +00:00

run_tests.py

test: single-command runner, cross-platform automation, fixture auto-discovery

2026-04-29 16:01:06 +00:00

tox.ini

test: single-command runner, cross-platform automation, fixture auto-discovery

2026-04-29 16:01:06 +00:00

README.md

DataTools

A bundle of Python data-cleaning tools for CSV and Excel files. Two scripts ship today; more are in build.

#	Tool	What it does
01	Deduplicator	Find and remove duplicate rows with exact + fuzzy matching, smart normalization, and interactive review.
02	Text Cleaner	Trim whitespace, fold smart quotes, strip invisible / control characters, normalize Unicode, normalize line endings, optional case conversion.

Deduplicator

Features

Zero-config start — auto-detects encoding, delimiters, headers, and match columns
Fuzzy matching — Jaro-Winkler, Levenshtein, and token set ratio algorithms
5 built-in normalizers — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
Merge mode — fill missing fields in the surviving row from removed duplicates
4 survivor rules — keep first, last, most complete, or most recent row per group
Interactive review — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live
Config profiles — save and reload your settings as JSON for repeatable runs
Dual interface — full CLI for automation, Streamlit GUI for visual review
Dry-run by default — preview what would change before writing anything
Audit trail — every run produces a match groups report and timestamped log

Quick Start

Install

pip install -r requirements.txt

CLI

# Preview duplicates (dry run — no files written)
python -m src.cli customers.csv

# Remove duplicates and save the result
python -m src.cli customers.csv --apply

# Fuzzy-match names at 80% similarity, merge missing fields
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply

# Interactively review each match group
python -m src.cli customers.csv --review --apply

GUI

streamlit run src/gui/app.py

Upload a file, click Find Duplicates, review match groups side-by-side, then download the cleaned result.

CLI Usage Summary

python -m src.cli INPUT_FILE [OPTIONS]

Options:
  --apply                  Write output files (default: preview only)
  --output, -o PATH        Output file path
  --subset, -s COLS        Columns to match on (comma-separated)
  --key, -k COLS           Strong-key columns for exact matching
  --fuzzy COLS             Columns to fuzzy-match
  --algorithm, -a ALG      levenshtein | jaro_winkler | token_set_ratio
  --threshold, -t N        Similarity threshold 0-100 (default: 85)
  --normalize COL:TYPE     Per-column normalizers (e.g., email:email,phone:phone)
  --survivor RULE          first | last | most-complete | most-recent
  --merge                  Fill missing fields from removed duplicates
  --review                 Interactively review each match group
  --config PATH            Load settings from a JSON config file
  --save-config PATH       Save current settings to JSON
  --sheet NAME             Excel sheet name or 0-based index
  --encoding ENC           Override auto-detected encoding
  --header-row N           0-based header row index
  --help                   Show full help

Sample Output

$ python -m src.cli samples/messy_sales.csv

Reading messy_sales.csv...
  50 rows, 8 columns
Finding duplicates...

──────────────────────────────────────────────────
  File:      messy_sales.csv
  Rows in:   50
  Rows out:  28
  Removed:   22
  Groups:    22
──────────────────────────────────────────────────

Match groups:
  Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
  Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
  ...

This was a preview. Add --apply to write the output files.

Output Files

When --apply is used, three files are produced:

File	Contents
`{input}_deduplicated.csv`	Cleaned data with duplicates removed
`{input}_removed.csv`	Rows that were removed
`{input}_match_groups.csv`	Audit trail: group ID, confidence, matched columns, survivor flag

Text Cleaner

Character-level hygiene for messy CSV / Excel input. Solves the dirty-data failure modes that silently break VLOOKUPs, dedup runs, and downstream imports:

Trailing / leading whitespace and tabs in cells
Non-breaking spaces (U+00A0) hiding inside text where regular spaces should be
Smart quotes pasted from Word (" " ' ' → " " ' ')
Em / en dashes, ellipsis, other typographic Unicode
Zero-width and bidi-mark characters (U+200B, U+200C, U+200D, etc.)
BOMs from Excel "Save As CSV UTF-8"
Mixed line endings (\r\n, bare \r) inside multi-line cells
Control characters (U+0000-U+001F minus \t \n \r)
Optional Unicode NFC / NFKC normalization
Optional per-column case conversion (UPPER / lower / smart Title / Sentence)

# Preview what would change (dry-run)
python -m src.cli_text_clean samples/messy_text.csv

# Apply the safe defaults
python -m src.cli_text_clean samples/messy_text.csv --apply

# Title-case the name column, upper-case the SKU column
python -m src.cli_text_clean products.csv --case title:name,upper:sku --apply

# Just trim and collapse — nothing fancy
python -m src.cli_text_clean messy.csv --preset minimal --apply

Three presets: minimal (trim + collapse only), excel-hygiene (default; everything safe ON), paranoid (adds lossy NFKC fold).

Outputs {input}_cleaned.csv plus a per-cell {input}_changes.csv audit (row, column, old, new, ops applied).

See docs/CLI-REFERENCE.md for every flag.

Documentation

CLI Reference — every flag with examples and recipe sections
Developer Guide — architecture, data flow, how to extend

Requirements

Python 3.10+
Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer

License

Languages

Python 87.3%

HTML 10%

CSS 1.8%

Shell 0.4%

JavaScript 0.2%

Other 0.2%