feat: implement text cleaner (script 02) with CLI, GUI, and tests

Builds 02_text_cleaner.py from stub to working: character-level hygiene
for CSV/Excel inputs covering trim, whitespace collapse, smart-character
folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char
strip, line-ending normalization, and per-column case conversion. Three
presets (minimal/excel-hygiene/paranoid) keep the buyer surface small.

- src/core/text_clean.py: pure helpers + CleanOptions/CleanResult +
  clean_dataframe with dtype-safe column selection
- src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape
  (dry-run by default, --apply writes cleaned + changes audit, JSON
  config save/load)
- src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset
  picker, advanced toggles, preview, before/after metrics, and three
  download buttons
- tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests
  covering edge cases E1-E50 from the spec
- samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10
  in 10 rows
- test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case
  fixtures

Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7
entry locking the spec, CLI-REFERENCE.md gains the text cleaner
section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md
status row 02 promoted Skeleton -> Working.

200/200 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 15:14:15 +00:00
parent b2ca04e6f4
commit 54f92ae47e
28 changed files with 2093 additions and 58 deletions

View File

@@ -1,6 +1,13 @@
# DataTools Deduplicator
# DataTools
Find and remove duplicate rows in CSV, delimited text, and Excel files — with fuzzy matching, smart normalization, and interactive review.
A bundle of Python data-cleaning tools for CSV and Excel files. Two scripts ship today; more are in build.
| # | Tool | What it does |
|---|---|---|
| 01 | **Deduplicator** | Find and remove duplicate rows with exact + fuzzy matching, smart normalization, and interactive review. |
| 02 | **Text Cleaner** | Trim whitespace, fold smart quotes, strip invisible / control characters, normalize Unicode, normalize line endings, optional case conversion. |
## Deduplicator
## Features
@@ -107,6 +114,41 @@ When `--apply` is used, three files are produced:
| `{input}_removed.csv` | Rows that were removed |
| `{input}_match_groups.csv` | Audit trail: group ID, confidence, matched columns, survivor flag |
## Text Cleaner
Character-level hygiene for messy CSV / Excel input. Solves the dirty-data failure modes that silently break VLOOKUPs, dedup runs, and downstream imports:
- Trailing / leading whitespace and tabs in cells
- Non-breaking spaces (`U+00A0`) hiding inside text where regular spaces should be
- Smart quotes pasted from Word (`"` `"` `'` `'``"` `"` `'` `'`)
- Em / en dashes, ellipsis, other typographic Unicode
- Zero-width and bidi-mark characters (`U+200B`, `U+200C`, `U+200D`, etc.)
- BOMs from Excel "Save As CSV UTF-8"
- Mixed line endings (`\r\n`, bare `\r`) inside multi-line cells
- Control characters (`U+0000`-`U+001F` minus `\t \n \r`)
- Optional Unicode NFC / NFKC normalization
- Optional per-column case conversion (UPPER / lower / smart Title / Sentence)
```bash
# Preview what would change (dry-run)
python -m src.cli_text_clean samples/messy_text.csv
# Apply the safe defaults
python -m src.cli_text_clean samples/messy_text.csv --apply
# Title-case the name column, upper-case the SKU column
python -m src.cli_text_clean products.csv --case title:name,upper:sku --apply
# Just trim and collapse — nothing fancy
python -m src.cli_text_clean messy.csv --preset minimal --apply
```
Three presets: `minimal` (trim + collapse only), `excel-hygiene` (default; everything safe ON), `paranoid` (adds lossy NFKC fold).
Outputs `{input}_cleaned.csv` plus a per-cell `{input}_changes.csv` audit (row, column, old, new, ops applied).
See [docs/CLI-REFERENCE.md](docs/CLI-REFERENCE.md#text-cleaner-cli) for every flag.
## Documentation
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections