feat: implement text cleaner (script 02) with CLI, GUI, and tests
Builds 02_text_cleaner.py from stub to working: character-level hygiene for CSV/Excel inputs covering trim, whitespace collapse, smart-character folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char strip, line-ending normalization, and per-column case conversion. Three presets (minimal/excel-hygiene/paranoid) keep the buyer surface small. - src/core/text_clean.py: pure helpers + CleanOptions/CleanResult + clean_dataframe with dtype-safe column selection - src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape (dry-run by default, --apply writes cleaned + changes audit, JSON config save/load) - src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset picker, advanced toggles, preview, before/after metrics, and three download buttons - tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests covering edge cases E1-E50 from the spec - samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10 in 10 rows - test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case fixtures Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7 entry locking the spec, CLI-REFERENCE.md gains the text cleaner section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md status row 02 promoted Skeleton -> Working. 200/200 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
46
README.md
46
README.md
@@ -1,6 +1,13 @@
|
||||
# DataTools Deduplicator
|
||||
# DataTools
|
||||
|
||||
Find and remove duplicate rows in CSV, delimited text, and Excel files — with fuzzy matching, smart normalization, and interactive review.
|
||||
A bundle of Python data-cleaning tools for CSV and Excel files. Two scripts ship today; more are in build.
|
||||
|
||||
| # | Tool | What it does |
|
||||
|---|---|---|
|
||||
| 01 | **Deduplicator** | Find and remove duplicate rows with exact + fuzzy matching, smart normalization, and interactive review. |
|
||||
| 02 | **Text Cleaner** | Trim whitespace, fold smart quotes, strip invisible / control characters, normalize Unicode, normalize line endings, optional case conversion. |
|
||||
|
||||
## Deduplicator
|
||||
|
||||
## Features
|
||||
|
||||
@@ -107,6 +114,41 @@ When `--apply` is used, three files are produced:
|
||||
| `{input}_removed.csv` | Rows that were removed |
|
||||
| `{input}_match_groups.csv` | Audit trail: group ID, confidence, matched columns, survivor flag |
|
||||
|
||||
## Text Cleaner
|
||||
|
||||
Character-level hygiene for messy CSV / Excel input. Solves the dirty-data failure modes that silently break VLOOKUPs, dedup runs, and downstream imports:
|
||||
|
||||
- Trailing / leading whitespace and tabs in cells
|
||||
- Non-breaking spaces (`U+00A0`) hiding inside text where regular spaces should be
|
||||
- Smart quotes pasted from Word (`"` `"` `'` `'` → `"` `"` `'` `'`)
|
||||
- Em / en dashes, ellipsis, other typographic Unicode
|
||||
- Zero-width and bidi-mark characters (`U+200B`, `U+200C`, `U+200D`, etc.)
|
||||
- BOMs from Excel "Save As CSV UTF-8"
|
||||
- Mixed line endings (`\r\n`, bare `\r`) inside multi-line cells
|
||||
- Control characters (`U+0000`-`U+001F` minus `\t \n \r`)
|
||||
- Optional Unicode NFC / NFKC normalization
|
||||
- Optional per-column case conversion (UPPER / lower / smart Title / Sentence)
|
||||
|
||||
```bash
|
||||
# Preview what would change (dry-run)
|
||||
python -m src.cli_text_clean samples/messy_text.csv
|
||||
|
||||
# Apply the safe defaults
|
||||
python -m src.cli_text_clean samples/messy_text.csv --apply
|
||||
|
||||
# Title-case the name column, upper-case the SKU column
|
||||
python -m src.cli_text_clean products.csv --case title:name,upper:sku --apply
|
||||
|
||||
# Just trim and collapse — nothing fancy
|
||||
python -m src.cli_text_clean messy.csv --preset minimal --apply
|
||||
```
|
||||
|
||||
Three presets: `minimal` (trim + collapse only), `excel-hygiene` (default; everything safe ON), `paranoid` (adds lossy NFKC fold).
|
||||
|
||||
Outputs `{input}_cleaned.csv` plus a per-cell `{input}_changes.csv` audit (row, column, old, new, ops applied).
|
||||
|
||||
See [docs/CLI-REFERENCE.md](docs/CLI-REFERENCE.md#text-cleaner-cli) for every flag.
|
||||
|
||||
## Documentation
|
||||
|
||||
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections
|
||||
|
||||
Reference in New Issue
Block a user