Files
datatools-dev/README.md
Michael 54f92ae47e feat: implement text cleaner (script 02) with CLI, GUI, and tests
Builds 02_text_cleaner.py from stub to working: character-level hygiene
for CSV/Excel inputs covering trim, whitespace collapse, smart-character
folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char
strip, line-ending normalization, and per-column case conversion. Three
presets (minimal/excel-hygiene/paranoid) keep the buyer surface small.

- src/core/text_clean.py: pure helpers + CleanOptions/CleanResult +
  clean_dataframe with dtype-safe column selection
- src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape
  (dry-run by default, --apply writes cleaned + changes audit, JSON
  config save/load)
- src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset
  picker, advanced toggles, preview, before/after metrics, and three
  download buttons
- tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests
  covering edge cases E1-E50 from the spec
- samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10
  in 10 rows
- test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case
  fixtures

Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7
entry locking the spec, CLI-REFERENCE.md gains the text cleaner
section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md
status row 02 promoted Skeleton -> Working.

200/200 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:14:15 +00:00

165 lines
6.1 KiB
Markdown

# DataTools
A bundle of Python data-cleaning tools for CSV and Excel files. Two scripts ship today; more are in build.
| # | Tool | What it does |
|---|---|---|
| 01 | **Deduplicator** | Find and remove duplicate rows with exact + fuzzy matching, smart normalization, and interactive review. |
| 02 | **Text Cleaner** | Trim whitespace, fold smart quotes, strip invisible / control characters, normalize Unicode, normalize line endings, optional case conversion. |
## Deduplicator
## Features
- **Zero-config start** — auto-detects encoding, delimiters, headers, and match columns
- **Fuzzy matching** — Jaro-Winkler, Levenshtein, and token set ratio algorithms
- **5 built-in normalizers** — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
- **Merge mode** — fill missing fields in the surviving row from removed duplicates
- **4 survivor rules** — keep first, last, most complete, or most recent row per group
- **Interactive review** — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live
- **Config profiles** — save and reload your settings as JSON for repeatable runs
- **Dual interface** — full CLI for automation, Streamlit GUI for visual review
- **Dry-run by default** — preview what would change before writing anything
- **Audit trail** — every run produces a match groups report and timestamped log
## Quick Start
### Install
```bash
pip install -r requirements.txt
```
### CLI
```bash
# Preview duplicates (dry run — no files written)
python -m src.cli customers.csv
# Remove duplicates and save the result
python -m src.cli customers.csv --apply
# Fuzzy-match names at 80% similarity, merge missing fields
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply
# Interactively review each match group
python -m src.cli customers.csv --review --apply
```
### GUI
```bash
streamlit run src/gui/app.py
```
Upload a file, click **Find Duplicates**, review match groups side-by-side, then download the cleaned result.
## CLI Usage Summary
```
python -m src.cli INPUT_FILE [OPTIONS]
Options:
--apply Write output files (default: preview only)
--output, -o PATH Output file path
--subset, -s COLS Columns to match on (comma-separated)
--key, -k COLS Strong-key columns for exact matching
--fuzzy COLS Columns to fuzzy-match
--algorithm, -a ALG levenshtein | jaro_winkler | token_set_ratio
--threshold, -t N Similarity threshold 0-100 (default: 85)
--normalize COL:TYPE Per-column normalizers (e.g., email:email,phone:phone)
--survivor RULE first | last | most-complete | most-recent
--merge Fill missing fields from removed duplicates
--review Interactively review each match group
--config PATH Load settings from a JSON config file
--save-config PATH Save current settings to JSON
--sheet NAME Excel sheet name or 0-based index
--encoding ENC Override auto-detected encoding
--header-row N 0-based header row index
--help Show full help
```
## Sample Output
```
$ python -m src.cli samples/messy_sales.csv
Reading messy_sales.csv...
50 rows, 8 columns
Finding duplicates...
──────────────────────────────────────────────────
File: messy_sales.csv
Rows in: 50
Rows out: 28
Removed: 22
Groups: 22
──────────────────────────────────────────────────
Match groups:
Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
...
This was a preview. Add --apply to write the output files.
```
## Output Files
When `--apply` is used, three files are produced:
| File | Contents |
|------|----------|
| `{input}_deduplicated.csv` | Cleaned data with duplicates removed |
| `{input}_removed.csv` | Rows that were removed |
| `{input}_match_groups.csv` | Audit trail: group ID, confidence, matched columns, survivor flag |
## Text Cleaner
Character-level hygiene for messy CSV / Excel input. Solves the dirty-data failure modes that silently break VLOOKUPs, dedup runs, and downstream imports:
- Trailing / leading whitespace and tabs in cells
- Non-breaking spaces (`U+00A0`) hiding inside text where regular spaces should be
- Smart quotes pasted from Word (`"` `"` `'` `'``"` `"` `'` `'`)
- Em / en dashes, ellipsis, other typographic Unicode
- Zero-width and bidi-mark characters (`U+200B`, `U+200C`, `U+200D`, etc.)
- BOMs from Excel "Save As CSV UTF-8"
- Mixed line endings (`\r\n`, bare `\r`) inside multi-line cells
- Control characters (`U+0000`-`U+001F` minus `\t \n \r`)
- Optional Unicode NFC / NFKC normalization
- Optional per-column case conversion (UPPER / lower / smart Title / Sentence)
```bash
# Preview what would change (dry-run)
python -m src.cli_text_clean samples/messy_text.csv
# Apply the safe defaults
python -m src.cli_text_clean samples/messy_text.csv --apply
# Title-case the name column, upper-case the SKU column
python -m src.cli_text_clean products.csv --case title:name,upper:sku --apply
# Just trim and collapse — nothing fancy
python -m src.cli_text_clean messy.csv --preset minimal --apply
```
Three presets: `minimal` (trim + collapse only), `excel-hygiene` (default; everything safe ON), `paranoid` (adds lossy NFKC fold).
Outputs `{input}_cleaned.csv` plus a per-cell `{input}_changes.csv` audit (row, column, old, new, ops applied).
See [docs/CLI-REFERENCE.md](docs/CLI-REFERENCE.md#text-cleaner-cli) for every flag.
## Documentation
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections
- [Developer Guide](docs/DEVELOPER.md) — architecture, data flow, how to extend
## Requirements
- Python 3.10+
- Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer
## License
Proprietary. All rights reserved.