feat: implement text cleaner (script 02) with CLI, GUI, and tests
Builds 02_text_cleaner.py from stub to working: character-level hygiene for CSV/Excel inputs covering trim, whitespace collapse, smart-character folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char strip, line-ending normalization, and per-column case conversion. Three presets (minimal/excel-hygiene/paranoid) keep the buyer surface small. - src/core/text_clean.py: pure helpers + CleanOptions/CleanResult + clean_dataframe with dtype-safe column selection - src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape (dry-run by default, --apply writes cleaned + changes audit, JSON config save/load) - src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset picker, advanced toggles, preview, before/after metrics, and three download buttons - tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests covering edge cases E1-E50 from the spec - samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10 in 10 rows - test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case fixtures Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7 entry locking the spec, CLI-REFERENCE.md gains the text cleaner section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md status row 02 promoted Skeleton -> Working. 200/200 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,6 +1,17 @@
|
||||
# CLI Reference
|
||||
|
||||
Complete command-line reference for the DataTools Deduplicator.
|
||||
Complete command-line reference for the DataTools bundle.
|
||||
|
||||
DataTools ships two CLI modules so each script can be invoked independently:
|
||||
|
||||
| Module | Command | Purpose |
|
||||
|---|---|---|
|
||||
| `src.cli` | `python -m src.cli INPUT_FILE [OPTIONS]` | Deduplicator (script 01) |
|
||||
| `src.cli_text_clean` | `python -m src.cli_text_clean INPUT_FILE [OPTIONS]` | Text cleaner (script 02) |
|
||||
|
||||
The deduplicator section is below; the text cleaner reference is in [Section: Text Cleaner CLI](#text-cleaner-cli).
|
||||
|
||||
## Deduplicator
|
||||
|
||||
```
|
||||
python -m src.cli INPUT_FILE [OPTIONS]
|
||||
@@ -282,3 +293,122 @@ When `--apply` is set, three files are written:
|
||||
## Logging
|
||||
|
||||
Every run writes a timestamped log to `logs/dedup_YYYYMMDD_HHMMSS.log` with full debug-level details: strategies used, pair comparisons, survivor decisions, and merge actions.
|
||||
|
||||
---
|
||||
|
||||
# Text Cleaner CLI
|
||||
|
||||
Character-level hygiene for CSV / Excel files: whitespace trim and collapse, smart-character folding, Unicode normalization, BOM strip, control-char strip, line-ending normalization, optional case conversion. See TECHNICAL.md Section 10.2 for the full functional spec.
|
||||
|
||||
```
|
||||
python -m src.cli_text_clean INPUT_FILE [OPTIONS]
|
||||
```
|
||||
|
||||
## Arguments
|
||||
|
||||
| Argument | Required | Description |
|
||||
|----------|----------|-------------|
|
||||
| `INPUT_FILE` | Yes | Path to the CSV, TSV, or Excel file to clean |
|
||||
|
||||
## Options
|
||||
|
||||
### Core
|
||||
|
||||
| Flag | Short | Default | Description |
|
||||
|------|-------|---------|-------------|
|
||||
| `--apply` | | `false` | Write output files. Without this flag, only a preview is shown. |
|
||||
| `--output` | `-o` | `{input}_cleaned.csv` | Output file path. |
|
||||
| `--preset` | | `excel-hygiene` | Preset bundle of safe defaults. See [Presets](#presets). |
|
||||
|
||||
### Scope
|
||||
|
||||
| Flag | Default | Description |
|
||||
|------|---------|-------------|
|
||||
| `--columns` | all string columns | Comma-separated columns to clean. |
|
||||
| `--skip` | none | Comma-separated columns to skip even if they look like text. Useful for free-text notes columns you don't want touched. |
|
||||
|
||||
### Per-operation toggles
|
||||
|
||||
These override the active preset.
|
||||
|
||||
| Flag | Effect |
|
||||
|------|--------|
|
||||
| `--no-trim` | Disable leading/trailing whitespace strip |
|
||||
| `--no-collapse` | Disable internal whitespace collapse |
|
||||
| `--no-nfc` | Disable Unicode NFC normalization |
|
||||
| `--nfkc` | Enable NFKC compatibility fold (lossy: `①` → `1`, `fi` → `fi`) |
|
||||
| `--no-smart-chars` | Disable smart-character folding (curly quotes, em/en-dash, NBSP, ellipsis) |
|
||||
| `--no-zero-width` | Disable zero-width / invisible character strip |
|
||||
| `--no-bom` | Disable leading BOM strip |
|
||||
| `--no-control` | Disable control-character strip |
|
||||
| `--no-line-endings` | Disable line-ending normalization |
|
||||
|
||||
### Case conversion
|
||||
|
||||
| Flag | Forms | Description |
|
||||
|------|-------|-------------|
|
||||
| `--case` | `upper`, `lower`, `title`, `sentence` | Apply this case to every selected column |
|
||||
| `--case` | `mode:col[,mode:col]` | Per-column case (e.g., `--case title:name,upper:code`) |
|
||||
|
||||
Title case preserves all-caps tokens (`USA` stays `USA`) and lowercases mid-string particles (`of`, `and`, `the`, etc.).
|
||||
|
||||
### Audit and config
|
||||
|
||||
| Flag | Default | Description |
|
||||
|------|---------|-------------|
|
||||
| `--full-changelog` | `false` | Write every cell change to the audit CSV (default caps to first 1000). |
|
||||
| `--config` | none | Load options from a saved JSON config file. |
|
||||
| `--save-config` | none | Save the current options to a JSON config file. |
|
||||
|
||||
### File format / encoding
|
||||
|
||||
| Flag | Default | Description |
|
||||
|------|---------|-------------|
|
||||
| `--sheet` | `0` | Excel sheet name or 0-based index. |
|
||||
| `--encoding` | auto-detect | Override auto-detected file encoding. |
|
||||
| `--header-row` | auto-detect | 0-based row index for the header. |
|
||||
|
||||
## Presets
|
||||
|
||||
| Preset | What it does |
|
||||
|---|---|
|
||||
| `minimal` | Trim + collapse whitespace only. Nothing else. |
|
||||
| `excel-hygiene` (default) | Trim, collapse, NFC, smart-character fold, zero-width strip, BOM strip, control strip, line-ending normalize. NFKC off. |
|
||||
| `paranoid` | All of `excel-hygiene` plus NFKC compatibility fold (lossy). |
|
||||
|
||||
## Output Files
|
||||
|
||||
When `--apply` is set:
|
||||
|
||||
| File | Description |
|
||||
|------|-------------|
|
||||
| `{stem}_cleaned.csv` | Cleaned DataFrame |
|
||||
| `{stem}_changes.csv` | Per-cell audit: `row`, `column`, `old`, `new`, `ops_applied` (capped to 1000 rows by default; use `--full-changelog` for all) |
|
||||
|
||||
A timestamped log is always written to `logs/text_clean_YYYYMMDD_HHMMSS.log`.
|
||||
|
||||
## Recipes
|
||||
|
||||
```bash
|
||||
# Preview what would change with the safe defaults
|
||||
python -m src.cli_text_clean messy.csv
|
||||
|
||||
# Apply the safe defaults
|
||||
python -m src.cli_text_clean messy.csv --apply
|
||||
|
||||
# Just the basics — only trim and collapse, leave Unicode/quotes alone
|
||||
python -m src.cli_text_clean messy.csv --preset minimal --apply
|
||||
|
||||
# Title-case the name column, upper-case the SKU column, leave others alone for case
|
||||
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply
|
||||
|
||||
# Clean only specific columns
|
||||
python -m src.cli_text_clean orders.csv --columns vendor,product --apply
|
||||
|
||||
# Skip a free-text notes column from cleaning
|
||||
python -m src.cli_text_clean tickets.csv --skip notes --apply
|
||||
|
||||
# Save the current settings as a profile and reload it later
|
||||
python -m src.cli_text_clean messy.csv --preset minimal --case upper --save-config my.json
|
||||
python -m src.cli_text_clean other.csv --config my.json --apply
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user