feat: implement text cleaner (script 02) with CLI, GUI, and tests

Builds 02_text_cleaner.py from stub to working: character-level hygiene
for CSV/Excel inputs covering trim, whitespace collapse, smart-character
folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char
strip, line-ending normalization, and per-column case conversion. Three
presets (minimal/excel-hygiene/paranoid) keep the buyer surface small.

- src/core/text_clean.py: pure helpers + CleanOptions/CleanResult +
  clean_dataframe with dtype-safe column selection
- src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape
  (dry-run by default, --apply writes cleaned + changes audit, JSON
  config save/load)
- src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset
  picker, advanced toggles, preview, before/after metrics, and three
  download buttons
- tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests
  covering edge cases E1-E50 from the spec
- samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10
  in 10 rows
- test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case
  fixtures

Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7
entry locking the spec, CLI-REFERENCE.md gains the text cleaner
section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md
status row 02 promoted Skeleton -> Working.

200/200 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 15:14:15 +00:00
parent b2ca04e6f4
commit 54f92ae47e
28 changed files with 2093 additions and 58 deletions

View File

@@ -430,6 +430,81 @@ This section captures the full functional spec for each script, beyond the one-l
35. Schedule / cron integration.
36. Direct Shopify / Klaviyo / Mailchimp API integration to dedupe in place. This would be a real differentiator for the Shopify niche specifically and is probably the right v2 direction if early sales validate the niche.
### 10.2 - 10.9 (Future)
### 10.2 `02_text_cleaner.py` - Character-level hygiene
Functional specs for scripts 02 through 09 to be added when each script enters active build. The deduplicator spec is the template; specs for other scripts should follow the same Tier 1 / Tier 2 / Tier 3 structure with explicit strategic framing (what's the market gap this script fills, given that some of its functionality is available free elsewhere).
**Current implementation status**: Stub only. `src/gui/pages/2_Text_Cleaner.py` is a placeholder UI with disabled controls. No `src/core/text_clean.py`, no CLI, no tests. Tier 1 below is the v1 launch target; nothing in this section is built yet.
**Strategic framing**: Excel and the OS provide effectively nothing here. Find/Replace fixes one character at a time. Power Query's "Clean" strips control chars but ignores BOMs, smart quotes, NBSPs, and zero-width chars. OpenRefine has the operations buried under "Common transforms" where the buyer never finds them. Pandas users `df.applymap(str.strip)` and miss everything else.
The market gap this script fills: **one-click correctness for the dirty-CSV failure modes that cause "why won't this VLOOKUP match?"** Trailing spaces, NBSP-in-place-of-space, smart quotes pasted from Word, mojibake, BOMs from Excel's "Save As CSV UTF-8". The buyer doesn't know they need this script until it fixes a problem they have spent two hours on. Demo value is high: the before/after diff sells itself.
**Boundary clarification** (cross-references Section 9):
- 02 owns whitespace, Unicode normalization, smart-character folding, BOM strip, line-ending normalization, zero-width strip, control-char strip, case ops. Writes cleaned values back to disk.
- 03 (format standardizer) owns dates, currencies, names, phones, addresses.
- 04 (missing values) owns disguised nulls (`N/A`, `-`, `unknown`, sentinel codes). Whitespace-only cells: 02 trims first to empty string; 04 then detects empty as null (per Section 9.3).
- 01 (deduplicator) has its own `normalize_string` helper for *match-time* case-folding. That is a match-time policy and stays distinct from 02's *write-time* policy. The two will not be merged; 02 may use lower-level helpers but does not aggressively case-fold cleaned output by default.
#### Tier 1: Must-ship for v1 to be best-of-class
**Operations** (each independently toggleable; defaults given for the `excel-hygiene` preset)
1. Whitespace trim - leading/trailing on every cell. Default ON.
2. Internal whitespace collapse - multi-space and tabs-in-cells to single space. Default ON.
3. Unicode NFC normalization - combining-character forms folded to canonical (e.g., `e + U+0301` to single `é`). Default ON.
4. Unicode NFKC normalization - compat fold (`①` to `1`, `fi` to `fi`). Default OFF, lossy, opt-in only. Not part of any preset other than `paranoid`.
5. Smart-character folding - curly quotes to ASCII, em/en-dash to hyphen, ellipsis `…` to `...`, NBSP `U+00A0` to space. Default ON.
6. Zero-width / invisible character strip - `U+200B`, `U+200C`, `U+200D`, `U+2060`, mid-string `U+FEFF`. Default ON.
7. BOM strip - `U+FEFF` at the start of the first cell of the first column (covers the case where the I/O layer didn't catch it). Default ON.
8. Control character strip - `U+0000`-`U+001F` and `U+007F`, *except* preserve `\t`, `\n`, `\r`. Default ON.
9. Line-ending normalization - within multi-line cells, `\r\n` and bare `\r` to `\n`. Default ON.
10. Case conversion - UPPER / lower / Title / Sentence. Default OFF, per-column. Title case is "smart": preserves all-caps tokens (`USA`, `NASA`) and lowercases mid-string particles (`of`, `and`, `the`).
**Scope control**
11. Per-column selection - by default operate on string-typed columns only; numeric / datetime columns pass through untouched. User can pick columns explicitly via `--columns`.
12. Skip-list - exclude specific columns via `--skip` even if they match the string-dtype filter (e.g., free-text notes columns).
**Trust and audit**
13. Dry-run preview by default. Output shows N cells that would change in column X. `--apply` writes. Non-negotiable for trust. Same standard as the deduplicator.
14. Per-cell change log: `{input}_changes.csv` with (row, column, old, new, ops_applied). Capped to first N rows by default to avoid 50MB audit files; `--full-changelog` removes the cap.
15. Three output files on `--apply`: `{input}_cleaned.csv`, `{input}_changes.csv`, `logs/text_clean_{ts}.log`. Mirrors the deduplicator output shape.
16. Original input file is never modified.
17. Idempotency: `clean(clean(x)) == clean(x)` for every individual op and every preset. Asserted as a property test.
**Configuration**
18. Presets: `--preset excel-hygiene` (everything safe ON, NFKC OFF, case OFF), `--preset minimal` (only trim + collapse), `--preset paranoid` (everything including NFKC). Buyers should not have to learn 9 flags. Default preset when no flag given: `excel-hygiene`.
19. Save / load JSON config. Same shape and reuse pattern as `DeduplicationConfig`.
**UX**
20. `--help` written for non-technical users with concrete examples, not a flag dump. Per DECISIONS.md Section 4b.
21. Progress bar for files over ~10K rows.
22. Error messages name the row, column, and value that caused the problem. No raw stack traces.
23. Sample data (`samples/messy_text.csv`) demonstrates: smart quotes from Excel, NBSP-vs-space, BOM, mixed line endings, zero-width chars. The before/after diff is the demo.
#### Tier 2: Worth-considering for v1.1
24. Custom regex find/replace - power-user escape hatch, per-column.
25. Diacritic strip (`José` to `Jose`). Lossy; opt-in only.
26. Mojibake auto-repair - detect `é` to `é` patterns (UTF-8 read as Latin-1 then re-encoded) and fix. Standard tool: `ftfy`. Promote to Tier 1 if early buyers report this.
27. Punctuation normalization - all Unicode dash/quote/space variants folded; runs of punctuation collapsed.
28. Profile detector - scan a file and recommend which ops to enable based on what's actually present. Lowers config friction further.
#### Tier 3: Optional / later
29. Locale-aware case conversion (Turkish dotted/dotless `i`, German `ß`).
30. Custom character-class strip rules (regex-class).
31. Streaming / chunked write for very large files (defer until a buyer reports it).
#### Open decisions captured at spec time
- Smart-character folding default ON in `excel-hygiene` accepted as the right tradeoff: highest-impact use case, dry-run preview makes the change visible before commit.
- NFKC stays Tier 1 but OFF by default and excluded from `excel-hygiene`. Lossy by design.
- CLI surface: separate `src/cli_text_clean.py` module, matching the "one CLI binary per script on PATH" pattern in Section 3.2. Not a subcommand on the existing dedup Typer app.
- `ftfy` dependency deferred to Tier 2 (~5MB). Revisit if mojibake reports come in.
### 10.3 - 10.9 (Future)
Functional specs for scripts 03 through 09 to be added when each script enters active build. The deduplicator (10.1) and text cleaner (10.2) specs are the template; specs for other scripts should follow the same Tier 1 / Tier 2 / Tier 3 structure with explicit strategic framing (what's the market gap this script fills, given that some of its functionality is available free elsewhere).