feat: implement text cleaner (script 02) with CLI, GUI, and tests

Builds 02_text_cleaner.py from stub to working: character-level hygiene for CSV/Excel inputs covering trim, whitespace collapse, smart-character folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char strip, line-ending normalization, and per-column case conversion. Three presets (minimal/excel-hygiene/paranoid) keep the buyer surface small. - src/core/text_clean.py: pure helpers + CleanOptions/CleanResult + clean_dataframe with dtype-safe column selection - src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape (dry-run by default, --apply writes cleaned + changes audit, JSON config save/load) - src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset picker, advanced toggles, preview, before/after metrics, and three download buttons - tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests covering edge cases E1-E50 from the spec - samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10 in 10 rows - test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case fixtures Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7 entry locking the spec, CLI-REFERENCE.md gains the text cleaner section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md status row 02 promoted Skeleton -> Working. 200/200 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:14:15 +00:00
parent b2ca04e6f4
commit 54f92ae47e
28 changed files with 2093 additions and 58 deletions
--- a/docs/TECHNICAL.md
+++ b/docs/TECHNICAL.md
@@ -430,6 +430,81 @@ This section captures the full functional spec for each script, beyond the one-l
 35. Schedule / cron integration.
 36. Direct Shopify / Klaviyo / Mailchimp API integration to dedupe in place. This would be a real differentiator for the Shopify niche specifically and is probably the right v2 direction if early sales validate the niche.

-### 10.2 - 10.9 (Future)
+### 10.2 `02_text_cleaner.py` - Character-level hygiene

-Functional specs for scripts 02 through 09 to be added when each script enters active build. The deduplicator spec is the template; specs for other scripts should follow the same Tier 1 / Tier 2 / Tier 3 structure with explicit strategic framing (what's the market gap this script fills, given that some of its functionality is available free elsewhere).
+**Current implementation status**: Stub only. `src/gui/pages/2_Text_Cleaner.py` is a placeholder UI with disabled controls. No `src/core/text_clean.py`, no CLI, no tests. Tier 1 below is the v1 launch target; nothing in this section is built yet.
+
+**Strategic framing**: Excel and the OS provide effectively nothing here. Find/Replace fixes one character at a time. Power Query's "Clean" strips control chars but ignores BOMs, smart quotes, NBSPs, and zero-width chars. OpenRefine has the operations buried under "Common transforms" where the buyer never finds them. Pandas users `df.applymap(str.strip)` and miss everything else.
+
+The market gap this script fills: **one-click correctness for the dirty-CSV failure modes that cause "why won't this VLOOKUP match?"** Trailing spaces, NBSP-in-place-of-space, smart quotes pasted from Word, mojibake, BOMs from Excel's "Save As CSV UTF-8". The buyer doesn't know they need this script until it fixes a problem they have spent two hours on. Demo value is high: the before/after diff sells itself.
+
+**Boundary clarification** (cross-references Section 9):
+- 02 owns whitespace, Unicode normalization, smart-character folding, BOM strip, line-ending normalization, zero-width strip, control-char strip, case ops. Writes cleaned values back to disk.
+- 03 (format standardizer) owns dates, currencies, names, phones, addresses.
+- 04 (missing values) owns disguised nulls (`N/A`, `-`, `unknown`, sentinel codes). Whitespace-only cells: 02 trims first to empty string; 04 then detects empty as null (per Section 9.3).
+- 01 (deduplicator) has its own `normalize_string` helper for *match-time* case-folding. That is a match-time policy and stays distinct from 02's *write-time* policy. The two will not be merged; 02 may use lower-level helpers but does not aggressively case-fold cleaned output by default.
+
+#### Tier 1: Must-ship for v1 to be best-of-class
+
+**Operations** (each independently toggleable; defaults given for the `excel-hygiene` preset)
+
+1. Whitespace trim - leading/trailing on every cell. Default ON.
+2. Internal whitespace collapse - multi-space and tabs-in-cells to single space. Default ON.
+3. Unicode NFC normalization - combining-character forms folded to canonical (e.g., `e + U+0301` to single `é`). Default ON.
+4. Unicode NFKC normalization - compat fold (`①` to `1`, `ﬁ` to `fi`). Default OFF, lossy, opt-in only. Not part of any preset other than `paranoid`.
+5. Smart-character folding - curly quotes to ASCII, em/en-dash to hyphen, ellipsis `…` to `...`, NBSP `U+00A0` to space. Default ON.
+6. Zero-width / invisible character strip - `U+200B`, `U+200C`, `U+200D`, `U+2060`, mid-string `U+FEFF`. Default ON.
+7. BOM strip - `U+FEFF` at the start of the first cell of the first column (covers the case where the I/O layer didn't catch it). Default ON.
+8. Control character strip - `U+0000`-`U+001F` and `U+007F`, *except* preserve `\t`, `\n`, `\r`. Default ON.
+9. Line-ending normalization - within multi-line cells, `\r\n` and bare `\r` to `\n`. Default ON.
+10. Case conversion - UPPER / lower / Title / Sentence. Default OFF, per-column. Title case is "smart": preserves all-caps tokens (`USA`, `NASA`) and lowercases mid-string particles (`of`, `and`, `the`).
+
+**Scope control**
+
+11. Per-column selection - by default operate on string-typed columns only; numeric / datetime columns pass through untouched. User can pick columns explicitly via `--columns`.
+12. Skip-list - exclude specific columns via `--skip` even if they match the string-dtype filter (e.g., free-text notes columns).
+
+**Trust and audit**
+
+13. Dry-run preview by default. Output shows N cells that would change in column X. `--apply` writes. Non-negotiable for trust. Same standard as the deduplicator.
+14. Per-cell change log: `{input}_changes.csv` with (row, column, old, new, ops_applied). Capped to first N rows by default to avoid 50MB audit files; `--full-changelog` removes the cap.
+15. Three output files on `--apply`: `{input}_cleaned.csv`, `{input}_changes.csv`, `logs/text_clean_{ts}.log`. Mirrors the deduplicator output shape.
+16. Original input file is never modified.
+17. Idempotency: `clean(clean(x)) == clean(x)` for every individual op and every preset. Asserted as a property test.
+
+**Configuration**
+
+18. Presets: `--preset excel-hygiene` (everything safe ON, NFKC OFF, case OFF), `--preset minimal` (only trim + collapse), `--preset paranoid` (everything including NFKC). Buyers should not have to learn 9 flags. Default preset when no flag given: `excel-hygiene`.
+19. Save / load JSON config. Same shape and reuse pattern as `DeduplicationConfig`.
+
+**UX**
+
+20. `--help` written for non-technical users with concrete examples, not a flag dump. Per DECISIONS.md Section 4b.
+21. Progress bar for files over ~10K rows.
+22. Error messages name the row, column, and value that caused the problem. No raw stack traces.
+23. Sample data (`samples/messy_text.csv`) demonstrates: smart quotes from Excel, NBSP-vs-space, BOM, mixed line endings, zero-width chars. The before/after diff is the demo.
+
+#### Tier 2: Worth-considering for v1.1
+
+24. Custom regex find/replace - power-user escape hatch, per-column.
+25. Diacritic strip (`José` to `Jose`). Lossy; opt-in only.
+26. Mojibake auto-repair - detect `Ã©` to `é` patterns (UTF-8 read as Latin-1 then re-encoded) and fix. Standard tool: `ftfy`. Promote to Tier 1 if early buyers report this.
+27. Punctuation normalization - all Unicode dash/quote/space variants folded; runs of punctuation collapsed.
+28. Profile detector - scan a file and recommend which ops to enable based on what's actually present. Lowers config friction further.
+
+#### Tier 3: Optional / later
+
+29. Locale-aware case conversion (Turkish dotted/dotless `i`, German `ß`).
+30. Custom character-class strip rules (regex-class).
+31. Streaming / chunked write for very large files (defer until a buyer reports it).
+
+#### Open decisions captured at spec time
+
+- Smart-character folding default ON in `excel-hygiene` accepted as the right tradeoff: highest-impact use case, dry-run preview makes the change visible before commit.
+- NFKC stays Tier 1 but OFF by default and excluded from `excel-hygiene`. Lossy by design.
+- CLI surface: separate `src/cli_text_clean.py` module, matching the "one CLI binary per script on PATH" pattern in Section 3.2. Not a subcommand on the existing dedup Typer app.
+- `ftfy` dependency deferred to Tier 2 (~5MB). Revisit if mojibake reports come in.
+
+### 10.3 - 10.9 (Future)
+
+Functional specs for scripts 03 through 09 to be added when each script enters active build. The deduplicator (10.1) and text cleaner (10.2) specs are the template; specs for other scripts should follow the same Tier 1 / Tier 2 / Tier 3 structure with explicit strategic framing (what's the market gap this script fills, given that some of its functionality is available free elsewhere).