feat: implement text cleaner (script 02) with CLI, GUI, and tests

Builds 02_text_cleaner.py from stub to working: character-level hygiene
for CSV/Excel inputs covering trim, whitespace collapse, smart-character
folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char
strip, line-ending normalization, and per-column case conversion. Three
presets (minimal/excel-hygiene/paranoid) keep the buyer surface small.

- src/core/text_clean.py: pure helpers + CleanOptions/CleanResult +
  clean_dataframe with dtype-safe column selection
- src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape
  (dry-run by default, --apply writes cleaned + changes audit, JSON
  config save/load)
- src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset
  picker, advanced toggles, preview, before/after metrics, and three
  download buttons
- tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests
  covering edge cases E1-E50 from the spec
- samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10
  in 10 rows
- test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case
  fixtures

Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7
entry locking the spec, CLI-REFERENCE.md gains the text cleaner
section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md
status row 02 promoted Skeleton -> Working.

200/200 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 15:14:15 +00:00
parent b2ca04e6f4
commit 54f92ae47e
28 changed files with 2093 additions and 58 deletions

View File

@@ -250,6 +250,7 @@ Own-domain SEO is treated as a long-term compounding asset (6-18 months to tract
| April 28, 2026 (v1.3) | **Add hosted browser demo as secondary distribution surface and conversion lever** | Direct consequence of Streamlit choice. See Section 5 and BUSINESS.md Section 7. |
| April 28, 2026 (v1.4) | **Re-apply 03/05 script boundary work dropped during v1.3 merge (silent drift recovery)** | Stream B v1.2 content (sharpened 03/05 descriptions in USER-GUIDE, run-order rule, TECHNICAL.md Section 9 boundary spec, RECOVERY.md pointer) was overwritten when Stream A's parallel v1.3 Streamlit work was saved to project. Restoring per the doc's own no-silent-drift rule. 03 owns "what's not there" (missing values, sentinel codes, imputation), 05 owns "what shouldn't be there" (statistical outliers, domain rules, winsorization). 03 runs before 05 because outlier statistics on data containing NaN or sentinel codes are mathematically poisoned. See TECHNICAL.md Section 9. |
| April 28, 2026 (v1.5) | **Add `02_text_cleaner.py` as new script; renumber 02-08 → 03-09** | Audit revealed character-level hygiene (whitespace trimming, multi-space collapse, Unicode normalization, BOM handling, line-ending normalization, special-character handling) had no clear owner. Was implicitly scattered: `01_deduplicator` normalizes internally for matching only (doesn't write back), `02_format_standardizer` (now 03) implies it but its named scope is dates/currencies/names/phones/addresses, `03_missing_value_handler` (now 04) only handles whitespace-only as disguised null. A buyer with trailing-space pollution had no obvious script to run. Per Section 4a (functional scope principle: one-stop shopping for the workflow), this was a real gap. Added as 02 because text cleaning is a pre-processing step that should run before format standardization, missing-value handling, and outlier detection. Kept 01 (deduplicator) at position 1 as the lead/working/marketing-flagship script; numbering does not strictly equal pipeline order, the orchestrator manages execution order. Renumber consequence: TECHNICAL.md Section 9 boundary references updated 03→04, 05→06; orchestrator references updated 08→09. New contested case documented in Section 9.3: whitespace-only cells (02 trims first, leaving empty string; 04 then detects empty strings as disguised null). Master orchestrator now 09. |
| April 29, 2026 (v1.7) | **Adopt `02_text_cleaner.py` Tier 1/2/3 functional spec; lock `excel-hygiene` as default preset** | Promotes character-level hygiene from a stub to a buildable v1 target. Strategic framing: Excel/Power Query/OpenRefine fail this category for non-technical buyers; the gap is "one-click correctness for dirty-CSV failure modes that cause silent VLOOKUP misses." Spec covers 10 toggleable ops (trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize, NFKC opt-in, per-column case), per-column scope control, dry-run-by-default, per-cell change audit, idempotency, three presets (`minimal`/`excel-hygiene`/`paranoid`), and JSON config save/load. Output shape mirrors deduplicator: `{input}_cleaned.csv`, `{input}_changes.csv`, `logs/text_clean_{ts}.log`. Boundary with adjacent scripts re-asserted: 02 trims whitespace-only cells to empty (04 then detects empty as null per Section 9.3); 02 is *write-time* and stays distinct from `01_deduplicator`'s match-time `normalize_string` helper. Smart-character fold defaults ON in `excel-hygiene` because demo value is highest there and dry-run preview makes the change visible before commit. NFKC stays opt-in (lossy). `ftfy` mojibake repair deferred to Tier 2 to avoid the 5MB dep without buyer demand. CLI ships as separate `src/cli_text_clean.py` module per the one-CLI-per-script pattern in TECHNICAL Section 3.2. Full spec in TECHNICAL.md Section 10.2. |
| April 28, 2026 (v1.6) | **Fold conversation-history content into docs: deduplicator functional spec, lead bundle use cases, competitive landscape, full GUI framework comparison matrix, concrete 04/06 boundary examples, expanded Streamlit-to-SaaS reasoning** | None of this represents new decisions; all of it represents prior analysis that lived only in chat history and was at risk of evaporating. Per the doc's own no-silent-drift rule (Section 8) and the v1.4 recovery story, valuable analysis must be promoted to docs to survive. Specifically: TECHNICAL.md gains Section 10 (per-script functional specs, starting with the deduplicator's 36-item tiered spec) which is the buildable target for the v1 launch GUI port; this also makes the gap between "currently working" (exact + basic fuzzy) and "v1 launch best-of-class" (Tier 1) explicit so the docs don't quietly overstate where the code is. Section 9.3 gains three concrete distinguishing examples (bank-export blank fees / $1M outlier / "999=refused") that prove 04 and 06 are distinct concerns. BUSINESS.md gains Section 4a (Lead Bundle Deep Dive: 15 use cases by persona, 6-row competitive landscape table, market gap statement) which feeds landing page copy and demo design. Section 4c gains a 10-dimension scored framework matrix and per-option summaries (locks the rejection reasoning against re-litigation), plus expanded point 4 on Streamlit-to-SaaS migration cost. RECOVERY.md updated to reference Section 10 in rebuild and priority steps. No structural decisions changed; this is pure capture work. |
---