Commit Graph

2 Commits

Author SHA1 Message Date
0671ef277e feat(io): route read_file through pre-parse repair by default
Previously only analyze() and direct read_csv_repaired() callers got the
byte-level repair pass (BOM strip, NUL strip, smart-double-quote fold,
unquoted-delimiter merge). The dedup CLI and any other read_file consumer
silently missed it.

read_file gains a repair=True default. CSV/TSV inputs run through
repair_bytes before pandas sees them; Excel inputs still pass through
unchanged. Chunked reads (chunk_size set) bypass repair because the pre-
parse pass loads the whole file — preserving streaming behavior on huge
files. Repair actions and unrepairable lines are logged at INFO/WARNING.

cli_text_clean opts out (repair=False): the cleaner offers fine-grained
control via --preset and per-op flags, and a byte-level smart-quote fold
under the user's "minimal" preset would violate that contract. The
cell-level cleaner does the equivalent work itself when its options ask
for it.

Tests: read_file default strips BOM and folds curly double quotes;
repair=False preserves smart quotes; chunked reads still work and skip
repair as documented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:09:35 +00:00
54f92ae47e feat: implement text cleaner (script 02) with CLI, GUI, and tests
Builds 02_text_cleaner.py from stub to working: character-level hygiene
for CSV/Excel inputs covering trim, whitespace collapse, smart-character
folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char
strip, line-ending normalization, and per-column case conversion. Three
presets (minimal/excel-hygiene/paranoid) keep the buyer surface small.

- src/core/text_clean.py: pure helpers + CleanOptions/CleanResult +
  clean_dataframe with dtype-safe column selection
- src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape
  (dry-run by default, --apply writes cleaned + changes audit, JSON
  config save/load)
- src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset
  picker, advanced toggles, preview, before/after metrics, and three
  download buttons
- tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests
  covering edge cases E1-E50 from the spec
- samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10
  in 10 rows
- test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case
  fixtures

Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7
entry locking the spec, CLI-REFERENCE.md gains the text cleaner
section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md
status row 02 promoted Skeleton -> Working.

200/200 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:14:15 +00:00