Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.
Core (src/core/):
- analyze.py: Finding gains confidence, fix_action, pre_applied; new
detectors for encoding_uncertain, encoding_decode_failed; new top-
level encoding_override parameter.
- fixes.py: registry of fix algorithms keyed by fix_action id.
- normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
the NormalizationResult / Decision dataclasses the gate consumes.
- io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
and normalizes line endings (fixes bare-CR parser crash); empty file
handled gracefully instead of EmptyDataError traceback.
GUI (src/gui/):
- pages/0_Review.py: gate page with per-finding decision controls,
encoding override picker (16 codepages + custom), and Advanced output
options (encoding, delimiter, line terminator) on the download.
- components.py: require_normalization_gate() helper.
- pages/1-9: gate guard wired on every tool page.
Test corpora:
- test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
UTF-8 files + manifest, synced from Business/DataTools.
- test-cases/text-cleaner-corpus/test_data/17: synced malformed input
(unquoted $1,500.00) for the unquoted-delimiter detector.
Tests (94 new):
- test_normalize.py (48): finding fields, fix registry, auto_fix scope,
decision paths, gate idempotency, output-options helper.
- test_encodings_corpus.py (90, 16 xfailed): parametric detection +
decode + analyzer-no-crash sweep against the manifest.
- test_analyze.py: encoding override + encoding_uncertain detectors.
- test_corpus.py: pre-parse repair in the strict reader.
run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.
Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.
Suite: 765 passed, 17 xfailed (was 458 passed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Text Cleaner Test Corpus
Test fixtures for 02_text_cleaner.py (Excel & CSV Data Cleaning Mastery Bundle).
Layout
text_cleaner_test_corpus/
├── README.md # This file
├── TEST-CASES.md # Full taxonomy and expected behavior per test
├── generate_test_data.py # Regenerates the 20 CSV inputs and expected outputs
├── generate_xlsx.py # Regenerates the multi-sheet XLSX fixture
├── test_data/ # Inputs (21 fixtures: 20 CSV + 1 XLSX)
└── expected/ # Expected outputs (with default and flag variants)
Quick start
Read TEST-CASES.md from top to bottom. Sections 1 (scope boundary) and 2 (default config assumed) are load-bearing; the per-test details in Section 4 don't make sense without them.
To regenerate the test files (e.g., after editing the generator):
python generate_test_data.py
python generate_xlsx.py
To use as pytest fixtures: see Section 6 of TEST-CASES.md.
Coverage summary
| Category | Fixtures |
|---|---|
| Whitespace (ASCII + Unicode) | 01, 02 |
| Smart punctuation | 03 |
| Unicode normalization | 04 |
| Invisible / zero-width / control | 05, 06 |
| BOM | 07 |
| Line endings (file-level + embedded) | 08, 09, 10, 11 |
| Case operations (opt-in) | 12 |
| International script preservation | 13 |
| Mojibake | 14 |
| Boundary with script 04 (missing values) | 15 |
| Headers | 16, 19 |
| Negative tests (must NOT touch) | 17 |
| File-level edge cases | 18, 19 |
| Integration | 20 |
| Excel-specific (multi-sheet, Alt+Enter) | 21 |
Out of scope
Documented in TEST-CASES.md Section 5: encoding detection, large-file performance, GUI behavior, file-locking, CLI argument parsing. Each needs its own test layer.