Files
datatools-dev/test-cases/text-cleaner-corpus
Michael 82d7fef21e feat(gate): CSV-normalization gate with confidence-tiered findings
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:35:27 +00:00
..

Text Cleaner Test Corpus

Test fixtures for 02_text_cleaner.py (Excel & CSV Data Cleaning Mastery Bundle).

Layout

text_cleaner_test_corpus/
├── README.md                # This file
├── TEST-CASES.md            # Full taxonomy and expected behavior per test
├── generate_test_data.py    # Regenerates the 20 CSV inputs and expected outputs
├── generate_xlsx.py         # Regenerates the multi-sheet XLSX fixture
├── test_data/               # Inputs (21 fixtures: 20 CSV + 1 XLSX)
└── expected/                # Expected outputs (with default and flag variants)

Quick start

Read TEST-CASES.md from top to bottom. Sections 1 (scope boundary) and 2 (default config assumed) are load-bearing; the per-test details in Section 4 don't make sense without them.

To regenerate the test files (e.g., after editing the generator):

python generate_test_data.py
python generate_xlsx.py

To use as pytest fixtures: see Section 6 of TEST-CASES.md.

Coverage summary

Category Fixtures
Whitespace (ASCII + Unicode) 01, 02
Smart punctuation 03
Unicode normalization 04
Invisible / zero-width / control 05, 06
BOM 07
Line endings (file-level + embedded) 08, 09, 10, 11
Case operations (opt-in) 12
International script preservation 13
Mojibake 14
Boundary with script 04 (missing values) 15
Headers 16, 19
Negative tests (must NOT touch) 17
File-level edge cases 18, 19
Integration 20
Excel-specific (multi-sheet, Alt+Enter) 21

Out of scope

Documented in TEST-CASES.md Section 5: encoding detection, large-file performance, GUI behavior, file-locking, CLI argument parsing. Each needs its own test layer.