Files

Michael c349a90e18 test: add text-cleaner corpus and close gaps surfaced by it

The 21-fixture corpus (test-cases/text-cleaner-corpus/) exercises the cleaner
end-to-end against the spec in TEST-CASES.md. Closing the failing cases drove
five small cleaner fixes plus two fixture-generation fixes:

- _SMART_CHARS: add prime, double prime, guillemets (case 03)
- _ZERO_WIDTH: add soft hyphen U+00AD (case 05)
- clean_dataframe: clean column headers via the same pipeline (cases 16/19/20),
  with a clean_headers toggle on CleanOptions
- smart_title_case: title-case full-shout strings ("ALICE SMITH" -> "Alice
  Smith") while still preserving embedded acronyms; preserve uppercase after
  apostrophe in names ("O'CONNOR" -> "O'Connor", "o'neil" -> "O'neil")
- test_corpus.py reader: pre-strip NUL bytes (C parser truncates at NUL,
  python engine is too strict about embedded literal "), per spec case 06
- generate_test_data.py: properly CSV-escape literal-quote cells in case 03
  expected; quote the rogue-comma price field in case 17 input

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 15:37:35 +00:00

expected

test: add text-cleaner corpus and close gaps surfaced by it

2026-04-29 15:37:35 +00:00

test_data

test: add text-cleaner corpus and close gaps surfaced by it

2026-04-29 15:37:35 +00:00

generate_test_data.py

test: add text-cleaner corpus and close gaps surfaced by it

2026-04-29 15:37:35 +00:00

generate_xlsx.py

test: add text-cleaner corpus and close gaps surfaced by it

2026-04-29 15:37:35 +00:00

README.md

test: add text-cleaner corpus and close gaps surfaced by it

2026-04-29 15:37:35 +00:00

TEST-CASES.md

test: add text-cleaner corpus and close gaps surfaced by it

2026-04-29 15:37:35 +00:00

README.md

Text Cleaner Test Corpus

Test fixtures for 02_text_cleaner.py (Excel & CSV Data Cleaning Mastery Bundle).

Layout

text_cleaner_test_corpus/
├── README.md                # This file
├── TEST-CASES.md            # Full taxonomy and expected behavior per test
├── generate_test_data.py    # Regenerates the 20 CSV inputs and expected outputs
├── generate_xlsx.py         # Regenerates the multi-sheet XLSX fixture
├── test_data/               # Inputs (21 fixtures: 20 CSV + 1 XLSX)
└── expected/                # Expected outputs (with default and flag variants)

Quick start

Read TEST-CASES.md from top to bottom. Sections 1 (scope boundary) and 2 (default config assumed) are load-bearing; the per-test details in Section 4 don't make sense without them.

To regenerate the test files (e.g., after editing the generator):

python generate_test_data.py
python generate_xlsx.py

To use as pytest fixtures: see Section 6 of TEST-CASES.md.

Coverage summary

Category	Fixtures
Whitespace (ASCII + Unicode)	01, 02
Smart punctuation	03
Unicode normalization	04
Invisible / zero-width / control	05, 06
BOM	07
Line endings (file-level + embedded)	08, 09, 10, 11
Case operations (opt-in)	12
International script preservation	13
Mojibake	14
Boundary with script 04 (missing values)	15
Headers	16, 19
Negative tests (must NOT touch)	17
File-level edge cases	18, 19
Integration	20
Excel-specific (multi-sheet, Alt+Enter)	21

Out of scope

Documented in TEST-CASES.md Section 5: encoding detection, large-file performance, GUI behavior, file-locking, CLI argument parsing. Each needs its own test layer.