# Text Cleaner Test Corpus Test fixtures for `02_text_cleaner.py` (Excel & CSV Data Cleaning Mastery Bundle). ## Layout ``` text_cleaner_test_corpus/ ├── README.md # This file ├── TEST-CASES.md # Full taxonomy and expected behavior per test ├── generate_test_data.py # Regenerates the 20 CSV inputs and expected outputs ├── generate_xlsx.py # Regenerates the multi-sheet XLSX fixture ├── test_data/ # Inputs (21 fixtures: 20 CSV + 1 XLSX) └── expected/ # Expected outputs (with default and flag variants) ``` ## Quick start Read `TEST-CASES.md` from top to bottom. Sections 1 (scope boundary) and 2 (default config assumed) are load-bearing; the per-test details in Section 4 don't make sense without them. To regenerate the test files (e.g., after editing the generator): ```bash python generate_test_data.py python generate_xlsx.py ``` To use as pytest fixtures: see Section 6 of `TEST-CASES.md`. ## Coverage summary | Category | Fixtures | |---|---| | Whitespace (ASCII + Unicode) | 01, 02 | | Smart punctuation | 03 | | Unicode normalization | 04 | | Invisible / zero-width / control | 05, 06 | | BOM | 07 | | Line endings (file-level + embedded) | 08, 09, 10, 11 | | Case operations (opt-in) | 12 | | International script preservation | 13 | | Mojibake | 14 | | Boundary with script 04 (missing values) | 15 | | Headers | 16, 19 | | Negative tests (must NOT touch) | 17 | | File-level edge cases | 18, 19 | | Integration | 20 | | Excel-specific (multi-sheet, Alt+Enter) | 21 | ## Out of scope Documented in `TEST-CASES.md` Section 5: encoding detection, large-file performance, GUI behavior, file-locking, CLI argument parsing. Each needs its own test layer.