The 21-fixture corpus (test-cases/text-cleaner-corpus/) exercises the cleaner
end-to-end against the spec in TEST-CASES.md. Closing the failing cases drove
five small cleaner fixes plus two fixture-generation fixes:
- _SMART_CHARS: add prime, double prime, guillemets (case 03)
- _ZERO_WIDTH: add soft hyphen U+00AD (case 05)
- clean_dataframe: clean column headers via the same pipeline (cases 16/19/20),
with a clean_headers toggle on CleanOptions
- smart_title_case: title-case full-shout strings ("ALICE SMITH" -> "Alice
Smith") while still preserving embedded acronyms; preserve uppercase after
apostrophe in names ("O'CONNOR" -> "O'Connor", "o'neil" -> "O'neil")
- test_corpus.py reader: pre-strip NUL bytes (C parser truncates at NUL,
python engine is too strict about embedded literal "), per spec case 06
- generate_test_data.py: properly CSV-escape literal-quote cells in case 03
expected; quote the rogue-comma price field in case 17 input
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
52 lines
1.7 KiB
Markdown
52 lines
1.7 KiB
Markdown
# Text Cleaner Test Corpus
|
|
|
|
Test fixtures for `02_text_cleaner.py` (Excel & CSV Data Cleaning Mastery Bundle).
|
|
|
|
## Layout
|
|
|
|
```
|
|
text_cleaner_test_corpus/
|
|
├── README.md # This file
|
|
├── TEST-CASES.md # Full taxonomy and expected behavior per test
|
|
├── generate_test_data.py # Regenerates the 20 CSV inputs and expected outputs
|
|
├── generate_xlsx.py # Regenerates the multi-sheet XLSX fixture
|
|
├── test_data/ # Inputs (21 fixtures: 20 CSV + 1 XLSX)
|
|
└── expected/ # Expected outputs (with default and flag variants)
|
|
```
|
|
|
|
## Quick start
|
|
|
|
Read `TEST-CASES.md` from top to bottom. Sections 1 (scope boundary) and 2 (default config assumed) are load-bearing; the per-test details in Section 4 don't make sense without them.
|
|
|
|
To regenerate the test files (e.g., after editing the generator):
|
|
```bash
|
|
python generate_test_data.py
|
|
python generate_xlsx.py
|
|
```
|
|
|
|
To use as pytest fixtures: see Section 6 of `TEST-CASES.md`.
|
|
|
|
## Coverage summary
|
|
|
|
| Category | Fixtures |
|
|
|---|---|
|
|
| Whitespace (ASCII + Unicode) | 01, 02 |
|
|
| Smart punctuation | 03 |
|
|
| Unicode normalization | 04 |
|
|
| Invisible / zero-width / control | 05, 06 |
|
|
| BOM | 07 |
|
|
| Line endings (file-level + embedded) | 08, 09, 10, 11 |
|
|
| Case operations (opt-in) | 12 |
|
|
| International script preservation | 13 |
|
|
| Mojibake | 14 |
|
|
| Boundary with script 04 (missing values) | 15 |
|
|
| Headers | 16, 19 |
|
|
| Negative tests (must NOT touch) | 17 |
|
|
| File-level edge cases | 18, 19 |
|
|
| Integration | 20 |
|
|
| Excel-specific (multi-sheet, Alt+Enter) | 21 |
|
|
|
|
## Out of scope
|
|
|
|
Documented in `TEST-CASES.md` Section 5: encoding detection, large-file performance, GUI behavior, file-locking, CLI argument parsing. Each needs its own test layer.
|