datatools-dev/test-cases/text-cleaner-corpus/README.md

# Clean Text Test Corpus

Test fixtures for `02_text_cleaner.py` (Excel & CSV Data Cleaning Mastery Bundle).

## Layout

```
text_cleaner_test_corpus/
├── README.md                # This file
├── TEST-CASES.md            # Full taxonomy and expected behavior per test
├── generate_test_data.py    # Regenerates the 20 CSV inputs and expected outputs
├── generate_xlsx.py         # Regenerates the multi-sheet XLSX fixture
├── test_data/               # Inputs (21 fixtures: 20 CSV + 1 XLSX)
└── expected/                # Expected outputs (with default and flag variants)
```

## Quick start

Read `TEST-CASES.md` from top to bottom. Sections 1 (scope boundary) and 2 (default config assumed) are load-bearing; the per-test details in Section 4 don't make sense without them.

To regenerate the test files (e.g., after editing the generator):
```bash
python generate_test_data.py
python generate_xlsx.py
```

To use as pytest fixtures: see Section 6 of `TEST-CASES.md`.

## Coverage summary

| Category | Fixtures |
|---|---|
| Whitespace (ASCII + Unicode) | 01, 02 |
| Smart punctuation | 03 |
| Unicode normalization | 04 |
| Invisible / zero-width / control | 05, 06 |
| BOM | 07 |
| Line endings (file-level + embedded) | 08, 09, 10, 11 |
| Case operations (opt-in) | 12 |
| International script preservation | 13 |
| Mojibake | 14 |
| Boundary with script 04 (missing values) | 15 |
| Headers | 16, 19 |
| Negative tests (must NOT touch) | 17 |
| File-level edge cases | 18, 19 |
| Integration | 20 |
| Excel-specific (multi-sheet, Alt+Enter) | 21 |

## Out of scope

Documented in `TEST-CASES.md` Section 5: encoding detection, large-file performance, GUI behavior, file-locking, CLI argument parsing. Each needs its own test layer.