test: add text-cleaner corpus and close gaps surfaced by it
The 21-fixture corpus (test-cases/text-cleaner-corpus/) exercises the cleaner
end-to-end against the spec in TEST-CASES.md. Closing the failing cases drove
five small cleaner fixes plus two fixture-generation fixes:
- _SMART_CHARS: add prime, double prime, guillemets (case 03)
- _ZERO_WIDTH: add soft hyphen U+00AD (case 05)
- clean_dataframe: clean column headers via the same pipeline (cases 16/19/20),
with a clean_headers toggle on CleanOptions
- smart_title_case: title-case full-shout strings ("ALICE SMITH" -> "Alice
Smith") while still preserving embedded acronyms; preserve uppercase after
apostrophe in names ("O'CONNOR" -> "O'Connor", "o'neil" -> "O'neil")
- test_corpus.py reader: pre-strip NUL bytes (C parser truncates at NUL,
python engine is too strict about embedded literal "), per spec case 06
- generate_test_data.py: properly CSV-escape literal-quote cells in case 03
expected; quote the rogue-comma price field in case 17 input
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,9 @@
|
||||
id,value,note
|
||||
1,HelloWorld,NUL byte inside
|
||||
2,BellSound,BEL character
|
||||
3,Backspace,backspace
|
||||
4,VertTab,vertical tab
|
||||
5,FormFeed,form feed
|
||||
6,Escape,ESC character
|
||||
7,Delete,DEL character
|
||||
8,Mixedjunk,multiple controls in one cell
|
||||
|
Reference in New Issue
Block a user