test: add text-cleaner corpus and close gaps surfaced by it

The 21-fixture corpus (test-cases/text-cleaner-corpus/) exercises the cleaner
end-to-end against the spec in TEST-CASES.md. Closing the failing cases drove
five small cleaner fixes plus two fixture-generation fixes:

- _SMART_CHARS: add prime, double prime, guillemets (case 03)
- _ZERO_WIDTH: add soft hyphen U+00AD (case 05)
- clean_dataframe: clean column headers via the same pipeline (cases 16/19/20),
  with a clean_headers toggle on CleanOptions
- smart_title_case: title-case full-shout strings ("ALICE SMITH" -> "Alice
  Smith") while still preserving embedded acronyms; preserve uppercase after
  apostrophe in names ("O'CONNOR" -> "O'Connor", "o'neil" -> "O'neil")
- test_corpus.py reader: pre-strip NUL bytes (C parser truncates at NUL,
  python engine is too strict about embedded literal "), per spec case 06
- generate_test_data.py: properly CSV-escape literal-quote cells in case 03
  expected; quote the rogue-comma price field in case 17 input

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 15:37:35 +00:00
parent 54f92ae47e
commit c349a90e18
50 changed files with 1644 additions and 4 deletions

View File

@@ -0,0 +1,8 @@
id,name,city
1, Alice ,New York
2,Bob, Chicago
3,Carol ,San Francisco
4,Dan Smith,Austin
5, Eve , Boston
6,Frank van der Berg,Denver
7, Grace Hopper , Palo Alto
1 id name city
2 1 Alice New York
3 2 Bob Chicago
4 3 Carol San Francisco
5 4 Dan Smith Austin
6 5 Eve Boston
7 6 Frank van der Berg Denver
8 7 Grace Hopper Palo Alto

View File

@@ -0,0 +1,7 @@
id,label,note
1, Premium ,NBSP padding
2,Discount,narrow NBSP
3, Standard ,ideographic space
4,TierOne,em-space internal
5,CostPlus,thin-space internal
6,   mixed   ,ascii + NBSP combined
1 id label note
2 1 Premium  NBSP padding
3 2 Discount  narrow NBSP
4 3 Standard  ideographic space
5 4 Tier  One em-space internal
6 5 Cost Plus thin-space internal
7 6 mixed   ascii + NBSP combined

View File

@@ -0,0 +1,6 @@
id,quote,measurement
1,“Hello world”,5 11″
2,its working,
3,20202024,from a to z
4,wait…,3 × 4
5,«quoted»,5 ± 0.1
1 id quote measurement
2 1 “Hello world” 5′ 11″
3 2 it’s working
4 3 2020–2024 from ‘a’ to ‘z’
5 4 wait… 3 × 4
6 5 «quoted» 5 ± 0.1

View File

@@ -0,0 +1,8 @@
id,name,description
1,café,NFC form (single code point)
2,café,NFD form (e + combining accent)
3,naïve,NFC i-diaeresis
4,naïve,NFD i + combining diaeresis
5,office,fi-ligature (ffi)
6,,fullwidth ABC
7,Ⅸ century,roman numeral nine (single code point)
1 id name description
2 1 café NFC form (single code point)
3 2 café NFD form (e + combining accent)
4 3 naïve NFC i-diaeresis
5 4 naïve NFD i + combining diaeresis
6 5 office fi-ligature (ffi)
7 6 ABC fullwidth ABC
8 7 Ⅸ century roman numeral nine (single code point)

View File

@@ -0,0 +1,8 @@
id,value,note
1,Hello,zero-width space inside word
2,Leading,leading + internal ZWSP
3,Trail,trailing ZWSP
4,abc,ZWNJ and ZWJ
5,Marked,LTR + RTL marks bracketing
6,co­operate,soft hyphen
7,nobreak,word joiner
1 id value note
2 1 Hel​lo zero-width space inside word
3 2 ​Lead​ing leading + internal ZWSP
4 3 Trail​ trailing ZWSP
5 4 a‌b‍c ZWNJ and ZWJ
6 5 ‎Marked‏ LTR + RTL marks bracketing
7 6 co­operate soft hyphen
8 7 no⁠break word joiner

Binary file not shown.
1 id value note
2 1 Hello�World NUL byte inside
3 2 BellSound BEL character
4 3 Backspace backspace
5 4 Vert Tab vertical tab
6 5 Form Feed form feed
7 6 Escape ESC character
8 7 Delete DEL character
9 8 Mixed�junk multiple controls in one cell

View File

@@ -0,0 +1,3 @@
id,name,city
1,Alice,New York
2,Bob,Chicago
1 id name city
2 1 Alice New York
3 2 Bob Chicago

View File

@@ -0,0 +1,4 @@
id,name
1,Alice
2,Bob
3,Carol
1 id name
2 1 Alice
3 2 Bob
4 3 Carol

View File

@@ -0,0 +1 @@
id,name
1 id name 1 Alice 2 Bob 3 Carol

View File

@@ -0,0 +1,4 @@
id,name
1,Alice
2,Bob
3,Carol
1 id,name
2 1,Alice
3 2,Bob 3,Carol
4 4,Dan

View File

@@ -0,0 +1,7 @@
id,address,notes
1,"123 Main St
Apt 4B
New York, NY","line1
line2"
2,"Single line","contains
classic mac
1 id address notes
2 1 123 Main St Apt 4B New York, NY line1 line2
3 2 Single line contains classic mac internal
4 3 normal no newlines here

View File

@@ -0,0 +1,5 @@
id,name,email,product
1,ALICE SMITH,Alice@Example.COM,Widget
2,bob jones,BOB@example.com,GADGET
3,Carol Brown,carol@EXAMPLE.com,wIdGeT
4,DAN O'CONNOR,Dan@Example.com,gizmo
1 id name email product
2 1 ALICE SMITH Alice@Example.COM Widget
3 2 bob jones BOB@example.com GADGET
4 3 Carol Brown carol@EXAMPLE.com wIdGeT
5 4 DAN O'CONNOR Dan@Example.com gizmo

View File

@@ -0,0 +1,7 @@
id,name,note
1, 中国北京 ,Beijing in Chinese (with leading/trailing space)
2,テスト,Japanese katakana (test)
3,تجربة,Arabic (test) - RTL
4,Москва,Russian (Moscow)
5,🎉 launch 🚀,emoji preserved
6,café ☕,emoji + accent combo
1 id name note
2 1 中国北京 Beijing in Chinese (with leading/trailing space)
3 2 テスト Japanese katakana (test)
4 3 تجربة Arabic (test) - RTL
5 4 Москва Russian (Moscow)
6 5 🎉 launch 🚀 emoji preserved
7 6 café ☕ emoji + accent combo

View File

@@ -0,0 +1,5 @@
id,name,city
1,café,München
2,naïve,résumé
3,don’t,smart-apostrophe mojibake
4,Alice,New York
1 id name city
2 1 café München
3 2 naïve résumé
4 3 don’t smart-apostrophe mojibake
5 4 Alice New York

View File

@@ -0,0 +1,8 @@
id,value
1,real
2,
3,
4,  
5,  
6,
7,actual value
1 id value
2 1 real
3 2
4 3
5 4
6 5
7 6
8 7 actual value

View File

@@ -0,0 +1,3 @@
id , Customer Name ,“Email”,Phone
1,Alice,alice@example.com,555-1234
2,Bob,bob@example.com,555-5678
1 id Customer Name  “Email” Phone​
2 1 Alice alice@example.com 555-1234
3 2 Bob bob@example.com 555-5678

View File

@@ -0,0 +1,4 @@
id,price,european_number,date,phone,quantity
1, 100 ,1 234,2024-01-15,(555) 123-4567,42
2," $1,500.00 ",12 345,15/01/2024,555.123.4567,7
3, N/A ,nan,Jan 15 2024,+1 555 123 4567,0
1 id price european_number date phone quantity
2 1 100 1 234 2024-01-15 (555) 123-4567 42
3 2 $1,500.00 12 345 15/01/2024 555.123.4567 7
4 3 N/A nan Jan 15 2024 +1 555 123 4567 0

View File

@@ -0,0 +1 @@
id ,Name ,Email
1 id Name  Email​

View File

@@ -0,0 +1,5 @@
id , Name ,“Email”,Notes
1, Alice Smith ,Alice@Example.COM,“VIP” customer — contact ASAP…
2, Bob Jones ,bob@example.com,its 56″ tall
3, Carol Brown ,CAROL@EXAMPLE.COM,3 × 4 = 12 (preserve ×)
4, ,empty@example.com,whitespace-only name (becomes empty)
1 id Name  “Email” Notes​
2 1 Alice Smith  Alice@Example.COM “VIP” customer — contact ASAP…
3 2 Bob Jones bob@example.com it’s 5′6″ tall
4 3 Carol Brown CAROL@EXAMPLE.COM 3 × 4 = 12 (preserve ×)
5 4 empty@example.com whitespace-only name (becomes empty)