test: add text-cleaner corpus and close gaps surfaced by it

The 21-fixture corpus (test-cases/text-cleaner-corpus/) exercises the cleaner
end-to-end against the spec in TEST-CASES.md. Closing the failing cases drove
five small cleaner fixes plus two fixture-generation fixes:

- _SMART_CHARS: add prime, double prime, guillemets (case 03)
- _ZERO_WIDTH: add soft hyphen U+00AD (case 05)
- clean_dataframe: clean column headers via the same pipeline (cases 16/19/20),
  with a clean_headers toggle on CleanOptions
- smart_title_case: title-case full-shout strings ("ALICE SMITH" -> "Alice
  Smith") while still preserving embedded acronyms; preserve uppercase after
  apostrophe in names ("O'CONNOR" -> "O'Connor", "o'neil" -> "O'neil")
- test_corpus.py reader: pre-strip NUL bytes (C parser truncates at NUL,
  python engine is too strict about embedded literal "), per spec case 06
- generate_test_data.py: properly CSV-escape literal-quote cells in case 03
  expected; quote the rogue-comma price field in case 17 input

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 15:37:35 +00:00
parent 54f92ae47e
commit c349a90e18
50 changed files with 1644 additions and 4 deletions

View File

@@ -0,0 +1,8 @@
id,name,city
1,Alice,New York
2,Bob,Chicago
3,Carol,San Francisco
4,Dan Smith,Austin
5,Eve,Boston
6,Frank van der Berg,Denver
7,Grace Hopper,Palo Alto
1 id name city
2 1 Alice New York
3 2 Bob Chicago
4 3 Carol San Francisco
5 4 Dan Smith Austin
6 5 Eve Boston
7 6 Frank van der Berg Denver
8 7 Grace Hopper Palo Alto

View File

@@ -0,0 +1,7 @@
id,label,note
1,Premium,NBSP padding
2,Discount,narrow NBSP
3,Standard,ideographic space
4,Tier One,em-space internal
5,Cost Plus,thin-space internal
6,mixed,ascii + NBSP combined
1 id label note
2 1 Premium NBSP padding
3 2 Discount narrow NBSP
4 3 Standard ideographic space
5 4 Tier One em-space internal
6 5 Cost Plus thin-space internal
7 6 mixed ascii + NBSP combined

View File

@@ -0,0 +1,6 @@
id,quote,measurement
1,"""Hello world""","5' 11"""
2,it's working,-
3,2020-2024,from 'a' to 'z'
4,wait...,3 × 4
5,"""quoted""",5 ± 0.1
1 id quote measurement
2 1 "Hello world" 5' 11"
3 2 it's working -
4 3 2020-2024 from 'a' to 'z'
5 4 wait... 3 × 4
6 5 "quoted" 5 ± 0.1

View File

@@ -0,0 +1,8 @@
id,name,description
1,café,NFC form (single code point)
2,café,NFD form (e + combining accent)
3,naïve,NFC i-diaeresis
4,naïve,NFD i + combining diaeresis
5,office,fi-ligature (ffi)
6,,fullwidth ABC
7,Ⅸ century,roman numeral nine (single code point)
1 id name description
2 1 café NFC form (single code point)
3 2 café NFD form (e + combining accent)
4 3 naïve NFC i-diaeresis
5 4 naïve NFD i + combining diaeresis
6 5 office fi-ligature (ffi)
7 6 ABC fullwidth ABC
8 7 Ⅸ century roman numeral nine (single code point)

View File

@@ -0,0 +1,8 @@
id,value,note
1,Hello,zero-width space inside word
2,Leading,leading + internal ZWSP
3,Trail,trailing ZWSP
4,abc,ZWNJ and ZWJ
5,Marked,LTR + RTL marks bracketing
6,cooperate,soft hyphen
7,nobreak,word joiner
1 id value note
2 1 Hello zero-width space inside word
3 2 Leading leading + internal ZWSP
4 3 Trail trailing ZWSP
5 4 abc ZWNJ and ZWJ
6 5 Marked LTR + RTL marks bracketing
7 6 cooperate soft hyphen
8 7 nobreak word joiner

View File

@@ -0,0 +1,9 @@
id,value,note
1,HelloWorld,NUL byte inside
2,BellSound,BEL character
3,Backspace,backspace
4,VertTab,vertical tab
5,FormFeed,form feed
6,Escape,ESC character
7,Delete,DEL character
8,Mixedjunk,multiple controls in one cell
1 id value note
2 1 HelloWorld NUL byte inside
3 2 BellSound BEL character
4 3 Backspace backspace
5 4 VertTab vertical tab
6 5 FormFeed form feed
7 6 Escape ESC character
8 7 Delete DEL character
9 8 Mixedjunk multiple controls in one cell

View File

@@ -0,0 +1,3 @@
id,name,city
1,Alice,New York
2,Bob,Chicago
1 id name city
2 1 Alice New York
3 2 Bob Chicago

View File

@@ -0,0 +1,4 @@
id,name
1,Alice
2,Bob
3,Carol
1 id name
2 1 Alice
3 2 Bob
4 3 Carol

View File

@@ -0,0 +1,4 @@
id,name
1,Alice
2,Bob
3,Carol
1 id name
2 1 Alice
3 2 Bob
4 3 Carol

View File

@@ -0,0 +1,5 @@
id,name
1,Alice
2,Bob
3,Carol
4,Dan
1 id name
2 1 Alice
3 2 Bob
4 3 Carol
5 4 Dan

View File

@@ -0,0 +1,9 @@
id,address,notes
1,"123 Main St
Apt 4B
New York, NY","line1
line2"
2,Single line,"contains
classic mac
internal"
3,normal,no newlines here
1 id address notes
2 1 123 Main St Apt 4B New York, NY line1 line2
3 2 Single line contains classic mac internal
4 3 normal no newlines here

View File

@@ -0,0 +1,5 @@
id,name,email,product
1,ALICE SMITH,Alice@Example.COM,Widget
2,bob jones,BOB@example.com,GADGET
3,Carol Brown,carol@EXAMPLE.com,wIdGeT
4,DAN O'CONNOR,Dan@Example.com,gizmo
1 id name email product
2 1 ALICE SMITH Alice@Example.COM Widget
3 2 bob jones BOB@example.com GADGET
4 3 Carol Brown carol@EXAMPLE.com wIdGeT
5 4 DAN O'CONNOR Dan@Example.com gizmo

View File

@@ -0,0 +1,5 @@
id,name,email,product
1,ALICE SMITH,alice@example.com,Widget
2,bob jones,bob@example.com,GADGET
3,Carol Brown,carol@example.com,wIdGeT
4,DAN O'CONNOR,dan@example.com,gizmo
1 id name email product
2 1 ALICE SMITH alice@example.com Widget
3 2 bob jones bob@example.com GADGET
4 3 Carol Brown carol@example.com wIdGeT
5 4 DAN O'CONNOR dan@example.com gizmo

View File

@@ -0,0 +1,5 @@
id,name,email,product
1,Alice Smith,Alice@Example.COM,Widget
2,Bob Jones,BOB@example.com,GADGET
3,Carol Brown,carol@EXAMPLE.com,wIdGeT
4,Dan O'Connor,Dan@Example.com,gizmo
1 id name email product
2 1 Alice Smith Alice@Example.COM Widget
3 2 Bob Jones BOB@example.com GADGET
4 3 Carol Brown carol@EXAMPLE.com wIdGeT
5 4 Dan O'Connor Dan@Example.com gizmo

View File

@@ -0,0 +1,7 @@
id,name,note
1,中国北京,Beijing in Chinese (with leading/trailing space)
2,テスト,Japanese katakana (test)
3,تجربة,Arabic (test) - RTL
4,Москва,Russian (Moscow)
5,🎉 launch 🚀,emoji preserved
6,café ☕,emoji + accent combo
1 id name note
2 1 中国北京 Beijing in Chinese (with leading/trailing space)
3 2 テスト Japanese katakana (test)
4 3 تجربة Arabic (test) - RTL
5 4 Москва Russian (Moscow)
6 5 🎉 launch 🚀 emoji preserved
7 6 café ☕ emoji + accent combo

View File

@@ -0,0 +1,5 @@
id,name,city
1,café,München
2,naïve,résumé
3,don’t,smart-apostrophe mojibake
4,Alice,New York
1 id name city
2 1 café München
3 2 naïve résumé
4 3 don’t smart-apostrophe mojibake
5 4 Alice New York

View File

@@ -0,0 +1,5 @@
id,name,city
1,café,München
2,naïve,résumé
3,don't,smart-apostrophe mojibake
4,Alice,New York
1 id name city
2 1 café München
3 2 naïve résumé
4 3 don't smart-apostrophe mojibake
5 4 Alice New York

View File

@@ -0,0 +1,8 @@
id,value
1,real
2,
3,
4,
5,
6,
7,actual value
1 id value
2 1 real
3 2
4 3
5 4
6 5
7 6
8 7 actual value

View File

@@ -0,0 +1,3 @@
id,Customer Name,"""Email""",Phone
1,Alice,alice@example.com,555-1234
2,Bob,bob@example.com,555-5678
1 id Customer Name "Email" Phone
2 1 Alice alice@example.com 555-1234
3 2 Bob bob@example.com 555-5678

View File

@@ -0,0 +1,4 @@
id,price,european_number,date,phone,quantity
1,100,1 234,2024-01-15,(555) 123-4567,42
2,"$1,500.00",12 345,15/01/2024,555.123.4567,7
3,N/A,nan,Jan 15 2024,+1 555 123 4567,0
1 id price european_number date phone quantity
2 1 100 1 234 2024-01-15 (555) 123-4567 42
3 2 $1,500.00 12 345 15/01/2024 555.123.4567 7
4 3 N/A nan Jan 15 2024 +1 555 123 4567 0

View File

@@ -0,0 +1 @@
id,Name,Email
1 id Name Email

View File

@@ -0,0 +1,5 @@
id,Name,"""Email""",Notes
1,Alice Smith,Alice@Example.COM,"""VIP"" customer - contact ASAP..."
2,Bob Jones,bob@example.com,"it's 5'6"" tall"
3,Carol Brown,CAROL@EXAMPLE.COM,3 × 4 = 12 (preserve ×)
4,,empty@example.com,whitespace-only name (becomes empty)
1 id Name "Email" Notes
2 1 Alice Smith Alice@Example.COM "VIP" customer - contact ASAP...
3 2 Bob Jones bob@example.com it's 5'6" tall
4 3 Carol Brown CAROL@EXAMPLE.COM 3 × 4 = 12 (preserve ×)
5 4 empty@example.com whitespace-only name (becomes empty)