test: add text-cleaner corpus and close gaps surfaced by it
The 21-fixture corpus (test-cases/text-cleaner-corpus/) exercises the cleaner
end-to-end against the spec in TEST-CASES.md. Closing the failing cases drove
five small cleaner fixes plus two fixture-generation fixes:
- _SMART_CHARS: add prime, double prime, guillemets (case 03)
- _ZERO_WIDTH: add soft hyphen U+00AD (case 05)
- clean_dataframe: clean column headers via the same pipeline (cases 16/19/20),
with a clean_headers toggle on CleanOptions
- smart_title_case: title-case full-shout strings ("ALICE SMITH" -> "Alice
Smith") while still preserving embedded acronyms; preserve uppercase after
apostrophe in names ("O'CONNOR" -> "O'Connor", "o'neil" -> "O'neil")
- test_corpus.py reader: pre-strip NUL bytes (C parser truncates at NUL,
python engine is too strict about embedded literal "), per spec case 06
- generate_test_data.py: properly CSV-escape literal-quote cells in case 03
expected; quote the rogue-comma price field in case 17 input
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,8 @@
|
||||
id,name,city
|
||||
1,Alice,New York
|
||||
2,Bob,Chicago
|
||||
3,Carol,San Francisco
|
||||
4,Dan Smith,Austin
|
||||
5,Eve,Boston
|
||||
6,Frank van der Berg,Denver
|
||||
7,Grace Hopper,Palo Alto
|
||||
|
@@ -0,0 +1,7 @@
|
||||
id,label,note
|
||||
1,Premium,NBSP padding
|
||||
2,Discount,narrow NBSP
|
||||
3,Standard,ideographic space
|
||||
4,Tier One,em-space internal
|
||||
5,Cost Plus,thin-space internal
|
||||
6,mixed,ascii + NBSP combined
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,quote,measurement
|
||||
1,"""Hello world""","5' 11"""
|
||||
2,it's working,-
|
||||
3,2020-2024,from 'a' to 'z'
|
||||
4,wait...,3 × 4
|
||||
5,"""quoted""",5 ± 0.1
|
||||
|
@@ -0,0 +1,8 @@
|
||||
id,name,description
|
||||
1,café,NFC form (single code point)
|
||||
2,café,NFD form (e + combining accent)
|
||||
3,naïve,NFC i-diaeresis
|
||||
4,naïve,NFD i + combining diaeresis
|
||||
5,office,fi-ligature (ffi)
|
||||
6,ABC,fullwidth ABC
|
||||
7,Ⅸ century,roman numeral nine (single code point)
|
||||
|
@@ -0,0 +1,8 @@
|
||||
id,value,note
|
||||
1,Hello,zero-width space inside word
|
||||
2,Leading,leading + internal ZWSP
|
||||
3,Trail,trailing ZWSP
|
||||
4,abc,ZWNJ and ZWJ
|
||||
5,Marked,LTR + RTL marks bracketing
|
||||
6,cooperate,soft hyphen
|
||||
7,nobreak,word joiner
|
||||
|
@@ -0,0 +1,9 @@
|
||||
id,value,note
|
||||
1,HelloWorld,NUL byte inside
|
||||
2,BellSound,BEL character
|
||||
3,Backspace,backspace
|
||||
4,VertTab,vertical tab
|
||||
5,FormFeed,form feed
|
||||
6,Escape,ESC character
|
||||
7,Delete,DEL character
|
||||
8,Mixedjunk,multiple controls in one cell
|
||||
|
3
test-cases/text-cleaner-corpus/expected/07_bom_utf8.csv
Normal file
3
test-cases/text-cleaner-corpus/expected/07_bom_utf8.csv
Normal file
@@ -0,0 +1,3 @@
|
||||
id,name,city
|
||||
1,Alice,New York
|
||||
2,Bob,Chicago
|
||||
|
@@ -0,0 +1,4 @@
|
||||
id,name
|
||||
1,Alice
|
||||
2,Bob
|
||||
3,Carol
|
||||
|
@@ -0,0 +1,4 @@
|
||||
id,name
|
||||
1,Alice
|
||||
2,Bob
|
||||
3,Carol
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,name
|
||||
1,Alice
|
||||
2,Bob
|
||||
3,Carol
|
||||
4,Dan
|
||||
|
@@ -0,0 +1,9 @@
|
||||
id,address,notes
|
||||
1,"123 Main St
|
||||
Apt 4B
|
||||
New York, NY","line1
|
||||
line2"
|
||||
2,Single line,"contains
|
||||
classic mac
|
||||
internal"
|
||||
3,normal,no newlines here
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,name,email,product
|
||||
1,ALICE SMITH,Alice@Example.COM,Widget
|
||||
2,bob jones,BOB@example.com,GADGET
|
||||
3,Carol Brown,carol@EXAMPLE.com,wIdGeT
|
||||
4,DAN O'CONNOR,Dan@Example.com,gizmo
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,name,email,product
|
||||
1,ALICE SMITH,alice@example.com,Widget
|
||||
2,bob jones,bob@example.com,GADGET
|
||||
3,Carol Brown,carol@example.com,wIdGeT
|
||||
4,DAN O'CONNOR,dan@example.com,gizmo
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,name,email,product
|
||||
1,Alice Smith,Alice@Example.COM,Widget
|
||||
2,Bob Jones,BOB@example.com,GADGET
|
||||
3,Carol Brown,carol@EXAMPLE.com,wIdGeT
|
||||
4,Dan O'Connor,Dan@Example.com,gizmo
|
||||
|
@@ -0,0 +1,7 @@
|
||||
id,name,note
|
||||
1,中国北京,Beijing in Chinese (with leading/trailing space)
|
||||
2,テスト,Japanese katakana (test)
|
||||
3,تجربة,Arabic (test) - RTL
|
||||
4,Москва,Russian (Moscow)
|
||||
5,🎉 launch 🚀,emoji preserved
|
||||
6,café ☕,emoji + accent combo
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,name,city
|
||||
1,café,München
|
||||
2,naïve,résumé
|
||||
3,don’t,smart-apostrophe mojibake
|
||||
4,Alice,New York
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,name,city
|
||||
1,café,München
|
||||
2,naïve,résumé
|
||||
3,don't,smart-apostrophe mojibake
|
||||
4,Alice,New York
|
||||
|
@@ -0,0 +1,8 @@
|
||||
id,value
|
||||
1,real
|
||||
2,
|
||||
3,
|
||||
4,
|
||||
5,
|
||||
6,
|
||||
7,actual value
|
||||
|
@@ -0,0 +1,3 @@
|
||||
id,Customer Name,"""Email""",Phone
|
||||
1,Alice,alice@example.com,555-1234
|
||||
2,Bob,bob@example.com,555-5678
|
||||
|
@@ -0,0 +1,4 @@
|
||||
id,price,european_number,date,phone,quantity
|
||||
1,100,1 234,2024-01-15,(555) 123-4567,42
|
||||
2,"$1,500.00",12 345,15/01/2024,555.123.4567,7
|
||||
3,N/A,nan,Jan 15 2024,+1 555 123 4567,0
|
||||
|
@@ -0,0 +1 @@
|
||||
id,Name,Email
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,Name,"""Email""",Notes
|
||||
1,Alice Smith,Alice@Example.COM,"""VIP"" customer - contact ASAP..."
|
||||
2,Bob Jones,bob@example.com,"it's 5'6"" tall"
|
||||
3,Carol Brown,CAROL@EXAMPLE.COM,3 × 4 = 12 (preserve ×)
|
||||
4,,empty@example.com,whitespace-only name (becomes empty)
|
||||
|
Reference in New Issue
Block a user