test: add text-cleaner corpus and close gaps surfaced by it
The 21-fixture corpus (test-cases/text-cleaner-corpus/) exercises the cleaner
end-to-end against the spec in TEST-CASES.md. Closing the failing cases drove
five small cleaner fixes plus two fixture-generation fixes:
- _SMART_CHARS: add prime, double prime, guillemets (case 03)
- _ZERO_WIDTH: add soft hyphen U+00AD (case 05)
- clean_dataframe: clean column headers via the same pipeline (cases 16/19/20),
with a clean_headers toggle on CleanOptions
- smart_title_case: title-case full-shout strings ("ALICE SMITH" -> "Alice
Smith") while still preserving embedded acronyms; preserve uppercase after
apostrophe in names ("O'CONNOR" -> "O'Connor", "o'neil" -> "O'neil")
- test_corpus.py reader: pre-strip NUL bytes (C parser truncates at NUL,
python engine is too strict about embedded literal "), per spec case 06
- generate_test_data.py: properly CSV-escape literal-quote cells in case 03
expected; quote the rogue-comma price field in case 17 input
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,8 @@
|
||||
id,name,city
|
||||
1, Alice ,New York
|
||||
2,Bob, Chicago
|
||||
3,Carol ,San Francisco
|
||||
4,Dan Smith,Austin
|
||||
5, Eve , Boston
|
||||
6,Frank van der Berg,Denver
|
||||
7, Grace Hopper , Palo Alto
|
||||
|
@@ -0,0 +1,7 @@
|
||||
id,label,note
|
||||
1, Premium ,NBSP padding
|
||||
2, Discount ,narrow NBSP
|
||||
3, Standard ,ideographic space
|
||||
4,Tier One,em-space internal
|
||||
5,Cost Plus,thin-space internal
|
||||
6, mixed ,ascii + NBSP combined
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,quote,measurement
|
||||
1,“Hello world”,5′ 11″
|
||||
2,it’s working,—
|
||||
3,2020–2024,from ‘a’ to ‘z’
|
||||
4,wait…,3 × 4
|
||||
5,«quoted»,5 ± 0.1
|
||||
|
@@ -0,0 +1,8 @@
|
||||
id,name,description
|
||||
1,café,NFC form (single code point)
|
||||
2,café,NFD form (e + combining accent)
|
||||
3,naïve,NFC i-diaeresis
|
||||
4,naïve,NFD i + combining diaeresis
|
||||
5,office,fi-ligature (ffi)
|
||||
6,ABC,fullwidth ABC
|
||||
7,Ⅸ century,roman numeral nine (single code point)
|
||||
|
@@ -0,0 +1,8 @@
|
||||
id,value,note
|
||||
1,Hello,zero-width space inside word
|
||||
2,Leading,leading + internal ZWSP
|
||||
3,Trail,trailing ZWSP
|
||||
4,abc,ZWNJ and ZWJ
|
||||
5,Marked,LTR + RTL marks bracketing
|
||||
6,cooperate,soft hyphen
|
||||
7,nobreak,word joiner
|
||||
|
Binary file not shown.
|
3
test-cases/text-cleaner-corpus/test_data/07_bom_utf8.csv
Normal file
3
test-cases/text-cleaner-corpus/test_data/07_bom_utf8.csv
Normal file
@@ -0,0 +1,3 @@
|
||||
id,name,city
|
||||
1,Alice,New York
|
||||
2,Bob,Chicago
|
||||
|
@@ -0,0 +1,4 @@
|
||||
id,name
|
||||
1,Alice
|
||||
2,Bob
|
||||
3,Carol
|
||||
|
@@ -0,0 +1 @@
|
||||
id,name
|
||||
|
@@ -0,0 +1,4 @@
|
||||
id,name
|
||||
1,Alice
|
||||
2,Bob
|
||||
3,Carol
|
||||
|
@@ -0,0 +1,7 @@
|
||||
id,address,notes
|
||||
1,"123 Main St
|
||||
Apt 4B
|
||||
New York, NY","line1
|
||||
line2"
|
||||
2,"Single line","contains
|
||||
classic mac
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,name,email,product
|
||||
1,ALICE SMITH,Alice@Example.COM,Widget
|
||||
2,bob jones,BOB@example.com,GADGET
|
||||
3,Carol Brown,carol@EXAMPLE.com,wIdGeT
|
||||
4,DAN O'CONNOR,Dan@Example.com,gizmo
|
||||
|
@@ -0,0 +1,7 @@
|
||||
id,name,note
|
||||
1, 中国北京 ,Beijing in Chinese (with leading/trailing space)
|
||||
2,テスト,Japanese katakana (test)
|
||||
3,تجربة,Arabic (test) - RTL
|
||||
4,Москва,Russian (Moscow)
|
||||
5,🎉 launch 🚀,emoji preserved
|
||||
6,café ☕,emoji + accent combo
|
||||
|
5
test-cases/text-cleaner-corpus/test_data/14_mojibake.csv
Normal file
5
test-cases/text-cleaner-corpus/test_data/14_mojibake.csv
Normal file
@@ -0,0 +1,5 @@
|
||||
id,name,city
|
||||
1,café,München
|
||||
2,naïve,résumé
|
||||
3,don’t,smart-apostrophe mojibake
|
||||
4,Alice,New York
|
||||
|
@@ -0,0 +1,8 @@
|
||||
id,value
|
||||
1,real
|
||||
2,
|
||||
3,
|
||||
4,
|
||||
5,
|
||||
6,
|
||||
7,actual value
|
||||
|
@@ -0,0 +1,3 @@
|
||||
id , Customer Name ,“Email”,Phone
|
||||
1,Alice,alice@example.com,555-1234
|
||||
2,Bob,bob@example.com,555-5678
|
||||
|
@@ -0,0 +1,4 @@
|
||||
id,price,european_number,date,phone,quantity
|
||||
1, 100 ,1 234,2024-01-15,(555) 123-4567,42
|
||||
2," $1,500.00 ",12 345,15/01/2024,555.123.4567,7
|
||||
3, N/A ,nan,Jan 15 2024,+1 555 123 4567,0
|
||||
|
@@ -0,0 +1 @@
|
||||
id ,Name ,Email
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id , Name ,“Email”,Notes
|
||||
1, Alice Smith ,Alice@Example.COM,“VIP” customer — contact ASAP…
|
||||
2, Bob Jones ,bob@example.com,it’s 5′6″ tall
|
||||
3, Carol Brown ,CAROL@EXAMPLE.COM,3 × 4 = 12 (preserve ×)
|
||||
4, ,empty@example.com,whitespace-only name (becomes empty)
|
||||
|
BIN
test-cases/text-cleaner-corpus/test_data/21_excel_pollution.xlsx
Normal file
BIN
test-cases/text-cleaner-corpus/test_data/21_excel_pollution.xlsx
Normal file
Binary file not shown.
Reference in New Issue
Block a user