feat: implement text cleaner (script 02) with CLI, GUI, and tests
Builds 02_text_cleaner.py from stub to working: character-level hygiene for CSV/Excel inputs covering trim, whitespace collapse, smart-character folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char strip, line-ending normalization, and per-column case conversion. Three presets (minimal/excel-hygiene/paranoid) keep the buyer surface small. - src/core/text_clean.py: pure helpers + CleanOptions/CleanResult + clean_dataframe with dtype-safe column selection - src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape (dry-run by default, --apply writes cleaned + changes audit, JSON config save/load) - src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset picker, advanced toggles, preview, before/after metrics, and three download buttons - tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests covering edge cases E1-E50 from the spec - samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10 in 10 rows - test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case fixtures Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7 entry locking the spec, CLI-REFERENCE.md gains the text cleaner section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md status row 02 promoted Skeleton -> Working. 200/200 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
8
test-cases/ec05_multiline_cells.csv
Normal file
8
test-cases/ec05_multiline_cells.csv
Normal file
@@ -0,0 +1,8 @@
|
||||
id,address
|
||||
1,"123 Main St
|
||||
Apt 4B
|
||||
NYC NY 10001"
|
||||
2,"456 Oak Ave
|
||||
Suite 200
|
||||
LA CA 90001"
|
||||
3,"789 Pine Rd
|
||||
|
BIN
test-cases/ec06_control_characters.csv
Normal file
BIN
test-cases/ec06_control_characters.csv
Normal file
Binary file not shown.
|
5
test-cases/ec07_unicode_decomposed.csv
Normal file
5
test-cases/ec07_unicode_decomposed.csv
Normal file
@@ -0,0 +1,5 @@
|
||||
name,translation
|
||||
Café,Cafe
|
||||
éclair,eclair
|
||||
你好,Hello (CN)
|
||||
שלום,Hello (HE)
|
||||
|
5
test-cases/ec08_all_numeric.csv
Normal file
5
test-cases/ec08_all_numeric.csv
Normal file
@@ -0,0 +1,5 @@
|
||||
x,y,z
|
||||
1,1.1,10
|
||||
2,2.2,20
|
||||
3,3.3,30
|
||||
4,4.4,40
|
||||
|
6
test-cases/ec09_smart_chars_full.csv
Normal file
6
test-cases/ec09_smart_chars_full.csv
Normal file
@@ -0,0 +1,6 @@
|
||||
field
|
||||
‘single curly’
|
||||
“double curly”
|
||||
low-9 ‘x’ high-reversed-9
|
||||
em — en – minus − horizontal ―
|
||||
ellipsis… narrow nbsp
|
||||
|
5
test-cases/uc16_shopify_nbsp_names.csv
Normal file
5
test-cases/uc16_shopify_nbsp_names.csv
Normal file
@@ -0,0 +1,5 @@
|
||||
first_name,last_name,phone
|
||||
John ,Smith,555-1234
|
||||
Jane,Doe ,555-5678
|
||||
Bob,Jones,555-9012
|
||||
Alice,Brown,555-3456
|
||||
|
4
test-cases/uc17_product_smart_quotes.csv
Normal file
4
test-cases/uc17_product_smart_quotes.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
sku,title,description
|
||||
DOG-001,“Best Dog Collar”,High quality…
|
||||
CAT-002,Cat Toy — Premium,It’s the best
|
||||
FISH-003,Fish Food – Tropical,Use don’t overfeed
|
||||
|
4
test-cases/uc18_excel_csv_utf8_bom.csv
Normal file
4
test-cases/uc18_excel_csv_utf8_bom.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
customer_id,name,amount
|
||||
1001,Alice,100.0
|
||||
1002,Bob,200.0
|
||||
1003,Charlie,300.0
|
||||
|
4
test-cases/uc19_pasted_sku_zerowidth.csv
Normal file
4
test-cases/uc19_pasted_sku_zerowidth.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
sku,qty
|
||||
ABC-123,10
|
||||
XYZ-456,20
|
||||
QQQ-789,30
|
||||
|
7
test-cases/uc20_bank_memo_crlf.csv
Normal file
7
test-cases/uc20_bank_memo_crlf.csv
Normal file
@@ -0,0 +1,7 @@
|
||||
date,amount,memo
|
||||
2024-01-15,-1500.0,"Payment
|
||||
Monthly recurring
|
||||
Net 30"
|
||||
2024-01-16,-250.0,Single line memo
|
||||
2024-01-17,-89.99,"Standard
|
||||
purchase"
|
||||
|
6
test-cases/uc21_quickbooks_trailing_spaces.csv
Normal file
6
test-cases/uc21_quickbooks_trailing_spaces.csv
Normal file
@@ -0,0 +1,6 @@
|
||||
vendor,ein
|
||||
ACME Corp ,12-3456789
|
||||
ACME Corp,12-3456789
|
||||
ACME Corp ,12-3456789
|
||||
Globex Inc,98-7654321
|
||||
Globex Inc ,98-7654321
|
||||
|
4
test-cases/uc22_unicode_accents.csv
Normal file
4
test-cases/uc22_unicode_accents.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
company,city
|
||||
Café Roma,Boston
|
||||
Très Belle,Montréal
|
||||
Naïve Studios,São Paulo
|
||||
|
4
test-cases/uc23_word_pasted_dashes.csv
Normal file
4
test-cases/uc23_word_pasted_dashes.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
task,owner
|
||||
Phase 1 — Discovery,Alice
|
||||
Phase 2 — Design,Bob
|
||||
Q1 – Q2,Charlie
|
||||
|
6
test-cases/uc24_survey_case_inconsistent.csv
Normal file
6
test-cases/uc24_survey_case_inconsistent.csv
Normal file
@@ -0,0 +1,6 @@
|
||||
response_id,agreement,category
|
||||
1,YES,Tech
|
||||
2,yes,TECH
|
||||
3,Yes,tech
|
||||
4,yEs,Tech
|
||||
5,yes, Tech
|
||||
|
4
test-cases/uc25_lead_invisible_unicode.csv
Normal file
4
test-cases/uc25_lead_invisible_unicode.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
email,source
|
||||
alice@test.com,Facebook
|
||||
bob@test.com,Google
|
||||
charlie@test.com,Organic
|
||||
|
6
test-cases/uc26_mixed_line_endings.csv
Normal file
6
test-cases/uc26_mixed_line_endings.csv
Normal file
@@ -0,0 +1,6 @@
|
||||
email,platform
|
||||
alice@a.com,FB
|
||||
"alice@a.com
|
||||
",Google
|
||||
"alice@a.com
|
||||
",Organic
|
||||
|
Reference in New Issue
Block a user