feat: implement text cleaner (script 02) with CLI, GUI, and tests

Builds 02_text_cleaner.py from stub to working: character-level hygiene
for CSV/Excel inputs covering trim, whitespace collapse, smart-character
folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char
strip, line-ending normalization, and per-column case conversion. Three
presets (minimal/excel-hygiene/paranoid) keep the buyer surface small.

- src/core/text_clean.py: pure helpers + CleanOptions/CleanResult +
  clean_dataframe with dtype-safe column selection
- src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape
  (dry-run by default, --apply writes cleaned + changes audit, JSON
  config save/load)
- src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset
  picker, advanced toggles, preview, before/after metrics, and three
  download buttons
- tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests
  covering edge cases E1-E50 from the spec
- samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10
  in 10 rows
- test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case
  fixtures

Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7
entry locking the spec, CLI-REFERENCE.md gains the text cleaner
section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md
status row 02 promoted Skeleton -> Working.

200/200 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 15:14:15 +00:00
parent b2ca04e6f4
commit 54f92ae47e
28 changed files with 2093 additions and 58 deletions

View File

@@ -0,0 +1,8 @@
id,address
1,"123 Main St
Apt 4B
NYC NY 10001"
2,"456 Oak Ave
Suite 200
LA CA 90001"
3,"789 Pine Rd
1 id address
2 1 123 Main St Apt 4B NYC NY 10001
3 2 456 Oak Ave Suite 200 LA CA 90001
4 3 789 Pine Rd Unit 5 SF CA 94101

Binary file not shown.
1 name note
2 Alice normal
3 Bob tab here
4 Charlie null�byte

View File

@@ -0,0 +1,5 @@
name,translation
Café,Cafe
éclair,eclair
你好,Hello (CN)
שלום,Hello (HE)
1 name translation
2 Café Cafe
3 éclair eclair
4 你好 Hello (CN)
5 שלום Hello (HE)

View File

@@ -0,0 +1,5 @@
x,y,z
1,1.1,10
2,2.2,20
3,3.3,30
4,4.4,40
1 x y z
2 1 1.1 10
3 2 2.2 20
4 3 3.3 30
5 4 4.4 40

View File

@@ -0,0 +1,6 @@
field
single curly
“double curly”
low-9 x high-reversed-9
em — en minus horizontal ―
ellipsis… narrownbsp 
1 field
2 ‘single curly’
3 “double curly”
4 low-9 ‘x’ high-reversed-9
5 em — en – minus − horizontal ―
6 ellipsis… narrow nbsp 

View File

@@ -0,0 +1,5 @@
first_name,last_name,phone
John ,Smith,555-1234
Jane,Doe ,555-5678
 Bob,Jones,555-9012
Alice,Brown,555-3456
1 first_name last_name phone
2 John  Smith 555-1234
3 Jane Doe  555-5678
4 Bob Jones 555-9012
5 Alice Brown 555-3456

View File

@@ -0,0 +1,4 @@
sku,title,description
DOG-001,“Best Dog Collar”,High quality…
CAT-002,Cat Toy — Premium,Its the best
FISH-003,Fish Food Tropical,Use dont overfeed
1 sku title description
2 DOG-001 “Best Dog Collar” High quality…
3 CAT-002 Cat Toy — Premium It’s the best
4 FISH-003 Fish Food – Tropical Use don’t overfeed

View File

@@ -0,0 +1,4 @@
customer_id,name,amount
1001,Alice,100.0
1002,Bob,200.0
1003,Charlie,300.0
1 customer_id name amount
2 1001 Alice 100.0
3 1002 Bob 200.0
4 1003 Charlie 300.0

View File

@@ -0,0 +1,4 @@
sku,qty
ABC-123,10
XYZ-456,20
QQQ-789,30
1 sku qty
2 ABC​-123 10
3 XYZ-456​ 20
4 QQQ-789 30

View File

@@ -0,0 +1,7 @@
date,amount,memo
2024-01-15,-1500.0,"Payment
Monthly recurring
Net 30"
2024-01-16,-250.0,Single line memo
2024-01-17,-89.99,"Standard
purchase"
1 date amount memo
2 2024-01-15 -1500.0 Payment Monthly recurring Net 30
3 2024-01-16 -250.0 Single line memo
4 2024-01-17 -89.99 Standard purchase

View File

@@ -0,0 +1,6 @@
vendor,ein
ACME Corp ,12-3456789
ACME Corp,12-3456789
ACME Corp ,12-3456789
Globex Inc,98-7654321
Globex Inc ,98-7654321
1 vendor ein
2 ACME Corp 12-3456789
3 ACME Corp 12-3456789
4 ACME Corp 12-3456789
5 Globex Inc 98-7654321
6 Globex Inc 98-7654321

View File

@@ -0,0 +1,4 @@
company,city
Café Roma,Boston
Très Belle,Montréal
Naïve Studios,São Paulo
1 company city
2 Café Roma Boston
3 Très Belle Montréal
4 Naïve Studios São Paulo

View File

@@ -0,0 +1,4 @@
task,owner
Phase 1 — Discovery,Alice
Phase 2 — Design,Bob
Q1 Q2,Charlie
1 task owner
2 Phase 1 — Discovery Alice
3 Phase 2 — Design Bob
4 Q1 – Q2 Charlie

View File

@@ -0,0 +1,6 @@
response_id,agreement,category
1,YES,Tech
2,yes,TECH
3,Yes,tech
4,yEs,Tech
5,yes, Tech
1 response_id agreement category
2 1 YES Tech
3 2 yes TECH
4 3 Yes tech
5 4 yEs Tech
6 5 yes Tech

View File

@@ -0,0 +1,4 @@
email,source
alice@test.com,Facebook
bob@test.com,Google
charlie@test.com,Organic
1 email source
2 alice​@test.com Facebook
3 bob@test‎.com Google
4 charlie@test.com Organic

View File

@@ -0,0 +1,6 @@
email,platform
alice@a.com,FB
"alice@a.com
",Google
"alice@a.com
",Organic
1 email platform
2 alice@a.com FB
3 alice@a.com Google
4 alice@a.com Organic
5 bob@a.com FB