feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI
05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI
09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI
with soft tool-dependency graph (recommended,
not enforced) and JSON save/load for repeatable
weekly cleanups.
Format Standardizer reworked for 1 GB international files:
• Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
• Per-row country / address columns drive parsing
• Audit cap (default 10 k rows, ~50 MB RAM)
• standardize_file(): chunked streaming entry point (~165 k rows/sec)
• currency_decimal="auto" for EU comma-decimal locales
• R$ / kr / zł multi-char currency prefixes
• cli_format.py with auto-stream above 100 MB inputs
Encoding detection arbiter + language-aware probe:
Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.
Distribution-readiness assets:
• streamlit_app.py — Streamlit Community Cloud entry shim
• src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
100-row cap + watermark, free-vs-paid boundary enforced at surface
• samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
• landing/ — 4 static HTML pages (apex chooser + 3 niche),
shared CSS, deploy.py URL-substitution script,
auto-generated robots.txt + sitemap.xml + 404.html + favicon
• docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
— full strategy + measurement + deployment + master checklist
Test counts:
before: 1,520 passed · 4 skipped · 17 xfailed
after: 1,729 passed · 0 skipped · 0 xfailed
Tier-1 corpora added:
• missing-corpus 3 use cases + 16 edge cases
• column-mapper-corpus 3 use cases + 5 edge cases
• format-cleaner intl 20-row 13-country stress fixture
Engine hardening flushed out by the corpora:
• interpolate guards against object-dtype columns
• mean/median skip all-NaN columns (silences numpy warning)
• fillna runs under future.no_silent_downcasting (silences pandas warning)
• mojibake test no longer skips when ftfy installed (monkeypatch path)
• drop-row threshold semantics: strict-greater (consistent across rows / cols)
• currency_decimal validator allow-set updated for "auto"
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
23
test-cases/column-mapper-corpus/README.md
Normal file
23
test-cases/column-mapper-corpus/README.md
Normal file
@@ -0,0 +1,23 @@
|
||||
# Column Mapper — corpus
|
||||
|
||||
Acceptance fixtures for `src/core/column_mapper.py`. Each `.csv` under
|
||||
`test_data/` is paired with assertions in
|
||||
`tests/test_column_mapper_corpus.py`.
|
||||
|
||||
## Use cases (target client profiles)
|
||||
|
||||
| File | Buyer profile | Tested behaviour |
|
||||
|------|---------------|------------------|
|
||||
| `uc01_crm_import.csv` + `schemas/uc01_crm_target.json` | Sales ops admin importing leads into Salesforce / HubSpot | Schema enforcement: rename via aliases, coerce types, drop extras, add `owner` default. |
|
||||
| `uc02_vendor_{a,b,c}.csv` + `schemas/uc02_canonical.json` | Operator unifying vendor exports | Multi-source unification: each vendor uses different headers; auto-inference resolves them all. |
|
||||
| `uc03_type_coercion.csv` + `schemas/uc03_types.json` | Analyst quick-fixing a mistyped CSV | Mixed-type coercion with documented per-column failure counts (bad rows survive as NaN). |
|
||||
|
||||
## Edge cases
|
||||
|
||||
| File | Stresses |
|
||||
|------|----------|
|
||||
| `ec01_duplicate_target.csv` | Mapping two source columns to the same target → InputValidationError. |
|
||||
| `ec02_unicode_columns.csv` | Non-ASCII column names (Japanese) survive rename and coerce. |
|
||||
| `ec03_whitespace_headers.csv` | Leading/trailing whitespace in headers still fuzzy-matches the schema. |
|
||||
| `ec04_no_match.csv` | No source column scores above threshold → empty mapping, fallback unmapped strategy fires. |
|
||||
| `ec05_required_missing.csv` | Required target field has no source column → InputValidationError unless `enforce_required=False`. |
|
||||
13
test-cases/column-mapper-corpus/schemas/uc01_crm_target.json
Normal file
13
test-cases/column-mapper-corpus/schemas/uc01_crm_target.json
Normal file
@@ -0,0 +1,13 @@
|
||||
{
|
||||
"fields": [
|
||||
{"name": "first_name", "dtype": "string", "required": true, "aliases": ["First Name", "fname"]},
|
||||
{"name": "last_name", "dtype": "string", "required": true, "aliases": ["Last Name", "lname"]},
|
||||
{"name": "email", "dtype": "string", "required": true, "aliases": ["EmailAddr", "Email", "email_address"]},
|
||||
{"name": "phone", "dtype": "string", "aliases": ["Phone", "phone_number"]},
|
||||
{"name": "account_name", "dtype": "string", "aliases": ["Company", "Account"]},
|
||||
{"name": "annual_rev", "dtype": "integer", "aliases": ["Annual Revenue", "revenue"]},
|
||||
{"name": "lead_source", "dtype": "category","aliases": ["Lead Source", "source"]},
|
||||
{"name": "created_date", "dtype": "date", "aliases": ["Created", "create_date"]},
|
||||
{"name": "owner", "dtype": "string", "default": "unassigned"}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,9 @@
|
||||
{
|
||||
"fields": [
|
||||
{"name": "first_name", "dtype": "string", "required": true, "aliases": ["FirstName", "FName", "First Name"]},
|
||||
{"name": "last_name", "dtype": "string", "required": true, "aliases": ["LastName", "Surname", "Last Name"]},
|
||||
{"name": "email", "dtype": "string", "required": true, "aliases": ["Email", "E-mail", "email_addr", "EmailAddr"]},
|
||||
{"name": "phone", "dtype": "string", "aliases": ["Phone Number", "Tel", "phone_number"]},
|
||||
{"name": "country", "dtype": "string", "aliases": ["Country", "country_code", "Region"]}
|
||||
]
|
||||
}
|
||||
10
test-cases/column-mapper-corpus/schemas/uc03_types.json
Normal file
10
test-cases/column-mapper-corpus/schemas/uc03_types.json
Normal file
@@ -0,0 +1,10 @@
|
||||
{
|
||||
"fields": [
|
||||
{"name": "id", "dtype": "integer", "required": true},
|
||||
{"name": "age", "dtype": "integer"},
|
||||
{"name": "active", "dtype": "boolean"},
|
||||
{"name": "joined", "dtype": "date"},
|
||||
{"name": "score", "dtype": "float"},
|
||||
{"name": "notes", "dtype": "string"}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,3 @@
|
||||
a,b,c
|
||||
1,2,3
|
||||
4,5,6
|
||||
|
@@ -0,0 +1,3 @@
|
||||
名前,Email,価格
|
||||
Alice,a@x.com,100
|
||||
Bob,b@x.com,200
|
||||
|
@@ -0,0 +1,3 @@
|
||||
First Name , Last Name ,EmailAddr
|
||||
Alice,Johnson,alice@x.com
|
||||
Bob,Smith,bob@x.com
|
||||
|
@@ -0,0 +1,3 @@
|
||||
xyz,abc,foobar
|
||||
1,2,3
|
||||
4,5,6
|
||||
|
@@ -0,0 +1,3 @@
|
||||
first_name,age
|
||||
Alice,30
|
||||
Bob,25
|
||||
|
@@ -0,0 +1,4 @@
|
||||
First Name,Last Name,EmailAddr,Phone,Company,Annual Revenue,Lead Source,Created
|
||||
Alice,Johnson,alice@acme.com,555-1234,Acme Corp,1500000,LinkedIn,2025-12-04
|
||||
Bob,Smith,bob@beta.com,555-5678,Beta LLC,250000,Webinar,2025-11-22
|
||||
Carlos,Garcia,carlos@gamma.io,555-9012,Gamma Inc,4200000,Referral,2025-10-30
|
||||
|
@@ -0,0 +1,3 @@
|
||||
FirstName,LastName,Email,Phone Number,Country
|
||||
Alice,Johnson,alice@vendor-a.com,555-1234,USA
|
||||
Bob,Smith,bob@vendor-a.com,555-5678,USA
|
||||
|
@@ -0,0 +1,3 @@
|
||||
first_name,surname,email_addr,phone,country_code
|
||||
Carlos,Garcia,carlos@vendor-b.com,555-9012,USA
|
||||
Diana,Lee,diana@vendor-b.com,555-7777,UK
|
||||
|
@@ -0,0 +1,3 @@
|
||||
FName,Surname,E-mail,Tel,Region
|
||||
Eve,Martinez,eve@vendor-c.com,555-9988,Bronx
|
||||
Frank,Brown,frank@vendor-c.com,555-1111,Queens
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,age,active,joined,score,notes
|
||||
1,30,true,2025-01-15,87.5,first
|
||||
2,25,false,2025-02-22,not_a_number,second
|
||||
3,not_a_number,yes,2025-03-08,76.0,third
|
||||
4,40,no,bad_date,91.2,fourth
|
||||
5,55,1,2025-05-01,82.0,fifth
|
||||
|
@@ -0,0 +1,21 @@
|
||||
customer_id,name,phone,country,address,price
|
||||
INT-001,Alice Johnson,(415) 555-1234,US,"1 Apple Park Way, Cupertino CA 95014",$1499.99
|
||||
INT-002,Boris Petrov,+7 495 123 4567,RU,"Ulitsa Tverskaya 13, Moscow 125009",₽89500
|
||||
INT-003,carlos garcia,+34 91 411 1111,ES,"Calle Gran Via 28, Madrid 28013","€1.299,00"
|
||||
INT-004,JOHN BROWN,020 7946 0958,GB,"10 Downing Street, London SW1A 2AA","£950.00"
|
||||
INT-005,marie dubois,01 42 86 82 00,FR,"Avenue des Champs-Elysees 100, Paris 75008","€2.499,50"
|
||||
INT-006,Yuki Tanaka,03-3210-7000,JP,"Marunouchi 2-7-3, Chiyoda-ku Tokyo 100-0005",¥150000
|
||||
INT-007,Anna Schmidt,030 12345678,DE,"Unter den Linden 5, Berlin 10117","€899,99"
|
||||
INT-008,giovanni rossi,+39 06 6982,IT,"Via del Corso 320, Roma 00186","€1.450,00"
|
||||
INT-009,Mei Wang,+86 10 1234 5678,CN,"东长安街 1号, 北京 100006",¥10000
|
||||
INT-010,Priya Sharma,+91 11 2345 6789,IN,"Connaught Place, New Delhi 110001",₹85000
|
||||
INT-011,Ahmed Hassan,+20 2 2735 0000,EG,"Tahrir Square, Cairo 11511",E£3500
|
||||
INT-012,emily smith,+61 2 9374 4000,AU,"Sydney Opera House, Sydney NSW 2000","$2,199.00"
|
||||
INT-013,Joao Silva,11 3071 0000,BR,"Avenida Paulista 1000, Sao Paulo 01310","R$ 1.299,90"
|
||||
INT-014,Sofia Lopez,+52 55 5555 0000,MX,"Paseo de la Reforma 222, Ciudad de Mexico 06600","$1,500 MXN"
|
||||
INT-015,Min-jun Kim,+82 2 2287 0114,KR,"Seoul Plaza, Seoul 04518",₩1500000
|
||||
INT-016,Mehmet Yilmaz,+90 212 252 0000,TR,"Sultanahmet, Istanbul 34122","₺1.250"
|
||||
INT-017,david cohen,+972 3 6957 0000,IL,"Dizengoff 50, Tel Aviv 6433222",₪450
|
||||
INT-018,Hanna Kowalska,+48 22 658 4500,PL,"Marszalkowska 1, Warszawa 00-624","zł 350,00"
|
||||
INT-019,Lars Nielsen,+45 33 12 88 88,DK,"Vesterbrogade 1, Copenhagen 1620","kr 950"
|
||||
INT-020,Sven Eriksson,+46 8 506 600 00,SE,"Drottninggatan 1, Stockholm 11151","kr 1.250,50"
|
||||
|
35
test-cases/missing-corpus/README.md
Normal file
35
test-cases/missing-corpus/README.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Missing Value Handler — corpus
|
||||
|
||||
Acceptance fixtures for `src/core/missing.py`. Each `.csv` under
|
||||
`test_data/` is paired with assertions in `tests/test_missing_corpus.py`.
|
||||
Add a new case by dropping a CSV here and adding a parametrize entry to
|
||||
the runner.
|
||||
|
||||
## Use cases (target client profiles)
|
||||
|
||||
| File | Buyer profile | Strategy under test |
|
||||
|------|---------------|---------------------|
|
||||
| `uc01_shopify_export.csv` | SMB / Shopify operator | `detect-only` |
|
||||
| `uc02_marketing_audience.csv` | Marketing / RevOps analyst| `safe-fill` |
|
||||
| `uc03_consultant_intake.csv` | Analyst / consultant | `drop-incomplete` + threshold |
|
||||
|
||||
## Edge cases
|
||||
|
||||
| File | What it stresses |
|
||||
|------|------------------|
|
||||
| `ec01_all_nan_column.csv` | column 100 % missing — fill must skip, drop_col must catch at threshold |
|
||||
| `ec02_no_missing.csv` | clean file — must be a no-op |
|
||||
| `ec03_zero_is_not_missing.csv` | numeric `0`, boolean `false`, `"0"` must NOT be treated as missing |
|
||||
| `ec04_excel_errors.csv` | `#N/A`, `#NULL!`, `#VALUE!` Excel error sentinels |
|
||||
| `ec05_unicode_whitespace.csv` | NBSP, tab-only, ideographic-space cells treated as whitespace |
|
||||
| `ec06_mixed_dtypes.csv` | mixed numeric/string in same column — graceful degrade to mode |
|
||||
| `ec07_real_data_with_padding.csv` | leading/trailing whitespace around real data must NOT be dropped |
|
||||
| `ec08_single_row.csv` | one-row file — every operation must still work |
|
||||
| `ec09_single_column.csv` | one-column file with header-only line + sentinels |
|
||||
| `ec10_all_sentinel_variants.csv` | every `DEFAULT_SENTINELS` entry exercised in one file |
|
||||
| `ec11_constant_per_column.csv` | `column_fill_values` differs per column |
|
||||
| `ec12_drop_threshold_boundary.csv`| boundary values for `row_drop_threshold` (0.5, 0.99, 1.0) |
|
||||
| `ec13_ffill_leading_nan.csv` | leading-NaN run survives ffill (no fabrication) |
|
||||
| `ec14_interpolate_fallback.csv` | numeric-only strategy on string column triggers fallback |
|
||||
| `ec15_headers_only.csv` | empty body — must not crash |
|
||||
| `ec16_idempotent_apply.csv` | running `handle_missing` twice yields the same DataFrame |
|
||||
@@ -0,0 +1,5 @@
|
||||
id,name,deprecated_field
|
||||
1,Alice,
|
||||
2,Bob,
|
||||
3,Charlie,
|
||||
4,Diana,
|
||||
|
4
test-cases/missing-corpus/test_data/ec02_no_missing.csv
Normal file
4
test-cases/missing-corpus/test_data/ec02_no_missing.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
id,name,age,city
|
||||
1,Alice,30,NYC
|
||||
2,Bob,25,LA
|
||||
3,Charlie,35,SF
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,active,balance,count,flag
|
||||
1,true,0.00,0,0
|
||||
2,false,150.50,3,1
|
||||
3,true,0,5,0
|
||||
4,true,75.25,0,1
|
||||
|
@@ -0,0 +1,7 @@
|
||||
sku,price,units,supplier
|
||||
A-100,19.99,5,Acme
|
||||
A-101,#N/A,3,Beta
|
||||
A-102,29.99,#NULL!,Gamma
|
||||
A-103,#VALUE!,2,Delta
|
||||
A-104,9.99,0,Acme
|
||||
A-105,#N/A,#N/A,#NULL!
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,note,value
|
||||
1,hello,10
|
||||
2, ,20
|
||||
3, ,30
|
||||
4,real,40
|
||||
5, ,50
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,mixed_col,real_num
|
||||
1,42,1.0
|
||||
2,N/A,2.0
|
||||
3,hello,
|
||||
4,,4.0
|
||||
5,99,5.0
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,name,city
|
||||
1, Alice ,NYC
|
||||
2, ,LA
|
||||
3, Bob ,
|
||||
4,Charlie, SF
|
||||
|
2
test-cases/missing-corpus/test_data/ec08_single_row.csv
Normal file
2
test-cases/missing-corpus/test_data/ec08_single_row.csv
Normal file
@@ -0,0 +1,2 @@
|
||||
id,name,age,city
|
||||
1,Alice,N/A,
|
||||
|
@@ -0,0 +1,7 @@
|
||||
value
|
||||
10
|
||||
N/A
|
||||
20
|
||||
" "
|
||||
-
|
||||
30
|
||||
|
@@ -0,0 +1,22 @@
|
||||
case_id,sentinel_value
|
||||
01,N/A
|
||||
02,n/a
|
||||
03,NA
|
||||
04,na
|
||||
05,NULL
|
||||
06,null
|
||||
07,None
|
||||
08,nil
|
||||
09,NaN
|
||||
10,-
|
||||
11,--
|
||||
12,?
|
||||
13,.
|
||||
14,TBD
|
||||
15,unknown
|
||||
16,(blank)
|
||||
17,(none)
|
||||
18,#N/A
|
||||
19,#NULL!
|
||||
20,missing
|
||||
21,real_value
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,country,salary,department
|
||||
1,USA,50000,Eng
|
||||
2,,60000,Sales
|
||||
3,UK,,Eng
|
||||
4,USA,55000,
|
||||
5,,,
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,a,b,c,d
|
||||
1,1,2,3,4
|
||||
2,,,3,4
|
||||
3,,,,4
|
||||
4,,,,
|
||||
5,1,2,,
|
||||
|
@@ -0,0 +1,8 @@
|
||||
date,price
|
||||
2025-01-01,
|
||||
2025-01-02,
|
||||
2025-01-03,100.0
|
||||
2025-01-04,
|
||||
2025-01-05,
|
||||
2025-01-06,150.0
|
||||
2025-01-07,
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,category,value
|
||||
1,A,10.0
|
||||
2,B,
|
||||
3,C,30.0
|
||||
4,,40.0
|
||||
5,A,
|
||||
|
@@ -0,0 +1 @@
|
||||
id,name,age,city
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,name,age
|
||||
1,Alice,30
|
||||
2,N/A,
|
||||
3,Bob,25
|
||||
4,,40
|
||||
|
11
test-cases/missing-corpus/test_data/uc01_shopify_export.csv
Normal file
11
test-cases/missing-corpus/test_data/uc01_shopify_export.csv
Normal file
@@ -0,0 +1,11 @@
|
||||
customer_id,first_name,last_name,email,phone,city,total_orders,lifetime_value,last_order_date,tags
|
||||
SHOP-001,Alice,Johnson,alice@shop.com,555-1234,Brooklyn,12,1240.50,2025-12-04,VIP
|
||||
SHOP-002,Bob,Smith,bob@shop.com,N/A,Queens,5,420.00,2025-11-22,
|
||||
SHOP-003,Carlos,Garcia,carlos@shop.com,555-5678,-,8,890.25,2025-12-15,Wholesale
|
||||
SHOP-004,Diana,Lee,diana@shop.com,(555) 222-3344,Manhattan,NULL,1875.00,2025-10-30,VIP|Wholesale
|
||||
SHOP-005,Eve,Martinez,,555-9988,Bronx,3,180.00,2025-09-15,
|
||||
SHOP-006,Frank,Brown,frank@shop.com, ,Staten Island,15,2410.75,(blank),
|
||||
SHOP-007,Grace,Davis,grace@shop.com,555-1111,Brooklyn,1,49.99,#N/A,New
|
||||
SHOP-008,Henry,Wilson,henry@shop.com,n/a,Queens,7,675.00,2025-11-08,VIP
|
||||
SHOP-009,Ivy,Chen,ivy@shop.com,555-7777,?,4,320.50,2025-10-12,
|
||||
SHOP-010,Jack,Taylor,jack@shop.com,555-4444,Manhattan,(none),520.00,2025-12-01,Wholesale
|
||||
|
@@ -0,0 +1,16 @@
|
||||
contact_id,email,segment,region,age,ltv,score,last_engagement_days,source,consent
|
||||
LEAD-001,a@mkt.com,Enterprise,NA-East,42,12400,87,3,LinkedIn,true
|
||||
LEAD-002,b@mkt.com,SMB,NA-West,,3200,62,12,Google,true
|
||||
LEAD-003,c@mkt.com,SMB,EU,29,1800,N/A,7,unknown,true
|
||||
LEAD-004,d@mkt.com,Enterprise,NA-East,55,,91,1,Webinar,true
|
||||
LEAD-005,e@mkt.com,Mid-Market,NA-West,38,5600,74,,Referral,true
|
||||
LEAD-006,f@mkt.com,SMB,EU,,2100,55,21,-,
|
||||
LEAD-007,g@mkt.com,Enterprise,APAC,47,9800,82,5,LinkedIn,true
|
||||
LEAD-008,h@mkt.com,SMB,NA-East,33,2900,,9,Google,
|
||||
LEAD-009,i@mkt.com,Mid-Market,EU,41,4750,68,15,NULL,true
|
||||
LEAD-010,j@mkt.com,Enterprise,NA-West,,11200,89,2,Webinar,true
|
||||
LEAD-011,k@mkt.com,SMB,APAC,28,1650,58,18,(blank),true
|
||||
LEAD-012,l@mkt.com,Mid-Market,NA-East,36,5100,,11,Referral,true
|
||||
LEAD-013,m@mkt.com,SMB,EU,31,2300,61,N/A,Google,true
|
||||
LEAD-014,n@mkt.com,Enterprise,APAC,52,10500,93,4,LinkedIn,true
|
||||
LEAD-015,o@mkt.com,SMB,NA-West,26,1400,49,25,?,
|
||||
|
@@ -0,0 +1,13 @@
|
||||
respondent_id,age,gender,zip,survey_q1,survey_q2,survey_q3,survey_q4,nps,comments,internal_id_legacy,beta_field
|
||||
R-001,34,F,11201,4,5,3,4,9,"loved it",,
|
||||
R-002,N/A,M,10001,,,,, ,,,
|
||||
R-003,41,F,90210,5,4,5,5,10,"perfect",,
|
||||
R-004,28,M,-,3,,,,7,,,
|
||||
R-005,,,NULL,,,,,,,,
|
||||
R-006,52,F,02101,4,4,4,4,8,"good experience",,
|
||||
R-007,?,?,?,?,?,?,?,?,?,,
|
||||
R-008,29,M,94102,5,5,5,5,10,"amazing",,
|
||||
R-009,38,F,60601,2,3,2,2,5,"meh",,
|
||||
R-010,(blank),(blank),(blank),(blank),(blank),(blank),(blank),(blank),(blank),,
|
||||
R-011,45,M,30301,4,4,3,4,8,,,
|
||||
R-012,33,F,11201,5,5,5,4,9,"will recommend",,
|
||||
|
Reference in New Issue
Block a user