feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI
05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI
09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI
with soft tool-dependency graph (recommended,
not enforced) and JSON save/load for repeatable
weekly cleanups.
Format Standardizer reworked for 1 GB international files:
• Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
• Per-row country / address columns drive parsing
• Audit cap (default 10 k rows, ~50 MB RAM)
• standardize_file(): chunked streaming entry point (~165 k rows/sec)
• currency_decimal="auto" for EU comma-decimal locales
• R$ / kr / zł multi-char currency prefixes
• cli_format.py with auto-stream above 100 MB inputs
Encoding detection arbiter + language-aware probe:
Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.
Distribution-readiness assets:
• streamlit_app.py — Streamlit Community Cloud entry shim
• src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
100-row cap + watermark, free-vs-paid boundary enforced at surface
• samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
• landing/ — 4 static HTML pages (apex chooser + 3 niche),
shared CSS, deploy.py URL-substitution script,
auto-generated robots.txt + sitemap.xml + 404.html + favicon
• docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
— full strategy + measurement + deployment + master checklist
Test counts:
before: 1,520 passed · 4 skipped · 17 xfailed
after: 1,729 passed · 0 skipped · 0 xfailed
Tier-1 corpora added:
• missing-corpus 3 use cases + 16 edge cases
• column-mapper-corpus 3 use cases + 5 edge cases
• format-cleaner intl 20-row 13-country stress fixture
Engine hardening flushed out by the corpora:
• interpolate guards against object-dtype columns
• mean/median skip all-NaN columns (silences numpy warning)
• fillna runs under future.no_silent_downcasting (silences pandas warning)
• mojibake test no longer skips when ftfy installed (monkeypatch path)
• drop-row threshold semantics: strict-greater (consistent across rows / cols)
• currency_decimal validator allow-set updated for "auto"
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
35
test-cases/missing-corpus/README.md
Normal file
35
test-cases/missing-corpus/README.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Missing Value Handler — corpus
|
||||
|
||||
Acceptance fixtures for `src/core/missing.py`. Each `.csv` under
|
||||
`test_data/` is paired with assertions in `tests/test_missing_corpus.py`.
|
||||
Add a new case by dropping a CSV here and adding a parametrize entry to
|
||||
the runner.
|
||||
|
||||
## Use cases (target client profiles)
|
||||
|
||||
| File | Buyer profile | Strategy under test |
|
||||
|------|---------------|---------------------|
|
||||
| `uc01_shopify_export.csv` | SMB / Shopify operator | `detect-only` |
|
||||
| `uc02_marketing_audience.csv` | Marketing / RevOps analyst| `safe-fill` |
|
||||
| `uc03_consultant_intake.csv` | Analyst / consultant | `drop-incomplete` + threshold |
|
||||
|
||||
## Edge cases
|
||||
|
||||
| File | What it stresses |
|
||||
|------|------------------|
|
||||
| `ec01_all_nan_column.csv` | column 100 % missing — fill must skip, drop_col must catch at threshold |
|
||||
| `ec02_no_missing.csv` | clean file — must be a no-op |
|
||||
| `ec03_zero_is_not_missing.csv` | numeric `0`, boolean `false`, `"0"` must NOT be treated as missing |
|
||||
| `ec04_excel_errors.csv` | `#N/A`, `#NULL!`, `#VALUE!` Excel error sentinels |
|
||||
| `ec05_unicode_whitespace.csv` | NBSP, tab-only, ideographic-space cells treated as whitespace |
|
||||
| `ec06_mixed_dtypes.csv` | mixed numeric/string in same column — graceful degrade to mode |
|
||||
| `ec07_real_data_with_padding.csv` | leading/trailing whitespace around real data must NOT be dropped |
|
||||
| `ec08_single_row.csv` | one-row file — every operation must still work |
|
||||
| `ec09_single_column.csv` | one-column file with header-only line + sentinels |
|
||||
| `ec10_all_sentinel_variants.csv` | every `DEFAULT_SENTINELS` entry exercised in one file |
|
||||
| `ec11_constant_per_column.csv` | `column_fill_values` differs per column |
|
||||
| `ec12_drop_threshold_boundary.csv`| boundary values for `row_drop_threshold` (0.5, 0.99, 1.0) |
|
||||
| `ec13_ffill_leading_nan.csv` | leading-NaN run survives ffill (no fabrication) |
|
||||
| `ec14_interpolate_fallback.csv` | numeric-only strategy on string column triggers fallback |
|
||||
| `ec15_headers_only.csv` | empty body — must not crash |
|
||||
| `ec16_idempotent_apply.csv` | running `handle_missing` twice yields the same DataFrame |
|
||||
@@ -0,0 +1,5 @@
|
||||
id,name,deprecated_field
|
||||
1,Alice,
|
||||
2,Bob,
|
||||
3,Charlie,
|
||||
4,Diana,
|
||||
|
4
test-cases/missing-corpus/test_data/ec02_no_missing.csv
Normal file
4
test-cases/missing-corpus/test_data/ec02_no_missing.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
id,name,age,city
|
||||
1,Alice,30,NYC
|
||||
2,Bob,25,LA
|
||||
3,Charlie,35,SF
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,active,balance,count,flag
|
||||
1,true,0.00,0,0
|
||||
2,false,150.50,3,1
|
||||
3,true,0,5,0
|
||||
4,true,75.25,0,1
|
||||
|
@@ -0,0 +1,7 @@
|
||||
sku,price,units,supplier
|
||||
A-100,19.99,5,Acme
|
||||
A-101,#N/A,3,Beta
|
||||
A-102,29.99,#NULL!,Gamma
|
||||
A-103,#VALUE!,2,Delta
|
||||
A-104,9.99,0,Acme
|
||||
A-105,#N/A,#N/A,#NULL!
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,note,value
|
||||
1,hello,10
|
||||
2, ,20
|
||||
3, ,30
|
||||
4,real,40
|
||||
5, ,50
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,mixed_col,real_num
|
||||
1,42,1.0
|
||||
2,N/A,2.0
|
||||
3,hello,
|
||||
4,,4.0
|
||||
5,99,5.0
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,name,city
|
||||
1, Alice ,NYC
|
||||
2, ,LA
|
||||
3, Bob ,
|
||||
4,Charlie, SF
|
||||
|
2
test-cases/missing-corpus/test_data/ec08_single_row.csv
Normal file
2
test-cases/missing-corpus/test_data/ec08_single_row.csv
Normal file
@@ -0,0 +1,2 @@
|
||||
id,name,age,city
|
||||
1,Alice,N/A,
|
||||
|
@@ -0,0 +1,7 @@
|
||||
value
|
||||
10
|
||||
N/A
|
||||
20
|
||||
" "
|
||||
-
|
||||
30
|
||||
|
@@ -0,0 +1,22 @@
|
||||
case_id,sentinel_value
|
||||
01,N/A
|
||||
02,n/a
|
||||
03,NA
|
||||
04,na
|
||||
05,NULL
|
||||
06,null
|
||||
07,None
|
||||
08,nil
|
||||
09,NaN
|
||||
10,-
|
||||
11,--
|
||||
12,?
|
||||
13,.
|
||||
14,TBD
|
||||
15,unknown
|
||||
16,(blank)
|
||||
17,(none)
|
||||
18,#N/A
|
||||
19,#NULL!
|
||||
20,missing
|
||||
21,real_value
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,country,salary,department
|
||||
1,USA,50000,Eng
|
||||
2,,60000,Sales
|
||||
3,UK,,Eng
|
||||
4,USA,55000,
|
||||
5,,,
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,a,b,c,d
|
||||
1,1,2,3,4
|
||||
2,,,3,4
|
||||
3,,,,4
|
||||
4,,,,
|
||||
5,1,2,,
|
||||
|
@@ -0,0 +1,8 @@
|
||||
date,price
|
||||
2025-01-01,
|
||||
2025-01-02,
|
||||
2025-01-03,100.0
|
||||
2025-01-04,
|
||||
2025-01-05,
|
||||
2025-01-06,150.0
|
||||
2025-01-07,
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,category,value
|
||||
1,A,10.0
|
||||
2,B,
|
||||
3,C,30.0
|
||||
4,,40.0
|
||||
5,A,
|
||||
|
@@ -0,0 +1 @@
|
||||
id,name,age,city
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,name,age
|
||||
1,Alice,30
|
||||
2,N/A,
|
||||
3,Bob,25
|
||||
4,,40
|
||||
|
11
test-cases/missing-corpus/test_data/uc01_shopify_export.csv
Normal file
11
test-cases/missing-corpus/test_data/uc01_shopify_export.csv
Normal file
@@ -0,0 +1,11 @@
|
||||
customer_id,first_name,last_name,email,phone,city,total_orders,lifetime_value,last_order_date,tags
|
||||
SHOP-001,Alice,Johnson,alice@shop.com,555-1234,Brooklyn,12,1240.50,2025-12-04,VIP
|
||||
SHOP-002,Bob,Smith,bob@shop.com,N/A,Queens,5,420.00,2025-11-22,
|
||||
SHOP-003,Carlos,Garcia,carlos@shop.com,555-5678,-,8,890.25,2025-12-15,Wholesale
|
||||
SHOP-004,Diana,Lee,diana@shop.com,(555) 222-3344,Manhattan,NULL,1875.00,2025-10-30,VIP|Wholesale
|
||||
SHOP-005,Eve,Martinez,,555-9988,Bronx,3,180.00,2025-09-15,
|
||||
SHOP-006,Frank,Brown,frank@shop.com, ,Staten Island,15,2410.75,(blank),
|
||||
SHOP-007,Grace,Davis,grace@shop.com,555-1111,Brooklyn,1,49.99,#N/A,New
|
||||
SHOP-008,Henry,Wilson,henry@shop.com,n/a,Queens,7,675.00,2025-11-08,VIP
|
||||
SHOP-009,Ivy,Chen,ivy@shop.com,555-7777,?,4,320.50,2025-10-12,
|
||||
SHOP-010,Jack,Taylor,jack@shop.com,555-4444,Manhattan,(none),520.00,2025-12-01,Wholesale
|
||||
|
@@ -0,0 +1,16 @@
|
||||
contact_id,email,segment,region,age,ltv,score,last_engagement_days,source,consent
|
||||
LEAD-001,a@mkt.com,Enterprise,NA-East,42,12400,87,3,LinkedIn,true
|
||||
LEAD-002,b@mkt.com,SMB,NA-West,,3200,62,12,Google,true
|
||||
LEAD-003,c@mkt.com,SMB,EU,29,1800,N/A,7,unknown,true
|
||||
LEAD-004,d@mkt.com,Enterprise,NA-East,55,,91,1,Webinar,true
|
||||
LEAD-005,e@mkt.com,Mid-Market,NA-West,38,5600,74,,Referral,true
|
||||
LEAD-006,f@mkt.com,SMB,EU,,2100,55,21,-,
|
||||
LEAD-007,g@mkt.com,Enterprise,APAC,47,9800,82,5,LinkedIn,true
|
||||
LEAD-008,h@mkt.com,SMB,NA-East,33,2900,,9,Google,
|
||||
LEAD-009,i@mkt.com,Mid-Market,EU,41,4750,68,15,NULL,true
|
||||
LEAD-010,j@mkt.com,Enterprise,NA-West,,11200,89,2,Webinar,true
|
||||
LEAD-011,k@mkt.com,SMB,APAC,28,1650,58,18,(blank),true
|
||||
LEAD-012,l@mkt.com,Mid-Market,NA-East,36,5100,,11,Referral,true
|
||||
LEAD-013,m@mkt.com,SMB,EU,31,2300,61,N/A,Google,true
|
||||
LEAD-014,n@mkt.com,Enterprise,APAC,52,10500,93,4,LinkedIn,true
|
||||
LEAD-015,o@mkt.com,SMB,NA-West,26,1400,49,25,?,
|
||||
|
@@ -0,0 +1,13 @@
|
||||
respondent_id,age,gender,zip,survey_q1,survey_q2,survey_q3,survey_q4,nps,comments,internal_id_legacy,beta_field
|
||||
R-001,34,F,11201,4,5,3,4,9,"loved it",,
|
||||
R-002,N/A,M,10001,,,,, ,,,
|
||||
R-003,41,F,90210,5,4,5,5,10,"perfect",,
|
||||
R-004,28,M,-,3,,,,7,,,
|
||||
R-005,,,NULL,,,,,,,,
|
||||
R-006,52,F,02101,4,4,4,4,8,"good experience",,
|
||||
R-007,?,?,?,?,?,?,?,?,?,,
|
||||
R-008,29,M,94102,5,5,5,5,10,"amazing",,
|
||||
R-009,38,F,60601,2,3,2,2,5,"meh",,
|
||||
R-010,(blank),(blank),(blank),(blank),(blank),(blank),(blank),(blank),(blank),,
|
||||
R-011,45,M,30301,4,4,3,4,8,,,
|
||||
R-012,33,F,11201,5,5,5,4,9,"will recommend",,
|
||||
|
Reference in New Issue
Block a user