feat: 3 new tools, format streaming, distribution-ready demo + landing pages

Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 22:31:26 +00:00
parent d18b95880d
commit 966af8ef94
89 changed files with 12039 additions and 284 deletions

View File

@@ -0,0 +1,35 @@
# Missing Value Handler — corpus
Acceptance fixtures for `src/core/missing.py`. Each `.csv` under
`test_data/` is paired with assertions in `tests/test_missing_corpus.py`.
Add a new case by dropping a CSV here and adding a parametrize entry to
the runner.
## Use cases (target client profiles)
| File | Buyer profile | Strategy under test |
|------|---------------|---------------------|
| `uc01_shopify_export.csv` | SMB / Shopify operator | `detect-only` |
| `uc02_marketing_audience.csv` | Marketing / RevOps analyst| `safe-fill` |
| `uc03_consultant_intake.csv` | Analyst / consultant | `drop-incomplete` + threshold |
## Edge cases
| File | What it stresses |
|------|------------------|
| `ec01_all_nan_column.csv` | column 100 % missing — fill must skip, drop_col must catch at threshold |
| `ec02_no_missing.csv` | clean file — must be a no-op |
| `ec03_zero_is_not_missing.csv` | numeric `0`, boolean `false`, `"0"` must NOT be treated as missing |
| `ec04_excel_errors.csv` | `#N/A`, `#NULL!`, `#VALUE!` Excel error sentinels |
| `ec05_unicode_whitespace.csv` | NBSP, tab-only, ideographic-space cells treated as whitespace |
| `ec06_mixed_dtypes.csv` | mixed numeric/string in same column — graceful degrade to mode |
| `ec07_real_data_with_padding.csv` | leading/trailing whitespace around real data must NOT be dropped |
| `ec08_single_row.csv` | one-row file — every operation must still work |
| `ec09_single_column.csv` | one-column file with header-only line + sentinels |
| `ec10_all_sentinel_variants.csv` | every `DEFAULT_SENTINELS` entry exercised in one file |
| `ec11_constant_per_column.csv` | `column_fill_values` differs per column |
| `ec12_drop_threshold_boundary.csv`| boundary values for `row_drop_threshold` (0.5, 0.99, 1.0) |
| `ec13_ffill_leading_nan.csv` | leading-NaN run survives ffill (no fabrication) |
| `ec14_interpolate_fallback.csv` | numeric-only strategy on string column triggers fallback |
| `ec15_headers_only.csv` | empty body — must not crash |
| `ec16_idempotent_apply.csv` | running `handle_missing` twice yields the same DataFrame |

View File

@@ -0,0 +1,5 @@
id,name,deprecated_field
1,Alice,
2,Bob,
3,Charlie,
4,Diana,
1 id name deprecated_field
2 1 Alice
3 2 Bob
4 3 Charlie
5 4 Diana

View File

@@ -0,0 +1,4 @@
id,name,age,city
1,Alice,30,NYC
2,Bob,25,LA
3,Charlie,35,SF
1 id name age city
2 1 Alice 30 NYC
3 2 Bob 25 LA
4 3 Charlie 35 SF

View File

@@ -0,0 +1,5 @@
id,active,balance,count,flag
1,true,0.00,0,0
2,false,150.50,3,1
3,true,0,5,0
4,true,75.25,0,1
1 id active balance count flag
2 1 true 0.00 0 0
3 2 false 150.50 3 1
4 3 true 0 5 0
5 4 true 75.25 0 1

View File

@@ -0,0 +1,7 @@
sku,price,units,supplier
A-100,19.99,5,Acme
A-101,#N/A,3,Beta
A-102,29.99,#NULL!,Gamma
A-103,#VALUE!,2,Delta
A-104,9.99,0,Acme
A-105,#N/A,#N/A,#NULL!
1 sku price units supplier
2 A-100 19.99 5 Acme
3 A-101 #N/A 3 Beta
4 A-102 29.99 #NULL! Gamma
5 A-103 #VALUE! 2 Delta
6 A-104 9.99 0 Acme
7 A-105 #N/A #N/A #NULL!

View File

@@ -0,0 +1,6 @@
id,note,value
1,hello,10
2, ,20
3, ,30
4,real,40
5, ,50
1 id note value
2 1 hello 10
3 2 20
4 3 30
5 4 real 40
6 5 50

View File

@@ -0,0 +1,6 @@
id,mixed_col,real_num
1,42,1.0
2,N/A,2.0
3,hello,
4,,4.0
5,99,5.0
1 id mixed_col real_num
2 1 42 1.0
3 2 N/A 2.0
4 3 hello
5 4 4.0
6 5 99 5.0

View File

@@ -0,0 +1,5 @@
id,name,city
1, Alice ,NYC
2, ,LA
3, Bob ,
4,Charlie, SF
1 id name city
2 1 Alice NYC
3 2 LA
4 3 Bob
5 4 Charlie SF

View File

@@ -0,0 +1,2 @@
id,name,age,city
1,Alice,N/A,
1 id name age city
2 1 Alice N/A

View File

@@ -0,0 +1,7 @@
value
10
N/A
20
" "
-
30
1 value
2 10
3 N/A
4 20
5
6 -
7 30

View File

@@ -0,0 +1,22 @@
case_id,sentinel_value
01,N/A
02,n/a
03,NA
04,na
05,NULL
06,null
07,None
08,nil
09,NaN
10,-
11,--
12,?
13,.
14,TBD
15,unknown
16,(blank)
17,(none)
18,#N/A
19,#NULL!
20,missing
21,real_value
1 case_id sentinel_value
2 01 N/A
3 02 n/a
4 03 NA
5 04 na
6 05 NULL
7 06 null
8 07 None
9 08 nil
10 09 NaN
11 10 -
12 11 --
13 12 ?
14 13 .
15 14 TBD
16 15 unknown
17 16 (blank)
18 17 (none)
19 18 #N/A
20 19 #NULL!
21 20 missing
22 21 real_value

View File

@@ -0,0 +1,6 @@
id,country,salary,department
1,USA,50000,Eng
2,,60000,Sales
3,UK,,Eng
4,USA,55000,
5,,,
1 id country salary department
2 1 USA 50000 Eng
3 2 60000 Sales
4 3 UK Eng
5 4 USA 55000
6 5

View File

@@ -0,0 +1,6 @@
id,a,b,c,d
1,1,2,3,4
2,,,3,4
3,,,,4
4,,,,
5,1,2,,
1 id a b c d
2 1 1 2 3 4
3 2 3 4
4 3 4
5 4
6 5 1 2

View File

@@ -0,0 +1,8 @@
date,price
2025-01-01,
2025-01-02,
2025-01-03,100.0
2025-01-04,
2025-01-05,
2025-01-06,150.0
2025-01-07,
1 date price
2 2025-01-01
3 2025-01-02
4 2025-01-03 100.0
5 2025-01-04
6 2025-01-05
7 2025-01-06 150.0
8 2025-01-07

View File

@@ -0,0 +1,6 @@
id,category,value
1,A,10.0
2,B,
3,C,30.0
4,,40.0
5,A,
1 id category value
2 1 A 10.0
3 2 B
4 3 C 30.0
5 4 40.0
6 5 A

View File

@@ -0,0 +1 @@
id,name,age,city
1 id name age city

View File

@@ -0,0 +1,5 @@
id,name,age
1,Alice,30
2,N/A,
3,Bob,25
4,,40
1 id name age
2 1 Alice 30
3 2 N/A
4 3 Bob 25
5 4 40

View File

@@ -0,0 +1,11 @@
customer_id,first_name,last_name,email,phone,city,total_orders,lifetime_value,last_order_date,tags
SHOP-001,Alice,Johnson,alice@shop.com,555-1234,Brooklyn,12,1240.50,2025-12-04,VIP
SHOP-002,Bob,Smith,bob@shop.com,N/A,Queens,5,420.00,2025-11-22,
SHOP-003,Carlos,Garcia,carlos@shop.com,555-5678,-,8,890.25,2025-12-15,Wholesale
SHOP-004,Diana,Lee,diana@shop.com,(555) 222-3344,Manhattan,NULL,1875.00,2025-10-30,VIP|Wholesale
SHOP-005,Eve,Martinez,,555-9988,Bronx,3,180.00,2025-09-15,
SHOP-006,Frank,Brown,frank@shop.com, ,Staten Island,15,2410.75,(blank),
SHOP-007,Grace,Davis,grace@shop.com,555-1111,Brooklyn,1,49.99,#N/A,New
SHOP-008,Henry,Wilson,henry@shop.com,n/a,Queens,7,675.00,2025-11-08,VIP
SHOP-009,Ivy,Chen,ivy@shop.com,555-7777,?,4,320.50,2025-10-12,
SHOP-010,Jack,Taylor,jack@shop.com,555-4444,Manhattan,(none),520.00,2025-12-01,Wholesale
1 customer_id first_name last_name email phone city total_orders lifetime_value last_order_date tags
2 SHOP-001 Alice Johnson alice@shop.com 555-1234 Brooklyn 12 1240.50 2025-12-04 VIP
3 SHOP-002 Bob Smith bob@shop.com N/A Queens 5 420.00 2025-11-22
4 SHOP-003 Carlos Garcia carlos@shop.com 555-5678 - 8 890.25 2025-12-15 Wholesale
5 SHOP-004 Diana Lee diana@shop.com (555) 222-3344 Manhattan NULL 1875.00 2025-10-30 VIP|Wholesale
6 SHOP-005 Eve Martinez 555-9988 Bronx 3 180.00 2025-09-15
7 SHOP-006 Frank Brown frank@shop.com Staten Island 15 2410.75 (blank)
8 SHOP-007 Grace Davis grace@shop.com 555-1111 Brooklyn 1 49.99 #N/A New
9 SHOP-008 Henry Wilson henry@shop.com n/a Queens 7 675.00 2025-11-08 VIP
10 SHOP-009 Ivy Chen ivy@shop.com 555-7777 ? 4 320.50 2025-10-12
11 SHOP-010 Jack Taylor jack@shop.com 555-4444 Manhattan (none) 520.00 2025-12-01 Wholesale

View File

@@ -0,0 +1,16 @@
contact_id,email,segment,region,age,ltv,score,last_engagement_days,source,consent
LEAD-001,a@mkt.com,Enterprise,NA-East,42,12400,87,3,LinkedIn,true
LEAD-002,b@mkt.com,SMB,NA-West,,3200,62,12,Google,true
LEAD-003,c@mkt.com,SMB,EU,29,1800,N/A,7,unknown,true
LEAD-004,d@mkt.com,Enterprise,NA-East,55,,91,1,Webinar,true
LEAD-005,e@mkt.com,Mid-Market,NA-West,38,5600,74,,Referral,true
LEAD-006,f@mkt.com,SMB,EU,,2100,55,21,-,
LEAD-007,g@mkt.com,Enterprise,APAC,47,9800,82,5,LinkedIn,true
LEAD-008,h@mkt.com,SMB,NA-East,33,2900,,9,Google,
LEAD-009,i@mkt.com,Mid-Market,EU,41,4750,68,15,NULL,true
LEAD-010,j@mkt.com,Enterprise,NA-West,,11200,89,2,Webinar,true
LEAD-011,k@mkt.com,SMB,APAC,28,1650,58,18,(blank),true
LEAD-012,l@mkt.com,Mid-Market,NA-East,36,5100,,11,Referral,true
LEAD-013,m@mkt.com,SMB,EU,31,2300,61,N/A,Google,true
LEAD-014,n@mkt.com,Enterprise,APAC,52,10500,93,4,LinkedIn,true
LEAD-015,o@mkt.com,SMB,NA-West,26,1400,49,25,?,
1 contact_id email segment region age ltv score last_engagement_days source consent
2 LEAD-001 a@mkt.com Enterprise NA-East 42 12400 87 3 LinkedIn true
3 LEAD-002 b@mkt.com SMB NA-West 3200 62 12 Google true
4 LEAD-003 c@mkt.com SMB EU 29 1800 N/A 7 unknown true
5 LEAD-004 d@mkt.com Enterprise NA-East 55 91 1 Webinar true
6 LEAD-005 e@mkt.com Mid-Market NA-West 38 5600 74 Referral true
7 LEAD-006 f@mkt.com SMB EU 2100 55 21 -
8 LEAD-007 g@mkt.com Enterprise APAC 47 9800 82 5 LinkedIn true
9 LEAD-008 h@mkt.com SMB NA-East 33 2900 9 Google
10 LEAD-009 i@mkt.com Mid-Market EU 41 4750 68 15 NULL true
11 LEAD-010 j@mkt.com Enterprise NA-West 11200 89 2 Webinar true
12 LEAD-011 k@mkt.com SMB APAC 28 1650 58 18 (blank) true
13 LEAD-012 l@mkt.com Mid-Market NA-East 36 5100 11 Referral true
14 LEAD-013 m@mkt.com SMB EU 31 2300 61 N/A Google true
15 LEAD-014 n@mkt.com Enterprise APAC 52 10500 93 4 LinkedIn true
16 LEAD-015 o@mkt.com SMB NA-West 26 1400 49 25 ?

View File

@@ -0,0 +1,13 @@
respondent_id,age,gender,zip,survey_q1,survey_q2,survey_q3,survey_q4,nps,comments,internal_id_legacy,beta_field
R-001,34,F,11201,4,5,3,4,9,"loved it",,
R-002,N/A,M,10001,,,,, ,,,
R-003,41,F,90210,5,4,5,5,10,"perfect",,
R-004,28,M,-,3,,,,7,,,
R-005,,,NULL,,,,,,,,
R-006,52,F,02101,4,4,4,4,8,"good experience",,
R-007,?,?,?,?,?,?,?,?,?,,
R-008,29,M,94102,5,5,5,5,10,"amazing",,
R-009,38,F,60601,2,3,2,2,5,"meh",,
R-010,(blank),(blank),(blank),(blank),(blank),(blank),(blank),(blank),(blank),,
R-011,45,M,30301,4,4,3,4,8,,,
R-012,33,F,11201,5,5,5,4,9,"will recommend",,
1 respondent_id age gender zip survey_q1 survey_q2 survey_q3 survey_q4 nps comments internal_id_legacy beta_field
2 R-001 34 F 11201 4 5 3 4 9 loved it
3 R-002 N/A M 10001
4 R-003 41 F 90210 5 4 5 5 10 perfect
5 R-004 28 M - 3 7
6 R-005 NULL
7 R-006 52 F 02101 4 4 4 4 8 good experience
8 R-007 ? ? ? ? ? ? ? ? ?
9 R-008 29 M 94102 5 5 5 5 10 amazing
10 R-009 38 F 60601 2 3 2 2 5 meh
11 R-010 (blank) (blank) (blank) (blank) (blank) (blank) (blank) (blank) (blank)
12 R-011 45 M 30301 4 4 3 4 8
13 R-012 33 F 11201 5 5 5 4 9 will recommend