Tools shipped this batch (4 → 6 of 9 Ready):
04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI
05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI
09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI
with soft tool-dependency graph (recommended,
not enforced) and JSON save/load for repeatable
weekly cleanups.
Format Standardizer reworked for 1 GB international files:
• Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
• Per-row country / address columns drive parsing
• Audit cap (default 10 k rows, ~50 MB RAM)
• standardize_file(): chunked streaming entry point (~165 k rows/sec)
• currency_decimal="auto" for EU comma-decimal locales
• R$ / kr / zł multi-char currency prefixes
• cli_format.py with auto-stream above 100 MB inputs
Encoding detection arbiter + language-aware probe:
Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.
Distribution-readiness assets:
• streamlit_app.py — Streamlit Community Cloud entry shim
• src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
100-row cap + watermark, free-vs-paid boundary enforced at surface
• samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
• landing/ — 4 static HTML pages (apex chooser + 3 niche),
shared CSS, deploy.py URL-substitution script,
auto-generated robots.txt + sitemap.xml + 404.html + favicon
• docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
— full strategy + measurement + deployment + master checklist
Test counts:
before: 1,520 passed · 4 skipped · 17 xfailed
after: 1,729 passed · 0 skipped · 0 xfailed
Tier-1 corpora added:
• missing-corpus 3 use cases + 16 edge cases
• column-mapper-corpus 3 use cases + 5 edge cases
• format-cleaner intl 20-row 13-country stress fixture
Engine hardening flushed out by the corpora:
• interpolate guards against object-dtype columns
• mean/median skip all-NaN columns (silences numpy warning)
• fillna runs under future.no_silent_downcasting (silences pandas warning)
• mojibake test no longer skips when ftfy installed (monkeypatch path)
• drop-row threshold semantics: strict-greater (consistent across rows / cols)
• currency_decimal validator allow-set updated for "auto"
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2.5 KiB
2.5 KiB
| 1 | Customer ID | First Name | Last Name | Phone | Address | City | State | ZIP | Country | Total Orders | Lifetime Value | Last Order Date | Tags | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | SHOP-1001 | Alice | Johnson | alice@petshop.com | (415) 555-1234 | 123 Main St., Apt 4B | San Francisco | CA | 94102 | US | 12 | $1 | 240.50 | 2025-12-04 | VIP |
| 3 | SHOP-1002 | Bob | SMITH | Bob@PetShop.com | 415.555.1234 | 123 Main St, Apt 4B | San Francisco | CA | 94102 | US | 12 | $1,240.50 | N/A | VIP | |
| 4 | SHOP-1003 | carlos | garcia | carlos@petshop.com | 5559876543 | 742 Evergreen Terrace | Springfield | IL | 62704 | US | 5 | 420.00 | 12/15/2025 | Wholesale | |
| 5 | SHOP-1004 | Diana | Lee | diana@petshop.com | (555) 222-3344 | PO Box 12, Sherwood Forest | Nottingham | NG1 5BA | GB | 8 | £890.25 | 2025-10-30 | VIP|Wholesale | ||
| 6 | SHOP-1005 | EVE MARTINEZ | eve.martinez@petshop.com | 555-9988 | Calle Mayor 45 | Madrid | 28013 | ES | 3 | €180 | 2025-09-15 | ||||
| 7 | SHOP-1006 | Frank | Brown | frank@petshop.com | Berlin | BE | 10115 | DE | 15 | €2.410 | 75 | (blank) | Wholesale | ||
| 8 | SHOP-1007 | Grace | Davis | grace@petshop.com | +1 555-111-1111 | 888 Maple Ave | Toronto | ON | M5V 3A8 | CA | 1 | $49.99 | #N/A | New | |
| 9 | SHOP-1008 | henry | wilson | Henry@PetShop.com | 5551111111 | 888 Maple Avenue | Toronto | ON | M5V 3A8 | CA | 1 | $49.99 | 2025-12-01 | New | |
| 10 | SHOP-1009 | Ivy | Chen | IVY@petshop.com | +1 (555) 777-7777 | 550 Elm Street, Suite 200 | Brooklyn | NY | 11201 | US | 4 | $320.50 | 10/12/2025 | ||
| 11 | SHOP-1010 | Jack | Taylor | jack@petshop.com | (none) | 550 elm street, suite 200 | brooklyn | NY | 11201 | US | 4 | $320.50 | 2025-10-12 | ||
| 12 | SHOP-1011 | kate | o'neil | kate.oneil@petshop.com | 415-555-2222 | 99 King's Rd | London | SW3 4LX | GB | 7 | £675.00 | ? | VIP | ||
| 13 | SHOP-1012 | luis | rodriguez | LUIS@petshop.com | +34 91 411 1111 | Avenida de la Paz 12, 3°D | Madrid | 28013 | ES | 2 | €89,99 | unknown | |||
| 14 | SHOP-1013 | Mia | Park | mia@petshop.com | 02-9374-4000 | Sydney Opera House Drive | Sydney | NSW | 2000 | AU | 9 | A$ 1,299.00 | 2025-11-20 | Wholesale | |
| 15 | SHOP-1014 | Noah | nguyen | noah@petshop.com | +81 3 3210 7000 | 丸の内 2-7-3 | Tokyo | 100-0005 | JP | 6 | ¥75000 | 2025-12-10 | VIP | ||
| 16 | SHOP-1015 | Olivia | Brown | OLIVIA@PETSHOP.COM | (555) 333-4444 | 742 evergreen terrace | springfield | IL | 62704 | US | 3 | $180.00 | (none) | ||
| 17 | SHOP-1016 | Pavel | Novak | pavel@petshop.com | +44 20 7946 1234 | 22 Baker Street | London | W1U 6AB | United Kingdom | 4 | £412.00 | 2025-11-18 | VIP | ||
| 18 | SHOP-1017 | Quinn | Murphy | quinn@petshop.com | +44 20 7946 5678 | 5 Princes Street | Edinburgh | EH2 2DA | U.K. | 2 | £189.50 | 2025-12-09 | |||
| 19 | SHOP-1018 | Rachel | O'Brien | rachel@petshop.com | 02-9374-9999 | 100 George Street | Sydney | NSW | 2000 | UK | 1 | £75.00 | ? | New | |
| 20 | SHOP-1019 | Sam | Klein | sam@petshop.com | +49 30 99887766 | Friedrichstraße 100 | Berlin | 10117 | Germany | 11 | €1.890,40 | 2025-12-11 | VIP|Wholesale | ||
| 21 | SHOP-1020 | Tara | Gianni | tara@petshop.com | +39 06 6982 4567 | Via del Corso 250 | Roma | 00186 | Italia | 5 | €649,99 | 2025-12-03 |