Tools shipped this batch (4 → 6 of 9 Ready):
04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI
05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI
09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI
with soft tool-dependency graph (recommended,
not enforced) and JSON save/load for repeatable
weekly cleanups.
Format Standardizer reworked for 1 GB international files:
• Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
• Per-row country / address columns drive parsing
• Audit cap (default 10 k rows, ~50 MB RAM)
• standardize_file(): chunked streaming entry point (~165 k rows/sec)
• currency_decimal="auto" for EU comma-decimal locales
• R$ / kr / zł multi-char currency prefixes
• cli_format.py with auto-stream above 100 MB inputs
Encoding detection arbiter + language-aware probe:
Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.
Distribution-readiness assets:
• streamlit_app.py — Streamlit Community Cloud entry shim
• src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
100-row cap + watermark, free-vs-paid boundary enforced at surface
• samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
• landing/ — 4 static HTML pages (apex chooser + 3 niche),
shared CSS, deploy.py URL-substitution script,
auto-generated robots.txt + sitemap.xml + 404.html + favicon
• docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
— full strategy + measurement + deployment + master checklist
Test counts:
before: 1,520 passed · 4 skipped · 17 xfailed
after: 1,729 passed · 0 skipped · 0 xfailed
Tier-1 corpora added:
• missing-corpus 3 use cases + 16 edge cases
• column-mapper-corpus 3 use cases + 5 edge cases
• format-cleaner intl 20-row 13-country stress fixture
Engine hardening flushed out by the corpora:
• interpolate guards against object-dtype columns
• mean/median skip all-NaN columns (silences numpy warning)
• fillna runs under future.no_silent_downcasting (silences pandas warning)
• mojibake test no longer skips when ftfy installed (monkeypatch path)
• drop-row threshold semantics: strict-greater (consistent across rows / cols)
• currency_decimal validator allow-set updated for "auto"
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.0 KiB
3.0 KiB
| 1 | Txn ID | Date | Description | Amount | Balance | Account | Vendor | Category |
|---|---|---|---|---|---|---|---|---|
| 2 | TXN-2401 | 01/15/2025 | AMAZON.COM*4F2X9 PURCHASE | -$129.99 | $2,450.01 | Checking | Amazon | Office Supplies |
| 3 | TXN-2402 | 2025-01-15 | AMAZON.COM*4F2X9 PURCHASE | -$129.99 | 2450.01 | Checking | amazon.com | Office Supplies |
| 4 | TXN-2403 | Jan 18 2025 | STAPLES #4422 — paper, toner | ($89.50) | $2360.51 | Checking | STAPLES | Office Supplies |
| 5 | TXN-2404 | 01/22/2025 | Verizon Wireless "autopay" | -$120.00 | $2,240.51 | Checking | Verizon | Utilities |
| 6 | TXN-2405 | 2025-01-22 | Verizon Wireless autopay | -120.00 | 2,240.51 | Checking | verizon | Utilities |
| 7 | TXN-2406 | 01-25-2025 | Stripe Payout — invoice #1077 | +$3,450.00 | $5,690.51 | Checking | Stripe | Income |
| 8 | TXN-2407 | 1/27/25 | Office Lease - Suite 204 | -1500.00 | $4,190.51 | Checking | Acme Realty | Rent |
| 9 | TXN-2408 | 02/01/2025 | Wire — Acme Realty Mgmt | -$1,500.00 | $2,690.51 | Checking | acme realty | Rent |
| 10 | TXN-2409 | 2025-02-03 | Adobe Creative Cloud annual | - $599.88 | $2,090.63 | Credit Card | Adobe Inc. | Software |
| 11 | TXN-2410 | 02/03/2025 | ADOBE CREATIVE CLOUD ANN | -599.88 | 2090.63 | Credit Card | adobe | Software |
| 12 | TXN-2411 | Feb 5 2025 | FedEx — overnight to client A | -$32.50 | $2,058.13 | Checking | FedEx | Shipping |
| 13 | TXN-2412 | 02/07/2025 | Square fee — invoice #1078 | -$3.20 | $2,054.93 | Checking | Square | Fees |
| 14 | TXN-2413 | 02/10/2025 | Stripe Payout invoice #1079 | + $1,200.00 | $3,254.93 | Checking | Stripe | Income |
| 15 | TXN-2414 | 2025-02-12 | USPS PRIORITY — to vendor B | -12.40 | $3,242.53 | Checking | USPS | Shipping |
| 16 | TXN-2415 | 02/14/2025 | Zoom Video Comms — annual | -$149.90 | $3,092.63 | Credit Card | Zoom | Software |
| 17 | TXN-2416 | 2/14/25 | Zoom Video Communications | -149.90 | 3092.63 | Credit Card | zoom | Software |
| 18 | TXN-2417 | 02/18/2025 | Costco Whse #421 — supplies | -$237.84 | $2,854.79 | Checking | Costco | Office Supplies |
| 19 | TXN-2418 | 2025-02-18 | COSTCO WHSE #421 | -237.84 | 2,854.79 | Checking | costco | Office Supplies |
| 20 | TXN-2419 | 02/22/2025 | Bank fee — int'l wire | -$45.00 | $2,809.79 | Checking | Bank Fee | Fees |
| 21 | TXN-2420 | 02/24/2025 | Stripe Payout — invoice #1080 | +$2,100.00 | $4,909.79 | Checking | Stripe | Income |
| 22 | TXN-2421 | 02/28/2025 | Refund — overcharge | +$45.00 | $4,954.79 | Checking | — | Refunds |
| 23 | TXN-2422 | Feb 28 2025 | REFUND OVERCHARGE | 45.00 | 4954.79 | Checking | N/A | Refunds |
| 24 | TXN-2423 | 03/01/2025 | Office Lease — Suite 204 | -$1,500.00 | $3,454.79 | Checking | Acme Realty | Rent |
| 25 | TXN-2424 | 2025-03-03 | Slack Technologies — annual | -$840.00 | $2,614.79 | Credit Card | Slack | Software |
| 26 | TXN-2425 | 03/05/2025 | Stripe Payout — invoice #1081 | +$1,875.00 | $4,489.79 | Checking | Stripe | Income |
| 27 | TXN-2426 | 03/08/2025 | Wire — Berlin office rent (EUR vendor) | -€1.450,00 | $2,989.79 | Checking | Mietverwaltung GmbH | Rent |
| 28 | TXN-2427 | 03/10/2025 | London supplier invoice (GBP) | -£950.00 | $1,939.79 | Checking | Stationery Co Ltd | Office Supplies |
| 29 | TXN-2428 | 03/12/2025 | São Paulo agency retainer | -R$ 1.299,90 | $1,679.79 | Credit Card | Estúdio Ágil | Software |
| 30 | TXN-2429 | 03/14/2025 | VAT MOSS prep — multi-EU sales | ($89.00) | $1,768.79 | Checking | EU VAT Service | Fees |
| 31 | TXN-2430 | 03/14/2025 | VAT MOSS prep multi EU sales | -89.00 | 1,768.79 | Checking | eu vat service | Fees |