Files
datatools-dev/samples/demo/bookkeeper_bank_reconcile.csv
Michael 966af8ef94 feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00

3.0 KiB

1Txn IDDate DescriptionAmountBalanceAccountVendorCategory
2TXN-240101/15/2025 AMAZON.COM*4F2X9 PURCHASE-$129.99$2,450.01CheckingAmazonOffice Supplies
3TXN-24022025-01-15AMAZON.COM*4F2X9 PURCHASE-$129.992450.01Checkingamazon.comOffice Supplies
4TXN-2403Jan 18 2025STAPLES #4422 — paper, toner($89.50)$2360.51CheckingSTAPLESOffice Supplies
5TXN-240401/22/2025Verizon Wireless "autopay"-$120.00$2,240.51CheckingVerizonUtilities
6TXN-24052025-01-22Verizon Wireless autopay-120.002,240.51CheckingverizonUtilities
7TXN-240601-25-2025Stripe Payout — invoice #1077+$3,450.00$5,690.51CheckingStripeIncome
8TXN-24071/27/25Office Lease - Suite 204-1500.00$4,190.51CheckingAcme RealtyRent
9TXN-240802/01/2025Wire — Acme Realty Mgmt-$1,500.00$2,690.51Checkingacme realtyRent
10TXN-24092025-02-03Adobe Creative Cloud annual- $599.88$2,090.63Credit CardAdobe Inc.Software
11TXN-241002/03/2025ADOBE CREATIVE CLOUD ANN-599.882090.63Credit CardadobeSoftware
12TXN-2411Feb 5 2025FedEx — overnight to client A-$32.50$2,058.13CheckingFedExShipping
13TXN-241202/07/2025Square fee — invoice #1078-$3.20$2,054.93CheckingSquareFees
14TXN-241302/10/2025Stripe Payout invoice #1079+ $1,200.00$3,254.93CheckingStripeIncome
15TXN-24142025-02-12USPS PRIORITY — to vendor B-12.40$3,242.53CheckingUSPSShipping
16TXN-241502/14/2025Zoom Video Comms — annual-$149.90$3,092.63Credit CardZoomSoftware
17TXN-24162/14/25Zoom Video Communications-149.903092.63Credit CardzoomSoftware
18TXN-241702/18/2025Costco Whse #421 — supplies-$237.84$2,854.79CheckingCostcoOffice Supplies
19TXN-24182025-02-18COSTCO WHSE #421-237.842,854.79CheckingcostcoOffice Supplies
20TXN-241902/22/2025Bank fee — int'l wire-$45.00$2,809.79CheckingBank FeeFees
21TXN-242002/24/2025Stripe Payout — invoice #1080+$2,100.00$4,909.79CheckingStripeIncome
22TXN-242102/28/2025 Refund — overcharge +$45.00$4,954.79CheckingRefunds
23TXN-2422Feb 28 2025REFUND OVERCHARGE45.004954.79CheckingN/ARefunds
24TXN-242303/01/2025Office Lease — Suite 204-$1,500.00$3,454.79CheckingAcme RealtyRent
25TXN-24242025-03-03Slack Technologies — annual-$840.00$2,614.79Credit CardSlackSoftware
26TXN-242503/05/2025Stripe Payout — invoice #1081+$1,875.00$4,489.79CheckingStripeIncome
27TXN-242603/08/2025Wire — Berlin office rent (EUR vendor)-€1.450,00$2,989.79CheckingMietverwaltung GmbHRent
28TXN-242703/10/2025London supplier invoice (GBP)-£950.00$1,939.79CheckingStationery Co LtdOffice Supplies
29TXN-242803/12/2025São Paulo agency retainer-R$ 1.299,90$1,679.79Credit CardEstúdio ÁgilSoftware
30TXN-242903/14/2025VAT MOSS prep — multi-EU sales($89.00)$1,768.79CheckingEU VAT ServiceFees
31TXN-243003/14/2025VAT MOSS prep multi EU sales-89.001,768.79Checkingeu vat serviceFees