feat: 3 new tools, format streaming, distribution-ready demo + landing pages

Tools shipped this batch (4 → 6 of 9 Ready): 04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI 05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI 09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI with soft tool-dependency graph (recommended, not enforced) and JSON save/load for repeatable weekly cleanups. Format Standardizer reworked for 1 GB international files: • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email • Per-row country / address columns drive parsing • Audit cap (default 10 k rows, ~50 MB RAM) • standardize_file(): chunked streaming entry point (~165 k rows/sec) • currency_decimal="auto" for EU comma-decimal locales • R$ / kr / zł multi-char currency prefixes • cli_format.py with auto-stream above 100 MB inputs Encoding detection arbiter + language-aware probe: Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM) via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes. Distribution-readiness assets: • streamlit_app.py — Streamlit Community Cloud entry shim • src/gui/app_demo.py — single-page demo, ?p=<persona> routing, 100-row cap + watermark, free-vs-paid boundary enforced at surface • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs • landing/ — 4 static HTML pages (apex chooser + 3 niche), shared CSS, deploy.py URL-substitution script, auto-generated robots.txt + sitemap.xml + 404.html + favicon • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md — full strategy + measurement + deployment + master checklist Test counts: before: 1,520 passed · 4 skipped · 17 xfailed after: 1,729 passed · 0 skipped · 0 xfailed Tier-1 corpora added: • missing-corpus 3 use cases + 16 edge cases • column-mapper-corpus 3 use cases + 5 edge cases • format-cleaner intl 20-row 13-country stress fixture Engine hardening flushed out by the corpora: • interpolate guards against object-dtype columns • mean/median skip all-NaN columns (silences numpy warning) • fillna runs under future.no_silent_downcasting (silences pandas warning) • mojibake test no longer skips when ftfy installed (monkeypatch path) • drop-row threshold semantics: strict-greater (consistent across rows / cols) • currency_decimal validator allow-set updated for "auto" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00
parent d18b95880d
commit 966af8ef94
89 changed files with 12039 additions and 284 deletions
--- a/docs/DEMO-PLAN.md
+++ b/docs/DEMO-PLAN.md
@@ -0,0 +1,332 @@
+# Demo Plan — DataTools
+
+> Creator-only. Implements PLAN.md §2.2 (the demo IS the product) and
+> §2.3 (niche down — three landing pages, one engine).
+> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael
+
+The hosted demo is the single highest-leverage marketing asset in the
+plan. This document defines exactly what loads, in what order, with
+what data, for which buyer — so the operator builds it once and never
+rebuilds it from a stale headline.
+
+## 1. Goals
+
+- Convert a cold visitor to a paid buyer in **under three minutes** of
+  active interaction.
+- Demonstrate the *full pipeline* (not one tool) on a dataset that
+  *looks like the visitor's own work* — not a toy CSV.
+- Survive zero attention to maintenance — once running, the demo
+  should keep working as the engine evolves (the pre-saved pipeline
+  JSONs use the same code path the paid product uses).
+- Provide a shareable artifact for niche-community posts (a public URL
+  the operator can drop into a subreddit reply with one sentence).
+
+## 2. Constraints (non-negotiable)
+
+| Constraint | Source | Implication |
+|---|---|---|
+| Free hosting at launch | BUSINESS.md §9 | Streamlit Community Cloud (1 GB RAM, sleeps after 7 days idle) |
+| No login | BUSINESS.md §7 | No email gate, no signup wall, no "create account to continue" |
+| Async / no-touch | DECISIONS.md §1 #8 | Cannot offer "schedule a demo with us" CTA |
+| Runs locally on paid product | BUSINESS.md §11 | Demo can't expose the same engine to abuse — needs row caps |
+| Friction kills conversion | BUSINESS.md §7 | Demo dataset preloaded; no "select a file" first-step |
+| < $1,200/mo recurring | BUSINESS.md §9 | Migration plan to $5/mo VPS only after rate-limit signal |
+
+## 3. The three personas (per PLAN.md §2.3)
+
+| Tag | Persona | Top-of-funnel keyword | Demo dataset | Pre-saved pipeline |
+|---|---|---|---|---|
+| `shopify-pet` | Shopify operator (priority: pet supplies) | "shopify customer cleanup" | `samples/demo/shopify_pet_customers.csv` | `shopify_pet_pipeline.json` |
+| `bookkeeper`   | Bookkeeper / freelance accountant | "reconcile bank export csv" | `samples/demo/bookkeeper_bank_reconcile.csv` | `bookkeeper_bank_pipeline.json` |
+| `revops`       | Marketing / RevOps agency | "dedupe lead list across vendors" | `samples/demo/agency_combined_leads.csv` | `agency_leads_pipeline.json` |
+
+Each persona gets its **own landing page URL**, its **own demo dataset
+loaded by default**, and its **own H1 + below-the-fold copy.** The
+engine is identical; only positioning differs.
+
+## 4. Demo dataset specifications
+
+Each dataset is intentionally small (~15–25 rows) so the full pipeline
+runs in well under one second on Streamlit Community Cloud's free
+hardware. Each row is a *plausible-looking* export from that
+persona's tooling. Each contains every kind of pollution the bundle's
+five tools fix, so a single demo run shows every tool earning its
+keep.
+
+### 4.0 Pain-point coverage map
+
+Each demo dataset is engineered so the buyer sees their **own top
+pain** demonstrated in the AFTER preview. The mapping below pairs
+each pain from PLAN.md §2.3a with the rows / columns that exercise
+it. Refresh the dataset only when this coverage drops.
+
+| Persona | Pain (from PLAN §2.3a) | Demo coverage |
+|---|---|---|
+| Shopify pet | S1 — Klaviyo per-contact dupes | 5 dup pairs across rows 1–15 (case + format + address-twin variants) |
+| Shopify pet | S2 — feed-rejection chars | smart-quote / NBSP / BOM in rows 1–6, 9, 11 |
+| Shopify pet | S3 — multi-channel | partner-style customer IDs (`SHOP-`); demonstration of column-level mapping covered in RevOps demo |
+| Shopify pet | S4 — subscription identity | rows 1+2, 7+8, 9+10 — same person, different format |
+| Shopify pet | S5 — VAT-MOSS country drift | rows 16–18 (`United Kingdom` / `U.K.` / `UK`) + rows 19–20 (`Germany`/`Italia`) |
+| Bookkeeper | B1 — month-overlap re-import | 7 dup pairs spanning Jan↔Feb and Mar boundaries |
+| Bookkeeper | B2 — 1099 vendor consolidation | Amazon × 3 spellings, Verizon × 2, Acme Realty × 2, Adobe × 2, Costco × 2, Zoom × 2, Stripe × 4 |
+| Bookkeeper | B3 — audit trail | every cell change in the run logged with old/new/rule — surface in the demo's audit tab |
+| Bookkeeper | B4 — per-license economics | demonstrated by pricing copy, not data |
+| Bookkeeper | B5 — multi-currency | rows 26 (EUR), 27 (GBP), 28 (BRL with comma decimal), 29 (parens-negative) |
+| RevOps | R1 — per-contact tier | 6 cross-source dup pairs (HubSpot × LinkedIn × Manual Scrape) |
+| RevOps | R2 — deliverability | rows 26–27 (`uma at uniform dot com`, `victor@@victorco.com` invalid emails) |
+| RevOps | R3 — GDPR / privacy | demonstrated by the network-tab moat panel + zero-upload claim |
+| RevOps | R4 — vendor unification | 3 source values (HubSpot / LinkedIn / Manual Scrape), 13 country codes, mixed-shape headers |
+| RevOps | R5 — suppression list | rows 29–30 (`Suppressed`, `Opted Out` tags) |
+
+### 4.1 `shopify_pet_customers.csv` (20 rows)
+
+**Looks like**: a Shopify customer export filtered for "Pet Supplies"
+sales channel, 12 months activity.
+
+**Pollution included**:
+- Whitespace padding ("  Alice  ", "Sydney Opera House Drive ")
+- Mixed phone formats: `(415) 555-1234`, `415.555.1234`, `5559876543`,
+  `+1 555-111-1111`
+- International phones: GB, ES, DE, AU, JP (15 demo rows span 6
+  countries)
+- Currency variants: `$1,240.50`, `£890.25`, `€2.410,75` (EU comma
+  decimal), `A$ 1,299.00`, `¥75000`
+- Date formats: `2025-12-04`, `12/15/2025`, `?`, `(blank)`, `(none)`,
+  `#N/A`
+- Disguised nulls: `N/A`, blank, `(blank)`, `?`, `#N/A`, `(none)`,
+  `unknown`
+- Name casing: `EVE MARTINEZ`, `henry`, `O'NEIL`, `noah`, mixed Title /
+  ALL CAPS / lower
+- Email case variants that *should* dedup: `Bob@PetShop.com` vs
+  `alice@petshop.com`
+- 4 fuzzy duplicates (Alice/Bob same address, Grace/Henry same phone,
+  Carlos/Olivia same address, Ivy/Jack same address)
+
+**After running the pipeline**: 20 rows → 15, ~29 cells canonicalized,
+~45 sentinels standardised, 5 cross-row duplicates merged. The
+customer table is now Klaviyo-import-ready and the country column
+(previously `UK` / `U.K.` / `United Kingdom` / `Germany` / `Italia`)
+is GB / DE / IT — VAT MOSS report won't break.
+
+### 4.2 `bookkeeper_bank_reconcile.csv` (30 rows)
+
+**Looks like**: two months of business checking + credit-card activity
+exported from a bank portal, with the Feb export accidentally
+overlapping the Jan export at the month boundary.
+
+**Pollution included**:
+- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`,
+  `1/27/25`, `Feb 5 2025`
+- Currency formats: `-$129.99`, `($89.50)` parens-negative,
+  `+$3,450.00`, `- $599.88` space, bare `-129.99`, `(50.00)`
+- Header trailing whitespace: `"Date "`
+- Smart quotes around descriptions: `"autopay"`
+- Em-dash sentinels in Vendor: `—`
+- Smart-em-dash inside descriptions: `STAPLES #4422 — paper, toner`
+- Vendor casing inconsistency: `Amazon` / `amazon.com` / `AMAZON.COM`,
+  `Verizon` / `verizon`
+- 6 duplicate transactions (same date+amount+vendor recorded twice
+  with different formats)
+
+**After running the pipeline**: 30 rows → 23, ~84 cells normalized, 7
+duplicates removed (month-overlap + VAT-MOSS dups). All dates
+ISO-formatted, all amounts numeric (including EUR/GBP/BRL with comma
+decimal), vendor casing canonical, parens-negative resolved.
+
+### 4.3 `agency_combined_leads.csv` (30 rows)
+
+**Looks like**: a marketing-ops worksheet combining lead exports from
+HubSpot + LinkedIn Sales Navigator + manual scraping, ready for
+campaign targeting.
+
+**Pollution included**:
+- Phone formats per region: US, UK, Spain, Germany, China, India,
+  Australia, Mexico, Israel, Singapore, Hong Kong, Italy, South
+  Korea — 13 country codes
+- Country column inconsistent: `USA` / `US` / `United States`
+- Disguised nulls: `N/A`, `unknown`, `(unknown)`, `(blank)`, `(none)`,
+  `?`, `—`, `#N/A`, `TBD`
+- Source column tags origin (`HubSpot` / `LinkedIn` / `Manual Scrape`)
+- Email duplicates across sources with case variants: `alice@acme.com`
+  + `Alice.Johnson@acme.com`, `bob@beta.com` + `Bob@Beta.com`,
+  `diana@delta.com` from two sources, `carlos@gamma.io` from two
+  sources, `Frank@Foxtrot.de` + `frank@foxtrot.de`
+- Name casing: `DIANA LEE`, `henry`, `IVY CHEN`, mixed
+- 6 fuzzy / cross-source duplicates designed to survive the dedup
+- Score column with sentinel pollution that needs coercion to integer
+
+**After running the pipeline**: 30 rows → 24, ~43 cells canonicalized,
+14 sentinels resolved, 6 cross-source duplicates merged with `merge=true`
+so each survivor inherits the most-complete picture. Invalid-email
+rows (deliverability stress) and `Suppressed`/`Opted Out` tags
+(suppression-list use case) survive as flagged rows the operator
+manually reviews.
+
+## 5. UX flow (per persona)
+
+The demo is a single Streamlit page (likely
+`src/gui/pages/0_Review.py` repurposed for demo mode, or a
+dedicated `app_demo.py` for the cloud build).
+
+```
+┌──────────────────────────────────────────────────────────┐
+│  DataTools — for {Persona}                               │
+│  "{Persona-specific H1}"                                 │
+├──────────────────────────────────────────────────────────┤
+│                                                          │
+│  Sample dataset preloaded:  shopify_pet_customers.csv    │
+│  [Replace with your own file (capped 100 rows)]          │
+│                                                          │
+│  ┌─ BEFORE preview (15 rows) ─────────────────────────┐  │
+│  │ Alice  | (415) 555-1234 | $1,240.50 | …          │  │
+│  │ Bob    | 415.555.1234   | $1,240.50 | …          │  │
+│  │ ...                                              │  │
+│  └──────────────────────────────────────────────────┘  │
+│                                                          │
+│  Pipeline (saved):                                       │
+│  1. Text Clean    →  2. Format Standardize    →          │
+│  3. Missing       →  4. Deduplicate                      │
+│                                                          │
+│  [▶ Run pipeline]                                        │
+│                                                          │
+│  ┌─ AFTER preview ───────────────────────────────────┐  │
+│  │ 15 rows → 11 (4 duplicates merged)                │  │
+│  │ 27 cells canonicalized · 33 sentinels resolved    │  │
+│  │                                                    │  │
+│  │ Alice Johnson  | +14155551234 | 1240.50 | …       │  │
+│  │ ...                                                │  │
+│  └──────────────────────────────────────────────────┘  │
+│                                                          │
+│  [Download cleaned CSV (sample, watermarked)]            │
+│                                                          │
+│  ┌──────────────────────────────────────────────────┐  │
+│  │  Like what you see?                              │  │
+│  │  Run this on YOUR 50,000-row export — locally.   │  │
+│  │  No upload. Your data never leaves your machine. │  │
+│  │  [Get DataTools — $49 →]                         │  │
+│  └──────────────────────────────────────────────────┘  │
+└──────────────────────────────────────────────────────────┘
+```
+
+**Critical UX points**:
+- Sample dataset is *already loaded* on page paint. Visitor never
+  sees an empty state.
+- BEFORE table is shown side-by-side with AFTER once the run
+  completes. Hidden-character toggle on by default so the visitor
+  *sees* what was hidden in their data.
+- "Replace with your own file" is a secondary action below the BEFORE
+  table — not the headline.
+- Per-step metrics are shown in the AFTER block: "27 cells
+  canonicalized, 33 sentinels resolved, 4 duplicates merged." Numbers
+  sell more than narrative.
+- Buy button is **inside** the AFTER block and **above the fold** when
+  the run completes. Friction kills.
+
+## 6. Free vs paid boundary
+
+The demo runs the **same code** as the paid product. Caps are surface,
+not engine.
+
+| Limit | Free demo | Paid (downloaded) |
+|---|---|---|
+| Input rows | 100 | unlimited (1 GB+ via streaming) |
+| File size | 5 MB | unlimited |
+| Output | watermarked CSV ("DataTools demo — buy at <url>" appended as last row) | clean CSV |
+| Pipeline editor | locked to the persona-saved pipeline | full edit / save / load JSON |
+| Save pipeline JSON | disabled | enabled |
+| International | enabled | enabled |
+| Audit log download | disabled | enabled |
+| Tool 06–09 | as they ship | as they ship |
+
+The watermark is a **single trailing row**, not an in-cell tag — so
+the demo's AFTER preview *visibly* reads as production-quality data,
+not "demo crippled" data.
+
+## 7. CTA copy (per persona)
+
+### 7.1 Shopify pet operator
+
+- **H1**: *Clean your customer / vendor / subscriber exports — locally.*
+- **Sub**: *Klaviyo-import-ready in 30 seconds. Catches duplicates Excel
+  misses. Your data never leaves your computer.*
+- **CTA**: *Get DataTools for Shopify — $49 →*
+
+### 7.2 Bookkeeper / freelance accountant
+
+- **H1**: *Reconcile messy bank exports. Hand your client an audit
+  trail.*
+- **Sub**: *Catches the duplicate transaction Quickbooks imported twice.
+  Standardizes dates, amounts, vendor casing. Every change auditable.*
+- **CTA**: *Get DataTools for Bookkeepers — $49 →*
+
+### 7.3 Marketing / RevOps agency
+
+- **H1**: *Dedupe leads across HubSpot, LinkedIn, and manual scrapes.*
+- **Sub**: *International phones, country normalization, fuzzy dedup
+  with merge — one tool, one schema, no upload.*
+- **CTA**: *Get DataTools for RevOps — $49 →*
+
+## 8. Telemetry / conversion tracking
+
+Async + no-touch + free hosting limits what we can instrument. Use
+event-only counters, no PII:
+
+| Event | Source | Aggregate-only field |
+|---|---|---|
+| `demo.page_view` | landing page | persona tag |
+| `demo.run_clicked` | demo page | persona tag |
+| `demo.run_completed` | demo page | persona tag, rows_processed |
+| `demo.cta_clicked` | demo page | persona tag |
+| `gumroad.purchase` | Gumroad webhook | landing-page-source query param (`?from=shopify-pet`) |
+
+Conversion = `cta_clicked / run_completed`. Demo-quality issue surfaces
+when `run_completed / page_view` < 30 % (visitors not engaging).
+
+Self-host counters on Cloudflare Pages (free, GDPR-friendly). No
+Google Analytics — adds privacy banner, conflicts with the "your data
+never leaves your computer" message.
+
+## 9. Maintenance plan
+
+**Recurring**: zero. The demo runs on the same engine the paid
+product ships, so any improvement to the engine improves the demo
+automatically. The pre-saved pipeline JSONs reference column names
+and tool names, both stable APIs.
+
+**Triggers for revisit**:
+
+| Trigger | Action |
+|---|---|
+| Streamlit Community Cloud rate-limits / sleeps too aggressively | Migrate to a $5–10/mo VPS (BUSINESS.md §9 contingency) |
+| Demo dataset becomes stale (e.g. all phones standardize to no-op) | Refresh with a new pollution batch — *don't change the persona* |
+| `run_completed / page_view < 30 %` for 4 consecutive weeks | Audit the demo: is the BEFORE preview showing the mess clearly? Is the AFTER too small to notice? |
+| `cta_clicked / run_completed < 5 %` for 4 consecutive weeks | The demo is impressive but the CTA isn't earning trust — revise copy + add a screenshot of the network tab showing zero outbound calls (PLAN.md §2.4) |
+| New tool ships (06–09) | Decide *per persona* whether to add it to that persona's saved pipeline. Not all tools belong on all personas |
+
+## 10. Build sequence (drops into PLAN.md week 2)
+
+| Day | Action |
+|---|---|
+| 1 | Demo build of Streamlit app: 3 personas, switch via query param `?p=shopify-pet` |
+| 2 | Pipeline JSONs wired in; row cap + watermark applied; download button |
+| 3 | Deploy to Streamlit Community Cloud · 3 sub-paths or 3 separate apps |
+| 4 | Persona landing pages: 3 static HTML pages on Cloudflare Pages, each with iframe embed of its persona demo + CTA |
+| 5 | Telemetry counters wired (Cloudflare event API) · Gumroad webhook captures `?from=` |
+
+End of day 5: three URLs the operator can drop into three different
+niche-community threads, each performing its own conversion math.
+
+## 11. Anti-temptations (things the demo deliberately refuses)
+
+- **No "try it on your data first" gate that requires email.** The
+  whole point is friction-free.
+- **No "schedule a demo" CTA.** Locked by no-touch.
+- **No live chat widget.** Same.
+- **No A/B-test framework yet.** Single-arm copy, ship it, iterate
+  monthly. A/B requires statistical traffic the funnel doesn't have
+  pre-PMF.
+- **No watermark inside cells.** The AFTER preview must look
+  production-quality. Watermark goes on a single trailing row that's
+  obviously the demo signature.
+- **No animation / loader theatrics.** Pipeline runs in <1 s; a
+  fake-progress bar lies about speed.