feat: 3 new tools, format streaming, distribution-ready demo + landing pages

Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 22:31:26 +00:00
parent d18b95880d
commit 966af8ef94
89 changed files with 12039 additions and 284 deletions

332
docs/DEMO-PLAN.md Normal file
View File

@@ -0,0 +1,332 @@
# Demo Plan — DataTools
> Creator-only. Implements PLAN.md §2.2 (the demo IS the product) and
> §2.3 (niche down — three landing pages, one engine).
> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael
The hosted demo is the single highest-leverage marketing asset in the
plan. This document defines exactly what loads, in what order, with
what data, for which buyer — so the operator builds it once and never
rebuilds it from a stale headline.
## 1. Goals
- Convert a cold visitor to a paid buyer in **under three minutes** of
active interaction.
- Demonstrate the *full pipeline* (not one tool) on a dataset that
*looks like the visitor's own work* — not a toy CSV.
- Survive zero attention to maintenance — once running, the demo
should keep working as the engine evolves (the pre-saved pipeline
JSONs use the same code path the paid product uses).
- Provide a shareable artifact for niche-community posts (a public URL
the operator can drop into a subreddit reply with one sentence).
## 2. Constraints (non-negotiable)
| Constraint | Source | Implication |
|---|---|---|
| Free hosting at launch | BUSINESS.md §9 | Streamlit Community Cloud (1 GB RAM, sleeps after 7 days idle) |
| No login | BUSINESS.md §7 | No email gate, no signup wall, no "create account to continue" |
| Async / no-touch | DECISIONS.md §1 #8 | Cannot offer "schedule a demo with us" CTA |
| Runs locally on paid product | BUSINESS.md §11 | Demo can't expose the same engine to abuse — needs row caps |
| Friction kills conversion | BUSINESS.md §7 | Demo dataset preloaded; no "select a file" first-step |
| < $1,200/mo recurring | BUSINESS.md §9 | Migration plan to $5/mo VPS only after rate-limit signal |
## 3. The three personas (per PLAN.md §2.3)
| Tag | Persona | Top-of-funnel keyword | Demo dataset | Pre-saved pipeline |
|---|---|---|---|---|
| `shopify-pet` | Shopify operator (priority: pet supplies) | "shopify customer cleanup" | `samples/demo/shopify_pet_customers.csv` | `shopify_pet_pipeline.json` |
| `bookkeeper` | Bookkeeper / freelance accountant | "reconcile bank export csv" | `samples/demo/bookkeeper_bank_reconcile.csv` | `bookkeeper_bank_pipeline.json` |
| `revops` | Marketing / RevOps agency | "dedupe lead list across vendors" | `samples/demo/agency_combined_leads.csv` | `agency_leads_pipeline.json` |
Each persona gets its **own landing page URL**, its **own demo dataset
loaded by default**, and its **own H1 + below-the-fold copy.** The
engine is identical; only positioning differs.
## 4. Demo dataset specifications
Each dataset is intentionally small (~1525 rows) so the full pipeline
runs in well under one second on Streamlit Community Cloud's free
hardware. Each row is a *plausible-looking* export from that
persona's tooling. Each contains every kind of pollution the bundle's
five tools fix, so a single demo run shows every tool earning its
keep.
### 4.0 Pain-point coverage map
Each demo dataset is engineered so the buyer sees their **own top
pain** demonstrated in the AFTER preview. The mapping below pairs
each pain from PLAN.md §2.3a with the rows / columns that exercise
it. Refresh the dataset only when this coverage drops.
| Persona | Pain (from PLAN §2.3a) | Demo coverage |
|---|---|---|
| Shopify pet | S1 — Klaviyo per-contact dupes | 5 dup pairs across rows 115 (case + format + address-twin variants) |
| Shopify pet | S2 — feed-rejection chars | smart-quote / NBSP / BOM in rows 16, 9, 11 |
| Shopify pet | S3 — multi-channel | partner-style customer IDs (`SHOP-`); demonstration of column-level mapping covered in RevOps demo |
| Shopify pet | S4 — subscription identity | rows 1+2, 7+8, 9+10 — same person, different format |
| Shopify pet | S5 — VAT-MOSS country drift | rows 1618 (`United Kingdom` / `U.K.` / `UK`) + rows 1920 (`Germany`/`Italia`) |
| Bookkeeper | B1 — month-overlap re-import | 7 dup pairs spanning Jan↔Feb and Mar boundaries |
| Bookkeeper | B2 — 1099 vendor consolidation | Amazon × 3 spellings, Verizon × 2, Acme Realty × 2, Adobe × 2, Costco × 2, Zoom × 2, Stripe × 4 |
| Bookkeeper | B3 — audit trail | every cell change in the run logged with old/new/rule — surface in the demo's audit tab |
| Bookkeeper | B4 — per-license economics | demonstrated by pricing copy, not data |
| Bookkeeper | B5 — multi-currency | rows 26 (EUR), 27 (GBP), 28 (BRL with comma decimal), 29 (parens-negative) |
| RevOps | R1 — per-contact tier | 6 cross-source dup pairs (HubSpot × LinkedIn × Manual Scrape) |
| RevOps | R2 — deliverability | rows 2627 (`uma at uniform dot com`, `victor@@victorco.com` invalid emails) |
| RevOps | R3 — GDPR / privacy | demonstrated by the network-tab moat panel + zero-upload claim |
| RevOps | R4 — vendor unification | 3 source values (HubSpot / LinkedIn / Manual Scrape), 13 country codes, mixed-shape headers |
| RevOps | R5 — suppression list | rows 2930 (`Suppressed`, `Opted Out` tags) |
### 4.1 `shopify_pet_customers.csv` (20 rows)
**Looks like**: a Shopify customer export filtered for "Pet Supplies"
sales channel, 12 months activity.
**Pollution included**:
- Whitespace padding (" Alice ", "Sydney Opera House Drive ")
- Mixed phone formats: `(415) 555-1234`, `415.555.1234`, `5559876543`,
`+1 555-111-1111`
- International phones: GB, ES, DE, AU, JP (15 demo rows span 6
countries)
- Currency variants: `$1,240.50`, `£890.25`, `€2.410,75` (EU comma
decimal), `A$ 1,299.00`, `¥75000`
- Date formats: `2025-12-04`, `12/15/2025`, `?`, `(blank)`, `(none)`,
`#N/A`
- Disguised nulls: `N/A`, blank, `(blank)`, `?`, `#N/A`, `(none)`,
`unknown`
- Name casing: `EVE MARTINEZ`, `henry`, `O'NEIL`, `noah`, mixed Title /
ALL CAPS / lower
- Email case variants that *should* dedup: `Bob@PetShop.com` vs
`alice@petshop.com`
- 4 fuzzy duplicates (Alice/Bob same address, Grace/Henry same phone,
Carlos/Olivia same address, Ivy/Jack same address)
**After running the pipeline**: 20 rows → 15, ~29 cells canonicalized,
~45 sentinels standardised, 5 cross-row duplicates merged. The
customer table is now Klaviyo-import-ready and the country column
(previously `UK` / `U.K.` / `United Kingdom` / `Germany` / `Italia`)
is GB / DE / IT — VAT MOSS report won't break.
### 4.2 `bookkeeper_bank_reconcile.csv` (30 rows)
**Looks like**: two months of business checking + credit-card activity
exported from a bank portal, with the Feb export accidentally
overlapping the Jan export at the month boundary.
**Pollution included**:
- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`,
`1/27/25`, `Feb 5 2025`
- Currency formats: `-$129.99`, `($89.50)` parens-negative,
`+$3,450.00`, `- $599.88` space, bare `-129.99`, `(50.00)`
- Header trailing whitespace: `"Date "`
- Smart quotes around descriptions: `"autopay"`
- Em-dash sentinels in Vendor: `—`
- Smart-em-dash inside descriptions: `STAPLES #4422 — paper, toner`
- Vendor casing inconsistency: `Amazon` / `amazon.com` / `AMAZON.COM`,
`Verizon` / `verizon`
- 6 duplicate transactions (same date+amount+vendor recorded twice
with different formats)
**After running the pipeline**: 30 rows → 23, ~84 cells normalized, 7
duplicates removed (month-overlap + VAT-MOSS dups). All dates
ISO-formatted, all amounts numeric (including EUR/GBP/BRL with comma
decimal), vendor casing canonical, parens-negative resolved.
### 4.3 `agency_combined_leads.csv` (30 rows)
**Looks like**: a marketing-ops worksheet combining lead exports from
HubSpot + LinkedIn Sales Navigator + manual scraping, ready for
campaign targeting.
**Pollution included**:
- Phone formats per region: US, UK, Spain, Germany, China, India,
Australia, Mexico, Israel, Singapore, Hong Kong, Italy, South
Korea — 13 country codes
- Country column inconsistent: `USA` / `US` / `United States`
- Disguised nulls: `N/A`, `unknown`, `(unknown)`, `(blank)`, `(none)`,
`?`, `—`, `#N/A`, `TBD`
- Source column tags origin (`HubSpot` / `LinkedIn` / `Manual Scrape`)
- Email duplicates across sources with case variants: `alice@acme.com`
+ `Alice.Johnson@acme.com`, `bob@beta.com` + `Bob@Beta.com`,
`diana@delta.com` from two sources, `carlos@gamma.io` from two
sources, `Frank@Foxtrot.de` + `frank@foxtrot.de`
- Name casing: `DIANA LEE`, `henry`, `IVY CHEN`, mixed
- 6 fuzzy / cross-source duplicates designed to survive the dedup
- Score column with sentinel pollution that needs coercion to integer
**After running the pipeline**: 30 rows → 24, ~43 cells canonicalized,
14 sentinels resolved, 6 cross-source duplicates merged with `merge=true`
so each survivor inherits the most-complete picture. Invalid-email
rows (deliverability stress) and `Suppressed`/`Opted Out` tags
(suppression-list use case) survive as flagged rows the operator
manually reviews.
## 5. UX flow (per persona)
The demo is a single Streamlit page (likely
`src/gui/pages/0_Review.py` repurposed for demo mode, or a
dedicated `app_demo.py` for the cloud build).
```
┌──────────────────────────────────────────────────────────┐
│ DataTools — for {Persona} │
│ "{Persona-specific H1}" │
├──────────────────────────────────────────────────────────┤
│ │
│ Sample dataset preloaded: shopify_pet_customers.csv │
│ [Replace with your own file (capped 100 rows)] │
│ │
│ ┌─ BEFORE preview (15 rows) ─────────────────────────┐ │
│ │ Alice | (415) 555-1234 | $1,240.50 | … │ │
│ │ Bob | 415.555.1234 | $1,240.50 | … │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Pipeline (saved): │
│ 1. Text Clean → 2. Format Standardize → │
│ 3. Missing → 4. Deduplicate │
│ │
│ [▶ Run pipeline] │
│ │
│ ┌─ AFTER preview ───────────────────────────────────┐ │
│ │ 15 rows → 11 (4 duplicates merged) │ │
│ │ 27 cells canonicalized · 33 sentinels resolved │ │
│ │ │ │
│ │ Alice Johnson | +14155551234 | 1240.50 | … │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ [Download cleaned CSV (sample, watermarked)] │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Like what you see? │ │
│ │ Run this on YOUR 50,000-row export — locally. │ │
│ │ No upload. Your data never leaves your machine. │ │
│ │ [Get DataTools — $49 →] │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
```
**Critical UX points**:
- Sample dataset is *already loaded* on page paint. Visitor never
sees an empty state.
- BEFORE table is shown side-by-side with AFTER once the run
completes. Hidden-character toggle on by default so the visitor
*sees* what was hidden in their data.
- "Replace with your own file" is a secondary action below the BEFORE
table — not the headline.
- Per-step metrics are shown in the AFTER block: "27 cells
canonicalized, 33 sentinels resolved, 4 duplicates merged." Numbers
sell more than narrative.
- Buy button is **inside** the AFTER block and **above the fold** when
the run completes. Friction kills.
## 6. Free vs paid boundary
The demo runs the **same code** as the paid product. Caps are surface,
not engine.
| Limit | Free demo | Paid (downloaded) |
|---|---|---|
| Input rows | 100 | unlimited (1 GB+ via streaming) |
| File size | 5 MB | unlimited |
| Output | watermarked CSV ("DataTools demo — buy at <url>" appended as last row) | clean CSV |
| Pipeline editor | locked to the persona-saved pipeline | full edit / save / load JSON |
| Save pipeline JSON | disabled | enabled |
| International | enabled | enabled |
| Audit log download | disabled | enabled |
| Tool 0609 | as they ship | as they ship |
The watermark is a **single trailing row**, not an in-cell tag — so
the demo's AFTER preview *visibly* reads as production-quality data,
not "demo crippled" data.
## 7. CTA copy (per persona)
### 7.1 Shopify pet operator
- **H1**: *Clean your customer / vendor / subscriber exports — locally.*
- **Sub**: *Klaviyo-import-ready in 30 seconds. Catches duplicates Excel
misses. Your data never leaves your computer.*
- **CTA**: *Get DataTools for Shopify — $49 →*
### 7.2 Bookkeeper / freelance accountant
- **H1**: *Reconcile messy bank exports. Hand your client an audit
trail.*
- **Sub**: *Catches the duplicate transaction Quickbooks imported twice.
Standardizes dates, amounts, vendor casing. Every change auditable.*
- **CTA**: *Get DataTools for Bookkeepers — $49 →*
### 7.3 Marketing / RevOps agency
- **H1**: *Dedupe leads across HubSpot, LinkedIn, and manual scrapes.*
- **Sub**: *International phones, country normalization, fuzzy dedup
with merge — one tool, one schema, no upload.*
- **CTA**: *Get DataTools for RevOps — $49 →*
## 8. Telemetry / conversion tracking
Async + no-touch + free hosting limits what we can instrument. Use
event-only counters, no PII:
| Event | Source | Aggregate-only field |
|---|---|---|
| `demo.page_view` | landing page | persona tag |
| `demo.run_clicked` | demo page | persona tag |
| `demo.run_completed` | demo page | persona tag, rows_processed |
| `demo.cta_clicked` | demo page | persona tag |
| `gumroad.purchase` | Gumroad webhook | landing-page-source query param (`?from=shopify-pet`) |
Conversion = `cta_clicked / run_completed`. Demo-quality issue surfaces
when `run_completed / page_view` < 30 % (visitors not engaging).
Self-host counters on Cloudflare Pages (free, GDPR-friendly). No
Google Analytics — adds privacy banner, conflicts with the "your data
never leaves your computer" message.
## 9. Maintenance plan
**Recurring**: zero. The demo runs on the same engine the paid
product ships, so any improvement to the engine improves the demo
automatically. The pre-saved pipeline JSONs reference column names
and tool names, both stable APIs.
**Triggers for revisit**:
| Trigger | Action |
|---|---|
| Streamlit Community Cloud rate-limits / sleeps too aggressively | Migrate to a $510/mo VPS (BUSINESS.md §9 contingency) |
| Demo dataset becomes stale (e.g. all phones standardize to no-op) | Refresh with a new pollution batch — *don't change the persona* |
| `run_completed / page_view < 30 %` for 4 consecutive weeks | Audit the demo: is the BEFORE preview showing the mess clearly? Is the AFTER too small to notice? |
| `cta_clicked / run_completed < 5 %` for 4 consecutive weeks | The demo is impressive but the CTA isn't earning trust — revise copy + add a screenshot of the network tab showing zero outbound calls (PLAN.md §2.4) |
| New tool ships (0609) | Decide *per persona* whether to add it to that persona's saved pipeline. Not all tools belong on all personas |
## 10. Build sequence (drops into PLAN.md week 2)
| Day | Action |
|---|---|
| 1 | Demo build of Streamlit app: 3 personas, switch via query param `?p=shopify-pet` |
| 2 | Pipeline JSONs wired in; row cap + watermark applied; download button |
| 3 | Deploy to Streamlit Community Cloud · 3 sub-paths or 3 separate apps |
| 4 | Persona landing pages: 3 static HTML pages on Cloudflare Pages, each with iframe embed of its persona demo + CTA |
| 5 | Telemetry counters wired (Cloudflare event API) · Gumroad webhook captures `?from=` |
End of day 5: three URLs the operator can drop into three different
niche-community threads, each performing its own conversion math.
## 11. Anti-temptations (things the demo deliberately refuses)
- **No "try it on your data first" gate that requires email.** The
whole point is friction-free.
- **No "schedule a demo" CTA.** Locked by no-touch.
- **No live chat widget.** Same.
- **No A/B-test framework yet.** Single-arm copy, ship it, iterate
monthly. A/B requires statistical traffic the funnel doesn't have
pre-PMF.
- **No watermark inside cells.** The AFTER preview must look
production-quality. Watermark goes on a single trailing row that's
obviously the demo signature.
- **No animation / loader theatrics.** Pipeline runs in <1 s; a
fake-progress bar lies about speed.