feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI
05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI
09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI
with soft tool-dependency graph (recommended,
not enforced) and JSON save/load for repeatable
weekly cleanups.
Format Standardizer reworked for 1 GB international files:
• Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
• Per-row country / address columns drive parsing
• Audit cap (default 10 k rows, ~50 MB RAM)
• standardize_file(): chunked streaming entry point (~165 k rows/sec)
• currency_decimal="auto" for EU comma-decimal locales
• R$ / kr / zł multi-char currency prefixes
• cli_format.py with auto-stream above 100 MB inputs
Encoding detection arbiter + language-aware probe:
Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.
Distribution-readiness assets:
• streamlit_app.py — Streamlit Community Cloud entry shim
• src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
100-row cap + watermark, free-vs-paid boundary enforced at surface
• samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
• landing/ — 4 static HTML pages (apex chooser + 3 niche),
shared CSS, deploy.py URL-substitution script,
auto-generated robots.txt + sitemap.xml + 404.html + favicon
• docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
— full strategy + measurement + deployment + master checklist
Test counts:
before: 1,520 passed · 4 skipped · 17 xfailed
after: 1,729 passed · 0 skipped · 0 xfailed
Tier-1 corpora added:
• missing-corpus 3 use cases + 16 edge cases
• column-mapper-corpus 3 use cases + 5 edge cases
• format-cleaner intl 20-row 13-country stress fixture
Engine hardening flushed out by the corpora:
• interpolate guards against object-dtype columns
• mean/median skip all-NaN columns (silences numpy warning)
• fillna runs under future.no_silent_downcasting (silences pandas warning)
• mojibake test no longer skips when ftfy installed (monkeypatch path)
• drop-row threshold semantics: strict-greater (consistent across rows / cols)
• currency_decimal validator allow-set updated for "auto"
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
332
docs/DEMO-PLAN.md
Normal file
332
docs/DEMO-PLAN.md
Normal file
@@ -0,0 +1,332 @@
|
||||
# Demo Plan — DataTools
|
||||
|
||||
> Creator-only. Implements PLAN.md §2.2 (the demo IS the product) and
|
||||
> §2.3 (niche down — three landing pages, one engine).
|
||||
> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael
|
||||
|
||||
The hosted demo is the single highest-leverage marketing asset in the
|
||||
plan. This document defines exactly what loads, in what order, with
|
||||
what data, for which buyer — so the operator builds it once and never
|
||||
rebuilds it from a stale headline.
|
||||
|
||||
## 1. Goals
|
||||
|
||||
- Convert a cold visitor to a paid buyer in **under three minutes** of
|
||||
active interaction.
|
||||
- Demonstrate the *full pipeline* (not one tool) on a dataset that
|
||||
*looks like the visitor's own work* — not a toy CSV.
|
||||
- Survive zero attention to maintenance — once running, the demo
|
||||
should keep working as the engine evolves (the pre-saved pipeline
|
||||
JSONs use the same code path the paid product uses).
|
||||
- Provide a shareable artifact for niche-community posts (a public URL
|
||||
the operator can drop into a subreddit reply with one sentence).
|
||||
|
||||
## 2. Constraints (non-negotiable)
|
||||
|
||||
| Constraint | Source | Implication |
|
||||
|---|---|---|
|
||||
| Free hosting at launch | BUSINESS.md §9 | Streamlit Community Cloud (1 GB RAM, sleeps after 7 days idle) |
|
||||
| No login | BUSINESS.md §7 | No email gate, no signup wall, no "create account to continue" |
|
||||
| Async / no-touch | DECISIONS.md §1 #8 | Cannot offer "schedule a demo with us" CTA |
|
||||
| Runs locally on paid product | BUSINESS.md §11 | Demo can't expose the same engine to abuse — needs row caps |
|
||||
| Friction kills conversion | BUSINESS.md §7 | Demo dataset preloaded; no "select a file" first-step |
|
||||
| < $1,200/mo recurring | BUSINESS.md §9 | Migration plan to $5/mo VPS only after rate-limit signal |
|
||||
|
||||
## 3. The three personas (per PLAN.md §2.3)
|
||||
|
||||
| Tag | Persona | Top-of-funnel keyword | Demo dataset | Pre-saved pipeline |
|
||||
|---|---|---|---|---|
|
||||
| `shopify-pet` | Shopify operator (priority: pet supplies) | "shopify customer cleanup" | `samples/demo/shopify_pet_customers.csv` | `shopify_pet_pipeline.json` |
|
||||
| `bookkeeper` | Bookkeeper / freelance accountant | "reconcile bank export csv" | `samples/demo/bookkeeper_bank_reconcile.csv` | `bookkeeper_bank_pipeline.json` |
|
||||
| `revops` | Marketing / RevOps agency | "dedupe lead list across vendors" | `samples/demo/agency_combined_leads.csv` | `agency_leads_pipeline.json` |
|
||||
|
||||
Each persona gets its **own landing page URL**, its **own demo dataset
|
||||
loaded by default**, and its **own H1 + below-the-fold copy.** The
|
||||
engine is identical; only positioning differs.
|
||||
|
||||
## 4. Demo dataset specifications
|
||||
|
||||
Each dataset is intentionally small (~15–25 rows) so the full pipeline
|
||||
runs in well under one second on Streamlit Community Cloud's free
|
||||
hardware. Each row is a *plausible-looking* export from that
|
||||
persona's tooling. Each contains every kind of pollution the bundle's
|
||||
five tools fix, so a single demo run shows every tool earning its
|
||||
keep.
|
||||
|
||||
### 4.0 Pain-point coverage map
|
||||
|
||||
Each demo dataset is engineered so the buyer sees their **own top
|
||||
pain** demonstrated in the AFTER preview. The mapping below pairs
|
||||
each pain from PLAN.md §2.3a with the rows / columns that exercise
|
||||
it. Refresh the dataset only when this coverage drops.
|
||||
|
||||
| Persona | Pain (from PLAN §2.3a) | Demo coverage |
|
||||
|---|---|---|
|
||||
| Shopify pet | S1 — Klaviyo per-contact dupes | 5 dup pairs across rows 1–15 (case + format + address-twin variants) |
|
||||
| Shopify pet | S2 — feed-rejection chars | smart-quote / NBSP / BOM in rows 1–6, 9, 11 |
|
||||
| Shopify pet | S3 — multi-channel | partner-style customer IDs (`SHOP-`); demonstration of column-level mapping covered in RevOps demo |
|
||||
| Shopify pet | S4 — subscription identity | rows 1+2, 7+8, 9+10 — same person, different format |
|
||||
| Shopify pet | S5 — VAT-MOSS country drift | rows 16–18 (`United Kingdom` / `U.K.` / `UK`) + rows 19–20 (`Germany`/`Italia`) |
|
||||
| Bookkeeper | B1 — month-overlap re-import | 7 dup pairs spanning Jan↔Feb and Mar boundaries |
|
||||
| Bookkeeper | B2 — 1099 vendor consolidation | Amazon × 3 spellings, Verizon × 2, Acme Realty × 2, Adobe × 2, Costco × 2, Zoom × 2, Stripe × 4 |
|
||||
| Bookkeeper | B3 — audit trail | every cell change in the run logged with old/new/rule — surface in the demo's audit tab |
|
||||
| Bookkeeper | B4 — per-license economics | demonstrated by pricing copy, not data |
|
||||
| Bookkeeper | B5 — multi-currency | rows 26 (EUR), 27 (GBP), 28 (BRL with comma decimal), 29 (parens-negative) |
|
||||
| RevOps | R1 — per-contact tier | 6 cross-source dup pairs (HubSpot × LinkedIn × Manual Scrape) |
|
||||
| RevOps | R2 — deliverability | rows 26–27 (`uma at uniform dot com`, `victor@@victorco.com` invalid emails) |
|
||||
| RevOps | R3 — GDPR / privacy | demonstrated by the network-tab moat panel + zero-upload claim |
|
||||
| RevOps | R4 — vendor unification | 3 source values (HubSpot / LinkedIn / Manual Scrape), 13 country codes, mixed-shape headers |
|
||||
| RevOps | R5 — suppression list | rows 29–30 (`Suppressed`, `Opted Out` tags) |
|
||||
|
||||
### 4.1 `shopify_pet_customers.csv` (20 rows)
|
||||
|
||||
**Looks like**: a Shopify customer export filtered for "Pet Supplies"
|
||||
sales channel, 12 months activity.
|
||||
|
||||
**Pollution included**:
|
||||
- Whitespace padding (" Alice ", "Sydney Opera House Drive ")
|
||||
- Mixed phone formats: `(415) 555-1234`, `415.555.1234`, `5559876543`,
|
||||
`+1 555-111-1111`
|
||||
- International phones: GB, ES, DE, AU, JP (15 demo rows span 6
|
||||
countries)
|
||||
- Currency variants: `$1,240.50`, `£890.25`, `€2.410,75` (EU comma
|
||||
decimal), `A$ 1,299.00`, `¥75000`
|
||||
- Date formats: `2025-12-04`, `12/15/2025`, `?`, `(blank)`, `(none)`,
|
||||
`#N/A`
|
||||
- Disguised nulls: `N/A`, blank, `(blank)`, `?`, `#N/A`, `(none)`,
|
||||
`unknown`
|
||||
- Name casing: `EVE MARTINEZ`, `henry`, `O'NEIL`, `noah`, mixed Title /
|
||||
ALL CAPS / lower
|
||||
- Email case variants that *should* dedup: `Bob@PetShop.com` vs
|
||||
`alice@petshop.com`
|
||||
- 4 fuzzy duplicates (Alice/Bob same address, Grace/Henry same phone,
|
||||
Carlos/Olivia same address, Ivy/Jack same address)
|
||||
|
||||
**After running the pipeline**: 20 rows → 15, ~29 cells canonicalized,
|
||||
~45 sentinels standardised, 5 cross-row duplicates merged. The
|
||||
customer table is now Klaviyo-import-ready and the country column
|
||||
(previously `UK` / `U.K.` / `United Kingdom` / `Germany` / `Italia`)
|
||||
is GB / DE / IT — VAT MOSS report won't break.
|
||||
|
||||
### 4.2 `bookkeeper_bank_reconcile.csv` (30 rows)
|
||||
|
||||
**Looks like**: two months of business checking + credit-card activity
|
||||
exported from a bank portal, with the Feb export accidentally
|
||||
overlapping the Jan export at the month boundary.
|
||||
|
||||
**Pollution included**:
|
||||
- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`,
|
||||
`1/27/25`, `Feb 5 2025`
|
||||
- Currency formats: `-$129.99`, `($89.50)` parens-negative,
|
||||
`+$3,450.00`, `- $599.88` space, bare `-129.99`, `(50.00)`
|
||||
- Header trailing whitespace: `"Date "`
|
||||
- Smart quotes around descriptions: `"autopay"`
|
||||
- Em-dash sentinels in Vendor: `—`
|
||||
- Smart-em-dash inside descriptions: `STAPLES #4422 — paper, toner`
|
||||
- Vendor casing inconsistency: `Amazon` / `amazon.com` / `AMAZON.COM`,
|
||||
`Verizon` / `verizon`
|
||||
- 6 duplicate transactions (same date+amount+vendor recorded twice
|
||||
with different formats)
|
||||
|
||||
**After running the pipeline**: 30 rows → 23, ~84 cells normalized, 7
|
||||
duplicates removed (month-overlap + VAT-MOSS dups). All dates
|
||||
ISO-formatted, all amounts numeric (including EUR/GBP/BRL with comma
|
||||
decimal), vendor casing canonical, parens-negative resolved.
|
||||
|
||||
### 4.3 `agency_combined_leads.csv` (30 rows)
|
||||
|
||||
**Looks like**: a marketing-ops worksheet combining lead exports from
|
||||
HubSpot + LinkedIn Sales Navigator + manual scraping, ready for
|
||||
campaign targeting.
|
||||
|
||||
**Pollution included**:
|
||||
- Phone formats per region: US, UK, Spain, Germany, China, India,
|
||||
Australia, Mexico, Israel, Singapore, Hong Kong, Italy, South
|
||||
Korea — 13 country codes
|
||||
- Country column inconsistent: `USA` / `US` / `United States`
|
||||
- Disguised nulls: `N/A`, `unknown`, `(unknown)`, `(blank)`, `(none)`,
|
||||
`?`, `—`, `#N/A`, `TBD`
|
||||
- Source column tags origin (`HubSpot` / `LinkedIn` / `Manual Scrape`)
|
||||
- Email duplicates across sources with case variants: `alice@acme.com`
|
||||
+ `Alice.Johnson@acme.com`, `bob@beta.com` + `Bob@Beta.com`,
|
||||
`diana@delta.com` from two sources, `carlos@gamma.io` from two
|
||||
sources, `Frank@Foxtrot.de` + `frank@foxtrot.de`
|
||||
- Name casing: `DIANA LEE`, `henry`, `IVY CHEN`, mixed
|
||||
- 6 fuzzy / cross-source duplicates designed to survive the dedup
|
||||
- Score column with sentinel pollution that needs coercion to integer
|
||||
|
||||
**After running the pipeline**: 30 rows → 24, ~43 cells canonicalized,
|
||||
14 sentinels resolved, 6 cross-source duplicates merged with `merge=true`
|
||||
so each survivor inherits the most-complete picture. Invalid-email
|
||||
rows (deliverability stress) and `Suppressed`/`Opted Out` tags
|
||||
(suppression-list use case) survive as flagged rows the operator
|
||||
manually reviews.
|
||||
|
||||
## 5. UX flow (per persona)
|
||||
|
||||
The demo is a single Streamlit page (likely
|
||||
`src/gui/pages/0_Review.py` repurposed for demo mode, or a
|
||||
dedicated `app_demo.py` for the cloud build).
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ DataTools — for {Persona} │
|
||||
│ "{Persona-specific H1}" │
|
||||
├──────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Sample dataset preloaded: shopify_pet_customers.csv │
|
||||
│ [Replace with your own file (capped 100 rows)] │
|
||||
│ │
|
||||
│ ┌─ BEFORE preview (15 rows) ─────────────────────────┐ │
|
||||
│ │ Alice | (415) 555-1234 | $1,240.50 | … │ │
|
||||
│ │ Bob | 415.555.1234 | $1,240.50 | … │ │
|
||||
│ │ ... │ │
|
||||
│ └──────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ Pipeline (saved): │
|
||||
│ 1. Text Clean → 2. Format Standardize → │
|
||||
│ 3. Missing → 4. Deduplicate │
|
||||
│ │
|
||||
│ [▶ Run pipeline] │
|
||||
│ │
|
||||
│ ┌─ AFTER preview ───────────────────────────────────┐ │
|
||||
│ │ 15 rows → 11 (4 duplicates merged) │ │
|
||||
│ │ 27 cells canonicalized · 33 sentinels resolved │ │
|
||||
│ │ │ │
|
||||
│ │ Alice Johnson | +14155551234 | 1240.50 | … │ │
|
||||
│ │ ... │ │
|
||||
│ └──────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ [Download cleaned CSV (sample, watermarked)] │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────┐ │
|
||||
│ │ Like what you see? │ │
|
||||
│ │ Run this on YOUR 50,000-row export — locally. │ │
|
||||
│ │ No upload. Your data never leaves your machine. │ │
|
||||
│ │ [Get DataTools — $49 →] │ │
|
||||
│ └──────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Critical UX points**:
|
||||
- Sample dataset is *already loaded* on page paint. Visitor never
|
||||
sees an empty state.
|
||||
- BEFORE table is shown side-by-side with AFTER once the run
|
||||
completes. Hidden-character toggle on by default so the visitor
|
||||
*sees* what was hidden in their data.
|
||||
- "Replace with your own file" is a secondary action below the BEFORE
|
||||
table — not the headline.
|
||||
- Per-step metrics are shown in the AFTER block: "27 cells
|
||||
canonicalized, 33 sentinels resolved, 4 duplicates merged." Numbers
|
||||
sell more than narrative.
|
||||
- Buy button is **inside** the AFTER block and **above the fold** when
|
||||
the run completes. Friction kills.
|
||||
|
||||
## 6. Free vs paid boundary
|
||||
|
||||
The demo runs the **same code** as the paid product. Caps are surface,
|
||||
not engine.
|
||||
|
||||
| Limit | Free demo | Paid (downloaded) |
|
||||
|---|---|---|
|
||||
| Input rows | 100 | unlimited (1 GB+ via streaming) |
|
||||
| File size | 5 MB | unlimited |
|
||||
| Output | watermarked CSV ("DataTools demo — buy at <url>" appended as last row) | clean CSV |
|
||||
| Pipeline editor | locked to the persona-saved pipeline | full edit / save / load JSON |
|
||||
| Save pipeline JSON | disabled | enabled |
|
||||
| International | enabled | enabled |
|
||||
| Audit log download | disabled | enabled |
|
||||
| Tool 06–09 | as they ship | as they ship |
|
||||
|
||||
The watermark is a **single trailing row**, not an in-cell tag — so
|
||||
the demo's AFTER preview *visibly* reads as production-quality data,
|
||||
not "demo crippled" data.
|
||||
|
||||
## 7. CTA copy (per persona)
|
||||
|
||||
### 7.1 Shopify pet operator
|
||||
|
||||
- **H1**: *Clean your customer / vendor / subscriber exports — locally.*
|
||||
- **Sub**: *Klaviyo-import-ready in 30 seconds. Catches duplicates Excel
|
||||
misses. Your data never leaves your computer.*
|
||||
- **CTA**: *Get DataTools for Shopify — $49 →*
|
||||
|
||||
### 7.2 Bookkeeper / freelance accountant
|
||||
|
||||
- **H1**: *Reconcile messy bank exports. Hand your client an audit
|
||||
trail.*
|
||||
- **Sub**: *Catches the duplicate transaction Quickbooks imported twice.
|
||||
Standardizes dates, amounts, vendor casing. Every change auditable.*
|
||||
- **CTA**: *Get DataTools for Bookkeepers — $49 →*
|
||||
|
||||
### 7.3 Marketing / RevOps agency
|
||||
|
||||
- **H1**: *Dedupe leads across HubSpot, LinkedIn, and manual scrapes.*
|
||||
- **Sub**: *International phones, country normalization, fuzzy dedup
|
||||
with merge — one tool, one schema, no upload.*
|
||||
- **CTA**: *Get DataTools for RevOps — $49 →*
|
||||
|
||||
## 8. Telemetry / conversion tracking
|
||||
|
||||
Async + no-touch + free hosting limits what we can instrument. Use
|
||||
event-only counters, no PII:
|
||||
|
||||
| Event | Source | Aggregate-only field |
|
||||
|---|---|---|
|
||||
| `demo.page_view` | landing page | persona tag |
|
||||
| `demo.run_clicked` | demo page | persona tag |
|
||||
| `demo.run_completed` | demo page | persona tag, rows_processed |
|
||||
| `demo.cta_clicked` | demo page | persona tag |
|
||||
| `gumroad.purchase` | Gumroad webhook | landing-page-source query param (`?from=shopify-pet`) |
|
||||
|
||||
Conversion = `cta_clicked / run_completed`. Demo-quality issue surfaces
|
||||
when `run_completed / page_view` < 30 % (visitors not engaging).
|
||||
|
||||
Self-host counters on Cloudflare Pages (free, GDPR-friendly). No
|
||||
Google Analytics — adds privacy banner, conflicts with the "your data
|
||||
never leaves your computer" message.
|
||||
|
||||
## 9. Maintenance plan
|
||||
|
||||
**Recurring**: zero. The demo runs on the same engine the paid
|
||||
product ships, so any improvement to the engine improves the demo
|
||||
automatically. The pre-saved pipeline JSONs reference column names
|
||||
and tool names, both stable APIs.
|
||||
|
||||
**Triggers for revisit**:
|
||||
|
||||
| Trigger | Action |
|
||||
|---|---|
|
||||
| Streamlit Community Cloud rate-limits / sleeps too aggressively | Migrate to a $5–10/mo VPS (BUSINESS.md §9 contingency) |
|
||||
| Demo dataset becomes stale (e.g. all phones standardize to no-op) | Refresh with a new pollution batch — *don't change the persona* |
|
||||
| `run_completed / page_view < 30 %` for 4 consecutive weeks | Audit the demo: is the BEFORE preview showing the mess clearly? Is the AFTER too small to notice? |
|
||||
| `cta_clicked / run_completed < 5 %` for 4 consecutive weeks | The demo is impressive but the CTA isn't earning trust — revise copy + add a screenshot of the network tab showing zero outbound calls (PLAN.md §2.4) |
|
||||
| New tool ships (06–09) | Decide *per persona* whether to add it to that persona's saved pipeline. Not all tools belong on all personas |
|
||||
|
||||
## 10. Build sequence (drops into PLAN.md week 2)
|
||||
|
||||
| Day | Action |
|
||||
|---|---|
|
||||
| 1 | Demo build of Streamlit app: 3 personas, switch via query param `?p=shopify-pet` |
|
||||
| 2 | Pipeline JSONs wired in; row cap + watermark applied; download button |
|
||||
| 3 | Deploy to Streamlit Community Cloud · 3 sub-paths or 3 separate apps |
|
||||
| 4 | Persona landing pages: 3 static HTML pages on Cloudflare Pages, each with iframe embed of its persona demo + CTA |
|
||||
| 5 | Telemetry counters wired (Cloudflare event API) · Gumroad webhook captures `?from=` |
|
||||
|
||||
End of day 5: three URLs the operator can drop into three different
|
||||
niche-community threads, each performing its own conversion math.
|
||||
|
||||
## 11. Anti-temptations (things the demo deliberately refuses)
|
||||
|
||||
- **No "try it on your data first" gate that requires email.** The
|
||||
whole point is friction-free.
|
||||
- **No "schedule a demo" CTA.** Locked by no-touch.
|
||||
- **No live chat widget.** Same.
|
||||
- **No A/B-test framework yet.** Single-arm copy, ship it, iterate
|
||||
monthly. A/B requires statistical traffic the funnel doesn't have
|
||||
pre-PMF.
|
||||
- **No watermark inside cells.** The AFTER preview must look
|
||||
production-quality. Watermark goes on a single trailing row that's
|
||||
obviously the demo signature.
|
||||
- **No animation / loader theatrics.** Pipeline runs in <1 s; a
|
||||
fake-progress bar lies about speed.
|
||||
Reference in New Issue
Block a user