Files
datatools-dev/docs/DEMO-PLAN.md
Michael 6df726e69e demo: reconstruct sales demos for an accounting audience
Replaces the Shopify / RevOps / Bookkeeper demo trio with three accounting
personas that share one buyer, each entering through a workflow where a
messy export costs money — all running the same saved 4-step pipeline:

- bank_reconciliation.csv (Bookkeeper): 26 -> 20 rows, 6 double-posted
  transactions caught after date+amount standardization.
- vendor_1099.csv (AP / 1099): 24 records -> 8 vendors, 7 missing EINs
  recovered via dedup merge — the 1099-complete story.
- ar_open_invoices.csv (AR): 26 -> 21 rows, 5 double-entered invoices
  removed, blank status backfilled from the twin row.

Every number is validated against the live engine and pinned by
tests/test_demo_pipelines.py (read path mirrors app_demo._load_demo:
dtype=str, keep_default_na=False). Rewires src/gui/app_demo.py PERSONAS
(keys bookkeeper / ap-1099 / ar-aging, accounting H1/sub/CTA) and rewrites
docs/DEMO-PLAN.md sections 3/4/7 with the validated outcomes.

(Repo hygiene forced by a partial-clone gap: finalizes the already-deleted,
unreferenced samples/messy_text.csv whose blob was unrecoverable.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:52:39 +00:00

309 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Demo Plan — DataTools
> Creator-only. Implements PLAN.md §2.2 (the demo IS the product) and
> §2.3 (niche down — three landing pages, one engine).
> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael
The hosted demo is the single highest-leverage marketing asset in the
plan. This document defines exactly what loads, in what order, with
what data, for which buyer — so the operator builds it once and never
rebuilds it from a stale headline.
## 1. Goals
- Convert a cold visitor to a paid buyer in **under three minutes** of
active interaction.
- Demonstrate the *full pipeline* (not one tool) on a dataset that
*looks like the visitor's own work* — not a toy CSV.
- Survive zero attention to maintenance — once running, the demo
should keep working as the engine evolves (the pre-saved pipeline
JSONs use the same code path the paid product uses).
- Provide a shareable artifact for niche-community posts (a public URL
the operator can drop into a subreddit reply with one sentence).
## 2. Constraints (non-negotiable)
| Constraint | Source | Implication |
|---|---|---|
| Free hosting at launch | BUSINESS.md §9 | Streamlit Community Cloud (1 GB RAM, sleeps after 7 days idle) |
| No login | BUSINESS.md §7 | No email gate, no signup wall, no "create account to continue" |
| Async / no-touch | DECISIONS.md §1 #8 | Cannot offer "schedule a demo with us" CTA |
| Runs locally on paid product | BUSINESS.md §11 | Demo can't expose the same engine to abuse — needs row caps |
| Friction kills conversion | BUSINESS.md §7 | Demo dataset preloaded; no "select a file" first-step |
| < $1,200/mo recurring | BUSINESS.md §9 | Migration plan to $5/mo VPS only after rate-limit signal |
## 3. The three personas — one audience: accounting (per PLAN.md §2.3)
We niche to **accounting** and enter through the three workflows where a
messy export costs real money. Same engine, three landing pages — each
is the same buyer at a different desk (bookkeeping, payables, receivables).
| Tag | Persona | Top-of-funnel keyword | Demo dataset | Pre-saved pipeline |
|---|---|---|---|---|
| `bookkeeper` | Bookkeeper — bank reconciliation | "reconcile bank export csv duplicates" | `samples/demo/bank_reconciliation.csv` | `bank_reconciliation_pipeline.json` |
| `ap-1099` | Accounts payable — 1099 vendor prep | "clean 1099 vendor list missing EIN" | `samples/demo/vendor_1099.csv` | `vendor_1099_pipeline.json` |
| `ar-aging` | Accounts receivable — open invoices | "remove duplicate invoices aging report" | `samples/demo/ar_open_invoices.csv` | `ar_open_invoices_pipeline.json` |
Each persona gets its **own landing page URL** (`?p=<tag>`), its **own
demo dataset loaded by default**, and its **own H1 + below-the-fold
copy** — wired in `src/gui/app_demo.py::PERSONAS`. The engine is
identical; only positioning differs.
## 4. Demo dataset specifications
Each dataset is intentionally small (~1525 rows) so the full pipeline
runs in well under one second on Streamlit Community Cloud's free
hardware. Each row is a *plausible-looking* export from that
persona's tooling. Each contains every kind of pollution the bundle's
five tools fix, so a single demo run shows every tool earning its
keep.
### 4.0 Value-proof map
Each demo dataset is engineered so the buyer sees their **own top pain**
fixed in the AFTER preview, with one unmistakable headline number. All
three run the same saved 4-step pipeline (Clean Text → Standardize
Formats → Fix Missing Values → Find Duplicates). The numbers below are
**validated against the live engine** (`tests/test_demo_pipelines.py`
pins them) — refresh the dataset only if a number stops landing.
| Persona | Headline proof | What the visitor watches happen |
|---|---|---|
| Bookkeeper | **26 → 20 rows · 6 phantom duplicates removed** | The same payment posted twice (different date + amount format) collapses to one; dates go ISO, parens-negatives become real negatives |
| AP / 1099 | **24 records → 8 vendors · 7 missing EINs recovered** | Each vendor's scattered records merge into one complete row; `merge=true` backfills the EIN/address/phone that any single record was missing |
| AR aging | **26 → 21 rows · 5 double-entered invoices removed** | Duplicate invoice numbers collapse; a blank status is backfilled from its twin; invoice + due dates go ISO, amounts numeric |
### 4.1 `bank_reconciliation.csv` (26 rows) — Bookkeeper
**Looks like**: two months (Jan + Feb 2025) of business-checking activity
from a bank portal, where the Feb re-export overlaps Jan so the same
transaction posts twice. Columns: `Date, Description, Vendor, Category,
Amount, Account`.
**Pollution included**:
- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`, `1/27/25`, `Feb 5 2025`.
- Currency formats incl. negatives: `-$129.99`, `($89.50)` parens-negative, `+$3,450.00`, `- $599.88`, bare `-129.99`, `(50.00)`.
- Whitespace + NBSP padding; smart quotes and an em-dash inside descriptions.
- Vendor casing variety on *non-duplicate* rows: `Amazon` / `amazon.com` / `AMAZON.COM`, `Verizon` / `verizon`.
- Disguised nulls in Category: `—`, `(blank)`, `?`, `unknown`, `TBD`.
- **6 duplicate transactions** — each pair shares the same vendor + real value but a different date *and* amount format, so they collapse only after standardization.
**After running the pipeline** (validated): **26 → 20 rows, 6 duplicates
removed**, 36 date/amount cells standardized (0 unparseable), all dates
ISO, parens-negatives resolved (`($89.50)``-89.50`), disguised-null
categories flagged. The reconciliation ties out.
### 4.2 `vendor_1099.csv` (24 rows) — Accounts payable / 1099
**Looks like**: a 1099-NEC vendor master list where the same vendor was
entered 23 times across the year by different staff, each record holding
only *part* of the vendor's details. Columns: `Vendor, Contact, Email,
Phone, EIN, Address, Total_Paid`.
**Pollution included**:
- The duplicate records for a vendor share one email differing only by case/whitespace (the reliable dedup key, matched with the `email` normalizer).
- EIN / Phone / Address scattered across the duplicate set so no single record is complete but the union is — gaps marked `—`, `(blank)`, `TBD`, `unknown`, `N/A`.
- Vendor name casing/spelling variants, phone formats, EIN formats (`12-3456789` vs `123456789`), `Total_Paid` currency variants.
**After running the pipeline** (validated): **24 records → 8 vendors, 16
duplicates removed, 7 missing EINs recovered** by `merge=true` +
`most_complete` survivor, 35 disguised nulls caught, phones/emails/amounts
standardized (0 unparseable). One vendor genuinely has no EIN in any
record — it survives with a blank EIN as the realistic "flag for
follow-up" case.
### 4.3 `ar_open_invoices.csv` (26 rows) — Accounts receivable
**Looks like**: an open-invoices (unpaid AR) export where some invoices
were double-entered in different formats and client contacts are messy.
Columns: `Invoice, Client, Email, Invoice_Date, Due_Date, Amount, Status`.
**Pollution included**:
- Two date columns with mixed formats; currency variants incl. a credit memo `($300.00)``-300.00`.
- Client name casing variety; email case variants (`AP@Acme.com` vs `ap@acme.com`).
- Status disguised nulls: `—`, `?`, `(blank)`, `TBD`, `unknown`, `(none)`.
- **5 double-entered invoices** — same invoice number twice, dates/amount in different formats, one copy with a blank status the other fills.
**After running the pipeline** (validated): **26 → 21 rows, 5 duplicate
invoices removed**, both date columns ISO + amounts numeric + emails
lowercased (0 unparseable), 7 disguised-null statuses caught, and a blank
status backfilled from its twin via `merge=true`. The aging report stops
double-counting.
## 5. UX flow (per persona)
The demo is a single Streamlit page (likely
`src/gui/pages/0_Review.py` repurposed for demo mode, or a
dedicated `app_demo.py` for the cloud build).
```
┌──────────────────────────────────────────────────────────┐
│ DataTools — for {Persona} │
│ "{Persona-specific H1}" │
├──────────────────────────────────────────────────────────┤
│ │
│ Sample dataset preloaded: bank_reconciliation.csv │
│ [Replace with your own file (capped 100 rows)] │
│ │
│ ┌─ BEFORE preview (26 rows) ─────────────────────────┐ │
│ │ 01/15/2025 | Stripe | +$3,450.00 | … │ │
│ │ 2025-01-15 | Stripe | 3450.00 | … (dup) │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Pipeline (saved): │
│ 1. Clean Text → 2. Standardize Formats → │
│ 3. Fix Missing → 4. Find Duplicates │
│ │
│ [▶ Run pipeline] │
│ │
│ ┌─ AFTER preview ───────────────────────────────────┐ │
│ │ 26 rows → 20 (6 duplicate transactions removed) │ │
│ │ 36 cells standardized · 4 disguised nulls flagged │ │
│ │ │ │
│ │ 2025-01-15 | Stripe | 3450.00 | … │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ [Download cleaned CSV (sample, watermarked)] │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Like what you see? │ │
│ │ Run this on YOUR 50,000-row export — locally. │ │
│ │ No upload. Your data never leaves your machine. │ │
│ │ [Get DataTools — $49 →] │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
```
**Critical UX points**:
- Sample dataset is *already loaded* on page paint. Visitor never
sees an empty state.
- BEFORE table is shown side-by-side with AFTER once the run
completes. Hidden-character toggle on by default so the visitor
*sees* what was hidden in their data.
- "Replace with your own file" is a secondary action below the BEFORE
table — not the headline.
- Per-step metrics are shown in the AFTER block: "27 cells
canonicalized, 33 sentinels resolved, 4 duplicates merged." Numbers
sell more than narrative.
- Buy button is **inside** the AFTER block and **above the fold** when
the run completes. Friction kills.
## 6. Free vs paid boundary
The demo runs the **same code** as the paid product. Caps are surface,
not engine.
| Limit | Free demo | Paid (downloaded) |
|---|---|---|
| Input rows | 100 | unlimited (1 GB+ via streaming) |
| File size | 5 MB | unlimited |
| Output | watermarked CSV ("DataTools demo — buy at <url>" appended as last row) | clean CSV |
| Pipeline editor | locked to the persona-saved pipeline | full edit / save / load JSON |
| Save pipeline JSON | disabled | enabled |
| International | enabled | enabled |
| Audit log download | disabled | enabled |
| Tool 0609 | as they ship | as they ship |
The watermark is a **single trailing row**, not an in-cell tag — so
the demo's AFTER preview *visibly* reads as production-quality data,
not "demo crippled" data.
## 7. CTA copy (per persona)
Copy lives in `src/gui/app_demo.py::PERSONAS` (H1 / sub / CTA per tag);
keep this section in sync with that dict.
### 7.1 Bookkeeper — bank reconciliation (`?p=bookkeeper`)
- **H1**: *Catch the transactions your bank export posted twice. Locally.*
- **Sub**: *When the Jan and Feb exports overlap, the same payment posts
twice in two formats. DataTools standardizes every date and amount, then
dedups on the real transaction so your reconciliation ties out — 26 rows
→ 20, six phantom duplicates gone.*
- **CTA**: *Get DataTools for Bookkeepers — $49 →*
### 7.2 Accounts payable — 1099 prep (`?p=ap-1099`)
- **H1**: *Build a clean 1099 vendor list — with the missing EINs filled in.*
- **Sub**: *The same vendor entered three times, each record holding only
part of the details. DataTools consolidates to one row and backfills the
gaps from the duplicates — 24 records → 8 vendors, 7 missing EINs
recovered.*
- **CTA**: *Get DataTools for Accounting — $49 →*
### 7.3 Accounts receivable — open invoices (`?p=ar-aging`)
- **H1**: *Stop chasing the invoices your aging report counted twice. Locally.*
- **Sub**: *Double-entered invoices inflate your AR aging and your
follow-ups. DataTools standardizes dates and amounts, lowercases client
emails, and removes the duplicate invoice numbers — 26 rows → 21, five
phantom invoices off the books.*
- **CTA**: *Get DataTools for Accounting — $49 →*
## 8. Telemetry / conversion tracking
Async + no-touch + free hosting limits what we can instrument. Use
event-only counters, no PII:
| Event | Source | Aggregate-only field |
|---|---|---|
| `demo.page_view` | landing page | persona tag |
| `demo.run_clicked` | demo page | persona tag |
| `demo.run_completed` | demo page | persona tag, rows_processed |
| `demo.cta_clicked` | demo page | persona tag |
| `gumroad.purchase` | Gumroad webhook | landing-page-source query param (`?from=shopify-pet`) |
Conversion = `cta_clicked / run_completed`. Demo-quality issue surfaces
when `run_completed / page_view` < 30 % (visitors not engaging).
Self-host counters on Cloudflare Pages (free, GDPR-friendly). No
Google Analytics — adds privacy banner, conflicts with the "your data
never leaves your computer" message.
## 9. Maintenance plan
**Recurring**: zero. The demo runs on the same engine the paid
product ships, so any improvement to the engine improves the demo
automatically. The pre-saved pipeline JSONs reference column names
and tool names, both stable APIs.
**Triggers for revisit**:
| Trigger | Action |
|---|---|
| Streamlit Community Cloud rate-limits / sleeps too aggressively | Migrate to a $510/mo VPS (BUSINESS.md §9 contingency) |
| Demo dataset becomes stale (e.g. all phones standardize to no-op) | Refresh with a new pollution batch — *don't change the persona* |
| `run_completed / page_view < 30 %` for 4 consecutive weeks | Audit the demo: is the BEFORE preview showing the mess clearly? Is the AFTER too small to notice? |
| `cta_clicked / run_completed < 5 %` for 4 consecutive weeks | The demo is impressive but the CTA isn't earning trust — revise copy + add a screenshot of the network tab showing zero outbound calls (PLAN.md §2.4) |
| New tool ships (0609) | Decide *per persona* whether to add it to that persona's saved pipeline. Not all tools belong on all personas |
## 10. Build sequence (drops into PLAN.md week 2)
| Day | Action |
|---|---|
| 1 | Demo build of Streamlit app: 3 personas, switch via query param `?p=shopify-pet` |
| 2 | Pipeline JSONs wired in; row cap + watermark applied; download button |
| 3 | Deploy to Streamlit Community Cloud · 3 sub-paths or 3 separate apps |
| 4 | Persona landing pages: 3 static HTML pages on Cloudflare Pages, each with iframe embed of its persona demo + CTA |
| 5 | Telemetry counters wired (Cloudflare event API) · Gumroad webhook captures `?from=` |
End of day 5: three URLs the operator can drop into three different
niche-community threads, each performing its own conversion math.
## 11. Anti-temptations (things the demo deliberately refuses)
- **No "try it on your data first" gate that requires email.** The
whole point is friction-free.
- **No "schedule a demo" CTA.** Locked by no-touch.
- **No live chat widget.** Same.
- **No A/B-test framework yet.** Single-arm copy, ship it, iterate
monthly. A/B requires statistical traffic the funnel doesn't have
pre-PMF.
- **No watermark inside cells.** The AFTER preview must look
production-quality. Watermark goes on a single trailing row that's
obviously the demo signature.
- **No animation / loader theatrics.** Pipeline runs in <1 s; a
fake-progress bar lies about speed.