Files

Michael 966af8ef94 feat: 3 new tools, format streaming, distribution-ready demo + landing pages

Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-01 22:31:26 +00:00

18 KiB

Raw Blame History

Demo Plan — DataTools

Creator-only. Implements PLAN.md §2.2 (the demo IS the product) and §2.3 (niche down — three landing pages, one engine). Version: 1.0 · Adopted: 2026-05-01 · Owner: Michael

The hosted demo is the single highest-leverage marketing asset in the plan. This document defines exactly what loads, in what order, with what data, for which buyer — so the operator builds it once and never rebuilds it from a stale headline.

1. Goals

Convert a cold visitor to a paid buyer in under three minutes of active interaction.
Demonstrate the full pipeline (not one tool) on a dataset that looks like the visitor's own work — not a toy CSV.
Survive zero attention to maintenance — once running, the demo should keep working as the engine evolves (the pre-saved pipeline JSONs use the same code path the paid product uses).
Provide a shareable artifact for niche-community posts (a public URL the operator can drop into a subreddit reply with one sentence).

2. Constraints (non-negotiable)

Constraint	Source	Implication
Free hosting at launch	BUSINESS.md §9	Streamlit Community Cloud (1 GB RAM, sleeps after 7 days idle)
No login	BUSINESS.md §7	No email gate, no signup wall, no "create account to continue"
Async / no-touch	DECISIONS.md §1 #8	Cannot offer "schedule a demo with us" CTA
Runs locally on paid product	BUSINESS.md §11	Demo can't expose the same engine to abuse — needs row caps
Friction kills conversion	BUSINESS.md §7	Demo dataset preloaded; no "select a file" first-step
< $1,200/mo recurring	BUSINESS.md §9	Migration plan to $5/mo VPS only after rate-limit signal

3. The three personas (per PLAN.md §2.3)

Tag	Persona	Top-of-funnel keyword	Demo dataset	Pre-saved pipeline
`shopify-pet`	Shopify operator (priority: pet supplies)	"shopify customer cleanup"	`samples/demo/shopify_pet_customers.csv`	`shopify_pet_pipeline.json`
`bookkeeper`	Bookkeeper / freelance accountant	"reconcile bank export csv"	`samples/demo/bookkeeper_bank_reconcile.csv`	`bookkeeper_bank_pipeline.json`
`revops`	Marketing / RevOps agency	"dedupe lead list across vendors"	`samples/demo/agency_combined_leads.csv`	`agency_leads_pipeline.json`

Each persona gets its own landing page URL, its own demo dataset loaded by default, and its own H1 + below-the-fold copy. The engine is identical; only positioning differs.

4. Demo dataset specifications

Each dataset is intentionally small (~15–25 rows) so the full pipeline runs in well under one second on Streamlit Community Cloud's free hardware. Each row is a plausible-looking export from that persona's tooling. Each contains every kind of pollution the bundle's five tools fix, so a single demo run shows every tool earning its keep.

4.0 Pain-point coverage map

Each demo dataset is engineered so the buyer sees their own top pain demonstrated in the AFTER preview. The mapping below pairs each pain from PLAN.md §2.3a with the rows / columns that exercise it. Refresh the dataset only when this coverage drops.

Persona	Pain (from PLAN §2.3a)	Demo coverage
Shopify pet	S1 — Klaviyo per-contact dupes	5 dup pairs across rows 1–15 (case + format + address-twin variants)
Shopify pet	S2 — feed-rejection chars	smart-quote / NBSP / BOM in rows 1–6, 9, 11
Shopify pet	S3 — multi-channel	partner-style customer IDs (`SHOP-`); demonstration of column-level mapping covered in RevOps demo
Shopify pet	S4 — subscription identity	rows 1+2, 7+8, 9+10 — same person, different format
Shopify pet	S5 — VAT-MOSS country drift	rows 16–18 (`United Kingdom` / `U.K.` / `UK`) + rows 19–20 (`Germany`/`Italia`)
Bookkeeper	B1 — month-overlap re-import	7 dup pairs spanning Jan↔Feb and Mar boundaries
Bookkeeper	B2 — 1099 vendor consolidation	Amazon × 3 spellings, Verizon × 2, Acme Realty × 2, Adobe × 2, Costco × 2, Zoom × 2, Stripe × 4
Bookkeeper	B3 — audit trail	every cell change in the run logged with old/new/rule — surface in the demo's audit tab
Bookkeeper	B4 — per-license economics	demonstrated by pricing copy, not data
Bookkeeper	B5 — multi-currency	rows 26 (EUR), 27 (GBP), 28 (BRL with comma decimal), 29 (parens-negative)
RevOps	R1 — per-contact tier	6 cross-source dup pairs (HubSpot × LinkedIn × Manual Scrape)
RevOps	R2 — deliverability	rows 26–27 (`uma at uniform dot com`, `victor@@victorco.com` invalid emails)
RevOps	R3 — GDPR / privacy	demonstrated by the network-tab moat panel + zero-upload claim
RevOps	R4 — vendor unification	3 source values (HubSpot / LinkedIn / Manual Scrape), 13 country codes, mixed-shape headers
RevOps	R5 — suppression list	rows 29–30 (`Suppressed`, `Opted Out` tags)

4.1 `shopify_pet_customers.csv` (20 rows)

Looks like: a Shopify customer export filtered for "Pet Supplies" sales channel, 12 months activity.

Pollution included:

Whitespace padding (" Alice ", "Sydney Opera House Drive ")
Mixed phone formats: (415) 555-1234, 415.555.1234, 5559876543, +1 555-111-1111
International phones: GB, ES, DE, AU, JP (15 demo rows span 6 countries)
Currency variants: $1,240.50, £890.25, €2.410,75 (EU comma decimal), A$ 1,299.00, ¥75000
Date formats: 2025-12-04, 12/15/2025, ?, (blank), (none), #N/A
Disguised nulls: N/A, blank, (blank), ?, #N/A, (none), unknown
Name casing: EVE MARTINEZ, henry, O'NEIL, noah, mixed Title / ALL CAPS / lower
Email case variants that should dedup: Bob@PetShop.com vs alice@petshop.com
4 fuzzy duplicates (Alice/Bob same address, Grace/Henry same phone, Carlos/Olivia same address, Ivy/Jack same address)

After running the pipeline: 20 rows → 15, ~29 cells canonicalized, ~45 sentinels standardised, 5 cross-row duplicates merged. The customer table is now Klaviyo-import-ready and the country column (previously UK / U.K. / United Kingdom / Germany / Italia) is GB / DE / IT — VAT MOSS report won't break.

4.2 `bookkeeper_bank_reconcile.csv` (30 rows)

Looks like: two months of business checking + credit-card activity exported from a bank portal, with the Feb export accidentally overlapping the Jan export at the month boundary.

Pollution included:

Mixed date formats: 01/15/2025, 2025-01-15, Jan 18 2025, 1/27/25, Feb 5 2025
Currency formats: -$129.99, ($89.50) parens-negative, +$3,450.00, - $599.88 space, bare -129.99, (50.00)
Header trailing whitespace: "Date "
Smart quotes around descriptions: "autopay"
Em-dash sentinels in Vendor: —
Smart-em-dash inside descriptions: STAPLES #4422 — paper, toner
Vendor casing inconsistency: Amazon / amazon.com / AMAZON.COM, Verizon / verizon
6 duplicate transactions (same date+amount+vendor recorded twice with different formats)

After running the pipeline: 30 rows → 23, ~84 cells normalized, 7 duplicates removed (month-overlap + VAT-MOSS dups). All dates ISO-formatted, all amounts numeric (including EUR/GBP/BRL with comma decimal), vendor casing canonical, parens-negative resolved.

4.3 `agency_combined_leads.csv` (30 rows)

Looks like: a marketing-ops worksheet combining lead exports from HubSpot + LinkedIn Sales Navigator + manual scraping, ready for campaign targeting.

Pollution included:

Phone formats per region: US, UK, Spain, Germany, China, India, Australia, Mexico, Israel, Singapore, Hong Kong, Italy, South Korea — 13 country codes
Country column inconsistent: USA / US / United States
Disguised nulls: N/A, unknown, (unknown), (blank), (none), ?, —, #N/A, TBD
Source column tags origin (HubSpot / LinkedIn / Manual Scrape)
Email duplicates across sources with case variants: alice@acme.com
- Alice.Johnson@acme.com, bob@beta.com + Bob@Beta.com, diana@delta.com from two sources, carlos@gamma.io from two sources, Frank@Foxtrot.de + frank@foxtrot.de
Name casing: DIANA LEE, henry, IVY CHEN, mixed
6 fuzzy / cross-source duplicates designed to survive the dedup
Score column with sentinel pollution that needs coercion to integer

After running the pipeline: 30 rows → 24, ~43 cells canonicalized, 14 sentinels resolved, 6 cross-source duplicates merged with merge=true so each survivor inherits the most-complete picture. Invalid-email rows (deliverability stress) and Suppressed/Opted Out tags (suppression-list use case) survive as flagged rows the operator manually reviews.

5. UX flow (per persona)

The demo is a single Streamlit page (likely src/gui/pages/0_Review.py repurposed for demo mode, or a dedicated app_demo.py for the cloud build).

┌──────────────────────────────────────────────────────────┐
│  DataTools — for {Persona}                               │
│  "{Persona-specific H1}"                                 │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  Sample dataset preloaded:  shopify_pet_customers.csv    │
│  [Replace with your own file (capped 100 rows)]          │
│                                                          │
│  ┌─ BEFORE preview (15 rows) ─────────────────────────┐  │
│  │ Alice  | (415) 555-1234 | $1,240.50 | …          │  │
│  │ Bob    | 415.555.1234   | $1,240.50 | …          │  │
│  │ ...                                              │  │
│  └──────────────────────────────────────────────────┘  │
│                                                          │
│  Pipeline (saved):                                       │
│  1. Text Clean    →  2. Format Standardize    →          │
│  3. Missing       →  4. Deduplicate                      │
│                                                          │
│  [▶ Run pipeline]                                        │
│                                                          │
│  ┌─ AFTER preview ───────────────────────────────────┐  │
│  │ 15 rows → 11 (4 duplicates merged)                │  │
│  │ 27 cells canonicalized · 33 sentinels resolved    │  │
│  │                                                    │  │
│  │ Alice Johnson  | +14155551234 | 1240.50 | …       │  │
│  │ ...                                                │  │
│  └──────────────────────────────────────────────────┘  │
│                                                          │
│  [Download cleaned CSV (sample, watermarked)]            │
│                                                          │
│  ┌──────────────────────────────────────────────────┐  │
│  │  Like what you see?                              │  │
│  │  Run this on YOUR 50,000-row export — locally.   │  │
│  │  No upload. Your data never leaves your machine. │  │
│  │  [Get DataTools — $49 →]                         │  │
│  └──────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

Critical UX points:

Sample dataset is already loaded on page paint. Visitor never sees an empty state.
BEFORE table is shown side-by-side with AFTER once the run completes. Hidden-character toggle on by default so the visitor sees what was hidden in their data.
"Replace with your own file" is a secondary action below the BEFORE table — not the headline.
Per-step metrics are shown in the AFTER block: "27 cells canonicalized, 33 sentinels resolved, 4 duplicates merged." Numbers sell more than narrative.
Buy button is inside the AFTER block and above the fold when the run completes. Friction kills.

6. Free vs paid boundary

The demo runs the same code as the paid product. Caps are surface, not engine.

Limit	Free demo	Paid (downloaded)
Input rows	100	unlimited (1 GB+ via streaming)
File size	5 MB	unlimited
Output	watermarked CSV ("DataTools demo — buy at " appended as last row)	clean CSV
Pipeline editor	locked to the persona-saved pipeline	full edit / save / load JSON
Save pipeline JSON	disabled	enabled
International	enabled	enabled
Audit log download	disabled	enabled
Tool 06–09	as they ship	as they ship

The watermark is a single trailing row, not an in-cell tag — so the demo's AFTER preview visibly reads as production-quality data, not "demo crippled" data.

7. CTA copy (per persona)

7.1 Shopify pet operator

H1: Clean your customer / vendor / subscriber exports — locally.
Sub: Klaviyo-import-ready in 30 seconds. Catches duplicates Excel misses. Your data never leaves your computer.
CTA: Get DataTools for Shopify — $49 →

7.2 Bookkeeper / freelance accountant

H1: Reconcile messy bank exports. Hand your client an audit trail.
Sub: Catches the duplicate transaction Quickbooks imported twice. Standardizes dates, amounts, vendor casing. Every change auditable.
CTA: Get DataTools for Bookkeepers — $49 →

7.3 Marketing / RevOps agency

H1: Dedupe leads across HubSpot, LinkedIn, and manual scrapes.
Sub: International phones, country normalization, fuzzy dedup with merge — one tool, one schema, no upload.
CTA: Get DataTools for RevOps — $49 →

8. Telemetry / conversion tracking

Async + no-touch + free hosting limits what we can instrument. Use event-only counters, no PII:

Event	Source	Aggregate-only field
`demo.page_view`	landing page	persona tag
`demo.run_clicked`	demo page	persona tag
`demo.run_completed`	demo page	persona tag, rows_processed
`demo.cta_clicked`	demo page	persona tag
`gumroad.purchase`	Gumroad webhook	landing-page-source query param (`?from=shopify-pet`)

Conversion = cta_clicked / run_completed. Demo-quality issue surfaces when run_completed / page_view < 30 % (visitors not engaging).

Self-host counters on Cloudflare Pages (free, GDPR-friendly). No Google Analytics — adds privacy banner, conflicts with the "your data never leaves your computer" message.

9. Maintenance plan

Recurring: zero. The demo runs on the same engine the paid product ships, so any improvement to the engine improves the demo automatically. The pre-saved pipeline JSONs reference column names and tool names, both stable APIs.

Triggers for revisit:

Trigger	Action
Streamlit Community Cloud rate-limits / sleeps too aggressively	Migrate to a $5–10/mo VPS (BUSINESS.md §9 contingency)
Demo dataset becomes stale (e.g. all phones standardize to no-op)	Refresh with a new pollution batch — don't change the persona
`run_completed / page_view < 30 %` for 4 consecutive weeks	Audit the demo: is the BEFORE preview showing the mess clearly? Is the AFTER too small to notice?
`cta_clicked / run_completed < 5 %` for 4 consecutive weeks	The demo is impressive but the CTA isn't earning trust — revise copy + add a screenshot of the network tab showing zero outbound calls (PLAN.md §2.4)
New tool ships (06–09)	Decide per persona whether to add it to that persona's saved pipeline. Not all tools belong on all personas

10. Build sequence (drops into PLAN.md week 2)

Day	Action
1	Demo build of Streamlit app: 3 personas, switch via query param `?p=shopify-pet`
2	Pipeline JSONs wired in; row cap + watermark applied; download button
3	Deploy to Streamlit Community Cloud · 3 sub-paths or 3 separate apps
4	Persona landing pages: 3 static HTML pages on Cloudflare Pages, each with iframe embed of its persona demo + CTA
5	Telemetry counters wired (Cloudflare event API) · Gumroad webhook captures `?from=`

End of day 5: three URLs the operator can drop into three different niche-community threads, each performing its own conversion math.

11. Anti-temptations (things the demo deliberately refuses)

No "try it on your data first" gate that requires email. The whole point is friction-free.
No "schedule a demo" CTA. Locked by no-touch.
No live chat widget. Same.
No A/B-test framework yet. Single-arm copy, ship it, iterate monthly. A/B requires statistical traffic the funnel doesn't have pre-PMF.
No watermark inside cells. The AFTER preview must look production-quality. Watermark goes on a single trailing row that's obviously the demo signature.
No animation / loader theatrics. Pipeline runs in <1 s; a fake-progress bar lies about speed.

18 KiB Raw Blame History Unescape Escape

Demo Plan — DataTools

1. Goals

2. Constraints (non-negotiable)

3. The three personas (per PLAN.md §2.3)

4. Demo dataset specifications

4.0 Pain-point coverage map

4.1 shopify_pet_customers.csv (20 rows)

4.2 bookkeeper_bank_reconcile.csv (30 rows)

4.3 agency_combined_leads.csv (30 rows)

5. UX flow (per persona)

6. Free vs paid boundary

7. CTA copy (per persona)

7.1 Shopify pet operator

7.2 Bookkeeper / freelance accountant

7.3 Marketing / RevOps agency

8. Telemetry / conversion tracking

9. Maintenance plan

10. Build sequence (drops into PLAN.md week 2)

11. Anti-temptations (things the demo deliberately refuses)

18 KiB

Raw Blame History

4.1 `shopify_pet_customers.csv` (20 rows)

4.2 `bookkeeper_bank_reconcile.csv` (30 rows)

4.3 `agency_combined_leads.csv` (30 rows)