feat: 3 new tools, format streaming, distribution-ready demo + landing pages

Tools shipped this batch (4 → 6 of 9 Ready): 04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI 05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI 09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI with soft tool-dependency graph (recommended, not enforced) and JSON save/load for repeatable weekly cleanups. Format Standardizer reworked for 1 GB international files: • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email • Per-row country / address columns drive parsing • Audit cap (default 10 k rows, ~50 MB RAM) • standardize_file(): chunked streaming entry point (~165 k rows/sec) • currency_decimal="auto" for EU comma-decimal locales • R$ / kr / zł multi-char currency prefixes • cli_format.py with auto-stream above 100 MB inputs Encoding detection arbiter + language-aware probe: Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM) via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes. Distribution-readiness assets: • streamlit_app.py — Streamlit Community Cloud entry shim • src/gui/app_demo.py — single-page demo, ?p=<persona> routing, 100-row cap + watermark, free-vs-paid boundary enforced at surface • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs • landing/ — 4 static HTML pages (apex chooser + 3 niche), shared CSS, deploy.py URL-substitution script, auto-generated robots.txt + sitemap.xml + 404.html + favicon • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md — full strategy + measurement + deployment + master checklist Test counts: before: 1,520 passed · 4 skipped · 17 xfailed after: 1,729 passed · 0 skipped · 0 xfailed Tier-1 corpora added: • missing-corpus 3 use cases + 16 edge cases • column-mapper-corpus 3 use cases + 5 edge cases • format-cleaner intl 20-row 13-country stress fixture Engine hardening flushed out by the corpora: • interpolate guards against object-dtype columns • mean/median skip all-NaN columns (silences numpy warning) • fillna runs under future.no_silent_downcasting (silences pandas warning) • mojibake test no longer skips when ftfy installed (monkeypatch path) • drop-row threshold semantics: strict-greater (consistent across rows / cols) • currency_decimal validator allow-set updated for "auto" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00
parent d18b95880d
commit 966af8ef94
89 changed files with 12039 additions and 284 deletions
--- a/docs/DEMO-PLAN.md
+++ b/docs/DEMO-PLAN.md
@@ -0,0 +1,332 @@
+# Demo Plan — DataTools
+
+> Creator-only. Implements PLAN.md §2.2 (the demo IS the product) and
+> §2.3 (niche down — three landing pages, one engine).
+> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael
+
+The hosted demo is the single highest-leverage marketing asset in the
+plan. This document defines exactly what loads, in what order, with
+what data, for which buyer — so the operator builds it once and never
+rebuilds it from a stale headline.
+
+## 1. Goals
+
+- Convert a cold visitor to a paid buyer in **under three minutes** of
+  active interaction.
+- Demonstrate the *full pipeline* (not one tool) on a dataset that
+  *looks like the visitor's own work* — not a toy CSV.
+- Survive zero attention to maintenance — once running, the demo
+  should keep working as the engine evolves (the pre-saved pipeline
+  JSONs use the same code path the paid product uses).
+- Provide a shareable artifact for niche-community posts (a public URL
+  the operator can drop into a subreddit reply with one sentence).
+
+## 2. Constraints (non-negotiable)
+
+| Constraint | Source | Implication |
+|---|---|---|
+| Free hosting at launch | BUSINESS.md §9 | Streamlit Community Cloud (1 GB RAM, sleeps after 7 days idle) |
+| No login | BUSINESS.md §7 | No email gate, no signup wall, no "create account to continue" |
+| Async / no-touch | DECISIONS.md §1 #8 | Cannot offer "schedule a demo with us" CTA |
+| Runs locally on paid product | BUSINESS.md §11 | Demo can't expose the same engine to abuse — needs row caps |
+| Friction kills conversion | BUSINESS.md §7 | Demo dataset preloaded; no "select a file" first-step |
+| < $1,200/mo recurring | BUSINESS.md §9 | Migration plan to $5/mo VPS only after rate-limit signal |
+
+## 3. The three personas (per PLAN.md §2.3)
+
+| Tag | Persona | Top-of-funnel keyword | Demo dataset | Pre-saved pipeline |
+|---|---|---|---|---|
+| `shopify-pet` | Shopify operator (priority: pet supplies) | "shopify customer cleanup" | `samples/demo/shopify_pet_customers.csv` | `shopify_pet_pipeline.json` |
+| `bookkeeper`   | Bookkeeper / freelance accountant | "reconcile bank export csv" | `samples/demo/bookkeeper_bank_reconcile.csv` | `bookkeeper_bank_pipeline.json` |
+| `revops`       | Marketing / RevOps agency | "dedupe lead list across vendors" | `samples/demo/agency_combined_leads.csv` | `agency_leads_pipeline.json` |
+
+Each persona gets its **own landing page URL**, its **own demo dataset
+loaded by default**, and its **own H1 + below-the-fold copy.** The
+engine is identical; only positioning differs.
+
+## 4. Demo dataset specifications
+
+Each dataset is intentionally small (~15–25 rows) so the full pipeline
+runs in well under one second on Streamlit Community Cloud's free
+hardware. Each row is a *plausible-looking* export from that
+persona's tooling. Each contains every kind of pollution the bundle's
+five tools fix, so a single demo run shows every tool earning its
+keep.
+
+### 4.0 Pain-point coverage map
+
+Each demo dataset is engineered so the buyer sees their **own top
+pain** demonstrated in the AFTER preview. The mapping below pairs
+each pain from PLAN.md §2.3a with the rows / columns that exercise
+it. Refresh the dataset only when this coverage drops.
+
+| Persona | Pain (from PLAN §2.3a) | Demo coverage |
+|---|---|---|
+| Shopify pet | S1 — Klaviyo per-contact dupes | 5 dup pairs across rows 1–15 (case + format + address-twin variants) |
+| Shopify pet | S2 — feed-rejection chars | smart-quote / NBSP / BOM in rows 1–6, 9, 11 |
+| Shopify pet | S3 — multi-channel | partner-style customer IDs (`SHOP-`); demonstration of column-level mapping covered in RevOps demo |
+| Shopify pet | S4 — subscription identity | rows 1+2, 7+8, 9+10 — same person, different format |
+| Shopify pet | S5 — VAT-MOSS country drift | rows 16–18 (`United Kingdom` / `U.K.` / `UK`) + rows 19–20 (`Germany`/`Italia`) |
+| Bookkeeper | B1 — month-overlap re-import | 7 dup pairs spanning Jan↔Feb and Mar boundaries |
+| Bookkeeper | B2 — 1099 vendor consolidation | Amazon × 3 spellings, Verizon × 2, Acme Realty × 2, Adobe × 2, Costco × 2, Zoom × 2, Stripe × 4 |
+| Bookkeeper | B3 — audit trail | every cell change in the run logged with old/new/rule — surface in the demo's audit tab |
+| Bookkeeper | B4 — per-license economics | demonstrated by pricing copy, not data |
+| Bookkeeper | B5 — multi-currency | rows 26 (EUR), 27 (GBP), 28 (BRL with comma decimal), 29 (parens-negative) |
+| RevOps | R1 — per-contact tier | 6 cross-source dup pairs (HubSpot × LinkedIn × Manual Scrape) |
+| RevOps | R2 — deliverability | rows 26–27 (`uma at uniform dot com`, `victor@@victorco.com` invalid emails) |
+| RevOps | R3 — GDPR / privacy | demonstrated by the network-tab moat panel + zero-upload claim |
+| RevOps | R4 — vendor unification | 3 source values (HubSpot / LinkedIn / Manual Scrape), 13 country codes, mixed-shape headers |
+| RevOps | R5 — suppression list | rows 29–30 (`Suppressed`, `Opted Out` tags) |
+
+### 4.1 `shopify_pet_customers.csv` (20 rows)
+
+**Looks like**: a Shopify customer export filtered for "Pet Supplies"
+sales channel, 12 months activity.
+
+**Pollution included**:
+- Whitespace padding ("  Alice  ", "Sydney Opera House Drive ")
+- Mixed phone formats: `(415) 555-1234`, `415.555.1234`, `5559876543`,
+  `+1 555-111-1111`
+- International phones: GB, ES, DE, AU, JP (15 demo rows span 6
+  countries)
+- Currency variants: `$1,240.50`, `£890.25`, `€2.410,75` (EU comma
+  decimal), `A$ 1,299.00`, `¥75000`
+- Date formats: `2025-12-04`, `12/15/2025`, `?`, `(blank)`, `(none)`,
+  `#N/A`
+- Disguised nulls: `N/A`, blank, `(blank)`, `?`, `#N/A`, `(none)`,
+  `unknown`
+- Name casing: `EVE MARTINEZ`, `henry`, `O'NEIL`, `noah`, mixed Title /
+  ALL CAPS / lower
+- Email case variants that *should* dedup: `Bob@PetShop.com` vs
+  `alice@petshop.com`
+- 4 fuzzy duplicates (Alice/Bob same address, Grace/Henry same phone,
+  Carlos/Olivia same address, Ivy/Jack same address)
+
+**After running the pipeline**: 20 rows → 15, ~29 cells canonicalized,
+~45 sentinels standardised, 5 cross-row duplicates merged. The
+customer table is now Klaviyo-import-ready and the country column
+(previously `UK` / `U.K.` / `United Kingdom` / `Germany` / `Italia`)
+is GB / DE / IT — VAT MOSS report won't break.
+
+### 4.2 `bookkeeper_bank_reconcile.csv` (30 rows)
+
+**Looks like**: two months of business checking + credit-card activity
+exported from a bank portal, with the Feb export accidentally
+overlapping the Jan export at the month boundary.
+
+**Pollution included**:
+- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`,
+  `1/27/25`, `Feb 5 2025`
+- Currency formats: `-$129.99`, `($89.50)` parens-negative,
+  `+$3,450.00`, `- $599.88` space, bare `-129.99`, `(50.00)`
+- Header trailing whitespace: `"Date "`
+- Smart quotes around descriptions: `"autopay"`
+- Em-dash sentinels in Vendor: `—`
+- Smart-em-dash inside descriptions: `STAPLES #4422 — paper, toner`
+- Vendor casing inconsistency: `Amazon` / `amazon.com` / `AMAZON.COM`,
+  `Verizon` / `verizon`
+- 6 duplicate transactions (same date+amount+vendor recorded twice
+  with different formats)
+
+**After running the pipeline**: 30 rows → 23, ~84 cells normalized, 7
+duplicates removed (month-overlap + VAT-MOSS dups). All dates
+ISO-formatted, all amounts numeric (including EUR/GBP/BRL with comma
+decimal), vendor casing canonical, parens-negative resolved.
+
+### 4.3 `agency_combined_leads.csv` (30 rows)
+
+**Looks like**: a marketing-ops worksheet combining lead exports from
+HubSpot + LinkedIn Sales Navigator + manual scraping, ready for
+campaign targeting.
+
+**Pollution included**:
+- Phone formats per region: US, UK, Spain, Germany, China, India,
+  Australia, Mexico, Israel, Singapore, Hong Kong, Italy, South
+  Korea — 13 country codes
+- Country column inconsistent: `USA` / `US` / `United States`
+- Disguised nulls: `N/A`, `unknown`, `(unknown)`, `(blank)`, `(none)`,
+  `?`, `—`, `#N/A`, `TBD`
+- Source column tags origin (`HubSpot` / `LinkedIn` / `Manual Scrape`)
+- Email duplicates across sources with case variants: `alice@acme.com`
+  + `Alice.Johnson@acme.com`, `bob@beta.com` + `Bob@Beta.com`,
+  `diana@delta.com` from two sources, `carlos@gamma.io` from two
+  sources, `Frank@Foxtrot.de` + `frank@foxtrot.de`
+- Name casing: `DIANA LEE`, `henry`, `IVY CHEN`, mixed
+- 6 fuzzy / cross-source duplicates designed to survive the dedup
+- Score column with sentinel pollution that needs coercion to integer
+
+**After running the pipeline**: 30 rows → 24, ~43 cells canonicalized,
+14 sentinels resolved, 6 cross-source duplicates merged with `merge=true`
+so each survivor inherits the most-complete picture. Invalid-email
+rows (deliverability stress) and `Suppressed`/`Opted Out` tags
+(suppression-list use case) survive as flagged rows the operator
+manually reviews.
+
+## 5. UX flow (per persona)
+
+The demo is a single Streamlit page (likely
+`src/gui/pages/0_Review.py` repurposed for demo mode, or a
+dedicated `app_demo.py` for the cloud build).
+
+```
+┌──────────────────────────────────────────────────────────┐
+│  DataTools — for {Persona}                               │
+│  "{Persona-specific H1}"                                 │
+├──────────────────────────────────────────────────────────┤
+│                                                          │
+│  Sample dataset preloaded:  shopify_pet_customers.csv    │
+│  [Replace with your own file (capped 100 rows)]          │
+│                                                          │
+│  ┌─ BEFORE preview (15 rows) ─────────────────────────┐  │
+│  │ Alice  | (415) 555-1234 | $1,240.50 | …          │  │
+│  │ Bob    | 415.555.1234   | $1,240.50 | …          │  │
+│  │ ...                                              │  │
+│  └──────────────────────────────────────────────────┘  │
+│                                                          │
+│  Pipeline (saved):                                       │
+│  1. Text Clean    →  2. Format Standardize    →          │
+│  3. Missing       →  4. Deduplicate                      │
+│                                                          │
+│  [▶ Run pipeline]                                        │
+│                                                          │
+│  ┌─ AFTER preview ───────────────────────────────────┐  │
+│  │ 15 rows → 11 (4 duplicates merged)                │  │
+│  │ 27 cells canonicalized · 33 sentinels resolved    │  │
+│  │                                                    │  │
+│  │ Alice Johnson  | +14155551234 | 1240.50 | …       │  │
+│  │ ...                                                │  │
+│  └──────────────────────────────────────────────────┘  │
+│                                                          │
+│  [Download cleaned CSV (sample, watermarked)]            │
+│                                                          │
+│  ┌──────────────────────────────────────────────────┐  │
+│  │  Like what you see?                              │  │
+│  │  Run this on YOUR 50,000-row export — locally.   │  │
+│  │  No upload. Your data never leaves your machine. │  │
+│  │  [Get DataTools — $49 →]                         │  │
+│  └──────────────────────────────────────────────────┘  │
+└──────────────────────────────────────────────────────────┘
+```
+
+**Critical UX points**:
+- Sample dataset is *already loaded* on page paint. Visitor never
+  sees an empty state.
+- BEFORE table is shown side-by-side with AFTER once the run
+  completes. Hidden-character toggle on by default so the visitor
+  *sees* what was hidden in their data.
+- "Replace with your own file" is a secondary action below the BEFORE
+  table — not the headline.
+- Per-step metrics are shown in the AFTER block: "27 cells
+  canonicalized, 33 sentinels resolved, 4 duplicates merged." Numbers
+  sell more than narrative.
+- Buy button is **inside** the AFTER block and **above the fold** when
+  the run completes. Friction kills.
+
+## 6. Free vs paid boundary
+
+The demo runs the **same code** as the paid product. Caps are surface,
+not engine.
+
+| Limit | Free demo | Paid (downloaded) |
+|---|---|---|
+| Input rows | 100 | unlimited (1 GB+ via streaming) |
+| File size | 5 MB | unlimited |
+| Output | watermarked CSV ("DataTools demo — buy at <url>" appended as last row) | clean CSV |
+| Pipeline editor | locked to the persona-saved pipeline | full edit / save / load JSON |
+| Save pipeline JSON | disabled | enabled |
+| International | enabled | enabled |
+| Audit log download | disabled | enabled |
+| Tool 06–09 | as they ship | as they ship |
+
+The watermark is a **single trailing row**, not an in-cell tag — so
+the demo's AFTER preview *visibly* reads as production-quality data,
+not "demo crippled" data.
+
+## 7. CTA copy (per persona)
+
+### 7.1 Shopify pet operator
+
+- **H1**: *Clean your customer / vendor / subscriber exports — locally.*
+- **Sub**: *Klaviyo-import-ready in 30 seconds. Catches duplicates Excel
+  misses. Your data never leaves your computer.*
+- **CTA**: *Get DataTools for Shopify — $49 →*
+
+### 7.2 Bookkeeper / freelance accountant
+
+- **H1**: *Reconcile messy bank exports. Hand your client an audit
+  trail.*
+- **Sub**: *Catches the duplicate transaction Quickbooks imported twice.
+  Standardizes dates, amounts, vendor casing. Every change auditable.*
+- **CTA**: *Get DataTools for Bookkeepers — $49 →*
+
+### 7.3 Marketing / RevOps agency
+
+- **H1**: *Dedupe leads across HubSpot, LinkedIn, and manual scrapes.*
+- **Sub**: *International phones, country normalization, fuzzy dedup
+  with merge — one tool, one schema, no upload.*
+- **CTA**: *Get DataTools for RevOps — $49 →*
+
+## 8. Telemetry / conversion tracking
+
+Async + no-touch + free hosting limits what we can instrument. Use
+event-only counters, no PII:
+
+| Event | Source | Aggregate-only field |
+|---|---|---|
+| `demo.page_view` | landing page | persona tag |
+| `demo.run_clicked` | demo page | persona tag |
+| `demo.run_completed` | demo page | persona tag, rows_processed |
+| `demo.cta_clicked` | demo page | persona tag |
+| `gumroad.purchase` | Gumroad webhook | landing-page-source query param (`?from=shopify-pet`) |
+
+Conversion = `cta_clicked / run_completed`. Demo-quality issue surfaces
+when `run_completed / page_view` < 30 % (visitors not engaging).
+
+Self-host counters on Cloudflare Pages (free, GDPR-friendly). No
+Google Analytics — adds privacy banner, conflicts with the "your data
+never leaves your computer" message.
+
+## 9. Maintenance plan
+
+**Recurring**: zero. The demo runs on the same engine the paid
+product ships, so any improvement to the engine improves the demo
+automatically. The pre-saved pipeline JSONs reference column names
+and tool names, both stable APIs.
+
+**Triggers for revisit**:
+
+| Trigger | Action |
+|---|---|
+| Streamlit Community Cloud rate-limits / sleeps too aggressively | Migrate to a $5–10/mo VPS (BUSINESS.md §9 contingency) |
+| Demo dataset becomes stale (e.g. all phones standardize to no-op) | Refresh with a new pollution batch — *don't change the persona* |
+| `run_completed / page_view < 30 %` for 4 consecutive weeks | Audit the demo: is the BEFORE preview showing the mess clearly? Is the AFTER too small to notice? |
+| `cta_clicked / run_completed < 5 %` for 4 consecutive weeks | The demo is impressive but the CTA isn't earning trust — revise copy + add a screenshot of the network tab showing zero outbound calls (PLAN.md §2.4) |
+| New tool ships (06–09) | Decide *per persona* whether to add it to that persona's saved pipeline. Not all tools belong on all personas |
+
+## 10. Build sequence (drops into PLAN.md week 2)
+
+| Day | Action |
+|---|---|
+| 1 | Demo build of Streamlit app: 3 personas, switch via query param `?p=shopify-pet` |
+| 2 | Pipeline JSONs wired in; row cap + watermark applied; download button |
+| 3 | Deploy to Streamlit Community Cloud · 3 sub-paths or 3 separate apps |
+| 4 | Persona landing pages: 3 static HTML pages on Cloudflare Pages, each with iframe embed of its persona demo + CTA |
+| 5 | Telemetry counters wired (Cloudflare event API) · Gumroad webhook captures `?from=` |
+
+End of day 5: three URLs the operator can drop into three different
+niche-community threads, each performing its own conversion math.
+
+## 11. Anti-temptations (things the demo deliberately refuses)
+
+- **No "try it on your data first" gate that requires email.** The
+  whole point is friction-free.
+- **No "schedule a demo" CTA.** Locked by no-touch.
+- **No live chat widget.** Same.
+- **No A/B-test framework yet.** Single-arm copy, ship it, iterate
+  monthly. A/B requires statistical traffic the funnel doesn't have
+  pre-PMF.
+- **No watermark inside cells.** The AFTER preview must look
+  production-quality. Watermark goes on a single trailing row that's
+  obviously the demo signature.
+- **No animation / loader theatrics.** Pipeline runs in <1 s; a
+  fake-progress bar lies about speed.
--- a/docs/DEPLOYMENT.md
+++ b/docs/DEPLOYMENT.md
@@ -0,0 +1,236 @@
+# Deployment — demo + landing pages
+
+> One page. Two services. ~30 minutes from "code complete" to
+> "URL the user can hit." Every step here is from-scratch reproducible
+> on a clean laptop.
+> **Version**: 1.0 · **Adopted**: 2026-05-01
+
+This doc covers the **two distribution surfaces** that ship to public
+URLs: the Streamlit demo (the iframe target) and the Cloudflare Pages
+landing pages (the marketing surface that embeds it).
+
+The *paid* product — PyInstaller installers, code-signing, Gumroad
+listing — is covered in `docs/NEXT-STEPS.md`.
+
+---
+
+## Part 1 · Deploy the demo (Streamlit Community Cloud — free)
+
+### A. Pre-flight (one-time, ~2 min)
+
+You need a free [Streamlit Community Cloud](https://streamlit.io/cloud)
+account. Sign in with the GitHub account that hosts this repo.
+
+### B. Deploy (~5 min, mostly waiting for the Cloud build)
+
+1. **Push the repo to GitHub** (private or public — both work). The
+   important files are at the **repo root**:
+
+   - `streamlit_app.py` — Cloud auto-detects this; nothing to configure
+   - `requirements.txt` — Cloud installs from this
+   - `.streamlit/config.toml` — Cloud honours this
+   - `samples/demo/*.csv` + `*_pipeline.json` — the demo's data
+   - `src/` — the engine
+
+2. In Streamlit Community Cloud → **New app**:
+   - Repository: your fork
+   - Branch: `main`
+   - Main file path: `streamlit_app.py` (the default — leave it)
+   - App URL: `datatools-demo` (or any free subdomain)
+   - **Deploy**
+
+3. First build is 2–3 min while Cloud installs `pandas`, `phonenumbers`,
+   `rapidfuzz`, etc. Subsequent deploys are < 30 s.
+
+### C. Verify
+
+Open the deployed URL. Append `?p=shopify-pet` to the URL bar —
+the persona-specific demo loads. Try `?p=bookkeeper` and
+`?p=revops` to confirm all three personas route correctly. Click
+**Run pipeline**; the AFTER preview should appear within ~1 second.
+
+### D. The output URL
+
+The deployed URL is what feeds into `landing/deploy.config.json` →
+`demo_base_url`. Without trailing slash. For example:
+
+    https://datatools-demo.streamlit.app
+
+### E. Migration trigger
+
+Per `BUSINESS.md` §9 / `DEMO-PLAN.md` §9, migrate to a $5–10/mo VPS
+when:
+
+- Streamlit Community Cloud rate-limits / sleeps too aggressively, OR
+- the demo crosses ~5 k page-views/month (free-tier capacity)
+
+The migration is one command if you containerise:
+`docker run -p 8501:8501 -v $(pwd):/app python:3.12-slim …`
+
+---
+
+## Part 2 · Deploy the landing pages (Cloudflare Pages — free)
+
+### A. Pre-flight (one-time, ~5 min)
+
+You need:
+
+- A Cloudflare account (free) and a domain (any registrar) with
+  nameservers pointed at Cloudflare. **OR** skip the custom domain
+  step and use the auto-generated `*.pages.dev` URL.
+- A Gumroad listing URL (placeholder until your account is set up —
+  use `https://gumroad.com/l/datatools` and update it later).
+
+### B. Build the deploy-ready bundle (~30 sec)
+
+```bash
+# One-time: copy the template
+cp landing/deploy.config.example.json landing/deploy.config.json
+# Edit it with your real URLs
+edit landing/deploy.config.json
+# Build
+python3 landing/deploy.py
+# → produces landing/dist/
+```
+
+`landing/deploy.config.json` is **gitignored**; your real URLs never
+hit the repo.
+
+### C. Deploy (~3 min)
+
+Two paths — pick one:
+
+**Drag-and-drop (zero CLI):**
+
+1. Cloudflare Pages dashboard → **Create project** → **Direct Upload**
+2. Drag `landing/dist/` into the upload zone
+3. Project name: `datatools` (becomes `datatools.pages.dev`)
+4. Click **Deploy**
+
+**Wrangler CLI (one command, scriptable):**
+
+```bash
+npm install -g wrangler          # one-time
+wrangler login                   # one-time
+wrangler pages deploy landing/dist
+```
+
+### D. Custom domain (~5 min, optional)
+
+Pages dashboard → your project → **Custom domains** → add
+`datatools.app` (or whichever apex domain you registered). Cloudflare
+auto-issues TLS. Once propagated:
+
+- `https://datatools.app/` → apex chooser
+- `https://datatools.app/shopify-pet/` → Shopify landing
+- `https://datatools.app/bookkeeper/` → Bookkeeper landing
+- `https://datatools.app/revops/` → RevOps landing
+
+### E. Verify
+
+For each persona:
+
+1. Open the persona URL.
+2. Confirm the demo iframe loads (the URL inside it points at the
+   Streamlit demo from Part 1).
+3. Click "Run pipeline" inside the iframe → AFTER preview appears.
+4. Click the "Get DataTools" button → opens Gumroad with the
+   correct `?from=<persona>` query (verify in the URL bar).
+
+If the iframe shows "Refused to connect", check Cloudflare Pages →
+**Settings** → **Functions** for any CSP that disallows Streamlit's
+domain. (Default Pages config does not set CSP, so this is rarely an
+issue.)
+
+---
+
+## Part 3 · Updates
+
+The cycle is:
+
+```bash
+# 1) Edit code or copy
+edit landing/<persona>/index.html
+edit src/gui/app_demo.py
+
+# 2) Rebuild landing
+python3 landing/deploy.py
+
+# 3) Re-deploy landing
+wrangler pages deploy landing/dist
+
+# 4) Re-deploy demo
+git push origin main
+# (Streamlit Cloud auto-deploys on push)
+```
+
+Both surfaces deploy in under 5 minutes end-to-end.
+
+---
+
+## Part 4 · Sanity checks (post-deploy, ~3 min)
+
+Run these once, then trust the build (per `POST-LAUNCH.md` §6):
+
+```bash
+# Landing pages serve and reference the right demo URL
+curl -s https://datatools.app/ | grep -c persona-card
+# → 3 (one per persona card)
+
+curl -s https://datatools.app/shopify-pet/ | grep -c "datatools-demo"
+# → ≥1 (iframe src points at your demo)
+
+# Demo responds and routes the persona param
+curl -s https://datatools-demo.streamlit.app/?p=shopify-pet | grep -c "Shopify"
+# → ≥1
+
+# Sitemap is valid XML and lists all 4 pages
+curl -s https://datatools.app/sitemap.xml | grep -c "<url>"
+# → 4
+```
+
+---
+
+## Part 5 · Cost ceiling check
+
+| Service | Tier | Cost | Cap |
+|---|---|---|---|
+| Cloudflare Pages | Free | $0 | 500 builds/month, unlimited bandwidth |
+| Streamlit Community Cloud | Free | $0 | 1 GB RAM, sleeps after 7 days idle |
+| Custom domain | Cloudflare or registrar | ~$15/year | n/a |
+| GitHub | Free for private repos with limited collaborators | $0 | n/a |
+| **Total ongoing** | | **~$1.25/mo** (domain only) | |
+
+Well inside the `BUSINESS.md` §9 cap of $1,200/mo recurring. The
+$5–10/mo VPS migration is a contingency only — don't pre-build it.
+
+---
+
+## Troubleshooting
+
+**Streamlit Cloud build fails with "ModuleNotFoundError: src.core"**
+
+`streamlit_app.py` puts the repo root on `sys.path` before invoking
+the demo module — but only if the file is at the repo root. Confirm
+`streamlit_app.py` lives at `/streamlit_app.py`, not nested in a
+folder.
+
+**Cloudflare Pages deploy succeeds but persona pages 404**
+
+The directory layout is preserved by `deploy.py`. Confirm your
+`landing/dist/` has `shopify-pet/index.html`, etc. — not just three
+flat files. If you used drag-and-drop, drag the **directory**, not
+its contents.
+
+**The iframe shows "X-Frame-Options denied"**
+
+Streamlit Community Cloud allows iframe embedding by default. If
+you've migrated to a self-hosted demo with a reverse proxy, set
+`X-Frame-Options: ALLOWALL` (or remove the header entirely) for the
+demo's domain.
+
+**Gumroad URL has no `?from=` parameter when clicked**
+
+The `&from=` query param is added by the landing-page CTA, not by
+Gumroad. If it's missing, the landing-page HTML wasn't substituted —
+re-run `python3 landing/deploy.py` and re-deploy.
--- a/docs/NEXT-STEPS.md
+++ b/docs/NEXT-STEPS.md
@@ -0,0 +1,319 @@
+# Next Steps — from "code complete" to first paying customer
+
+> Creator-only. The runnable checklist that takes the operator from
+> the current state (1,729 tests passing, 6 tools shipped, 0 paying
+> customers) through launch and into the first 90 days.
+> **Version**: 1.0 · **Adopted**: 2026-05-01
+
+This document is the **single answer** to "what now?". Every line
+item has an owner, a time estimate, a blocker, a cost, and the
+external dependency that makes it un-shippable today. Items are
+ordered by **must-finish-before-the-next-item** — work top-down.
+
+Cross-references:
+- Strategy: `PLAN.md` (the 8 strategic moves + the 90-day sequence)
+- Demo specs: `DEMO-PLAN.md`
+- Deployment mechanics: `DEPLOYMENT.md`
+- Post-launch measurement: `POST-LAUNCH.md`
+- Locked criteria: `DECISIONS.md` §1
+
+Status legend:
+- **🟢** Done — the asset exists in this repo
+- **🟡** Buildable now — no external dependency needed
+- **🟠** External dependency — needs an account / signup / payment
+- **🔴** Manual / requires user input that can't be automated
+
+---
+
+## Phase 0 · What's already done (skip ahead)
+
+| ✓ | Item | Where it lives |
+|---|------|----------------|
+| 🟢 | 6 of 9 tools shipped (Dedup, Text, Format, Missing, Column-Map, Pipeline) | `src/core/`, `src/cli_*.py`, `src/gui/pages/` |
+| 🟢 | Pipeline Runner (the retention multiplier per `PLAN.md` §2.6) | `src/core/pipeline.py`, `src/cli_pipeline.py`, `src/gui/pages/9_Pipeline_Runner.py` |
+| 🟢 | 1,729 passing tests · 0 skipped · 0 xfailed | `tests/` |
+| 🟢 | 3 niche demo datasets + pre-tuned pipeline JSONs | `samples/demo/` |
+| 🟢 | Streamlit demo app + Cloud entry shim | `streamlit_app.py`, `src/gui/app_demo.py` |
+| 🟢 | 3 niche landing pages + apex chooser + shared CSS | `landing/` |
+| 🟢 | Landing-page deploy script (URL-substitution + sitemap + 404 + favicon) | `landing/deploy.py` |
+| 🟢 | Strategic plan + demo plan + post-launch measurement plan + deployment doc | `docs/PLAN.md`, `DEMO-PLAN.md`, `POST-LAUNCH.md`, `DEPLOYMENT.md` |
+
+---
+
+## Phase 1 · Stand the funnel up (target: end of week 1, ~6 hours total work)
+
+The bottleneck right now is **distribution, not feature count**.
+Everything in this phase is about turning code into a URL the user
+can hit.
+
+### 1.1 — 🟠 Push to GitHub (5 min)
+
+| | |
+|---|---|
+| **What** | `git init` (if not already), commit, push to a private or public GitHub repo. |
+| **Why** | Cloud deploy services need a Git source. Streamlit Community Cloud auto-deploys on push to `main`. |
+| **External dependency** | A GitHub account (free). |
+| **Cost** | $0. |
+| **Blocked by** | Nothing. |
+
+### 1.2 — 🟠 Deploy the demo to Streamlit Community Cloud (15 min)
+
+| | |
+|---|---|
+| **What** | Follow `DEPLOYMENT.md` Part 1. Result: a public URL like `https://datatools-demo.streamlit.app`. |
+| **Why** | The landing pages embed this in their iframe. Without it, every "Run pipeline" button on the landing pages 404s. |
+| **External dependency** | Free Streamlit Community Cloud account, signed in via GitHub. |
+| **Cost** | $0. |
+| **Blocked by** | 1.1 (the repo must be on GitHub). |
+| **Watch out for** | First build takes 2–3 min while Cloud installs deps. Subsequent deploys < 30 s. |
+
+### 1.3 — 🟠 Buy the apex domain (5 min, ~$15/year)
+
+| | |
+|---|---|
+| **What** | Register `datatools.app` (or whichever) at any registrar. Point the nameservers at Cloudflare. |
+| **Why** | The landing-page canonical URLs and CTA buttons refer to this domain. Pages can deploy to a free `*.pages.dev` URL first if you want to defer this. |
+| **External dependency** | A registrar account; payment method. |
+| **Cost** | ~$15/year. Within `BUSINESS.md` §9 cost cap. |
+| **Blocked by** | Nothing — can run in parallel with 1.1 / 1.2. |
+
+### 1.4 — 🟠 Deploy the landing pages to Cloudflare Pages (15 min)
+
+| | |
+|---|---|
+| **What** | Follow `DEPLOYMENT.md` Part 2. Run `python3 landing/deploy.py` with the operator's URLs in `deploy.config.json`, then `wrangler pages deploy landing/dist` (or drag-drop). |
+| **Why** | This is the marketing surface. Three persona URLs go live as soon as it deploys. |
+| **External dependency** | Free Cloudflare account; Wrangler CLI (optional — drag-drop works too). |
+| **Cost** | $0. |
+| **Blocked by** | 1.2 (the demo URL goes into `deploy.config.json`); ideally 1.3 for the custom domain. |
+| **Watch out for** | The `deploy.config.json` file is gitignored — your real URLs never get committed. |
+
+### 1.5 — 🟠 Open a Gumroad listing (15 min) **— stub for now**
+
+| | |
+|---|---|
+| **What** | Create a Gumroad account, draft a listing with a single screenshot + the landing-page copy, set price to $49. Don't enable purchases yet — leave it as a draft. |
+| **Why** | The CTA buttons on the landing pages link to `gumroad.com/l/datatools?from=<persona>`. Until the listing exists, those buttons 404. |
+| **External dependency** | Free Gumroad account; Stripe-connected payout method (defer to Phase 2). |
+| **Cost** | $0 to draft, ~10% per sale once live. |
+| **Blocked by** | Nothing — can run in parallel with 1.1–1.4. |
+| **Watch out for** | The listing URL must be `gumroad.com/l/datatools` to match the landing-page hard-coded CTAs. If you pick a different slug, update `landing/deploy.config.json` → `gumroad_listing` and re-run `deploy.py`. |
+
+### 1.6 — 🟡 End-to-end smoke verification (10 min)
+
+| | |
+|---|---|
+| **What** | Run the four `curl` commands from `DEPLOYMENT.md` Part 4. All four landing pages, all three demo personas, sitemap.xml. |
+| **Why** | First time something can break is the moment a real user hits it. Ten minutes of `curl` saves a week of "why is conversion zero." |
+| **External dependency** | None. |
+| **Cost** | $0. |
+| **Blocked by** | 1.4 + 1.2. |
+
+---
+
+## Phase 2 · Make it sellable (target: end of week 2)
+
+### 2.1 — 🟠 Apple Developer Program enrollment (5 min to start, 1–2 weeks lead)
+
+| | |
+|---|---|
+| **What** | Per `BUSINESS.md` §10. Required for code-signing the macOS installer. |
+| **External dependency** | Apple ID + government-issued ID (individual) or D-U-N-S number (org). |
+| **Cost** | $99/year. |
+| **Blocked by** | Nothing — start ASAP because of the 1–2 week approval window. The pipeline waits on this; nothing else does. |
+
+### 2.2 — 🟡 PyInstaller spec + cross-platform build (1–3 days first time)
+
+| | |
+|---|---|
+| **What** | A `build/datatools.spec` that bundles the Streamlit GUI + all 6 tools + samples into one app. Mac `.dmg`, Windows `.exe` installer, Linux AppImage. |
+| **Why** | The buyer's deliverable. Without this, there is nothing to attach to the Gumroad listing. |
+| **External dependency** | None for Linux/Mac builds. Windows builds need a Windows machine or a CI matrix runner. |
+| **Cost** | $0 (GitHub Actions matrix runners are free for public repos). |
+| **Blocked by** | Nothing for the spec; 2.1 for the signed Mac build. |
+| **Watch out for** | Streamlit's bundle size lands around 300–500 MB per `DECISIONS.md` §4c — accepted tradeoff. |
+
+### 2.3 — 🟡 macOS sign + notarize (30 min once Apple Dev is approved)
+
+| | |
+|---|---|
+| **What** | Sign the `.dmg`, submit to Apple's notarization service, staple the ticket. |
+| **Why** | Without it, Gatekeeper hard-blocks the install with no obvious way out (per `BUSINESS.md` §10). The buyer gives up. |
+| **External dependency** | Apple Developer Program (2.1). |
+| **Cost** | $0 incremental over 2.1. |
+| **Blocked by** | 2.1 + 2.2. |
+
+### 2.4 — 🔴 Refund policy + license + Gumroad listing copy (1 hour)
+
+| | |
+|---|---|
+| **What** | A clear refund policy (14-day no-questions per the FAQ already on the landing pages) + a software licence text + the Gumroad listing description. |
+| **Why** | Required by Gumroad's terms; surfaces on the listing page; protects against buyer disputes. |
+| **External dependency** | None — operator authoring. |
+| **Cost** | $0. |
+| **Blocked by** | Nothing. |
+| **Hint** | Most of the copy is already in the landing pages' FAQ section — paste it into Gumroad. |
+
+### 2.5 — 🟠 Activate the Gumroad listing (15 min)
+
+| | |
+|---|---|
+| **What** | Upload the cross-platform installers from 2.2/2.3, paste the copy from 2.4, set $49 price, enable purchases, configure Stripe payout. |
+| **Why** | This is the "buy" button finally working. |
+| **External dependency** | Gumroad + Stripe account; the installers from 2.2/2.3. |
+| **Cost** | ~10 % per sale. |
+| **Blocked by** | 2.2, 2.3, 2.4. |
+
+---
+
+## Phase 3 · First-traffic ignition (target: end of week 4)
+
+Per `PLAN.md` §3 and `BUSINESS.md` §7 channel priorities. The strict
+no-touch constraint of `DECISIONS.md` §1 #8 makes channel choice
+matter — these are the only ones that fit.
+
+### 3.1 — 🔴 First niche-community post (30 min)
+
+| | |
+|---|---|
+| **What** | One value-first post in one niche-relevant community (e.g. r/shopify, IndieHackers Shopify chat, a Slack/Discord that allows it). Lead with the demo URL, not the buy URL. |
+| **Why** | Marketplaces alone don't drive discovery. Communities are the only first-touch channel that works under no-touch. |
+| **External dependency** | Account in the chosen community; understand its self-promotion rules. |
+| **Cost** | $0. |
+| **Blocked by** | 1.4 (demo URL must work). |
+| **Hint** | Pick the persona with the most familiar community to the operator. Don't try all three at once — see `POST-LAUNCH.md` §2 "decide ONE thing" rule. |
+
+### 3.2 — 🟡 First long-tail SEO blog post (4–6 hours)
+
+| | |
+|---|---|
+| **What** | One 800–1,500-word post on `datatools.app/blog/` (sub-route of Cloudflare Pages or Substack) targeting one niche keyword from `BUSINESS.md` §7. Topic: a real problem you've encountered, the cleanup steps, the demo URL at the end. |
+| **Why** | Compounding asset — `BUSINESS.md` §2 says SEO pays in 6–18 months, not week 1. Don't mistake it for an early-stage channel. |
+| **External dependency** | None. |
+| **Cost** | $0. |
+| **Blocked by** | Nothing. |
+
+### 3.3 — 🟡 Cloudflare Web Analytics + event counters (45 min)
+
+| | |
+|---|---|
+| **What** | Enable Cloudflare Web Analytics on the Pages project (one click). Add a tiny inline `<script>` to each landing page that fires `cta_clicked` when the buy button is hit, before redirecting. Per `POST-LAUNCH.md` §1. |
+| **Why** | Without this, the post-launch checklist is unrunnable. |
+| **External dependency** | Cloudflare account (already from 1.4). |
+| **Cost** | $0. |
+| **Blocked by** | 1.4. |
+| **Hint** | The Gumroad webhook captures `?from=<persona>` automatically — no extra wiring. |
+
+### 3.4 — 🟡 Email autoresponder (post-purchase delivery + 3-touch onboarding) (2–3 hours)
+
+| | |
+|---|---|
+| **What** | Gumroad's built-in delivery email plus three follow-up emails (day 1, day 7, day 14): "are you running into X?", "here's an advanced trick", "save your pipeline as JSON for next week". |
+| **Why** | Increases activation, reduces refund risk, surfaces support questions while volume is small. |
+| **External dependency** | Gumroad delivery is built-in. The 3-touch sequence needs a free email service (Resend's free tier or Mailchimp's free tier). |
+| **Cost** | $0–$30/month per `BUSINESS.md` §9. |
+| **Blocked by** | 2.5. |
+
+---
+
+## Phase 4 · First-buyer trigger and review
+
+Per `PLAN.md` §4 decision triggers and `POST-LAUNCH.md` §4.
+
+### 4.1 — 🟢 Run the monthly review (30 min, first Monday after launch)
+
+| | |
+|---|---|
+| **What** | Follow `POST-LAUNCH.md` §2 — pull last-30-days demo events + Gumroad sales + refunds, compute the five numbers, decide ONE change. |
+| **Why** | Without this discipline, the funnel drifts and the operator changes 5 things at once and learns nothing. |
+| **External dependency** | None — analytics from 3.3, sales from 2.5. |
+| **Cost** | $0. |
+| **Blocked by** | 3.3 + 2.5. |
+
+### 4.2 — 🟢 First paying customer (target: 90 days)
+
+| | |
+|---|---|
+| **What** | The actual first sale. |
+| **Why** | Per `BUSINESS.md` §6: validates the funnel; not the business. |
+| **Trigger action** | Continue, no plan change. Make the first $1k/month within month 6. |
+
+### 4.3 — 🔴 Zero-paid-in-90-days fallback (only fires if 4.2 doesn't)
+
+| | |
+|---|---|
+| **What** | Per `POST-LAUNCH.md` §4 — audit the funnel, not the features. Run a 1-week outbound experiment to 30 niche contacts as a control (per `BUSINESS.md` §8 the no-touch revisit is allowed below $5k MRR if it produces signal). |
+| **Why** | Distinguishes "no reach" from "no conversion" — they need different fixes. |
+| **External dependency** | Operator's time. |
+| **Cost** | The 10 hr/wk allocation already exists; this displaces other work. |
+| **Blocked by** | The 90-day calendar trigger from 4.2. |
+
+---
+
+## Phase 5 · Steady state — what NOT to build
+
+Per `PLAN.md` §5 (anti-temptations) and `DECISIONS.md` §8 (re-lock
+triggers). The trap is treating "more code" as the answer when the
+data says "more reach" or "more conversion." The five forbidden
+moves until $5k/mo MRR:
+
+| | Why locked |
+|---|---|
+| ❌ More tools (06–08) | `PLAN.md` §2.1 distribution-gate. Tool 09 was the exception; no others until first paid customer + one external review. |
+| ❌ SaaS pivot | `DECISIONS.md` §4 — recurring infra conflicts with the lifestyle constraint. |
+| ❌ Live chat / sales calls | `DECISIONS.md` §1 #8 — no-touch is locked until $5k/mo. |
+| ❌ Custom integrations / one-off consulting | Breaks "build once, sell many." |
+| ❌ Going broad on personas | `PLAN.md` §5 — "all small businesses" converts at 1 %; vertical converts at 5–15 %. |
+
+---
+
+## Triage table — what blocks what
+
+```
+Phase 1 (week 1)              Phase 2 (week 2)         Phase 3 (week 4)
+┌──────────────┐              ┌──────────────┐         ┌──────────────┐
+│ 1.1 Push GH  │──────────┐   │ 2.1 Apple    │ ───┐    │ 3.1 Community│
+│ 1.2 Demo     │──┐       ├──▶│   Dev (1-2w) │    │    │ 3.2 SEO post │
+│ 1.3 Domain   │  │       │   │ 2.2 Build    │ ───┤    │ 3.3 Analytics│
+│ 1.4 Pages    │◀─┘       │   │ 2.3 Sign     │ ───┤    │ 3.4 Emails   │
+│ 1.5 Gumroad  │──────────┘   │ 2.4 Copy     │    │    └──────────────┘
+│ 1.6 Verify   │              │ 2.5 Activate │ ◀──┘
+└──────────────┘              └──────────────┘            ↓
+                                                  ┌──────────────┐
+                                                  │ 4.1 Monthly  │
+                                                  │ 4.2 First $  │
+                                                  │ 4.3 Fallback │
+                                                  └──────────────┘
+```
+
+The longest blocking path is **2.1 Apple Developer Program**
+(1–2 weeks). Start it on day 1 of week 1 — it unblocks everything in
+Phase 2 and you can do all of Phase 1 while waiting.
+
+---
+
+## Time estimate — total operator time
+
+| Phase | Hours | Wall-clock |
+|---|---|---|
+| Phase 1 | ~1 hour | end of week 1 (mostly waiting for builds) |
+| Phase 2 | ~1 day | end of week 2 (gated by Apple Dev approval) |
+| Phase 3 | ~6 hours | week 3–4 |
+| Phase 4 | 30 min/month | ongoing |
+| **Total to launch** | **~12 hours of operator time** | **~14 days wall-clock** |
+
+Well inside the 10 hr/wk constraint of `DECISIONS.md` §1 #2.
+
+---
+
+## The thing that decides whether the plan works
+
+Not the build. Not the deploy. Not even the first sale.
+
+**The discipline of running the monthly review** in Phase 4 — and the
+"decide ONE thing per month" rule from `POST-LAUNCH.md` §2 — is what
+separates "this product exists" from "this product compounds." Every
+feature added before the funnel is measured is a guess; every change
+made after the monthly review is informed.
+
+Don't skip 4.1.
--- a/docs/PLAN.md
+++ b/docs/PLAN.md
@@ -0,0 +1,220 @@
+# Strategic Plan — DataTools
+
+> Creator-only. Locks the "what next" in light of the locked criteria
+> (DECISIONS.md §1) and the v1.6 honest status (BUSINESS.md §13).
+> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael
+
+This document is the active plan, derived from the strategic review of
+2026-05-01. It compresses the eight strategic moves and a 90-day
+execution sequence onto one page so the next decision (build vs.
+ship vs. market) has a single reference.
+
+It is **not** a re-lock of operating criteria — those still live in
+DECISIONS.md and have not changed. This plan is downstream of those
+criteria; if a move below conflicts with §1 of Decisions, the criteria
+win.
+
+## 1. Frame
+
+**Locked context** (BUSINESS.md, DECISIONS.md):
+
+- Niche Python automation tools, $49–79 single / $149 suite.
+- Cash budget ≤ $1,200/mo recurring · Time ≤ 10 hr/wk · No external funding.
+- Async + no-touch sales (revisit at $5k/mo MRR).
+- Marketplace-first distribution (Gumroad / Lemon Squeezy).
+- Streamlit GUI + CLI dual interface, runs locally.
+- Lifestyle cashflow goal (no exit needed).
+
+**Honest current state** (2026-05-01):
+
+| Asset | State |
+|---|---|
+| Tools 1–5 (Dedup, Text Clean, Format Standardize, Missing, Column Mapper) | Ready · 1,691 tests passing · 0 xfailed |
+| Tools 6–9 (Outlier, Multi-File Merge, Validator, Pipeline) | Coming Soon |
+| PyInstaller installer pipeline | Not started |
+| macOS code signing (Apple Dev Program) | Not started |
+| Hosted browser demo (Streamlit Cloud) | Not deployed |
+| Landing page | Not live |
+| Marketplace listing (Gumroad) | Not listed |
+| Paying customers | 0 |
+
+**Diagnosis**: the bottleneck is not feature count — it's distribution.
+The next $1 of value comes from closing the gap between "code-complete"
+and "buyer-pulls-out-card", not from tool 6.
+
+## 2. The eight strategic moves
+
+Numbered moves. Each is consistent with locked criteria.
+
+### 2.1 Freeze new-tool development (one exception). Ship what exists.
+
+Tools 6–8 are blocked behind a **distribution gate**: no work on them
+until the existing 5 tools have a paying customer + one external review
+(BUSINESS.md §4 sequence rule, applied recursively inside the bundle).
+
+**Exception granted 2026-05-01**: Tool 09 Pipeline Runner is built
+*now*. Rationale: the pipeline transforms the bundle from "5 tools you
+buy" into "an automatable workflow you depend on." That conversion is
+what produces retention and word-of-mouth — the only marketing channel
+that scales under the no-network/no-touch constraint.
+
+### 2.2 The demo *is* the product. Make it embarrassingly good.
+
+- Three persona-tagged sample datasets, not one generic CSV: Shopify
+  customers / bookkeeper bank export / agency lead list.
+- Run the *full pipeline* on the sample (Review → Dedup → Text Clean →
+  Format → Missing → Column Map). Free version caps **output rows**,
+  not the experience.
+- Embed the demo as an **iframe on the landing page** (not "click to
+  open"). Friction kills conversion.
+- Persistent CTA after demo: *"Run this on your own 50 k-row file →
+  buy for $49 →"* directly above the Gumroad button.
+
+### 2.3 Niche down. Stop selling "data cleaning."
+
+One engine, three landing pages:
+
+| Persona | Landing-page lead | Demo dataset |
+|---|---|---|
+| Shopify operator (priority: pet supplies) | "Clean your customer / vendor / subscriber exports" | uc01_shopify_customer_list |
+| Bookkeeper / freelance accountant | "Reconcile bank exports + vendor lists. Auditable changes." | uc06_bank_export_overlap |
+| Marketing / RevOps agency | "Dedupe lead lists. Standardize phones across vendors." | uc13_combined_lead_sources |
+
+Generic copy competes with `pip install pandas`. Vertical copy
+competes with nothing.
+
+### 2.3a Top pain points per niche
+
+The "what does this actually fix?" question. Each pain point below is
+sourced from operator-domain knowledge of these markets and the
+buyer-use-case research already captured in `BUSINESS.md §4a`. Pain
+points are ranked by **frequency × dollar impact** for that persona —
+high-frequency / high-cost pains lead the landing-page copy and the
+demo dataset.
+
+> **Validation gap (honest disclaimer)**: these pains are derived from
+> operator knowledge of the categories, not from a sample of buyer
+> interviews. Per `BUSINESS.md §8` (no-touch constraint review at $5k/mo
+> MRR), validate the top-3 per persona via 5 buyer interviews before the
+> first $200 of paid acquisition spend. If any pain ranks below the
+> assumed level, swap it for the next-highest in this list.
+
+#### Shopify operator (priority: pet supplies)
+
+| # | Pain | $ / time impact | Tools that fix it |
+|---|------|-----------------|---|
+| S1 | **Klaviyo / Mailchimp / Omnisend per-contact billing.** Subscriber list with 10–18 % duplicate rate (case drift, plus signs in Gmail addresses, multiple devices) → recurring overpay forever. | $30–300/mo per percent of dupes on a 50 k list — recurring | Dedup + Format Standardize (email canonicalization) + Pipeline (re-run weekly) |
+| S2 | **Product feed rejected by Google Merchant Center / Meta Catalog.** Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 24–72 h while feed gets fixed. | 1–3 days delayed launch × campaign value | Text Cleaner + Format Standardize |
+| S3 | **Multi-channel order consolidation.** Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 4–8 hr / month manually merging | Column Mapper + Dedup + Pipeline |
+| S4 | **Subscription identity fragmentation.** Pet-box subscribers cancel and re-sub under a different email; cohort analysis says churn is 20 % when it's actually 12 % — pricing decisions wrong. | Mis-priced LTV → over- or under-paid acquisition | Dedup with `merge=true` survivor |
+| S5 | **International tax / VAT MOSS compliance.** Country column is `UK` / `U.K.` / `United Kingdom` / `GB` in the same export; VAT report breaks. Phone formats per region break call-center routing. | Compliance penalty risk + ops friction | Format Standardize (per-row country) + Column Mapper |
+
+#### Bookkeeper / freelance accountant
+
+| # | Pain | $ / time impact | Tools that fix it |
+|---|------|-----------------|---|
+| B1 | **Bank-export month-overlap re-import.** Same transaction posts twice when Jan and Feb exports overlap at the boundary; client's books understate cash by 1–4 %. | 2–4 hr / month / client + reconciliation errors | Dedup with explicit Date+Amount+fuzzy Vendor strategy |
+| B2 | **QBO / Xero vendor consolidation for 1099 reports.** "Amazon" / "amazon.com" / "AMAZON.COM*4F2X9" become 3 vendors; 1099 reports break, P&L by vendor unusable. | 1–2 hr / 1099 cycle + IRS-paper-trail risk | Format Standardize (name canonicalization) + Dedup |
+| B3 | **Liability / professional indemnity.** Cannot use AI tools that don't show their work; client audit response window is 24–48 h. | Per-firm liability premium ≈ $500–2,500 / yr | Audit log built into every tool — every change row-logged |
+| B4 | **Per-license-not-per-client economics.** Most cleanup tools are per-seat / per-client SaaS; bookkeepers managing 10–30 clients hit price walls fast. | $30/mo × N clients vs. $49 once | Desktop license, no per-client constraint |
+| B5 | **Multi-currency books.** US-domiciled clients with EU customers; comma-decimal amounts (`€1.234,56`) crash standard parsers; parens-negative (`($89.50)`) treated as positive. | 30–60 min per multi-currency client per month | Format Standardize (`currency_decimal=auto`, parens-negative) |
+
+#### Marketing / RevOps agency
+
+| # | Pain | $ / time impact | Tools that fix it |
+|---|------|-----------------|---|
+| R1 | **HubSpot / Marketo / Iterable per-contact tier pricing.** 10 k contacts → enterprise tier at $4–8 k/mo. Every duplicate is a recurring tax. | $200–800 / month per 1 k duplicate contacts — recurring | Dedup with cross-source merge + Pipeline |
+| R2 | **Email-deliverability / sender reputation.** Sending to invalid or duplicate addresses tanks reputation; recovery takes weeks. | Catastrophic — entire email programme degraded | Format Standardize (email canonicalization) + Missing (sentinel detection) |
+| R3 | **GDPR / contact-data privacy.** Uploading lead data to a third-party cleaning SaaS is itself a GDPR concern; legal review blocks adoption. | Compliance risk + 4–8 wk legal-review delay | Local-only desktop app, zero outbound calls |
+| R4 | **Multi-vendor lead-source unification.** Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 1–3 days per campaign of manual unification | Column Mapper (alias matching) + Format Standardize (per-row country) + Dedup |
+| R5 | **Suppression-list management across 5+ platforms.** Each platform has its own format; un-deduped suppression lists let opt-outs slip through, triggering CAN-SPAM / GDPR exposure. | Compliance risk + churn-back cost | Pipeline saved as JSON, re-run on each new suppression batch |
+
+### 2.4 Operationalize the moat the docs already name.
+
+Three durable advantages, each promoted from buried feature to
+landing-page H1:
+
+- **Quality**: 1 GB international standardization in ~2.5 minutes,
+  locally. Excel can't do this; OpenRefine fights you for an hour.
+- **Privacy**: "Your data never leaves this computer." Already in the
+  GUI footer — promote to landing-page lead, screenshot the empty
+  network tab.
+- **Update cadence**: ship a v1.1 patch within 30 days of v1.0 launch.
+  Not features — *evidence* the product is alive. "Added Czech Republic
+  phone format support" beats "no updates in 6 months" every time.
+
+### 2.5 Surface the audit-trail feature in sales copy.
+
+Every tool has a structured audit log. Most cleaning tools do not.
+Bookkeepers and consultants get fired if they can't show what changed
+to a client. The audit feature is currently invisible on every
+proposed landing page and should be the **second-largest callout** —
+right after "runs locally."
+
+Copy seed: *"Every change auditable. Hand the audit CSV to your client
+with the cleaned file."*
+
+### 2.6 The Pipeline Runner is the retention multiplier.
+
+A buyer with a saved pipeline isn't a one-off purchase — they're a
+recurring user who recommends the product. This is exactly the
+behavioural lever the no-touch constraint needs (DECISIONS.md §8
+trigger). Build it now (see §2.1 exception).
+
+### 2.7 Add a $199 "priority support" tier post-launch.
+
+Same code, async-email SLA (24 h response). Targets the bookkeeper /
+consultant persona whose own time is $300/hr. Zero new product work,
+~3× ARPU on 5–10 % of buyers. Lock the SLA to **async only** so the
+no-touch constraint isn't violated. Defer until $5 k/mo MRR (the same
+trigger DECISIONS.md §8 already names).
+
+### 2.8 Dependency-aware pipeline UX.
+
+Tools have soft execution-order preferences (Text Clean before Format
+Standardize, Format before Dedup, Missing before Dedup). The Pipeline
+Runner *recommends* the order, *warns* on reversals, and **never
+forces** — the user owns their workflow. Implementation: see
+`src/core/pipeline.py` `SOFT_DEPENDENCIES`.
+
+## 3. 90-day execution sequence
+
+| Week | Action | Done when |
+|---|---|---|
+| 1 | PyInstaller pipeline · Mac/Win unsigned installers · Apple Dev Program enrollment (1–2 wk lead) | `dist/datatools-mac.dmg` and `dist/datatools-win.exe` install on a clean machine |
+| 2 | Demo deployed to Streamlit Cloud · landing page v1 with embedded demo · 3 persona datasets in the demo | Public URL serves a working pipeline run on a sample dataset in < 30 s |
+| 3 | Gumroad listing live · share value-first in 3 niche communities (no pitch) · 1 long-tail SEO post for the lead persona | First listing impression captured · post not removed for self-promotion |
+| 4 | Pipeline Runner v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded |
+| 5–8 | Bookkeeper landing page · agency landing page · second tool's promo cycle · priority-support tier added (defer purchase until §2.7 trigger) | Three live landing pages with distinct H1, demo dataset, conversion target |
+| 9–13 | Tool 06–08 only **if** revenue trajectory supports continued investment · otherwise more market work on the existing 5 + 09 | Decision made on 13 Aug 2026 with revenue data, not feature ambition |
+
+## 4. Decision triggers (re-evaluation prompts)
+
+These flip the plan, not the underlying criteria:
+
+| Trigger | Reaction |
+|---|---|
+| First paying customer in week 4–13 | Continue. Plan is working. |
+| **Zero** paid in 90 days | Audit the funnel. Demo conversion? Niche fit? Price? Don't add features. |
+| $5 k/mo MRR | DECISIONS.md §8 trigger fires: revisit async + priority-support tier. |
+| Marketplace policy / shutdown | Switch to own-domain Stripe immediately; landing pages are already self-hosted. |
+| Streamlit hard direction change | Low-probability re-lock per DECISIONS.md §8. Tk fallback is documented. |
+
+## 5. Anti-temptations (things the plan refuses)
+
+- **More tools before more buyers.** Locked. Exception only for Pipeline Runner per §2.1.
+- **SaaS pivot.** Recurring infra conflicts with the lifestyle constraint (DECISIONS.md §4).
+- **Live chat / sales calls.** Conflicts with no-touch (DECISIONS.md §1 #8).
+- **Custom integrations / one-off consulting.** $300/hr looks tempting; breaks the "build once, sell many" model that justifies the entire strategy.
+- **Going broad on personas.** "All small businesses" is a generic landing page that converts at 1 %; "Shopify pet-supply operators with 1k–50k customers" converts at 5–15 % in the right communities.
+
+## 6. What this plan deliberately leaves open
+
+- Whether tools 06–08 ever ship. Decided on revenue, not roadmap.
+- Whether to add a fourth niche landing page. Decided on which of the
+  three is producing.
+- Whether to invest in own-domain SEO. Compounding 6–18 mo asset; not
+  the early-stage channel. Revisit when marketplace + community
+  produces baseline traffic to optimise.
+- Whether to add a Notion / Slack support community. If support volume
+  per 100 sales > 10 (BUSINESS.md §12 target), revisit; else leave async-email only.
--- a/docs/POST-LAUNCH.md
+++ b/docs/POST-LAUNCH.md
@@ -0,0 +1,158 @@
+# Post-launch — 90-day measurement plan
+
+> Creator-only. The other half of `PLAN.md`: PLAN tells you what to
+> build, this tells you what to measure once it's live and which
+> numbers trigger which actions.
+> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael
+
+This is a runnable monthly checklist, not analytics theatre. Every
+metric below has a **threshold** and an **action**. If you're not
+willing to execute the action when the threshold trips, drop the
+metric — measuring without responding is busywork.
+
+## 1. The five numbers that matter
+
+Every other dashboard, chart, or vanity stat is downstream of these
+five. The funnel is short on purpose; pre-PMF traffic doesn't have
+the resolution to support more.
+
+| # | Metric | How to compute | Threshold | When tripped |
+|---|--------|----------------|-----------|--------------|
+| 1 | **Persona engagement** | `demo.run_completed / demo.page_view` per persona | < 30 % for 4 consecutive weeks | Demo isn't running or BEFORE preview isn't compelling. **Action:** check iframe loads; widen BEFORE preview to show pollution clearly; move demo above the fold. |
+| 2 | **Demo→CTA intent** | `demo.cta_clicked / demo.run_completed` per persona | < 5 % for 4 consecutive weeks | Demo is impressive but the CTA isn't earning trust. **Action:** add network-tab privacy screenshot; soften the price callout; A-B test eyebrow copy on the CTA card. |
+| 3 | **Purchase rate** | `gumroad.purchase / demo.cta_clicked` per persona | < 30 % for 4 consecutive weeks | Visitors click through but don't pull the card out. **Action:** check Gumroad listing renders cleanly; verify refund-policy copy; check that the screenshot on the listing matches the demo they just ran. |
+| 4 | **Refund rate** | `gumroad.refunds / gumroad.purchase` rolling 30 days | > 5 % | Buyer expectation mismatch. **Action:** read every refund email; determine if it's a feature gap (build it), a positioning lie (rewrite), or a personal-fit miss (fine, ignore). |
+| 5 | **Support load** | email tickets / 100 sales rolling 30 days | > 10 | The product isn't self-serve enough at this price. **Action:** find the top 3 questions; add to in-app onboarding + landing-page FAQ + the persona's saved pipeline. |
+
+These five also map to BUSINESS.md §12 — that doc names the metrics;
+this doc operationalises them.
+
+## 2. Monthly review — 30-minute checklist
+
+Block 30 minutes on the first Monday of every month for the first six
+months. After month 6 if numbers are stable, drop to 15 minutes
+quarterly.
+
+```
+[ ] Pull last 30 days of demo events from Cloudflare Web Analytics
+[ ] Pull last 30 days of Gumroad sales + refunds export
+[ ] Compute the five numbers in §1 per persona
+[ ] Note which thresholds are tripped (if any)
+[ ] Read every refund email since last review
+[ ] Read every support email since last review
+[ ] Decide ONE thing to change this month (only one)
+[ ] Update CHANGELOG with what was changed and why
+[ ] Schedule next review
+```
+
+The "decide ONE thing" rule is load-bearing. Pre-PMF traffic doesn't
+have the volume to A/B-test multiple changes in parallel — you'll just
+confuse yourself about what moved the number.
+
+## 3. Per-persona scoreboard (template)
+
+Maintain in a single text file or spreadsheet. The shape that fits in
+a notebook page is the shape you'll actually update.
+
+```
+Month: 2026-06
+─────────────────────────────────────────────────────────────────
+                       Shopify  Bookkeeper  RevOps  Total
+Page views                420         180     290    890
+Demo runs                 137          59      82    278
+CTA clicks                  9           7       6     22
+Purchases                   3           2       2      7
+
+Metric 1 (engage)        33%         33%     28%   31%
+Metric 2 (intent)         7%         12%      7%    8%
+Metric 3 (purchase)      33%         29%     33%   32%
+Metric 4 (refund)         0%          0%      0%    0%
+Metric 5 (support)         3 tickets  /  100 sales
+
+Tripped thresholds:      RevOps engagement (28% < 30%)
+
+This-month change:       Move demo embed above the fold on revops
+                         page; reduce hero text by 40%.
+
+Last-month change:       Added network-tab screenshot to all 3
+                         pages.  Result: intent +1.5 percentage
+                         points on Shopify, flat elsewhere.
+```
+
+## 4. Stage-gate triggers from PLAN.md
+
+Reproduced here so the gate criteria sit beside the metrics that
+fire them:
+
+| Trigger | From | Action |
+|---|------|--------|
+| **First paying customer** | PLAN §4 | Continue. Plan is working. |
+| **Zero paid in 90 days** | PLAN §4 | Audit the funnel. Don't add features. Run a small (1-week) outbound experiment to 30 niche-community contacts as a control, even though it stretches the no-touch constraint, to determine whether the bottleneck is reach or conversion. |
+| **$5 k/mo MRR** | DECISIONS §8 | Re-evaluate async constraint. Add priority-support tier (PLAN §2.7). |
+| **$10 k/mo MRR** | DECISIONS §8 | Revisit time-budget allocation. Decide on tools 06–08 vs. additional bundles. |
+| **Marketplace shutdown** | PLAN §4 / DECISIONS §8 | Switch landing-page CTA to own-domain Stripe Checkout. Pre-built; one-line edit. |
+| **Streamlit hard direction change** | DECISIONS §8 | Low-probability re-lock. Tk fallback documented. |
+| **Burnout signal** | DECISIONS §8 | Stop. Triage. The constraint matters more than the revenue ramp. |
+
+## 5. What we deliberately do NOT measure
+
+These look productive but predict nothing pre-PMF. Don't add them.
+
+- **Bounce rate** — single-page sites have artificially high bounce. Useless signal.
+- **Time on page** — landing pages are *supposed* to be quick reads. Long time on page often means confusion, not engagement.
+- **Heatmaps / scroll-depth** — no statistical resolution at <500 monthly visitors. Add when you cross 5 k/month.
+- **Email open rates** — under §2.7 priority support is the only email channel; opens aren't a buying signal.
+- **Social mentions** — vanity. The signal that matters is "did they buy" or "did they come back."
+
+## 6. What we measure once, then trust
+
+Do these once, then let them run for 6+ months without re-measuring:
+
+- **Demo correctness** — once per pipeline release, run all 3 demos
+  end-to-end via `tests/test_pipeline.py` and check the output looks
+  reasonable. The CI pipeline already does this; nothing to add.
+- **Cross-platform install** — once per release, verify the
+  PyInstaller bundle launches on Mac / Windows / Linux. After three
+  green releases, trust the build pipeline; spot-check on major OS
+  updates only.
+- **Privacy claim integrity** — once at launch, capture the network
+  tab while running the cleaner and host that screenshot at a stable
+  URL. Re-capture only when a new tool or dependency is added.
+
+## 7. Per-persona attribution
+
+The buy buttons on every landing page carry `?from=<persona>` query
+parameters. Gumroad propagates that into the order metadata. Use it
+to attribute purchases:
+
+| persona key | landing page URL | Gumroad query | Source |
+|---|---|---|---|
+| `shopify-pet` | `/shopify-pet/` | `?from=shopify-pet` | Shopify operator |
+| `bookkeeper`  | `/bookkeeper/`  | `?from=bookkeeper`  | Bookkeeper / freelance accountant |
+| `revops`      | `/revops/`      | `?from=revops`      | Marketing / RevOps agency |
+| `apex`        | `/`             | (no query — use `unknown` bucket) | Generic discovery |
+
+When `unknown` exceeds 30 % of total, add UTM tagging to community
+posts and SEO blog backlinks so you can break the bucket apart.
+
+## 8. The four months that decide whether the plan works
+
+Reading PLAN.md §3 + this doc together, the rough script:
+
+| Month | What's running | What we expect to learn |
+|---|---|---|
+| **M1** (June) | Installers · demo · 3 landing pages · Gumroad live | Whether the funnel mechanically works. Numbers will be noisy; just look for one purchase. |
+| **M2** (July) | M1 + community posts in 3 niches + 1 SEO post | Which persona converts. Re-allocate effort to the highest-converting niche. |
+| **M3** (August) | M2 + landing-page changes from M2 review | Whether intent-rate moved on the change. Decide tools 06–08 go/no-go. |
+| **M4** (September) | M3 + first repeat-buyer signals | Whether the Pipeline Runner is producing retention as designed. |
+
+By end of M4, the data tells you whether the plan is producing
+$1k–3k/mo (BUSINESS.md §6 6-month target) — extrapolated from the
+trajectory, not the absolute number.
+
+## 9. The hardest part of the plan to execute
+
+Not the metrics. Not the build. **The "decide ONE thing per month"
+rule** — operators with engineering backgrounds chronically pick
+three changes per month and conclude nothing because their signal
+is muddled. This doc says one. It means one.
--- a/docs/REQUIREMENTS.md
+++ b/docs/REQUIREMENTS.md
@@ -46,7 +46,7 @@ Numbered support matrix. Updated with every shipped capability.
 **Cell-level**:
 - `smart_punctuation_in_data`, `nbsp_or_unicode_whitespace`, `zero_width_or_invisible`, `dirty_column_headers`, `whitespace_padding`, `null_like_sentinels`, `suspected_mojibake`, `mixed_case_email_column`, `inconsistent_date_format`, `near_duplicate_rows`, `leading_zero_ids`.

-**Encoding integrity**: `encoding_uncertain`, `encoding_decode_failed`, `empty_input`.
+**Encoding integrity**: `encoding_uncertain`, `encoding_decode_failed`, `encoding_lying_bom`, `empty_input`.

 Sample size: 1,000 rows (configurable).

@@ -71,17 +71,37 @@ Sample size: 1,000 rows (configurable).
 - Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
 - Output write: ~10 s.
 - Recommended RAM: 4× input size for full-Apply path.
+- Format standardizer (`standardize_file`): ~150k rows/sec on cache-warm
+  international data; chunk-bounded RAM (~50 MB peak at default
+  chunk_size=50,000). A 1 GB CSV with mixed phone+currency+address
+  columns finishes in ~2.5–10 minutes depending on column count.

 ## 11. Tools
 1. Deduplicator — Ready
 2. Text Cleaner — Ready
 3. Format Standardizer — Ready
-4. Missing Value Handler — Coming Soon
-5. Column Mapper — Coming Soon
+4. Missing Value Handler — Ready
+5. Column Mapper — Ready
 6. Outlier Detector — Coming Soon
 7. Multi-File Merger — Coming Soon
 8. Validator & Reporter — Coming Soon
-9. Pipeline Runner — Coming Soon
+9. Pipeline Runner — Ready
+
+### 11.a Recommended pipeline order (soft, not enforced)
+
+The Pipeline Runner ships with a `SOFT_DEPENDENCIES` table; the
+following ordering is the default and the basis of the warning
+surface. Re-ordering is allowed; the runner emits a warning string
+and proceeds.
+
+| # | Tool | Why this slot |
+|---|------|---------------|
+| 1 | column_map (optional, for header alignment) | Multi-vendor unification — rename early so downstream tools see canonical headers |
+| 2 | text_clean | NBSP / smart quotes / zero-width pollution silently breaks downstream parsers |
+| 3 | format_standardize | Phones / dates / currencies → canonical form before missing detection and dedup |
+| 4 | missing | Sentinel detection, imputation, drop strategies — needs canonical types |
+| 5 | column_map (optional, for schema enforcement) | Project to target schema, coerce, drop extras AFTER cleaning |
+| 6 | dedup | Fuzzy matching is most accurate on canonicalised, sentinel-laundered data |

 ## 12. Gate (Review & Normalize)
 - Gates every tool page.
@@ -95,7 +115,7 @@ Sample size: 1,000 rows (configurable).

 ## 13. Interfaces
 - **GUI**: Streamlit, browser-based, local, no internet.
- **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_analyze`.
+- **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_format` · `src.cli_missing` · `src.cli_column_map` · `src.cli_pipeline` · `src.cli_analyze`.
 - **Python API**: `from src.core import …` (analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …).
 - **JSON output**: `--json` on `cli_analyze`.

@@ -113,8 +133,8 @@ Sample size: 1,000 rows (configurable).
 - **Dev**: pytest, tox.

 ## 16. Test coverage
- 1,230 tests passing, 4 skipped (ftfy not installed), 17 xfailed (documented).
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases).
+- 1,729 tests passing, 0 skipped, 0 xfailed.
+- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
 - Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.

 ## 17. Privacy / data handling