feat: 3 new tools, format streaming, distribution-ready demo + landing pages

Tools shipped this batch (4 → 6 of 9 Ready): 04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI 05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI 09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI with soft tool-dependency graph (recommended, not enforced) and JSON save/load for repeatable weekly cleanups. Format Standardizer reworked for 1 GB international files: • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email • Per-row country / address columns drive parsing • Audit cap (default 10 k rows, ~50 MB RAM) • standardize_file(): chunked streaming entry point (~165 k rows/sec) • currency_decimal="auto" for EU comma-decimal locales • R$ / kr / zł multi-char currency prefixes • cli_format.py with auto-stream above 100 MB inputs Encoding detection arbiter + language-aware probe: Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM) via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes. Distribution-readiness assets: • streamlit_app.py — Streamlit Community Cloud entry shim • src/gui/app_demo.py — single-page demo, ?p=<persona> routing, 100-row cap + watermark, free-vs-paid boundary enforced at surface • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs • landing/ — 4 static HTML pages (apex chooser + 3 niche), shared CSS, deploy.py URL-substitution script, auto-generated robots.txt + sitemap.xml + 404.html + favicon • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md — full strategy + measurement + deployment + master checklist Test counts: before: 1,520 passed · 4 skipped · 17 xfailed after: 1,729 passed · 0 skipped · 0 xfailed Tier-1 corpora added: • missing-corpus 3 use cases + 16 edge cases • column-mapper-corpus 3 use cases + 5 edge cases • format-cleaner intl 20-row 13-country stress fixture Engine hardening flushed out by the corpora: • interpolate guards against object-dtype columns • mean/median skip all-NaN columns (silences numpy warning) • fillna runs under future.no_silent_downcasting (silences pandas warning) • mojibake test no longer skips when ftfy installed (monkeypatch path) • drop-row threshold semantics: strict-greater (consistent across rows / cols) • currency_decimal validator allow-set updated for "auto" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00
parent d18b95880d
commit 966af8ef94
89 changed files with 12039 additions and 284 deletions
--- a/docs/PLAN.md
+++ b/docs/PLAN.md
@@ -0,0 +1,220 @@
+# Strategic Plan — DataTools
+
+> Creator-only. Locks the "what next" in light of the locked criteria
+> (DECISIONS.md §1) and the v1.6 honest status (BUSINESS.md §13).
+> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael
+
+This document is the active plan, derived from the strategic review of
+2026-05-01. It compresses the eight strategic moves and a 90-day
+execution sequence onto one page so the next decision (build vs.
+ship vs. market) has a single reference.
+
+It is **not** a re-lock of operating criteria — those still live in
+DECISIONS.md and have not changed. This plan is downstream of those
+criteria; if a move below conflicts with §1 of Decisions, the criteria
+win.
+
+## 1. Frame
+
+**Locked context** (BUSINESS.md, DECISIONS.md):
+
+- Niche Python automation tools, $49–79 single / $149 suite.
+- Cash budget ≤ $1,200/mo recurring · Time ≤ 10 hr/wk · No external funding.
+- Async + no-touch sales (revisit at $5k/mo MRR).
+- Marketplace-first distribution (Gumroad / Lemon Squeezy).
+- Streamlit GUI + CLI dual interface, runs locally.
+- Lifestyle cashflow goal (no exit needed).
+
+**Honest current state** (2026-05-01):
+
+| Asset | State |
+|---|---|
+| Tools 1–5 (Dedup, Text Clean, Format Standardize, Missing, Column Mapper) | Ready · 1,691 tests passing · 0 xfailed |
+| Tools 6–9 (Outlier, Multi-File Merge, Validator, Pipeline) | Coming Soon |
+| PyInstaller installer pipeline | Not started |
+| macOS code signing (Apple Dev Program) | Not started |
+| Hosted browser demo (Streamlit Cloud) | Not deployed |
+| Landing page | Not live |
+| Marketplace listing (Gumroad) | Not listed |
+| Paying customers | 0 |
+
+**Diagnosis**: the bottleneck is not feature count — it's distribution.
+The next $1 of value comes from closing the gap between "code-complete"
+and "buyer-pulls-out-card", not from tool 6.
+
+## 2. The eight strategic moves
+
+Numbered moves. Each is consistent with locked criteria.
+
+### 2.1 Freeze new-tool development (one exception). Ship what exists.
+
+Tools 6–8 are blocked behind a **distribution gate**: no work on them
+until the existing 5 tools have a paying customer + one external review
+(BUSINESS.md §4 sequence rule, applied recursively inside the bundle).
+
+**Exception granted 2026-05-01**: Tool 09 Pipeline Runner is built
+*now*. Rationale: the pipeline transforms the bundle from "5 tools you
+buy" into "an automatable workflow you depend on." That conversion is
+what produces retention and word-of-mouth — the only marketing channel
+that scales under the no-network/no-touch constraint.
+
+### 2.2 The demo *is* the product. Make it embarrassingly good.
+
+- Three persona-tagged sample datasets, not one generic CSV: Shopify
+  customers / bookkeeper bank export / agency lead list.
+- Run the *full pipeline* on the sample (Review → Dedup → Text Clean →
+  Format → Missing → Column Map). Free version caps **output rows**,
+  not the experience.
+- Embed the demo as an **iframe on the landing page** (not "click to
+  open"). Friction kills conversion.
+- Persistent CTA after demo: *"Run this on your own 50 k-row file →
+  buy for $49 →"* directly above the Gumroad button.
+
+### 2.3 Niche down. Stop selling "data cleaning."
+
+One engine, three landing pages:
+
+| Persona | Landing-page lead | Demo dataset |
+|---|---|---|
+| Shopify operator (priority: pet supplies) | "Clean your customer / vendor / subscriber exports" | uc01_shopify_customer_list |
+| Bookkeeper / freelance accountant | "Reconcile bank exports + vendor lists. Auditable changes." | uc06_bank_export_overlap |
+| Marketing / RevOps agency | "Dedupe lead lists. Standardize phones across vendors." | uc13_combined_lead_sources |
+
+Generic copy competes with `pip install pandas`. Vertical copy
+competes with nothing.
+
+### 2.3a Top pain points per niche
+
+The "what does this actually fix?" question. Each pain point below is
+sourced from operator-domain knowledge of these markets and the
+buyer-use-case research already captured in `BUSINESS.md §4a`. Pain
+points are ranked by **frequency × dollar impact** for that persona —
+high-frequency / high-cost pains lead the landing-page copy and the
+demo dataset.
+
+> **Validation gap (honest disclaimer)**: these pains are derived from
+> operator knowledge of the categories, not from a sample of buyer
+> interviews. Per `BUSINESS.md §8` (no-touch constraint review at $5k/mo
+> MRR), validate the top-3 per persona via 5 buyer interviews before the
+> first $200 of paid acquisition spend. If any pain ranks below the
+> assumed level, swap it for the next-highest in this list.
+
+#### Shopify operator (priority: pet supplies)
+
+| # | Pain | $ / time impact | Tools that fix it |
+|---|------|-----------------|---|
+| S1 | **Klaviyo / Mailchimp / Omnisend per-contact billing.** Subscriber list with 10–18 % duplicate rate (case drift, plus signs in Gmail addresses, multiple devices) → recurring overpay forever. | $30–300/mo per percent of dupes on a 50 k list — recurring | Dedup + Format Standardize (email canonicalization) + Pipeline (re-run weekly) |
+| S2 | **Product feed rejected by Google Merchant Center / Meta Catalog.** Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 24–72 h while feed gets fixed. | 1–3 days delayed launch × campaign value | Text Cleaner + Format Standardize |
+| S3 | **Multi-channel order consolidation.** Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 4–8 hr / month manually merging | Column Mapper + Dedup + Pipeline |
+| S4 | **Subscription identity fragmentation.** Pet-box subscribers cancel and re-sub under a different email; cohort analysis says churn is 20 % when it's actually 12 % — pricing decisions wrong. | Mis-priced LTV → over- or under-paid acquisition | Dedup with `merge=true` survivor |
+| S5 | **International tax / VAT MOSS compliance.** Country column is `UK` / `U.K.` / `United Kingdom` / `GB` in the same export; VAT report breaks. Phone formats per region break call-center routing. | Compliance penalty risk + ops friction | Format Standardize (per-row country) + Column Mapper |
+
+#### Bookkeeper / freelance accountant
+
+| # | Pain | $ / time impact | Tools that fix it |
+|---|------|-----------------|---|
+| B1 | **Bank-export month-overlap re-import.** Same transaction posts twice when Jan and Feb exports overlap at the boundary; client's books understate cash by 1–4 %. | 2–4 hr / month / client + reconciliation errors | Dedup with explicit Date+Amount+fuzzy Vendor strategy |
+| B2 | **QBO / Xero vendor consolidation for 1099 reports.** "Amazon" / "amazon.com" / "AMAZON.COM*4F2X9" become 3 vendors; 1099 reports break, P&L by vendor unusable. | 1–2 hr / 1099 cycle + IRS-paper-trail risk | Format Standardize (name canonicalization) + Dedup |
+| B3 | **Liability / professional indemnity.** Cannot use AI tools that don't show their work; client audit response window is 24–48 h. | Per-firm liability premium ≈ $500–2,500 / yr | Audit log built into every tool — every change row-logged |
+| B4 | **Per-license-not-per-client economics.** Most cleanup tools are per-seat / per-client SaaS; bookkeepers managing 10–30 clients hit price walls fast. | $30/mo × N clients vs. $49 once | Desktop license, no per-client constraint |
+| B5 | **Multi-currency books.** US-domiciled clients with EU customers; comma-decimal amounts (`€1.234,56`) crash standard parsers; parens-negative (`($89.50)`) treated as positive. | 30–60 min per multi-currency client per month | Format Standardize (`currency_decimal=auto`, parens-negative) |
+
+#### Marketing / RevOps agency
+
+| # | Pain | $ / time impact | Tools that fix it |
+|---|------|-----------------|---|
+| R1 | **HubSpot / Marketo / Iterable per-contact tier pricing.** 10 k contacts → enterprise tier at $4–8 k/mo. Every duplicate is a recurring tax. | $200–800 / month per 1 k duplicate contacts — recurring | Dedup with cross-source merge + Pipeline |
+| R2 | **Email-deliverability / sender reputation.** Sending to invalid or duplicate addresses tanks reputation; recovery takes weeks. | Catastrophic — entire email programme degraded | Format Standardize (email canonicalization) + Missing (sentinel detection) |
+| R3 | **GDPR / contact-data privacy.** Uploading lead data to a third-party cleaning SaaS is itself a GDPR concern; legal review blocks adoption. | Compliance risk + 4–8 wk legal-review delay | Local-only desktop app, zero outbound calls |
+| R4 | **Multi-vendor lead-source unification.** Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 1–3 days per campaign of manual unification | Column Mapper (alias matching) + Format Standardize (per-row country) + Dedup |
+| R5 | **Suppression-list management across 5+ platforms.** Each platform has its own format; un-deduped suppression lists let opt-outs slip through, triggering CAN-SPAM / GDPR exposure. | Compliance risk + churn-back cost | Pipeline saved as JSON, re-run on each new suppression batch |
+
+### 2.4 Operationalize the moat the docs already name.
+
+Three durable advantages, each promoted from buried feature to
+landing-page H1:
+
+- **Quality**: 1 GB international standardization in ~2.5 minutes,
+  locally. Excel can't do this; OpenRefine fights you for an hour.
+- **Privacy**: "Your data never leaves this computer." Already in the
+  GUI footer — promote to landing-page lead, screenshot the empty
+  network tab.
+- **Update cadence**: ship a v1.1 patch within 30 days of v1.0 launch.
+  Not features — *evidence* the product is alive. "Added Czech Republic
+  phone format support" beats "no updates in 6 months" every time.
+
+### 2.5 Surface the audit-trail feature in sales copy.
+
+Every tool has a structured audit log. Most cleaning tools do not.
+Bookkeepers and consultants get fired if they can't show what changed
+to a client. The audit feature is currently invisible on every
+proposed landing page and should be the **second-largest callout** —
+right after "runs locally."
+
+Copy seed: *"Every change auditable. Hand the audit CSV to your client
+with the cleaned file."*
+
+### 2.6 The Pipeline Runner is the retention multiplier.
+
+A buyer with a saved pipeline isn't a one-off purchase — they're a
+recurring user who recommends the product. This is exactly the
+behavioural lever the no-touch constraint needs (DECISIONS.md §8
+trigger). Build it now (see §2.1 exception).
+
+### 2.7 Add a $199 "priority support" tier post-launch.
+
+Same code, async-email SLA (24 h response). Targets the bookkeeper /
+consultant persona whose own time is $300/hr. Zero new product work,
+~3× ARPU on 5–10 % of buyers. Lock the SLA to **async only** so the
+no-touch constraint isn't violated. Defer until $5 k/mo MRR (the same
+trigger DECISIONS.md §8 already names).
+
+### 2.8 Dependency-aware pipeline UX.
+
+Tools have soft execution-order preferences (Text Clean before Format
+Standardize, Format before Dedup, Missing before Dedup). The Pipeline
+Runner *recommends* the order, *warns* on reversals, and **never
+forces** — the user owns their workflow. Implementation: see
+`src/core/pipeline.py` `SOFT_DEPENDENCIES`.
+
+## 3. 90-day execution sequence
+
+| Week | Action | Done when |
+|---|---|---|
+| 1 | PyInstaller pipeline · Mac/Win unsigned installers · Apple Dev Program enrollment (1–2 wk lead) | `dist/datatools-mac.dmg` and `dist/datatools-win.exe` install on a clean machine |
+| 2 | Demo deployed to Streamlit Cloud · landing page v1 with embedded demo · 3 persona datasets in the demo | Public URL serves a working pipeline run on a sample dataset in < 30 s |
+| 3 | Gumroad listing live · share value-first in 3 niche communities (no pitch) · 1 long-tail SEO post for the lead persona | First listing impression captured · post not removed for self-promotion |
+| 4 | Pipeline Runner v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded |
+| 5–8 | Bookkeeper landing page · agency landing page · second tool's promo cycle · priority-support tier added (defer purchase until §2.7 trigger) | Three live landing pages with distinct H1, demo dataset, conversion target |
+| 9–13 | Tool 06–08 only **if** revenue trajectory supports continued investment · otherwise more market work on the existing 5 + 09 | Decision made on 13 Aug 2026 with revenue data, not feature ambition |
+
+## 4. Decision triggers (re-evaluation prompts)
+
+These flip the plan, not the underlying criteria:
+
+| Trigger | Reaction |
+|---|---|
+| First paying customer in week 4–13 | Continue. Plan is working. |
+| **Zero** paid in 90 days | Audit the funnel. Demo conversion? Niche fit? Price? Don't add features. |
+| $5 k/mo MRR | DECISIONS.md §8 trigger fires: revisit async + priority-support tier. |
+| Marketplace policy / shutdown | Switch to own-domain Stripe immediately; landing pages are already self-hosted. |
+| Streamlit hard direction change | Low-probability re-lock per DECISIONS.md §8. Tk fallback is documented. |
+
+## 5. Anti-temptations (things the plan refuses)
+
+- **More tools before more buyers.** Locked. Exception only for Pipeline Runner per §2.1.
+- **SaaS pivot.** Recurring infra conflicts with the lifestyle constraint (DECISIONS.md §4).
+- **Live chat / sales calls.** Conflicts with no-touch (DECISIONS.md §1 #8).
+- **Custom integrations / one-off consulting.** $300/hr looks tempting; breaks the "build once, sell many" model that justifies the entire strategy.
+- **Going broad on personas.** "All small businesses" is a generic landing page that converts at 1 %; "Shopify pet-supply operators with 1k–50k customers" converts at 5–15 % in the right communities.
+
+## 6. What this plan deliberately leaves open
+
+- Whether tools 06–08 ever ship. Decided on revenue, not roadmap.
+- Whether to add a fourth niche landing page. Decided on which of the
+  three is producing.
+- Whether to invest in own-domain SEO. Compounding 6–18 mo asset; not
+  the early-stage channel. Revisit when marketplace + community
+  produces baseline traffic to optimise.
+- Whether to add a Notion / Slack support community. If support volume
+  per 100 sales > 10 (BUSINESS.md §12 target), revisit; else leave async-email only.