feat: 3 new tools, format streaming, distribution-ready demo + landing pages

Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 22:31:26 +00:00
parent d18b95880d
commit 966af8ef94
89 changed files with 12039 additions and 284 deletions

220
docs/PLAN.md Normal file
View File

@@ -0,0 +1,220 @@
# Strategic Plan — DataTools
> Creator-only. Locks the "what next" in light of the locked criteria
> (DECISIONS.md §1) and the v1.6 honest status (BUSINESS.md §13).
> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael
This document is the active plan, derived from the strategic review of
2026-05-01. It compresses the eight strategic moves and a 90-day
execution sequence onto one page so the next decision (build vs.
ship vs. market) has a single reference.
It is **not** a re-lock of operating criteria — those still live in
DECISIONS.md and have not changed. This plan is downstream of those
criteria; if a move below conflicts with §1 of Decisions, the criteria
win.
## 1. Frame
**Locked context** (BUSINESS.md, DECISIONS.md):
- Niche Python automation tools, $4979 single / $149 suite.
- Cash budget ≤ $1,200/mo recurring · Time ≤ 10 hr/wk · No external funding.
- Async + no-touch sales (revisit at $5k/mo MRR).
- Marketplace-first distribution (Gumroad / Lemon Squeezy).
- Streamlit GUI + CLI dual interface, runs locally.
- Lifestyle cashflow goal (no exit needed).
**Honest current state** (2026-05-01):
| Asset | State |
|---|---|
| Tools 15 (Dedup, Text Clean, Format Standardize, Missing, Column Mapper) | Ready · 1,691 tests passing · 0 xfailed |
| Tools 69 (Outlier, Multi-File Merge, Validator, Pipeline) | Coming Soon |
| PyInstaller installer pipeline | Not started |
| macOS code signing (Apple Dev Program) | Not started |
| Hosted browser demo (Streamlit Cloud) | Not deployed |
| Landing page | Not live |
| Marketplace listing (Gumroad) | Not listed |
| Paying customers | 0 |
**Diagnosis**: the bottleneck is not feature count — it's distribution.
The next $1 of value comes from closing the gap between "code-complete"
and "buyer-pulls-out-card", not from tool 6.
## 2. The eight strategic moves
Numbered moves. Each is consistent with locked criteria.
### 2.1 Freeze new-tool development (one exception). Ship what exists.
Tools 68 are blocked behind a **distribution gate**: no work on them
until the existing 5 tools have a paying customer + one external review
(BUSINESS.md §4 sequence rule, applied recursively inside the bundle).
**Exception granted 2026-05-01**: Tool 09 Pipeline Runner is built
*now*. Rationale: the pipeline transforms the bundle from "5 tools you
buy" into "an automatable workflow you depend on." That conversion is
what produces retention and word-of-mouth — the only marketing channel
that scales under the no-network/no-touch constraint.
### 2.2 The demo *is* the product. Make it embarrassingly good.
- Three persona-tagged sample datasets, not one generic CSV: Shopify
customers / bookkeeper bank export / agency lead list.
- Run the *full pipeline* on the sample (Review → Dedup → Text Clean →
Format → Missing → Column Map). Free version caps **output rows**,
not the experience.
- Embed the demo as an **iframe on the landing page** (not "click to
open"). Friction kills conversion.
- Persistent CTA after demo: *"Run this on your own 50 k-row file →
buy for $49 →"* directly above the Gumroad button.
### 2.3 Niche down. Stop selling "data cleaning."
One engine, three landing pages:
| Persona | Landing-page lead | Demo dataset |
|---|---|---|
| Shopify operator (priority: pet supplies) | "Clean your customer / vendor / subscriber exports" | uc01_shopify_customer_list |
| Bookkeeper / freelance accountant | "Reconcile bank exports + vendor lists. Auditable changes." | uc06_bank_export_overlap |
| Marketing / RevOps agency | "Dedupe lead lists. Standardize phones across vendors." | uc13_combined_lead_sources |
Generic copy competes with `pip install pandas`. Vertical copy
competes with nothing.
### 2.3a Top pain points per niche
The "what does this actually fix?" question. Each pain point below is
sourced from operator-domain knowledge of these markets and the
buyer-use-case research already captured in `BUSINESS.md §4a`. Pain
points are ranked by **frequency × dollar impact** for that persona —
high-frequency / high-cost pains lead the landing-page copy and the
demo dataset.
> **Validation gap (honest disclaimer)**: these pains are derived from
> operator knowledge of the categories, not from a sample of buyer
> interviews. Per `BUSINESS.md §8` (no-touch constraint review at $5k/mo
> MRR), validate the top-3 per persona via 5 buyer interviews before the
> first $200 of paid acquisition spend. If any pain ranks below the
> assumed level, swap it for the next-highest in this list.
#### Shopify operator (priority: pet supplies)
| # | Pain | $ / time impact | Tools that fix it |
|---|------|-----------------|---|
| S1 | **Klaviyo / Mailchimp / Omnisend per-contact billing.** Subscriber list with 1018 % duplicate rate (case drift, plus signs in Gmail addresses, multiple devices) → recurring overpay forever. | $30300/mo per percent of dupes on a 50 k list — recurring | Dedup + Format Standardize (email canonicalization) + Pipeline (re-run weekly) |
| S2 | **Product feed rejected by Google Merchant Center / Meta Catalog.** Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 2472 h while feed gets fixed. | 13 days delayed launch × campaign value | Text Cleaner + Format Standardize |
| S3 | **Multi-channel order consolidation.** Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 48 hr / month manually merging | Column Mapper + Dedup + Pipeline |
| S4 | **Subscription identity fragmentation.** Pet-box subscribers cancel and re-sub under a different email; cohort analysis says churn is 20 % when it's actually 12 % — pricing decisions wrong. | Mis-priced LTV → over- or under-paid acquisition | Dedup with `merge=true` survivor |
| S5 | **International tax / VAT MOSS compliance.** Country column is `UK` / `U.K.` / `United Kingdom` / `GB` in the same export; VAT report breaks. Phone formats per region break call-center routing. | Compliance penalty risk + ops friction | Format Standardize (per-row country) + Column Mapper |
#### Bookkeeper / freelance accountant
| # | Pain | $ / time impact | Tools that fix it |
|---|------|-----------------|---|
| B1 | **Bank-export month-overlap re-import.** Same transaction posts twice when Jan and Feb exports overlap at the boundary; client's books understate cash by 14 %. | 24 hr / month / client + reconciliation errors | Dedup with explicit Date+Amount+fuzzy Vendor strategy |
| B2 | **QBO / Xero vendor consolidation for 1099 reports.** "Amazon" / "amazon.com" / "AMAZON.COM*4F2X9" become 3 vendors; 1099 reports break, P&L by vendor unusable. | 12 hr / 1099 cycle + IRS-paper-trail risk | Format Standardize (name canonicalization) + Dedup |
| B3 | **Liability / professional indemnity.** Cannot use AI tools that don't show their work; client audit response window is 2448 h. | Per-firm liability premium ≈ $5002,500 / yr | Audit log built into every tool — every change row-logged |
| B4 | **Per-license-not-per-client economics.** Most cleanup tools are per-seat / per-client SaaS; bookkeepers managing 1030 clients hit price walls fast. | $30/mo × N clients vs. $49 once | Desktop license, no per-client constraint |
| B5 | **Multi-currency books.** US-domiciled clients with EU customers; comma-decimal amounts (`€1.234,56`) crash standard parsers; parens-negative (`($89.50)`) treated as positive. | 3060 min per multi-currency client per month | Format Standardize (`currency_decimal=auto`, parens-negative) |
#### Marketing / RevOps agency
| # | Pain | $ / time impact | Tools that fix it |
|---|------|-----------------|---|
| R1 | **HubSpot / Marketo / Iterable per-contact tier pricing.** 10 k contacts → enterprise tier at $48 k/mo. Every duplicate is a recurring tax. | $200800 / month per 1 k duplicate contacts — recurring | Dedup with cross-source merge + Pipeline |
| R2 | **Email-deliverability / sender reputation.** Sending to invalid or duplicate addresses tanks reputation; recovery takes weeks. | Catastrophic — entire email programme degraded | Format Standardize (email canonicalization) + Missing (sentinel detection) |
| R3 | **GDPR / contact-data privacy.** Uploading lead data to a third-party cleaning SaaS is itself a GDPR concern; legal review blocks adoption. | Compliance risk + 48 wk legal-review delay | Local-only desktop app, zero outbound calls |
| R4 | **Multi-vendor lead-source unification.** Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 13 days per campaign of manual unification | Column Mapper (alias matching) + Format Standardize (per-row country) + Dedup |
| R5 | **Suppression-list management across 5+ platforms.** Each platform has its own format; un-deduped suppression lists let opt-outs slip through, triggering CAN-SPAM / GDPR exposure. | Compliance risk + churn-back cost | Pipeline saved as JSON, re-run on each new suppression batch |
### 2.4 Operationalize the moat the docs already name.
Three durable advantages, each promoted from buried feature to
landing-page H1:
- **Quality**: 1 GB international standardization in ~2.5 minutes,
locally. Excel can't do this; OpenRefine fights you for an hour.
- **Privacy**: "Your data never leaves this computer." Already in the
GUI footer — promote to landing-page lead, screenshot the empty
network tab.
- **Update cadence**: ship a v1.1 patch within 30 days of v1.0 launch.
Not features — *evidence* the product is alive. "Added Czech Republic
phone format support" beats "no updates in 6 months" every time.
### 2.5 Surface the audit-trail feature in sales copy.
Every tool has a structured audit log. Most cleaning tools do not.
Bookkeepers and consultants get fired if they can't show what changed
to a client. The audit feature is currently invisible on every
proposed landing page and should be the **second-largest callout**
right after "runs locally."
Copy seed: *"Every change auditable. Hand the audit CSV to your client
with the cleaned file."*
### 2.6 The Pipeline Runner is the retention multiplier.
A buyer with a saved pipeline isn't a one-off purchase — they're a
recurring user who recommends the product. This is exactly the
behavioural lever the no-touch constraint needs (DECISIONS.md §8
trigger). Build it now (see §2.1 exception).
### 2.7 Add a $199 "priority support" tier post-launch.
Same code, async-email SLA (24 h response). Targets the bookkeeper /
consultant persona whose own time is $300/hr. Zero new product work,
~3× ARPU on 510 % of buyers. Lock the SLA to **async only** so the
no-touch constraint isn't violated. Defer until $5 k/mo MRR (the same
trigger DECISIONS.md §8 already names).
### 2.8 Dependency-aware pipeline UX.
Tools have soft execution-order preferences (Text Clean before Format
Standardize, Format before Dedup, Missing before Dedup). The Pipeline
Runner *recommends* the order, *warns* on reversals, and **never
forces** — the user owns their workflow. Implementation: see
`src/core/pipeline.py` `SOFT_DEPENDENCIES`.
## 3. 90-day execution sequence
| Week | Action | Done when |
|---|---|---|
| 1 | PyInstaller pipeline · Mac/Win unsigned installers · Apple Dev Program enrollment (12 wk lead) | `dist/datatools-mac.dmg` and `dist/datatools-win.exe` install on a clean machine |
| 2 | Demo deployed to Streamlit Cloud · landing page v1 with embedded demo · 3 persona datasets in the demo | Public URL serves a working pipeline run on a sample dataset in < 30 s |
| 3 | Gumroad listing live · share value-first in 3 niche communities (no pitch) · 1 long-tail SEO post for the lead persona | First listing impression captured · post not removed for self-promotion |
| 4 | Pipeline Runner v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded |
| 58 | Bookkeeper landing page · agency landing page · second tool's promo cycle · priority-support tier added (defer purchase until §2.7 trigger) | Three live landing pages with distinct H1, demo dataset, conversion target |
| 913 | Tool 0608 only **if** revenue trajectory supports continued investment · otherwise more market work on the existing 5 + 09 | Decision made on 13 Aug 2026 with revenue data, not feature ambition |
## 4. Decision triggers (re-evaluation prompts)
These flip the plan, not the underlying criteria:
| Trigger | Reaction |
|---|---|
| First paying customer in week 413 | Continue. Plan is working. |
| **Zero** paid in 90 days | Audit the funnel. Demo conversion? Niche fit? Price? Don't add features. |
| $5 k/mo MRR | DECISIONS.md §8 trigger fires: revisit async + priority-support tier. |
| Marketplace policy / shutdown | Switch to own-domain Stripe immediately; landing pages are already self-hosted. |
| Streamlit hard direction change | Low-probability re-lock per DECISIONS.md §8. Tk fallback is documented. |
## 5. Anti-temptations (things the plan refuses)
- **More tools before more buyers.** Locked. Exception only for Pipeline Runner per §2.1.
- **SaaS pivot.** Recurring infra conflicts with the lifestyle constraint (DECISIONS.md §4).
- **Live chat / sales calls.** Conflicts with no-touch (DECISIONS.md §1 #8).
- **Custom integrations / one-off consulting.** $300/hr looks tempting; breaks the "build once, sell many" model that justifies the entire strategy.
- **Going broad on personas.** "All small businesses" is a generic landing page that converts at 1 %; "Shopify pet-supply operators with 1k50k customers" converts at 515 % in the right communities.
## 6. What this plan deliberately leaves open
- Whether tools 0608 ever ship. Decided on revenue, not roadmap.
- Whether to add a fourth niche landing page. Decided on which of the
three is producing.
- Whether to invest in own-domain SEO. Compounding 618 mo asset; not
the early-stage channel. Revisit when marketplace + community
produces baseline traffic to optimise.
- Whether to add a Notion / Slack support community. If support volume
per 100 sales > 10 (BUSINESS.md §12 target), revisit; else leave async-email only.