Files
datatools-dev/docs/PLAN.md
Michael 966af8ef94 feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00

14 KiB
Raw Blame History

Strategic Plan — DataTools

Creator-only. Locks the "what next" in light of the locked criteria (DECISIONS.md §1) and the v1.6 honest status (BUSINESS.md §13). Version: 1.0 · Adopted: 2026-05-01 · Owner: Michael

This document is the active plan, derived from the strategic review of 2026-05-01. It compresses the eight strategic moves and a 90-day execution sequence onto one page so the next decision (build vs. ship vs. market) has a single reference.

It is not a re-lock of operating criteria — those still live in DECISIONS.md and have not changed. This plan is downstream of those criteria; if a move below conflicts with §1 of Decisions, the criteria win.

1. Frame

Locked context (BUSINESS.md, DECISIONS.md):

  • Niche Python automation tools, $4979 single / $149 suite.
  • Cash budget ≤ $1,200/mo recurring · Time ≤ 10 hr/wk · No external funding.
  • Async + no-touch sales (revisit at $5k/mo MRR).
  • Marketplace-first distribution (Gumroad / Lemon Squeezy).
  • Streamlit GUI + CLI dual interface, runs locally.
  • Lifestyle cashflow goal (no exit needed).

Honest current state (2026-05-01):

Asset State
Tools 15 (Dedup, Text Clean, Format Standardize, Missing, Column Mapper) Ready · 1,691 tests passing · 0 xfailed
Tools 69 (Outlier, Multi-File Merge, Validator, Pipeline) Coming Soon
PyInstaller installer pipeline Not started
macOS code signing (Apple Dev Program) Not started
Hosted browser demo (Streamlit Cloud) Not deployed
Landing page Not live
Marketplace listing (Gumroad) Not listed
Paying customers 0

Diagnosis: the bottleneck is not feature count — it's distribution. The next $1 of value comes from closing the gap between "code-complete" and "buyer-pulls-out-card", not from tool 6.

2. The eight strategic moves

Numbered moves. Each is consistent with locked criteria.

2.1 Freeze new-tool development (one exception). Ship what exists.

Tools 68 are blocked behind a distribution gate: no work on them until the existing 5 tools have a paying customer + one external review (BUSINESS.md §4 sequence rule, applied recursively inside the bundle).

Exception granted 2026-05-01: Tool 09 Pipeline Runner is built now. Rationale: the pipeline transforms the bundle from "5 tools you buy" into "an automatable workflow you depend on." That conversion is what produces retention and word-of-mouth — the only marketing channel that scales under the no-network/no-touch constraint.

2.2 The demo is the product. Make it embarrassingly good.

  • Three persona-tagged sample datasets, not one generic CSV: Shopify customers / bookkeeper bank export / agency lead list.
  • Run the full pipeline on the sample (Review → Dedup → Text Clean → Format → Missing → Column Map). Free version caps output rows, not the experience.
  • Embed the demo as an iframe on the landing page (not "click to open"). Friction kills conversion.
  • Persistent CTA after demo: "Run this on your own 50 k-row file → buy for $49 →" directly above the Gumroad button.

2.3 Niche down. Stop selling "data cleaning."

One engine, three landing pages:

Persona Landing-page lead Demo dataset
Shopify operator (priority: pet supplies) "Clean your customer / vendor / subscriber exports" uc01_shopify_customer_list
Bookkeeper / freelance accountant "Reconcile bank exports + vendor lists. Auditable changes." uc06_bank_export_overlap
Marketing / RevOps agency "Dedupe lead lists. Standardize phones across vendors." uc13_combined_lead_sources

Generic copy competes with pip install pandas. Vertical copy competes with nothing.

2.3a Top pain points per niche

The "what does this actually fix?" question. Each pain point below is sourced from operator-domain knowledge of these markets and the buyer-use-case research already captured in BUSINESS.md §4a. Pain points are ranked by frequency × dollar impact for that persona — high-frequency / high-cost pains lead the landing-page copy and the demo dataset.

Validation gap (honest disclaimer): these pains are derived from operator knowledge of the categories, not from a sample of buyer interviews. Per BUSINESS.md §8 (no-touch constraint review at $5k/mo MRR), validate the top-3 per persona via 5 buyer interviews before the first $200 of paid acquisition spend. If any pain ranks below the assumed level, swap it for the next-highest in this list.

Shopify operator (priority: pet supplies)

# Pain $ / time impact Tools that fix it
S1 Klaviyo / Mailchimp / Omnisend per-contact billing. Subscriber list with 1018 % duplicate rate (case drift, plus signs in Gmail addresses, multiple devices) → recurring overpay forever. $30300/mo per percent of dupes on a 50 k list — recurring Dedup + Format Standardize (email canonicalization) + Pipeline (re-run weekly)
S2 Product feed rejected by Google Merchant Center / Meta Catalog. Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 2472 h while feed gets fixed. 13 days delayed launch × campaign value Text Cleaner + Format Standardize
S3 Multi-channel order consolidation. Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". 48 hr / month manually merging Column Mapper + Dedup + Pipeline
S4 Subscription identity fragmentation. Pet-box subscribers cancel and re-sub under a different email; cohort analysis says churn is 20 % when it's actually 12 % — pricing decisions wrong. Mis-priced LTV → over- or under-paid acquisition Dedup with merge=true survivor
S5 International tax / VAT MOSS compliance. Country column is UK / U.K. / United Kingdom / GB in the same export; VAT report breaks. Phone formats per region break call-center routing. Compliance penalty risk + ops friction Format Standardize (per-row country) + Column Mapper

Bookkeeper / freelance accountant

# Pain $ / time impact Tools that fix it
B1 Bank-export month-overlap re-import. Same transaction posts twice when Jan and Feb exports overlap at the boundary; client's books understate cash by 14 %. 24 hr / month / client + reconciliation errors Dedup with explicit Date+Amount+fuzzy Vendor strategy
B2 QBO / Xero vendor consolidation for 1099 reports. "Amazon" / "amazon.com" / "AMAZON.COM*4F2X9" become 3 vendors; 1099 reports break, P&L by vendor unusable. 12 hr / 1099 cycle + IRS-paper-trail risk Format Standardize (name canonicalization) + Dedup
B3 Liability / professional indemnity. Cannot use AI tools that don't show their work; client audit response window is 2448 h. Per-firm liability premium ≈ $5002,500 / yr Audit log built into every tool — every change row-logged
B4 Per-license-not-per-client economics. Most cleanup tools are per-seat / per-client SaaS; bookkeepers managing 1030 clients hit price walls fast. $30/mo × N clients vs. $49 once Desktop license, no per-client constraint
B5 Multi-currency books. US-domiciled clients with EU customers; comma-decimal amounts (€1.234,56) crash standard parsers; parens-negative (($89.50)) treated as positive. 3060 min per multi-currency client per month Format Standardize (currency_decimal=auto, parens-negative)

Marketing / RevOps agency

# Pain $ / time impact Tools that fix it
R1 HubSpot / Marketo / Iterable per-contact tier pricing. 10 k contacts → enterprise tier at $48 k/mo. Every duplicate is a recurring tax. $200800 / month per 1 k duplicate contacts — recurring Dedup with cross-source merge + Pipeline
R2 Email-deliverability / sender reputation. Sending to invalid or duplicate addresses tanks reputation; recovery takes weeks. Catastrophic — entire email programme degraded Format Standardize (email canonicalization) + Missing (sentinel detection)
R3 GDPR / contact-data privacy. Uploading lead data to a third-party cleaning SaaS is itself a GDPR concern; legal review blocks adoption. Compliance risk + 48 wk legal-review delay Local-only desktop app, zero outbound calls
R4 Multi-vendor lead-source unification. Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. 13 days per campaign of manual unification Column Mapper (alias matching) + Format Standardize (per-row country) + Dedup
R5 Suppression-list management across 5+ platforms. Each platform has its own format; un-deduped suppression lists let opt-outs slip through, triggering CAN-SPAM / GDPR exposure. Compliance risk + churn-back cost Pipeline saved as JSON, re-run on each new suppression batch

2.4 Operationalize the moat the docs already name.

Three durable advantages, each promoted from buried feature to landing-page H1:

  • Quality: 1 GB international standardization in ~2.5 minutes, locally. Excel can't do this; OpenRefine fights you for an hour.
  • Privacy: "Your data never leaves this computer." Already in the GUI footer — promote to landing-page lead, screenshot the empty network tab.
  • Update cadence: ship a v1.1 patch within 30 days of v1.0 launch. Not features — evidence the product is alive. "Added Czech Republic phone format support" beats "no updates in 6 months" every time.

2.5 Surface the audit-trail feature in sales copy.

Every tool has a structured audit log. Most cleaning tools do not. Bookkeepers and consultants get fired if they can't show what changed to a client. The audit feature is currently invisible on every proposed landing page and should be the second-largest callout — right after "runs locally."

Copy seed: "Every change auditable. Hand the audit CSV to your client with the cleaned file."

2.6 The Pipeline Runner is the retention multiplier.

A buyer with a saved pipeline isn't a one-off purchase — they're a recurring user who recommends the product. This is exactly the behavioural lever the no-touch constraint needs (DECISIONS.md §8 trigger). Build it now (see §2.1 exception).

2.7 Add a $199 "priority support" tier post-launch.

Same code, async-email SLA (24 h response). Targets the bookkeeper / consultant persona whose own time is $300/hr. Zero new product work, ~3× ARPU on 510 % of buyers. Lock the SLA to async only so the no-touch constraint isn't violated. Defer until $5 k/mo MRR (the same trigger DECISIONS.md §8 already names).

2.8 Dependency-aware pipeline UX.

Tools have soft execution-order preferences (Text Clean before Format Standardize, Format before Dedup, Missing before Dedup). The Pipeline Runner recommends the order, warns on reversals, and never forces — the user owns their workflow. Implementation: see src/core/pipeline.py SOFT_DEPENDENCIES.

3. 90-day execution sequence

Week Action Done when
1 PyInstaller pipeline · Mac/Win unsigned installers · Apple Dev Program enrollment (12 wk lead) dist/datatools-mac.dmg and dist/datatools-win.exe install on a clean machine
2 Demo deployed to Streamlit Cloud · landing page v1 with embedded demo · 3 persona datasets in the demo Public URL serves a working pipeline run on a sample dataset in < 30 s
3 Gumroad listing live · share value-first in 3 niche communities (no pitch) · 1 long-tail SEO post for the lead persona First listing impression captured · post not removed for self-promotion
4 Pipeline Runner v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements Pipeline saves/loads JSON · 3 demo pipelines preloaded
58 Bookkeeper landing page · agency landing page · second tool's promo cycle · priority-support tier added (defer purchase until §2.7 trigger) Three live landing pages with distinct H1, demo dataset, conversion target
913 Tool 0608 only if revenue trajectory supports continued investment · otherwise more market work on the existing 5 + 09 Decision made on 13 Aug 2026 with revenue data, not feature ambition

4. Decision triggers (re-evaluation prompts)

These flip the plan, not the underlying criteria:

Trigger Reaction
First paying customer in week 413 Continue. Plan is working.
Zero paid in 90 days Audit the funnel. Demo conversion? Niche fit? Price? Don't add features.
$5 k/mo MRR DECISIONS.md §8 trigger fires: revisit async + priority-support tier.
Marketplace policy / shutdown Switch to own-domain Stripe immediately; landing pages are already self-hosted.
Streamlit hard direction change Low-probability re-lock per DECISIONS.md §8. Tk fallback is documented.

5. Anti-temptations (things the plan refuses)

  • More tools before more buyers. Locked. Exception only for Pipeline Runner per §2.1.
  • SaaS pivot. Recurring infra conflicts with the lifestyle constraint (DECISIONS.md §4).
  • Live chat / sales calls. Conflicts with no-touch (DECISIONS.md §1 #8).
  • Custom integrations / one-off consulting. $300/hr looks tempting; breaks the "build once, sell many" model that justifies the entire strategy.
  • Going broad on personas. "All small businesses" is a generic landing page that converts at 1 %; "Shopify pet-supply operators with 1k50k customers" converts at 515 % in the right communities.

6. What this plan deliberately leaves open

  • Whether tools 0608 ever ship. Decided on revenue, not roadmap.
  • Whether to add a fourth niche landing page. Decided on which of the three is producing.
  • Whether to invest in own-domain SEO. Compounding 618 mo asset; not the early-stage channel. Revisit when marketplace + community produces baseline traffic to optimise.
  • Whether to add a Notion / Slack support community. If support volume per 100 sales > 10 (BUSINESS.md §12 target), revisit; else leave async-email only.