Tools shipped this batch (4 → 6 of 9 Ready):
04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI
05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI
09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI
with soft tool-dependency graph (recommended,
not enforced) and JSON save/load for repeatable
weekly cleanups.
Format Standardizer reworked for 1 GB international files:
• Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
• Per-row country / address columns drive parsing
• Audit cap (default 10 k rows, ~50 MB RAM)
• standardize_file(): chunked streaming entry point (~165 k rows/sec)
• currency_decimal="auto" for EU comma-decimal locales
• R$ / kr / zł multi-char currency prefixes
• cli_format.py with auto-stream above 100 MB inputs
Encoding detection arbiter + language-aware probe:
Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.
Distribution-readiness assets:
• streamlit_app.py — Streamlit Community Cloud entry shim
• src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
100-row cap + watermark, free-vs-paid boundary enforced at surface
• samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
• landing/ — 4 static HTML pages (apex chooser + 3 niche),
shared CSS, deploy.py URL-substitution script,
auto-generated robots.txt + sitemap.xml + 404.html + favicon
• docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
— full strategy + measurement + deployment + master checklist
Test counts:
before: 1,520 passed · 4 skipped · 17 xfailed
after: 1,729 passed · 0 skipped · 0 xfailed
Tier-1 corpora added:
• missing-corpus 3 use cases + 16 edge cases
• column-mapper-corpus 3 use cases + 5 edge cases
• format-cleaner intl 20-row 13-country stress fixture
Engine hardening flushed out by the corpora:
• interpolate guards against object-dtype columns
• mean/median skip all-NaN columns (silences numpy warning)
• fillna runs under future.no_silent_downcasting (silences pandas warning)
• mojibake test no longer skips when ftfy installed (monkeypatch path)
• drop-row threshold semantics: strict-greater (consistent across rows / cols)
• currency_decimal validator allow-set updated for "auto"
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 KiB
Strategic Plan — DataTools
Creator-only. Locks the "what next" in light of the locked criteria (DECISIONS.md §1) and the v1.6 honest status (BUSINESS.md §13). Version: 1.0 · Adopted: 2026-05-01 · Owner: Michael
This document is the active plan, derived from the strategic review of 2026-05-01. It compresses the eight strategic moves and a 90-day execution sequence onto one page so the next decision (build vs. ship vs. market) has a single reference.
It is not a re-lock of operating criteria — those still live in DECISIONS.md and have not changed. This plan is downstream of those criteria; if a move below conflicts with §1 of Decisions, the criteria win.
1. Frame
Locked context (BUSINESS.md, DECISIONS.md):
- Niche Python automation tools, $49–79 single / $149 suite.
- Cash budget ≤ $1,200/mo recurring · Time ≤ 10 hr/wk · No external funding.
- Async + no-touch sales (revisit at $5k/mo MRR).
- Marketplace-first distribution (Gumroad / Lemon Squeezy).
- Streamlit GUI + CLI dual interface, runs locally.
- Lifestyle cashflow goal (no exit needed).
Honest current state (2026-05-01):
| Asset | State |
|---|---|
| Tools 1–5 (Dedup, Text Clean, Format Standardize, Missing, Column Mapper) | Ready · 1,691 tests passing · 0 xfailed |
| Tools 6–9 (Outlier, Multi-File Merge, Validator, Pipeline) | Coming Soon |
| PyInstaller installer pipeline | Not started |
| macOS code signing (Apple Dev Program) | Not started |
| Hosted browser demo (Streamlit Cloud) | Not deployed |
| Landing page | Not live |
| Marketplace listing (Gumroad) | Not listed |
| Paying customers | 0 |
Diagnosis: the bottleneck is not feature count — it's distribution. The next $1 of value comes from closing the gap between "code-complete" and "buyer-pulls-out-card", not from tool 6.
2. The eight strategic moves
Numbered moves. Each is consistent with locked criteria.
2.1 Freeze new-tool development (one exception). Ship what exists.
Tools 6–8 are blocked behind a distribution gate: no work on them until the existing 5 tools have a paying customer + one external review (BUSINESS.md §4 sequence rule, applied recursively inside the bundle).
Exception granted 2026-05-01: Tool 09 Pipeline Runner is built now. Rationale: the pipeline transforms the bundle from "5 tools you buy" into "an automatable workflow you depend on." That conversion is what produces retention and word-of-mouth — the only marketing channel that scales under the no-network/no-touch constraint.
2.2 The demo is the product. Make it embarrassingly good.
- Three persona-tagged sample datasets, not one generic CSV: Shopify customers / bookkeeper bank export / agency lead list.
- Run the full pipeline on the sample (Review → Dedup → Text Clean → Format → Missing → Column Map). Free version caps output rows, not the experience.
- Embed the demo as an iframe on the landing page (not "click to open"). Friction kills conversion.
- Persistent CTA after demo: "Run this on your own 50 k-row file → buy for $49 →" directly above the Gumroad button.
2.3 Niche down. Stop selling "data cleaning."
One engine, three landing pages:
| Persona | Landing-page lead | Demo dataset |
|---|---|---|
| Shopify operator (priority: pet supplies) | "Clean your customer / vendor / subscriber exports" | uc01_shopify_customer_list |
| Bookkeeper / freelance accountant | "Reconcile bank exports + vendor lists. Auditable changes." | uc06_bank_export_overlap |
| Marketing / RevOps agency | "Dedupe lead lists. Standardize phones across vendors." | uc13_combined_lead_sources |
Generic copy competes with pip install pandas. Vertical copy
competes with nothing.
2.3a Top pain points per niche
The "what does this actually fix?" question. Each pain point below is
sourced from operator-domain knowledge of these markets and the
buyer-use-case research already captured in BUSINESS.md §4a. Pain
points are ranked by frequency × dollar impact for that persona —
high-frequency / high-cost pains lead the landing-page copy and the
demo dataset.
Validation gap (honest disclaimer): these pains are derived from operator knowledge of the categories, not from a sample of buyer interviews. Per
BUSINESS.md §8(no-touch constraint review at $5k/mo MRR), validate the top-3 per persona via 5 buyer interviews before the first $200 of paid acquisition spend. If any pain ranks below the assumed level, swap it for the next-highest in this list.
Shopify operator (priority: pet supplies)
| # | Pain | $ / time impact | Tools that fix it |
|---|---|---|---|
| S1 | Klaviyo / Mailchimp / Omnisend per-contact billing. Subscriber list with 10–18 % duplicate rate (case drift, plus signs in Gmail addresses, multiple devices) → recurring overpay forever. | $30–300/mo per percent of dupes on a 50 k list — recurring | Dedup + Format Standardize (email canonicalization) + Pipeline (re-run weekly) |
| S2 | Product feed rejected by Google Merchant Center / Meta Catalog. Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 24–72 h while feed gets fixed. | 1–3 days delayed launch × campaign value | Text Cleaner + Format Standardize |
| S3 | Multi-channel order consolidation. Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 4–8 hr / month manually merging | Column Mapper + Dedup + Pipeline |
| S4 | Subscription identity fragmentation. Pet-box subscribers cancel and re-sub under a different email; cohort analysis says churn is 20 % when it's actually 12 % — pricing decisions wrong. | Mis-priced LTV → over- or under-paid acquisition | Dedup with merge=true survivor |
| S5 | International tax / VAT MOSS compliance. Country column is UK / U.K. / United Kingdom / GB in the same export; VAT report breaks. Phone formats per region break call-center routing. |
Compliance penalty risk + ops friction | Format Standardize (per-row country) + Column Mapper |
Bookkeeper / freelance accountant
| # | Pain | $ / time impact | Tools that fix it |
|---|---|---|---|
| B1 | Bank-export month-overlap re-import. Same transaction posts twice when Jan and Feb exports overlap at the boundary; client's books understate cash by 1–4 %. | 2–4 hr / month / client + reconciliation errors | Dedup with explicit Date+Amount+fuzzy Vendor strategy |
| B2 | QBO / Xero vendor consolidation for 1099 reports. "Amazon" / "amazon.com" / "AMAZON.COM*4F2X9" become 3 vendors; 1099 reports break, P&L by vendor unusable. | 1–2 hr / 1099 cycle + IRS-paper-trail risk | Format Standardize (name canonicalization) + Dedup |
| B3 | Liability / professional indemnity. Cannot use AI tools that don't show their work; client audit response window is 24–48 h. | Per-firm liability premium ≈ $500–2,500 / yr | Audit log built into every tool — every change row-logged |
| B4 | Per-license-not-per-client economics. Most cleanup tools are per-seat / per-client SaaS; bookkeepers managing 10–30 clients hit price walls fast. | $30/mo × N clients vs. $49 once | Desktop license, no per-client constraint |
| B5 | Multi-currency books. US-domiciled clients with EU customers; comma-decimal amounts (€1.234,56) crash standard parsers; parens-negative (($89.50)) treated as positive. |
30–60 min per multi-currency client per month | Format Standardize (currency_decimal=auto, parens-negative) |
Marketing / RevOps agency
| # | Pain | $ / time impact | Tools that fix it |
|---|---|---|---|
| R1 | HubSpot / Marketo / Iterable per-contact tier pricing. 10 k contacts → enterprise tier at $4–8 k/mo. Every duplicate is a recurring tax. | $200–800 / month per 1 k duplicate contacts — recurring | Dedup with cross-source merge + Pipeline |
| R2 | Email-deliverability / sender reputation. Sending to invalid or duplicate addresses tanks reputation; recovery takes weeks. | Catastrophic — entire email programme degraded | Format Standardize (email canonicalization) + Missing (sentinel detection) |
| R3 | GDPR / contact-data privacy. Uploading lead data to a third-party cleaning SaaS is itself a GDPR concern; legal review blocks adoption. | Compliance risk + 4–8 wk legal-review delay | Local-only desktop app, zero outbound calls |
| R4 | Multi-vendor lead-source unification. Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 1–3 days per campaign of manual unification | Column Mapper (alias matching) + Format Standardize (per-row country) + Dedup |
| R5 | Suppression-list management across 5+ platforms. Each platform has its own format; un-deduped suppression lists let opt-outs slip through, triggering CAN-SPAM / GDPR exposure. | Compliance risk + churn-back cost | Pipeline saved as JSON, re-run on each new suppression batch |
2.4 Operationalize the moat the docs already name.
Three durable advantages, each promoted from buried feature to landing-page H1:
- Quality: 1 GB international standardization in ~2.5 minutes, locally. Excel can't do this; OpenRefine fights you for an hour.
- Privacy: "Your data never leaves this computer." Already in the GUI footer — promote to landing-page lead, screenshot the empty network tab.
- Update cadence: ship a v1.1 patch within 30 days of v1.0 launch. Not features — evidence the product is alive. "Added Czech Republic phone format support" beats "no updates in 6 months" every time.
2.5 Surface the audit-trail feature in sales copy.
Every tool has a structured audit log. Most cleaning tools do not. Bookkeepers and consultants get fired if they can't show what changed to a client. The audit feature is currently invisible on every proposed landing page and should be the second-largest callout — right after "runs locally."
Copy seed: "Every change auditable. Hand the audit CSV to your client with the cleaned file."
2.6 The Pipeline Runner is the retention multiplier.
A buyer with a saved pipeline isn't a one-off purchase — they're a recurring user who recommends the product. This is exactly the behavioural lever the no-touch constraint needs (DECISIONS.md §8 trigger). Build it now (see §2.1 exception).
2.7 Add a $199 "priority support" tier post-launch.
Same code, async-email SLA (24 h response). Targets the bookkeeper / consultant persona whose own time is $300/hr. Zero new product work, ~3× ARPU on 5–10 % of buyers. Lock the SLA to async only so the no-touch constraint isn't violated. Defer until $5 k/mo MRR (the same trigger DECISIONS.md §8 already names).
2.8 Dependency-aware pipeline UX.
Tools have soft execution-order preferences (Text Clean before Format
Standardize, Format before Dedup, Missing before Dedup). The Pipeline
Runner recommends the order, warns on reversals, and never
forces — the user owns their workflow. Implementation: see
src/core/pipeline.py SOFT_DEPENDENCIES.
3. 90-day execution sequence
| Week | Action | Done when |
|---|---|---|
| 1 | PyInstaller pipeline · Mac/Win unsigned installers · Apple Dev Program enrollment (1–2 wk lead) | dist/datatools-mac.dmg and dist/datatools-win.exe install on a clean machine |
| 2 | Demo deployed to Streamlit Cloud · landing page v1 with embedded demo · 3 persona datasets in the demo | Public URL serves a working pipeline run on a sample dataset in < 30 s |
| 3 | Gumroad listing live · share value-first in 3 niche communities (no pitch) · 1 long-tail SEO post for the lead persona | First listing impression captured · post not removed for self-promotion |
| 4 | Pipeline Runner v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded |
| 5–8 | Bookkeeper landing page · agency landing page · second tool's promo cycle · priority-support tier added (defer purchase until §2.7 trigger) | Three live landing pages with distinct H1, demo dataset, conversion target |
| 9–13 | Tool 06–08 only if revenue trajectory supports continued investment · otherwise more market work on the existing 5 + 09 | Decision made on 13 Aug 2026 with revenue data, not feature ambition |
4. Decision triggers (re-evaluation prompts)
These flip the plan, not the underlying criteria:
| Trigger | Reaction |
|---|---|
| First paying customer in week 4–13 | Continue. Plan is working. |
| Zero paid in 90 days | Audit the funnel. Demo conversion? Niche fit? Price? Don't add features. |
| $5 k/mo MRR | DECISIONS.md §8 trigger fires: revisit async + priority-support tier. |
| Marketplace policy / shutdown | Switch to own-domain Stripe immediately; landing pages are already self-hosted. |
| Streamlit hard direction change | Low-probability re-lock per DECISIONS.md §8. Tk fallback is documented. |
5. Anti-temptations (things the plan refuses)
- More tools before more buyers. Locked. Exception only for Pipeline Runner per §2.1.
- SaaS pivot. Recurring infra conflicts with the lifestyle constraint (DECISIONS.md §4).
- Live chat / sales calls. Conflicts with no-touch (DECISIONS.md §1 #8).
- Custom integrations / one-off consulting. $300/hr looks tempting; breaks the "build once, sell many" model that justifies the entire strategy.
- Going broad on personas. "All small businesses" is a generic landing page that converts at 1 %; "Shopify pet-supply operators with 1k–50k customers" converts at 5–15 % in the right communities.
6. What this plan deliberately leaves open
- Whether tools 06–08 ever ship. Decided on revenue, not roadmap.
- Whether to add a fourth niche landing page. Decided on which of the three is producing.
- Whether to invest in own-domain SEO. Compounding 6–18 mo asset; not the early-stage channel. Revisit when marketplace + community produces baseline traffic to optimise.
- Whether to add a Notion / Slack support community. If support volume per 100 sales > 10 (BUSINESS.md §12 target), revisit; else leave async-email only.