New ``docs/FUTURE-TOOLS.md`` captures post-launch tool ideas with a consistent shape — What / Why / Can we ship now / Approach / GUI sketch / Effort / Risks / Ship criteria. Resting place for things the new-tool freeze in ``PLAN.md`` §2.1 refuses to build but that keep coming up. First entry: **#10 PDF → CSV extractor** (bank statements et al.). Key facts captured: - **Current state**: no PDF infrastructure exists. Zero PDF dependencies in requirements.txt; zero PDF-touching code under ``src/``. The only "PDF" string in the codebase is the planned- output copy for the Quality Check tool, unrelated to extraction. - **Library picks**: pdfplumber as the extraction core (BSD-3, no native compiler, gives coordinate-aware text), Tesseract via pytesseract as the OCR fallback for scanned PDFs, streamlit-drawable-canvas as the region-picker component. - **GUI sketch**: user draws a header strip + a row template on a rendered page; the tool applies that template across N pages, saves the template by layout fingerprint for next month's statement, emits CSV. - **Effort phased A–E**: 3–4 weeks for a text-only MVP; 6–10 weeks for a polished version with multi-page template recall; +2–3 weeks if scanned-PDF OCR is required. - **Difficulty**: medium-hard. The pieces are well-trodden; the combination (region selection that persists across pages and across documents with similar layouts) is where the engineering goes. - **Ship criteria**: ≥1 paying customer + ≥3 paid or ≥5 demo emails asking for PDF extraction + the bookkeeper niche converting at least one customer first. None have fired. Cross-references added: - ``docs/REQUIREMENTS.md`` §11: pointer to FUTURE-TOOLS.md for parked tool ideas, with a one-paragraph summary of #10. - ``docs/PLAN.md`` §2.1: notes that the freeze parks future tools in FUTURE-TOOLS.md and explicitly names #10 as the current highest-pressure entry. - ``docs/NEXT-STEPS.md`` Phase 5 "what NOT to build" table: a new row for the PDF tool tied to the same ship-trigger language. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 KiB
Strategic Plan — DataTools
Creator-only. Locks the "what next" in light of the locked criteria (DECISIONS.md §1) and the v1.6 honest status (BUSINESS.md §13). Version: 1.0 · Adopted: 2026-05-01 · Owner: Michael
This document is the active plan, derived from the strategic review of 2026-05-01. It compresses the eight strategic moves and a 90-day execution sequence onto one page so the next decision (build vs. ship vs. market) has a single reference.
It is not a re-lock of operating criteria — those still live in DECISIONS.md and have not changed. This plan is downstream of those criteria; if a move below conflicts with §1 of Decisions, the criteria win.
1. Frame
Locked context (BUSINESS.md, DECISIONS.md):
- Niche Python automation tools, $49–79 single / $149 suite.
- Cash budget ≤ $1,200/mo recurring · Time ≤ 10 hr/wk · No external funding.
- Async + no-touch sales (revisit at $5k/mo MRR).
- Marketplace-first distribution (Gumroad / Lemon Squeezy).
- Streamlit GUI + CLI dual interface, runs locally.
- Lifestyle cashflow goal (no exit needed).
Honest current state (2026-05-01):
| Asset | State |
|---|---|
| Tools 1–5 (Find Duplicates, Clean Text, Standardize Formats, Fix Missing Values, Map Columns) | Ready · 1,691 tests passing · 0 xfailed |
| Tools 6–9 (Find Unusual Values, Combine Files, Quality Check, Automated Workflows) | Coming Soon |
| PyInstaller installer pipeline | Not started |
| macOS code signing (Apple Dev Program) | Not started |
| Hosted browser demo (Streamlit Cloud) | Not deployed |
| Landing page | Not live |
| Marketplace listing (Gumroad) | Not listed |
| Paying customers | 0 |
Diagnosis: the bottleneck is not feature count — it's distribution. The next $1 of value comes from closing the gap between "code-complete" and "buyer-pulls-out-card", not from tool 6.
2. The eight strategic moves
Numbered moves. Each is consistent with locked criteria.
2.1 Freeze new-tool development (one exception). Ship what exists.
Tools 6–8 are blocked behind a distribution gate: no work on them until the existing 5 tools have a paying customer + one external review (BUSINESS.md §4 sequence rule, applied recursively inside the bundle).
Exception granted 2026-05-01: Tool 09 Automated Workflows is built now. Rationale: the pipeline transforms the bundle from "5 tools you buy" into "an automatable workflow you depend on." That conversion is what produces retention and word-of-mouth — the only marketing channel that scales under the no-network/no-touch constraint.
Parked behind the freeze: post-launch tool ideas are captured in
docs/FUTURE-TOOLS.md with feasibility, GUI sketch, effort estimate,
and ship criteria for each. Currently parked: #10 PDF → CSV
extractor (bank statements et al.) — gated on a paying customer +
≥3 paying customers or ≥5 demo emails explicitly asking for PDF
extraction, with the bookkeeper niche converting at least one customer
first. None of those triggers have fired yet.
2.2 The demo is the product. Make it embarrassingly good.
- Three persona-tagged sample datasets, not one generic CSV: Shopify customers / bookkeeper bank export / agency lead list.
- Run the full pipeline on the sample (Review → Dedup → Text Clean → Format → Missing → Column Map). Free version caps output rows, not the experience.
- Embed the demo as an iframe on the landing page (not "click to open"). Friction kills conversion.
- Persistent CTA after demo: "Run this on your own 50 k-row file → buy for $49 →" directly above the Gumroad button.
2.3 Niche down. Stop selling "data cleaning."
One engine, three landing pages:
| Persona | Landing-page lead | Demo dataset |
|---|---|---|
| Shopify operator (priority: pet supplies) | "Clean your customer / vendor / subscriber exports" | uc01_shopify_customer_list |
| Bookkeeper / freelance accountant | "Reconcile bank exports + vendor lists. Auditable changes." | uc06_bank_export_overlap |
| Marketing / RevOps agency | "Dedupe lead lists. Standardize phones across vendors." | uc13_combined_lead_sources |
Generic copy competes with pip install pandas. Vertical copy
competes with nothing.
2.3a Top pain points per niche
The "what does this actually fix?" question. Each pain point below is
sourced from operator-domain knowledge of these markets and the
buyer-use-case research already captured in BUSINESS.md §4a. Pain
points are ranked by frequency × dollar impact for that persona —
high-frequency / high-cost pains lead the landing-page copy and the
demo dataset.
Validation gap (honest disclaimer): these pains are derived from operator knowledge of the categories, not from a sample of buyer interviews. Per
BUSINESS.md §8(no-touch constraint review at $5k/mo MRR), validate the top-3 per persona via 5 buyer interviews before the first $200 of paid acquisition spend. If any pain ranks below the assumed level, swap it for the next-highest in this list.
Shopify operator (priority: pet supplies)
| # | Pain | $ / time impact | Tools that fix it |
|---|---|---|---|
| S1 | Klaviyo / Mailchimp / Omnisend per-contact billing. Subscriber list with 10–18 % duplicate rate (case drift, plus signs in Gmail addresses, multiple devices) → recurring overpay forever. | $30–300/mo per percent of dupes on a 50 k list — recurring | Dedup + Format Standardize (email canonicalization) + Pipeline (re-run weekly) |
| S2 | Product feed rejected by Google Merchant Center / Meta Catalog. Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 24–72 h while feed gets fixed. | 1–3 days delayed launch × campaign value | Clean Text + Standardize Formats |
| S3 | Multi-channel order consolidation. Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 4–8 hr / month manually merging | Map Columns + Find Duplicates + Automated Workflows |
| S4 | Subscription identity fragmentation. Pet-box subscribers cancel and re-sub under a different email; cohort analysis says churn is 20 % when it's actually 12 % — pricing decisions wrong. | Mis-priced LTV → over- or under-paid acquisition | Dedup with merge=true survivor |
| S5 | International tax / VAT MOSS compliance. Country column is UK / U.K. / United Kingdom / GB in the same export; VAT report breaks. Phone formats per region break call-center routing. |
Compliance penalty risk + ops friction | Standardize Formats (per-row country) + Map Columns |
Bookkeeper / freelance accountant
| # | Pain | $ / time impact | Tools that fix it |
|---|---|---|---|
| B1 | Bank-export month-overlap re-import. Same transaction posts twice when Jan and Feb exports overlap at the boundary; client's books understate cash by 1–4 %. | 2–4 hr / month / client + reconciliation errors | Dedup with explicit Date+Amount+fuzzy Vendor strategy |
| B2 | QBO / Xero vendor consolidation for 1099 reports. "Amazon" / "amazon.com" / "AMAZON.COM*4F2X9" become 3 vendors; 1099 reports break, P&L by vendor unusable. | 1–2 hr / 1099 cycle + IRS-paper-trail risk | Format Standardize (name canonicalization) + Dedup |
| B3 | Liability / professional indemnity. Cannot use AI tools that don't show their work; client audit response window is 24–48 h. | Per-firm liability premium ≈ $500–2,500 / yr | Audit log built into every tool — every change row-logged |
| B4 | Per-license-not-per-client economics. Most cleanup tools are per-seat / per-client SaaS; bookkeepers managing 10–30 clients hit price walls fast. | $30/mo × N clients vs. $49 once | Desktop license, no per-client constraint |
| B5 | Multi-currency books. US-domiciled clients with EU customers; comma-decimal amounts (€1.234,56) crash standard parsers; parens-negative (($89.50)) treated as positive. |
30–60 min per multi-currency client per month | Format Standardize (currency_decimal=auto, parens-negative) |
Marketing / RevOps agency
| # | Pain | $ / time impact | Tools that fix it |
|---|---|---|---|
| R1 | HubSpot / Marketo / Iterable per-contact tier pricing. 10 k contacts → enterprise tier at $4–8 k/mo. Every duplicate is a recurring tax. | $200–800 / month per 1 k duplicate contacts — recurring | Dedup with cross-source merge + Pipeline |
| R2 | Email-deliverability / sender reputation. Sending to invalid or duplicate addresses tanks reputation; recovery takes weeks. | Catastrophic — entire email programme degraded | Format Standardize (email canonicalization) + Missing (sentinel detection) |
| R3 | GDPR / contact-data privacy. Uploading lead data to a third-party cleaning SaaS is itself a GDPR concern; legal review blocks adoption. | Compliance risk + 4–8 wk legal-review delay | Local-only desktop app, zero outbound calls |
| R4 | Multi-vendor lead-source unification. Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 1–3 days per campaign of manual unification | Map Columns (alias matching) + Standardize Formats (per-row country) + Find Duplicates |
| R5 | Suppression-list management across 5+ platforms. Each platform has its own format; un-deduped suppression lists let opt-outs slip through, triggering CAN-SPAM / GDPR exposure. | Compliance risk + churn-back cost | Pipeline saved as JSON, re-run on each new suppression batch |
2.4 Operationalize the moat the docs already name.
Three durable advantages, each promoted from buried feature to landing-page H1:
- Quality: 1 GB international standardization in ~2.5 minutes, locally. Excel can't do this; OpenRefine fights you for an hour.
- Privacy: "Your data never leaves this computer." Already in the GUI footer — promote to landing-page lead, screenshot the empty network tab.
- Update cadence: ship a v1.1 patch within 30 days of v1.0 launch. Not features — evidence the product is alive. "Added Czech Republic phone format support" beats "no updates in 6 months" every time.
2.5 Surface the audit-trail feature in sales copy.
Every tool has a structured audit log. Most cleaning tools do not. Bookkeepers and consultants get fired if they can't show what changed to a client. The audit feature is currently invisible on every proposed landing page and should be the second-largest callout — right after "runs locally."
Copy seed: "Every change auditable. Hand the audit CSV to your client with the cleaned file."
2.6 Automated Workflows is the retention multiplier.
A buyer with a saved pipeline isn't a one-off purchase — they're a recurring user who recommends the product. This is exactly the behavioural lever the no-touch constraint needs (DECISIONS.md §8 trigger). Build it now (see §2.1 exception).
2.7 Add a $199 "priority support" tier post-launch.
Same code, async-email SLA (24 h response). Targets the bookkeeper / consultant persona whose own time is $300/hr. Zero new product work, ~3× ARPU on 5–10 % of buyers. Lock the SLA to async only so the no-touch constraint isn't violated. Defer until $5 k/mo MRR (the same trigger DECISIONS.md §8 already names).
2.8 Dependency-aware pipeline UX.
Tools have soft execution-order preferences (Text Clean before Format
Standardize, Format before Dedup, Missing before Dedup). Automated
Workflows recommends the order, warns on reversals, and never
forces — the user owns their workflow. Implementation: see
src/core/pipeline.py SOFT_DEPENDENCIES.
3. 90-day execution sequence
| Week | Action | Done when |
|---|---|---|
| 1 | PyInstaller pipeline · Mac/Win unsigned installers · Apple Dev Program enrollment (1–2 wk lead) | dist/datatools-mac.dmg and dist/datatools-win.exe install on a clean machine |
| 2 | Demo deployed to Streamlit Cloud · landing page v1 with embedded demo · 3 persona datasets in the demo | Public URL serves a working pipeline run on a sample dataset in < 30 s |
| 3 | Gumroad listing live · share value-first in 3 niche communities (no pitch) · 1 long-tail SEO post for the lead persona | First listing impression captured · post not removed for self-promotion |
| 4 | Automated Workflows v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded |
| 5–8 | Bookkeeper landing page · agency landing page · second tool's promo cycle · priority-support tier added (defer purchase until §2.7 trigger) | Three live landing pages with distinct H1, demo dataset, conversion target |
| 9–13 | Tool 06–08 only if revenue trajectory supports continued investment · otherwise more market work on the existing 5 + 09 | Decision made on 13 Aug 2026 with revenue data, not feature ambition |
4. Decision triggers (re-evaluation prompts)
These flip the plan, not the underlying criteria:
| Trigger | Reaction |
|---|---|
| First paying customer in week 4–13 | Continue. Plan is working. |
| Zero paid in 90 days | Audit the funnel. Demo conversion? Niche fit? Price? Don't add features. |
| $5 k/mo MRR | DECISIONS.md §8 trigger fires: revisit async + priority-support tier. |
| Marketplace policy / shutdown | Switch to own-domain Stripe immediately; landing pages are already self-hosted. |
| Streamlit hard direction change | Low-probability re-lock per DECISIONS.md §8. Tk fallback is documented. |
5. Anti-temptations (things the plan refuses)
- More tools before more buyers. Locked. Exception only for Automated Workflows per §2.1.
- SaaS pivot. Recurring infra conflicts with the lifestyle constraint (DECISIONS.md §4).
- Live chat / sales calls. Conflicts with no-touch (DECISIONS.md §1 #8).
- Custom integrations / one-off consulting. $300/hr looks tempting; breaks the "build once, sell many" model that justifies the entire strategy.
- Going broad on personas. "All small businesses" is a generic landing page that converts at 1 %; "Shopify pet-supply operators with 1k–50k customers" converts at 5–15 % in the right communities.
6. What this plan deliberately leaves open
- Whether tools 06–08 ever ship. Decided on revenue, not roadmap.
- Whether to add a fourth niche landing page. Decided on which of the three is producing.
- Whether to invest in own-domain SEO. Compounding 6–18 mo asset; not the early-stage channel. Revisit when marketplace + community produces baseline traffic to optimise.
- Whether to add a Notion / Slack support community. If support volume per 100 sales > 10 (BUSINESS.md §12 target), revisit; else leave async-email only.