datatools-dev/docs/PLAN.md

# Strategic Plan — DataTools

> Creator-only. Locks the "what next" in light of the locked criteria
> (DECISIONS.md §1) and the v1.6 honest status (BUSINESS.md §13).
> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael

This document is the active plan, derived from the strategic review of
2026-05-01. It compresses the eight strategic moves and a 90-day
execution sequence onto one page so the next decision (build vs.
ship vs. market) has a single reference.

It is **not** a re-lock of operating criteria — those still live in
DECISIONS.md and have not changed. This plan is downstream of those
criteria; if a move below conflicts with §1 of Decisions, the criteria
win.

## 1. Frame

**Locked context** (BUSINESS.md, DECISIONS.md):

- Niche Python automation tools, $49–79 single / $149 suite.
- Cash budget ≤ $1,200/mo recurring · Time ≤ 10 hr/wk · No external funding.
- Async + no-touch sales (revisit at $5k/mo MRR).
- Marketplace-first distribution (Gumroad / Lemon Squeezy).
- Streamlit GUI + CLI dual interface, runs locally.
- Lifestyle cashflow goal (no exit needed).

**Honest current state** (2026-05-01):

| Asset | State |
|---|---|
| Tools 1–5 (Find Duplicates, Clean Text, Standardize Formats, Fix Missing Values, Map Columns) | Ready · 1,691 tests passing · 0 xfailed |
| Tools 6–9 (Find Unusual Values, Combine Files, Quality Check, Automated Workflows) | Coming Soon |
| PyInstaller installer pipeline | Not started |
| macOS code signing (Apple Dev Program) | Not started |
| Hosted browser demo (Streamlit Cloud) | Not deployed |
| Landing page | Not live |
| Marketplace listing (Gumroad) | Not listed |
| Paying customers | 0 |

**Diagnosis**: the bottleneck is not feature count — it's distribution.
The next $1 of value comes from closing the gap between "code-complete"
and "buyer-pulls-out-card", not from tool 6.

## 2. The eight strategic moves

Numbered moves. Each is consistent with locked criteria.

### 2.1 Freeze new-tool development (one exception). Ship what exists.

Tools 6–8 are blocked behind a **distribution gate**: no work on them
until the existing 5 tools have a paying customer + one external review
(BUSINESS.md §4 sequence rule, applied recursively inside the bundle).

**Exception granted 2026-05-01**: Tool 09 Automated Workflows is built
*now*. Rationale: the pipeline transforms the bundle from "5 tools you
buy" into "an automatable workflow you depend on." That conversion is
what produces retention and word-of-mouth — the only marketing channel
that scales under the no-network/no-touch constraint.

### 2.2 The demo *is* the product. Make it embarrassingly good.

- Three persona-tagged sample datasets, not one generic CSV: Shopify
  customers / bookkeeper bank export / agency lead list.
- Run the *full pipeline* on the sample (Review → Dedup → Text Clean →
  Format → Missing → Column Map). Free version caps **output rows**,
  not the experience.
- Embed the demo as an **iframe on the landing page** (not "click to
  open"). Friction kills conversion.
- Persistent CTA after demo: *"Run this on your own 50 k-row file →
  buy for $49 →"* directly above the Gumroad button.

### 2.3 Niche down. Stop selling "data cleaning."

One engine, three landing pages:

| Persona | Landing-page lead | Demo dataset |
|---|---|---|
| Shopify operator (priority: pet supplies) | "Clean your customer / vendor / subscriber exports" | uc01_shopify_customer_list |
| Bookkeeper / freelance accountant | "Reconcile bank exports + vendor lists. Auditable changes." | uc06_bank_export_overlap |
| Marketing / RevOps agency | "Dedupe lead lists. Standardize phones across vendors." | uc13_combined_lead_sources |

Generic copy competes with `pip install pandas`. Vertical copy
competes with nothing.

### 2.3a Top pain points per niche

The "what does this actually fix?" question. Each pain point below is
sourced from operator-domain knowledge of these markets and the
buyer-use-case research already captured in `BUSINESS.md §4a`. Pain
points are ranked by **frequency × dollar impact** for that persona —
high-frequency / high-cost pains lead the landing-page copy and the
demo dataset.

> **Validation gap (honest disclaimer)**: these pains are derived from
> operator knowledge of the categories, not from a sample of buyer
> interviews. Per `BUSINESS.md §8` (no-touch constraint review at $5k/mo
> MRR), validate the top-3 per persona via 5 buyer interviews before the
> first $200 of paid acquisition spend. If any pain ranks below the
> assumed level, swap it for the next-highest in this list.

#### Shopify operator (priority: pet supplies)

| # | Pain | $ / time impact | Tools that fix it |
|---|------|-----------------|---|
| S1 | **Klaviyo / Mailchimp / Omnisend per-contact billing.** Subscriber list with 10–18 % duplicate rate (case drift, plus signs in Gmail addresses, multiple devices) → recurring overpay forever. | $30–300/mo per percent of dupes on a 50 k list — recurring | Dedup + Format Standardize (email canonicalization) + Pipeline (re-run weekly) |
| S2 | **Product feed rejected by Google Merchant Center / Meta Catalog.** Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 24–72 h while feed gets fixed. | 1–3 days delayed launch × campaign value | Clean Text + Standardize Formats |
| S3 | **Multi-channel order consolidation.** Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 4–8 hr / month manually merging | Map Columns + Find Duplicates + Automated Workflows |
| S4 | **Subscription identity fragmentation.** Pet-box subscribers cancel and re-sub under a different email; cohort analysis says churn is 20 % when it's actually 12 % — pricing decisions wrong. | Mis-priced LTV → over- or under-paid acquisition | Dedup with `merge=true` survivor |
| S5 | **International tax / VAT MOSS compliance.** Country column is `UK` / `U.K.` / `United Kingdom` / `GB` in the same export; VAT report breaks. Phone formats per region break call-center routing. | Compliance penalty risk + ops friction | Standardize Formats (per-row country) + Map Columns |

#### Bookkeeper / freelance accountant

| # | Pain | $ / time impact | Tools that fix it |
|---|------|-----------------|---|
| B1 | **Bank-export month-overlap re-import.** Same transaction posts twice when Jan and Feb exports overlap at the boundary; client's books understate cash by 1–4 %. | 2–4 hr / month / client + reconciliation errors | Dedup with explicit Date+Amount+fuzzy Vendor strategy |
| B2 | **QBO / Xero vendor consolidation for 1099 reports.** "Amazon" / "amazon.com" / "AMAZON.COM*4F2X9" become 3 vendors; 1099 reports break, P&L by vendor unusable. | 1–2 hr / 1099 cycle + IRS-paper-trail risk | Format Standardize (name canonicalization) + Dedup |
| B3 | **Liability / professional indemnity.** Cannot use AI tools that don't show their work; client audit response window is 24–48 h. | Per-firm liability premium ≈ $500–2,500 / yr | Audit log built into every tool — every change row-logged |
| B4 | **Per-license-not-per-client economics.** Most cleanup tools are per-seat / per-client SaaS; bookkeepers managing 10–30 clients hit price walls fast. | $30/mo × N clients vs. $49 once | Desktop license, no per-client constraint |
| B5 | **Multi-currency books.** US-domiciled clients with EU customers; comma-decimal amounts (`€1.234,56`) crash standard parsers; parens-negative (`($89.50)`) treated as positive. | 30–60 min per multi-currency client per month | Format Standardize (`currency_decimal=auto`, parens-negative) |

#### Marketing / RevOps agency

| # | Pain | $ / time impact | Tools that fix it |
|---|------|-----------------|---|
| R1 | **HubSpot / Marketo / Iterable per-contact tier pricing.** 10 k contacts → enterprise tier at $4–8 k/mo. Every duplicate is a recurring tax. | $200–800 / month per 1 k duplicate contacts — recurring | Dedup with cross-source merge + Pipeline |
| R2 | **Email-deliverability / sender reputation.** Sending to invalid or duplicate addresses tanks reputation; recovery takes weeks. | Catastrophic — entire email programme degraded | Format Standardize (email canonicalization) + Missing (sentinel detection) |
| R3 | **GDPR / contact-data privacy.** Uploading lead data to a third-party cleaning SaaS is itself a GDPR concern; legal review blocks adoption. | Compliance risk + 4–8 wk legal-review delay | Local-only desktop app, zero outbound calls |
| R4 | **Multi-vendor lead-source unification.** Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 1–3 days per campaign of manual unification | Map Columns (alias matching) + Standardize Formats (per-row country) + Find Duplicates |
| R5 | **Suppression-list management across 5+ platforms.** Each platform has its own format; un-deduped suppression lists let opt-outs slip through, triggering CAN-SPAM / GDPR exposure. | Compliance risk + churn-back cost | Pipeline saved as JSON, re-run on each new suppression batch |

### 2.4 Operationalize the moat the docs already name.

Three durable advantages, each promoted from buried feature to
landing-page H1:

- **Quality**: 1 GB international standardization in ~2.5 minutes,
  locally. Excel can't do this; OpenRefine fights you for an hour.
- **Privacy**: "Your data never leaves this computer." Already in the
  GUI footer — promote to landing-page lead, screenshot the empty
  network tab.
- **Update cadence**: ship a v1.1 patch within 30 days of v1.0 launch.
  Not features — *evidence* the product is alive. "Added Czech Republic
  phone format support" beats "no updates in 6 months" every time.

### 2.5 Surface the audit-trail feature in sales copy.

Every tool has a structured audit log. Most cleaning tools do not.
Bookkeepers and consultants get fired if they can't show what changed
to a client. The audit feature is currently invisible on every
proposed landing page and should be the **second-largest callout** —
right after "runs locally."

Copy seed: *"Every change auditable. Hand the audit CSV to your client
with the cleaned file."*

### 2.6 Automated Workflows is the retention multiplier.

A buyer with a saved pipeline isn't a one-off purchase — they're a
recurring user who recommends the product. This is exactly the
behavioural lever the no-touch constraint needs (DECISIONS.md §8
trigger). Build it now (see §2.1 exception).

### 2.7 Add a $199 "priority support" tier post-launch.

Same code, async-email SLA (24 h response). Targets the bookkeeper /
consultant persona whose own time is $300/hr. Zero new product work,
~3× ARPU on 5–10 % of buyers. Lock the SLA to **async only** so the
no-touch constraint isn't violated. Defer until $5 k/mo MRR (the same
trigger DECISIONS.md §8 already names).

### 2.8 Dependency-aware pipeline UX.

Tools have soft execution-order preferences (Text Clean before Format
Standardize, Format before Dedup, Missing before Dedup). Automated
Workflows *recommends* the order, *warns* on reversals, and **never
forces** — the user owns their workflow. Implementation: see
`src/core/pipeline.py` `SOFT_DEPENDENCIES`.

## 3. 90-day execution sequence

| Week | Action | Done when |
|---|---|---|
| 1 | PyInstaller pipeline · Mac/Win unsigned installers · Apple Dev Program enrollment (1–2 wk lead) | `dist/datatools-mac.dmg` and `dist/datatools-win.exe` install on a clean machine |
| 2 | Demo deployed to Streamlit Cloud · landing page v1 with embedded demo · 3 persona datasets in the demo | Public URL serves a working pipeline run on a sample dataset in < 30 s |
| 3 | Gumroad listing live · share value-first in 3 niche communities (no pitch) · 1 long-tail SEO post for the lead persona | First listing impression captured · post not removed for self-promotion |
| 4 | Automated Workflows v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded |
| 5–8 | Bookkeeper landing page · agency landing page · second tool's promo cycle · priority-support tier added (defer purchase until §2.7 trigger) | Three live landing pages with distinct H1, demo dataset, conversion target |
| 9–13 | Tool 06–08 only **if** revenue trajectory supports continued investment · otherwise more market work on the existing 5 + 09 | Decision made on 13 Aug 2026 with revenue data, not feature ambition |

## 4. Decision triggers (re-evaluation prompts)

These flip the plan, not the underlying criteria:

| Trigger | Reaction |
|---|---|
| First paying customer in week 4–13 | Continue. Plan is working. |
| **Zero** paid in 90 days | Audit the funnel. Demo conversion? Niche fit? Price? Don't add features. |
| $5 k/mo MRR | DECISIONS.md §8 trigger fires: revisit async + priority-support tier. |
| Marketplace policy / shutdown | Switch to own-domain Stripe immediately; landing pages are already self-hosted. |
| Streamlit hard direction change | Low-probability re-lock per DECISIONS.md §8. Tk fallback is documented. |

## 5. Anti-temptations (things the plan refuses)

- **More tools before more buyers.** Locked. Exception only for Automated Workflows per §2.1.
- **SaaS pivot.** Recurring infra conflicts with the lifestyle constraint (DECISIONS.md §4).
- **Live chat / sales calls.** Conflicts with no-touch (DECISIONS.md §1 #8).
- **Custom integrations / one-off consulting.** $300/hr looks tempting; breaks the "build once, sell many" model that justifies the entire strategy.
- **Going broad on personas.** "All small businesses" is a generic landing page that converts at 1 %; "Shopify pet-supply operators with 1k–50k customers" converts at 5–15 % in the right communities.

## 6. What this plan deliberately leaves open

- Whether tools 06–08 ever ship. Decided on revenue, not roadmap.
- Whether to add a fourth niche landing page. Decided on which of the
  three is producing.
- Whether to invest in own-domain SEO. Compounding 6–18 mo asset; not
  the early-stage channel. Revisit when marketplace + community
  produces baseline traffic to optimise.
- Whether to add a Notion / Slack support community. If support volume
  per 100 sales > 10 (BUSINESS.md §12 target), revisit; else leave async-email only.