Files
datatools-dev/docs/NEXT-STEPS.md
Michael 966af8ef94 feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00

320 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Next Steps — from "code complete" to first paying customer
> Creator-only. The runnable checklist that takes the operator from
> the current state (1,729 tests passing, 6 tools shipped, 0 paying
> customers) through launch and into the first 90 days.
> **Version**: 1.0 · **Adopted**: 2026-05-01
This document is the **single answer** to "what now?". Every line
item has an owner, a time estimate, a blocker, a cost, and the
external dependency that makes it un-shippable today. Items are
ordered by **must-finish-before-the-next-item** — work top-down.
Cross-references:
- Strategy: `PLAN.md` (the 8 strategic moves + the 90-day sequence)
- Demo specs: `DEMO-PLAN.md`
- Deployment mechanics: `DEPLOYMENT.md`
- Post-launch measurement: `POST-LAUNCH.md`
- Locked criteria: `DECISIONS.md` §1
Status legend:
- **🟢** Done — the asset exists in this repo
- **🟡** Buildable now — no external dependency needed
- **🟠** External dependency — needs an account / signup / payment
- **🔴** Manual / requires user input that can't be automated
---
## Phase 0 · What's already done (skip ahead)
| ✓ | Item | Where it lives |
|---|------|----------------|
| 🟢 | 6 of 9 tools shipped (Dedup, Text, Format, Missing, Column-Map, Pipeline) | `src/core/`, `src/cli_*.py`, `src/gui/pages/` |
| 🟢 | Pipeline Runner (the retention multiplier per `PLAN.md` §2.6) | `src/core/pipeline.py`, `src/cli_pipeline.py`, `src/gui/pages/9_Pipeline_Runner.py` |
| 🟢 | 1,729 passing tests · 0 skipped · 0 xfailed | `tests/` |
| 🟢 | 3 niche demo datasets + pre-tuned pipeline JSONs | `samples/demo/` |
| 🟢 | Streamlit demo app + Cloud entry shim | `streamlit_app.py`, `src/gui/app_demo.py` |
| 🟢 | 3 niche landing pages + apex chooser + shared CSS | `landing/` |
| 🟢 | Landing-page deploy script (URL-substitution + sitemap + 404 + favicon) | `landing/deploy.py` |
| 🟢 | Strategic plan + demo plan + post-launch measurement plan + deployment doc | `docs/PLAN.md`, `DEMO-PLAN.md`, `POST-LAUNCH.md`, `DEPLOYMENT.md` |
---
## Phase 1 · Stand the funnel up (target: end of week 1, ~6 hours total work)
The bottleneck right now is **distribution, not feature count**.
Everything in this phase is about turning code into a URL the user
can hit.
### 1.1 — 🟠 Push to GitHub (5 min)
| | |
|---|---|
| **What** | `git init` (if not already), commit, push to a private or public GitHub repo. |
| **Why** | Cloud deploy services need a Git source. Streamlit Community Cloud auto-deploys on push to `main`. |
| **External dependency** | A GitHub account (free). |
| **Cost** | $0. |
| **Blocked by** | Nothing. |
### 1.2 — 🟠 Deploy the demo to Streamlit Community Cloud (15 min)
| | |
|---|---|
| **What** | Follow `DEPLOYMENT.md` Part 1. Result: a public URL like `https://datatools-demo.streamlit.app`. |
| **Why** | The landing pages embed this in their iframe. Without it, every "Run pipeline" button on the landing pages 404s. |
| **External dependency** | Free Streamlit Community Cloud account, signed in via GitHub. |
| **Cost** | $0. |
| **Blocked by** | 1.1 (the repo must be on GitHub). |
| **Watch out for** | First build takes 23 min while Cloud installs deps. Subsequent deploys < 30 s. |
### 1.3 — 🟠 Buy the apex domain (5 min, ~$15/year)
| | |
|---|---|
| **What** | Register `datatools.app` (or whichever) at any registrar. Point the nameservers at Cloudflare. |
| **Why** | The landing-page canonical URLs and CTA buttons refer to this domain. Pages can deploy to a free `*.pages.dev` URL first if you want to defer this. |
| **External dependency** | A registrar account; payment method. |
| **Cost** | ~$15/year. Within `BUSINESS.md` §9 cost cap. |
| **Blocked by** | Nothing — can run in parallel with 1.1 / 1.2. |
### 1.4 — 🟠 Deploy the landing pages to Cloudflare Pages (15 min)
| | |
|---|---|
| **What** | Follow `DEPLOYMENT.md` Part 2. Run `python3 landing/deploy.py` with the operator's URLs in `deploy.config.json`, then `wrangler pages deploy landing/dist` (or drag-drop). |
| **Why** | This is the marketing surface. Three persona URLs go live as soon as it deploys. |
| **External dependency** | Free Cloudflare account; Wrangler CLI (optional — drag-drop works too). |
| **Cost** | $0. |
| **Blocked by** | 1.2 (the demo URL goes into `deploy.config.json`); ideally 1.3 for the custom domain. |
| **Watch out for** | The `deploy.config.json` file is gitignored — your real URLs never get committed. |
### 1.5 — 🟠 Open a Gumroad listing (15 min) **— stub for now**
| | |
|---|---|
| **What** | Create a Gumroad account, draft a listing with a single screenshot + the landing-page copy, set price to $49. Don't enable purchases yet — leave it as a draft. |
| **Why** | The CTA buttons on the landing pages link to `gumroad.com/l/datatools?from=<persona>`. Until the listing exists, those buttons 404. |
| **External dependency** | Free Gumroad account; Stripe-connected payout method (defer to Phase 2). |
| **Cost** | $0 to draft, ~10% per sale once live. |
| **Blocked by** | Nothing — can run in parallel with 1.11.4. |
| **Watch out for** | The listing URL must be `gumroad.com/l/datatools` to match the landing-page hard-coded CTAs. If you pick a different slug, update `landing/deploy.config.json``gumroad_listing` and re-run `deploy.py`. |
### 1.6 — 🟡 End-to-end smoke verification (10 min)
| | |
|---|---|
| **What** | Run the four `curl` commands from `DEPLOYMENT.md` Part 4. All four landing pages, all three demo personas, sitemap.xml. |
| **Why** | First time something can break is the moment a real user hits it. Ten minutes of `curl` saves a week of "why is conversion zero." |
| **External dependency** | None. |
| **Cost** | $0. |
| **Blocked by** | 1.4 + 1.2. |
---
## Phase 2 · Make it sellable (target: end of week 2)
### 2.1 — 🟠 Apple Developer Program enrollment (5 min to start, 12 weeks lead)
| | |
|---|---|
| **What** | Per `BUSINESS.md` §10. Required for code-signing the macOS installer. |
| **External dependency** | Apple ID + government-issued ID (individual) or D-U-N-S number (org). |
| **Cost** | $99/year. |
| **Blocked by** | Nothing — start ASAP because of the 12 week approval window. The pipeline waits on this; nothing else does. |
### 2.2 — 🟡 PyInstaller spec + cross-platform build (13 days first time)
| | |
|---|---|
| **What** | A `build/datatools.spec` that bundles the Streamlit GUI + all 6 tools + samples into one app. Mac `.dmg`, Windows `.exe` installer, Linux AppImage. |
| **Why** | The buyer's deliverable. Without this, there is nothing to attach to the Gumroad listing. |
| **External dependency** | None for Linux/Mac builds. Windows builds need a Windows machine or a CI matrix runner. |
| **Cost** | $0 (GitHub Actions matrix runners are free for public repos). |
| **Blocked by** | Nothing for the spec; 2.1 for the signed Mac build. |
| **Watch out for** | Streamlit's bundle size lands around 300500 MB per `DECISIONS.md` §4c — accepted tradeoff. |
### 2.3 — 🟡 macOS sign + notarize (30 min once Apple Dev is approved)
| | |
|---|---|
| **What** | Sign the `.dmg`, submit to Apple's notarization service, staple the ticket. |
| **Why** | Without it, Gatekeeper hard-blocks the install with no obvious way out (per `BUSINESS.md` §10). The buyer gives up. |
| **External dependency** | Apple Developer Program (2.1). |
| **Cost** | $0 incremental over 2.1. |
| **Blocked by** | 2.1 + 2.2. |
### 2.4 — 🔴 Refund policy + license + Gumroad listing copy (1 hour)
| | |
|---|---|
| **What** | A clear refund policy (14-day no-questions per the FAQ already on the landing pages) + a software licence text + the Gumroad listing description. |
| **Why** | Required by Gumroad's terms; surfaces on the listing page; protects against buyer disputes. |
| **External dependency** | None — operator authoring. |
| **Cost** | $0. |
| **Blocked by** | Nothing. |
| **Hint** | Most of the copy is already in the landing pages' FAQ section — paste it into Gumroad. |
### 2.5 — 🟠 Activate the Gumroad listing (15 min)
| | |
|---|---|
| **What** | Upload the cross-platform installers from 2.2/2.3, paste the copy from 2.4, set $49 price, enable purchases, configure Stripe payout. |
| **Why** | This is the "buy" button finally working. |
| **External dependency** | Gumroad + Stripe account; the installers from 2.2/2.3. |
| **Cost** | ~10 % per sale. |
| **Blocked by** | 2.2, 2.3, 2.4. |
---
## Phase 3 · First-traffic ignition (target: end of week 4)
Per `PLAN.md` §3 and `BUSINESS.md` §7 channel priorities. The strict
no-touch constraint of `DECISIONS.md` §1 #8 makes channel choice
matter — these are the only ones that fit.
### 3.1 — 🔴 First niche-community post (30 min)
| | |
|---|---|
| **What** | One value-first post in one niche-relevant community (e.g. r/shopify, IndieHackers Shopify chat, a Slack/Discord that allows it). Lead with the demo URL, not the buy URL. |
| **Why** | Marketplaces alone don't drive discovery. Communities are the only first-touch channel that works under no-touch. |
| **External dependency** | Account in the chosen community; understand its self-promotion rules. |
| **Cost** | $0. |
| **Blocked by** | 1.4 (demo URL must work). |
| **Hint** | Pick the persona with the most familiar community to the operator. Don't try all three at once — see `POST-LAUNCH.md` §2 "decide ONE thing" rule. |
### 3.2 — 🟡 First long-tail SEO blog post (46 hours)
| | |
|---|---|
| **What** | One 8001,500-word post on `datatools.app/blog/` (sub-route of Cloudflare Pages or Substack) targeting one niche keyword from `BUSINESS.md` §7. Topic: a real problem you've encountered, the cleanup steps, the demo URL at the end. |
| **Why** | Compounding asset — `BUSINESS.md` §2 says SEO pays in 618 months, not week 1. Don't mistake it for an early-stage channel. |
| **External dependency** | None. |
| **Cost** | $0. |
| **Blocked by** | Nothing. |
### 3.3 — 🟡 Cloudflare Web Analytics + event counters (45 min)
| | |
|---|---|
| **What** | Enable Cloudflare Web Analytics on the Pages project (one click). Add a tiny inline `<script>` to each landing page that fires `cta_clicked` when the buy button is hit, before redirecting. Per `POST-LAUNCH.md` §1. |
| **Why** | Without this, the post-launch checklist is unrunnable. |
| **External dependency** | Cloudflare account (already from 1.4). |
| **Cost** | $0. |
| **Blocked by** | 1.4. |
| **Hint** | The Gumroad webhook captures `?from=<persona>` automatically — no extra wiring. |
### 3.4 — 🟡 Email autoresponder (post-purchase delivery + 3-touch onboarding) (23 hours)
| | |
|---|---|
| **What** | Gumroad's built-in delivery email plus three follow-up emails (day 1, day 7, day 14): "are you running into X?", "here's an advanced trick", "save your pipeline as JSON for next week". |
| **Why** | Increases activation, reduces refund risk, surfaces support questions while volume is small. |
| **External dependency** | Gumroad delivery is built-in. The 3-touch sequence needs a free email service (Resend's free tier or Mailchimp's free tier). |
| **Cost** | $0$30/month per `BUSINESS.md` §9. |
| **Blocked by** | 2.5. |
---
## Phase 4 · First-buyer trigger and review
Per `PLAN.md` §4 decision triggers and `POST-LAUNCH.md` §4.
### 4.1 — 🟢 Run the monthly review (30 min, first Monday after launch)
| | |
|---|---|
| **What** | Follow `POST-LAUNCH.md` §2 — pull last-30-days demo events + Gumroad sales + refunds, compute the five numbers, decide ONE change. |
| **Why** | Without this discipline, the funnel drifts and the operator changes 5 things at once and learns nothing. |
| **External dependency** | None — analytics from 3.3, sales from 2.5. |
| **Cost** | $0. |
| **Blocked by** | 3.3 + 2.5. |
### 4.2 — 🟢 First paying customer (target: 90 days)
| | |
|---|---|
| **What** | The actual first sale. |
| **Why** | Per `BUSINESS.md` §6: validates the funnel; not the business. |
| **Trigger action** | Continue, no plan change. Make the first $1k/month within month 6. |
### 4.3 — 🔴 Zero-paid-in-90-days fallback (only fires if 4.2 doesn't)
| | |
|---|---|
| **What** | Per `POST-LAUNCH.md` §4 — audit the funnel, not the features. Run a 1-week outbound experiment to 30 niche contacts as a control (per `BUSINESS.md` §8 the no-touch revisit is allowed below $5k MRR if it produces signal). |
| **Why** | Distinguishes "no reach" from "no conversion" — they need different fixes. |
| **External dependency** | Operator's time. |
| **Cost** | The 10 hr/wk allocation already exists; this displaces other work. |
| **Blocked by** | The 90-day calendar trigger from 4.2. |
---
## Phase 5 · Steady state — what NOT to build
Per `PLAN.md` §5 (anti-temptations) and `DECISIONS.md` §8 (re-lock
triggers). The trap is treating "more code" as the answer when the
data says "more reach" or "more conversion." The five forbidden
moves until $5k/mo MRR:
| | Why locked |
|---|---|
| ❌ More tools (0608) | `PLAN.md` §2.1 distribution-gate. Tool 09 was the exception; no others until first paid customer + one external review. |
| ❌ SaaS pivot | `DECISIONS.md` §4 — recurring infra conflicts with the lifestyle constraint. |
| ❌ Live chat / sales calls | `DECISIONS.md` §1 #8 — no-touch is locked until $5k/mo. |
| ❌ Custom integrations / one-off consulting | Breaks "build once, sell many." |
| ❌ Going broad on personas | `PLAN.md` §5 — "all small businesses" converts at 1 %; vertical converts at 515 %. |
---
## Triage table — what blocks what
```
Phase 1 (week 1) Phase 2 (week 2) Phase 3 (week 4)
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ 1.1 Push GH │──────────┐ │ 2.1 Apple │ ───┐ │ 3.1 Community│
│ 1.2 Demo │──┐ ├──▶│ Dev (1-2w) │ │ │ 3.2 SEO post │
│ 1.3 Domain │ │ │ │ 2.2 Build │ ───┤ │ 3.3 Analytics│
│ 1.4 Pages │◀─┘ │ │ 2.3 Sign │ ───┤ │ 3.4 Emails │
│ 1.5 Gumroad │──────────┘ │ 2.4 Copy │ │ └──────────────┘
│ 1.6 Verify │ │ 2.5 Activate │ ◀──┘
└──────────────┘ └──────────────┘ ↓
┌──────────────┐
│ 4.1 Monthly │
│ 4.2 First $ │
│ 4.3 Fallback │
└──────────────┘
```
The longest blocking path is **2.1 Apple Developer Program**
(12 weeks). Start it on day 1 of week 1 — it unblocks everything in
Phase 2 and you can do all of Phase 1 while waiting.
---
## Time estimate — total operator time
| Phase | Hours | Wall-clock |
|---|---|---|
| Phase 1 | ~1 hour | end of week 1 (mostly waiting for builds) |
| Phase 2 | ~1 day | end of week 2 (gated by Apple Dev approval) |
| Phase 3 | ~6 hours | week 34 |
| Phase 4 | 30 min/month | ongoing |
| **Total to launch** | **~12 hours of operator time** | **~14 days wall-clock** |
Well inside the 10 hr/wk constraint of `DECISIONS.md` §1 #2.
---
## The thing that decides whether the plan works
Not the build. Not the deploy. Not even the first sale.
**The discipline of running the monthly review** in Phase 4 — and the
"decide ONE thing per month" rule from `POST-LAUNCH.md` §2 — is what
separates "this product exists" from "this product compounds." Every
feature added before the funnel is measured is a guess; every change
made after the monthly review is informed.
Don't skip 4.1.