Tools shipped this batch (4 → 6 of 9 Ready):
04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI
05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI
09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI
with soft tool-dependency graph (recommended,
not enforced) and JSON save/load for repeatable
weekly cleanups.
Format Standardizer reworked for 1 GB international files:
• Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
• Per-row country / address columns drive parsing
• Audit cap (default 10 k rows, ~50 MB RAM)
• standardize_file(): chunked streaming entry point (~165 k rows/sec)
• currency_decimal="auto" for EU comma-decimal locales
• R$ / kr / zł multi-char currency prefixes
• cli_format.py with auto-stream above 100 MB inputs
Encoding detection arbiter + language-aware probe:
Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.
Distribution-readiness assets:
• streamlit_app.py — Streamlit Community Cloud entry shim
• src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
100-row cap + watermark, free-vs-paid boundary enforced at surface
• samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
• landing/ — 4 static HTML pages (apex chooser + 3 niche),
shared CSS, deploy.py URL-substitution script,
auto-generated robots.txt + sitemap.xml + 404.html + favicon
• docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
— full strategy + measurement + deployment + master checklist
Test counts:
before: 1,520 passed · 4 skipped · 17 xfailed
after: 1,729 passed · 0 skipped · 0 xfailed
Tier-1 corpora added:
• missing-corpus 3 use cases + 16 edge cases
• column-mapper-corpus 3 use cases + 5 edge cases
• format-cleaner intl 20-row 13-country stress fixture
Engine hardening flushed out by the corpora:
• interpolate guards against object-dtype columns
• mean/median skip all-NaN columns (silences numpy warning)
• fillna runs under future.no_silent_downcasting (silences pandas warning)
• mojibake test no longer skips when ftfy installed (monkeypatch path)
• drop-row threshold semantics: strict-greater (consistent across rows / cols)
• currency_decimal validator allow-set updated for "auto"
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.2 KiB
Deployment — demo + landing pages
One page. Two services. ~30 minutes from "code complete" to "URL the user can hit." Every step here is from-scratch reproducible on a clean laptop. Version: 1.0 · Adopted: 2026-05-01
This doc covers the two distribution surfaces that ship to public URLs: the Streamlit demo (the iframe target) and the Cloudflare Pages landing pages (the marketing surface that embeds it).
The paid product — PyInstaller installers, code-signing, Gumroad
listing — is covered in docs/NEXT-STEPS.md.
Part 1 · Deploy the demo (Streamlit Community Cloud — free)
A. Pre-flight (one-time, ~2 min)
You need a free Streamlit Community Cloud account. Sign in with the GitHub account that hosts this repo.
B. Deploy (~5 min, mostly waiting for the Cloud build)
-
Push the repo to GitHub (private or public — both work). The important files are at the repo root:
streamlit_app.py— Cloud auto-detects this; nothing to configurerequirements.txt— Cloud installs from this.streamlit/config.toml— Cloud honours thissamples/demo/*.csv+*_pipeline.json— the demo's datasrc/— the engine
-
In Streamlit Community Cloud → New app:
- Repository: your fork
- Branch:
main - Main file path:
streamlit_app.py(the default — leave it) - App URL:
datatools-demo(or any free subdomain) - Deploy
-
First build is 2–3 min while Cloud installs
pandas,phonenumbers,rapidfuzz, etc. Subsequent deploys are < 30 s.
C. Verify
Open the deployed URL. Append ?p=shopify-pet to the URL bar —
the persona-specific demo loads. Try ?p=bookkeeper and
?p=revops to confirm all three personas route correctly. Click
Run pipeline; the AFTER preview should appear within ~1 second.
D. The output URL
The deployed URL is what feeds into landing/deploy.config.json →
demo_base_url. Without trailing slash. For example:
https://datatools-demo.streamlit.app
E. Migration trigger
Per BUSINESS.md §9 / DEMO-PLAN.md §9, migrate to a $5–10/mo VPS
when:
- Streamlit Community Cloud rate-limits / sleeps too aggressively, OR
- the demo crosses ~5 k page-views/month (free-tier capacity)
The migration is one command if you containerise:
docker run -p 8501:8501 -v $(pwd):/app python:3.12-slim …
Part 2 · Deploy the landing pages (Cloudflare Pages — free)
A. Pre-flight (one-time, ~5 min)
You need:
- A Cloudflare account (free) and a domain (any registrar) with
nameservers pointed at Cloudflare. OR skip the custom domain
step and use the auto-generated
*.pages.devURL. - A Gumroad listing URL (placeholder until your account is set up —
use
https://gumroad.com/l/datatoolsand update it later).
B. Build the deploy-ready bundle (~30 sec)
# One-time: copy the template
cp landing/deploy.config.example.json landing/deploy.config.json
# Edit it with your real URLs
edit landing/deploy.config.json
# Build
python3 landing/deploy.py
# → produces landing/dist/
landing/deploy.config.json is gitignored; your real URLs never
hit the repo.
C. Deploy (~3 min)
Two paths — pick one:
Drag-and-drop (zero CLI):
- Cloudflare Pages dashboard → Create project → Direct Upload
- Drag
landing/dist/into the upload zone - Project name:
datatools(becomesdatatools.pages.dev) - Click Deploy
Wrangler CLI (one command, scriptable):
npm install -g wrangler # one-time
wrangler login # one-time
wrangler pages deploy landing/dist
D. Custom domain (~5 min, optional)
Pages dashboard → your project → Custom domains → add
datatools.app (or whichever apex domain you registered). Cloudflare
auto-issues TLS. Once propagated:
https://datatools.app/→ apex chooserhttps://datatools.app/shopify-pet/→ Shopify landinghttps://datatools.app/bookkeeper/→ Bookkeeper landinghttps://datatools.app/revops/→ RevOps landing
E. Verify
For each persona:
- Open the persona URL.
- Confirm the demo iframe loads (the URL inside it points at the Streamlit demo from Part 1).
- Click "Run pipeline" inside the iframe → AFTER preview appears.
- Click the "Get DataTools" button → opens Gumroad with the
correct
?from=<persona>query (verify in the URL bar).
If the iframe shows "Refused to connect", check Cloudflare Pages → Settings → Functions for any CSP that disallows Streamlit's domain. (Default Pages config does not set CSP, so this is rarely an issue.)
Part 3 · Updates
The cycle is:
# 1) Edit code or copy
edit landing/<persona>/index.html
edit src/gui/app_demo.py
# 2) Rebuild landing
python3 landing/deploy.py
# 3) Re-deploy landing
wrangler pages deploy landing/dist
# 4) Re-deploy demo
git push origin main
# (Streamlit Cloud auto-deploys on push)
Both surfaces deploy in under 5 minutes end-to-end.
Part 4 · Sanity checks (post-deploy, ~3 min)
Run these once, then trust the build (per POST-LAUNCH.md §6):
# Landing pages serve and reference the right demo URL
curl -s https://datatools.app/ | grep -c persona-card
# → 3 (one per persona card)
curl -s https://datatools.app/shopify-pet/ | grep -c "datatools-demo"
# → ≥1 (iframe src points at your demo)
# Demo responds and routes the persona param
curl -s https://datatools-demo.streamlit.app/?p=shopify-pet | grep -c "Shopify"
# → ≥1
# Sitemap is valid XML and lists all 4 pages
curl -s https://datatools.app/sitemap.xml | grep -c "<url>"
# → 4
Part 5 · Cost ceiling check
| Service | Tier | Cost | Cap |
|---|---|---|---|
| Cloudflare Pages | Free | $0 | 500 builds/month, unlimited bandwidth |
| Streamlit Community Cloud | Free | $0 | 1 GB RAM, sleeps after 7 days idle |
| Custom domain | Cloudflare or registrar | ~$15/year | n/a |
| GitHub | Free for private repos with limited collaborators | $0 | n/a |
| Total ongoing | ~$1.25/mo (domain only) |
Well inside the BUSINESS.md §9 cap of $1,200/mo recurring. The
$5–10/mo VPS migration is a contingency only — don't pre-build it.
Troubleshooting
Streamlit Cloud build fails with "ModuleNotFoundError: src.core"
streamlit_app.py puts the repo root on sys.path before invoking
the demo module — but only if the file is at the repo root. Confirm
streamlit_app.py lives at /streamlit_app.py, not nested in a
folder.
Cloudflare Pages deploy succeeds but persona pages 404
The directory layout is preserved by deploy.py. Confirm your
landing/dist/ has shopify-pet/index.html, etc. — not just three
flat files. If you used drag-and-drop, drag the directory, not
its contents.
The iframe shows "X-Frame-Options denied"
Streamlit Community Cloud allows iframe embedding by default. If
you've migrated to a self-hosted demo with a reverse proxy, set
X-Frame-Options: ALLOWALL (or remove the header entirely) for the
demo's domain.
Gumroad URL has no ?from= parameter when clicked
The &from= query param is added by the landing-page CTA, not by
Gumroad. If it's missing, the landing-page HTML wasn't substituted —
re-run python3 landing/deploy.py and re-deploy.