Files
datatools-dev/docs/DEPLOYMENT.md
Michael 966af8ef94 feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00

7.2 KiB
Raw Permalink Blame History

Deployment — demo + landing pages

One page. Two services. ~30 minutes from "code complete" to "URL the user can hit." Every step here is from-scratch reproducible on a clean laptop. Version: 1.0 · Adopted: 2026-05-01

This doc covers the two distribution surfaces that ship to public URLs: the Streamlit demo (the iframe target) and the Cloudflare Pages landing pages (the marketing surface that embeds it).

The paid product — PyInstaller installers, code-signing, Gumroad listing — is covered in docs/NEXT-STEPS.md.


Part 1 · Deploy the demo (Streamlit Community Cloud — free)

A. Pre-flight (one-time, ~2 min)

You need a free Streamlit Community Cloud account. Sign in with the GitHub account that hosts this repo.

B. Deploy (~5 min, mostly waiting for the Cloud build)

  1. Push the repo to GitHub (private or public — both work). The important files are at the repo root:

    • streamlit_app.py — Cloud auto-detects this; nothing to configure
    • requirements.txt — Cloud installs from this
    • .streamlit/config.toml — Cloud honours this
    • samples/demo/*.csv + *_pipeline.json — the demo's data
    • src/ — the engine
  2. In Streamlit Community Cloud → New app:

    • Repository: your fork
    • Branch: main
    • Main file path: streamlit_app.py (the default — leave it)
    • App URL: datatools-demo (or any free subdomain)
    • Deploy
  3. First build is 23 min while Cloud installs pandas, phonenumbers, rapidfuzz, etc. Subsequent deploys are < 30 s.

C. Verify

Open the deployed URL. Append ?p=shopify-pet to the URL bar — the persona-specific demo loads. Try ?p=bookkeeper and ?p=revops to confirm all three personas route correctly. Click Run pipeline; the AFTER preview should appear within ~1 second.

D. The output URL

The deployed URL is what feeds into landing/deploy.config.jsondemo_base_url. Without trailing slash. For example:

https://datatools-demo.streamlit.app

E. Migration trigger

Per BUSINESS.md §9 / DEMO-PLAN.md §9, migrate to a $510/mo VPS when:

  • Streamlit Community Cloud rate-limits / sleeps too aggressively, OR
  • the demo crosses ~5 k page-views/month (free-tier capacity)

The migration is one command if you containerise: docker run -p 8501:8501 -v $(pwd):/app python:3.12-slim …


Part 2 · Deploy the landing pages (Cloudflare Pages — free)

A. Pre-flight (one-time, ~5 min)

You need:

  • A Cloudflare account (free) and a domain (any registrar) with nameservers pointed at Cloudflare. OR skip the custom domain step and use the auto-generated *.pages.dev URL.
  • A Gumroad listing URL (placeholder until your account is set up — use https://gumroad.com/l/datatools and update it later).

B. Build the deploy-ready bundle (~30 sec)

# One-time: copy the template
cp landing/deploy.config.example.json landing/deploy.config.json
# Edit it with your real URLs
edit landing/deploy.config.json
# Build
python3 landing/deploy.py
# → produces landing/dist/

landing/deploy.config.json is gitignored; your real URLs never hit the repo.

C. Deploy (~3 min)

Two paths — pick one:

Drag-and-drop (zero CLI):

  1. Cloudflare Pages dashboard → Create projectDirect Upload
  2. Drag landing/dist/ into the upload zone
  3. Project name: datatools (becomes datatools.pages.dev)
  4. Click Deploy

Wrangler CLI (one command, scriptable):

npm install -g wrangler          # one-time
wrangler login                   # one-time
wrangler pages deploy landing/dist

D. Custom domain (~5 min, optional)

Pages dashboard → your project → Custom domains → add datatools.app (or whichever apex domain you registered). Cloudflare auto-issues TLS. Once propagated:

  • https://datatools.app/ → apex chooser
  • https://datatools.app/shopify-pet/ → Shopify landing
  • https://datatools.app/bookkeeper/ → Bookkeeper landing
  • https://datatools.app/revops/ → RevOps landing

E. Verify

For each persona:

  1. Open the persona URL.
  2. Confirm the demo iframe loads (the URL inside it points at the Streamlit demo from Part 1).
  3. Click "Run pipeline" inside the iframe → AFTER preview appears.
  4. Click the "Get DataTools" button → opens Gumroad with the correct ?from=<persona> query (verify in the URL bar).

If the iframe shows "Refused to connect", check Cloudflare Pages → SettingsFunctions for any CSP that disallows Streamlit's domain. (Default Pages config does not set CSP, so this is rarely an issue.)


Part 3 · Updates

The cycle is:

# 1) Edit code or copy
edit landing/<persona>/index.html
edit src/gui/app_demo.py

# 2) Rebuild landing
python3 landing/deploy.py

# 3) Re-deploy landing
wrangler pages deploy landing/dist

# 4) Re-deploy demo
git push origin main
# (Streamlit Cloud auto-deploys on push)

Both surfaces deploy in under 5 minutes end-to-end.


Part 4 · Sanity checks (post-deploy, ~3 min)

Run these once, then trust the build (per POST-LAUNCH.md §6):

# Landing pages serve and reference the right demo URL
curl -s https://datatools.app/ | grep -c persona-card
# → 3 (one per persona card)

curl -s https://datatools.app/shopify-pet/ | grep -c "datatools-demo"
# → ≥1 (iframe src points at your demo)

# Demo responds and routes the persona param
curl -s https://datatools-demo.streamlit.app/?p=shopify-pet | grep -c "Shopify"
# → ≥1

# Sitemap is valid XML and lists all 4 pages
curl -s https://datatools.app/sitemap.xml | grep -c "<url>"
# → 4

Part 5 · Cost ceiling check

Service Tier Cost Cap
Cloudflare Pages Free $0 500 builds/month, unlimited bandwidth
Streamlit Community Cloud Free $0 1 GB RAM, sleeps after 7 days idle
Custom domain Cloudflare or registrar ~$15/year n/a
GitHub Free for private repos with limited collaborators $0 n/a
Total ongoing ~$1.25/mo (domain only)

Well inside the BUSINESS.md §9 cap of $1,200/mo recurring. The $510/mo VPS migration is a contingency only — don't pre-build it.


Troubleshooting

Streamlit Cloud build fails with "ModuleNotFoundError: src.core"

streamlit_app.py puts the repo root on sys.path before invoking the demo module — but only if the file is at the repo root. Confirm streamlit_app.py lives at /streamlit_app.py, not nested in a folder.

Cloudflare Pages deploy succeeds but persona pages 404

The directory layout is preserved by deploy.py. Confirm your landing/dist/ has shopify-pet/index.html, etc. — not just three flat files. If you used drag-and-drop, drag the directory, not its contents.

The iframe shows "X-Frame-Options denied"

Streamlit Community Cloud allows iframe embedding by default. If you've migrated to a self-hosted demo with a reverse proxy, set X-Frame-Options: ALLOWALL (or remove the header entirely) for the demo's domain.

Gumroad URL has no ?from= parameter when clicked

The &from= query param is added by the landing-page CTA, not by Gumroad. If it's missing, the landing-page HTML wasn't substituted — re-run python3 landing/deploy.py and re-deploy.