Files
datatools-dev/docs/DEPLOYMENT.md
Michael 966af8ef94 feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00

237 lines
7.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Deployment — demo + landing pages
> One page. Two services. ~30 minutes from "code complete" to
> "URL the user can hit." Every step here is from-scratch reproducible
> on a clean laptop.
> **Version**: 1.0 · **Adopted**: 2026-05-01
This doc covers the **two distribution surfaces** that ship to public
URLs: the Streamlit demo (the iframe target) and the Cloudflare Pages
landing pages (the marketing surface that embeds it).
The *paid* product — PyInstaller installers, code-signing, Gumroad
listing — is covered in `docs/NEXT-STEPS.md`.
---
## Part 1 · Deploy the demo (Streamlit Community Cloud — free)
### A. Pre-flight (one-time, ~2 min)
You need a free [Streamlit Community Cloud](https://streamlit.io/cloud)
account. Sign in with the GitHub account that hosts this repo.
### B. Deploy (~5 min, mostly waiting for the Cloud build)
1. **Push the repo to GitHub** (private or public — both work). The
important files are at the **repo root**:
- `streamlit_app.py` — Cloud auto-detects this; nothing to configure
- `requirements.txt` — Cloud installs from this
- `.streamlit/config.toml` — Cloud honours this
- `samples/demo/*.csv` + `*_pipeline.json` — the demo's data
- `src/` — the engine
2. In Streamlit Community Cloud → **New app**:
- Repository: your fork
- Branch: `main`
- Main file path: `streamlit_app.py` (the default — leave it)
- App URL: `datatools-demo` (or any free subdomain)
- **Deploy**
3. First build is 23 min while Cloud installs `pandas`, `phonenumbers`,
`rapidfuzz`, etc. Subsequent deploys are < 30 s.
### C. Verify
Open the deployed URL. Append `?p=shopify-pet` to the URL bar —
the persona-specific demo loads. Try `?p=bookkeeper` and
`?p=revops` to confirm all three personas route correctly. Click
**Run pipeline**; the AFTER preview should appear within ~1 second.
### D. The output URL
The deployed URL is what feeds into `landing/deploy.config.json`
`demo_base_url`. Without trailing slash. For example:
https://datatools-demo.streamlit.app
### E. Migration trigger
Per `BUSINESS.md` §9 / `DEMO-PLAN.md` §9, migrate to a $510/mo VPS
when:
- Streamlit Community Cloud rate-limits / sleeps too aggressively, OR
- the demo crosses ~5 k page-views/month (free-tier capacity)
The migration is one command if you containerise:
`docker run -p 8501:8501 -v $(pwd):/app python:3.12-slim …`
---
## Part 2 · Deploy the landing pages (Cloudflare Pages — free)
### A. Pre-flight (one-time, ~5 min)
You need:
- A Cloudflare account (free) and a domain (any registrar) with
nameservers pointed at Cloudflare. **OR** skip the custom domain
step and use the auto-generated `*.pages.dev` URL.
- A Gumroad listing URL (placeholder until your account is set up —
use `https://gumroad.com/l/datatools` and update it later).
### B. Build the deploy-ready bundle (~30 sec)
```bash
# One-time: copy the template
cp landing/deploy.config.example.json landing/deploy.config.json
# Edit it with your real URLs
edit landing/deploy.config.json
# Build
python3 landing/deploy.py
# → produces landing/dist/
```
`landing/deploy.config.json` is **gitignored**; your real URLs never
hit the repo.
### C. Deploy (~3 min)
Two paths — pick one:
**Drag-and-drop (zero CLI):**
1. Cloudflare Pages dashboard → **Create project****Direct Upload**
2. Drag `landing/dist/` into the upload zone
3. Project name: `datatools` (becomes `datatools.pages.dev`)
4. Click **Deploy**
**Wrangler CLI (one command, scriptable):**
```bash
npm install -g wrangler # one-time
wrangler login # one-time
wrangler pages deploy landing/dist
```
### D. Custom domain (~5 min, optional)
Pages dashboard → your project → **Custom domains** → add
`datatools.app` (or whichever apex domain you registered). Cloudflare
auto-issues TLS. Once propagated:
- `https://datatools.app/` → apex chooser
- `https://datatools.app/shopify-pet/` → Shopify landing
- `https://datatools.app/bookkeeper/` → Bookkeeper landing
- `https://datatools.app/revops/` → RevOps landing
### E. Verify
For each persona:
1. Open the persona URL.
2. Confirm the demo iframe loads (the URL inside it points at the
Streamlit demo from Part 1).
3. Click "Run pipeline" inside the iframe → AFTER preview appears.
4. Click the "Get DataTools" button → opens Gumroad with the
correct `?from=<persona>` query (verify in the URL bar).
If the iframe shows "Refused to connect", check Cloudflare Pages →
**Settings****Functions** for any CSP that disallows Streamlit's
domain. (Default Pages config does not set CSP, so this is rarely an
issue.)
---
## Part 3 · Updates
The cycle is:
```bash
# 1) Edit code or copy
edit landing/<persona>/index.html
edit src/gui/app_demo.py
# 2) Rebuild landing
python3 landing/deploy.py
# 3) Re-deploy landing
wrangler pages deploy landing/dist
# 4) Re-deploy demo
git push origin main
# (Streamlit Cloud auto-deploys on push)
```
Both surfaces deploy in under 5 minutes end-to-end.
---
## Part 4 · Sanity checks (post-deploy, ~3 min)
Run these once, then trust the build (per `POST-LAUNCH.md` §6):
```bash
# Landing pages serve and reference the right demo URL
curl -s https://datatools.app/ | grep -c persona-card
# → 3 (one per persona card)
curl -s https://datatools.app/shopify-pet/ | grep -c "datatools-demo"
# → ≥1 (iframe src points at your demo)
# Demo responds and routes the persona param
curl -s https://datatools-demo.streamlit.app/?p=shopify-pet | grep -c "Shopify"
# → ≥1
# Sitemap is valid XML and lists all 4 pages
curl -s https://datatools.app/sitemap.xml | grep -c "<url>"
# → 4
```
---
## Part 5 · Cost ceiling check
| Service | Tier | Cost | Cap |
|---|---|---|---|
| Cloudflare Pages | Free | $0 | 500 builds/month, unlimited bandwidth |
| Streamlit Community Cloud | Free | $0 | 1 GB RAM, sleeps after 7 days idle |
| Custom domain | Cloudflare or registrar | ~$15/year | n/a |
| GitHub | Free for private repos with limited collaborators | $0 | n/a |
| **Total ongoing** | | **~$1.25/mo** (domain only) | |
Well inside the `BUSINESS.md` §9 cap of $1,200/mo recurring. The
$510/mo VPS migration is a contingency only — don't pre-build it.
---
## Troubleshooting
**Streamlit Cloud build fails with "ModuleNotFoundError: src.core"**
`streamlit_app.py` puts the repo root on `sys.path` before invoking
the demo module — but only if the file is at the repo root. Confirm
`streamlit_app.py` lives at `/streamlit_app.py`, not nested in a
folder.
**Cloudflare Pages deploy succeeds but persona pages 404**
The directory layout is preserved by `deploy.py`. Confirm your
`landing/dist/` has `shopify-pet/index.html`, etc. — not just three
flat files. If you used drag-and-drop, drag the **directory**, not
its contents.
**The iframe shows "X-Frame-Options denied"**
Streamlit Community Cloud allows iframe embedding by default. If
you've migrated to a self-hosted demo with a reverse proxy, set
`X-Frame-Options: ALLOWALL` (or remove the header entirely) for the
demo's domain.
**Gumroad URL has no `?from=` parameter when clicked**
The `&from=` query param is added by the landing-page CTA, not by
Gumroad. If it's missing, the landing-page HTML wasn't substituted —
re-run `python3 landing/deploy.py` and re-deploy.