Tools shipped this batch (4 → 6 of 9 Ready):
04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI
05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI
09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI
with soft tool-dependency graph (recommended,
not enforced) and JSON save/load for repeatable
weekly cleanups.
Format Standardizer reworked for 1 GB international files:
• Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
• Per-row country / address columns drive parsing
• Audit cap (default 10 k rows, ~50 MB RAM)
• standardize_file(): chunked streaming entry point (~165 k rows/sec)
• currency_decimal="auto" for EU comma-decimal locales
• R$ / kr / zł multi-char currency prefixes
• cli_format.py with auto-stream above 100 MB inputs
Encoding detection arbiter + language-aware probe:
Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.
Distribution-readiness assets:
• streamlit_app.py — Streamlit Community Cloud entry shim
• src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
100-row cap + watermark, free-vs-paid boundary enforced at surface
• samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
• landing/ — 4 static HTML pages (apex chooser + 3 niche),
shared CSS, deploy.py URL-substitution script,
auto-generated robots.txt + sitemap.xml + 404.html + favicon
• docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
— full strategy + measurement + deployment + master checklist
Test counts:
before: 1,520 passed · 4 skipped · 17 xfailed
after: 1,729 passed · 0 skipped · 0 xfailed
Tier-1 corpora added:
• missing-corpus 3 use cases + 16 edge cases
• column-mapper-corpus 3 use cases + 5 edge cases
• format-cleaner intl 20-row 13-country stress fixture
Engine hardening flushed out by the corpora:
• interpolate guards against object-dtype columns
• mean/median skip all-NaN columns (silences numpy warning)
• fillna runs under future.no_silent_downcasting (silences pandas warning)
• mojibake test no longer skips when ftfy installed (monkeypatch path)
• drop-row threshold semantics: strict-greater (consistent across rows / cols)
• currency_decimal validator allow-set updated for "auto"
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
237 lines
7.2 KiB
Markdown
237 lines
7.2 KiB
Markdown
# Deployment — demo + landing pages
|
||
|
||
> One page. Two services. ~30 minutes from "code complete" to
|
||
> "URL the user can hit." Every step here is from-scratch reproducible
|
||
> on a clean laptop.
|
||
> **Version**: 1.0 · **Adopted**: 2026-05-01
|
||
|
||
This doc covers the **two distribution surfaces** that ship to public
|
||
URLs: the Streamlit demo (the iframe target) and the Cloudflare Pages
|
||
landing pages (the marketing surface that embeds it).
|
||
|
||
The *paid* product — PyInstaller installers, code-signing, Gumroad
|
||
listing — is covered in `docs/NEXT-STEPS.md`.
|
||
|
||
---
|
||
|
||
## Part 1 · Deploy the demo (Streamlit Community Cloud — free)
|
||
|
||
### A. Pre-flight (one-time, ~2 min)
|
||
|
||
You need a free [Streamlit Community Cloud](https://streamlit.io/cloud)
|
||
account. Sign in with the GitHub account that hosts this repo.
|
||
|
||
### B. Deploy (~5 min, mostly waiting for the Cloud build)
|
||
|
||
1. **Push the repo to GitHub** (private or public — both work). The
|
||
important files are at the **repo root**:
|
||
|
||
- `streamlit_app.py` — Cloud auto-detects this; nothing to configure
|
||
- `requirements.txt` — Cloud installs from this
|
||
- `.streamlit/config.toml` — Cloud honours this
|
||
- `samples/demo/*.csv` + `*_pipeline.json` — the demo's data
|
||
- `src/` — the engine
|
||
|
||
2. In Streamlit Community Cloud → **New app**:
|
||
- Repository: your fork
|
||
- Branch: `main`
|
||
- Main file path: `streamlit_app.py` (the default — leave it)
|
||
- App URL: `datatools-demo` (or any free subdomain)
|
||
- **Deploy**
|
||
|
||
3. First build is 2–3 min while Cloud installs `pandas`, `phonenumbers`,
|
||
`rapidfuzz`, etc. Subsequent deploys are < 30 s.
|
||
|
||
### C. Verify
|
||
|
||
Open the deployed URL. Append `?p=shopify-pet` to the URL bar —
|
||
the persona-specific demo loads. Try `?p=bookkeeper` and
|
||
`?p=revops` to confirm all three personas route correctly. Click
|
||
**Run pipeline**; the AFTER preview should appear within ~1 second.
|
||
|
||
### D. The output URL
|
||
|
||
The deployed URL is what feeds into `landing/deploy.config.json` →
|
||
`demo_base_url`. Without trailing slash. For example:
|
||
|
||
https://datatools-demo.streamlit.app
|
||
|
||
### E. Migration trigger
|
||
|
||
Per `BUSINESS.md` §9 / `DEMO-PLAN.md` §9, migrate to a $5–10/mo VPS
|
||
when:
|
||
|
||
- Streamlit Community Cloud rate-limits / sleeps too aggressively, OR
|
||
- the demo crosses ~5 k page-views/month (free-tier capacity)
|
||
|
||
The migration is one command if you containerise:
|
||
`docker run -p 8501:8501 -v $(pwd):/app python:3.12-slim …`
|
||
|
||
---
|
||
|
||
## Part 2 · Deploy the landing pages (Cloudflare Pages — free)
|
||
|
||
### A. Pre-flight (one-time, ~5 min)
|
||
|
||
You need:
|
||
|
||
- A Cloudflare account (free) and a domain (any registrar) with
|
||
nameservers pointed at Cloudflare. **OR** skip the custom domain
|
||
step and use the auto-generated `*.pages.dev` URL.
|
||
- A Gumroad listing URL (placeholder until your account is set up —
|
||
use `https://gumroad.com/l/datatools` and update it later).
|
||
|
||
### B. Build the deploy-ready bundle (~30 sec)
|
||
|
||
```bash
|
||
# One-time: copy the template
|
||
cp landing/deploy.config.example.json landing/deploy.config.json
|
||
# Edit it with your real URLs
|
||
edit landing/deploy.config.json
|
||
# Build
|
||
python3 landing/deploy.py
|
||
# → produces landing/dist/
|
||
```
|
||
|
||
`landing/deploy.config.json` is **gitignored**; your real URLs never
|
||
hit the repo.
|
||
|
||
### C. Deploy (~3 min)
|
||
|
||
Two paths — pick one:
|
||
|
||
**Drag-and-drop (zero CLI):**
|
||
|
||
1. Cloudflare Pages dashboard → **Create project** → **Direct Upload**
|
||
2. Drag `landing/dist/` into the upload zone
|
||
3. Project name: `datatools` (becomes `datatools.pages.dev`)
|
||
4. Click **Deploy**
|
||
|
||
**Wrangler CLI (one command, scriptable):**
|
||
|
||
```bash
|
||
npm install -g wrangler # one-time
|
||
wrangler login # one-time
|
||
wrangler pages deploy landing/dist
|
||
```
|
||
|
||
### D. Custom domain (~5 min, optional)
|
||
|
||
Pages dashboard → your project → **Custom domains** → add
|
||
`datatools.app` (or whichever apex domain you registered). Cloudflare
|
||
auto-issues TLS. Once propagated:
|
||
|
||
- `https://datatools.app/` → apex chooser
|
||
- `https://datatools.app/shopify-pet/` → Shopify landing
|
||
- `https://datatools.app/bookkeeper/` → Bookkeeper landing
|
||
- `https://datatools.app/revops/` → RevOps landing
|
||
|
||
### E. Verify
|
||
|
||
For each persona:
|
||
|
||
1. Open the persona URL.
|
||
2. Confirm the demo iframe loads (the URL inside it points at the
|
||
Streamlit demo from Part 1).
|
||
3. Click "Run pipeline" inside the iframe → AFTER preview appears.
|
||
4. Click the "Get DataTools" button → opens Gumroad with the
|
||
correct `?from=<persona>` query (verify in the URL bar).
|
||
|
||
If the iframe shows "Refused to connect", check Cloudflare Pages →
|
||
**Settings** → **Functions** for any CSP that disallows Streamlit's
|
||
domain. (Default Pages config does not set CSP, so this is rarely an
|
||
issue.)
|
||
|
||
---
|
||
|
||
## Part 3 · Updates
|
||
|
||
The cycle is:
|
||
|
||
```bash
|
||
# 1) Edit code or copy
|
||
edit landing/<persona>/index.html
|
||
edit src/gui/app_demo.py
|
||
|
||
# 2) Rebuild landing
|
||
python3 landing/deploy.py
|
||
|
||
# 3) Re-deploy landing
|
||
wrangler pages deploy landing/dist
|
||
|
||
# 4) Re-deploy demo
|
||
git push origin main
|
||
# (Streamlit Cloud auto-deploys on push)
|
||
```
|
||
|
||
Both surfaces deploy in under 5 minutes end-to-end.
|
||
|
||
---
|
||
|
||
## Part 4 · Sanity checks (post-deploy, ~3 min)
|
||
|
||
Run these once, then trust the build (per `POST-LAUNCH.md` §6):
|
||
|
||
```bash
|
||
# Landing pages serve and reference the right demo URL
|
||
curl -s https://datatools.app/ | grep -c persona-card
|
||
# → 3 (one per persona card)
|
||
|
||
curl -s https://datatools.app/shopify-pet/ | grep -c "datatools-demo"
|
||
# → ≥1 (iframe src points at your demo)
|
||
|
||
# Demo responds and routes the persona param
|
||
curl -s https://datatools-demo.streamlit.app/?p=shopify-pet | grep -c "Shopify"
|
||
# → ≥1
|
||
|
||
# Sitemap is valid XML and lists all 4 pages
|
||
curl -s https://datatools.app/sitemap.xml | grep -c "<url>"
|
||
# → 4
|
||
```
|
||
|
||
---
|
||
|
||
## Part 5 · Cost ceiling check
|
||
|
||
| Service | Tier | Cost | Cap |
|
||
|---|---|---|---|
|
||
| Cloudflare Pages | Free | $0 | 500 builds/month, unlimited bandwidth |
|
||
| Streamlit Community Cloud | Free | $0 | 1 GB RAM, sleeps after 7 days idle |
|
||
| Custom domain | Cloudflare or registrar | ~$15/year | n/a |
|
||
| GitHub | Free for private repos with limited collaborators | $0 | n/a |
|
||
| **Total ongoing** | | **~$1.25/mo** (domain only) | |
|
||
|
||
Well inside the `BUSINESS.md` §9 cap of $1,200/mo recurring. The
|
||
$5–10/mo VPS migration is a contingency only — don't pre-build it.
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
**Streamlit Cloud build fails with "ModuleNotFoundError: src.core"**
|
||
|
||
`streamlit_app.py` puts the repo root on `sys.path` before invoking
|
||
the demo module — but only if the file is at the repo root. Confirm
|
||
`streamlit_app.py` lives at `/streamlit_app.py`, not nested in a
|
||
folder.
|
||
|
||
**Cloudflare Pages deploy succeeds but persona pages 404**
|
||
|
||
The directory layout is preserved by `deploy.py`. Confirm your
|
||
`landing/dist/` has `shopify-pet/index.html`, etc. — not just three
|
||
flat files. If you used drag-and-drop, drag the **directory**, not
|
||
its contents.
|
||
|
||
**The iframe shows "X-Frame-Options denied"**
|
||
|
||
Streamlit Community Cloud allows iframe embedding by default. If
|
||
you've migrated to a self-hosted demo with a reverse proxy, set
|
||
`X-Frame-Options: ALLOWALL` (or remove the header entirely) for the
|
||
demo's domain.
|
||
|
||
**Gumroad URL has no `?from=` parameter when clicked**
|
||
|
||
The `&from=` query param is added by the landing-page CTA, not by
|
||
Gumroad. If it's missing, the landing-page HTML wasn't substituted —
|
||
re-run `python3 landing/deploy.py` and re-deploy.
|