feat: 3 new tools, format streaming, distribution-ready demo + landing pages

Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 22:31:26 +00:00
parent d18b95880d
commit 966af8ef94
89 changed files with 12039 additions and 284 deletions

236
docs/DEPLOYMENT.md Normal file
View File

@@ -0,0 +1,236 @@
# Deployment — demo + landing pages
> One page. Two services. ~30 minutes from "code complete" to
> "URL the user can hit." Every step here is from-scratch reproducible
> on a clean laptop.
> **Version**: 1.0 · **Adopted**: 2026-05-01
This doc covers the **two distribution surfaces** that ship to public
URLs: the Streamlit demo (the iframe target) and the Cloudflare Pages
landing pages (the marketing surface that embeds it).
The *paid* product — PyInstaller installers, code-signing, Gumroad
listing — is covered in `docs/NEXT-STEPS.md`.
---
## Part 1 · Deploy the demo (Streamlit Community Cloud — free)
### A. Pre-flight (one-time, ~2 min)
You need a free [Streamlit Community Cloud](https://streamlit.io/cloud)
account. Sign in with the GitHub account that hosts this repo.
### B. Deploy (~5 min, mostly waiting for the Cloud build)
1. **Push the repo to GitHub** (private or public — both work). The
important files are at the **repo root**:
- `streamlit_app.py` — Cloud auto-detects this; nothing to configure
- `requirements.txt` — Cloud installs from this
- `.streamlit/config.toml` — Cloud honours this
- `samples/demo/*.csv` + `*_pipeline.json` — the demo's data
- `src/` — the engine
2. In Streamlit Community Cloud → **New app**:
- Repository: your fork
- Branch: `main`
- Main file path: `streamlit_app.py` (the default — leave it)
- App URL: `datatools-demo` (or any free subdomain)
- **Deploy**
3. First build is 23 min while Cloud installs `pandas`, `phonenumbers`,
`rapidfuzz`, etc. Subsequent deploys are < 30 s.
### C. Verify
Open the deployed URL. Append `?p=shopify-pet` to the URL bar —
the persona-specific demo loads. Try `?p=bookkeeper` and
`?p=revops` to confirm all three personas route correctly. Click
**Run pipeline**; the AFTER preview should appear within ~1 second.
### D. The output URL
The deployed URL is what feeds into `landing/deploy.config.json`
`demo_base_url`. Without trailing slash. For example:
https://datatools-demo.streamlit.app
### E. Migration trigger
Per `BUSINESS.md` §9 / `DEMO-PLAN.md` §9, migrate to a $510/mo VPS
when:
- Streamlit Community Cloud rate-limits / sleeps too aggressively, OR
- the demo crosses ~5 k page-views/month (free-tier capacity)
The migration is one command if you containerise:
`docker run -p 8501:8501 -v $(pwd):/app python:3.12-slim …`
---
## Part 2 · Deploy the landing pages (Cloudflare Pages — free)
### A. Pre-flight (one-time, ~5 min)
You need:
- A Cloudflare account (free) and a domain (any registrar) with
nameservers pointed at Cloudflare. **OR** skip the custom domain
step and use the auto-generated `*.pages.dev` URL.
- A Gumroad listing URL (placeholder until your account is set up —
use `https://gumroad.com/l/datatools` and update it later).
### B. Build the deploy-ready bundle (~30 sec)
```bash
# One-time: copy the template
cp landing/deploy.config.example.json landing/deploy.config.json
# Edit it with your real URLs
edit landing/deploy.config.json
# Build
python3 landing/deploy.py
# → produces landing/dist/
```
`landing/deploy.config.json` is **gitignored**; your real URLs never
hit the repo.
### C. Deploy (~3 min)
Two paths — pick one:
**Drag-and-drop (zero CLI):**
1. Cloudflare Pages dashboard → **Create project****Direct Upload**
2. Drag `landing/dist/` into the upload zone
3. Project name: `datatools` (becomes `datatools.pages.dev`)
4. Click **Deploy**
**Wrangler CLI (one command, scriptable):**
```bash
npm install -g wrangler # one-time
wrangler login # one-time
wrangler pages deploy landing/dist
```
### D. Custom domain (~5 min, optional)
Pages dashboard → your project → **Custom domains** → add
`datatools.app` (or whichever apex domain you registered). Cloudflare
auto-issues TLS. Once propagated:
- `https://datatools.app/` → apex chooser
- `https://datatools.app/shopify-pet/` → Shopify landing
- `https://datatools.app/bookkeeper/` → Bookkeeper landing
- `https://datatools.app/revops/` → RevOps landing
### E. Verify
For each persona:
1. Open the persona URL.
2. Confirm the demo iframe loads (the URL inside it points at the
Streamlit demo from Part 1).
3. Click "Run pipeline" inside the iframe → AFTER preview appears.
4. Click the "Get DataTools" button → opens Gumroad with the
correct `?from=<persona>` query (verify in the URL bar).
If the iframe shows "Refused to connect", check Cloudflare Pages →
**Settings****Functions** for any CSP that disallows Streamlit's
domain. (Default Pages config does not set CSP, so this is rarely an
issue.)
---
## Part 3 · Updates
The cycle is:
```bash
# 1) Edit code or copy
edit landing/<persona>/index.html
edit src/gui/app_demo.py
# 2) Rebuild landing
python3 landing/deploy.py
# 3) Re-deploy landing
wrangler pages deploy landing/dist
# 4) Re-deploy demo
git push origin main
# (Streamlit Cloud auto-deploys on push)
```
Both surfaces deploy in under 5 minutes end-to-end.
---
## Part 4 · Sanity checks (post-deploy, ~3 min)
Run these once, then trust the build (per `POST-LAUNCH.md` §6):
```bash
# Landing pages serve and reference the right demo URL
curl -s https://datatools.app/ | grep -c persona-card
# → 3 (one per persona card)
curl -s https://datatools.app/shopify-pet/ | grep -c "datatools-demo"
# → ≥1 (iframe src points at your demo)
# Demo responds and routes the persona param
curl -s https://datatools-demo.streamlit.app/?p=shopify-pet | grep -c "Shopify"
# → ≥1
# Sitemap is valid XML and lists all 4 pages
curl -s https://datatools.app/sitemap.xml | grep -c "<url>"
# → 4
```
---
## Part 5 · Cost ceiling check
| Service | Tier | Cost | Cap |
|---|---|---|---|
| Cloudflare Pages | Free | $0 | 500 builds/month, unlimited bandwidth |
| Streamlit Community Cloud | Free | $0 | 1 GB RAM, sleeps after 7 days idle |
| Custom domain | Cloudflare or registrar | ~$15/year | n/a |
| GitHub | Free for private repos with limited collaborators | $0 | n/a |
| **Total ongoing** | | **~$1.25/mo** (domain only) | |
Well inside the `BUSINESS.md` §9 cap of $1,200/mo recurring. The
$510/mo VPS migration is a contingency only — don't pre-build it.
---
## Troubleshooting
**Streamlit Cloud build fails with "ModuleNotFoundError: src.core"**
`streamlit_app.py` puts the repo root on `sys.path` before invoking
the demo module — but only if the file is at the repo root. Confirm
`streamlit_app.py` lives at `/streamlit_app.py`, not nested in a
folder.
**Cloudflare Pages deploy succeeds but persona pages 404**
The directory layout is preserved by `deploy.py`. Confirm your
`landing/dist/` has `shopify-pet/index.html`, etc. — not just three
flat files. If you used drag-and-drop, drag the **directory**, not
its contents.
**The iframe shows "X-Frame-Options denied"**
Streamlit Community Cloud allows iframe embedding by default. If
you've migrated to a self-hosted demo with a reverse proxy, set
`X-Frame-Options: ALLOWALL` (or remove the header entirely) for the
demo's domain.
**Gumroad URL has no `?from=` parameter when clicked**
The `&from=` query param is added by the landing-page CTA, not by
Gumroad. If it's missing, the landing-page HTML wasn't substituted —
re-run `python3 landing/deploy.py` and re-deploy.