feat: 3 new tools, format streaming, distribution-ready demo + landing pages

Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 22:31:26 +00:00
parent d18b95880d
commit 966af8ef94
89 changed files with 12039 additions and 284 deletions

142
landing/README.md Normal file
View File

@@ -0,0 +1,142 @@
# Landing pages
Three persona-tagged landing pages per `docs/PLAN.md` §2.3 and
`docs/DEMO-PLAN.md` §3 / §7. Static HTML, zero build step, ship to
Cloudflare Pages.
## Structure
```
landing/
├── _shared/styles.css shared CSS (system fonts, no externals)
├── shopify-pet/index.html Shopify operator (priority: pet supplies)
├── bookkeeper/index.html bookkeeper / freelance accountant
├── revops/index.html marketing / RevOps agency
└── README.md this file
```
Each page:
- Inherits `landing/_shared/styles.css`
- Overrides the `--accent` colour variable in an inline `<style>` block
so each persona has its own visual identity (Shopify = mint green,
Bookkeeper = steel blue, RevOps = vivid violet)
- Has a sticky buy bar with the Gumroad CTA tagged with `?from=<persona>`
- Embeds the live demo (Streamlit) via `<iframe>` with a sandbox attribute
- Carries persona-specific H1, sub-copy, use cases, FAQ, and a
ready-to-paste `terminal` block showing the CLI in action
- Includes Open Graph + Schema.org `SoftwareApplication` JSON-LD for
link-share previews and SEO
## Pre-deploy URL substitutions — automated
The HTML carries placeholder URLs (the literal strings
`https://demo.datatools.app`, `https://datatools.app`,
`https://gumroad.com/l/datatools`, `mailto:hello@datatools.app`)
that **must** be replaced before deployment. A small Python script
does this for you — no global search-and-replace needed.
```bash
# 1) Copy the template and fill in your real URLs:
cp landing/deploy.config.example.json landing/deploy.config.json
edit landing/deploy.config.json
# 2) Build the deploy-ready bundle:
python3 landing/deploy.py
# → produces landing/dist/ with substitutions applied,
# plus robots.txt, sitemap.xml, 404.html, favicon.svg
```
`landing/deploy.config.json` is gitignored so your real URLs never
hit the repo. Re-run `landing/deploy.py` whenever you change a URL or
edit any HTML source.
## Cloudflare Pages deployment
The simplest path — one Pages project pointed at `landing/dist/`:
```bash
# Option A: drag-and-drop the directory in the Cloudflare dashboard
# Pages → Create project → Direct Upload → drag landing/dist/
# Option B: Wrangler CLI (one command, scriptable)
wrangler pages deploy landing/dist
```
Configure the custom apex domain (`datatools.app`) in the Cloudflare
Pages project settings; sub-paths `/shopify-pet/`, `/bookkeeper/`,
`/revops/` are served automatically because the directory layout
mirrors them. Cache rule defaults are fine (HTML 1 day, CSS 7 days).
If you want **separate Pages projects** per persona for independent
A/B testing, point three projects at the same `landing/dist/` and
configure each with its own sub-domain (`shopify.datatools.app`, etc.)
and a Pages rule that rewrites the root to that persona's
sub-directory.
## Telemetry wiring (per DEMO-PLAN §8)
The plan calls for event-only counters, no PII, no Google Analytics.
For each page, on Cloudflare Pages, attach a Worker (or use Cloudflare
Web Analytics — it's privacy-friendly out of the box and zero config).
Track:
- `page_view` per persona (auto from CF Web Analytics)
- `cta_clicked` — add a small inline `<script>` that fires a fetch to
`/api/event?event=cta_clicked&persona=<persona>` when the buy button
is clicked, then continues the navigation to Gumroad.
- `demo.run_completed` and `demo.cta_clicked` are owned by the demo
app, not the landing page.
Conversion (per DEMO-PLAN §8):
```
demo_engagement = demo.run_completed / page_view (target ≥ 30%)
purchase_intent = demo.cta_clicked / demo.run_completed (target ≥ 5%)
purchase_rate = gumroad.purchase / demo.cta_clicked (target ≥ 30%)
```
The Gumroad webhook captures `?from=<persona>` so we can attribute
purchases back to the landing page that produced them.
## Maintenance triggers (per DEMO-PLAN §9)
Refresh the page when:
| Trigger | Action |
|---|---|
| `cta_clicked / run_completed < 5%` for 4 weeks | The demo is working but the buyer isn't trusting the CTA. Add a screenshot of the network tab showing zero outbound calls. Soften the price callout. |
| `page_view → run_completed < 30%` for 4 weeks | The demo iframe isn't loading or visitors aren't engaging. Check the iframe URL. Move the demo above the fold if it's currently below. |
| New tool ships (0609) | Add it to the persona's saved pipeline only if it fits — don't bloat the demo with every tool. |
| Pricing change | Update `<meta>` schema, the buybar `.price-tag`, the pricing card, and the FAQ. Search-and-replace `$49` across the file. |
| New persona added (4th, 5th) | Copy `shopify-pet/index.html`, replace persona-specific copy, add to the `footer` cross-link block on the existing pages. |
## Why static HTML
Per `DECISIONS.md §5` and `BUSINESS.md §7`, the landing-page channel
must be:
- **Async-friendly** — Cloudflare Pages serves these with no operator
involvement
- **Cheap** — Cloudflare Pages free tier is sufficient until well past
the $5k/mo MRR re-lock trigger (`DECISIONS.md §8`)
- **Privacy-respecting** — no third-party tracker means no cookie
banner, which means no friction added to the conversion funnel
- **Zero ongoing maintenance** — no framework, no build, no upgrades.
The CSS uses system fonts; no Google Fonts; no CDN dependency that
could break the page when their TLS certificate rolls.
## Anti-temptations (per DEMO-PLAN §11 + plan §5)
These pages deliberately exclude:
- **No live chat widget.** Locked by no-touch.
- **No "schedule a demo with us" CTA.** Same.
- **No email capture before the demo.** Friction kills conversion.
- **No Google Analytics / Meta Pixel.** Privacy story is a moat, not
a checkbox to ignore.
- **No SaaS-style "free trial / no credit card."** This is a one-time
download, not a subscription.
- **No A/B-testing framework yet.** Pre-PMF traffic doesn't reach
statistical significance — ship the single-arm copy, iterate monthly.