Files
datatools-dev/streamlit_app.py
Michael 966af8ef94 feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00

52 lines
2.0 KiB
Python

"""Streamlit Community Cloud entry point — public demo app.
This is the file Streamlit Community Cloud auto-discovers when you
deploy from this repository: leave the "Main file path" field at its
default (``streamlit_app.py``) and it just works.
Why this lives at the repo root, not in ``src/gui/``:
Streamlit auto-detects sibling files inside a ``pages/`` directory
next to the entry script and renders them as additional pages in
the sidebar. The full product GUI's pages live in
``src/gui/pages/`` — pointing the Cloud at ``src/gui/app_demo.py``
would inadvertently expose every paid-product page in the demo's
sidebar (or require URL-routing tricks to suppress them).
Anchoring the entry script at the repo root means there is no
``pages/`` neighbour and the demo stays single-page by
construction.
The actual demo UI is defined once in ``src/gui/app_demo.py`` so
local development still works the way it always did:
streamlit run src/gui/app_demo.py # local dev, identical UX
Cloud deploy uses this shim:
streamlit run streamlit_app.py # what Cloud invokes
"""
from __future__ import annotations
import sys
from pathlib import Path
# Put the repo root on sys.path so ``src.core`` and ``src.gui`` imports
# resolve cleanly. The demo module does this itself for the local-dev
# case, but the import order matters when this shim runs first on Cloud.
_HERE = Path(__file__).resolve().parent
if str(_HERE) not in sys.path:
sys.path.insert(0, str(_HERE))
# Executing the demo module top-to-bottom is the simplest way to share
# the UI between the two entry points without duplicating code or
# refactoring the demo into a function (Streamlit's idiom is
# script-as-page; converting it to a callable would fight the
# framework). ``runpy`` runs the file in this script's namespace so
# Streamlit's ``st.set_page_config`` / element registration sees the
# correct module.
import runpy
runpy.run_path(
str(_HERE / "src" / "gui" / "app_demo.py"),
run_name="__main__",
)