Commit Graph

13 Commits

Author SHA1 Message Date
1895074b8f test+fix(gui): retire the now-empty "analysis" nav section
The journey-level nav restructure moved Home to a standalone "Start
here" entry and Reconcile into the "Finance" group, leaving the
"analysis" section with zero tools. Two registry tests encoded the old
layout and failed:
- test_every_section_has_at_least_one_tool[analysis] (empty section)
- test_reconciler_present (asserted section == "analysis")

Drop "analysis" from the Section literal, SECTION_LABELS, and app.py's
by_section bucket — it's genuinely dead now (home isn't a registry Tool).
Update the presence tests to assert Reconcile + PDF to CSV live in
"finance". The section-invariant tests (every section non-empty, has a
label, no orphan labels) are preserved and pass.

Full suite: 2441 passed, 91 skipped, 0 failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 17:11:02 +00:00
09ec01e98b feat(gui): port journey-level nav + local-first pill to the live app
Brings the live Streamlit app in line with the finalized layout-review
mockups (structural/low-risk changes; verified by compile + registry
sanity, still pending a streamlit-run visual check):

- tools_registry: Data Cleaners now in pipeline order (Clean Text ->
  Standardize -> Fix Missing -> Find Duplicates); new "finance" section
  (Reconcile, PDF to CSV) and "coming_soon" section (Find Unusual,
  Quality Check, Combine Files). Adds those to the Section type +
  SECTION_LABELS.
- app.py: Home becomes the "Start here" front door — a standalone,
  unlabeled top entry (play_circle icon) ahead of the hidden
  Activate/Logs/Close pages; nav groups reordered cleaners ->
  transformations -> automations -> finance -> coming soon.
- _legacy.py: render_tool_header now shows the "Runs 100% locally"
  privacy pill (right-aligned, Ready tools only — omitted on Coming
  Soon stubs); accent emphasis CSS for the Start-here nav link.
- i18n: add nav.start_here_title, nav.section_finance,
  nav.section_coming_soon to en + es packs.
- DECISIONS.md: log the PDF/Reconcile in-bundle (Finance group) call.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 17:01:57 +00:00
7203a81af7 copy: strip jargon from tool descriptions and captions
Prior round only touched page_caption; the description field (shown on
home grid cards) still said "imputation", "missingness",
"winsorization", "schema coercion", "fuzzy matching with normalization",
etc. The audience is non-technical buyers — they shouldn't need a stats
or DB-admin vocabulary to read a tool card.

Rewrite both description and page_caption across en, es, and the
tools_registry (the fallback source of truth) using everyday words:
blanks instead of nulls, fill in instead of impute, look wrong instead
of statistical outliers, etc. Same one-line shape as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:09:52 +00:00
ea99e292d2 feat(nav): group Home + Reconcile under a new "Analysis" section
Home now appears in the sidebar as "File Analysis" under a labeled
"Analysis" section together with Reconcile Two Files — both pages
are data-analysis workflows (importing/profiling files vs. matching
across files), so grouping them clarifies the sidebar's mental model.

- tools_registry: new ``analysis`` Section; reconcile moves out of
  automations into it.
- i18n: ``nav.section_analysis`` + ``nav.file_analysis_title`` added
  to en.json and es.json.
- app.py: home dropped from the unlabeled section and surfaced at the
  top of the Analysis group; ``default=True`` preserved so first-visit
  routing is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:11:06 +00:00
e44af3a45e feat(reconcile): two-source reconciliation tool
Bank-feed-vs-ledger style matcher: 4-pass greedy assignment (key →
exact → tolerance → fuzzy) with ambiguous candidates routed to a
review bucket instead of arbitrary picks. CLI mirrors the
cli_text_clean preview/--apply pattern; Streamlit page registered
in the automations section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 22:33:14 +00:00
2f349e8191 feat(pdf): tool page with Extract / Build / Manage modes
Phase 3/6. Wires the PDF Extractor into the GUI as a new
"transformations" tool with three modes selected by a horizontal
radio at the top of the page:

**Extract** — pick a saved template, upload one or more
statement PDFs (single + batch shipping together to keep the
common case one-step), get a previewed DataFrame + CSV download.
Per-file row counts and warnings are surfaced; failures on one
file don't kill the whole batch. The combined CSV gets a
``source_file`` first column so the accountant can sort/filter
by statement.

**Build template** — load an existing template or start fresh,
upload a sample PDF, edit every schema field across four tabs
(Pages & table / Columns / Parsing / Save). A live preview below
re-runs ``apply_template`` against the sample on each re-render
so the user sees their changes hit rows immediately. The column-
boundary editor is text-input ("comma-separated x-positions") for
now — replaced by the drawable-canvas visual picker in commit 5.

**Manage templates** — list with rename / delete / export
(downloads the canonical JSON) / import (uploads someone else's
JSON, validated through ``template_from_json``).

Heavy work (``extract_pages_auto``) only runs on explicit user
action (Extract / a new sample upload), and the parsed Page list
is cached in ``st.session_state`` so widget-edit reruns don't
re-parse the PDF.

Logging: tool runs and template saves both hit the audit log via
``log_event("tool_run", …)``, matching every other tool's
instrumentation pattern.

Registered in ``tools_registry.py`` under ``transformations``
with status ``Ready`` and the picture-as-pdf Material icon. i18n
keys added for en + es ("PDF to CSV" / "PDF a CSV").

OCR is wired in this commit — ``extract_pages_auto`` already
falls back through ``pytesseract`` when the binary is available,
and the warning strings it returns surface as ``st.info`` /
``st.warning`` per-file. Commit 6 will polish the OCR UX with a
status row.

Next commits build on this page:
  4 — batch progress + cancellation + per-file error grouping
  5 — drawable-canvas visual picker replaces text x-positions
  6 — OCR availability banner + scanned-page indicators

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:49:44 +00:00
da7d86f457 feat(ui): Material icons in sidebar + stats overview on home
Two pieces of the mockup 2 layout that hadn't landed yet:

1. Sidebar nav icons — emoji glyphs (🧹 ✂️ 🔍 …) swapped for
   Streamlit's ``:material/<name>:`` syntax, picking the outline
   Material Symbol that best matches each mockup SVG:

       Home               → :material/home:
       Fix Missing Values → :material/help_outline:
       Find Unusual Vals  → :material/insights:
       Clean Text         → :material/text_format:
       Standardize Fmts   → :material/format_list_bulleted:
       Find Duplicates    → :material/search:
       Quality Check      → :material/check_circle:
       Map Columns        → :material/view_column:
       Combine Files      → :material/account_tree:
       Auto Workflows     → :material/auto_awesome:
       Activate           → :material/key:
       Close              → :material/close:

   Streamlit injects the icon name as a literal ligature inside a
   first-child ``<span>`` of the nav anchor, expected to render
   through the Material Symbols font. theme.py's base rule was
   forcing Geist on every span under ``stSidebarNav``, turning the
   ligatures back into plain text labels — added a structural
   exception that targets ``[data-testid="stSidebarNavLink"] >
   span:first-child`` (and any descendant), restoring the Material
   font family, neutralizing the inherited ``ss01/cv01/cv11``
   feature settings, and sizing to 18px.

   Also stripped the leading emojis from every page title in the
   en/es i18n packs (``home.title``, ``close_page.title``,
   ``activation.title``, ``tools.*.page_title``) — the icons live
   in the sidebar now, the page H1 no longer needs to carry one.

2. Stats overview on home — new ``_render_stats_overview`` in
   _home.py emits a 4-card grid above the per-file findings panels:
   Files analyzed, Total findings, Warnings (severity ``warn`` ∪
   ``error``), Info (severity ``info``). Card layout follows the
   mockup §stats verbatim — Geist 28px / 600 / -0.03em for the
   numeric value (the "Display number" row in spec §4), tiny
   uppercase tracked label, paper-surface card with the standard
   warm border + faint shadow. The Warnings / Info cards tint the
   number with ``--warn`` / ``--info`` when the count is non-zero.

CSS for ``.dt-stats / .dt-stat / .dt-stat-label / .dt-stat-value /
.dt-stat-unit`` added to ``_DESIGN_TOKENS_CSS``; falls to a
2-column grid below 900px viewport, matching the mockup's media
query.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:31:40 +00:00
dad744f17f refactor(gui): drop Review page + normalization gate
Home is now the only entry point: the "Run analysis" button on the
upload section IS the review step (findings render inline via
render_findings_panel). Tool pages no longer gate on a passed
normalization — running the analyzer is sufficient context.

Removed:
- src/gui/pages/0_Review.py
- src/gui/components/gate.py (re-export seam)
- require_normalization_gate() in src/gui/components/_legacy.py
- "review" section enum in tools_registry.py
- Data Review entry in app.py navigation
- require_normalization_gate() calls + imports in all nine tool pages
- tests/gui/test_gate.py (whole file)
- TestReviewWorkflow in tests/gui/test_workflows.py
- 0_Review entry in tests/gui/test_smoke.py PAGE_SLUGS
- stash_upload's normalization_result+normalization_for stashing
- stash_upload_without_gate (was the gate's negative-path helper)

2017 tests pass (16 retired with the gate flow).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:04:33 +00:00
93e43fc0d9 feat(gui): sidebar sections + non-technical tool labels
Sidebar nav now groups tools under Data Review / Data Cleaners /
Transformations / Automations via st.navigation, replacing the flat
auto-discovered list. Tool display names switch to action-first
phrasing (Find Duplicates, Fix Missing Values, Find Unusual Values,
Standardize Formats, Clean Text, Quality Check, Map Columns, Combine
Files, Automated Workflows) in EN + ES packs and on each page's H1.

The Data Cleaners section follows the requested order: Missing
Values → Outliers → Text Cleaner → Format Standardizer → Deduplicator
→ Quality Check. (Text Cleaner kept inside cleaners since the request
didn't list it but the tool still ships.) Registry now carries a
section field; helpers added: tools_in_section(), section_label().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:36:01 +00:00
c4ce86bd64 feat(i18n): add language-pack scaffold with English and Spanish
Introduces ``src/i18n`` with a tiny JSON-backed t() lookup, an in-session
language preference, and a sidebar selector wired through
``hide_streamlit_chrome`` so every page picks up the same picker. Covers
home, tool cards, findings panel, gate, shutdown, and pickup banner
strings. Tests pin pack parity and the farewell-overlay JS escape so
future packs can't silently regress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:11:30 +00:00
966af8ef94 feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00
4adeb5c7f3 feat(format): per-cell standardizers + 199-row buyer corpus
Adds src/core/format_standardize.py — a per-cell standardizer for dates,
phones, emails, addresses, names, currencies, booleans — wired through
StandardizeOptions / standardize_dataframe with FieldType registry.

Includes:
- Date parser handles ISO/US/EU/longform/excel-serial/unix-timestamp/
  partial-precision/quarter notation; opt-in French/German/Spanish month
  dictionaries via month_locales.
- Phone via libphonenumber with extension preservation (;ext=N), 001
  international prefix handling, error sentinels for placeholders /
  multi-number cells.
- Email lowercase/trim/mailto/angle-bracket strip with optional
  --gmail-canonical mode.
- Address USPS abbreviation expansion or compression (expand=False per
  corpus § 6.3), state-name → 2-letter conversion, multi-line collapse,
  PO Box normalization, state-code preservation regardless of input case.
- Name handler: Mc/Mac/O'/D' inner caps, hyphen segments, particle
  lowercasing (von/van/de/da), comma-format reversal, period stripping
  for titles/suffixes/initials, PhD/MD acronym preservation, conservative
  mode for mixed-case input.
- Currency: auto-detect EU vs US separators, space-thousands, Swiss
  apostrophe, accounting parens, optional ISO code preservation, error
  sentinels for percentages/ranges/word-values/ambiguous separators.
- Per-domain error_policy ("passthrough" | "sentinel") for surfacing
  malformed values as <error: reason> per corpus § 0.3.

Test corpus from Business/DataTools/test-cases-format-cleaner copied to
test-cases/format-cleaner-corpus/ — 7 fixtures plus FORMATS-CASES.md.
tests/test_format_standardize_corpus.py drives all 199 rows through the
per-cell standardizers; 0 xfailed.

Wires the GUI page (3_Format_Standardizer.py) to "Ready" status.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:11:24 +00:00
f891c6116d refactor(gui): tool registry + components package for per-tool builds
Two low-risk seam moves to enable selling per-tool subsets without
breaking the existing all-in-one bundle. Behaviour identical; every
existing import still resolves; full pytest suite + every page returns
HTTP 200.

1. **Tool registry** (src/gui/tools_registry.py) — replaces the
   inline dict-of-dicts in app.py with a Tool dataclass and a TOOLS
   list. Adds a tier field ("core" today, "pro" / "enterprise" later)
   and tools_for_tier() / tool_by_id() / display_name() helpers. A
   per-tool build slices TOOLS at import time without code changes.

2. **components package** (src/gui/components/) — converts the former
   single components.py into a package with:
     _legacy.py        — original file, unchanged.
     __init__.py       — re-exports the legacy surface; existing
                         "from src.gui.components import …" calls
                         continue to work.
     shared.py         — hide_streamlit_chrome, pickup_or_upload
                         (every build needs these).
     gate.py           — require_normalization_gate (Pro / Suite SKUs).
     findings.py       — analyzer-finding widgets (drops out of a
                         standalone-Dedup build).
     dedup_review.py   — match-group cards + apply pipeline (drops out
                         of a non-dedup build).

   The seam modules are narrow re-exports today. As code migrates out
   of _legacy.py into the focused modules, the public import path
   stays stable via the shim.

E2E: 765 passed, 17 xfailed (unchanged); home page + all 9 tool pages
+ Review page render HTTP 200; full pipeline (analyze → auto_fix →
apply_decisions → output bytes) round-trips on the kitchen-sink
fixture with zero high-confidence findings remaining post-fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:56:21 +00:00