datatools-dev

Author	SHA1	Message	Date
Michael	a1824b8dc4	feat(pdf): Home-style file list + Clear-all button User feedback: the standard file_uploader didn't visually match the Home page, and there was no obvious way to clear out uploaded files between scans (have to refresh the browser tab). Persistent stash + add-only sync. Files captured into ``st.session_state["pdf_uploads"]`` (dict name → {bytes, size}) via an ``on_change`` callback on the file_uploader widget. The callback is add-only — never removes files from the stash based on widget state. Removal is owned by the custom X buttons + widget-counter bump (see below). This guarantees a hidden native X click can't silently drop files behind the user's back. Hidden native file list. A small CSS block suppresses the file_uploader's built-in file rows + their delete buttons (``stFileUploaderFile`` + ``stFileUploaderDeleteBtn``), so the custom list below is the single source of truth on screen. Custom file list (Home pattern). Below the dropzone, every uploaded file gets a row: ``✕ \| 📄 filename \| size``. Top of section shows ``N files · 12.3 MB total``. Counts and sizes update in real time as the user adds or removes files. The X button per row calls ``log_event("upload", "PDF removed: …")``, removes the entry from the stash, and bumps the widget counter to clear the widget too. Clear-all button. Sits next to the Scan button. Wipes the stash, bumps the widget counter, drops any cached scan results (``K_ROWS``, ``K_WARNINGS``, ``K_SOURCE_COUNT``). Audited via ``log_event("upload", "PDF list cleared", count=N)``. Widget reset via counter bump. Streamlit disallows programmatic mutation of widget session-state entries; the standard workaround is to rotate the widget's ``key``. Page maintains ``K_UPLOAD_COUNTER`` which gets incremented on remove / clear-all, producing a fresh ``pdf_upload_v{N}`` key and a freshly-instantiated empty widget. The stash retains any unaffected files; on next upload, the add-only sync picks up the new ones without re-adding the removed ones. Scan rewired to read the stash. Instead of iterating the widget's UploadedFile objects (which the previous code did and which broke when the widget unmounted on remove), the scan loop iterates ``pdf_uploads.items()`` and uses the cached ``bytes``. Diagnostic expander does the same — re-reads from the stash, removing the need for a separate ``K_DIAGNOSTIC`` cache (deleted). ``_format_size`` helper ports the byte-formatting logic from ``_home.py``'s pattern (KB / MB / GB rollover). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:28:01 +00:00
Michael	155dd30746	feat(pdf): extract statement header (account + period) + date format Two related additions for the accountant workflow: 1. Statement header extraction. New ``extract_statement_metadata(pages)`` pulls the account number and statement period out of the first page (falls back to page 1+2 if either is missing on page 1 — Wells Fargo business accounts put header info on page 2). Detected fields are stamped onto EVERY transaction row so a multi-statement CSV is self-attributing per row:: { "date": "20250113", "description": "Coffee Shop", "amount_1": -4.50, "account_number": "**5678", "statement_period_start": "20250101", "statement_period_end": "20250131", ... } Account-number regex is tolerant of masks (``1234``), hyphens (``1234-5678-9012``), and spaces. Period regex looks for "Statement Period" / "From" / "Period Covered" labels plus the first 1-2 full-year dates that follow. If only one date is present near the label, it's used for both start and end (some statements show only the closing date). 2. Year inference for short dates. When the row date is a short ``01/13`` or ``Jan 13`` without a year, the scanner now binds the year from the statement period's end date BEFORE formatting. Doesn't handle the December-in-January-statement cross-year case (rare; user can edit in the table). 3. Configurable output date format.** New ``output_date_format`` parameter on ``scan_pdf_for_transactions`` defaults to ``%Y%m%d``. Applied to: the transaction date column AND the statement period start/end fields. The page surfaces a dropdown in Scan options with common presets (YYYYMMDD, YYYY-MM-DD, MM/DD/YYYY, DD/MM/YYYY, ``Mon DD, YYYY``) plus a Custom option that accepts a raw strftime string. New helper: ``format_date(iso_str, fmt)`` converts ISO ``YYYY-MM-DD`` to any strftime; passes invalid input through unchanged so the user can see what was actually there rather than getting silent empties. 20 new tests cover: format_date, account-number extraction (masked / hyphenated / spaced / no-label / short), period extraction (standard / from-to / single-date / no-label), metadata orchestrator (full header / no pages / page-2 fallback), year inference (US / dash / month-name / no-period / unparseable), plus an end-to-end class that builds a header'd PDF with short-date transactions and confirms metadata attribution + year inference + format round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:20:46 +00:00
Michael	3cf935c999	fix(pdf): drop zero-amount rows; multi-date rows clean description Two corrections from real-statement feedback: 1. Drop rows where the transaction amount is exactly 0. Bank statements include date+amount-shaped noise like "INTEREST EARNED 0.00", "PAGE TOTAL 0.00", "BALANCE FORWARD 0.00 1,234.56" — all match the date+amount heuristic but aren't transactions. New filter in ``scan_pdf_for_transactions``: drop rows whose ``amount_1`` parses to exactly 0. Non-zero balances in ``amount_2`` don't rescue a zero amount_1 — leftmost amount is the canonical transaction amount. Unparsed-but-non-empty amount strings are kept (user verifies in the editor). 2. Multi-date rows: first date wins for the column, every date excluded from the description. Chase / BofA / Wells commonly show both a transaction date and a posting date per row: 01/13 01/14 COFFEE SHOP $4.50 Before this fix, ``_find_dates_in_words`` returned the first date only and the second date leaked into description as "01/14 COFFEE SHOP". Now it returns ALL dates with their word ranges; the scanner uses ``dates[0]`` as the canonical date and passes every range to the description builder for exclusion. The detector's two-pass strategy now also guards against mixing full-year and short-date matches on the same row. Previously, a header line like ``Page 1/2 of 3 ... Statement Date 01/13/2026`` would return both ``1/2`` and ``01/13/2026``, and ``1/2`` (being leftmost) would have won the date column. Now: if any full-year date is found on the row, short patterns are NOT also collected — full year anchors interpretation. A row with no full-year date (Chase short-date case) still falls back to short patterns and collects all of them. New tests: - ``test_multiple_dates_returned_in_position_order`` — ``01/13`` + ``01/14`` both returned, in order - ``TestMultiDateRow.test_first_date_wins_second_excluded_from_description`` — end-to-end through ``scan_pdf_for_transactions`` - ``TestZeroAmountRowsAreDropped.test_zero_amount_row_dropped`` — "INTEREST EARNED 0.00" row dropped while real txn kept - ``test_negative_amount_kept`` — pin that -40.00 is not treated as zero by the filter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:12:21 +00:00
Michael	263af3c7c2	fix(pdf): short dates without year + diagnostic for "0 rows" runs User uploaded a real Chase statement and got "0 rows detected." Two bugs the rewrite shipped with, plus a diagnostic: 1. Short dates without year weren't recognized. Most bank statements (Chase, Wells, BofA, …) display transaction dates as ``01/13`` or ``Jan 13`` because the year is implied by the statement period. The original regex required ``\d{2,4}`` after the second slash, so ``01/13`` failed to match and rows with no detected date got dropped. Split ``_DATE_RES`` into ``_FULL`` (with year) and ``_SHORT`` (no year), with a two-pass detector: pass 1 tries full-year patterns across the whole row; pass 2 only tries short patterns if pass 1 found nothing. This prevents a stray ``Page 1/2`` from shadowing the real dated transaction on the same line. Short patterns: - ``\d{1,2}/\d{1,2}`` — Chase, etc. - ``\d{1,2}-\d{1,2}`` - ``[A-Z][a-z]{2}\s+\d{1,2}`` — "Jan 13" When parsing, short dates pass through ``parse_date`` and return None (no year to bind to), so the scanner falls back to the raw text — the user sees ``01/13`` in the date column and can correct in the editor. 2. Multi-word dates leaked the day token into the description. A pre-existing bug: ``_find_dates_in_words`` returned only the START word index, and ``_description_from_row`` only excluded that single word. For "Jan 13 Coffee $4.50", the description became "13 Coffee" instead of "Coffee". Fixed by returning ``(start, end, text)`` with ``end`` exclusive (computed from ``len(m.group(1).split())`` so window-overrun doesn't over-consume), and the description builder now skips the full range. 3. New diagnostic: ``diagnose_pdf_lines(pdf_bytes)``. Returns every clustered text line the scanner saw with ``has_date`` / ``has_amount`` flags. When the page's scan returns 0 rows, an auto-expanded "what the scanner saw" expander now renders a table of all extracted lines so the user can: - Spot scanned-PDF cases (empty result → enable OCR) - See which lines have a date but no amount (or vice versa) - Eyeball the date / amount format the scanner missed Without leaving the app or asking the developer for help. Eight new tests cover: short US date (``01/13``), short month- name date with two-word consumption (``Jan 13``), the ``Page 1/2 ... 01/13/2026`` shadowing case, and the multi-word- date description fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:06:07 +00:00
Michael	bece2b4030	refactor(pdf): rip out templates; heuristic scan + selectable table User feedback: the template / visual-picker / mode-dispatch implementation was too complex for the actual workflow. Statements drift between months, the canvas state didn't survive multi-page navigation, and accountants don't want to maintain per-bank configuration just to convert PDFs to CSV. Start-over design — one public function, one page, no persistence: ``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)`` A row is "any text line with a date pattern AND at least one amount pattern." Each detected row is a dict shaped:: { "date": "2026-01-15", "description": "Coffee Shop", "amount_1": -4.50, "amount_2": 1000.00, # if a second amount was found "page": 1, "raw": "01/15/2026 Coffee Shop (4.50) 1,000.00", "source_file": "chase-jan-2026.pdf", } Multi-line descriptions still merge (no-date no-amount lines attach to the previous transaction). Multi-PDF batches share a single combined table with a ``source_file`` column. Page UX: - Upload PDF(s) → optional Options expander (parens-negative, use-OCR) → click Scan → see all detected rows in an ``st.data_editor``. - The editor has an ``Include`` checkbox column (default on), plus user-editable date / description / amount cells and a read-only ``raw`` column showing the original PDF text for verification. - A ``Columns to include in CSV`` multiselect hides ``page`` / ``raw`` from the download by default; user can re-add either. - Download CSV gets only the checked rows. No template save/load. No visual picker. No mode dispatch. No column boundaries. No schema migration. No per-bank configuration files. Deletions: - ``src/pdf_templates.py`` — template storage layer - ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim for the canvas (no canvas now) - ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``, ``test_drawable_canvas_compat.py`` — covered the removed APIs - ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for the removed dep - ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt`` - The drawable-canvas references in ``build/datatools.spec`` ``src/pdf_extract.py`` shrinks from ~30 helper functions to ~10. Keeps: value parsers, row clusterer, date/amount token finders, OCR pipeline, dependency guards. The one new public function ``scan_pdf_for_transactions`` glues them together. Tests (59 passing): the unit layer keeps full coverage of the building blocks; the smoke layer pins the end-to-end PDF roundtrip, OCR discovery, dependency-import behavior, and the multi-line-description merge. The fpdf2-generated fixture PDF still drives the real-PDF test. Rollback: ``git revert HEAD`` brings back the template system if needed — but the simpler model should make that unlikely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:57:30 +00:00
Michael	60969c0770	feat(pdf): UI rework — Auto-detect is the default build flow Pulls the user's primary mental model away from "draw column boundaries" toward "tell me what shape your amounts have, see detected rows, save." The visual picker that wasn't working for multi-statement workflows is reachable but no longer the default. Build mode header now has a mode radio: - "Auto-detect (recommended)" — row_heuristic. Tabs: Amount layout · Filters & date · Save. Three small forms; no coordinate UI anywhere. The Amount-layout tab's dropdown picks one of single / txn+balance / debit+credit / debit+credit+balance and auto-derives the min/max amount-count range (overridable under an expander). - "Visual columns (advanced)" — column_visual. Five tabs (the original Visual picker / Pages & table / Columns / Parsing / Save). A yellow warning panel up top reminds the user that column-x templates only work when statement layout is stable. Switching modes triggers a rerun so the right tab set renders immediately. The template object preserves both mode's config trees side-by-side so a user can flip between them without losing work. Live preview below the form runs ``apply_template`` against the cached sample pages (already cached in session_state so this re-renders cheaply on every form edit). The "no rows yet" message is mode-aware — points users at the right tuning knobs for whichever mode they're in. The preview caption notes which mode produced the rows so the user can correlate decisions to output. The visual picker bug the user reported — "a single box stays in the same location regardless of page" — is sidestepped rather than fixed: in row_heuristic mode there's no canvas to confuse, and for the rare column_visual user the canvas is still imperfect but no longer their first interaction with the tool. Cleaning up the column_visual canvas state bugs is a separate follow-up if real users still hit the Advanced mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:46:27 +00:00
Michael	48cd9e8249	feat(pdf): schema v2 + mode field + v1 in-memory migration Bumps ``SCHEMA_VERSION`` from 1 to 2 to add a top-level ``mode`` field distinguishing ``row_heuristic`` (new default) from ``column_visual`` (legacy). The schema bump is real — old code that defaults missing keys would silently mis-extract — so we do it the careful way: - ``new_template`` now returns mode=``row_heuristic`` with the full row-heuristic config tree pre-populated. The legacy column-visual fields are still seeded with empty defaults so switching modes in the GUI doesn't require runtime key insertion. - ``validate_template`` is mode-aware: row_heuristic templates must have a valid ``amounts.shape`` + sane ``row_detection.min/max_amounts_per_row``; column_visual templates keep the existing column/target requirements. - ``load_template`` accepts both v1 and v2 files (``_LOAD_SUPPORTED_VERSIONS = {1, 2}``). v1 files get ``mode="column_visual"`` injected and ``schema_version`` bumped IN MEMORY ONLY — disk file stays v1 until the user explicitly re-saves. A buggy migration can't silently corrupt their template library. - ``save_template`` continues to write the current schema; saving a v1 template through the GUI naturally upgrades it. Mode + shape constants exported (``VALID_MODES``, ``VALID_AMOUNT_SHAPES``) so the GUI dropdowns can derive their options from the source of truth. Tests split into ``TestValidateTemplateRowHeuristic`` (6) + ``TestValidateTemplateColumnVisual`` (4) + ``TestV1Migration`` (1). All 29 template tests pass; the original column-mode tests that previously implicitly relied on schema_version=1 keep working because new_template's seeded column fields are still present in row_heuristic templates (just not validated as required). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:46:10 +00:00
Michael	d80befd05a	feat(pdf): row-heuristic extraction (mode dispatch, no coordinates) User reported the column-visual approach is too brittle for real bank statements: column-x-positions saved against a sample page don't survive layout drift between months (statement A has columns at x=300, statement B drifted to x=320), and a saved template can only realistically work for one statement's specific render. The fundamental fix is to stop depending on coordinates at all. Row-heuristic mode finds transaction rows by pattern: any line with a date token + N amount tokens IS a transaction. Date patterns (US slash / EU slash / ISO / "Jan 15, 2026" / etc.) and amount patterns (currency, parens-negative, thousands grouping) are matched against word text — no x-positions involved. The full pipeline: 1. ``find_transaction_rows`` clusters words into rows and scans each line for date + amount tokens. 2. Multi-line descriptions still attach to the previous row via the no-date-no-amount continuation rule. 3. Amount shapes drive interpretation: ``single`` / ``txn_balance`` / ``debit_credit`` / ``debit_credit_balance``. 4. ``_infer_amount_column_centers`` clusters amount x-midpoints ACROSS ALL detected rows to find natural column groupings — so debit-vs-credit assignment for single-amount lines works without the user marking anything on screen. ``apply_template`` is now a dispatch over ``template["mode"]``: - ``mode="row_heuristic"`` (default for new templates) — the new pipeline. - ``mode="column_visual"`` — the existing pipeline, kept under ``_apply_template_column_visual`` for v1 templates and the Advanced fallback. 18 new tests cover: date detection (US slash, two-digit year, ISO, month-name, missing); amount-token finding (currency, parens, pure text, bare-year rejection); column-center inference (clear two-column case, empty input); end-to-end on synthetic Page objects with all four amount shapes; the critical layout-drift test that proves the same template works on pages of different sizes / different absolute x-positions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:45:55 +00:00
Michael	10015c40e1	fix(pdf): shim image_to_url for drawable-canvas on modern Streamlit User hit ``AttributeError: module 'streamlit.elements.image' has no attribute 'image_to_url'`` on first PDF import. Root cause: ``streamlit-drawable-canvas`` 0.9.3 (last upstream release 2023) calls a Streamlit internal that was relocated in Streamlit ~1.30+. The function moved from ``streamlit.elements.image`` to ``streamlit.elements.lib.image_utils`` AND its signature changed — the second positional argument is now a ``LayoutConfig`` dataclass instead of a plain ``int`` width. Three remedies considered: 1. Downgrade Streamlit. Reverses unrelated improvements + security fixes; not on the table. 2. Fork drawable-canvas. The maintenance hit isn't worth it for a one-line internal API change. 3. Ship a compatibility shim. Re-attach a wrapper at the old import path that adapts the old call shape to the new function. This is the standard workaround the wider Streamlit community has converged on for this exact regression. ``src/gui/_drawable_canvas_compat.py`` does (3). The ``install()`` helper is idempotent, opt-in (not auto-run at module import — a grep for ``_install_canvas_compat`` shows every call site), and no-ops if Streamlit hasn't moved the function OR if the new function isn't where we expect (lets the canvas surface a real error rather than papering over a different bug). The page calls ``_install_canvas_compat()`` once at module top before any ``st_canvas`` invocation; Streamlit's script-rerun model means this fires every page load but the ``_PATCHED`` guard makes re-runs free. The shim wraps the old ``width=int`` arg into a default-constructed ``LayoutConfig()`` — the old ``width=-1`` sentinel meant "use the image's natural width", which is also what an unconfigured LayoutConfig produces. Confirmed by inspecting Streamlit 1.57.0's ``image_utils.py``. 4 new tests pin the shim contract: - ``install()`` attaches ``image_to_url`` to the old path on modern Streamlit - Idempotent — calling twice doesn't double-wrap - Doesn't clobber a future Streamlit that restores the original at the old path - Translates ``(image, -1, False, "RGB", "PNG", "id")`` into a proper call to the new function with a ``LayoutConfig`` instance If a future Streamlit upgrade moves ``image_to_url`` AGAIN, the shim's silent-no-op fallback means the canvas error surfaces again and points at where to look. The shim doesn't paper over mysteries; it only patches the one specific relocation we know about. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:29:20 +00:00
Michael	e6ee2e3481	feat(pdf): robust Tesseract discovery + OS-aware install copy User tried ``brew install tesseract`` in PowerShell after seeing all three OSes listed inline in the OCR banner — easy mistake when the install commands are crammed on one line with ``·`` separators. Two changes pre-empt this: OS-aware OCR banner. The expander now detects the user's platform via ``platform.system()`` and shows only the relevant install instructions: - Windows: UB-Mannheim installer link, numbered steps, explicit "keep the Add to PATH checkbox on" callout, plus a fallback paragraph telling the user how to set ``DATATOOLS_TESSERACT_PATH`` if they already installed without PATH and don't want to reinstall. - macOS: ``brew install tesseract`` with a Homebrew link. - Linux: ``apt install tesseract-ocr`` with a "or your distro's equivalent" hedge. Robust binary discovery in ``ocr_available()``. Three-stage: 1. Honor ``DATATOOLS_TESSERACT_PATH`` env var if set — explicit override for portable installs or non-default locations. 2. Try ``pytesseract``'s default PATH-based lookup. 3. If PATH lookup fails, probe known Windows install paths (``C:\Program Files\Tesseract-OCR\tesseract.exe``, the x86 variant, and ``%LOCALAPPDATA%\Programs\Tesseract-OCR\``) via the new ``_autodetect_tesseract_path``. On hit, set ``pytesseract.pytesseract.tesseract_cmd`` so all subsequent ``image_to_data`` calls use the same binary without re-discovering. This means a user who runs the UB-Mannheim installer with default options but forgets the PATH checkbox will still get OCR working after a launcher restart, without env-var gymnastics. Tests (4 new, 85 total in the suite): - Auto-detect returns None on non-Windows (no false positives on dev laptops). - Auto-detect finds the binary at a mocked ``C:\Program Files\Tesseract-OCR\tesseract.exe``. - Auto-detect returns None when no candidate exists. - ``DATATOOLS_TESSERACT_PATH`` env var beats both PATH lookup and auto-detect (sets ``tesseract_cmd`` even when the path doesn't resolve, so a real binary at a custom location works). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:15:00 +00:00
Michael	538e23d219	build(pdf): bundle PDF deps in installers + pin versions + smoke tests Three changes prepare the next tagged release so end users get the PDF Extractor without ever touching pip. Exact-pin the new deps (``requirements.txt``): pdfplumber==0.11.9 pypdfium2==5.8.0 pytesseract==0.3.13 streamlit-drawable-canvas==0.9.3 Tight pins are the right call for these because the GUI's visual-picker geometry + the parsing-pipeline word positions depend on stable internal behavior — a quiet upstream tweak to ``extract_words`` or ``page.render`` would re-break the tool on the next CI build. Bumping requires a deliberate edit + a CI run, not a transient ``pip install`` resolving to whatever ``setup.py`` pulled. Existing deps stay on their current ``>=X.Y,<X+1`` ranges; the user's "tight pin" concern is specifically about the PDF stack. Wire the new deps into the PyInstaller bundle (``build/``): - ``datatools.spec`` — add ``collect_submodules`` for pdfplumber, pdfminer, pypdfium2, streamlit_drawable_canvas, PIL, pytesseract; add ``collect_data_files`` for pypdfium2 (PDFium native ``.dll``/``.so``/``.dylib``), streamlit_drawable_canvas (frontend JS bundle), pdfminer (Adobe CMap tables). - ``hooks/hook-pypdfium2.py`` — belt-and-braces hook that uses ``collect_dynamic_libs`` to force-include the PDFium binary. Without this the visual picker silently fails on installed builds with a ``FileNotFoundError`` for the shared library. - ``hooks/hook-streamlit_drawable_canvas.py`` — collects the built JS frontend so the canvas iframe loads under the bundled Streamlit server instead of rendering blank. Tesseract is intentionally NOT bundled (option A from the design discussion). Modern bank statements are text-based; bundling Tesseract would ~triple installer size for a long-tail case. The in-app banner directs users to install it from ``UB-Mannheim/tesseract`` if they need OCR. Decision is captured in the ``project-pdf-installer-pending`` memory note. Smoke tests (``tests/test_pdf_extract_smoke.py``, 17 tests) add the layer above the pure unit tests: - ``TestDependencyImports`` — each dep imports cleanly - ``TestRealPdfRoundTrip`` — generates a tiny statement PDF in memory with ``fpdf2`` (test-only dep in ``requirements-dev.txt``), runs ``extract_pages`` + ``apply_template``, asserts 3 rows out with the right signed amounts. Catches "the build succeeded but pdfplumber breaks at runtime." - ``TestRenderPageImage`` — exercises ``pypdfium2.render`` so the hook-bundled native lib gets a real call. This is the most common installer-bug signature (missing .dll) and the test catches it before users do. - ``TestPdfDependencyMissing`` — monkeypatches ``__import__`` to simulate a stripped install; confirms the typed exception + actionable hint round-trip. - ``TestPinnedVersionsMatchInstalled`` — parametrized over all four pinned dists; uses ``importlib.metadata`` rather than ``__version__`` because pypdfium2 doesn't expose it directly. Trips if someone bumps the pin without reinstalling. - ``TestOcrAvailability`` — confirms ``ocr_available()`` returns ``(bool, str)`` and ``extract_pages_auto(allow_ocr=False)`` skips OCR cleanly. All 81 PDF + audit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:10:43 +00:00
Michael	2d927bc95f	fix(pdf): graceful fallback when PDF dependencies aren't installed User hit a hard ImportError on opening the PDF→CSV tool because ``pip install -r requirements.txt`` hadn't picked up the new ``pdfplumber`` / ``pypdfium2`` lines yet. Streamlit surfaces that as an unfiltered traceback — friendlier to show a clear install-required panel inside the tool instead. Two changes: 1. ``src/pdf_extract.py`` lazy-imports the PDF deps via ``_require_pdfplumber()`` / ``_require_pdfium()`` helpers that raise a new ``PdfDependencyMissing`` (subclass of ImportError) with an actionable ``hint`` field. Pure helpers (``parse_amount``, ``parse_date``, ``cluster_rows``, etc.) keep working with no PDF dep installed — useful for tests and for keeping module-import paths cheap. 2. The tool page probes both deps at render time via ``_pdf_deps_status()``; if anything's missing it shows a ``st.error`` panel with the exact pip command and a "restart the launcher" reminder, then ``st.stop()``s before touching any PDF code path. The page itself loads cleanly without the deps installed, so the sidebar nav doesn't 500 — the user just sees the install panel on click. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:59:20 +00:00
Michael	967d3f6a11	feat(pdf): OCR availability banner + per-run toggle Phase 6/6. Final polish layer on top of the OCR pipeline that ``extract_pages_auto`` has carried since commit 1. - OCR status banner at the top of the page next to the mode selector. Ready: a one-liner caption confirming OCR will run on scanned pages. Unavailable: a collapsed expander explaining the missing piece (``pytesseract`` binding vs. Tesseract binary) with install pointers for Windows, macOS, and Linux. The expander explicitly notes that modern text-based bank statements don't need OCR — most users will never expand it. - "Use OCR for scanned pages" toggle in Extract mode, defaulting to the runtime availability. Disabled (greyed out) when Tesseract isn't usable, so the user can't accidentally set themselves up for confusing warnings. Passes through as ``allow_ocr`` to ``extract_pages_auto``. - Build mode's sample-loading path continues to call ``extract_pages_auto(..., allow_ocr=True)`` — sample preview always uses OCR if available, since the user is actively diagnosing template fit. No schema change. OCR's structural support is in commits 1 + 3; this commit just makes it discoverable + opt-out. Rolling up the 6-commit feature: `b8aff86` Phase 1 — pure pdf_extract module + tests `aea520d` Phase 2 — template storage layer + tests `2f349e8` Phase 3 — Extract/Build/Manage page + nav + i18n `5a8e2ec` Phase 4 — batch polish (ZIP, sort, status block) `b86828d` Phase 5 — visual region picker (drawable canvas) THIS Phase 6 — OCR banner + toggle Each commit is independently revertable; rolling all the way back to ``c16e2a5`` is ``git revert `b86828d` `5a8e2ec` `2f349e8` `aea520d` `b8aff86` <this>`` (or just ``git reset --hard c16e2a5`` on a clean branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:54:11 +00:00
Michael	b86828d791	feat(pdf): visual region picker on rendered sample page Phase 5/6. Adds a "Visual picker" tab as the first stop in the template-build flow. The sample PDF page is rasterized with ``pypdfium2`` (capped at ~900px wide for sensible display), and ``streamlit-drawable-canvas`` overlays drawing tools on top. UX: - Line mode — drag short (roughly vertical) strokes where you want columns to split. Each stroke's x-midpoint becomes one boundary in PDF point coordinates. - Rect mode — drag a rectangle around the transactions table; bbox is preserved on the template as ``visual.table_bbox`` for round-trip, future use as a hard crop region. - Transform mode — move/resize already-drawn shapes after the fact. Round-trip: re-entering Build mode with an existing template seeds the canvas with full-height vertical lines for every boundary already on the template, plus the saved bbox if any, so editing-after-save matches the user's mental model. Coordinate translation: the canvas reports pixel positions; we divide by the renderer's pixels-per-PDF-point scale to get back to PDF coordinates that ``apply_template`` already expects. No template-schema change required — the boundaries the picker writes are the same list the text-input editor wrote in commit 3, just sourced visually. New helper in the extraction module: - ``render_page_image(pdf_bytes, page_no, target_width=900)`` — rasterize a single 1-indexed page to a PIL image; returns ``(image, scale)`` for coordinate translation. The text-input boundary editor in the Columns tab remains as a fallback for power users / keyboard-only workflows and for copy-paste from spreadsheet-derived x-positions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:52:54 +00:00
Michael	5a8e2ec9e1	feat(pdf): batch extract polish — ZIP output, sort-by-date, status block Phase 4/6. Polishes the batch workflow shipped in commit 3: - st.status progress block replaces the simple progress bar. Each file appears as its own line as it's processed; the block auto-collapses on completion with a "12/13 extracted" summary and turns red if any file errored. - Sort combined output by date checkbox (default ON) sorts the merged CSV ascending by date, with source_file as a stable secondary sort so multiple statements interleave by date but same-day rows from the same file stay together. - ZIP-of-per-PDF-CSVs output option alongside the combined CSV. When the accountant has 12 statements from 12 different account periods and wants to feed them into 12 separate ledger imports, the ZIP keeps each file's rows in its own CSV named after the original PDF stem. - Per-file summary table gets a ``status`` column ("ok" / "no rows" / "error: ExceptionName") so error grouping is obvious at a glance — already present from commit 3, now upgraded with the status field. Cancellation is intentionally not added — Streamlit's single- thread rerun model has no clean way to interrupt a tool-run mid-stream without architectural changes to extraction. If a user mis-fires Extract on 50 PDFs they can refresh the browser tab; the task will be killed when the next interaction comes in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:51:05 +00:00
Michael	2f349e8191	feat(pdf): tool page with Extract / Build / Manage modes Phase 3/6. Wires the PDF Extractor into the GUI as a new "transformations" tool with three modes selected by a horizontal radio at the top of the page: Extract — pick a saved template, upload one or more statement PDFs (single + batch shipping together to keep the common case one-step), get a previewed DataFrame + CSV download. Per-file row counts and warnings are surfaced; failures on one file don't kill the whole batch. The combined CSV gets a ``source_file`` first column so the accountant can sort/filter by statement. Build template — load an existing template or start fresh, upload a sample PDF, edit every schema field across four tabs (Pages & table / Columns / Parsing / Save). A live preview below re-runs ``apply_template`` against the sample on each re-render so the user sees their changes hit rows immediately. The column- boundary editor is text-input ("comma-separated x-positions") for now — replaced by the drawable-canvas visual picker in commit 5. Manage templates — list with rename / delete / export (downloads the canonical JSON) / import (uploads someone else's JSON, validated through ``template_from_json``). Heavy work (``extract_pages_auto``) only runs on explicit user action (Extract / a new sample upload), and the parsed Page list is cached in ``st.session_state`` so widget-edit reruns don't re-parse the PDF. Logging: tool runs and template saves both hit the audit log via ``log_event("tool_run", …)``, matching every other tool's instrumentation pattern. Registered in ``tools_registry.py`` under ``transformations`` with status ``Ready`` and the picture-as-pdf Material icon. i18n keys added for en + es ("PDF to CSV" / "PDF a CSV"). OCR is wired in this commit — ``extract_pages_auto`` already falls back through ``pytesseract`` when the binary is available, and the warning strings it returns surface as ``st.info`` / ``st.warning`` per-file. Commit 6 will polish the OCR UX with a status row. Next commits build on this page: 4 — batch progress + cancellation + per-file error grouping 5 — drawable-canvas visual picker replaces text x-positions 6 — OCR availability banner + scanned-page indicators Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:49:44 +00:00
Michael	aea520d2f7	feat(pdf): template storage layer (load/save/list/import/export) Phase 2/6. Persists "how to read this bank's statements" as JSON files under ``~/.datatools/pdf_templates/<slug>.json`` so an accountant can build one template per source and reuse it across every statement that follows the same layout. Public API: - ``new_template(name)`` — blank with sensible defaults - ``save_template(t)`` — validate + atomic write (temp + rename) - ``load_template(slug)`` / ``delete_template(slug)`` - ``list_templates()`` — sorted summaries, skips corrupt files - ``template_to_json`` / ``template_from_json`` — portability - ``validate_template(t)`` — returns (ok, errors) list for GUI Schema is documented in the module docstring. Versioned via ``schema_version: 1`` so future fields don't break saved files silently — ``load_template`` refuses unknown versions instead of limping along with missing keys. Validation contract enforces: - non-empty name + slug (lowercase alphanumeric + hyphens) - at least two output columns - at least one column mapped to ``date`` - either one ``amount`` column OR both ``amount_debit`` + ``amount_credit`` - column boundary count consistent with source-column count Storage is atomic: ``_atomic_write`` goes through a temp file + ``os.replace`` so a crashed save can't leave a half-written JSON at the canonical path. The GUI's build flow saves on most visual-picker changes, so this matters more here than for a "save button" workflow. 24 tests cover slugify, defaults, validation branches, round-trip load/save, missing/corrupt file handling, delete, list (incl. skipping corrupt files), atomic-write rollback, and import/export. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:46:44 +00:00
Michael	b8aff862ed	feat(pdf): add pure PDF→DataFrame extraction module Phase 1/6 of the PDF Extractor tool. Pure module — no Streamlit, no user-config I/O — that turns a PDF blob plus a template dict into a ``pandas.DataFrame`` of transaction rows. Primary use case is accountant-style extraction of bank-statement transactions, where each bank's format is encoded as a reusable template. Pipeline: 1. ``extract_pages(pdf_bytes)`` reads with pdfplumber and surfaces words with bounding boxes. 2. ``cluster_rows(words)`` groups words into rows by ``top`` tolerance — no reliance on PDF table-line detection (most bank statements have no visible cell borders). 3. ``assign_columns(row_words, boundaries)`` buckets each word by its horizontal midpoint into N+1 columns defined by N interior x-boundaries. 4. ``_within_table_window`` slices to the band between the header line and the end-marker (e.g. "Closing balance"). 5. ``apply_template`` orchestrates the above, handling: - parens-style negative amounts, currency stripping, custom decimal/thousands separators - separate debit + credit columns combined into a single signed ``amount`` (credit positive, debit negative — accounting register convention; matches QuickBooks/Xero imports) - multi-line description wrapping (rows with empty date column attach to the previous row's description) - row-level regex skip filters (e.g., "Total", "Subtotal") - page-range filters ("all", "2-", "1,3-5") Optional OCR fallback for scanned statements: - ``page_has_extractable_text`` heuristic flags pages with <5 words as likely-scanned. - ``ocr_available()`` checks both the ``pytesseract`` Python binding and the Tesseract binary; surfaces a clear reason string when either is missing. - ``extract_pages_auto`` does text-first, OCR-the-blanks, and returns warnings the UI can surface. 29 unit tests cover the parsing pipeline against synthetic WordBox/Page data — no fixture PDFs required, runs in 0.1s. Real PDF extraction is exercised by hand on the user's statements. Dependencies added: - ``pdfplumber>=0.10,<1`` — text + position extraction - ``pypdfium2>=4,<6`` — page rasterization for OCR + visual picker - ``streamlit-drawable-canvas>=0.9,<1`` — visual region picker (used in commit 5) - ``pytesseract>=0.3,<1`` — OCR (used in commit 6; system Tesseract binary required separately) - ``cryptography>=41,<49`` — bumped upper bound; pdfminer.six transitively requires a recent release. Internal ed25519 license-signing usage is API-stable across the bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:44:51 +00:00
Michael	c16e2a5e29	feat(audit): surface log path + /logs link in Help popover Adds a "Log file" section to the sticky-footer Help popover with two affordances: 1. The current audit-log path rendered as monospace text with ``user-select: all`` so a single click selects the whole path for copy-paste into a file manager. Works on every platform — no subprocess required. 2. A "View all logs →" link to the new ``/logs`` page (added in the previous commit) for download/inspection of today's and prior days' files. i18n keys ``footer.help_logs_label`` + ``footer.help_logs_link`` added to en + es packs, matching the existing ``footer.help_*`` naming. ``audit_log_path()`` is wrapped in try/except because a broken audit module MUST NOT take the footer down — falls back to "—". Same defensive pattern the license section uses. Rollback: ``git revert HEAD`` removes the section; the popover and its layout return to the prior shape with zero coupling to the audit module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 21:26:53 +00:00
Michael	7c9139f199	feat(audit): /logs page — view + download recent audit log files Adds a Streamlit page at ``/logs`` listing every ``datatools-*.jsonl`` file in ``audit_log_dir()`` (7-day window per the retention sweep in `b3ae913`). Each entry shows filename, mtime, byte size, and a ``st.download_button``. Today's file gets its own section at the top. The page also surfaces both paths as copyable monospace text: the active log path (so users can grep/cat it directly on their machine) and the folder path (so they can paste into Explorer / Finder). Wired into navigation via ``st.Page("pages/_Logs.py", ...)`` with ``url_path="logs"``. The sidebar entry is hidden by the same ``hide_streamlit_chrome`` CSS rule that hides ``/activate`` and ``/close`` — same pattern, same ``:has()`` + plain-fallback selectors so the LinkContainer collapses cleanly in modern browsers and the anchor is at least un-clickable in older ones. License gate is OFF for this page (``gate_license=False``) — if a user's license expires they may need logs to file a support request; locking them out of their own audit history would be hostile. Next commit will wire the popover link. Rollback: ``git revert HEAD`` removes the page and its nav entry; the audit log itself keeps working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 21:24:46 +00:00
Michael	b3ae913bb9	feat(audit): daily filename + 7-day retention sweep Replaces the per-session ``datatools-<ts>-<sid>.jsonl`` filename with a single daily file ``datatools-YYYY-MM-DD.jsonl`` (local date). Sessions on the same calendar day share a file via the writer thread's per-batch open+append; multiple DataTools instances running concurrently on the same day fan into the same file (append-mode small writes are atomic on POSIX, safe-enough on Windows under realistic load). Drops the ``_LOG_PATH`` module global and the lock around it — ``audit_log_path()`` is now pure date math, recomputed on every call so a session that crosses midnight follows the rollover into the next day's file. Adds ``_sweep_old_logs()`` invoked once per process at writer- thread start. Deletes any ``datatools-*.jsonl`` whose mtime is older than 7 days. The glob deliberately matches the legacy per-session filename too, so users upgrading from the previous build don't keep a permanent backlog of pre-retention files. Event ``ts`` fields stay UTC; only the filename uses local date, because users go looking for "today's log" on their wall clock. Tests cover: daily filename shape, sweep removes stale files, sweep keeps fresh files, sweep also clears legacy filenames. Rollback: ``git revert HEAD`` restores the per-session filename and removes the sweep. No data migration needed either way — existing files keep working as JSONL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 21:22:47 +00:00
Michael	ba07dcb6c7	feat(audit): re-enable audit log (kill switch off by default) Phase 1 diagnostic build validated end-to-end on the user's machine: session cf2ebbd5 (2026-05-19) produced session/upload/analyze/nav/ session-end events with no blank-pages regression. Root cause of the original symptom was the audit_log_path/_session_id deadlock fixed in `a8ff8f4` — the kill switch is no longer load-bearing. Flips ``_DISABLED: True`` → ``False`` so the default install writes a log. The three env-var overrides (``DATATOOLS_AUDIT_ENABLED``, ``DATATOOLS_AUDIT_TRACE``, ``DATATOOLS_AUDIT_PROBE``) and the writer- thread BaseException guard from `76c9f5a` stay in place as escape hatches if the symptom ever recurs. TestKillSwitchContract continues to pass — it monkeypatches ``_DISABLED = True`` explicitly and doesn't rely on the module default. Rollback: ``git revert HEAD`` flips the switch back without removing the diagnostic instrumentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:50:28 +00:00
Michael	76c9f5a679	feat(audit): diagnostic instrumentation env vars + writer-thread guard Phase 1 of the audit-log re-enablement plan. Adds three opt-in env vars that let us ship one instrumented build for the user to run, without flipping the kill switch on for everybody. Default behaviour is byte-identical to today: with no env vars set the kill switch wins, no writer thread starts, no file is written, no stderr line is printed. Env vars (do NOT set in prod): - ``DATATOOLS_AUDIT_ENABLED=1`` — bypass ``_DISABLED`` for one session. ``_DISABLED = True`` stays in the source so an upgrade with no env var is still safe. - ``DATATOOLS_AUDIT_TRACE=1`` — print ``[audit] ...`` lines to stderr at module import, every writer-thread state change, and every producer entry point. Lets the user share a small log instead of attaching a debugger. - ``DATATOOLS_AUDIT_PROBE=<value>`` — bisect the producer path for Phase 2. Values: ``full`` (default), ``noop``, ``no-events``, ``no-page-open``, ``no-session-start``. The named variants return early from the corresponding ``log_*`` function so we can isolate which call is implicated in the blank-pages symptom. Also: - ``_writer_loop`` gets an outer ``try/except BaseException`` so silent thread death now surfaces a ``"writer thread died: ..."`` line in the launcher terminal instead of looking like a hang. - Existing first-write-failure stderr print gets ``flush=True`` so the user actually sees it before the process is killed. - Test fixture switches from the previous-commit ``_DISABLED = False`` override to ``_ENABLE_OVERRIDE = True`` so tests exercise the same bypass path the diagnostic build uses. - Two new tests pin the safety contract: with the kill switch on and no override, every producer is a true no-op (no writer thread, no file). And ``DATATOOLS_AUDIT_PROBE=no-events`` bypasses ``log_event`` even when the override is on — guards the bisect. Rollback: ``git revert HEAD`` removes Phase 1 cleanly. The deadlock fix from the previous commit stays in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:46:27 +00:00
Michael	a8ff8f4bd0	fix(audit): break audit_log_path/_session_id deadlock Pre-existing latent bug since `d9e32e5`: ``audit_log_path()`` acquires the non-reentrant ``_LOCK`` and, while holding it, calls ``_session_id()`` which also takes ``_LOCK``. On a clean module state (both ``_LOG_PATH`` and ``_SESSION_ID`` unset) the first caller deadlocks. ``log_session_start`` triggers it in practice — it's the first GUI call after import and the ``log_file=str(audit_log_path())`` arg is evaluated before any ``log_event`` has had a chance to lazy-init the session id. Strong candidate contributor to the blank-pages symptom the kill switch was put back to mask: the writer thread (and any producer reaching ``audit_log_path``) would freeze forever, and Ctrl+C would not free the GIL — matches the launcher-can't-be-killed behaviour reported in `1caedbb`. Fix: resolve the session id BEFORE acquiring ``_LOCK`` in ``audit_log_path``. ``_session_id`` already double-checks under its own lock, so the call is safe and self-synchronising. Test fixture in ``tests/test_audit.py`` now bypasses the kill switch via ``monkeypatch.setattr(audit, "_DISABLED", False)`` — env vars are captured at import time and ``monkeypatch.setenv`` won't reach the module-level flag. With the fix in place, all 6 tests pass in 0.15s; without it, ``test_session_start_renders`` (and any test exercising the log_session_start path) hangs indefinitely. Kill switch behaviour is unchanged in production (`_DISABLED = True` in the shipped module); this is purely a correctness fix for the code path that gets exercised when the switch is off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:45:08 +00:00
Michael	4451f74895	fix(layout): bump bottom block-container padding 4rem → 7rem Last lines on long tool pages were still grazing the fixed Help/Close footer when scrolled all the way down. 4rem gave the cursor of free space the footer claims but no breathing room — the bottom button or text was visually flush against the footer's top edge. 7rem buys ~3rem of clear space on every page so the last content row reads without obstruction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 02:32:13 +00:00
Michael	a022059b1e	chore: drop accidentally-tracked scratch screenshot	2026-05-19 02:30:01 +00:00
Michael	69240fc922	fix(home,close): tool-link preserves file context + drop close-page explanation (1) ``[Tool] →`` action links inside per-file finding rows now preserve the file that the card belongs to. Previously the home page re-set ``home_uploaded_*`` to the FIRST imported file on every rerun — so when a user with multiple imports clicked ``Clean Text →`` on file_B's findings card, the tool page loaded file_A. The click handler in ``_render_finding_row_v2`` now looks the file up in ``home_uploads`` by the findings-card filename and writes ``home_uploaded_name / size / bytes`` BEFORE ``st.switch_page``, so the tool's ``pickup_or_upload`` reads the right context. The filename threads through ``render_findings_panel(..., header=)`` → ``_render_finding_row_v2(..., filename=)``; ``header`` is already the filename today, so no call-site change needed. (2) Close screen "explanation" removed. The long browser-restriction hint paragraph (``quit.close_hint``: "Browsers don't let JavaScript close a tab you opened yourself …") is gone from the farewell overlay — the auto-dismiss path lands the user on about:blank within ~1.5s of the close click, so the explanation never had a chance to be useful. ``autoDismiss`` simplified to "try close, else redirect" without the hint-surface step. The i18n key is retained as a no-op in case the hint comes back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 02:29:49 +00:00
Michael	9a7d861903	fix(ui): bottom padding + close-screen button removed + sidebar collapse + quiet loguru Four issues batched together since they all touch the GUI shell: - ``stMainBlockContainer``'s ``padding-bottom`` bumped from 0.75rem → 4rem (~one button-height of free space above the fixed Help/Close footer). The last line of content on a page that fills the viewport was previously sitting flush against the footer's top border. - Farewell overlay's "Close this window" button removed per UX request. The auto-dismiss path is now the only flow: try programmatic close (works in Chrome/Edge ``--app`` windows); failing that, surface the hint and redirect the parent window to ``about:blank`` after a short timeout. Previously the user had to click the button to get the same fallback. The ``quit.close_window_button`` i18n key is retained as a no-op for now in case the button comes back; nothing references it. - Sidebar collapse → expand was broken: clicking « collapsed the sidebar but the » expand-back affordance was invisible. Two causes pulled apart: 1. ``.dt-brand { flex: 1 }`` was eating the entire ``stSidebarHeader`` width, squeezing Streamlit's ``stSidebarCollapseButton`` off the right edge. Changed to ``margin: 0 auto 0 0`` so the brand keeps its natural width and the chevron has room to live next to it. 2. The "hide Streamlit chrome" toolbar block was listing ``stToolbar`` and ``stToolbarActions`` for ``display: none`` — but the post-collapse re-open button (``stExpandSidebarButton``) lives inside ``stToolbar``, so hiding the container killed the button too. Dropped both container testids from the hide list and kept the per-icon rules for ``stMainMenu`` / ``stAppDeployButton`` / ``stStatusWidget`` / ``stDecoration``. - Loguru's stderr sink quieted in GUI mode. ``src/gui/app.py`` now runs ``logger.remove()`` + ``logger.add(sys.stderr, level="ERROR", …)`` at the top so internal ``logger.debug`` / ``logger.warning`` breadcrumbs (e.g. ``standardize_dataframe: 7/31 cells were unparseable``) no longer print to the terminal when the user runs ``python -m src.gui``. CLI entry points already do the same configuration per-script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 02:21:41 +00:00
Michael	1016a4d2c4	feat(home,sidebar): brand hero + sidebar = footer style + PNG icon Bundles a handful of UX cleanups: - Findings-card chevron moved to the LEFT side of the head. CSS still rotates it 90° between collapsed/expanded states. - Tool-link buttons in findings rows (``Clean Text →`` etc.) are now left-justified against the icon column with minimal surrounding whitespace. Action column ratio dropped from 1.8 → 1.4 and the button switched from ``width="stretch"`` (centered text) to ``width="content"`` (shrinks to fit, left-aligned within column). - Home-page hero now mirrors the sidebar brand block: 56px ink "D" chip on the left + "UNALOGIX" eyebrow stacked above "DataTools" wordmark, then the "Clean. Normalize. Transform." tagline beneath. New ``.dt-page-brand / -row / -words / -mark / -eyebrow / -wordmark`` rules in ``_DESIGN_TOKENS_CSS``. Streamlit wraps h1 elements in an emotion-cache div with extra padding; a descendant flattener (``.dt-page-brand-words *`` margin:0 / padding:0) keeps the eyebrow + wordmark stack the same height as the chip so they center-align cleanly. - Sidebar nav restyled to match the sticky-footer Help/Close buttons exactly: 13px / 500 / 1.3 line-height, 5×10px padding, 8px gap between icon and label, transparent background. Active item gets the same ``rgba(0,0,0,0.04)`` tint as the hover state (no white pill, no shadow), only the heavier weight + ink text distinguishes it. - OS app icon (page_icon) switched from SVG to a Pillow-rendered ``datatools_icon_256.png`` so Windows / macOS taskbar+dock pick it up reliably (some OS shells fall back to a default icon for SVG favicons). Rounded-square ink ground with cream "D" centered — same mark as the sidebar chip + hero chip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 02:04:53 +00:00
Michael	6c3939d21b	feat(brand): "Letter D (sans)" app icon — favicon + sidebar chip Implements ``Business/DataTools/app_icons.html`` §03 "Letter D (sans)" as the canonical app mark. - New ``src/gui/assets/datatools_icon.svg`` — 64×64 SVG, 14px corner radius, ink ground (#1c1917), cream "D" (#fef4ed) in Geist 700 / -0.04em tracking. Pure SVG so it renders sharp at every favicon size; font stack falls back through Geist → system sans where the webfont isn't installed (favicons can't load Google Fonts). - ``_home.py``, ``_Activate.py``, ``99_Close.py``: page_icon now resolves the SVG path via ``Path(__file__).parent / "assets" / "datatools_icon.svg"`` instead of the broom 🧹 / 🔑 / 🛑 emojis. Streamlit inlines it as a ``data:image/svg+xml;base64,...`` link tag so the browser tab + OS app-icon for ``python -m src.gui`` matches the sidebar chip. - Sidebar ``.dt-brand-mark`` tightened to match the spec's "Letter D (sans)" rendering: ``font-weight: 700`` and ``letter-spacing: -0.04em`` (was 600 / -0.02em). The on-screen chip is now a scaled-up copy of the OS icon. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:50:18 +00:00
Michael	d436e34a45	feat(brand): rebrand to UNALOGIX DataTools + Clean. Normalize. Transform. User-facing copy + brand updates landed together: - Page H1 + browser-tab title: "DataTools — Data Cleaning Mastery" → "UNALOGIX DataTools". Same change in es.json (was "DataTools — Maestría en limpieza de datos"). - Hero subtitle: long descriptive caption replaced with the tagline "Clean. Normalize. Transform." (es: "Limpia. Normaliza. Transforma."). - Sidebar brand block: wordmark is now two lines — UNALOGIX in tiny uppercase tracked eyebrow style on top, DataTools in the 15px semibold wordmark beneath. The 28px "D" chip stays as the recognizable mark. New ``.dt-brand-eyebrow`` rule in ``_DESIGN_TOKENS_CSS``. Top-right Streamlit chrome cleanup — the user reported two stacked icon buttons. ``.streamlit/config.toml`` bumped to ``toolbarMode = "viewer"`` (most aggressive — suppresses status indicator + deploy button + running glyph). CSS belt-and-suspenders hides ``stToolbar``, ``stToolbarActions``, ``stStatusWidget``, ``stDecoration`` for newer Streamlit releases that keep emitting these with inline styles even under toolbarMode=viewer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:45:38 +00:00
Michael	0bb72ecd7e	feat(home,sidebar): brand block + collapsible findings + many polish tweaks Batch of UX tweaks the user asked for in quick succession: - Sidebar brand block (mockup §brand) — 28px ink chip with a "D" wordmark plus the "DataTools" text — injected into ``stSidebarHeader`` by a small JS bundled into the iframe-mounted script that already runs from ``hide_streamlit_chrome``. The Streamlit ``stLogoSpacer`` is hidden when the brand block is present so it sits flush at the top of the sidebar. - Findings cards are now collapsible. Each file's card head carries ``data-dt-collapsed="true"`` on first render; clicking the head flips the attribute via the new ``_WIRE_COLLAPSIBLE_FINDINGS_JS`` (MutationObserver re-wires after reruns). A CSS rule ``[stElementContainer]:has(.dt-finding-group-head[data-dt-collapsed ="true"]) ~ *`` hides every later sibling of the head's element container — covers both ``stLayoutWrapper`` (the columns rows in this Streamlit release) and ``stElementContainer`` so the rule survives future Streamlit layout renames. A chevron icon (``chevron_right``) rotates 90° when expanded. The head itself gets ``cursor: pointer`` + an accent-fill hover. - Tool-link buttons in finding rows dropped the leading ``Open`` — now read ``Clean Text →``, ``Standardize Formats →`` etc. - Finding-row column order: action is now LEFT of the description, matching user feedback (``[icon] [Tool →] [description + meta]``). - Head padding bumped to ``16px 22px`` so the filename has visible breathing room from the card's left edge (previously the mono filename felt like it was bleeding into the rounded corner). - Head margin-bottom bumped to 1.5rem for breathing room before the first finding row when expanded; collapsed state tucks the head flush against the card bottom with full ``--r-lg`` corner radius and no visible bottom border. - Files card row layout: ``✕`` button moved to the LEFT of the filename (``[✕] [chip + filename] [size]``). - Sidebar nav rows tightened: link padding 7px → 4px, line-height 1.25, 1px margin-bottom per li, section-header padding-top reduced. Plus a new ``--gap: 0.25rem`` rule for vertical blocks inside bordered containers so the Files card and findings card body have denser inter-row spacing. - Sidebar Language selector restyled: widget labels render as the spec's "Eyebrow" row (11.5px / 500 / 0.08em uppercase, tertiary ink), selectbox combobox gets a paper surface + soft border that matches the rest of the sidebar chrome. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:40:22 +00:00
Michael	74d0ee270f	chore(home): remove "Export report" button The disabled "Export report" placeholder is gone — it wasn't tied to a real feature and was just noise in the action bar. Action bar is back to two buttons (Run analysis · Clear results) on a 1:1:4 column split. ``upload.export_report`` keys removed from en + es i18n packs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:17:43 +00:00
Michael	06f1ea6cf7	fix(buttons,footer): unify disabled state + restyle Help/Close as nav links (3) Disabled primary buttons no longer read as a "whited-out" dark slab. Streamlit's primary-button selector ``button[data-testid="stBaseButton-primary"]`` has the same specificity as our previous ``button:disabled`` selector, so the primary background + cream text kept winning the cascade tie-break. The disabled rule's selector list now explicitly matches both the ``kind="primary"``/``kind="secondary"`` shapes AND the ``stBaseButton-primary``/``-secondary`` testids, so disabled buttons collapse to ``surface-hover`` background, ``ink-tertiary`` label, soft border — same look regardless of starting kind. A follow-up rule re-asserts ``color: var(--ink-tertiary)`` on every descendant of the disabled primary so the inner ``stMarkdownContainer > p`` doesn't keep the cream label from the "all descendants get --bg" primary rule. (4) The sticky-footer Help + Close buttons now match the sidebar nav-item look. Old outlined-pill chrome is gone: ``.datatools-footer-btn`` is now display:inline-flex with a Material-Symbols ligature icon + label, borderless, ``ink-secondary`` text on a transparent surface, ``rgba(0,0,0,0.04)`` hover background. The Close button keeps a danger tint via ``.close`` so it still reads as the shut-down action, with a soft ``--danger-fill`` hover. Help uses the ``help_outline`` icon, Close uses ``power_settings_new``. Built via a small ``makeFooterBtn`` helper in the iframe JS that appends the icon span + label text node to the button — keeps the existing soft-nav click handlers intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:12:03 +00:00
Michael	784695e3a7	fix(home,findings): reclaim top whitespace + add padding under finding head Two visual cleanups: 1. The block-container "claim padding" rule was a no-op — it targets the legacy ``stAppViewBlockContainer`` testid; Streamlit renamed it to ``stMainBlockContainer`` in the current release. Updated the selector list to match both, so the page title now sits close to the top edge again (~0.5rem from the hidden header) instead of inheriting Streamlit's default ~6rem header reservation. 2. ``.dt-finding-group-head`` margin tightened to ``margin: -1rem -1rem 0.75rem``: -1rem on top/sides still bleeds the head to the card edges, but +0.75rem on the bottom is breathing room between the head's bottom border and the first finding row, which were abutting before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:04:42 +00:00
Michael	4816da1ad6	fix(home): show file sizes in KB/MB/GB, never raw bytes Per-row file sizes and the Files-card total-size meta both read as human-readable units now. Smallest unit is KB even for sub-kilobyte files (so ``538 B`` → ``0.5 KB``, ``4914 B`` → ``4.8 KB``), steps up to MB at 1 MiB and GB at 1 GiB. Always one decimal place. New module-level helper ``_format_size(int) -> str`` in ``_home.py``; both the section meta (``1 file · 4.8 KB total``) and the per-row ``dt-file-size`` cell call it instead of the previous ad-hoc ``f"{n:,} B"`` formatter. Keeps the display consistent regardless of file size — and keeps the GUI free of raw byte counts that nobody needs to read. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 00:59:56 +00:00
Michael	6703e2c15c	feat(home): in-card "+ Add more files" replaces Streamlit's dropzone Mockup §file-add lands as the canonical import affordance: - Streamlit's ``st.file_uploader`` widget is still mounted (only path that actually receives browser file events), but parked off-screen via a new ``[data-testid="stFileUploader"] { position:absolute; left:-10000px; … pointer-events:none }`` rule. Its hidden ``<input type="file">`` stays reachable to JavaScript. - The Files card is now always rendered (header + bordered body). The bottom row of the card is a ``button.dt-file-add`` styled per mockup §file-add: dashed top border bleeding to the card edges, surface-hover background, ``+ Add more files`` text in ``--ink-secondary``, accent-fill on hover. - A small ``<script>`` shipped through ``st.iframe`` wires the button: ``click → input.click()`` on the off-screen ``stFileUploaderDropzoneInput``. Streamlit's HTML sanitizer strips inline ``onclick`` from ``unsafe_allow_html`` content, so the binding has to come from a real script element — same pattern the sticky footer and Upload→Import rewriter use. A ``MutationObserver`` re-wires the button when Streamlit remounts it across reruns. The ``dataset.dtWired`` guard prevents double binding. Section structure also tightened to match the mockup: - Section heading is now ``<h2>Files</h2>`` (was ``### Import one or more files to start``) with the count + total size on the right of the same flex row. When no files: ``No files imported yet``. When files exist: ``1 file · 4.8 KB total``. - Dropped the ``upload.intro_multi`` caption and the ``upload.empty_state`` info banner — the card itself plus the in-card Add button cover both prompts. - Empty state now ends after the Files card (no stats / no action bar / no findings rendered) — matches mockup's single-section empty view. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 00:56:11 +00:00
Michael	a9788ba712	feat(ui): page header + files card + action bar + findings cards (mockup 2) Closes the remaining gaps between the live home page and the ``datatools_layout_redesign2.html`` mockup. Four pieces land together because they all consume the same new CSS scaffold: 1. Page header (§page-header) ``st.title`` + ``st.caption`` + ``st.divider`` collapse into one flex header: h1 + body subtitle on the left, ``Runs 100% locally`` privacy pill (success-fill + lock SVG) on the right, soft border below. The "Runs 100% locally" phrase moved out of ``home.caption`` into the new ``home.privacy_pill`` i18n key (en + es). 2. Files card (§files-card) The "Imported files" list is now a single bordered card with a section head (count + KB total on the right, mockup §section-head). Each row renders a 28px accent-fill chip carrying the inline document SVG, a mono filename, a right-aligned mono size, and a compact ``✕`` button. The word-button ``Remove`` is gone — replaced by an icon-only tertiary button styled via a new CSS rule that goes transparent → danger-fill on hover (mockup §file-remove). 3. Action bar (§action-bar) Three buttons in one row: ``Run analysis`` (primary ink), a new disabled ``Export report`` (secondary; coming soon, tooltip), and ``Clear results``. New i18n key ``upload.export_report``. 4. Findings — per-file group cards (§finding-group) ``render_findings_panel`` rewritten end-to-end. Output is now: • A head row (``dt-finding-group-head``) bleeding to the card edges: worst-severity dot · mono filename · count pills enumerating non-zero severities (e.g. ``2 info`` blue, ``1 warning`` amber, ``1 error`` rose). • A flat list of finding rows sorted error → warn → info. Each row: tinted Material-icon chip + title (description with optional ``<code>`` column chip) + mono meta line (rows affected, samples captured) + tertiary ``Open <Tool> →`` action button that ``st.switch_page``s to the relevant tool. The previous tool-grouped expander stack is dropped — the new layout is denser and matches the mockup's single-card-per-file structure. ``_render_one_finding`` (the old per-finding helper that emitted markdown lines + sample tables) remains in the file but is no longer called from the home flow; left in place for any other surface that still depends on the markdown style. The "no issues" success state renders a green dot + mono filename + ``no issues`` success pill in the same card chrome, so empty-result files visually match the rest of the panel rather than getting a generic ``st.success`` callout. CSS additions (``_DESIGN_TOKENS_CSS``): ``.dt-page-header / .dt-page-subtitle / .dt-privacy-pill`` ``.dt-files-section-head / .dt-section-meta`` ``.dt-file-row / .dt-file-icon-chip / .dt-file-name / .dt-file-size`` ``.dt-finding-group-head / .dt-severity-dot{.warn,.info,.error,.success}`` ``.dt-group-filename / .dt-group-counts`` ``.dt-count-pill{.warn,.info,.error,.success}`` ``.dt-finding-row / .dt-finding-icon{.warn,.info,.error}`` ``.dt-finding-title / .dt-finding-meta`` Tertiary button rule (transparent → danger-fill on hover) for the X button and the ``Open Tool →`` row action. theme.py: Explicitly loads Material Symbols Outlined alongside Geist — the severity-chip ligatures (``info`` / ``warning`` / ``error``) need the font present even when no ``:material/`` token has been emitted yet on the page. Tightened ``.dt-finding-icon .dt-mui`` selector with ``[data-testid="stMarkdownContainer"]``-scoped variant so the Material font wins over theme.py's base ``var(--font-sans) !important`` on markdown descendants. Leading section-heading emojis stripped from i18n (``upload.heading``) for parity with the mockup's clean ``Files`` / ``Findings`` h2s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 00:43:42 +00:00
Michael	da7d86f457	feat(ui): Material icons in sidebar + stats overview on home Two pieces of the mockup 2 layout that hadn't landed yet: 1. Sidebar nav icons — emoji glyphs (🧹 ✂️ 🔍 …) swapped for Streamlit's ``:material/<name>:`` syntax, picking the outline Material Symbol that best matches each mockup SVG: Home → :material/home: Fix Missing Values → :material/help_outline: Find Unusual Vals → :material/insights: Clean Text → :material/text_format: Standardize Fmts → :material/format_list_bulleted: Find Duplicates → :material/search: Quality Check → :material/check_circle: Map Columns → :material/view_column: Combine Files → :material/account_tree: Auto Workflows → :material/auto_awesome: Activate → :material/key: Close → :material/close: Streamlit injects the icon name as a literal ligature inside a first-child ``<span>`` of the nav anchor, expected to render through the Material Symbols font. theme.py's base rule was forcing Geist on every span under ``stSidebarNav``, turning the ligatures back into plain text labels — added a structural exception that targets ``[data-testid="stSidebarNavLink"] > span:first-child`` (and any descendant), restoring the Material font family, neutralizing the inherited ``ss01/cv01/cv11`` feature settings, and sizing to 18px. Also stripped the leading emojis from every page title in the en/es i18n packs (``home.title``, ``close_page.title``, ``activation.title``, ``tools.*.page_title``) — the icons live in the sidebar now, the page H1 no longer needs to carry one. 2. Stats overview on home — new ``_render_stats_overview`` in _home.py emits a 4-card grid above the per-file findings panels: Files analyzed, Total findings, Warnings (severity ``warn`` ∪ ``error``), Info (severity ``info``). Card layout follows the mockup §stats verbatim — Geist 28px / 600 / -0.03em for the numeric value (the "Display number" row in spec §4), tiny uppercase tracked label, paper-surface card with the standard warm border + faint shadow. The Warnings / Info cards tint the number with ``--warn`` / ``--info`` when the count is non-zero. CSS for ``.dt-stats / .dt-stat / .dt-stat-label / .dt-stat-value / .dt-stat-unit`` added to ``_DESIGN_TOKENS_CSS``; falls to a 2-column grid below 900px viewport, matching the mockup's media query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 00:31:40 +00:00
Michael	2501119ac2	feat(ui): replace Fraunces with Geist per geist_spec.md Switches the type system to the single-family Geist spec referenced in ``Business/DataTools/geist_spec.md`` and the matching ``datatools_layout_redesign2.html`` mockup. Editorial-serif headings are out; the product now reads as modern SaaS-tool typography per the spec's positioning note (§10). src/gui/theme.py (new) Implements geist_spec.md §3 verbatim — preconnect + Google Fonts link for Geist (400/500/600/700) and Geist Mono (400/500), the canonical ``:root`` token table (§7) plus severity extensions, and the type scale (§4): h1 32/600/-0.035em, h2 22/600/-0.025em, h3 18/500/-0.018em, h4 15/500/-0.012em, body 14/400, caption 12.5/400, mono 0.92× ss02. ``apply_theme()`` is the single entry point. Two deviations from the spec, both anticipated by spec §6.1: - ``font-family: var(--font-sans) !important`` on the base rule. Streamlit applies ``font-family: "Source Sans"`` directly to ``[data-testid="stMarkdownContainer"]`` and a few widget wrappers at equal-or-higher specificity than the spec's selector list, so plain inheritance loses the cascade. - The base selector list explicitly enumerates ``stSidebarNav``, ``stMarkdownContainer``, ``stVerticalBlock`` and a few siblings so Streamlit's per-widget font reset doesn't reach descendant text. src/gui/components/_legacy.py - ``_DESIGN_TOKENS_CSS`` no longer redeclares fonts or the heading rules — those are theme.py's job (spec §9 says the spec is type-only; everything below is component chrome). - Token references switched from ``--dt-`` to the spec names (``--ink``, ``--bg``, ``--surface``, ``--border``, ``--accent``, ``--font-sans``, ``--font-mono``, …). - Sidebar section-label rule tightened to 11.5px / 500 to match the "Eyebrow" row in spec §4. - Primary-button text color now also targets every descendant (``button[kind="primary"] ``) so the inner ``stMarkdownContainer > p`` doesn't pick up ``color: var(--ink)`` from the base rule and render near-invisible ink-on-ink. - ``hide_streamlit_chrome`` now calls ``apply_theme`` before injecting component CSS so the base tokens are defined first. Acceptance criteria from spec §8 verified at 1920×1050: - h1 computes ``font-family: Geist``, ``font-weight: 600``, ``letter-spacing: -1.12px`` (= 32px × -0.035em), size ``32px``. - Body ``<p>`` inside ``stMarkdownContainer``: Geist 400 / 14px. - Caption: Geist 400 / 12.5px. - Inline mono filenames: Geist Mono in accent-fill chip. - No Source Sans Pro leaks into any text the user reads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 00:21:52 +00:00
Michael	444dffbc63	chore(ui): rename Upload → Import in user-facing strings DataTools is local-first — "Upload" reads like "send data somewhere remote", which contradicts the product positioning. Sweep replaces the user-visible term throughout the UI: - ``src/i18n/packs/en.json`` + ``es.json``: all ``upload.`` strings (heading, intro, uploader labels, empty state, switch-back, etc.) and ``gate.default_name``. The ``intro_multi`` "no upload anywhere" phrasing dropped the verb entirely — now reads "nothing leaves this computer". - All 9 tool pages: ``st.file_uploader(label="Upload …")`` → ``"Import …"``; matching ``st.info("Upload a …")`` empty-state banners; ``help="Upload …"`` strings on disabled uploaders. - ``9_Pipeline_Runner`` + ``5_Column_Mapper``: radio-option text ``"Upload schema/pipeline JSON"`` → ``"Import …"`` plus the ``.startswith("Upload")`` branch guards that read those values. - ``_home.py``: "Uploaded files" → "Imported files". - ``app_demo.py``: "Uploaded file is …" → "Imported file is …". Internal identifiers left untouched: function names (``pickup_or_upload``, ``_StashedUpload``), session-state keys (``home_upload``, ``home_uploads``, ``home_uploaded_``, ``merger_file_upload``), audit-log event category (``"upload"``), Streamlit testid CSS selectors. None of those are visible to the user. The file_uploader's dropzone button text is a baked-in React literal that Streamlit's ``label=`` doesn't reach; rewritten at the DOM level with a small ``_RENAME_UPLOAD_BUTTON_JS`` snippet shipped through ``st.iframe`` (same pattern the sticky footer uses to mount on ``<body>``). A ``MutationObserver`` on the parent document re- applies the swap when Streamlit remounts the dropzone after file add/remove or page navigation, throttled via ``requestAnimationFrame``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 23:48:31 +00:00
Michael	3c4b80895e	fix(home): hide Streamlit's chip row, keep only the canonical file list After upload, two near-identical file lists were shown stacked: Streamlit's built-in compact chip row inside the dropzone (icon + ``messy_sales.csv`` + size) and the home page's own "Uploaded files" section beneath it (filename + Remove button). User flagged the duplication. Hide ``[data-testid="stFileChip"]`` and its first-child wrapper so the chip row collapses; the dropzone's borderless ``+`` button is preserved as the "add more files" affordance, and our "Uploaded files" list is now the single source of truth visually. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 23:42:22 +00:00
Michael	b0ee65e922	feat(ui): warm editorial redesign — Fraunces + Geist + stone palette Lifts ideas from the ``datatools_layout_redesign.html`` mockup (artistic licence, not literal). Two changes: 1. ``.streamlit/config.toml`` ``[theme]`` block — cream paper bg (#fafaf7), warm sidebar (#f5f4ef), stone ink (#1c1917), burnt orange primary (#c2410c). Streamlit threads these through its chrome (focus rings, file-uploader accents, link colors). 2. ``_DESIGN_TOKENS_CSS`` injected by ``hide_streamlit_chrome`` on every page. Imports Fraunces (display serif), Geist (body sans), Geist Mono. Restyles, scoped through ``--dt-*`` custom properties: - Page surface + sidebar — warm cream backgrounds, soft warm borders, no harsh white. - Sidebar nav — section labels in tiny uppercase tracking, nav items with soft hover, active item as a white pill with subtle shadow. - Typography — H1/H2/H3 in Fraunces with tightened tracking; body Geist; inline code Geist Mono with orange-on-cream chip. - Buttons — primary = dark ink (``#1c1917``) with white text; secondary = paper surface with warm border; disabled = muted cream. - Containers / expanders — editorial cards: 14px radius, 1px warm border, faint shadow, warm-cream summary headers. - File uploader — cream dropzone with dashed border + per-file paper chips. - Alerts — soft tinted fills (info=sky, success=mint, warn=amber, error=rose) over the kind-specific palette. - Inputs, tabs, dataframes — paper surfaces with rounded warm borders. Verified at 1920x1050 + 1400x900 on home page (empty + with file uploaded + with findings rendered) and Clean Text tool page; no regressions in the white-bar fix from `65b663b`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 23:36:24 +00:00
Michael	65b663be97	fix(footer): stretch .stApp + sidebar + main to compensate for zoom User screenshot pinned the actual culprit: a horizontal white band across the FULL viewport width (including over the sidebar) above the Help/Close footer. Diagnosis: - ``.stApp`` carries ``zoom: 0.85``, so any descendant sized at ``100vh`` only renders at ~85vh visually. - At 1920x1050 the visual end of ``.stApp`` is around y=893; the fixed footer overlays y=1017..1050; the strip in between (124px at this resolution) is ``body`` painting white through, because ``.stApp``, ``stSidebar`` and ``stMain`` are all shorter than the viewport. - The previous "min-height: 100vh/0.85" rule targeted the legacy ``data-testid="stAppViewBlockContainer"``. The current Streamlit release renamed that testid to ``stMainBlockContainer`` — so the rule was a no-op for months. Verified the new testid by walking the live DOM. Fix: stretch ``.stApp``, ``[data-testid="stSidebar"]`` and ``[data-testid="stMain"]`` with ``min-height: calc(100vh / 0.85)`` so they fill the visible viewport. Keep the block-container's 2rem ``padding-bottom`` (now matching both the new and legacy testids in case Streamlit rolls it back). Verified at 1920x1050: sidebar gray extends to y=1050, content area extends to y=1050, footer overlays the bottom 33px, no white band between content and footer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 23:22:11 +00:00
Michael	c942b8aa19	fix(footer): offset sticky-footer's left edge past the sidebar The "white bar" was the footer's near-white background painting over the bottom of the sidebar. The footer is fixed at body level with ``left: 0; right: 0`` so it spans the full viewport — its ``rgba(255, 255, 255, 0.97)`` background renders as essentially white over the sidebar's ``rgb(240, 242, 246)`` gray, producing a visibly different strip at the bottom of the sidebar (this is what the diagnostic GREEN tint marked as ``stAppViewContainer``-shaped because that is the element directly behind it). Pixel-sampled the bottom row to confirm: y=860 over sidebar → (240, 242, 246) (gray) y=870 over sidebar → (255, 255, 255) (footer-painted white) Fix: in the iframe JS that mounts the footer on ``<body>``, measure ``[data-testid="stSidebar"].getBoundingClientRect().right`` and set the footer's (and help popover's) ``left`` to that offset with ``setProperty(..., 'important')`` so it beats the ``left:0!important`` fallback in CSS. A ``ResizeObserver`` on the sidebar plus a ``window.resize`` listener keep the offset in sync when the sidebar collapses or expands. Sidebar collapsed (width 0 or off-screen) clamps to 0 → footer goes flush-left as before. Also dropped the no-op ``min-height`` on the view container from the previous attempt; ``stAppViewContainer`` is transparent, so stretching it never painted anything. Verified by injecting the same offset on the live page: bottom row at y=890 is now ``(240,242,246)`` over the sidebar and only turns white at x=255 where the content area begins. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 22:52:02 +00:00
Michael	61e63913cb	chore: migrate use_container_width → width (Streamlit deprecation) ``use_container_width`` is being removed after 2025-12-31. Streamlit log was flooding the terminal with the deprecation notice on every rerun. Mechanical sweep: use_container_width=True → width="stretch" use_container_width=False → width="content" 51 call sites across 11 page files + ``app_demo.py``. Also renamed the ``local_download_button`` helper's ``use_container_width`` kwarg to ``width`` (default ``"stretch"``); it has no external callers passing the old name, so this is a safe rename. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 22:43:52 +00:00
Michael	e011c0b6e6	fix(footer): close white gap by stretching stAppViewContainer Color-tag diagnostic confirmed the bottom-of-viewport strip was painted by ``stAppViewContainer`` (it showed GREEN), not by the block container as the previous two attempts assumed. ``.stApp`` has ``zoom: 0.85`` so 100vh visually renders at 85% — apply ``min-height: calc(100vh / 0.85)`` to the view container itself so it spans the full visible viewport and there is no gap for its own background to leak through as a "white bar". Reverts the diagnostic tints (RED/BLUE/GREEN/GOLD); keeps the 2rem block-container padding-bottom that reserves room for the fixed footer overlay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 22:36:41 +00:00
Michael	2fe324279e	diag(footer): color-tag every candidate bottom-area container Option 2 (stretching the block container with ``min-height``) did not close the white gap. Either the rule isn't applying, or the block container isn't the element that fills the visible bottom of the page. Tint every plausible container so the eye can tell us instantly which one paints the bar: - RED ``stAppViewBlockContainer`` (still has min-height applied) - BLUE ``stMain`` / ``section[stMain]`` (with its own min-height) - GREEN ``stAppViewContainer`` - GOLD ``.stApp`` (zoomed) User reload + report which color shows where the "white bar" previously was — that names the target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 22:33:19 +00:00
Michael	04dc326020	fix(footer): stretch block container to full viewport to close white gap Option 1 (tightening ``padding-bottom`` from 3rem to 2rem) did not eliminate the gap. The remaining gap is ``.stApp``'s solid white background showing through the area below the block container's natural (content-sized) bottom edge — visible because the home page's content is shorter than the viewport. Stretch the block container with ``min-height: calc(100vh / 0.85)`` so the container itself fills the visible viewport. Now the area between the last finding card and the fixed footer is the block container's own background, not ``.stApp`` showing through — visually continuous with the content above. The ``/0.85`` compensates for ``.stApp { zoom: 0.85 }`` (defined in ``_HIDE_CHROME_CSS``): inside a zoomed container, ``100vh`` renders at 85% of true viewport height, leaving a 15% gap if used raw. ``box-sizing: border-box`` keeps the 2rem padding part of the total height instead of stacking onto it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 22:30:22 +00:00
Michael	d487a44170	fix(footer): tighten block-container ``padding-bottom`` to close white gap Diagnostics confirmed the "white bar" the user has been describing is not a separate element — it's ``[data-testid=stApp]``'s solid white background (``rgb(255,255,255)``, viewport-locked) showing through the gap between where page content ends and where the fixed Help/Close footer overlay begins. ``stApp`` stays put while content scrolls inside it, which is why the bar "doesn't change when scrolling". The gap exists because ``render_sticky_footer`` overrides the block container's ``padding-bottom`` to ``3rem`` (48px) to reserve clear room for the fixed footer. The footer is only ~32-33px tall (min- height 32px + 0.25rem top/bottom padding), so ~16px of that reserve was pure visible white space sitting above the buttons. Reduce ``padding-bottom`` to ``2rem`` (~32px) — just enough to prevent content from rendering under the footer overlay, no more. Eliminates the visible gap without exposing text to clipping. Also remove the diagnostic banner + click-to-inspect iframe from the home page now that the bar is identified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 22:28:17 +00:00

1 2 3 4 5

243 Commits