datatools-dev

Author	SHA1	Message	Date
Michael	450d4fc9a8	feat(pdf): default output date format to YYYY-MM-DD User asked to flip the default from YYYYMMDD to YYYY-MM-DD. ISO is the better default for an accountant CSV workflow: - Lexicographic sort = chronological sort (no parsing needed). - Every spreadsheet tool the user might import into recognises it as a real date with no ambiguity (US vs EU readers can't disagree on the order). - Hyphens make the year/month/day boundaries scan-able by eye. Concrete changes: - New module constant ``DEFAULT_DATE_FORMAT = "%Y-%m-%d"``, used as the default for ``format_date()`` and the ``output_date_format`` keyword on ``scan_pdf_for_transactions``. - Page's ``_DATE_FORMAT_CHOICES`` reordered so the ISO entry is first (index 0 = default Streamlit selection); YYYYMMDD drops to second. - Custom-strftime input default also flips to ``%Y-%m-%d``. Tests updated to reflect the new default (``test_dates_formatted_iso_by_default``, ``test_short_dates_get_year_from_period``, ``test_compact_format_round_trip``, plus a new ``test_default_is_iso`` for the format_date helper). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 02:04:34 +00:00
Michael	a0042d4aba	feat(pdf): Dec/Jan-aware year inference + filename hint + override Previous year inference picked ``period_end_iso[:4]`` for every short date, which fails on statements that cross the Dec/Jan boundary. A "12/30" row in a 2024-12-16 to 2025-01-15 statement got 2025-12-30 (wrong) instead of 2024-12-30. New cascade for ``_infer_year_for_short_date``: 1. ``override_year`` — caller supplies it (new ``"Override year for short dates"`` field in Scan options). Beats every heuristic. Empty by default; the page validates the value is a 4-digit-looking integer in 1900-2100 and falls back to automatic on garbage input. 2. Statement period start + end — the function now takes BOTH dates and generates candidates with every distinct year in the period (one year for same-year statements, two for Dec/Jan boundaries). The picker scores each candidate by distance from the period: candidates inside the period score 0, candidates outside score ``min(\|days from start\|, \|days from end\|)``. Lowest-distance candidate wins. So: - ``12/30`` + period 2024-12-16 to 2025-01-15 → 2024-12-30 (inside period, score 0) - ``01/05`` + same period → 2025-01-05 (inside, score 0) - ``12/15`` + same period → 2024-12-15 (1 day before, closer than 2025-12-15 which is 11 months after) 3. ``filename_year_hint`` — fallback when the statement period regex misses the bank's specific layout. The page passes ``year_from_filename(upload.name)`` automatically so files like ``eStmt_2025-01-13.pdf`` get year 2025 even if the PDF's text doesn't yield a parseable period. The regex matches the first ``20XX`` token bounded by non-digits. Both new helpers (``year_from_filename`` and the new ``_try_short_date_with_year`` factor-out) are exported and tested. 16 new tests cover: within-period inference (same-year sanity), Dec/Jan boundary cases for both sides, the just-before-period closer-distance case, override priority, filename fallback, no-signal None, dash-format / month-name shorthand round-trip, garbage input, filename year extraction (eStmt pattern, embedded, first-match-wins, no-match, empty). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:59:30 +00:00
Michael	a18b126885	fix(pdf): stamp scan timestamp once; restores Saved-to-path banner After swapping to ``html_download_button`` the user noticed the "✓ Saved to <path>" + 📂 Open Downloads folder pair never appeared. The helper itself is fine — every other tool shows those affordances correctly. Bug was specific to the PDF page. The download button's file_name was being computed with a fresh ``datetime.now().strftime(...)`` on every render. The helper builds its session-state keys from ``f"_dl_btn_{file_name}_{digest}"`` so the keys silently drift every second. After the click and rerun, the helper looks up the saved_key for the NEW file_name, finds nothing in session_state (the click had written to the OLD key), and skips the success banner. Fix: stamp the timestamp once when scan completes, store it in ``K_TIMESTAMP``, and reuse it for the download filename. The filename stays stable across reruns, so the helper's keys are stable, so the saved-path banner renders correctly on the post- click rerun. Also clear ``K_TIMESTAMP`` on Clear-all-files so a new scan gets a fresh stamp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:50:22 +00:00
Michael	981a1a9cba	fix(downloads): OneDrive-aware Downloads path + PDF uses html_download_button User reported downloads "do nothing on click" in tool pages and "acts like it downloads but no file in the folder" in the PDF tool. Two root causes, two fixes. Root cause #1 — wrong Downloads folder on Windows. ``_downloads_dir()`` returned ``Path.home() / "Downloads"`` unconditionally. On Windows machines with OneDrive enabled (very common for business users), the real Downloads folder is redirected to ``C:\Users\<u>\OneDrive\Downloads``. Our helper would write to ``C:\Users\<u>\Downloads`` instead — a folder that may not even exist until ``mkdir`` creates it — and the user, naturally opening their actual OneDrive Downloads, sees no file and concludes nothing happened. Now: on Windows, ``_downloads_dir`` queries the registry key ``Software\Microsoft\Windows\CurrentVersion\Explorer\User Shell Folders`` for FOLDERID_Downloads (GUID ``{374DE290-123F-4565-9164-39C4925E467B}``). This entry returns the redirected path when OneDrive is active, the original ``%USERPROFILE%\Downloads`` otherwise — exactly what the user's File Explorer reads. ``%USERPROFILE%`` expansion is applied via ``os.path.expandvars``. Any registry hiccup falls through to ``Path.home() / "Downloads"`` so the helper never raises. The sanity check (path exists OR parent exists) catches the edge case where the registry points into a deleted OneDrive mount. Root cause #2 — PDF page used st.download_button. Every other tool uses the project's ``html_download_button`` helper (which is ``local_download_button`` under the hood — the rename happened in `b9147f3`). ``st.download_button`` has a long-standing bug where the second-or-later instance in a script pass silently fails to fire. The PDF tool predated the rewrite that switched everyone over and was still using the broken native widget. ``_Logs.py`` had the same problem in two places. Swapped all three call sites to ``html_download_button``. They now save to ``~/Downloads/<filename>`` (correctly resolved per fix #1) and show the saved path + "Open Downloads folder" button below the click, matching every other tool in the suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:45:51 +00:00
Michael	dbcf4d4048	feat(pdf): adopt Home-page Files-card layout User wants the PDF page's upload UX to match the Home page exactly — Files section header + bordered card containing the file rows AND the "Add more files" button at the bottom, no visible Streamlit file_uploader competing for attention. Layout changes mirroring ``src/gui/_home.py``: - ``st.file_uploader`` is positioned off-screen via CSS (``position:absolute;left:-10000px;…``). The underlying ``<input type=file>`` stays reachable to JS so the in-card "Add more files" button can programmatically click it. - ``<h2>Files</h2>`` section header with ``N files · X.X MB total`` meta on the right, identical markup (``dt-files-section-head``). - Single ``st.container(border=True)`` hosts every file row (``✕ \| 📄 filename \| size``, using ``dt-file-row`` / ``dt-file-icon-chip`` / ``dt-file-name`` / ``dt-file-size`` classes) AND the "Add more files" button (``dt-file-add``) at the bottom. All classes are already defined globally in ``_legacy.py`` so no new CSS. - The Add button click is wired to the off-screen uploader's ``stFileUploaderDropzoneInput`` via a 30-line iframe script, identical to the Home page's pattern. A ``MutationObserver`` re-wires after Streamlit reruns when the button gets re-mounted. Action buttons (Scan + Clear all) sit BELOW the Files card, side-by-side in a `[1, 1, 4]` column split with ``use_container_width=True`` so they fill their cells cleanly without stretching across the whole row. Both buttons are disabled when no files are uploaded — the empty Files card is its own affordance for the empty state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:34:31 +00:00
Michael	34b56b404a	fix(pdf): drop statement_period_start/end columns from output User asked to remove them — the two columns repeated the same value on every row from a given statement, took up screen space in the editor, and offered limited value once the date column already carries the inferred full date. What's kept: - ``account_number`` — still stamped onto every row so multi- statement CSVs are self-attributing - ``extract_statement_metadata`` — still runs every scan because ``period_end`` is the source of the year inference that binds Chase-style short ``01/13`` dates to ``20250113`` - ``_extract_statement_period`` and its tests — period detection itself isn't going anywhere, just its appearance in the output rows What's removed: - ``record["statement_period_start"]`` / ``record["statement_period_end"]`` assignments in ``scan_pdf_for_transactions`` - The two columns from the page's column-ordering setup - Tests pinning their presence; replaced with assertions that they're explicitly absent Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:28:32 +00:00
Michael	ad7c22d7fb	fix(pdf): consistent 2-decimal amount precision in display and CSV User reported amounts losing trailing zeros — 4.50 rendering as 4.5, 1000.00 as 1000 — on the same statement. Classic float display issue: Python's native ``repr(4.5)`` drops the ``.0``, and pandas / Streamlit happily show that inconsistency cell-by-cell. Two layers of fix, internal type stays ``float`` for arithmetic: Display. ``st.column_config.NumberColumn(format="%.2f")`` applied programmatically to every ``amount_`` column on the data_editor. Every numeric amount now shows with exactly two decimal places regardless of trailing zeros. CSV export.* Pandas' default float-to-CSV writer also drops trailing zeros (the same issue an accountant would see when opening the file in Excel). Before serialising, each amount column is mapped through the new ``format_amount`` helper — returns ``f"{v:.2f}"`` for numerics, empty string for None/NaN/inf, ``str(value)`` for booleans (guards the ``True → "1.00"`` foot-gun since ``bool`` is an ``int`` subclass), and passes through any string the scanner kept because parsing failed (e.g. ``(4.50)`` when parens-negative is off — user can correct in the editor before re-exporting). ``format_amount`` lives in ``src/pdf_extract.py`` so it's testable in isolation (the page module can't easily be unit tested because of its Streamlit import chain). 8 new tests cover the trailing-zeros case, negatives, None/empty, string-passthrough, bool guard, NaN/inf, and the ``places`` parameter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:27:16 +00:00
Michael	a1824b8dc4	feat(pdf): Home-style file list + Clear-all button User feedback: the standard file_uploader didn't visually match the Home page, and there was no obvious way to clear out uploaded files between scans (have to refresh the browser tab). Persistent stash + add-only sync. Files captured into ``st.session_state["pdf_uploads"]`` (dict name → {bytes, size}) via an ``on_change`` callback on the file_uploader widget. The callback is add-only — never removes files from the stash based on widget state. Removal is owned by the custom X buttons + widget-counter bump (see below). This guarantees a hidden native X click can't silently drop files behind the user's back. Hidden native file list. A small CSS block suppresses the file_uploader's built-in file rows + their delete buttons (``stFileUploaderFile`` + ``stFileUploaderDeleteBtn``), so the custom list below is the single source of truth on screen. Custom file list (Home pattern). Below the dropzone, every uploaded file gets a row: ``✕ \| 📄 filename \| size``. Top of section shows ``N files · 12.3 MB total``. Counts and sizes update in real time as the user adds or removes files. The X button per row calls ``log_event("upload", "PDF removed: …")``, removes the entry from the stash, and bumps the widget counter to clear the widget too. Clear-all button. Sits next to the Scan button. Wipes the stash, bumps the widget counter, drops any cached scan results (``K_ROWS``, ``K_WARNINGS``, ``K_SOURCE_COUNT``). Audited via ``log_event("upload", "PDF list cleared", count=N)``. Widget reset via counter bump. Streamlit disallows programmatic mutation of widget session-state entries; the standard workaround is to rotate the widget's ``key``. Page maintains ``K_UPLOAD_COUNTER`` which gets incremented on remove / clear-all, producing a fresh ``pdf_upload_v{N}`` key and a freshly-instantiated empty widget. The stash retains any unaffected files; on next upload, the add-only sync picks up the new ones without re-adding the removed ones. Scan rewired to read the stash. Instead of iterating the widget's UploadedFile objects (which the previous code did and which broke when the widget unmounted on remove), the scan loop iterates ``pdf_uploads.items()`` and uses the cached ``bytes``. Diagnostic expander does the same — re-reads from the stash, removing the need for a separate ``K_DIAGNOSTIC`` cache (deleted). ``_format_size`` helper ports the byte-formatting logic from ``_home.py``'s pattern (KB / MB / GB rollover). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:28:01 +00:00
Michael	155dd30746	feat(pdf): extract statement header (account + period) + date format Two related additions for the accountant workflow: 1. Statement header extraction. New ``extract_statement_metadata(pages)`` pulls the account number and statement period out of the first page (falls back to page 1+2 if either is missing on page 1 — Wells Fargo business accounts put header info on page 2). Detected fields are stamped onto EVERY transaction row so a multi-statement CSV is self-attributing per row:: { "date": "20250113", "description": "Coffee Shop", "amount_1": -4.50, "account_number": "**5678", "statement_period_start": "20250101", "statement_period_end": "20250131", ... } Account-number regex is tolerant of masks (``1234``), hyphens (``1234-5678-9012``), and spaces. Period regex looks for "Statement Period" / "From" / "Period Covered" labels plus the first 1-2 full-year dates that follow. If only one date is present near the label, it's used for both start and end (some statements show only the closing date). 2. Year inference for short dates. When the row date is a short ``01/13`` or ``Jan 13`` without a year, the scanner now binds the year from the statement period's end date BEFORE formatting. Doesn't handle the December-in-January-statement cross-year case (rare; user can edit in the table). 3. Configurable output date format.** New ``output_date_format`` parameter on ``scan_pdf_for_transactions`` defaults to ``%Y%m%d``. Applied to: the transaction date column AND the statement period start/end fields. The page surfaces a dropdown in Scan options with common presets (YYYYMMDD, YYYY-MM-DD, MM/DD/YYYY, DD/MM/YYYY, ``Mon DD, YYYY``) plus a Custom option that accepts a raw strftime string. New helper: ``format_date(iso_str, fmt)`` converts ISO ``YYYY-MM-DD`` to any strftime; passes invalid input through unchanged so the user can see what was actually there rather than getting silent empties. 20 new tests cover: format_date, account-number extraction (masked / hyphenated / spaced / no-label / short), period extraction (standard / from-to / single-date / no-label), metadata orchestrator (full header / no pages / page-2 fallback), year inference (US / dash / month-name / no-period / unparseable), plus an end-to-end class that builds a header'd PDF with short-date transactions and confirms metadata attribution + year inference + format round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:20:46 +00:00
Michael	263af3c7c2	fix(pdf): short dates without year + diagnostic for "0 rows" runs User uploaded a real Chase statement and got "0 rows detected." Two bugs the rewrite shipped with, plus a diagnostic: 1. Short dates without year weren't recognized. Most bank statements (Chase, Wells, BofA, …) display transaction dates as ``01/13`` or ``Jan 13`` because the year is implied by the statement period. The original regex required ``\d{2,4}`` after the second slash, so ``01/13`` failed to match and rows with no detected date got dropped. Split ``_DATE_RES`` into ``_FULL`` (with year) and ``_SHORT`` (no year), with a two-pass detector: pass 1 tries full-year patterns across the whole row; pass 2 only tries short patterns if pass 1 found nothing. This prevents a stray ``Page 1/2`` from shadowing the real dated transaction on the same line. Short patterns: - ``\d{1,2}/\d{1,2}`` — Chase, etc. - ``\d{1,2}-\d{1,2}`` - ``[A-Z][a-z]{2}\s+\d{1,2}`` — "Jan 13" When parsing, short dates pass through ``parse_date`` and return None (no year to bind to), so the scanner falls back to the raw text — the user sees ``01/13`` in the date column and can correct in the editor. 2. Multi-word dates leaked the day token into the description. A pre-existing bug: ``_find_dates_in_words`` returned only the START word index, and ``_description_from_row`` only excluded that single word. For "Jan 13 Coffee $4.50", the description became "13 Coffee" instead of "Coffee". Fixed by returning ``(start, end, text)`` with ``end`` exclusive (computed from ``len(m.group(1).split())`` so window-overrun doesn't over-consume), and the description builder now skips the full range. 3. New diagnostic: ``diagnose_pdf_lines(pdf_bytes)``. Returns every clustered text line the scanner saw with ``has_date`` / ``has_amount`` flags. When the page's scan returns 0 rows, an auto-expanded "what the scanner saw" expander now renders a table of all extracted lines so the user can: - Spot scanned-PDF cases (empty result → enable OCR) - See which lines have a date but no amount (or vice versa) - Eyeball the date / amount format the scanner missed Without leaving the app or asking the developer for help. Eight new tests cover: short US date (``01/13``), short month- name date with two-word consumption (``Jan 13``), the ``Page 1/2 ... 01/13/2026`` shadowing case, and the multi-word- date description fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:06:07 +00:00
Michael	bece2b4030	refactor(pdf): rip out templates; heuristic scan + selectable table User feedback: the template / visual-picker / mode-dispatch implementation was too complex for the actual workflow. Statements drift between months, the canvas state didn't survive multi-page navigation, and accountants don't want to maintain per-bank configuration just to convert PDFs to CSV. Start-over design — one public function, one page, no persistence: ``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)`` A row is "any text line with a date pattern AND at least one amount pattern." Each detected row is a dict shaped:: { "date": "2026-01-15", "description": "Coffee Shop", "amount_1": -4.50, "amount_2": 1000.00, # if a second amount was found "page": 1, "raw": "01/15/2026 Coffee Shop (4.50) 1,000.00", "source_file": "chase-jan-2026.pdf", } Multi-line descriptions still merge (no-date no-amount lines attach to the previous transaction). Multi-PDF batches share a single combined table with a ``source_file`` column. Page UX: - Upload PDF(s) → optional Options expander (parens-negative, use-OCR) → click Scan → see all detected rows in an ``st.data_editor``. - The editor has an ``Include`` checkbox column (default on), plus user-editable date / description / amount cells and a read-only ``raw`` column showing the original PDF text for verification. - A ``Columns to include in CSV`` multiselect hides ``page`` / ``raw`` from the download by default; user can re-add either. - Download CSV gets only the checked rows. No template save/load. No visual picker. No mode dispatch. No column boundaries. No schema migration. No per-bank configuration files. Deletions: - ``src/pdf_templates.py`` — template storage layer - ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim for the canvas (no canvas now) - ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``, ``test_drawable_canvas_compat.py`` — covered the removed APIs - ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for the removed dep - ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt`` - The drawable-canvas references in ``build/datatools.spec`` ``src/pdf_extract.py`` shrinks from ~30 helper functions to ~10. Keeps: value parsers, row clusterer, date/amount token finders, OCR pipeline, dependency guards. The one new public function ``scan_pdf_for_transactions`` glues them together. Tests (59 passing): the unit layer keeps full coverage of the building blocks; the smoke layer pins the end-to-end PDF roundtrip, OCR discovery, dependency-import behavior, and the multi-line-description merge. The fpdf2-generated fixture PDF still drives the real-PDF test. Rollback: ``git revert HEAD`` brings back the template system if needed — but the simpler model should make that unlikely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:57:30 +00:00
Michael	60969c0770	feat(pdf): UI rework — Auto-detect is the default build flow Pulls the user's primary mental model away from "draw column boundaries" toward "tell me what shape your amounts have, see detected rows, save." The visual picker that wasn't working for multi-statement workflows is reachable but no longer the default. Build mode header now has a mode radio: - "Auto-detect (recommended)" — row_heuristic. Tabs: Amount layout · Filters & date · Save. Three small forms; no coordinate UI anywhere. The Amount-layout tab's dropdown picks one of single / txn+balance / debit+credit / debit+credit+balance and auto-derives the min/max amount-count range (overridable under an expander). - "Visual columns (advanced)" — column_visual. Five tabs (the original Visual picker / Pages & table / Columns / Parsing / Save). A yellow warning panel up top reminds the user that column-x templates only work when statement layout is stable. Switching modes triggers a rerun so the right tab set renders immediately. The template object preserves both mode's config trees side-by-side so a user can flip between them without losing work. Live preview below the form runs ``apply_template`` against the cached sample pages (already cached in session_state so this re-renders cheaply on every form edit). The "no rows yet" message is mode-aware — points users at the right tuning knobs for whichever mode they're in. The preview caption notes which mode produced the rows so the user can correlate decisions to output. The visual picker bug the user reported — "a single box stays in the same location regardless of page" — is sidestepped rather than fixed: in row_heuristic mode there's no canvas to confuse, and for the rare column_visual user the canvas is still imperfect but no longer their first interaction with the tool. Cleaning up the column_visual canvas state bugs is a separate follow-up if real users still hit the Advanced mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:46:27 +00:00
Michael	10015c40e1	fix(pdf): shim image_to_url for drawable-canvas on modern Streamlit User hit ``AttributeError: module 'streamlit.elements.image' has no attribute 'image_to_url'`` on first PDF import. Root cause: ``streamlit-drawable-canvas`` 0.9.3 (last upstream release 2023) calls a Streamlit internal that was relocated in Streamlit ~1.30+. The function moved from ``streamlit.elements.image`` to ``streamlit.elements.lib.image_utils`` AND its signature changed — the second positional argument is now a ``LayoutConfig`` dataclass instead of a plain ``int`` width. Three remedies considered: 1. Downgrade Streamlit. Reverses unrelated improvements + security fixes; not on the table. 2. Fork drawable-canvas. The maintenance hit isn't worth it for a one-line internal API change. 3. Ship a compatibility shim. Re-attach a wrapper at the old import path that adapts the old call shape to the new function. This is the standard workaround the wider Streamlit community has converged on for this exact regression. ``src/gui/_drawable_canvas_compat.py`` does (3). The ``install()`` helper is idempotent, opt-in (not auto-run at module import — a grep for ``_install_canvas_compat`` shows every call site), and no-ops if Streamlit hasn't moved the function OR if the new function isn't where we expect (lets the canvas surface a real error rather than papering over a different bug). The page calls ``_install_canvas_compat()`` once at module top before any ``st_canvas`` invocation; Streamlit's script-rerun model means this fires every page load but the ``_PATCHED`` guard makes re-runs free. The shim wraps the old ``width=int`` arg into a default-constructed ``LayoutConfig()`` — the old ``width=-1`` sentinel meant "use the image's natural width", which is also what an unconfigured LayoutConfig produces. Confirmed by inspecting Streamlit 1.57.0's ``image_utils.py``. 4 new tests pin the shim contract: - ``install()`` attaches ``image_to_url`` to the old path on modern Streamlit - Idempotent — calling twice doesn't double-wrap - Doesn't clobber a future Streamlit that restores the original at the old path - Translates ``(image, -1, False, "RGB", "PNG", "id")`` into a proper call to the new function with a ``LayoutConfig`` instance If a future Streamlit upgrade moves ``image_to_url`` AGAIN, the shim's silent-no-op fallback means the canvas error surfaces again and points at where to look. The shim doesn't paper over mysteries; it only patches the one specific relocation we know about. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:29:20 +00:00
Michael	e6ee2e3481	feat(pdf): robust Tesseract discovery + OS-aware install copy User tried ``brew install tesseract`` in PowerShell after seeing all three OSes listed inline in the OCR banner — easy mistake when the install commands are crammed on one line with ``·`` separators. Two changes pre-empt this: OS-aware OCR banner. The expander now detects the user's platform via ``platform.system()`` and shows only the relevant install instructions: - Windows: UB-Mannheim installer link, numbered steps, explicit "keep the Add to PATH checkbox on" callout, plus a fallback paragraph telling the user how to set ``DATATOOLS_TESSERACT_PATH`` if they already installed without PATH and don't want to reinstall. - macOS: ``brew install tesseract`` with a Homebrew link. - Linux: ``apt install tesseract-ocr`` with a "or your distro's equivalent" hedge. Robust binary discovery in ``ocr_available()``. Three-stage: 1. Honor ``DATATOOLS_TESSERACT_PATH`` env var if set — explicit override for portable installs or non-default locations. 2. Try ``pytesseract``'s default PATH-based lookup. 3. If PATH lookup fails, probe known Windows install paths (``C:\Program Files\Tesseract-OCR\tesseract.exe``, the x86 variant, and ``%LOCALAPPDATA%\Programs\Tesseract-OCR\``) via the new ``_autodetect_tesseract_path``. On hit, set ``pytesseract.pytesseract.tesseract_cmd`` so all subsequent ``image_to_data`` calls use the same binary without re-discovering. This means a user who runs the UB-Mannheim installer with default options but forgets the PATH checkbox will still get OCR working after a launcher restart, without env-var gymnastics. Tests (4 new, 85 total in the suite): - Auto-detect returns None on non-Windows (no false positives on dev laptops). - Auto-detect finds the binary at a mocked ``C:\Program Files\Tesseract-OCR\tesseract.exe``. - Auto-detect returns None when no candidate exists. - ``DATATOOLS_TESSERACT_PATH`` env var beats both PATH lookup and auto-detect (sets ``tesseract_cmd`` even when the path doesn't resolve, so a real binary at a custom location works). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:15:00 +00:00
Michael	2d927bc95f	fix(pdf): graceful fallback when PDF dependencies aren't installed User hit a hard ImportError on opening the PDF→CSV tool because ``pip install -r requirements.txt`` hadn't picked up the new ``pdfplumber`` / ``pypdfium2`` lines yet. Streamlit surfaces that as an unfiltered traceback — friendlier to show a clear install-required panel inside the tool instead. Two changes: 1. ``src/pdf_extract.py`` lazy-imports the PDF deps via ``_require_pdfplumber()`` / ``_require_pdfium()`` helpers that raise a new ``PdfDependencyMissing`` (subclass of ImportError) with an actionable ``hint`` field. Pure helpers (``parse_amount``, ``parse_date``, ``cluster_rows``, etc.) keep working with no PDF dep installed — useful for tests and for keeping module-import paths cheap. 2. The tool page probes both deps at render time via ``_pdf_deps_status()``; if anything's missing it shows a ``st.error`` panel with the exact pip command and a "restart the launcher" reminder, then ``st.stop()``s before touching any PDF code path. The page itself loads cleanly without the deps installed, so the sidebar nav doesn't 500 — the user just sees the install panel on click. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:59:20 +00:00
Michael	967d3f6a11	feat(pdf): OCR availability banner + per-run toggle Phase 6/6. Final polish layer on top of the OCR pipeline that ``extract_pages_auto`` has carried since commit 1. - OCR status banner at the top of the page next to the mode selector. Ready: a one-liner caption confirming OCR will run on scanned pages. Unavailable: a collapsed expander explaining the missing piece (``pytesseract`` binding vs. Tesseract binary) with install pointers for Windows, macOS, and Linux. The expander explicitly notes that modern text-based bank statements don't need OCR — most users will never expand it. - "Use OCR for scanned pages" toggle in Extract mode, defaulting to the runtime availability. Disabled (greyed out) when Tesseract isn't usable, so the user can't accidentally set themselves up for confusing warnings. Passes through as ``allow_ocr`` to ``extract_pages_auto``. - Build mode's sample-loading path continues to call ``extract_pages_auto(..., allow_ocr=True)`` — sample preview always uses OCR if available, since the user is actively diagnosing template fit. No schema change. OCR's structural support is in commits 1 + 3; this commit just makes it discoverable + opt-out. Rolling up the 6-commit feature: `b8aff86` Phase 1 — pure pdf_extract module + tests `aea520d` Phase 2 — template storage layer + tests `2f349e8` Phase 3 — Extract/Build/Manage page + nav + i18n `5a8e2ec` Phase 4 — batch polish (ZIP, sort, status block) `b86828d` Phase 5 — visual region picker (drawable canvas) THIS Phase 6 — OCR banner + toggle Each commit is independently revertable; rolling all the way back to ``c16e2a5`` is ``git revert `b86828d` `5a8e2ec` `2f349e8` `aea520d` `b8aff86` <this>`` (or just ``git reset --hard c16e2a5`` on a clean branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:54:11 +00:00
Michael	b86828d791	feat(pdf): visual region picker on rendered sample page Phase 5/6. Adds a "Visual picker" tab as the first stop in the template-build flow. The sample PDF page is rasterized with ``pypdfium2`` (capped at ~900px wide for sensible display), and ``streamlit-drawable-canvas`` overlays drawing tools on top. UX: - Line mode — drag short (roughly vertical) strokes where you want columns to split. Each stroke's x-midpoint becomes one boundary in PDF point coordinates. - Rect mode — drag a rectangle around the transactions table; bbox is preserved on the template as ``visual.table_bbox`` for round-trip, future use as a hard crop region. - Transform mode — move/resize already-drawn shapes after the fact. Round-trip: re-entering Build mode with an existing template seeds the canvas with full-height vertical lines for every boundary already on the template, plus the saved bbox if any, so editing-after-save matches the user's mental model. Coordinate translation: the canvas reports pixel positions; we divide by the renderer's pixels-per-PDF-point scale to get back to PDF coordinates that ``apply_template`` already expects. No template-schema change required — the boundaries the picker writes are the same list the text-input editor wrote in commit 3, just sourced visually. New helper in the extraction module: - ``render_page_image(pdf_bytes, page_no, target_width=900)`` — rasterize a single 1-indexed page to a PIL image; returns ``(image, scale)`` for coordinate translation. The text-input boundary editor in the Columns tab remains as a fallback for power users / keyboard-only workflows and for copy-paste from spreadsheet-derived x-positions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:52:54 +00:00
Michael	5a8e2ec9e1	feat(pdf): batch extract polish — ZIP output, sort-by-date, status block Phase 4/6. Polishes the batch workflow shipped in commit 3: - st.status progress block replaces the simple progress bar. Each file appears as its own line as it's processed; the block auto-collapses on completion with a "12/13 extracted" summary and turns red if any file errored. - Sort combined output by date checkbox (default ON) sorts the merged CSV ascending by date, with source_file as a stable secondary sort so multiple statements interleave by date but same-day rows from the same file stay together. - ZIP-of-per-PDF-CSVs output option alongside the combined CSV. When the accountant has 12 statements from 12 different account periods and wants to feed them into 12 separate ledger imports, the ZIP keeps each file's rows in its own CSV named after the original PDF stem. - Per-file summary table gets a ``status`` column ("ok" / "no rows" / "error: ExceptionName") so error grouping is obvious at a glance — already present from commit 3, now upgraded with the status field. Cancellation is intentionally not added — Streamlit's single- thread rerun model has no clean way to interrupt a tool-run mid-stream without architectural changes to extraction. If a user mis-fires Extract on 50 PDFs they can refresh the browser tab; the task will be killed when the next interaction comes in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:51:05 +00:00
Michael	2f349e8191	feat(pdf): tool page with Extract / Build / Manage modes Phase 3/6. Wires the PDF Extractor into the GUI as a new "transformations" tool with three modes selected by a horizontal radio at the top of the page: Extract — pick a saved template, upload one or more statement PDFs (single + batch shipping together to keep the common case one-step), get a previewed DataFrame + CSV download. Per-file row counts and warnings are surfaced; failures on one file don't kill the whole batch. The combined CSV gets a ``source_file`` first column so the accountant can sort/filter by statement. Build template — load an existing template or start fresh, upload a sample PDF, edit every schema field across four tabs (Pages & table / Columns / Parsing / Save). A live preview below re-runs ``apply_template`` against the sample on each re-render so the user sees their changes hit rows immediately. The column- boundary editor is text-input ("comma-separated x-positions") for now — replaced by the drawable-canvas visual picker in commit 5. Manage templates — list with rename / delete / export (downloads the canonical JSON) / import (uploads someone else's JSON, validated through ``template_from_json``). Heavy work (``extract_pages_auto``) only runs on explicit user action (Extract / a new sample upload), and the parsed Page list is cached in ``st.session_state`` so widget-edit reruns don't re-parse the PDF. Logging: tool runs and template saves both hit the audit log via ``log_event("tool_run", …)``, matching every other tool's instrumentation pattern. Registered in ``tools_registry.py`` under ``transformations`` with status ``Ready`` and the picture-as-pdf Material icon. i18n keys added for en + es ("PDF to CSV" / "PDF a CSV"). OCR is wired in this commit — ``extract_pages_auto`` already falls back through ``pytesseract`` when the binary is available, and the warning strings it returns surface as ``st.info`` / ``st.warning`` per-file. Commit 6 will polish the OCR UX with a status row. Next commits build on this page: 4 — batch progress + cancellation + per-file error grouping 5 — drawable-canvas visual picker replaces text x-positions 6 — OCR availability banner + scanned-page indicators Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:49:44 +00:00

19 Commits