datatools-dev

Author	SHA1	Message	Date
Michael	17faf84aed	feat(pdf): probe bundled Tesseract first when running frozen Adds runtime support for the bundled Tesseract that ships inside the DataTools installer / portable / AppImage artifacts. When DataTools is launched from a PyInstaller frozen bundle the OCR engine now resolves automatically — no end-user install required. New helpers in src/pdf_extract.py: - _bundled_tesseract_path() → Path \| None — returns <sys._MEIPASS>/tesseract/tesseract[.exe] when getattr(sys, "frozen", False) AND sys._MEIPASS are present; None in dev. - _bundled_tessdata_dir() → Path \| None — same gating, returns <sys._MEIPASS>/tesseract/tessdata. - _apply_bundled_tessdata_prefix() — sets TESSDATA_PREFIX to the bundled tessdata dir before any pytesseract call; only if frozen, dir exists, and the user hasn't already overridden the env var. Discovery order in ocr_available() / _autodetect_tesseract_path(): 1. DATATOOLS_TESSERACT_PATH env override (existing) 2. Bundled binary (NEW — frozen-only) 3. System PATH (existing) 4. Windows well-known install dirs (existing legacy fallback) In dev (not frozen) every new probe is a no-op so the developer experience is unchanged. 12 new tests cover frozen vs. non-frozen detection on each platform, the user-override respect for TESSDATA_PREFIX, autodetect priority ordering, and the no-bundled-dir graceful path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 18:19:52 +00:00
Michael	450d4fc9a8	feat(pdf): default output date format to YYYY-MM-DD User asked to flip the default from YYYYMMDD to YYYY-MM-DD. ISO is the better default for an accountant CSV workflow: - Lexicographic sort = chronological sort (no parsing needed). - Every spreadsheet tool the user might import into recognises it as a real date with no ambiguity (US vs EU readers can't disagree on the order). - Hyphens make the year/month/day boundaries scan-able by eye. Concrete changes: - New module constant ``DEFAULT_DATE_FORMAT = "%Y-%m-%d"``, used as the default for ``format_date()`` and the ``output_date_format`` keyword on ``scan_pdf_for_transactions``. - Page's ``_DATE_FORMAT_CHOICES`` reordered so the ISO entry is first (index 0 = default Streamlit selection); YYYYMMDD drops to second. - Custom-strftime input default also flips to ``%Y-%m-%d``. Tests updated to reflect the new default (``test_dates_formatted_iso_by_default``, ``test_short_dates_get_year_from_period``, ``test_compact_format_round_trip``, plus a new ``test_default_is_iso`` for the format_date helper). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 02:04:34 +00:00
Michael	a0042d4aba	feat(pdf): Dec/Jan-aware year inference + filename hint + override Previous year inference picked ``period_end_iso[:4]`` for every short date, which fails on statements that cross the Dec/Jan boundary. A "12/30" row in a 2024-12-16 to 2025-01-15 statement got 2025-12-30 (wrong) instead of 2024-12-30. New cascade for ``_infer_year_for_short_date``: 1. ``override_year`` — caller supplies it (new ``"Override year for short dates"`` field in Scan options). Beats every heuristic. Empty by default; the page validates the value is a 4-digit-looking integer in 1900-2100 and falls back to automatic on garbage input. 2. Statement period start + end — the function now takes BOTH dates and generates candidates with every distinct year in the period (one year for same-year statements, two for Dec/Jan boundaries). The picker scores each candidate by distance from the period: candidates inside the period score 0, candidates outside score ``min(\|days from start\|, \|days from end\|)``. Lowest-distance candidate wins. So: - ``12/30`` + period 2024-12-16 to 2025-01-15 → 2024-12-30 (inside period, score 0) - ``01/05`` + same period → 2025-01-05 (inside, score 0) - ``12/15`` + same period → 2024-12-15 (1 day before, closer than 2025-12-15 which is 11 months after) 3. ``filename_year_hint`` — fallback when the statement period regex misses the bank's specific layout. The page passes ``year_from_filename(upload.name)`` automatically so files like ``eStmt_2025-01-13.pdf`` get year 2025 even if the PDF's text doesn't yield a parseable period. The regex matches the first ``20XX`` token bounded by non-digits. Both new helpers (``year_from_filename`` and the new ``_try_short_date_with_year`` factor-out) are exported and tested. 16 new tests cover: within-period inference (same-year sanity), Dec/Jan boundary cases for both sides, the just-before-period closer-distance case, override priority, filename fallback, no-signal None, dash-format / month-name shorthand round-trip, garbage input, filename year extraction (eStmt pattern, embedded, first-match-wins, no-match, empty). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:59:30 +00:00
Michael	34b56b404a	fix(pdf): drop statement_period_start/end columns from output User asked to remove them — the two columns repeated the same value on every row from a given statement, took up screen space in the editor, and offered limited value once the date column already carries the inferred full date. What's kept: - ``account_number`` — still stamped onto every row so multi- statement CSVs are self-attributing - ``extract_statement_metadata`` — still runs every scan because ``period_end`` is the source of the year inference that binds Chase-style short ``01/13`` dates to ``20250113`` - ``_extract_statement_period`` and its tests — period detection itself isn't going anywhere, just its appearance in the output rows What's removed: - ``record["statement_period_start"]`` / ``record["statement_period_end"]`` assignments in ``scan_pdf_for_transactions`` - The two columns from the page's column-ordering setup - Tests pinning their presence; replaced with assertions that they're explicitly absent Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:28:32 +00:00
Michael	ad7c22d7fb	fix(pdf): consistent 2-decimal amount precision in display and CSV User reported amounts losing trailing zeros — 4.50 rendering as 4.5, 1000.00 as 1000 — on the same statement. Classic float display issue: Python's native ``repr(4.5)`` drops the ``.0``, and pandas / Streamlit happily show that inconsistency cell-by-cell. Two layers of fix, internal type stays ``float`` for arithmetic: Display. ``st.column_config.NumberColumn(format="%.2f")`` applied programmatically to every ``amount_`` column on the data_editor. Every numeric amount now shows with exactly two decimal places regardless of trailing zeros. CSV export.* Pandas' default float-to-CSV writer also drops trailing zeros (the same issue an accountant would see when opening the file in Excel). Before serialising, each amount column is mapped through the new ``format_amount`` helper — returns ``f"{v:.2f}"`` for numerics, empty string for None/NaN/inf, ``str(value)`` for booleans (guards the ``True → "1.00"`` foot-gun since ``bool`` is an ``int`` subclass), and passes through any string the scanner kept because parsing failed (e.g. ``(4.50)`` when parens-negative is off — user can correct in the editor before re-exporting). ``format_amount`` lives in ``src/pdf_extract.py`` so it's testable in isolation (the page module can't easily be unit tested because of its Streamlit import chain). 8 new tests cover the trailing-zeros case, negatives, None/empty, string-passthrough, bool guard, NaN/inf, and the ``places`` parameter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:27:16 +00:00
Michael	6f2ad57490	fix(pdf): require non-empty description; tighten multi-line merge User reported "Daily Ledger Balances" entries leaking into output. Three correlated bugs in the row qualifier: 1. Empty description is now disqualifying. A row like ``01/13/2025 $1,000.00`` has a date and an amount but no text between them — that's a daily-balance entry, a period-summary, or page furniture. Drop these. New filter sits after ``_description_from_row`` returns: if the description string is empty (or whitespace-only), continue past the row. 2. ``prev`` resets per page. The state that drives multi- line description merging (the "previous transaction this continuation might attach to") used to persist across page boundaries. A no-date no-amount line at the top of page 2 could silently attach to the last transaction on page 1. Fixed by moving the ``prev`` / ``prev_y_bottom`` declarations into the outer page loop so each page starts clean. 3. Multi-line merges now check y-distance. Before this fix, ANY no-date no-amount line attached to the previous transaction's description. A "Daily Ledger Balances" section header several rows below the last transaction would silently fold into it. Now the merge only happens when the gap ``current_top - prev_y_bottom <= 25.0`` PDF points — generous enough for one blank-line gap between wrapped descriptions, tight enough to reject section headers across paragraph breaks. The threshold is a module constant (``_MULTILINE_MERGE_MAX_GAP``) for future tuning if real statements call for it. Three new test classes: - ``TestRequiresDescription.test_empty_description_row_dropped`` — date+amount-no-text row filtered, real transaction kept. - ``TestPrevTransactionResetsPerPage.test_no_cross_page_merge`` — page-1 transaction + page-2 section header = no merge. - ``TestMultilineMergeYGap`` — close continuation merges (10-pt gap), far section header doesn't (100-pt gap). The original ``TestMultilineDescription.test_continuation_line_merges`` still passes — its setup has a 10-pt gap which is within the new threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:58:50 +00:00
Michael	155dd30746	feat(pdf): extract statement header (account + period) + date format Two related additions for the accountant workflow: 1. Statement header extraction. New ``extract_statement_metadata(pages)`` pulls the account number and statement period out of the first page (falls back to page 1+2 if either is missing on page 1 — Wells Fargo business accounts put header info on page 2). Detected fields are stamped onto EVERY transaction row so a multi-statement CSV is self-attributing per row:: { "date": "20250113", "description": "Coffee Shop", "amount_1": -4.50, "account_number": "**5678", "statement_period_start": "20250101", "statement_period_end": "20250131", ... } Account-number regex is tolerant of masks (``1234``), hyphens (``1234-5678-9012``), and spaces. Period regex looks for "Statement Period" / "From" / "Period Covered" labels plus the first 1-2 full-year dates that follow. If only one date is present near the label, it's used for both start and end (some statements show only the closing date). 2. Year inference for short dates. When the row date is a short ``01/13`` or ``Jan 13`` without a year, the scanner now binds the year from the statement period's end date BEFORE formatting. Doesn't handle the December-in-January-statement cross-year case (rare; user can edit in the table). 3. Configurable output date format.** New ``output_date_format`` parameter on ``scan_pdf_for_transactions`` defaults to ``%Y%m%d``. Applied to: the transaction date column AND the statement period start/end fields. The page surfaces a dropdown in Scan options with common presets (YYYYMMDD, YYYY-MM-DD, MM/DD/YYYY, DD/MM/YYYY, ``Mon DD, YYYY``) plus a Custom option that accepts a raw strftime string. New helper: ``format_date(iso_str, fmt)`` converts ISO ``YYYY-MM-DD`` to any strftime; passes invalid input through unchanged so the user can see what was actually there rather than getting silent empties. 20 new tests cover: format_date, account-number extraction (masked / hyphenated / spaced / no-label / short), period extraction (standard / from-to / single-date / no-label), metadata orchestrator (full header / no pages / page-2 fallback), year inference (US / dash / month-name / no-period / unparseable), plus an end-to-end class that builds a header'd PDF with short-date transactions and confirms metadata attribution + year inference + format round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:20:46 +00:00
Michael	3cf935c999	fix(pdf): drop zero-amount rows; multi-date rows clean description Two corrections from real-statement feedback: 1. Drop rows where the transaction amount is exactly 0. Bank statements include date+amount-shaped noise like "INTEREST EARNED 0.00", "PAGE TOTAL 0.00", "BALANCE FORWARD 0.00 1,234.56" — all match the date+amount heuristic but aren't transactions. New filter in ``scan_pdf_for_transactions``: drop rows whose ``amount_1`` parses to exactly 0. Non-zero balances in ``amount_2`` don't rescue a zero amount_1 — leftmost amount is the canonical transaction amount. Unparsed-but-non-empty amount strings are kept (user verifies in the editor). 2. Multi-date rows: first date wins for the column, every date excluded from the description. Chase / BofA / Wells commonly show both a transaction date and a posting date per row: 01/13 01/14 COFFEE SHOP $4.50 Before this fix, ``_find_dates_in_words`` returned the first date only and the second date leaked into description as "01/14 COFFEE SHOP". Now it returns ALL dates with their word ranges; the scanner uses ``dates[0]`` as the canonical date and passes every range to the description builder for exclusion. The detector's two-pass strategy now also guards against mixing full-year and short-date matches on the same row. Previously, a header line like ``Page 1/2 of 3 ... Statement Date 01/13/2026`` would return both ``1/2`` and ``01/13/2026``, and ``1/2`` (being leftmost) would have won the date column. Now: if any full-year date is found on the row, short patterns are NOT also collected — full year anchors interpretation. A row with no full-year date (Chase short-date case) still falls back to short patterns and collects all of them. New tests: - ``test_multiple_dates_returned_in_position_order`` — ``01/13`` + ``01/14`` both returned, in order - ``TestMultiDateRow.test_first_date_wins_second_excluded_from_description`` — end-to-end through ``scan_pdf_for_transactions`` - ``TestZeroAmountRowsAreDropped.test_zero_amount_row_dropped`` — "INTEREST EARNED 0.00" row dropped while real txn kept - ``test_negative_amount_kept`` — pin that -40.00 is not treated as zero by the filter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:12:21 +00:00
Michael	263af3c7c2	fix(pdf): short dates without year + diagnostic for "0 rows" runs User uploaded a real Chase statement and got "0 rows detected." Two bugs the rewrite shipped with, plus a diagnostic: 1. Short dates without year weren't recognized. Most bank statements (Chase, Wells, BofA, …) display transaction dates as ``01/13`` or ``Jan 13`` because the year is implied by the statement period. The original regex required ``\d{2,4}`` after the second slash, so ``01/13`` failed to match and rows with no detected date got dropped. Split ``_DATE_RES`` into ``_FULL`` (with year) and ``_SHORT`` (no year), with a two-pass detector: pass 1 tries full-year patterns across the whole row; pass 2 only tries short patterns if pass 1 found nothing. This prevents a stray ``Page 1/2`` from shadowing the real dated transaction on the same line. Short patterns: - ``\d{1,2}/\d{1,2}`` — Chase, etc. - ``\d{1,2}-\d{1,2}`` - ``[A-Z][a-z]{2}\s+\d{1,2}`` — "Jan 13" When parsing, short dates pass through ``parse_date`` and return None (no year to bind to), so the scanner falls back to the raw text — the user sees ``01/13`` in the date column and can correct in the editor. 2. Multi-word dates leaked the day token into the description. A pre-existing bug: ``_find_dates_in_words`` returned only the START word index, and ``_description_from_row`` only excluded that single word. For "Jan 13 Coffee $4.50", the description became "13 Coffee" instead of "Coffee". Fixed by returning ``(start, end, text)`` with ``end`` exclusive (computed from ``len(m.group(1).split())`` so window-overrun doesn't over-consume), and the description builder now skips the full range. 3. New diagnostic: ``diagnose_pdf_lines(pdf_bytes)``. Returns every clustered text line the scanner saw with ``has_date`` / ``has_amount`` flags. When the page's scan returns 0 rows, an auto-expanded "what the scanner saw" expander now renders a table of all extracted lines so the user can: - Spot scanned-PDF cases (empty result → enable OCR) - See which lines have a date but no amount (or vice versa) - Eyeball the date / amount format the scanner missed Without leaving the app or asking the developer for help. Eight new tests cover: short US date (``01/13``), short month- name date with two-word consumption (``Jan 13``), the ``Page 1/2 ... 01/13/2026`` shadowing case, and the multi-word- date description fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:06:07 +00:00
Michael	bece2b4030	refactor(pdf): rip out templates; heuristic scan + selectable table User feedback: the template / visual-picker / mode-dispatch implementation was too complex for the actual workflow. Statements drift between months, the canvas state didn't survive multi-page navigation, and accountants don't want to maintain per-bank configuration just to convert PDFs to CSV. Start-over design — one public function, one page, no persistence: ``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)`` A row is "any text line with a date pattern AND at least one amount pattern." Each detected row is a dict shaped:: { "date": "2026-01-15", "description": "Coffee Shop", "amount_1": -4.50, "amount_2": 1000.00, # if a second amount was found "page": 1, "raw": "01/15/2026 Coffee Shop (4.50) 1,000.00", "source_file": "chase-jan-2026.pdf", } Multi-line descriptions still merge (no-date no-amount lines attach to the previous transaction). Multi-PDF batches share a single combined table with a ``source_file`` column. Page UX: - Upload PDF(s) → optional Options expander (parens-negative, use-OCR) → click Scan → see all detected rows in an ``st.data_editor``. - The editor has an ``Include`` checkbox column (default on), plus user-editable date / description / amount cells and a read-only ``raw`` column showing the original PDF text for verification. - A ``Columns to include in CSV`` multiselect hides ``page`` / ``raw`` from the download by default; user can re-add either. - Download CSV gets only the checked rows. No template save/load. No visual picker. No mode dispatch. No column boundaries. No schema migration. No per-bank configuration files. Deletions: - ``src/pdf_templates.py`` — template storage layer - ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim for the canvas (no canvas now) - ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``, ``test_drawable_canvas_compat.py`` — covered the removed APIs - ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for the removed dep - ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt`` - The drawable-canvas references in ``build/datatools.spec`` ``src/pdf_extract.py`` shrinks from ~30 helper functions to ~10. Keeps: value parsers, row clusterer, date/amount token finders, OCR pipeline, dependency guards. The one new public function ``scan_pdf_for_transactions`` glues them together. Tests (59 passing): the unit layer keeps full coverage of the building blocks; the smoke layer pins the end-to-end PDF roundtrip, OCR discovery, dependency-import behavior, and the multi-line-description merge. The fpdf2-generated fixture PDF still drives the real-PDF test. Rollback: ``git revert HEAD`` brings back the template system if needed — but the simpler model should make that unlikely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:57:30 +00:00
Michael	d80befd05a	feat(pdf): row-heuristic extraction (mode dispatch, no coordinates) User reported the column-visual approach is too brittle for real bank statements: column-x-positions saved against a sample page don't survive layout drift between months (statement A has columns at x=300, statement B drifted to x=320), and a saved template can only realistically work for one statement's specific render. The fundamental fix is to stop depending on coordinates at all. Row-heuristic mode finds transaction rows by pattern: any line with a date token + N amount tokens IS a transaction. Date patterns (US slash / EU slash / ISO / "Jan 15, 2026" / etc.) and amount patterns (currency, parens-negative, thousands grouping) are matched against word text — no x-positions involved. The full pipeline: 1. ``find_transaction_rows`` clusters words into rows and scans each line for date + amount tokens. 2. Multi-line descriptions still attach to the previous row via the no-date-no-amount continuation rule. 3. Amount shapes drive interpretation: ``single`` / ``txn_balance`` / ``debit_credit`` / ``debit_credit_balance``. 4. ``_infer_amount_column_centers`` clusters amount x-midpoints ACROSS ALL detected rows to find natural column groupings — so debit-vs-credit assignment for single-amount lines works without the user marking anything on screen. ``apply_template`` is now a dispatch over ``template["mode"]``: - ``mode="row_heuristic"`` (default for new templates) — the new pipeline. - ``mode="column_visual"`` — the existing pipeline, kept under ``_apply_template_column_visual`` for v1 templates and the Advanced fallback. 18 new tests cover: date detection (US slash, two-digit year, ISO, month-name, missing); amount-token finding (currency, parens, pure text, bare-year rejection); column-center inference (clear two-column case, empty input); end-to-end on synthetic Page objects with all four amount shapes; the critical layout-drift test that proves the same template works on pages of different sizes / different absolute x-positions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:45:55 +00:00
Michael	e6ee2e3481	feat(pdf): robust Tesseract discovery + OS-aware install copy User tried ``brew install tesseract`` in PowerShell after seeing all three OSes listed inline in the OCR banner — easy mistake when the install commands are crammed on one line with ``·`` separators. Two changes pre-empt this: OS-aware OCR banner. The expander now detects the user's platform via ``platform.system()`` and shows only the relevant install instructions: - Windows: UB-Mannheim installer link, numbered steps, explicit "keep the Add to PATH checkbox on" callout, plus a fallback paragraph telling the user how to set ``DATATOOLS_TESSERACT_PATH`` if they already installed without PATH and don't want to reinstall. - macOS: ``brew install tesseract`` with a Homebrew link. - Linux: ``apt install tesseract-ocr`` with a "or your distro's equivalent" hedge. Robust binary discovery in ``ocr_available()``. Three-stage: 1. Honor ``DATATOOLS_TESSERACT_PATH`` env var if set — explicit override for portable installs or non-default locations. 2. Try ``pytesseract``'s default PATH-based lookup. 3. If PATH lookup fails, probe known Windows install paths (``C:\Program Files\Tesseract-OCR\tesseract.exe``, the x86 variant, and ``%LOCALAPPDATA%\Programs\Tesseract-OCR\``) via the new ``_autodetect_tesseract_path``. On hit, set ``pytesseract.pytesseract.tesseract_cmd`` so all subsequent ``image_to_data`` calls use the same binary without re-discovering. This means a user who runs the UB-Mannheim installer with default options but forgets the PATH checkbox will still get OCR working after a launcher restart, without env-var gymnastics. Tests (4 new, 85 total in the suite): - Auto-detect returns None on non-Windows (no false positives on dev laptops). - Auto-detect finds the binary at a mocked ``C:\Program Files\Tesseract-OCR\tesseract.exe``. - Auto-detect returns None when no candidate exists. - ``DATATOOLS_TESSERACT_PATH`` env var beats both PATH lookup and auto-detect (sets ``tesseract_cmd`` even when the path doesn't resolve, so a real binary at a custom location works). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:15:00 +00:00
Michael	2d927bc95f	fix(pdf): graceful fallback when PDF dependencies aren't installed User hit a hard ImportError on opening the PDF→CSV tool because ``pip install -r requirements.txt`` hadn't picked up the new ``pdfplumber`` / ``pypdfium2`` lines yet. Streamlit surfaces that as an unfiltered traceback — friendlier to show a clear install-required panel inside the tool instead. Two changes: 1. ``src/pdf_extract.py`` lazy-imports the PDF deps via ``_require_pdfplumber()`` / ``_require_pdfium()`` helpers that raise a new ``PdfDependencyMissing`` (subclass of ImportError) with an actionable ``hint`` field. Pure helpers (``parse_amount``, ``parse_date``, ``cluster_rows``, etc.) keep working with no PDF dep installed — useful for tests and for keeping module-import paths cheap. 2. The tool page probes both deps at render time via ``_pdf_deps_status()``; if anything's missing it shows a ``st.error`` panel with the exact pip command and a "restart the launcher" reminder, then ``st.stop()``s before touching any PDF code path. The page itself loads cleanly without the deps installed, so the sidebar nav doesn't 500 — the user just sees the install panel on click. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:59:20 +00:00
Michael	b86828d791	feat(pdf): visual region picker on rendered sample page Phase 5/6. Adds a "Visual picker" tab as the first stop in the template-build flow. The sample PDF page is rasterized with ``pypdfium2`` (capped at ~900px wide for sensible display), and ``streamlit-drawable-canvas`` overlays drawing tools on top. UX: - Line mode — drag short (roughly vertical) strokes where you want columns to split. Each stroke's x-midpoint becomes one boundary in PDF point coordinates. - Rect mode — drag a rectangle around the transactions table; bbox is preserved on the template as ``visual.table_bbox`` for round-trip, future use as a hard crop region. - Transform mode — move/resize already-drawn shapes after the fact. Round-trip: re-entering Build mode with an existing template seeds the canvas with full-height vertical lines for every boundary already on the template, plus the saved bbox if any, so editing-after-save matches the user's mental model. Coordinate translation: the canvas reports pixel positions; we divide by the renderer's pixels-per-PDF-point scale to get back to PDF coordinates that ``apply_template`` already expects. No template-schema change required — the boundaries the picker writes are the same list the text-input editor wrote in commit 3, just sourced visually. New helper in the extraction module: - ``render_page_image(pdf_bytes, page_no, target_width=900)`` — rasterize a single 1-indexed page to a PIL image; returns ``(image, scale)`` for coordinate translation. The text-input boundary editor in the Columns tab remains as a fallback for power users / keyboard-only workflows and for copy-paste from spreadsheet-derived x-positions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:52:54 +00:00
Michael	b8aff862ed	feat(pdf): add pure PDF→DataFrame extraction module Phase 1/6 of the PDF Extractor tool. Pure module — no Streamlit, no user-config I/O — that turns a PDF blob plus a template dict into a ``pandas.DataFrame`` of transaction rows. Primary use case is accountant-style extraction of bank-statement transactions, where each bank's format is encoded as a reusable template. Pipeline: 1. ``extract_pages(pdf_bytes)`` reads with pdfplumber and surfaces words with bounding boxes. 2. ``cluster_rows(words)`` groups words into rows by ``top`` tolerance — no reliance on PDF table-line detection (most bank statements have no visible cell borders). 3. ``assign_columns(row_words, boundaries)`` buckets each word by its horizontal midpoint into N+1 columns defined by N interior x-boundaries. 4. ``_within_table_window`` slices to the band between the header line and the end-marker (e.g. "Closing balance"). 5. ``apply_template`` orchestrates the above, handling: - parens-style negative amounts, currency stripping, custom decimal/thousands separators - separate debit + credit columns combined into a single signed ``amount`` (credit positive, debit negative — accounting register convention; matches QuickBooks/Xero imports) - multi-line description wrapping (rows with empty date column attach to the previous row's description) - row-level regex skip filters (e.g., "Total", "Subtotal") - page-range filters ("all", "2-", "1,3-5") Optional OCR fallback for scanned statements: - ``page_has_extractable_text`` heuristic flags pages with <5 words as likely-scanned. - ``ocr_available()`` checks both the ``pytesseract`` Python binding and the Tesseract binary; surfaces a clear reason string when either is missing. - ``extract_pages_auto`` does text-first, OCR-the-blanks, and returns warnings the UI can surface. 29 unit tests cover the parsing pipeline against synthetic WordBox/Page data — no fixture PDFs required, runs in 0.1s. Real PDF extraction is exercised by hand on the user's statements. Dependencies added: - ``pdfplumber>=0.10,<1`` — text + position extraction - ``pypdfium2>=4,<6`` — page rasterization for OCR + visual picker - ``streamlit-drawable-canvas>=0.9,<1`` — visual region picker (used in commit 5) - ``pytesseract>=0.3,<1`` — OCR (used in commit 6; system Tesseract binary required separately) - ``cryptography>=41,<49`` — bumped upper bound; pdfminer.six transitively requires a recent release. Internal ed25519 license-signing usage is API-stable across the bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:44:51 +00:00

15 Commits