refactor(pdf): rip out templates; heuristic scan + selectable table

User feedback: the template / visual-picker / mode-dispatch implementation was too complex for the actual workflow. Statements drift between months, the canvas state didn't survive multi-page navigation, and accountants don't want to maintain per-bank configuration just to convert PDFs to CSV. Start-over design — one public function, one page, no persistence: ``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)`` A row is "any text line with a date pattern AND at least one amount pattern." Each detected row is a dict shaped:: { "date": "2026-01-15", "description": "Coffee Shop", "amount_1": -4.50, "amount_2": 1000.00, # if a second amount was found "page": 1, "raw": "01/15/2026 Coffee Shop (4.50) 1,000.00", "source_file": "chase-jan-2026.pdf", } Multi-line descriptions still merge (no-date no-amount lines attach to the previous transaction). Multi-PDF batches share a single combined table with a ``source_file`` column. **Page UX:** - Upload PDF(s) → optional Options expander (parens-negative, use-OCR) → click Scan → see all detected rows in an ``st.data_editor``. - The editor has an ``Include`` checkbox column (default on), plus user-editable date / description / amount cells and a read-only ``raw`` column showing the original PDF text for verification. - A ``Columns to include in CSV`` multiselect hides ``page`` / ``raw`` from the download by default; user can re-add either. - Download CSV gets only the checked rows. No template save/load. No visual picker. No mode dispatch. No column boundaries. No schema migration. No per-bank configuration files. **Deletions:** - ``src/pdf_templates.py`` — template storage layer - ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim for the canvas (no canvas now) - ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``, ``test_drawable_canvas_compat.py`` — covered the removed APIs - ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for the removed dep - ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt`` - The drawable-canvas references in ``build/datatools.spec`` **``src/pdf_extract.py``** shrinks from ~30 helper functions to ~10. Keeps: value parsers, row clusterer, date/amount token finders, OCR pipeline, dependency guards. The one new public function ``scan_pdf_for_transactions`` glues them together. **Tests** (59 passing): the unit layer keeps full coverage of the building blocks; the smoke layer pins the end-to-end PDF roundtrip, OCR discovery, dependency-import behavior, and the multi-line-description merge. The fpdf2-generated fixture PDF still drives the real-PDF test. Rollback: ``git revert HEAD`` brings back the template system if needed — but the simpler model should make that unlikely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:57:30 +00:00
parent 60969c0770
commit bece2b4030
12 changed files with 729 additions and 3632 deletions
--- a/tests/test_pdf_extract_smoke.py
+++ b/tests/test_pdf_extract_smoke.py
@@ -1,55 +1,43 @@
-"""End-to-end smoke tests for the PDF extraction stack.
+"""End-to-end smoke tests for the PDF transaction scanner.

-These tests run real ``pdfplumber`` + ``pypdfium2`` calls against
-a small PDF generated in-memory with ``fpdf2``. They exist to
-catch the failure mode the user hit on first install — a missing
-or mismatched native dependency that doesn't show up until the
-extractor actually tries to open a PDF.
+These run real ``pdfplumber`` + ``pypdfium2`` (when OCR is in play)
+calls against a small statement-shaped PDF generated in memory
+with ``fpdf2``. They catch the failure modes most likely to bite
+an end-user installer build: missing native lib, broken hook
+bundling, pin/installed mismatch.

-Per ``project-pdf-extractor`` memory: ``test_pdf_extract.py``
-covers the parsing logic on synthetic ``WordBox`` data with no
-PDF dep involved. This file is the layer above: it confirms the
-deps themselves work, that hooks bundled them correctly (the
-versions pinned in ``requirements.txt`` matter here), and that
-the extractor's pipeline survives a round-trip through real
-``pdfplumber.extract_words`` and real ``pypdfium2.render``.
-
-Generation note: ``fpdf2`` is a test-only dep listed in
+Generation note: ``fpdf2`` is a test-only dep in
 ``requirements-dev.txt``. We don't ship it.
 """

 from __future__ import annotations

-import io
-
 import pytest


 def _build_tiny_statement_pdf() -> bytes:
-    """Render a one-page PDF that looks roughly like the simplest
-    possible bank statement: a header line + three transaction
-    rows + a closing-balance footer. Word positions are stable
-    enough that the parser can identify columns by x-position."""
+    """One-page PDF: header line + three transaction rows + a
+    closing-balance footer. The scanner should pick up exactly the
+    three transactions."""
    from fpdf import FPDF

    pdf = FPDF(orientation="P", unit="pt", format="letter")
    pdf.add_page()
    pdf.set_font("Helvetica", size=12)
-    # Header
    pdf.set_xy(40, 50)
    pdf.cell(0, 14, "ACME BANK STATEMENT", new_x="LMARGIN", new_y="NEXT")
-    # Transaction-table header row
+    # Header row (not a transaction — no amount)
    pdf.set_xy(40, 100)
    pdf.cell(120, 14, "Date")
    pdf.set_xy(160, 100)
    pdf.cell(200, 14, "Description")
    pdf.set_xy(360, 100)
    pdf.cell(80, 14, "Amount")
-    # Three rows
+    # Three transactions
    rows = [
-        ("01/15/2026", "Coffee Shop",   "(4.50)"),
-        ("01/16/2026", "Refund Vendor", "$12.00"),
-        ("01/17/2026", "ATM Withdrawal","(40.00)"),
+        ("01/15/2026", "Coffee Shop",     "(4.50)"),
+        ("01/16/2026", "Refund Vendor",   "$12.00"),
+        ("01/17/2026", "ATM Withdrawal",  "(40.00)"),
    ]
    y = 130
    for date, desc, amt in rows:
@@ -60,7 +48,7 @@ def _build_tiny_statement_pdf() -> bytes:
        pdf.set_xy(360, y)
        pdf.cell(80, 14, amt)
        y += 20
-    # Closing-balance footer
+    # Footer — has a date-like number maybe but no real txn shape
    pdf.set_xy(40, y + 20)
    pdf.cell(0, 14, "Closing balance: $1,000.00")
    return bytes(pdf.output())
@@ -72,12 +60,8 @@ def _build_tiny_statement_pdf() -> bytes:


 class TestDependencyImports:
-    """Each runtime PDF dep must be importable.
-
-    These tests will fail fast on a stripped/broken install — most
-    valuable as a CI gate when the requirements.txt pins are
-    bumped, so we know the new pin still installs cleanly across
-    the matrix."""
+    """Each runtime PDF dep must be importable. Fails fast on a
+    stripped install or a missing CI pin."""

    def test_pdfplumber(self):
        import pdfplumber  # noqa: F401
@@ -85,130 +69,135 @@ class TestDependencyImports:
    def test_pypdfium2(self):
        import pypdfium2  # noqa: F401

-    def test_streamlit_drawable_canvas(self):
-        # Don't instantiate the canvas — that needs a Streamlit
-        # script-run context. Just confirm the module loads.
-        import streamlit_drawable_canvas  # noqa: F401
-
    def test_pytesseract(self):
-        # The Python binding must import even when the Tesseract
-        # binary isn't installed — the OCR availability check
-        # handles binary absence separately.
        import pytesseract  # noqa: F401

    def test_PIL(self):
-        # Transitively required by pdfplumber + pypdfium2 + canvas.
-        # Pinning explicit confirms hooks pull it through.
        from PIL import Image  # noqa: F401


 # ---------------------------------------------------------------------------
-# Real-PDF round-trip
+# End-to-end against a real PDF
 # ---------------------------------------------------------------------------


-class TestRealPdfRoundTrip:
-    """``extract_pages`` + ``apply_template`` against a real PDF."""
-
+class TestScanPdfForTransactions:
    @pytest.fixture
    def pdf_bytes(self) -> bytes:
        return _build_tiny_statement_pdf()

-    def test_extract_pages_returns_words(self, pdf_bytes):
-        from src.pdf_extract import extract_pages
-        pages = extract_pages(pdf_bytes)
-        assert len(pages) == 1
-        assert pages[0].width > 0 and pages[0].height > 0
-        # At minimum we should have the words from the header and
-        # one transaction row — proves pdfplumber wired up.
-        all_text = " ".join(w.text for w in pages[0].words)
-        assert "ACME" in all_text
-        assert "Coffee" in all_text
-        assert "01/15/2026" in all_text
+    def test_finds_three_transactions(self, pdf_bytes):
+        from src.pdf_extract import scan_pdf_for_transactions
+        rows, warnings = scan_pdf_for_transactions(pdf_bytes)
+        # The PDF has 3 transactions plus a header and a closing-
+        # balance footer. Header has no amount; closing-balance has
+        # no date in the same line — neither qualifies as a txn.
+        assert len(rows) == 3, (
+            f"expected 3 rows, got {len(rows)}:\n"
+            f"{[r.get('raw') for r in rows]}"
+        )

-    def test_apply_template_extracts_three_rows(self, pdf_bytes):
-        from src.pdf_extract import apply_template, extract_pages
-        # The template's column boundaries are tuned to fpdf2's
-        # x-coordinates above (40 / 160 / 360 pt).
-        tpl = {
-            "pages": {"range": "all"},
-            "table": {
-                "header_text": "Date Description Amount",
-                "end_markers": ["Closing balance"],
-                "column_boundaries": [150, 350],
-                "y_tolerance": 3.0,
-            },
-            "columns": [
-                {"source": 0, "target": "date"},
-                {"source": 1, "target": "description"},
-                {"source": 2, "target": "amount"},
-            ],
-            "parse": {
-                "date_format": "%m/%d/%Y",
-                "amount_negative_in_parens": True,
-                "merge_multiline_description": True,
-            },
-        }
-        pages = extract_pages(pdf_bytes)
-        df = apply_template(pages, tpl)
-        assert len(df) == 3, f"expected 3 rows, got {len(df)}:\n{df}"
-        assert list(df["date"]) == [
+    def test_parses_dates_to_iso(self, pdf_bytes):
+        from src.pdf_extract import scan_pdf_for_transactions
+        rows, _ = scan_pdf_for_transactions(pdf_bytes)
+        assert [r["date"] for r in rows] == [
            "2026-01-15", "2026-01-16", "2026-01-17",
        ]
-        # Parens-negative + currency-positive both round-trip
-        assert df.iloc[0]["amount"] == -4.50
-        assert df.iloc[1]["amount"] == 12.00
-        assert df.iloc[2]["amount"] == -40.00
+
+    def test_parses_amounts_with_signs(self, pdf_bytes):
+        from src.pdf_extract import scan_pdf_for_transactions
+        rows, _ = scan_pdf_for_transactions(pdf_bytes)
+        assert rows[0]["amount_1"] == -4.50
+        assert rows[1]["amount_1"] == 12.00
+        assert rows[2]["amount_1"] == -40.00
+
+    def test_preserves_raw_line(self, pdf_bytes):
+        from src.pdf_extract import scan_pdf_for_transactions
+        rows, _ = scan_pdf_for_transactions(pdf_bytes)
+        # Raw line lets the user verify what was matched.
+        assert all("raw" in r and r["raw"] for r in rows)
+        assert "Coffee" in rows[0]["raw"]
+
+    def test_page_tagged(self, pdf_bytes):
+        from src.pdf_extract import scan_pdf_for_transactions
+        rows, _ = scan_pdf_for_transactions(pdf_bytes)
+        assert all(r["page"] == 1 for r in rows)
+
+    def test_negative_in_parens_off(self, pdf_bytes):
+        """With parens-negative off, the parser can't decode
+        ``(4.50)`` and falls back to the raw text — the row still
+        surfaces, just with the unparsed string in the amount slot
+        so the user can see and fix it in the editor."""
+        from src.pdf_extract import scan_pdf_for_transactions
+        rows, _ = scan_pdf_for_transactions(
+            pdf_bytes, negative_in_parens=False,
+        )
+        # Row 0 had "(4.50)" — without parens-negative, parse_amount
+        # returns None and the scanner keeps the raw token.
+        assert rows[0]["amount_1"] == "(4.50)"
+        # Row 1 had "$12.00" — still parses to positive.
+        assert rows[1]["amount_1"] == 12.00


 # ---------------------------------------------------------------------------
-# pypdfium2 rendering (powers the visual picker)
+# Multi-line description merging
 # ---------------------------------------------------------------------------


-class TestRenderPageImage:
-    """``render_page_image`` is what feeds the drawable canvas.
+class TestMultilineDescription:
+    def test_continuation_line_merges(self):
+        """A line with no date and no amount, sitting between two
+        transaction rows, attaches to the previous transaction's
+        description."""
+        from src.pdf_extract import (
+            Page,
+            WordBox,
+            scan_pdf_for_transactions,
+        )
+        # Build a synthetic page through the public entry point by
+        # going through extract_pages_auto's intermediate? Easier:
+        # call the internals directly via a fake PDF. For unit
+        # coverage of the merge behavior, route through the helper:
+        from src import pdf_extract as mod

-    Catches the most common installer-bug: native PDFium .dll/.so
-    missing from the bundle. If this test crashes with a
-    ``FileNotFoundError`` it almost always means the
-    ``hook-pypdfium2.py`` didn't pick up the shared lib."""
+        original = mod.extract_pages_auto

-    def test_renders_a_real_pil_image(self):
-        from src.pdf_extract import render_page_image
-        pdf_bytes = _build_tiny_statement_pdf()
-        image, scale = render_page_image(pdf_bytes, page_no=1)
-        # Letter-size at scale ≈ 900/612 ≈ 1.47 → ~900px wide.
-        assert image.width > 800
-        assert image.height > 800
-        assert scale > 0
-        # PIL Image is duck-typed; check the attrs we depend on.
-        assert hasattr(image, "save")
-        assert hasattr(image, "tobytes")
+        def fake(_pdf_bytes, *, allow_ocr=True):
+            words = [
+                WordBox(x0=0, top=0, x1=80, bottom=10, text="01/15/2026"),
+                WordBox(x0=100, top=0, x1=160, bottom=10, text="Coffee"),
+                WordBox(x0=200, top=0, x1=240, bottom=10, text="$4.50"),
+                # Continuation: no date, no amount
+                WordBox(x0=100, top=20, x1=160, bottom=30, text="Vendor"),
+                WordBox(x0=170, top=20, x1=230, bottom=30, text="memo"),
+                # Next transaction
+                WordBox(x0=0, top=40, x1=80, bottom=50, text="01/16/2026"),
+                WordBox(x0=100, top=40, x1=160, bottom=50, text="Other"),
+                WordBox(x0=200, top=40, x1=240, bottom=50, text="$10.00"),
+            ]
+            return [Page(
+                page_no=1, width=300, height=100, text="", words=words,
+            )], []

-    def test_invalid_page_number_clamps(self):
-        from src.pdf_extract import render_page_image
-        pdf_bytes = _build_tiny_statement_pdf()
-        # PDF has 1 page; page_no=99 should clamp, not raise.
-        image, scale = render_page_image(pdf_bytes, page_no=99)
-        assert image.width > 0
+        mod.extract_pages_auto = fake
+        try:
+            rows, _ = scan_pdf_for_transactions(b"")
+        finally:
+            mod.extract_pages_auto = original
+
+        assert len(rows) == 2
+        assert "Vendor memo" in rows[0]["description"]
+        assert rows[1]["description"] == "Other"


 # ---------------------------------------------------------------------------
-# Graceful-fallback behavior
+# Graceful fallback when deps absent
 # ---------------------------------------------------------------------------


 class TestPdfDependencyMissing:
-    """The page should see a clean exception when a dep is absent,
-    not a raw ``ImportError`` that leaks into the Streamlit traceback."""
-
    def test_require_pdfplumber_raises_typed_on_absence(self, monkeypatch):
        from src import pdf_extract
-        # Simulate "pdfplumber not installed" without uninstalling.
-        # ``_require_pdfplumber`` does its own ``import pdfplumber``
-        # at call time; patch ``__import__`` to throw for that one
-        # name only.
        import builtins
        real_import = builtins.__import__

@@ -218,10 +207,10 @@ class TestPdfDependencyMissing:
            return real_import(name, *a, **kw)

        monkeypatch.setattr(builtins, "__import__", fake_import)
-        with pytest.raises(pdf_extract.PdfDependencyMissing) as exc_info:
+        with pytest.raises(pdf_extract.PdfDependencyMissing) as exc:
            pdf_extract._require_pdfplumber()
-        assert "pdfplumber" in str(exc_info.value)
-        assert exc_info.value.hint  # actionable hint must be populated
+        assert "pdfplumber" in str(exc.value)
+        assert exc.value.hint

    def test_require_pdfium_raises_typed_on_absence(self, monkeypatch):
        from src import pdf_extract
@@ -239,17 +228,13 @@ class TestPdfDependencyMissing:


 # ---------------------------------------------------------------------------
-# Requirements-pin consistency
+# Requirements pin consistency
 # ---------------------------------------------------------------------------


 class TestPinnedVersionsMatchInstalled:
    """If someone bumps the pin in ``requirements.txt`` without
-    actually reinstalling, this test points it out before CI does.
-
-    Uses ``importlib.metadata`` rather than each library's
-    ``__version__`` attribute because not every PDF dep exposes
-    one (``pypdfium2`` keeps version info on a submodule)."""
+    actually reinstalling, this test points it out before CI does."""

    def _parse_pins(self) -> dict[str, str]:
        from pathlib import Path
@@ -266,21 +251,17 @@ class TestPinnedVersionsMatchInstalled:
                pins[name.strip()] = version.strip()
        return pins

-    def _installed(self, dist_name: str) -> str:
-        import importlib.metadata as md
-        return md.version(dist_name)
-
    @pytest.mark.parametrize("dist_name", [
        "pdfplumber",
        "pypdfium2",
        "pytesseract",
-        "streamlit-drawable-canvas",
    ])
    def test_pin_matches_installed(self, dist_name):
+        import importlib.metadata as md
        pins = self._parse_pins()
        if dist_name not in pins:
            pytest.skip(f"{dist_name} not exact-pinned in requirements.txt")
-        installed = self._installed(dist_name)
+        installed = md.version(dist_name)
        assert installed == pins[dist_name], (
            f"installed {dist_name}=={installed} but requirements.txt "
            f"pins {pins[dist_name]} — bump the pin, or reinstall."
@@ -288,79 +269,52 @@ class TestPinnedVersionsMatchInstalled:


 # ---------------------------------------------------------------------------
-# OCR availability runtime probe
+# OCR availability
 # ---------------------------------------------------------------------------


 class TestOcrAvailability:
-    """``ocr_available`` is the linchpin of the UI's OCR banner.
-    Returns ``(bool, str)`` — both branches must round-trip."""
-
    def test_returns_a_tuple(self):
        from src.pdf_extract import ocr_available
        result = ocr_available()
-        assert isinstance(result, tuple)
-        assert len(result) == 2
+        assert isinstance(result, tuple) and len(result) == 2
        ok, reason = result
        assert isinstance(ok, bool)
        assert isinstance(reason, str)

    def test_extract_pages_auto_skips_ocr_when_disabled(self):
        from src.pdf_extract import extract_pages_auto
-        # With allow_ocr=False, no OCR even if pages are blank.
        pdf_bytes = _build_tiny_statement_pdf()
        pages, warnings = extract_pages_auto(pdf_bytes, allow_ocr=False)
        assert len(pages) == 1
-        # No OCR-disabled warning on a text PDF, since pages have text.
        assert not any("OCR is disabled" in w for w in warnings)


 class TestTesseractDiscovery:
-    """Windows install paths + env-var override are how a real user
-    (no PATH munging) gets OCR working. Cover the discovery logic
-    even on Linux/macOS test runners by mocking out the OS check
-    and ``Path.exists``."""
-
    def test_autodetect_returns_none_on_non_windows(self, monkeypatch):
        from src import pdf_extract
-        monkeypatch.setattr(
-            "platform.system",
-            lambda: "Linux",
-        )
+        monkeypatch.setattr("platform.system", lambda: "Linux")
        assert pdf_extract._autodetect_tesseract_path() is None

    def test_autodetect_finds_program_files_on_windows(self, monkeypatch):
        from src import pdf_extract
        monkeypatch.setattr("platform.system", lambda: "Windows")
-
        target = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

        def fake_exists(self):
            return str(self) == target

-        monkeypatch.setattr(
-            "pathlib.Path.exists",
-            fake_exists,
-        )
+        monkeypatch.setattr("pathlib.Path.exists", fake_exists)
        assert pdf_extract._autodetect_tesseract_path() == target

-    def test_autodetect_returns_none_when_nothing_installed(
-        self, monkeypatch,
-    ):
+    def test_autodetect_returns_none_when_nothing_installed(self, monkeypatch):
        from src import pdf_extract
        monkeypatch.setattr("platform.system", lambda: "Windows")
        monkeypatch.setattr("pathlib.Path.exists", lambda self: False)
        assert pdf_extract._autodetect_tesseract_path() is None

    def test_env_var_override_takes_precedence(self, monkeypatch, tmp_path):
-        """``DATATOOLS_TESSERACT_PATH`` wins over discovery so a
-        portable install at a non-default path works without
-        relying on PATH."""
        from src import pdf_extract
-        # Point the override at a path that doesn't exist —
-        # ocr_available will try it and report the failure, but
-        # importantly the cmd attribute is set BEFORE the call,
-        # which is what we're verifying.
        fake_bin = str(tmp_path / "fake-tesseract.exe")
        monkeypatch.setenv("DATATOOLS_TESSERACT_PATH", fake_bin)
        pdf_extract.ocr_available()