refactor(pdf): rip out templates; heuristic scan + selectable table

User feedback: the template / visual-picker / mode-dispatch implementation was too complex for the actual workflow. Statements drift between months, the canvas state didn't survive multi-page navigation, and accountants don't want to maintain per-bank configuration just to convert PDFs to CSV. Start-over design — one public function, one page, no persistence: ``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)`` A row is "any text line with a date pattern AND at least one amount pattern." Each detected row is a dict shaped:: { "date": "2026-01-15", "description": "Coffee Shop", "amount_1": -4.50, "amount_2": 1000.00, # if a second amount was found "page": 1, "raw": "01/15/2026 Coffee Shop (4.50) 1,000.00", "source_file": "chase-jan-2026.pdf", } Multi-line descriptions still merge (no-date no-amount lines attach to the previous transaction). Multi-PDF batches share a single combined table with a ``source_file`` column. **Page UX:** - Upload PDF(s) → optional Options expander (parens-negative, use-OCR) → click Scan → see all detected rows in an ``st.data_editor``. - The editor has an ``Include`` checkbox column (default on), plus user-editable date / description / amount cells and a read-only ``raw`` column showing the original PDF text for verification. - A ``Columns to include in CSV`` multiselect hides ``page`` / ``raw`` from the download by default; user can re-add either. - Download CSV gets only the checked rows. No template save/load. No visual picker. No mode dispatch. No column boundaries. No schema migration. No per-bank configuration files. **Deletions:** - ``src/pdf_templates.py`` — template storage layer - ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim for the canvas (no canvas now) - ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``, ``test_drawable_canvas_compat.py`` — covered the removed APIs - ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for the removed dep - ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt`` - The drawable-canvas references in ``build/datatools.spec`` **``src/pdf_extract.py``** shrinks from ~30 helper functions to ~10. Keeps: value parsers, row clusterer, date/amount token finders, OCR pipeline, dependency guards. The one new public function ``scan_pdf_for_transactions`` glues them together. **Tests** (59 passing): the unit layer keeps full coverage of the building blocks; the smoke layer pins the end-to-end PDF roundtrip, OCR discovery, dependency-import behavior, and the multi-line-description merge. The fpdf2-generated fixture PDF still drives the real-PDF test. Rollback: ``git revert HEAD`` brings back the template system if needed — but the simpler model should make that unlikely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:57:30 +00:00
parent 60969c0770
commit bece2b4030
12 changed files with 729 additions and 3632 deletions
--- a/src/gui/_drawable_canvas_compat.py
+++ b/src/gui/_drawable_canvas_compat.py
@@ -1,86 +0,0 @@
-"""Compatibility shim for streamlit-drawable-canvas on modern Streamlit.
-
-``streamlit-drawable-canvas`` 0.9.3 (last release 2023) calls
-``streamlit.elements.image.image_to_url(image, width, clamp,
-channels, output_format, image_id)``. Streamlit ~1.30+ moved this
-helper out of ``streamlit.elements.image`` and changed its
-signature so the second positional argument is now a
-``LayoutConfig`` dataclass instead of a plain ``int`` width.
-
-The canvas package hasn't been updated, so on modern Streamlit
-its very first call fails with::
-
-    AttributeError: module 'streamlit.elements.image'
-        has no attribute 'image_to_url'
-
-This module re-attaches a wrapper at the old import path that
-adapts the old call shape to the new function. Import it once
-before any ``st_canvas`` call; idempotent.
-
-The shim is opt-in (not auto-installed at module import) so the
-audit log of "I patched a third-party internal" is visible in
-``grep`` rather than silently happening on every page load.
-"""
-
-from __future__ import annotations
-
-
-_PATCHED = False
-
-
-def install() -> None:
-    """Install the ``image_to_url`` compatibility shim.
-
-    Idempotent — safe to call multiple times. Returns silently
-    if the canvas package or Streamlit can't be imported (lets
-    the caller handle the "PDF deps missing" path on its own).
-    """
-    global _PATCHED
-    if _PATCHED:
-        return
-
-    try:
-        import streamlit.elements.image as _old_image_module
-    except ImportError:
-        return
-
-    # Already present (old Streamlit, or already shimmed) — bail.
-    if hasattr(_old_image_module, "image_to_url"):
-        _PATCHED = True
-        return
-
-    try:
-        from streamlit.elements.lib.image_utils import (
-            image_to_url as _new_image_to_url,
-        )
-        from streamlit.elements.lib.layout_utils import LayoutConfig
-    except ImportError:
-        # ``image_to_url`` is in some other location we don't know
-        # about yet — let the canvas surface its own error so we
-        # learn where to look. Don't fail silently.
-        return
-
-    def _shim(
-        image,
-        width,
-        clamp,
-        channels,
-        output_format,
-        image_id,
-    ) -> str:
-        """Old API → new API. The old ``width=-1`` sentinel meant
-        "use the image's natural width", which is also the new
-        function's default behavior when ``LayoutConfig`` is left
-        unconfigured."""
-        layout = LayoutConfig()
-        return _new_image_to_url(
-            image,
-            layout,
-            clamp,
-            channels,
-            output_format,
-            image_id,
-        )
-
-    _old_image_module.image_to_url = _shim
-    _PATCHED = True
--- a/src/gui/pages/10_PDF_Extractor.py
+++ b/src/gui/pages/10_PDF_Extractor.py
--- a/src/pdf_extract.py
+++ b/src/pdf_extract.py
--- a/src/pdf_templates.py
+++ b/src/pdf_templates.py
@@ -1,508 +0,0 @@
-"""PDF extract template storage.
-
-Templates encode "how to read this bank's statements" — page
-range, table window markers, column x-positions, target field
-mapping, amount/date parse options. They live as JSON files in
-``~/.datatools/pdf_templates/`` so an accountant can build one
-per source and reuse it for every statement that follows the
-same layout. Templates are portable: the ``export`` / ``import``
-flow is just a file copy of the JSON.
-
-The schema is intentionally a plain dict (not a frozen dataclass)
-because the GUI mutates it incrementally during the build flow.
-``validate_template`` enforces the contract at save time.
-
-Schema (``schema_version: 1``)::
-
-    {
-      "schema_version": 1,
-      "slug": "chase-personal-checking",
-      "name": "Chase Personal Checking",
-      "notes": "",
-      "created_at": "<iso8601>",
-      "updated_at": "<iso8601>",
-      "pages": {
-        "range": "all" | "1-3" | "2,4,6-",
-        "skip_matching": "<regex>"
-      },
-      "table": {
-        "header_text": "<text containing all header words>",
-        "end_markers": ["<regex>", ...],
-        "column_boundaries": [x0, x1, ...],
-        "y_tolerance": 3.0,
-        "skip_rows_matching": ["<regex>", ...]
-      },
-      "columns": [
-        {"source": 0, "target": "date"},
-        ...
-        # ``target`` is one of: date | description | amount |
-        # amount_debit | amount_credit | balance | <free text>
-      ],
-      "parse": {
-        "date_format": "%m/%d/%Y",
-        "date_formats": [],
-        "decimal_separator": ".",
-        "thousands_separator": ",",
-        "currency_strip": "$",
-        "amount_negative_in_parens": true,
-        "merge_multiline_description": true
-      },
-      "visual": {
-        "page_width": 612.0,
-        "page_height": 792.0,
-        "sample_page": 1,
-        "table_bbox": [x0, top, x1, bottom] | null
-      }
-    }
-
-The ``visual`` block is preserved across save/load so the build
-UI can round-trip the user's last visual-picker state.
-"""
-
-from __future__ import annotations
-
-import json
-import os
-import re
-import tempfile
-from datetime import datetime, timezone
-from pathlib import Path
-from typing import Any
-
-
-SCHEMA_VERSION = 2
-
-# Backward-compatible versions ``load_template`` will accept.
-# v1 templates predate the row-heuristic shift and are loaded as
-# ``mode="column_visual"``; they're not auto-migrated on disk, so
-# the user keeps their canonical original until they re-save.
-_LOAD_SUPPORTED_VERSIONS = frozenset({1, 2})
-
-# Extraction modes. ``row_heuristic`` is the default for new
-# templates — finds transactions by date+amount pattern matching
-# with no coordinate dependency. ``column_visual`` is the legacy
-# x-position-boundary approach, kept for old templates and for
-# the "Advanced" build-mode fallback when the heuristic misfires.
-VALID_MODES = frozenset({"row_heuristic", "column_visual"})
-
-# Amount shapes for row_heuristic mode. The GUI offers these as a
-# dropdown; the parser uses them to assign amount tokens to fields.
-VALID_AMOUNT_SHAPES = frozenset({
-    "single",
-    "txn_balance",
-    "debit_credit",
-    "debit_credit_balance",
-})
-
-VALID_TARGETS = frozenset({
-    "date",
-    "description",
-    "amount",
-    "amount_debit",
-    "amount_credit",
-    "balance",
-    "type",
-})
-
-
-# ---------------------------------------------------------------------------
-# Filesystem layout
-# ---------------------------------------------------------------------------
-
-
-def templates_dir() -> Path:
-    """Return ``~/.datatools/pdf_templates/``. Override via the
-    ``DATATOOLS_PDF_TEMPLATES_DIR`` env var (used by tests)."""
-    override = os.environ.get("DATATOOLS_PDF_TEMPLATES_DIR")
-    if override:
-        return Path(override)
-    try:
-        return Path.home() / ".datatools" / "pdf_templates"
-    except Exception:
-        return Path(tempfile.gettempdir()) / "datatools-pdf-templates"
-
-
-def template_path(slug: str) -> Path:
-    """Resolve *slug* to its on-disk JSON path."""
-    return templates_dir() / f"{slug}.json"
-
-
-# ---------------------------------------------------------------------------
-# Slugify
-# ---------------------------------------------------------------------------
-
-
-_SLUG_STRIP = re.compile(r"[^a-z0-9]+")
-
-
-def slugify(name: str) -> str:
-    """Make a filesystem-safe slug from a human-friendly name."""
-    s = (name or "").strip().lower()
-    s = _SLUG_STRIP.sub("-", s).strip("-")
-    return s or "untitled"
-
-
-# ---------------------------------------------------------------------------
-# Construction + defaults
-# ---------------------------------------------------------------------------
-
-
-def new_template(name: str) -> dict[str, Any]:
-    """Build a blank template with sensible defaults.
-
-    Defaults to ``mode="row_heuristic"`` — the simpler, more
-    robust approach. The GUI's build flow lets the user switch to
-    ``mode="column_visual"`` if the heuristic doesn't fit their
-    statement layout.
-    """
-    now = datetime.now(tz=timezone.utc).isoformat(timespec="seconds")
-    slug = slugify(name)
-    return {
-        "schema_version": SCHEMA_VERSION,
-        "slug": slug,
-        "name": name or slug,
-        "notes": "",
-        "mode": "row_heuristic",
-        "created_at": now,
-        "updated_at": now,
-        "pages": {
-            "range": "all",
-            "skip_matching": "",
-        },
-        # Row-heuristic config (primary path).
-        "row_detection": {
-            "min_amounts_per_row": 1,
-            "max_amounts_per_row": 3,
-            "y_tolerance": 3.0,
-            "skip_rows_matching": [],
-            "merge_multiline_description": True,
-        },
-        "amounts": {
-            "shape": "single",
-            "negative_in_parens": True,
-            "decimal_separator": ".",
-            "thousands_separator": ",",
-            "currency_strip": "$",
-        },
-        "date": {
-            "format": "%m/%d/%Y",
-            "formats_fallback": [],
-        },
-        # Column-visual config (legacy / Advanced fallback). Empty
-        # placeholders so the GUI can populate when the user
-        # switches modes without inserting keys at runtime.
-        "table": {
-            "header_text": "",
-            "end_markers": [],
-            "column_boundaries": [],
-            "y_tolerance": 3.0,
-            "skip_rows_matching": [],
-        },
-        "columns": [],
-        "parse": {
-            "date_format": "%m/%d/%Y",
-            "date_formats": [],
-            "decimal_separator": ".",
-            "thousands_separator": ",",
-            "currency_strip": "$",
-            "amount_negative_in_parens": True,
-            "merge_multiline_description": True,
-        },
-        "visual": {
-            "page_width": 612.0,
-            "page_height": 792.0,
-            "sample_page": 1,
-            "table_bbox": None,
-        },
-    }
-
-
-# ---------------------------------------------------------------------------
-# Validation
-# ---------------------------------------------------------------------------
-
-
-def validate_template(template: dict[str, Any]) -> tuple[bool, list[str]]:
-    """Check the template before saving. Returns ``(ok, errors)``.
-
-    Mode-aware: row-heuristic templates and column-visual
-    templates have different required fields. The GUI shows the
-    errors next to the Save button; nothing silent here."""
-    errors: list[str] = []
-    if not isinstance(template, dict):
-        return False, ["Template must be a JSON object."]
-
-    sv = template.get("schema_version")
-    if sv != SCHEMA_VERSION:
-        errors.append(
-            f"Unsupported schema_version {sv!r} (expected {SCHEMA_VERSION})."
-        )
-
-    name = template.get("name", "")
-    if not isinstance(name, str) or not name.strip():
-        errors.append("name is required.")
-
-    slug = template.get("slug") or slugify(name)
-    if not re.match(r"^[a-z0-9][a-z0-9-]{0,63}$", slug or ""):
-        errors.append(
-            "slug must be lowercase alphanumeric + hyphens, "
-            "1–64 chars, starting with a letter or digit."
-        )
-
-    mode = template.get("mode", "row_heuristic")
-    if mode not in VALID_MODES:
-        errors.append(
-            f"mode {mode!r} must be one of: {sorted(VALID_MODES)}."
-        )
-
-    if mode == "row_heuristic":
-        amounts = template.get("amounts", {}) or {}
-        shape = amounts.get("shape", "single")
-        if shape not in VALID_AMOUNT_SHAPES:
-            errors.append(
-                f"amounts.shape {shape!r} must be one of: "
-                f"{sorted(VALID_AMOUNT_SHAPES)}."
-            )
-        rd = template.get("row_detection", {}) or {}
-        min_a = rd.get("min_amounts_per_row", 1)
-        max_a = rd.get("max_amounts_per_row", 3)
-        if not (isinstance(min_a, int) and isinstance(max_a, int)):
-            errors.append(
-                "row_detection.min_amounts_per_row and "
-                "max_amounts_per_row must be integers."
-            )
-        elif min_a < 1 or max_a < min_a:
-            errors.append(
-                "row_detection.min_amounts_per_row must be ≥1 and ≤ "
-                "max_amounts_per_row."
-            )
-
-    elif mode == "column_visual":
-        columns = template.get("columns", [])
-        if not isinstance(columns, list) or len(columns) < 2:
-            errors.append(
-                "column_visual mode: at least two output columns "
-                "are required."
-            )
-        else:
-            seen_targets: list[str] = []
-            for i, col in enumerate(columns):
-                if not isinstance(col, dict):
-                    errors.append(f"columns[{i}] must be an object.")
-                    continue
-                src = col.get("source")
-                tgt = col.get("target")
-                if not isinstance(src, int) or src < 0:
-                    errors.append(
-                        f"columns[{i}].source must be a non-negative "
-                        f"integer."
-                    )
-                if not isinstance(tgt, str) or not tgt:
-                    errors.append(
-                        f"columns[{i}].target must be a non-empty string."
-                    )
-                else:
-                    seen_targets.append(tgt)
-            if "date" not in seen_targets:
-                errors.append(
-                    "column_visual mode: at least one column must map "
-                    "to 'date'."
-                )
-            if (
-                "amount" not in seen_targets
-                and not (
-                    "amount_debit" in seen_targets
-                    and "amount_credit" in seen_targets
-                )
-            ):
-                errors.append(
-                    "column_visual mode: either an 'amount' column or "
-                    "both 'amount_debit' + 'amount_credit' columns "
-                    "are required."
-                )
-
-        table = template.get("table", {}) or {}
-        boundaries = table.get("column_boundaries", [])
-        if not isinstance(boundaries, list):
-            errors.append("table.column_boundaries must be a list.")
-
-    return (not errors), errors
-
-
-# ---------------------------------------------------------------------------
-# Persistence
-# ---------------------------------------------------------------------------
-
-
-def _atomic_write(path: Path, payload: str) -> None:
-    """Write *payload* to *path* via a temp file + rename.
-
-    Avoids leaving a half-written JSON if the process dies mid-save —
-    the GUI saves on every visual-picker change, and a corrupt
-    template file would be hostile to recover from.
-    """
-    path.parent.mkdir(parents=True, exist_ok=True)
-    fd, tmp_path = tempfile.mkstemp(
-        prefix=f".{path.name}.",
-        suffix=".tmp",
-        dir=str(path.parent),
-    )
-    try:
-        with os.fdopen(fd, "w", encoding="utf-8") as f:
-            f.write(payload)
-        os.replace(tmp_path, path)
-    except Exception:
-        try:
-            os.unlink(tmp_path)
-        except FileNotFoundError:
-            pass
-        raise
-
-
-def save_template(template: dict[str, Any]) -> str:
-    """Persist *template* to disk; return the slug it was saved as.
-
-    Stamps ``updated_at``. Atomic via temp-file + rename.
-    Raises ``ValueError`` with a multi-line error list if validation
-    fails — caller should surface that to the user.
-    """
-    ok, errors = validate_template(template)
-    if not ok:
-        raise ValueError("\n".join(errors))
-    template = dict(template)
-    template["updated_at"] = datetime.now(tz=timezone.utc).isoformat(
-        timespec="seconds"
-    )
-    slug = template["slug"]
-    payload = json.dumps(template, indent=2, ensure_ascii=False)
-    _atomic_write(template_path(slug), payload)
-    return slug
-
-
-def load_template(slug: str) -> dict[str, Any]:
-    """Read the template at *slug*. Raises ``FileNotFoundError`` if
-    missing, ``ValueError`` if the JSON is corrupt or the schema
-    version is unsupported.
-
-    v1 templates (pre row-heuristic) are accepted and migrated
-    in-memory to v2 shape with ``mode="column_visual"``. The file
-    on disk is NOT rewritten — the user's canonical original stays
-    intact until they explicitly re-save, so a buggy migration
-    can't silently corrupt their template library.
-    """
-    p = template_path(slug)
-    try:
-        raw = p.read_text(encoding="utf-8")
-    except FileNotFoundError:
-        raise
-    try:
-        data = json.loads(raw)
-    except json.JSONDecodeError as e:
-        raise ValueError(f"Corrupt template {slug!r}: {e}") from e
-    sv = data.get("schema_version")
-    if sv not in _LOAD_SUPPORTED_VERSIONS:
-        raise ValueError(
-            f"Template {slug!r} has unsupported schema_version {sv!r}; "
-            f"this build supports {sorted(_LOAD_SUPPORTED_VERSIONS)}."
-        )
-    return _migrate_to_current(data)
-
-
-def _migrate_to_current(data: dict[str, Any]) -> dict[str, Any]:
-    """In-memory migration of older schemas to the current shape.
-
-    v1 → v2 adds a ``mode`` key defaulting to ``"column_visual"``
-    (since v1 was the column-x-position approach) and stamps
-    ``schema_version`` to the current value. All v1 keys keep
-    their original meaning."""
-    if data.get("schema_version") == 1:
-        data = dict(data)
-        data["schema_version"] = SCHEMA_VERSION
-        data.setdefault("mode", "column_visual")
-    return data
-
-
-def delete_template(slug: str) -> bool:
-    """Remove the template file; returns ``True`` if it existed."""
-    p = template_path(slug)
-    try:
-        p.unlink()
-        return True
-    except FileNotFoundError:
-        return False
-
-
-def list_templates() -> list[dict[str, Any]]:
-    """Return a sorted list of ``{slug, name, updated_at}`` summaries.
-
-    Skips files that fail to parse — surfaces them in the manage UI
-    as warnings rather than crashing the list view.
-    """
-    d = templates_dir()
-    if not d.exists():
-        return []
-    out: list[dict[str, Any]] = []
-    for p in sorted(d.glob("*.json")):
-        try:
-            data = json.loads(p.read_text(encoding="utf-8"))
-        except Exception:
-            continue
-        if not isinstance(data, dict):
-            continue
-        out.append({
-            "slug": data.get("slug") or p.stem,
-            "name": data.get("name") or p.stem,
-            "updated_at": data.get("updated_at", ""),
-            "notes": data.get("notes", ""),
-        })
-    out.sort(key=lambda r: r["updated_at"] or r["name"], reverse=True)
-    return out
-
-
-# ---------------------------------------------------------------------------
-# Import / export
-# ---------------------------------------------------------------------------
-
-
-def template_to_json(template: dict[str, Any]) -> str:
-    """Serialize a template for download. Pretty-printed for human
-    inspection / diffing."""
-    return json.dumps(template, indent=2, ensure_ascii=False)
-
-
-def template_from_json(payload: str) -> dict[str, Any]:
-    """Deserialize uploaded template JSON. Validates schema version
-    but does NOT save — caller decides whether to ``save_template``
-    or merge into the current build.
-
-    Raises ``ValueError`` on malformed input."""
-    try:
-        data = json.loads(payload)
-    except json.JSONDecodeError as e:
-        raise ValueError(f"Not valid JSON: {e}") from e
-    if not isinstance(data, dict):
-        raise ValueError("Top-level JSON must be an object.")
-    sv = data.get("schema_version")
-    if sv != SCHEMA_VERSION:
-        raise ValueError(
-            f"Imported template has schema_version {sv!r}; "
-            f"this build expects {SCHEMA_VERSION}."
-        )
-    return data
-
-
-__all__ = [
-    "SCHEMA_VERSION",
-    "VALID_TARGETS",
-    "delete_template",
-    "list_templates",
-    "load_template",
-    "new_template",
-    "save_template",
-    "slugify",
-    "template_from_json",
-    "template_path",
-    "template_to_json",
-    "templates_dir",
-    "validate_template",
-]