refactor(pdf): rip out templates; heuristic scan + selectable table

User feedback: the template / visual-picker / mode-dispatch
implementation was too complex for the actual workflow.
Statements drift between months, the canvas state didn't survive
multi-page navigation, and accountants don't want to maintain
per-bank configuration just to convert PDFs to CSV.

Start-over design — one public function, one page, no
persistence:

  ``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)``

A row is "any text line with a date pattern AND at least one
amount pattern." Each detected row is a dict shaped::

    {
      "date": "2026-01-15",
      "description": "Coffee Shop",
      "amount_1": -4.50,
      "amount_2": 1000.00,   # if a second amount was found
      "page": 1,
      "raw": "01/15/2026 Coffee Shop (4.50) 1,000.00",
      "source_file": "chase-jan-2026.pdf",
    }

Multi-line descriptions still merge (no-date no-amount lines
attach to the previous transaction). Multi-PDF batches share a
single combined table with a ``source_file`` column.

**Page UX:**

- Upload PDF(s) → optional Options expander (parens-negative,
  use-OCR) → click Scan → see all detected rows in an
  ``st.data_editor``.
- The editor has an ``Include`` checkbox column (default on),
  plus user-editable date / description / amount cells and a
  read-only ``raw`` column showing the original PDF text for
  verification.
- A ``Columns to include in CSV`` multiselect hides
  ``page`` / ``raw`` from the download by default; user can
  re-add either.
- Download CSV gets only the checked rows.

No template save/load. No visual picker. No mode dispatch. No
column boundaries. No schema migration. No per-bank
configuration files.

**Deletions:**

- ``src/pdf_templates.py`` — template storage layer
- ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim
  for the canvas (no canvas now)
- ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``,
  ``test_drawable_canvas_compat.py`` — covered the removed APIs
- ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for
  the removed dep
- ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt``
- The drawable-canvas references in ``build/datatools.spec``

**``src/pdf_extract.py``** shrinks from ~30 helper functions to
~10. Keeps: value parsers, row clusterer, date/amount token
finders, OCR pipeline, dependency guards. The one new public
function ``scan_pdf_for_transactions`` glues them together.

**Tests** (59 passing): the unit layer keeps full coverage of
the building blocks; the smoke layer pins the end-to-end PDF
roundtrip, OCR discovery, dependency-import behavior, and the
multi-line-description merge. The fpdf2-generated fixture PDF
still drives the real-PDF test.

Rollback: ``git revert HEAD`` brings back the template system if
needed — but the simpler model should make that unlikely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-19 23:57:30 +00:00
parent 60969c0770
commit bece2b4030
12 changed files with 729 additions and 3632 deletions

View File

@@ -58,15 +58,12 @@ hidden_imports += collect_submodules("charset_normalizer")
hidden_imports += collect_submodules("openpyxl")
hidden_imports += collect_submodules("loguru")
# PDF Extractor stack. ``streamlit_drawable_canvas`` and
# ``pypdfium2`` both have their own PyInstaller hooks under
# ``build/hooks/`` that pull in the native binary + frontend
# assets — keep the ``collect_submodules`` calls here for
# belt-and-braces.
# PDF Extractor stack. ``pypdfium2`` has its own PyInstaller hook
# under ``build/hooks/`` that pulls in the native PDFium binary —
# keep the ``collect_submodules`` calls here for belt-and-braces.
hidden_imports += collect_submodules("pdfplumber")
hidden_imports += collect_submodules("pdfminer")
hidden_imports += collect_submodules("pypdfium2")
hidden_imports += collect_submodules("streamlit_drawable_canvas")
hidden_imports += collect_submodules("PIL")
hidden_imports += collect_submodules("pytesseract")
@@ -91,13 +88,10 @@ datas += collect_data_files("phonenumbers", include_py_files=False)
# PDF Extractor data files. ``pypdfium2`` ships a native PDFium
# shared library (``.dll`` / ``.so`` / ``.dylib``) under its package
# dir; ``streamlit-drawable-canvas`` ships a built JS bundle that
# Streamlit serves from the package dir at runtime; pdfminer ships
# the Adobe CMap tables it uses for character mapping. Hooks
# under ``build/hooks/`` mirror these calls for explicit
# documentation and survive ``collect_data_files`` regressions.
# dir; ``pdfminer`` ships the Adobe CMap tables it uses for
# character mapping. The drawable-canvas frontend bundle is gone
# now that the visual picker was removed.
datas += collect_data_files("pypdfium2", include_py_files=False)
datas += collect_data_files("streamlit_drawable_canvas")
datas += collect_data_files("pdfminer", include_py_files=False)
# Our application files. PyInstaller's bundler treats source as code

View File

@@ -1,19 +0,0 @@
"""PyInstaller hook for streamlit-drawable-canvas.
Streamlit components are Python packages that also ship a built
JavaScript/CSS bundle Streamlit serves from disk at component-
render time. Without those assets in the bundle the canvas
iframe loads blank — the user sees the page render fine but the
visual picker shows no image and no drawing controls.
``collect_data_files`` covers the frontend bundle directory
(named ``frontend`` or ``frontend/build`` depending on the
component version). Hidden imports are picked up by the main
spec's ``collect_submodules`` call, repeated here for the same
belt-and-braces reason as ``hook-pypdfium2.py``.
"""
from PyInstaller.utils.hooks import collect_data_files, collect_submodules
datas = collect_data_files("streamlit_drawable_canvas")
hiddenimports = collect_submodules("streamlit_drawable_canvas")

View File

@@ -10,10 +10,14 @@ phonenumbers>=8.13,<9
streamlit>=1.35,<2
cryptography>=41,<49
# PDF Extractor stack — pinned to exact tested versions so a future
# upstream release can't change the visual picker's coordinate model
# or pdfplumber's word-position behavior mid-build. Bump these
# upstream release can't quietly change pdfplumber's word-position
# behavior or pypdfium2's OCR rendering mid-build. Bump these
# explicitly when re-testing against a new release.
#
# ``pypdfium2`` is here for the OCR fallback path only (rasterizing
# pages to images for Tesseract). The drawable-canvas dep was
# removed when the visual picker was ripped out — the scanner is
# pure heuristic now, no coordinate UI.
pdfplumber==0.11.9
pypdfium2==5.8.0
pytesseract==0.3.13
streamlit-drawable-canvas==0.9.3

View File

@@ -1,86 +0,0 @@
"""Compatibility shim for streamlit-drawable-canvas on modern Streamlit.
``streamlit-drawable-canvas`` 0.9.3 (last release 2023) calls
``streamlit.elements.image.image_to_url(image, width, clamp,
channels, output_format, image_id)``. Streamlit ~1.30+ moved this
helper out of ``streamlit.elements.image`` and changed its
signature so the second positional argument is now a
``LayoutConfig`` dataclass instead of a plain ``int`` width.
The canvas package hasn't been updated, so on modern Streamlit
its very first call fails with::
AttributeError: module 'streamlit.elements.image'
has no attribute 'image_to_url'
This module re-attaches a wrapper at the old import path that
adapts the old call shape to the new function. Import it once
before any ``st_canvas`` call; idempotent.
The shim is opt-in (not auto-installed at module import) so the
audit log of "I patched a third-party internal" is visible in
``grep`` rather than silently happening on every page load.
"""
from __future__ import annotations
_PATCHED = False
def install() -> None:
"""Install the ``image_to_url`` compatibility shim.
Idempotent — safe to call multiple times. Returns silently
if the canvas package or Streamlit can't be imported (lets
the caller handle the "PDF deps missing" path on its own).
"""
global _PATCHED
if _PATCHED:
return
try:
import streamlit.elements.image as _old_image_module
except ImportError:
return
# Already present (old Streamlit, or already shimmed) — bail.
if hasattr(_old_image_module, "image_to_url"):
_PATCHED = True
return
try:
from streamlit.elements.lib.image_utils import (
image_to_url as _new_image_to_url,
)
from streamlit.elements.lib.layout_utils import LayoutConfig
except ImportError:
# ``image_to_url`` is in some other location we don't know
# about yet — let the canvas surface its own error so we
# learn where to look. Don't fail silently.
return
def _shim(
image,
width,
clamp,
channels,
output_format,
image_id,
) -> str:
"""Old API → new API. The old ``width=-1`` sentinel meant
"use the image's natural width", which is also the new
function's default behavior when ``LayoutConfig`` is left
unconfigured."""
layout = LayoutConfig()
return _new_image_to_url(
image,
layout,
clamp,
channels,
output_format,
image_id,
)
_old_image_module.image_to_url = _shim
_PATCHED = True

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1,508 +0,0 @@
"""PDF extract template storage.
Templates encode "how to read this bank's statements" — page
range, table window markers, column x-positions, target field
mapping, amount/date parse options. They live as JSON files in
``~/.datatools/pdf_templates/`` so an accountant can build one
per source and reuse it for every statement that follows the
same layout. Templates are portable: the ``export`` / ``import``
flow is just a file copy of the JSON.
The schema is intentionally a plain dict (not a frozen dataclass)
because the GUI mutates it incrementally during the build flow.
``validate_template`` enforces the contract at save time.
Schema (``schema_version: 1``)::
{
"schema_version": 1,
"slug": "chase-personal-checking",
"name": "Chase Personal Checking",
"notes": "",
"created_at": "<iso8601>",
"updated_at": "<iso8601>",
"pages": {
"range": "all" | "1-3" | "2,4,6-",
"skip_matching": "<regex>"
},
"table": {
"header_text": "<text containing all header words>",
"end_markers": ["<regex>", ...],
"column_boundaries": [x0, x1, ...],
"y_tolerance": 3.0,
"skip_rows_matching": ["<regex>", ...]
},
"columns": [
{"source": 0, "target": "date"},
...
# ``target`` is one of: date | description | amount |
# amount_debit | amount_credit | balance | <free text>
],
"parse": {
"date_format": "%m/%d/%Y",
"date_formats": [],
"decimal_separator": ".",
"thousands_separator": ",",
"currency_strip": "$",
"amount_negative_in_parens": true,
"merge_multiline_description": true
},
"visual": {
"page_width": 612.0,
"page_height": 792.0,
"sample_page": 1,
"table_bbox": [x0, top, x1, bottom] | null
}
}
The ``visual`` block is preserved across save/load so the build
UI can round-trip the user's last visual-picker state.
"""
from __future__ import annotations
import json
import os
import re
import tempfile
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
SCHEMA_VERSION = 2
# Backward-compatible versions ``load_template`` will accept.
# v1 templates predate the row-heuristic shift and are loaded as
# ``mode="column_visual"``; they're not auto-migrated on disk, so
# the user keeps their canonical original until they re-save.
_LOAD_SUPPORTED_VERSIONS = frozenset({1, 2})
# Extraction modes. ``row_heuristic`` is the default for new
# templates — finds transactions by date+amount pattern matching
# with no coordinate dependency. ``column_visual`` is the legacy
# x-position-boundary approach, kept for old templates and for
# the "Advanced" build-mode fallback when the heuristic misfires.
VALID_MODES = frozenset({"row_heuristic", "column_visual"})
# Amount shapes for row_heuristic mode. The GUI offers these as a
# dropdown; the parser uses them to assign amount tokens to fields.
VALID_AMOUNT_SHAPES = frozenset({
"single",
"txn_balance",
"debit_credit",
"debit_credit_balance",
})
VALID_TARGETS = frozenset({
"date",
"description",
"amount",
"amount_debit",
"amount_credit",
"balance",
"type",
})
# ---------------------------------------------------------------------------
# Filesystem layout
# ---------------------------------------------------------------------------
def templates_dir() -> Path:
"""Return ``~/.datatools/pdf_templates/``. Override via the
``DATATOOLS_PDF_TEMPLATES_DIR`` env var (used by tests)."""
override = os.environ.get("DATATOOLS_PDF_TEMPLATES_DIR")
if override:
return Path(override)
try:
return Path.home() / ".datatools" / "pdf_templates"
except Exception:
return Path(tempfile.gettempdir()) / "datatools-pdf-templates"
def template_path(slug: str) -> Path:
"""Resolve *slug* to its on-disk JSON path."""
return templates_dir() / f"{slug}.json"
# ---------------------------------------------------------------------------
# Slugify
# ---------------------------------------------------------------------------
_SLUG_STRIP = re.compile(r"[^a-z0-9]+")
def slugify(name: str) -> str:
"""Make a filesystem-safe slug from a human-friendly name."""
s = (name or "").strip().lower()
s = _SLUG_STRIP.sub("-", s).strip("-")
return s or "untitled"
# ---------------------------------------------------------------------------
# Construction + defaults
# ---------------------------------------------------------------------------
def new_template(name: str) -> dict[str, Any]:
"""Build a blank template with sensible defaults.
Defaults to ``mode="row_heuristic"`` — the simpler, more
robust approach. The GUI's build flow lets the user switch to
``mode="column_visual"`` if the heuristic doesn't fit their
statement layout.
"""
now = datetime.now(tz=timezone.utc).isoformat(timespec="seconds")
slug = slugify(name)
return {
"schema_version": SCHEMA_VERSION,
"slug": slug,
"name": name or slug,
"notes": "",
"mode": "row_heuristic",
"created_at": now,
"updated_at": now,
"pages": {
"range": "all",
"skip_matching": "",
},
# Row-heuristic config (primary path).
"row_detection": {
"min_amounts_per_row": 1,
"max_amounts_per_row": 3,
"y_tolerance": 3.0,
"skip_rows_matching": [],
"merge_multiline_description": True,
},
"amounts": {
"shape": "single",
"negative_in_parens": True,
"decimal_separator": ".",
"thousands_separator": ",",
"currency_strip": "$",
},
"date": {
"format": "%m/%d/%Y",
"formats_fallback": [],
},
# Column-visual config (legacy / Advanced fallback). Empty
# placeholders so the GUI can populate when the user
# switches modes without inserting keys at runtime.
"table": {
"header_text": "",
"end_markers": [],
"column_boundaries": [],
"y_tolerance": 3.0,
"skip_rows_matching": [],
},
"columns": [],
"parse": {
"date_format": "%m/%d/%Y",
"date_formats": [],
"decimal_separator": ".",
"thousands_separator": ",",
"currency_strip": "$",
"amount_negative_in_parens": True,
"merge_multiline_description": True,
},
"visual": {
"page_width": 612.0,
"page_height": 792.0,
"sample_page": 1,
"table_bbox": None,
},
}
# ---------------------------------------------------------------------------
# Validation
# ---------------------------------------------------------------------------
def validate_template(template: dict[str, Any]) -> tuple[bool, list[str]]:
"""Check the template before saving. Returns ``(ok, errors)``.
Mode-aware: row-heuristic templates and column-visual
templates have different required fields. The GUI shows the
errors next to the Save button; nothing silent here."""
errors: list[str] = []
if not isinstance(template, dict):
return False, ["Template must be a JSON object."]
sv = template.get("schema_version")
if sv != SCHEMA_VERSION:
errors.append(
f"Unsupported schema_version {sv!r} (expected {SCHEMA_VERSION})."
)
name = template.get("name", "")
if not isinstance(name, str) or not name.strip():
errors.append("name is required.")
slug = template.get("slug") or slugify(name)
if not re.match(r"^[a-z0-9][a-z0-9-]{0,63}$", slug or ""):
errors.append(
"slug must be lowercase alphanumeric + hyphens, "
"164 chars, starting with a letter or digit."
)
mode = template.get("mode", "row_heuristic")
if mode not in VALID_MODES:
errors.append(
f"mode {mode!r} must be one of: {sorted(VALID_MODES)}."
)
if mode == "row_heuristic":
amounts = template.get("amounts", {}) or {}
shape = amounts.get("shape", "single")
if shape not in VALID_AMOUNT_SHAPES:
errors.append(
f"amounts.shape {shape!r} must be one of: "
f"{sorted(VALID_AMOUNT_SHAPES)}."
)
rd = template.get("row_detection", {}) or {}
min_a = rd.get("min_amounts_per_row", 1)
max_a = rd.get("max_amounts_per_row", 3)
if not (isinstance(min_a, int) and isinstance(max_a, int)):
errors.append(
"row_detection.min_amounts_per_row and "
"max_amounts_per_row must be integers."
)
elif min_a < 1 or max_a < min_a:
errors.append(
"row_detection.min_amounts_per_row must be ≥1 and ≤ "
"max_amounts_per_row."
)
elif mode == "column_visual":
columns = template.get("columns", [])
if not isinstance(columns, list) or len(columns) < 2:
errors.append(
"column_visual mode: at least two output columns "
"are required."
)
else:
seen_targets: list[str] = []
for i, col in enumerate(columns):
if not isinstance(col, dict):
errors.append(f"columns[{i}] must be an object.")
continue
src = col.get("source")
tgt = col.get("target")
if not isinstance(src, int) or src < 0:
errors.append(
f"columns[{i}].source must be a non-negative "
f"integer."
)
if not isinstance(tgt, str) or not tgt:
errors.append(
f"columns[{i}].target must be a non-empty string."
)
else:
seen_targets.append(tgt)
if "date" not in seen_targets:
errors.append(
"column_visual mode: at least one column must map "
"to 'date'."
)
if (
"amount" not in seen_targets
and not (
"amount_debit" in seen_targets
and "amount_credit" in seen_targets
)
):
errors.append(
"column_visual mode: either an 'amount' column or "
"both 'amount_debit' + 'amount_credit' columns "
"are required."
)
table = template.get("table", {}) or {}
boundaries = table.get("column_boundaries", [])
if not isinstance(boundaries, list):
errors.append("table.column_boundaries must be a list.")
return (not errors), errors
# ---------------------------------------------------------------------------
# Persistence
# ---------------------------------------------------------------------------
def _atomic_write(path: Path, payload: str) -> None:
"""Write *payload* to *path* via a temp file + rename.
Avoids leaving a half-written JSON if the process dies mid-save —
the GUI saves on every visual-picker change, and a corrupt
template file would be hostile to recover from.
"""
path.parent.mkdir(parents=True, exist_ok=True)
fd, tmp_path = tempfile.mkstemp(
prefix=f".{path.name}.",
suffix=".tmp",
dir=str(path.parent),
)
try:
with os.fdopen(fd, "w", encoding="utf-8") as f:
f.write(payload)
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except FileNotFoundError:
pass
raise
def save_template(template: dict[str, Any]) -> str:
"""Persist *template* to disk; return the slug it was saved as.
Stamps ``updated_at``. Atomic via temp-file + rename.
Raises ``ValueError`` with a multi-line error list if validation
fails — caller should surface that to the user.
"""
ok, errors = validate_template(template)
if not ok:
raise ValueError("\n".join(errors))
template = dict(template)
template["updated_at"] = datetime.now(tz=timezone.utc).isoformat(
timespec="seconds"
)
slug = template["slug"]
payload = json.dumps(template, indent=2, ensure_ascii=False)
_atomic_write(template_path(slug), payload)
return slug
def load_template(slug: str) -> dict[str, Any]:
"""Read the template at *slug*. Raises ``FileNotFoundError`` if
missing, ``ValueError`` if the JSON is corrupt or the schema
version is unsupported.
v1 templates (pre row-heuristic) are accepted and migrated
in-memory to v2 shape with ``mode="column_visual"``. The file
on disk is NOT rewritten — the user's canonical original stays
intact until they explicitly re-save, so a buggy migration
can't silently corrupt their template library.
"""
p = template_path(slug)
try:
raw = p.read_text(encoding="utf-8")
except FileNotFoundError:
raise
try:
data = json.loads(raw)
except json.JSONDecodeError as e:
raise ValueError(f"Corrupt template {slug!r}: {e}") from e
sv = data.get("schema_version")
if sv not in _LOAD_SUPPORTED_VERSIONS:
raise ValueError(
f"Template {slug!r} has unsupported schema_version {sv!r}; "
f"this build supports {sorted(_LOAD_SUPPORTED_VERSIONS)}."
)
return _migrate_to_current(data)
def _migrate_to_current(data: dict[str, Any]) -> dict[str, Any]:
"""In-memory migration of older schemas to the current shape.
v1 → v2 adds a ``mode`` key defaulting to ``"column_visual"``
(since v1 was the column-x-position approach) and stamps
``schema_version`` to the current value. All v1 keys keep
their original meaning."""
if data.get("schema_version") == 1:
data = dict(data)
data["schema_version"] = SCHEMA_VERSION
data.setdefault("mode", "column_visual")
return data
def delete_template(slug: str) -> bool:
"""Remove the template file; returns ``True`` if it existed."""
p = template_path(slug)
try:
p.unlink()
return True
except FileNotFoundError:
return False
def list_templates() -> list[dict[str, Any]]:
"""Return a sorted list of ``{slug, name, updated_at}`` summaries.
Skips files that fail to parse — surfaces them in the manage UI
as warnings rather than crashing the list view.
"""
d = templates_dir()
if not d.exists():
return []
out: list[dict[str, Any]] = []
for p in sorted(d.glob("*.json")):
try:
data = json.loads(p.read_text(encoding="utf-8"))
except Exception:
continue
if not isinstance(data, dict):
continue
out.append({
"slug": data.get("slug") or p.stem,
"name": data.get("name") or p.stem,
"updated_at": data.get("updated_at", ""),
"notes": data.get("notes", ""),
})
out.sort(key=lambda r: r["updated_at"] or r["name"], reverse=True)
return out
# ---------------------------------------------------------------------------
# Import / export
# ---------------------------------------------------------------------------
def template_to_json(template: dict[str, Any]) -> str:
"""Serialize a template for download. Pretty-printed for human
inspection / diffing."""
return json.dumps(template, indent=2, ensure_ascii=False)
def template_from_json(payload: str) -> dict[str, Any]:
"""Deserialize uploaded template JSON. Validates schema version
but does NOT save — caller decides whether to ``save_template``
or merge into the current build.
Raises ``ValueError`` on malformed input."""
try:
data = json.loads(payload)
except json.JSONDecodeError as e:
raise ValueError(f"Not valid JSON: {e}") from e
if not isinstance(data, dict):
raise ValueError("Top-level JSON must be an object.")
sv = data.get("schema_version")
if sv != SCHEMA_VERSION:
raise ValueError(
f"Imported template has schema_version {sv!r}; "
f"this build expects {SCHEMA_VERSION}."
)
return data
__all__ = [
"SCHEMA_VERSION",
"VALID_TARGETS",
"delete_template",
"list_templates",
"load_template",
"new_template",
"save_template",
"slugify",
"template_from_json",
"template_path",
"template_to_json",
"templates_dir",
"validate_template",
]

View File

@@ -1,116 +0,0 @@
"""Tests for the streamlit-drawable-canvas compatibility shim.
The shim re-attaches ``image_to_url`` to ``streamlit.elements.image``
on modern Streamlit where the helper was relocated to
``streamlit.elements.lib.image_utils`` and given a new signature
(takes a ``LayoutConfig`` dataclass instead of a plain ``int``
width).
If this test ever fails on a Streamlit upgrade, it almost
certainly means the ``image_to_url`` function moved AGAIN — the
shim's fallback message points to where to look. Update
``_drawable_canvas_compat.py`` to find the new location.
"""
from __future__ import annotations
import sys
import types
def test_shim_attaches_image_to_url():
"""After ``install()`` the old import path resolves to a
callable, even on modern Streamlit where the original was
relocated."""
# Force a fresh import so the module-level _PATCHED guard
# doesn't short-circuit between tests.
sys.modules.pop("src.gui._drawable_canvas_compat", None)
from src.gui._drawable_canvas_compat import install
install()
import streamlit.elements.image as old_loc
assert hasattr(old_loc, "image_to_url")
assert callable(old_loc.image_to_url)
def test_shim_is_idempotent():
"""Calling ``install()`` twice doesn't double-wrap or break
anything — important because the page module imports + calls
it once, and a Streamlit script-rerun re-executes the page
module top-to-bottom."""
sys.modules.pop("src.gui._drawable_canvas_compat", None)
from src.gui._drawable_canvas_compat import install
install()
import streamlit.elements.image as old_loc
first = old_loc.image_to_url
install()
second = old_loc.image_to_url
assert first is second
def test_shim_no_op_when_image_to_url_already_present():
"""If a future Streamlit restores ``image_to_url`` at the old
location, the shim must not overwrite it — leave the upstream
function in place so the canvas package gets the official
version, not our compatibility wrapper."""
sys.modules.pop("src.gui._drawable_canvas_compat", None)
import streamlit.elements.image as old_loc
sentinel = lambda *a, **kw: "sentinel-url" # noqa: E731
old_loc.image_to_url = sentinel
try:
from src.gui._drawable_canvas_compat import install
install()
assert old_loc.image_to_url is sentinel, (
"Shim must not clobber an existing image_to_url."
)
finally:
# Tidy up so subsequent tests see a clean module.
delattr(old_loc, "image_to_url")
sys.modules.pop("src.gui._drawable_canvas_compat", None)
def test_shim_calls_new_function_with_layout_config():
"""The shim's wrapper must translate the old ``(image, width,
clamp, channels, output_format, image_id)`` call into the new
``(image, layout_config, …)`` signature without breaking."""
sys.modules.pop("src.gui._drawable_canvas_compat", None)
import streamlit.elements.image as old_loc
if hasattr(old_loc, "image_to_url"):
delattr(old_loc, "image_to_url")
# Replace the new function with a recorder so we can inspect
# what arguments the shim passed through.
from streamlit.elements.lib import image_utils
captured: dict = {}
original = image_utils.image_to_url
def recorder(image, layout_config, clamp, channels, output_format, image_id):
captured["image"] = image
captured["layout_config"] = layout_config
captured["clamp"] = clamp
captured["channels"] = channels
captured["output_format"] = output_format
captured["image_id"] = image_id
return "fake-url"
image_utils.image_to_url = recorder
try:
from src.gui._drawable_canvas_compat import install
install()
result = old_loc.image_to_url(
"fake-image", -1, False, "RGB", "PNG", "test-id",
)
assert result == "fake-url"
assert captured["image"] == "fake-image"
assert captured["clamp"] is False
assert captured["channels"] == "RGB"
assert captured["output_format"] == "PNG"
assert captured["image_id"] == "test-id"
# The shim wraps the int width into a LayoutConfig.
from streamlit.elements.lib.layout_utils import LayoutConfig
assert isinstance(captured["layout_config"], LayoutConfig)
finally:
image_utils.image_to_url = original
if hasattr(old_loc, "image_to_url"):
delattr(old_loc, "image_to_url")
sys.modules.pop("src.gui._drawable_canvas_compat", None)

View File

@@ -1,36 +1,33 @@
"""Tests for the pure PDF-extraction pipeline.
"""Tests for the minimal PDF transaction scanner.
Real PDF parsing (``extract_pages``) is a thin wrapper around
``pdfplumber`` and is exercised by hand on real bank statements.
These tests pin the meaty bits — value parsing, row clustering,
column assignment, template-driven extraction — against synthetic
``WordBox`` data so they run fast and have no PDF dependency.
The public API is one function: ``scan_pdf_for_transactions``.
These tests cover the value-parsing helpers, the row clusterer,
the date/amount token finders, and the end-to-end scanner
against synthetic ``Page`` objects with no real PDF involved.
End-to-end-on-a-real-PDF coverage lives in
``test_pdf_extract_smoke.py``, which uses ``fpdf2`` to generate
a fixture statement at test time.
"""
from __future__ import annotations
import pandas as pd
from src.pdf_extract import (
Page,
WordBox,
apply_template,
assign_columns,
_find_amount_tokens,
_find_dates_in_words,
cluster_rows,
parse_amount,
parse_date,
_pages_in_range,
_within_table_window,
)
def _w(text: str, x0: float, top: float, x1: float | None = None) -> WordBox:
"""Convenience constructor — heights and exact x1 don't matter
for the tests we write."""
return WordBox(
x0=x0,
top=top,
x1=x1 if x1 is not None else x0 + 10 * len(text),
x1=x1 if x1 is not None else x0 + 8 * len(text),
bottom=top + 10,
text=text,
)
@@ -61,13 +58,18 @@ class TestParseAmount:
assert parse_amount("not a number") is None
def test_european_decimal(self):
opts = {
"decimal_separator": ",",
"thousands_separator": ".",
"currency_strip": "",
"negative_in_parens": True,
}
assert parse_amount("€1.234,56", opts) == 1234.56
assert parse_amount(
"€1.234,56",
decimal=",",
thousands=".",
currency_strip="",
) == 1234.56
def test_parens_off_disables_paren_negative(self):
# With parens off, (4.50) won't be treated as negative —
# but it also won't parse cleanly since "(4.50)" isn't a
# plain number. Verify the off-path is non-flipping.
assert parse_amount("(4.50)", negative_in_parens=False) is None
class TestParseDate:
@@ -78,7 +80,7 @@ class TestParseDate:
assert parse_date("2026-01-15", ["%Y-%m-%d"]) == "2026-01-15"
def test_fallback_format(self):
# Not in the supplied list — should still parse via fallback.
# Not in supplied list — should still parse via fallback.
assert parse_date("01/15/26") == "2026-01-15"
def test_invalid(self):
@@ -88,199 +90,74 @@ class TestParseDate:
class TestClusterRows:
def test_groups_close_y(self):
words = [
_w("A", x0=0, top=100),
_w("B", x0=20, top=101),
_w("C", x0=40, top=102),
_w("A", 0, 100), _w("B", 20, 101), _w("C", 40, 102),
]
rows = cluster_rows(words, y_tolerance=3.0)
rows = cluster_rows(words)
assert len(rows) == 1
assert [w.text for w in rows[0]] == ["A", "B", "C"]
def test_separates_far_y(self):
words = [
_w("A", x0=0, top=100),
_w("B", x0=0, top=120),
]
rows = cluster_rows(words, y_tolerance=3.0)
assert [[w.text for w in r] for r in rows] == [["A"], ["B"]]
words = [_w("A", 0, 100), _w("B", 0, 120)]
assert [
[w.text for w in r] for r in cluster_rows(words)
] == [["A"], ["B"]]
def test_sorts_left_to_right_within_row(self):
words = [
_w("C", x0=40, top=100),
_w("A", x0=0, top=100),
_w("B", x0=20, top=100),
]
rows = cluster_rows(words)
assert [w.text for w in rows[0]] == ["A", "B", "C"]
words = [_w("C", 40, 100), _w("A", 0, 100), _w("B", 20, 100)]
assert [w.text for w in cluster_rows(words)[0]] == ["A", "B", "C"]
def test_empty(self):
assert cluster_rows([]) == []
class TestAssignColumns:
def test_three_columns(self):
# boundaries at x=100, 200 → columns [0,100), [100,200), [200,∞)
row = [
_w("Jan", x0=10, top=0, x1=40), # col 0
_w("1", x0=45, top=0, x1=55), # col 0
_w("Deposit", x0=110, top=0, x1=180), # col 1
_w("250.00", x0=210, top=0, x1=260), # col 2
]
cells = assign_columns(row, [100, 200])
assert cells[0] == "Jan 1"
assert cells[1] == "Deposit"
assert cells[2] == "250.00"
class TestFindDatesInWords:
def test_us_slash(self):
row = [_w("01/15/2026", 0, 0), _w("Coffee", 100, 0)]
assert _find_dates_in_words(row) == [(0, "01/15/2026")]
def test_no_boundaries_one_column(self):
row = [_w("A", 0, 0), _w("B", 20, 0)]
cells = assign_columns(row, [])
assert cells == ["A B"]
def test_two_digit_year(self):
row = [_w("01/15/26", 0, 0), _w("Foo", 100, 0)]
result = _find_dates_in_words(row)
assert result and result[0][1] == "01/15/26"
def test_iso(self):
row = [_w("2026-01-15", 0, 0), _w("Tx", 100, 0)]
assert _find_dates_in_words(row) == [(0, "2026-01-15")]
def test_month_name(self):
row = [_w("Jan", 0, 0), _w("15,", 25, 0), _w("2026", 50, 0)]
result = _find_dates_in_words(row)
assert result and "Jan 15" in result[0][1]
def test_no_date(self):
row = [_w("Just", 0, 0), _w("text", 50, 0)]
assert _find_dates_in_words(row) == []
class TestPagesInRange:
def _mk(self, n):
return [Page(page_no=i + 1, width=600, height=800, text="", words=[]) for i in range(n)]
class TestFindAmountTokens:
def test_currency_format(self):
row = [_w("Coffee", 0, 0), _w("$4.50", 100, 0)]
out = _find_amount_tokens(row)
assert len(out) == 1
assert out[0][2] == "$4.50"
def test_all(self):
pages = self._mk(5)
assert len(_pages_in_range(pages, "all")) == 5
assert len(_pages_in_range(pages, "")) == 5
def test_parens_negative(self):
row = [_w("(123.45)", 0, 0)]
out = _find_amount_tokens(row)
assert out and out[0][2] == "(123.45)"
def test_explicit_list(self):
pages = self._mk(5)
got = [p.page_no for p in _pages_in_range(pages, "1,3,5")]
assert got == [1, 3, 5]
def test_no_amount_on_pure_text(self):
row = [_w("Hello", 0, 0), _w("World", 50, 0)]
assert _find_amount_tokens(row) == []
def test_range(self):
pages = self._mk(5)
got = [p.page_no for p in _pages_in_range(pages, "2-4")]
assert got == [2, 3, 4]
def test_open_ended(self):
pages = self._mk(5)
got = [p.page_no for p in _pages_in_range(pages, "3-")]
assert got == [3, 4, 5]
def test_rejects_bare_year(self):
# A bare 4-digit year matches the digit pattern but lacks
# any money marker — should be filtered out.
row = [_w("2026", 0, 0)]
assert _find_amount_tokens(row) == []
class TestWithinTableWindow:
def test_header_skipped_end_excluded(self):
rows = [
[_w("STATEMENT", 0, 0)],
[_w("Date", 0, 20), _w("Description", 50, 20), _w("Amount", 200, 20)],
[_w("01/15", 0, 40), _w("Coffee", 50, 40), _w("4.50", 200, 40)],
[_w("01/16", 0, 60), _w("Refund", 50, 60), _w("12.00", 200, 60)],
[_w("Closing", 0, 80), _w("balance", 50, 80)],
[_w("Page", 0, 100), _w("1", 50, 100)],
]
out = _within_table_window(rows, "Date Description Amount", ["Closing balance"])
# Should keep just the two transaction rows.
assert len(out) == 2
assert out[0][0].text == "01/15"
assert out[1][0].text == "01/16"
def test_no_header_returns_empty_when_required(self):
rows = [[_w("foo", 0, 0)]]
assert _within_table_window(rows, "Date Description Amount", []) == []
def test_blank_header_passes_through(self):
rows = [[_w("x", 0, 0)], [_w("y", 0, 20)]]
assert _within_table_window(rows, "", []) == rows
class TestApplyTemplate:
"""End-to-end on synthetic ``Page`` objects."""
def _statement_page(self) -> Page:
# Mock layout: 3 columns at x=0/100/200, header at y=20, data at 40+.
words = [
_w("STATEMENT", 0, 0),
# Header
_w("Date", 5, 20), _w("Description", 105, 20), _w("Amount", 205, 20),
# Row 1
_w("01/15/2026", 5, 40), _w("Coffee", 105, 40),
_w("Shop", 140, 40), _w("(4.50)", 205, 40),
# Row 2
_w("01/16/2026", 5, 60), _w("Refund", 105, 60), _w("$12.00", 205, 60),
# Continuation row (no date) — should merge into row 2
_w("from", 105, 80), _w("vendor", 140, 80),
# End marker
_w("Closing", 5, 100), _w("balance", 105, 100), _w("$1,000.00", 205, 100),
]
return Page(page_no=1, width=300, height=120, text="", words=words)
def _template(self) -> dict:
return {
"pages": {"range": "all"},
"table": {
"header_text": "Date Description Amount",
"end_markers": ["Closing balance"],
"column_boundaries": [100, 200],
"y_tolerance": 3.0,
"skip_rows_matching": [],
},
"columns": [
{"source": 0, "target": "date"},
{"source": 1, "target": "description"},
{"source": 2, "target": "amount"},
],
"parse": {
"date_format": "%m/%d/%Y",
"amount_negative_in_parens": True,
"merge_multiline_description": True,
},
}
def test_basic_extraction(self):
df = apply_template([self._statement_page()], self._template())
assert isinstance(df, pd.DataFrame)
assert len(df) == 2
assert list(df["date"]) == ["2026-01-15", "2026-01-16"]
# Parens-negative
assert df.iloc[0]["amount"] == -4.50
# Plain positive with currency strip
assert df.iloc[1]["amount"] == 12.00
# Multi-line description merged
assert "from vendor" in df.iloc[1]["description"]
def test_debit_credit_split_columns(self):
# Layout: date | description | debit | credit columns
page = Page(
page_no=1, width=400, height=80, text="",
words=[
_w("Date", 5, 0), _w("Desc", 105, 0),
_w("Debit", 205, 0), _w("Credit", 305, 0),
_w("01/15/2026", 5, 20), _w("Coffee", 105, 20), _w("4.50", 205, 20),
_w("01/16/2026", 5, 40), _w("Refund", 105, 40),
_w("", 205, 40), # no debit
_w("12.00", 305, 40),
],
)
tpl = {
"table": {
"header_text": "Date Desc Debit Credit",
"column_boundaries": [100, 200, 300],
},
"columns": [
{"source": 0, "target": "date"},
{"source": 1, "target": "description"},
{"source": 2, "target": "amount_debit"},
{"source": 3, "target": "amount_credit"},
],
"parse": {"date_format": "%m/%d/%Y"},
}
df = apply_template([page], tpl)
assert list(df["amount"]) == [-4.50, 12.00]
assert list(df["type"]) == ["debit", "credit"]
def test_skip_rows_matching(self):
page = self._statement_page()
tpl = self._template()
tpl["table"]["skip_rows_matching"] = ["Refund"]
df = apply_template([page], tpl)
# Refund row is dropped — only one transaction left
assert len(df) == 1
assert df.iloc[0]["amount"] == -4.50
def test_empty_pages_returns_empty_df(self):
df = apply_template([], self._template())
assert df.empty
# End-to-end tests against synthetic Page objects are in the smoke
# test module — they need ``scan_pdf_for_transactions`` which in
# turn uses ``extract_pages_auto``. The unit-test layer here pins
# the building blocks; smoke tests pin the wiring.

View File

@@ -1,51 +1,39 @@
"""End-to-end smoke tests for the PDF extraction stack.
"""End-to-end smoke tests for the PDF transaction scanner.
These tests run real ``pdfplumber`` + ``pypdfium2`` calls against
a small PDF generated in-memory with ``fpdf2``. They exist to
catch the failure mode the user hit on first install — a missing
or mismatched native dependency that doesn't show up until the
extractor actually tries to open a PDF.
These run real ``pdfplumber`` + ``pypdfium2`` (when OCR is in play)
calls against a small statement-shaped PDF generated in memory
with ``fpdf2``. They catch the failure modes most likely to bite
an end-user installer build: missing native lib, broken hook
bundling, pin/installed mismatch.
Per ``project-pdf-extractor`` memory: ``test_pdf_extract.py``
covers the parsing logic on synthetic ``WordBox`` data with no
PDF dep involved. This file is the layer above: it confirms the
deps themselves work, that hooks bundled them correctly (the
versions pinned in ``requirements.txt`` matter here), and that
the extractor's pipeline survives a round-trip through real
``pdfplumber.extract_words`` and real ``pypdfium2.render``.
Generation note: ``fpdf2`` is a test-only dep listed in
Generation note: ``fpdf2`` is a test-only dep in
``requirements-dev.txt``. We don't ship it.
"""
from __future__ import annotations
import io
import pytest
def _build_tiny_statement_pdf() -> bytes:
"""Render a one-page PDF that looks roughly like the simplest
possible bank statement: a header line + three transaction
rows + a closing-balance footer. Word positions are stable
enough that the parser can identify columns by x-position."""
"""One-page PDF: header line + three transaction rows + a
closing-balance footer. The scanner should pick up exactly the
three transactions."""
from fpdf import FPDF
pdf = FPDF(orientation="P", unit="pt", format="letter")
pdf.add_page()
pdf.set_font("Helvetica", size=12)
# Header
pdf.set_xy(40, 50)
pdf.cell(0, 14, "ACME BANK STATEMENT", new_x="LMARGIN", new_y="NEXT")
# Transaction-table header row
# Header row (not a transaction — no amount)
pdf.set_xy(40, 100)
pdf.cell(120, 14, "Date")
pdf.set_xy(160, 100)
pdf.cell(200, 14, "Description")
pdf.set_xy(360, 100)
pdf.cell(80, 14, "Amount")
# Three rows
# Three transactions
rows = [
("01/15/2026", "Coffee Shop", "(4.50)"),
("01/16/2026", "Refund Vendor", "$12.00"),
@@ -60,7 +48,7 @@ def _build_tiny_statement_pdf() -> bytes:
pdf.set_xy(360, y)
pdf.cell(80, 14, amt)
y += 20
# Closing-balance footer
# Footer — has a date-like number maybe but no real txn shape
pdf.set_xy(40, y + 20)
pdf.cell(0, 14, "Closing balance: $1,000.00")
return bytes(pdf.output())
@@ -72,12 +60,8 @@ def _build_tiny_statement_pdf() -> bytes:
class TestDependencyImports:
"""Each runtime PDF dep must be importable.
These tests will fail fast on a stripped/broken install — most
valuable as a CI gate when the requirements.txt pins are
bumped, so we know the new pin still installs cleanly across
the matrix."""
"""Each runtime PDF dep must be importable. Fails fast on a
stripped install or a missing CI pin."""
def test_pdfplumber(self):
import pdfplumber # noqa: F401
@@ -85,130 +69,135 @@ class TestDependencyImports:
def test_pypdfium2(self):
import pypdfium2 # noqa: F401
def test_streamlit_drawable_canvas(self):
# Don't instantiate the canvas — that needs a Streamlit
# script-run context. Just confirm the module loads.
import streamlit_drawable_canvas # noqa: F401
def test_pytesseract(self):
# The Python binding must import even when the Tesseract
# binary isn't installed — the OCR availability check
# handles binary absence separately.
import pytesseract # noqa: F401
def test_PIL(self):
# Transitively required by pdfplumber + pypdfium2 + canvas.
# Pinning explicit confirms hooks pull it through.
from PIL import Image # noqa: F401
# ---------------------------------------------------------------------------
# Real-PDF round-trip
# End-to-end against a real PDF
# ---------------------------------------------------------------------------
class TestRealPdfRoundTrip:
"""``extract_pages`` + ``apply_template`` against a real PDF."""
class TestScanPdfForTransactions:
@pytest.fixture
def pdf_bytes(self) -> bytes:
return _build_tiny_statement_pdf()
def test_extract_pages_returns_words(self, pdf_bytes):
from src.pdf_extract import extract_pages
pages = extract_pages(pdf_bytes)
assert len(pages) == 1
assert pages[0].width > 0 and pages[0].height > 0
# At minimum we should have the words from the header and
# one transaction row — proves pdfplumber wired up.
all_text = " ".join(w.text for w in pages[0].words)
assert "ACME" in all_text
assert "Coffee" in all_text
assert "01/15/2026" in all_text
def test_finds_three_transactions(self, pdf_bytes):
from src.pdf_extract import scan_pdf_for_transactions
rows, warnings = scan_pdf_for_transactions(pdf_bytes)
# The PDF has 3 transactions plus a header and a closing-
# balance footer. Header has no amount; closing-balance has
# no date in the same line — neither qualifies as a txn.
assert len(rows) == 3, (
f"expected 3 rows, got {len(rows)}:\n"
f"{[r.get('raw') for r in rows]}"
)
def test_apply_template_extracts_three_rows(self, pdf_bytes):
from src.pdf_extract import apply_template, extract_pages
# The template's column boundaries are tuned to fpdf2's
# x-coordinates above (40 / 160 / 360 pt).
tpl = {
"pages": {"range": "all"},
"table": {
"header_text": "Date Description Amount",
"end_markers": ["Closing balance"],
"column_boundaries": [150, 350],
"y_tolerance": 3.0,
},
"columns": [
{"source": 0, "target": "date"},
{"source": 1, "target": "description"},
{"source": 2, "target": "amount"},
],
"parse": {
"date_format": "%m/%d/%Y",
"amount_negative_in_parens": True,
"merge_multiline_description": True,
},
}
pages = extract_pages(pdf_bytes)
df = apply_template(pages, tpl)
assert len(df) == 3, f"expected 3 rows, got {len(df)}:\n{df}"
assert list(df["date"]) == [
def test_parses_dates_to_iso(self, pdf_bytes):
from src.pdf_extract import scan_pdf_for_transactions
rows, _ = scan_pdf_for_transactions(pdf_bytes)
assert [r["date"] for r in rows] == [
"2026-01-15", "2026-01-16", "2026-01-17",
]
# Parens-negative + currency-positive both round-trip
assert df.iloc[0]["amount"] == -4.50
assert df.iloc[1]["amount"] == 12.00
assert df.iloc[2]["amount"] == -40.00
def test_parses_amounts_with_signs(self, pdf_bytes):
from src.pdf_extract import scan_pdf_for_transactions
rows, _ = scan_pdf_for_transactions(pdf_bytes)
assert rows[0]["amount_1"] == -4.50
assert rows[1]["amount_1"] == 12.00
assert rows[2]["amount_1"] == -40.00
def test_preserves_raw_line(self, pdf_bytes):
from src.pdf_extract import scan_pdf_for_transactions
rows, _ = scan_pdf_for_transactions(pdf_bytes)
# Raw line lets the user verify what was matched.
assert all("raw" in r and r["raw"] for r in rows)
assert "Coffee" in rows[0]["raw"]
def test_page_tagged(self, pdf_bytes):
from src.pdf_extract import scan_pdf_for_transactions
rows, _ = scan_pdf_for_transactions(pdf_bytes)
assert all(r["page"] == 1 for r in rows)
def test_negative_in_parens_off(self, pdf_bytes):
"""With parens-negative off, the parser can't decode
``(4.50)`` and falls back to the raw text — the row still
surfaces, just with the unparsed string in the amount slot
so the user can see and fix it in the editor."""
from src.pdf_extract import scan_pdf_for_transactions
rows, _ = scan_pdf_for_transactions(
pdf_bytes, negative_in_parens=False,
)
# Row 0 had "(4.50)" — without parens-negative, parse_amount
# returns None and the scanner keeps the raw token.
assert rows[0]["amount_1"] == "(4.50)"
# Row 1 had "$12.00" — still parses to positive.
assert rows[1]["amount_1"] == 12.00
# ---------------------------------------------------------------------------
# pypdfium2 rendering (powers the visual picker)
# Multi-line description merging
# ---------------------------------------------------------------------------
class TestRenderPageImage:
"""``render_page_image`` is what feeds the drawable canvas.
class TestMultilineDescription:
def test_continuation_line_merges(self):
"""A line with no date and no amount, sitting between two
transaction rows, attaches to the previous transaction's
description."""
from src.pdf_extract import (
Page,
WordBox,
scan_pdf_for_transactions,
)
# Build a synthetic page through the public entry point by
# going through extract_pages_auto's intermediate? Easier:
# call the internals directly via a fake PDF. For unit
# coverage of the merge behavior, route through the helper:
from src import pdf_extract as mod
Catches the most common installer-bug: native PDFium .dll/.so
missing from the bundle. If this test crashes with a
``FileNotFoundError`` it almost always means the
``hook-pypdfium2.py`` didn't pick up the shared lib."""
original = mod.extract_pages_auto
def test_renders_a_real_pil_image(self):
from src.pdf_extract import render_page_image
pdf_bytes = _build_tiny_statement_pdf()
image, scale = render_page_image(pdf_bytes, page_no=1)
# Letter-size at scale ≈ 900/612 ≈ 1.47 → ~900px wide.
assert image.width > 800
assert image.height > 800
assert scale > 0
# PIL Image is duck-typed; check the attrs we depend on.
assert hasattr(image, "save")
assert hasattr(image, "tobytes")
def fake(_pdf_bytes, *, allow_ocr=True):
words = [
WordBox(x0=0, top=0, x1=80, bottom=10, text="01/15/2026"),
WordBox(x0=100, top=0, x1=160, bottom=10, text="Coffee"),
WordBox(x0=200, top=0, x1=240, bottom=10, text="$4.50"),
# Continuation: no date, no amount
WordBox(x0=100, top=20, x1=160, bottom=30, text="Vendor"),
WordBox(x0=170, top=20, x1=230, bottom=30, text="memo"),
# Next transaction
WordBox(x0=0, top=40, x1=80, bottom=50, text="01/16/2026"),
WordBox(x0=100, top=40, x1=160, bottom=50, text="Other"),
WordBox(x0=200, top=40, x1=240, bottom=50, text="$10.00"),
]
return [Page(
page_no=1, width=300, height=100, text="", words=words,
)], []
def test_invalid_page_number_clamps(self):
from src.pdf_extract import render_page_image
pdf_bytes = _build_tiny_statement_pdf()
# PDF has 1 page; page_no=99 should clamp, not raise.
image, scale = render_page_image(pdf_bytes, page_no=99)
assert image.width > 0
mod.extract_pages_auto = fake
try:
rows, _ = scan_pdf_for_transactions(b"")
finally:
mod.extract_pages_auto = original
assert len(rows) == 2
assert "Vendor memo" in rows[0]["description"]
assert rows[1]["description"] == "Other"
# ---------------------------------------------------------------------------
# Graceful-fallback behavior
# Graceful fallback when deps absent
# ---------------------------------------------------------------------------
class TestPdfDependencyMissing:
"""The page should see a clean exception when a dep is absent,
not a raw ``ImportError`` that leaks into the Streamlit traceback."""
def test_require_pdfplumber_raises_typed_on_absence(self, monkeypatch):
from src import pdf_extract
# Simulate "pdfplumber not installed" without uninstalling.
# ``_require_pdfplumber`` does its own ``import pdfplumber``
# at call time; patch ``__import__`` to throw for that one
# name only.
import builtins
real_import = builtins.__import__
@@ -218,10 +207,10 @@ class TestPdfDependencyMissing:
return real_import(name, *a, **kw)
monkeypatch.setattr(builtins, "__import__", fake_import)
with pytest.raises(pdf_extract.PdfDependencyMissing) as exc_info:
with pytest.raises(pdf_extract.PdfDependencyMissing) as exc:
pdf_extract._require_pdfplumber()
assert "pdfplumber" in str(exc_info.value)
assert exc_info.value.hint # actionable hint must be populated
assert "pdfplumber" in str(exc.value)
assert exc.value.hint
def test_require_pdfium_raises_typed_on_absence(self, monkeypatch):
from src import pdf_extract
@@ -239,17 +228,13 @@ class TestPdfDependencyMissing:
# ---------------------------------------------------------------------------
# Requirements-pin consistency
# Requirements pin consistency
# ---------------------------------------------------------------------------
class TestPinnedVersionsMatchInstalled:
"""If someone bumps the pin in ``requirements.txt`` without
actually reinstalling, this test points it out before CI does.
Uses ``importlib.metadata`` rather than each library's
``__version__`` attribute because not every PDF dep exposes
one (``pypdfium2`` keeps version info on a submodule)."""
actually reinstalling, this test points it out before CI does."""
def _parse_pins(self) -> dict[str, str]:
from pathlib import Path
@@ -266,21 +251,17 @@ class TestPinnedVersionsMatchInstalled:
pins[name.strip()] = version.strip()
return pins
def _installed(self, dist_name: str) -> str:
import importlib.metadata as md
return md.version(dist_name)
@pytest.mark.parametrize("dist_name", [
"pdfplumber",
"pypdfium2",
"pytesseract",
"streamlit-drawable-canvas",
])
def test_pin_matches_installed(self, dist_name):
import importlib.metadata as md
pins = self._parse_pins()
if dist_name not in pins:
pytest.skip(f"{dist_name} not exact-pinned in requirements.txt")
installed = self._installed(dist_name)
installed = md.version(dist_name)
assert installed == pins[dist_name], (
f"installed {dist_name}=={installed} but requirements.txt "
f"pins {pins[dist_name]} — bump the pin, or reinstall."
@@ -288,79 +269,52 @@ class TestPinnedVersionsMatchInstalled:
# ---------------------------------------------------------------------------
# OCR availability runtime probe
# OCR availability
# ---------------------------------------------------------------------------
class TestOcrAvailability:
"""``ocr_available`` is the linchpin of the UI's OCR banner.
Returns ``(bool, str)`` — both branches must round-trip."""
def test_returns_a_tuple(self):
from src.pdf_extract import ocr_available
result = ocr_available()
assert isinstance(result, tuple)
assert len(result) == 2
assert isinstance(result, tuple) and len(result) == 2
ok, reason = result
assert isinstance(ok, bool)
assert isinstance(reason, str)
def test_extract_pages_auto_skips_ocr_when_disabled(self):
from src.pdf_extract import extract_pages_auto
# With allow_ocr=False, no OCR even if pages are blank.
pdf_bytes = _build_tiny_statement_pdf()
pages, warnings = extract_pages_auto(pdf_bytes, allow_ocr=False)
assert len(pages) == 1
# No OCR-disabled warning on a text PDF, since pages have text.
assert not any("OCR is disabled" in w for w in warnings)
class TestTesseractDiscovery:
"""Windows install paths + env-var override are how a real user
(no PATH munging) gets OCR working. Cover the discovery logic
even on Linux/macOS test runners by mocking out the OS check
and ``Path.exists``."""
def test_autodetect_returns_none_on_non_windows(self, monkeypatch):
from src import pdf_extract
monkeypatch.setattr(
"platform.system",
lambda: "Linux",
)
monkeypatch.setattr("platform.system", lambda: "Linux")
assert pdf_extract._autodetect_tesseract_path() is None
def test_autodetect_finds_program_files_on_windows(self, monkeypatch):
from src import pdf_extract
monkeypatch.setattr("platform.system", lambda: "Windows")
target = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
def fake_exists(self):
return str(self) == target
monkeypatch.setattr(
"pathlib.Path.exists",
fake_exists,
)
monkeypatch.setattr("pathlib.Path.exists", fake_exists)
assert pdf_extract._autodetect_tesseract_path() == target
def test_autodetect_returns_none_when_nothing_installed(
self, monkeypatch,
):
def test_autodetect_returns_none_when_nothing_installed(self, monkeypatch):
from src import pdf_extract
monkeypatch.setattr("platform.system", lambda: "Windows")
monkeypatch.setattr("pathlib.Path.exists", lambda self: False)
assert pdf_extract._autodetect_tesseract_path() is None
def test_env_var_override_takes_precedence(self, monkeypatch, tmp_path):
"""``DATATOOLS_TESSERACT_PATH`` wins over discovery so a
portable install at a non-default path works without
relying on PATH."""
from src import pdf_extract
# Point the override at a path that doesn't exist —
# ocr_available will try it and report the failure, but
# importantly the cmd attribute is set BEFORE the call,
# which is what we're verifying.
fake_bin = str(tmp_path / "fake-tesseract.exe")
monkeypatch.setenv("DATATOOLS_TESSERACT_PATH", fake_bin)
pdf_extract.ocr_available()

View File

@@ -1,280 +0,0 @@
"""Tests for the row-heuristic extraction pipeline.
This is now the primary extraction mode — uses date + amount
pattern matching to find transaction lines, with no dependency
on x-position column boundaries. Robust to layout drift across
statements from the same bank.
The legacy column-visual pipeline keeps its own tests in
``test_pdf_extract.py``.
"""
from __future__ import annotations
import pandas as pd
from src.pdf_extract import (
Page,
WordBox,
apply_template,
apply_template_row_heuristic,
find_transaction_rows,
_find_amount_tokens,
_find_dates_in_words,
_infer_amount_column_centers,
)
def _w(text: str, x0: float, top: float) -> WordBox:
return WordBox(
x0=x0,
top=top,
x1=x0 + 8 * len(text),
bottom=top + 10,
text=text,
)
class TestFindDatesInRow:
def test_us_slash(self):
row = [_w("01/15/2026", 0, 0), _w("Coffee", 100, 0)]
assert _find_dates_in_words(row) == [(0, "01/15/2026")]
def test_two_digit_year(self):
row = [_w("01/15/26", 0, 0), _w("Foo", 100, 0)]
result = _find_dates_in_words(row)
assert result and result[0][1] == "01/15/26"
def test_iso(self):
row = [_w("2026-01-15", 0, 0), _w("Tx", 100, 0)]
assert _find_dates_in_words(row) == [(0, "2026-01-15")]
def test_month_name(self):
# "Jan 15, 2026" — three word tokens, should stitch.
row = [_w("Jan", 0, 0), _w("15,", 25, 0), _w("2026", 50, 0)]
result = _find_dates_in_words(row)
assert result, "Multi-word month-day-year should match"
assert "Jan 15" in result[0][1]
def test_no_date(self):
row = [_w("Just", 0, 0), _w("text", 50, 0)]
assert _find_dates_in_words(row) == []
class TestFindAmountTokens:
def test_currency_format(self):
row = [_w("Coffee", 0, 0), _w("$4.50", 100, 0)]
out = _find_amount_tokens(row)
assert len(out) == 1
assert out[0][2] == "$4.50"
def test_parens_negative(self):
row = [_w("(123.45)", 0, 0)]
out = _find_amount_tokens(row)
assert out and out[0][2] == "(123.45)"
def test_no_amount_on_pure_text(self):
row = [_w("Hello", 0, 0), _w("World", 50, 0)]
assert _find_amount_tokens(row) == []
def test_rejects_bare_year(self):
# "2026" matches the digit pattern but lacks $/decimal/etc.,
# so the looks-like-amount filter should drop it.
row = [_w("2026", 0, 0)]
# Bare integer can pass the regex but not the heuristic.
out = _find_amount_tokens(row)
# Either filtered out OR included — both are defensible.
# If included, it'd be missed-amount territory not a false-
# positive. Pin the conservative behavior: NO match.
assert out == [], "Bare 4-digit year should not register as amount"
class TestInferAmountColumnCenters:
def test_two_clear_columns(self):
# 5 rows, each with two amounts at roughly x=300 and x=450.
rows = []
for top in range(0, 100, 20):
rows.append([
_w("01/15/2026", 20, top),
_w("Item", 100, top),
_w("$10.00", 300, top),
_w("$1,000.00", 450, top),
])
centers = _infer_amount_column_centers(
rows, expected=2, min_amounts=2, max_amounts=2,
)
assert len(centers) == 2
# Left center ≈ 300 + 8*len("$10.00")/2 = 300+24 = 324
assert 310 < centers[0] < 340
assert 460 < centers[1] < 490
def test_no_transactions_returns_empty(self):
rows = [[_w("just", 0, 0), _w("text", 50, 0)]]
assert _infer_amount_column_centers(
rows, expected=2, min_amounts=1, max_amounts=3,
) == []
class TestRowHeuristicEndToEnd:
"""Synthetic ``Page`` objects exercise the full row-heuristic
pipeline end-to-end without a real PDF."""
def _page_single_amount(self) -> Page:
words = [
_w("ACME BANK STATEMENT", 20, 0),
_w("01/15/2026", 20, 30), _w("Coffee", 100, 30),
_w("Shop", 150, 30), _w("$4.50", 400, 30),
_w("01/16/2026", 20, 50), _w("Refund", 100, 50),
_w("from", 100, 70), _w("vendor", 140, 70), # continuation
_w("Vendor", 140, 50), _w("$12.00", 400, 50),
_w("Page", 20, 90), _w("1", 60, 90), # not a txn
]
return Page(page_no=1, width=600, height=120, text="", words=words)
def test_extracts_two_rows_single_amount(self):
tpl = {
"mode": "row_heuristic",
"row_detection": {
"min_amounts_per_row": 1,
"max_amounts_per_row": 1,
"merge_multiline_description": True,
},
"amounts": {"shape": "single", "negative_in_parens": True},
"date": {"format": "%m/%d/%Y"},
}
df = apply_template_row_heuristic([self._page_single_amount()], tpl)
assert len(df) == 2
assert list(df["date"]) == ["2026-01-15", "2026-01-16"]
# Multi-line description merged
assert "from vendor" in df.iloc[1]["description"]
def test_dispatches_through_apply_template(self):
tpl = {
"mode": "row_heuristic",
"row_detection": {"min_amounts_per_row": 1, "max_amounts_per_row": 1},
"amounts": {"shape": "single"},
"date": {"format": "%m/%d/%Y"},
}
df = apply_template([self._page_single_amount()], tpl)
assert isinstance(df, pd.DataFrame)
assert len(df) == 2
def test_txn_balance_shape(self):
page = Page(
page_no=1, width=600, height=100, text="", words=[
_w("01/15/2026", 20, 0), _w("Coffee", 100, 0),
_w("(4.50)", 300, 0), _w("1,000.00", 450, 0),
_w("01/16/2026", 20, 20), _w("Refund", 100, 20),
_w("12.00", 300, 20), _w("1,012.00", 450, 20),
],
)
tpl = {
"mode": "row_heuristic",
"row_detection": {"min_amounts_per_row": 2, "max_amounts_per_row": 2},
"amounts": {"shape": "txn_balance", "negative_in_parens": True},
"date": {"format": "%m/%d/%Y"},
}
df = apply_template([page], tpl)
assert len(df) == 2
assert df.iloc[0]["amount"] == -4.50
assert df.iloc[0]["balance"] == 1000.00
assert df.iloc[1]["amount"] == 12.00
assert df.iloc[1]["balance"] == 1012.00
def test_debit_credit_balance_shape(self):
page = Page(
page_no=1, width=600, height=100, text="", words=[
_w("01/15/2026", 20, 0), _w("Coffee", 100, 0),
_w("4.50", 300, 0), _w("1,000.00", 450, 0),
_w("01/16/2026", 20, 20), _w("Refund", 100, 20),
_w("12.00", 380, 20), _w("1,012.00", 450, 20),
],
)
tpl = {
"mode": "row_heuristic",
"row_detection": {"min_amounts_per_row": 2, "max_amounts_per_row": 3},
"amounts": {"shape": "debit_credit_balance"},
"date": {"format": "%m/%d/%Y"},
}
df = apply_template([page], tpl)
assert len(df) == 2
# Row 0: amount at x=300 (debit column) → debit, balance at 450
assert df.iloc[0]["amount"] == -4.50
assert df.iloc[0]["type"] == "debit"
# Row 1: amount at x=380 (credit column) → credit, balance at 450
assert df.iloc[1]["amount"] == 12.00
assert df.iloc[1]["type"] == "credit"
def test_skip_rows_matching(self):
page = self._page_single_amount()
tpl = {
"mode": "row_heuristic",
"row_detection": {
"min_amounts_per_row": 1,
"max_amounts_per_row": 1,
"skip_rows_matching": ["Refund"],
},
"amounts": {"shape": "single"},
"date": {"format": "%m/%d/%Y"},
}
df = apply_template_row_heuristic([page], tpl)
assert len(df) == 1
assert df.iloc[0]["date"] == "2026-01-15"
def test_layout_drift_doesnt_matter(self):
"""The whole point of row-heuristic: same template works
on pages of different sizes / different column x-positions."""
# Page A: amounts at x=400
page_a = Page(
page_no=1, width=600, height=80, text="", words=[
_w("01/15/2026", 20, 0), _w("Coffee", 100, 0),
_w("$4.50", 400, 0),
],
)
# Page B: amounts shifted to x=520 (different layout)
page_b = Page(
page_no=1, width=720, height=80, text="", words=[
_w("01/15/2026", 50, 0), _w("Coffee", 150, 0),
_w("$4.50", 520, 0),
],
)
tpl = {
"mode": "row_heuristic",
"row_detection": {"min_amounts_per_row": 1, "max_amounts_per_row": 1},
"amounts": {"shape": "single"},
"date": {"format": "%m/%d/%Y"},
}
df_a = apply_template([page_a], tpl)
df_b = apply_template([page_b], tpl)
# Both should extract — proves no coordinate dependency.
assert len(df_a) == 1
assert len(df_b) == 1
assert df_a.iloc[0]["amount"] == df_b.iloc[0]["amount"] == 4.50
class TestFindTransactionRows:
"""The pre-DataFrame stage — returns dict records the build UI
uses to render a preview before the user commits."""
def test_returns_records(self):
page = Page(
page_no=1, width=600, height=80, text="", words=[
_w("01/15/2026", 20, 0), _w("Coffee", 100, 0),
_w("$4.50", 400, 0),
],
)
tpl = {
"mode": "row_heuristic",
"row_detection": {"min_amounts_per_row": 1, "max_amounts_per_row": 1},
"amounts": {"shape": "single"},
"date": {"format": "%m/%d/%Y"},
}
rows = find_transaction_rows([page], tpl)
assert len(rows) == 1
r = rows[0]
assert r["date"] == "2026-01-15"
assert r["description"] == "Coffee"
assert r["amount"] == 4.50
assert r["_page"] == 1
# Raw line is preserved so the GUI can show "what we saw"
assert "_raw_line" in r

View File

@@ -1,316 +0,0 @@
"""Tests for the PDF template storage layer."""
from __future__ import annotations
import json
import pytest
from src.pdf_templates import (
SCHEMA_VERSION,
delete_template,
list_templates,
load_template,
new_template,
save_template,
slugify,
template_from_json,
template_path,
templates_dir,
template_to_json,
validate_template,
)
@pytest.fixture
def isolated_templates(monkeypatch, tmp_path):
"""Redirect the templates directory into ``tmp_path``."""
monkeypatch.setenv("DATATOOLS_PDF_TEMPLATES_DIR", str(tmp_path))
yield tmp_path
class TestSlugify:
def test_basic(self):
assert slugify("Chase Personal Checking") == "chase-personal-checking"
def test_strips_punctuation(self):
assert slugify("BofA: Business (USD)") == "bofa-business-usd"
def test_empty_falls_back(self):
assert slugify("") == "untitled"
assert slugify(" ") == "untitled"
class TestNewTemplate:
def test_has_schema_version(self):
t = new_template("Sample")
assert t["schema_version"] == SCHEMA_VERSION
def test_slug_derived_from_name(self):
t = new_template("Sample Bank")
assert t["slug"] == "sample-bank"
assert t["name"] == "Sample Bank"
def test_timestamps_present(self):
t = new_template("X")
assert t["created_at"]
assert t["updated_at"]
class TestValidateTemplateRowHeuristic:
"""Row-heuristic mode is the v2 default."""
def _valid(self) -> dict:
return {
"schema_version": SCHEMA_VERSION,
"slug": "x",
"name": "X",
"mode": "row_heuristic",
"row_detection": {
"min_amounts_per_row": 1,
"max_amounts_per_row": 3,
},
"amounts": {"shape": "single"},
"date": {"format": "%m/%d/%Y"},
}
def test_valid_passes(self):
ok, errs = validate_template(self._valid())
assert ok, errs
def test_missing_name_fails(self):
t = self._valid()
t["name"] = ""
ok, errs = validate_template(t)
assert not ok
def test_bad_mode_fails(self):
t = self._valid()
t["mode"] = "magic"
ok, errs = validate_template(t)
assert not ok
assert any("mode" in e for e in errs)
def test_bad_shape_fails(self):
t = self._valid()
t["amounts"]["shape"] = "telepathic"
ok, errs = validate_template(t)
assert not ok
assert any("shape" in e for e in errs)
def test_inverted_amount_range_fails(self):
t = self._valid()
t["row_detection"]["min_amounts_per_row"] = 5
t["row_detection"]["max_amounts_per_row"] = 2
ok, errs = validate_template(t)
assert not ok
def test_does_not_require_columns_in_row_mode(self):
"""Key point: row mode doesn't need ``columns`` populated.
That's what makes the GUI's primary path simpler than v1."""
t = self._valid()
# No columns key at all.
ok, errs = validate_template(t)
assert ok, errs
class TestValidateTemplateColumnVisual:
"""Legacy column-visual mode keeps its own contract."""
def _valid(self) -> dict:
return {
"schema_version": SCHEMA_VERSION,
"slug": "x",
"name": "X",
"mode": "column_visual",
"pages": {"range": "all"},
"table": {"column_boundaries": [100, 200]},
"columns": [
{"source": 0, "target": "date"},
{"source": 1, "target": "description"},
{"source": 2, "target": "amount"},
],
"parse": {},
}
def test_valid_passes(self):
ok, errs = validate_template(self._valid())
assert ok, errs
def test_requires_date_column(self):
t = self._valid()
t["columns"] = [
{"source": 0, "target": "description"},
{"source": 1, "target": "amount"},
]
ok, errs = validate_template(t)
assert not ok
assert any("date" in e for e in errs)
def test_requires_amount_or_debit_credit(self):
t = self._valid()
t["columns"] = [
{"source": 0, "target": "date"},
{"source": 1, "target": "description"},
]
ok, errs = validate_template(t)
assert not ok
assert any("amount" in e for e in errs)
def test_debit_credit_pair_is_valid(self):
t = self._valid()
t["columns"] = [
{"source": 0, "target": "date"},
{"source": 1, "target": "description"},
{"source": 2, "target": "amount_debit"},
{"source": 3, "target": "amount_credit"},
]
t["table"]["column_boundaries"] = [100, 200, 300]
ok, errs = validate_template(t)
assert ok, errs
class TestV1Migration:
"""v1 templates load with mode='column_visual' auto-injected;
the file on disk stays v1 until the user re-saves."""
def test_loads_v1_template(self, isolated_templates, tmp_path):
import json
v1_payload = {
"schema_version": 1,
"slug": "legacy",
"name": "Legacy Bank",
"pages": {"range": "all"},
"table": {"column_boundaries": [100, 200]},
"columns": [
{"source": 0, "target": "date"},
{"source": 1, "target": "description"},
{"source": 2, "target": "amount"},
],
"parse": {},
}
(tmp_path / "legacy.json").write_text(
json.dumps(v1_payload), encoding="utf-8",
)
loaded = load_template("legacy")
# In-memory migration adds mode + bumps schema_version
assert loaded["mode"] == "column_visual"
assert loaded["schema_version"] == SCHEMA_VERSION
# Original keys still intact
assert loaded["columns"][0]["target"] == "date"
class TestPersistence:
def test_round_trip(self, isolated_templates):
t = new_template("Round Trip Bank")
t["columns"] = [
{"source": 0, "target": "date"},
{"source": 1, "target": "description"},
{"source": 2, "target": "amount"},
]
t["table"]["column_boundaries"] = [100, 200]
slug = save_template(t)
assert slug == "round-trip-bank"
path = template_path(slug)
assert path.exists()
loaded = load_template(slug)
assert loaded["name"] == "Round Trip Bank"
assert loaded["columns"][0]["target"] == "date"
def test_save_rejects_invalid(self, isolated_templates):
with pytest.raises(ValueError):
save_template({"schema_version": 1, "name": ""})
def test_load_missing_raises(self, isolated_templates):
with pytest.raises(FileNotFoundError):
load_template("does-not-exist")
def test_load_corrupt_raises(self, isolated_templates, tmp_path):
bad = tmp_path / "bad.json"
bad.write_text("not json", encoding="utf-8")
with pytest.raises(ValueError):
load_template("bad")
def test_delete(self, isolated_templates):
t = new_template("To Delete")
t["columns"] = [
{"source": 0, "target": "date"},
{"source": 1, "target": "amount"},
]
t["table"]["column_boundaries"] = [100]
save_template(t)
assert delete_template("to-delete") is True
assert delete_template("to-delete") is False
def test_list_returns_summaries(self, isolated_templates):
for name in ["Alpha", "Bravo"]:
t = new_template(name)
t["columns"] = [
{"source": 0, "target": "date"},
{"source": 1, "target": "amount"},
]
t["table"]["column_boundaries"] = [100]
save_template(t)
rows = list_templates()
assert {r["slug"] for r in rows} == {"alpha", "bravo"}
def test_list_skips_corrupt(self, isolated_templates, tmp_path):
(tmp_path / "broken.json").write_text("nope", encoding="utf-8")
# Even with a broken file present, list still returns []
rows = list_templates()
assert rows == []
def test_atomic_save_no_partial_file_on_failure(
self, isolated_templates, monkeypatch
):
"""If the write step fails mid-way, no half-written JSON survives
at the target path. Tests the temp-file-rename safety pattern."""
t = new_template("Atomic")
t["columns"] = [
{"source": 0, "target": "date"},
{"source": 1, "target": "amount"},
]
t["table"]["column_boundaries"] = [100]
# Make json.dumps blow up to simulate a failure during write.
# save_template already validated before this step, so the
# crash is "after validation, during write".
import src.pdf_templates as mod
original_dumps = mod.json.dumps
def boom(*a, **kw):
raise IOError("disk full")
monkeypatch.setattr(mod.json, "dumps", boom)
with pytest.raises(IOError):
save_template(t)
monkeypatch.setattr(mod.json, "dumps", original_dumps)
assert not template_path("atomic").exists()
class TestImportExport:
def test_round_trip_via_json(self):
t = new_template("Exported")
t["columns"] = [
{"source": 0, "target": "date"},
{"source": 1, "target": "amount"},
]
payload = template_to_json(t)
loaded = template_from_json(payload)
assert loaded["name"] == "Exported"
def test_import_rejects_bad_schema(self):
bad = json.dumps({"schema_version": 999, "name": "X"})
with pytest.raises(ValueError):
template_from_json(bad)
def test_import_rejects_non_object(self):
with pytest.raises(ValueError):
template_from_json('["not", "an", "object"]')
def test_templates_dir_env_override(monkeypatch, tmp_path):
monkeypatch.setenv("DATATOOLS_PDF_TEMPLATES_DIR", str(tmp_path))
assert templates_dir() == tmp_path