refactor(pdf): rip out templates; heuristic scan + selectable table
User feedback: the template / visual-picker / mode-dispatch
implementation was too complex for the actual workflow.
Statements drift between months, the canvas state didn't survive
multi-page navigation, and accountants don't want to maintain
per-bank configuration just to convert PDFs to CSV.
Start-over design — one public function, one page, no
persistence:
``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)``
A row is "any text line with a date pattern AND at least one
amount pattern." Each detected row is a dict shaped::
{
"date": "2026-01-15",
"description": "Coffee Shop",
"amount_1": -4.50,
"amount_2": 1000.00, # if a second amount was found
"page": 1,
"raw": "01/15/2026 Coffee Shop (4.50) 1,000.00",
"source_file": "chase-jan-2026.pdf",
}
Multi-line descriptions still merge (no-date no-amount lines
attach to the previous transaction). Multi-PDF batches share a
single combined table with a ``source_file`` column.
**Page UX:**
- Upload PDF(s) → optional Options expander (parens-negative,
use-OCR) → click Scan → see all detected rows in an
``st.data_editor``.
- The editor has an ``Include`` checkbox column (default on),
plus user-editable date / description / amount cells and a
read-only ``raw`` column showing the original PDF text for
verification.
- A ``Columns to include in CSV`` multiselect hides
``page`` / ``raw`` from the download by default; user can
re-add either.
- Download CSV gets only the checked rows.
No template save/load. No visual picker. No mode dispatch. No
column boundaries. No schema migration. No per-bank
configuration files.
**Deletions:**
- ``src/pdf_templates.py`` — template storage layer
- ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim
for the canvas (no canvas now)
- ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``,
``test_drawable_canvas_compat.py`` — covered the removed APIs
- ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for
the removed dep
- ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt``
- The drawable-canvas references in ``build/datatools.spec``
**``src/pdf_extract.py``** shrinks from ~30 helper functions to
~10. Keeps: value parsers, row clusterer, date/amount token
finders, OCR pipeline, dependency guards. The one new public
function ``scan_pdf_for_transactions`` glues them together.
**Tests** (59 passing): the unit layer keeps full coverage of
the building blocks; the smoke layer pins the end-to-end PDF
roundtrip, OCR discovery, dependency-import behavior, and the
multi-line-description merge. The fpdf2-generated fixture PDF
still drives the real-PDF test.
Rollback: ``git revert HEAD`` brings back the template system if
needed — but the simpler model should make that unlikely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,55 +1,43 @@
|
||||
"""End-to-end smoke tests for the PDF extraction stack.
|
||||
"""End-to-end smoke tests for the PDF transaction scanner.
|
||||
|
||||
These tests run real ``pdfplumber`` + ``pypdfium2`` calls against
|
||||
a small PDF generated in-memory with ``fpdf2``. They exist to
|
||||
catch the failure mode the user hit on first install — a missing
|
||||
or mismatched native dependency that doesn't show up until the
|
||||
extractor actually tries to open a PDF.
|
||||
These run real ``pdfplumber`` + ``pypdfium2`` (when OCR is in play)
|
||||
calls against a small statement-shaped PDF generated in memory
|
||||
with ``fpdf2``. They catch the failure modes most likely to bite
|
||||
an end-user installer build: missing native lib, broken hook
|
||||
bundling, pin/installed mismatch.
|
||||
|
||||
Per ``project-pdf-extractor`` memory: ``test_pdf_extract.py``
|
||||
covers the parsing logic on synthetic ``WordBox`` data with no
|
||||
PDF dep involved. This file is the layer above: it confirms the
|
||||
deps themselves work, that hooks bundled them correctly (the
|
||||
versions pinned in ``requirements.txt`` matter here), and that
|
||||
the extractor's pipeline survives a round-trip through real
|
||||
``pdfplumber.extract_words`` and real ``pypdfium2.render``.
|
||||
|
||||
Generation note: ``fpdf2`` is a test-only dep listed in
|
||||
Generation note: ``fpdf2`` is a test-only dep in
|
||||
``requirements-dev.txt``. We don't ship it.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def _build_tiny_statement_pdf() -> bytes:
|
||||
"""Render a one-page PDF that looks roughly like the simplest
|
||||
possible bank statement: a header line + three transaction
|
||||
rows + a closing-balance footer. Word positions are stable
|
||||
enough that the parser can identify columns by x-position."""
|
||||
"""One-page PDF: header line + three transaction rows + a
|
||||
closing-balance footer. The scanner should pick up exactly the
|
||||
three transactions."""
|
||||
from fpdf import FPDF
|
||||
|
||||
pdf = FPDF(orientation="P", unit="pt", format="letter")
|
||||
pdf.add_page()
|
||||
pdf.set_font("Helvetica", size=12)
|
||||
# Header
|
||||
pdf.set_xy(40, 50)
|
||||
pdf.cell(0, 14, "ACME BANK STATEMENT", new_x="LMARGIN", new_y="NEXT")
|
||||
# Transaction-table header row
|
||||
# Header row (not a transaction — no amount)
|
||||
pdf.set_xy(40, 100)
|
||||
pdf.cell(120, 14, "Date")
|
||||
pdf.set_xy(160, 100)
|
||||
pdf.cell(200, 14, "Description")
|
||||
pdf.set_xy(360, 100)
|
||||
pdf.cell(80, 14, "Amount")
|
||||
# Three rows
|
||||
# Three transactions
|
||||
rows = [
|
||||
("01/15/2026", "Coffee Shop", "(4.50)"),
|
||||
("01/16/2026", "Refund Vendor", "$12.00"),
|
||||
("01/17/2026", "ATM Withdrawal","(40.00)"),
|
||||
("01/15/2026", "Coffee Shop", "(4.50)"),
|
||||
("01/16/2026", "Refund Vendor", "$12.00"),
|
||||
("01/17/2026", "ATM Withdrawal", "(40.00)"),
|
||||
]
|
||||
y = 130
|
||||
for date, desc, amt in rows:
|
||||
@@ -60,7 +48,7 @@ def _build_tiny_statement_pdf() -> bytes:
|
||||
pdf.set_xy(360, y)
|
||||
pdf.cell(80, 14, amt)
|
||||
y += 20
|
||||
# Closing-balance footer
|
||||
# Footer — has a date-like number maybe but no real txn shape
|
||||
pdf.set_xy(40, y + 20)
|
||||
pdf.cell(0, 14, "Closing balance: $1,000.00")
|
||||
return bytes(pdf.output())
|
||||
@@ -72,12 +60,8 @@ def _build_tiny_statement_pdf() -> bytes:
|
||||
|
||||
|
||||
class TestDependencyImports:
|
||||
"""Each runtime PDF dep must be importable.
|
||||
|
||||
These tests will fail fast on a stripped/broken install — most
|
||||
valuable as a CI gate when the requirements.txt pins are
|
||||
bumped, so we know the new pin still installs cleanly across
|
||||
the matrix."""
|
||||
"""Each runtime PDF dep must be importable. Fails fast on a
|
||||
stripped install or a missing CI pin."""
|
||||
|
||||
def test_pdfplumber(self):
|
||||
import pdfplumber # noqa: F401
|
||||
@@ -85,130 +69,135 @@ class TestDependencyImports:
|
||||
def test_pypdfium2(self):
|
||||
import pypdfium2 # noqa: F401
|
||||
|
||||
def test_streamlit_drawable_canvas(self):
|
||||
# Don't instantiate the canvas — that needs a Streamlit
|
||||
# script-run context. Just confirm the module loads.
|
||||
import streamlit_drawable_canvas # noqa: F401
|
||||
|
||||
def test_pytesseract(self):
|
||||
# The Python binding must import even when the Tesseract
|
||||
# binary isn't installed — the OCR availability check
|
||||
# handles binary absence separately.
|
||||
import pytesseract # noqa: F401
|
||||
|
||||
def test_PIL(self):
|
||||
# Transitively required by pdfplumber + pypdfium2 + canvas.
|
||||
# Pinning explicit confirms hooks pull it through.
|
||||
from PIL import Image # noqa: F401
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Real-PDF round-trip
|
||||
# End-to-end against a real PDF
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestRealPdfRoundTrip:
|
||||
"""``extract_pages`` + ``apply_template`` against a real PDF."""
|
||||
|
||||
class TestScanPdfForTransactions:
|
||||
@pytest.fixture
|
||||
def pdf_bytes(self) -> bytes:
|
||||
return _build_tiny_statement_pdf()
|
||||
|
||||
def test_extract_pages_returns_words(self, pdf_bytes):
|
||||
from src.pdf_extract import extract_pages
|
||||
pages = extract_pages(pdf_bytes)
|
||||
assert len(pages) == 1
|
||||
assert pages[0].width > 0 and pages[0].height > 0
|
||||
# At minimum we should have the words from the header and
|
||||
# one transaction row — proves pdfplumber wired up.
|
||||
all_text = " ".join(w.text for w in pages[0].words)
|
||||
assert "ACME" in all_text
|
||||
assert "Coffee" in all_text
|
||||
assert "01/15/2026" in all_text
|
||||
def test_finds_three_transactions(self, pdf_bytes):
|
||||
from src.pdf_extract import scan_pdf_for_transactions
|
||||
rows, warnings = scan_pdf_for_transactions(pdf_bytes)
|
||||
# The PDF has 3 transactions plus a header and a closing-
|
||||
# balance footer. Header has no amount; closing-balance has
|
||||
# no date in the same line — neither qualifies as a txn.
|
||||
assert len(rows) == 3, (
|
||||
f"expected 3 rows, got {len(rows)}:\n"
|
||||
f"{[r.get('raw') for r in rows]}"
|
||||
)
|
||||
|
||||
def test_apply_template_extracts_three_rows(self, pdf_bytes):
|
||||
from src.pdf_extract import apply_template, extract_pages
|
||||
# The template's column boundaries are tuned to fpdf2's
|
||||
# x-coordinates above (40 / 160 / 360 pt).
|
||||
tpl = {
|
||||
"pages": {"range": "all"},
|
||||
"table": {
|
||||
"header_text": "Date Description Amount",
|
||||
"end_markers": ["Closing balance"],
|
||||
"column_boundaries": [150, 350],
|
||||
"y_tolerance": 3.0,
|
||||
},
|
||||
"columns": [
|
||||
{"source": 0, "target": "date"},
|
||||
{"source": 1, "target": "description"},
|
||||
{"source": 2, "target": "amount"},
|
||||
],
|
||||
"parse": {
|
||||
"date_format": "%m/%d/%Y",
|
||||
"amount_negative_in_parens": True,
|
||||
"merge_multiline_description": True,
|
||||
},
|
||||
}
|
||||
pages = extract_pages(pdf_bytes)
|
||||
df = apply_template(pages, tpl)
|
||||
assert len(df) == 3, f"expected 3 rows, got {len(df)}:\n{df}"
|
||||
assert list(df["date"]) == [
|
||||
def test_parses_dates_to_iso(self, pdf_bytes):
|
||||
from src.pdf_extract import scan_pdf_for_transactions
|
||||
rows, _ = scan_pdf_for_transactions(pdf_bytes)
|
||||
assert [r["date"] for r in rows] == [
|
||||
"2026-01-15", "2026-01-16", "2026-01-17",
|
||||
]
|
||||
# Parens-negative + currency-positive both round-trip
|
||||
assert df.iloc[0]["amount"] == -4.50
|
||||
assert df.iloc[1]["amount"] == 12.00
|
||||
assert df.iloc[2]["amount"] == -40.00
|
||||
|
||||
def test_parses_amounts_with_signs(self, pdf_bytes):
|
||||
from src.pdf_extract import scan_pdf_for_transactions
|
||||
rows, _ = scan_pdf_for_transactions(pdf_bytes)
|
||||
assert rows[0]["amount_1"] == -4.50
|
||||
assert rows[1]["amount_1"] == 12.00
|
||||
assert rows[2]["amount_1"] == -40.00
|
||||
|
||||
def test_preserves_raw_line(self, pdf_bytes):
|
||||
from src.pdf_extract import scan_pdf_for_transactions
|
||||
rows, _ = scan_pdf_for_transactions(pdf_bytes)
|
||||
# Raw line lets the user verify what was matched.
|
||||
assert all("raw" in r and r["raw"] for r in rows)
|
||||
assert "Coffee" in rows[0]["raw"]
|
||||
|
||||
def test_page_tagged(self, pdf_bytes):
|
||||
from src.pdf_extract import scan_pdf_for_transactions
|
||||
rows, _ = scan_pdf_for_transactions(pdf_bytes)
|
||||
assert all(r["page"] == 1 for r in rows)
|
||||
|
||||
def test_negative_in_parens_off(self, pdf_bytes):
|
||||
"""With parens-negative off, the parser can't decode
|
||||
``(4.50)`` and falls back to the raw text — the row still
|
||||
surfaces, just with the unparsed string in the amount slot
|
||||
so the user can see and fix it in the editor."""
|
||||
from src.pdf_extract import scan_pdf_for_transactions
|
||||
rows, _ = scan_pdf_for_transactions(
|
||||
pdf_bytes, negative_in_parens=False,
|
||||
)
|
||||
# Row 0 had "(4.50)" — without parens-negative, parse_amount
|
||||
# returns None and the scanner keeps the raw token.
|
||||
assert rows[0]["amount_1"] == "(4.50)"
|
||||
# Row 1 had "$12.00" — still parses to positive.
|
||||
assert rows[1]["amount_1"] == 12.00
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# pypdfium2 rendering (powers the visual picker)
|
||||
# Multi-line description merging
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestRenderPageImage:
|
||||
"""``render_page_image`` is what feeds the drawable canvas.
|
||||
class TestMultilineDescription:
|
||||
def test_continuation_line_merges(self):
|
||||
"""A line with no date and no amount, sitting between two
|
||||
transaction rows, attaches to the previous transaction's
|
||||
description."""
|
||||
from src.pdf_extract import (
|
||||
Page,
|
||||
WordBox,
|
||||
scan_pdf_for_transactions,
|
||||
)
|
||||
# Build a synthetic page through the public entry point by
|
||||
# going through extract_pages_auto's intermediate? Easier:
|
||||
# call the internals directly via a fake PDF. For unit
|
||||
# coverage of the merge behavior, route through the helper:
|
||||
from src import pdf_extract as mod
|
||||
|
||||
Catches the most common installer-bug: native PDFium .dll/.so
|
||||
missing from the bundle. If this test crashes with a
|
||||
``FileNotFoundError`` it almost always means the
|
||||
``hook-pypdfium2.py`` didn't pick up the shared lib."""
|
||||
original = mod.extract_pages_auto
|
||||
|
||||
def test_renders_a_real_pil_image(self):
|
||||
from src.pdf_extract import render_page_image
|
||||
pdf_bytes = _build_tiny_statement_pdf()
|
||||
image, scale = render_page_image(pdf_bytes, page_no=1)
|
||||
# Letter-size at scale ≈ 900/612 ≈ 1.47 → ~900px wide.
|
||||
assert image.width > 800
|
||||
assert image.height > 800
|
||||
assert scale > 0
|
||||
# PIL Image is duck-typed; check the attrs we depend on.
|
||||
assert hasattr(image, "save")
|
||||
assert hasattr(image, "tobytes")
|
||||
def fake(_pdf_bytes, *, allow_ocr=True):
|
||||
words = [
|
||||
WordBox(x0=0, top=0, x1=80, bottom=10, text="01/15/2026"),
|
||||
WordBox(x0=100, top=0, x1=160, bottom=10, text="Coffee"),
|
||||
WordBox(x0=200, top=0, x1=240, bottom=10, text="$4.50"),
|
||||
# Continuation: no date, no amount
|
||||
WordBox(x0=100, top=20, x1=160, bottom=30, text="Vendor"),
|
||||
WordBox(x0=170, top=20, x1=230, bottom=30, text="memo"),
|
||||
# Next transaction
|
||||
WordBox(x0=0, top=40, x1=80, bottom=50, text="01/16/2026"),
|
||||
WordBox(x0=100, top=40, x1=160, bottom=50, text="Other"),
|
||||
WordBox(x0=200, top=40, x1=240, bottom=50, text="$10.00"),
|
||||
]
|
||||
return [Page(
|
||||
page_no=1, width=300, height=100, text="", words=words,
|
||||
)], []
|
||||
|
||||
def test_invalid_page_number_clamps(self):
|
||||
from src.pdf_extract import render_page_image
|
||||
pdf_bytes = _build_tiny_statement_pdf()
|
||||
# PDF has 1 page; page_no=99 should clamp, not raise.
|
||||
image, scale = render_page_image(pdf_bytes, page_no=99)
|
||||
assert image.width > 0
|
||||
mod.extract_pages_auto = fake
|
||||
try:
|
||||
rows, _ = scan_pdf_for_transactions(b"")
|
||||
finally:
|
||||
mod.extract_pages_auto = original
|
||||
|
||||
assert len(rows) == 2
|
||||
assert "Vendor memo" in rows[0]["description"]
|
||||
assert rows[1]["description"] == "Other"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Graceful-fallback behavior
|
||||
# Graceful fallback when deps absent
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestPdfDependencyMissing:
|
||||
"""The page should see a clean exception when a dep is absent,
|
||||
not a raw ``ImportError`` that leaks into the Streamlit traceback."""
|
||||
|
||||
def test_require_pdfplumber_raises_typed_on_absence(self, monkeypatch):
|
||||
from src import pdf_extract
|
||||
# Simulate "pdfplumber not installed" without uninstalling.
|
||||
# ``_require_pdfplumber`` does its own ``import pdfplumber``
|
||||
# at call time; patch ``__import__`` to throw for that one
|
||||
# name only.
|
||||
import builtins
|
||||
real_import = builtins.__import__
|
||||
|
||||
@@ -218,10 +207,10 @@ class TestPdfDependencyMissing:
|
||||
return real_import(name, *a, **kw)
|
||||
|
||||
monkeypatch.setattr(builtins, "__import__", fake_import)
|
||||
with pytest.raises(pdf_extract.PdfDependencyMissing) as exc_info:
|
||||
with pytest.raises(pdf_extract.PdfDependencyMissing) as exc:
|
||||
pdf_extract._require_pdfplumber()
|
||||
assert "pdfplumber" in str(exc_info.value)
|
||||
assert exc_info.value.hint # actionable hint must be populated
|
||||
assert "pdfplumber" in str(exc.value)
|
||||
assert exc.value.hint
|
||||
|
||||
def test_require_pdfium_raises_typed_on_absence(self, monkeypatch):
|
||||
from src import pdf_extract
|
||||
@@ -239,17 +228,13 @@ class TestPdfDependencyMissing:
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Requirements-pin consistency
|
||||
# Requirements pin consistency
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestPinnedVersionsMatchInstalled:
|
||||
"""If someone bumps the pin in ``requirements.txt`` without
|
||||
actually reinstalling, this test points it out before CI does.
|
||||
|
||||
Uses ``importlib.metadata`` rather than each library's
|
||||
``__version__`` attribute because not every PDF dep exposes
|
||||
one (``pypdfium2`` keeps version info on a submodule)."""
|
||||
actually reinstalling, this test points it out before CI does."""
|
||||
|
||||
def _parse_pins(self) -> dict[str, str]:
|
||||
from pathlib import Path
|
||||
@@ -266,21 +251,17 @@ class TestPinnedVersionsMatchInstalled:
|
||||
pins[name.strip()] = version.strip()
|
||||
return pins
|
||||
|
||||
def _installed(self, dist_name: str) -> str:
|
||||
import importlib.metadata as md
|
||||
return md.version(dist_name)
|
||||
|
||||
@pytest.mark.parametrize("dist_name", [
|
||||
"pdfplumber",
|
||||
"pypdfium2",
|
||||
"pytesseract",
|
||||
"streamlit-drawable-canvas",
|
||||
])
|
||||
def test_pin_matches_installed(self, dist_name):
|
||||
import importlib.metadata as md
|
||||
pins = self._parse_pins()
|
||||
if dist_name not in pins:
|
||||
pytest.skip(f"{dist_name} not exact-pinned in requirements.txt")
|
||||
installed = self._installed(dist_name)
|
||||
installed = md.version(dist_name)
|
||||
assert installed == pins[dist_name], (
|
||||
f"installed {dist_name}=={installed} but requirements.txt "
|
||||
f"pins {pins[dist_name]} — bump the pin, or reinstall."
|
||||
@@ -288,79 +269,52 @@ class TestPinnedVersionsMatchInstalled:
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# OCR availability runtime probe
|
||||
# OCR availability
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestOcrAvailability:
|
||||
"""``ocr_available`` is the linchpin of the UI's OCR banner.
|
||||
Returns ``(bool, str)`` — both branches must round-trip."""
|
||||
|
||||
def test_returns_a_tuple(self):
|
||||
from src.pdf_extract import ocr_available
|
||||
result = ocr_available()
|
||||
assert isinstance(result, tuple)
|
||||
assert len(result) == 2
|
||||
assert isinstance(result, tuple) and len(result) == 2
|
||||
ok, reason = result
|
||||
assert isinstance(ok, bool)
|
||||
assert isinstance(reason, str)
|
||||
|
||||
def test_extract_pages_auto_skips_ocr_when_disabled(self):
|
||||
from src.pdf_extract import extract_pages_auto
|
||||
# With allow_ocr=False, no OCR even if pages are blank.
|
||||
pdf_bytes = _build_tiny_statement_pdf()
|
||||
pages, warnings = extract_pages_auto(pdf_bytes, allow_ocr=False)
|
||||
assert len(pages) == 1
|
||||
# No OCR-disabled warning on a text PDF, since pages have text.
|
||||
assert not any("OCR is disabled" in w for w in warnings)
|
||||
|
||||
|
||||
class TestTesseractDiscovery:
|
||||
"""Windows install paths + env-var override are how a real user
|
||||
(no PATH munging) gets OCR working. Cover the discovery logic
|
||||
even on Linux/macOS test runners by mocking out the OS check
|
||||
and ``Path.exists``."""
|
||||
|
||||
def test_autodetect_returns_none_on_non_windows(self, monkeypatch):
|
||||
from src import pdf_extract
|
||||
monkeypatch.setattr(
|
||||
"platform.system",
|
||||
lambda: "Linux",
|
||||
)
|
||||
monkeypatch.setattr("platform.system", lambda: "Linux")
|
||||
assert pdf_extract._autodetect_tesseract_path() is None
|
||||
|
||||
def test_autodetect_finds_program_files_on_windows(self, monkeypatch):
|
||||
from src import pdf_extract
|
||||
monkeypatch.setattr("platform.system", lambda: "Windows")
|
||||
|
||||
target = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
|
||||
|
||||
def fake_exists(self):
|
||||
return str(self) == target
|
||||
|
||||
monkeypatch.setattr(
|
||||
"pathlib.Path.exists",
|
||||
fake_exists,
|
||||
)
|
||||
monkeypatch.setattr("pathlib.Path.exists", fake_exists)
|
||||
assert pdf_extract._autodetect_tesseract_path() == target
|
||||
|
||||
def test_autodetect_returns_none_when_nothing_installed(
|
||||
self, monkeypatch,
|
||||
):
|
||||
def test_autodetect_returns_none_when_nothing_installed(self, monkeypatch):
|
||||
from src import pdf_extract
|
||||
monkeypatch.setattr("platform.system", lambda: "Windows")
|
||||
monkeypatch.setattr("pathlib.Path.exists", lambda self: False)
|
||||
assert pdf_extract._autodetect_tesseract_path() is None
|
||||
|
||||
def test_env_var_override_takes_precedence(self, monkeypatch, tmp_path):
|
||||
"""``DATATOOLS_TESSERACT_PATH`` wins over discovery so a
|
||||
portable install at a non-default path works without
|
||||
relying on PATH."""
|
||||
from src import pdf_extract
|
||||
# Point the override at a path that doesn't exist —
|
||||
# ocr_available will try it and report the failure, but
|
||||
# importantly the cmd attribute is set BEFORE the call,
|
||||
# which is what we're verifying.
|
||||
fake_bin = str(tmp_path / "fake-tesseract.exe")
|
||||
monkeypatch.setenv("DATATOOLS_TESSERACT_PATH", fake_bin)
|
||||
pdf_extract.ocr_available()
|
||||
|
||||
Reference in New Issue
Block a user