Three changes prepare the next tagged release so end users get the PDF Extractor without ever touching pip. **Exact-pin the new deps** (``requirements.txt``): pdfplumber==0.11.9 pypdfium2==5.8.0 pytesseract==0.3.13 streamlit-drawable-canvas==0.9.3 Tight pins are the right call for these because the GUI's visual-picker geometry + the parsing-pipeline word positions depend on stable internal behavior — a quiet upstream tweak to ``extract_words`` or ``page.render`` would re-break the tool on the next CI build. Bumping requires a deliberate edit + a CI run, not a transient ``pip install`` resolving to whatever ``setup.py`` pulled. Existing deps stay on their current ``>=X.Y,<X+1`` ranges; the user's "tight pin" concern is specifically about the PDF stack. **Wire the new deps into the PyInstaller bundle** (``build/``): - ``datatools.spec`` — add ``collect_submodules`` for pdfplumber, pdfminer, pypdfium2, streamlit_drawable_canvas, PIL, pytesseract; add ``collect_data_files`` for pypdfium2 (PDFium native ``.dll``/``.so``/``.dylib``), streamlit_drawable_canvas (frontend JS bundle), pdfminer (Adobe CMap tables). - ``hooks/hook-pypdfium2.py`` — belt-and-braces hook that uses ``collect_dynamic_libs`` to force-include the PDFium binary. Without this the visual picker silently fails on installed builds with a ``FileNotFoundError`` for the shared library. - ``hooks/hook-streamlit_drawable_canvas.py`` — collects the built JS frontend so the canvas iframe loads under the bundled Streamlit server instead of rendering blank. **Tesseract is intentionally NOT bundled** (option A from the design discussion). Modern bank statements are text-based; bundling Tesseract would ~triple installer size for a long-tail case. The in-app banner directs users to install it from ``UB-Mannheim/tesseract`` if they need OCR. Decision is captured in the ``project-pdf-installer-pending`` memory note. **Smoke tests** (``tests/test_pdf_extract_smoke.py``, 17 tests) add the layer above the pure unit tests: - ``TestDependencyImports`` — each dep imports cleanly - ``TestRealPdfRoundTrip`` — generates a tiny statement PDF in memory with ``fpdf2`` (test-only dep in ``requirements-dev.txt``), runs ``extract_pages`` + ``apply_template``, asserts 3 rows out with the right signed amounts. Catches "the build succeeded but pdfplumber breaks at runtime." - ``TestRenderPageImage`` — exercises ``pypdfium2.render`` so the hook-bundled native lib gets a real call. This is the most common installer-bug signature (missing .dll) and the test catches it before users do. - ``TestPdfDependencyMissing`` — monkeypatches ``__import__`` to simulate a stripped install; confirms the typed exception + actionable hint round-trip. - ``TestPinnedVersionsMatchInstalled`` — parametrized over all four pinned dists; uses ``importlib.metadata`` rather than ``__version__`` because pypdfium2 doesn't expose it directly. Trips if someone bumps the pin without reinstalling. - ``TestOcrAvailability`` — confirms ``ocr_available()`` returns ``(bool, str)`` and ``extract_pages_auto(allow_ocr=False)`` skips OCR cleanly. All 81 PDF + audit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
32 lines
1.3 KiB
Python
32 lines
1.3 KiB
Python
"""PyInstaller hook for pypdfium2.
|
|
|
|
``pypdfium2`` ships the native PDFium shared library as a data file
|
|
inside its package directory (``pdfium``-prefixed ``.dll`` on
|
|
Windows, ``.so`` on Linux, ``.dylib`` on macOS). PyInstaller's
|
|
default discovery picks up Python ``.py``/``.pyc`` but can miss
|
|
the binary if the package is wheel-installed and the shared lib
|
|
isn't on the ``__init__``'s module-level path it scans.
|
|
|
|
This hook is belt-and-braces — the main spec already calls
|
|
``collect_data_files("pypdfium2")`` and ``collect_submodules``,
|
|
but PyInstaller's hook-discovery-by-name is the documented
|
|
escape hatch for native-bundled libraries. Without this, the
|
|
visual picker (which renders PDF pages via
|
|
``pypdfium2.PdfDocument(...).render(...)``) silently fails on
|
|
installed builds with a ``FileNotFoundError`` for the PDFium
|
|
shared library.
|
|
"""
|
|
|
|
from PyInstaller.utils.hooks import (
|
|
collect_all,
|
|
collect_data_files,
|
|
collect_dynamic_libs,
|
|
)
|
|
|
|
datas, binaries, hiddenimports = collect_all("pypdfium2")
|
|
# Make absolutely sure the bundled PDFium .dll/.so/.dylib is
|
|
# carried over — PyInstaller treats it as a dynamic lib, not data.
|
|
binaries += collect_dynamic_libs("pypdfium2")
|
|
# And its raw data files (the type stubs + metadata file).
|
|
datas += collect_data_files("pypdfium2", include_py_files=False)
|