build(pdf): bundle PDF deps in installers + pin versions + smoke tests
Three changes prepare the next tagged release so end users get the PDF Extractor without ever touching pip. **Exact-pin the new deps** (``requirements.txt``): pdfplumber==0.11.9 pypdfium2==5.8.0 pytesseract==0.3.13 streamlit-drawable-canvas==0.9.3 Tight pins are the right call for these because the GUI's visual-picker geometry + the parsing-pipeline word positions depend on stable internal behavior — a quiet upstream tweak to ``extract_words`` or ``page.render`` would re-break the tool on the next CI build. Bumping requires a deliberate edit + a CI run, not a transient ``pip install`` resolving to whatever ``setup.py`` pulled. Existing deps stay on their current ``>=X.Y,<X+1`` ranges; the user's "tight pin" concern is specifically about the PDF stack. **Wire the new deps into the PyInstaller bundle** (``build/``): - ``datatools.spec`` — add ``collect_submodules`` for pdfplumber, pdfminer, pypdfium2, streamlit_drawable_canvas, PIL, pytesseract; add ``collect_data_files`` for pypdfium2 (PDFium native ``.dll``/``.so``/``.dylib``), streamlit_drawable_canvas (frontend JS bundle), pdfminer (Adobe CMap tables). - ``hooks/hook-pypdfium2.py`` — belt-and-braces hook that uses ``collect_dynamic_libs`` to force-include the PDFium binary. Without this the visual picker silently fails on installed builds with a ``FileNotFoundError`` for the shared library. - ``hooks/hook-streamlit_drawable_canvas.py`` — collects the built JS frontend so the canvas iframe loads under the bundled Streamlit server instead of rendering blank. **Tesseract is intentionally NOT bundled** (option A from the design discussion). Modern bank statements are text-based; bundling Tesseract would ~triple installer size for a long-tail case. The in-app banner directs users to install it from ``UB-Mannheim/tesseract`` if they need OCR. Decision is captured in the ``project-pdf-installer-pending`` memory note. **Smoke tests** (``tests/test_pdf_extract_smoke.py``, 17 tests) add the layer above the pure unit tests: - ``TestDependencyImports`` — each dep imports cleanly - ``TestRealPdfRoundTrip`` — generates a tiny statement PDF in memory with ``fpdf2`` (test-only dep in ``requirements-dev.txt``), runs ``extract_pages`` + ``apply_template``, asserts 3 rows out with the right signed amounts. Catches "the build succeeded but pdfplumber breaks at runtime." - ``TestRenderPageImage`` — exercises ``pypdfium2.render`` so the hook-bundled native lib gets a real call. This is the most common installer-bug signature (missing .dll) and the test catches it before users do. - ``TestPdfDependencyMissing`` — monkeypatches ``__import__`` to simulate a stripped install; confirms the typed exception + actionable hint round-trip. - ``TestPinnedVersionsMatchInstalled`` — parametrized over all four pinned dists; uses ``importlib.metadata`` rather than ``__version__`` because pypdfium2 doesn't expose it directly. Trips if someone bumps the pin without reinstalling. - ``TestOcrAvailability`` — confirms ``ocr_available()`` returns ``(bool, str)`` and ``extract_pages_auto(allow_ocr=False)`` skips OCR cleanly. All 81 PDF + audit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
31
build/hooks/hook-pypdfium2.py
Normal file
31
build/hooks/hook-pypdfium2.py
Normal file
@@ -0,0 +1,31 @@
|
||||
"""PyInstaller hook for pypdfium2.
|
||||
|
||||
``pypdfium2`` ships the native PDFium shared library as a data file
|
||||
inside its package directory (``pdfium``-prefixed ``.dll`` on
|
||||
Windows, ``.so`` on Linux, ``.dylib`` on macOS). PyInstaller's
|
||||
default discovery picks up Python ``.py``/``.pyc`` but can miss
|
||||
the binary if the package is wheel-installed and the shared lib
|
||||
isn't on the ``__init__``'s module-level path it scans.
|
||||
|
||||
This hook is belt-and-braces — the main spec already calls
|
||||
``collect_data_files("pypdfium2")`` and ``collect_submodules``,
|
||||
but PyInstaller's hook-discovery-by-name is the documented
|
||||
escape hatch for native-bundled libraries. Without this, the
|
||||
visual picker (which renders PDF pages via
|
||||
``pypdfium2.PdfDocument(...).render(...)``) silently fails on
|
||||
installed builds with a ``FileNotFoundError`` for the PDFium
|
||||
shared library.
|
||||
"""
|
||||
|
||||
from PyInstaller.utils.hooks import (
|
||||
collect_all,
|
||||
collect_data_files,
|
||||
collect_dynamic_libs,
|
||||
)
|
||||
|
||||
datas, binaries, hiddenimports = collect_all("pypdfium2")
|
||||
# Make absolutely sure the bundled PDFium .dll/.so/.dylib is
|
||||
# carried over — PyInstaller treats it as a dynamic lib, not data.
|
||||
binaries += collect_dynamic_libs("pypdfium2")
|
||||
# And its raw data files (the type stubs + metadata file).
|
||||
datas += collect_data_files("pypdfium2", include_py_files=False)
|
||||
19
build/hooks/hook-streamlit_drawable_canvas.py
Normal file
19
build/hooks/hook-streamlit_drawable_canvas.py
Normal file
@@ -0,0 +1,19 @@
|
||||
"""PyInstaller hook for streamlit-drawable-canvas.
|
||||
|
||||
Streamlit components are Python packages that also ship a built
|
||||
JavaScript/CSS bundle Streamlit serves from disk at component-
|
||||
render time. Without those assets in the bundle the canvas
|
||||
iframe loads blank — the user sees the page render fine but the
|
||||
visual picker shows no image and no drawing controls.
|
||||
|
||||
``collect_data_files`` covers the frontend bundle directory
|
||||
(named ``frontend`` or ``frontend/build`` depending on the
|
||||
component version). Hidden imports are picked up by the main
|
||||
spec's ``collect_submodules`` call, repeated here for the same
|
||||
belt-and-braces reason as ``hook-pypdfium2.py``.
|
||||
"""
|
||||
|
||||
from PyInstaller.utils.hooks import collect_data_files, collect_submodules
|
||||
|
||||
datas = collect_data_files("streamlit_drawable_canvas")
|
||||
hiddenimports = collect_submodules("streamlit_drawable_canvas")
|
||||
Reference in New Issue
Block a user