datatools-dev

giteadmin/datatools-dev

Fork 0

Commit Graph

Author	SHA1	Message	Date
Michael	e6ee2e3481	feat(pdf): robust Tesseract discovery + OS-aware install copy User tried ``brew install tesseract`` in PowerShell after seeing all three OSes listed inline in the OCR banner — easy mistake when the install commands are crammed on one line with ``·`` separators. Two changes pre-empt this: OS-aware OCR banner. The expander now detects the user's platform via ``platform.system()`` and shows only the relevant install instructions: - Windows: UB-Mannheim installer link, numbered steps, explicit "keep the Add to PATH checkbox on" callout, plus a fallback paragraph telling the user how to set ``DATATOOLS_TESSERACT_PATH`` if they already installed without PATH and don't want to reinstall. - macOS: ``brew install tesseract`` with a Homebrew link. - Linux: ``apt install tesseract-ocr`` with a "or your distro's equivalent" hedge. Robust binary discovery in ``ocr_available()``. Three-stage: 1. Honor ``DATATOOLS_TESSERACT_PATH`` env var if set — explicit override for portable installs or non-default locations. 2. Try ``pytesseract``'s default PATH-based lookup. 3. If PATH lookup fails, probe known Windows install paths (``C:\Program Files\Tesseract-OCR\tesseract.exe``, the x86 variant, and ``%LOCALAPPDATA%\Programs\Tesseract-OCR\``) via the new ``_autodetect_tesseract_path``. On hit, set ``pytesseract.pytesseract.tesseract_cmd`` so all subsequent ``image_to_data`` calls use the same binary without re-discovering. This means a user who runs the UB-Mannheim installer with default options but forgets the PATH checkbox will still get OCR working after a launcher restart, without env-var gymnastics. Tests (4 new, 85 total in the suite): - Auto-detect returns None on non-Windows (no false positives on dev laptops). - Auto-detect finds the binary at a mocked ``C:\Program Files\Tesseract-OCR\tesseract.exe``. - Auto-detect returns None when no candidate exists. - ``DATATOOLS_TESSERACT_PATH`` env var beats both PATH lookup and auto-detect (sets ``tesseract_cmd`` even when the path doesn't resolve, so a real binary at a custom location works). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:15:00 +00:00
Michael	538e23d219	build(pdf): bundle PDF deps in installers + pin versions + smoke tests Three changes prepare the next tagged release so end users get the PDF Extractor without ever touching pip. Exact-pin the new deps (``requirements.txt``): pdfplumber==0.11.9 pypdfium2==5.8.0 pytesseract==0.3.13 streamlit-drawable-canvas==0.9.3 Tight pins are the right call for these because the GUI's visual-picker geometry + the parsing-pipeline word positions depend on stable internal behavior — a quiet upstream tweak to ``extract_words`` or ``page.render`` would re-break the tool on the next CI build. Bumping requires a deliberate edit + a CI run, not a transient ``pip install`` resolving to whatever ``setup.py`` pulled. Existing deps stay on their current ``>=X.Y,<X+1`` ranges; the user's "tight pin" concern is specifically about the PDF stack. Wire the new deps into the PyInstaller bundle (``build/``): - ``datatools.spec`` — add ``collect_submodules`` for pdfplumber, pdfminer, pypdfium2, streamlit_drawable_canvas, PIL, pytesseract; add ``collect_data_files`` for pypdfium2 (PDFium native ``.dll``/``.so``/``.dylib``), streamlit_drawable_canvas (frontend JS bundle), pdfminer (Adobe CMap tables). - ``hooks/hook-pypdfium2.py`` — belt-and-braces hook that uses ``collect_dynamic_libs`` to force-include the PDFium binary. Without this the visual picker silently fails on installed builds with a ``FileNotFoundError`` for the shared library. - ``hooks/hook-streamlit_drawable_canvas.py`` — collects the built JS frontend so the canvas iframe loads under the bundled Streamlit server instead of rendering blank. Tesseract is intentionally NOT bundled (option A from the design discussion). Modern bank statements are text-based; bundling Tesseract would ~triple installer size for a long-tail case. The in-app banner directs users to install it from ``UB-Mannheim/tesseract`` if they need OCR. Decision is captured in the ``project-pdf-installer-pending`` memory note. Smoke tests (``tests/test_pdf_extract_smoke.py``, 17 tests) add the layer above the pure unit tests: - ``TestDependencyImports`` — each dep imports cleanly - ``TestRealPdfRoundTrip`` — generates a tiny statement PDF in memory with ``fpdf2`` (test-only dep in ``requirements-dev.txt``), runs ``extract_pages`` + ``apply_template``, asserts 3 rows out with the right signed amounts. Catches "the build succeeded but pdfplumber breaks at runtime." - ``TestRenderPageImage`` — exercises ``pypdfium2.render`` so the hook-bundled native lib gets a real call. This is the most common installer-bug signature (missing .dll) and the test catches it before users do. - ``TestPdfDependencyMissing`` — monkeypatches ``__import__`` to simulate a stripped install; confirms the typed exception + actionable hint round-trip. - ``TestPinnedVersionsMatchInstalled`` — parametrized over all four pinned dists; uses ``importlib.metadata`` rather than ``__version__`` because pypdfium2 doesn't expose it directly. Trips if someone bumps the pin without reinstalling. - ``TestOcrAvailability`` — confirms ``ocr_available()`` returns ``(bool, str)`` and ``extract_pages_auto(allow_ocr=False)`` skips OCR cleanly. All 81 PDF + audit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:10:43 +00:00

Author

SHA1

Message

Date

Michael

e6ee2e3481

feat(pdf): robust Tesseract discovery + OS-aware install copy

User tried ``brew install tesseract`` in PowerShell after seeing
all three OSes listed inline in the OCR banner — easy mistake
when the install commands are crammed on one line with ``·``
separators. Two changes pre-empt this:

**OS-aware OCR banner.** The expander now detects the user's
platform via ``platform.system()`` and shows only the relevant
install instructions:

- **Windows**: UB-Mannheim installer link, numbered steps,
  explicit "keep the Add to PATH checkbox on" callout, plus a
  fallback paragraph telling the user how to set
  ``DATATOOLS_TESSERACT_PATH`` if they already installed
  without PATH and don't want to reinstall.
- **macOS**: ``brew install tesseract`` with a Homebrew link.
- **Linux**: ``apt install tesseract-ocr`` with a "or your
  distro's equivalent" hedge.

**Robust binary discovery in ``ocr_available()``.** Three-stage:

1. Honor ``DATATOOLS_TESSERACT_PATH`` env var if set — explicit
   override for portable installs or non-default locations.
2. Try ``pytesseract``'s default PATH-based lookup.
3. If PATH lookup fails, probe known Windows install paths
   (``C:\Program Files\Tesseract-OCR\tesseract.exe``,
   the x86 variant, and ``%LOCALAPPDATA%\Programs\Tesseract-OCR\``)
   via the new ``_autodetect_tesseract_path``. On hit, set
   ``pytesseract.pytesseract.tesseract_cmd`` so all subsequent
   ``image_to_data`` calls use the same binary without
   re-discovering.

This means a user who runs the UB-Mannheim installer with
default options but forgets the PATH checkbox will still get
OCR working after a launcher restart, without env-var
gymnastics.

Tests (4 new, 85 total in the suite):

- Auto-detect returns None on non-Windows (no false positives
  on dev laptops).
- Auto-detect finds the binary at a mocked
  ``C:\Program Files\Tesseract-OCR\tesseract.exe``.
- Auto-detect returns None when no candidate exists.
- ``DATATOOLS_TESSERACT_PATH`` env var beats both PATH lookup
  and auto-detect (sets ``tesseract_cmd`` even when the path
  doesn't resolve, so a real binary at a custom location works).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-19 23:15:00 +00:00

Michael

538e23d219

build(pdf): bundle PDF deps in installers + pin versions + smoke tests

Three changes prepare the next tagged release so end users get
the PDF Extractor without ever touching pip.

**Exact-pin the new deps** (``requirements.txt``):

  pdfplumber==0.11.9
  pypdfium2==5.8.0
  pytesseract==0.3.13
  streamlit-drawable-canvas==0.9.3

Tight pins are the right call for these because the GUI's
visual-picker geometry + the parsing-pipeline word positions
depend on stable internal behavior — a quiet upstream tweak to
``extract_words`` or ``page.render`` would re-break the tool on
the next CI build. Bumping requires a deliberate edit + a CI
run, not a transient ``pip install`` resolving to whatever
``setup.py`` pulled.

Existing deps stay on their current ``>=X.Y,<X+1`` ranges; the
user's "tight pin" concern is specifically about the PDF stack.

**Wire the new deps into the PyInstaller bundle** (``build/``):

- ``datatools.spec`` — add ``collect_submodules`` for pdfplumber,
  pdfminer, pypdfium2, streamlit_drawable_canvas, PIL,
  pytesseract; add ``collect_data_files`` for pypdfium2 (PDFium
  native ``.dll``/``.so``/``.dylib``), streamlit_drawable_canvas
  (frontend JS bundle), pdfminer (Adobe CMap tables).
- ``hooks/hook-pypdfium2.py`` — belt-and-braces hook that uses
  ``collect_dynamic_libs`` to force-include the PDFium binary.
  Without this the visual picker silently fails on installed
  builds with a ``FileNotFoundError`` for the shared library.
- ``hooks/hook-streamlit_drawable_canvas.py`` — collects the
  built JS frontend so the canvas iframe loads under the bundled
  Streamlit server instead of rendering blank.

**Tesseract is intentionally NOT bundled** (option A from the
design discussion). Modern bank statements are text-based;
bundling Tesseract would ~triple installer size for a long-tail
case. The in-app banner directs users to install it from
``UB-Mannheim/tesseract`` if they need OCR. Decision is captured
in the ``project-pdf-installer-pending`` memory note.

**Smoke tests** (``tests/test_pdf_extract_smoke.py``, 17 tests)
add the layer above the pure unit tests:

- ``TestDependencyImports`` — each dep imports cleanly
- ``TestRealPdfRoundTrip`` — generates a tiny statement PDF in
  memory with ``fpdf2`` (test-only dep in
  ``requirements-dev.txt``), runs ``extract_pages`` +
  ``apply_template``, asserts 3 rows out with the right signed
  amounts. Catches "the build succeeded but pdfplumber breaks at
  runtime."
- ``TestRenderPageImage`` — exercises ``pypdfium2.render`` so the
  hook-bundled native lib gets a real call. This is the most
  common installer-bug signature (missing .dll) and the test
  catches it before users do.
- ``TestPdfDependencyMissing`` — monkeypatches ``__import__`` to
  simulate a stripped install; confirms the typed exception +
  actionable hint round-trip.
- ``TestPinnedVersionsMatchInstalled`` — parametrized over all
  four pinned dists; uses ``importlib.metadata`` rather than
  ``__version__`` because pypdfium2 doesn't expose it directly.
  Trips if someone bumps the pin without reinstalling.
- ``TestOcrAvailability`` — confirms ``ocr_available()`` returns
  ``(bool, str)`` and ``extract_pages_auto(allow_ocr=False)``
  skips OCR cleanly.

All 81 PDF + audit tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-19 23:10:43 +00:00

2 Commits