feat(pdf): robust Tesseract discovery + OS-aware install copy

User tried ``brew install tesseract`` in PowerShell after seeing
all three OSes listed inline in the OCR banner — easy mistake
when the install commands are crammed on one line with ``·``
separators. Two changes pre-empt this:

**OS-aware OCR banner.** The expander now detects the user's
platform via ``platform.system()`` and shows only the relevant
install instructions:

- **Windows**: UB-Mannheim installer link, numbered steps,
  explicit "keep the Add to PATH checkbox on" callout, plus a
  fallback paragraph telling the user how to set
  ``DATATOOLS_TESSERACT_PATH`` if they already installed
  without PATH and don't want to reinstall.
- **macOS**: ``brew install tesseract`` with a Homebrew link.
- **Linux**: ``apt install tesseract-ocr`` with a "or your
  distro's equivalent" hedge.

**Robust binary discovery in ``ocr_available()``.** Three-stage:

1. Honor ``DATATOOLS_TESSERACT_PATH`` env var if set — explicit
   override for portable installs or non-default locations.
2. Try ``pytesseract``'s default PATH-based lookup.
3. If PATH lookup fails, probe known Windows install paths
   (``C:\Program Files\Tesseract-OCR\tesseract.exe``,
   the x86 variant, and ``%LOCALAPPDATA%\Programs\Tesseract-OCR\``)
   via the new ``_autodetect_tesseract_path``. On hit, set
   ``pytesseract.pytesseract.tesseract_cmd`` so all subsequent
   ``image_to_data`` calls use the same binary without
   re-discovering.

This means a user who runs the UB-Mannheim installer with
default options but forgets the PATH checkbox will still get
OCR working after a launcher restart, without env-var
gymnastics.

Tests (4 new, 85 total in the suite):

- Auto-detect returns None on non-Windows (no false positives
  on dev laptops).
- Auto-detect finds the binary at a mocked
  ``C:\Program Files\Tesseract-OCR\tesseract.exe``.
- Auto-detect returns None when no candidate exists.
- ``DATATOOLS_TESSERACT_PATH`` env var beats both PATH lookup
  and auto-detect (sets ``tesseract_cmd`` even when the path
  doesn't resolve, so a real binary at a custom location works).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-19 23:15:00 +00:00
parent 538e23d219
commit e6ee2e3481
3 changed files with 156 additions and 16 deletions

View File

@@ -136,16 +136,44 @@ with c_ocr:
if _ocr_ok:
st.caption("**OCR:** ready · scanned pages will be transcribed.")
else:
import platform as _platform
_os_name = _platform.system()
with st.expander("**OCR:** unavailable", expanded=False):
st.caption(
f"Reason: {_ocr_reason or 'unknown'}. Scanned (image-based) "
"statements will fall through with warnings. "
"To enable OCR, install Tesseract on this machine — "
"[Windows](https://github.com/UB-Mannheim/tesseract/wiki) · "
"macOS: ``brew install tesseract`` · "
"Linux: ``apt install tesseract-ocr``. "
"Modern text-based statements don't need OCR."
st.markdown(
f"**Reason:** {_ocr_reason or 'unknown'}.\n\n"
"Scanned (image-based) statements will fall through "
"with warnings. Most modern bank statements are text-"
"based and don't need OCR — only install Tesseract if "
"your statements actually come through as images."
)
if _os_name == "Windows":
st.markdown(
"**Install on Windows:**\n"
"1. Download the installer from "
"[UB-Mannheim/tesseract](https://github.com/UB-Mannheim/tesseract/wiki) "
"(look for ``tesseract-ocr-w64-setup-…``).\n"
"2. Run it. Keep the **\"Add tesseract to system "
"PATH\"** checkbox on during setup.\n"
"3. Restart the DataTools launcher.\n\n"
"If you installed without PATH and don't want to "
"reinstall, point DataTools at the binary directly "
"by setting the ``DATATOOLS_TESSERACT_PATH`` env "
"var to ``C:\\Program Files\\Tesseract-OCR\\tesseract.exe`` "
"before launching."
)
elif _os_name == "Darwin":
st.markdown(
"**Install on macOS:** ``brew install tesseract`` "
"(requires [Homebrew](https://brew.sh)). Restart "
"the DataTools launcher afterward."
)
else:
st.markdown(
"**Install on Linux:** ``sudo apt install "
"tesseract-ocr`` (Debian/Ubuntu) or your distro's "
"equivalent (``dnf``, ``pacman``, …). Restart the "
"DataTools launcher afterward."
)
st.divider()