feat(pdf): robust Tesseract discovery + OS-aware install copy

User tried ``brew install tesseract`` in PowerShell after seeing
all three OSes listed inline in the OCR banner — easy mistake
when the install commands are crammed on one line with ``·``
separators. Two changes pre-empt this:

**OS-aware OCR banner.** The expander now detects the user's
platform via ``platform.system()`` and shows only the relevant
install instructions:

- **Windows**: UB-Mannheim installer link, numbered steps,
  explicit "keep the Add to PATH checkbox on" callout, plus a
  fallback paragraph telling the user how to set
  ``DATATOOLS_TESSERACT_PATH`` if they already installed
  without PATH and don't want to reinstall.
- **macOS**: ``brew install tesseract`` with a Homebrew link.
- **Linux**: ``apt install tesseract-ocr`` with a "or your
  distro's equivalent" hedge.

**Robust binary discovery in ``ocr_available()``.** Three-stage:

1. Honor ``DATATOOLS_TESSERACT_PATH`` env var if set — explicit
   override for portable installs or non-default locations.
2. Try ``pytesseract``'s default PATH-based lookup.
3. If PATH lookup fails, probe known Windows install paths
   (``C:\Program Files\Tesseract-OCR\tesseract.exe``,
   the x86 variant, and ``%LOCALAPPDATA%\Programs\Tesseract-OCR\``)
   via the new ``_autodetect_tesseract_path``. On hit, set
   ``pytesseract.pytesseract.tesseract_cmd`` so all subsequent
   ``image_to_data`` calls use the same binary without
   re-discovering.

This means a user who runs the UB-Mannheim installer with
default options but forgets the PATH checkbox will still get
OCR working after a launcher restart, without env-var
gymnastics.

Tests (4 new, 85 total in the suite):

- Auto-detect returns None on non-Windows (no false positives
  on dev laptops).
- Auto-detect finds the binary at a mocked
  ``C:\Program Files\Tesseract-OCR\tesseract.exe``.
- Auto-detect returns None when no candidate exists.
- ``DATATOOLS_TESSERACT_PATH`` env var beats both PATH lookup
  and auto-detect (sets ``tesseract_cmd`` even when the path
  doesn't resolve, so a real binary at a custom location works).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-19 23:15:00 +00:00
parent 538e23d219
commit e6ee2e3481
3 changed files with 156 additions and 16 deletions

View File

@@ -313,3 +313,56 @@ class TestOcrAvailability:
assert len(pages) == 1
# No OCR-disabled warning on a text PDF, since pages have text.
assert not any("OCR is disabled" in w for w in warnings)
class TestTesseractDiscovery:
"""Windows install paths + env-var override are how a real user
(no PATH munging) gets OCR working. Cover the discovery logic
even on Linux/macOS test runners by mocking out the OS check
and ``Path.exists``."""
def test_autodetect_returns_none_on_non_windows(self, monkeypatch):
from src import pdf_extract
monkeypatch.setattr(
"platform.system",
lambda: "Linux",
)
assert pdf_extract._autodetect_tesseract_path() is None
def test_autodetect_finds_program_files_on_windows(self, monkeypatch):
from src import pdf_extract
monkeypatch.setattr("platform.system", lambda: "Windows")
target = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
def fake_exists(self):
return str(self) == target
monkeypatch.setattr(
"pathlib.Path.exists",
fake_exists,
)
assert pdf_extract._autodetect_tesseract_path() == target
def test_autodetect_returns_none_when_nothing_installed(
self, monkeypatch,
):
from src import pdf_extract
monkeypatch.setattr("platform.system", lambda: "Windows")
monkeypatch.setattr("pathlib.Path.exists", lambda self: False)
assert pdf_extract._autodetect_tesseract_path() is None
def test_env_var_override_takes_precedence(self, monkeypatch, tmp_path):
"""``DATATOOLS_TESSERACT_PATH`` wins over discovery so a
portable install at a non-default path works without
relying on PATH."""
from src import pdf_extract
# Point the override at a path that doesn't exist —
# ocr_available will try it and report the failure, but
# importantly the cmd attribute is set BEFORE the call,
# which is what we're verifying.
fake_bin = str(tmp_path / "fake-tesseract.exe")
monkeypatch.setenv("DATATOOLS_TESSERACT_PATH", fake_bin)
pdf_extract.ocr_available()
import pytesseract
assert pytesseract.pytesseract.tesseract_cmd == fake_bin