feat(pdf): robust Tesseract discovery + OS-aware install copy

User tried ``brew install tesseract`` in PowerShell after seeing
all three OSes listed inline in the OCR banner — easy mistake
when the install commands are crammed on one line with ``·``
separators. Two changes pre-empt this:

**OS-aware OCR banner.** The expander now detects the user's
platform via ``platform.system()`` and shows only the relevant
install instructions:

- **Windows**: UB-Mannheim installer link, numbered steps,
  explicit "keep the Add to PATH checkbox on" callout, plus a
  fallback paragraph telling the user how to set
  ``DATATOOLS_TESSERACT_PATH`` if they already installed
  without PATH and don't want to reinstall.
- **macOS**: ``brew install tesseract`` with a Homebrew link.
- **Linux**: ``apt install tesseract-ocr`` with a "or your
  distro's equivalent" hedge.

**Robust binary discovery in ``ocr_available()``.** Three-stage:

1. Honor ``DATATOOLS_TESSERACT_PATH`` env var if set — explicit
   override for portable installs or non-default locations.
2. Try ``pytesseract``'s default PATH-based lookup.
3. If PATH lookup fails, probe known Windows install paths
   (``C:\Program Files\Tesseract-OCR\tesseract.exe``,
   the x86 variant, and ``%LOCALAPPDATA%\Programs\Tesseract-OCR\``)
   via the new ``_autodetect_tesseract_path``. On hit, set
   ``pytesseract.pytesseract.tesseract_cmd`` so all subsequent
   ``image_to_data`` calls use the same binary without
   re-discovering.

This means a user who runs the UB-Mannheim installer with
default options but forgets the PATH checkbox will still get
OCR working after a launcher restart, without env-var
gymnastics.

Tests (4 new, 85 total in the suite):

- Auto-detect returns None on non-Windows (no false positives
  on dev laptops).
- Auto-detect finds the binary at a mocked
  ``C:\Program Files\Tesseract-OCR\tesseract.exe``.
- Auto-detect returns None when no candidate exists.
- ``DATATOOLS_TESSERACT_PATH`` env var beats both PATH lookup
  and auto-detect (sets ``tesseract_cmd`` even when the path
  doesn't resolve, so a real binary at a custom location works).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-19 23:15:00 +00:00
parent 538e23d219
commit e6ee2e3481
3 changed files with 156 additions and 16 deletions

View File

@@ -531,23 +531,82 @@ def page_has_extractable_text(page: Page, min_words: int = 5) -> bool:
return len(page.words) >= min_words
def _autodetect_tesseract_path() -> str | None:
"""Probe well-known install locations for ``tesseract.exe``.
UB-Mannheim's Windows installer drops Tesseract at one of two
paths by default. Auto-detecting them lets ``ocr_available``
succeed even when the user (or their installer) skipped the
"Add to PATH" step — the most common Windows install
snag based on real user reports.
No-op on non-Windows: macOS/Linux package managers
always put ``tesseract`` on PATH, so PATH-based discovery is
sufficient.
"""
import os as _os
import platform as _platform
from pathlib import Path as _Path
if _platform.system() != "Windows":
return None
candidates = [
r"C:\Program Files\Tesseract-OCR\tesseract.exe",
r"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe",
_os.path.expandvars(
r"%LOCALAPPDATA%\Programs\Tesseract-OCR\tesseract.exe"
),
]
for p in candidates:
if p and _Path(p).exists():
return p
return None
def ocr_available() -> tuple[bool, str]:
"""Return ``(available, reason)`` — is OCR usable right now?
Checks both the Python binding (``pytesseract``) and the
Tesseract binary. The reason string is suitable for surfacing to
the user when OCR is unavailable.
Tesseract binary. The reason string is suitable for surfacing
to the user when OCR is unavailable.
Discovery order for the Tesseract binary:
1. ``DATATOOLS_TESSERACT_PATH`` env var — explicit override,
wins over everything else. Useful for portable installs.
2. Whatever's on PATH (``pytesseract``'s default).
3. ``_autodetect_tesseract_path`` — known Windows install
locations. Sets ``pytesseract.pytesseract.tesseract_cmd``
so subsequent ``image_to_data`` calls use the same binary.
"""
import os as _os
try:
import pytesseract # noqa: F401
import pytesseract # noqa: F401, PLC0415
except ImportError:
return False, "pytesseract is not installed."
override = _os.environ.get("DATATOOLS_TESSERACT_PATH")
if override:
pytesseract.pytesseract.tesseract_cmd = override
try:
import pytesseract as pt
pt.get_tesseract_version()
except Exception as e:
return False, f"Tesseract binary not found: {e}"
return True, ""
pytesseract.get_tesseract_version()
return True, ""
except Exception as e_path:
# Fallback: probe known install locations.
candidate = _autodetect_tesseract_path()
if candidate:
pytesseract.pytesseract.tesseract_cmd = candidate
try:
pytesseract.get_tesseract_version()
return True, ""
except Exception as e_candidate:
return False, (
f"Tesseract found at {candidate} but failed to "
f"run: {e_candidate}"
)
return False, f"Tesseract binary not found on PATH: {e_path}"
def render_page_image(