build: bundle Tesseract 5.5.0 + tessdata into every release artifact
End users no longer have to install Tesseract separately for OCR on
scanned PDFs — the engine ships inside the installer, portable .zip,
and AppImage for all three platforms.
Per-platform fetch in build/make_release.py (run before PyInstaller):
- Windows: download UB-Mannheim installer 5.5.0.20241111, extract
with 7-Zip, copy tesseract.exe + required DLLs into the staging dir.
- macOS: ``brew install tesseract``, copy binary + every Homebrew-
prefixed dylib resolved via otool -L (recurse one level for
transitive deps), then install_name_tool rewrites IDs / load paths
to @loader_path/... so the bundle is relocatable.
- Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary
+ every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'.
Wire-up:
- build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by
make_release) and adds the staging dir + tessdata + the
LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas
so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/}
and the license sits at the bundle root. Soft-warns when staging
is empty so dev spec runs still complete.
- English tessdata pulled by fetch_tessdata() from
tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at
build/vendor/tessdata/.
- .github/workflows/build.yml: actions/cache@v4 step keyed on
``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the
staging dir and the vendored tessdata across runs; apt installs
patchelf on the Linux runner; PyInstaller step now receives the
DATATOOLS_TESS_STAGING env var.
- .gitignore: build/_tesseract/ and the .traineddata blob.
- TESSERACT_SKIP_FETCH=1 honored for offline / manual stages.
- Installer / .dmg / .zip / AppImage scripts: one-line comments
confirming Tesseract rides along automatically via PyInstaller's
datas (no extra packaging steps required in those scripts).
Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post-
compression. Net installer size ~250-300 MB (was ~120 MB) — accepted
tradeoff for zero end-user OCR setup.
Reversal of the prior "don't bundle Tesseract" decision (option A).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -24,6 +24,7 @@
|
||||
|
||||
# -*- mode: python ; coding: utf-8 -*-
|
||||
|
||||
import os
|
||||
from pathlib import Path
|
||||
from PyInstaller.utils.hooks import (
|
||||
collect_all,
|
||||
@@ -103,6 +104,78 @@ datas += [
|
||||
(str(REPO / ".streamlit" / "config.toml"),".streamlit"),
|
||||
]
|
||||
|
||||
# ----- Tesseract OCR bundle ----------------------------------------
|
||||
# ``build/make_release.py`` stages the per-platform Tesseract binary
|
||||
# + its runtime libs (DLLs/dylibs/sos) into
|
||||
# ``build/_tesseract/<target>/`` and the shared eng.traineddata into
|
||||
# ``build/vendor/tessdata/``. We add both to ``datas`` so PyInstaller
|
||||
# drops them at the path the runtime expects:
|
||||
#
|
||||
# <bundle>/tesseract/tesseract[.exe]
|
||||
# <bundle>/tesseract/<all dll/dylib/so deps>
|
||||
# <bundle>/tesseract/tessdata/eng.traineddata
|
||||
#
|
||||
# The runtime discovery code in src/pdf_extract.py reads this layout
|
||||
# from ``Path(sys._MEIPASS) / "tesseract" / ...``. Keep the two ends
|
||||
# in sync — if you rename "tesseract" here, update pdf_extract.py too.
|
||||
#
|
||||
# The orchestrator (make_release.py) sets DATATOOLS_TESS_STAGING to
|
||||
# the right per-platform dir before invoking PyInstaller. For ad-hoc
|
||||
# `pyinstaller build/datatools.spec` runs without the orchestrator,
|
||||
# fall back to the canonical staging path.
|
||||
_tess_staging_env = os.environ.get("DATATOOLS_TESS_STAGING")
|
||||
if _tess_staging_env:
|
||||
_tess_staging = Path(_tess_staging_env)
|
||||
else:
|
||||
# Pick the obvious per-host staging dir as a fallback so spec-only
|
||||
# builds (without the orchestrator) still work in dev.
|
||||
import sys as _sys_for_target
|
||||
_target_guess = (
|
||||
"win" if _sys_for_target.platform.startswith("win")
|
||||
else "mac" if _sys_for_target.platform == "darwin"
|
||||
else "linux"
|
||||
)
|
||||
_tess_staging = REPO / "build" / "_tesseract" / _target_guess
|
||||
|
||||
_tessdata = REPO / "build" / "vendor" / "tessdata"
|
||||
|
||||
if _tess_staging.is_dir() and any(_tess_staging.iterdir()):
|
||||
# Drop every file in the staging dir directly under
|
||||
# ``<bundle>/tesseract/`` (binary + DLL/dylib/so siblings).
|
||||
datas += [(str(_tess_staging), "tesseract")]
|
||||
else:
|
||||
# Don't hard-fail spec parse — useful for first-time devs running
|
||||
# PyInstaller before fetching binaries. Surface a loud warning
|
||||
# though, since the OCR feature will silently fail at runtime.
|
||||
print(
|
||||
f"WARNING: {_tess_staging} is empty or missing — OCR will be "
|
||||
"disabled in the bundle. Run build/make_release.py (which "
|
||||
"calls fetch_tesseract_for_platform) before pyinstaller, or "
|
||||
"pre-stage the binary manually."
|
||||
)
|
||||
|
||||
if (_tessdata / "eng.traineddata").exists():
|
||||
datas += [(str(_tessdata), "tesseract/tessdata")]
|
||||
else:
|
||||
print(
|
||||
f"WARNING: {_tessdata}/eng.traineddata is missing — OCR will "
|
||||
"have no language data at runtime. Run build/make_release.py "
|
||||
"or fetch manually per build/vendor/README.md."
|
||||
)
|
||||
|
||||
# Bundle the Apache-2.0 LICENSE text alongside the binary. The docs
|
||||
# agent maintains LICENSE_TESSERACT.txt at the repo root; PyInstaller
|
||||
# drops it at the bundle root next to DataTools[.exe].
|
||||
_tess_license = REPO / "LICENSE_TESSERACT.txt"
|
||||
if _tess_license.exists():
|
||||
datas += [(str(_tess_license), ".")]
|
||||
else:
|
||||
print(
|
||||
"WARNING: LICENSE_TESSERACT.txt missing at repo root. Required "
|
||||
"by Apache-2.0 for redistribution; the docs agent should "
|
||||
"create it. Continuing without it for now."
|
||||
)
|
||||
|
||||
# ----- Analysis ------------------------------------------------------
|
||||
|
||||
a = Analysis(
|
||||
@@ -158,6 +231,13 @@ coll = COLLECT(
|
||||
|
||||
# macOS .app bundle wrapper. PyInstaller produces it only on Mac;
|
||||
# this block is a no-op on Win/Linux.
|
||||
#
|
||||
# Tesseract bundling note: ``BUNDLE(coll, ...)`` carries the entire
|
||||
# COLLECT output (binaries + datas) into the .app's
|
||||
# Contents/Resources tree, so the ``tesseract/`` subdir we built up
|
||||
# in ``datas`` lands at ``DataTools.app/Contents/Resources/tesseract/``
|
||||
# and the runtime ``sys._MEIPASS`` resolves there. No extra plumbing
|
||||
# needed.
|
||||
import sys as _sys
|
||||
if _sys.platform == "darwin":
|
||||
app = BUNDLE(
|
||||
|
||||
Reference in New Issue
Block a user