build: bundle Tesseract 5.5.0 + tessdata into every release artifact

End users no longer have to install Tesseract separately for OCR on
scanned PDFs — the engine ships inside the installer, portable .zip,
and AppImage for all three platforms.

Per-platform fetch in build/make_release.py (run before PyInstaller):
- Windows: download UB-Mannheim installer 5.5.0.20241111, extract
  with 7-Zip, copy tesseract.exe + required DLLs into the staging dir.
- macOS: ``brew install tesseract``, copy binary + every Homebrew-
  prefixed dylib resolved via otool -L (recurse one level for
  transitive deps), then install_name_tool rewrites IDs / load paths
  to @loader_path/... so the bundle is relocatable.
- Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary
  + every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'.

Wire-up:
- build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by
  make_release) and adds the staging dir + tessdata + the
  LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas
  so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/}
  and the license sits at the bundle root. Soft-warns when staging
  is empty so dev spec runs still complete.
- English tessdata pulled by fetch_tessdata() from
  tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at
  build/vendor/tessdata/.
- .github/workflows/build.yml: actions/cache@v4 step keyed on
  ``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the
  staging dir and the vendored tessdata across runs; apt installs
  patchelf on the Linux runner; PyInstaller step now receives the
  DATATOOLS_TESS_STAGING env var.
- .gitignore: build/_tesseract/ and the .traineddata blob.
- TESSERACT_SKIP_FETCH=1 honored for offline / manual stages.
- Installer / .dmg / .zip / AppImage scripts: one-line comments
  confirming Tesseract rides along automatically via PyInstaller's
  datas (no extra packaging steps required in those scripts).

Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post-
compression. Net installer size ~250-300 MB (was ~120 MB) — accepted
tradeoff for zero end-user OCR setup.

Reversal of the prior "don't bundle Tesseract" decision (option A).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-02 18:20:17 +00:00
parent 17faf84aed
commit 93ccada974
10 changed files with 634 additions and 3 deletions

View File

@@ -65,6 +65,30 @@ jobs:
pip install -r requirements.txt
pip install pyinstaller pillow
# ---- Tesseract bundling cache --------------------------------
# The fetch logic inside build/make_release.py downloads:
# * build/vendor/tessdata/eng.traineddata (~16 MB, shared)
# * build/_tesseract/<platform>/ (binary + libs, 30-120 MB)
# Cache both so iterative CI runs don't re-download. The
# cache key bakes in the pinned Tesseract version + tessdata
# URL so a version bump invalidates automatically.
- name: Cache Tesseract bundle inputs
uses: actions/cache@v4
with:
path: |
build/_tesseract
build/vendor/tessdata
key: tesseract-${{ runner.os }}-5.5.0-tessdata_best-v1
# ---- Linux: install patchelf so make_release.py can rewrite
# RPATH on the bundled tesseract binary. apt-get install
# tesseract-ocr is handled inside make_release.py itself. -----
- name: Install Linux build prereqs for Tesseract bundling
if: matrix.os == 'ubuntu-latest'
run: |
sudo apt-get update
sudo apt-get install -y patchelf
- name: Read version
id: version
shell: bash
@@ -75,7 +99,32 @@ jobs:
- name: Generate platform icons
run: python build/generate_icons.py
# Stage Tesseract before PyInstaller. The make_release.py
# helpers handle the per-platform fetch (UB-Mannheim on Win,
# brew on Mac, apt on Linux) and stage the binary + libs into
# build/_tesseract/<platform>/ where the spec picks them up.
# We invoke a tiny inline Python so the workflow doesn't have
# to know the per-platform target string.
- name: Stage Tesseract binary + tessdata
shell: bash
env:
DATATOOLS_PLATFORM: ${{ matrix.platform }}
run: |
python - <<'PY'
import os, sys
sys.path.insert(0, "build")
from make_release import fetch_tessdata, fetch_tesseract_for_platform
target = os.environ["DATATOOLS_PLATFORM"]
fetch_tessdata()
fetch_tesseract_for_platform(target)
PY
- name: Build PyInstaller bundle
shell: bash
env:
# The spec reads this to find the per-platform staging dir;
# see build/datatools.spec for the contract.
DATATOOLS_TESS_STAGING: build/_tesseract/${{ matrix.platform }}
run: pyinstaller build/datatools.spec --clean --noconfirm
# ---- Per-platform installer packaging ------------------------