build: bundle Tesseract 5.5.0 + tessdata into every release artifact

End users no longer have to install Tesseract separately for OCR on scanned PDFs — the engine ships inside the installer, portable .zip, and AppImage for all three platforms. Per-platform fetch in build/make_release.py (run before PyInstaller): - Windows: download UB-Mannheim installer 5.5.0.20241111, extract with 7-Zip, copy tesseract.exe + required DLLs into the staging dir. - macOS: ``brew install tesseract``, copy binary + every Homebrew- prefixed dylib resolved via otool -L (recurse one level for transitive deps), then install_name_tool rewrites IDs / load paths to @loader_path/... so the bundle is relocatable. - Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary + every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'. Wire-up: - build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by make_release) and adds the staging dir + tessdata + the LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/} and the license sits at the bundle root. Soft-warns when staging is empty so dev spec runs still complete. - English tessdata pulled by fetch_tessdata() from tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at build/vendor/tessdata/. - .github/workflows/build.yml: actions/cache@v4 step keyed on ``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the staging dir and the vendored tessdata across runs; apt installs patchelf on the Linux runner; PyInstaller step now receives the DATATOOLS_TESS_STAGING env var. - .gitignore: build/_tesseract/ and the .traineddata blob. - TESSERACT_SKIP_FETCH=1 honored for offline / manual stages. - Installer / .dmg / .zip / AppImage scripts: one-line comments confirming Tesseract rides along automatically via PyInstaller's datas (no extra packaging steps required in those scripts). Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post- compression. Net installer size ~250-300 MB (was ~120 MB) — accepted tradeoff for zero end-user OCR setup. Reversal of the prior "don't bundle Tesseract" decision (option A). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:17 +00:00
parent 17faf84aed
commit 93ccada974
10 changed files with 634 additions and 3 deletions
--- a/build/vendor/README.md
+++ b/build/vendor/README.md
@@ -0,0 +1,62 @@
+# build/vendor/ — third-party bundle inputs (fetched at build time)
+
+This tree holds the third-party assets that get bundled into the
+PyInstaller artifacts but that we deliberately do **not** keep in git
+(too large / license-encumbered / re-fetchable on demand).
+
+The build pipeline (`build/make_release.py`) populates everything in
+here before the PyInstaller step. The contents are git-ignored except
+for this README.
+
+## tessdata/
+
+Holds the Tesseract language data file(s) used by the PDF Extractor
+OCR fallback. Only English is bundled today.
+
+### Canonical source
+
+We use the **"best" model** from `tesseract-ocr/tessdata_best` (LSTM,
+slower but higher accuracy than the legacy `tessdata` set, and only
+~12 MB compressed → ~16 MB uncompressed):
+
+```
+https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata
+```
+
+There is also `tessdata_fast/` (~4 MB, lower accuracy) if you ever
+want to optimise for bundle size over recognition quality. For bank
+statements (the only OCR use case so far), the extra accuracy of the
+`_best` model is worth the 10 MB.
+
+### Why we don't vendor it in git
+
+* ~16 MB binary file — bloats clone times for everyone, including
+  contributors who never touch the OCR code path.
+* Apache-2.0-licensed and stable; the file rarely changes upstream
+  (last touched 2021), so a build-time fetch is safe.
+* The Tesseract project explicitly distributes these via GitHub
+  raw URLs — they're meant to be downloaded, not redistributed
+  through other repos.
+
+### How it gets populated
+
+`build/make_release.py::fetch_tessdata()` checks for
+`build/vendor/tessdata/eng.traineddata` on every run. If it's
+missing, the script downloads it from the canonical URL above and
+caches it here. Subsequent builds reuse the cached file.
+
+On CI, the directory is restored from the GitHub Actions cache so we
+don't pay the download cost on every run (`.github/workflows/build.yml`
+caches `build/vendor/tessdata/` keyed on the URL above).
+
+## Manual one-time fetch (if you're offline or behind a proxy)
+
+```bash
+mkdir -p build/vendor/tessdata
+curl -L -o build/vendor/tessdata/eng.traineddata \
+  https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata
+```
+
+Verify the file is non-empty and starts with the magic bytes
+`b"\x00\x00\x00\x00"` followed by a header that `pytesseract` can
+read; the script does a basic sanity check after download.
--- a/build/vendor/tessdata/.gitkeep
+++ b/build/vendor/tessdata/.gitkeep