build: bundle Tesseract 5.5.0 + tessdata into every release artifact
End users no longer have to install Tesseract separately for OCR on
scanned PDFs — the engine ships inside the installer, portable .zip,
and AppImage for all three platforms.
Per-platform fetch in build/make_release.py (run before PyInstaller):
- Windows: download UB-Mannheim installer 5.5.0.20241111, extract
with 7-Zip, copy tesseract.exe + required DLLs into the staging dir.
- macOS: ``brew install tesseract``, copy binary + every Homebrew-
prefixed dylib resolved via otool -L (recurse one level for
transitive deps), then install_name_tool rewrites IDs / load paths
to @loader_path/... so the bundle is relocatable.
- Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary
+ every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'.
Wire-up:
- build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by
make_release) and adds the staging dir + tessdata + the
LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas
so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/}
and the license sits at the bundle root. Soft-warns when staging
is empty so dev spec runs still complete.
- English tessdata pulled by fetch_tessdata() from
tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at
build/vendor/tessdata/.
- .github/workflows/build.yml: actions/cache@v4 step keyed on
``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the
staging dir and the vendored tessdata across runs; apt installs
patchelf on the Linux runner; PyInstaller step now receives the
DATATOOLS_TESS_STAGING env var.
- .gitignore: build/_tesseract/ and the .traineddata blob.
- TESSERACT_SKIP_FETCH=1 honored for offline / manual stages.
- Installer / .dmg / .zip / AppImage scripts: one-line comments
confirming Tesseract rides along automatically via PyInstaller's
datas (no extra packaging steps required in those scripts).
Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post-
compression. Net installer size ~250-300 MB (was ~120 MB) — accepted
tradeoff for zero end-user OCR setup.
Reversal of the prior "don't bundle Tesseract" decision (option A).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
62
build/vendor/README.md
vendored
Normal file
62
build/vendor/README.md
vendored
Normal file
@@ -0,0 +1,62 @@
|
||||
# build/vendor/ — third-party bundle inputs (fetched at build time)
|
||||
|
||||
This tree holds the third-party assets that get bundled into the
|
||||
PyInstaller artifacts but that we deliberately do **not** keep in git
|
||||
(too large / license-encumbered / re-fetchable on demand).
|
||||
|
||||
The build pipeline (`build/make_release.py`) populates everything in
|
||||
here before the PyInstaller step. The contents are git-ignored except
|
||||
for this README.
|
||||
|
||||
## tessdata/
|
||||
|
||||
Holds the Tesseract language data file(s) used by the PDF Extractor
|
||||
OCR fallback. Only English is bundled today.
|
||||
|
||||
### Canonical source
|
||||
|
||||
We use the **"best" model** from `tesseract-ocr/tessdata_best` (LSTM,
|
||||
slower but higher accuracy than the legacy `tessdata` set, and only
|
||||
~12 MB compressed → ~16 MB uncompressed):
|
||||
|
||||
```
|
||||
https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata
|
||||
```
|
||||
|
||||
There is also `tessdata_fast/` (~4 MB, lower accuracy) if you ever
|
||||
want to optimise for bundle size over recognition quality. For bank
|
||||
statements (the only OCR use case so far), the extra accuracy of the
|
||||
`_best` model is worth the 10 MB.
|
||||
|
||||
### Why we don't vendor it in git
|
||||
|
||||
* ~16 MB binary file — bloats clone times for everyone, including
|
||||
contributors who never touch the OCR code path.
|
||||
* Apache-2.0-licensed and stable; the file rarely changes upstream
|
||||
(last touched 2021), so a build-time fetch is safe.
|
||||
* The Tesseract project explicitly distributes these via GitHub
|
||||
raw URLs — they're meant to be downloaded, not redistributed
|
||||
through other repos.
|
||||
|
||||
### How it gets populated
|
||||
|
||||
`build/make_release.py::fetch_tessdata()` checks for
|
||||
`build/vendor/tessdata/eng.traineddata` on every run. If it's
|
||||
missing, the script downloads it from the canonical URL above and
|
||||
caches it here. Subsequent builds reuse the cached file.
|
||||
|
||||
On CI, the directory is restored from the GitHub Actions cache so we
|
||||
don't pay the download cost on every run (`.github/workflows/build.yml`
|
||||
caches `build/vendor/tessdata/` keyed on the URL above).
|
||||
|
||||
## Manual one-time fetch (if you're offline or behind a proxy)
|
||||
|
||||
```bash
|
||||
mkdir -p build/vendor/tessdata
|
||||
curl -L -o build/vendor/tessdata/eng.traineddata \
|
||||
https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata
|
||||
```
|
||||
|
||||
Verify the file is non-empty and starts with the magic bytes
|
||||
`b"\x00\x00\x00\x00"` followed by a header that `pytesseract` can
|
||||
read; the script does a basic sanity check after download.
|
||||
0
build/vendor/tessdata/.gitkeep
vendored
Normal file
0
build/vendor/tessdata/.gitkeep
vendored
Normal file
Reference in New Issue
Block a user