Files
datatools-dev/build/vendor/README.md
Michael 93ccada974 build: bundle Tesseract 5.5.0 + tessdata into every release artifact
End users no longer have to install Tesseract separately for OCR on
scanned PDFs — the engine ships inside the installer, portable .zip,
and AppImage for all three platforms.

Per-platform fetch in build/make_release.py (run before PyInstaller):
- Windows: download UB-Mannheim installer 5.5.0.20241111, extract
  with 7-Zip, copy tesseract.exe + required DLLs into the staging dir.
- macOS: ``brew install tesseract``, copy binary + every Homebrew-
  prefixed dylib resolved via otool -L (recurse one level for
  transitive deps), then install_name_tool rewrites IDs / load paths
  to @loader_path/... so the bundle is relocatable.
- Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary
  + every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'.

Wire-up:
- build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by
  make_release) and adds the staging dir + tessdata + the
  LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas
  so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/}
  and the license sits at the bundle root. Soft-warns when staging
  is empty so dev spec runs still complete.
- English tessdata pulled by fetch_tessdata() from
  tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at
  build/vendor/tessdata/.
- .github/workflows/build.yml: actions/cache@v4 step keyed on
  ``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the
  staging dir and the vendored tessdata across runs; apt installs
  patchelf on the Linux runner; PyInstaller step now receives the
  DATATOOLS_TESS_STAGING env var.
- .gitignore: build/_tesseract/ and the .traineddata blob.
- TESSERACT_SKIP_FETCH=1 honored for offline / manual stages.
- Installer / .dmg / .zip / AppImage scripts: one-line comments
  confirming Tesseract rides along automatically via PyInstaller's
  datas (no extra packaging steps required in those scripts).

Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post-
compression. Net installer size ~250-300 MB (was ~120 MB) — accepted
tradeoff for zero end-user OCR setup.

Reversal of the prior "don't bundle Tesseract" decision (option A).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:33 +00:00

63 lines
2.3 KiB
Markdown

# build/vendor/ — third-party bundle inputs (fetched at build time)
This tree holds the third-party assets that get bundled into the
PyInstaller artifacts but that we deliberately do **not** keep in git
(too large / license-encumbered / re-fetchable on demand).
The build pipeline (`build/make_release.py`) populates everything in
here before the PyInstaller step. The contents are git-ignored except
for this README.
## tessdata/
Holds the Tesseract language data file(s) used by the PDF Extractor
OCR fallback. Only English is bundled today.
### Canonical source
We use the **"best" model** from `tesseract-ocr/tessdata_best` (LSTM,
slower but higher accuracy than the legacy `tessdata` set, and only
~12 MB compressed → ~16 MB uncompressed):
```
https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata
```
There is also `tessdata_fast/` (~4 MB, lower accuracy) if you ever
want to optimise for bundle size over recognition quality. For bank
statements (the only OCR use case so far), the extra accuracy of the
`_best` model is worth the 10 MB.
### Why we don't vendor it in git
* ~16 MB binary file — bloats clone times for everyone, including
contributors who never touch the OCR code path.
* Apache-2.0-licensed and stable; the file rarely changes upstream
(last touched 2021), so a build-time fetch is safe.
* The Tesseract project explicitly distributes these via GitHub
raw URLs — they're meant to be downloaded, not redistributed
through other repos.
### How it gets populated
`build/make_release.py::fetch_tessdata()` checks for
`build/vendor/tessdata/eng.traineddata` on every run. If it's
missing, the script downloads it from the canonical URL above and
caches it here. Subsequent builds reuse the cached file.
On CI, the directory is restored from the GitHub Actions cache so we
don't pay the download cost on every run (`.github/workflows/build.yml`
caches `build/vendor/tessdata/` keyed on the URL above).
## Manual one-time fetch (if you're offline or behind a proxy)
```bash
mkdir -p build/vendor/tessdata
curl -L -o build/vendor/tessdata/eng.traineddata \
https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata
```
Verify the file is non-empty and starts with the magic bytes
`b"\x00\x00\x00\x00"` followed by a header that `pytesseract` can
read; the script does a basic sanity check after download.