build: bundle Tesseract 5.5.0 + tessdata into every release artifact
End users no longer have to install Tesseract separately for OCR on
scanned PDFs — the engine ships inside the installer, portable .zip,
and AppImage for all three platforms.
Per-platform fetch in build/make_release.py (run before PyInstaller):
- Windows: download UB-Mannheim installer 5.5.0.20241111, extract
with 7-Zip, copy tesseract.exe + required DLLs into the staging dir.
- macOS: ``brew install tesseract``, copy binary + every Homebrew-
prefixed dylib resolved via otool -L (recurse one level for
transitive deps), then install_name_tool rewrites IDs / load paths
to @loader_path/... so the bundle is relocatable.
- Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary
+ every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'.
Wire-up:
- build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by
make_release) and adds the staging dir + tessdata + the
LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas
so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/}
and the license sits at the bundle root. Soft-warns when staging
is empty so dev spec runs still complete.
- English tessdata pulled by fetch_tessdata() from
tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at
build/vendor/tessdata/.
- .github/workflows/build.yml: actions/cache@v4 step keyed on
``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the
staging dir and the vendored tessdata across runs; apt installs
patchelf on the Linux runner; PyInstaller step now receives the
DATATOOLS_TESS_STAGING env var.
- .gitignore: build/_tesseract/ and the .traineddata blob.
- TESSERACT_SKIP_FETCH=1 honored for offline / manual stages.
- Installer / .dmg / .zip / AppImage scripts: one-line comments
confirming Tesseract rides along automatically via PyInstaller's
datas (no extra packaging steps required in those scripts).
Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post-
compression. Net installer size ~250-300 MB (was ~120 MB) — accepted
tradeoff for zero end-user OCR setup.
Reversal of the prior "don't bundle Tesseract" decision (option A).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -63,6 +63,14 @@ Name: "desktopicon"; Description: "Create a &desktop shortcut"; GroupDescription
|
||||
Name: "quicklaunchicon"; Description: "Create a &Quick Launch shortcut"; GroupDescription: "Additional shortcuts:"; Flags: unchecked; OnlyBelowVersion: 6.1
|
||||
|
||||
[Files]
|
||||
; PyInstaller's dist/DataTools/ tree includes:
|
||||
; * DataTools.exe + frozen Python runtime
|
||||
; * tesseract/tesseract.exe + DLLs + tessdata/eng.traineddata
|
||||
; (bundled via build/datatools.spec datas; runtime discovery in
|
||||
; src/pdf_extract.py reads sys._MEIPASS / "tesseract" / ...).
|
||||
; * LICENSE_TESSERACT.txt at the bundle root (Apache-2.0).
|
||||
; The recursesubdirs flag below picks all of those up — no separate
|
||||
; Files: entry needed for tesseract/.
|
||||
Source: "..\dist\DataTools\*"; DestDir: "{app}"; Flags: recursesubdirs ignoreversion
|
||||
|
||||
[Icons]
|
||||
|
||||
Reference in New Issue
Block a user