End users no longer have to install Tesseract separately for OCR on
scanned PDFs — the engine ships inside the installer, portable .zip,
and AppImage for all three platforms.
Per-platform fetch in build/make_release.py (run before PyInstaller):
- Windows: download UB-Mannheim installer 5.5.0.20241111, extract
with 7-Zip, copy tesseract.exe + required DLLs into the staging dir.
- macOS: ``brew install tesseract``, copy binary + every Homebrew-
prefixed dylib resolved via otool -L (recurse one level for
transitive deps), then install_name_tool rewrites IDs / load paths
to @loader_path/... so the bundle is relocatable.
- Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary
+ every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'.
Wire-up:
- build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by
make_release) and adds the staging dir + tessdata + the
LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas
so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/}
and the license sits at the bundle root. Soft-warns when staging
is empty so dev spec runs still complete.
- English tessdata pulled by fetch_tessdata() from
tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at
build/vendor/tessdata/.
- .github/workflows/build.yml: actions/cache@v4 step keyed on
``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the
staging dir and the vendored tessdata across runs; apt installs
patchelf on the Linux runner; PyInstaller step now receives the
DATATOOLS_TESS_STAGING env var.
- .gitignore: build/_tesseract/ and the .traineddata blob.
- TESSERACT_SKIP_FETCH=1 honored for offline / manual stages.
- Installer / .dmg / .zip / AppImage scripts: one-line comments
confirming Tesseract rides along automatically via PyInstaller's
datas (no extra packaging steps required in those scripts).
Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post-
compression. Net installer size ~250-300 MB (was ~120 MB) — accepted
tradeoff for zero end-user OCR setup.
Reversal of the prior "don't bundle Tesseract" decision (option A).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
44 lines
1.6 KiB
Bash
Executable File
44 lines
1.6 KiB
Bash
Executable File
#!/usr/bin/env bash
|
|
# Wrap dist/DataTools.app into a no-install portable .zip.
|
|
#
|
|
# Usage:
|
|
# bash build/macos/build_zip.sh <version>
|
|
#
|
|
# Why a portable .zip in addition to the .dmg:
|
|
# * Buyers who don't want an installer can unzip and double-click the
|
|
# .app directly — no drag-to-/Applications step, no installer
|
|
# chrome. Self-contained: the .app holds Python + every dep.
|
|
# * IT-locked-down machines often block .dmg auto-mount but allow
|
|
# .zip download + extraction.
|
|
#
|
|
# Run after ``pyinstaller build/datatools.spec --clean --noconfirm``
|
|
# has produced ``dist/DataTools.app``. Output goes to
|
|
# ``dist/DataTools-<version>-mac-portable.zip``.
|
|
#
|
|
# Tesseract bundling: no-op here. The bundled Tesseract binary +
|
|
# dylibs + tessdata are already inside DataTools.app/Contents/Resources/tesseract/
|
|
# (placed by PyInstaller's BUNDLE/datas mechanism). ``ditto -c -k``
|
|
# preserves the whole .app tree.
|
|
|
|
set -euo pipefail
|
|
|
|
VERSION="${1:-0.0.0-dev}"
|
|
APP="dist/DataTools.app"
|
|
ZIP="dist/DataTools-${VERSION}-mac-portable.zip"
|
|
|
|
if [[ ! -d "$APP" ]]; then
|
|
echo "Error: $APP not found. Run pyinstaller build/datatools.spec first." >&2
|
|
exit 1
|
|
fi
|
|
|
|
# ``ditto`` preserves the .app bundle's extended attributes and
|
|
# resource forks (a plain ``zip`` strips them and can break code
|
|
# signatures + Info.plist resolution on the buyer's machine).
|
|
#
|
|
# --sequesterRsrc keeps the AppleDouble metadata inside the archive
|
|
# rather than as parallel ._ files on disk after extraction.
|
|
rm -f "$ZIP"
|
|
ditto -c -k --sequesterRsrc --keepParent "$APP" "$ZIP"
|
|
|
|
echo "Built $ZIP ($(du -h "$ZIP" | cut -f1))"
|