Files
datatools-dev/build/macos/build_dmg.sh
Michael 93ccada974 build: bundle Tesseract 5.5.0 + tessdata into every release artifact
End users no longer have to install Tesseract separately for OCR on
scanned PDFs — the engine ships inside the installer, portable .zip,
and AppImage for all three platforms.

Per-platform fetch in build/make_release.py (run before PyInstaller):
- Windows: download UB-Mannheim installer 5.5.0.20241111, extract
  with 7-Zip, copy tesseract.exe + required DLLs into the staging dir.
- macOS: ``brew install tesseract``, copy binary + every Homebrew-
  prefixed dylib resolved via otool -L (recurse one level for
  transitive deps), then install_name_tool rewrites IDs / load paths
  to @loader_path/... so the bundle is relocatable.
- Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary
  + every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'.

Wire-up:
- build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by
  make_release) and adds the staging dir + tessdata + the
  LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas
  so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/}
  and the license sits at the bundle root. Soft-warns when staging
  is empty so dev spec runs still complete.
- English tessdata pulled by fetch_tessdata() from
  tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at
  build/vendor/tessdata/.
- .github/workflows/build.yml: actions/cache@v4 step keyed on
  ``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the
  staging dir and the vendored tessdata across runs; apt installs
  patchelf on the Linux runner; PyInstaller step now receives the
  DATATOOLS_TESS_STAGING env var.
- .gitignore: build/_tesseract/ and the .traineddata blob.
- TESSERACT_SKIP_FETCH=1 honored for offline / manual stages.
- Installer / .dmg / .zip / AppImage scripts: one-line comments
  confirming Tesseract rides along automatically via PyInstaller's
  datas (no extra packaging steps required in those scripts).

Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post-
compression. Net installer size ~250-300 MB (was ~120 MB) — accepted
tradeoff for zero end-user OCR setup.

Reversal of the prior "don't bundle Tesseract" decision (option A).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:33 +00:00

47 lines
1.3 KiB
Bash
Executable File

#!/usr/bin/env bash
# Wrap dist/DataTools.app into a distributable .dmg.
#
# Usage:
# bash build/macos/build_dmg.sh <version>
#
# Run after ``pyinstaller build/datatools.spec --clean --noconfirm``
# has produced ``dist/DataTools.app``. The output DMG goes to
# ``dist/DataTools-<version>-mac.dmg``.
#
# Code signing + notarization happen separately (see build/README.md
# "Signing"). This script only handles the packaging step.
#
# Tesseract bundling: no-op here. The .app already contains
# Contents/Resources/tesseract/{tesseract, *.dylib, tessdata/} thanks
# to PyInstaller's BUNDLE() carrying the spec's datas through. This
# script just wraps the finished .app — no extra steps for OCR.
set -euo pipefail
VERSION="${1:-0.0.0-dev}"
APP="dist/DataTools.app"
DMG="dist/DataTools-${VERSION}-mac.dmg"
if [[ ! -d "$APP" ]]; then
echo "Error: $APP not found. Run pyinstaller build/datatools.spec first." >&2
exit 1
fi
# Drag-target convenience: a /Applications symlink inside the DMG so
# the buyer can drag the app icon to it without leaving the DMG.
STAGE="$(mktemp -d)"
trap 'rm -rf "$STAGE"' EXIT
cp -R "$APP" "$STAGE/"
ln -s /Applications "$STAGE/Applications"
# UDZO = compressed read-only DMG, the standard distribution format.
hdiutil create \
-volname "DataTools" \
-srcfolder "$STAGE" \
-ov \
-format UDZO \
"$DMG"
echo "Built $DMG"