End users no longer have to install Tesseract separately for OCR on
scanned PDFs — the engine ships inside the installer, portable .zip,
and AppImage for all three platforms.
Per-platform fetch in build/make_release.py (run before PyInstaller):
- Windows: download UB-Mannheim installer 5.5.0.20241111, extract
with 7-Zip, copy tesseract.exe + required DLLs into the staging dir.
- macOS: ``brew install tesseract``, copy binary + every Homebrew-
prefixed dylib resolved via otool -L (recurse one level for
transitive deps), then install_name_tool rewrites IDs / load paths
to @loader_path/... so the bundle is relocatable.
- Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary
+ every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'.
Wire-up:
- build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by
make_release) and adds the staging dir + tessdata + the
LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas
so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/}
and the license sits at the bundle root. Soft-warns when staging
is empty so dev spec runs still complete.
- English tessdata pulled by fetch_tessdata() from
tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at
build/vendor/tessdata/.
- .github/workflows/build.yml: actions/cache@v4 step keyed on
``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the
staging dir and the vendored tessdata across runs; apt installs
patchelf on the Linux runner; PyInstaller step now receives the
DATATOOLS_TESS_STAGING env var.
- .gitignore: build/_tesseract/ and the .traineddata blob.
- TESSERACT_SKIP_FETCH=1 honored for offline / manual stages.
- Installer / .dmg / .zip / AppImage scripts: one-line comments
confirming Tesseract rides along automatically via PyInstaller's
datas (no extra packaging steps required in those scripts).
Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post-
compression. Net installer size ~250-300 MB (was ~120 MB) — accepted
tradeoff for zero end-user OCR setup.
Reversal of the prior "don't bundle Tesseract" decision (option A).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
68 lines
2.2 KiB
Bash
Executable File
68 lines
2.2 KiB
Bash
Executable File
#!/usr/bin/env bash
|
|
# Wrap dist/DataTools/ (PyInstaller folder mode) into a distributable
|
|
# AppImage.
|
|
#
|
|
# Usage:
|
|
# bash build/appimage/build.sh <version>
|
|
#
|
|
# Requires ``appimagetool`` on PATH (CI installs it; locally grab the
|
|
# latest release from https://github.com/AppImage/AppImageKit/releases).
|
|
#
|
|
# Output: dist/DataTools-<version>-linux-x86_64.AppImage
|
|
#
|
|
# Tesseract bundling: no-op here. The PyInstaller bundle in
|
|
# dist/DataTools/ already contains tesseract/{tesseract, *.so,
|
|
# tessdata/eng.traineddata} from the spec's datas; ``cp -R``
|
|
# below carries it along into the AppDir.
|
|
|
|
set -euo pipefail
|
|
|
|
VERSION="${1:-0.0.0-dev}"
|
|
DIST="dist/DataTools"
|
|
OUT="dist/DataTools-${VERSION}-linux-x86_64.AppImage"
|
|
|
|
if [[ ! -d "$DIST" ]]; then
|
|
echo "Error: $DIST not found. Run pyinstaller build/datatools.spec first." >&2
|
|
exit 1
|
|
fi
|
|
|
|
if ! command -v appimagetool >/dev/null 2>&1; then
|
|
echo "Error: appimagetool not on PATH. See build/appimage/build.sh header." >&2
|
|
exit 1
|
|
fi
|
|
|
|
# Lay out the AppDir.
|
|
APPDIR="$(mktemp -d)/DataTools.AppDir"
|
|
trap 'rm -rf "$(dirname -- "$APPDIR")"' EXIT
|
|
mkdir -p "$APPDIR/usr/bin"
|
|
|
|
cp -R "$DIST" "$APPDIR/usr/bin/"
|
|
cp build/appimage/AppRun "$APPDIR/AppRun"
|
|
chmod +x "$APPDIR/AppRun"
|
|
cp build/appimage/datatools.desktop "$APPDIR/datatools.desktop"
|
|
|
|
# Icon. AppImage requires a top-level <appname>.png next to the
|
|
# .desktop. Use the build/icon.png if present, otherwise generate a
|
|
# blank placeholder so the build doesn't fail on a fresh checkout.
|
|
if [[ -f build/icon.png ]]; then
|
|
cp build/icon.png "$APPDIR/datatools.png"
|
|
else
|
|
# 256x256 single-colour PNG via printf — appimagetool needs *some*
|
|
# icon present. Replace with a real 1024x1024 PNG before launch.
|
|
python3 - <<'PY'
|
|
import struct, zlib, os
|
|
def chunk(t, d): return struct.pack(">I", len(d)) + t + d + struct.pack(">I", zlib.crc32(t + d) & 0xffffffff)
|
|
W = H = 256
|
|
ihdr = struct.pack(">IIBBBBB", W, H, 8, 2, 0, 0, 0) # 8-bit RGB
|
|
raw = b"".join(b"\x00" + b"\x16\x19\x22" * W for _ in range(H)) # filter byte + dark pixels
|
|
idat = zlib.compress(raw, 9)
|
|
png = b"\x89PNG\r\n\x1a\n" + chunk(b"IHDR", ihdr) + chunk(b"IDAT", idat) + chunk(b"IEND", b"")
|
|
out = os.environ["APPDIR"] + "/datatools.png"
|
|
open(out, "wb").write(png)
|
|
PY
|
|
fi
|
|
export APPDIR
|
|
|
|
ARCH=x86_64 appimagetool "$APPDIR" "$OUT"
|
|
echo "Built $OUT"
|