End users no longer have to install Tesseract separately for OCR on
scanned PDFs — the engine ships inside the installer, portable .zip,
and AppImage for all three platforms.
Per-platform fetch in build/make_release.py (run before PyInstaller):
- Windows: download UB-Mannheim installer 5.5.0.20241111, extract
with 7-Zip, copy tesseract.exe + required DLLs into the staging dir.
- macOS: ``brew install tesseract``, copy binary + every Homebrew-
prefixed dylib resolved via otool -L (recurse one level for
transitive deps), then install_name_tool rewrites IDs / load paths
to @loader_path/... so the bundle is relocatable.
- Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary
+ every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'.
Wire-up:
- build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by
make_release) and adds the staging dir + tessdata + the
LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas
so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/}
and the license sits at the bundle root. Soft-warns when staging
is empty so dev spec runs still complete.
- English tessdata pulled by fetch_tessdata() from
tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at
build/vendor/tessdata/.
- .github/workflows/build.yml: actions/cache@v4 step keyed on
``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the
staging dir and the vendored tessdata across runs; apt installs
patchelf on the Linux runner; PyInstaller step now receives the
DATATOOLS_TESS_STAGING env var.
- .gitignore: build/_tesseract/ and the .traineddata blob.
- TESSERACT_SKIP_FETCH=1 honored for offline / manual stages.
- Installer / .dmg / .zip / AppImage scripts: one-line comments
confirming Tesseract rides along automatically via PyInstaller's
datas (no extra packaging steps required in those scripts).
Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post-
compression. Net installer size ~250-300 MB (was ~120 MB) — accepted
tradeoff for zero end-user OCR setup.
Reversal of the prior "don't bundle Tesseract" decision (option A).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
197 lines
7.2 KiB
YAML
197 lines
7.2 KiB
YAML
name: Build installers
|
|
|
|
# Triggers:
|
|
# * Tag push (v*) → produces installers + portable zips, attaches them
|
|
# to a GitHub Release.
|
|
# * Manual dispatch → uploads everything as workflow artifacts only.
|
|
#
|
|
# Outputs per platform (downloadable by buyers):
|
|
# * macOS: .dmg installer + portable .zip (signed .app inside).
|
|
# * Windows: .exe installer + portable .zip (no-install).
|
|
# * Linux: .AppImage (already portable; no separate zip).
|
|
#
|
|
# Self-contained: every artifact ships its own Python interpreter + every
|
|
# runtime dep through PyInstaller. No pre/post install steps on the
|
|
# buyer's machine.
|
|
#
|
|
# What this workflow doesn't do (yet):
|
|
# * Code signing (Mac Developer ID, Windows code-signing cert).
|
|
# Those need GitHub Secrets the owner sets up first. See
|
|
# build/README.md "Signing" for the secret names this workflow
|
|
# will read once they exist.
|
|
# * Auto-update endpoint generation. v1 distributes via Gumroad;
|
|
# buyers re-download for updates.
|
|
|
|
on:
|
|
workflow_dispatch:
|
|
push:
|
|
tags:
|
|
- 'v*'
|
|
|
|
permissions:
|
|
contents: write # needed to create the release on tag push
|
|
|
|
jobs:
|
|
build:
|
|
name: Build (${{ matrix.os }})
|
|
strategy:
|
|
fail-fast: false
|
|
matrix:
|
|
include:
|
|
- os: macos-latest
|
|
platform: mac
|
|
installer_glob: dist/DataTools-*-mac.dmg
|
|
portable_glob: dist/DataTools-*-mac-portable.zip
|
|
- os: windows-latest
|
|
platform: win
|
|
installer_glob: dist/DataTools-*-win-setup.exe
|
|
portable_glob: dist/DataTools-*-win-portable.zip
|
|
- os: ubuntu-latest
|
|
platform: linux
|
|
installer_glob: dist/DataTools-*-linux-x86_64.AppImage
|
|
portable_glob: '' # AppImage is already a portable single file
|
|
runs-on: ${{ matrix.os }}
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
|
|
- uses: actions/setup-python@v5
|
|
with:
|
|
python-version: '3.12'
|
|
cache: pip
|
|
|
|
- name: Install build deps
|
|
run: |
|
|
pip install --upgrade pip
|
|
pip install -r requirements.txt
|
|
pip install pyinstaller pillow
|
|
|
|
# ---- Tesseract bundling cache --------------------------------
|
|
# The fetch logic inside build/make_release.py downloads:
|
|
# * build/vendor/tessdata/eng.traineddata (~16 MB, shared)
|
|
# * build/_tesseract/<platform>/ (binary + libs, 30-120 MB)
|
|
# Cache both so iterative CI runs don't re-download. The
|
|
# cache key bakes in the pinned Tesseract version + tessdata
|
|
# URL so a version bump invalidates automatically.
|
|
- name: Cache Tesseract bundle inputs
|
|
uses: actions/cache@v4
|
|
with:
|
|
path: |
|
|
build/_tesseract
|
|
build/vendor/tessdata
|
|
key: tesseract-${{ runner.os }}-5.5.0-tessdata_best-v1
|
|
|
|
# ---- Linux: install patchelf so make_release.py can rewrite
|
|
# RPATH on the bundled tesseract binary. apt-get install
|
|
# tesseract-ocr is handled inside make_release.py itself. -----
|
|
- name: Install Linux build prereqs for Tesseract bundling
|
|
if: matrix.os == 'ubuntu-latest'
|
|
run: |
|
|
sudo apt-get update
|
|
sudo apt-get install -y patchelf
|
|
|
|
- name: Read version
|
|
id: version
|
|
shell: bash
|
|
run: |
|
|
VER=$(python -c "import re; print(re.search(r'__version__\s*=\s*\"([^\"]+)\"', open('src/__init__.py').read()).group(1))")
|
|
echo "version=$VER" >> "$GITHUB_OUTPUT"
|
|
|
|
- name: Generate platform icons
|
|
run: python build/generate_icons.py
|
|
|
|
# Stage Tesseract before PyInstaller. The make_release.py
|
|
# helpers handle the per-platform fetch (UB-Mannheim on Win,
|
|
# brew on Mac, apt on Linux) and stage the binary + libs into
|
|
# build/_tesseract/<platform>/ where the spec picks them up.
|
|
# We invoke a tiny inline Python so the workflow doesn't have
|
|
# to know the per-platform target string.
|
|
- name: Stage Tesseract binary + tessdata
|
|
shell: bash
|
|
env:
|
|
DATATOOLS_PLATFORM: ${{ matrix.platform }}
|
|
run: |
|
|
python - <<'PY'
|
|
import os, sys
|
|
sys.path.insert(0, "build")
|
|
from make_release import fetch_tessdata, fetch_tesseract_for_platform
|
|
target = os.environ["DATATOOLS_PLATFORM"]
|
|
fetch_tessdata()
|
|
fetch_tesseract_for_platform(target)
|
|
PY
|
|
|
|
- name: Build PyInstaller bundle
|
|
shell: bash
|
|
env:
|
|
# The spec reads this to find the per-platform staging dir;
|
|
# see build/datatools.spec for the contract.
|
|
DATATOOLS_TESS_STAGING: build/_tesseract/${{ matrix.platform }}
|
|
run: pyinstaller build/datatools.spec --clean --noconfirm
|
|
|
|
# ---- Per-platform installer packaging ------------------------
|
|
|
|
- name: Package macOS DMG (installer)
|
|
if: matrix.os == 'macos-latest'
|
|
run: bash build/macos/build_dmg.sh "${{ steps.version.outputs.version }}"
|
|
|
|
- name: Package macOS portable .zip
|
|
if: matrix.os == 'macos-latest'
|
|
run: bash build/macos/build_zip.sh "${{ steps.version.outputs.version }}"
|
|
|
|
- name: Install Inno Setup (Windows)
|
|
if: matrix.os == 'windows-latest'
|
|
run: choco install innosetup --no-progress -y
|
|
|
|
- name: Package Windows installer
|
|
if: matrix.os == 'windows-latest'
|
|
shell: cmd
|
|
run: |
|
|
iscc /DAppVersion=${{ steps.version.outputs.version }} build\installer.iss
|
|
|
|
- name: Package Windows portable .zip
|
|
if: matrix.os == 'windows-latest'
|
|
run: python build/build_portable_zip.py win ${{ steps.version.outputs.version }}
|
|
|
|
- name: Install AppImage tooling (Linux)
|
|
if: matrix.os == 'ubuntu-latest'
|
|
run: |
|
|
sudo apt-get update
|
|
sudo apt-get install -y libfuse2 wget
|
|
wget -q https://github.com/AppImage/AppImageKit/releases/download/continuous/appimagetool-x86_64.AppImage -O /usr/local/bin/appimagetool
|
|
sudo chmod +x /usr/local/bin/appimagetool
|
|
|
|
- name: Package Linux AppImage
|
|
if: matrix.os == 'ubuntu-latest'
|
|
run: bash build/appimage/build.sh "${{ steps.version.outputs.version }}"
|
|
|
|
# ---- Upload + release ----------------------------------------
|
|
|
|
- name: Upload installer artifact
|
|
uses: actions/upload-artifact@v4
|
|
with:
|
|
name: DataTools-${{ matrix.platform }}-installer
|
|
path: ${{ matrix.installer_glob }}
|
|
if-no-files-found: error
|
|
|
|
- name: Upload portable artifact
|
|
if: matrix.portable_glob != ''
|
|
uses: actions/upload-artifact@v4
|
|
with:
|
|
name: DataTools-${{ matrix.platform }}-portable
|
|
path: ${{ matrix.portable_glob }}
|
|
if-no-files-found: error
|
|
|
|
- name: Attach installer to Release (tag push only)
|
|
if: startsWith(github.ref, 'refs/tags/v')
|
|
uses: softprops/action-gh-release@v2
|
|
with:
|
|
files: ${{ matrix.installer_glob }}
|
|
fail_on_unmatched_files: true
|
|
generate_release_notes: true
|
|
|
|
- name: Attach portable to Release (tag push only)
|
|
if: startsWith(github.ref, 'refs/tags/v') && matrix.portable_glob != ''
|
|
uses: softprops/action-gh-release@v2
|
|
with:
|
|
files: ${{ matrix.portable_glob }}
|
|
fail_on_unmatched_files: true
|