Files
datatools-dev/build
Michael 93ccada974 build: bundle Tesseract 5.5.0 + tessdata into every release artifact
End users no longer have to install Tesseract separately for OCR on
scanned PDFs — the engine ships inside the installer, portable .zip,
and AppImage for all three platforms.

Per-platform fetch in build/make_release.py (run before PyInstaller):
- Windows: download UB-Mannheim installer 5.5.0.20241111, extract
  with 7-Zip, copy tesseract.exe + required DLLs into the staging dir.
- macOS: ``brew install tesseract``, copy binary + every Homebrew-
  prefixed dylib resolved via otool -L (recurse one level for
  transitive deps), then install_name_tool rewrites IDs / load paths
  to @loader_path/... so the bundle is relocatable.
- Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary
  + every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'.

Wire-up:
- build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by
  make_release) and adds the staging dir + tessdata + the
  LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas
  so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/}
  and the license sits at the bundle root. Soft-warns when staging
  is empty so dev spec runs still complete.
- English tessdata pulled by fetch_tessdata() from
  tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at
  build/vendor/tessdata/.
- .github/workflows/build.yml: actions/cache@v4 step keyed on
  ``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the
  staging dir and the vendored tessdata across runs; apt installs
  patchelf on the Linux runner; PyInstaller step now receives the
  DATATOOLS_TESS_STAGING env var.
- .gitignore: build/_tesseract/ and the .traineddata blob.
- TESSERACT_SKIP_FETCH=1 honored for offline / manual stages.
- Installer / .dmg / .zip / AppImage scripts: one-line comments
  confirming Tesseract rides along automatically via PyInstaller's
  datas (no extra packaging steps required in those scripts).

Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post-
compression. Net installer size ~250-300 MB (was ~120 MB) — accepted
tradeoff for zero end-user OCR setup.

Reversal of the prior "don't bundle Tesseract" decision (option A).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:33 +00:00
..

Build — DataTools desktop installer

Cross-platform PyInstaller bundle for Mac / Windows / Linux. The single deliverable the buyer downloads from Gumroad. Owner: Michael · Updated: 2026-05-01

This directory is the build pipeline. Source of truth for the bundle shape, hidden-import lists, per-platform recipes, and the launcher that boots Streamlit inside the bundle.

Files

build/
├── launcher.py           Entry point PyInstaller wraps. Boots a local
│                         Streamlit server, opens browser, locks server
│                         to 127.0.0.1 so the privacy claim holds.
├── datatools.spec        PyInstaller spec — hidden imports, data files,
│                         Mac .app bundle config. Reads the version
│                         from src/__init__.py.
├── installer.iss         Inno Setup script — Windows .exe installer.
│                         Adds Start Menu + Desktop + App Paths entries.
├── generate_icons.py     Builds icon.ico / icon.icns / icon.png from
│                         src/gui/assets/datatools_icon_256.png. Run
│                         once before pyinstaller (CI does this).
├── build_portable_zip.py Cross-platform: zips dist/DataTools/ into a
│                         no-install portable download. Used by the
│                         Windows + Linux portable artifacts.
├── macos/
│   ├── build_dmg.sh      Wraps dist/DataTools.app into a .dmg with a
│   │                     drag-to-/Applications layout (installer).
│   └── build_zip.sh      Wraps dist/DataTools.app into a portable
│                         .zip via ditto (preserves bundle metadata).
├── appimage/
│   ├── AppRun            Entry point invoked when the AppImage runs.
│   ├── datatools.desktop Linux desktop-entry metadata.
│   └── build.sh          Wraps dist/DataTools/ into an .AppImage.
├── hooks/                PyInstaller hooks for libs the static analyser
│   └── hook-streamlit.py misses (Streamlit's dynamic imports).
├── icon.{ico,icns,png}   Generated by generate_icons.py — gitignored.
└── README.md             this file

Distribution outputs per platform

Each CI run produces two downloads per platform — an installer for buyers who want shortcuts wired automatically, and a portable .zip for buyers (or IT-locked-down machines) that can't run installers:

Platform Installer Portable
macOS DataTools-<ver>-mac.dmg DataTools-<ver>-mac-portable.zip (ditto .app)
Windows DataTools-<ver>-win-setup.exe DataTools-<ver>-win-portable.zip
Linux DataTools-<ver>-linux-x86_64.AppImage (the AppImage IS the portable)

All six outputs are self-contained: every dependency (Python, pandas, streamlit, pdfplumber, the lot) is frozen into the bundle. The buyer does not need to install Python, pip, or anything else first.

Easy-launch surface

Affordance Windows macOS
Desktop shortcut Inno Setup desktopicon task (checked default) The .app bundle in /Applications is the icon
App menu Start Menu → DataTools (always installed) Launchpad + Spotlight (auto from /Applications)
Taskbar / Dock User pins manually (OS forbids programmatic pin) User pins manually after first launch
Run from terminal DataTools (registered via App Paths) open -a DataTools (auto from .app bundle)

CI: .github/workflows/build.yml runs the full pipeline on tag push (matrix: macos-latest, windows-latest, ubuntu-latest) and attaches the resulting installers to a GitHub Release. Manual workflow_dispatch runs upload them as workflow artifacts only.

Releasing

PyInstaller can't cross-compile, so a single machine produces one platform's packages. Run this on each target OS:

# One-time setup per machine:
pip install -r requirements.txt
pip install pyinstaller pillow
# Windows only: install Inno Setup from https://jrsoftware.org/isdl.php
# Linux  only: drop appimagetool onto PATH (see preflight output)

# Build everything for the current OS:
python build/make_release.py

Outputs land in dist/:

  • Windows host → DataTools-<ver>-win-setup.exe + DataTools-<ver>-win-portable.zip
  • macOS host → DataTools-<ver>-mac.dmg + DataTools-<ver>-mac-portable.zip
  • Linux host → DataTools-<ver>-linux-x86_64.AppImage

Useful flags:

python build/make_release.py --preflight       # check tooling, build nothing
python build/make_release.py --clean           # wipe dist/ first
python build/make_release.py --skip-installer  # just the portable zip
python build/make_release.py --skip-portable   # just the installer

CI build (push tag → GitHub Release)

If you have CI runners for all three OSes:

  1. Bump __version__ in src/__init__.py.
  2. git commit -am "release: vX.Y.Z" && git tag vX.Y.Z.
  3. git push && git push --tags.
  4. CI builds all three platforms and creates a Release with the installers + portable zips attached.
  5. Mirror the Release assets to Gumroad (manual until v2).

Signing (Phase 2 — needs accounts/credentials)

Both code-signing steps are intentionally not in CI yet because they require credentials the owner sets up first.

macOS — Apple Developer Program enrollment ($99/yr). Once enrolled, add these GitHub Secrets and uncomment the codesign + notarytool steps in build.yml:

Secret Value
MACOS_DEVELOPER_ID_CERT_P12_BASE64 base64-encoded .p12 cert
MACOS_DEVELOPER_ID_CERT_PASSWORD password for the .p12
MACOS_NOTARY_APPLE_ID Apple ID email
MACOS_NOTARY_TEAM_ID 10-char team ID
MACOS_NOTARY_PASSWORD app-specific password

Windows — Code-signing cert from Sectigo / DigiCert (~$200-400/yr, or ~$300-500 for an EV cert that bypasses SmartScreen). Add:

Secret Value
WINDOWS_CERT_PFX_BASE64 base64-encoded .pfx cert
WINDOWS_CERT_PASSWORD password for the .pfx

Until those are wired, buyers will see:

  • macOS: "DataTools is damaged and can't be opened" — fix by removing the quarantine attribute (xattr -cr /Applications/DataTools.app). Acceptable for the technical buyer; blocking for the non-technical buyer. Don't ship to non-technical without notarization.
  • Windows: SmartScreen "Windows protected your PC" — buyer clicks "More info → Run anyway". Friction but not blocking.
  • Linux: AppImage runs without complaint (Linux has no equivalent trust-store).

Per-platform recipe

Each platform builds on its own machine — PyInstaller does not cross-compile. Pick the platform that matches the bundle you need. GitHub Actions matrix runners are the simplest way to produce all three from one push (see "CI build" below).

Mac (Intel + Apple Silicon, universal2)

# One-time:
pyenv install 3.12
pyenv local 3.12
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install pyinstaller

# Build:
pyinstaller build/datatools.spec --clean

# Output:
#   dist/DataTools/         — folder mode (faster cold start)
#   dist/DataTools.app/     — macOS .app bundle (drag-drop into /Applications)

# Sign + notarize (after Apple Developer Program enrollment per BUSINESS.md §10):
codesign --deep --force --options runtime \
  --sign "Developer ID Application: <YOUR-NAME> (<TEAMID>)" \
  dist/DataTools.app

# Notarize:
xcrun notarytool submit dist/DataTools.app \
  --apple-id "<YOUR-APPLE-ID>" \
  --team-id  "<TEAMID>" \
  --password "<APP-SPECIFIC-PASSWORD>" \
  --wait

# Staple the notarization ticket so Gatekeeper sees it offline:
xcrun stapler staple dist/DataTools.app

# Wrap for distribution:
hdiutil create -volname "DataTools" -srcfolder dist/DataTools.app \
  -ov -format UDZO dist/DataTools-1.0.0-mac.dmg

Windows

# One-time:
py -3.12 -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
pip install pyinstaller

# Build:
pyinstaller build\datatools.spec --clean

# Output:
#   dist\DataTools\          — folder mode
#   dist\DataTools\DataTools.exe

# Wrap with Inno Setup (free):
#   1. Install Inno Setup (https://jrsoftware.org/isdl.php)
#   2. Create installer.iss next to this README:
#        [Setup]
#        AppName=DataTools
#        AppVersion=1.0.0
#        DefaultDirName={autopf}\DataTools
#        OutputDir=..\..\dist
#        OutputBaseFilename=DataTools-1.0.0-win-setup
#        Compression=lzma
#        SolidCompression=yes
#        [Files]
#        Source: "..\..\dist\DataTools\*"; DestDir: "{app}"; Flags: recursesubdirs
#        [Icons]
#        Name: "{autoprograms}\DataTools"; Filename: "{app}\DataTools.exe"
#   3. Compile: ISCC.exe build\installer.iss

# Code-sign (optional but reduces SmartScreen warnings):
#   Use signtool with a code-signing cert (Sectigo / DigiCert).
#   Without signing, buyer sees "Windows protected your PC" once;
#   they click "More info → Run anyway." Acceptable for v1.

Linux (AppImage)

python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install pyinstaller

pyinstaller build/datatools.spec --clean
# dist/DataTools/ — folder mode

# Wrap as AppImage (single-file portable app):
#   1. Download appimagetool from https://appimage.org/
#   2. Set up the AppDir layout:
#        DataTools.AppDir/
#        ├── AppRun                     -> ./DataTools/DataTools
#        ├── DataTools.desktop          (icon + entry config)
#        ├── icon.png
#        └── usr/bin/                   -> dist/DataTools/*
#   3. ./appimagetool DataTools.AppDir dist/DataTools-1.0.0-linux-x86_64.AppImage

.github/workflows/build.yml (template):

name: Build installers
on:
  workflow_dispatch:
  push:
    tags: [ 'v*' ]
jobs:
  build:
    strategy:
      matrix:
        os: [macos-latest, windows-latest, ubuntu-latest]
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - run: pip install -r requirements.txt pyinstaller
      - run: pyinstaller build/datatools.spec --clean
      - uses: actions/upload-artifact@v4
        with:
          name: DataTools-${{ matrix.os }}
          path: dist/

Mac code-signing in CI requires the cert + private key as a GitHub secret (encoded with base64). Detailed walkthrough belongs in a later doc — for v1, sign locally and upload to GitHub Releases.

Common pitfalls

Symptom Fix
Bundle is 800+ MB Check the excludes list in datatools.spec. matplotlib / scipy / tkinter are the usual suspects.
App launches, browser opens, page is blank Streamlit's static assets aren't bundled. Re-run with --log-level=DEBUG and confirm the static dir was collected by collect_data_files('streamlit').
App launches but logs ImportError: streamlit.runtime.X Add X to hidden_imports in the spec or to hook-streamlit.py.
Mac Gatekeeper says "DataTools is damaged and can't be opened" The bundle wasn't signed + notarized. Don't ship to buyers without these — see the Mac recipe above.
Windows SmartScreen blocks first launch Buyer clicks "More info → Run anyway". Code-signing reduces but doesn't eliminate this; for v1 it's an accepted friction.
Bundle works on dev machine but crashes on a clean machine Likely a missing C runtime. On Windows, install VC++ redistributable into the installer alongside the bundle.

Testing the bundle

Smoke-test on a clean machine (or VM) — your dev machine has too much state to trust:

1. Boot a clean Mac / Win / Linux VM.
2. Copy the .dmg / .exe / .AppImage onto it.
3. Install / drag-drop into Applications / chmod +x.
4. Double-click the app icon.
5. Browser should open to http://127.0.0.1:850x within 5 seconds.
6. Drop samples/demo/shopify_pet_customers.csv into the
   Automated Workflows page; click Run; AFTER preview should appear.
7. Confirm in the network tab: zero outbound calls except to
   127.0.0.1 and the Streamlit static asset paths (also local).

Step 7 is the privacy-claim integrity check from docs/POST-LAUNCH.md §6 — do this once per release, then trust it.

Versioning

Bump the version string in three places per release:

  • datatools.spec (CFBundleVersion + CFBundleShortVersionString)
  • the Inno Setup AppVersion line
  • the AppImage filename

A single source of truth (e.g. src/__init__.py) is a future refactor — for v1 the three-spot update is fine.