Add a guarded "Sign & notarize macOS app" step to build.yml that signs dist/DataTools.app with the Developer ID (hardened runtime + entitlements + secure timestamp), notarizes via notarytool, and staples the ticket — running before DMG packaging. The step exits 0 with a warning when the MACOS_* secrets are absent, so dry-run dispatches still produce an (unsigned) build. Add build/macos/entitlements.plist with the hardened-runtime entitlements a frozen PyInstaller/CPython app needs (JIT memory, library-validation disabled for bundled .so/.dylib + Tesseract). Update build/README.md to reflect that macOS signing is now wired and only needs the secrets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Build — DataTools desktop installer
Cross-platform PyInstaller bundle for Mac / Windows / Linux. The single deliverable the buyer downloads from Gumroad. Owner: Michael · Updated: 2026-05-01
This directory is the build pipeline. Source of truth for the bundle shape, hidden-import lists, per-platform recipes, and the launcher that boots Streamlit inside the bundle.
Files
build/
├── launcher.py Entry point PyInstaller wraps. Boots a local
│ Streamlit server, opens browser, locks server
│ to 127.0.0.1 so the privacy claim holds.
├── datatools.spec PyInstaller spec — hidden imports, data files,
│ Mac .app bundle config. Reads the version
│ from src/__init__.py.
├── installer.iss Inno Setup script — Windows .exe installer.
│ Adds Start Menu + Desktop + App Paths entries.
├── generate_icons.py Builds icon.ico / icon.icns / icon.png from
│ src/gui/assets/datatools_icon_256.png. Run
│ once before pyinstaller (CI does this).
├── tesseract.py Fetches the per-platform Tesseract binary +
│ eng.traineddata at build time. CI imports
│ fetch_tessdata + fetch_tesseract_for_platform.
├── macos/
│ └── build_dmg.sh Wraps dist/DataTools.app into a .dmg with a
│ drag-to-/Applications layout (installer).
├── appimage/
│ ├── AppRun Entry point invoked when the AppImage runs.
│ ├── datatools.desktop Linux desktop-entry metadata.
│ └── build.sh Wraps dist/DataTools/ into an .AppImage.
├── hooks/ PyInstaller hooks for libs the static analyser
│ └── hook-streamlit.py misses (Streamlit's dynamic imports).
├── icon.{ico,icns,png} Generated by generate_icons.py — gitignored.
└── README.md this file
Distribution outputs per platform
Each CI run produces one installer per platform:
| Platform | Installer |
|---|---|
| macOS | DataTools-<ver>-mac.dmg |
| Windows | DataTools-<ver>-win-setup.exe |
| Linux | DataTools-<ver>-linux-x86_64.AppImage (already portable) |
All three outputs are self-contained: every dependency (Python, pandas,
streamlit, pdfplumber, Tesseract OCR + eng.traineddata, the lot)
is frozen into the bundle. The buyer does not need to install Python,
pip, Tesseract, or anything else first. With Tesseract bundled, each
artifact is roughly 250–300 MB on disk (up from ~120 MB pre-OCR);
unpacked installs run ~300–400 MB once scratch space is counted.
Easy-launch surface
| Affordance | Windows | macOS |
|---|---|---|
| Desktop shortcut | Inno Setup desktopicon task (checked default) |
The .app bundle in /Applications is the icon |
| App menu | Start Menu → DataTools (always installed) | Launchpad + Spotlight (auto from /Applications) |
| Taskbar / Dock | User pins manually (OS forbids programmatic pin) | User pins manually after first launch |
| Run from terminal | DataTools (registered via App Paths) |
open -a DataTools (auto from .app bundle) |
CI: .github/workflows/build.yml runs the full pipeline on tag push
(matrix: macos-latest, windows-latest, ubuntu-latest) and attaches
the resulting installers to a GitHub Release. Manual
workflow_dispatch runs upload them as workflow artifacts only.
Releasing
CI build (push tag → GitHub Release) — the release process
Releases are built by GitHub Actions (.github/workflows/build.yml),
not on a developer's machine. The matrix runs on
macos-latest / windows-latest / ubuntu-latest, stages Tesseract
(build/tesseract.py), runs PyInstaller, packages the per-platform
installer, and attaches it to a GitHub Release on tag push:
- Bump
__version__insrc/__init__.py. git commit -am "release: vX.Y.Z" && git tag vX.Y.Z.git push && git push --tags.- CI builds all three platforms and creates a Release with the installers attached.
- Mirror the Release assets to Gumroad (manual until v2).
A manual workflow_dispatch run does the same build but uploads the
installers as workflow artifacts instead of creating a Release —
useful for smoke-testing a build without cutting a tag.
Local build (single platform, for testing)
PyInstaller can't cross-compile, so a local build produces only the current OS's installer. This mirrors what CI does, by hand — use it to debug the bundle before tagging. See the per-platform recipes below for the exact commands; the short version is:
pip install -r requirements.txt
pip install pyinstaller pillow
python build/generate_icons.py
python -c "import sys; sys.path.insert(0,'build'); \
from tesseract import fetch_tessdata, fetch_tesseract_for_platform; \
fetch_tessdata(); fetch_tesseract_for_platform('mac')" # win / mac / linux
pyinstaller build/datatools.spec --clean --noconfirm
# then run the matching packager: build/macos/build_dmg.sh,
# build/installer.iss (iscc), or build/appimage/build.sh
Signing (Phase 2 — needs accounts/credentials)
macOS signing + notarization is now wired into build.yml (the
"Sign & notarize macOS app" step, with build/macos/entitlements.plist).
It is guarded: if MACOS_DEVELOPER_ID_CERT_P12_BASE64 is absent the step
warns and exits 0, so dry-run dispatches still produce an unsigned build.
To activate it, just add the secrets below — no code change needed.
Windows code-signing is still not wired (accepted v1 friction).
macOS — Apple Developer Program enrollment ($99/yr). Once enrolled,
add these GitHub Secrets to activate the signing step in build.yml:
| Secret | Value |
|---|---|
MACOS_DEVELOPER_ID_CERT_P12_BASE64 |
base64-encoded .p12 cert |
MACOS_DEVELOPER_ID_CERT_PASSWORD |
password for the .p12 |
MACOS_NOTARY_APPLE_ID |
Apple ID email |
MACOS_NOTARY_TEAM_ID |
10-char team ID |
MACOS_NOTARY_PASSWORD |
app-specific password |
Windows — Code-signing cert from Sectigo / DigiCert (~$200-400/yr, or ~$300-500 for an EV cert that bypasses SmartScreen). Add:
| Secret | Value |
|---|---|
WINDOWS_CERT_PFX_BASE64 |
base64-encoded .pfx cert |
WINDOWS_CERT_PASSWORD |
password for the .pfx |
Until those are wired, buyers will see:
- macOS: "DataTools is damaged and can't be opened" — fix by removing
the quarantine attribute (
xattr -cr /Applications/DataTools.app). Acceptable for the technical buyer; blocking for the non-technical buyer. Don't ship to non-technical without notarization. - Windows: SmartScreen "Windows protected your PC" — buyer clicks "More info → Run anyway". Friction but not blocking.
- Linux: AppImage runs without complaint (Linux has no equivalent trust-store).
Per-platform recipe
Each platform builds on its own machine — PyInstaller does not cross-compile. Pick the platform that matches the bundle you need. GitHub Actions matrix runners are the simplest way to produce all three from one push (see "CI build" below).
Mac (Intel + Apple Silicon, universal2)
# One-time:
pyenv install 3.12
pyenv local 3.12
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install pyinstaller
# Build:
pyinstaller build/datatools.spec --clean
# Output:
# dist/DataTools/ — folder mode (faster cold start)
# dist/DataTools.app/ — macOS .app bundle (drag-drop into /Applications)
# Sign + notarize (after Apple Developer Program enrollment per BUSINESS.md §10):
codesign --deep --force --options runtime \
--sign "Developer ID Application: <YOUR-NAME> (<TEAMID>)" \
dist/DataTools.app
# Notarize:
xcrun notarytool submit dist/DataTools.app \
--apple-id "<YOUR-APPLE-ID>" \
--team-id "<TEAMID>" \
--password "<APP-SPECIFIC-PASSWORD>" \
--wait
# Staple the notarization ticket so Gatekeeper sees it offline:
xcrun stapler staple dist/DataTools.app
# Wrap for distribution:
hdiutil create -volname "DataTools" -srcfolder dist/DataTools.app \
-ov -format UDZO dist/DataTools-1.0.0-mac.dmg
Windows
# One-time:
py -3.12 -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
pip install pyinstaller
# Build:
pyinstaller build\datatools.spec --clean
# Output:
# dist\DataTools\ — folder mode
# dist\DataTools\DataTools.exe
# Wrap with Inno Setup (free):
# 1. Install Inno Setup (https://jrsoftware.org/isdl.php)
# 2. Create installer.iss next to this README:
# [Setup]
# AppName=DataTools
# AppVersion=1.0.0
# DefaultDirName={autopf}\DataTools
# OutputDir=..\..\dist
# OutputBaseFilename=DataTools-1.0.0-win-setup
# Compression=lzma
# SolidCompression=yes
# [Files]
# Source: "..\..\dist\DataTools\*"; DestDir: "{app}"; Flags: recursesubdirs
# [Icons]
# Name: "{autoprograms}\DataTools"; Filename: "{app}\DataTools.exe"
# 3. Compile: ISCC.exe build\installer.iss
# Code-sign (optional but reduces SmartScreen warnings):
# Use signtool with a code-signing cert (Sectigo / DigiCert).
# Without signing, buyer sees "Windows protected your PC" once;
# they click "More info → Run anyway." Acceptable for v1.
Linux (AppImage)
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install pyinstaller
pyinstaller build/datatools.spec --clean
# dist/DataTools/ — folder mode
# Wrap as AppImage (single-file portable app):
# 1. Download appimagetool from https://appimage.org/
# 2. Set up the AppDir layout:
# DataTools.AppDir/
# ├── AppRun -> ./DataTools/DataTools
# ├── DataTools.desktop (icon + entry config)
# ├── icon.png
# └── usr/bin/ -> dist/DataTools/*
# 3. ./appimagetool DataTools.AppDir dist/DataTools-1.0.0-linux-x86_64.AppImage
CI build (recommended once the spec is stable)
.github/workflows/build.yml (template):
name: Build installers
on:
workflow_dispatch:
push:
tags: [ 'v*' ]
jobs:
build:
strategy:
matrix:
os: [macos-latest, windows-latest, ubuntu-latest]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.12' }
- run: pip install -r requirements.txt pyinstaller
- run: pyinstaller build/datatools.spec --clean
- uses: actions/upload-artifact@v4
with:
name: DataTools-${{ matrix.os }}
path: dist/
Mac code-signing in CI requires the cert + private key as a GitHub
secret (encoded with base64). Detailed walkthrough belongs in a
later doc — for v1, sign locally and upload to GitHub Releases.
Tesseract bundling (PDF Extractor OCR)
Frozen artifacts ship a per-platform Tesseract binary plus the English
eng.traineddata model so scanned-PDF support in the PDF Extractor
works out of the box — no separate user install. Source / pip
developer setups still need system Tesseract on PATH.
Layout inside the bundle:
DataTools/ (or DataTools.app/Contents/MacOS/)
└── tesseract/
├── tesseract (Linux/macOS binary; tesseract.exe on Windows)
└── tessdata/
└── eng.traineddata
The runtime resolver (in src/, owned by the runtime team) walks:
DATATOOLS_TESSERACT_BINenv var override.Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]"— frozen bundles only.tesseractonPATH.- Windows well-known paths.
Where the bytes come from:
- Tessdata — vendored in-repo at
build/vendor/tessdata/eng.traineddata(sourced from tessdata_best).datatools.speccopies it intotesseract/tessdata/. - Binary — fetched per-platform at build time by
build/tesseract.pyfrom pinned upstream URLs. Current pin: Tesseract 5.5.0. CI importsfetch_tessdata+fetch_tesseract_for_platformfrom this module before PyInstaller.
Updating Tesseract:
- Bump the version pin and the per-platform fetch URLs in
build/tesseract.py. - If the model schema changed upstream, refresh
build/vendor/tessdata/eng.traineddatafromtessdata_bestat the matching tag. - Push a
v*tag so CI rebuilds all three platforms, then smoke-test a scanned PDF through the PDF Extractor. - Update
LICENSE_TESSERACT.txtat the repo root if upstream license terms change (Apache-2.0 today).
License attribution for the bundled binary lives at
LICENSE_TESSERACT.txt at the repo root — it must ship alongside any
binary that contains Tesseract.
Common pitfalls
| Symptom | Fix |
|---|---|
| Bundle is 800+ MB | Check the excludes list in datatools.spec. matplotlib / scipy / tkinter are the usual suspects. |
| App launches, browser opens, page is blank | Streamlit's static assets aren't bundled. Re-run with --log-level=DEBUG and confirm the static dir was collected by collect_data_files('streamlit'). |
App launches but logs ImportError: streamlit.runtime.X |
Add X to hidden_imports in the spec or to hook-streamlit.py. |
| Mac Gatekeeper says "DataTools is damaged and can't be opened" | The bundle wasn't signed + notarized. Don't ship to buyers without these — see the Mac recipe above. |
| Windows SmartScreen blocks first launch | Buyer clicks "More info → Run anyway". Code-signing reduces but doesn't eliminate this; for v1 it's an accepted friction. |
| Bundle works on dev machine but crashes on a clean machine | Likely a missing C runtime. On Windows, install VC++ redistributable into the installer alongside the bundle. |
Testing the bundle
Smoke-test on a clean machine (or VM) — your dev machine has too much state to trust:
1. Boot a clean Mac / Win / Linux VM.
2. Copy the .dmg / .exe / .AppImage onto it.
3. Install / drag-drop into Applications / chmod +x.
4. Double-click the app icon.
5. Browser should open to http://127.0.0.1:850x within 5 seconds.
6. Drop samples/demo/shopify_pet_customers.csv into the
Automated Workflows page; click Run; AFTER preview should appear.
7. Confirm in the network tab: zero outbound calls except to
127.0.0.1 and the Streamlit static asset paths (also local).
Step 7 is the privacy-claim integrity check from
docs/POST-LAUNCH.md §6 — do this once per release, then trust it.
Versioning
Bump the version string in three places per release:
datatools.spec(CFBundleVersion + CFBundleShortVersionString)- the Inno Setup
AppVersionline - the AppImage filename
A single source of truth (e.g. src/__init__.py) is a future
refactor — for v1 the three-spot update is fine.