# Build — DataTools desktop installer > Cross-platform PyInstaller bundle for Mac / Windows / Linux. The > single deliverable the buyer downloads from Gumroad. > **Owner**: Michael · **Updated**: 2026-05-01 This directory is the build pipeline. Source of truth for the bundle shape, hidden-import lists, per-platform recipes, and the launcher that boots Streamlit inside the bundle. ## Files ``` build/ ├── launcher.py Entry point PyInstaller wraps. Boots a local │ Streamlit server, opens browser, locks server │ to 127.0.0.1 so the privacy claim holds. ├── datatools.spec PyInstaller spec — hidden imports, data files, │ Mac .app bundle config. Reads the version │ from src/__init__.py. ├── installer.iss Inno Setup script — Windows .exe installer. │ Adds Start Menu + Desktop + App Paths entries. ├── generate_icons.py Builds icon.ico / icon.icns / icon.png from │ src/gui/assets/datatools_icon_256.png. Run │ once before pyinstaller (CI does this). ├── tesseract.py Fetches the per-platform Tesseract binary + │ eng.traineddata at build time. CI imports │ fetch_tessdata + fetch_tesseract_for_platform. ├── macos/ │ └── build_dmg.sh Wraps dist/DataTools.app into a .dmg with a │ drag-to-/Applications layout (installer). ├── appimage/ │ ├── AppRun Entry point invoked when the AppImage runs. │ ├── datatools.desktop Linux desktop-entry metadata. │ └── build.sh Wraps dist/DataTools/ into an .AppImage. ├── hooks/ PyInstaller hooks for libs the static analyser │ └── hook-streamlit.py misses (Streamlit's dynamic imports). ├── icon.{ico,icns,png} Generated by generate_icons.py — gitignored. └── README.md this file ``` ## Distribution outputs per platform Each CI run produces one installer per platform: | Platform | Installer | |----------|----------------------------------------| | macOS | `DataTools--mac.dmg` | | Windows | `DataTools--win-setup.exe` | | Linux | `DataTools--linux-x86_64.AppImage` (already portable) | All three outputs are self-contained: every dependency (Python, pandas, streamlit, pdfplumber, **Tesseract OCR + `eng.traineddata`**, the lot) is frozen into the bundle. The buyer does not need to install Python, pip, Tesseract, or anything else first. With Tesseract bundled, each artifact is roughly **250–300 MB** on disk (up from ~120 MB pre-OCR); unpacked installs run ~300–400 MB once scratch space is counted. ## Easy-launch surface | Affordance | Windows | macOS | |------------------|--------------------------------------------------|------------------------------------------------------| | Desktop shortcut | Inno Setup `desktopicon` task (checked default) | The .app bundle in /Applications is the icon | | App menu | Start Menu → DataTools (always installed) | Launchpad + Spotlight (auto from /Applications) | | Taskbar / Dock | User pins manually (OS forbids programmatic pin) | User pins manually after first launch | | Run from terminal| `DataTools` (registered via App Paths) | `open -a DataTools` (auto from .app bundle) | CI: `.github/workflows/build.yml` runs the full pipeline on tag push (matrix: macos-latest, windows-latest, ubuntu-latest) and attaches the resulting installers to a GitHub Release. Manual `workflow_dispatch` runs upload them as workflow artifacts only. ## Releasing ### CI build (push tag → GitHub Release) — the release process Releases are built by GitHub Actions (`.github/workflows/build.yml`), not on a developer's machine. The matrix runs on macos-latest / windows-latest / ubuntu-latest, stages Tesseract (`build/tesseract.py`), runs PyInstaller, packages the per-platform installer, and attaches it to a GitHub Release on tag push: 1. Bump `__version__` in `src/__init__.py`. 2. `git commit -am "release: vX.Y.Z" && git tag vX.Y.Z`. 3. `git push && git push --tags`. 4. CI builds all three platforms and creates a Release with the installers attached. 5. Mirror the Release assets to Gumroad (manual until v2). A manual `workflow_dispatch` run does the same build but uploads the installers as workflow artifacts instead of creating a Release — useful for smoke-testing a build without cutting a tag. ### Local build (single platform, for testing) PyInstaller can't cross-compile, so a local build produces only the current OS's installer. This mirrors what CI does, by hand — use it to debug the bundle before tagging. See the per-platform recipes below for the exact commands; the short version is: ```bash pip install -r requirements.txt pip install pyinstaller pillow python build/generate_icons.py python -c "import sys; sys.path.insert(0,'build'); \ from tesseract import fetch_tessdata, fetch_tesseract_for_platform; \ fetch_tessdata(); fetch_tesseract_for_platform('mac')" # win / mac / linux pyinstaller build/datatools.spec --clean --noconfirm # then run the matching packager: build/macos/build_dmg.sh, # build/installer.iss (iscc), or build/appimage/build.sh ``` ## Signing (Phase 2 — needs accounts/credentials) **macOS signing + notarization is now wired into `build.yml`** (the "Sign & notarize macOS app" step, with `build/macos/entitlements.plist`). It is guarded: if `MACOS_DEVELOPER_ID_CERT_P12_BASE64` is absent the step warns and exits 0, so dry-run dispatches still produce an unsigned build. To activate it, just add the secrets below — no code change needed. **Windows** code-signing is still not wired (accepted v1 friction). **macOS** — Apple Developer Program enrollment ($99/yr). Once enrolled, add these GitHub Secrets to activate the signing step in `build.yml`: | Secret | Value | |---|---| | `MACOS_DEVELOPER_ID_CERT_P12_BASE64` | base64-encoded `.p12` cert | | `MACOS_DEVELOPER_ID_CERT_PASSWORD` | password for the .p12 | | `MACOS_NOTARY_APPLE_ID` | Apple ID email | | `MACOS_NOTARY_TEAM_ID` | 10-char team ID | | `MACOS_NOTARY_PASSWORD` | app-specific password | **Windows** — Code-signing cert from Sectigo / DigiCert (~$200-400/yr, or ~$300-500 for an EV cert that bypasses SmartScreen). Add: | Secret | Value | |---|---| | `WINDOWS_CERT_PFX_BASE64` | base64-encoded `.pfx` cert | | `WINDOWS_CERT_PASSWORD` | password for the .pfx | Until those are wired, buyers will see: - macOS: "DataTools is damaged and can't be opened" — fix by removing the quarantine attribute (`xattr -cr /Applications/DataTools.app`). Acceptable for the technical buyer; **blocking** for the non-technical buyer. Don't ship to non-technical without notarization. - Windows: SmartScreen "Windows protected your PC" — buyer clicks "More info → Run anyway". Friction but not blocking. - Linux: AppImage runs without complaint (Linux has no equivalent trust-store). ## Per-platform recipe Each platform builds on its own machine — PyInstaller does **not** cross-compile. Pick the platform that matches the bundle you need. GitHub Actions matrix runners are the simplest way to produce all three from one push (see "CI build" below). ### Mac (Intel + Apple Silicon, universal2) ```bash # One-time: pyenv install 3.12 pyenv local 3.12 python -m venv .venv source .venv/bin/activate pip install -r requirements.txt pip install pyinstaller # Build: pyinstaller build/datatools.spec --clean # Output: # dist/DataTools/ — folder mode (faster cold start) # dist/DataTools.app/ — macOS .app bundle (drag-drop into /Applications) # Sign + notarize (after Apple Developer Program enrollment per BUSINESS.md §10): codesign --deep --force --options runtime \ --sign "Developer ID Application: ()" \ dist/DataTools.app # Notarize: xcrun notarytool submit dist/DataTools.app \ --apple-id "" \ --team-id "" \ --password "" \ --wait # Staple the notarization ticket so Gatekeeper sees it offline: xcrun stapler staple dist/DataTools.app # Wrap for distribution: hdiutil create -volname "DataTools" -srcfolder dist/DataTools.app \ -ov -format UDZO dist/DataTools-1.0.0-mac.dmg ``` ### Windows ```powershell # One-time: py -3.12 -m venv .venv .venv\Scripts\activate pip install -r requirements.txt pip install pyinstaller # Build: pyinstaller build\datatools.spec --clean # Output: # dist\DataTools\ — folder mode # dist\DataTools\DataTools.exe # Wrap with Inno Setup (free): # 1. Install Inno Setup (https://jrsoftware.org/isdl.php) # 2. Create installer.iss next to this README: # [Setup] # AppName=DataTools # AppVersion=1.0.0 # DefaultDirName={autopf}\DataTools # OutputDir=..\..\dist # OutputBaseFilename=DataTools-1.0.0-win-setup # Compression=lzma # SolidCompression=yes # [Files] # Source: "..\..\dist\DataTools\*"; DestDir: "{app}"; Flags: recursesubdirs # [Icons] # Name: "{autoprograms}\DataTools"; Filename: "{app}\DataTools.exe" # 3. Compile: ISCC.exe build\installer.iss # Code-sign (optional but reduces SmartScreen warnings): # Use signtool with a code-signing cert (Sectigo / DigiCert). # Without signing, buyer sees "Windows protected your PC" once; # they click "More info → Run anyway." Acceptable for v1. ``` ### Linux (AppImage) ```bash python3.12 -m venv .venv source .venv/bin/activate pip install -r requirements.txt pip install pyinstaller pyinstaller build/datatools.spec --clean # dist/DataTools/ — folder mode # Wrap as AppImage (single-file portable app): # 1. Download appimagetool from https://appimage.org/ # 2. Set up the AppDir layout: # DataTools.AppDir/ # ├── AppRun -> ./DataTools/DataTools # ├── DataTools.desktop (icon + entry config) # ├── icon.png # └── usr/bin/ -> dist/DataTools/* # 3. ./appimagetool DataTools.AppDir dist/DataTools-1.0.0-linux-x86_64.AppImage ``` ## CI build (recommended once the spec is stable) `.github/workflows/build.yml` (template): ```yaml name: Build installers on: workflow_dispatch: push: tags: [ 'v*' ] jobs: build: strategy: matrix: os: [macos-latest, windows-latest, ubuntu-latest] runs-on: ${{ matrix.os }} steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: { python-version: '3.12' } - run: pip install -r requirements.txt pyinstaller - run: pyinstaller build/datatools.spec --clean - uses: actions/upload-artifact@v4 with: name: DataTools-${{ matrix.os }} path: dist/ ``` Mac code-signing in CI requires the cert + private key as a GitHub secret (encoded with `base64`). Detailed walkthrough belongs in a later doc — for v1, sign locally and upload to GitHub Releases. ## Tesseract bundling (PDF Extractor OCR) Frozen artifacts ship a per-platform Tesseract binary plus the English `eng.traineddata` model so scanned-PDF support in the PDF Extractor works out of the box — no separate user install. Source / pip developer setups still need system Tesseract on `PATH`. **Layout inside the bundle**: ``` DataTools/ (or DataTools.app/Contents/MacOS/) └── tesseract/ ├── tesseract (Linux/macOS binary; tesseract.exe on Windows) └── tessdata/ └── eng.traineddata ``` The runtime resolver (in `src/`, owned by the runtime team) walks: 1. `DATATOOLS_TESSERACT_BIN` env var override. 2. `Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]"` — frozen bundles only. 3. `tesseract` on `PATH`. 4. Windows well-known paths. **Where the bytes come from**: - **Tessdata** — vendored in-repo at `build/vendor/tessdata/eng.traineddata` (sourced from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best)). `datatools.spec` copies it into `tesseract/tessdata/`. - **Binary** — fetched per-platform at build time by `build/tesseract.py` from pinned upstream URLs. Current pin: **Tesseract 5.5.0**. CI imports `fetch_tessdata` + `fetch_tesseract_for_platform` from this module before PyInstaller. **Updating Tesseract**: 1. Bump the version pin and the per-platform fetch URLs in `build/tesseract.py`. 2. If the model schema changed upstream, refresh `build/vendor/tessdata/eng.traineddata` from `tessdata_best` at the matching tag. 3. Push a `v*` tag so CI rebuilds all three platforms, then smoke-test a scanned PDF through the PDF Extractor. 4. Update `LICENSE_TESSERACT.txt` at the repo root if upstream license terms change (Apache-2.0 today). License attribution for the bundled binary lives at `LICENSE_TESSERACT.txt` at the repo root — it must ship alongside any binary that contains Tesseract. ## Common pitfalls | Symptom | Fix | |---|---| | Bundle is 800+ MB | Check the ``excludes`` list in ``datatools.spec``. ``matplotlib`` / ``scipy`` / ``tkinter`` are the usual suspects. | | App launches, browser opens, page is blank | Streamlit's static assets aren't bundled. Re-run with `--log-level=DEBUG` and confirm the static dir was collected by `collect_data_files('streamlit')`. | | App launches but logs ``ImportError: streamlit.runtime.X`` | Add ``X`` to ``hidden_imports`` in the spec or to ``hook-streamlit.py``. | | Mac Gatekeeper says "DataTools is damaged and can't be opened" | The bundle wasn't signed + notarized. Don't ship to buyers without these — see the Mac recipe above. | | Windows SmartScreen blocks first launch | Buyer clicks "More info → Run anyway". Code-signing reduces but doesn't eliminate this; for v1 it's an accepted friction. | | Bundle works on dev machine but crashes on a clean machine | Likely a missing C runtime. On Windows, install [VC++ redistributable](https://aka.ms/vs/17/release/vc_redist.x64.exe) into the installer alongside the bundle. | ## Testing the bundle Smoke-test on a **clean** machine (or VM) — your dev machine has too much state to trust: ``` 1. Boot a clean Mac / Win / Linux VM. 2. Copy the .dmg / .exe / .AppImage onto it. 3. Install / drag-drop into Applications / chmod +x. 4. Double-click the app icon. 5. Browser should open to http://127.0.0.1:850x within 5 seconds. 6. Drop samples/demo/shopify_pet_customers.csv into the Automated Workflows page; click Run; AFTER preview should appear. 7. Confirm in the network tab: zero outbound calls except to 127.0.0.1 and the Streamlit static asset paths (also local). ``` Step 7 is the privacy-claim integrity check from `docs/POST-LAUNCH.md` §6 — do this once per release, then trust it. ## Versioning Bump the version string in three places per release: - `datatools.spec` (CFBundleVersion + CFBundleShortVersionString) - the Inno Setup `AppVersion` line - the AppImage filename A single source of truth (e.g. `src/__init__.py`) is a future refactor — for v1 the three-spot update is fine.