Files
datatools-dev/build/README.md
Michael fd9606c67b build: drop the local Python release method, return to CI-only installer builds
Removes the single-command Python packaging method (build/make_release.py
+ build/build_portable_zip.py + build/macos/build_zip.sh) and the portable
.zip artifacts it produced. Release builds go back to the original GitHub
Actions process: the CI matrix builds one installer per platform (.dmg /
.exe / .AppImage) on tag push and attaches them to a GitHub Release.

Tesseract OCR bundling is preserved: the fetch helpers the workflow depends
on (fetch_tessdata, fetch_tesseract_for_platform) are extracted into a
standalone build/tesseract.py, which build.yml now imports.

Docs (README, build/README, DEVELOPER, TECHNICAL, USER-GUIDE, vendor README,
es translations) updated to drop the portable-zip flavor and point at the
new module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 17:47:36 +00:00

15 KiB
Raw Blame History

Build — DataTools desktop installer

Cross-platform PyInstaller bundle for Mac / Windows / Linux. The single deliverable the buyer downloads from Gumroad. Owner: Michael · Updated: 2026-05-01

This directory is the build pipeline. Source of truth for the bundle shape, hidden-import lists, per-platform recipes, and the launcher that boots Streamlit inside the bundle.

Files

build/
├── launcher.py           Entry point PyInstaller wraps. Boots a local
│                         Streamlit server, opens browser, locks server
│                         to 127.0.0.1 so the privacy claim holds.
├── datatools.spec        PyInstaller spec — hidden imports, data files,
│                         Mac .app bundle config. Reads the version
│                         from src/__init__.py.
├── installer.iss         Inno Setup script — Windows .exe installer.
│                         Adds Start Menu + Desktop + App Paths entries.
├── generate_icons.py     Builds icon.ico / icon.icns / icon.png from
│                         src/gui/assets/datatools_icon_256.png. Run
│                         once before pyinstaller (CI does this).
├── tesseract.py          Fetches the per-platform Tesseract binary +
│                         eng.traineddata at build time. CI imports
│                         fetch_tessdata + fetch_tesseract_for_platform.
├── macos/
│   └── build_dmg.sh      Wraps dist/DataTools.app into a .dmg with a
│                         drag-to-/Applications layout (installer).
├── appimage/
│   ├── AppRun            Entry point invoked when the AppImage runs.
│   ├── datatools.desktop Linux desktop-entry metadata.
│   └── build.sh          Wraps dist/DataTools/ into an .AppImage.
├── hooks/                PyInstaller hooks for libs the static analyser
│   └── hook-streamlit.py misses (Streamlit's dynamic imports).
├── icon.{ico,icns,png}   Generated by generate_icons.py — gitignored.
└── README.md             this file

Distribution outputs per platform

Each CI run produces one installer per platform:

Platform Installer
macOS DataTools-<ver>-mac.dmg
Windows DataTools-<ver>-win-setup.exe
Linux DataTools-<ver>-linux-x86_64.AppImage (already portable)

All three outputs are self-contained: every dependency (Python, pandas, streamlit, pdfplumber, Tesseract OCR + eng.traineddata, the lot) is frozen into the bundle. The buyer does not need to install Python, pip, Tesseract, or anything else first. With Tesseract bundled, each artifact is roughly 250300 MB on disk (up from ~120 MB pre-OCR); unpacked installs run ~300400 MB once scratch space is counted.

Easy-launch surface

Affordance Windows macOS
Desktop shortcut Inno Setup desktopicon task (checked default) The .app bundle in /Applications is the icon
App menu Start Menu → DataTools (always installed) Launchpad + Spotlight (auto from /Applications)
Taskbar / Dock User pins manually (OS forbids programmatic pin) User pins manually after first launch
Run from terminal DataTools (registered via App Paths) open -a DataTools (auto from .app bundle)

CI: .github/workflows/build.yml runs the full pipeline on tag push (matrix: macos-latest, windows-latest, ubuntu-latest) and attaches the resulting installers to a GitHub Release. Manual workflow_dispatch runs upload them as workflow artifacts only.

Releasing

CI build (push tag → GitHub Release) — the release process

Releases are built by GitHub Actions (.github/workflows/build.yml), not on a developer's machine. The matrix runs on macos-latest / windows-latest / ubuntu-latest, stages Tesseract (build/tesseract.py), runs PyInstaller, packages the per-platform installer, and attaches it to a GitHub Release on tag push:

  1. Bump __version__ in src/__init__.py.
  2. git commit -am "release: vX.Y.Z" && git tag vX.Y.Z.
  3. git push && git push --tags.
  4. CI builds all three platforms and creates a Release with the installers attached.
  5. Mirror the Release assets to Gumroad (manual until v2).

A manual workflow_dispatch run does the same build but uploads the installers as workflow artifacts instead of creating a Release — useful for smoke-testing a build without cutting a tag.

Local build (single platform, for testing)

PyInstaller can't cross-compile, so a local build produces only the current OS's installer. This mirrors what CI does, by hand — use it to debug the bundle before tagging. See the per-platform recipes below for the exact commands; the short version is:

pip install -r requirements.txt
pip install pyinstaller pillow
python build/generate_icons.py
python -c "import sys; sys.path.insert(0,'build'); \
  from tesseract import fetch_tessdata, fetch_tesseract_for_platform; \
  fetch_tessdata(); fetch_tesseract_for_platform('mac')"   # win / mac / linux
pyinstaller build/datatools.spec --clean --noconfirm
# then run the matching packager: build/macos/build_dmg.sh,
# build/installer.iss (iscc), or build/appimage/build.sh

Signing (Phase 2 — needs accounts/credentials)

Both code-signing steps are intentionally not in CI yet because they require credentials the owner sets up first.

macOS — Apple Developer Program enrollment ($99/yr). Once enrolled, add these GitHub Secrets and uncomment the codesign + notarytool steps in build.yml:

Secret Value
MACOS_DEVELOPER_ID_CERT_P12_BASE64 base64-encoded .p12 cert
MACOS_DEVELOPER_ID_CERT_PASSWORD password for the .p12
MACOS_NOTARY_APPLE_ID Apple ID email
MACOS_NOTARY_TEAM_ID 10-char team ID
MACOS_NOTARY_PASSWORD app-specific password

Windows — Code-signing cert from Sectigo / DigiCert (~$200-400/yr, or ~$300-500 for an EV cert that bypasses SmartScreen). Add:

Secret Value
WINDOWS_CERT_PFX_BASE64 base64-encoded .pfx cert
WINDOWS_CERT_PASSWORD password for the .pfx

Until those are wired, buyers will see:

  • macOS: "DataTools is damaged and can't be opened" — fix by removing the quarantine attribute (xattr -cr /Applications/DataTools.app). Acceptable for the technical buyer; blocking for the non-technical buyer. Don't ship to non-technical without notarization.
  • Windows: SmartScreen "Windows protected your PC" — buyer clicks "More info → Run anyway". Friction but not blocking.
  • Linux: AppImage runs without complaint (Linux has no equivalent trust-store).

Per-platform recipe

Each platform builds on its own machine — PyInstaller does not cross-compile. Pick the platform that matches the bundle you need. GitHub Actions matrix runners are the simplest way to produce all three from one push (see "CI build" below).

Mac (Intel + Apple Silicon, universal2)

# One-time:
pyenv install 3.12
pyenv local 3.12
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install pyinstaller

# Build:
pyinstaller build/datatools.spec --clean

# Output:
#   dist/DataTools/         — folder mode (faster cold start)
#   dist/DataTools.app/     — macOS .app bundle (drag-drop into /Applications)

# Sign + notarize (after Apple Developer Program enrollment per BUSINESS.md §10):
codesign --deep --force --options runtime \
  --sign "Developer ID Application: <YOUR-NAME> (<TEAMID>)" \
  dist/DataTools.app

# Notarize:
xcrun notarytool submit dist/DataTools.app \
  --apple-id "<YOUR-APPLE-ID>" \
  --team-id  "<TEAMID>" \
  --password "<APP-SPECIFIC-PASSWORD>" \
  --wait

# Staple the notarization ticket so Gatekeeper sees it offline:
xcrun stapler staple dist/DataTools.app

# Wrap for distribution:
hdiutil create -volname "DataTools" -srcfolder dist/DataTools.app \
  -ov -format UDZO dist/DataTools-1.0.0-mac.dmg

Windows

# One-time:
py -3.12 -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
pip install pyinstaller

# Build:
pyinstaller build\datatools.spec --clean

# Output:
#   dist\DataTools\          — folder mode
#   dist\DataTools\DataTools.exe

# Wrap with Inno Setup (free):
#   1. Install Inno Setup (https://jrsoftware.org/isdl.php)
#   2. Create installer.iss next to this README:
#        [Setup]
#        AppName=DataTools
#        AppVersion=1.0.0
#        DefaultDirName={autopf}\DataTools
#        OutputDir=..\..\dist
#        OutputBaseFilename=DataTools-1.0.0-win-setup
#        Compression=lzma
#        SolidCompression=yes
#        [Files]
#        Source: "..\..\dist\DataTools\*"; DestDir: "{app}"; Flags: recursesubdirs
#        [Icons]
#        Name: "{autoprograms}\DataTools"; Filename: "{app}\DataTools.exe"
#   3. Compile: ISCC.exe build\installer.iss

# Code-sign (optional but reduces SmartScreen warnings):
#   Use signtool with a code-signing cert (Sectigo / DigiCert).
#   Without signing, buyer sees "Windows protected your PC" once;
#   they click "More info → Run anyway." Acceptable for v1.

Linux (AppImage)

python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install pyinstaller

pyinstaller build/datatools.spec --clean
# dist/DataTools/ — folder mode

# Wrap as AppImage (single-file portable app):
#   1. Download appimagetool from https://appimage.org/
#   2. Set up the AppDir layout:
#        DataTools.AppDir/
#        ├── AppRun                     -> ./DataTools/DataTools
#        ├── DataTools.desktop          (icon + entry config)
#        ├── icon.png
#        └── usr/bin/                   -> dist/DataTools/*
#   3. ./appimagetool DataTools.AppDir dist/DataTools-1.0.0-linux-x86_64.AppImage

.github/workflows/build.yml (template):

name: Build installers
on:
  workflow_dispatch:
  push:
    tags: [ 'v*' ]
jobs:
  build:
    strategy:
      matrix:
        os: [macos-latest, windows-latest, ubuntu-latest]
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - run: pip install -r requirements.txt pyinstaller
      - run: pyinstaller build/datatools.spec --clean
      - uses: actions/upload-artifact@v4
        with:
          name: DataTools-${{ matrix.os }}
          path: dist/

Mac code-signing in CI requires the cert + private key as a GitHub secret (encoded with base64). Detailed walkthrough belongs in a later doc — for v1, sign locally and upload to GitHub Releases.

Tesseract bundling (PDF Extractor OCR)

Frozen artifacts ship a per-platform Tesseract binary plus the English eng.traineddata model so scanned-PDF support in the PDF Extractor works out of the box — no separate user install. Source / pip developer setups still need system Tesseract on PATH.

Layout inside the bundle:

DataTools/                  (or DataTools.app/Contents/MacOS/)
└── tesseract/
    ├── tesseract           (Linux/macOS binary; tesseract.exe on Windows)
    └── tessdata/
        └── eng.traineddata

The runtime resolver (in src/, owned by the runtime team) walks:

  1. DATATOOLS_TESSERACT_BIN env var override.
  2. Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]" — frozen bundles only.
  3. tesseract on PATH.
  4. Windows well-known paths.

Where the bytes come from:

  • Tessdata — vendored in-repo at build/vendor/tessdata/eng.traineddata (sourced from tessdata_best). datatools.spec copies it into tesseract/tessdata/.
  • Binary — fetched per-platform at build time by build/tesseract.py from pinned upstream URLs. Current pin: Tesseract 5.5.0. CI imports fetch_tessdata + fetch_tesseract_for_platform from this module before PyInstaller.

Updating Tesseract:

  1. Bump the version pin and the per-platform fetch URLs in build/tesseract.py.
  2. If the model schema changed upstream, refresh build/vendor/tessdata/eng.traineddata from tessdata_best at the matching tag.
  3. Push a v* tag so CI rebuilds all three platforms, then smoke-test a scanned PDF through the PDF Extractor.
  4. Update LICENSE_TESSERACT.txt at the repo root if upstream license terms change (Apache-2.0 today).

License attribution for the bundled binary lives at LICENSE_TESSERACT.txt at the repo root — it must ship alongside any binary that contains Tesseract.

Common pitfalls

Symptom Fix
Bundle is 800+ MB Check the excludes list in datatools.spec. matplotlib / scipy / tkinter are the usual suspects.
App launches, browser opens, page is blank Streamlit's static assets aren't bundled. Re-run with --log-level=DEBUG and confirm the static dir was collected by collect_data_files('streamlit').
App launches but logs ImportError: streamlit.runtime.X Add X to hidden_imports in the spec or to hook-streamlit.py.
Mac Gatekeeper says "DataTools is damaged and can't be opened" The bundle wasn't signed + notarized. Don't ship to buyers without these — see the Mac recipe above.
Windows SmartScreen blocks first launch Buyer clicks "More info → Run anyway". Code-signing reduces but doesn't eliminate this; for v1 it's an accepted friction.
Bundle works on dev machine but crashes on a clean machine Likely a missing C runtime. On Windows, install VC++ redistributable into the installer alongside the bundle.

Testing the bundle

Smoke-test on a clean machine (or VM) — your dev machine has too much state to trust:

1. Boot a clean Mac / Win / Linux VM.
2. Copy the .dmg / .exe / .AppImage onto it.
3. Install / drag-drop into Applications / chmod +x.
4. Double-click the app icon.
5. Browser should open to http://127.0.0.1:850x within 5 seconds.
6. Drop samples/demo/shopify_pet_customers.csv into the
   Automated Workflows page; click Run; AFTER preview should appear.
7. Confirm in the network tab: zero outbound calls except to
   127.0.0.1 and the Streamlit static asset paths (also local).

Step 7 is the privacy-claim integrity check from docs/POST-LAUNCH.md §6 — do this once per release, then trust it.

Versioning

Bump the version string in three places per release:

  • datatools.spec (CFBundleVersion + CFBundleShortVersionString)
  • the Inno Setup AppVersion line
  • the AppImage filename

A single source of truth (e.g. src/__init__.py) is a future refactor — for v1 the three-spot update is fine.