Removes the single-command Python packaging method (build/make_release.py + build/build_portable_zip.py + build/macos/build_zip.sh) and the portable .zip artifacts it produced. Release builds go back to the original GitHub Actions process: the CI matrix builds one installer per platform (.dmg / .exe / .AppImage) on tag push and attaches them to a GitHub Release. Tesseract OCR bundling is preserved: the fetch helpers the workflow depends on (fetch_tessdata, fetch_tesseract_for_platform) are extracted into a standalone build/tesseract.py, which build.yml now imports. Docs (README, build/README, DEVELOPER, TECHNICAL, USER-GUIDE, vendor README, es translations) updated to drop the portable-zip flavor and point at the new module. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
378 lines
15 KiB
Markdown
378 lines
15 KiB
Markdown
# Build — DataTools desktop installer
|
||
|
||
> Cross-platform PyInstaller bundle for Mac / Windows / Linux. The
|
||
> single deliverable the buyer downloads from Gumroad.
|
||
> **Owner**: Michael · **Updated**: 2026-05-01
|
||
|
||
This directory is the build pipeline. Source of truth for the bundle
|
||
shape, hidden-import lists, per-platform recipes, and the launcher
|
||
that boots Streamlit inside the bundle.
|
||
|
||
## Files
|
||
|
||
```
|
||
build/
|
||
├── launcher.py Entry point PyInstaller wraps. Boots a local
|
||
│ Streamlit server, opens browser, locks server
|
||
│ to 127.0.0.1 so the privacy claim holds.
|
||
├── datatools.spec PyInstaller spec — hidden imports, data files,
|
||
│ Mac .app bundle config. Reads the version
|
||
│ from src/__init__.py.
|
||
├── installer.iss Inno Setup script — Windows .exe installer.
|
||
│ Adds Start Menu + Desktop + App Paths entries.
|
||
├── generate_icons.py Builds icon.ico / icon.icns / icon.png from
|
||
│ src/gui/assets/datatools_icon_256.png. Run
|
||
│ once before pyinstaller (CI does this).
|
||
├── tesseract.py Fetches the per-platform Tesseract binary +
|
||
│ eng.traineddata at build time. CI imports
|
||
│ fetch_tessdata + fetch_tesseract_for_platform.
|
||
├── macos/
|
||
│ └── build_dmg.sh Wraps dist/DataTools.app into a .dmg with a
|
||
│ drag-to-/Applications layout (installer).
|
||
├── appimage/
|
||
│ ├── AppRun Entry point invoked when the AppImage runs.
|
||
│ ├── datatools.desktop Linux desktop-entry metadata.
|
||
│ └── build.sh Wraps dist/DataTools/ into an .AppImage.
|
||
├── hooks/ PyInstaller hooks for libs the static analyser
|
||
│ └── hook-streamlit.py misses (Streamlit's dynamic imports).
|
||
├── icon.{ico,icns,png} Generated by generate_icons.py — gitignored.
|
||
└── README.md this file
|
||
```
|
||
|
||
## Distribution outputs per platform
|
||
|
||
Each CI run produces one installer per platform:
|
||
|
||
| Platform | Installer |
|
||
|----------|----------------------------------------|
|
||
| macOS | `DataTools-<ver>-mac.dmg` |
|
||
| Windows | `DataTools-<ver>-win-setup.exe` |
|
||
| Linux | `DataTools-<ver>-linux-x86_64.AppImage` (already portable) |
|
||
|
||
All three outputs are self-contained: every dependency (Python, pandas,
|
||
streamlit, pdfplumber, **Tesseract OCR + `eng.traineddata`**, the lot)
|
||
is frozen into the bundle. The buyer does not need to install Python,
|
||
pip, Tesseract, or anything else first. With Tesseract bundled, each
|
||
artifact is roughly **250–300 MB** on disk (up from ~120 MB pre-OCR);
|
||
unpacked installs run ~300–400 MB once scratch space is counted.
|
||
|
||
## Easy-launch surface
|
||
|
||
| Affordance | Windows | macOS |
|
||
|------------------|--------------------------------------------------|------------------------------------------------------|
|
||
| Desktop shortcut | Inno Setup `desktopicon` task (checked default) | The .app bundle in /Applications is the icon |
|
||
| App menu | Start Menu → DataTools (always installed) | Launchpad + Spotlight (auto from /Applications) |
|
||
| Taskbar / Dock | User pins manually (OS forbids programmatic pin) | User pins manually after first launch |
|
||
| Run from terminal| `DataTools` (registered via App Paths) | `open -a DataTools` (auto from .app bundle) |
|
||
|
||
CI: `.github/workflows/build.yml` runs the full pipeline on tag push
|
||
(matrix: macos-latest, windows-latest, ubuntu-latest) and attaches
|
||
the resulting installers to a GitHub Release. Manual
|
||
`workflow_dispatch` runs upload them as workflow artifacts only.
|
||
|
||
## Releasing
|
||
|
||
### CI build (push tag → GitHub Release) — the release process
|
||
|
||
Releases are built by GitHub Actions (`.github/workflows/build.yml`),
|
||
not on a developer's machine. The matrix runs on
|
||
macos-latest / windows-latest / ubuntu-latest, stages Tesseract
|
||
(`build/tesseract.py`), runs PyInstaller, packages the per-platform
|
||
installer, and attaches it to a GitHub Release on tag push:
|
||
|
||
1. Bump `__version__` in `src/__init__.py`.
|
||
2. `git commit -am "release: vX.Y.Z" && git tag vX.Y.Z`.
|
||
3. `git push && git push --tags`.
|
||
4. CI builds all three platforms and creates a Release with the
|
||
installers attached.
|
||
5. Mirror the Release assets to Gumroad (manual until v2).
|
||
|
||
A manual `workflow_dispatch` run does the same build but uploads the
|
||
installers as workflow artifacts instead of creating a Release —
|
||
useful for smoke-testing a build without cutting a tag.
|
||
|
||
### Local build (single platform, for testing)
|
||
|
||
PyInstaller can't cross-compile, so a local build produces only the
|
||
current OS's installer. This mirrors what CI does, by hand — use it to
|
||
debug the bundle before tagging. See the per-platform recipes below for
|
||
the exact commands; the short version is:
|
||
|
||
```bash
|
||
pip install -r requirements.txt
|
||
pip install pyinstaller pillow
|
||
python build/generate_icons.py
|
||
python -c "import sys; sys.path.insert(0,'build'); \
|
||
from tesseract import fetch_tessdata, fetch_tesseract_for_platform; \
|
||
fetch_tessdata(); fetch_tesseract_for_platform('mac')" # win / mac / linux
|
||
pyinstaller build/datatools.spec --clean --noconfirm
|
||
# then run the matching packager: build/macos/build_dmg.sh,
|
||
# build/installer.iss (iscc), or build/appimage/build.sh
|
||
```
|
||
|
||
## Signing (Phase 2 — needs accounts/credentials)
|
||
|
||
Both code-signing steps are intentionally not in CI yet because they
|
||
require credentials the owner sets up first.
|
||
|
||
**macOS** — Apple Developer Program enrollment ($99/yr). Once enrolled,
|
||
add these GitHub Secrets and uncomment the `codesign` + `notarytool`
|
||
steps in `build.yml`:
|
||
|
||
| Secret | Value |
|
||
|---|---|
|
||
| `MACOS_DEVELOPER_ID_CERT_P12_BASE64` | base64-encoded `.p12` cert |
|
||
| `MACOS_DEVELOPER_ID_CERT_PASSWORD` | password for the .p12 |
|
||
| `MACOS_NOTARY_APPLE_ID` | Apple ID email |
|
||
| `MACOS_NOTARY_TEAM_ID` | 10-char team ID |
|
||
| `MACOS_NOTARY_PASSWORD` | app-specific password |
|
||
|
||
**Windows** — Code-signing cert from Sectigo / DigiCert (~$200-400/yr,
|
||
or ~$300-500 for an EV cert that bypasses SmartScreen). Add:
|
||
|
||
| Secret | Value |
|
||
|---|---|
|
||
| `WINDOWS_CERT_PFX_BASE64` | base64-encoded `.pfx` cert |
|
||
| `WINDOWS_CERT_PASSWORD` | password for the .pfx |
|
||
|
||
Until those are wired, buyers will see:
|
||
- macOS: "DataTools is damaged and can't be opened" — fix by removing
|
||
the quarantine attribute (`xattr -cr /Applications/DataTools.app`).
|
||
Acceptable for the technical buyer; **blocking** for the
|
||
non-technical buyer. Don't ship to non-technical without notarization.
|
||
- Windows: SmartScreen "Windows protected your PC" — buyer clicks
|
||
"More info → Run anyway". Friction but not blocking.
|
||
- Linux: AppImage runs without complaint (Linux has no equivalent
|
||
trust-store).
|
||
|
||
## Per-platform recipe
|
||
|
||
Each platform builds on its own machine — PyInstaller does **not**
|
||
cross-compile. Pick the platform that matches the bundle you need.
|
||
GitHub Actions matrix runners are the simplest way to produce all
|
||
three from one push (see "CI build" below).
|
||
|
||
### Mac (Intel + Apple Silicon, universal2)
|
||
|
||
```bash
|
||
# One-time:
|
||
pyenv install 3.12
|
||
pyenv local 3.12
|
||
python -m venv .venv
|
||
source .venv/bin/activate
|
||
pip install -r requirements.txt
|
||
pip install pyinstaller
|
||
|
||
# Build:
|
||
pyinstaller build/datatools.spec --clean
|
||
|
||
# Output:
|
||
# dist/DataTools/ — folder mode (faster cold start)
|
||
# dist/DataTools.app/ — macOS .app bundle (drag-drop into /Applications)
|
||
|
||
# Sign + notarize (after Apple Developer Program enrollment per BUSINESS.md §10):
|
||
codesign --deep --force --options runtime \
|
||
--sign "Developer ID Application: <YOUR-NAME> (<TEAMID>)" \
|
||
dist/DataTools.app
|
||
|
||
# Notarize:
|
||
xcrun notarytool submit dist/DataTools.app \
|
||
--apple-id "<YOUR-APPLE-ID>" \
|
||
--team-id "<TEAMID>" \
|
||
--password "<APP-SPECIFIC-PASSWORD>" \
|
||
--wait
|
||
|
||
# Staple the notarization ticket so Gatekeeper sees it offline:
|
||
xcrun stapler staple dist/DataTools.app
|
||
|
||
# Wrap for distribution:
|
||
hdiutil create -volname "DataTools" -srcfolder dist/DataTools.app \
|
||
-ov -format UDZO dist/DataTools-1.0.0-mac.dmg
|
||
```
|
||
|
||
### Windows
|
||
|
||
```powershell
|
||
# One-time:
|
||
py -3.12 -m venv .venv
|
||
.venv\Scripts\activate
|
||
pip install -r requirements.txt
|
||
pip install pyinstaller
|
||
|
||
# Build:
|
||
pyinstaller build\datatools.spec --clean
|
||
|
||
# Output:
|
||
# dist\DataTools\ — folder mode
|
||
# dist\DataTools\DataTools.exe
|
||
|
||
# Wrap with Inno Setup (free):
|
||
# 1. Install Inno Setup (https://jrsoftware.org/isdl.php)
|
||
# 2. Create installer.iss next to this README:
|
||
# [Setup]
|
||
# AppName=DataTools
|
||
# AppVersion=1.0.0
|
||
# DefaultDirName={autopf}\DataTools
|
||
# OutputDir=..\..\dist
|
||
# OutputBaseFilename=DataTools-1.0.0-win-setup
|
||
# Compression=lzma
|
||
# SolidCompression=yes
|
||
# [Files]
|
||
# Source: "..\..\dist\DataTools\*"; DestDir: "{app}"; Flags: recursesubdirs
|
||
# [Icons]
|
||
# Name: "{autoprograms}\DataTools"; Filename: "{app}\DataTools.exe"
|
||
# 3. Compile: ISCC.exe build\installer.iss
|
||
|
||
# Code-sign (optional but reduces SmartScreen warnings):
|
||
# Use signtool with a code-signing cert (Sectigo / DigiCert).
|
||
# Without signing, buyer sees "Windows protected your PC" once;
|
||
# they click "More info → Run anyway." Acceptable for v1.
|
||
```
|
||
|
||
### Linux (AppImage)
|
||
|
||
```bash
|
||
python3.12 -m venv .venv
|
||
source .venv/bin/activate
|
||
pip install -r requirements.txt
|
||
pip install pyinstaller
|
||
|
||
pyinstaller build/datatools.spec --clean
|
||
# dist/DataTools/ — folder mode
|
||
|
||
# Wrap as AppImage (single-file portable app):
|
||
# 1. Download appimagetool from https://appimage.org/
|
||
# 2. Set up the AppDir layout:
|
||
# DataTools.AppDir/
|
||
# ├── AppRun -> ./DataTools/DataTools
|
||
# ├── DataTools.desktop (icon + entry config)
|
||
# ├── icon.png
|
||
# └── usr/bin/ -> dist/DataTools/*
|
||
# 3. ./appimagetool DataTools.AppDir dist/DataTools-1.0.0-linux-x86_64.AppImage
|
||
```
|
||
|
||
## CI build (recommended once the spec is stable)
|
||
|
||
`.github/workflows/build.yml` (template):
|
||
|
||
```yaml
|
||
name: Build installers
|
||
on:
|
||
workflow_dispatch:
|
||
push:
|
||
tags: [ 'v*' ]
|
||
jobs:
|
||
build:
|
||
strategy:
|
||
matrix:
|
||
os: [macos-latest, windows-latest, ubuntu-latest]
|
||
runs-on: ${{ matrix.os }}
|
||
steps:
|
||
- uses: actions/checkout@v4
|
||
- uses: actions/setup-python@v5
|
||
with: { python-version: '3.12' }
|
||
- run: pip install -r requirements.txt pyinstaller
|
||
- run: pyinstaller build/datatools.spec --clean
|
||
- uses: actions/upload-artifact@v4
|
||
with:
|
||
name: DataTools-${{ matrix.os }}
|
||
path: dist/
|
||
```
|
||
|
||
Mac code-signing in CI requires the cert + private key as a GitHub
|
||
secret (encoded with `base64`). Detailed walkthrough belongs in a
|
||
later doc — for v1, sign locally and upload to GitHub Releases.
|
||
|
||
## Tesseract bundling (PDF Extractor OCR)
|
||
|
||
Frozen artifacts ship a per-platform Tesseract binary plus the English
|
||
`eng.traineddata` model so scanned-PDF support in the PDF Extractor
|
||
works out of the box — no separate user install. Source / pip
|
||
developer setups still need system Tesseract on `PATH`.
|
||
|
||
**Layout inside the bundle**:
|
||
|
||
```
|
||
DataTools/ (or DataTools.app/Contents/MacOS/)
|
||
└── tesseract/
|
||
├── tesseract (Linux/macOS binary; tesseract.exe on Windows)
|
||
└── tessdata/
|
||
└── eng.traineddata
|
||
```
|
||
|
||
The runtime resolver (in `src/`, owned by the runtime team) walks:
|
||
|
||
1. `DATATOOLS_TESSERACT_BIN` env var override.
|
||
2. `Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]"` — frozen
|
||
bundles only.
|
||
3. `tesseract` on `PATH`.
|
||
4. Windows well-known paths.
|
||
|
||
**Where the bytes come from**:
|
||
|
||
- **Tessdata** — vendored in-repo at `build/vendor/tessdata/eng.traineddata`
|
||
(sourced from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best)).
|
||
`datatools.spec` copies it into `tesseract/tessdata/`.
|
||
- **Binary** — fetched per-platform at build time by
|
||
`build/tesseract.py` from pinned upstream URLs. Current pin:
|
||
**Tesseract 5.5.0**. CI imports `fetch_tessdata` +
|
||
`fetch_tesseract_for_platform` from this module before PyInstaller.
|
||
|
||
**Updating Tesseract**:
|
||
|
||
1. Bump the version pin and the per-platform fetch URLs in
|
||
`build/tesseract.py`.
|
||
2. If the model schema changed upstream, refresh
|
||
`build/vendor/tessdata/eng.traineddata` from `tessdata_best` at the
|
||
matching tag.
|
||
3. Push a `v*` tag so CI rebuilds all three platforms, then
|
||
smoke-test a scanned PDF through the PDF Extractor.
|
||
4. Update `LICENSE_TESSERACT.txt` at the repo root if upstream license
|
||
terms change (Apache-2.0 today).
|
||
|
||
License attribution for the bundled binary lives at
|
||
`LICENSE_TESSERACT.txt` at the repo root — it must ship alongside any
|
||
binary that contains Tesseract.
|
||
|
||
## Common pitfalls
|
||
|
||
| Symptom | Fix |
|
||
|---|---|
|
||
| Bundle is 800+ MB | Check the ``excludes`` list in ``datatools.spec``. ``matplotlib`` / ``scipy`` / ``tkinter`` are the usual suspects. |
|
||
| App launches, browser opens, page is blank | Streamlit's static assets aren't bundled. Re-run with `--log-level=DEBUG` and confirm the static dir was collected by `collect_data_files('streamlit')`. |
|
||
| App launches but logs ``ImportError: streamlit.runtime.X`` | Add ``X`` to ``hidden_imports`` in the spec or to ``hook-streamlit.py``. |
|
||
| Mac Gatekeeper says "DataTools is damaged and can't be opened" | The bundle wasn't signed + notarized. Don't ship to buyers without these — see the Mac recipe above. |
|
||
| Windows SmartScreen blocks first launch | Buyer clicks "More info → Run anyway". Code-signing reduces but doesn't eliminate this; for v1 it's an accepted friction. |
|
||
| Bundle works on dev machine but crashes on a clean machine | Likely a missing C runtime. On Windows, install [VC++ redistributable](https://aka.ms/vs/17/release/vc_redist.x64.exe) into the installer alongside the bundle. |
|
||
|
||
## Testing the bundle
|
||
|
||
Smoke-test on a **clean** machine (or VM) — your dev machine has too
|
||
much state to trust:
|
||
|
||
```
|
||
1. Boot a clean Mac / Win / Linux VM.
|
||
2. Copy the .dmg / .exe / .AppImage onto it.
|
||
3. Install / drag-drop into Applications / chmod +x.
|
||
4. Double-click the app icon.
|
||
5. Browser should open to http://127.0.0.1:850x within 5 seconds.
|
||
6. Drop samples/demo/shopify_pet_customers.csv into the
|
||
Automated Workflows page; click Run; AFTER preview should appear.
|
||
7. Confirm in the network tab: zero outbound calls except to
|
||
127.0.0.1 and the Streamlit static asset paths (also local).
|
||
```
|
||
|
||
Step 7 is the privacy-claim integrity check from
|
||
`docs/POST-LAUNCH.md` §6 — do this once per release, then trust it.
|
||
|
||
## Versioning
|
||
|
||
Bump the version string in three places per release:
|
||
|
||
- `datatools.spec` (CFBundleVersion + CFBundleShortVersionString)
|
||
- the Inno Setup `AppVersion` line
|
||
- the AppImage filename
|
||
|
||
A single source of truth (e.g. `src/__init__.py`) is a future
|
||
refactor — for v1 the three-spot update is fine.
|