- NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers the bundled Tesseract OCR binary (Apache 2.0, upstream tesseract-ocr/tesseract, copyright Google + contributors) and the eng.traineddata from tessdata_best (also Apache 2.0). Clarifies DataTools itself remains proprietary. Full canonical Apache 2.0 license text included. - README.md + README.es.md (Download section): bumped size estimate ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR is bundled (no separate install required), with a link to the new license file. - docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System requirements): bumped disk estimate, added a paragraph stating Tesseract 5.5 + eng.traineddata ship inside every installer / portable / AppImage, with a source-install fallback hint pointing developers to DEVELOPER.md. - docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section documenting the runtime layout (sys._MEIPASS / tesseract / …), discovery order, source of bytes (build/vendor/tessdata + per- platform fetch in make_release.py), version pin, update recipe. - docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor OCR)" — short version of the discovery order for the build pipeline section. - build/README.md: distribution-outputs paragraph now lists Tesseract among bundled deps with the ~250-300 MB estimate; new "Tesseract bundling" section: layout diagram, resolver order, source of bytes + 5.5.0 pin, update steps, license-file ref. Out-of-scope gaps noted by the docs sweep: - docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a high-risk packaging headache; now superseded. Worth a one-line "(resolved — bundled as of v1.x)" callout in a future pass. - USER-GUIDE §2 "What's included" table doesn't list PDF Extractor at all (it shipped in b8aff86…967d3f6). Separate gap to close. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 KiB
Build — DataTools desktop installer
Cross-platform PyInstaller bundle for Mac / Windows / Linux. The single deliverable the buyer downloads from Gumroad. Owner: Michael · Updated: 2026-05-01
This directory is the build pipeline. Source of truth for the bundle shape, hidden-import lists, per-platform recipes, and the launcher that boots Streamlit inside the bundle.
Files
build/
├── launcher.py Entry point PyInstaller wraps. Boots a local
│ Streamlit server, opens browser, locks server
│ to 127.0.0.1 so the privacy claim holds.
├── datatools.spec PyInstaller spec — hidden imports, data files,
│ Mac .app bundle config. Reads the version
│ from src/__init__.py.
├── installer.iss Inno Setup script — Windows .exe installer.
│ Adds Start Menu + Desktop + App Paths entries.
├── generate_icons.py Builds icon.ico / icon.icns / icon.png from
│ src/gui/assets/datatools_icon_256.png. Run
│ once before pyinstaller (CI does this).
├── build_portable_zip.py Cross-platform: zips dist/DataTools/ into a
│ no-install portable download. Used by the
│ Windows + Linux portable artifacts.
├── macos/
│ ├── build_dmg.sh Wraps dist/DataTools.app into a .dmg with a
│ │ drag-to-/Applications layout (installer).
│ └── build_zip.sh Wraps dist/DataTools.app into a portable
│ .zip via ditto (preserves bundle metadata).
├── appimage/
│ ├── AppRun Entry point invoked when the AppImage runs.
│ ├── datatools.desktop Linux desktop-entry metadata.
│ └── build.sh Wraps dist/DataTools/ into an .AppImage.
├── hooks/ PyInstaller hooks for libs the static analyser
│ └── hook-streamlit.py misses (Streamlit's dynamic imports).
├── icon.{ico,icns,png} Generated by generate_icons.py — gitignored.
└── README.md this file
Distribution outputs per platform
Each CI run produces two downloads per platform — an installer for buyers who want shortcuts wired automatically, and a portable .zip for buyers (or IT-locked-down machines) that can't run installers:
| Platform | Installer | Portable |
|---|---|---|
| macOS | DataTools-<ver>-mac.dmg |
DataTools-<ver>-mac-portable.zip (ditto .app) |
| Windows | DataTools-<ver>-win-setup.exe |
DataTools-<ver>-win-portable.zip |
| Linux | DataTools-<ver>-linux-x86_64.AppImage |
(the AppImage IS the portable) |
All six outputs are self-contained: every dependency (Python, pandas,
streamlit, pdfplumber, Tesseract OCR + eng.traineddata, the lot)
is frozen into the bundle. The buyer does not need to install Python,
pip, Tesseract, or anything else first. With Tesseract bundled, each
artifact is roughly 250–300 MB on disk (up from ~120 MB pre-OCR);
unpacked installs run ~300–400 MB once scratch space is counted.
Easy-launch surface
| Affordance | Windows | macOS |
|---|---|---|
| Desktop shortcut | Inno Setup desktopicon task (checked default) |
The .app bundle in /Applications is the icon |
| App menu | Start Menu → DataTools (always installed) | Launchpad + Spotlight (auto from /Applications) |
| Taskbar / Dock | User pins manually (OS forbids programmatic pin) | User pins manually after first launch |
| Run from terminal | DataTools (registered via App Paths) |
open -a DataTools (auto from .app bundle) |
CI: .github/workflows/build.yml runs the full pipeline on tag push
(matrix: macos-latest, windows-latest, ubuntu-latest) and attaches
the resulting installers to a GitHub Release. Manual
workflow_dispatch runs upload them as workflow artifacts only.
Releasing
Single-command local build (recommended for one-developer workflow)
PyInstaller can't cross-compile, so a single machine produces one platform's packages. Run this on each target OS:
# One-time setup per machine:
pip install -r requirements.txt
pip install pyinstaller pillow
# Windows only: install Inno Setup from https://jrsoftware.org/isdl.php
# Linux only: drop appimagetool onto PATH (see preflight output)
# Build everything for the current OS:
python build/make_release.py
Outputs land in dist/:
- Windows host →
DataTools-<ver>-win-setup.exe+DataTools-<ver>-win-portable.zip - macOS host →
DataTools-<ver>-mac.dmg+DataTools-<ver>-mac-portable.zip - Linux host →
DataTools-<ver>-linux-x86_64.AppImage
Useful flags:
python build/make_release.py --preflight # check tooling, build nothing
python build/make_release.py --clean # wipe dist/ first
python build/make_release.py --skip-installer # just the portable zip
python build/make_release.py --skip-portable # just the installer
CI build (push tag → GitHub Release)
If you have CI runners for all three OSes:
- Bump
__version__insrc/__init__.py. git commit -am "release: vX.Y.Z" && git tag vX.Y.Z.git push && git push --tags.- CI builds all three platforms and creates a Release with the installers + portable zips attached.
- Mirror the Release assets to Gumroad (manual until v2).
Signing (Phase 2 — needs accounts/credentials)
Both code-signing steps are intentionally not in CI yet because they require credentials the owner sets up first.
macOS — Apple Developer Program enrollment ($99/yr). Once enrolled,
add these GitHub Secrets and uncomment the codesign + notarytool
steps in build.yml:
| Secret | Value |
|---|---|
MACOS_DEVELOPER_ID_CERT_P12_BASE64 |
base64-encoded .p12 cert |
MACOS_DEVELOPER_ID_CERT_PASSWORD |
password for the .p12 |
MACOS_NOTARY_APPLE_ID |
Apple ID email |
MACOS_NOTARY_TEAM_ID |
10-char team ID |
MACOS_NOTARY_PASSWORD |
app-specific password |
Windows — Code-signing cert from Sectigo / DigiCert (~$200-400/yr, or ~$300-500 for an EV cert that bypasses SmartScreen). Add:
| Secret | Value |
|---|---|
WINDOWS_CERT_PFX_BASE64 |
base64-encoded .pfx cert |
WINDOWS_CERT_PASSWORD |
password for the .pfx |
Until those are wired, buyers will see:
- macOS: "DataTools is damaged and can't be opened" — fix by removing
the quarantine attribute (
xattr -cr /Applications/DataTools.app). Acceptable for the technical buyer; blocking for the non-technical buyer. Don't ship to non-technical without notarization. - Windows: SmartScreen "Windows protected your PC" — buyer clicks "More info → Run anyway". Friction but not blocking.
- Linux: AppImage runs without complaint (Linux has no equivalent trust-store).
Per-platform recipe
Each platform builds on its own machine — PyInstaller does not cross-compile. Pick the platform that matches the bundle you need. GitHub Actions matrix runners are the simplest way to produce all three from one push (see "CI build" below).
Mac (Intel + Apple Silicon, universal2)
# One-time:
pyenv install 3.12
pyenv local 3.12
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install pyinstaller
# Build:
pyinstaller build/datatools.spec --clean
# Output:
# dist/DataTools/ — folder mode (faster cold start)
# dist/DataTools.app/ — macOS .app bundle (drag-drop into /Applications)
# Sign + notarize (after Apple Developer Program enrollment per BUSINESS.md §10):
codesign --deep --force --options runtime \
--sign "Developer ID Application: <YOUR-NAME> (<TEAMID>)" \
dist/DataTools.app
# Notarize:
xcrun notarytool submit dist/DataTools.app \
--apple-id "<YOUR-APPLE-ID>" \
--team-id "<TEAMID>" \
--password "<APP-SPECIFIC-PASSWORD>" \
--wait
# Staple the notarization ticket so Gatekeeper sees it offline:
xcrun stapler staple dist/DataTools.app
# Wrap for distribution:
hdiutil create -volname "DataTools" -srcfolder dist/DataTools.app \
-ov -format UDZO dist/DataTools-1.0.0-mac.dmg
Windows
# One-time:
py -3.12 -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
pip install pyinstaller
# Build:
pyinstaller build\datatools.spec --clean
# Output:
# dist\DataTools\ — folder mode
# dist\DataTools\DataTools.exe
# Wrap with Inno Setup (free):
# 1. Install Inno Setup (https://jrsoftware.org/isdl.php)
# 2. Create installer.iss next to this README:
# [Setup]
# AppName=DataTools
# AppVersion=1.0.0
# DefaultDirName={autopf}\DataTools
# OutputDir=..\..\dist
# OutputBaseFilename=DataTools-1.0.0-win-setup
# Compression=lzma
# SolidCompression=yes
# [Files]
# Source: "..\..\dist\DataTools\*"; DestDir: "{app}"; Flags: recursesubdirs
# [Icons]
# Name: "{autoprograms}\DataTools"; Filename: "{app}\DataTools.exe"
# 3. Compile: ISCC.exe build\installer.iss
# Code-sign (optional but reduces SmartScreen warnings):
# Use signtool with a code-signing cert (Sectigo / DigiCert).
# Without signing, buyer sees "Windows protected your PC" once;
# they click "More info → Run anyway." Acceptable for v1.
Linux (AppImage)
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install pyinstaller
pyinstaller build/datatools.spec --clean
# dist/DataTools/ — folder mode
# Wrap as AppImage (single-file portable app):
# 1. Download appimagetool from https://appimage.org/
# 2. Set up the AppDir layout:
# DataTools.AppDir/
# ├── AppRun -> ./DataTools/DataTools
# ├── DataTools.desktop (icon + entry config)
# ├── icon.png
# └── usr/bin/ -> dist/DataTools/*
# 3. ./appimagetool DataTools.AppDir dist/DataTools-1.0.0-linux-x86_64.AppImage
CI build (recommended once the spec is stable)
.github/workflows/build.yml (template):
name: Build installers
on:
workflow_dispatch:
push:
tags: [ 'v*' ]
jobs:
build:
strategy:
matrix:
os: [macos-latest, windows-latest, ubuntu-latest]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.12' }
- run: pip install -r requirements.txt pyinstaller
- run: pyinstaller build/datatools.spec --clean
- uses: actions/upload-artifact@v4
with:
name: DataTools-${{ matrix.os }}
path: dist/
Mac code-signing in CI requires the cert + private key as a GitHub
secret (encoded with base64). Detailed walkthrough belongs in a
later doc — for v1, sign locally and upload to GitHub Releases.
Tesseract bundling (PDF Extractor OCR)
Frozen artifacts ship a per-platform Tesseract binary plus the English
eng.traineddata model so scanned-PDF support in the PDF Extractor
works out of the box — no separate user install. Source / pip
developer setups still need system Tesseract on PATH.
Layout inside the bundle:
DataTools/ (or DataTools.app/Contents/MacOS/)
└── tesseract/
├── tesseract (Linux/macOS binary; tesseract.exe on Windows)
└── tessdata/
└── eng.traineddata
The runtime resolver (in src/, owned by the runtime team) walks:
DATATOOLS_TESSERACT_BINenv var override.Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]"— frozen bundles only.tesseractonPATH.- Windows well-known paths.
Where the bytes come from:
- Tessdata — vendored in-repo at
build/vendor/tessdata/eng.traineddata(sourced from tessdata_best).datatools.speccopies it intotesseract/tessdata/. - Binary — fetched per-platform at build time by
build/make_release.pyfrom pinned upstream URLs. Current pin: Tesseract 5.5.0.
Updating Tesseract:
- Bump the version pin and the per-platform fetch URLs in
build/make_release.py. - If the model schema changed upstream, refresh
build/vendor/tessdata/eng.traineddatafromtessdata_bestat the matching tag. - Rebuild on each platform (
python build/make_release.py) and smoke-test a scanned PDF through the PDF Extractor. - Update
LICENSE_TESSERACT.txtat the repo root if upstream license terms change (Apache-2.0 today).
License attribution for the bundled binary lives at
LICENSE_TESSERACT.txt at the repo root — it must ship alongside any
binary that contains Tesseract.
Common pitfalls
| Symptom | Fix |
|---|---|
| Bundle is 800+ MB | Check the excludes list in datatools.spec. matplotlib / scipy / tkinter are the usual suspects. |
| App launches, browser opens, page is blank | Streamlit's static assets aren't bundled. Re-run with --log-level=DEBUG and confirm the static dir was collected by collect_data_files('streamlit'). |
App launches but logs ImportError: streamlit.runtime.X |
Add X to hidden_imports in the spec or to hook-streamlit.py. |
| Mac Gatekeeper says "DataTools is damaged and can't be opened" | The bundle wasn't signed + notarized. Don't ship to buyers without these — see the Mac recipe above. |
| Windows SmartScreen blocks first launch | Buyer clicks "More info → Run anyway". Code-signing reduces but doesn't eliminate this; for v1 it's an accepted friction. |
| Bundle works on dev machine but crashes on a clean machine | Likely a missing C runtime. On Windows, install VC++ redistributable into the installer alongside the bundle. |
Testing the bundle
Smoke-test on a clean machine (or VM) — your dev machine has too much state to trust:
1. Boot a clean Mac / Win / Linux VM.
2. Copy the .dmg / .exe / .AppImage onto it.
3. Install / drag-drop into Applications / chmod +x.
4. Double-click the app icon.
5. Browser should open to http://127.0.0.1:850x within 5 seconds.
6. Drop samples/demo/shopify_pet_customers.csv into the
Automated Workflows page; click Run; AFTER preview should appear.
7. Confirm in the network tab: zero outbound calls except to
127.0.0.1 and the Streamlit static asset paths (also local).
Step 7 is the privacy-claim integrity check from
docs/POST-LAUNCH.md §6 — do this once per release, then trust it.
Versioning
Bump the version string in three places per release:
datatools.spec(CFBundleVersion + CFBundleShortVersionString)- the Inno Setup
AppVersionline - the AppImage filename
A single source of truth (e.g. src/__init__.py) is a future
refactor — for v1 the three-spot update is fine.