Files
datatools-dev/build/README.md
Michael b703911df3 docs: reflect bundled Tesseract on every install surface
- NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers
  the bundled Tesseract OCR binary (Apache 2.0, upstream
  tesseract-ocr/tesseract, copyright Google + contributors) and the
  eng.traineddata from tessdata_best (also Apache 2.0). Clarifies
  DataTools itself remains proprietary. Full canonical Apache 2.0
  license text included.
- README.md + README.es.md (Download section): bumped size estimate
  ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR
  is bundled (no separate install required), with a link to the new
  license file.
- docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System
  requirements): bumped disk estimate, added a paragraph stating
  Tesseract 5.5 + eng.traineddata ship inside every installer /
  portable / AppImage, with a source-install fallback hint pointing
  developers to DEVELOPER.md.
- docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section
  documenting the runtime layout (sys._MEIPASS / tesseract / …),
  discovery order, source of bytes (build/vendor/tessdata + per-
  platform fetch in make_release.py), version pin, update recipe.
- docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor
  OCR)" — short version of the discovery order for the build
  pipeline section.
- build/README.md: distribution-outputs paragraph now lists
  Tesseract among bundled deps with the ~250-300 MB estimate; new
  "Tesseract bundling" section: layout diagram, resolver order,
  source of bytes + 5.5.0 pin, update steps, license-file ref.

Out-of-scope gaps noted by the docs sweep:
- docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a
  high-risk packaging headache; now superseded. Worth a one-line
  "(resolved — bundled as of v1.x)" callout in a future pass.
- USER-GUIDE §2 "What's included" table doesn't list PDF Extractor
  at all (it shipped in b8aff86…967d3f6). Separate gap to close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:50 +00:00

15 KiB
Raw Blame History

Build — DataTools desktop installer

Cross-platform PyInstaller bundle for Mac / Windows / Linux. The single deliverable the buyer downloads from Gumroad. Owner: Michael · Updated: 2026-05-01

This directory is the build pipeline. Source of truth for the bundle shape, hidden-import lists, per-platform recipes, and the launcher that boots Streamlit inside the bundle.

Files

build/
├── launcher.py           Entry point PyInstaller wraps. Boots a local
│                         Streamlit server, opens browser, locks server
│                         to 127.0.0.1 so the privacy claim holds.
├── datatools.spec        PyInstaller spec — hidden imports, data files,
│                         Mac .app bundle config. Reads the version
│                         from src/__init__.py.
├── installer.iss         Inno Setup script — Windows .exe installer.
│                         Adds Start Menu + Desktop + App Paths entries.
├── generate_icons.py     Builds icon.ico / icon.icns / icon.png from
│                         src/gui/assets/datatools_icon_256.png. Run
│                         once before pyinstaller (CI does this).
├── build_portable_zip.py Cross-platform: zips dist/DataTools/ into a
│                         no-install portable download. Used by the
│                         Windows + Linux portable artifacts.
├── macos/
│   ├── build_dmg.sh      Wraps dist/DataTools.app into a .dmg with a
│   │                     drag-to-/Applications layout (installer).
│   └── build_zip.sh      Wraps dist/DataTools.app into a portable
│                         .zip via ditto (preserves bundle metadata).
├── appimage/
│   ├── AppRun            Entry point invoked when the AppImage runs.
│   ├── datatools.desktop Linux desktop-entry metadata.
│   └── build.sh          Wraps dist/DataTools/ into an .AppImage.
├── hooks/                PyInstaller hooks for libs the static analyser
│   └── hook-streamlit.py misses (Streamlit's dynamic imports).
├── icon.{ico,icns,png}   Generated by generate_icons.py — gitignored.
└── README.md             this file

Distribution outputs per platform

Each CI run produces two downloads per platform — an installer for buyers who want shortcuts wired automatically, and a portable .zip for buyers (or IT-locked-down machines) that can't run installers:

Platform Installer Portable
macOS DataTools-<ver>-mac.dmg DataTools-<ver>-mac-portable.zip (ditto .app)
Windows DataTools-<ver>-win-setup.exe DataTools-<ver>-win-portable.zip
Linux DataTools-<ver>-linux-x86_64.AppImage (the AppImage IS the portable)

All six outputs are self-contained: every dependency (Python, pandas, streamlit, pdfplumber, Tesseract OCR + eng.traineddata, the lot) is frozen into the bundle. The buyer does not need to install Python, pip, Tesseract, or anything else first. With Tesseract bundled, each artifact is roughly 250300 MB on disk (up from ~120 MB pre-OCR); unpacked installs run ~300400 MB once scratch space is counted.

Easy-launch surface

Affordance Windows macOS
Desktop shortcut Inno Setup desktopicon task (checked default) The .app bundle in /Applications is the icon
App menu Start Menu → DataTools (always installed) Launchpad + Spotlight (auto from /Applications)
Taskbar / Dock User pins manually (OS forbids programmatic pin) User pins manually after first launch
Run from terminal DataTools (registered via App Paths) open -a DataTools (auto from .app bundle)

CI: .github/workflows/build.yml runs the full pipeline on tag push (matrix: macos-latest, windows-latest, ubuntu-latest) and attaches the resulting installers to a GitHub Release. Manual workflow_dispatch runs upload them as workflow artifacts only.

Releasing

PyInstaller can't cross-compile, so a single machine produces one platform's packages. Run this on each target OS:

# One-time setup per machine:
pip install -r requirements.txt
pip install pyinstaller pillow
# Windows only: install Inno Setup from https://jrsoftware.org/isdl.php
# Linux  only: drop appimagetool onto PATH (see preflight output)

# Build everything for the current OS:
python build/make_release.py

Outputs land in dist/:

  • Windows host → DataTools-<ver>-win-setup.exe + DataTools-<ver>-win-portable.zip
  • macOS host → DataTools-<ver>-mac.dmg + DataTools-<ver>-mac-portable.zip
  • Linux host → DataTools-<ver>-linux-x86_64.AppImage

Useful flags:

python build/make_release.py --preflight       # check tooling, build nothing
python build/make_release.py --clean           # wipe dist/ first
python build/make_release.py --skip-installer  # just the portable zip
python build/make_release.py --skip-portable   # just the installer

CI build (push tag → GitHub Release)

If you have CI runners for all three OSes:

  1. Bump __version__ in src/__init__.py.
  2. git commit -am "release: vX.Y.Z" && git tag vX.Y.Z.
  3. git push && git push --tags.
  4. CI builds all three platforms and creates a Release with the installers + portable zips attached.
  5. Mirror the Release assets to Gumroad (manual until v2).

Signing (Phase 2 — needs accounts/credentials)

Both code-signing steps are intentionally not in CI yet because they require credentials the owner sets up first.

macOS — Apple Developer Program enrollment ($99/yr). Once enrolled, add these GitHub Secrets and uncomment the codesign + notarytool steps in build.yml:

Secret Value
MACOS_DEVELOPER_ID_CERT_P12_BASE64 base64-encoded .p12 cert
MACOS_DEVELOPER_ID_CERT_PASSWORD password for the .p12
MACOS_NOTARY_APPLE_ID Apple ID email
MACOS_NOTARY_TEAM_ID 10-char team ID
MACOS_NOTARY_PASSWORD app-specific password

Windows — Code-signing cert from Sectigo / DigiCert (~$200-400/yr, or ~$300-500 for an EV cert that bypasses SmartScreen). Add:

Secret Value
WINDOWS_CERT_PFX_BASE64 base64-encoded .pfx cert
WINDOWS_CERT_PASSWORD password for the .pfx

Until those are wired, buyers will see:

  • macOS: "DataTools is damaged and can't be opened" — fix by removing the quarantine attribute (xattr -cr /Applications/DataTools.app). Acceptable for the technical buyer; blocking for the non-technical buyer. Don't ship to non-technical without notarization.
  • Windows: SmartScreen "Windows protected your PC" — buyer clicks "More info → Run anyway". Friction but not blocking.
  • Linux: AppImage runs without complaint (Linux has no equivalent trust-store).

Per-platform recipe

Each platform builds on its own machine — PyInstaller does not cross-compile. Pick the platform that matches the bundle you need. GitHub Actions matrix runners are the simplest way to produce all three from one push (see "CI build" below).

Mac (Intel + Apple Silicon, universal2)

# One-time:
pyenv install 3.12
pyenv local 3.12
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install pyinstaller

# Build:
pyinstaller build/datatools.spec --clean

# Output:
#   dist/DataTools/         — folder mode (faster cold start)
#   dist/DataTools.app/     — macOS .app bundle (drag-drop into /Applications)

# Sign + notarize (after Apple Developer Program enrollment per BUSINESS.md §10):
codesign --deep --force --options runtime \
  --sign "Developer ID Application: <YOUR-NAME> (<TEAMID>)" \
  dist/DataTools.app

# Notarize:
xcrun notarytool submit dist/DataTools.app \
  --apple-id "<YOUR-APPLE-ID>" \
  --team-id  "<TEAMID>" \
  --password "<APP-SPECIFIC-PASSWORD>" \
  --wait

# Staple the notarization ticket so Gatekeeper sees it offline:
xcrun stapler staple dist/DataTools.app

# Wrap for distribution:
hdiutil create -volname "DataTools" -srcfolder dist/DataTools.app \
  -ov -format UDZO dist/DataTools-1.0.0-mac.dmg

Windows

# One-time:
py -3.12 -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
pip install pyinstaller

# Build:
pyinstaller build\datatools.spec --clean

# Output:
#   dist\DataTools\          — folder mode
#   dist\DataTools\DataTools.exe

# Wrap with Inno Setup (free):
#   1. Install Inno Setup (https://jrsoftware.org/isdl.php)
#   2. Create installer.iss next to this README:
#        [Setup]
#        AppName=DataTools
#        AppVersion=1.0.0
#        DefaultDirName={autopf}\DataTools
#        OutputDir=..\..\dist
#        OutputBaseFilename=DataTools-1.0.0-win-setup
#        Compression=lzma
#        SolidCompression=yes
#        [Files]
#        Source: "..\..\dist\DataTools\*"; DestDir: "{app}"; Flags: recursesubdirs
#        [Icons]
#        Name: "{autoprograms}\DataTools"; Filename: "{app}\DataTools.exe"
#   3. Compile: ISCC.exe build\installer.iss

# Code-sign (optional but reduces SmartScreen warnings):
#   Use signtool with a code-signing cert (Sectigo / DigiCert).
#   Without signing, buyer sees "Windows protected your PC" once;
#   they click "More info → Run anyway." Acceptable for v1.

Linux (AppImage)

python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install pyinstaller

pyinstaller build/datatools.spec --clean
# dist/DataTools/ — folder mode

# Wrap as AppImage (single-file portable app):
#   1. Download appimagetool from https://appimage.org/
#   2. Set up the AppDir layout:
#        DataTools.AppDir/
#        ├── AppRun                     -> ./DataTools/DataTools
#        ├── DataTools.desktop          (icon + entry config)
#        ├── icon.png
#        └── usr/bin/                   -> dist/DataTools/*
#   3. ./appimagetool DataTools.AppDir dist/DataTools-1.0.0-linux-x86_64.AppImage

.github/workflows/build.yml (template):

name: Build installers
on:
  workflow_dispatch:
  push:
    tags: [ 'v*' ]
jobs:
  build:
    strategy:
      matrix:
        os: [macos-latest, windows-latest, ubuntu-latest]
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - run: pip install -r requirements.txt pyinstaller
      - run: pyinstaller build/datatools.spec --clean
      - uses: actions/upload-artifact@v4
        with:
          name: DataTools-${{ matrix.os }}
          path: dist/

Mac code-signing in CI requires the cert + private key as a GitHub secret (encoded with base64). Detailed walkthrough belongs in a later doc — for v1, sign locally and upload to GitHub Releases.

Tesseract bundling (PDF Extractor OCR)

Frozen artifacts ship a per-platform Tesseract binary plus the English eng.traineddata model so scanned-PDF support in the PDF Extractor works out of the box — no separate user install. Source / pip developer setups still need system Tesseract on PATH.

Layout inside the bundle:

DataTools/                  (or DataTools.app/Contents/MacOS/)
└── tesseract/
    ├── tesseract           (Linux/macOS binary; tesseract.exe on Windows)
    └── tessdata/
        └── eng.traineddata

The runtime resolver (in src/, owned by the runtime team) walks:

  1. DATATOOLS_TESSERACT_BIN env var override.
  2. Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]" — frozen bundles only.
  3. tesseract on PATH.
  4. Windows well-known paths.

Where the bytes come from:

  • Tessdata — vendored in-repo at build/vendor/tessdata/eng.traineddata (sourced from tessdata_best). datatools.spec copies it into tesseract/tessdata/.
  • Binary — fetched per-platform at build time by build/make_release.py from pinned upstream URLs. Current pin: Tesseract 5.5.0.

Updating Tesseract:

  1. Bump the version pin and the per-platform fetch URLs in build/make_release.py.
  2. If the model schema changed upstream, refresh build/vendor/tessdata/eng.traineddata from tessdata_best at the matching tag.
  3. Rebuild on each platform (python build/make_release.py) and smoke-test a scanned PDF through the PDF Extractor.
  4. Update LICENSE_TESSERACT.txt at the repo root if upstream license terms change (Apache-2.0 today).

License attribution for the bundled binary lives at LICENSE_TESSERACT.txt at the repo root — it must ship alongside any binary that contains Tesseract.

Common pitfalls

Symptom Fix
Bundle is 800+ MB Check the excludes list in datatools.spec. matplotlib / scipy / tkinter are the usual suspects.
App launches, browser opens, page is blank Streamlit's static assets aren't bundled. Re-run with --log-level=DEBUG and confirm the static dir was collected by collect_data_files('streamlit').
App launches but logs ImportError: streamlit.runtime.X Add X to hidden_imports in the spec or to hook-streamlit.py.
Mac Gatekeeper says "DataTools is damaged and can't be opened" The bundle wasn't signed + notarized. Don't ship to buyers without these — see the Mac recipe above.
Windows SmartScreen blocks first launch Buyer clicks "More info → Run anyway". Code-signing reduces but doesn't eliminate this; for v1 it's an accepted friction.
Bundle works on dev machine but crashes on a clean machine Likely a missing C runtime. On Windows, install VC++ redistributable into the installer alongside the bundle.

Testing the bundle

Smoke-test on a clean machine (or VM) — your dev machine has too much state to trust:

1. Boot a clean Mac / Win / Linux VM.
2. Copy the .dmg / .exe / .AppImage onto it.
3. Install / drag-drop into Applications / chmod +x.
4. Double-click the app icon.
5. Browser should open to http://127.0.0.1:850x within 5 seconds.
6. Drop samples/demo/shopify_pet_customers.csv into the
   Automated Workflows page; click Run; AFTER preview should appear.
7. Confirm in the network tab: zero outbound calls except to
   127.0.0.1 and the Streamlit static asset paths (also local).

Step 7 is the privacy-claim integrity check from docs/POST-LAUNCH.md §6 — do this once per release, then trust it.

Versioning

Bump the version string in three places per release:

  • datatools.spec (CFBundleVersion + CFBundleShortVersionString)
  • the Inno Setup AppVersion line
  • the AppImage filename

A single source of truth (e.g. src/__init__.py) is a future refactor — for v1 the three-spot update is fine.