Files
datatools-dev/build/README.md
Michael b703911df3 docs: reflect bundled Tesseract on every install surface
- NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers
  the bundled Tesseract OCR binary (Apache 2.0, upstream
  tesseract-ocr/tesseract, copyright Google + contributors) and the
  eng.traineddata from tessdata_best (also Apache 2.0). Clarifies
  DataTools itself remains proprietary. Full canonical Apache 2.0
  license text included.
- README.md + README.es.md (Download section): bumped size estimate
  ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR
  is bundled (no separate install required), with a link to the new
  license file.
- docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System
  requirements): bumped disk estimate, added a paragraph stating
  Tesseract 5.5 + eng.traineddata ship inside every installer /
  portable / AppImage, with a source-install fallback hint pointing
  developers to DEVELOPER.md.
- docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section
  documenting the runtime layout (sys._MEIPASS / tesseract / …),
  discovery order, source of bytes (build/vendor/tessdata + per-
  platform fetch in make_release.py), version pin, update recipe.
- docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor
  OCR)" — short version of the discovery order for the build
  pipeline section.
- build/README.md: distribution-outputs paragraph now lists
  Tesseract among bundled deps with the ~250-300 MB estimate; new
  "Tesseract bundling" section: layout diagram, resolver order,
  source of bytes + 5.5.0 pin, update steps, license-file ref.

Out-of-scope gaps noted by the docs sweep:
- docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a
  high-risk packaging headache; now superseded. Worth a one-line
  "(resolved — bundled as of v1.x)" callout in a future pass.
- USER-GUIDE §2 "What's included" table doesn't list PDF Extractor
  at all (it shipped in b8aff86…967d3f6). Separate gap to close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:50 +00:00

384 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Build — DataTools desktop installer
> Cross-platform PyInstaller bundle for Mac / Windows / Linux. The
> single deliverable the buyer downloads from Gumroad.
> **Owner**: Michael · **Updated**: 2026-05-01
This directory is the build pipeline. Source of truth for the bundle
shape, hidden-import lists, per-platform recipes, and the launcher
that boots Streamlit inside the bundle.
## Files
```
build/
├── launcher.py Entry point PyInstaller wraps. Boots a local
│ Streamlit server, opens browser, locks server
│ to 127.0.0.1 so the privacy claim holds.
├── datatools.spec PyInstaller spec — hidden imports, data files,
│ Mac .app bundle config. Reads the version
│ from src/__init__.py.
├── installer.iss Inno Setup script — Windows .exe installer.
│ Adds Start Menu + Desktop + App Paths entries.
├── generate_icons.py Builds icon.ico / icon.icns / icon.png from
│ src/gui/assets/datatools_icon_256.png. Run
│ once before pyinstaller (CI does this).
├── build_portable_zip.py Cross-platform: zips dist/DataTools/ into a
│ no-install portable download. Used by the
│ Windows + Linux portable artifacts.
├── macos/
│ ├── build_dmg.sh Wraps dist/DataTools.app into a .dmg with a
│ │ drag-to-/Applications layout (installer).
│ └── build_zip.sh Wraps dist/DataTools.app into a portable
│ .zip via ditto (preserves bundle metadata).
├── appimage/
│ ├── AppRun Entry point invoked when the AppImage runs.
│ ├── datatools.desktop Linux desktop-entry metadata.
│ └── build.sh Wraps dist/DataTools/ into an .AppImage.
├── hooks/ PyInstaller hooks for libs the static analyser
│ └── hook-streamlit.py misses (Streamlit's dynamic imports).
├── icon.{ico,icns,png} Generated by generate_icons.py — gitignored.
└── README.md this file
```
## Distribution outputs per platform
Each CI run produces two downloads per platform — an installer for
buyers who want shortcuts wired automatically, and a portable .zip
for buyers (or IT-locked-down machines) that can't run installers:
| Platform | Installer | Portable |
|----------|----------------------------------------|------------------------------------------------|
| macOS | `DataTools-<ver>-mac.dmg` | `DataTools-<ver>-mac-portable.zip` (ditto .app)|
| Windows | `DataTools-<ver>-win-setup.exe` | `DataTools-<ver>-win-portable.zip` |
| Linux | `DataTools-<ver>-linux-x86_64.AppImage`| (the AppImage IS the portable) |
All six outputs are self-contained: every dependency (Python, pandas,
streamlit, pdfplumber, **Tesseract OCR + `eng.traineddata`**, the lot)
is frozen into the bundle. The buyer does not need to install Python,
pip, Tesseract, or anything else first. With Tesseract bundled, each
artifact is roughly **250300 MB** on disk (up from ~120 MB pre-OCR);
unpacked installs run ~300400 MB once scratch space is counted.
## Easy-launch surface
| Affordance | Windows | macOS |
|------------------|--------------------------------------------------|------------------------------------------------------|
| Desktop shortcut | Inno Setup `desktopicon` task (checked default) | The .app bundle in /Applications is the icon |
| App menu | Start Menu → DataTools (always installed) | Launchpad + Spotlight (auto from /Applications) |
| Taskbar / Dock | User pins manually (OS forbids programmatic pin) | User pins manually after first launch |
| Run from terminal| `DataTools` (registered via App Paths) | `open -a DataTools` (auto from .app bundle) |
CI: `.github/workflows/build.yml` runs the full pipeline on tag push
(matrix: macos-latest, windows-latest, ubuntu-latest) and attaches
the resulting installers to a GitHub Release. Manual
`workflow_dispatch` runs upload them as workflow artifacts only.
## Releasing
### Single-command local build (recommended for one-developer workflow)
PyInstaller can't cross-compile, so a single machine produces one
platform's packages. Run this on each target OS:
```bash
# One-time setup per machine:
pip install -r requirements.txt
pip install pyinstaller pillow
# Windows only: install Inno Setup from https://jrsoftware.org/isdl.php
# Linux only: drop appimagetool onto PATH (see preflight output)
# Build everything for the current OS:
python build/make_release.py
```
Outputs land in `dist/`:
- Windows host → `DataTools-<ver>-win-setup.exe` + `DataTools-<ver>-win-portable.zip`
- macOS host → `DataTools-<ver>-mac.dmg` + `DataTools-<ver>-mac-portable.zip`
- Linux host → `DataTools-<ver>-linux-x86_64.AppImage`
Useful flags:
```bash
python build/make_release.py --preflight # check tooling, build nothing
python build/make_release.py --clean # wipe dist/ first
python build/make_release.py --skip-installer # just the portable zip
python build/make_release.py --skip-portable # just the installer
```
### CI build (push tag → GitHub Release)
If you have CI runners for all three OSes:
1. Bump `__version__` in `src/__init__.py`.
2. `git commit -am "release: vX.Y.Z" && git tag vX.Y.Z`.
3. `git push && git push --tags`.
4. CI builds all three platforms and creates a Release with the
installers + portable zips attached.
5. Mirror the Release assets to Gumroad (manual until v2).
## Signing (Phase 2 — needs accounts/credentials)
Both code-signing steps are intentionally not in CI yet because they
require credentials the owner sets up first.
**macOS** — Apple Developer Program enrollment ($99/yr). Once enrolled,
add these GitHub Secrets and uncomment the `codesign` + `notarytool`
steps in `build.yml`:
| Secret | Value |
|---|---|
| `MACOS_DEVELOPER_ID_CERT_P12_BASE64` | base64-encoded `.p12` cert |
| `MACOS_DEVELOPER_ID_CERT_PASSWORD` | password for the .p12 |
| `MACOS_NOTARY_APPLE_ID` | Apple ID email |
| `MACOS_NOTARY_TEAM_ID` | 10-char team ID |
| `MACOS_NOTARY_PASSWORD` | app-specific password |
**Windows** — Code-signing cert from Sectigo / DigiCert (~$200-400/yr,
or ~$300-500 for an EV cert that bypasses SmartScreen). Add:
| Secret | Value |
|---|---|
| `WINDOWS_CERT_PFX_BASE64` | base64-encoded `.pfx` cert |
| `WINDOWS_CERT_PASSWORD` | password for the .pfx |
Until those are wired, buyers will see:
- macOS: "DataTools is damaged and can't be opened" — fix by removing
the quarantine attribute (`xattr -cr /Applications/DataTools.app`).
Acceptable for the technical buyer; **blocking** for the
non-technical buyer. Don't ship to non-technical without notarization.
- Windows: SmartScreen "Windows protected your PC" — buyer clicks
"More info → Run anyway". Friction but not blocking.
- Linux: AppImage runs without complaint (Linux has no equivalent
trust-store).
## Per-platform recipe
Each platform builds on its own machine — PyInstaller does **not**
cross-compile. Pick the platform that matches the bundle you need.
GitHub Actions matrix runners are the simplest way to produce all
three from one push (see "CI build" below).
### Mac (Intel + Apple Silicon, universal2)
```bash
# One-time:
pyenv install 3.12
pyenv local 3.12
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install pyinstaller
# Build:
pyinstaller build/datatools.spec --clean
# Output:
# dist/DataTools/ — folder mode (faster cold start)
# dist/DataTools.app/ — macOS .app bundle (drag-drop into /Applications)
# Sign + notarize (after Apple Developer Program enrollment per BUSINESS.md §10):
codesign --deep --force --options runtime \
--sign "Developer ID Application: <YOUR-NAME> (<TEAMID>)" \
dist/DataTools.app
# Notarize:
xcrun notarytool submit dist/DataTools.app \
--apple-id "<YOUR-APPLE-ID>" \
--team-id "<TEAMID>" \
--password "<APP-SPECIFIC-PASSWORD>" \
--wait
# Staple the notarization ticket so Gatekeeper sees it offline:
xcrun stapler staple dist/DataTools.app
# Wrap for distribution:
hdiutil create -volname "DataTools" -srcfolder dist/DataTools.app \
-ov -format UDZO dist/DataTools-1.0.0-mac.dmg
```
### Windows
```powershell
# One-time:
py -3.12 -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
pip install pyinstaller
# Build:
pyinstaller build\datatools.spec --clean
# Output:
# dist\DataTools\ — folder mode
# dist\DataTools\DataTools.exe
# Wrap with Inno Setup (free):
# 1. Install Inno Setup (https://jrsoftware.org/isdl.php)
# 2. Create installer.iss next to this README:
# [Setup]
# AppName=DataTools
# AppVersion=1.0.0
# DefaultDirName={autopf}\DataTools
# OutputDir=..\..\dist
# OutputBaseFilename=DataTools-1.0.0-win-setup
# Compression=lzma
# SolidCompression=yes
# [Files]
# Source: "..\..\dist\DataTools\*"; DestDir: "{app}"; Flags: recursesubdirs
# [Icons]
# Name: "{autoprograms}\DataTools"; Filename: "{app}\DataTools.exe"
# 3. Compile: ISCC.exe build\installer.iss
# Code-sign (optional but reduces SmartScreen warnings):
# Use signtool with a code-signing cert (Sectigo / DigiCert).
# Without signing, buyer sees "Windows protected your PC" once;
# they click "More info → Run anyway." Acceptable for v1.
```
### Linux (AppImage)
```bash
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install pyinstaller
pyinstaller build/datatools.spec --clean
# dist/DataTools/ — folder mode
# Wrap as AppImage (single-file portable app):
# 1. Download appimagetool from https://appimage.org/
# 2. Set up the AppDir layout:
# DataTools.AppDir/
# ├── AppRun -> ./DataTools/DataTools
# ├── DataTools.desktop (icon + entry config)
# ├── icon.png
# └── usr/bin/ -> dist/DataTools/*
# 3. ./appimagetool DataTools.AppDir dist/DataTools-1.0.0-linux-x86_64.AppImage
```
## CI build (recommended once the spec is stable)
`.github/workflows/build.yml` (template):
```yaml
name: Build installers
on:
workflow_dispatch:
push:
tags: [ 'v*' ]
jobs:
build:
strategy:
matrix:
os: [macos-latest, windows-latest, ubuntu-latest]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.12' }
- run: pip install -r requirements.txt pyinstaller
- run: pyinstaller build/datatools.spec --clean
- uses: actions/upload-artifact@v4
with:
name: DataTools-${{ matrix.os }}
path: dist/
```
Mac code-signing in CI requires the cert + private key as a GitHub
secret (encoded with `base64`). Detailed walkthrough belongs in a
later doc — for v1, sign locally and upload to GitHub Releases.
## Tesseract bundling (PDF Extractor OCR)
Frozen artifacts ship a per-platform Tesseract binary plus the English
`eng.traineddata` model so scanned-PDF support in the PDF Extractor
works out of the box — no separate user install. Source / pip
developer setups still need system Tesseract on `PATH`.
**Layout inside the bundle**:
```
DataTools/ (or DataTools.app/Contents/MacOS/)
└── tesseract/
├── tesseract (Linux/macOS binary; tesseract.exe on Windows)
└── tessdata/
└── eng.traineddata
```
The runtime resolver (in `src/`, owned by the runtime team) walks:
1. `DATATOOLS_TESSERACT_BIN` env var override.
2. `Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]"` — frozen
bundles only.
3. `tesseract` on `PATH`.
4. Windows well-known paths.
**Where the bytes come from**:
- **Tessdata** — vendored in-repo at `build/vendor/tessdata/eng.traineddata`
(sourced from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best)).
`datatools.spec` copies it into `tesseract/tessdata/`.
- **Binary** — fetched per-platform at build time by
`build/make_release.py` from pinned upstream URLs. Current pin:
**Tesseract 5.5.0**.
**Updating Tesseract**:
1. Bump the version pin and the per-platform fetch URLs in
`build/make_release.py`.
2. If the model schema changed upstream, refresh
`build/vendor/tessdata/eng.traineddata` from `tessdata_best` at the
matching tag.
3. Rebuild on each platform (`python build/make_release.py`) and
smoke-test a scanned PDF through the PDF Extractor.
4. Update `LICENSE_TESSERACT.txt` at the repo root if upstream license
terms change (Apache-2.0 today).
License attribution for the bundled binary lives at
`LICENSE_TESSERACT.txt` at the repo root — it must ship alongside any
binary that contains Tesseract.
## Common pitfalls
| Symptom | Fix |
|---|---|
| Bundle is 800+ MB | Check the ``excludes`` list in ``datatools.spec``. ``matplotlib`` / ``scipy`` / ``tkinter`` are the usual suspects. |
| App launches, browser opens, page is blank | Streamlit's static assets aren't bundled. Re-run with `--log-level=DEBUG` and confirm the static dir was collected by `collect_data_files('streamlit')`. |
| App launches but logs ``ImportError: streamlit.runtime.X`` | Add ``X`` to ``hidden_imports`` in the spec or to ``hook-streamlit.py``. |
| Mac Gatekeeper says "DataTools is damaged and can't be opened" | The bundle wasn't signed + notarized. Don't ship to buyers without these — see the Mac recipe above. |
| Windows SmartScreen blocks first launch | Buyer clicks "More info → Run anyway". Code-signing reduces but doesn't eliminate this; for v1 it's an accepted friction. |
| Bundle works on dev machine but crashes on a clean machine | Likely a missing C runtime. On Windows, install [VC++ redistributable](https://aka.ms/vs/17/release/vc_redist.x64.exe) into the installer alongside the bundle. |
## Testing the bundle
Smoke-test on a **clean** machine (or VM) — your dev machine has too
much state to trust:
```
1. Boot a clean Mac / Win / Linux VM.
2. Copy the .dmg / .exe / .AppImage onto it.
3. Install / drag-drop into Applications / chmod +x.
4. Double-click the app icon.
5. Browser should open to http://127.0.0.1:850x within 5 seconds.
6. Drop samples/demo/shopify_pet_customers.csv into the
Automated Workflows page; click Run; AFTER preview should appear.
7. Confirm in the network tab: zero outbound calls except to
127.0.0.1 and the Streamlit static asset paths (also local).
```
Step 7 is the privacy-claim integrity check from
`docs/POST-LAUNCH.md` §6 — do this once per release, then trust it.
## Versioning
Bump the version string in three places per release:
- `datatools.spec` (CFBundleVersion + CFBundleShortVersionString)
- the Inno Setup `AppVersion` line
- the AppImage filename
A single source of truth (e.g. `src/__init__.py`) is a future
refactor — for v1 the three-spot update is fine.