Files

Michael fd9606c67b build: drop the local Python release method, return to CI-only installer builds

Removes the single-command Python packaging method (build/make_release.py
+ build/build_portable_zip.py + build/macos/build_zip.sh) and the portable
.zip artifacts it produced. Release builds go back to the original GitHub
Actions process: the CI matrix builds one installer per platform (.dmg /
.exe / .AppImage) on tag push and attaches them to a GitHub Release.

Tesseract OCR bundling is preserved: the fetch helpers the workflow depends
on (fetch_tessdata, fetch_tesseract_for_platform) are extracted into a
standalone build/tesseract.py, which build.yml now imports.

Docs (README, build/README, DEVELOPER, TECHNICAL, USER-GUIDE, vendor README,
es translations) updated to drop the portable-zip flavor and point at the
new module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-22 17:47:36 +00:00

tessdata

build: bundle Tesseract 5.5.0 + tessdata into every release artifact

2026-06-02 18:20:33 +00:00

README.md

build: drop the local Python release method, return to CI-only installer builds

2026-06-22 17:47:36 +00:00

README.md

build/vendor/ — third-party bundle inputs (fetched at build time)

This tree holds the third-party assets that get bundled into the PyInstaller artifacts but that we deliberately do not keep in git (too large / license-encumbered / re-fetchable on demand).

The build's Tesseract helper (build/tesseract.py) populates everything in here before the PyInstaller step — CI (.github/workflows/build.yml) calls it ahead of the build. The contents are git-ignored except for this README.

tessdata/

Holds the Tesseract language data file(s) used by the PDF Extractor OCR fallback. Only English is bundled today.

Canonical source

We use the "best" model from tesseract-ocr/tessdata_best (LSTM, slower but higher accuracy than the legacy tessdata set, and only ~12 MB compressed → ~16 MB uncompressed):

https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata

There is also tessdata_fast/ (~4 MB, lower accuracy) if you ever want to optimise for bundle size over recognition quality. For bank statements (the only OCR use case so far), the extra accuracy of the _best model is worth the 10 MB.

Why we don't vendor it in git

~16 MB binary file — bloats clone times for everyone, including contributors who never touch the OCR code path.
Apache-2.0-licensed and stable; the file rarely changes upstream (last touched 2021), so a build-time fetch is safe.
The Tesseract project explicitly distributes these via GitHub raw URLs — they're meant to be downloaded, not redistributed through other repos.

How it gets populated

build/tesseract.py::fetch_tessdata() checks for build/vendor/tessdata/eng.traineddata on every run. If it's missing, it downloads the file from the canonical URL above and caches it here. Subsequent builds reuse the cached file.

On CI, the directory is restored from the GitHub Actions cache so we don't pay the download cost on every run (.github/workflows/build.yml caches build/vendor/tessdata/ keyed on the URL above).

Manual one-time fetch (if you're offline or behind a proxy)

mkdir -p build/vendor/tessdata
curl -L -o build/vendor/tessdata/eng.traineddata \
  https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata

Verify the file is non-empty and starts with the magic bytes b"\x00\x00\x00\x00" followed by a header that pytesseract can read; the script does a basic sanity check after download.