build: drop the local Python release method, return to CI-only installer builds

Removes the single-command Python packaging method (build/make_release.py
+ build/build_portable_zip.py + build/macos/build_zip.sh) and the portable
.zip artifacts it produced. Release builds go back to the original GitHub
Actions process: the CI matrix builds one installer per platform (.dmg /
.exe / .AppImage) on tag push and attaches them to a GitHub Release.

Tesseract OCR bundling is preserved: the fetch helpers the workflow depends
on (fetch_tessdata, fetch_tesseract_for_platform) are extracted into a
standalone build/tesseract.py, which build.yml now imports.

Docs (README, build/README, DEVELOPER, TECHNICAL, USER-GUIDE, vendor README,
es translations) updated to drop the portable-zip flavor and point at the
new module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-22 17:47:36 +00:00
parent 28ab51a869
commit fd9606c67b
13 changed files with 127 additions and 608 deletions

View File

@@ -298,7 +298,7 @@ All `DataToolsError` subclasses extend stdlib `ValueError` or `OSError` so exist
## PDF Extractor — bundled Tesseract
Frozen builds (installer / portable .zip / AppImage) ship Tesseract OCR inside the bundle so scanned PDFs work without a separate system install. Source / `pip` developer environments still resolve Tesseract from `PATH`.
Frozen builds (installer / AppImage) ship Tesseract OCR inside the bundle so scanned PDFs work without a separate system install. Source / `pip` developer environments still resolve Tesseract from `PATH`.
**Runtime layout (frozen bundles)**:
@@ -318,13 +318,13 @@ Frozen builds (installer / portable .zip / AppImage) ship Tesseract OCR inside t
**Where the bytes come from**:
- **Tessdata** is vendored at `build/vendor/tessdata/eng.traineddata` — the "best" English model from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best). PyInstaller's spec copies it into `tesseract/tessdata/` inside the bundle.
- **Tesseract binary** is fetched at build time by `build/make_release.py` — per-platform download URLs are pinned in that script. The current pin is **Tesseract 5.5.0**.
- **Tesseract binary** is fetched at build time by `build/tesseract.py` — per-platform download URLs are pinned in that module. The current pin is **Tesseract 5.5.0**. CI (`.github/workflows/build.yml`) imports `fetch_tessdata` + `fetch_tesseract_for_platform` and runs them before PyInstaller.
**To update Tesseract**:
1. Bump the version pin + the per-platform fetch URLs in `build/make_release.py`.
1. Bump the version pin + the per-platform fetch URLs in `build/tesseract.py`.
2. If upstream changed the `eng.traineddata` schema, refresh `build/vendor/tessdata/eng.traineddata` from `tessdata_best` at the matching tag.
3. Rebuild on each platform (`python build/make_release.py`) and smoke-test a scanned-PDF run through the PDF Extractor before tagging the release.
3. Push a `v*` tag so CI rebuilds all three platforms, then smoke-test a scanned-PDF run through the PDF Extractor before publishing the release.
4. Update `LICENSE_TESSERACT.txt` at the repo root if the upstream license terms change (Tesseract is Apache-2.0 today).
## Tests