Removes the single-command Python packaging method (build/make_release.py + build/build_portable_zip.py + build/macos/build_zip.sh) and the portable .zip artifacts it produced. Release builds go back to the original GitHub Actions process: the CI matrix builds one installer per platform (.dmg / .exe / .AppImage) on tag push and attaches them to a GitHub Release. Tesseract OCR bundling is preserved: the fetch helpers the workflow depends on (fetch_tessdata, fetch_tesseract_for_platform) are extracted into a standalone build/tesseract.py, which build.yml now imports. Docs (README, build/README, DEVELOPER, TECHNICAL, USER-GUIDE, vendor README, es translations) updated to drop the portable-zip flavor and point at the new module. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
64 lines
2.4 KiB
Markdown
64 lines
2.4 KiB
Markdown
# build/vendor/ — third-party bundle inputs (fetched at build time)
|
|
|
|
This tree holds the third-party assets that get bundled into the
|
|
PyInstaller artifacts but that we deliberately do **not** keep in git
|
|
(too large / license-encumbered / re-fetchable on demand).
|
|
|
|
The build's Tesseract helper (`build/tesseract.py`) populates
|
|
everything in here before the PyInstaller step — CI
|
|
(`.github/workflows/build.yml`) calls it ahead of the build. The
|
|
contents are git-ignored except for this README.
|
|
|
|
## tessdata/
|
|
|
|
Holds the Tesseract language data file(s) used by the PDF Extractor
|
|
OCR fallback. Only English is bundled today.
|
|
|
|
### Canonical source
|
|
|
|
We use the **"best" model** from `tesseract-ocr/tessdata_best` (LSTM,
|
|
slower but higher accuracy than the legacy `tessdata` set, and only
|
|
~12 MB compressed → ~16 MB uncompressed):
|
|
|
|
```
|
|
https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata
|
|
```
|
|
|
|
There is also `tessdata_fast/` (~4 MB, lower accuracy) if you ever
|
|
want to optimise for bundle size over recognition quality. For bank
|
|
statements (the only OCR use case so far), the extra accuracy of the
|
|
`_best` model is worth the 10 MB.
|
|
|
|
### Why we don't vendor it in git
|
|
|
|
* ~16 MB binary file — bloats clone times for everyone, including
|
|
contributors who never touch the OCR code path.
|
|
* Apache-2.0-licensed and stable; the file rarely changes upstream
|
|
(last touched 2021), so a build-time fetch is safe.
|
|
* The Tesseract project explicitly distributes these via GitHub
|
|
raw URLs — they're meant to be downloaded, not redistributed
|
|
through other repos.
|
|
|
|
### How it gets populated
|
|
|
|
`build/tesseract.py::fetch_tessdata()` checks for
|
|
`build/vendor/tessdata/eng.traineddata` on every run. If it's
|
|
missing, it downloads the file from the canonical URL above and
|
|
caches it here. Subsequent builds reuse the cached file.
|
|
|
|
On CI, the directory is restored from the GitHub Actions cache so we
|
|
don't pay the download cost on every run (`.github/workflows/build.yml`
|
|
caches `build/vendor/tessdata/` keyed on the URL above).
|
|
|
|
## Manual one-time fetch (if you're offline or behind a proxy)
|
|
|
|
```bash
|
|
mkdir -p build/vendor/tessdata
|
|
curl -L -o build/vendor/tessdata/eng.traineddata \
|
|
https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata
|
|
```
|
|
|
|
Verify the file is non-empty and starts with the magic bytes
|
|
`b"\x00\x00\x00\x00"` followed by a header that `pytesseract` can
|
|
read; the script does a basic sanity check after download.
|