datatools-dev/build/vendor/README.md

# build/vendor/ — third-party bundle inputs (fetched at build time)

This tree holds the third-party assets that get bundled into the
PyInstaller artifacts but that we deliberately do **not** keep in git
(too large / license-encumbered / re-fetchable on demand).

The build's Tesseract helper (`build/tesseract.py`) populates
everything in here before the PyInstaller step — CI
(`.github/workflows/build.yml`) calls it ahead of the build. The
contents are git-ignored except for this README.

## tessdata/

Holds the Tesseract language data file(s) used by the PDF Extractor
OCR fallback. Only English is bundled today.

### Canonical source

We use the **"best" model** from `tesseract-ocr/tessdata_best` (LSTM,
slower but higher accuracy than the legacy `tessdata` set, and only
~12 MB compressed → ~16 MB uncompressed):

```
https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata
```

There is also `tessdata_fast/` (~4 MB, lower accuracy) if you ever
want to optimise for bundle size over recognition quality. For bank
statements (the only OCR use case so far), the extra accuracy of the
`_best` model is worth the 10 MB.

### Why we don't vendor it in git

* ~16 MB binary file — bloats clone times for everyone, including
  contributors who never touch the OCR code path.
* Apache-2.0-licensed and stable; the file rarely changes upstream
  (last touched 2021), so a build-time fetch is safe.
* The Tesseract project explicitly distributes these via GitHub
  raw URLs — they're meant to be downloaded, not redistributed
  through other repos.

### How it gets populated

`build/tesseract.py::fetch_tessdata()` checks for
`build/vendor/tessdata/eng.traineddata` on every run. If it's
missing, it downloads the file from the canonical URL above and
caches it here. Subsequent builds reuse the cached file.

On CI, the directory is restored from the GitHub Actions cache so we
don't pay the download cost on every run (`.github/workflows/build.yml`
caches `build/vendor/tessdata/` keyed on the URL above).

## Manual one-time fetch (if you're offline or behind a proxy)

```bash
mkdir -p build/vendor/tessdata
curl -L -o build/vendor/tessdata/eng.traineddata \
  https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata
```

Verify the file is non-empty and starts with the magic bytes
`b"\x00\x00\x00\x00"` followed by a header that `pytesseract` can
read; the script does a basic sanity check after download.