docs: reflect bundled Tesseract on every install surface

- NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers the bundled Tesseract OCR binary (Apache 2.0, upstream tesseract-ocr/tesseract, copyright Google + contributors) and the eng.traineddata from tessdata_best (also Apache 2.0). Clarifies DataTools itself remains proprietary. Full canonical Apache 2.0 license text included. - README.md + README.es.md (Download section): bumped size estimate ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR is bundled (no separate install required), with a link to the new license file. - docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System requirements): bumped disk estimate, added a paragraph stating Tesseract 5.5 + eng.traineddata ship inside every installer / portable / AppImage, with a source-install fallback hint pointing developers to DEVELOPER.md. - docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section documenting the runtime layout (sys._MEIPASS / tesseract / …), discovery order, source of bytes (build/vendor/tessdata + per- platform fetch in make_release.py), version pin, update recipe. - docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor OCR)" — short version of the discovery order for the build pipeline section. - build/README.md: distribution-outputs paragraph now lists Tesseract among bundled deps with the ~250-300 MB estimate; new "Tesseract bundling" section: layout diagram, resolver order, source of bytes + 5.5.0 pin, update steps, license-file ref. Out-of-scope gaps noted by the docs sweep: - docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a high-risk packaging headache; now superseded. Worth a one-line "(resolved — bundled as of v1.x)" callout in a future pass. - USER-GUIDE §2 "What's included" table doesn't list PDF Extractor at all (it shipped in b8aff86…967d3f6). Separate gap to close. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:50 +00:00
parent 93ccada974
commit b703911df3
8 changed files with 329 additions and 6 deletions
--- a/docs/DEVELOPER.md
+++ b/docs/DEVELOPER.md
@@ -296,6 +296,37 @@ GUI / CLI handlers: use `format_for_user(exc, context="...")` to render.

 All `DataToolsError` subclasses extend stdlib `ValueError` or `OSError` so existing handlers still catch them.

+## PDF Extractor — bundled Tesseract
+
+Frozen builds (installer / portable .zip / AppImage) ship Tesseract OCR inside the bundle so scanned PDFs work without a separate system install. Source / `pip` developer environments still resolve Tesseract from `PATH`.
+
+**Runtime layout (frozen bundles)**:
+
+| Resource | Path |
+|---|---|
+| Tesseract binary | `Path(sys._MEIPASS) / "tesseract" / "tesseract"` (Linux/macOS), `…/tesseract/tesseract.exe` (Windows) |
+| Tessdata directory | `Path(sys._MEIPASS) / "tesseract" / "tessdata"` |
+| English model | `Path(sys._MEIPASS) / "tesseract" / "tessdata" / "eng.traineddata"` |
+
+**Discovery order** (PDF Extractor runtime):
+
+1. `DATATOOLS_TESSERACT_BIN` env var (override — explicit path to a `tesseract` binary).
+2. Bundled path under `sys._MEIPASS` (frozen bundles only — falls through to step 3 otherwise).
+3. `tesseract` on `PATH` (developer setups, source checkouts).
+4. Windows well-known locations (`C:\Program Files\Tesseract-OCR\tesseract.exe`, etc.).
+
+**Where the bytes come from**:
+
+- **Tessdata** is vendored at `build/vendor/tessdata/eng.traineddata` — the "best" English model from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best). PyInstaller's spec copies it into `tesseract/tessdata/` inside the bundle.
+- **Tesseract binary** is fetched at build time by `build/make_release.py` — per-platform download URLs are pinned in that script. The current pin is **Tesseract 5.5.0**.
+
+**To update Tesseract**:
+
+1. Bump the version pin + the per-platform fetch URLs in `build/make_release.py`.
+2. If upstream changed the `eng.traineddata` schema, refresh `build/vendor/tessdata/eng.traineddata` from `tessdata_best` at the matching tag.
+3. Rebuild on each platform (`python build/make_release.py`) and smoke-test a scanned-PDF run through the PDF Extractor before tagging the release.
+4. Update `LICENSE_TESSERACT.txt` at the repo root if the upstream license terms change (Tesseract is Apache-2.0 today).
+
 ## Tests

 ```bash
--- a/docs/TECHNICAL.md
+++ b/docs/TECHNICAL.md
@@ -122,6 +122,17 @@ Tag a release → 3 platform artifacts upload to GitHub Releases. Manual: copy t

 `demo/streamlit_app.py` → Streamlit Community Cloud. Configure deployment in Streamlit UI. Custom domain via CNAME (verify policy at deploy time). Fall back to $5/mo VPS if rate limits / branding constraints hit.

+### 3.10 Bundled Tesseract (PDF Extractor OCR)
+
+Frozen builds ship Tesseract 5.5 + `eng.traineddata` inside the PyInstaller bundle so scanned PDFs work without a separate install. Per-platform binary URLs pinned in `build/make_release.py`; tessdata vendored at `build/vendor/tessdata/eng.traineddata`. License attribution in `LICENSE_TESSERACT.txt` at the repo root.
+
+**Discovery order at runtime** (see `docs/DEVELOPER.md` for the full Path layout):
+
+1. `DATATOOLS_TESSERACT_BIN` env var override.
+2. Bundled path under `sys._MEIPASS / "tesseract" /` (frozen bundles only).
+3. `tesseract` on `PATH` (source / pip developer environments).
+4. Windows well-known locations.
+
 ## 4. Libraries

 | Purpose | Library |
--- a/docs/USER-GUIDE.es.md
+++ b/docs/USER-GUIDE.es.md
@@ -103,7 +103,9 @@ La ventana del lanzador queda abierta en segundo plano. Cerrarla detiene el serv

 - Windows 10/11 (64 bits), macOS 11+, Linux moderno (2020+).
 - Navegador moderno (Chrome, Edge, Firefox, Safari, últimos 3 años).
- ~400 MB de espacio libre en disco (el paquete ocupa ~200 MB; el resto es espacio de trabajo para CSV grandes).
+- ~500 MB de espacio libre en disco (el paquete ocupa ~300 MB; el resto es espacio de trabajo para CSV grandes).
+
+**OCR para PDFs escaneados viene incluido** — Tesseract 5.5 y el modelo en inglés `eng.traineddata` vienen dentro de cada instalador / portable / AppImage. La ruta de extracción de PDFs escaneados del Extractor de PDF funciona sin configuración adicional; no hace falta instalar nada por separado. (Quien ejecute desde un checkout con `pip install -r requirements.txt` sigue necesitando Tesseract del sistema en el `PATH` — ver [DEVELOPER.md §PDF Extractor — bundled Tesseract](DEVELOPER.md#pdf-extractor--bundled-tesseract) (solo en inglés).)

 Matriz de soporte completa: [REQUIREMENTS.md](REQUIREMENTS.md) (solo en inglés).

--- a/docs/USER-GUIDE.md
+++ b/docs/USER-GUIDE.md
@@ -103,7 +103,9 @@ The launcher window stays open in the background. Closing it stops the server

 - Windows 10/11 (64-bit), macOS 11+, modern Linux (2020+).
 - Modern browser (Chrome, Edge, Firefox, Safari, last 3 years).
- ~400 MB free disk space (the bundle itself is ~200 MB; the rest is working scratch space for large CSVs).
+- ~500 MB free disk space (the bundle itself is ~300 MB; the rest is working scratch space for large CSVs).
+
+**OCR for scanned PDFs is bundled** — Tesseract 5.5 + the English `eng.traineddata` model ship inside every installer / portable / AppImage. The PDF Extractor's scanned-statement path works out of the box; no separate install required. (Developers running from a `pip install -r requirements.txt` checkout still need system Tesseract on `PATH` — see [DEVELOPER.md §PDF Extractor — bundled Tesseract](DEVELOPER.md#pdf-extractor--bundled-tesseract).)

 Full numbered support matrix: [REQUIREMENTS.md](REQUIREMENTS.md).