docs: reflect bundled Tesseract on every install surface

- NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers the bundled Tesseract OCR binary (Apache 2.0, upstream tesseract-ocr/tesseract, copyright Google + contributors) and the eng.traineddata from tessdata_best (also Apache 2.0). Clarifies DataTools itself remains proprietary. Full canonical Apache 2.0 license text included. - README.md + README.es.md (Download section): bumped size estimate ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR is bundled (no separate install required), with a link to the new license file. - docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System requirements): bumped disk estimate, added a paragraph stating Tesseract 5.5 + eng.traineddata ship inside every installer / portable / AppImage, with a source-install fallback hint pointing developers to DEVELOPER.md. - docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section documenting the runtime layout (sys._MEIPASS / tesseract / …), discovery order, source of bytes (build/vendor/tessdata + per- platform fetch in make_release.py), version pin, update recipe. - docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor OCR)" — short version of the discovery order for the build pipeline section. - build/README.md: distribution-outputs paragraph now lists Tesseract among bundled deps with the ~250-300 MB estimate; new "Tesseract bundling" section: layout diagram, resolver order, source of bytes + 5.5.0 pin, update steps, license-file ref. Out-of-scope gaps noted by the docs sweep: - docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a high-risk packaging headache; now superseded. Worth a one-line "(resolved — bundled as of v1.x)" callout in a future pass. - USER-GUIDE §2 "What's included" table doesn't list PDF Extractor at all (it shipped in b8aff86…967d3f6). Separate gap to close. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:50 +00:00
parent 93ccada974
commit b703911df3
8 changed files with 329 additions and 6 deletions
--- a/build/README.md
+++ b/build/README.md
@@ -54,8 +54,11 @@ for buyers (or IT-locked-down machines) that can't run installers:
 | Linux    | `DataTools-<ver>-linux-x86_64.AppImage`| (the AppImage IS the portable)                 |

 All six outputs are self-contained: every dependency (Python, pandas,
-streamlit, pdfplumber, the lot) is frozen into the bundle. The buyer
-does not need to install Python, pip, or anything else first.
+streamlit, pdfplumber, **Tesseract OCR + `eng.traineddata`**, the lot)
+is frozen into the bundle. The buyer does not need to install Python,
+pip, Tesseract, or anything else first. With Tesseract bundled, each
+artifact is roughly **250–300 MB** on disk (up from ~120 MB pre-OCR);
+unpacked installs run ~300–400 MB once scratch space is counted.

 ## Easy-launch surface

@@ -287,6 +290,56 @@ Mac code-signing in CI requires the cert + private key as a GitHub
 secret (encoded with `base64`). Detailed walkthrough belongs in a
 later doc — for v1, sign locally and upload to GitHub Releases.

+## Tesseract bundling (PDF Extractor OCR)
+
+Frozen artifacts ship a per-platform Tesseract binary plus the English
+`eng.traineddata` model so scanned-PDF support in the PDF Extractor
+works out of the box — no separate user install. Source / pip
+developer setups still need system Tesseract on `PATH`.
+
+**Layout inside the bundle**:
+
+```
+DataTools/                  (or DataTools.app/Contents/MacOS/)
+└── tesseract/
+    ├── tesseract           (Linux/macOS binary; tesseract.exe on Windows)
+    └── tessdata/
+        └── eng.traineddata
+```
+
+The runtime resolver (in `src/`, owned by the runtime team) walks:
+
+1. `DATATOOLS_TESSERACT_BIN` env var override.
+2. `Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]"` — frozen
+   bundles only.
+3. `tesseract` on `PATH`.
+4. Windows well-known paths.
+
+**Where the bytes come from**:
+
+- **Tessdata** — vendored in-repo at `build/vendor/tessdata/eng.traineddata`
+  (sourced from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best)).
+  `datatools.spec` copies it into `tesseract/tessdata/`.
+- **Binary** — fetched per-platform at build time by
+  `build/make_release.py` from pinned upstream URLs. Current pin:
+  **Tesseract 5.5.0**.
+
+**Updating Tesseract**:
+
+1. Bump the version pin and the per-platform fetch URLs in
+   `build/make_release.py`.
+2. If the model schema changed upstream, refresh
+   `build/vendor/tessdata/eng.traineddata` from `tessdata_best` at the
+   matching tag.
+3. Rebuild on each platform (`python build/make_release.py`) and
+   smoke-test a scanned PDF through the PDF Extractor.
+4. Update `LICENSE_TESSERACT.txt` at the repo root if upstream license
+   terms change (Apache-2.0 today).
+
+License attribution for the bundled binary lives at
+`LICENSE_TESSERACT.txt` at the repo root — it must ship alongside any
+binary that contains Tesseract.
+
 ## Common pitfalls

 | Symptom | Fix |