docs: reflect bundled Tesseract on every install surface

- NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers
  the bundled Tesseract OCR binary (Apache 2.0, upstream
  tesseract-ocr/tesseract, copyright Google + contributors) and the
  eng.traineddata from tessdata_best (also Apache 2.0). Clarifies
  DataTools itself remains proprietary. Full canonical Apache 2.0
  license text included.
- README.md + README.es.md (Download section): bumped size estimate
  ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR
  is bundled (no separate install required), with a link to the new
  license file.
- docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System
  requirements): bumped disk estimate, added a paragraph stating
  Tesseract 5.5 + eng.traineddata ship inside every installer /
  portable / AppImage, with a source-install fallback hint pointing
  developers to DEVELOPER.md.
- docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section
  documenting the runtime layout (sys._MEIPASS / tesseract / …),
  discovery order, source of bytes (build/vendor/tessdata + per-
  platform fetch in make_release.py), version pin, update recipe.
- docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor
  OCR)" — short version of the discovery order for the build
  pipeline section.
- build/README.md: distribution-outputs paragraph now lists
  Tesseract among bundled deps with the ~250-300 MB estimate; new
  "Tesseract bundling" section: layout diagram, resolver order,
  source of bytes + 5.5.0 pin, update steps, license-file ref.

Out-of-scope gaps noted by the docs sweep:
- docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a
  high-risk packaging headache; now superseded. Worth a one-line
  "(resolved — bundled as of v1.x)" callout in a future pass.
- USER-GUIDE §2 "What's included" table doesn't list PDF Extractor
  at all (it shipped in b8aff86…967d3f6). Separate gap to close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-02 18:20:50 +00:00
parent 93ccada974
commit b703911df3
8 changed files with 329 additions and 6 deletions

View File

@@ -54,8 +54,11 @@ for buyers (or IT-locked-down machines) that can't run installers:
| Linux | `DataTools-<ver>-linux-x86_64.AppImage`| (the AppImage IS the portable) |
All six outputs are self-contained: every dependency (Python, pandas,
streamlit, pdfplumber, the lot) is frozen into the bundle. The buyer
does not need to install Python, pip, or anything else first.
streamlit, pdfplumber, **Tesseract OCR + `eng.traineddata`**, the lot)
is frozen into the bundle. The buyer does not need to install Python,
pip, Tesseract, or anything else first. With Tesseract bundled, each
artifact is roughly **250300 MB** on disk (up from ~120 MB pre-OCR);
unpacked installs run ~300400 MB once scratch space is counted.
## Easy-launch surface
@@ -287,6 +290,56 @@ Mac code-signing in CI requires the cert + private key as a GitHub
secret (encoded with `base64`). Detailed walkthrough belongs in a
later doc — for v1, sign locally and upload to GitHub Releases.
## Tesseract bundling (PDF Extractor OCR)
Frozen artifacts ship a per-platform Tesseract binary plus the English
`eng.traineddata` model so scanned-PDF support in the PDF Extractor
works out of the box — no separate user install. Source / pip
developer setups still need system Tesseract on `PATH`.
**Layout inside the bundle**:
```
DataTools/ (or DataTools.app/Contents/MacOS/)
└── tesseract/
├── tesseract (Linux/macOS binary; tesseract.exe on Windows)
└── tessdata/
└── eng.traineddata
```
The runtime resolver (in `src/`, owned by the runtime team) walks:
1. `DATATOOLS_TESSERACT_BIN` env var override.
2. `Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]"` — frozen
bundles only.
3. `tesseract` on `PATH`.
4. Windows well-known paths.
**Where the bytes come from**:
- **Tessdata** — vendored in-repo at `build/vendor/tessdata/eng.traineddata`
(sourced from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best)).
`datatools.spec` copies it into `tesseract/tessdata/`.
- **Binary** — fetched per-platform at build time by
`build/make_release.py` from pinned upstream URLs. Current pin:
**Tesseract 5.5.0**.
**Updating Tesseract**:
1. Bump the version pin and the per-platform fetch URLs in
`build/make_release.py`.
2. If the model schema changed upstream, refresh
`build/vendor/tessdata/eng.traineddata` from `tessdata_best` at the
matching tag.
3. Rebuild on each platform (`python build/make_release.py`) and
smoke-test a scanned PDF through the PDF Extractor.
4. Update `LICENSE_TESSERACT.txt` at the repo root if upstream license
terms change (Apache-2.0 today).
License attribution for the bundled binary lives at
`LICENSE_TESSERACT.txt` at the repo root — it must ship alongside any
binary that contains Tesseract.
## Common pitfalls
| Symptom | Fix |