docs: reflect bundled Tesseract on every install surface
- NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers the bundled Tesseract OCR binary (Apache 2.0, upstream tesseract-ocr/tesseract, copyright Google + contributors) and the eng.traineddata from tessdata_best (also Apache 2.0). Clarifies DataTools itself remains proprietary. Full canonical Apache 2.0 license text included. - README.md + README.es.md (Download section): bumped size estimate ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR is bundled (no separate install required), with a link to the new license file. - docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System requirements): bumped disk estimate, added a paragraph stating Tesseract 5.5 + eng.traineddata ship inside every installer / portable / AppImage, with a source-install fallback hint pointing developers to DEVELOPER.md. - docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section documenting the runtime layout (sys._MEIPASS / tesseract / …), discovery order, source of bytes (build/vendor/tessdata + per- platform fetch in make_release.py), version pin, update recipe. - docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor OCR)" — short version of the discovery order for the build pipeline section. - build/README.md: distribution-outputs paragraph now lists Tesseract among bundled deps with the ~250-300 MB estimate; new "Tesseract bundling" section: layout diagram, resolver order, source of bytes + 5.5.0 pin, update steps, license-file ref. Out-of-scope gaps noted by the docs sweep: - docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a high-risk packaging headache; now superseded. Worth a one-line "(resolved — bundled as of v1.x)" callout in a future pass. - USER-GUIDE §2 "What's included" table doesn't list PDF Extractor at all (it shipped in b8aff86…967d3f6). Separate gap to close. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -54,8 +54,11 @@ for buyers (or IT-locked-down machines) that can't run installers:
|
||||
| Linux | `DataTools-<ver>-linux-x86_64.AppImage`| (the AppImage IS the portable) |
|
||||
|
||||
All six outputs are self-contained: every dependency (Python, pandas,
|
||||
streamlit, pdfplumber, the lot) is frozen into the bundle. The buyer
|
||||
does not need to install Python, pip, or anything else first.
|
||||
streamlit, pdfplumber, **Tesseract OCR + `eng.traineddata`**, the lot)
|
||||
is frozen into the bundle. The buyer does not need to install Python,
|
||||
pip, Tesseract, or anything else first. With Tesseract bundled, each
|
||||
artifact is roughly **250–300 MB** on disk (up from ~120 MB pre-OCR);
|
||||
unpacked installs run ~300–400 MB once scratch space is counted.
|
||||
|
||||
## Easy-launch surface
|
||||
|
||||
@@ -287,6 +290,56 @@ Mac code-signing in CI requires the cert + private key as a GitHub
|
||||
secret (encoded with `base64`). Detailed walkthrough belongs in a
|
||||
later doc — for v1, sign locally and upload to GitHub Releases.
|
||||
|
||||
## Tesseract bundling (PDF Extractor OCR)
|
||||
|
||||
Frozen artifacts ship a per-platform Tesseract binary plus the English
|
||||
`eng.traineddata` model so scanned-PDF support in the PDF Extractor
|
||||
works out of the box — no separate user install. Source / pip
|
||||
developer setups still need system Tesseract on `PATH`.
|
||||
|
||||
**Layout inside the bundle**:
|
||||
|
||||
```
|
||||
DataTools/ (or DataTools.app/Contents/MacOS/)
|
||||
└── tesseract/
|
||||
├── tesseract (Linux/macOS binary; tesseract.exe on Windows)
|
||||
└── tessdata/
|
||||
└── eng.traineddata
|
||||
```
|
||||
|
||||
The runtime resolver (in `src/`, owned by the runtime team) walks:
|
||||
|
||||
1. `DATATOOLS_TESSERACT_BIN` env var override.
|
||||
2. `Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]"` — frozen
|
||||
bundles only.
|
||||
3. `tesseract` on `PATH`.
|
||||
4. Windows well-known paths.
|
||||
|
||||
**Where the bytes come from**:
|
||||
|
||||
- **Tessdata** — vendored in-repo at `build/vendor/tessdata/eng.traineddata`
|
||||
(sourced from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best)).
|
||||
`datatools.spec` copies it into `tesseract/tessdata/`.
|
||||
- **Binary** — fetched per-platform at build time by
|
||||
`build/make_release.py` from pinned upstream URLs. Current pin:
|
||||
**Tesseract 5.5.0**.
|
||||
|
||||
**Updating Tesseract**:
|
||||
|
||||
1. Bump the version pin and the per-platform fetch URLs in
|
||||
`build/make_release.py`.
|
||||
2. If the model schema changed upstream, refresh
|
||||
`build/vendor/tessdata/eng.traineddata` from `tessdata_best` at the
|
||||
matching tag.
|
||||
3. Rebuild on each platform (`python build/make_release.py`) and
|
||||
smoke-test a scanned PDF through the PDF Extractor.
|
||||
4. Update `LICENSE_TESSERACT.txt` at the repo root if upstream license
|
||||
terms change (Apache-2.0 today).
|
||||
|
||||
License attribution for the bundled binary lives at
|
||||
`LICENSE_TESSERACT.txt` at the repo root — it must ship alongside any
|
||||
binary that contains Tesseract.
|
||||
|
||||
## Common pitfalls
|
||||
|
||||
| Symptom | Fix |
|
||||
|
||||
Reference in New Issue
Block a user