- NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers
the bundled Tesseract OCR binary (Apache 2.0, upstream
tesseract-ocr/tesseract, copyright Google + contributors) and the
eng.traineddata from tessdata_best (also Apache 2.0). Clarifies
DataTools itself remains proprietary. Full canonical Apache 2.0
license text included.
- README.md + README.es.md (Download section): bumped size estimate
~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR
is bundled (no separate install required), with a link to the new
license file.
- docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System
requirements): bumped disk estimate, added a paragraph stating
Tesseract 5.5 + eng.traineddata ship inside every installer /
portable / AppImage, with a source-install fallback hint pointing
developers to DEVELOPER.md.
- docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section
documenting the runtime layout (sys._MEIPASS / tesseract / …),
discovery order, source of bytes (build/vendor/tessdata + per-
platform fetch in make_release.py), version pin, update recipe.
- docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor
OCR)" — short version of the discovery order for the build
pipeline section.
- build/README.md: distribution-outputs paragraph now lists
Tesseract among bundled deps with the ~250-300 MB estimate; new
"Tesseract bundling" section: layout diagram, resolver order,
source of bytes + 5.5.0 pin, update steps, license-file ref.
Out-of-scope gaps noted by the docs sweep:
- docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a
high-risk packaging headache; now superseded. Worth a one-line
"(resolved — bundled as of v1.x)" callout in a future pass.
- USER-GUIDE §2 "What's included" table doesn't list PDF Extractor
at all (it shipped in b8aff86…967d3f6). Separate gap to close.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>