Three changes prepare the next tagged release so end users get
the PDF Extractor without ever touching pip.
**Exact-pin the new deps** (``requirements.txt``):
pdfplumber==0.11.9
pypdfium2==5.8.0
pytesseract==0.3.13
streamlit-drawable-canvas==0.9.3
Tight pins are the right call for these because the GUI's
visual-picker geometry + the parsing-pipeline word positions
depend on stable internal behavior — a quiet upstream tweak to
``extract_words`` or ``page.render`` would re-break the tool on
the next CI build. Bumping requires a deliberate edit + a CI
run, not a transient ``pip install`` resolving to whatever
``setup.py`` pulled.
Existing deps stay on their current ``>=X.Y,<X+1`` ranges; the
user's "tight pin" concern is specifically about the PDF stack.
**Wire the new deps into the PyInstaller bundle** (``build/``):
- ``datatools.spec`` — add ``collect_submodules`` for pdfplumber,
pdfminer, pypdfium2, streamlit_drawable_canvas, PIL,
pytesseract; add ``collect_data_files`` for pypdfium2 (PDFium
native ``.dll``/``.so``/``.dylib``), streamlit_drawable_canvas
(frontend JS bundle), pdfminer (Adobe CMap tables).
- ``hooks/hook-pypdfium2.py`` — belt-and-braces hook that uses
``collect_dynamic_libs`` to force-include the PDFium binary.
Without this the visual picker silently fails on installed
builds with a ``FileNotFoundError`` for the shared library.
- ``hooks/hook-streamlit_drawable_canvas.py`` — collects the
built JS frontend so the canvas iframe loads under the bundled
Streamlit server instead of rendering blank.
**Tesseract is intentionally NOT bundled** (option A from the
design discussion). Modern bank statements are text-based;
bundling Tesseract would ~triple installer size for a long-tail
case. The in-app banner directs users to install it from
``UB-Mannheim/tesseract`` if they need OCR. Decision is captured
in the ``project-pdf-installer-pending`` memory note.
**Smoke tests** (``tests/test_pdf_extract_smoke.py``, 17 tests)
add the layer above the pure unit tests:
- ``TestDependencyImports`` — each dep imports cleanly
- ``TestRealPdfRoundTrip`` — generates a tiny statement PDF in
memory with ``fpdf2`` (test-only dep in
``requirements-dev.txt``), runs ``extract_pages`` +
``apply_template``, asserts 3 rows out with the right signed
amounts. Catches "the build succeeded but pdfplumber breaks at
runtime."
- ``TestRenderPageImage`` — exercises ``pypdfium2.render`` so the
hook-bundled native lib gets a real call. This is the most
common installer-bug signature (missing .dll) and the test
catches it before users do.
- ``TestPdfDependencyMissing`` — monkeypatches ``__import__`` to
simulate a stripped install; confirms the typed exception +
actionable hint round-trip.
- ``TestPinnedVersionsMatchInstalled`` — parametrized over all
four pinned dists; uses ``importlib.metadata`` rather than
``__version__`` because pypdfium2 doesn't expose it directly.
Trips if someone bumps the pin without reinstalling.
- ``TestOcrAvailability`` — confirms ``ocr_available()`` returns
``(bool, str)`` and ``extract_pages_auto(allow_ocr=False)``
skips OCR cleanly.
All 81 PDF + audit tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1/6 of the PDF Extractor tool. Pure module — no Streamlit,
no user-config I/O — that turns a PDF blob plus a template dict
into a ``pandas.DataFrame`` of transaction rows. Primary use case
is accountant-style extraction of bank-statement transactions,
where each bank's format is encoded as a reusable template.
Pipeline:
1. ``extract_pages(pdf_bytes)`` reads with pdfplumber and surfaces
words with bounding boxes.
2. ``cluster_rows(words)`` groups words into rows by ``top``
tolerance — no reliance on PDF table-line detection (most bank
statements have no visible cell borders).
3. ``assign_columns(row_words, boundaries)`` buckets each word by
its horizontal midpoint into N+1 columns defined by N interior
x-boundaries.
4. ``_within_table_window`` slices to the band between the header
line and the end-marker (e.g. "Closing balance").
5. ``apply_template`` orchestrates the above, handling:
- parens-style negative amounts, currency stripping, custom
decimal/thousands separators
- separate debit + credit columns combined into a single signed
``amount`` (credit positive, debit negative — accounting
register convention; matches QuickBooks/Xero imports)
- multi-line description wrapping (rows with empty date column
attach to the previous row's description)
- row-level regex skip filters (e.g., "Total", "Subtotal")
- page-range filters ("all", "2-", "1,3-5")
Optional OCR fallback for scanned statements:
- ``page_has_extractable_text`` heuristic flags pages with <5
words as likely-scanned.
- ``ocr_available()`` checks both the ``pytesseract`` Python
binding and the Tesseract binary; surfaces a clear reason
string when either is missing.
- ``extract_pages_auto`` does text-first, OCR-the-blanks, and
returns warnings the UI can surface.
29 unit tests cover the parsing pipeline against synthetic
WordBox/Page data — no fixture PDFs required, runs in 0.1s. Real
PDF extraction is exercised by hand on the user's statements.
Dependencies added:
- ``pdfplumber>=0.10,<1`` — text + position extraction
- ``pypdfium2>=4,<6`` — page rasterization for OCR + visual picker
- ``streamlit-drawable-canvas>=0.9,<1`` — visual region picker
(used in commit 5)
- ``pytesseract>=0.3,<1`` — OCR (used in commit 6; system
Tesseract binary required separately)
- ``cryptography>=41,<49`` — bumped upper bound; pdfminer.six
transitively requires a recent release. Internal ed25519
license-signing usage is API-stable across the bump.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled hardening upgrades.
1. Asymmetric signatures (HMAC → Ed25519)
The previous HMAC scheme used a symmetric secret that any motivated
reverse engineer could pull out of the shipped binary and use to
mint blobs for any tier / name / email. With Ed25519, the binary
ships only the public verification key; the signing key never
leaves the seller's environment, so binary compromise no longer
yields forgery.
- src/license/crypto.py rewritten around
cryptography.hazmat.primitives.asymmetric.ed25519. Same public
API surface (sign/verify/encode_blob/decode_blob), same canonical
JSON encoding — drop-in for the manager / cli / GUI layers.
- DATATOOLS_LICENSE_PRIVKEY (seller-side) and
DATATOOLS_LICENSE_PUBKEY (build-time) env vars supply the keys;
the in-source dev keypair (src/license/_dev_keypair.py)
deterministically derives from a seed phrase for repro builds and
tests.
- Blob prefix bumped DTLIC1: → DTLIC2:. Decoding a DTLIC1 blob
surfaces a clear "old format" error rather than a confusing
signature mismatch.
- scripts/generate_keypair.py mints fresh production keypairs for
the seller (run once, stash the private key offline). Adds
cryptography>=41,<46 to requirements.txt (was an undeclared
transitive dep).
2. Production-safe tripwire
assert_production_safe() refuses to boot a frozen / shipped build
when either:
- DATATOOLS_DEV_MODE=1 is set (would unconditionally bypass every
license check — fine in source/test but catastrophic in a buyer
install).
- The active verification key is still the embedded dev key (the
build pipeline forgot to set DATATOOLS_LICENSE_PUBKEY).
No-op in source / pytest runs (sys.frozen is unset) so test
fixtures and dev workflows keep working without ceremony. Called
from src/cli_license_guard.guard() and from hide_streamlit_chrome
— so it fires on every CLI invocation and every GUI page load.
Tests: 49 license-layer unit tests (was 40); added Ed25519
wrong-key rejection, dev-keypair seed pin, blob v2 prefix, v1
rejection with clear message, and four production-safe scenarios
(no-op in source, fires on DEV_MODE in frozen, fires on dev key in
frozen, passes in frozen with prod pubkey). Total: 2024 → 2033.
Docs (REQUIREMENTS §17a, DEVELOPER licensing recipe, DECISIONS
§9b + decision log) updated with the new threat-model write-up,
key-storage workflow, and tripwire behaviour.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rewrite README.md with project overview, quick-start, and CLI summary
- Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections
- Add docs/DEVELOPER.md with architecture, data flow, and extension guides
- Rewrite src/core/__init__.py with public API exports and module docstring
- Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive
match group review with side-by-side diff, and download buttons
- Add .gitignore, requirements.txt, all source code, tests, and sample data
- Add streamlit to requirements.txt
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>