User feedback: the template / visual-picker / mode-dispatch
implementation was too complex for the actual workflow.
Statements drift between months, the canvas state didn't survive
multi-page navigation, and accountants don't want to maintain
per-bank configuration just to convert PDFs to CSV.
Start-over design — one public function, one page, no
persistence:
``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)``
A row is "any text line with a date pattern AND at least one
amount pattern." Each detected row is a dict shaped::
{
"date": "2026-01-15",
"description": "Coffee Shop",
"amount_1": -4.50,
"amount_2": 1000.00, # if a second amount was found
"page": 1,
"raw": "01/15/2026 Coffee Shop (4.50) 1,000.00",
"source_file": "chase-jan-2026.pdf",
}
Multi-line descriptions still merge (no-date no-amount lines
attach to the previous transaction). Multi-PDF batches share a
single combined table with a ``source_file`` column.
**Page UX:**
- Upload PDF(s) → optional Options expander (parens-negative,
use-OCR) → click Scan → see all detected rows in an
``st.data_editor``.
- The editor has an ``Include`` checkbox column (default on),
plus user-editable date / description / amount cells and a
read-only ``raw`` column showing the original PDF text for
verification.
- A ``Columns to include in CSV`` multiselect hides
``page`` / ``raw`` from the download by default; user can
re-add either.
- Download CSV gets only the checked rows.
No template save/load. No visual picker. No mode dispatch. No
column boundaries. No schema migration. No per-bank
configuration files.
**Deletions:**
- ``src/pdf_templates.py`` — template storage layer
- ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim
for the canvas (no canvas now)
- ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``,
``test_drawable_canvas_compat.py`` — covered the removed APIs
- ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for
the removed dep
- ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt``
- The drawable-canvas references in ``build/datatools.spec``
**``src/pdf_extract.py``** shrinks from ~30 helper functions to
~10. Keeps: value parsers, row clusterer, date/amount token
finders, OCR pipeline, dependency guards. The one new public
function ``scan_pdf_for_transactions`` glues them together.
**Tests** (59 passing): the unit layer keeps full coverage of
the building blocks; the smoke layer pins the end-to-end PDF
roundtrip, OCR discovery, dependency-import behavior, and the
multi-line-description merge. The fpdf2-generated fixture PDF
still drives the real-PDF test.
Rollback: ``git revert HEAD`` brings back the template system if
needed — but the simpler model should make that unlikely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes prepare the next tagged release so end users get
the PDF Extractor without ever touching pip.
**Exact-pin the new deps** (``requirements.txt``):
pdfplumber==0.11.9
pypdfium2==5.8.0
pytesseract==0.3.13
streamlit-drawable-canvas==0.9.3
Tight pins are the right call for these because the GUI's
visual-picker geometry + the parsing-pipeline word positions
depend on stable internal behavior — a quiet upstream tweak to
``extract_words`` or ``page.render`` would re-break the tool on
the next CI build. Bumping requires a deliberate edit + a CI
run, not a transient ``pip install`` resolving to whatever
``setup.py`` pulled.
Existing deps stay on their current ``>=X.Y,<X+1`` ranges; the
user's "tight pin" concern is specifically about the PDF stack.
**Wire the new deps into the PyInstaller bundle** (``build/``):
- ``datatools.spec`` — add ``collect_submodules`` for pdfplumber,
pdfminer, pypdfium2, streamlit_drawable_canvas, PIL,
pytesseract; add ``collect_data_files`` for pypdfium2 (PDFium
native ``.dll``/``.so``/``.dylib``), streamlit_drawable_canvas
(frontend JS bundle), pdfminer (Adobe CMap tables).
- ``hooks/hook-pypdfium2.py`` — belt-and-braces hook that uses
``collect_dynamic_libs`` to force-include the PDFium binary.
Without this the visual picker silently fails on installed
builds with a ``FileNotFoundError`` for the shared library.
- ``hooks/hook-streamlit_drawable_canvas.py`` — collects the
built JS frontend so the canvas iframe loads under the bundled
Streamlit server instead of rendering blank.
**Tesseract is intentionally NOT bundled** (option A from the
design discussion). Modern bank statements are text-based;
bundling Tesseract would ~triple installer size for a long-tail
case. The in-app banner directs users to install it from
``UB-Mannheim/tesseract`` if they need OCR. Decision is captured
in the ``project-pdf-installer-pending`` memory note.
**Smoke tests** (``tests/test_pdf_extract_smoke.py``, 17 tests)
add the layer above the pure unit tests:
- ``TestDependencyImports`` — each dep imports cleanly
- ``TestRealPdfRoundTrip`` — generates a tiny statement PDF in
memory with ``fpdf2`` (test-only dep in
``requirements-dev.txt``), runs ``extract_pages`` +
``apply_template``, asserts 3 rows out with the right signed
amounts. Catches "the build succeeded but pdfplumber breaks at
runtime."
- ``TestRenderPageImage`` — exercises ``pypdfium2.render`` so the
hook-bundled native lib gets a real call. This is the most
common installer-bug signature (missing .dll) and the test
catches it before users do.
- ``TestPdfDependencyMissing`` — monkeypatches ``__import__`` to
simulate a stripped install; confirms the typed exception +
actionable hint round-trip.
- ``TestPinnedVersionsMatchInstalled`` — parametrized over all
four pinned dists; uses ``importlib.metadata`` rather than
``__version__`` because pypdfium2 doesn't expose it directly.
Trips if someone bumps the pin without reinstalling.
- ``TestOcrAvailability`` — confirms ``ocr_available()`` returns
``(bool, str)`` and ``extract_pages_auto(allow_ocr=False)``
skips OCR cleanly.
All 81 PDF + audit tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stand up the seamless-download path for non-technical buyers:
* .github/workflows/build.yml — matrix CI (mac/win/linux) that builds
PyInstaller bundles and packages them per platform on tag push,
attaching the resulting installers to a GitHub Release.
* build/installer.iss — Inno Setup script for the Windows installer
(per-user install, optional desktop shortcut, runs on finish).
* build/macos/build_dmg.sh — wraps DataTools.app into a .dmg with a
drag-to-/Applications layout.
* build/appimage/{AppRun,datatools.desktop,build.sh} — AppImage recipe.
* src/__init__.py — single source of truth for __version__; the spec
reads it (was hardcoded), CI passes it through to all packagers.
Buyer download path now lives in the top-level README. Per-build
README documents the Phase 2 step (signing/notarization) that needs
the owner's Apple Developer + Windows code-signing credentials —
those are intentionally not in CI yet because they require setup
outside this repo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>