User feedback: the template / visual-picker / mode-dispatch
implementation was too complex for the actual workflow.
Statements drift between months, the canvas state didn't survive
multi-page navigation, and accountants don't want to maintain
per-bank configuration just to convert PDFs to CSV.
Start-over design — one public function, one page, no
persistence:
``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)``
A row is "any text line with a date pattern AND at least one
amount pattern." Each detected row is a dict shaped::
{
"date": "2026-01-15",
"description": "Coffee Shop",
"amount_1": -4.50,
"amount_2": 1000.00, # if a second amount was found
"page": 1,
"raw": "01/15/2026 Coffee Shop (4.50) 1,000.00",
"source_file": "chase-jan-2026.pdf",
}
Multi-line descriptions still merge (no-date no-amount lines
attach to the previous transaction). Multi-PDF batches share a
single combined table with a ``source_file`` column.
**Page UX:**
- Upload PDF(s) → optional Options expander (parens-negative,
use-OCR) → click Scan → see all detected rows in an
``st.data_editor``.
- The editor has an ``Include`` checkbox column (default on),
plus user-editable date / description / amount cells and a
read-only ``raw`` column showing the original PDF text for
verification.
- A ``Columns to include in CSV`` multiselect hides
``page`` / ``raw`` from the download by default; user can
re-add either.
- Download CSV gets only the checked rows.
No template save/load. No visual picker. No mode dispatch. No
column boundaries. No schema migration. No per-bank
configuration files.
**Deletions:**
- ``src/pdf_templates.py`` — template storage layer
- ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim
for the canvas (no canvas now)
- ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``,
``test_drawable_canvas_compat.py`` — covered the removed APIs
- ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for
the removed dep
- ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt``
- The drawable-canvas references in ``build/datatools.spec``
**``src/pdf_extract.py``** shrinks from ~30 helper functions to
~10. Keeps: value parsers, row clusterer, date/amount token
finders, OCR pipeline, dependency guards. The one new public
function ``scan_pdf_for_transactions`` glues them together.
**Tests** (59 passing): the unit layer keeps full coverage of
the building blocks; the smoke layer pins the end-to-end PDF
roundtrip, OCR discovery, dependency-import behavior, and the
multi-line-description merge. The fpdf2-generated fixture PDF
still drives the real-PDF test.
Rollback: ``git revert HEAD`` brings back the template system if
needed — but the simpler model should make that unlikely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
24 lines
744 B
Plaintext
24 lines
744 B
Plaintext
pandas>=2.2,<3
|
|
openpyxl>=3.1,<4
|
|
numpy>=1.26,<3
|
|
rapidfuzz>=3.6,<4
|
|
charset-normalizer>=3.3,<4
|
|
loguru>=0.7,<1
|
|
tqdm>=4.66,<5
|
|
typer>=0.12,<1
|
|
phonenumbers>=8.13,<9
|
|
streamlit>=1.35,<2
|
|
cryptography>=41,<49
|
|
# PDF Extractor stack — pinned to exact tested versions so a future
|
|
# upstream release can't quietly change pdfplumber's word-position
|
|
# behavior or pypdfium2's OCR rendering mid-build. Bump these
|
|
# explicitly when re-testing against a new release.
|
|
#
|
|
# ``pypdfium2`` is here for the OCR fallback path only (rasterizing
|
|
# pages to images for Tesseract). The drawable-canvas dep was
|
|
# removed when the visual picker was ripped out — the scanner is
|
|
# pure heuristic now, no coordinate UI.
|
|
pdfplumber==0.11.9
|
|
pypdfium2==5.8.0
|
|
pytesseract==0.3.13
|