datatools-dev

giteadmin/datatools-dev

Fork 0

Commit Graph

Author	SHA1	Message	Date
Michael	263af3c7c2	fix(pdf): short dates without year + diagnostic for "0 rows" runs User uploaded a real Chase statement and got "0 rows detected." Two bugs the rewrite shipped with, plus a diagnostic: 1. Short dates without year weren't recognized. Most bank statements (Chase, Wells, BofA, …) display transaction dates as ``01/13`` or ``Jan 13`` because the year is implied by the statement period. The original regex required ``\d{2,4}`` after the second slash, so ``01/13`` failed to match and rows with no detected date got dropped. Split ``_DATE_RES`` into ``_FULL`` (with year) and ``_SHORT`` (no year), with a two-pass detector: pass 1 tries full-year patterns across the whole row; pass 2 only tries short patterns if pass 1 found nothing. This prevents a stray ``Page 1/2`` from shadowing the real dated transaction on the same line. Short patterns: - ``\d{1,2}/\d{1,2}`` — Chase, etc. - ``\d{1,2}-\d{1,2}`` - ``[A-Z][a-z]{2}\s+\d{1,2}`` — "Jan 13" When parsing, short dates pass through ``parse_date`` and return None (no year to bind to), so the scanner falls back to the raw text — the user sees ``01/13`` in the date column and can correct in the editor. 2. Multi-word dates leaked the day token into the description. A pre-existing bug: ``_find_dates_in_words`` returned only the START word index, and ``_description_from_row`` only excluded that single word. For "Jan 13 Coffee $4.50", the description became "13 Coffee" instead of "Coffee". Fixed by returning ``(start, end, text)`` with ``end`` exclusive (computed from ``len(m.group(1).split())`` so window-overrun doesn't over-consume), and the description builder now skips the full range. 3. New diagnostic: ``diagnose_pdf_lines(pdf_bytes)``. Returns every clustered text line the scanner saw with ``has_date`` / ``has_amount`` flags. When the page's scan returns 0 rows, an auto-expanded "what the scanner saw" expander now renders a table of all extracted lines so the user can: - Spot scanned-PDF cases (empty result → enable OCR) - See which lines have a date but no amount (or vice versa) - Eyeball the date / amount format the scanner missed Without leaving the app or asking the developer for help. Eight new tests cover: short US date (``01/13``), short month- name date with two-word consumption (``Jan 13``), the ``Page 1/2 ... 01/13/2026`` shadowing case, and the multi-word- date description fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:06:07 +00:00
Michael	bece2b4030	refactor(pdf): rip out templates; heuristic scan + selectable table User feedback: the template / visual-picker / mode-dispatch implementation was too complex for the actual workflow. Statements drift between months, the canvas state didn't survive multi-page navigation, and accountants don't want to maintain per-bank configuration just to convert PDFs to CSV. Start-over design — one public function, one page, no persistence: ``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)`` A row is "any text line with a date pattern AND at least one amount pattern." Each detected row is a dict shaped:: { "date": "2026-01-15", "description": "Coffee Shop", "amount_1": -4.50, "amount_2": 1000.00, # if a second amount was found "page": 1, "raw": "01/15/2026 Coffee Shop (4.50) 1,000.00", "source_file": "chase-jan-2026.pdf", } Multi-line descriptions still merge (no-date no-amount lines attach to the previous transaction). Multi-PDF batches share a single combined table with a ``source_file`` column. Page UX: - Upload PDF(s) → optional Options expander (parens-negative, use-OCR) → click Scan → see all detected rows in an ``st.data_editor``. - The editor has an ``Include`` checkbox column (default on), plus user-editable date / description / amount cells and a read-only ``raw`` column showing the original PDF text for verification. - A ``Columns to include in CSV`` multiselect hides ``page`` / ``raw`` from the download by default; user can re-add either. - Download CSV gets only the checked rows. No template save/load. No visual picker. No mode dispatch. No column boundaries. No schema migration. No per-bank configuration files. Deletions: - ``src/pdf_templates.py`` — template storage layer - ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim for the canvas (no canvas now) - ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``, ``test_drawable_canvas_compat.py`` — covered the removed APIs - ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for the removed dep - ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt`` - The drawable-canvas references in ``build/datatools.spec`` ``src/pdf_extract.py`` shrinks from ~30 helper functions to ~10. Keeps: value parsers, row clusterer, date/amount token finders, OCR pipeline, dependency guards. The one new public function ``scan_pdf_for_transactions`` glues them together. Tests (59 passing): the unit layer keeps full coverage of the building blocks; the smoke layer pins the end-to-end PDF roundtrip, OCR discovery, dependency-import behavior, and the multi-line-description merge. The fpdf2-generated fixture PDF still drives the real-PDF test. Rollback: ``git revert HEAD`` brings back the template system if needed — but the simpler model should make that unlikely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:57:30 +00:00
Michael	b8aff862ed	feat(pdf): add pure PDF→DataFrame extraction module Phase 1/6 of the PDF Extractor tool. Pure module — no Streamlit, no user-config I/O — that turns a PDF blob plus a template dict into a ``pandas.DataFrame`` of transaction rows. Primary use case is accountant-style extraction of bank-statement transactions, where each bank's format is encoded as a reusable template. Pipeline: 1. ``extract_pages(pdf_bytes)`` reads with pdfplumber and surfaces words with bounding boxes. 2. ``cluster_rows(words)`` groups words into rows by ``top`` tolerance — no reliance on PDF table-line detection (most bank statements have no visible cell borders). 3. ``assign_columns(row_words, boundaries)`` buckets each word by its horizontal midpoint into N+1 columns defined by N interior x-boundaries. 4. ``_within_table_window`` slices to the band between the header line and the end-marker (e.g. "Closing balance"). 5. ``apply_template`` orchestrates the above, handling: - parens-style negative amounts, currency stripping, custom decimal/thousands separators - separate debit + credit columns combined into a single signed ``amount`` (credit positive, debit negative — accounting register convention; matches QuickBooks/Xero imports) - multi-line description wrapping (rows with empty date column attach to the previous row's description) - row-level regex skip filters (e.g., "Total", "Subtotal") - page-range filters ("all", "2-", "1,3-5") Optional OCR fallback for scanned statements: - ``page_has_extractable_text`` heuristic flags pages with <5 words as likely-scanned. - ``ocr_available()`` checks both the ``pytesseract`` Python binding and the Tesseract binary; surfaces a clear reason string when either is missing. - ``extract_pages_auto`` does text-first, OCR-the-blanks, and returns warnings the UI can surface. 29 unit tests cover the parsing pipeline against synthetic WordBox/Page data — no fixture PDFs required, runs in 0.1s. Real PDF extraction is exercised by hand on the user's statements. Dependencies added: - ``pdfplumber>=0.10,<1`` — text + position extraction - ``pypdfium2>=4,<6`` — page rasterization for OCR + visual picker - ``streamlit-drawable-canvas>=0.9,<1`` — visual region picker (used in commit 5) - ``pytesseract>=0.3,<1`` — OCR (used in commit 6; system Tesseract binary required separately) - ``cryptography>=41,<49`` — bumped upper bound; pdfminer.six transitively requires a recent release. Internal ed25519 license-signing usage is API-stable across the bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:44:51 +00:00

Author

SHA1

Message

Date

Michael

263af3c7c2

fix(pdf): short dates without year + diagnostic for "0 rows" runs

User uploaded a real Chase statement and got "0 rows detected."
Two bugs the rewrite shipped with, plus a diagnostic:

**1. Short dates without year weren't recognized.** Most bank
statements (Chase, Wells, BofA, …) display transaction dates as
``01/13`` or ``Jan 13`` because the year is implied by the
statement period. The original regex required ``\d{2,4}`` after
the second slash, so ``01/13`` failed to match and rows with no
detected date got dropped.

Split ``_DATE_RES`` into ``_FULL`` (with year) and ``_SHORT``
(no year), with a two-pass detector: pass 1 tries full-year
patterns across the whole row; pass 2 only tries short patterns
if pass 1 found nothing. This prevents a stray ``Page 1/2`` from
shadowing the real dated transaction on the same line.

Short patterns:
- ``\d{1,2}/\d{1,2}`` — Chase, etc.
- ``\d{1,2}-\d{1,2}``
- ``[A-Z][a-z]{2}\s+\d{1,2}`` — "Jan 13"

When parsing, short dates pass through ``parse_date`` and
return None (no year to bind to), so the scanner falls back to
the raw text — the user sees ``01/13`` in the date column and
can correct in the editor.

**2. Multi-word dates leaked the day token into the description.**
A pre-existing bug: ``_find_dates_in_words`` returned only the
START word index, and ``_description_from_row`` only excluded
that single word. For "Jan 13 Coffee $4.50", the description
became "13 Coffee" instead of "Coffee". Fixed by returning
``(start, end, text)`` with ``end`` exclusive (computed from
``len(m.group(1).split())`` so window-overrun doesn't
over-consume), and the description builder now skips the full
range.

**3. New diagnostic: ``diagnose_pdf_lines(pdf_bytes)``.** Returns
every clustered text line the scanner saw with ``has_date`` /
``has_amount`` flags. When the page's scan returns 0 rows, an
auto-expanded "what the scanner saw" expander now renders a
table of all extracted lines so the user can:

- Spot scanned-PDF cases (empty result → enable OCR)
- See which lines have a date but no amount (or vice versa)
- Eyeball the date / amount format the scanner missed

Without leaving the app or asking the developer for help.

Eight new tests cover: short US date (``01/13``), short month-
name date with two-word consumption (``Jan 13``), the
``Page 1/2 ... 01/13/2026`` shadowing case, and the multi-word-
date description fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 00:06:07 +00:00

Michael

bece2b4030

refactor(pdf): rip out templates; heuristic scan + selectable table

User feedback: the template / visual-picker / mode-dispatch
implementation was too complex for the actual workflow.
Statements drift between months, the canvas state didn't survive
multi-page navigation, and accountants don't want to maintain
per-bank configuration just to convert PDFs to CSV.

Start-over design — one public function, one page, no
persistence:

  ``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)``

A row is "any text line with a date pattern AND at least one
amount pattern." Each detected row is a dict shaped::

    {
      "date": "2026-01-15",
      "description": "Coffee Shop",
      "amount_1": -4.50,
      "amount_2": 1000.00,   # if a second amount was found
      "page": 1,
      "raw": "01/15/2026 Coffee Shop (4.50) 1,000.00",
      "source_file": "chase-jan-2026.pdf",
    }

Multi-line descriptions still merge (no-date no-amount lines
attach to the previous transaction). Multi-PDF batches share a
single combined table with a ``source_file`` column.

**Page UX:**

- Upload PDF(s) → optional Options expander (parens-negative,
  use-OCR) → click Scan → see all detected rows in an
  ``st.data_editor``.
- The editor has an ``Include`` checkbox column (default on),
  plus user-editable date / description / amount cells and a
  read-only ``raw`` column showing the original PDF text for
  verification.
- A ``Columns to include in CSV`` multiselect hides
  ``page`` / ``raw`` from the download by default; user can
  re-add either.
- Download CSV gets only the checked rows.

No template save/load. No visual picker. No mode dispatch. No
column boundaries. No schema migration. No per-bank
configuration files.

**Deletions:**

- ``src/pdf_templates.py`` — template storage layer
- ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim
  for the canvas (no canvas now)
- ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``,
  ``test_drawable_canvas_compat.py`` — covered the removed APIs
- ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for
  the removed dep
- ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt``
- The drawable-canvas references in ``build/datatools.spec``

**``src/pdf_extract.py``** shrinks from ~30 helper functions to
~10. Keeps: value parsers, row clusterer, date/amount token
finders, OCR pipeline, dependency guards. The one new public
function ``scan_pdf_for_transactions`` glues them together.

**Tests** (59 passing): the unit layer keeps full coverage of
the building blocks; the smoke layer pins the end-to-end PDF
roundtrip, OCR discovery, dependency-import behavior, and the
multi-line-description merge. The fpdf2-generated fixture PDF
still drives the real-PDF test.

Rollback: ``git revert HEAD`` brings back the template system if
needed — but the simpler model should make that unlikely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-19 23:57:30 +00:00

Michael

b8aff862ed

feat(pdf): add pure PDF→DataFrame extraction module

Phase 1/6 of the PDF Extractor tool. Pure module — no Streamlit,
no user-config I/O — that turns a PDF blob plus a template dict
into a ``pandas.DataFrame`` of transaction rows. Primary use case
is accountant-style extraction of bank-statement transactions,
where each bank's format is encoded as a reusable template.

Pipeline:

1. ``extract_pages(pdf_bytes)`` reads with pdfplumber and surfaces
   words with bounding boxes.
2. ``cluster_rows(words)`` groups words into rows by ``top``
   tolerance — no reliance on PDF table-line detection (most bank
   statements have no visible cell borders).
3. ``assign_columns(row_words, boundaries)`` buckets each word by
   its horizontal midpoint into N+1 columns defined by N interior
   x-boundaries.
4. ``_within_table_window`` slices to the band between the header
   line and the end-marker (e.g. "Closing balance").
5. ``apply_template`` orchestrates the above, handling:
   - parens-style negative amounts, currency stripping, custom
     decimal/thousands separators
   - separate debit + credit columns combined into a single signed
     ``amount`` (credit positive, debit negative — accounting
     register convention; matches QuickBooks/Xero imports)
   - multi-line description wrapping (rows with empty date column
     attach to the previous row's description)
   - row-level regex skip filters (e.g., "Total", "Subtotal")
   - page-range filters ("all", "2-", "1,3-5")

Optional OCR fallback for scanned statements:

- ``page_has_extractable_text`` heuristic flags pages with <5
  words as likely-scanned.
- ``ocr_available()`` checks both the ``pytesseract`` Python
  binding and the Tesseract binary; surfaces a clear reason
  string when either is missing.
- ``extract_pages_auto`` does text-first, OCR-the-blanks, and
  returns warnings the UI can surface.

29 unit tests cover the parsing pipeline against synthetic
WordBox/Page data — no fixture PDFs required, runs in 0.1s. Real
PDF extraction is exercised by hand on the user's statements.

Dependencies added:
- ``pdfplumber>=0.10,<1`` — text + position extraction
- ``pypdfium2>=4,<6`` — page rasterization for OCR + visual picker
- ``streamlit-drawable-canvas>=0.9,<1`` — visual region picker
  (used in commit 5)
- ``pytesseract>=0.3,<1`` — OCR (used in commit 6; system
  Tesseract binary required separately)
- ``cryptography>=41,<49`` — bumped upper bound; pdfminer.six
  transitively requires a recent release. Internal ed25519
  license-signing usage is API-stable across the bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-19 22:44:51 +00:00

3 Commits