Files
datatools-dev/tests/test_pdf_extract.py
Michael 263af3c7c2 fix(pdf): short dates without year + diagnostic for "0 rows" runs
User uploaded a real Chase statement and got "0 rows detected."
Two bugs the rewrite shipped with, plus a diagnostic:

**1. Short dates without year weren't recognized.** Most bank
statements (Chase, Wells, BofA, …) display transaction dates as
``01/13`` or ``Jan 13`` because the year is implied by the
statement period. The original regex required ``\d{2,4}`` after
the second slash, so ``01/13`` failed to match and rows with no
detected date got dropped.

Split ``_DATE_RES`` into ``_FULL`` (with year) and ``_SHORT``
(no year), with a two-pass detector: pass 1 tries full-year
patterns across the whole row; pass 2 only tries short patterns
if pass 1 found nothing. This prevents a stray ``Page 1/2`` from
shadowing the real dated transaction on the same line.

Short patterns:
- ``\d{1,2}/\d{1,2}`` — Chase, etc.
- ``\d{1,2}-\d{1,2}``
- ``[A-Z][a-z]{2}\s+\d{1,2}`` — "Jan 13"

When parsing, short dates pass through ``parse_date`` and
return None (no year to bind to), so the scanner falls back to
the raw text — the user sees ``01/13`` in the date column and
can correct in the editor.

**2. Multi-word dates leaked the day token into the description.**
A pre-existing bug: ``_find_dates_in_words`` returned only the
START word index, and ``_description_from_row`` only excluded
that single word. For "Jan 13 Coffee $4.50", the description
became "13 Coffee" instead of "Coffee". Fixed by returning
``(start, end, text)`` with ``end`` exclusive (computed from
``len(m.group(1).split())`` so window-overrun doesn't
over-consume), and the description builder now skips the full
range.

**3. New diagnostic: ``diagnose_pdf_lines(pdf_bytes)``.** Returns
every clustered text line the scanner saw with ``has_date`` /
``has_amount`` flags. When the page's scan returns 0 rows, an
auto-expanded "what the scanner saw" expander now renders a
table of all extracted lines so the user can:

- Spot scanned-PDF cases (empty result → enable OCR)
- See which lines have a date but no amount (or vice versa)
- Eyeball the date / amount format the scanner missed

Without leaving the app or asking the developer for help.

Eight new tests cover: short US date (``01/13``), short month-
name date with two-word consumption (``Jan 13``), the
``Page 1/2 ... 01/13/2026`` shadowing case, and the multi-word-
date description fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:06:07 +00:00

6.5 KiB