fix(pdf): short dates without year + diagnostic for "0 rows" runs

User uploaded a real Chase statement and got "0 rows detected." Two bugs the rewrite shipped with, plus a diagnostic: **1. Short dates without year weren't recognized.** Most bank statements (Chase, Wells, BofA, …) display transaction dates as ``01/13`` or ``Jan 13`` because the year is implied by the statement period. The original regex required ``\d{2,4}`` after the second slash, so ``01/13`` failed to match and rows with no detected date got dropped. Split ``_DATE_RES`` into ``_FULL`` (with year) and ``_SHORT`` (no year), with a two-pass detector: pass 1 tries full-year patterns across the whole row; pass 2 only tries short patterns if pass 1 found nothing. This prevents a stray ``Page 1/2`` from shadowing the real dated transaction on the same line. Short patterns: - ``\d{1,2}/\d{1,2}`` — Chase, etc. - ``\d{1,2}-\d{1,2}`` - ``[A-Z][a-z]{2}\s+\d{1,2}`` — "Jan 13" When parsing, short dates pass through ``parse_date`` and return None (no year to bind to), so the scanner falls back to the raw text — the user sees ``01/13`` in the date column and can correct in the editor. **2. Multi-word dates leaked the day token into the description.** A pre-existing bug: ``_find_dates_in_words`` returned only the START word index, and ``_description_from_row`` only excluded that single word. For "Jan 13 Coffee $4.50", the description became "13 Coffee" instead of "Coffee". Fixed by returning ``(start, end, text)`` with ``end`` exclusive (computed from ``len(m.group(1).split())`` so window-overrun doesn't over-consume), and the description builder now skips the full range. **3. New diagnostic: ``diagnose_pdf_lines(pdf_bytes)``.** Returns every clustered text line the scanner saw with ``has_date`` / ``has_amount`` flags. When the page's scan returns 0 rows, an auto-expanded "what the scanner saw" expander now renders a table of all extracted lines so the user can: - Spot scanned-PDF cases (empty result → enable OCR) - See which lines have a date but no amount (or vice versa) - Eyeball the date / amount format the scanner missed Without leaving the app or asking the developer for help. Eight new tests cover: short US date (``01/13``), short month- name date with two-word consumption (``Jan 13``), the ``Page 1/2 ... 01/13/2026`` shadowing case, and the multi-word- date description fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:06:07 +00:00
parent bece2b4030
commit 263af3c7c2
3 changed files with 202 additions and 29 deletions
--- a/src/gui/pages/10_PDF_Extractor.py
+++ b/src/gui/pages/10_PDF_Extractor.py
@@ -23,6 +23,7 @@ from src.audit import log_event, log_page_open
 from src.gui.components import hide_streamlit_chrome, render_sticky_footer
 from src.pdf_extract import (
    PdfDependencyMissing,
+    diagnose_pdf_lines,
    ocr_available,
    scan_pdf_for_transactions,
 )
@@ -58,6 +59,7 @@ render_sticky_footer()
 K_ROWS = "pdf_scan_rows"
 K_WARNINGS = "pdf_scan_warnings"
 K_SOURCE_COUNT = "pdf_scan_source_count"
+K_DIAGNOSTIC = "pdf_scan_diagnostic"


 # ---------------------------------------------------------------------------
@@ -130,6 +132,9 @@ scan_clicked = st.button(
 if scan_clicked and uploads:
    all_rows: list[dict] = []
    all_warnings: list[str] = []
+    # Cache the raw bytes per file so the diagnostic expander can
+    # re-extract lines without asking the user to re-upload.
+    cached_bytes: list[tuple[str, bytes]] = []
    with st.status(
        f"Scanning {len(uploads)} file(s)…",
        expanded=True,
@@ -137,8 +142,10 @@ if scan_clicked and uploads:
        for i, up in enumerate(uploads, start=1):
            st.write(f"**{i}/{len(uploads)}** · {up.name}")
            try:
+                raw = up.read()
+                cached_bytes.append((up.name, raw))
                rows, warns = scan_pdf_for_transactions(
-                    up.read(),
+                    raw,
                    negative_in_parens=negative_in_parens,
                    allow_ocr=use_ocr,
                )
@@ -164,6 +171,7 @@ if scan_clicked and uploads:
    st.session_state[K_ROWS] = all_rows
    st.session_state[K_WARNINGS] = all_warnings
    st.session_state[K_SOURCE_COUNT] = len(uploads)
+    st.session_state[K_DIAGNOSTIC] = cached_bytes

    log_event(
        "tool_run",
@@ -197,10 +205,53 @@ if rows is None:
 elif not rows:
    st.info(
        "No transaction rows detected. The scanner looks for lines "
-        "containing a date and at least one amount. Check the "
-        "warnings expander above for clues — most often the PDF is "
-        "scanned (image-only) and OCR isn't available."
+        "containing a date and at least one amount. The diagnostic "
+        "below shows every line the PDF reader could see — use the "
+        "``has_date`` and ``has_amount`` columns to spot which "
+        "pieces are missing (usually one or the other)."
    )
+    cached_bytes = st.session_state.get(K_DIAGNOSTIC) or []
+    if cached_bytes:
+        with st.expander(
+            "Diagnostic: what the scanner saw",
+            expanded=True,
+        ):
+            for fname, raw in cached_bytes:
+                st.markdown(f"**{fname}**")
+                try:
+                    lines, dwarns = diagnose_pdf_lines(
+                        raw, allow_ocr=use_ocr, max_lines=200,
+                    )
+                except Exception as e:
+                    st.error(f"Diagnostic failed: {type(e).__name__}: {e}")
+                    continue
+                for w in dwarns:
+                    st.caption(w)
+                if not lines:
+                    st.warning(
+                        "Zero text lines extracted. This is almost "
+                        "certainly a scanned (image-based) PDF — "
+                        "enable OCR in Scan options if available."
+                    )
+                    continue
+                st.dataframe(
+                    pd.DataFrame(lines),
+                    hide_index=True,
+                    use_container_width=True,
+                    height=400,
+                )
+                date_hits = sum(1 for ln in lines if ln["has_date"])
+                amt_hits = sum(1 for ln in lines if ln["has_amount"])
+                both = sum(
+                    1 for ln in lines
+                    if ln["has_date"] and ln["has_amount"]
+                )
+                st.caption(
+                    f"{len(lines):,} lines · {date_hits:,} look like "
+                    f"they contain a date · {amt_hits:,} look like "
+                    f"they contain an amount · {both:,} have both "
+                    "(those are the rows the scanner would have kept)."
+                )

 else:
    df = pd.DataFrame(rows)