feat(pdf): extract statement header (account + period) + date format

Two related additions for the accountant workflow: **1. Statement header extraction.** New ``extract_statement_metadata(pages)`` pulls the account number and statement period out of the first page (falls back to page 1+2 if either is missing on page 1 — Wells Fargo business accounts put header info on page 2). Detected fields are stamped onto EVERY transaction row so a multi-statement CSV is self-attributing per row:: { "date": "20250113", "description": "Coffee Shop", "amount_1": -4.50, "account_number": "****5678", "statement_period_start": "20250101", "statement_period_end": "20250131", ... } Account-number regex is tolerant of masks (``****1234``), hyphens (``1234-5678-9012``), and spaces. Period regex looks for "Statement Period" / "From" / "Period Covered" labels plus the first 1-2 full-year dates that follow. If only one date is present near the label, it's used for both start and end (some statements show only the closing date). **2. Year inference for short dates.** When the row date is a short ``01/13`` or ``Jan 13`` without a year, the scanner now binds the year from the statement period's end date BEFORE formatting. Doesn't handle the December-in-January-statement cross-year case (rare; user can edit in the table). **3. Configurable output date format.** New ``output_date_format`` parameter on ``scan_pdf_for_transactions`` defaults to ``%Y%m%d``. Applied to: the transaction date column AND the statement period start/end fields. The page surfaces a dropdown in Scan options with common presets (YYYYMMDD, YYYY-MM-DD, MM/DD/YYYY, DD/MM/YYYY, ``Mon DD, YYYY``) plus a Custom option that accepts a raw strftime string. New helper: ``format_date(iso_str, fmt)`` converts ISO ``YYYY-MM-DD`` to any strftime; passes invalid input through unchanged so the user can see what was actually there rather than getting silent empties. 20 new tests cover: format_date, account-number extraction (masked / hyphenated / spaced / no-label / short), period extraction (standard / from-to / single-date / no-label), metadata orchestrator (full header / no pages / page-2 fallback), year inference (US / dash / month-name / no-period / unparseable), plus an end-to-end class that builds a header'd PDF with short-date transactions and confirms metadata attribution + year inference + format round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:20:46 +00:00
parent 3cf935c999
commit 155dd30746
4 changed files with 499 additions and 13 deletions
--- a/src/gui/pages/10_PDF_Extractor.py
+++ b/src/gui/pages/10_PDF_Extractor.py
@@ -90,6 +90,15 @@ if not _pdf_ok:
 # Options + upload
 # ---------------------------------------------------------------------------

+_DATE_FORMAT_CHOICES = {
+    "YYYYMMDD (20260113)": "%Y%m%d",
+    "YYYY-MM-DD (2026-01-13)": "%Y-%m-%d",
+    "MM/DD/YYYY (01/13/2026)": "%m/%d/%Y",
+    "DD/MM/YYYY (13/01/2026)": "%d/%m/%Y",
+    "MMM DD, YYYY (Jan 13, 2026)": "%b %d, %Y",
+    "Custom strftime…": "__custom__",
+}
+
 with st.expander("Scan options", expanded=False):
    c1, c2 = st.columns(2)
    negative_in_parens = c1.checkbox(
@@ -112,6 +121,28 @@ with st.expander("Scan options", expanded=False):
        ),
    )

+    c3, c4 = st.columns(2)
+    date_label = c3.selectbox(
+        "Output date format",
+        list(_DATE_FORMAT_CHOICES.keys()),
+        index=0,
+        help=(
+            "Applied to the transaction date AND the statement "
+            "period dates pulled from the header. Pick Custom to "
+            "enter your own ``strftime`` string."
+        ),
+    )
+    output_date_format = _DATE_FORMAT_CHOICES[date_label]
+    if output_date_format == "__custom__":
+        output_date_format = c4.text_input(
+            "Custom strftime format",
+            value="%Y%m%d",
+            help=(
+                "Python ``strftime`` codes — e.g., ``%Y%m%d`` for "
+                "20260113, ``%Y-%m-%d`` for 2026-01-13."
+            ),
+        )
+
 uploads = st.file_uploader(
    "PDF file(s)",
    type=["pdf"],
@@ -148,6 +179,7 @@ if scan_clicked and uploads:
                    raw,
                    negative_in_parens=negative_in_parens,
                    allow_ocr=use_ocr,
+                    output_date_format=output_date_format,
                )
                for r in rows:
                    r["source_file"] = up.name
@@ -258,11 +290,24 @@ else:

    # Order columns so the user-facing fields are leftmost; raw +
    # internals are last and easy to scroll past or unselect at
-    # download time.
-    front = ["date", "description"]
+    # download time. Statement metadata sits with the transaction
+    # detail since it's per-row context an accountant typically
+    # wants alongside the amounts.
+    front = [
+        "date",
+        "description",
+    ]
    amount_cols = sorted(c for c in df.columns if c.startswith("amount_"))
+    metadata_cols = [
+        "account_number",
+        "statement_period_start",
+        "statement_period_end",
+    ]
    tail = ["source_file", "page", "raw"]
-    ordered = [c for c in front + amount_cols + tail if c in df.columns]
+    ordered = [
+        c for c in front + amount_cols + metadata_cols + tail
+        if c in df.columns
+    ]
    extras = [c for c in df.columns if c not in ordered]
    df = df[ordered + extras]