feat(pdf): extract statement header (account + period) + date format

Two related additions for the accountant workflow:

**1. Statement header extraction.** New
``extract_statement_metadata(pages)`` pulls the account number
and statement period out of the first page (falls back to
page 1+2 if either is missing on page 1 — Wells Fargo business
accounts put header info on page 2). Detected fields are
stamped onto EVERY transaction row so a multi-statement CSV is
self-attributing per row::

    {
      "date": "20250113",
      "description": "Coffee Shop",
      "amount_1": -4.50,
      "account_number": "****5678",
      "statement_period_start": "20250101",
      "statement_period_end": "20250131",
      ...
    }

Account-number regex is tolerant of masks (``****1234``),
hyphens (``1234-5678-9012``), and spaces. Period regex looks
for "Statement Period" / "From" / "Period Covered" labels plus
the first 1-2 full-year dates that follow. If only one date is
present near the label, it's used for both start and end (some
statements show only the closing date).

**2. Year inference for short dates.** When the row date is a
short ``01/13`` or ``Jan 13`` without a year, the scanner now
binds the year from the statement period's end date BEFORE
formatting. Doesn't handle the December-in-January-statement
cross-year case (rare; user can edit in the table).

**3. Configurable output date format.** New
``output_date_format`` parameter on ``scan_pdf_for_transactions``
defaults to ``%Y%m%d``. Applied to: the transaction date column
AND the statement period start/end fields. The page surfaces a
dropdown in Scan options with common presets (YYYYMMDD,
YYYY-MM-DD, MM/DD/YYYY, DD/MM/YYYY, ``Mon DD, YYYY``) plus a
Custom option that accepts a raw strftime string.

New helper: ``format_date(iso_str, fmt)`` converts ISO
``YYYY-MM-DD`` to any strftime; passes invalid input through
unchanged so the user can see what was actually there rather
than getting silent empties.

20 new tests cover: format_date, account-number extraction
(masked / hyphenated / spaced / no-label / short), period
extraction (standard / from-to / single-date / no-label),
metadata orchestrator (full header / no pages / page-2
fallback), year inference (US / dash / month-name / no-period /
unparseable), plus an end-to-end class that builds a header'd
PDF with short-date transactions and confirms metadata
attribution + year inference + format round-trip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-20 00:20:46 +00:00
parent 3cf935c999
commit 155dd30746
4 changed files with 499 additions and 13 deletions

View File

@@ -90,6 +90,15 @@ if not _pdf_ok:
# Options + upload
# ---------------------------------------------------------------------------
_DATE_FORMAT_CHOICES = {
"YYYYMMDD (20260113)": "%Y%m%d",
"YYYY-MM-DD (2026-01-13)": "%Y-%m-%d",
"MM/DD/YYYY (01/13/2026)": "%m/%d/%Y",
"DD/MM/YYYY (13/01/2026)": "%d/%m/%Y",
"MMM DD, YYYY (Jan 13, 2026)": "%b %d, %Y",
"Custom strftime…": "__custom__",
}
with st.expander("Scan options", expanded=False):
c1, c2 = st.columns(2)
negative_in_parens = c1.checkbox(
@@ -112,6 +121,28 @@ with st.expander("Scan options", expanded=False):
),
)
c3, c4 = st.columns(2)
date_label = c3.selectbox(
"Output date format",
list(_DATE_FORMAT_CHOICES.keys()),
index=0,
help=(
"Applied to the transaction date AND the statement "
"period dates pulled from the header. Pick Custom to "
"enter your own ``strftime`` string."
),
)
output_date_format = _DATE_FORMAT_CHOICES[date_label]
if output_date_format == "__custom__":
output_date_format = c4.text_input(
"Custom strftime format",
value="%Y%m%d",
help=(
"Python ``strftime`` codes — e.g., ``%Y%m%d`` for "
"20260113, ``%Y-%m-%d`` for 2026-01-13."
),
)
uploads = st.file_uploader(
"PDF file(s)",
type=["pdf"],
@@ -148,6 +179,7 @@ if scan_clicked and uploads:
raw,
negative_in_parens=negative_in_parens,
allow_ocr=use_ocr,
output_date_format=output_date_format,
)
for r in rows:
r["source_file"] = up.name
@@ -258,11 +290,24 @@ else:
# Order columns so the user-facing fields are leftmost; raw +
# internals are last and easy to scroll past or unselect at
# download time.
front = ["date", "description"]
# download time. Statement metadata sits with the transaction
# detail since it's per-row context an accountant typically
# wants alongside the amounts.
front = [
"date",
"description",
]
amount_cols = sorted(c for c in df.columns if c.startswith("amount_"))
metadata_cols = [
"account_number",
"statement_period_start",
"statement_period_end",
]
tail = ["source_file", "page", "raw"]
ordered = [c for c in front + amount_cols + tail if c in df.columns]
ordered = [
c for c in front + amount_cols + metadata_cols + tail
if c in df.columns
]
extras = [c for c in df.columns if c not in ordered]
df = df[ordered + extras]