feat(pdf): extract statement header (account + period) + date format
Two related additions for the accountant workflow:
**1. Statement header extraction.** New
``extract_statement_metadata(pages)`` pulls the account number
and statement period out of the first page (falls back to
page 1+2 if either is missing on page 1 — Wells Fargo business
accounts put header info on page 2). Detected fields are
stamped onto EVERY transaction row so a multi-statement CSV is
self-attributing per row::
{
"date": "20250113",
"description": "Coffee Shop",
"amount_1": -4.50,
"account_number": "****5678",
"statement_period_start": "20250101",
"statement_period_end": "20250131",
...
}
Account-number regex is tolerant of masks (``****1234``),
hyphens (``1234-5678-9012``), and spaces. Period regex looks
for "Statement Period" / "From" / "Period Covered" labels plus
the first 1-2 full-year dates that follow. If only one date is
present near the label, it's used for both start and end (some
statements show only the closing date).
**2. Year inference for short dates.** When the row date is a
short ``01/13`` or ``Jan 13`` without a year, the scanner now
binds the year from the statement period's end date BEFORE
formatting. Doesn't handle the December-in-January-statement
cross-year case (rare; user can edit in the table).
**3. Configurable output date format.** New
``output_date_format`` parameter on ``scan_pdf_for_transactions``
defaults to ``%Y%m%d``. Applied to: the transaction date column
AND the statement period start/end fields. The page surfaces a
dropdown in Scan options with common presets (YYYYMMDD,
YYYY-MM-DD, MM/DD/YYYY, DD/MM/YYYY, ``Mon DD, YYYY``) plus a
Custom option that accepts a raw strftime string.
New helper: ``format_date(iso_str, fmt)`` converts ISO
``YYYY-MM-DD`` to any strftime; passes invalid input through
unchanged so the user can see what was actually there rather
than getting silent empties.
20 new tests cover: format_date, account-number extraction
(masked / hyphenated / spaced / no-label / short), period
extraction (standard / from-to / single-date / no-label),
metadata orchestrator (full header / no pages / page-2
fallback), year inference (US / dash / month-name / no-period /
unparseable), plus an end-to-end class that builds a header'd
PDF with short-date transactions and confirms metadata
attribution + year inference + format round-trip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -90,6 +90,15 @@ if not _pdf_ok:
|
||||
# Options + upload
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
_DATE_FORMAT_CHOICES = {
|
||||
"YYYYMMDD (20260113)": "%Y%m%d",
|
||||
"YYYY-MM-DD (2026-01-13)": "%Y-%m-%d",
|
||||
"MM/DD/YYYY (01/13/2026)": "%m/%d/%Y",
|
||||
"DD/MM/YYYY (13/01/2026)": "%d/%m/%Y",
|
||||
"MMM DD, YYYY (Jan 13, 2026)": "%b %d, %Y",
|
||||
"Custom strftime…": "__custom__",
|
||||
}
|
||||
|
||||
with st.expander("Scan options", expanded=False):
|
||||
c1, c2 = st.columns(2)
|
||||
negative_in_parens = c1.checkbox(
|
||||
@@ -112,6 +121,28 @@ with st.expander("Scan options", expanded=False):
|
||||
),
|
||||
)
|
||||
|
||||
c3, c4 = st.columns(2)
|
||||
date_label = c3.selectbox(
|
||||
"Output date format",
|
||||
list(_DATE_FORMAT_CHOICES.keys()),
|
||||
index=0,
|
||||
help=(
|
||||
"Applied to the transaction date AND the statement "
|
||||
"period dates pulled from the header. Pick Custom to "
|
||||
"enter your own ``strftime`` string."
|
||||
),
|
||||
)
|
||||
output_date_format = _DATE_FORMAT_CHOICES[date_label]
|
||||
if output_date_format == "__custom__":
|
||||
output_date_format = c4.text_input(
|
||||
"Custom strftime format",
|
||||
value="%Y%m%d",
|
||||
help=(
|
||||
"Python ``strftime`` codes — e.g., ``%Y%m%d`` for "
|
||||
"20260113, ``%Y-%m-%d`` for 2026-01-13."
|
||||
),
|
||||
)
|
||||
|
||||
uploads = st.file_uploader(
|
||||
"PDF file(s)",
|
||||
type=["pdf"],
|
||||
@@ -148,6 +179,7 @@ if scan_clicked and uploads:
|
||||
raw,
|
||||
negative_in_parens=negative_in_parens,
|
||||
allow_ocr=use_ocr,
|
||||
output_date_format=output_date_format,
|
||||
)
|
||||
for r in rows:
|
||||
r["source_file"] = up.name
|
||||
@@ -258,11 +290,24 @@ else:
|
||||
|
||||
# Order columns so the user-facing fields are leftmost; raw +
|
||||
# internals are last and easy to scroll past or unselect at
|
||||
# download time.
|
||||
front = ["date", "description"]
|
||||
# download time. Statement metadata sits with the transaction
|
||||
# detail since it's per-row context an accountant typically
|
||||
# wants alongside the amounts.
|
||||
front = [
|
||||
"date",
|
||||
"description",
|
||||
]
|
||||
amount_cols = sorted(c for c in df.columns if c.startswith("amount_"))
|
||||
metadata_cols = [
|
||||
"account_number",
|
||||
"statement_period_start",
|
||||
"statement_period_end",
|
||||
]
|
||||
tail = ["source_file", "page", "raw"]
|
||||
ordered = [c for c in front + amount_cols + tail if c in df.columns]
|
||||
ordered = [
|
||||
c for c in front + amount_cols + metadata_cols + tail
|
||||
if c in df.columns
|
||||
]
|
||||
extras = [c for c in df.columns if c not in ordered]
|
||||
df = df[ordered + extras]
|
||||
|
||||
|
||||
Reference in New Issue
Block a user