feat(pdf): Dec/Jan-aware year inference + filename hint + override

Previous year inference picked ``period_end_iso[:4]`` for every
short date, which fails on statements that cross the Dec/Jan
boundary. A "12/30" row in a 2024-12-16 to 2025-01-15 statement
got 2025-12-30 (wrong) instead of 2024-12-30.

New cascade for ``_infer_year_for_short_date``:

1. **``override_year``** — caller supplies it (new ``"Override
   year for short dates"`` field in Scan options). Beats every
   heuristic. Empty by default; the page validates the value
   is a 4-digit-looking integer in 1900-2100 and falls back to
   automatic on garbage input.

2. **Statement period start + end** — the function now takes
   BOTH dates and generates candidates with every distinct year
   in the period (one year for same-year statements, two for
   Dec/Jan boundaries). The picker scores each candidate by
   distance from the period: candidates inside the period
   score 0, candidates outside score ``min(|days from start|,
   |days from end|)``. Lowest-distance candidate wins. So:

     - ``12/30`` + period 2024-12-16 to 2025-01-15 → 2024-12-30
       (inside period, score 0)
     - ``01/05`` + same period → 2025-01-05 (inside, score 0)
     - ``12/15`` + same period → 2024-12-15 (1 day before,
       closer than 2025-12-15 which is 11 months after)

3. **``filename_year_hint``** — fallback when the statement
   period regex misses the bank's specific layout. The page
   passes ``year_from_filename(upload.name)`` automatically so
   files like ``eStmt_2025-01-13.pdf`` get year 2025 even if
   the PDF's text doesn't yield a parseable period. The regex
   matches the first ``20XX`` token bounded by non-digits.

Both new helpers (``year_from_filename`` and the new
``_try_short_date_with_year`` factor-out) are exported and
tested. 16 new tests cover: within-period inference (same-year
sanity), Dec/Jan boundary cases for both sides, the
just-before-period closer-distance case, override priority,
filename fallback, no-signal None, dash-format / month-name
shorthand round-trip, garbage input, filename year extraction
(eStmt pattern, embedded, first-match-wins, no-match, empty).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-20 01:59:30 +00:00
parent a18b126885
commit a0042d4aba
3 changed files with 224 additions and 35 deletions

View File

@@ -32,6 +32,7 @@ from src.pdf_extract import (
format_amount,
ocr_available,
scan_pdf_for_transactions,
year_from_filename,
)
@@ -179,6 +180,40 @@ with st.expander("Scan options", expanded=False):
),
)
# Year override for short dates. Empty by default — the
# scanner uses statement-period detection + filename year hint
# automatically. Set this when the statement period regex
# misses on a particular bank's layout, or when you want to
# force a specific year (e.g., historical reconciliation).
year_override_str = st.text_input(
"Override year for short dates (optional)",
value="",
help=(
"Short dates like ``01/13`` get bound to a year by the "
"scanner — statement period first, then filename year, "
"then this override. Leave blank for automatic. Enter "
"a 4-digit year (e.g., 2025) to force every short date "
"to that year. Won't affect dates that already have a "
"year (``01/13/2025``)."
),
)
try:
year_override = (
int(year_override_str) if year_override_str.strip() else None
)
if year_override is not None and not (1900 <= year_override <= 2100):
st.warning(
f"Year override {year_override} looks wrong — using "
"automatic detection instead."
)
year_override = None
except ValueError:
st.warning(
f"Year override {year_override_str!r} isn't a number — "
"using automatic detection instead."
)
year_override = None
# Persistent stash + rotating widget key. See K_UPLOADS / K_UPLOAD_COUNTER
# docstrings for why the counter exists.
pdf_uploads: dict = st.session_state.setdefault(K_UPLOADS, {})
@@ -425,6 +460,8 @@ if scan_clicked and pdf_uploads:
negative_in_parens=negative_in_parens,
allow_ocr=use_ocr,
output_date_format=output_date_format,
filename_year_hint=year_from_filename(name),
year_override=year_override,
)
for r in rows:
r["source_file"] = name