feat(pdf): Dec/Jan-aware year inference + filename hint + override
Previous year inference picked ``period_end_iso[:4]`` for every
short date, which fails on statements that cross the Dec/Jan
boundary. A "12/30" row in a 2024-12-16 to 2025-01-15 statement
got 2025-12-30 (wrong) instead of 2024-12-30.
New cascade for ``_infer_year_for_short_date``:
1. **``override_year``** — caller supplies it (new ``"Override
year for short dates"`` field in Scan options). Beats every
heuristic. Empty by default; the page validates the value
is a 4-digit-looking integer in 1900-2100 and falls back to
automatic on garbage input.
2. **Statement period start + end** — the function now takes
BOTH dates and generates candidates with every distinct year
in the period (one year for same-year statements, two for
Dec/Jan boundaries). The picker scores each candidate by
distance from the period: candidates inside the period
score 0, candidates outside score ``min(|days from start|,
|days from end|)``. Lowest-distance candidate wins. So:
- ``12/30`` + period 2024-12-16 to 2025-01-15 → 2024-12-30
(inside period, score 0)
- ``01/05`` + same period → 2025-01-05 (inside, score 0)
- ``12/15`` + same period → 2024-12-15 (1 day before,
closer than 2025-12-15 which is 11 months after)
3. **``filename_year_hint``** — fallback when the statement
period regex misses the bank's specific layout. The page
passes ``year_from_filename(upload.name)`` automatically so
files like ``eStmt_2025-01-13.pdf`` get year 2025 even if
the PDF's text doesn't yield a parseable period. The regex
matches the first ``20XX`` token bounded by non-digits.
Both new helpers (``year_from_filename`` and the new
``_try_short_date_with_year`` factor-out) are exported and
tested. 16 new tests cover: within-period inference (same-year
sanity), Dec/Jan boundary cases for both sides, the
just-before-period closer-distance case, override priority,
filename fallback, no-signal None, dash-format / month-name
shorthand round-trip, garbage input, filename year extraction
(eStmt pattern, embedded, first-match-wins, no-match, empty).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -32,6 +32,7 @@ from src.pdf_extract import (
|
||||
format_amount,
|
||||
ocr_available,
|
||||
scan_pdf_for_transactions,
|
||||
year_from_filename,
|
||||
)
|
||||
|
||||
|
||||
@@ -179,6 +180,40 @@ with st.expander("Scan options", expanded=False):
|
||||
),
|
||||
)
|
||||
|
||||
# Year override for short dates. Empty by default — the
|
||||
# scanner uses statement-period detection + filename year hint
|
||||
# automatically. Set this when the statement period regex
|
||||
# misses on a particular bank's layout, or when you want to
|
||||
# force a specific year (e.g., historical reconciliation).
|
||||
year_override_str = st.text_input(
|
||||
"Override year for short dates (optional)",
|
||||
value="",
|
||||
help=(
|
||||
"Short dates like ``01/13`` get bound to a year by the "
|
||||
"scanner — statement period first, then filename year, "
|
||||
"then this override. Leave blank for automatic. Enter "
|
||||
"a 4-digit year (e.g., 2025) to force every short date "
|
||||
"to that year. Won't affect dates that already have a "
|
||||
"year (``01/13/2025``)."
|
||||
),
|
||||
)
|
||||
try:
|
||||
year_override = (
|
||||
int(year_override_str) if year_override_str.strip() else None
|
||||
)
|
||||
if year_override is not None and not (1900 <= year_override <= 2100):
|
||||
st.warning(
|
||||
f"Year override {year_override} looks wrong — using "
|
||||
"automatic detection instead."
|
||||
)
|
||||
year_override = None
|
||||
except ValueError:
|
||||
st.warning(
|
||||
f"Year override {year_override_str!r} isn't a number — "
|
||||
"using automatic detection instead."
|
||||
)
|
||||
year_override = None
|
||||
|
||||
# Persistent stash + rotating widget key. See K_UPLOADS / K_UPLOAD_COUNTER
|
||||
# docstrings for why the counter exists.
|
||||
pdf_uploads: dict = st.session_state.setdefault(K_UPLOADS, {})
|
||||
@@ -425,6 +460,8 @@ if scan_clicked and pdf_uploads:
|
||||
negative_in_parens=negative_in_parens,
|
||||
allow_ocr=use_ocr,
|
||||
output_date_format=output_date_format,
|
||||
filename_year_hint=year_from_filename(name),
|
||||
year_override=year_override,
|
||||
)
|
||||
for r in rows:
|
||||
r["source_file"] = name
|
||||
|
||||
Reference in New Issue
Block a user