Commit Graph

207 Commits

Author SHA1 Message Date
6627895a10 test: fix v3 branding drift, add reconcile CLI + registry coverage
GUI/lang-pack tests were asserting against pre-v3 strings ("Data
Cleaning Mastery", "Maestría en limpieza…") that the brand refresh
replaced with "UNALOGIX DataTools" + "Clean. Normalize. Transform."
Updated assertions to the current copy and switched the findings
panel tests to the redesigned flat-list layout (per-finding "Open
Tool →" buttons instead of per-tool expanders).

New coverage:
- tests/test_cli_reconcile.py (13) — preview/apply, tolerance flags,
  sign inversion, key flags, error paths, Excel input.
- tests/test_tools_registry.py (27) — unique tool_ids, page_slug →
  real file, valid sections/tiers, localized accessor fallbacks,
  explicit pins for PDF Extractor + Reconciler entries.
- tests/test_reconcile.py — one-side-empty, key-pass tagging,
  additional validation cases, input-DataFrame immutability.
- tests/gui/test_smoke.py — PAGE_SLUGS now includes 10_PDF_Extractor
  and 11_Reconciler in both en/es.
- tests/gui/test_workflows.py — TestPdfExtractorWorkflow and
  TestReconcilerWorkflow render checks.

Net: 2317 passed → 2418 passed, 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 19:30:02 +00:00
ea99e292d2 feat(nav): group Home + Reconcile under a new "Analysis" section
Home now appears in the sidebar as "File Analysis" under a labeled
"Analysis" section together with Reconcile Two Files — both pages
are data-analysis workflows (importing/profiling files vs. matching
across files), so grouping them clarifies the sidebar's mental model.

- tools_registry: new ``analysis`` Section; reconcile moves out of
  automations into it.
- i18n: ``nav.section_analysis`` + ``nav.file_analysis_title`` added
  to en.json and es.json.
- app.py: home dropped from the unlabeled section and surfaced at the
  top of the Analysis group; ``default=True`` preserved so first-visit
  routing is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:11:06 +00:00
0be59c0f03 fix(gui): shrink white-bar compensation to ~1/4 of original gap
Plain ``min-height: 100vh`` left a ~15vh white bar below ``.stApp``
(the zoom: 0.85 scaler shrinks visual height to 85%). Reinstate the
stretching but stop short of the full ``100vh / 0.85`` overflow:
``calc(96vh / 0.85)`` fills 96vh visually and leaves a ~4vh bar — a
quarter the size, no longer dominating the page.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:06:32 +00:00
3a3a9a895b fix(gui): stop overstretching pages, restore footer clearance
Two layout bugs were hiding the bottom of every tool page behind the
sticky footer:

1. ``.stApp`` and the main/sidebar containers were forced to
   ``min-height: calc(100vh / 0.85)``, ≈ 17.6% taller than the
   viewport, to mask a white bar caused by the ``zoom: 0.85`` scaler.
   That hack stretches short pages and pushes long-page content past
   the visible area. Drop the calc factor — plain ``100vh`` fills the
   visible viewport without forced overflow.

2. ``render_sticky_footer``'s stylesheet re-set the block container's
   ``padding-bottom`` to ``2rem``, overriding the ``7rem`` reserved
   by ``hide_streamlit_chrome``. The footer (~40px tall) needs more
   than 32px of clearance, so the last row of content was sliding
   behind the footer. Remove the override and let chrome's reservation
   stand.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:03:52 +00:00
d090f8cb5e feat(reconcile): auto-detect role columns, preview result tabs
Match-settings selectors now reorder per side to match the file's
column order, using name heuristics (amount / date / desc) so a
typical bank CSV reads Date → Description → Amount → Reference
without manual fiddling. Detected columns also pre-fill as the
default selection.

Result tabs render at most 25 rows with a "preview of N of M"
caption; full data is still available via the existing download
buttons.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 22:39:47 +00:00
e44af3a45e feat(reconcile): two-source reconciliation tool
Bank-feed-vs-ledger style matcher: 4-pass greedy assignment (key →
exact → tolerance → fuzzy) with ambiguous candidates routed to a
review bucket instead of arbitrary picks. CLI mirrors the
cli_text_clean preview/--apply pattern; Streamlit page registered
in the automations section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 22:33:14 +00:00
450d4fc9a8 feat(pdf): default output date format to YYYY-MM-DD
User asked to flip the default from YYYYMMDD to YYYY-MM-DD.
ISO is the better default for an accountant CSV workflow:

- Lexicographic sort = chronological sort (no parsing needed).
- Every spreadsheet tool the user might import into recognises
  it as a real date with no ambiguity (US vs EU readers can't
  disagree on the order).
- Hyphens make the year/month/day boundaries scan-able by eye.

Concrete changes:

- New module constant ``DEFAULT_DATE_FORMAT = "%Y-%m-%d"``,
  used as the default for ``format_date()`` and the
  ``output_date_format`` keyword on
  ``scan_pdf_for_transactions``.
- Page's ``_DATE_FORMAT_CHOICES`` reordered so the ISO entry
  is first (index 0 = default Streamlit selection); YYYYMMDD
  drops to second.
- Custom-strftime input default also flips to ``%Y-%m-%d``.

Tests updated to reflect the new default (``test_dates_formatted_iso_by_default``,
``test_short_dates_get_year_from_period``,
``test_compact_format_round_trip``, plus a new
``test_default_is_iso`` for the format_date helper).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 02:04:34 +00:00
a0042d4aba feat(pdf): Dec/Jan-aware year inference + filename hint + override
Previous year inference picked ``period_end_iso[:4]`` for every
short date, which fails on statements that cross the Dec/Jan
boundary. A "12/30" row in a 2024-12-16 to 2025-01-15 statement
got 2025-12-30 (wrong) instead of 2024-12-30.

New cascade for ``_infer_year_for_short_date``:

1. **``override_year``** — caller supplies it (new ``"Override
   year for short dates"`` field in Scan options). Beats every
   heuristic. Empty by default; the page validates the value
   is a 4-digit-looking integer in 1900-2100 and falls back to
   automatic on garbage input.

2. **Statement period start + end** — the function now takes
   BOTH dates and generates candidates with every distinct year
   in the period (one year for same-year statements, two for
   Dec/Jan boundaries). The picker scores each candidate by
   distance from the period: candidates inside the period
   score 0, candidates outside score ``min(|days from start|,
   |days from end|)``. Lowest-distance candidate wins. So:

     - ``12/30`` + period 2024-12-16 to 2025-01-15 → 2024-12-30
       (inside period, score 0)
     - ``01/05`` + same period → 2025-01-05 (inside, score 0)
     - ``12/15`` + same period → 2024-12-15 (1 day before,
       closer than 2025-12-15 which is 11 months after)

3. **``filename_year_hint``** — fallback when the statement
   period regex misses the bank's specific layout. The page
   passes ``year_from_filename(upload.name)`` automatically so
   files like ``eStmt_2025-01-13.pdf`` get year 2025 even if
   the PDF's text doesn't yield a parseable period. The regex
   matches the first ``20XX`` token bounded by non-digits.

Both new helpers (``year_from_filename`` and the new
``_try_short_date_with_year`` factor-out) are exported and
tested. 16 new tests cover: within-period inference (same-year
sanity), Dec/Jan boundary cases for both sides, the
just-before-period closer-distance case, override priority,
filename fallback, no-signal None, dash-format / month-name
shorthand round-trip, garbage input, filename year extraction
(eStmt pattern, embedded, first-match-wins, no-match, empty).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:59:30 +00:00
a18b126885 fix(pdf): stamp scan timestamp once; restores Saved-to-path banner
After swapping to ``html_download_button`` the user noticed the
"✓ Saved to <path>" + 📂 Open Downloads folder pair never
appeared. The helper itself is fine — every other tool shows
those affordances correctly. Bug was specific to the PDF page.

The download button's file_name was being computed with a fresh
``datetime.now().strftime(...)`` on every render. The helper
builds its session-state keys from
``f"_dl_btn_{file_name}_{digest}"`` so the keys silently drift
every second. After the click and rerun, the helper looks up
the saved_key for the NEW file_name, finds nothing in
session_state (the click had written to the OLD key), and skips
the success banner.

Fix: stamp the timestamp once when scan completes, store it in
``K_TIMESTAMP``, and reuse it for the download filename. The
filename stays stable across reruns, so the helper's keys are
stable, so the saved-path banner renders correctly on the post-
click rerun.

Also clear ``K_TIMESTAMP`` on Clear-all-files so a new scan
gets a fresh stamp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:50:22 +00:00
981a1a9cba fix(downloads): OneDrive-aware Downloads path + PDF uses html_download_button
User reported downloads "do nothing on click" in tool pages and
"acts like it downloads but no file in the folder" in the PDF
tool. Two root causes, two fixes.

**Root cause #1 — wrong Downloads folder on Windows.**
``_downloads_dir()`` returned ``Path.home() / "Downloads"``
unconditionally. On Windows machines with OneDrive enabled
(very common for business users), the real Downloads folder
is redirected to ``C:\Users\<u>\OneDrive\Downloads``. Our
helper would write to ``C:\Users\<u>\Downloads`` instead —
a folder that may not even exist until ``mkdir`` creates it —
and the user, naturally opening their actual OneDrive
Downloads, sees no file and concludes nothing happened.

Now: on Windows, ``_downloads_dir`` queries the registry key
``Software\Microsoft\Windows\CurrentVersion\Explorer\User
Shell Folders`` for FOLDERID_Downloads (GUID
``{374DE290-123F-4565-9164-39C4925E467B}``). This entry returns
the redirected path when OneDrive is active, the original
``%USERPROFILE%\Downloads`` otherwise — exactly what the user's
File Explorer reads. ``%USERPROFILE%`` expansion is applied
via ``os.path.expandvars``. Any registry hiccup falls through
to ``Path.home() / "Downloads"`` so the helper never raises.

The sanity check (path exists OR parent exists) catches the
edge case where the registry points into a deleted OneDrive
mount.

**Root cause #2 — PDF page used st.download_button.**
Every other tool uses the project's ``html_download_button``
helper (which is ``local_download_button`` under the hood —
the rename happened in b9147f3). ``st.download_button`` has a
long-standing bug where the second-or-later instance in a
script pass silently fails to fire. The PDF tool predated the
rewrite that switched everyone over and was still using the
broken native widget. ``_Logs.py`` had the same problem in two
places.

Swapped all three call sites to ``html_download_button``. They
now save to ``~/Downloads/<filename>`` (correctly resolved per
fix #1) and show the saved path + "Open Downloads folder"
button below the click, matching every other tool in the suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:45:51 +00:00
dbcf4d4048 feat(pdf): adopt Home-page Files-card layout
User wants the PDF page's upload UX to match the Home page
exactly — Files section header + bordered card containing the
file rows AND the "Add more files" button at the bottom, no
visible Streamlit file_uploader competing for attention.

Layout changes mirroring ``src/gui/_home.py``:

- ``st.file_uploader`` is positioned off-screen via CSS
  (``position:absolute;left:-10000px;…``). The underlying
  ``<input type=file>`` stays reachable to JS so the in-card
  "Add more files" button can programmatically click it.
- ``<h2>Files</h2>`` section header with ``N files · X.X MB
  total`` meta on the right, identical markup
  (``dt-files-section-head``).
- Single ``st.container(border=True)`` hosts every file row
  (``✕ | 📄 filename | size``, using ``dt-file-row`` /
  ``dt-file-icon-chip`` / ``dt-file-name`` / ``dt-file-size``
  classes) AND the "Add more files" button (``dt-file-add``)
  at the bottom. All classes are already defined globally in
  ``_legacy.py`` so no new CSS.
- The Add button click is wired to the off-screen uploader's
  ``stFileUploaderDropzoneInput`` via a 30-line iframe script,
  identical to the Home page's pattern. A ``MutationObserver``
  re-wires after Streamlit reruns when the button gets
  re-mounted.

Action buttons (Scan + Clear all) sit BELOW the Files card,
side-by-side in a `[1, 1, 4]` column split with
``use_container_width=True`` so they fill their cells cleanly
without stretching across the whole row. Both buttons are
disabled when no files are uploaded — the empty Files card is
its own affordance for the empty state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:34:31 +00:00
34b56b404a fix(pdf): drop statement_period_start/end columns from output
User asked to remove them — the two columns repeated the same
value on every row from a given statement, took up screen space
in the editor, and offered limited value once the date column
already carries the inferred full date.

What's kept:
- ``account_number`` — still stamped onto every row so multi-
  statement CSVs are self-attributing
- ``extract_statement_metadata`` — still runs every scan because
  ``period_end`` is the source of the year inference that binds
  Chase-style short ``01/13`` dates to ``20250113``
- ``_extract_statement_period`` and its tests — period
  detection itself isn't going anywhere, just its appearance in
  the output rows

What's removed:
- ``record["statement_period_start"]`` / ``record["statement_period_end"]``
  assignments in ``scan_pdf_for_transactions``
- The two columns from the page's column-ordering setup
- Tests pinning their presence; replaced with assertions that
  they're explicitly absent

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:28:32 +00:00
ad7c22d7fb fix(pdf): consistent 2-decimal amount precision in display and CSV
User reported amounts losing trailing zeros — 4.50 rendering as
4.5, 1000.00 as 1000 — on the same statement. Classic float
display issue: Python's native ``repr(4.5)`` drops the
``.0``, and pandas / Streamlit happily show that
inconsistency cell-by-cell.

Two layers of fix, internal type stays ``float`` for arithmetic:

**Display.** ``st.column_config.NumberColumn(format="%.2f")``
applied programmatically to every ``amount_*`` column on the
data_editor. Every numeric amount now shows with exactly two
decimal places regardless of trailing zeros.

**CSV export.** Pandas' default float-to-CSV writer also drops
trailing zeros (the same issue an accountant would see when
opening the file in Excel). Before serialising, each amount
column is mapped through the new ``format_amount`` helper —
returns ``f"{v:.2f}"`` for numerics, empty string for
None/NaN/inf, ``str(value)`` for booleans (guards the
``True → "1.00"`` foot-gun since ``bool`` is an ``int``
subclass), and passes through any string the scanner kept
because parsing failed (e.g. ``(4.50)`` when parens-negative is
off — user can correct in the editor before re-exporting).

``format_amount`` lives in ``src/pdf_extract.py`` so it's
testable in isolation (the page module can't easily be unit
tested because of its Streamlit import chain). 8 new tests
cover the trailing-zeros case, negatives, None/empty,
string-passthrough, bool guard, NaN/inf, and the ``places``
parameter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:27:16 +00:00
6f2ad57490 fix(pdf): require non-empty description; tighten multi-line merge
User reported "Daily Ledger Balances" entries leaking into
output. Three correlated bugs in the row qualifier:

**1. Empty description is now disqualifying.** A row like
``01/13/2025  $1,000.00`` has a date and an amount but no text
between them — that's a daily-balance entry, a period-summary,
or page furniture. Drop these. New filter sits after
``_description_from_row`` returns: if the description string is
empty (or whitespace-only), continue past the row.

**2. ``prev`` resets per page.** The state that drives multi-
line description merging (the "previous transaction this
continuation might attach to") used to persist across page
boundaries. A no-date no-amount line at the top of page 2
could silently attach to the last transaction on page 1. Fixed
by moving the ``prev`` / ``prev_y_bottom`` declarations into
the outer page loop so each page starts clean.

**3. Multi-line merges now check y-distance.** Before this fix,
ANY no-date no-amount line attached to the previous
transaction's description. A "Daily Ledger Balances" section
header several rows below the last transaction would silently
fold into it. Now the merge only happens when the gap
``current_top - prev_y_bottom <= 25.0`` PDF points — generous
enough for one blank-line gap between wrapped descriptions,
tight enough to reject section headers across paragraph
breaks. The threshold is a module constant
(``_MULTILINE_MERGE_MAX_GAP``) for future tuning if real
statements call for it.

Three new test classes:

- ``TestRequiresDescription.test_empty_description_row_dropped``
  — date+amount-no-text row filtered, real transaction kept.
- ``TestPrevTransactionResetsPerPage.test_no_cross_page_merge``
  — page-1 transaction + page-2 section header = no merge.
- ``TestMultilineMergeYGap`` — close continuation merges
  (10-pt gap), far section header doesn't (100-pt gap).

The original ``TestMultilineDescription.test_continuation_line_merges``
still passes — its setup has a 10-pt gap which is within the
new threshold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:58:50 +00:00
a1824b8dc4 feat(pdf): Home-style file list + Clear-all button
User feedback: the standard file_uploader didn't visually match
the Home page, and there was no obvious way to clear out
uploaded files between scans (have to refresh the browser tab).

**Persistent stash + add-only sync.** Files captured into
``st.session_state["pdf_uploads"]`` (dict name → {bytes, size})
via an ``on_change`` callback on the file_uploader widget. The
callback is **add-only** — never removes files from the stash
based on widget state. Removal is owned by the custom X buttons
+ widget-counter bump (see below). This guarantees a hidden
native X click can't silently drop files behind the user's
back.

**Hidden native file list.** A small CSS block suppresses the
file_uploader's built-in file rows + their delete buttons
(``stFileUploaderFile`` + ``stFileUploaderDeleteBtn``), so the
custom list below is the single source of truth on screen.

**Custom file list (Home pattern).** Below the dropzone, every
uploaded file gets a row: ``✕ | 📄 filename | size``. Top of
section shows ``N files · 12.3 MB total``. Counts and sizes
update in real time as the user adds or removes files. The X
button per row calls ``log_event("upload", "PDF removed: …")``,
removes the entry from the stash, and bumps the widget counter
to clear the widget too.

**Clear-all button.** Sits next to the Scan button. Wipes the
stash, bumps the widget counter, drops any cached scan results
(``K_ROWS``, ``K_WARNINGS``, ``K_SOURCE_COUNT``). Audited via
``log_event("upload", "PDF list cleared", count=N)``.

**Widget reset via counter bump.** Streamlit disallows
programmatic mutation of widget session-state entries; the
standard workaround is to rotate the widget's ``key``. Page
maintains ``K_UPLOAD_COUNTER`` which gets incremented on
remove / clear-all, producing a fresh ``pdf_upload_v{N}`` key
and a freshly-instantiated empty widget. The stash retains any
unaffected files; on next upload, the add-only sync picks up
the new ones without re-adding the removed ones.

**Scan rewired to read the stash.** Instead of iterating the
widget's UploadedFile objects (which the previous code did and
which broke when the widget unmounted on remove), the scan
loop iterates ``pdf_uploads.items()`` and uses the cached
``bytes``. Diagnostic expander does the same — re-reads from
the stash, removing the need for a separate ``K_DIAGNOSTIC``
cache (deleted).

**``_format_size`` helper** ports the byte-formatting logic
from ``_home.py``'s pattern (KB / MB / GB rollover).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:28:01 +00:00
155dd30746 feat(pdf): extract statement header (account + period) + date format
Two related additions for the accountant workflow:

**1. Statement header extraction.** New
``extract_statement_metadata(pages)`` pulls the account number
and statement period out of the first page (falls back to
page 1+2 if either is missing on page 1 — Wells Fargo business
accounts put header info on page 2). Detected fields are
stamped onto EVERY transaction row so a multi-statement CSV is
self-attributing per row::

    {
      "date": "20250113",
      "description": "Coffee Shop",
      "amount_1": -4.50,
      "account_number": "****5678",
      "statement_period_start": "20250101",
      "statement_period_end": "20250131",
      ...
    }

Account-number regex is tolerant of masks (``****1234``),
hyphens (``1234-5678-9012``), and spaces. Period regex looks
for "Statement Period" / "From" / "Period Covered" labels plus
the first 1-2 full-year dates that follow. If only one date is
present near the label, it's used for both start and end (some
statements show only the closing date).

**2. Year inference for short dates.** When the row date is a
short ``01/13`` or ``Jan 13`` without a year, the scanner now
binds the year from the statement period's end date BEFORE
formatting. Doesn't handle the December-in-January-statement
cross-year case (rare; user can edit in the table).

**3. Configurable output date format.** New
``output_date_format`` parameter on ``scan_pdf_for_transactions``
defaults to ``%Y%m%d``. Applied to: the transaction date column
AND the statement period start/end fields. The page surfaces a
dropdown in Scan options with common presets (YYYYMMDD,
YYYY-MM-DD, MM/DD/YYYY, DD/MM/YYYY, ``Mon DD, YYYY``) plus a
Custom option that accepts a raw strftime string.

New helper: ``format_date(iso_str, fmt)`` converts ISO
``YYYY-MM-DD`` to any strftime; passes invalid input through
unchanged so the user can see what was actually there rather
than getting silent empties.

20 new tests cover: format_date, account-number extraction
(masked / hyphenated / spaced / no-label / short), period
extraction (standard / from-to / single-date / no-label),
metadata orchestrator (full header / no pages / page-2
fallback), year inference (US / dash / month-name / no-period /
unparseable), plus an end-to-end class that builds a header'd
PDF with short-date transactions and confirms metadata
attribution + year inference + format round-trip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:20:46 +00:00
3cf935c999 fix(pdf): drop zero-amount rows; multi-date rows clean description
Two corrections from real-statement feedback:

**1. Drop rows where the transaction amount is exactly 0.**
Bank statements include date+amount-shaped noise like
"INTEREST EARNED 0.00", "PAGE TOTAL 0.00", "BALANCE FORWARD
0.00 1,234.56" — all match the date+amount heuristic but
aren't transactions. New filter in
``scan_pdf_for_transactions``: drop rows whose ``amount_1``
parses to exactly 0. Non-zero balances in ``amount_2`` don't
rescue a zero amount_1 — leftmost amount is the canonical
transaction amount. Unparsed-but-non-empty amount strings are
kept (user verifies in the editor).

**2. Multi-date rows: first date wins for the column, every
date excluded from the description.** Chase / BofA / Wells
commonly show both a transaction date and a posting date per
row:

    01/13  01/14  COFFEE SHOP  $4.50

Before this fix, ``_find_dates_in_words`` returned the first
date only and the second date leaked into description as
"01/14 COFFEE SHOP". Now it returns ALL dates with their word
ranges; the scanner uses ``dates[0]`` as the canonical date
and passes every range to the description builder for
exclusion.

The detector's two-pass strategy now also guards against
mixing full-year and short-date matches on the same row.
Previously, a header line like ``Page 1/2 of 3 ... Statement
Date 01/13/2026`` would return both ``1/2`` and ``01/13/2026``,
and ``1/2`` (being leftmost) would have won the date column.
Now: if any full-year date is found on the row, short patterns
are NOT also collected — full year anchors interpretation. A
row with no full-year date (Chase short-date case) still falls
back to short patterns and collects all of them.

New tests:
- ``test_multiple_dates_returned_in_position_order`` —
  ``01/13`` + ``01/14`` both returned, in order
- ``TestMultiDateRow.test_first_date_wins_second_excluded_from_description``
  — end-to-end through ``scan_pdf_for_transactions``
- ``TestZeroAmountRowsAreDropped.test_zero_amount_row_dropped``
  — "INTEREST EARNED 0.00" row dropped while real txn kept
- ``test_negative_amount_kept`` — pin that -40.00 is not
  treated as zero by the filter

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:12:21 +00:00
263af3c7c2 fix(pdf): short dates without year + diagnostic for "0 rows" runs
User uploaded a real Chase statement and got "0 rows detected."
Two bugs the rewrite shipped with, plus a diagnostic:

**1. Short dates without year weren't recognized.** Most bank
statements (Chase, Wells, BofA, …) display transaction dates as
``01/13`` or ``Jan 13`` because the year is implied by the
statement period. The original regex required ``\d{2,4}`` after
the second slash, so ``01/13`` failed to match and rows with no
detected date got dropped.

Split ``_DATE_RES`` into ``_FULL`` (with year) and ``_SHORT``
(no year), with a two-pass detector: pass 1 tries full-year
patterns across the whole row; pass 2 only tries short patterns
if pass 1 found nothing. This prevents a stray ``Page 1/2`` from
shadowing the real dated transaction on the same line.

Short patterns:
- ``\d{1,2}/\d{1,2}`` — Chase, etc.
- ``\d{1,2}-\d{1,2}``
- ``[A-Z][a-z]{2}\s+\d{1,2}`` — "Jan 13"

When parsing, short dates pass through ``parse_date`` and
return None (no year to bind to), so the scanner falls back to
the raw text — the user sees ``01/13`` in the date column and
can correct in the editor.

**2. Multi-word dates leaked the day token into the description.**
A pre-existing bug: ``_find_dates_in_words`` returned only the
START word index, and ``_description_from_row`` only excluded
that single word. For "Jan 13 Coffee $4.50", the description
became "13 Coffee" instead of "Coffee". Fixed by returning
``(start, end, text)`` with ``end`` exclusive (computed from
``len(m.group(1).split())`` so window-overrun doesn't
over-consume), and the description builder now skips the full
range.

**3. New diagnostic: ``diagnose_pdf_lines(pdf_bytes)``.** Returns
every clustered text line the scanner saw with ``has_date`` /
``has_amount`` flags. When the page's scan returns 0 rows, an
auto-expanded "what the scanner saw" expander now renders a
table of all extracted lines so the user can:

- Spot scanned-PDF cases (empty result → enable OCR)
- See which lines have a date but no amount (or vice versa)
- Eyeball the date / amount format the scanner missed

Without leaving the app or asking the developer for help.

Eight new tests cover: short US date (``01/13``), short month-
name date with two-word consumption (``Jan 13``), the
``Page 1/2 ... 01/13/2026`` shadowing case, and the multi-word-
date description fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:06:07 +00:00
bece2b4030 refactor(pdf): rip out templates; heuristic scan + selectable table
User feedback: the template / visual-picker / mode-dispatch
implementation was too complex for the actual workflow.
Statements drift between months, the canvas state didn't survive
multi-page navigation, and accountants don't want to maintain
per-bank configuration just to convert PDFs to CSV.

Start-over design — one public function, one page, no
persistence:

  ``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)``

A row is "any text line with a date pattern AND at least one
amount pattern." Each detected row is a dict shaped::

    {
      "date": "2026-01-15",
      "description": "Coffee Shop",
      "amount_1": -4.50,
      "amount_2": 1000.00,   # if a second amount was found
      "page": 1,
      "raw": "01/15/2026 Coffee Shop (4.50) 1,000.00",
      "source_file": "chase-jan-2026.pdf",
    }

Multi-line descriptions still merge (no-date no-amount lines
attach to the previous transaction). Multi-PDF batches share a
single combined table with a ``source_file`` column.

**Page UX:**

- Upload PDF(s) → optional Options expander (parens-negative,
  use-OCR) → click Scan → see all detected rows in an
  ``st.data_editor``.
- The editor has an ``Include`` checkbox column (default on),
  plus user-editable date / description / amount cells and a
  read-only ``raw`` column showing the original PDF text for
  verification.
- A ``Columns to include in CSV`` multiselect hides
  ``page`` / ``raw`` from the download by default; user can
  re-add either.
- Download CSV gets only the checked rows.

No template save/load. No visual picker. No mode dispatch. No
column boundaries. No schema migration. No per-bank
configuration files.

**Deletions:**

- ``src/pdf_templates.py`` — template storage layer
- ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim
  for the canvas (no canvas now)
- ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``,
  ``test_drawable_canvas_compat.py`` — covered the removed APIs
- ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for
  the removed dep
- ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt``
- The drawable-canvas references in ``build/datatools.spec``

**``src/pdf_extract.py``** shrinks from ~30 helper functions to
~10. Keeps: value parsers, row clusterer, date/amount token
finders, OCR pipeline, dependency guards. The one new public
function ``scan_pdf_for_transactions`` glues them together.

**Tests** (59 passing): the unit layer keeps full coverage of
the building blocks; the smoke layer pins the end-to-end PDF
roundtrip, OCR discovery, dependency-import behavior, and the
multi-line-description merge. The fpdf2-generated fixture PDF
still drives the real-PDF test.

Rollback: ``git revert HEAD`` brings back the template system if
needed — but the simpler model should make that unlikely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:57:30 +00:00
60969c0770 feat(pdf): UI rework — Auto-detect is the default build flow
Pulls the user's primary mental model away from "draw column
boundaries" toward "tell me what shape your amounts have, see
detected rows, save." The visual picker that wasn't working for
multi-statement workflows is reachable but no longer the
default.

**Build mode header** now has a mode radio:

- "Auto-detect (recommended)" — row_heuristic. Tabs: Amount
  layout · Filters & date · Save. Three small forms; no
  coordinate UI anywhere. The Amount-layout tab's dropdown picks
  one of single / txn+balance / debit+credit / debit+credit+balance
  and auto-derives the min/max amount-count range (overridable
  under an expander).
- "Visual columns (advanced)" — column_visual. Five tabs (the
  original Visual picker / Pages & table / Columns / Parsing /
  Save). A yellow warning panel up top reminds the user that
  column-x templates only work when statement layout is stable.

Switching modes triggers a rerun so the right tab set renders
immediately. The template object preserves both mode's config
trees side-by-side so a user can flip between them without
losing work.

**Live preview** below the form runs ``apply_template`` against
the cached sample pages (already cached in session_state so this
re-renders cheaply on every form edit). The "no rows yet"
message is mode-aware — points users at the right tuning knobs
for whichever mode they're in. The preview caption notes which
mode produced the rows so the user can correlate decisions to
output.

The visual picker bug the user reported — "a single box stays in
the same location regardless of page" — is sidestepped rather
than fixed: in row_heuristic mode there's no canvas to confuse,
and for the rare column_visual user the canvas is still
imperfect but no longer their first interaction with the tool.
Cleaning up the column_visual canvas state bugs is a separate
follow-up if real users still hit the Advanced mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:46:27 +00:00
48cd9e8249 feat(pdf): schema v2 + mode field + v1 in-memory migration
Bumps ``SCHEMA_VERSION`` from 1 to 2 to add a top-level ``mode``
field distinguishing ``row_heuristic`` (new default) from
``column_visual`` (legacy). The schema bump is real — old code
that defaults missing keys would silently mis-extract — so we
do it the careful way:

- ``new_template`` now returns mode=``row_heuristic`` with the
  full row-heuristic config tree pre-populated. The legacy
  column-visual fields are still seeded with empty defaults so
  switching modes in the GUI doesn't require runtime key
  insertion.
- ``validate_template`` is mode-aware: row_heuristic templates
  must have a valid ``amounts.shape`` + sane
  ``row_detection.min/max_amounts_per_row``; column_visual
  templates keep the existing column/target requirements.
- ``load_template`` accepts both v1 and v2 files
  (``_LOAD_SUPPORTED_VERSIONS = {1, 2}``). v1 files get
  ``mode="column_visual"`` injected and ``schema_version`` bumped
  IN MEMORY ONLY — disk file stays v1 until the user explicitly
  re-saves. A buggy migration can't silently corrupt their
  template library.
- ``save_template`` continues to write the current schema; saving
  a v1 template through the GUI naturally upgrades it.

Mode + shape constants exported (``VALID_MODES``,
``VALID_AMOUNT_SHAPES``) so the GUI dropdowns can derive their
options from the source of truth.

Tests split into ``TestValidateTemplateRowHeuristic`` (6) +
``TestValidateTemplateColumnVisual`` (4) + ``TestV1Migration``
(1). All 29 template tests pass; the original column-mode tests
that previously implicitly relied on schema_version=1 keep
working because new_template's seeded column fields are still
present in row_heuristic templates (just not validated as
required).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:46:10 +00:00
d80befd05a feat(pdf): row-heuristic extraction (mode dispatch, no coordinates)
User reported the column-visual approach is too brittle for real
bank statements: column-x-positions saved against a sample page
don't survive layout drift between months (statement A has
columns at x=300, statement B drifted to x=320), and a saved
template can only realistically work for one statement's
specific render. The fundamental fix is to stop depending on
coordinates at all.

**Row-heuristic mode** finds transaction rows by pattern: any
line with a date token + N amount tokens IS a transaction. Date
patterns (US slash / EU slash / ISO / "Jan 15, 2026" / etc.) and
amount patterns (currency, parens-negative, thousands grouping)
are matched against word text — no x-positions involved.

The full pipeline:

1. ``find_transaction_rows`` clusters words into rows and scans
   each line for date + amount tokens.
2. Multi-line descriptions still attach to the previous row via
   the no-date-no-amount continuation rule.
3. Amount shapes drive interpretation: ``single`` /
   ``txn_balance`` / ``debit_credit`` / ``debit_credit_balance``.
4. ``_infer_amount_column_centers`` clusters amount x-midpoints
   ACROSS ALL detected rows to find natural column groupings —
   so debit-vs-credit assignment for single-amount lines works
   without the user marking anything on screen.

``apply_template`` is now a dispatch over ``template["mode"]``:

- ``mode="row_heuristic"`` (default for new templates) — the new
  pipeline.
- ``mode="column_visual"`` — the existing pipeline, kept under
  ``_apply_template_column_visual`` for v1 templates and the
  Advanced fallback.

18 new tests cover: date detection (US slash, two-digit year,
ISO, month-name, missing); amount-token finding (currency,
parens, pure text, bare-year rejection); column-center inference
(clear two-column case, empty input); end-to-end on synthetic
Page objects with all four amount shapes; the critical
layout-drift test that proves the same template works on pages
of different sizes / different absolute x-positions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:45:55 +00:00
10015c40e1 fix(pdf): shim image_to_url for drawable-canvas on modern Streamlit
User hit ``AttributeError: module 'streamlit.elements.image' has
no attribute 'image_to_url'`` on first PDF import. Root cause:
``streamlit-drawable-canvas`` 0.9.3 (last upstream release 2023)
calls a Streamlit internal that was relocated in Streamlit
~1.30+. The function moved from ``streamlit.elements.image`` to
``streamlit.elements.lib.image_utils`` AND its signature
changed — the second positional argument is now a
``LayoutConfig`` dataclass instead of a plain ``int`` width.

Three remedies considered:

1. Downgrade Streamlit. Reverses unrelated improvements +
   security fixes; not on the table.
2. Fork drawable-canvas. The maintenance hit isn't worth it for a
   one-line internal API change.
3. **Ship a compatibility shim.** Re-attach a wrapper at the old
   import path that adapts the old call shape to the new
   function. This is the standard workaround the wider Streamlit
   community has converged on for this exact regression.

``src/gui/_drawable_canvas_compat.py`` does (3). The ``install()``
helper is idempotent, opt-in (not auto-run at module import — a
grep for ``_install_canvas_compat`` shows every call site), and
no-ops if Streamlit hasn't moved the function OR if the new
function isn't where we expect (lets the canvas surface a real
error rather than papering over a different bug). The page calls
``_install_canvas_compat()`` once at module top before any
``st_canvas`` invocation; Streamlit's script-rerun model means
this fires every page load but the ``_PATCHED`` guard makes
re-runs free.

The shim wraps the old ``width=int`` arg into a default-constructed
``LayoutConfig()`` — the old ``width=-1`` sentinel meant "use
the image's natural width", which is also what an unconfigured
LayoutConfig produces. Confirmed by inspecting Streamlit 1.57.0's
``image_utils.py``.

4 new tests pin the shim contract:

- ``install()`` attaches ``image_to_url`` to the old path on modern
  Streamlit
- Idempotent — calling twice doesn't double-wrap
- Doesn't clobber a future Streamlit that restores the original
  at the old path
- Translates ``(image, -1, False, "RGB", "PNG", "id")`` into a
  proper call to the new function with a ``LayoutConfig`` instance

If a future Streamlit upgrade moves ``image_to_url`` AGAIN, the
shim's silent-no-op fallback means the canvas error surfaces
again and points at where to look. The shim doesn't paper over
mysteries; it only patches the one specific relocation we know
about.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:29:20 +00:00
e6ee2e3481 feat(pdf): robust Tesseract discovery + OS-aware install copy
User tried ``brew install tesseract`` in PowerShell after seeing
all three OSes listed inline in the OCR banner — easy mistake
when the install commands are crammed on one line with ``·``
separators. Two changes pre-empt this:

**OS-aware OCR banner.** The expander now detects the user's
platform via ``platform.system()`` and shows only the relevant
install instructions:

- **Windows**: UB-Mannheim installer link, numbered steps,
  explicit "keep the Add to PATH checkbox on" callout, plus a
  fallback paragraph telling the user how to set
  ``DATATOOLS_TESSERACT_PATH`` if they already installed
  without PATH and don't want to reinstall.
- **macOS**: ``brew install tesseract`` with a Homebrew link.
- **Linux**: ``apt install tesseract-ocr`` with a "or your
  distro's equivalent" hedge.

**Robust binary discovery in ``ocr_available()``.** Three-stage:

1. Honor ``DATATOOLS_TESSERACT_PATH`` env var if set — explicit
   override for portable installs or non-default locations.
2. Try ``pytesseract``'s default PATH-based lookup.
3. If PATH lookup fails, probe known Windows install paths
   (``C:\Program Files\Tesseract-OCR\tesseract.exe``,
   the x86 variant, and ``%LOCALAPPDATA%\Programs\Tesseract-OCR\``)
   via the new ``_autodetect_tesseract_path``. On hit, set
   ``pytesseract.pytesseract.tesseract_cmd`` so all subsequent
   ``image_to_data`` calls use the same binary without
   re-discovering.

This means a user who runs the UB-Mannheim installer with
default options but forgets the PATH checkbox will still get
OCR working after a launcher restart, without env-var
gymnastics.

Tests (4 new, 85 total in the suite):

- Auto-detect returns None on non-Windows (no false positives
  on dev laptops).
- Auto-detect finds the binary at a mocked
  ``C:\Program Files\Tesseract-OCR\tesseract.exe``.
- Auto-detect returns None when no candidate exists.
- ``DATATOOLS_TESSERACT_PATH`` env var beats both PATH lookup
  and auto-detect (sets ``tesseract_cmd`` even when the path
  doesn't resolve, so a real binary at a custom location works).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:15:00 +00:00
538e23d219 build(pdf): bundle PDF deps in installers + pin versions + smoke tests
Three changes prepare the next tagged release so end users get
the PDF Extractor without ever touching pip.

**Exact-pin the new deps** (``requirements.txt``):

  pdfplumber==0.11.9
  pypdfium2==5.8.0
  pytesseract==0.3.13
  streamlit-drawable-canvas==0.9.3

Tight pins are the right call for these because the GUI's
visual-picker geometry + the parsing-pipeline word positions
depend on stable internal behavior — a quiet upstream tweak to
``extract_words`` or ``page.render`` would re-break the tool on
the next CI build. Bumping requires a deliberate edit + a CI
run, not a transient ``pip install`` resolving to whatever
``setup.py`` pulled.

Existing deps stay on their current ``>=X.Y,<X+1`` ranges; the
user's "tight pin" concern is specifically about the PDF stack.

**Wire the new deps into the PyInstaller bundle** (``build/``):

- ``datatools.spec`` — add ``collect_submodules`` for pdfplumber,
  pdfminer, pypdfium2, streamlit_drawable_canvas, PIL,
  pytesseract; add ``collect_data_files`` for pypdfium2 (PDFium
  native ``.dll``/``.so``/``.dylib``), streamlit_drawable_canvas
  (frontend JS bundle), pdfminer (Adobe CMap tables).
- ``hooks/hook-pypdfium2.py`` — belt-and-braces hook that uses
  ``collect_dynamic_libs`` to force-include the PDFium binary.
  Without this the visual picker silently fails on installed
  builds with a ``FileNotFoundError`` for the shared library.
- ``hooks/hook-streamlit_drawable_canvas.py`` — collects the
  built JS frontend so the canvas iframe loads under the bundled
  Streamlit server instead of rendering blank.

**Tesseract is intentionally NOT bundled** (option A from the
design discussion). Modern bank statements are text-based;
bundling Tesseract would ~triple installer size for a long-tail
case. The in-app banner directs users to install it from
``UB-Mannheim/tesseract`` if they need OCR. Decision is captured
in the ``project-pdf-installer-pending`` memory note.

**Smoke tests** (``tests/test_pdf_extract_smoke.py``, 17 tests)
add the layer above the pure unit tests:

- ``TestDependencyImports`` — each dep imports cleanly
- ``TestRealPdfRoundTrip`` — generates a tiny statement PDF in
  memory with ``fpdf2`` (test-only dep in
  ``requirements-dev.txt``), runs ``extract_pages`` +
  ``apply_template``, asserts 3 rows out with the right signed
  amounts. Catches "the build succeeded but pdfplumber breaks at
  runtime."
- ``TestRenderPageImage`` — exercises ``pypdfium2.render`` so the
  hook-bundled native lib gets a real call. This is the most
  common installer-bug signature (missing .dll) and the test
  catches it before users do.
- ``TestPdfDependencyMissing`` — monkeypatches ``__import__`` to
  simulate a stripped install; confirms the typed exception +
  actionable hint round-trip.
- ``TestPinnedVersionsMatchInstalled`` — parametrized over all
  four pinned dists; uses ``importlib.metadata`` rather than
  ``__version__`` because pypdfium2 doesn't expose it directly.
  Trips if someone bumps the pin without reinstalling.
- ``TestOcrAvailability`` — confirms ``ocr_available()`` returns
  ``(bool, str)`` and ``extract_pages_auto(allow_ocr=False)``
  skips OCR cleanly.

All 81 PDF + audit tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:10:43 +00:00
2d927bc95f fix(pdf): graceful fallback when PDF dependencies aren't installed
User hit a hard ImportError on opening the PDF→CSV tool because
``pip install -r requirements.txt`` hadn't picked up the new
``pdfplumber`` / ``pypdfium2`` lines yet. Streamlit surfaces
that as an unfiltered traceback — friendlier to show a clear
install-required panel inside the tool instead.

Two changes:

1. ``src/pdf_extract.py`` lazy-imports the PDF deps via
   ``_require_pdfplumber()`` / ``_require_pdfium()`` helpers that
   raise a new ``PdfDependencyMissing`` (subclass of ImportError)
   with an actionable ``hint`` field. Pure helpers
   (``parse_amount``, ``parse_date``, ``cluster_rows``, etc.)
   keep working with no PDF dep installed — useful for tests and
   for keeping module-import paths cheap.

2. The tool page probes both deps at render time via
   ``_pdf_deps_status()``; if anything's missing it shows a
   ``st.error`` panel with the exact pip command and a
   "restart the launcher" reminder, then ``st.stop()``s before
   touching any PDF code path.

The page itself loads cleanly without the deps installed, so the
sidebar nav doesn't 500 — the user just sees the install panel
on click.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:59:20 +00:00
967d3f6a11 feat(pdf): OCR availability banner + per-run toggle
Phase 6/6. Final polish layer on top of the OCR pipeline that
``extract_pages_auto`` has carried since commit 1.

- **OCR status banner** at the top of the page next to the mode
  selector. Ready: a one-liner caption confirming OCR will run
  on scanned pages. Unavailable: a collapsed expander explaining
  the missing piece (``pytesseract`` binding vs. Tesseract
  binary) with install pointers for Windows, macOS, and Linux.
  The expander explicitly notes that modern text-based bank
  statements don't need OCR — most users will never expand it.
- **"Use OCR for scanned pages" toggle** in Extract mode,
  defaulting to the runtime availability. Disabled (greyed out)
  when Tesseract isn't usable, so the user can't accidentally
  set themselves up for confusing warnings. Passes through as
  ``allow_ocr`` to ``extract_pages_auto``.
- Build mode's sample-loading path continues to call
  ``extract_pages_auto(..., allow_ocr=True)`` — sample preview
  always uses OCR if available, since the user is actively
  diagnosing template fit.

No schema change. OCR's structural support is in commits 1 + 3;
this commit just makes it discoverable + opt-out.

Rolling up the 6-commit feature:

  b8aff86  Phase 1 — pure pdf_extract module + tests
  aea520d  Phase 2 — template storage layer + tests
  2f349e8  Phase 3 — Extract/Build/Manage page + nav + i18n
  5a8e2ec  Phase 4 — batch polish (ZIP, sort, status block)
  b86828d  Phase 5 — visual region picker (drawable canvas)
  THIS     Phase 6 — OCR banner + toggle

Each commit is independently revertable; rolling all the way
back to ``c16e2a5`` is ``git revert b86828d 5a8e2ec 2f349e8
aea520d b8aff86 <this>`` (or just ``git reset --hard c16e2a5``
on a clean branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:54:11 +00:00
b86828d791 feat(pdf): visual region picker on rendered sample page
Phase 5/6. Adds a "Visual picker" tab as the first stop in the
template-build flow. The sample PDF page is rasterized with
``pypdfium2`` (capped at ~900px wide for sensible display), and
``streamlit-drawable-canvas`` overlays drawing tools on top.

UX:

- **Line mode** — drag short (roughly vertical) strokes where you
  want columns to split. Each stroke's x-midpoint becomes one
  boundary in PDF point coordinates.
- **Rect mode** — drag a rectangle around the transactions
  table; bbox is preserved on the template as
  ``visual.table_bbox`` for round-trip, future use as a hard
  crop region.
- **Transform mode** — move/resize already-drawn shapes after
  the fact.

Round-trip: re-entering Build mode with an existing template
seeds the canvas with full-height vertical lines for every
boundary already on the template, plus the saved bbox if any,
so editing-after-save matches the user's mental model.

Coordinate translation: the canvas reports pixel positions; we
divide by the renderer's pixels-per-PDF-point scale to get back
to PDF coordinates that ``apply_template`` already expects. No
template-schema change required — the boundaries the picker
writes are the same list the text-input editor wrote in
commit 3, just sourced visually.

New helper in the extraction module:

- ``render_page_image(pdf_bytes, page_no, target_width=900)`` —
  rasterize a single 1-indexed page to a PIL image; returns
  ``(image, scale)`` for coordinate translation.

The text-input boundary editor in the Columns tab remains as a
fallback for power users / keyboard-only workflows and for
copy-paste from spreadsheet-derived x-positions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:52:54 +00:00
5a8e2ec9e1 feat(pdf): batch extract polish — ZIP output, sort-by-date, status block
Phase 4/6. Polishes the batch workflow shipped in commit 3:

- **st.status progress block** replaces the simple progress bar.
  Each file appears as its own line as it's processed; the block
  auto-collapses on completion with a "12/13 extracted" summary
  and turns red if any file errored.
- **Sort combined output by date** checkbox (default ON) sorts
  the merged CSV ascending by date, with source_file as a stable
  secondary sort so multiple statements interleave by date but
  same-day rows from the same file stay together.
- **ZIP-of-per-PDF-CSVs output option** alongside the combined
  CSV. When the accountant has 12 statements from 12 different
  account periods and wants to feed them into 12 separate ledger
  imports, the ZIP keeps each file's rows in its own CSV named
  after the original PDF stem.
- **Per-file summary table** gets a ``status`` column ("ok" /
  "no rows" / "error: ExceptionName") so error grouping is
  obvious at a glance — already present from commit 3, now
  upgraded with the status field.

Cancellation is intentionally not added — Streamlit's single-
thread rerun model has no clean way to interrupt a tool-run
mid-stream without architectural changes to extraction. If a
user mis-fires Extract on 50 PDFs they can refresh the browser
tab; the task will be killed when the next interaction comes in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:51:05 +00:00
2f349e8191 feat(pdf): tool page with Extract / Build / Manage modes
Phase 3/6. Wires the PDF Extractor into the GUI as a new
"transformations" tool with three modes selected by a horizontal
radio at the top of the page:

**Extract** — pick a saved template, upload one or more
statement PDFs (single + batch shipping together to keep the
common case one-step), get a previewed DataFrame + CSV download.
Per-file row counts and warnings are surfaced; failures on one
file don't kill the whole batch. The combined CSV gets a
``source_file`` first column so the accountant can sort/filter
by statement.

**Build template** — load an existing template or start fresh,
upload a sample PDF, edit every schema field across four tabs
(Pages & table / Columns / Parsing / Save). A live preview below
re-runs ``apply_template`` against the sample on each re-render
so the user sees their changes hit rows immediately. The column-
boundary editor is text-input ("comma-separated x-positions") for
now — replaced by the drawable-canvas visual picker in commit 5.

**Manage templates** — list with rename / delete / export
(downloads the canonical JSON) / import (uploads someone else's
JSON, validated through ``template_from_json``).

Heavy work (``extract_pages_auto``) only runs on explicit user
action (Extract / a new sample upload), and the parsed Page list
is cached in ``st.session_state`` so widget-edit reruns don't
re-parse the PDF.

Logging: tool runs and template saves both hit the audit log via
``log_event("tool_run", …)``, matching every other tool's
instrumentation pattern.

Registered in ``tools_registry.py`` under ``transformations``
with status ``Ready`` and the picture-as-pdf Material icon. i18n
keys added for en + es ("PDF to CSV" / "PDF a CSV").

OCR is wired in this commit — ``extract_pages_auto`` already
falls back through ``pytesseract`` when the binary is available,
and the warning strings it returns surface as ``st.info`` /
``st.warning`` per-file. Commit 6 will polish the OCR UX with a
status row.

Next commits build on this page:
  4 — batch progress + cancellation + per-file error grouping
  5 — drawable-canvas visual picker replaces text x-positions
  6 — OCR availability banner + scanned-page indicators

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:49:44 +00:00
aea520d2f7 feat(pdf): template storage layer (load/save/list/import/export)
Phase 2/6. Persists "how to read this bank's statements" as JSON
files under ``~/.datatools/pdf_templates/<slug>.json`` so an
accountant can build one template per source and reuse it across
every statement that follows the same layout.

Public API:

- ``new_template(name)`` — blank with sensible defaults
- ``save_template(t)`` — validate + atomic write (temp + rename)
- ``load_template(slug)`` / ``delete_template(slug)``
- ``list_templates()`` — sorted summaries, skips corrupt files
- ``template_to_json`` / ``template_from_json`` — portability
- ``validate_template(t)`` — returns (ok, errors) list for GUI

Schema is documented in the module docstring. Versioned via
``schema_version: 1`` so future fields don't break saved files
silently — ``load_template`` refuses unknown versions instead of
limping along with missing keys.

Validation contract enforces:
- non-empty name + slug (lowercase alphanumeric + hyphens)
- at least two output columns
- at least one column mapped to ``date``
- either one ``amount`` column OR both ``amount_debit`` +
  ``amount_credit``
- column boundary count consistent with source-column count

Storage is atomic: ``_atomic_write`` goes through a temp file +
``os.replace`` so a crashed save can't leave a half-written JSON
at the canonical path. The GUI's build flow saves on most
visual-picker changes, so this matters more here than for a
"save button" workflow.

24 tests cover slugify, defaults, validation branches, round-trip
load/save, missing/corrupt file handling, delete, list (incl.
skipping corrupt files), atomic-write rollback, and import/export.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:46:44 +00:00
b8aff862ed feat(pdf): add pure PDF→DataFrame extraction module
Phase 1/6 of the PDF Extractor tool. Pure module — no Streamlit,
no user-config I/O — that turns a PDF blob plus a template dict
into a ``pandas.DataFrame`` of transaction rows. Primary use case
is accountant-style extraction of bank-statement transactions,
where each bank's format is encoded as a reusable template.

Pipeline:

1. ``extract_pages(pdf_bytes)`` reads with pdfplumber and surfaces
   words with bounding boxes.
2. ``cluster_rows(words)`` groups words into rows by ``top``
   tolerance — no reliance on PDF table-line detection (most bank
   statements have no visible cell borders).
3. ``assign_columns(row_words, boundaries)`` buckets each word by
   its horizontal midpoint into N+1 columns defined by N interior
   x-boundaries.
4. ``_within_table_window`` slices to the band between the header
   line and the end-marker (e.g. "Closing balance").
5. ``apply_template`` orchestrates the above, handling:
   - parens-style negative amounts, currency stripping, custom
     decimal/thousands separators
   - separate debit + credit columns combined into a single signed
     ``amount`` (credit positive, debit negative — accounting
     register convention; matches QuickBooks/Xero imports)
   - multi-line description wrapping (rows with empty date column
     attach to the previous row's description)
   - row-level regex skip filters (e.g., "Total", "Subtotal")
   - page-range filters ("all", "2-", "1,3-5")

Optional OCR fallback for scanned statements:

- ``page_has_extractable_text`` heuristic flags pages with <5
  words as likely-scanned.
- ``ocr_available()`` checks both the ``pytesseract`` Python
  binding and the Tesseract binary; surfaces a clear reason
  string when either is missing.
- ``extract_pages_auto`` does text-first, OCR-the-blanks, and
  returns warnings the UI can surface.

29 unit tests cover the parsing pipeline against synthetic
WordBox/Page data — no fixture PDFs required, runs in 0.1s. Real
PDF extraction is exercised by hand on the user's statements.

Dependencies added:
- ``pdfplumber>=0.10,<1`` — text + position extraction
- ``pypdfium2>=4,<6`` — page rasterization for OCR + visual picker
- ``streamlit-drawable-canvas>=0.9,<1`` — visual region picker
  (used in commit 5)
- ``pytesseract>=0.3,<1`` — OCR (used in commit 6; system
  Tesseract binary required separately)
- ``cryptography>=41,<49`` — bumped upper bound; pdfminer.six
  transitively requires a recent release. Internal ed25519
  license-signing usage is API-stable across the bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:44:51 +00:00
c16e2a5e29 feat(audit): surface log path + /logs link in Help popover
Adds a "Log file" section to the sticky-footer Help popover with
two affordances:

1. The current audit-log path rendered as monospace text with
   ``user-select: all`` so a single click selects the whole path
   for copy-paste into a file manager. Works on every platform —
   no subprocess required.
2. A "View all logs →" link to the new ``/logs`` page (added in
   the previous commit) for download/inspection of today's and
   prior days' files.

i18n keys ``footer.help_logs_label`` + ``footer.help_logs_link``
added to en + es packs, matching the existing
``footer.help_*`` naming.

``audit_log_path()`` is wrapped in try/except because a broken
audit module MUST NOT take the footer down — falls back to "—".
Same defensive pattern the license section uses.

Rollback: ``git revert HEAD`` removes the section; the popover
and its layout return to the prior shape with zero coupling to
the audit module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 21:26:53 +00:00
7c9139f199 feat(audit): /logs page — view + download recent audit log files
Adds a Streamlit page at ``/logs`` listing every
``datatools-*.jsonl`` file in ``audit_log_dir()`` (7-day window
per the retention sweep in b3ae913). Each entry shows filename,
mtime, byte size, and a ``st.download_button``. Today's file
gets its own section at the top.

The page also surfaces both paths as copyable monospace text:
the active log path (so users can grep/cat it directly on their
machine) and the folder path (so they can paste into Explorer /
Finder).

Wired into navigation via ``st.Page("pages/_Logs.py", ...)`` with
``url_path="logs"``. The sidebar entry is hidden by the same
``hide_streamlit_chrome`` CSS rule that hides ``/activate`` and
``/close`` — same pattern, same ``:has()`` + plain-fallback
selectors so the LinkContainer collapses cleanly in modern
browsers and the anchor is at least un-clickable in older ones.

License gate is OFF for this page (``gate_license=False``) — if a
user's license expires they may need logs to file a support
request; locking them out of their own audit history would be
hostile.

Next commit will wire the popover link.

Rollback: ``git revert HEAD`` removes the page and its nav entry;
the audit log itself keeps working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 21:24:46 +00:00
b3ae913bb9 feat(audit): daily filename + 7-day retention sweep
Replaces the per-session ``datatools-<ts>-<sid>.jsonl`` filename
with a single daily file ``datatools-YYYY-MM-DD.jsonl`` (local
date). Sessions on the same calendar day share a file via the
writer thread's per-batch open+append; multiple DataTools
instances running concurrently on the same day fan into the same
file (append-mode small writes are atomic on POSIX, safe-enough on
Windows under realistic load).

Drops the ``_LOG_PATH`` module global and the lock around it —
``audit_log_path()`` is now pure date math, recomputed on every
call so a session that crosses midnight follows the rollover into
the next day's file.

Adds ``_sweep_old_logs()`` invoked once per process at writer-
thread start. Deletes any ``datatools-*.jsonl`` whose mtime is
older than 7 days. The glob deliberately matches the legacy
per-session filename too, so users upgrading from the previous
build don't keep a permanent backlog of pre-retention files.

Event ``ts`` fields stay UTC; only the filename uses local date,
because users go looking for "today's log" on their wall clock.

Tests cover: daily filename shape, sweep removes stale files,
sweep keeps fresh files, sweep also clears legacy filenames.

Rollback: ``git revert HEAD`` restores the per-session filename
and removes the sweep. No data migration needed either way —
existing files keep working as JSONL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 21:22:47 +00:00
ba07dcb6c7 feat(audit): re-enable audit log (kill switch off by default)
Phase 1 diagnostic build validated end-to-end on the user's machine:
session cf2ebbd5 (2026-05-19) produced session/upload/analyze/nav/
session-end events with no blank-pages regression. Root cause of the
original symptom was the audit_log_path/_session_id deadlock fixed in
a8ff8f4 — the kill switch is no longer load-bearing.

Flips ``_DISABLED: True`` → ``False`` so the default install writes a
log. The three env-var overrides (``DATATOOLS_AUDIT_ENABLED``,
``DATATOOLS_AUDIT_TRACE``, ``DATATOOLS_AUDIT_PROBE``) and the writer-
thread BaseException guard from 76c9f5a stay in place as escape
hatches if the symptom ever recurs.

TestKillSwitchContract continues to pass — it monkeypatches
``_DISABLED = True`` explicitly and doesn't rely on the module default.

Rollback: ``git revert HEAD`` flips the switch back without removing
the diagnostic instrumentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:50:28 +00:00
76c9f5a679 feat(audit): diagnostic instrumentation env vars + writer-thread guard
Phase 1 of the audit-log re-enablement plan. Adds three opt-in env
vars that let us ship one instrumented build for the user to run,
without flipping the kill switch on for everybody. **Default
behaviour is byte-identical to today**: with no env vars set the
kill switch wins, no writer thread starts, no file is written, no
stderr line is printed.

Env vars (do NOT set in prod):

- ``DATATOOLS_AUDIT_ENABLED=1`` — bypass ``_DISABLED`` for one
  session. ``_DISABLED = True`` stays in the source so an upgrade
  with no env var is still safe.
- ``DATATOOLS_AUDIT_TRACE=1`` — print ``[audit] ...`` lines to
  stderr at module import, every writer-thread state change, and
  every producer entry point. Lets the user share a small log
  instead of attaching a debugger.
- ``DATATOOLS_AUDIT_PROBE=<value>`` — bisect the producer path
  for Phase 2. Values: ``full`` (default), ``noop``, ``no-events``,
  ``no-page-open``, ``no-session-start``. The named variants
  return early from the corresponding ``log_*`` function so we can
  isolate which call is implicated in the blank-pages symptom.

Also:

- ``_writer_loop`` gets an outer ``try/except BaseException`` so
  silent thread death now surfaces a ``"writer thread died: ..."``
  line in the launcher terminal instead of looking like a hang.
- Existing first-write-failure stderr print gets ``flush=True`` so
  the user actually sees it before the process is killed.
- Test fixture switches from the previous-commit ``_DISABLED = False``
  override to ``_ENABLE_OVERRIDE = True`` so tests exercise the same
  bypass path the diagnostic build uses.
- Two new tests pin the safety contract: with the kill switch on
  and no override, every producer is a true no-op (no writer
  thread, no file). And ``DATATOOLS_AUDIT_PROBE=no-events`` bypasses
  ``log_event`` even when the override is on — guards the bisect.

Rollback: ``git revert HEAD`` removes Phase 1 cleanly. The deadlock
fix from the previous commit stays in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:46:27 +00:00
a8ff8f4bd0 fix(audit): break audit_log_path/_session_id deadlock
Pre-existing latent bug since d9e32e5: ``audit_log_path()`` acquires
the non-reentrant ``_LOCK`` and, while holding it, calls
``_session_id()`` which also takes ``_LOCK``. On a clean module state
(both ``_LOG_PATH`` and ``_SESSION_ID`` unset) the first caller
deadlocks.

``log_session_start`` triggers it in practice — it's the first GUI
call after import and the ``log_file=str(audit_log_path())`` arg is
evaluated before any ``log_event`` has had a chance to lazy-init the
session id. Strong candidate contributor to the blank-pages symptom
the kill switch was put back to mask: the writer thread (and any
producer reaching ``audit_log_path``) would freeze forever, and
Ctrl+C would not free the GIL — matches the launcher-can't-be-killed
behaviour reported in 1caedbb.

Fix: resolve the session id BEFORE acquiring ``_LOCK`` in
``audit_log_path``. ``_session_id`` already double-checks under its
own lock, so the call is safe and self-synchronising.

Test fixture in ``tests/test_audit.py`` now bypasses the kill switch
via ``monkeypatch.setattr(audit, "_DISABLED", False)`` — env vars are
captured at import time and ``monkeypatch.setenv`` won't reach the
module-level flag. With the fix in place, all 6 tests pass in 0.15s;
without it, ``test_session_start_renders`` (and any test exercising
the log_session_start path) hangs indefinitely.

Kill switch behaviour is unchanged in production (`_DISABLED = True`
in the shipped module); this is purely a correctness fix for the
code path that gets exercised when the switch is off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:45:08 +00:00
4451f74895 fix(layout): bump bottom block-container padding 4rem → 7rem
Last lines on long tool pages were still grazing the fixed Help/Close
footer when scrolled all the way down. 4rem gave the cursor of free
space the footer claims but no breathing room — the bottom button
or text was visually flush against the footer's top edge. 7rem buys
~3rem of clear space on every page so the last content row reads
without obstruction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 02:32:13 +00:00
a022059b1e chore: drop accidentally-tracked scratch screenshot 2026-05-19 02:30:01 +00:00
69240fc922 fix(home,close): tool-link preserves file context + drop close-page explanation
(1) ``[Tool] →`` action links inside per-file finding rows now
preserve the file that the card belongs to. Previously the home page
re-set ``home_uploaded_*`` to the FIRST imported file on every rerun
— so when a user with multiple imports clicked
``Clean Text →`` on file_B's findings card, the tool page loaded
file_A. The click handler in ``_render_finding_row_v2`` now looks
the file up in ``home_uploads`` by the findings-card filename and
writes ``home_uploaded_name / size / bytes`` BEFORE
``st.switch_page``, so the tool's ``pickup_or_upload`` reads the
right context.

The filename threads through ``render_findings_panel(..., header=)``
→ ``_render_finding_row_v2(..., filename=)``; ``header`` is already
the filename today, so no call-site change needed.

(2) Close screen "explanation" removed. The long browser-restriction
hint paragraph (``quit.close_hint``: "Browsers don't let JavaScript
close a tab you opened yourself …") is gone from the farewell overlay
— the auto-dismiss path lands the user on about:blank within ~1.5s
of the close click, so the explanation never had a chance to be
useful. ``autoDismiss`` simplified to "try close, else redirect"
without the hint-surface step. The i18n key is retained as a no-op
in case the hint comes back.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 02:29:49 +00:00
9a7d861903 fix(ui): bottom padding + close-screen button removed + sidebar collapse + quiet loguru
Four issues batched together since they all touch the GUI shell:

- ``stMainBlockContainer``'s ``padding-bottom`` bumped from 0.75rem
  → 4rem (~one button-height of free space above the fixed Help/Close
  footer). The last line of content on a page that fills the viewport
  was previously sitting flush against the footer's top border.

- Farewell overlay's "Close this window" button removed per UX
  request. The auto-dismiss path is now the only flow: try
  programmatic close (works in Chrome/Edge ``--app`` windows);
  failing that, surface the hint and redirect the parent window to
  ``about:blank`` after a short timeout. Previously the user had to
  click the button to get the same fallback. The
  ``quit.close_window_button`` i18n key is retained as a no-op for
  now in case the button comes back; nothing references it.

- Sidebar collapse → expand was broken: clicking « collapsed the
  sidebar but the » expand-back affordance was invisible. Two causes
  pulled apart:

   1. ``.dt-brand { flex: 1 }`` was eating the entire
      ``stSidebarHeader`` width, squeezing Streamlit's
      ``stSidebarCollapseButton`` off the right edge. Changed to
      ``margin: 0 auto 0 0`` so the brand keeps its natural width
      and the chevron has room to live next to it.

   2. The "hide Streamlit chrome" toolbar block was listing
      ``stToolbar`` and ``stToolbarActions`` for ``display: none``
      — but the post-collapse re-open button
      (``stExpandSidebarButton``) lives inside ``stToolbar``, so
      hiding the container killed the button too. Dropped both
      container testids from the hide list and kept the per-icon
      rules for ``stMainMenu`` / ``stAppDeployButton`` /
      ``stStatusWidget`` / ``stDecoration``.

- Loguru's stderr sink quieted in GUI mode. ``src/gui/app.py`` now
  runs ``logger.remove()`` + ``logger.add(sys.stderr, level="ERROR",
  …)`` at the top so internal ``logger.debug`` / ``logger.warning``
  breadcrumbs (e.g.
  ``standardize_dataframe: 7/31 cells were unparseable``) no longer
  print to the terminal when the user runs ``python -m src.gui``.
  CLI entry points already do the same configuration per-script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 02:21:41 +00:00
1016a4d2c4 feat(home,sidebar): brand hero + sidebar = footer style + PNG icon
Bundles a handful of UX cleanups:

- Findings-card chevron moved to the LEFT side of the head. CSS still
  rotates it 90° between collapsed/expanded states.

- Tool-link buttons in findings rows (``Clean Text →`` etc.) are now
  left-justified against the icon column with minimal surrounding
  whitespace. Action column ratio dropped from 1.8 → 1.4 and the
  button switched from ``width="stretch"`` (centered text) to
  ``width="content"`` (shrinks to fit, left-aligned within column).

- Home-page hero now mirrors the sidebar brand block: 56px ink "D"
  chip on the left + "UNALOGIX" eyebrow stacked above "DataTools"
  wordmark, then the "Clean. Normalize. Transform." tagline beneath.
  New ``.dt-page-brand / -row / -words / -mark / -eyebrow /
  -wordmark`` rules in ``_DESIGN_TOKENS_CSS``. Streamlit wraps h1
  elements in an emotion-cache div with extra padding; a descendant
  flattener (``.dt-page-brand-words *`` margin:0 / padding:0) keeps
  the eyebrow + wordmark stack the same height as the chip so they
  center-align cleanly.

- Sidebar nav restyled to match the sticky-footer Help/Close buttons
  exactly: 13px / 500 / 1.3 line-height, 5×10px padding, 8px gap
  between icon and label, transparent background. Active item gets
  the same ``rgba(0,0,0,0.04)`` tint as the hover state (no white
  pill, no shadow), only the heavier weight + ink text distinguishes
  it.

- OS app icon (page_icon) switched from SVG to a Pillow-rendered
  ``datatools_icon_256.png`` so Windows / macOS taskbar+dock pick
  it up reliably (some OS shells fall back to a default icon for
  SVG favicons). Rounded-square ink ground with cream "D" centered —
  same mark as the sidebar chip + hero chip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 02:04:53 +00:00
6c3939d21b feat(brand): "Letter D (sans)" app icon — favicon + sidebar chip
Implements ``Business/DataTools/app_icons.html`` §03 "Letter D (sans)"
as the canonical app mark.

- New ``src/gui/assets/datatools_icon.svg`` — 64×64 SVG, 14px corner
  radius, ink ground (#1c1917), cream "D" (#fef4ed) in
  Geist 700 / -0.04em tracking. Pure SVG so it renders sharp at
  every favicon size; font stack falls back through Geist →
  system sans where the webfont isn't installed (favicons can't load
  Google Fonts).

- ``_home.py``, ``_Activate.py``, ``99_Close.py``: page_icon now
  resolves the SVG path via ``Path(__file__).parent / "assets" /
  "datatools_icon.svg"`` instead of the broom 🧹 / 🔑 / 🛑
  emojis. Streamlit inlines it as a ``data:image/svg+xml;base64,...``
  link tag so the browser tab + OS app-icon for ``python -m src.gui``
  matches the sidebar chip.

- Sidebar ``.dt-brand-mark`` tightened to match the spec's "Letter D
  (sans)" rendering: ``font-weight: 700`` and
  ``letter-spacing: -0.04em`` (was 600 / -0.02em). The on-screen
  chip is now a scaled-up copy of the OS icon.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:50:18 +00:00
d436e34a45 feat(brand): rebrand to UNALOGIX DataTools + Clean. Normalize. Transform.
User-facing copy + brand updates landed together:

- Page H1 + browser-tab title: "DataTools — Data Cleaning Mastery"
  → "UNALOGIX DataTools". Same change in es.json (was "DataTools —
  Maestría en limpieza de datos").
- Hero subtitle: long descriptive caption replaced with the tagline
  "Clean. Normalize. Transform." (es: "Limpia. Normaliza.
  Transforma.").
- Sidebar brand block: wordmark is now two lines — UNALOGIX in tiny
  uppercase tracked eyebrow style on top, DataTools in the 15px
  semibold wordmark beneath. The 28px "D" chip stays as the
  recognizable mark. New ``.dt-brand-eyebrow`` rule in
  ``_DESIGN_TOKENS_CSS``.

Top-right Streamlit chrome cleanup — the user reported two stacked
icon buttons. ``.streamlit/config.toml`` bumped to
``toolbarMode = "viewer"`` (most aggressive — suppresses status
indicator + deploy button + running glyph). CSS belt-and-suspenders
hides ``stToolbar``, ``stToolbarActions``, ``stStatusWidget``,
``stDecoration`` for newer Streamlit releases that keep emitting
these with inline styles even under toolbarMode=viewer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:45:38 +00:00
0bb72ecd7e feat(home,sidebar): brand block + collapsible findings + many polish tweaks
Batch of UX tweaks the user asked for in quick succession:

- Sidebar brand block (mockup §brand) — 28px ink chip with a "D"
  wordmark plus the "DataTools" text — injected into
  ``stSidebarHeader`` by a small JS bundled into the iframe-mounted
  script that already runs from ``hide_streamlit_chrome``. The
  Streamlit ``stLogoSpacer`` is hidden when the brand block is
  present so it sits flush at the top of the sidebar.

- Findings cards are now collapsible. Each file's card head carries
  ``data-dt-collapsed="true"`` on first render; clicking the head
  flips the attribute via the new ``_WIRE_COLLAPSIBLE_FINDINGS_JS``
  (MutationObserver re-wires after reruns). A CSS rule
  ``[stElementContainer]:has(.dt-finding-group-head[data-dt-collapsed
  ="true"]) ~ *`` hides every later sibling of the head's element
  container — covers both ``stLayoutWrapper`` (the columns rows in
  this Streamlit release) and ``stElementContainer`` so the rule
  survives future Streamlit layout renames. A chevron icon
  (``chevron_right``) rotates 90° when expanded. The head itself
  gets ``cursor: pointer`` + an accent-fill hover.

- Tool-link buttons in finding rows dropped the leading ``Open`` —
  now read ``Clean Text →``, ``Standardize Formats →`` etc.

- Finding-row column order: action is now LEFT of the description,
  matching user feedback (``[icon] [Tool →] [description + meta]``).

- Head padding bumped to ``16px 22px`` so the filename has visible
  breathing room from the card's left edge (previously the mono
  filename felt like it was bleeding into the rounded corner).

- Head margin-bottom bumped to 1.5rem for breathing room before the
  first finding row when expanded; collapsed state tucks the head
  flush against the card bottom with full ``--r-lg`` corner radius
  and no visible bottom border.

- Files card row layout: ``✕`` button moved to the LEFT of the
  filename (``[✕] [chip + filename] [size]``).

- Sidebar nav rows tightened: link padding 7px → 4px, line-height
  1.25, 1px margin-bottom per li, section-header padding-top reduced.
  Plus a new ``--gap: 0.25rem`` rule for vertical blocks inside
  bordered containers so the Files card and findings card body have
  denser inter-row spacing.

- Sidebar Language selector restyled: widget labels render as the
  spec's "Eyebrow" row (11.5px / 500 / 0.08em uppercase, tertiary
  ink), selectbox combobox gets a paper surface + soft border that
  matches the rest of the sidebar chrome.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:40:22 +00:00
74d0ee270f chore(home): remove "Export report" button
The disabled "Export report" placeholder is gone — it wasn't tied to
a real feature and was just noise in the action bar. Action bar is
back to two buttons (Run analysis · Clear results) on a 1:1:4
column split. ``upload.export_report`` keys removed from en + es
i18n packs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:17:43 +00:00
06f1ea6cf7 fix(buttons,footer): unify disabled state + restyle Help/Close as nav links
(3) Disabled primary buttons no longer read as a "whited-out" dark
slab. Streamlit's primary-button selector
``button[data-testid="stBaseButton-primary"]`` has the same
specificity as our previous ``button:disabled`` selector, so the
primary background + cream text kept winning the cascade tie-break.
The disabled rule's selector list now explicitly matches both the
``kind="primary"``/``kind="secondary"`` shapes AND the
``stBaseButton-primary``/``-secondary`` testids, so disabled
buttons collapse to ``surface-hover`` background, ``ink-tertiary``
label, soft border — same look regardless of starting kind. A
follow-up rule re-asserts ``color: var(--ink-tertiary)`` on every
descendant of the disabled primary so the inner
``stMarkdownContainer > p`` doesn't keep the cream label from the
"all descendants get --bg" primary rule.

(4) The sticky-footer Help + Close buttons now match the sidebar
nav-item look. Old outlined-pill chrome is gone:
``.datatools-footer-btn`` is now display:inline-flex with a
Material-Symbols ligature icon + label, borderless, ``ink-secondary``
text on a transparent surface, ``rgba(0,0,0,0.04)`` hover background.
The Close button keeps a danger tint via ``.close`` so it still reads
as the shut-down action, with a soft ``--danger-fill`` hover. Help
uses the ``help_outline`` icon, Close uses ``power_settings_new``.
Built via a small ``makeFooterBtn`` helper in the iframe JS that
appends the icon span + label text node to the button — keeps the
existing soft-nav click handlers intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:12:03 +00:00
784695e3a7 fix(home,findings): reclaim top whitespace + add padding under finding head
Two visual cleanups:

1. The block-container "claim padding" rule was a no-op — it targets
   the legacy ``stAppViewBlockContainer`` testid; Streamlit renamed
   it to ``stMainBlockContainer`` in the current release. Updated the
   selector list to match both, so the page title now sits close to
   the top edge again (~0.5rem from the hidden header) instead of
   inheriting Streamlit's default ~6rem header reservation.

2. ``.dt-finding-group-head`` margin tightened to ``margin: -1rem
   -1rem 0.75rem``: -1rem on top/sides still bleeds the head to the
   card edges, but +0.75rem on the bottom is breathing room between
   the head's bottom border and the first finding row, which were
   abutting before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:04:42 +00:00
4816da1ad6 fix(home): show file sizes in KB/MB/GB, never raw bytes
Per-row file sizes and the Files-card total-size meta both read as
human-readable units now. Smallest unit is KB even for sub-kilobyte
files (so ``538 B`` → ``0.5 KB``, ``4914 B`` → ``4.8 KB``), steps up
to MB at 1 MiB and GB at 1 GiB. Always one decimal place.

New module-level helper ``_format_size(int) -> str`` in ``_home.py``;
both the section meta (``1 file · 4.8 KB total``) and the per-row
``dt-file-size`` cell call it instead of the previous ad-hoc
``f"{n:,} B"`` formatter. Keeps the display consistent regardless of
file size — and keeps the GUI free of raw byte counts that nobody
needs to read.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:59:56 +00:00