datatools-dev

Author	SHA1	Message	Date
Michael	b703911df3	docs: reflect bundled Tesseract on every install surface - NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers the bundled Tesseract OCR binary (Apache 2.0, upstream tesseract-ocr/tesseract, copyright Google + contributors) and the eng.traineddata from tessdata_best (also Apache 2.0). Clarifies DataTools itself remains proprietary. Full canonical Apache 2.0 license text included. - README.md + README.es.md (Download section): bumped size estimate ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR is bundled (no separate install required), with a link to the new license file. - docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System requirements): bumped disk estimate, added a paragraph stating Tesseract 5.5 + eng.traineddata ship inside every installer / portable / AppImage, with a source-install fallback hint pointing developers to DEVELOPER.md. - docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section documenting the runtime layout (sys._MEIPASS / tesseract / …), discovery order, source of bytes (build/vendor/tessdata + per- platform fetch in make_release.py), version pin, update recipe. - docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor OCR)" — short version of the discovery order for the build pipeline section. - build/README.md: distribution-outputs paragraph now lists Tesseract among bundled deps with the ~250-300 MB estimate; new "Tesseract bundling" section: layout diagram, resolver order, source of bytes + 5.5.0 pin, update steps, license-file ref. Out-of-scope gaps noted by the docs sweep: - docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a high-risk packaging headache; now superseded. Worth a one-line "(resolved — bundled as of v1.x)" callout in a future pass. - USER-GUIDE §2 "What's included" table doesn't list PDF Extractor at all (it shipped in b8aff86…967d3f6). Separate gap to close. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 18:20:50 +00:00
Michael	93ccada974	build: bundle Tesseract 5.5.0 + tessdata into every release artifact End users no longer have to install Tesseract separately for OCR on scanned PDFs — the engine ships inside the installer, portable .zip, and AppImage for all three platforms. Per-platform fetch in build/make_release.py (run before PyInstaller): - Windows: download UB-Mannheim installer 5.5.0.20241111, extract with 7-Zip, copy tesseract.exe + required DLLs into the staging dir. - macOS: ``brew install tesseract``, copy binary + every Homebrew- prefixed dylib resolved via otool -L (recurse one level for transitive deps), then install_name_tool rewrites IDs / load paths to @loader_path/... so the bundle is relocatable. - Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary + every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'. Wire-up: - build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by make_release) and adds the staging dir + tessdata + the LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/} and the license sits at the bundle root. Soft-warns when staging is empty so dev spec runs still complete. - English tessdata pulled by fetch_tessdata() from tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at build/vendor/tessdata/. - .github/workflows/build.yml: actions/cache@v4 step keyed on ``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the staging dir and the vendored tessdata across runs; apt installs patchelf on the Linux runner; PyInstaller step now receives the DATATOOLS_TESS_STAGING env var. - .gitignore: build/_tesseract/ and the .traineddata blob. - TESSERACT_SKIP_FETCH=1 honored for offline / manual stages. - Installer / .dmg / .zip / AppImage scripts: one-line comments confirming Tesseract rides along automatically via PyInstaller's datas (no extra packaging steps required in those scripts). Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post- compression. Net installer size ~250-300 MB (was ~120 MB) — accepted tradeoff for zero end-user OCR setup. Reversal of the prior "don't bundle Tesseract" decision (option A). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 18:20:33 +00:00
Michael	17faf84aed	feat(pdf): probe bundled Tesseract first when running frozen Adds runtime support for the bundled Tesseract that ships inside the DataTools installer / portable / AppImage artifacts. When DataTools is launched from a PyInstaller frozen bundle the OCR engine now resolves automatically — no end-user install required. New helpers in src/pdf_extract.py: - _bundled_tesseract_path() → Path \| None — returns <sys._MEIPASS>/tesseract/tesseract[.exe] when getattr(sys, "frozen", False) AND sys._MEIPASS are present; None in dev. - _bundled_tessdata_dir() → Path \| None — same gating, returns <sys._MEIPASS>/tesseract/tessdata. - _apply_bundled_tessdata_prefix() — sets TESSDATA_PREFIX to the bundled tessdata dir before any pytesseract call; only if frozen, dir exists, and the user hasn't already overridden the env var. Discovery order in ocr_available() / _autodetect_tesseract_path(): 1. DATATOOLS_TESSERACT_PATH env override (existing) 2. Bundled binary (NEW — frozen-only) 3. System PATH (existing) 4. Windows well-known install dirs (existing legacy fallback) In dev (not frozen) every new probe is a no-op so the developer experience is unchanged. 12 new tests cover frozen vs. non-frozen detection on each platform, the user-override respect for TESSDATA_PREFIX, autodetect priority ordering, and the no-bundled-dir graceful path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 18:19:52 +00:00
Michael	4d8513b1a3	docs: cover help popover, +/- nav indicators, render_tool_header User-facing docs (USER-GUIDE en+es, README en+es): - New short paragraph under §3.1 GUI noting the in-tool Help button on every detail page, what it contains (When to use / Steps / Examples / Tip), and that content lives in tools.<id>.help_md. - One-line note in the README tool tables pointing at the same. - Mention the sidebar +/- nav indicators replacing Streamlit's default Material Symbols chevron. Developer docs: - DEVELOPER: new "Tool page header" subsection documenting render_tool_header(tool_id), the help_md markdown skeleton, and the fallback to help.missing_body when a tool's help is absent. Update i18n authoring rules to list help.* keys and the per-tool help_md field alongside name/description/page_title/page_caption. - TECHNICAL: new §10c documenting the sidebar nav indicator swap — CSS in _HIDE_CHROME_CSS plus _SWAP_NAV_SECTION_INDICATOR_JS injected through the hide_streamlit_chrome() iframe bundle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 18:08:01 +00:00
Michael	ac94208d8f	chore: production-readiness sweep on the help-popover wave - Drop unused 'from src.i18n import t' from pages 1-9 (the swap to render_tool_header(tool_id) means no page calls t() directly anymore). Pages 10, 11 and the underscore-prefixed pages were already clean or legitimately use t(). - Rewrite PDF Extractor help_md (en + es). The original prose described features the tool does NOT have — template drawing, per-source saved templates, automatic reuse. The actual tool is a heuristic batch scanner (per its own docstring: "No templates, no per-bank configuration"). New copy: scan → uncheck → pick date format → enable OCR if needed → download. Spanish version tagged with '<!-- TODO: review Spanish -->' since the prose is best-effort. - Document why both stSidebarNavSectionHeader (legacy, streamlit~=1.35) and stNavSectionHeader (current, 1.57) testids appear in the chrome CSS — requirements floor is streamlit>=1.35,<2 so dropping the legacy selector would silently break the lower bound. - Pin the t()-returns-key-on-miss contract that render_tool_header's fallback path depends on, with a comment at the call site. - Pin the demo's intentional skip of hide_streamlit_chrome (so the +/- sidebar swap JS doesn't ever try to load there) with a load- bearing comment in app_demo.py. - Confirmed i18n parity: every tool id has page_title / page_caption / description / name / help_md in BOTH packs; help.button_label and help.missing_body in both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 18:07:33 +00:00
Michael	4955fb239b	test: cover help_md keys, header smoke, and bilingual ES smoke Two stale Spanish smoke assertions still expected English page titles for PDF Extractor and Reconciler — the i18n work landed real translations ("PDF a CSV", "Reconciliar dos archivos"), so refresh the expected substrings and the surrounding comment. Add new coverage for the help-popover feature: - TestHelpPopoverKeys (test_lang_packs): every tool_id resolves a non-empty tools.<id>.help_md in BOTH packs; help.button_label and help.missing_body resolve in both. - TestDescriptionCopy (test_tools_registry): every Tool.description non-empty and under 120 chars — pins the post-jargon-scrub copy so future drift back into multi-clause prose is loud. - TestRenderToolHeaderSmoke: render_tool_header is callable, listed in components.__all__, and every i18n key it touches resolves in both packs. Runs without a Streamlit script context. Suite: 2427 passed (+9 new), 91 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 18:07:19 +00:00
Michael	4a8961d58a	fix(gui): keep tool-page Help button on one line at narrow widths When the viewport shrunk, the help popover button in the title row was wrapping its label vertically — ``[icon]`` over ``Help`` — because the button was set to use_container_width=True and the column it sat in collapsed below the button's natural width. Two-pronged fix: - Set use_container_width=False on the popover so the button sizes to content (icon + label) instead of stretching to the column. - Widen the column ratio from [10, 1] to [8, 2] so there's room for the button without forcing the title text to truncate. - Add CSS pinning ``white-space: nowrap`` on every popover button (and its inner div / p) as defense-in-depth — even if the button does get squeezed, the label can't wrap. ``min-width: max-content`` keeps the button from compressing below its content. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 17:54:41 +00:00
Michael	fe4b5dc755	fix(sidebar): correct testid + JS swap so +/− actually renders The prior attempt used data-testid=stSidebarNavSectionHeader, which is not what Streamlit 1.57 emits — the correct testid is stNavSectionHeader (verified against the bundled JS in streamlit/static/static/js/). The section header is also a <div> with onClick, not a <button>, and the React component keeps the expanded state in a prop without surfacing aria-expanded on the DOM. Pure CSS can therefore neither locate the header nor switch the glyph by state, which is why the chevron was unchanged in the rendered UI. Switch strategies: - CSS now targets the correct stNavSectionHeader / stIconMaterial selectors, drops the Material Symbols font from the icon span, and restyles it so a plain ascii character reads as proper typography (size, weight, color, hover). - Add _SWAP_NAV_SECTION_INDICATOR_JS — small inline script that rewrites the icon's text node from "expand_more"/"expand_less" to "+"/"−" (U+2212), throttled via requestAnimationFrame, re-applied on every DOM mutation by a MutationObserver. Bundled into the same iframe injection as the existing brand/upload/findings scripts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 17:52:47 +00:00
Michael	209b5fb1aa	style(sidebar): swap expand chevrons for +/− indicators on nav sections Streamlit's default sidebar section header uses a Material Symbols expand_more chevron — three different icons (chevron down, chevron up, sometimes a plain triangle) depending on version, all of which felt inconsistent with the rest of the chrome. Hide the built-in icon (svg / material-symbols span — covered with multiple selectors for cross-version durability) and render our own glyph as a right-aligned pseudo-element on the section-header button, keyed off the standard ARIA aria-expanded attribute: - collapsed → "+" - expanded → "−" (U+2212, visually balanced with +) Hover deepens the indicator color to match the surrounding nav-link hover treatment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 17:23:49 +00:00
Michael	904356f4e8	feat(gui): inline Help popover next to every tool's title Adds a contextual Help button on each detail page, right of the title. Clicking it opens a Streamlit popover with a one-shot how-to: when to use, numbered steps, before→after examples, and an optional one-line tip. Designed to be scannable — no paragraph prose. Implementation: - New ``render_tool_header(tool_id)`` helper in components replaces the bare ``st.title(...) + st.caption(...)`` block on each of the 11 tool pages. Title in the wide column, popover in a narrow right column; caption sits on its own line beneath. - Help content is one markdown blob per tool stored in i18n under ``tools.<id>.help_md`` (en + es). Editors can tweak copy without touching Python. - ``help.button_label`` and ``help.missing_body`` keys added to both packs for the popover trigger and the empty-tool fallback. All 11 tool pages now use the same header pattern — including the PDF Extractor and Reconciler which previously had hardcoded title/ caption pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 17:21:55 +00:00
Michael	7203a81af7	copy: strip jargon from tool descriptions and captions Prior round only touched page_caption; the description field (shown on home grid cards) still said "imputation", "missingness", "winsorization", "schema coercion", "fuzzy matching with normalization", etc. The audience is non-technical buyers — they shouldn't need a stats or DB-admin vocabulary to read a tool card. Rewrite both description and page_caption across en, es, and the tools_registry (the fallback source of truth) using everyday words: blanks instead of nulls, fill in instead of impute, look wrong instead of statistical outliers, etc. Same one-line shape as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 17:09:52 +00:00
Michael	dd3b9bd59d	copy: tighten tool-page captions to one plain-English line Each tool's page caption is what tells a user what the tool actually does the moment they land. They were inconsistent — some terse, most multi-clause with a redundant "Runs locally — your data never leaves this computer" trailer that's already a privacy pill on Home. Rewrite every caption (en + es) as a single ~60-80 char action-first line. Replaces the hardcoded multi-line Reconciler caption with the same shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 14:34:34 +00:00
Michael	2bd94c4441	docs: document installer + portable downloads in en/es Repo READMEs now show both download flavors side-by-side with first-launch warnings (SmartScreen, Gatekeeper) and link to the deeper walkthrough. USER-GUIDE §1 rewritten from a 9-line stub into six subsections: - §1.1 Windows: installer (5 steps) + portable (4 steps) - §1.2 macOS: DMG (5 steps incl. right-click-Open) + portable - §1.3 Linux: AppImage flow (unchanged) - §1.4 First-launch: port selection, localhost binding, browser open - §1.5 How the GUI works - §1.6 System requirements §6 Troubleshooting picks up portable-specific items: Safari unzip quirks, antivirus quarantine on Win portable, license file location. docs/README and Spanish mirrors updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 19:30:28 +00:00
Michael	9c426194b1	build: add single-command release script + portable zip artifacts One-developer workflow: ``python build/make_release.py`` on each target OS produces both the installer and a portable .zip for that platform. Preflight checks PyInstaller / Pillow / iscc / hdiutil / ditto / appimagetool and bails with install hints if anything is missing — no half-built dist/. New scripts: - build/make_release.py — orchestrator, auto-detects host OS. - build/generate_icons.py — icon.ico / icon.icns / icon.png from src/gui/assets/datatools_icon_256.png (Pillow ships ICO + ICNS writers; no platform tooling needed). - build/build_portable_zip.py — Win/Linux portable zip via stdlib. - build/macos/build_zip.sh — Mac portable .app via ditto so bundle metadata survives. installer.iss now adds: Quick Launch task (opt-in, legacy Win 7), App Paths registry entry (Win+R "DataTools" works), SetupIconFile, UninstallDisplayIcon, AppSupportURL, AppUpdatesURL. CI workflow uploads installer + portable per platform and attaches both to GitHub Releases on tag push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 19:30:17 +00:00
Michael	6627895a10	test: fix v3 branding drift, add reconcile CLI + registry coverage GUI/lang-pack tests were asserting against pre-v3 strings ("Data Cleaning Mastery", "Maestría en limpieza…") that the brand refresh replaced with "UNALOGIX DataTools" + "Clean. Normalize. Transform." Updated assertions to the current copy and switched the findings panel tests to the redesigned flat-list layout (per-finding "Open Tool →" buttons instead of per-tool expanders). New coverage: - tests/test_cli_reconcile.py (13) — preview/apply, tolerance flags, sign inversion, key flags, error paths, Excel input. - tests/test_tools_registry.py (27) — unique tool_ids, page_slug → real file, valid sections/tiers, localized accessor fallbacks, explicit pins for PDF Extractor + Reconciler entries. - tests/test_reconcile.py — one-side-empty, key-pass tagging, additional validation cases, input-DataFrame immutability. - tests/gui/test_smoke.py — PAGE_SLUGS now includes 10_PDF_Extractor and 11_Reconciler in both en/es. - tests/gui/test_workflows.py — TestPdfExtractorWorkflow and TestReconcilerWorkflow render checks. Net: 2317 passed → 2418 passed, 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 19:30:02 +00:00
Michael	ea99e292d2	feat(nav): group Home + Reconcile under a new "Analysis" section Home now appears in the sidebar as "File Analysis" under a labeled "Analysis" section together with Reconcile Two Files — both pages are data-analysis workflows (importing/profiling files vs. matching across files), so grouping them clarifies the sidebar's mental model. - tools_registry: new ``analysis`` Section; reconcile moves out of automations into it. - i18n: ``nav.section_analysis`` + ``nav.file_analysis_title`` added to en.json and es.json. - app.py: home dropped from the unlabeled section and surfaced at the top of the Analysis group; ``default=True`` preserved so first-visit routing is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 23:11:06 +00:00
Michael	0be59c0f03	fix(gui): shrink white-bar compensation to ~1/4 of original gap Plain ``min-height: 100vh`` left a ~15vh white bar below ``.stApp`` (the zoom: 0.85 scaler shrinks visual height to 85%). Reinstate the stretching but stop short of the full ``100vh / 0.85`` overflow: ``calc(96vh / 0.85)`` fills 96vh visually and leaves a ~4vh bar — a quarter the size, no longer dominating the page. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 23:06:32 +00:00
Michael	3a3a9a895b	fix(gui): stop overstretching pages, restore footer clearance Two layout bugs were hiding the bottom of every tool page behind the sticky footer: 1. ``.stApp`` and the main/sidebar containers were forced to ``min-height: calc(100vh / 0.85)``, ≈ 17.6% taller than the viewport, to mask a white bar caused by the ``zoom: 0.85`` scaler. That hack stretches short pages and pushes long-page content past the visible area. Drop the calc factor — plain ``100vh`` fills the visible viewport without forced overflow. 2. ``render_sticky_footer``'s stylesheet re-set the block container's ``padding-bottom`` to ``2rem``, overriding the ``7rem`` reserved by ``hide_streamlit_chrome``. The footer (~40px tall) needs more than 32px of clearance, so the last row of content was sliding behind the footer. Remove the override and let chrome's reservation stand. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 23:03:52 +00:00
Michael	d090f8cb5e	feat(reconcile): auto-detect role columns, preview result tabs Match-settings selectors now reorder per side to match the file's column order, using name heuristics (amount / date / desc) so a typical bank CSV reads Date → Description → Amount → Reference without manual fiddling. Detected columns also pre-fill as the default selection. Result tabs render at most 25 rows with a "preview of N of M" caption; full data is still available via the existing download buttons. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 22:39:47 +00:00
Michael	e44af3a45e	feat(reconcile): two-source reconciliation tool Bank-feed-vs-ledger style matcher: 4-pass greedy assignment (key → exact → tolerance → fuzzy) with ambiguous candidates routed to a review bucket instead of arbitrary picks. CLI mirrors the cli_text_clean preview/--apply pattern; Streamlit page registered in the automations section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 22:33:14 +00:00
Michael	450d4fc9a8	feat(pdf): default output date format to YYYY-MM-DD User asked to flip the default from YYYYMMDD to YYYY-MM-DD. ISO is the better default for an accountant CSV workflow: - Lexicographic sort = chronological sort (no parsing needed). - Every spreadsheet tool the user might import into recognises it as a real date with no ambiguity (US vs EU readers can't disagree on the order). - Hyphens make the year/month/day boundaries scan-able by eye. Concrete changes: - New module constant ``DEFAULT_DATE_FORMAT = "%Y-%m-%d"``, used as the default for ``format_date()`` and the ``output_date_format`` keyword on ``scan_pdf_for_transactions``. - Page's ``_DATE_FORMAT_CHOICES`` reordered so the ISO entry is first (index 0 = default Streamlit selection); YYYYMMDD drops to second. - Custom-strftime input default also flips to ``%Y-%m-%d``. Tests updated to reflect the new default (``test_dates_formatted_iso_by_default``, ``test_short_dates_get_year_from_period``, ``test_compact_format_round_trip``, plus a new ``test_default_is_iso`` for the format_date helper). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 02:04:34 +00:00
Michael	a0042d4aba	feat(pdf): Dec/Jan-aware year inference + filename hint + override Previous year inference picked ``period_end_iso[:4]`` for every short date, which fails on statements that cross the Dec/Jan boundary. A "12/30" row in a 2024-12-16 to 2025-01-15 statement got 2025-12-30 (wrong) instead of 2024-12-30. New cascade for ``_infer_year_for_short_date``: 1. ``override_year`` — caller supplies it (new ``"Override year for short dates"`` field in Scan options). Beats every heuristic. Empty by default; the page validates the value is a 4-digit-looking integer in 1900-2100 and falls back to automatic on garbage input. 2. Statement period start + end — the function now takes BOTH dates and generates candidates with every distinct year in the period (one year for same-year statements, two for Dec/Jan boundaries). The picker scores each candidate by distance from the period: candidates inside the period score 0, candidates outside score ``min(\|days from start\|, \|days from end\|)``. Lowest-distance candidate wins. So: - ``12/30`` + period 2024-12-16 to 2025-01-15 → 2024-12-30 (inside period, score 0) - ``01/05`` + same period → 2025-01-05 (inside, score 0) - ``12/15`` + same period → 2024-12-15 (1 day before, closer than 2025-12-15 which is 11 months after) 3. ``filename_year_hint`` — fallback when the statement period regex misses the bank's specific layout. The page passes ``year_from_filename(upload.name)`` automatically so files like ``eStmt_2025-01-13.pdf`` get year 2025 even if the PDF's text doesn't yield a parseable period. The regex matches the first ``20XX`` token bounded by non-digits. Both new helpers (``year_from_filename`` and the new ``_try_short_date_with_year`` factor-out) are exported and tested. 16 new tests cover: within-period inference (same-year sanity), Dec/Jan boundary cases for both sides, the just-before-period closer-distance case, override priority, filename fallback, no-signal None, dash-format / month-name shorthand round-trip, garbage input, filename year extraction (eStmt pattern, embedded, first-match-wins, no-match, empty). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:59:30 +00:00
Michael	a18b126885	fix(pdf): stamp scan timestamp once; restores Saved-to-path banner After swapping to ``html_download_button`` the user noticed the "✓ Saved to <path>" + 📂 Open Downloads folder pair never appeared. The helper itself is fine — every other tool shows those affordances correctly. Bug was specific to the PDF page. The download button's file_name was being computed with a fresh ``datetime.now().strftime(...)`` on every render. The helper builds its session-state keys from ``f"_dl_btn_{file_name}_{digest}"`` so the keys silently drift every second. After the click and rerun, the helper looks up the saved_key for the NEW file_name, finds nothing in session_state (the click had written to the OLD key), and skips the success banner. Fix: stamp the timestamp once when scan completes, store it in ``K_TIMESTAMP``, and reuse it for the download filename. The filename stays stable across reruns, so the helper's keys are stable, so the saved-path banner renders correctly on the post- click rerun. Also clear ``K_TIMESTAMP`` on Clear-all-files so a new scan gets a fresh stamp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:50:22 +00:00
Michael	981a1a9cba	fix(downloads): OneDrive-aware Downloads path + PDF uses html_download_button User reported downloads "do nothing on click" in tool pages and "acts like it downloads but no file in the folder" in the PDF tool. Two root causes, two fixes. Root cause #1 — wrong Downloads folder on Windows. ``_downloads_dir()`` returned ``Path.home() / "Downloads"`` unconditionally. On Windows machines with OneDrive enabled (very common for business users), the real Downloads folder is redirected to ``C:\Users\<u>\OneDrive\Downloads``. Our helper would write to ``C:\Users\<u>\Downloads`` instead — a folder that may not even exist until ``mkdir`` creates it — and the user, naturally opening their actual OneDrive Downloads, sees no file and concludes nothing happened. Now: on Windows, ``_downloads_dir`` queries the registry key ``Software\Microsoft\Windows\CurrentVersion\Explorer\User Shell Folders`` for FOLDERID_Downloads (GUID ``{374DE290-123F-4565-9164-39C4925E467B}``). This entry returns the redirected path when OneDrive is active, the original ``%USERPROFILE%\Downloads`` otherwise — exactly what the user's File Explorer reads. ``%USERPROFILE%`` expansion is applied via ``os.path.expandvars``. Any registry hiccup falls through to ``Path.home() / "Downloads"`` so the helper never raises. The sanity check (path exists OR parent exists) catches the edge case where the registry points into a deleted OneDrive mount. Root cause #2 — PDF page used st.download_button. Every other tool uses the project's ``html_download_button`` helper (which is ``local_download_button`` under the hood — the rename happened in `b9147f3`). ``st.download_button`` has a long-standing bug where the second-or-later instance in a script pass silently fails to fire. The PDF tool predated the rewrite that switched everyone over and was still using the broken native widget. ``_Logs.py`` had the same problem in two places. Swapped all three call sites to ``html_download_button``. They now save to ``~/Downloads/<filename>`` (correctly resolved per fix #1) and show the saved path + "Open Downloads folder" button below the click, matching every other tool in the suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:45:51 +00:00
Michael	dbcf4d4048	feat(pdf): adopt Home-page Files-card layout User wants the PDF page's upload UX to match the Home page exactly — Files section header + bordered card containing the file rows AND the "Add more files" button at the bottom, no visible Streamlit file_uploader competing for attention. Layout changes mirroring ``src/gui/_home.py``: - ``st.file_uploader`` is positioned off-screen via CSS (``position:absolute;left:-10000px;…``). The underlying ``<input type=file>`` stays reachable to JS so the in-card "Add more files" button can programmatically click it. - ``<h2>Files</h2>`` section header with ``N files · X.X MB total`` meta on the right, identical markup (``dt-files-section-head``). - Single ``st.container(border=True)`` hosts every file row (``✕ \| 📄 filename \| size``, using ``dt-file-row`` / ``dt-file-icon-chip`` / ``dt-file-name`` / ``dt-file-size`` classes) AND the "Add more files" button (``dt-file-add``) at the bottom. All classes are already defined globally in ``_legacy.py`` so no new CSS. - The Add button click is wired to the off-screen uploader's ``stFileUploaderDropzoneInput`` via a 30-line iframe script, identical to the Home page's pattern. A ``MutationObserver`` re-wires after Streamlit reruns when the button gets re-mounted. Action buttons (Scan + Clear all) sit BELOW the Files card, side-by-side in a `[1, 1, 4]` column split with ``use_container_width=True`` so they fill their cells cleanly without stretching across the whole row. Both buttons are disabled when no files are uploaded — the empty Files card is its own affordance for the empty state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:34:31 +00:00
Michael	34b56b404a	fix(pdf): drop statement_period_start/end columns from output User asked to remove them — the two columns repeated the same value on every row from a given statement, took up screen space in the editor, and offered limited value once the date column already carries the inferred full date. What's kept: - ``account_number`` — still stamped onto every row so multi- statement CSVs are self-attributing - ``extract_statement_metadata`` — still runs every scan because ``period_end`` is the source of the year inference that binds Chase-style short ``01/13`` dates to ``20250113`` - ``_extract_statement_period`` and its tests — period detection itself isn't going anywhere, just its appearance in the output rows What's removed: - ``record["statement_period_start"]`` / ``record["statement_period_end"]`` assignments in ``scan_pdf_for_transactions`` - The two columns from the page's column-ordering setup - Tests pinning their presence; replaced with assertions that they're explicitly absent Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:28:32 +00:00
Michael	ad7c22d7fb	fix(pdf): consistent 2-decimal amount precision in display and CSV User reported amounts losing trailing zeros — 4.50 rendering as 4.5, 1000.00 as 1000 — on the same statement. Classic float display issue: Python's native ``repr(4.5)`` drops the ``.0``, and pandas / Streamlit happily show that inconsistency cell-by-cell. Two layers of fix, internal type stays ``float`` for arithmetic: Display. ``st.column_config.NumberColumn(format="%.2f")`` applied programmatically to every ``amount_`` column on the data_editor. Every numeric amount now shows with exactly two decimal places regardless of trailing zeros. CSV export.* Pandas' default float-to-CSV writer also drops trailing zeros (the same issue an accountant would see when opening the file in Excel). Before serialising, each amount column is mapped through the new ``format_amount`` helper — returns ``f"{v:.2f}"`` for numerics, empty string for None/NaN/inf, ``str(value)`` for booleans (guards the ``True → "1.00"`` foot-gun since ``bool`` is an ``int`` subclass), and passes through any string the scanner kept because parsing failed (e.g. ``(4.50)`` when parens-negative is off — user can correct in the editor before re-exporting). ``format_amount`` lives in ``src/pdf_extract.py`` so it's testable in isolation (the page module can't easily be unit tested because of its Streamlit import chain). 8 new tests cover the trailing-zeros case, negatives, None/empty, string-passthrough, bool guard, NaN/inf, and the ``places`` parameter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:27:16 +00:00
Michael	6f2ad57490	fix(pdf): require non-empty description; tighten multi-line merge User reported "Daily Ledger Balances" entries leaking into output. Three correlated bugs in the row qualifier: 1. Empty description is now disqualifying. A row like ``01/13/2025 $1,000.00`` has a date and an amount but no text between them — that's a daily-balance entry, a period-summary, or page furniture. Drop these. New filter sits after ``_description_from_row`` returns: if the description string is empty (or whitespace-only), continue past the row. 2. ``prev`` resets per page. The state that drives multi- line description merging (the "previous transaction this continuation might attach to") used to persist across page boundaries. A no-date no-amount line at the top of page 2 could silently attach to the last transaction on page 1. Fixed by moving the ``prev`` / ``prev_y_bottom`` declarations into the outer page loop so each page starts clean. 3. Multi-line merges now check y-distance. Before this fix, ANY no-date no-amount line attached to the previous transaction's description. A "Daily Ledger Balances" section header several rows below the last transaction would silently fold into it. Now the merge only happens when the gap ``current_top - prev_y_bottom <= 25.0`` PDF points — generous enough for one blank-line gap between wrapped descriptions, tight enough to reject section headers across paragraph breaks. The threshold is a module constant (``_MULTILINE_MERGE_MAX_GAP``) for future tuning if real statements call for it. Three new test classes: - ``TestRequiresDescription.test_empty_description_row_dropped`` — date+amount-no-text row filtered, real transaction kept. - ``TestPrevTransactionResetsPerPage.test_no_cross_page_merge`` — page-1 transaction + page-2 section header = no merge. - ``TestMultilineMergeYGap`` — close continuation merges (10-pt gap), far section header doesn't (100-pt gap). The original ``TestMultilineDescription.test_continuation_line_merges`` still passes — its setup has a 10-pt gap which is within the new threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:58:50 +00:00
Michael	a1824b8dc4	feat(pdf): Home-style file list + Clear-all button User feedback: the standard file_uploader didn't visually match the Home page, and there was no obvious way to clear out uploaded files between scans (have to refresh the browser tab). Persistent stash + add-only sync. Files captured into ``st.session_state["pdf_uploads"]`` (dict name → {bytes, size}) via an ``on_change`` callback on the file_uploader widget. The callback is add-only — never removes files from the stash based on widget state. Removal is owned by the custom X buttons + widget-counter bump (see below). This guarantees a hidden native X click can't silently drop files behind the user's back. Hidden native file list. A small CSS block suppresses the file_uploader's built-in file rows + their delete buttons (``stFileUploaderFile`` + ``stFileUploaderDeleteBtn``), so the custom list below is the single source of truth on screen. Custom file list (Home pattern). Below the dropzone, every uploaded file gets a row: ``✕ \| 📄 filename \| size``. Top of section shows ``N files · 12.3 MB total``. Counts and sizes update in real time as the user adds or removes files. The X button per row calls ``log_event("upload", "PDF removed: …")``, removes the entry from the stash, and bumps the widget counter to clear the widget too. Clear-all button. Sits next to the Scan button. Wipes the stash, bumps the widget counter, drops any cached scan results (``K_ROWS``, ``K_WARNINGS``, ``K_SOURCE_COUNT``). Audited via ``log_event("upload", "PDF list cleared", count=N)``. Widget reset via counter bump. Streamlit disallows programmatic mutation of widget session-state entries; the standard workaround is to rotate the widget's ``key``. Page maintains ``K_UPLOAD_COUNTER`` which gets incremented on remove / clear-all, producing a fresh ``pdf_upload_v{N}`` key and a freshly-instantiated empty widget. The stash retains any unaffected files; on next upload, the add-only sync picks up the new ones without re-adding the removed ones. Scan rewired to read the stash. Instead of iterating the widget's UploadedFile objects (which the previous code did and which broke when the widget unmounted on remove), the scan loop iterates ``pdf_uploads.items()`` and uses the cached ``bytes``. Diagnostic expander does the same — re-reads from the stash, removing the need for a separate ``K_DIAGNOSTIC`` cache (deleted). ``_format_size`` helper ports the byte-formatting logic from ``_home.py``'s pattern (KB / MB / GB rollover). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:28:01 +00:00
Michael	155dd30746	feat(pdf): extract statement header (account + period) + date format Two related additions for the accountant workflow: 1. Statement header extraction. New ``extract_statement_metadata(pages)`` pulls the account number and statement period out of the first page (falls back to page 1+2 if either is missing on page 1 — Wells Fargo business accounts put header info on page 2). Detected fields are stamped onto EVERY transaction row so a multi-statement CSV is self-attributing per row:: { "date": "20250113", "description": "Coffee Shop", "amount_1": -4.50, "account_number": "**5678", "statement_period_start": "20250101", "statement_period_end": "20250131", ... } Account-number regex is tolerant of masks (``1234``), hyphens (``1234-5678-9012``), and spaces. Period regex looks for "Statement Period" / "From" / "Period Covered" labels plus the first 1-2 full-year dates that follow. If only one date is present near the label, it's used for both start and end (some statements show only the closing date). 2. Year inference for short dates. When the row date is a short ``01/13`` or ``Jan 13`` without a year, the scanner now binds the year from the statement period's end date BEFORE formatting. Doesn't handle the December-in-January-statement cross-year case (rare; user can edit in the table). 3. Configurable output date format.** New ``output_date_format`` parameter on ``scan_pdf_for_transactions`` defaults to ``%Y%m%d``. Applied to: the transaction date column AND the statement period start/end fields. The page surfaces a dropdown in Scan options with common presets (YYYYMMDD, YYYY-MM-DD, MM/DD/YYYY, DD/MM/YYYY, ``Mon DD, YYYY``) plus a Custom option that accepts a raw strftime string. New helper: ``format_date(iso_str, fmt)`` converts ISO ``YYYY-MM-DD`` to any strftime; passes invalid input through unchanged so the user can see what was actually there rather than getting silent empties. 20 new tests cover: format_date, account-number extraction (masked / hyphenated / spaced / no-label / short), period extraction (standard / from-to / single-date / no-label), metadata orchestrator (full header / no pages / page-2 fallback), year inference (US / dash / month-name / no-period / unparseable), plus an end-to-end class that builds a header'd PDF with short-date transactions and confirms metadata attribution + year inference + format round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:20:46 +00:00
Michael	3cf935c999	fix(pdf): drop zero-amount rows; multi-date rows clean description Two corrections from real-statement feedback: 1. Drop rows where the transaction amount is exactly 0. Bank statements include date+amount-shaped noise like "INTEREST EARNED 0.00", "PAGE TOTAL 0.00", "BALANCE FORWARD 0.00 1,234.56" — all match the date+amount heuristic but aren't transactions. New filter in ``scan_pdf_for_transactions``: drop rows whose ``amount_1`` parses to exactly 0. Non-zero balances in ``amount_2`` don't rescue a zero amount_1 — leftmost amount is the canonical transaction amount. Unparsed-but-non-empty amount strings are kept (user verifies in the editor). 2. Multi-date rows: first date wins for the column, every date excluded from the description. Chase / BofA / Wells commonly show both a transaction date and a posting date per row: 01/13 01/14 COFFEE SHOP $4.50 Before this fix, ``_find_dates_in_words`` returned the first date only and the second date leaked into description as "01/14 COFFEE SHOP". Now it returns ALL dates with their word ranges; the scanner uses ``dates[0]`` as the canonical date and passes every range to the description builder for exclusion. The detector's two-pass strategy now also guards against mixing full-year and short-date matches on the same row. Previously, a header line like ``Page 1/2 of 3 ... Statement Date 01/13/2026`` would return both ``1/2`` and ``01/13/2026``, and ``1/2`` (being leftmost) would have won the date column. Now: if any full-year date is found on the row, short patterns are NOT also collected — full year anchors interpretation. A row with no full-year date (Chase short-date case) still falls back to short patterns and collects all of them. New tests: - ``test_multiple_dates_returned_in_position_order`` — ``01/13`` + ``01/14`` both returned, in order - ``TestMultiDateRow.test_first_date_wins_second_excluded_from_description`` — end-to-end through ``scan_pdf_for_transactions`` - ``TestZeroAmountRowsAreDropped.test_zero_amount_row_dropped`` — "INTEREST EARNED 0.00" row dropped while real txn kept - ``test_negative_amount_kept`` — pin that -40.00 is not treated as zero by the filter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:12:21 +00:00
Michael	263af3c7c2	fix(pdf): short dates without year + diagnostic for "0 rows" runs User uploaded a real Chase statement and got "0 rows detected." Two bugs the rewrite shipped with, plus a diagnostic: 1. Short dates without year weren't recognized. Most bank statements (Chase, Wells, BofA, …) display transaction dates as ``01/13`` or ``Jan 13`` because the year is implied by the statement period. The original regex required ``\d{2,4}`` after the second slash, so ``01/13`` failed to match and rows with no detected date got dropped. Split ``_DATE_RES`` into ``_FULL`` (with year) and ``_SHORT`` (no year), with a two-pass detector: pass 1 tries full-year patterns across the whole row; pass 2 only tries short patterns if pass 1 found nothing. This prevents a stray ``Page 1/2`` from shadowing the real dated transaction on the same line. Short patterns: - ``\d{1,2}/\d{1,2}`` — Chase, etc. - ``\d{1,2}-\d{1,2}`` - ``[A-Z][a-z]{2}\s+\d{1,2}`` — "Jan 13" When parsing, short dates pass through ``parse_date`` and return None (no year to bind to), so the scanner falls back to the raw text — the user sees ``01/13`` in the date column and can correct in the editor. 2. Multi-word dates leaked the day token into the description. A pre-existing bug: ``_find_dates_in_words`` returned only the START word index, and ``_description_from_row`` only excluded that single word. For "Jan 13 Coffee $4.50", the description became "13 Coffee" instead of "Coffee". Fixed by returning ``(start, end, text)`` with ``end`` exclusive (computed from ``len(m.group(1).split())`` so window-overrun doesn't over-consume), and the description builder now skips the full range. 3. New diagnostic: ``diagnose_pdf_lines(pdf_bytes)``. Returns every clustered text line the scanner saw with ``has_date`` / ``has_amount`` flags. When the page's scan returns 0 rows, an auto-expanded "what the scanner saw" expander now renders a table of all extracted lines so the user can: - Spot scanned-PDF cases (empty result → enable OCR) - See which lines have a date but no amount (or vice versa) - Eyeball the date / amount format the scanner missed Without leaving the app or asking the developer for help. Eight new tests cover: short US date (``01/13``), short month- name date with two-word consumption (``Jan 13``), the ``Page 1/2 ... 01/13/2026`` shadowing case, and the multi-word- date description fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:06:07 +00:00
Michael	bece2b4030	refactor(pdf): rip out templates; heuristic scan + selectable table User feedback: the template / visual-picker / mode-dispatch implementation was too complex for the actual workflow. Statements drift between months, the canvas state didn't survive multi-page navigation, and accountants don't want to maintain per-bank configuration just to convert PDFs to CSV. Start-over design — one public function, one page, no persistence: ``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)`` A row is "any text line with a date pattern AND at least one amount pattern." Each detected row is a dict shaped:: { "date": "2026-01-15", "description": "Coffee Shop", "amount_1": -4.50, "amount_2": 1000.00, # if a second amount was found "page": 1, "raw": "01/15/2026 Coffee Shop (4.50) 1,000.00", "source_file": "chase-jan-2026.pdf", } Multi-line descriptions still merge (no-date no-amount lines attach to the previous transaction). Multi-PDF batches share a single combined table with a ``source_file`` column. Page UX: - Upload PDF(s) → optional Options expander (parens-negative, use-OCR) → click Scan → see all detected rows in an ``st.data_editor``. - The editor has an ``Include`` checkbox column (default on), plus user-editable date / description / amount cells and a read-only ``raw`` column showing the original PDF text for verification. - A ``Columns to include in CSV`` multiselect hides ``page`` / ``raw`` from the download by default; user can re-add either. - Download CSV gets only the checked rows. No template save/load. No visual picker. No mode dispatch. No column boundaries. No schema migration. No per-bank configuration files. Deletions: - ``src/pdf_templates.py`` — template storage layer - ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim for the canvas (no canvas now) - ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``, ``test_drawable_canvas_compat.py`` — covered the removed APIs - ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for the removed dep - ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt`` - The drawable-canvas references in ``build/datatools.spec`` ``src/pdf_extract.py`` shrinks from ~30 helper functions to ~10. Keeps: value parsers, row clusterer, date/amount token finders, OCR pipeline, dependency guards. The one new public function ``scan_pdf_for_transactions`` glues them together. Tests (59 passing): the unit layer keeps full coverage of the building blocks; the smoke layer pins the end-to-end PDF roundtrip, OCR discovery, dependency-import behavior, and the multi-line-description merge. The fpdf2-generated fixture PDF still drives the real-PDF test. Rollback: ``git revert HEAD`` brings back the template system if needed — but the simpler model should make that unlikely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:57:30 +00:00
Michael	60969c0770	feat(pdf): UI rework — Auto-detect is the default build flow Pulls the user's primary mental model away from "draw column boundaries" toward "tell me what shape your amounts have, see detected rows, save." The visual picker that wasn't working for multi-statement workflows is reachable but no longer the default. Build mode header now has a mode radio: - "Auto-detect (recommended)" — row_heuristic. Tabs: Amount layout · Filters & date · Save. Three small forms; no coordinate UI anywhere. The Amount-layout tab's dropdown picks one of single / txn+balance / debit+credit / debit+credit+balance and auto-derives the min/max amount-count range (overridable under an expander). - "Visual columns (advanced)" — column_visual. Five tabs (the original Visual picker / Pages & table / Columns / Parsing / Save). A yellow warning panel up top reminds the user that column-x templates only work when statement layout is stable. Switching modes triggers a rerun so the right tab set renders immediately. The template object preserves both mode's config trees side-by-side so a user can flip between them without losing work. Live preview below the form runs ``apply_template`` against the cached sample pages (already cached in session_state so this re-renders cheaply on every form edit). The "no rows yet" message is mode-aware — points users at the right tuning knobs for whichever mode they're in. The preview caption notes which mode produced the rows so the user can correlate decisions to output. The visual picker bug the user reported — "a single box stays in the same location regardless of page" — is sidestepped rather than fixed: in row_heuristic mode there's no canvas to confuse, and for the rare column_visual user the canvas is still imperfect but no longer their first interaction with the tool. Cleaning up the column_visual canvas state bugs is a separate follow-up if real users still hit the Advanced mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:46:27 +00:00
Michael	48cd9e8249	feat(pdf): schema v2 + mode field + v1 in-memory migration Bumps ``SCHEMA_VERSION`` from 1 to 2 to add a top-level ``mode`` field distinguishing ``row_heuristic`` (new default) from ``column_visual`` (legacy). The schema bump is real — old code that defaults missing keys would silently mis-extract — so we do it the careful way: - ``new_template`` now returns mode=``row_heuristic`` with the full row-heuristic config tree pre-populated. The legacy column-visual fields are still seeded with empty defaults so switching modes in the GUI doesn't require runtime key insertion. - ``validate_template`` is mode-aware: row_heuristic templates must have a valid ``amounts.shape`` + sane ``row_detection.min/max_amounts_per_row``; column_visual templates keep the existing column/target requirements. - ``load_template`` accepts both v1 and v2 files (``_LOAD_SUPPORTED_VERSIONS = {1, 2}``). v1 files get ``mode="column_visual"`` injected and ``schema_version`` bumped IN MEMORY ONLY — disk file stays v1 until the user explicitly re-saves. A buggy migration can't silently corrupt their template library. - ``save_template`` continues to write the current schema; saving a v1 template through the GUI naturally upgrades it. Mode + shape constants exported (``VALID_MODES``, ``VALID_AMOUNT_SHAPES``) so the GUI dropdowns can derive their options from the source of truth. Tests split into ``TestValidateTemplateRowHeuristic`` (6) + ``TestValidateTemplateColumnVisual`` (4) + ``TestV1Migration`` (1). All 29 template tests pass; the original column-mode tests that previously implicitly relied on schema_version=1 keep working because new_template's seeded column fields are still present in row_heuristic templates (just not validated as required). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:46:10 +00:00
Michael	d80befd05a	feat(pdf): row-heuristic extraction (mode dispatch, no coordinates) User reported the column-visual approach is too brittle for real bank statements: column-x-positions saved against a sample page don't survive layout drift between months (statement A has columns at x=300, statement B drifted to x=320), and a saved template can only realistically work for one statement's specific render. The fundamental fix is to stop depending on coordinates at all. Row-heuristic mode finds transaction rows by pattern: any line with a date token + N amount tokens IS a transaction. Date patterns (US slash / EU slash / ISO / "Jan 15, 2026" / etc.) and amount patterns (currency, parens-negative, thousands grouping) are matched against word text — no x-positions involved. The full pipeline: 1. ``find_transaction_rows`` clusters words into rows and scans each line for date + amount tokens. 2. Multi-line descriptions still attach to the previous row via the no-date-no-amount continuation rule. 3. Amount shapes drive interpretation: ``single`` / ``txn_balance`` / ``debit_credit`` / ``debit_credit_balance``. 4. ``_infer_amount_column_centers`` clusters amount x-midpoints ACROSS ALL detected rows to find natural column groupings — so debit-vs-credit assignment for single-amount lines works without the user marking anything on screen. ``apply_template`` is now a dispatch over ``template["mode"]``: - ``mode="row_heuristic"`` (default for new templates) — the new pipeline. - ``mode="column_visual"`` — the existing pipeline, kept under ``_apply_template_column_visual`` for v1 templates and the Advanced fallback. 18 new tests cover: date detection (US slash, two-digit year, ISO, month-name, missing); amount-token finding (currency, parens, pure text, bare-year rejection); column-center inference (clear two-column case, empty input); end-to-end on synthetic Page objects with all four amount shapes; the critical layout-drift test that proves the same template works on pages of different sizes / different absolute x-positions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:45:55 +00:00
Michael	10015c40e1	fix(pdf): shim image_to_url for drawable-canvas on modern Streamlit User hit ``AttributeError: module 'streamlit.elements.image' has no attribute 'image_to_url'`` on first PDF import. Root cause: ``streamlit-drawable-canvas`` 0.9.3 (last upstream release 2023) calls a Streamlit internal that was relocated in Streamlit ~1.30+. The function moved from ``streamlit.elements.image`` to ``streamlit.elements.lib.image_utils`` AND its signature changed — the second positional argument is now a ``LayoutConfig`` dataclass instead of a plain ``int`` width. Three remedies considered: 1. Downgrade Streamlit. Reverses unrelated improvements + security fixes; not on the table. 2. Fork drawable-canvas. The maintenance hit isn't worth it for a one-line internal API change. 3. Ship a compatibility shim. Re-attach a wrapper at the old import path that adapts the old call shape to the new function. This is the standard workaround the wider Streamlit community has converged on for this exact regression. ``src/gui/_drawable_canvas_compat.py`` does (3). The ``install()`` helper is idempotent, opt-in (not auto-run at module import — a grep for ``_install_canvas_compat`` shows every call site), and no-ops if Streamlit hasn't moved the function OR if the new function isn't where we expect (lets the canvas surface a real error rather than papering over a different bug). The page calls ``_install_canvas_compat()`` once at module top before any ``st_canvas`` invocation; Streamlit's script-rerun model means this fires every page load but the ``_PATCHED`` guard makes re-runs free. The shim wraps the old ``width=int`` arg into a default-constructed ``LayoutConfig()`` — the old ``width=-1`` sentinel meant "use the image's natural width", which is also what an unconfigured LayoutConfig produces. Confirmed by inspecting Streamlit 1.57.0's ``image_utils.py``. 4 new tests pin the shim contract: - ``install()`` attaches ``image_to_url`` to the old path on modern Streamlit - Idempotent — calling twice doesn't double-wrap - Doesn't clobber a future Streamlit that restores the original at the old path - Translates ``(image, -1, False, "RGB", "PNG", "id")`` into a proper call to the new function with a ``LayoutConfig`` instance If a future Streamlit upgrade moves ``image_to_url`` AGAIN, the shim's silent-no-op fallback means the canvas error surfaces again and points at where to look. The shim doesn't paper over mysteries; it only patches the one specific relocation we know about. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:29:20 +00:00
Michael	e6ee2e3481	feat(pdf): robust Tesseract discovery + OS-aware install copy User tried ``brew install tesseract`` in PowerShell after seeing all three OSes listed inline in the OCR banner — easy mistake when the install commands are crammed on one line with ``·`` separators. Two changes pre-empt this: OS-aware OCR banner. The expander now detects the user's platform via ``platform.system()`` and shows only the relevant install instructions: - Windows: UB-Mannheim installer link, numbered steps, explicit "keep the Add to PATH checkbox on" callout, plus a fallback paragraph telling the user how to set ``DATATOOLS_TESSERACT_PATH`` if they already installed without PATH and don't want to reinstall. - macOS: ``brew install tesseract`` with a Homebrew link. - Linux: ``apt install tesseract-ocr`` with a "or your distro's equivalent" hedge. Robust binary discovery in ``ocr_available()``. Three-stage: 1. Honor ``DATATOOLS_TESSERACT_PATH`` env var if set — explicit override for portable installs or non-default locations. 2. Try ``pytesseract``'s default PATH-based lookup. 3. If PATH lookup fails, probe known Windows install paths (``C:\Program Files\Tesseract-OCR\tesseract.exe``, the x86 variant, and ``%LOCALAPPDATA%\Programs\Tesseract-OCR\``) via the new ``_autodetect_tesseract_path``. On hit, set ``pytesseract.pytesseract.tesseract_cmd`` so all subsequent ``image_to_data`` calls use the same binary without re-discovering. This means a user who runs the UB-Mannheim installer with default options but forgets the PATH checkbox will still get OCR working after a launcher restart, without env-var gymnastics. Tests (4 new, 85 total in the suite): - Auto-detect returns None on non-Windows (no false positives on dev laptops). - Auto-detect finds the binary at a mocked ``C:\Program Files\Tesseract-OCR\tesseract.exe``. - Auto-detect returns None when no candidate exists. - ``DATATOOLS_TESSERACT_PATH`` env var beats both PATH lookup and auto-detect (sets ``tesseract_cmd`` even when the path doesn't resolve, so a real binary at a custom location works). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:15:00 +00:00
Michael	538e23d219	build(pdf): bundle PDF deps in installers + pin versions + smoke tests Three changes prepare the next tagged release so end users get the PDF Extractor without ever touching pip. Exact-pin the new deps (``requirements.txt``): pdfplumber==0.11.9 pypdfium2==5.8.0 pytesseract==0.3.13 streamlit-drawable-canvas==0.9.3 Tight pins are the right call for these because the GUI's visual-picker geometry + the parsing-pipeline word positions depend on stable internal behavior — a quiet upstream tweak to ``extract_words`` or ``page.render`` would re-break the tool on the next CI build. Bumping requires a deliberate edit + a CI run, not a transient ``pip install`` resolving to whatever ``setup.py`` pulled. Existing deps stay on their current ``>=X.Y,<X+1`` ranges; the user's "tight pin" concern is specifically about the PDF stack. Wire the new deps into the PyInstaller bundle (``build/``): - ``datatools.spec`` — add ``collect_submodules`` for pdfplumber, pdfminer, pypdfium2, streamlit_drawable_canvas, PIL, pytesseract; add ``collect_data_files`` for pypdfium2 (PDFium native ``.dll``/``.so``/``.dylib``), streamlit_drawable_canvas (frontend JS bundle), pdfminer (Adobe CMap tables). - ``hooks/hook-pypdfium2.py`` — belt-and-braces hook that uses ``collect_dynamic_libs`` to force-include the PDFium binary. Without this the visual picker silently fails on installed builds with a ``FileNotFoundError`` for the shared library. - ``hooks/hook-streamlit_drawable_canvas.py`` — collects the built JS frontend so the canvas iframe loads under the bundled Streamlit server instead of rendering blank. Tesseract is intentionally NOT bundled (option A from the design discussion). Modern bank statements are text-based; bundling Tesseract would ~triple installer size for a long-tail case. The in-app banner directs users to install it from ``UB-Mannheim/tesseract`` if they need OCR. Decision is captured in the ``project-pdf-installer-pending`` memory note. Smoke tests (``tests/test_pdf_extract_smoke.py``, 17 tests) add the layer above the pure unit tests: - ``TestDependencyImports`` — each dep imports cleanly - ``TestRealPdfRoundTrip`` — generates a tiny statement PDF in memory with ``fpdf2`` (test-only dep in ``requirements-dev.txt``), runs ``extract_pages`` + ``apply_template``, asserts 3 rows out with the right signed amounts. Catches "the build succeeded but pdfplumber breaks at runtime." - ``TestRenderPageImage`` — exercises ``pypdfium2.render`` so the hook-bundled native lib gets a real call. This is the most common installer-bug signature (missing .dll) and the test catches it before users do. - ``TestPdfDependencyMissing`` — monkeypatches ``__import__`` to simulate a stripped install; confirms the typed exception + actionable hint round-trip. - ``TestPinnedVersionsMatchInstalled`` — parametrized over all four pinned dists; uses ``importlib.metadata`` rather than ``__version__`` because pypdfium2 doesn't expose it directly. Trips if someone bumps the pin without reinstalling. - ``TestOcrAvailability`` — confirms ``ocr_available()`` returns ``(bool, str)`` and ``extract_pages_auto(allow_ocr=False)`` skips OCR cleanly. All 81 PDF + audit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 23:10:43 +00:00
Michael	2d927bc95f	fix(pdf): graceful fallback when PDF dependencies aren't installed User hit a hard ImportError on opening the PDF→CSV tool because ``pip install -r requirements.txt`` hadn't picked up the new ``pdfplumber`` / ``pypdfium2`` lines yet. Streamlit surfaces that as an unfiltered traceback — friendlier to show a clear install-required panel inside the tool instead. Two changes: 1. ``src/pdf_extract.py`` lazy-imports the PDF deps via ``_require_pdfplumber()`` / ``_require_pdfium()`` helpers that raise a new ``PdfDependencyMissing`` (subclass of ImportError) with an actionable ``hint`` field. Pure helpers (``parse_amount``, ``parse_date``, ``cluster_rows``, etc.) keep working with no PDF dep installed — useful for tests and for keeping module-import paths cheap. 2. The tool page probes both deps at render time via ``_pdf_deps_status()``; if anything's missing it shows a ``st.error`` panel with the exact pip command and a "restart the launcher" reminder, then ``st.stop()``s before touching any PDF code path. The page itself loads cleanly without the deps installed, so the sidebar nav doesn't 500 — the user just sees the install panel on click. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:59:20 +00:00
Michael	967d3f6a11	feat(pdf): OCR availability banner + per-run toggle Phase 6/6. Final polish layer on top of the OCR pipeline that ``extract_pages_auto`` has carried since commit 1. - OCR status banner at the top of the page next to the mode selector. Ready: a one-liner caption confirming OCR will run on scanned pages. Unavailable: a collapsed expander explaining the missing piece (``pytesseract`` binding vs. Tesseract binary) with install pointers for Windows, macOS, and Linux. The expander explicitly notes that modern text-based bank statements don't need OCR — most users will never expand it. - "Use OCR for scanned pages" toggle in Extract mode, defaulting to the runtime availability. Disabled (greyed out) when Tesseract isn't usable, so the user can't accidentally set themselves up for confusing warnings. Passes through as ``allow_ocr`` to ``extract_pages_auto``. - Build mode's sample-loading path continues to call ``extract_pages_auto(..., allow_ocr=True)`` — sample preview always uses OCR if available, since the user is actively diagnosing template fit. No schema change. OCR's structural support is in commits 1 + 3; this commit just makes it discoverable + opt-out. Rolling up the 6-commit feature: `b8aff86` Phase 1 — pure pdf_extract module + tests `aea520d` Phase 2 — template storage layer + tests `2f349e8` Phase 3 — Extract/Build/Manage page + nav + i18n `5a8e2ec` Phase 4 — batch polish (ZIP, sort, status block) `b86828d` Phase 5 — visual region picker (drawable canvas) THIS Phase 6 — OCR banner + toggle Each commit is independently revertable; rolling all the way back to ``c16e2a5`` is ``git revert `b86828d` `5a8e2ec` `2f349e8` `aea520d` `b8aff86` <this>`` (or just ``git reset --hard c16e2a5`` on a clean branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:54:11 +00:00
Michael	b86828d791	feat(pdf): visual region picker on rendered sample page Phase 5/6. Adds a "Visual picker" tab as the first stop in the template-build flow. The sample PDF page is rasterized with ``pypdfium2`` (capped at ~900px wide for sensible display), and ``streamlit-drawable-canvas`` overlays drawing tools on top. UX: - Line mode — drag short (roughly vertical) strokes where you want columns to split. Each stroke's x-midpoint becomes one boundary in PDF point coordinates. - Rect mode — drag a rectangle around the transactions table; bbox is preserved on the template as ``visual.table_bbox`` for round-trip, future use as a hard crop region. - Transform mode — move/resize already-drawn shapes after the fact. Round-trip: re-entering Build mode with an existing template seeds the canvas with full-height vertical lines for every boundary already on the template, plus the saved bbox if any, so editing-after-save matches the user's mental model. Coordinate translation: the canvas reports pixel positions; we divide by the renderer's pixels-per-PDF-point scale to get back to PDF coordinates that ``apply_template`` already expects. No template-schema change required — the boundaries the picker writes are the same list the text-input editor wrote in commit 3, just sourced visually. New helper in the extraction module: - ``render_page_image(pdf_bytes, page_no, target_width=900)`` — rasterize a single 1-indexed page to a PIL image; returns ``(image, scale)`` for coordinate translation. The text-input boundary editor in the Columns tab remains as a fallback for power users / keyboard-only workflows and for copy-paste from spreadsheet-derived x-positions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:52:54 +00:00
Michael	5a8e2ec9e1	feat(pdf): batch extract polish — ZIP output, sort-by-date, status block Phase 4/6. Polishes the batch workflow shipped in commit 3: - st.status progress block replaces the simple progress bar. Each file appears as its own line as it's processed; the block auto-collapses on completion with a "12/13 extracted" summary and turns red if any file errored. - Sort combined output by date checkbox (default ON) sorts the merged CSV ascending by date, with source_file as a stable secondary sort so multiple statements interleave by date but same-day rows from the same file stay together. - ZIP-of-per-PDF-CSVs output option alongside the combined CSV. When the accountant has 12 statements from 12 different account periods and wants to feed them into 12 separate ledger imports, the ZIP keeps each file's rows in its own CSV named after the original PDF stem. - Per-file summary table gets a ``status`` column ("ok" / "no rows" / "error: ExceptionName") so error grouping is obvious at a glance — already present from commit 3, now upgraded with the status field. Cancellation is intentionally not added — Streamlit's single- thread rerun model has no clean way to interrupt a tool-run mid-stream without architectural changes to extraction. If a user mis-fires Extract on 50 PDFs they can refresh the browser tab; the task will be killed when the next interaction comes in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:51:05 +00:00
Michael	2f349e8191	feat(pdf): tool page with Extract / Build / Manage modes Phase 3/6. Wires the PDF Extractor into the GUI as a new "transformations" tool with three modes selected by a horizontal radio at the top of the page: Extract — pick a saved template, upload one or more statement PDFs (single + batch shipping together to keep the common case one-step), get a previewed DataFrame + CSV download. Per-file row counts and warnings are surfaced; failures on one file don't kill the whole batch. The combined CSV gets a ``source_file`` first column so the accountant can sort/filter by statement. Build template — load an existing template or start fresh, upload a sample PDF, edit every schema field across four tabs (Pages & table / Columns / Parsing / Save). A live preview below re-runs ``apply_template`` against the sample on each re-render so the user sees their changes hit rows immediately. The column- boundary editor is text-input ("comma-separated x-positions") for now — replaced by the drawable-canvas visual picker in commit 5. Manage templates — list with rename / delete / export (downloads the canonical JSON) / import (uploads someone else's JSON, validated through ``template_from_json``). Heavy work (``extract_pages_auto``) only runs on explicit user action (Extract / a new sample upload), and the parsed Page list is cached in ``st.session_state`` so widget-edit reruns don't re-parse the PDF. Logging: tool runs and template saves both hit the audit log via ``log_event("tool_run", …)``, matching every other tool's instrumentation pattern. Registered in ``tools_registry.py`` under ``transformations`` with status ``Ready`` and the picture-as-pdf Material icon. i18n keys added for en + es ("PDF to CSV" / "PDF a CSV"). OCR is wired in this commit — ``extract_pages_auto`` already falls back through ``pytesseract`` when the binary is available, and the warning strings it returns surface as ``st.info`` / ``st.warning`` per-file. Commit 6 will polish the OCR UX with a status row. Next commits build on this page: 4 — batch progress + cancellation + per-file error grouping 5 — drawable-canvas visual picker replaces text x-positions 6 — OCR availability banner + scanned-page indicators Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:49:44 +00:00
Michael	aea520d2f7	feat(pdf): template storage layer (load/save/list/import/export) Phase 2/6. Persists "how to read this bank's statements" as JSON files under ``~/.datatools/pdf_templates/<slug>.json`` so an accountant can build one template per source and reuse it across every statement that follows the same layout. Public API: - ``new_template(name)`` — blank with sensible defaults - ``save_template(t)`` — validate + atomic write (temp + rename) - ``load_template(slug)`` / ``delete_template(slug)`` - ``list_templates()`` — sorted summaries, skips corrupt files - ``template_to_json`` / ``template_from_json`` — portability - ``validate_template(t)`` — returns (ok, errors) list for GUI Schema is documented in the module docstring. Versioned via ``schema_version: 1`` so future fields don't break saved files silently — ``load_template`` refuses unknown versions instead of limping along with missing keys. Validation contract enforces: - non-empty name + slug (lowercase alphanumeric + hyphens) - at least two output columns - at least one column mapped to ``date`` - either one ``amount`` column OR both ``amount_debit`` + ``amount_credit`` - column boundary count consistent with source-column count Storage is atomic: ``_atomic_write`` goes through a temp file + ``os.replace`` so a crashed save can't leave a half-written JSON at the canonical path. The GUI's build flow saves on most visual-picker changes, so this matters more here than for a "save button" workflow. 24 tests cover slugify, defaults, validation branches, round-trip load/save, missing/corrupt file handling, delete, list (incl. skipping corrupt files), atomic-write rollback, and import/export. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:46:44 +00:00
Michael	b8aff862ed	feat(pdf): add pure PDF→DataFrame extraction module Phase 1/6 of the PDF Extractor tool. Pure module — no Streamlit, no user-config I/O — that turns a PDF blob plus a template dict into a ``pandas.DataFrame`` of transaction rows. Primary use case is accountant-style extraction of bank-statement transactions, where each bank's format is encoded as a reusable template. Pipeline: 1. ``extract_pages(pdf_bytes)`` reads with pdfplumber and surfaces words with bounding boxes. 2. ``cluster_rows(words)`` groups words into rows by ``top`` tolerance — no reliance on PDF table-line detection (most bank statements have no visible cell borders). 3. ``assign_columns(row_words, boundaries)`` buckets each word by its horizontal midpoint into N+1 columns defined by N interior x-boundaries. 4. ``_within_table_window`` slices to the band between the header line and the end-marker (e.g. "Closing balance"). 5. ``apply_template`` orchestrates the above, handling: - parens-style negative amounts, currency stripping, custom decimal/thousands separators - separate debit + credit columns combined into a single signed ``amount`` (credit positive, debit negative — accounting register convention; matches QuickBooks/Xero imports) - multi-line description wrapping (rows with empty date column attach to the previous row's description) - row-level regex skip filters (e.g., "Total", "Subtotal") - page-range filters ("all", "2-", "1,3-5") Optional OCR fallback for scanned statements: - ``page_has_extractable_text`` heuristic flags pages with <5 words as likely-scanned. - ``ocr_available()`` checks both the ``pytesseract`` Python binding and the Tesseract binary; surfaces a clear reason string when either is missing. - ``extract_pages_auto`` does text-first, OCR-the-blanks, and returns warnings the UI can surface. 29 unit tests cover the parsing pipeline against synthetic WordBox/Page data — no fixture PDFs required, runs in 0.1s. Real PDF extraction is exercised by hand on the user's statements. Dependencies added: - ``pdfplumber>=0.10,<1`` — text + position extraction - ``pypdfium2>=4,<6`` — page rasterization for OCR + visual picker - ``streamlit-drawable-canvas>=0.9,<1`` — visual region picker (used in commit 5) - ``pytesseract>=0.3,<1`` — OCR (used in commit 6; system Tesseract binary required separately) - ``cryptography>=41,<49`` — bumped upper bound; pdfminer.six transitively requires a recent release. Internal ed25519 license-signing usage is API-stable across the bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 22:44:51 +00:00
Michael	c16e2a5e29	feat(audit): surface log path + /logs link in Help popover Adds a "Log file" section to the sticky-footer Help popover with two affordances: 1. The current audit-log path rendered as monospace text with ``user-select: all`` so a single click selects the whole path for copy-paste into a file manager. Works on every platform — no subprocess required. 2. A "View all logs →" link to the new ``/logs`` page (added in the previous commit) for download/inspection of today's and prior days' files. i18n keys ``footer.help_logs_label`` + ``footer.help_logs_link`` added to en + es packs, matching the existing ``footer.help_*`` naming. ``audit_log_path()`` is wrapped in try/except because a broken audit module MUST NOT take the footer down — falls back to "—". Same defensive pattern the license section uses. Rollback: ``git revert HEAD`` removes the section; the popover and its layout return to the prior shape with zero coupling to the audit module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 21:26:53 +00:00
Michael	7c9139f199	feat(audit): /logs page — view + download recent audit log files Adds a Streamlit page at ``/logs`` listing every ``datatools-*.jsonl`` file in ``audit_log_dir()`` (7-day window per the retention sweep in `b3ae913`). Each entry shows filename, mtime, byte size, and a ``st.download_button``. Today's file gets its own section at the top. The page also surfaces both paths as copyable monospace text: the active log path (so users can grep/cat it directly on their machine) and the folder path (so they can paste into Explorer / Finder). Wired into navigation via ``st.Page("pages/_Logs.py", ...)`` with ``url_path="logs"``. The sidebar entry is hidden by the same ``hide_streamlit_chrome`` CSS rule that hides ``/activate`` and ``/close`` — same pattern, same ``:has()`` + plain-fallback selectors so the LinkContainer collapses cleanly in modern browsers and the anchor is at least un-clickable in older ones. License gate is OFF for this page (``gate_license=False``) — if a user's license expires they may need logs to file a support request; locking them out of their own audit history would be hostile. Next commit will wire the popover link. Rollback: ``git revert HEAD`` removes the page and its nav entry; the audit log itself keeps working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 21:24:46 +00:00
Michael	b3ae913bb9	feat(audit): daily filename + 7-day retention sweep Replaces the per-session ``datatools-<ts>-<sid>.jsonl`` filename with a single daily file ``datatools-YYYY-MM-DD.jsonl`` (local date). Sessions on the same calendar day share a file via the writer thread's per-batch open+append; multiple DataTools instances running concurrently on the same day fan into the same file (append-mode small writes are atomic on POSIX, safe-enough on Windows under realistic load). Drops the ``_LOG_PATH`` module global and the lock around it — ``audit_log_path()`` is now pure date math, recomputed on every call so a session that crosses midnight follows the rollover into the next day's file. Adds ``_sweep_old_logs()`` invoked once per process at writer- thread start. Deletes any ``datatools-*.jsonl`` whose mtime is older than 7 days. The glob deliberately matches the legacy per-session filename too, so users upgrading from the previous build don't keep a permanent backlog of pre-retention files. Event ``ts`` fields stay UTC; only the filename uses local date, because users go looking for "today's log" on their wall clock. Tests cover: daily filename shape, sweep removes stale files, sweep keeps fresh files, sweep also clears legacy filenames. Rollback: ``git revert HEAD`` restores the per-session filename and removes the sweep. No data migration needed either way — existing files keep working as JSONL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 21:22:47 +00:00
Michael	ba07dcb6c7	feat(audit): re-enable audit log (kill switch off by default) Phase 1 diagnostic build validated end-to-end on the user's machine: session cf2ebbd5 (2026-05-19) produced session/upload/analyze/nav/ session-end events with no blank-pages regression. Root cause of the original symptom was the audit_log_path/_session_id deadlock fixed in `a8ff8f4` — the kill switch is no longer load-bearing. Flips ``_DISABLED: True`` → ``False`` so the default install writes a log. The three env-var overrides (``DATATOOLS_AUDIT_ENABLED``, ``DATATOOLS_AUDIT_TRACE``, ``DATATOOLS_AUDIT_PROBE``) and the writer- thread BaseException guard from `76c9f5a` stay in place as escape hatches if the symptom ever recurs. TestKillSwitchContract continues to pass — it monkeypatches ``_DISABLED = True`` explicitly and doesn't rely on the module default. Rollback: ``git revert HEAD`` flips the switch back without removing the diagnostic instrumentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:50:28 +00:00

1 2 3 4 5

221 Commits