datatools-dev

Author	SHA1	Message	Date
Michael	41ab2166ef	build(ci): wire macOS code signing + notarization into release workflow Add a guarded "Sign & notarize macOS app" step to build.yml that signs dist/DataTools.app with the Developer ID (hardened runtime + entitlements + secure timestamp), notarizes via notarytool, and staples the ticket — running before DMG packaging. The step exits 0 with a warning when the MACOS_* secrets are absent, so dry-run dispatches still produce an (unsigned) build. Add build/macos/entitlements.plist with the hardened-runtime entitlements a frozen PyInstaller/CPython app needs (JIT memory, library-validation disabled for bundled .so/.dylib + Tesseract). Update build/README.md to reflect that macOS signing is now wired and only needs the secrets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:56:17 +00:00
Michael	9943e6e537	test(demo): cover the demo app + sales-surface coherence Adds a demo test suite on top of the data-value pins: - tests/gui/test_app_demo.py (new, AppTest): every accounting persona renders with its dataset, the default/unknown-persona fallback resolves to bookkeeper, clicking Run produces the AFTER value (rows reduced to the validated count) with the watermarked download + Gumroad CTA, and switching persona via the quick-switch dropdown clears the stale result. - tests/test_demo_pipelines.py (extended): cross-surface coherence — each persona key served by app_demo has a matching landing page whose iframe (?p=) and CTA (from=) point at it and that the hub links to; no retired Shopify/RevOps language remains in landing HTML; and the demo download still appends exactly one watermark row. Full suite: 2584 passed, 91 skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 19:06:50 +00:00
Michael	e7ec79b9b5	demo: retarget landing pages to the accounting audience Reorients the whole sales surface to accounting so it matches the rebuilt demos. Replaces the Shopify and RevOps persona pages with accounts-payable (1099) and accounts-receivable pages, refreshes the bookkeeper page, and rewires the hub + deploy tooling: - landing/bookkeeper/ — refreshed to the validated bank-rec demo (26 -> 20, six phantom duplicates), iframe ?p=bookkeeper. - landing/ap-1099/ — NEW (replaces shopify-pet/): 1099 vendor prep, "24 records -> 8 vendors, 7 missing EINs recovered", iframe ?p=ap-1099, amber accent. - landing/ar-aging/ — NEW (replaces revops/): AR open invoices, "26 -> 21, five double-entered invoices removed", iframe ?p=ar-aging, green accent. - landing/index.html — hub rewritten with the three accounting cards. - deploy.py / deploy.config.example.json / README.md / _shared/styles.css — persona list, sitemap defaults, 404 links, cross-links, docs updated. All demo iframes now point at the renamed app_demo personas; deploy.py builds the dist bundle cleanly (verified) and the Gumroad ?from= tags match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 18:59:50 +00:00
Michael	6df726e69e	demo: reconstruct sales demos for an accounting audience Replaces the Shopify / RevOps / Bookkeeper demo trio with three accounting personas that share one buyer, each entering through a workflow where a messy export costs money — all running the same saved 4-step pipeline: - bank_reconciliation.csv (Bookkeeper): 26 -> 20 rows, 6 double-posted transactions caught after date+amount standardization. - vendor_1099.csv (AP / 1099): 24 records -> 8 vendors, 7 missing EINs recovered via dedup merge — the 1099-complete story. - ar_open_invoices.csv (AR): 26 -> 21 rows, 5 double-entered invoices removed, blank status backfilled from the twin row. Every number is validated against the live engine and pinned by tests/test_demo_pipelines.py (read path mirrors app_demo._load_demo: dtype=str, keep_default_na=False). Rewires src/gui/app_demo.py PERSONAS (keys bookkeeper / ap-1099 / ar-aging, accounting H1/sub/CTA) and rewrites docs/DEMO-PLAN.md sections 3/4/7 with the validated outcomes. (Repo hygiene forced by a partial-clone gap: finalizes the already-deleted, unreferenced samples/messy_text.csv whose blob was unrecoverable.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 18:52:39 +00:00
Michael	38616d69e2	test(pipeline): complete automated test suite for the pipeline feature Adds ~115 tests pinning the Automated Workflows feature end to end: - tests/test_pipeline.py (+43): per-adapter summary correctness on known inputs, multi-step data flow, error stop/continue contract, empty / single-column / all-disabled edges, dict+file serialization round-trips, recommended_pipeline(include=…), and a synthesized demo integration run. - tests/test_cli_pipeline.py (new, 21): --recommend, dry-run-by-default, --apply output CSV + audit JSON, --steps, --strict abort, arg validation, --continue-on-error vs halt, and a save→load round-trip. Invokes the Typer app directly to bypass the license guard (house pattern). - tests/gui/test_pipeline_builder.py (+9): reorder ▲/▼, disabled edge buttons, disabled-step persistence across reorder, restore-recommended, Advanced JSON export/import, and per-tool Configure panels emitting the correct option dicts (AppTest). - tests/gui/test_pipeline_phrasing.py (new, 30): step_phrase/step_status and the adapter-key→friendly-name bridge as pure functions, incl. pluralization, column prose, and warn/error status derivation. Full suite: 2565 passed, 91 skipped. No product bugs surfaced. Documents the coverage in docs/DEVELOPER.md (test tree + a pipeline-coverage note). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 18:31:15 +00:00
Michael	00d3f28865	feat(pipeline): plain-English per-step result summaries Replaces the raw-JSON summary column in the Results table with the mockup's plain-English phrasing: "312 duplicates removed across 147 groups (18,442 → 18,130 rows)", "1,204 cells cleaned in name & city", etc. (correct singular/plural via a small _n helper). Adds step_phrase() and step_status() to pipeline_modules.py. step_status derives the status pill (✓ ok / ⚠ ok · N skipped / ✗ error / ⏭ skipped) and, for warn/error steps (e.g. format_standardize unparseable cells, column_map coercion failures / missing required targets), an inline detail callout rendered directly below the results table — surfacing non-fatal issues in context without a dedicated always-empty column. Extends tests/gui/test_pipeline_builder.py with phrasing + status assertions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 18:21:17 +00:00
Michael	837f4b88b5	feat(pipeline): visual module-card builder for Automated Workflows Replaces the raw options_json data-editor table with a per-step "module card" builder matching the locked design mockup (layout-review/09_pipeline_runner.html): each step shows a friendly name + caption, an enable toggle, ▲/▼/✕ reorder/remove controls, and a Configure expander that renders that tool's own controls in plain language. Raw JSON is demoted to an Advanced import/export section. New src/gui/components/pipeline_modules.py holds the adapter-key→tool_id friendly-name bridge, one plain-language config renderer per tool (text_clean, format_standardize, missing, column_map, dedup — emitting the exact JSON option shapes the core adapters accept), and render_step_card. Steps live in session state as an ordered list with stable ids so widget keys survive reorder/remove. Reorder is ▲/▼ buttons (no JS drag dependency). The on-disk/CLI pipeline JSON format is unchanged — CLI and src/core untouched. Adds tests/gui/test_pipeline_builder.py (AppTest) covering seed, configure panels, toggle/add/remove, and a full run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 18:16:09 +00:00
Michael	fd9606c67b	build: drop the local Python release method, return to CI-only installer builds Removes the single-command Python packaging method (build/make_release.py + build/build_portable_zip.py + build/macos/build_zip.sh) and the portable .zip artifacts it produced. Release builds go back to the original GitHub Actions process: the CI matrix builds one installer per platform (.dmg / .exe / .AppImage) on tag push and attaches them to a GitHub Release. Tesseract OCR bundling is preserved: the fetch helpers the workflow depends on (fetch_tessdata, fetch_tesseract_for_platform) are extracted into a standalone build/tesseract.py, which build.yml now imports. Docs (README, build/README, DEVELOPER, TECHNICAL, USER-GUIDE, vendor README, es translations) updated to drop the portable-zip flavor and point at the new module. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 17:47:36 +00:00
Michael	28ab51a869	Merge ui-redesign: journey-level UX redesign + live-app port Brings the design-review mockups and the highest-leverage live-app changes into main: - layout-review/ mockups: 12-page review addressed; front door, taught pipeline order, consistent intake, coming-soon stubs, shared tokens. - Live src/gui/: nav reordered to pipeline order with new Finance + Coming-soon groups; Home is the "Start here" front door with a one-click "Clean these files for me" pipeline runner; local-first pill on every working tool header. - DECISIONS.md: PDF to CSV + Reconcile kept in-bundle under Finance. Full suite green: 2441 passed, 91 skipped, 0 failed. Follow-ups tracked (not blockers): streamlit-run visual verification of the live UI; i18n keys for the front-door copy (English literals today); rebuild the live coming-soon stub page bodies.	2026-06-08 17:41:30 +00:00
Michael	1895074b8f	test+fix(gui): retire the now-empty "analysis" nav section The journey-level nav restructure moved Home to a standalone "Start here" entry and Reconcile into the "Finance" group, leaving the "analysis" section with zero tools. Two registry tests encoded the old layout and failed: - test_every_section_has_at_least_one_tool[analysis] (empty section) - test_reconciler_present (asserted section == "analysis") Drop "analysis" from the Section literal, SECTION_LABELS, and app.py's by_section bucket — it's genuinely dead now (home isn't a registry Tool). Update the presence tests to assert Reconcile + PDF to CSV live in "finance". The section-invariant tests (every section non-empty, has a label, no orphan labels) are preserved and pass. Full suite: 2441 passed, 91 skipped, 0 failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 17:11:02 +00:00
Michael	d807d3c11b	feat(gui): add the one-click "Clean these files for me" front door Issue #1 (the make-or-break UX fix): after the analyzer runs, Home now leads with a primary "Clean these files for me" CTA that runs the recommended pipeline (Clean Text -> Standardize -> Fix Missing -> Find Duplicates, in order) on every imported file and hands back a cleaned CSV per file — collapsing "which tool, what order" to one click. The existing per-finding cards remain, reframed as "Or fix issues one at a time" for users who want manual control. - Reuses the core API verbatim (recommended_pipeline + run_pipeline); reader mirrors 9_Pipeline_Runner._read_uploaded so files load the same way the standalone orchestrator loads them. - Per-file errors are captured so one bad file doesn't kill the batch; cleaned CSVs are cached in session_state so downloads survive reruns and are pruned when a file is removed or re-analyzed. Verified: the read -> run_pipeline -> CSV data path executes correctly (compile + a non-Streamlit functional smoke test). The Streamlit UI scaffolding (button / download_button / progress / session_state) mirrors the proven runner page but still needs a `streamlit run` check. Front-door copy is English literals for now; i18n keys are a follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 17:06:30 +00:00
Michael	09ec01e98b	feat(gui): port journey-level nav + local-first pill to the live app Brings the live Streamlit app in line with the finalized layout-review mockups (structural/low-risk changes; verified by compile + registry sanity, still pending a streamlit-run visual check): - tools_registry: Data Cleaners now in pipeline order (Clean Text -> Standardize -> Fix Missing -> Find Duplicates); new "finance" section (Reconcile, PDF to CSV) and "coming_soon" section (Find Unusual, Quality Check, Combine Files). Adds those to the Section type + SECTION_LABELS. - app.py: Home becomes the "Start here" front door — a standalone, unlabeled top entry (play_circle icon) ahead of the hidden Activate/Logs/Close pages; nav groups reordered cleaners -> transformations -> automations -> finance -> coming soon. - _legacy.py: render_tool_header now shows the "Runs 100% locally" privacy pill (right-aligned, Ready tools only — omitted on Coming Soon stubs); accent emphasis CSS for the Start-here nav link. - i18n: add nav.start_here_title, nav.section_finance, nav.section_coming_soon to en + es packs. - DECISIONS.md: log the PDF/Reconcile in-bundle (Finance group) call. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 17:01:57 +00:00
Michael	48251b625f	refactor(layout-review): consolidate tool-header actions + align reconcile downloads Consistency pass over the parallel-agent work: - Replace 4 divergent inline header wrappers (flex/inline-flex, gap 10/12px, margin-top present/absent across 8 tool pages) with one shared .dt-tool-header-actions class; strip the now-redundant per-button margin-top:0. Every tool header now aligns the local-first pill + Help button identically. - Reconcile downloads row: reorder to the page's exceptions-first order (Review, Unmatched left, Unmatched right, Matched) to match the tabs and metric strip, and drop the lone competing primary — the four are parallel exports of equal weight. Audited and confirmed already-consistent: compact intake banner, privacy pill markup, .dt-next-step strips, the three coming-soon stubs, primary CTAs, and the 3-download CSV/audit/config pattern. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 16:50:25 +00:00
Michael	dd0942d710	feat(layout-review): journey-level redesign — front door, taught order, consistency Addresses the journey-level review (the app felt like 12 tools sharing a stylesheet, not one guided product). File-partitioned changes: Navigation (shell.js): rename Home -> "Start here" with front-door emphasis (.dt-nav-start); reorder Data Cleaners into pipeline order (Clean Text -> Standardize -> Fix Missing -> Find Duplicates); new "Finance" group (Reconcile, PDF to CSV); all stubs moved to a bottom "Coming soon" group, no longer interleaved with working tools. Front door (home.html): a prominent primary "Clean these files for me" that runs the recommended pipeline in order, above the existing per-finding cards (reframed as "fix one thing at a time"). Shared tokens (app.css): .dt-next-step suggestion strip + .dt-nav-start. Teach the order: a slim .dt-next-step strip at the end of each linear cleaner page points to the next pipeline step (Map Columns -> Start here; orchestrator/Finance pages correctly omit it). Local-first: the green "Runs 100% locally" pill now sits in every working tool page's header (home + 8 tools), where client data is entered. Plain English: jargon relabeled on input controls (coerce, E.164, NFC/NFKC, sentinels, survivor rule), technical terms kept in tooltips and audit/output cells only. Stubs (06/08/07): rebuilt to one identical skeleton — info line + plain feature list + a real "Notify me when this ships" button; every disabled control and uploader removed (a dimmed dropzone reads as broken). Intake: full dropzone+chip replaced with the compact "Using <file>" banner on Clean Text, Fix Missing, Find Duplicates, and both Reconcile sides. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 16:44:11 +00:00
Michael	cf31d9ef14	feat(layout-review): address review findings on pages 7-12 Find Duplicates (01_deduplicator): - Delete the redundant outer Options wrapper; surface threshold + survivor rule directly, push the rest behind a single Advanced pane. - Disambiguate competing primaries: top result is an auto-resolved preview (secondary download), review decisions are the single primary. - Plain-English match labels (exact / approximate); clarify the third. - Lift the match-card caption to a one-time instruction; note delimiter is delimited-text-only. Quality Check (08_validator_reporter) — stub: - Remove the dead disabled "Load rules file (JSON)" uploader so the stub invites a single action; keep the informative feature list. Map Columns (05_column_mapper): - Regroup schema -> mapping -> strategy/advanced (core task contiguous). - Make preset-vs-Advanced precedence legible (Custom + modified marker). - Adopt the compact file-intake banner; drop the duplicate resolved- mapping table; fix the add-row gutter style. Combine Files (07_multi_file_merger) — stub: - Actually disable the Merge CTA (add the disabled attribute). PDF to CSV (10_pdf_extractor): - Drop page/raw from the default preview to match export + fix the horizontal clip; surface raw via per-row affordance + overflow-x. - Move the column selector above the download button; give auto-excluded rows a reason; align the files card to Home; de-dupe the row count. Automated Workflows (09_pipeline_runner): - Replace hand-edited JSON step config with per-step control expanders; JSON moved behind Advanced import/export. - Editing the table marks the mode modified; fold the empty error column into the status pill; render summaries as plain English; collapse the explainer by default. Cross-cutting items (stub standardization on page 10, shared disabled- field token, remaining intake rollout) deferred to a holistic pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 16:35:46 +00:00
Michael	563d845b70	feat(layout-review): address review findings on pages 4-6 Find Unusual Values (06_outlier_detector) — coming-soon stub: - Anchor the disabled Method on IQR (multiplier 1.5), not Z-score, per the logged robustness decision. - Drop the redundant feature bullet list (kept alert + greyed controls + disabled button); also fixes the MAD-only-in-bullets mismatch. - Remove the live uploader that dead-ended into disabled controls. Clean Text (02_text_cleaner): - Add an inline hidden-character legend (3 swatches reusing the actual badge classes) beside the canonical "Show hidden characters" toggle. - Unify the two hidden-char toggles: preview one is canonical; the Results bare checkbox is wrapped in a field + bound note. - Describe all three presets (minimal / excel-hygiene / paranoid). - Give "Changes by column" a real "column" header instead of the grey index-gutter style. Standardize Formats (03_format_standardizer): - Make preset-vs-control precedence legible: preset shows Custom with a "modified" marker + base tag, diverging controls flag the winning value (same pattern as Fix Missing Values). - Replace the dead-end unparseable alert with a real "Unparseable cells (47)" expander the alert now points to. - Honest preview caption: "5 of 6 columns (notes skipped)". Intake pattern (the cross-page reference) left untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 16:27:42 +00:00
Michael	be1e263223	feat(layout-review): address Fix Missing Values review findings - Pin down strategy precedence: add a resolution-order legend (per-column -> global -> preset), dim/strike the preset radios when a global strategy overrides them, and add a "Resolves to" column to the per-column override table so the winning value is legible. - Make the demo state honest: Global strategy = median is what drives the 1,043 fills, resolving the detect-only contradiction. - Surface the missingness profile as an always-visible block above the (now-open) Options expander — diagnostic before configuration. - Stop highlighting unchanged before/after cells (respondent_id 0->0); show "(global)" placeholders in unset per-column override cells. - Fold the standalone "Strategy applied per column" table into the before/after table as a strategy column; inset maxed slider knobs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 16:23:32 +00:00
Michael	7ebfd0f153	feat(layout-review): address Reconcile page review findings - Fix doubled "Invert right amount sign" label: keep the field label, strip the checkbox caption to the box only (also evens the 3-up row). - Reorder results exceptions-first: tabs and metric strip both run Review -> Unmatched left -> Unmatched right -> Matched, with Review the default active tab and its table as the inline content; Matched demoted to a trailing context expander. - Surface the "references must match left count" rule with an inline validation indicator under the right reference field instead of a label note alone. - Mark the required Amount join key with the .req accent star on both sides so it reads distinct from the optional date/description pickers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 16:17:20 +00:00
Michael	2592604067	feat(layout-review): address Home page review findings - Findings card no longer truncates silently: panel #1 gains a .dt-finding-more overflow control ("Show all 8 findings · 5 more"). - Replace the dead "Files analyzed: 3" stat (restated the section meta + visible rows) with "Rows scanned" — info not already on screen. - Collapsed findings panels use a real .is-collapsed state variant instead of inline margin-bottom:-16px hacks, so states can't drift. - Action bar buttons are content-sized; drop the 340px island that jarred against the full-width divider/stats below it. Branding kept as deliberate landing-style treatment on Home (per review decision); interior tool pages remain title-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 16:14:04 +00:00
Michael	58d0009849	refactor(layout-review): inline assets beside pages Move app.css and shell.js into layout-review/ alongside the .html files and reference them by bare filename; drop the assets/ subfolder. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> rollback-2026-06-08	2026-06-08 15:43:31 +00:00
Michael	b6c39d7a09	refactor(layout-review): move assets to repo root Relocate assets/ (app.css, shell.js) from layout-review/ up to the repo root and rewrite every page's link/script refs to ../assets/. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 15:31:53 +00:00
Michael	b2fa8503e6	chore: add layout-review HTML mockups Static layout mockups for each app tool (deduplicator, text cleaner, format standardizer, missing handler, column mapper, outlier detector, multi-file merger, validator/reporter, pipeline runner, PDF extractor, reconciler) plus index/home shells and shared assets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 15:28:23 +00:00
Michael	b703911df3	docs: reflect bundled Tesseract on every install surface - NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers the bundled Tesseract OCR binary (Apache 2.0, upstream tesseract-ocr/tesseract, copyright Google + contributors) and the eng.traineddata from tessdata_best (also Apache 2.0). Clarifies DataTools itself remains proprietary. Full canonical Apache 2.0 license text included. - README.md + README.es.md (Download section): bumped size estimate ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR is bundled (no separate install required), with a link to the new license file. - docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System requirements): bumped disk estimate, added a paragraph stating Tesseract 5.5 + eng.traineddata ship inside every installer / portable / AppImage, with a source-install fallback hint pointing developers to DEVELOPER.md. - docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section documenting the runtime layout (sys._MEIPASS / tesseract / …), discovery order, source of bytes (build/vendor/tessdata + per- platform fetch in make_release.py), version pin, update recipe. - docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor OCR)" — short version of the discovery order for the build pipeline section. - build/README.md: distribution-outputs paragraph now lists Tesseract among bundled deps with the ~250-300 MB estimate; new "Tesseract bundling" section: layout diagram, resolver order, source of bytes + 5.5.0 pin, update steps, license-file ref. Out-of-scope gaps noted by the docs sweep: - docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a high-risk packaging headache; now superseded. Worth a one-line "(resolved — bundled as of v1.x)" callout in a future pass. - USER-GUIDE §2 "What's included" table doesn't list PDF Extractor at all (it shipped in b8aff86…967d3f6). Separate gap to close. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 18:20:50 +00:00
Michael	93ccada974	build: bundle Tesseract 5.5.0 + tessdata into every release artifact End users no longer have to install Tesseract separately for OCR on scanned PDFs — the engine ships inside the installer, portable .zip, and AppImage for all three platforms. Per-platform fetch in build/make_release.py (run before PyInstaller): - Windows: download UB-Mannheim installer 5.5.0.20241111, extract with 7-Zip, copy tesseract.exe + required DLLs into the staging dir. - macOS: ``brew install tesseract``, copy binary + every Homebrew- prefixed dylib resolved via otool -L (recurse one level for transitive deps), then install_name_tool rewrites IDs / load paths to @loader_path/... so the bundle is relocatable. - Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary + every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'. Wire-up: - build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by make_release) and adds the staging dir + tessdata + the LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/} and the license sits at the bundle root. Soft-warns when staging is empty so dev spec runs still complete. - English tessdata pulled by fetch_tessdata() from tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at build/vendor/tessdata/. - .github/workflows/build.yml: actions/cache@v4 step keyed on ``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the staging dir and the vendored tessdata across runs; apt installs patchelf on the Linux runner; PyInstaller step now receives the DATATOOLS_TESS_STAGING env var. - .gitignore: build/_tesseract/ and the .traineddata blob. - TESSERACT_SKIP_FETCH=1 honored for offline / manual stages. - Installer / .dmg / .zip / AppImage scripts: one-line comments confirming Tesseract rides along automatically via PyInstaller's datas (no extra packaging steps required in those scripts). Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post- compression. Net installer size ~250-300 MB (was ~120 MB) — accepted tradeoff for zero end-user OCR setup. Reversal of the prior "don't bundle Tesseract" decision (option A). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 18:20:33 +00:00
Michael	17faf84aed	feat(pdf): probe bundled Tesseract first when running frozen Adds runtime support for the bundled Tesseract that ships inside the DataTools installer / portable / AppImage artifacts. When DataTools is launched from a PyInstaller frozen bundle the OCR engine now resolves automatically — no end-user install required. New helpers in src/pdf_extract.py: - _bundled_tesseract_path() → Path \| None — returns <sys._MEIPASS>/tesseract/tesseract[.exe] when getattr(sys, "frozen", False) AND sys._MEIPASS are present; None in dev. - _bundled_tessdata_dir() → Path \| None — same gating, returns <sys._MEIPASS>/tesseract/tessdata. - _apply_bundled_tessdata_prefix() — sets TESSDATA_PREFIX to the bundled tessdata dir before any pytesseract call; only if frozen, dir exists, and the user hasn't already overridden the env var. Discovery order in ocr_available() / _autodetect_tesseract_path(): 1. DATATOOLS_TESSERACT_PATH env override (existing) 2. Bundled binary (NEW — frozen-only) 3. System PATH (existing) 4. Windows well-known install dirs (existing legacy fallback) In dev (not frozen) every new probe is a no-op so the developer experience is unchanged. 12 new tests cover frozen vs. non-frozen detection on each platform, the user-override respect for TESSDATA_PREFIX, autodetect priority ordering, and the no-bundled-dir graceful path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 18:19:52 +00:00
Michael	4d8513b1a3	docs: cover help popover, +/- nav indicators, render_tool_header User-facing docs (USER-GUIDE en+es, README en+es): - New short paragraph under §3.1 GUI noting the in-tool Help button on every detail page, what it contains (When to use / Steps / Examples / Tip), and that content lives in tools.<id>.help_md. - One-line note in the README tool tables pointing at the same. - Mention the sidebar +/- nav indicators replacing Streamlit's default Material Symbols chevron. Developer docs: - DEVELOPER: new "Tool page header" subsection documenting render_tool_header(tool_id), the help_md markdown skeleton, and the fallback to help.missing_body when a tool's help is absent. Update i18n authoring rules to list help.* keys and the per-tool help_md field alongside name/description/page_title/page_caption. - TECHNICAL: new §10c documenting the sidebar nav indicator swap — CSS in _HIDE_CHROME_CSS plus _SWAP_NAV_SECTION_INDICATOR_JS injected through the hide_streamlit_chrome() iframe bundle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 18:08:01 +00:00
Michael	ac94208d8f	chore: production-readiness sweep on the help-popover wave - Drop unused 'from src.i18n import t' from pages 1-9 (the swap to render_tool_header(tool_id) means no page calls t() directly anymore). Pages 10, 11 and the underscore-prefixed pages were already clean or legitimately use t(). - Rewrite PDF Extractor help_md (en + es). The original prose described features the tool does NOT have — template drawing, per-source saved templates, automatic reuse. The actual tool is a heuristic batch scanner (per its own docstring: "No templates, no per-bank configuration"). New copy: scan → uncheck → pick date format → enable OCR if needed → download. Spanish version tagged with '<!-- TODO: review Spanish -->' since the prose is best-effort. - Document why both stSidebarNavSectionHeader (legacy, streamlit~=1.35) and stNavSectionHeader (current, 1.57) testids appear in the chrome CSS — requirements floor is streamlit>=1.35,<2 so dropping the legacy selector would silently break the lower bound. - Pin the t()-returns-key-on-miss contract that render_tool_header's fallback path depends on, with a comment at the call site. - Pin the demo's intentional skip of hide_streamlit_chrome (so the +/- sidebar swap JS doesn't ever try to load there) with a load- bearing comment in app_demo.py. - Confirmed i18n parity: every tool id has page_title / page_caption / description / name / help_md in BOTH packs; help.button_label and help.missing_body in both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 18:07:33 +00:00
Michael	4955fb239b	test: cover help_md keys, header smoke, and bilingual ES smoke Two stale Spanish smoke assertions still expected English page titles for PDF Extractor and Reconciler — the i18n work landed real translations ("PDF a CSV", "Reconciliar dos archivos"), so refresh the expected substrings and the surrounding comment. Add new coverage for the help-popover feature: - TestHelpPopoverKeys (test_lang_packs): every tool_id resolves a non-empty tools.<id>.help_md in BOTH packs; help.button_label and help.missing_body resolve in both. - TestDescriptionCopy (test_tools_registry): every Tool.description non-empty and under 120 chars — pins the post-jargon-scrub copy so future drift back into multi-clause prose is loud. - TestRenderToolHeaderSmoke: render_tool_header is callable, listed in components.__all__, and every i18n key it touches resolves in both packs. Runs without a Streamlit script context. Suite: 2427 passed (+9 new), 91 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 18:07:19 +00:00
Michael	4a8961d58a	fix(gui): keep tool-page Help button on one line at narrow widths When the viewport shrunk, the help popover button in the title row was wrapping its label vertically — ``[icon]`` over ``Help`` — because the button was set to use_container_width=True and the column it sat in collapsed below the button's natural width. Two-pronged fix: - Set use_container_width=False on the popover so the button sizes to content (icon + label) instead of stretching to the column. - Widen the column ratio from [10, 1] to [8, 2] so there's room for the button without forcing the title text to truncate. - Add CSS pinning ``white-space: nowrap`` on every popover button (and its inner div / p) as defense-in-depth — even if the button does get squeezed, the label can't wrap. ``min-width: max-content`` keeps the button from compressing below its content. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 17:54:41 +00:00
Michael	fe4b5dc755	fix(sidebar): correct testid + JS swap so +/− actually renders The prior attempt used data-testid=stSidebarNavSectionHeader, which is not what Streamlit 1.57 emits — the correct testid is stNavSectionHeader (verified against the bundled JS in streamlit/static/static/js/). The section header is also a <div> with onClick, not a <button>, and the React component keeps the expanded state in a prop without surfacing aria-expanded on the DOM. Pure CSS can therefore neither locate the header nor switch the glyph by state, which is why the chevron was unchanged in the rendered UI. Switch strategies: - CSS now targets the correct stNavSectionHeader / stIconMaterial selectors, drops the Material Symbols font from the icon span, and restyles it so a plain ascii character reads as proper typography (size, weight, color, hover). - Add _SWAP_NAV_SECTION_INDICATOR_JS — small inline script that rewrites the icon's text node from "expand_more"/"expand_less" to "+"/"−" (U+2212), throttled via requestAnimationFrame, re-applied on every DOM mutation by a MutationObserver. Bundled into the same iframe injection as the existing brand/upload/findings scripts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 17:52:47 +00:00
Michael	209b5fb1aa	style(sidebar): swap expand chevrons for +/− indicators on nav sections Streamlit's default sidebar section header uses a Material Symbols expand_more chevron — three different icons (chevron down, chevron up, sometimes a plain triangle) depending on version, all of which felt inconsistent with the rest of the chrome. Hide the built-in icon (svg / material-symbols span — covered with multiple selectors for cross-version durability) and render our own glyph as a right-aligned pseudo-element on the section-header button, keyed off the standard ARIA aria-expanded attribute: - collapsed → "+" - expanded → "−" (U+2212, visually balanced with +) Hover deepens the indicator color to match the surrounding nav-link hover treatment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 17:23:49 +00:00
Michael	904356f4e8	feat(gui): inline Help popover next to every tool's title Adds a contextual Help button on each detail page, right of the title. Clicking it opens a Streamlit popover with a one-shot how-to: when to use, numbered steps, before→after examples, and an optional one-line tip. Designed to be scannable — no paragraph prose. Implementation: - New ``render_tool_header(tool_id)`` helper in components replaces the bare ``st.title(...) + st.caption(...)`` block on each of the 11 tool pages. Title in the wide column, popover in a narrow right column; caption sits on its own line beneath. - Help content is one markdown blob per tool stored in i18n under ``tools.<id>.help_md`` (en + es). Editors can tweak copy without touching Python. - ``help.button_label`` and ``help.missing_body`` keys added to both packs for the popover trigger and the empty-tool fallback. All 11 tool pages now use the same header pattern — including the PDF Extractor and Reconciler which previously had hardcoded title/ caption pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 17:21:55 +00:00
Michael	7203a81af7	copy: strip jargon from tool descriptions and captions Prior round only touched page_caption; the description field (shown on home grid cards) still said "imputation", "missingness", "winsorization", "schema coercion", "fuzzy matching with normalization", etc. The audience is non-technical buyers — they shouldn't need a stats or DB-admin vocabulary to read a tool card. Rewrite both description and page_caption across en, es, and the tools_registry (the fallback source of truth) using everyday words: blanks instead of nulls, fill in instead of impute, look wrong instead of statistical outliers, etc. Same one-line shape as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 17:09:52 +00:00
Michael	dd3b9bd59d	copy: tighten tool-page captions to one plain-English line Each tool's page caption is what tells a user what the tool actually does the moment they land. They were inconsistent — some terse, most multi-clause with a redundant "Runs locally — your data never leaves this computer" trailer that's already a privacy pill on Home. Rewrite every caption (en + es) as a single ~60-80 char action-first line. Replaces the hardcoded multi-line Reconciler caption with the same shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 14:34:34 +00:00
Michael	2bd94c4441	docs: document installer + portable downloads in en/es Repo READMEs now show both download flavors side-by-side with first-launch warnings (SmartScreen, Gatekeeper) and link to the deeper walkthrough. USER-GUIDE §1 rewritten from a 9-line stub into six subsections: - §1.1 Windows: installer (5 steps) + portable (4 steps) - §1.2 macOS: DMG (5 steps incl. right-click-Open) + portable - §1.3 Linux: AppImage flow (unchanged) - §1.4 First-launch: port selection, localhost binding, browser open - §1.5 How the GUI works - §1.6 System requirements §6 Troubleshooting picks up portable-specific items: Safari unzip quirks, antivirus quarantine on Win portable, license file location. docs/README and Spanish mirrors updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 19:30:28 +00:00
Michael	9c426194b1	build: add single-command release script + portable zip artifacts One-developer workflow: ``python build/make_release.py`` on each target OS produces both the installer and a portable .zip for that platform. Preflight checks PyInstaller / Pillow / iscc / hdiutil / ditto / appimagetool and bails with install hints if anything is missing — no half-built dist/. New scripts: - build/make_release.py — orchestrator, auto-detects host OS. - build/generate_icons.py — icon.ico / icon.icns / icon.png from src/gui/assets/datatools_icon_256.png (Pillow ships ICO + ICNS writers; no platform tooling needed). - build/build_portable_zip.py — Win/Linux portable zip via stdlib. - build/macos/build_zip.sh — Mac portable .app via ditto so bundle metadata survives. installer.iss now adds: Quick Launch task (opt-in, legacy Win 7), App Paths registry entry (Win+R "DataTools" works), SetupIconFile, UninstallDisplayIcon, AppSupportURL, AppUpdatesURL. CI workflow uploads installer + portable per platform and attaches both to GitHub Releases on tag push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 19:30:17 +00:00
Michael	6627895a10	test: fix v3 branding drift, add reconcile CLI + registry coverage GUI/lang-pack tests were asserting against pre-v3 strings ("Data Cleaning Mastery", "Maestría en limpieza…") that the brand refresh replaced with "UNALOGIX DataTools" + "Clean. Normalize. Transform." Updated assertions to the current copy and switched the findings panel tests to the redesigned flat-list layout (per-finding "Open Tool →" buttons instead of per-tool expanders). New coverage: - tests/test_cli_reconcile.py (13) — preview/apply, tolerance flags, sign inversion, key flags, error paths, Excel input. - tests/test_tools_registry.py (27) — unique tool_ids, page_slug → real file, valid sections/tiers, localized accessor fallbacks, explicit pins for PDF Extractor + Reconciler entries. - tests/test_reconcile.py — one-side-empty, key-pass tagging, additional validation cases, input-DataFrame immutability. - tests/gui/test_smoke.py — PAGE_SLUGS now includes 10_PDF_Extractor and 11_Reconciler in both en/es. - tests/gui/test_workflows.py — TestPdfExtractorWorkflow and TestReconcilerWorkflow render checks. Net: 2317 passed → 2418 passed, 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 19:30:02 +00:00
Michael	ea99e292d2	feat(nav): group Home + Reconcile under a new "Analysis" section Home now appears in the sidebar as "File Analysis" under a labeled "Analysis" section together with Reconcile Two Files — both pages are data-analysis workflows (importing/profiling files vs. matching across files), so grouping them clarifies the sidebar's mental model. - tools_registry: new ``analysis`` Section; reconcile moves out of automations into it. - i18n: ``nav.section_analysis`` + ``nav.file_analysis_title`` added to en.json and es.json. - app.py: home dropped from the unlabeled section and surfaced at the top of the Analysis group; ``default=True`` preserved so first-visit routing is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 23:11:06 +00:00
Michael	0be59c0f03	fix(gui): shrink white-bar compensation to ~1/4 of original gap Plain ``min-height: 100vh`` left a ~15vh white bar below ``.stApp`` (the zoom: 0.85 scaler shrinks visual height to 85%). Reinstate the stretching but stop short of the full ``100vh / 0.85`` overflow: ``calc(96vh / 0.85)`` fills 96vh visually and leaves a ~4vh bar — a quarter the size, no longer dominating the page. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 23:06:32 +00:00
Michael	3a3a9a895b	fix(gui): stop overstretching pages, restore footer clearance Two layout bugs were hiding the bottom of every tool page behind the sticky footer: 1. ``.stApp`` and the main/sidebar containers were forced to ``min-height: calc(100vh / 0.85)``, ≈ 17.6% taller than the viewport, to mask a white bar caused by the ``zoom: 0.85`` scaler. That hack stretches short pages and pushes long-page content past the visible area. Drop the calc factor — plain ``100vh`` fills the visible viewport without forced overflow. 2. ``render_sticky_footer``'s stylesheet re-set the block container's ``padding-bottom`` to ``2rem``, overriding the ``7rem`` reserved by ``hide_streamlit_chrome``. The footer (~40px tall) needs more than 32px of clearance, so the last row of content was sliding behind the footer. Remove the override and let chrome's reservation stand. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 23:03:52 +00:00
Michael	d090f8cb5e	feat(reconcile): auto-detect role columns, preview result tabs Match-settings selectors now reorder per side to match the file's column order, using name heuristics (amount / date / desc) so a typical bank CSV reads Date → Description → Amount → Reference without manual fiddling. Detected columns also pre-fill as the default selection. Result tabs render at most 25 rows with a "preview of N of M" caption; full data is still available via the existing download buttons. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 22:39:47 +00:00
Michael	e44af3a45e	feat(reconcile): two-source reconciliation tool Bank-feed-vs-ledger style matcher: 4-pass greedy assignment (key → exact → tolerance → fuzzy) with ambiguous candidates routed to a review bucket instead of arbitrary picks. CLI mirrors the cli_text_clean preview/--apply pattern; Streamlit page registered in the automations section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 22:33:14 +00:00
Michael	450d4fc9a8	feat(pdf): default output date format to YYYY-MM-DD User asked to flip the default from YYYYMMDD to YYYY-MM-DD. ISO is the better default for an accountant CSV workflow: - Lexicographic sort = chronological sort (no parsing needed). - Every spreadsheet tool the user might import into recognises it as a real date with no ambiguity (US vs EU readers can't disagree on the order). - Hyphens make the year/month/day boundaries scan-able by eye. Concrete changes: - New module constant ``DEFAULT_DATE_FORMAT = "%Y-%m-%d"``, used as the default for ``format_date()`` and the ``output_date_format`` keyword on ``scan_pdf_for_transactions``. - Page's ``_DATE_FORMAT_CHOICES`` reordered so the ISO entry is first (index 0 = default Streamlit selection); YYYYMMDD drops to second. - Custom-strftime input default also flips to ``%Y-%m-%d``. Tests updated to reflect the new default (``test_dates_formatted_iso_by_default``, ``test_short_dates_get_year_from_period``, ``test_compact_format_round_trip``, plus a new ``test_default_is_iso`` for the format_date helper). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 02:04:34 +00:00
Michael	a0042d4aba	feat(pdf): Dec/Jan-aware year inference + filename hint + override Previous year inference picked ``period_end_iso[:4]`` for every short date, which fails on statements that cross the Dec/Jan boundary. A "12/30" row in a 2024-12-16 to 2025-01-15 statement got 2025-12-30 (wrong) instead of 2024-12-30. New cascade for ``_infer_year_for_short_date``: 1. ``override_year`` — caller supplies it (new ``"Override year for short dates"`` field in Scan options). Beats every heuristic. Empty by default; the page validates the value is a 4-digit-looking integer in 1900-2100 and falls back to automatic on garbage input. 2. Statement period start + end — the function now takes BOTH dates and generates candidates with every distinct year in the period (one year for same-year statements, two for Dec/Jan boundaries). The picker scores each candidate by distance from the period: candidates inside the period score 0, candidates outside score ``min(\|days from start\|, \|days from end\|)``. Lowest-distance candidate wins. So: - ``12/30`` + period 2024-12-16 to 2025-01-15 → 2024-12-30 (inside period, score 0) - ``01/05`` + same period → 2025-01-05 (inside, score 0) - ``12/15`` + same period → 2024-12-15 (1 day before, closer than 2025-12-15 which is 11 months after) 3. ``filename_year_hint`` — fallback when the statement period regex misses the bank's specific layout. The page passes ``year_from_filename(upload.name)`` automatically so files like ``eStmt_2025-01-13.pdf`` get year 2025 even if the PDF's text doesn't yield a parseable period. The regex matches the first ``20XX`` token bounded by non-digits. Both new helpers (``year_from_filename`` and the new ``_try_short_date_with_year`` factor-out) are exported and tested. 16 new tests cover: within-period inference (same-year sanity), Dec/Jan boundary cases for both sides, the just-before-period closer-distance case, override priority, filename fallback, no-signal None, dash-format / month-name shorthand round-trip, garbage input, filename year extraction (eStmt pattern, embedded, first-match-wins, no-match, empty). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:59:30 +00:00
Michael	a18b126885	fix(pdf): stamp scan timestamp once; restores Saved-to-path banner After swapping to ``html_download_button`` the user noticed the "✓ Saved to <path>" + 📂 Open Downloads folder pair never appeared. The helper itself is fine — every other tool shows those affordances correctly. Bug was specific to the PDF page. The download button's file_name was being computed with a fresh ``datetime.now().strftime(...)`` on every render. The helper builds its session-state keys from ``f"_dl_btn_{file_name}_{digest}"`` so the keys silently drift every second. After the click and rerun, the helper looks up the saved_key for the NEW file_name, finds nothing in session_state (the click had written to the OLD key), and skips the success banner. Fix: stamp the timestamp once when scan completes, store it in ``K_TIMESTAMP``, and reuse it for the download filename. The filename stays stable across reruns, so the helper's keys are stable, so the saved-path banner renders correctly on the post- click rerun. Also clear ``K_TIMESTAMP`` on Clear-all-files so a new scan gets a fresh stamp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:50:22 +00:00
Michael	981a1a9cba	fix(downloads): OneDrive-aware Downloads path + PDF uses html_download_button User reported downloads "do nothing on click" in tool pages and "acts like it downloads but no file in the folder" in the PDF tool. Two root causes, two fixes. Root cause #1 — wrong Downloads folder on Windows. ``_downloads_dir()`` returned ``Path.home() / "Downloads"`` unconditionally. On Windows machines with OneDrive enabled (very common for business users), the real Downloads folder is redirected to ``C:\Users\<u>\OneDrive\Downloads``. Our helper would write to ``C:\Users\<u>\Downloads`` instead — a folder that may not even exist until ``mkdir`` creates it — and the user, naturally opening their actual OneDrive Downloads, sees no file and concludes nothing happened. Now: on Windows, ``_downloads_dir`` queries the registry key ``Software\Microsoft\Windows\CurrentVersion\Explorer\User Shell Folders`` for FOLDERID_Downloads (GUID ``{374DE290-123F-4565-9164-39C4925E467B}``). This entry returns the redirected path when OneDrive is active, the original ``%USERPROFILE%\Downloads`` otherwise — exactly what the user's File Explorer reads. ``%USERPROFILE%`` expansion is applied via ``os.path.expandvars``. Any registry hiccup falls through to ``Path.home() / "Downloads"`` so the helper never raises. The sanity check (path exists OR parent exists) catches the edge case where the registry points into a deleted OneDrive mount. Root cause #2 — PDF page used st.download_button. Every other tool uses the project's ``html_download_button`` helper (which is ``local_download_button`` under the hood — the rename happened in `b9147f3`). ``st.download_button`` has a long-standing bug where the second-or-later instance in a script pass silently fails to fire. The PDF tool predated the rewrite that switched everyone over and was still using the broken native widget. ``_Logs.py`` had the same problem in two places. Swapped all three call sites to ``html_download_button``. They now save to ``~/Downloads/<filename>`` (correctly resolved per fix #1) and show the saved path + "Open Downloads folder" button below the click, matching every other tool in the suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:45:51 +00:00
Michael	dbcf4d4048	feat(pdf): adopt Home-page Files-card layout User wants the PDF page's upload UX to match the Home page exactly — Files section header + bordered card containing the file rows AND the "Add more files" button at the bottom, no visible Streamlit file_uploader competing for attention. Layout changes mirroring ``src/gui/_home.py``: - ``st.file_uploader`` is positioned off-screen via CSS (``position:absolute;left:-10000px;…``). The underlying ``<input type=file>`` stays reachable to JS so the in-card "Add more files" button can programmatically click it. - ``<h2>Files</h2>`` section header with ``N files · X.X MB total`` meta on the right, identical markup (``dt-files-section-head``). - Single ``st.container(border=True)`` hosts every file row (``✕ \| 📄 filename \| size``, using ``dt-file-row`` / ``dt-file-icon-chip`` / ``dt-file-name`` / ``dt-file-size`` classes) AND the "Add more files" button (``dt-file-add``) at the bottom. All classes are already defined globally in ``_legacy.py`` so no new CSS. - The Add button click is wired to the off-screen uploader's ``stFileUploaderDropzoneInput`` via a 30-line iframe script, identical to the Home page's pattern. A ``MutationObserver`` re-wires after Streamlit reruns when the button gets re-mounted. Action buttons (Scan + Clear all) sit BELOW the Files card, side-by-side in a `[1, 1, 4]` column split with ``use_container_width=True`` so they fill their cells cleanly without stretching across the whole row. Both buttons are disabled when no files are uploaded — the empty Files card is its own affordance for the empty state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:34:31 +00:00
Michael	34b56b404a	fix(pdf): drop statement_period_start/end columns from output User asked to remove them — the two columns repeated the same value on every row from a given statement, took up screen space in the editor, and offered limited value once the date column already carries the inferred full date. What's kept: - ``account_number`` — still stamped onto every row so multi- statement CSVs are self-attributing - ``extract_statement_metadata`` — still runs every scan because ``period_end`` is the source of the year inference that binds Chase-style short ``01/13`` dates to ``20250113`` - ``_extract_statement_period`` and its tests — period detection itself isn't going anywhere, just its appearance in the output rows What's removed: - ``record["statement_period_start"]`` / ``record["statement_period_end"]`` assignments in ``scan_pdf_for_transactions`` - The two columns from the page's column-ordering setup - Tests pinning their presence; replaced with assertions that they're explicitly absent Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:28:32 +00:00
Michael	ad7c22d7fb	fix(pdf): consistent 2-decimal amount precision in display and CSV User reported amounts losing trailing zeros — 4.50 rendering as 4.5, 1000.00 as 1000 — on the same statement. Classic float display issue: Python's native ``repr(4.5)`` drops the ``.0``, and pandas / Streamlit happily show that inconsistency cell-by-cell. Two layers of fix, internal type stays ``float`` for arithmetic: Display. ``st.column_config.NumberColumn(format="%.2f")`` applied programmatically to every ``amount_`` column on the data_editor. Every numeric amount now shows with exactly two decimal places regardless of trailing zeros. CSV export.* Pandas' default float-to-CSV writer also drops trailing zeros (the same issue an accountant would see when opening the file in Excel). Before serialising, each amount column is mapped through the new ``format_amount`` helper — returns ``f"{v:.2f}"`` for numerics, empty string for None/NaN/inf, ``str(value)`` for booleans (guards the ``True → "1.00"`` foot-gun since ``bool`` is an ``int`` subclass), and passes through any string the scanner kept because parsing failed (e.g. ``(4.50)`` when parens-negative is off — user can correct in the editor before re-exporting). ``format_amount`` lives in ``src/pdf_extract.py`` so it's testable in isolation (the page module can't easily be unit tested because of its Streamlit import chain). 8 new tests cover the trailing-zeros case, negatives, None/empty, string-passthrough, bool guard, NaN/inf, and the ``places`` parameter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 01:27:16 +00:00
Michael	6f2ad57490	fix(pdf): require non-empty description; tighten multi-line merge User reported "Daily Ledger Balances" entries leaking into output. Three correlated bugs in the row qualifier: 1. Empty description is now disqualifying. A row like ``01/13/2025 $1,000.00`` has a date and an amount but no text between them — that's a daily-balance entry, a period-summary, or page furniture. Drop these. New filter sits after ``_description_from_row`` returns: if the description string is empty (or whitespace-only), continue past the row. 2. ``prev`` resets per page. The state that drives multi- line description merging (the "previous transaction this continuation might attach to") used to persist across page boundaries. A no-date no-amount line at the top of page 2 could silently attach to the last transaction on page 1. Fixed by moving the ``prev`` / ``prev_y_bottom`` declarations into the outer page loop so each page starts clean. 3. Multi-line merges now check y-distance. Before this fix, ANY no-date no-amount line attached to the previous transaction's description. A "Daily Ledger Balances" section header several rows below the last transaction would silently fold into it. Now the merge only happens when the gap ``current_top - prev_y_bottom <= 25.0`` PDF points — generous enough for one blank-line gap between wrapped descriptions, tight enough to reject section headers across paragraph breaks. The threshold is a module constant (``_MULTILINE_MERGE_MAX_GAP``) for future tuning if real statements call for it. Three new test classes: - ``TestRequiresDescription.test_empty_description_row_dropped`` — date+amount-no-text row filtered, real transaction kept. - ``TestPrevTransactionResetsPerPage.test_no_cross_page_merge`` — page-1 transaction + page-2 section header = no merge. - ``TestMultilineMergeYGap`` — close continuation merges (10-pt gap), far section header doesn't (100-pt gap). The original ``TestMultilineDescription.test_continuation_line_merges`` still passes — its setup has a 10-pt gap which is within the new threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 00:58:50 +00:00

1 2 3 4 5

243 Commits