Commit Graph

235 Commits

Author SHA1 Message Date
28ab51a869 Merge ui-redesign: journey-level UX redesign + live-app port
Brings the design-review mockups and the highest-leverage live-app
changes into main:
- layout-review/ mockups: 12-page review addressed; front door, taught
  pipeline order, consistent intake, coming-soon stubs, shared tokens.
- Live src/gui/: nav reordered to pipeline order with new Finance +
  Coming-soon groups; Home is the "Start here" front door with a
  one-click "Clean these files for me" pipeline runner; local-first
  pill on every working tool header.
- DECISIONS.md: PDF to CSV + Reconcile kept in-bundle under Finance.

Full suite green: 2441 passed, 91 skipped, 0 failed.

Follow-ups tracked (not blockers): streamlit-run visual verification of
the live UI; i18n keys for the front-door copy (English literals today);
rebuild the live coming-soon stub page bodies.
2026-06-08 17:41:30 +00:00
1895074b8f test+fix(gui): retire the now-empty "analysis" nav section
The journey-level nav restructure moved Home to a standalone "Start
here" entry and Reconcile into the "Finance" group, leaving the
"analysis" section with zero tools. Two registry tests encoded the old
layout and failed:
- test_every_section_has_at_least_one_tool[analysis] (empty section)
- test_reconciler_present (asserted section == "analysis")

Drop "analysis" from the Section literal, SECTION_LABELS, and app.py's
by_section bucket — it's genuinely dead now (home isn't a registry Tool).
Update the presence tests to assert Reconcile + PDF to CSV live in
"finance". The section-invariant tests (every section non-empty, has a
label, no orphan labels) are preserved and pass.

Full suite: 2441 passed, 91 skipped, 0 failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 17:11:02 +00:00
d807d3c11b feat(gui): add the one-click "Clean these files for me" front door
Issue #1 (the make-or-break UX fix): after the analyzer runs, Home now
leads with a primary "Clean these files for me" CTA that runs the
recommended pipeline (Clean Text -> Standardize -> Fix Missing -> Find
Duplicates, in order) on every imported file and hands back a cleaned
CSV per file — collapsing "which tool, what order" to one click. The
existing per-finding cards remain, reframed as "Or fix issues one at a
time" for users who want manual control.

- Reuses the core API verbatim (recommended_pipeline + run_pipeline);
  reader mirrors 9_Pipeline_Runner._read_uploaded so files load the same
  way the standalone orchestrator loads them.
- Per-file errors are captured so one bad file doesn't kill the batch;
  cleaned CSVs are cached in session_state so downloads survive reruns
  and are pruned when a file is removed or re-analyzed.

Verified: the read -> run_pipeline -> CSV data path executes correctly
(compile + a non-Streamlit functional smoke test). The Streamlit UI
scaffolding (button / download_button / progress / session_state)
mirrors the proven runner page but still needs a `streamlit run` check.
Front-door copy is English literals for now; i18n keys are a follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 17:06:30 +00:00
09ec01e98b feat(gui): port journey-level nav + local-first pill to the live app
Brings the live Streamlit app in line with the finalized layout-review
mockups (structural/low-risk changes; verified by compile + registry
sanity, still pending a streamlit-run visual check):

- tools_registry: Data Cleaners now in pipeline order (Clean Text ->
  Standardize -> Fix Missing -> Find Duplicates); new "finance" section
  (Reconcile, PDF to CSV) and "coming_soon" section (Find Unusual,
  Quality Check, Combine Files). Adds those to the Section type +
  SECTION_LABELS.
- app.py: Home becomes the "Start here" front door — a standalone,
  unlabeled top entry (play_circle icon) ahead of the hidden
  Activate/Logs/Close pages; nav groups reordered cleaners ->
  transformations -> automations -> finance -> coming soon.
- _legacy.py: render_tool_header now shows the "Runs 100% locally"
  privacy pill (right-aligned, Ready tools only — omitted on Coming
  Soon stubs); accent emphasis CSS for the Start-here nav link.
- i18n: add nav.start_here_title, nav.section_finance,
  nav.section_coming_soon to en + es packs.
- DECISIONS.md: log the PDF/Reconcile in-bundle (Finance group) call.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 17:01:57 +00:00
48251b625f refactor(layout-review): consolidate tool-header actions + align reconcile downloads
Consistency pass over the parallel-agent work:
- Replace 4 divergent inline header wrappers (flex/inline-flex, gap
  10/12px, margin-top present/absent across 8 tool pages) with one shared
  .dt-tool-header-actions class; strip the now-redundant per-button
  margin-top:0. Every tool header now aligns the local-first pill + Help
  button identically.
- Reconcile downloads row: reorder to the page's exceptions-first order
  (Review, Unmatched left, Unmatched right, Matched) to match the tabs and
  metric strip, and drop the lone competing primary — the four are
  parallel exports of equal weight.

Audited and confirmed already-consistent: compact intake banner, privacy
pill markup, .dt-next-step strips, the three coming-soon stubs, primary
CTAs, and the 3-download CSV/audit/config pattern.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:50:25 +00:00
dd0942d710 feat(layout-review): journey-level redesign — front door, taught order, consistency
Addresses the journey-level review (the app felt like 12 tools sharing a
stylesheet, not one guided product). File-partitioned changes:

Navigation (shell.js): rename Home -> "Start here" with front-door
emphasis (.dt-nav-start); reorder Data Cleaners into pipeline order
(Clean Text -> Standardize -> Fix Missing -> Find Duplicates); new
"Finance" group (Reconcile, PDF to CSV); all stubs moved to a bottom
"Coming soon" group, no longer interleaved with working tools.

Front door (home.html): a prominent primary "Clean these files for me"
that runs the recommended pipeline in order, above the existing
per-finding cards (reframed as "fix one thing at a time").

Shared tokens (app.css): .dt-next-step suggestion strip + .dt-nav-start.

Teach the order: a slim .dt-next-step strip at the end of each linear
cleaner page points to the next pipeline step (Map Columns -> Start here;
orchestrator/Finance pages correctly omit it).

Local-first: the green "Runs 100% locally" pill now sits in every working
tool page's header (home + 8 tools), where client data is entered.

Plain English: jargon relabeled on input controls (coerce, E.164,
NFC/NFKC, sentinels, survivor rule), technical terms kept in tooltips and
audit/output cells only.

Stubs (06/08/07): rebuilt to one identical skeleton — info line + plain
feature list + a real "Notify me when this ships" button; every disabled
control and uploader removed (a dimmed dropzone reads as broken).

Intake: full dropzone+chip replaced with the compact "Using <file>" banner
on Clean Text, Fix Missing, Find Duplicates, and both Reconcile sides.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:44:11 +00:00
cf31d9ef14 feat(layout-review): address review findings on pages 7-12
Find Duplicates (01_deduplicator):
- Delete the redundant outer Options wrapper; surface threshold +
  survivor rule directly, push the rest behind a single Advanced pane.
- Disambiguate competing primaries: top result is an auto-resolved
  preview (secondary download), review decisions are the single primary.
- Plain-English match labels (exact / approximate); clarify the third.
- Lift the match-card caption to a one-time instruction; note delimiter
  is delimited-text-only.

Quality Check (08_validator_reporter) — stub:
- Remove the dead disabled "Load rules file (JSON)" uploader so the
  stub invites a single action; keep the informative feature list.

Map Columns (05_column_mapper):
- Regroup schema -> mapping -> strategy/advanced (core task contiguous).
- Make preset-vs-Advanced precedence legible (Custom + modified marker).
- Adopt the compact file-intake banner; drop the duplicate resolved-
  mapping table; fix the add-row gutter style.

Combine Files (07_multi_file_merger) — stub:
- Actually disable the Merge CTA (add the disabled attribute).

PDF to CSV (10_pdf_extractor):
- Drop page/raw from the default preview to match export + fix the
  horizontal clip; surface raw via per-row affordance + overflow-x.
- Move the column selector above the download button; give auto-excluded
  rows a reason; align the files card to Home; de-dupe the row count.

Automated Workflows (09_pipeline_runner):
- Replace hand-edited JSON step config with per-step control expanders;
  JSON moved behind Advanced import/export.
- Editing the table marks the mode modified; fold the empty error column
  into the status pill; render summaries as plain English; collapse the
  explainer by default.

Cross-cutting items (stub standardization on page 10, shared disabled-
field token, remaining intake rollout) deferred to a holistic pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:35:46 +00:00
563d845b70 feat(layout-review): address review findings on pages 4-6
Find Unusual Values (06_outlier_detector) — coming-soon stub:
- Anchor the disabled Method on IQR (multiplier 1.5), not Z-score, per
  the logged robustness decision.
- Drop the redundant feature bullet list (kept alert + greyed controls
  + disabled button); also fixes the MAD-only-in-bullets mismatch.
- Remove the live uploader that dead-ended into disabled controls.

Clean Text (02_text_cleaner):
- Add an inline hidden-character legend (3 swatches reusing the actual
  badge classes) beside the canonical "Show hidden characters" toggle.
- Unify the two hidden-char toggles: preview one is canonical; the
  Results bare checkbox is wrapped in a field + bound note.
- Describe all three presets (minimal / excel-hygiene / paranoid).
- Give "Changes by column" a real "column" header instead of the
  grey index-gutter style.

Standardize Formats (03_format_standardizer):
- Make preset-vs-control precedence legible: preset shows Custom with a
  "modified" marker + base tag, diverging controls flag the winning
  value (same pattern as Fix Missing Values).
- Replace the dead-end unparseable alert with a real "Unparseable
  cells (47)" expander the alert now points to.
- Honest preview caption: "5 of 6 columns (notes skipped)".
Intake pattern (the cross-page reference) left untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:27:42 +00:00
be1e263223 feat(layout-review): address Fix Missing Values review findings
- Pin down strategy precedence: add a resolution-order legend
  (per-column -> global -> preset), dim/strike the preset radios when
  a global strategy overrides them, and add a "Resolves to" column to
  the per-column override table so the winning value is legible.
- Make the demo state honest: Global strategy = median is what drives
  the 1,043 fills, resolving the detect-only contradiction.
- Surface the missingness profile as an always-visible block above the
  (now-open) Options expander — diagnostic before configuration.
- Stop highlighting unchanged before/after cells (respondent_id 0->0);
  show "(global)" placeholders in unset per-column override cells.
- Fold the standalone "Strategy applied per column" table into the
  before/after table as a strategy column; inset maxed slider knobs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:23:32 +00:00
7ebfd0f153 feat(layout-review): address Reconcile page review findings
- Fix doubled "Invert right amount sign" label: keep the field label,
  strip the checkbox caption to the box only (also evens the 3-up row).
- Reorder results exceptions-first: tabs and metric strip both run
  Review -> Unmatched left -> Unmatched right -> Matched, with Review
  the default active tab and its table as the inline content; Matched
  demoted to a trailing context expander.
- Surface the "references must match left count" rule with an inline
  validation indicator under the right reference field instead of a
  label note alone.
- Mark the required Amount join key with the .req accent star on both
  sides so it reads distinct from the optional date/description pickers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:17:20 +00:00
2592604067 feat(layout-review): address Home page review findings
- Findings card no longer truncates silently: panel #1 gains a
  .dt-finding-more overflow control ("Show all 8 findings · 5 more").
- Replace the dead "Files analyzed: 3" stat (restated the section meta
  + visible rows) with "Rows scanned" — info not already on screen.
- Collapsed findings panels use a real .is-collapsed state variant
  instead of inline margin-bottom:-16px hacks, so states can't drift.
- Action bar buttons are content-sized; drop the 340px island that
  jarred against the full-width divider/stats below it.

Branding kept as deliberate landing-style treatment on Home (per
review decision); interior tool pages remain title-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:14:04 +00:00
58d0009849 refactor(layout-review): inline assets beside pages
Move app.css and shell.js into layout-review/ alongside the .html files
and reference them by bare filename; drop the assets/ subfolder.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rollback-2026-06-08
2026-06-08 15:43:31 +00:00
b6c39d7a09 refactor(layout-review): move assets to repo root
Relocate assets/ (app.css, shell.js) from layout-review/ up to the repo
root and rewrite every page's link/script refs to ../assets/.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:31:53 +00:00
b2fa8503e6 chore: add layout-review HTML mockups
Static layout mockups for each app tool (deduplicator, text cleaner,
format standardizer, missing handler, column mapper, outlier detector,
multi-file merger, validator/reporter, pipeline runner, PDF extractor,
reconciler) plus index/home shells and shared assets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:28:23 +00:00
b703911df3 docs: reflect bundled Tesseract on every install surface
- NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers
  the bundled Tesseract OCR binary (Apache 2.0, upstream
  tesseract-ocr/tesseract, copyright Google + contributors) and the
  eng.traineddata from tessdata_best (also Apache 2.0). Clarifies
  DataTools itself remains proprietary. Full canonical Apache 2.0
  license text included.
- README.md + README.es.md (Download section): bumped size estimate
  ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR
  is bundled (no separate install required), with a link to the new
  license file.
- docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System
  requirements): bumped disk estimate, added a paragraph stating
  Tesseract 5.5 + eng.traineddata ship inside every installer /
  portable / AppImage, with a source-install fallback hint pointing
  developers to DEVELOPER.md.
- docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section
  documenting the runtime layout (sys._MEIPASS / tesseract / …),
  discovery order, source of bytes (build/vendor/tessdata + per-
  platform fetch in make_release.py), version pin, update recipe.
- docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor
  OCR)" — short version of the discovery order for the build
  pipeline section.
- build/README.md: distribution-outputs paragraph now lists
  Tesseract among bundled deps with the ~250-300 MB estimate; new
  "Tesseract bundling" section: layout diagram, resolver order,
  source of bytes + 5.5.0 pin, update steps, license-file ref.

Out-of-scope gaps noted by the docs sweep:
- docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a
  high-risk packaging headache; now superseded. Worth a one-line
  "(resolved — bundled as of v1.x)" callout in a future pass.
- USER-GUIDE §2 "What's included" table doesn't list PDF Extractor
  at all (it shipped in b8aff86…967d3f6). Separate gap to close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:50 +00:00
93ccada974 build: bundle Tesseract 5.5.0 + tessdata into every release artifact
End users no longer have to install Tesseract separately for OCR on
scanned PDFs — the engine ships inside the installer, portable .zip,
and AppImage for all three platforms.

Per-platform fetch in build/make_release.py (run before PyInstaller):
- Windows: download UB-Mannheim installer 5.5.0.20241111, extract
  with 7-Zip, copy tesseract.exe + required DLLs into the staging dir.
- macOS: ``brew install tesseract``, copy binary + every Homebrew-
  prefixed dylib resolved via otool -L (recurse one level for
  transitive deps), then install_name_tool rewrites IDs / load paths
  to @loader_path/... so the bundle is relocatable.
- Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary
  + every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'.

Wire-up:
- build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by
  make_release) and adds the staging dir + tessdata + the
  LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas
  so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/}
  and the license sits at the bundle root. Soft-warns when staging
  is empty so dev spec runs still complete.
- English tessdata pulled by fetch_tessdata() from
  tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at
  build/vendor/tessdata/.
- .github/workflows/build.yml: actions/cache@v4 step keyed on
  ``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the
  staging dir and the vendored tessdata across runs; apt installs
  patchelf on the Linux runner; PyInstaller step now receives the
  DATATOOLS_TESS_STAGING env var.
- .gitignore: build/_tesseract/ and the .traineddata blob.
- TESSERACT_SKIP_FETCH=1 honored for offline / manual stages.
- Installer / .dmg / .zip / AppImage scripts: one-line comments
  confirming Tesseract rides along automatically via PyInstaller's
  datas (no extra packaging steps required in those scripts).

Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post-
compression. Net installer size ~250-300 MB (was ~120 MB) — accepted
tradeoff for zero end-user OCR setup.

Reversal of the prior "don't bundle Tesseract" decision (option A).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:33 +00:00
17faf84aed feat(pdf): probe bundled Tesseract first when running frozen
Adds runtime support for the bundled Tesseract that ships inside the
DataTools installer / portable / AppImage artifacts. When DataTools
is launched from a PyInstaller frozen bundle the OCR engine now
resolves automatically — no end-user install required.

New helpers in src/pdf_extract.py:
- _bundled_tesseract_path() → Path | None — returns
  <sys._MEIPASS>/tesseract/tesseract[.exe] when getattr(sys,
  "frozen", False) AND sys._MEIPASS are present; None in dev.
- _bundled_tessdata_dir() → Path | None — same gating, returns
  <sys._MEIPASS>/tesseract/tessdata.
- _apply_bundled_tessdata_prefix() — sets TESSDATA_PREFIX to the
  bundled tessdata dir before any pytesseract call; only if frozen,
  dir exists, and the user hasn't already overridden the env var.

Discovery order in ocr_available() / _autodetect_tesseract_path():
1. DATATOOLS_TESSERACT_PATH env override (existing)
2. Bundled binary (NEW — frozen-only)
3. System PATH (existing)
4. Windows well-known install dirs (existing legacy fallback)

In dev (not frozen) every new probe is a no-op so the developer
experience is unchanged.

12 new tests cover frozen vs. non-frozen detection on each platform,
the user-override respect for TESSDATA_PREFIX, autodetect priority
ordering, and the no-bundled-dir graceful path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:19:52 +00:00
4d8513b1a3 docs: cover help popover, +/- nav indicators, render_tool_header
User-facing docs (USER-GUIDE en+es, README en+es):
- New short paragraph under §3.1 GUI noting the in-tool Help button
  on every detail page, what it contains (When to use / Steps /
  Examples / Tip), and that content lives in tools.<id>.help_md.
- One-line note in the README tool tables pointing at the same.
- Mention the sidebar +/- nav indicators replacing Streamlit's
  default Material Symbols chevron.

Developer docs:
- DEVELOPER: new "Tool page header" subsection documenting
  render_tool_header(tool_id), the help_md markdown skeleton, and
  the fallback to help.missing_body when a tool's help is absent.
  Update i18n authoring rules to list help.* keys and the per-tool
  help_md field alongside name/description/page_title/page_caption.
- TECHNICAL: new §10c documenting the sidebar nav indicator swap —
  CSS in _HIDE_CHROME_CSS plus _SWAP_NAV_SECTION_INDICATOR_JS
  injected through the hide_streamlit_chrome() iframe bundle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:08:01 +00:00
ac94208d8f chore: production-readiness sweep on the help-popover wave
- Drop unused 'from src.i18n import t' from pages 1-9 (the swap to
  render_tool_header(tool_id) means no page calls t() directly anymore).
  Pages 10, 11 and the underscore-prefixed pages were already clean or
  legitimately use t().

- Rewrite PDF Extractor help_md (en + es). The original prose described
  features the tool does NOT have — template drawing, per-source saved
  templates, automatic reuse. The actual tool is a heuristic batch
  scanner (per its own docstring: "No templates, no per-bank
  configuration"). New copy: scan → uncheck → pick date format → enable
  OCR if needed → download. Spanish version tagged with
  '<!-- TODO: review Spanish -->' since the prose is best-effort.

- Document why both stSidebarNavSectionHeader (legacy, streamlit~=1.35)
  and stNavSectionHeader (current, 1.57) testids appear in the chrome
  CSS — requirements floor is streamlit>=1.35,<2 so dropping the legacy
  selector would silently break the lower bound.

- Pin the t()-returns-key-on-miss contract that render_tool_header's
  fallback path depends on, with a comment at the call site.

- Pin the demo's intentional skip of hide_streamlit_chrome (so the
  +/- sidebar swap JS doesn't ever try to load there) with a load-
  bearing comment in app_demo.py.

- Confirmed i18n parity: every tool id has page_title / page_caption /
  description / name / help_md in BOTH packs; help.button_label and
  help.missing_body in both.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:07:33 +00:00
4955fb239b test: cover help_md keys, header smoke, and bilingual ES smoke
Two stale Spanish smoke assertions still expected English page titles
for PDF Extractor and Reconciler — the i18n work landed real
translations ("PDF a CSV", "Reconciliar dos archivos"), so refresh the
expected substrings and the surrounding comment.

Add new coverage for the help-popover feature:
- TestHelpPopoverKeys (test_lang_packs): every tool_id resolves a
  non-empty tools.<id>.help_md in BOTH packs; help.button_label and
  help.missing_body resolve in both.
- TestDescriptionCopy (test_tools_registry): every Tool.description
  non-empty and under 120 chars — pins the post-jargon-scrub copy
  so future drift back into multi-clause prose is loud.
- TestRenderToolHeaderSmoke: render_tool_header is callable, listed
  in components.__all__, and every i18n key it touches resolves in
  both packs. Runs without a Streamlit script context.

Suite: 2427 passed (+9 new), 91 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:07:19 +00:00
4a8961d58a fix(gui): keep tool-page Help button on one line at narrow widths
When the viewport shrunk, the help popover button in the title row
was wrapping its label vertically — ``[icon]`` over ``Help`` — because
the button was set to use_container_width=True and the column it sat
in collapsed below the button's natural width.

Two-pronged fix:
- Set use_container_width=False on the popover so the button sizes to
  content (icon + label) instead of stretching to the column.
- Widen the column ratio from [10, 1] to [8, 2] so there's room for
  the button without forcing the title text to truncate.
- Add CSS pinning ``white-space: nowrap`` on every popover button (and
  its inner div / p) as defense-in-depth — even if the button does
  get squeezed, the label can't wrap. ``min-width: max-content`` keeps
  the button from compressing below its content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:54:41 +00:00
fe4b5dc755 fix(sidebar): correct testid + JS swap so +/− actually renders
The prior attempt used data-testid=stSidebarNavSectionHeader, which is
not what Streamlit 1.57 emits — the correct testid is stNavSectionHeader
(verified against the bundled JS in streamlit/static/static/js/).
The section header is also a <div> with onClick, not a <button>, and
the React component keeps the expanded state in a prop without
surfacing aria-expanded on the DOM. Pure CSS can therefore neither
locate the header nor switch the glyph by state, which is why the
chevron was unchanged in the rendered UI.

Switch strategies:
- CSS now targets the correct stNavSectionHeader / stIconMaterial
  selectors, drops the Material Symbols font from the icon span, and
  restyles it so a plain ascii character reads as proper typography
  (size, weight, color, hover).
- Add _SWAP_NAV_SECTION_INDICATOR_JS — small inline script that
  rewrites the icon's text node from "expand_more"/"expand_less" to
  "+"/"−" (U+2212), throttled via requestAnimationFrame, re-applied
  on every DOM mutation by a MutationObserver. Bundled into the same
  iframe injection as the existing brand/upload/findings scripts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:52:47 +00:00
209b5fb1aa style(sidebar): swap expand chevrons for +/− indicators on nav sections
Streamlit's default sidebar section header uses a Material Symbols
expand_more chevron — three different icons (chevron down, chevron up,
sometimes a plain triangle) depending on version, all of which felt
inconsistent with the rest of the chrome.

Hide the built-in icon (svg / material-symbols span — covered with
multiple selectors for cross-version durability) and render our own
glyph as a right-aligned pseudo-element on the section-header button,
keyed off the standard ARIA aria-expanded attribute:
- collapsed → "+"
- expanded  → "−" (U+2212, visually balanced with +)

Hover deepens the indicator color to match the surrounding nav-link
hover treatment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:23:49 +00:00
904356f4e8 feat(gui): inline Help popover next to every tool's title
Adds a contextual Help button on each detail page, right of the title.
Clicking it opens a Streamlit popover with a one-shot how-to: when to
use, numbered steps, before→after examples, and an optional one-line
tip. Designed to be scannable — no paragraph prose.

Implementation:
- New ``render_tool_header(tool_id)`` helper in components replaces the
  bare ``st.title(...) + st.caption(...)`` block on each of the 11 tool
  pages. Title in the wide column, popover in a narrow right column;
  caption sits on its own line beneath.
- Help content is one markdown blob per tool stored in i18n under
  ``tools.<id>.help_md`` (en + es). Editors can tweak copy without
  touching Python.
- ``help.button_label`` and ``help.missing_body`` keys added to both
  packs for the popover trigger and the empty-tool fallback.

All 11 tool pages now use the same header pattern — including the
PDF Extractor and Reconciler which previously had hardcoded title/
caption pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:21:55 +00:00
7203a81af7 copy: strip jargon from tool descriptions and captions
Prior round only touched page_caption; the description field (shown on
home grid cards) still said "imputation", "missingness",
"winsorization", "schema coercion", "fuzzy matching with normalization",
etc. The audience is non-technical buyers — they shouldn't need a stats
or DB-admin vocabulary to read a tool card.

Rewrite both description and page_caption across en, es, and the
tools_registry (the fallback source of truth) using everyday words:
blanks instead of nulls, fill in instead of impute, look wrong instead
of statistical outliers, etc. Same one-line shape as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:09:52 +00:00
dd3b9bd59d copy: tighten tool-page captions to one plain-English line
Each tool's page caption is what tells a user what the tool actually
does the moment they land. They were inconsistent — some terse, most
multi-clause with a redundant "Runs locally — your data never leaves
this computer" trailer that's already a privacy pill on Home.

Rewrite every caption (en + es) as a single ~60-80 char action-first
line. Replaces the hardcoded multi-line Reconciler caption with the
same shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 14:34:34 +00:00
2bd94c4441 docs: document installer + portable downloads in en/es
Repo READMEs now show both download flavors side-by-side with
first-launch warnings (SmartScreen, Gatekeeper) and link to the
deeper walkthrough.

USER-GUIDE §1 rewritten from a 9-line stub into six subsections:
- §1.1 Windows: installer (5 steps) + portable (4 steps)
- §1.2 macOS:   DMG (5 steps incl. right-click-Open) + portable
- §1.3 Linux:   AppImage flow (unchanged)
- §1.4 First-launch: port selection, localhost binding, browser open
- §1.5 How the GUI works
- §1.6 System requirements

§6 Troubleshooting picks up portable-specific items: Safari unzip
quirks, antivirus quarantine on Win portable, license file location.

docs/README and Spanish mirrors updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 19:30:28 +00:00
9c426194b1 build: add single-command release script + portable zip artifacts
One-developer workflow: ``python build/make_release.py`` on each
target OS produces both the installer and a portable .zip for that
platform. Preflight checks PyInstaller / Pillow / iscc / hdiutil /
ditto / appimagetool and bails with install hints if anything is
missing — no half-built dist/.

New scripts:
- build/make_release.py   — orchestrator, auto-detects host OS.
- build/generate_icons.py — icon.ico / icon.icns / icon.png from
  src/gui/assets/datatools_icon_256.png (Pillow ships ICO + ICNS
  writers; no platform tooling needed).
- build/build_portable_zip.py — Win/Linux portable zip via stdlib.
- build/macos/build_zip.sh — Mac portable .app via ditto so
  bundle metadata survives.

installer.iss now adds: Quick Launch task (opt-in, legacy Win 7),
App Paths registry entry (Win+R "DataTools" works), SetupIconFile,
UninstallDisplayIcon, AppSupportURL, AppUpdatesURL.

CI workflow uploads installer + portable per platform and attaches
both to GitHub Releases on tag push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 19:30:17 +00:00
6627895a10 test: fix v3 branding drift, add reconcile CLI + registry coverage
GUI/lang-pack tests were asserting against pre-v3 strings ("Data
Cleaning Mastery", "Maestría en limpieza…") that the brand refresh
replaced with "UNALOGIX DataTools" + "Clean. Normalize. Transform."
Updated assertions to the current copy and switched the findings
panel tests to the redesigned flat-list layout (per-finding "Open
Tool →" buttons instead of per-tool expanders).

New coverage:
- tests/test_cli_reconcile.py (13) — preview/apply, tolerance flags,
  sign inversion, key flags, error paths, Excel input.
- tests/test_tools_registry.py (27) — unique tool_ids, page_slug →
  real file, valid sections/tiers, localized accessor fallbacks,
  explicit pins for PDF Extractor + Reconciler entries.
- tests/test_reconcile.py — one-side-empty, key-pass tagging,
  additional validation cases, input-DataFrame immutability.
- tests/gui/test_smoke.py — PAGE_SLUGS now includes 10_PDF_Extractor
  and 11_Reconciler in both en/es.
- tests/gui/test_workflows.py — TestPdfExtractorWorkflow and
  TestReconcilerWorkflow render checks.

Net: 2317 passed → 2418 passed, 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 19:30:02 +00:00
ea99e292d2 feat(nav): group Home + Reconcile under a new "Analysis" section
Home now appears in the sidebar as "File Analysis" under a labeled
"Analysis" section together with Reconcile Two Files — both pages
are data-analysis workflows (importing/profiling files vs. matching
across files), so grouping them clarifies the sidebar's mental model.

- tools_registry: new ``analysis`` Section; reconcile moves out of
  automations into it.
- i18n: ``nav.section_analysis`` + ``nav.file_analysis_title`` added
  to en.json and es.json.
- app.py: home dropped from the unlabeled section and surfaced at the
  top of the Analysis group; ``default=True`` preserved so first-visit
  routing is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:11:06 +00:00
0be59c0f03 fix(gui): shrink white-bar compensation to ~1/4 of original gap
Plain ``min-height: 100vh`` left a ~15vh white bar below ``.stApp``
(the zoom: 0.85 scaler shrinks visual height to 85%). Reinstate the
stretching but stop short of the full ``100vh / 0.85`` overflow:
``calc(96vh / 0.85)`` fills 96vh visually and leaves a ~4vh bar — a
quarter the size, no longer dominating the page.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:06:32 +00:00
3a3a9a895b fix(gui): stop overstretching pages, restore footer clearance
Two layout bugs were hiding the bottom of every tool page behind the
sticky footer:

1. ``.stApp`` and the main/sidebar containers were forced to
   ``min-height: calc(100vh / 0.85)``, ≈ 17.6% taller than the
   viewport, to mask a white bar caused by the ``zoom: 0.85`` scaler.
   That hack stretches short pages and pushes long-page content past
   the visible area. Drop the calc factor — plain ``100vh`` fills the
   visible viewport without forced overflow.

2. ``render_sticky_footer``'s stylesheet re-set the block container's
   ``padding-bottom`` to ``2rem``, overriding the ``7rem`` reserved
   by ``hide_streamlit_chrome``. The footer (~40px tall) needs more
   than 32px of clearance, so the last row of content was sliding
   behind the footer. Remove the override and let chrome's reservation
   stand.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:03:52 +00:00
d090f8cb5e feat(reconcile): auto-detect role columns, preview result tabs
Match-settings selectors now reorder per side to match the file's
column order, using name heuristics (amount / date / desc) so a
typical bank CSV reads Date → Description → Amount → Reference
without manual fiddling. Detected columns also pre-fill as the
default selection.

Result tabs render at most 25 rows with a "preview of N of M"
caption; full data is still available via the existing download
buttons.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 22:39:47 +00:00
e44af3a45e feat(reconcile): two-source reconciliation tool
Bank-feed-vs-ledger style matcher: 4-pass greedy assignment (key →
exact → tolerance → fuzzy) with ambiguous candidates routed to a
review bucket instead of arbitrary picks. CLI mirrors the
cli_text_clean preview/--apply pattern; Streamlit page registered
in the automations section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 22:33:14 +00:00
450d4fc9a8 feat(pdf): default output date format to YYYY-MM-DD
User asked to flip the default from YYYYMMDD to YYYY-MM-DD.
ISO is the better default for an accountant CSV workflow:

- Lexicographic sort = chronological sort (no parsing needed).
- Every spreadsheet tool the user might import into recognises
  it as a real date with no ambiguity (US vs EU readers can't
  disagree on the order).
- Hyphens make the year/month/day boundaries scan-able by eye.

Concrete changes:

- New module constant ``DEFAULT_DATE_FORMAT = "%Y-%m-%d"``,
  used as the default for ``format_date()`` and the
  ``output_date_format`` keyword on
  ``scan_pdf_for_transactions``.
- Page's ``_DATE_FORMAT_CHOICES`` reordered so the ISO entry
  is first (index 0 = default Streamlit selection); YYYYMMDD
  drops to second.
- Custom-strftime input default also flips to ``%Y-%m-%d``.

Tests updated to reflect the new default (``test_dates_formatted_iso_by_default``,
``test_short_dates_get_year_from_period``,
``test_compact_format_round_trip``, plus a new
``test_default_is_iso`` for the format_date helper).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 02:04:34 +00:00
a0042d4aba feat(pdf): Dec/Jan-aware year inference + filename hint + override
Previous year inference picked ``period_end_iso[:4]`` for every
short date, which fails on statements that cross the Dec/Jan
boundary. A "12/30" row in a 2024-12-16 to 2025-01-15 statement
got 2025-12-30 (wrong) instead of 2024-12-30.

New cascade for ``_infer_year_for_short_date``:

1. **``override_year``** — caller supplies it (new ``"Override
   year for short dates"`` field in Scan options). Beats every
   heuristic. Empty by default; the page validates the value
   is a 4-digit-looking integer in 1900-2100 and falls back to
   automatic on garbage input.

2. **Statement period start + end** — the function now takes
   BOTH dates and generates candidates with every distinct year
   in the period (one year for same-year statements, two for
   Dec/Jan boundaries). The picker scores each candidate by
   distance from the period: candidates inside the period
   score 0, candidates outside score ``min(|days from start|,
   |days from end|)``. Lowest-distance candidate wins. So:

     - ``12/30`` + period 2024-12-16 to 2025-01-15 → 2024-12-30
       (inside period, score 0)
     - ``01/05`` + same period → 2025-01-05 (inside, score 0)
     - ``12/15`` + same period → 2024-12-15 (1 day before,
       closer than 2025-12-15 which is 11 months after)

3. **``filename_year_hint``** — fallback when the statement
   period regex misses the bank's specific layout. The page
   passes ``year_from_filename(upload.name)`` automatically so
   files like ``eStmt_2025-01-13.pdf`` get year 2025 even if
   the PDF's text doesn't yield a parseable period. The regex
   matches the first ``20XX`` token bounded by non-digits.

Both new helpers (``year_from_filename`` and the new
``_try_short_date_with_year`` factor-out) are exported and
tested. 16 new tests cover: within-period inference (same-year
sanity), Dec/Jan boundary cases for both sides, the
just-before-period closer-distance case, override priority,
filename fallback, no-signal None, dash-format / month-name
shorthand round-trip, garbage input, filename year extraction
(eStmt pattern, embedded, first-match-wins, no-match, empty).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:59:30 +00:00
a18b126885 fix(pdf): stamp scan timestamp once; restores Saved-to-path banner
After swapping to ``html_download_button`` the user noticed the
"✓ Saved to <path>" + 📂 Open Downloads folder pair never
appeared. The helper itself is fine — every other tool shows
those affordances correctly. Bug was specific to the PDF page.

The download button's file_name was being computed with a fresh
``datetime.now().strftime(...)`` on every render. The helper
builds its session-state keys from
``f"_dl_btn_{file_name}_{digest}"`` so the keys silently drift
every second. After the click and rerun, the helper looks up
the saved_key for the NEW file_name, finds nothing in
session_state (the click had written to the OLD key), and skips
the success banner.

Fix: stamp the timestamp once when scan completes, store it in
``K_TIMESTAMP``, and reuse it for the download filename. The
filename stays stable across reruns, so the helper's keys are
stable, so the saved-path banner renders correctly on the post-
click rerun.

Also clear ``K_TIMESTAMP`` on Clear-all-files so a new scan
gets a fresh stamp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:50:22 +00:00
981a1a9cba fix(downloads): OneDrive-aware Downloads path + PDF uses html_download_button
User reported downloads "do nothing on click" in tool pages and
"acts like it downloads but no file in the folder" in the PDF
tool. Two root causes, two fixes.

**Root cause #1 — wrong Downloads folder on Windows.**
``_downloads_dir()`` returned ``Path.home() / "Downloads"``
unconditionally. On Windows machines with OneDrive enabled
(very common for business users), the real Downloads folder
is redirected to ``C:\Users\<u>\OneDrive\Downloads``. Our
helper would write to ``C:\Users\<u>\Downloads`` instead —
a folder that may not even exist until ``mkdir`` creates it —
and the user, naturally opening their actual OneDrive
Downloads, sees no file and concludes nothing happened.

Now: on Windows, ``_downloads_dir`` queries the registry key
``Software\Microsoft\Windows\CurrentVersion\Explorer\User
Shell Folders`` for FOLDERID_Downloads (GUID
``{374DE290-123F-4565-9164-39C4925E467B}``). This entry returns
the redirected path when OneDrive is active, the original
``%USERPROFILE%\Downloads`` otherwise — exactly what the user's
File Explorer reads. ``%USERPROFILE%`` expansion is applied
via ``os.path.expandvars``. Any registry hiccup falls through
to ``Path.home() / "Downloads"`` so the helper never raises.

The sanity check (path exists OR parent exists) catches the
edge case where the registry points into a deleted OneDrive
mount.

**Root cause #2 — PDF page used st.download_button.**
Every other tool uses the project's ``html_download_button``
helper (which is ``local_download_button`` under the hood —
the rename happened in b9147f3). ``st.download_button`` has a
long-standing bug where the second-or-later instance in a
script pass silently fails to fire. The PDF tool predated the
rewrite that switched everyone over and was still using the
broken native widget. ``_Logs.py`` had the same problem in two
places.

Swapped all three call sites to ``html_download_button``. They
now save to ``~/Downloads/<filename>`` (correctly resolved per
fix #1) and show the saved path + "Open Downloads folder"
button below the click, matching every other tool in the suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:45:51 +00:00
dbcf4d4048 feat(pdf): adopt Home-page Files-card layout
User wants the PDF page's upload UX to match the Home page
exactly — Files section header + bordered card containing the
file rows AND the "Add more files" button at the bottom, no
visible Streamlit file_uploader competing for attention.

Layout changes mirroring ``src/gui/_home.py``:

- ``st.file_uploader`` is positioned off-screen via CSS
  (``position:absolute;left:-10000px;…``). The underlying
  ``<input type=file>`` stays reachable to JS so the in-card
  "Add more files" button can programmatically click it.
- ``<h2>Files</h2>`` section header with ``N files · X.X MB
  total`` meta on the right, identical markup
  (``dt-files-section-head``).
- Single ``st.container(border=True)`` hosts every file row
  (``✕ | 📄 filename | size``, using ``dt-file-row`` /
  ``dt-file-icon-chip`` / ``dt-file-name`` / ``dt-file-size``
  classes) AND the "Add more files" button (``dt-file-add``)
  at the bottom. All classes are already defined globally in
  ``_legacy.py`` so no new CSS.
- The Add button click is wired to the off-screen uploader's
  ``stFileUploaderDropzoneInput`` via a 30-line iframe script,
  identical to the Home page's pattern. A ``MutationObserver``
  re-wires after Streamlit reruns when the button gets
  re-mounted.

Action buttons (Scan + Clear all) sit BELOW the Files card,
side-by-side in a `[1, 1, 4]` column split with
``use_container_width=True`` so they fill their cells cleanly
without stretching across the whole row. Both buttons are
disabled when no files are uploaded — the empty Files card is
its own affordance for the empty state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:34:31 +00:00
34b56b404a fix(pdf): drop statement_period_start/end columns from output
User asked to remove them — the two columns repeated the same
value on every row from a given statement, took up screen space
in the editor, and offered limited value once the date column
already carries the inferred full date.

What's kept:
- ``account_number`` — still stamped onto every row so multi-
  statement CSVs are self-attributing
- ``extract_statement_metadata`` — still runs every scan because
  ``period_end`` is the source of the year inference that binds
  Chase-style short ``01/13`` dates to ``20250113``
- ``_extract_statement_period`` and its tests — period
  detection itself isn't going anywhere, just its appearance in
  the output rows

What's removed:
- ``record["statement_period_start"]`` / ``record["statement_period_end"]``
  assignments in ``scan_pdf_for_transactions``
- The two columns from the page's column-ordering setup
- Tests pinning their presence; replaced with assertions that
  they're explicitly absent

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:28:32 +00:00
ad7c22d7fb fix(pdf): consistent 2-decimal amount precision in display and CSV
User reported amounts losing trailing zeros — 4.50 rendering as
4.5, 1000.00 as 1000 — on the same statement. Classic float
display issue: Python's native ``repr(4.5)`` drops the
``.0``, and pandas / Streamlit happily show that
inconsistency cell-by-cell.

Two layers of fix, internal type stays ``float`` for arithmetic:

**Display.** ``st.column_config.NumberColumn(format="%.2f")``
applied programmatically to every ``amount_*`` column on the
data_editor. Every numeric amount now shows with exactly two
decimal places regardless of trailing zeros.

**CSV export.** Pandas' default float-to-CSV writer also drops
trailing zeros (the same issue an accountant would see when
opening the file in Excel). Before serialising, each amount
column is mapped through the new ``format_amount`` helper —
returns ``f"{v:.2f}"`` for numerics, empty string for
None/NaN/inf, ``str(value)`` for booleans (guards the
``True → "1.00"`` foot-gun since ``bool`` is an ``int``
subclass), and passes through any string the scanner kept
because parsing failed (e.g. ``(4.50)`` when parens-negative is
off — user can correct in the editor before re-exporting).

``format_amount`` lives in ``src/pdf_extract.py`` so it's
testable in isolation (the page module can't easily be unit
tested because of its Streamlit import chain). 8 new tests
cover the trailing-zeros case, negatives, None/empty,
string-passthrough, bool guard, NaN/inf, and the ``places``
parameter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:27:16 +00:00
6f2ad57490 fix(pdf): require non-empty description; tighten multi-line merge
User reported "Daily Ledger Balances" entries leaking into
output. Three correlated bugs in the row qualifier:

**1. Empty description is now disqualifying.** A row like
``01/13/2025  $1,000.00`` has a date and an amount but no text
between them — that's a daily-balance entry, a period-summary,
or page furniture. Drop these. New filter sits after
``_description_from_row`` returns: if the description string is
empty (or whitespace-only), continue past the row.

**2. ``prev`` resets per page.** The state that drives multi-
line description merging (the "previous transaction this
continuation might attach to") used to persist across page
boundaries. A no-date no-amount line at the top of page 2
could silently attach to the last transaction on page 1. Fixed
by moving the ``prev`` / ``prev_y_bottom`` declarations into
the outer page loop so each page starts clean.

**3. Multi-line merges now check y-distance.** Before this fix,
ANY no-date no-amount line attached to the previous
transaction's description. A "Daily Ledger Balances" section
header several rows below the last transaction would silently
fold into it. Now the merge only happens when the gap
``current_top - prev_y_bottom <= 25.0`` PDF points — generous
enough for one blank-line gap between wrapped descriptions,
tight enough to reject section headers across paragraph
breaks. The threshold is a module constant
(``_MULTILINE_MERGE_MAX_GAP``) for future tuning if real
statements call for it.

Three new test classes:

- ``TestRequiresDescription.test_empty_description_row_dropped``
  — date+amount-no-text row filtered, real transaction kept.
- ``TestPrevTransactionResetsPerPage.test_no_cross_page_merge``
  — page-1 transaction + page-2 section header = no merge.
- ``TestMultilineMergeYGap`` — close continuation merges
  (10-pt gap), far section header doesn't (100-pt gap).

The original ``TestMultilineDescription.test_continuation_line_merges``
still passes — its setup has a 10-pt gap which is within the
new threshold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:58:50 +00:00
a1824b8dc4 feat(pdf): Home-style file list + Clear-all button
User feedback: the standard file_uploader didn't visually match
the Home page, and there was no obvious way to clear out
uploaded files between scans (have to refresh the browser tab).

**Persistent stash + add-only sync.** Files captured into
``st.session_state["pdf_uploads"]`` (dict name → {bytes, size})
via an ``on_change`` callback on the file_uploader widget. The
callback is **add-only** — never removes files from the stash
based on widget state. Removal is owned by the custom X buttons
+ widget-counter bump (see below). This guarantees a hidden
native X click can't silently drop files behind the user's
back.

**Hidden native file list.** A small CSS block suppresses the
file_uploader's built-in file rows + their delete buttons
(``stFileUploaderFile`` + ``stFileUploaderDeleteBtn``), so the
custom list below is the single source of truth on screen.

**Custom file list (Home pattern).** Below the dropzone, every
uploaded file gets a row: ``✕ | 📄 filename | size``. Top of
section shows ``N files · 12.3 MB total``. Counts and sizes
update in real time as the user adds or removes files. The X
button per row calls ``log_event("upload", "PDF removed: …")``,
removes the entry from the stash, and bumps the widget counter
to clear the widget too.

**Clear-all button.** Sits next to the Scan button. Wipes the
stash, bumps the widget counter, drops any cached scan results
(``K_ROWS``, ``K_WARNINGS``, ``K_SOURCE_COUNT``). Audited via
``log_event("upload", "PDF list cleared", count=N)``.

**Widget reset via counter bump.** Streamlit disallows
programmatic mutation of widget session-state entries; the
standard workaround is to rotate the widget's ``key``. Page
maintains ``K_UPLOAD_COUNTER`` which gets incremented on
remove / clear-all, producing a fresh ``pdf_upload_v{N}`` key
and a freshly-instantiated empty widget. The stash retains any
unaffected files; on next upload, the add-only sync picks up
the new ones without re-adding the removed ones.

**Scan rewired to read the stash.** Instead of iterating the
widget's UploadedFile objects (which the previous code did and
which broke when the widget unmounted on remove), the scan
loop iterates ``pdf_uploads.items()`` and uses the cached
``bytes``. Diagnostic expander does the same — re-reads from
the stash, removing the need for a separate ``K_DIAGNOSTIC``
cache (deleted).

**``_format_size`` helper** ports the byte-formatting logic
from ``_home.py``'s pattern (KB / MB / GB rollover).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:28:01 +00:00
155dd30746 feat(pdf): extract statement header (account + period) + date format
Two related additions for the accountant workflow:

**1. Statement header extraction.** New
``extract_statement_metadata(pages)`` pulls the account number
and statement period out of the first page (falls back to
page 1+2 if either is missing on page 1 — Wells Fargo business
accounts put header info on page 2). Detected fields are
stamped onto EVERY transaction row so a multi-statement CSV is
self-attributing per row::

    {
      "date": "20250113",
      "description": "Coffee Shop",
      "amount_1": -4.50,
      "account_number": "****5678",
      "statement_period_start": "20250101",
      "statement_period_end": "20250131",
      ...
    }

Account-number regex is tolerant of masks (``****1234``),
hyphens (``1234-5678-9012``), and spaces. Period regex looks
for "Statement Period" / "From" / "Period Covered" labels plus
the first 1-2 full-year dates that follow. If only one date is
present near the label, it's used for both start and end (some
statements show only the closing date).

**2. Year inference for short dates.** When the row date is a
short ``01/13`` or ``Jan 13`` without a year, the scanner now
binds the year from the statement period's end date BEFORE
formatting. Doesn't handle the December-in-January-statement
cross-year case (rare; user can edit in the table).

**3. Configurable output date format.** New
``output_date_format`` parameter on ``scan_pdf_for_transactions``
defaults to ``%Y%m%d``. Applied to: the transaction date column
AND the statement period start/end fields. The page surfaces a
dropdown in Scan options with common presets (YYYYMMDD,
YYYY-MM-DD, MM/DD/YYYY, DD/MM/YYYY, ``Mon DD, YYYY``) plus a
Custom option that accepts a raw strftime string.

New helper: ``format_date(iso_str, fmt)`` converts ISO
``YYYY-MM-DD`` to any strftime; passes invalid input through
unchanged so the user can see what was actually there rather
than getting silent empties.

20 new tests cover: format_date, account-number extraction
(masked / hyphenated / spaced / no-label / short), period
extraction (standard / from-to / single-date / no-label),
metadata orchestrator (full header / no pages / page-2
fallback), year inference (US / dash / month-name / no-period /
unparseable), plus an end-to-end class that builds a header'd
PDF with short-date transactions and confirms metadata
attribution + year inference + format round-trip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:20:46 +00:00
3cf935c999 fix(pdf): drop zero-amount rows; multi-date rows clean description
Two corrections from real-statement feedback:

**1. Drop rows where the transaction amount is exactly 0.**
Bank statements include date+amount-shaped noise like
"INTEREST EARNED 0.00", "PAGE TOTAL 0.00", "BALANCE FORWARD
0.00 1,234.56" — all match the date+amount heuristic but
aren't transactions. New filter in
``scan_pdf_for_transactions``: drop rows whose ``amount_1``
parses to exactly 0. Non-zero balances in ``amount_2`` don't
rescue a zero amount_1 — leftmost amount is the canonical
transaction amount. Unparsed-but-non-empty amount strings are
kept (user verifies in the editor).

**2. Multi-date rows: first date wins for the column, every
date excluded from the description.** Chase / BofA / Wells
commonly show both a transaction date and a posting date per
row:

    01/13  01/14  COFFEE SHOP  $4.50

Before this fix, ``_find_dates_in_words`` returned the first
date only and the second date leaked into description as
"01/14 COFFEE SHOP". Now it returns ALL dates with their word
ranges; the scanner uses ``dates[0]`` as the canonical date
and passes every range to the description builder for
exclusion.

The detector's two-pass strategy now also guards against
mixing full-year and short-date matches on the same row.
Previously, a header line like ``Page 1/2 of 3 ... Statement
Date 01/13/2026`` would return both ``1/2`` and ``01/13/2026``,
and ``1/2`` (being leftmost) would have won the date column.
Now: if any full-year date is found on the row, short patterns
are NOT also collected — full year anchors interpretation. A
row with no full-year date (Chase short-date case) still falls
back to short patterns and collects all of them.

New tests:
- ``test_multiple_dates_returned_in_position_order`` —
  ``01/13`` + ``01/14`` both returned, in order
- ``TestMultiDateRow.test_first_date_wins_second_excluded_from_description``
  — end-to-end through ``scan_pdf_for_transactions``
- ``TestZeroAmountRowsAreDropped.test_zero_amount_row_dropped``
  — "INTEREST EARNED 0.00" row dropped while real txn kept
- ``test_negative_amount_kept`` — pin that -40.00 is not
  treated as zero by the filter

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:12:21 +00:00
263af3c7c2 fix(pdf): short dates without year + diagnostic for "0 rows" runs
User uploaded a real Chase statement and got "0 rows detected."
Two bugs the rewrite shipped with, plus a diagnostic:

**1. Short dates without year weren't recognized.** Most bank
statements (Chase, Wells, BofA, …) display transaction dates as
``01/13`` or ``Jan 13`` because the year is implied by the
statement period. The original regex required ``\d{2,4}`` after
the second slash, so ``01/13`` failed to match and rows with no
detected date got dropped.

Split ``_DATE_RES`` into ``_FULL`` (with year) and ``_SHORT``
(no year), with a two-pass detector: pass 1 tries full-year
patterns across the whole row; pass 2 only tries short patterns
if pass 1 found nothing. This prevents a stray ``Page 1/2`` from
shadowing the real dated transaction on the same line.

Short patterns:
- ``\d{1,2}/\d{1,2}`` — Chase, etc.
- ``\d{1,2}-\d{1,2}``
- ``[A-Z][a-z]{2}\s+\d{1,2}`` — "Jan 13"

When parsing, short dates pass through ``parse_date`` and
return None (no year to bind to), so the scanner falls back to
the raw text — the user sees ``01/13`` in the date column and
can correct in the editor.

**2. Multi-word dates leaked the day token into the description.**
A pre-existing bug: ``_find_dates_in_words`` returned only the
START word index, and ``_description_from_row`` only excluded
that single word. For "Jan 13 Coffee $4.50", the description
became "13 Coffee" instead of "Coffee". Fixed by returning
``(start, end, text)`` with ``end`` exclusive (computed from
``len(m.group(1).split())`` so window-overrun doesn't
over-consume), and the description builder now skips the full
range.

**3. New diagnostic: ``diagnose_pdf_lines(pdf_bytes)``.** Returns
every clustered text line the scanner saw with ``has_date`` /
``has_amount`` flags. When the page's scan returns 0 rows, an
auto-expanded "what the scanner saw" expander now renders a
table of all extracted lines so the user can:

- Spot scanned-PDF cases (empty result → enable OCR)
- See which lines have a date but no amount (or vice versa)
- Eyeball the date / amount format the scanner missed

Without leaving the app or asking the developer for help.

Eight new tests cover: short US date (``01/13``), short month-
name date with two-word consumption (``Jan 13``), the
``Page 1/2 ... 01/13/2026`` shadowing case, and the multi-word-
date description fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:06:07 +00:00
bece2b4030 refactor(pdf): rip out templates; heuristic scan + selectable table
User feedback: the template / visual-picker / mode-dispatch
implementation was too complex for the actual workflow.
Statements drift between months, the canvas state didn't survive
multi-page navigation, and accountants don't want to maintain
per-bank configuration just to convert PDFs to CSV.

Start-over design — one public function, one page, no
persistence:

  ``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)``

A row is "any text line with a date pattern AND at least one
amount pattern." Each detected row is a dict shaped::

    {
      "date": "2026-01-15",
      "description": "Coffee Shop",
      "amount_1": -4.50,
      "amount_2": 1000.00,   # if a second amount was found
      "page": 1,
      "raw": "01/15/2026 Coffee Shop (4.50) 1,000.00",
      "source_file": "chase-jan-2026.pdf",
    }

Multi-line descriptions still merge (no-date no-amount lines
attach to the previous transaction). Multi-PDF batches share a
single combined table with a ``source_file`` column.

**Page UX:**

- Upload PDF(s) → optional Options expander (parens-negative,
  use-OCR) → click Scan → see all detected rows in an
  ``st.data_editor``.
- The editor has an ``Include`` checkbox column (default on),
  plus user-editable date / description / amount cells and a
  read-only ``raw`` column showing the original PDF text for
  verification.
- A ``Columns to include in CSV`` multiselect hides
  ``page`` / ``raw`` from the download by default; user can
  re-add either.
- Download CSV gets only the checked rows.

No template save/load. No visual picker. No mode dispatch. No
column boundaries. No schema migration. No per-bank
configuration files.

**Deletions:**

- ``src/pdf_templates.py`` — template storage layer
- ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim
  for the canvas (no canvas now)
- ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``,
  ``test_drawable_canvas_compat.py`` — covered the removed APIs
- ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for
  the removed dep
- ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt``
- The drawable-canvas references in ``build/datatools.spec``

**``src/pdf_extract.py``** shrinks from ~30 helper functions to
~10. Keeps: value parsers, row clusterer, date/amount token
finders, OCR pipeline, dependency guards. The one new public
function ``scan_pdf_for_transactions`` glues them together.

**Tests** (59 passing): the unit layer keeps full coverage of
the building blocks; the smoke layer pins the end-to-end PDF
roundtrip, OCR discovery, dependency-import behavior, and the
multi-line-description merge. The fpdf2-generated fixture PDF
still drives the real-PDF test.

Rollback: ``git revert HEAD`` brings back the template system if
needed — but the simpler model should make that unlikely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:57:30 +00:00
60969c0770 feat(pdf): UI rework — Auto-detect is the default build flow
Pulls the user's primary mental model away from "draw column
boundaries" toward "tell me what shape your amounts have, see
detected rows, save." The visual picker that wasn't working for
multi-statement workflows is reachable but no longer the
default.

**Build mode header** now has a mode radio:

- "Auto-detect (recommended)" — row_heuristic. Tabs: Amount
  layout · Filters & date · Save. Three small forms; no
  coordinate UI anywhere. The Amount-layout tab's dropdown picks
  one of single / txn+balance / debit+credit / debit+credit+balance
  and auto-derives the min/max amount-count range (overridable
  under an expander).
- "Visual columns (advanced)" — column_visual. Five tabs (the
  original Visual picker / Pages & table / Columns / Parsing /
  Save). A yellow warning panel up top reminds the user that
  column-x templates only work when statement layout is stable.

Switching modes triggers a rerun so the right tab set renders
immediately. The template object preserves both mode's config
trees side-by-side so a user can flip between them without
losing work.

**Live preview** below the form runs ``apply_template`` against
the cached sample pages (already cached in session_state so this
re-renders cheaply on every form edit). The "no rows yet"
message is mode-aware — points users at the right tuning knobs
for whichever mode they're in. The preview caption notes which
mode produced the rows so the user can correlate decisions to
output.

The visual picker bug the user reported — "a single box stays in
the same location regardless of page" — is sidestepped rather
than fixed: in row_heuristic mode there's no canvas to confuse,
and for the rare column_visual user the canvas is still
imperfect but no longer their first interaction with the tool.
Cleaning up the column_visual canvas state bugs is a separate
follow-up if real users still hit the Advanced mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:46:27 +00:00
48cd9e8249 feat(pdf): schema v2 + mode field + v1 in-memory migration
Bumps ``SCHEMA_VERSION`` from 1 to 2 to add a top-level ``mode``
field distinguishing ``row_heuristic`` (new default) from
``column_visual`` (legacy). The schema bump is real — old code
that defaults missing keys would silently mis-extract — so we
do it the careful way:

- ``new_template`` now returns mode=``row_heuristic`` with the
  full row-heuristic config tree pre-populated. The legacy
  column-visual fields are still seeded with empty defaults so
  switching modes in the GUI doesn't require runtime key
  insertion.
- ``validate_template`` is mode-aware: row_heuristic templates
  must have a valid ``amounts.shape`` + sane
  ``row_detection.min/max_amounts_per_row``; column_visual
  templates keep the existing column/target requirements.
- ``load_template`` accepts both v1 and v2 files
  (``_LOAD_SUPPORTED_VERSIONS = {1, 2}``). v1 files get
  ``mode="column_visual"`` injected and ``schema_version`` bumped
  IN MEMORY ONLY — disk file stays v1 until the user explicitly
  re-saves. A buggy migration can't silently corrupt their
  template library.
- ``save_template`` continues to write the current schema; saving
  a v1 template through the GUI naturally upgrades it.

Mode + shape constants exported (``VALID_MODES``,
``VALID_AMOUNT_SHAPES``) so the GUI dropdowns can derive their
options from the source of truth.

Tests split into ``TestValidateTemplateRowHeuristic`` (6) +
``TestValidateTemplateColumnVisual`` (4) + ``TestV1Migration``
(1). All 29 template tests pass; the original column-mode tests
that previously implicitly relied on schema_version=1 keep
working because new_template's seeded column fields are still
present in row_heuristic templates (just not validated as
required).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:46:10 +00:00
d80befd05a feat(pdf): row-heuristic extraction (mode dispatch, no coordinates)
User reported the column-visual approach is too brittle for real
bank statements: column-x-positions saved against a sample page
don't survive layout drift between months (statement A has
columns at x=300, statement B drifted to x=320), and a saved
template can only realistically work for one statement's
specific render. The fundamental fix is to stop depending on
coordinates at all.

**Row-heuristic mode** finds transaction rows by pattern: any
line with a date token + N amount tokens IS a transaction. Date
patterns (US slash / EU slash / ISO / "Jan 15, 2026" / etc.) and
amount patterns (currency, parens-negative, thousands grouping)
are matched against word text — no x-positions involved.

The full pipeline:

1. ``find_transaction_rows`` clusters words into rows and scans
   each line for date + amount tokens.
2. Multi-line descriptions still attach to the previous row via
   the no-date-no-amount continuation rule.
3. Amount shapes drive interpretation: ``single`` /
   ``txn_balance`` / ``debit_credit`` / ``debit_credit_balance``.
4. ``_infer_amount_column_centers`` clusters amount x-midpoints
   ACROSS ALL detected rows to find natural column groupings —
   so debit-vs-credit assignment for single-amount lines works
   without the user marking anything on screen.

``apply_template`` is now a dispatch over ``template["mode"]``:

- ``mode="row_heuristic"`` (default for new templates) — the new
  pipeline.
- ``mode="column_visual"`` — the existing pipeline, kept under
  ``_apply_template_column_visual`` for v1 templates and the
  Advanced fallback.

18 new tests cover: date detection (US slash, two-digit year,
ISO, month-name, missing); amount-token finding (currency,
parens, pure text, bare-year rejection); column-center inference
(clear two-column case, empty input); end-to-end on synthetic
Page objects with all four amount shapes; the critical
layout-drift test that proves the same template works on pages
of different sizes / different absolute x-positions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:45:55 +00:00