main
66 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
| 9943e6e537 |
test(demo): cover the demo app + sales-surface coherence
Adds a demo test suite on top of the data-value pins: - tests/gui/test_app_demo.py (new, AppTest): every accounting persona renders with its dataset, the default/unknown-persona fallback resolves to bookkeeper, clicking Run produces the AFTER value (rows reduced to the validated count) with the watermarked download + Gumroad CTA, and switching persona via the quick-switch dropdown clears the stale result. - tests/test_demo_pipelines.py (extended): cross-surface coherence — each persona key served by app_demo has a matching landing page whose iframe (?p=) and CTA (from=) point at it and that the hub links to; no retired Shopify/RevOps language remains in landing HTML; and the demo download still appends exactly one watermark row. Full suite: 2584 passed, 91 skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
|||
| 6df726e69e |
demo: reconstruct sales demos for an accounting audience
Replaces the Shopify / RevOps / Bookkeeper demo trio with three accounting personas that share one buyer, each entering through a workflow where a messy export costs money — all running the same saved 4-step pipeline: - bank_reconciliation.csv (Bookkeeper): 26 -> 20 rows, 6 double-posted transactions caught after date+amount standardization. - vendor_1099.csv (AP / 1099): 24 records -> 8 vendors, 7 missing EINs recovered via dedup merge — the 1099-complete story. - ar_open_invoices.csv (AR): 26 -> 21 rows, 5 double-entered invoices removed, blank status backfilled from the twin row. Every number is validated against the live engine and pinned by tests/test_demo_pipelines.py (read path mirrors app_demo._load_demo: dtype=str, keep_default_na=False). Rewires src/gui/app_demo.py PERSONAS (keys bookkeeper / ap-1099 / ar-aging, accounting H1/sub/CTA) and rewrites docs/DEMO-PLAN.md sections 3/4/7 with the validated outcomes. (Repo hygiene forced by a partial-clone gap: finalizes the already-deleted, unreferenced samples/messy_text.csv whose blob was unrecoverable.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
|||
| 38616d69e2 |
test(pipeline): complete automated test suite for the pipeline feature
Adds ~115 tests pinning the Automated Workflows feature end to end: - tests/test_pipeline.py (+43): per-adapter summary correctness on known inputs, multi-step data flow, error stop/continue contract, empty / single-column / all-disabled edges, dict+file serialization round-trips, recommended_pipeline(include=…), and a synthesized demo integration run. - tests/test_cli_pipeline.py (new, 21): --recommend, dry-run-by-default, --apply output CSV + audit JSON, --steps, --strict abort, arg validation, --continue-on-error vs halt, and a save→load round-trip. Invokes the Typer app directly to bypass the license guard (house pattern). - tests/gui/test_pipeline_builder.py (+9): reorder ▲/▼, disabled edge buttons, disabled-step persistence across reorder, restore-recommended, Advanced JSON export/import, and per-tool Configure panels emitting the correct option dicts (AppTest). - tests/gui/test_pipeline_phrasing.py (new, 30): step_phrase/step_status and the adapter-key→friendly-name bridge as pure functions, incl. pluralization, column prose, and warn/error status derivation. Full suite: 2565 passed, 91 skipped. No product bugs surfaced. Documents the coverage in docs/DEVELOPER.md (test tree + a pipeline-coverage note). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
|||
| 00d3f28865 |
feat(pipeline): plain-English per-step result summaries
Replaces the raw-JSON summary column in the Results table with the mockup's plain-English phrasing: "312 duplicates removed across 147 groups (18,442 → 18,130 rows)", "1,204 cells cleaned in name & city", etc. (correct singular/plural via a small _n helper). Adds step_phrase() and step_status() to pipeline_modules.py. step_status derives the status pill (✓ ok / ⚠ ok · N skipped / ✗ error / ⏭ skipped) and, for warn/error steps (e.g. format_standardize unparseable cells, column_map coercion failures / missing required targets), an inline detail callout rendered directly below the results table — surfacing non-fatal issues in context without a dedicated always-empty column. Extends tests/gui/test_pipeline_builder.py with phrasing + status assertions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
|||
| 837f4b88b5 |
feat(pipeline): visual module-card builder for Automated Workflows
Replaces the raw options_json data-editor table with a per-step "module card" builder matching the locked design mockup (layout-review/09_pipeline_runner.html): each step shows a friendly name + caption, an enable toggle, ▲/▼/✕ reorder/remove controls, and a Configure expander that renders that tool's own controls in plain language. Raw JSON is demoted to an Advanced import/export section. New src/gui/components/pipeline_modules.py holds the adapter-key→tool_id friendly-name bridge, one plain-language config renderer per tool (text_clean, format_standardize, missing, column_map, dedup — emitting the exact JSON option shapes the core adapters accept), and render_step_card. Steps live in session state as an ordered list with stable ids so widget keys survive reorder/remove. Reorder is ▲/▼ buttons (no JS drag dependency). The on-disk/CLI pipeline JSON format is unchanged — CLI and src/core untouched. Adds tests/gui/test_pipeline_builder.py (AppTest) covering seed, configure panels, toggle/add/remove, and a full run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
|||
| 1895074b8f |
test+fix(gui): retire the now-empty "analysis" nav section
The journey-level nav restructure moved Home to a standalone "Start here" entry and Reconcile into the "Finance" group, leaving the "analysis" section with zero tools. Two registry tests encoded the old layout and failed: - test_every_section_has_at_least_one_tool[analysis] (empty section) - test_reconciler_present (asserted section == "analysis") Drop "analysis" from the Section literal, SECTION_LABELS, and app.py's by_section bucket — it's genuinely dead now (home isn't a registry Tool). Update the presence tests to assert Reconcile + PDF to CSV live in "finance". The section-invariant tests (every section non-empty, has a label, no orphan labels) are preserved and pass. Full suite: 2441 passed, 91 skipped, 0 failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
|||
| 17faf84aed |
feat(pdf): probe bundled Tesseract first when running frozen
Adds runtime support for the bundled Tesseract that ships inside the DataTools installer / portable / AppImage artifacts. When DataTools is launched from a PyInstaller frozen bundle the OCR engine now resolves automatically — no end-user install required. New helpers in src/pdf_extract.py: - _bundled_tesseract_path() → Path | None — returns <sys._MEIPASS>/tesseract/tesseract[.exe] when getattr(sys, "frozen", False) AND sys._MEIPASS are present; None in dev. - _bundled_tessdata_dir() → Path | None — same gating, returns <sys._MEIPASS>/tesseract/tessdata. - _apply_bundled_tessdata_prefix() — sets TESSDATA_PREFIX to the bundled tessdata dir before any pytesseract call; only if frozen, dir exists, and the user hasn't already overridden the env var. Discovery order in ocr_available() / _autodetect_tesseract_path(): 1. DATATOOLS_TESSERACT_PATH env override (existing) 2. Bundled binary (NEW — frozen-only) 3. System PATH (existing) 4. Windows well-known install dirs (existing legacy fallback) In dev (not frozen) every new probe is a no-op so the developer experience is unchanged. 12 new tests cover frozen vs. non-frozen detection on each platform, the user-override respect for TESSDATA_PREFIX, autodetect priority ordering, and the no-bundled-dir graceful path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| 4955fb239b |
test: cover help_md keys, header smoke, and bilingual ES smoke
Two stale Spanish smoke assertions still expected English page titles
for PDF Extractor and Reconciler — the i18n work landed real
translations ("PDF a CSV", "Reconciliar dos archivos"), so refresh the
expected substrings and the surrounding comment.
Add new coverage for the help-popover feature:
- TestHelpPopoverKeys (test_lang_packs): every tool_id resolves a
non-empty tools.<id>.help_md in BOTH packs; help.button_label and
help.missing_body resolve in both.
- TestDescriptionCopy (test_tools_registry): every Tool.description
non-empty and under 120 chars — pins the post-jargon-scrub copy
so future drift back into multi-clause prose is loud.
- TestRenderToolHeaderSmoke: render_tool_header is callable, listed
in components.__all__, and every i18n key it touches resolves in
both packs. Runs without a Streamlit script context.
Suite: 2427 passed (+9 new), 91 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 6627895a10 |
test: fix v3 branding drift, add reconcile CLI + registry coverage
GUI/lang-pack tests were asserting against pre-v3 strings ("Data
Cleaning Mastery", "Maestría en limpieza…") that the brand refresh
replaced with "UNALOGIX DataTools" + "Clean. Normalize. Transform."
Updated assertions to the current copy and switched the findings
panel tests to the redesigned flat-list layout (per-finding "Open
Tool →" buttons instead of per-tool expanders).
New coverage:
- tests/test_cli_reconcile.py (13) — preview/apply, tolerance flags,
sign inversion, key flags, error paths, Excel input.
- tests/test_tools_registry.py (27) — unique tool_ids, page_slug →
real file, valid sections/tiers, localized accessor fallbacks,
explicit pins for PDF Extractor + Reconciler entries.
- tests/test_reconcile.py — one-side-empty, key-pass tagging,
additional validation cases, input-DataFrame immutability.
- tests/gui/test_smoke.py — PAGE_SLUGS now includes 10_PDF_Extractor
and 11_Reconciler in both en/es.
- tests/gui/test_workflows.py — TestPdfExtractorWorkflow and
TestReconcilerWorkflow render checks.
Net: 2317 passed → 2418 passed, 0 failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| e44af3a45e |
feat(reconcile): two-source reconciliation tool
Bank-feed-vs-ledger style matcher: 4-pass greedy assignment (key → exact → tolerance → fuzzy) with ambiguous candidates routed to a review bucket instead of arbitrary picks. CLI mirrors the cli_text_clean preview/--apply pattern; Streamlit page registered in the automations section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| 450d4fc9a8 |
feat(pdf): default output date format to YYYY-MM-DD
User asked to flip the default from YYYYMMDD to YYYY-MM-DD. ISO is the better default for an accountant CSV workflow: - Lexicographic sort = chronological sort (no parsing needed). - Every spreadsheet tool the user might import into recognises it as a real date with no ambiguity (US vs EU readers can't disagree on the order). - Hyphens make the year/month/day boundaries scan-able by eye. Concrete changes: - New module constant ``DEFAULT_DATE_FORMAT = "%Y-%m-%d"``, used as the default for ``format_date()`` and the ``output_date_format`` keyword on ``scan_pdf_for_transactions``. - Page's ``_DATE_FORMAT_CHOICES`` reordered so the ISO entry is first (index 0 = default Streamlit selection); YYYYMMDD drops to second. - Custom-strftime input default also flips to ``%Y-%m-%d``. Tests updated to reflect the new default (``test_dates_formatted_iso_by_default``, ``test_short_dates_get_year_from_period``, ``test_compact_format_round_trip``, plus a new ``test_default_is_iso`` for the format_date helper). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| a0042d4aba |
feat(pdf): Dec/Jan-aware year inference + filename hint + override
Previous year inference picked ``period_end_iso[:4]`` for every
short date, which fails on statements that cross the Dec/Jan
boundary. A "12/30" row in a 2024-12-16 to 2025-01-15 statement
got 2025-12-30 (wrong) instead of 2024-12-30.
New cascade for ``_infer_year_for_short_date``:
1. **``override_year``** — caller supplies it (new ``"Override
year for short dates"`` field in Scan options). Beats every
heuristic. Empty by default; the page validates the value
is a 4-digit-looking integer in 1900-2100 and falls back to
automatic on garbage input.
2. **Statement period start + end** — the function now takes
BOTH dates and generates candidates with every distinct year
in the period (one year for same-year statements, two for
Dec/Jan boundaries). The picker scores each candidate by
distance from the period: candidates inside the period
score 0, candidates outside score ``min(|days from start|,
|days from end|)``. Lowest-distance candidate wins. So:
- ``12/30`` + period 2024-12-16 to 2025-01-15 → 2024-12-30
(inside period, score 0)
- ``01/05`` + same period → 2025-01-05 (inside, score 0)
- ``12/15`` + same period → 2024-12-15 (1 day before,
closer than 2025-12-15 which is 11 months after)
3. **``filename_year_hint``** — fallback when the statement
period regex misses the bank's specific layout. The page
passes ``year_from_filename(upload.name)`` automatically so
files like ``eStmt_2025-01-13.pdf`` get year 2025 even if
the PDF's text doesn't yield a parseable period. The regex
matches the first ``20XX`` token bounded by non-digits.
Both new helpers (``year_from_filename`` and the new
``_try_short_date_with_year`` factor-out) are exported and
tested. 16 new tests cover: within-period inference (same-year
sanity), Dec/Jan boundary cases for both sides, the
just-before-period closer-distance case, override priority,
filename fallback, no-signal None, dash-format / month-name
shorthand round-trip, garbage input, filename year extraction
(eStmt pattern, embedded, first-match-wins, no-match, empty).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 34b56b404a |
fix(pdf): drop statement_period_start/end columns from output
User asked to remove them — the two columns repeated the same value on every row from a given statement, took up screen space in the editor, and offered limited value once the date column already carries the inferred full date. What's kept: - ``account_number`` — still stamped onto every row so multi- statement CSVs are self-attributing - ``extract_statement_metadata`` — still runs every scan because ``period_end`` is the source of the year inference that binds Chase-style short ``01/13`` dates to ``20250113`` - ``_extract_statement_period`` and its tests — period detection itself isn't going anywhere, just its appearance in the output rows What's removed: - ``record["statement_period_start"]`` / ``record["statement_period_end"]`` assignments in ``scan_pdf_for_transactions`` - The two columns from the page's column-ordering setup - Tests pinning their presence; replaced with assertions that they're explicitly absent Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| ad7c22d7fb |
fix(pdf): consistent 2-decimal amount precision in display and CSV
User reported amounts losing trailing zeros — 4.50 rendering as
4.5, 1000.00 as 1000 — on the same statement. Classic float
display issue: Python's native ``repr(4.5)`` drops the
``.0``, and pandas / Streamlit happily show that
inconsistency cell-by-cell.
Two layers of fix, internal type stays ``float`` for arithmetic:
**Display.** ``st.column_config.NumberColumn(format="%.2f")``
applied programmatically to every ``amount_*`` column on the
data_editor. Every numeric amount now shows with exactly two
decimal places regardless of trailing zeros.
**CSV export.** Pandas' default float-to-CSV writer also drops
trailing zeros (the same issue an accountant would see when
opening the file in Excel). Before serialising, each amount
column is mapped through the new ``format_amount`` helper —
returns ``f"{v:.2f}"`` for numerics, empty string for
None/NaN/inf, ``str(value)`` for booleans (guards the
``True → "1.00"`` foot-gun since ``bool`` is an ``int``
subclass), and passes through any string the scanner kept
because parsing failed (e.g. ``(4.50)`` when parens-negative is
off — user can correct in the editor before re-exporting).
``format_amount`` lives in ``src/pdf_extract.py`` so it's
testable in isolation (the page module can't easily be unit
tested because of its Streamlit import chain). 8 new tests
cover the trailing-zeros case, negatives, None/empty,
string-passthrough, bool guard, NaN/inf, and the ``places``
parameter.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 6f2ad57490 |
fix(pdf): require non-empty description; tighten multi-line merge
User reported "Daily Ledger Balances" entries leaking into output. Three correlated bugs in the row qualifier: **1. Empty description is now disqualifying.** A row like ``01/13/2025 $1,000.00`` has a date and an amount but no text between them — that's a daily-balance entry, a period-summary, or page furniture. Drop these. New filter sits after ``_description_from_row`` returns: if the description string is empty (or whitespace-only), continue past the row. **2. ``prev`` resets per page.** The state that drives multi- line description merging (the "previous transaction this continuation might attach to") used to persist across page boundaries. A no-date no-amount line at the top of page 2 could silently attach to the last transaction on page 1. Fixed by moving the ``prev`` / ``prev_y_bottom`` declarations into the outer page loop so each page starts clean. **3. Multi-line merges now check y-distance.** Before this fix, ANY no-date no-amount line attached to the previous transaction's description. A "Daily Ledger Balances" section header several rows below the last transaction would silently fold into it. Now the merge only happens when the gap ``current_top - prev_y_bottom <= 25.0`` PDF points — generous enough for one blank-line gap between wrapped descriptions, tight enough to reject section headers across paragraph breaks. The threshold is a module constant (``_MULTILINE_MERGE_MAX_GAP``) for future tuning if real statements call for it. Three new test classes: - ``TestRequiresDescription.test_empty_description_row_dropped`` — date+amount-no-text row filtered, real transaction kept. - ``TestPrevTransactionResetsPerPage.test_no_cross_page_merge`` — page-1 transaction + page-2 section header = no merge. - ``TestMultilineMergeYGap`` — close continuation merges (10-pt gap), far section header doesn't (100-pt gap). The original ``TestMultilineDescription.test_continuation_line_merges`` still passes — its setup has a 10-pt gap which is within the new threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| 155dd30746 |
feat(pdf): extract statement header (account + period) + date format
Two related additions for the accountant workflow:
**1. Statement header extraction.** New
``extract_statement_metadata(pages)`` pulls the account number
and statement period out of the first page (falls back to
page 1+2 if either is missing on page 1 — Wells Fargo business
accounts put header info on page 2). Detected fields are
stamped onto EVERY transaction row so a multi-statement CSV is
self-attributing per row::
{
"date": "20250113",
"description": "Coffee Shop",
"amount_1": -4.50,
"account_number": "****5678",
"statement_period_start": "20250101",
"statement_period_end": "20250131",
...
}
Account-number regex is tolerant of masks (``****1234``),
hyphens (``1234-5678-9012``), and spaces. Period regex looks
for "Statement Period" / "From" / "Period Covered" labels plus
the first 1-2 full-year dates that follow. If only one date is
present near the label, it's used for both start and end (some
statements show only the closing date).
**2. Year inference for short dates.** When the row date is a
short ``01/13`` or ``Jan 13`` without a year, the scanner now
binds the year from the statement period's end date BEFORE
formatting. Doesn't handle the December-in-January-statement
cross-year case (rare; user can edit in the table).
**3. Configurable output date format.** New
``output_date_format`` parameter on ``scan_pdf_for_transactions``
defaults to ``%Y%m%d``. Applied to: the transaction date column
AND the statement period start/end fields. The page surfaces a
dropdown in Scan options with common presets (YYYYMMDD,
YYYY-MM-DD, MM/DD/YYYY, DD/MM/YYYY, ``Mon DD, YYYY``) plus a
Custom option that accepts a raw strftime string.
New helper: ``format_date(iso_str, fmt)`` converts ISO
``YYYY-MM-DD`` to any strftime; passes invalid input through
unchanged so the user can see what was actually there rather
than getting silent empties.
20 new tests cover: format_date, account-number extraction
(masked / hyphenated / spaced / no-label / short), period
extraction (standard / from-to / single-date / no-label),
metadata orchestrator (full header / no pages / page-2
fallback), year inference (US / dash / month-name / no-period /
unparseable), plus an end-to-end class that builds a header'd
PDF with short-date transactions and confirms metadata
attribution + year inference + format round-trip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 3cf935c999 |
fix(pdf): drop zero-amount rows; multi-date rows clean description
Two corrections from real-statement feedback:
**1. Drop rows where the transaction amount is exactly 0.**
Bank statements include date+amount-shaped noise like
"INTEREST EARNED 0.00", "PAGE TOTAL 0.00", "BALANCE FORWARD
0.00 1,234.56" — all match the date+amount heuristic but
aren't transactions. New filter in
``scan_pdf_for_transactions``: drop rows whose ``amount_1``
parses to exactly 0. Non-zero balances in ``amount_2`` don't
rescue a zero amount_1 — leftmost amount is the canonical
transaction amount. Unparsed-but-non-empty amount strings are
kept (user verifies in the editor).
**2. Multi-date rows: first date wins for the column, every
date excluded from the description.** Chase / BofA / Wells
commonly show both a transaction date and a posting date per
row:
01/13 01/14 COFFEE SHOP $4.50
Before this fix, ``_find_dates_in_words`` returned the first
date only and the second date leaked into description as
"01/14 COFFEE SHOP". Now it returns ALL dates with their word
ranges; the scanner uses ``dates[0]`` as the canonical date
and passes every range to the description builder for
exclusion.
The detector's two-pass strategy now also guards against
mixing full-year and short-date matches on the same row.
Previously, a header line like ``Page 1/2 of 3 ... Statement
Date 01/13/2026`` would return both ``1/2`` and ``01/13/2026``,
and ``1/2`` (being leftmost) would have won the date column.
Now: if any full-year date is found on the row, short patterns
are NOT also collected — full year anchors interpretation. A
row with no full-year date (Chase short-date case) still falls
back to short patterns and collects all of them.
New tests:
- ``test_multiple_dates_returned_in_position_order`` —
``01/13`` + ``01/14`` both returned, in order
- ``TestMultiDateRow.test_first_date_wins_second_excluded_from_description``
— end-to-end through ``scan_pdf_for_transactions``
- ``TestZeroAmountRowsAreDropped.test_zero_amount_row_dropped``
— "INTEREST EARNED 0.00" row dropped while real txn kept
- ``test_negative_amount_kept`` — pin that -40.00 is not
treated as zero by the filter
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 263af3c7c2 |
fix(pdf): short dates without year + diagnostic for "0 rows" runs
User uploaded a real Chase statement and got "0 rows detected."
Two bugs the rewrite shipped with, plus a diagnostic:
**1. Short dates without year weren't recognized.** Most bank
statements (Chase, Wells, BofA, …) display transaction dates as
``01/13`` or ``Jan 13`` because the year is implied by the
statement period. The original regex required ``\d{2,4}`` after
the second slash, so ``01/13`` failed to match and rows with no
detected date got dropped.
Split ``_DATE_RES`` into ``_FULL`` (with year) and ``_SHORT``
(no year), with a two-pass detector: pass 1 tries full-year
patterns across the whole row; pass 2 only tries short patterns
if pass 1 found nothing. This prevents a stray ``Page 1/2`` from
shadowing the real dated transaction on the same line.
Short patterns:
- ``\d{1,2}/\d{1,2}`` — Chase, etc.
- ``\d{1,2}-\d{1,2}``
- ``[A-Z][a-z]{2}\s+\d{1,2}`` — "Jan 13"
When parsing, short dates pass through ``parse_date`` and
return None (no year to bind to), so the scanner falls back to
the raw text — the user sees ``01/13`` in the date column and
can correct in the editor.
**2. Multi-word dates leaked the day token into the description.**
A pre-existing bug: ``_find_dates_in_words`` returned only the
START word index, and ``_description_from_row`` only excluded
that single word. For "Jan 13 Coffee $4.50", the description
became "13 Coffee" instead of "Coffee". Fixed by returning
``(start, end, text)`` with ``end`` exclusive (computed from
``len(m.group(1).split())`` so window-overrun doesn't
over-consume), and the description builder now skips the full
range.
**3. New diagnostic: ``diagnose_pdf_lines(pdf_bytes)``.** Returns
every clustered text line the scanner saw with ``has_date`` /
``has_amount`` flags. When the page's scan returns 0 rows, an
auto-expanded "what the scanner saw" expander now renders a
table of all extracted lines so the user can:
- Spot scanned-PDF cases (empty result → enable OCR)
- See which lines have a date but no amount (or vice versa)
- Eyeball the date / amount format the scanner missed
Without leaving the app or asking the developer for help.
Eight new tests cover: short US date (``01/13``), short month-
name date with two-word consumption (``Jan 13``), the
``Page 1/2 ... 01/13/2026`` shadowing case, and the multi-word-
date description fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| bece2b4030 |
refactor(pdf): rip out templates; heuristic scan + selectable table
User feedback: the template / visual-picker / mode-dispatch
implementation was too complex for the actual workflow.
Statements drift between months, the canvas state didn't survive
multi-page navigation, and accountants don't want to maintain
per-bank configuration just to convert PDFs to CSV.
Start-over design — one public function, one page, no
persistence:
``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)``
A row is "any text line with a date pattern AND at least one
amount pattern." Each detected row is a dict shaped::
{
"date": "2026-01-15",
"description": "Coffee Shop",
"amount_1": -4.50,
"amount_2": 1000.00, # if a second amount was found
"page": 1,
"raw": "01/15/2026 Coffee Shop (4.50) 1,000.00",
"source_file": "chase-jan-2026.pdf",
}
Multi-line descriptions still merge (no-date no-amount lines
attach to the previous transaction). Multi-PDF batches share a
single combined table with a ``source_file`` column.
**Page UX:**
- Upload PDF(s) → optional Options expander (parens-negative,
use-OCR) → click Scan → see all detected rows in an
``st.data_editor``.
- The editor has an ``Include`` checkbox column (default on),
plus user-editable date / description / amount cells and a
read-only ``raw`` column showing the original PDF text for
verification.
- A ``Columns to include in CSV`` multiselect hides
``page`` / ``raw`` from the download by default; user can
re-add either.
- Download CSV gets only the checked rows.
No template save/load. No visual picker. No mode dispatch. No
column boundaries. No schema migration. No per-bank
configuration files.
**Deletions:**
- ``src/pdf_templates.py`` — template storage layer
- ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim
for the canvas (no canvas now)
- ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``,
``test_drawable_canvas_compat.py`` — covered the removed APIs
- ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for
the removed dep
- ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt``
- The drawable-canvas references in ``build/datatools.spec``
**``src/pdf_extract.py``** shrinks from ~30 helper functions to
~10. Keeps: value parsers, row clusterer, date/amount token
finders, OCR pipeline, dependency guards. The one new public
function ``scan_pdf_for_transactions`` glues them together.
**Tests** (59 passing): the unit layer keeps full coverage of
the building blocks; the smoke layer pins the end-to-end PDF
roundtrip, OCR discovery, dependency-import behavior, and the
multi-line-description merge. The fpdf2-generated fixture PDF
still drives the real-PDF test.
Rollback: ``git revert HEAD`` brings back the template system if
needed — but the simpler model should make that unlikely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 48cd9e8249 |
feat(pdf): schema v2 + mode field + v1 in-memory migration
Bumps ``SCHEMA_VERSION`` from 1 to 2 to add a top-level ``mode``
field distinguishing ``row_heuristic`` (new default) from
``column_visual`` (legacy). The schema bump is real — old code
that defaults missing keys would silently mis-extract — so we
do it the careful way:
- ``new_template`` now returns mode=``row_heuristic`` with the
full row-heuristic config tree pre-populated. The legacy
column-visual fields are still seeded with empty defaults so
switching modes in the GUI doesn't require runtime key
insertion.
- ``validate_template`` is mode-aware: row_heuristic templates
must have a valid ``amounts.shape`` + sane
``row_detection.min/max_amounts_per_row``; column_visual
templates keep the existing column/target requirements.
- ``load_template`` accepts both v1 and v2 files
(``_LOAD_SUPPORTED_VERSIONS = {1, 2}``). v1 files get
``mode="column_visual"`` injected and ``schema_version`` bumped
IN MEMORY ONLY — disk file stays v1 until the user explicitly
re-saves. A buggy migration can't silently corrupt their
template library.
- ``save_template`` continues to write the current schema; saving
a v1 template through the GUI naturally upgrades it.
Mode + shape constants exported (``VALID_MODES``,
``VALID_AMOUNT_SHAPES``) so the GUI dropdowns can derive their
options from the source of truth.
Tests split into ``TestValidateTemplateRowHeuristic`` (6) +
``TestValidateTemplateColumnVisual`` (4) + ``TestV1Migration``
(1). All 29 template tests pass; the original column-mode tests
that previously implicitly relied on schema_version=1 keep
working because new_template's seeded column fields are still
present in row_heuristic templates (just not validated as
required).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| d80befd05a |
feat(pdf): row-heuristic extraction (mode dispatch, no coordinates)
User reported the column-visual approach is too brittle for real bank statements: column-x-positions saved against a sample page don't survive layout drift between months (statement A has columns at x=300, statement B drifted to x=320), and a saved template can only realistically work for one statement's specific render. The fundamental fix is to stop depending on coordinates at all. **Row-heuristic mode** finds transaction rows by pattern: any line with a date token + N amount tokens IS a transaction. Date patterns (US slash / EU slash / ISO / "Jan 15, 2026" / etc.) and amount patterns (currency, parens-negative, thousands grouping) are matched against word text — no x-positions involved. The full pipeline: 1. ``find_transaction_rows`` clusters words into rows and scans each line for date + amount tokens. 2. Multi-line descriptions still attach to the previous row via the no-date-no-amount continuation rule. 3. Amount shapes drive interpretation: ``single`` / ``txn_balance`` / ``debit_credit`` / ``debit_credit_balance``. 4. ``_infer_amount_column_centers`` clusters amount x-midpoints ACROSS ALL detected rows to find natural column groupings — so debit-vs-credit assignment for single-amount lines works without the user marking anything on screen. ``apply_template`` is now a dispatch over ``template["mode"]``: - ``mode="row_heuristic"`` (default for new templates) — the new pipeline. - ``mode="column_visual"`` — the existing pipeline, kept under ``_apply_template_column_visual`` for v1 templates and the Advanced fallback. 18 new tests cover: date detection (US slash, two-digit year, ISO, month-name, missing); amount-token finding (currency, parens, pure text, bare-year rejection); column-center inference (clear two-column case, empty input); end-to-end on synthetic Page objects with all four amount shapes; the critical layout-drift test that proves the same template works on pages of different sizes / different absolute x-positions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| 10015c40e1 |
fix(pdf): shim image_to_url for drawable-canvas on modern Streamlit
User hit ``AttributeError: module 'streamlit.elements.image' has no attribute 'image_to_url'`` on first PDF import. Root cause: ``streamlit-drawable-canvas`` 0.9.3 (last upstream release 2023) calls a Streamlit internal that was relocated in Streamlit ~1.30+. The function moved from ``streamlit.elements.image`` to ``streamlit.elements.lib.image_utils`` AND its signature changed — the second positional argument is now a ``LayoutConfig`` dataclass instead of a plain ``int`` width. Three remedies considered: 1. Downgrade Streamlit. Reverses unrelated improvements + security fixes; not on the table. 2. Fork drawable-canvas. The maintenance hit isn't worth it for a one-line internal API change. 3. **Ship a compatibility shim.** Re-attach a wrapper at the old import path that adapts the old call shape to the new function. This is the standard workaround the wider Streamlit community has converged on for this exact regression. ``src/gui/_drawable_canvas_compat.py`` does (3). The ``install()`` helper is idempotent, opt-in (not auto-run at module import — a grep for ``_install_canvas_compat`` shows every call site), and no-ops if Streamlit hasn't moved the function OR if the new function isn't where we expect (lets the canvas surface a real error rather than papering over a different bug). The page calls ``_install_canvas_compat()`` once at module top before any ``st_canvas`` invocation; Streamlit's script-rerun model means this fires every page load but the ``_PATCHED`` guard makes re-runs free. The shim wraps the old ``width=int`` arg into a default-constructed ``LayoutConfig()`` — the old ``width=-1`` sentinel meant "use the image's natural width", which is also what an unconfigured LayoutConfig produces. Confirmed by inspecting Streamlit 1.57.0's ``image_utils.py``. 4 new tests pin the shim contract: - ``install()`` attaches ``image_to_url`` to the old path on modern Streamlit - Idempotent — calling twice doesn't double-wrap - Doesn't clobber a future Streamlit that restores the original at the old path - Translates ``(image, -1, False, "RGB", "PNG", "id")`` into a proper call to the new function with a ``LayoutConfig`` instance If a future Streamlit upgrade moves ``image_to_url`` AGAIN, the shim's silent-no-op fallback means the canvas error surfaces again and points at where to look. The shim doesn't paper over mysteries; it only patches the one specific relocation we know about. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| e6ee2e3481 |
feat(pdf): robust Tesseract discovery + OS-aware install copy
User tried ``brew install tesseract`` in PowerShell after seeing all three OSes listed inline in the OCR banner — easy mistake when the install commands are crammed on one line with ``·`` separators. Two changes pre-empt this: **OS-aware OCR banner.** The expander now detects the user's platform via ``platform.system()`` and shows only the relevant install instructions: - **Windows**: UB-Mannheim installer link, numbered steps, explicit "keep the Add to PATH checkbox on" callout, plus a fallback paragraph telling the user how to set ``DATATOOLS_TESSERACT_PATH`` if they already installed without PATH and don't want to reinstall. - **macOS**: ``brew install tesseract`` with a Homebrew link. - **Linux**: ``apt install tesseract-ocr`` with a "or your distro's equivalent" hedge. **Robust binary discovery in ``ocr_available()``.** Three-stage: 1. Honor ``DATATOOLS_TESSERACT_PATH`` env var if set — explicit override for portable installs or non-default locations. 2. Try ``pytesseract``'s default PATH-based lookup. 3. If PATH lookup fails, probe known Windows install paths (``C:\Program Files\Tesseract-OCR\tesseract.exe``, the x86 variant, and ``%LOCALAPPDATA%\Programs\Tesseract-OCR\``) via the new ``_autodetect_tesseract_path``. On hit, set ``pytesseract.pytesseract.tesseract_cmd`` so all subsequent ``image_to_data`` calls use the same binary without re-discovering. This means a user who runs the UB-Mannheim installer with default options but forgets the PATH checkbox will still get OCR working after a launcher restart, without env-var gymnastics. Tests (4 new, 85 total in the suite): - Auto-detect returns None on non-Windows (no false positives on dev laptops). - Auto-detect finds the binary at a mocked ``C:\Program Files\Tesseract-OCR\tesseract.exe``. - Auto-detect returns None when no candidate exists. - ``DATATOOLS_TESSERACT_PATH`` env var beats both PATH lookup and auto-detect (sets ``tesseract_cmd`` even when the path doesn't resolve, so a real binary at a custom location works). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| 538e23d219 |
build(pdf): bundle PDF deps in installers + pin versions + smoke tests
Three changes prepare the next tagged release so end users get the PDF Extractor without ever touching pip. **Exact-pin the new deps** (``requirements.txt``): pdfplumber==0.11.9 pypdfium2==5.8.0 pytesseract==0.3.13 streamlit-drawable-canvas==0.9.3 Tight pins are the right call for these because the GUI's visual-picker geometry + the parsing-pipeline word positions depend on stable internal behavior — a quiet upstream tweak to ``extract_words`` or ``page.render`` would re-break the tool on the next CI build. Bumping requires a deliberate edit + a CI run, not a transient ``pip install`` resolving to whatever ``setup.py`` pulled. Existing deps stay on their current ``>=X.Y,<X+1`` ranges; the user's "tight pin" concern is specifically about the PDF stack. **Wire the new deps into the PyInstaller bundle** (``build/``): - ``datatools.spec`` — add ``collect_submodules`` for pdfplumber, pdfminer, pypdfium2, streamlit_drawable_canvas, PIL, pytesseract; add ``collect_data_files`` for pypdfium2 (PDFium native ``.dll``/``.so``/``.dylib``), streamlit_drawable_canvas (frontend JS bundle), pdfminer (Adobe CMap tables). - ``hooks/hook-pypdfium2.py`` — belt-and-braces hook that uses ``collect_dynamic_libs`` to force-include the PDFium binary. Without this the visual picker silently fails on installed builds with a ``FileNotFoundError`` for the shared library. - ``hooks/hook-streamlit_drawable_canvas.py`` — collects the built JS frontend so the canvas iframe loads under the bundled Streamlit server instead of rendering blank. **Tesseract is intentionally NOT bundled** (option A from the design discussion). Modern bank statements are text-based; bundling Tesseract would ~triple installer size for a long-tail case. The in-app banner directs users to install it from ``UB-Mannheim/tesseract`` if they need OCR. Decision is captured in the ``project-pdf-installer-pending`` memory note. **Smoke tests** (``tests/test_pdf_extract_smoke.py``, 17 tests) add the layer above the pure unit tests: - ``TestDependencyImports`` — each dep imports cleanly - ``TestRealPdfRoundTrip`` — generates a tiny statement PDF in memory with ``fpdf2`` (test-only dep in ``requirements-dev.txt``), runs ``extract_pages`` + ``apply_template``, asserts 3 rows out with the right signed amounts. Catches "the build succeeded but pdfplumber breaks at runtime." - ``TestRenderPageImage`` — exercises ``pypdfium2.render`` so the hook-bundled native lib gets a real call. This is the most common installer-bug signature (missing .dll) and the test catches it before users do. - ``TestPdfDependencyMissing`` — monkeypatches ``__import__`` to simulate a stripped install; confirms the typed exception + actionable hint round-trip. - ``TestPinnedVersionsMatchInstalled`` — parametrized over all four pinned dists; uses ``importlib.metadata`` rather than ``__version__`` because pypdfium2 doesn't expose it directly. Trips if someone bumps the pin without reinstalling. - ``TestOcrAvailability`` — confirms ``ocr_available()`` returns ``(bool, str)`` and ``extract_pages_auto(allow_ocr=False)`` skips OCR cleanly. All 81 PDF + audit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| aea520d2f7 |
feat(pdf): template storage layer (load/save/list/import/export)
Phase 2/6. Persists "how to read this bank's statements" as JSON files under ``~/.datatools/pdf_templates/<slug>.json`` so an accountant can build one template per source and reuse it across every statement that follows the same layout. Public API: - ``new_template(name)`` — blank with sensible defaults - ``save_template(t)`` — validate + atomic write (temp + rename) - ``load_template(slug)`` / ``delete_template(slug)`` - ``list_templates()`` — sorted summaries, skips corrupt files - ``template_to_json`` / ``template_from_json`` — portability - ``validate_template(t)`` — returns (ok, errors) list for GUI Schema is documented in the module docstring. Versioned via ``schema_version: 1`` so future fields don't break saved files silently — ``load_template`` refuses unknown versions instead of limping along with missing keys. Validation contract enforces: - non-empty name + slug (lowercase alphanumeric + hyphens) - at least two output columns - at least one column mapped to ``date`` - either one ``amount`` column OR both ``amount_debit`` + ``amount_credit`` - column boundary count consistent with source-column count Storage is atomic: ``_atomic_write`` goes through a temp file + ``os.replace`` so a crashed save can't leave a half-written JSON at the canonical path. The GUI's build flow saves on most visual-picker changes, so this matters more here than for a "save button" workflow. 24 tests cover slugify, defaults, validation branches, round-trip load/save, missing/corrupt file handling, delete, list (incl. skipping corrupt files), atomic-write rollback, and import/export. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| b8aff862ed |
feat(pdf): add pure PDF→DataFrame extraction module
Phase 1/6 of the PDF Extractor tool. Pure module — no Streamlit,
no user-config I/O — that turns a PDF blob plus a template dict
into a ``pandas.DataFrame`` of transaction rows. Primary use case
is accountant-style extraction of bank-statement transactions,
where each bank's format is encoded as a reusable template.
Pipeline:
1. ``extract_pages(pdf_bytes)`` reads with pdfplumber and surfaces
words with bounding boxes.
2. ``cluster_rows(words)`` groups words into rows by ``top``
tolerance — no reliance on PDF table-line detection (most bank
statements have no visible cell borders).
3. ``assign_columns(row_words, boundaries)`` buckets each word by
its horizontal midpoint into N+1 columns defined by N interior
x-boundaries.
4. ``_within_table_window`` slices to the band between the header
line and the end-marker (e.g. "Closing balance").
5. ``apply_template`` orchestrates the above, handling:
- parens-style negative amounts, currency stripping, custom
decimal/thousands separators
- separate debit + credit columns combined into a single signed
``amount`` (credit positive, debit negative — accounting
register convention; matches QuickBooks/Xero imports)
- multi-line description wrapping (rows with empty date column
attach to the previous row's description)
- row-level regex skip filters (e.g., "Total", "Subtotal")
- page-range filters ("all", "2-", "1,3-5")
Optional OCR fallback for scanned statements:
- ``page_has_extractable_text`` heuristic flags pages with <5
words as likely-scanned.
- ``ocr_available()`` checks both the ``pytesseract`` Python
binding and the Tesseract binary; surfaces a clear reason
string when either is missing.
- ``extract_pages_auto`` does text-first, OCR-the-blanks, and
returns warnings the UI can surface.
29 unit tests cover the parsing pipeline against synthetic
WordBox/Page data — no fixture PDFs required, runs in 0.1s. Real
PDF extraction is exercised by hand on the user's statements.
Dependencies added:
- ``pdfplumber>=0.10,<1`` — text + position extraction
- ``pypdfium2>=4,<6`` — page rasterization for OCR + visual picker
- ``streamlit-drawable-canvas>=0.9,<1`` — visual region picker
(used in commit 5)
- ``pytesseract>=0.3,<1`` — OCR (used in commit 6; system
Tesseract binary required separately)
- ``cryptography>=41,<49`` — bumped upper bound; pdfminer.six
transitively requires a recent release. Internal ed25519
license-signing usage is API-stable across the bump.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| b3ae913bb9 |
feat(audit): daily filename + 7-day retention sweep
Replaces the per-session ``datatools-<ts>-<sid>.jsonl`` filename with a single daily file ``datatools-YYYY-MM-DD.jsonl`` (local date). Sessions on the same calendar day share a file via the writer thread's per-batch open+append; multiple DataTools instances running concurrently on the same day fan into the same file (append-mode small writes are atomic on POSIX, safe-enough on Windows under realistic load). Drops the ``_LOG_PATH`` module global and the lock around it — ``audit_log_path()`` is now pure date math, recomputed on every call so a session that crosses midnight follows the rollover into the next day's file. Adds ``_sweep_old_logs()`` invoked once per process at writer- thread start. Deletes any ``datatools-*.jsonl`` whose mtime is older than 7 days. The glob deliberately matches the legacy per-session filename too, so users upgrading from the previous build don't keep a permanent backlog of pre-retention files. Event ``ts`` fields stay UTC; only the filename uses local date, because users go looking for "today's log" on their wall clock. Tests cover: daily filename shape, sweep removes stale files, sweep keeps fresh files, sweep also clears legacy filenames. Rollback: ``git revert HEAD`` restores the per-session filename and removes the sweep. No data migration needed either way — existing files keep working as JSONL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| 76c9f5a679 |
feat(audit): diagnostic instrumentation env vars + writer-thread guard
Phase 1 of the audit-log re-enablement plan. Adds three opt-in env vars that let us ship one instrumented build for the user to run, without flipping the kill switch on for everybody. **Default behaviour is byte-identical to today**: with no env vars set the kill switch wins, no writer thread starts, no file is written, no stderr line is printed. Env vars (do NOT set in prod): - ``DATATOOLS_AUDIT_ENABLED=1`` — bypass ``_DISABLED`` for one session. ``_DISABLED = True`` stays in the source so an upgrade with no env var is still safe. - ``DATATOOLS_AUDIT_TRACE=1`` — print ``[audit] ...`` lines to stderr at module import, every writer-thread state change, and every producer entry point. Lets the user share a small log instead of attaching a debugger. - ``DATATOOLS_AUDIT_PROBE=<value>`` — bisect the producer path for Phase 2. Values: ``full`` (default), ``noop``, ``no-events``, ``no-page-open``, ``no-session-start``. The named variants return early from the corresponding ``log_*`` function so we can isolate which call is implicated in the blank-pages symptom. Also: - ``_writer_loop`` gets an outer ``try/except BaseException`` so silent thread death now surfaces a ``"writer thread died: ..."`` line in the launcher terminal instead of looking like a hang. - Existing first-write-failure stderr print gets ``flush=True`` so the user actually sees it before the process is killed. - Test fixture switches from the previous-commit ``_DISABLED = False`` override to ``_ENABLE_OVERRIDE = True`` so tests exercise the same bypass path the diagnostic build uses. - Two new tests pin the safety contract: with the kill switch on and no override, every producer is a true no-op (no writer thread, no file). And ``DATATOOLS_AUDIT_PROBE=no-events`` bypasses ``log_event`` even when the override is on — guards the bisect. Rollback: ``git revert HEAD`` removes Phase 1 cleanly. The deadlock fix from the previous commit stays in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| a8ff8f4bd0 |
fix(audit): break audit_log_path/_session_id deadlock
Pre-existing latent bug since |
|||
| d9e32e578b |
feat(audit): async writer thread — safe to re-enable
Reported earlier: synchronous file writes in ``log_event`` blocked the GUI render thread on hostile filesystems (Windows antivirus on ``~/.datatools/logs/`` is the prime suspect). A blocking ``open`` call doesn't raise — try/except can't recover from it — so the only safe re-enable is to take file I/O off the render path. Refactor: - ``log_event`` and friends push events onto a ``deque(maxlen=5000)`` via ``put_nowait`` and return in microseconds. - A single daemon thread (``datatools-audit-writer``) drains the queue and writes batches. Holds the queue lock only long enough to snapshot + clear, then does I/O outside the lock so producers can keep enqueueing. - ``audit_log_path()`` is now pure path arithmetic — no ``mkdir`` no ``open``. The writer thread does the directory creation off the request path, so any hang there only affects the writer. - Bounded queue means an unwritable disk doesn't unbounded-grow memory; the queue caps at 5000 and overflow drops OLDEST events so the most-recent (most-diagnostic) ones survive. - First write failure prints once to stderr; subsequent failures are silent so logs don't drown the launcher terminal. - ``flush_audit_log(timeout_s=0.5)`` drains the queue and signals the writer to exit; bounded so a stuck disk can't delay shutdown. Other changes in this commit: - ``shutdown_app`` now emits a "Session ending" event and calls ``flush_audit_log`` before kicking the os._exit timer, so the closing session's events make it to disk. - The Diagnostics sidebar in ``hide_streamlit_chrome`` is re-enabled (the ``if False:`` gate is removed). Wrapped in try/except defensively — render errors print to stderr, never blank the page. - ``_DISABLED`` kill-switch is gone. The async design IS the safety mechanism now. Tests in ``tests/test_audit.py``: - log_event burst of 1000 events completes in well under 1s (proves non-blocking). - Events queued before flush land on disk with the expected JSON shape; session_start renders; idempotent. - Pointing the audit dir at a file (so mkdir fails) doesn't hang or crash the producer. - Non-JSON extras are str()-coerced rather than dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| 7ad19ac7f4 |
feat(nav,i18n): sticky footer with Back-to-Home + localized tool headers
Two unrelated UX issues addressed in one sweep across all nine tool
pages because they share the same edit surface.
(1) Sticky footer replaces the top + bottom back-link buttons.
Reported: a big white empty footer space at the bottom of every page;
the Back to Home button at the top scrolled out of view on long pages.
New ``render_sticky_footer()`` helper in ``components/_legacy.py``
injects a fixed-position bar at ``bottom: 0`` of the viewport with:
- A border-top so it visually reads as a non-movable bar.
- A semi-transparent background (rgba 0.96 + ``backdrop-filter: blur``)
so content underneath shows through faintly when the user scrolls.
- A styled ``<a href="home">`` anchor (not an ``st.button``) because
Streamlit widgets can't be CSS-positioned reliably — Streamlit owns
the widget's DOM container and re-mounts it on every rerun. A real
anchor sits exactly where the CSS puts it and triggers Streamlit's
URL routing to the home page.
- ``padding-bottom: 3.5rem`` on the main container so the last widget
isn't hidden behind the bar.
Called once per tool page, immediately after ``hide_streamlit_chrome()``
so it renders even on pages that ``st.stop()`` early before any other
content runs. The old top-and-bottom ``back_to_home_link()`` calls are
removed from every tool page; their entry/exit points were dropping
the button when the script short-circuited.
(2) Tool-page headers now localize.
Reported: switching the sidebar language picker to Spanish left the
tool page's title + caption in English. Root cause: every page had
hard-coded ``st.title("✂️ Clean Text")`` / ``st.caption("Trim
whitespace...")`` strings.
Added per-tool ``tools.<id>.page_title`` and
``tools.<id>.page_caption`` keys to ``en.json`` and ``es.json`` for
all nine tools. Routed each page's title/caption call through ``t()``.
Verified: with ``ui_lang=es`` set, the Clean Text page now renders
"✂️ Limpiar texto" + the Spanish caption.
Updated ``tests/gui/test_smoke.py::EXPECTED_SUBSTRINGS`` so the
``es`` column for each tool page asserts the actual Spanish string
(was a duplicate of the English string back when the page bodies
were English-only).
2220 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 5128d35961 |
fix(text-cleaner): hoist show_hidden + stress-test all tool pages
Reported crash: clicking "Clean Text" with mojibake.csv (a junk corpus
file that the cleaner ran on but produced zero changes) blew up the
results render with
NameError: name 'show_hidden' is not defined
at the cleaned-preview block. ``show_hidden`` was defined inside
``if result.cells_changed:`` and referenced unconditionally below.
Fix on the page itself: hoist the ``show_hidden = st.toggle(...)``
declaration out of the conditional so it's always in scope for the
downstream cleaned-preview render. One toggle now drives both the
Examples table (which only renders when there are changes) AND the
cleaned preview (which always renders).
Generalized regression net: ``tests/test_junk_corpus_tool_pages.py``.
For nine representative junk files (empty, only_nul, mojibake,
invalid_utf8, utf16_le_no_bom, mismatched_columns, all_nulls,
corrupt_xlsx, single_column) and every Ready/Coming-Soon tool page,
the test:
1. Stashes the junk bytes as the home upload via session_state.
2. Runs the page through AppTest, asserts ``app.exception`` is empty.
3. If the page exposes a deterministic primary-action button label,
clicks it and asserts no exception on the post-click render.
Pages that catch a bad file at read time and short-circuit via
``st.error`` + ``st.stop`` are correctly skipped from the
primary-action half (the button isn't rendered). A genuine crash
shows up as ``app.exception`` carrying a Python traceback — exactly
what the user reported, exactly what we now catch.
162 tests collected, 102 passed, 60 skipped. 4 seconds.
Full suite: 2220 passed, 91 skipped, 35 s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 696996c119 |
test(junk-corpus): pathological-input stress suite for the analyzer
Build a corpus of 35 deliberately-broken files (empty bytes, NUL
bytes, mojibake, UTF-16 without BOM, mismatched columns, unescaped
quotes, corrupt zip, etc.) and pin the analyzer's stability contract
against them.
Files land in ``test-cases/junk-corpus/test_data/``. The generator
``make_junk_corpus.py`` produces them deterministically (one random
sample uses ``secrets.token_bytes`` — committed bytes are stable
across regenerations because the byte stream is captured at commit
time). README documents the categories and how to add new shapes.
``tests/test_junk_corpus.py`` parametrizes over every file in the
corpus and asserts:
1. ``_run_analysis_on_upload`` never raises — exceptions must be
caught and surfaced as a synthetic ``Finding`` with
severity="error". This was the user-reported crash for
13_non_latin_scripts.csv that the previous fix in
|
|||
| c568aec8a7 |
feat(gui): one-click Close in its own bottom sidebar section
Close is now a direct shutdown trigger: visiting the Close page (the sidebar entry) fires shutdown_app() immediately — no confirm step, no intermediate body. The farewell overlay paints and os._exit(0) lands ~1s later from a daemon thread. Layout: Close moved into its own bottom-of-sidebar section so the destructive action is visually separated from Account/Activate. - New shutdown_app() in components/_legacy.py replaces quit_button. os._exit thread is skipped when "pytest" is in sys.modules so the test suite doesn't suicide on rendering 99_Close. - pages/99_Close.py shrinks to set_page_config + chrome + shutdown_app. - app.py nav grows a new "Close" section header (new nav.section_close key in en/es packs) pinned at the bottom of the navigation dict. Tests updated: - TestQuitButtonRenders → TestClosePageShutsDownImmediately. Assert the shutdown caption renders + no confirm button exists. - test_smoke EXPECTED_SUBSTRINGS["99_Close"] now pins "Shutting down" / "Cerrando" (the visible page body) instead of the removed page title. 2008 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| ff2eaeb6c4 |
feat(home): multi-file upload + per-file analysis, drop tool grid
Home is now upload + analysis only. The page accepts multiple files in
one go, analyzes each independently, and renders findings grouped by
filename in bordered containers. The 3-section tool-card grid is gone —
discovery happens via the sidebar now.
Mechanics:
- file_uploader uses accept_multiple_files=True. Each file's findings
cache in session_state["home_findings_by_file"] keyed by filename so
removing a file via Streamlit's "x" button drops its findings too,
and re-clicking Run only re-analyzes pending files.
- The first uploaded file is mirrored into the singular
home_uploaded_{name,bytes,size} keys so tool pages continue to pick
up an "active" upload through pickup_or_upload — no tool-page changes.
- New i18n keys: upload.intro_multi, upload.uploader_label_multi,
upload.clear_results, upload.empty_state. upload.heading text is
updated to "Upload one or more files to start" (EN + ES).
Dropped tests pinning the tool grid:
- TestHomeToolGridLocalization (test_chrome.py)
- test_home_tool_card_uses_es_name (test_smoke.py)
- TestLiteHomeGridBadges (test_lite_tier.py — locked-card lock-badge
assertions; locking is still enforced per-tool-page via
require_feature_or_render_upgrade)
2009 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| dad744f17f |
refactor(gui): drop Review page + normalization gate
Home is now the only entry point: the "Run analysis" button on the upload section IS the review step (findings render inline via render_findings_panel). Tool pages no longer gate on a passed normalization — running the analyzer is sufficient context. Removed: - src/gui/pages/0_Review.py - src/gui/components/gate.py (re-export seam) - require_normalization_gate() in src/gui/components/_legacy.py - "review" section enum in tools_registry.py - Data Review entry in app.py navigation - require_normalization_gate() calls + imports in all nine tool pages - tests/gui/test_gate.py (whole file) - TestReviewWorkflow in tests/gui/test_workflows.py - 0_Review entry in tests/gui/test_smoke.py PAGE_SLUGS - stash_upload's normalization_result+normalization_for stashing - stash_upload_without_gate (was the gate's negative-path helper) 2017 tests pass (16 retired with the gate flow). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| fc6c22c6a7 |
feat(review): inline file uploader instead of redirect home
When a user lands on Review without an upload, show a file uploader on the page itself and auto-run the analyzer once a file is picked, rather than bouncing them to the home page with a "Back to home" button. Auto-analyze is the right default here: the user is already on the Review page, so they've implicitly committed to a scan. Stashing the bytes in the same session-state keys the home page uses keeps the rest of the flow (encoding picker, gate, tool pages) unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| db5ec084da |
docs+code: rename tool labels everywhere
Sweep follow-up to
|
|||
| 93e43fc0d9 |
feat(gui): sidebar sections + non-technical tool labels
Sidebar nav now groups tools under Data Review / Data Cleaners / Transformations / Automations via st.navigation, replacing the flat auto-discovered list. Tool display names switch to action-first phrasing (Find Duplicates, Fix Missing Values, Find Unusual Values, Standardize Formats, Clean Text, Quality Check, Map Columns, Combine Files, Automated Workflows) in EN + ES packs and on each page's H1. The Data Cleaners section follows the requested order: Missing Values → Outliers → Text Cleaner → Format Standardizer → Deduplicator → Quality Check. (Text Cleaner kept inside cleaners since the request didn't list it but the tool still ships.) Registry now carries a section field; helpers added: tools_in_section(), section_label(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| e534fb4989 |
sec(license): Ed25519 sigs + production-safe tripwire
Two coupled hardening upgrades. 1. Asymmetric signatures (HMAC → Ed25519) The previous HMAC scheme used a symmetric secret that any motivated reverse engineer could pull out of the shipped binary and use to mint blobs for any tier / name / email. With Ed25519, the binary ships only the public verification key; the signing key never leaves the seller's environment, so binary compromise no longer yields forgery. - src/license/crypto.py rewritten around cryptography.hazmat.primitives.asymmetric.ed25519. Same public API surface (sign/verify/encode_blob/decode_blob), same canonical JSON encoding — drop-in for the manager / cli / GUI layers. - DATATOOLS_LICENSE_PRIVKEY (seller-side) and DATATOOLS_LICENSE_PUBKEY (build-time) env vars supply the keys; the in-source dev keypair (src/license/_dev_keypair.py) deterministically derives from a seed phrase for repro builds and tests. - Blob prefix bumped DTLIC1: → DTLIC2:. Decoding a DTLIC1 blob surfaces a clear "old format" error rather than a confusing signature mismatch. - scripts/generate_keypair.py mints fresh production keypairs for the seller (run once, stash the private key offline). Adds cryptography>=41,<46 to requirements.txt (was an undeclared transitive dep). 2. Production-safe tripwire assert_production_safe() refuses to boot a frozen / shipped build when either: - DATATOOLS_DEV_MODE=1 is set (would unconditionally bypass every license check — fine in source/test but catastrophic in a buyer install). - The active verification key is still the embedded dev key (the build pipeline forgot to set DATATOOLS_LICENSE_PUBKEY). No-op in source / pytest runs (sys.frozen is unset) so test fixtures and dev workflows keep working without ceremony. Called from src/cli_license_guard.guard() and from hide_streamlit_chrome — so it fires on every CLI invocation and every GUI page load. Tests: 49 license-layer unit tests (was 40); added Ed25519 wrong-key rejection, dev-keypair seed pin, blob v2 prefix, v1 rejection with clear message, and four production-safe scenarios (no-op in source, fires on DEV_MODE in frozen, fires on dev key in frozen, passes in frozen with prod pubkey). Total: 2024 → 2033. Docs (REQUIREMENTS §17a, DEVELOPER licensing recipe, DECISIONS §9b + decision log) updated with the new threat-model write-up, key-storage workflow, and tripwire behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| d32b58e61a |
feat(license): add Lite SKU; remove user-facing free trial
Two coupled changes:
1. Lite tier
- New Tier.LITE in src/license/schema.py.
- FEATURES_BY_TIER[Tier.LITE] = {Deduplicator, Text Cleaner,
Format Standardizer}. The three universally-useful tools that
cover the most common bookkeeping / RevOps / Klaviyo prep
workflows. Other six tools require Core.
- i18n: license.tier_lite, license.feature_locked_title,
license.feature_locked_body, license.upgrade_link,
license.status_locked (en + es).
- Per-tool feature gate at every GUI tool page
(require_feature_or_render_upgrade) and every tool CLI
(guard(feature=...)). A locked tool renders an upgrade
prompt + Manage-license button (GUI) or exits with code 2
(CLI).
- Home grid: tool cards the user's tier doesn't unlock get a
red 🔒 Locked badge in place of green Ready.
2. Trial removed
- Activation form's "Start 1-year trial" button removed.
- license_cli's `trial` subcommand removed.
- activation.trial_button / activation.trial_help i18n keys
dropped (pack parity test stays green).
- Tier.TRIAL stays in the enum (back-compat with any field-
tested trial licenses); LicenseManager._mint stays internal
for tests and the seller's key generator.
- Decision logged in DECISIONS §9b: a 1-year all-features
trial undercuts paid Lite; paid-only keeps tier economics
clean.
Tests (+29 net): +17 Lite-tier unit/guard tests + 13 Lite-tier
GUI tests + 1 trial-absent assertion - 2 trial CLI tests - 1
trial GUI button test. Total: 1995 → 2024.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| e435103113 |
feat(license): registration + 1-year licenses + tier scaffolding
A complete offline licensing layer (no internet at any step): Core - src/license/ — schema (License, Tier, FeatureFlag), HMAC crypto, JSON storage, LicenseManager singleton with activate/renew/ deactivate/issue_trial. Tier-scaffolded so future SKUs can carve per-tool feature sets without consumer-code edits. - scripts/generate_license.py — creator-only key generator. Mints a DTLIC1: blob the buyer pastes into the activation page. GUI - New activation form component (src/gui/components/activation.py). - hide_streamlit_chrome() now inline-renders the activation form when no valid license is present (every page short-circuits to the form until activated). - Sidebar shows tier + days remaining; renewal warning under 30 days. - New pages/_Activate.py for revisiting the form after activation. CLI - src/license_cli.py — activate / renew / status / trial / deactivate commands. Exempt from the guard. - src/cli_license_guard.py — drop-in guard call added to every tool CLI's main(). Lets --help through; respects DATATOOLS_DEV_MODE. i18n - New activation.* and license.* keys in en.json + es.json (page title, form labels, status badges, renewal warnings, error messages). Pack parity test stays green. Test infrastructure - tests/conftest.py autouse fixture sets DATATOOLS_DEV_MODE=1 so the existing 1916 tests continue to pass. - isolated_license_path / activated_license_manager / unactivated_license_manager fixtures for tests that want to drive the real check. Tests (+79) - tests/test_license.py (40): schema, crypto roundtrip, blob encode/decode, tier→feature mapping, activation flow, name/email mismatch rejection, tamper detection, expiration, renewal, dev-mode bypass. - tests/test_license_cli.py (26): every license_cli command + subprocess tests confirming every tool CLI refuses to run without a license, --help always works, DEV_MODE bypasses. - tests/gui/test_activation.py (13): gate blocks without license, passes with trial, activation form submission unlocks the gate, sidebar status, renewal warning, i18n. Total: 1916 → 1995 tests. All pass under the strict warning filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| b2c7b94fe9 |
fix: clear all latent deprecation + resource warnings
Three real issues surfaced when running the suite with strict warnings: 1. src/core/format_standardize.py: ``datetime.utcfromtimestamp`` is deprecated in CPython 3.12 and slated for removal. Replace with ``datetime.fromtimestamp(ts, tz=timezone.utc)``. Output for the date-only format codes we use is byte-identical. 2. src/core/io.py: ``list_sheets`` leaked the openpyxl file handle by returning ``xl.sheet_names`` from an unclosed ``pd.ExcelFile``. Wrap in a ``with`` block so the FD closes deterministically — also prevents the Windows-only "file is locked" repro path. 3. tests/test_corpus.py: ``TestXlsxPollution.workbook`` fixture returned the bare ``pd.ExcelFile`` instead of yielding + closing. Convert to a yield-and-finally pattern so the class-scoped handle isn't leaked across the whole test file. Also harden pytest.ini's warning policy: escalate ``ResourceWarning`` from ``src`` to an error, alongside the existing ``DeprecationWarning`` rule. Third-party warnings stay filtered — we can't fix pandas/openpyxl/streamlit churn from here. All 1916 tests pass under the strict filter; full and split runs (``pytest``, ``pytest -m 'not gui'``, ``pytest -m gui``) all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| 35d46a0c1a |
test(gui): add Streamlit AppTest layer (139 tests)
Until now every test ran against core or the CLI; the Streamlit GUI was verified by hand. This commit adds tests/gui/ — 139 AppTest- driven tests behind a 'gui' marker so the quick loop (``pytest -m 'not gui'``) stays at 1777 tests / ~10s while ``pytest`` runs everything (1916 / ~14s). Coverage: - test_smoke.py (59): every page renders in EN and ES, expected substring present, sidebar selector mounted. - test_chrome.py (18): language selector flips session state and re-renders; quit button + farewell strings localize; tool-card names use the active language. - test_gate.py (9): require_normalization_gate no-op / warning / short-circuit / hash-mismatch invariants; warning + button localized. - test_workflows.py (14): happy path per Ready tool — stash upload, render, find primary action, verify result lands in session state. - test_dedup_review.py (8): Accept All / Reject All / Clear Decisions wire through to review_decisions; apply_review_decisions semantics (keep-all, merge, column override). - test_advanced_panels.py (15): config_panel widget defaults and options (algorithm, threshold, survivor rule, merge, multiselects, config save/load). - test_errors.py (4): garbage / empty / single-column uploads don't crash; duplicate-target mapping raises InputValidationError. - test_findings_panel.py (12): driven via a small standalone harness page so we test the component without faking a file_uploader. EN + ES strings, per-tool grouping, open-tool button label, untargeted expander, severity summary. Shared infrastructure in tests/gui/conftest.py: - ``stash_upload`` / ``stash_upload_without_gate`` — populate session_state to pre-pass or block the gate. - ``with_language`` — set ``ui_lang`` before run(). - ``collected_text`` — flatten title/caption/markdown/etc. into one string for substring assertions. - Auto-marking: every test in tests/gui/ gets ``@pytest.mark.gui`` via ``pytest_collection_modifyitems``, so the marker isn't per-test boilerplate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| 64452dd783 |
perf: dedup blocking, column-parallel scaffolding, lazy-copy pipelines
Three follow-on wins from the audit, each with shape-pinning tests.
1. Dedup blocking
- Exact-only strategies (every column EXACT @ 100 — covers strong-
key dedup like email/phone, the drop-duplicates fallback, and
explicit "match on this exact column" calls) now route through
an O(n) groupby fast path. Lossless; no API change required.
Measured: 10k-row email-exact dedup → 73 ms (was ~30 minutes
via the O(n²) pair compare).
- Fuzzy strategies still pair-compare, with opt-in prefix blocking
via deduplicate(..., blocking_columns=[...], blocking_prefix_len=1).
Measured: 5k-row fuzzy-name → 25.6s with blocking vs 179s
without (7x). Trade-off: cross-block matches missed.
2. Column-parallel standardize
- StandardizeOptions.parallel_columns (default 1) lands a
ThreadPoolExecutor over the column loop. Output order and
audit-record order are preserved deterministically via a merge
step keyed off column_types order. Honest doc: under CPython
3.12's GIL the win is roughly neutral (phonenumbers/dateutil
hold the GIL); the API is ready for free-threaded Python 3.13+.
3. Lazy-copy in missing / column_mapper
- _standardize_sentinels now builds per-column changes in a dict
and only materialises the output frame when at least one column
actually changed. On a clean 1 GB file this skips a 1 GB
allocation.
- handle_missing carries an out_is_owned flag, copying on demand
before any mutating step. No-op runs return the input frame.
- map_columns drops the unconditional upfront df.copy(); rename
and drop both return fresh frames already, and schema-add /
coerce trigger _ensure_owned() lazily.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| 5b672370a6 |
perf: cache hot paths, drop wasted allocations, lift 1 GB → 1.5 GB
Five targeted wins driven by an end-to-end audit, with shape-pinning regression tests so reverts are loud: - format_standardize: fuse the dispatcher loop into one pass — was calling Series.tolist() three times per typed column and materialising an intermediate triples list; now one tolist, one walk. On a synthetic 1M-row phone+email frame this measures ~2.7M rows/sec (vs. the previous 150k/sec doc target). - dedup: wrap normalizers in a per-call lru_cache so repeat phones / emails / addresses skip re-parsing. phonenumbers.parse is the expensive call; ~2–5x faster on the normalisation step for realistic workloads. - analyze: _detect_near_duplicates no longer copies the full input frame; builds only the normalised string columns via a dict and references non-string columns by view. Skips the redundant astype(str) when a column is already pandas string dtype. - text_clean: hoist _build_pipeline out of the per-cell loop and add a per-call string cache so 100k repeats of "Active" only run the pipeline once. ~1M rows/sec on repetition-heavy columns. - io.repair_bytes: the non-UTF-8 smart-quote fold path used a Python-level zip walk over the entire decoded string to count replacements — replaced with sum(text.count(c) ...) which runs in C at ~GB/s. Was a latent ~100s on a 1 GB cp1252 file; now <1s. Updates REQUIREMENTS §10 with measured numbers and bumps the buyer- facing upload limit from 1 GB to 1.5 GB across the i18n packs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| c4ce86bd64 |
feat(i18n): add language-pack scaffold with English and Spanish
Introduces ``src/i18n`` with a tiny JSON-backed t() lookup, an in-session language preference, and a sidebar selector wired through ``hide_streamlit_chrome`` so every page picks up the same picker. Covers home, tool cards, findings panel, gate, shutdown, and pickup banner strings. Tests pin pack parity and the farewell-overlay JS escape so future packs can't silently regress. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| 966af8ef94 |
feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI
05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI
09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI
with soft tool-dependency graph (recommended,
not enforced) and JSON save/load for repeatable
weekly cleanups.
Format Standardizer reworked for 1 GB international files:
• Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
• Per-row country / address columns drive parsing
• Audit cap (default 10 k rows, ~50 MB RAM)
• standardize_file(): chunked streaming entry point (~165 k rows/sec)
• currency_decimal="auto" for EU comma-decimal locales
• R$ / kr / zł multi-char currency prefixes
• cli_format.py with auto-stream above 100 MB inputs
Encoding detection arbiter + language-aware probe:
Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.
Distribution-readiness assets:
• streamlit_app.py — Streamlit Community Cloud entry shim
• src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
100-row cap + watermark, free-vs-paid boundary enforced at surface
• samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
• landing/ — 4 static HTML pages (apex chooser + 3 niche),
shared CSS, deploy.py URL-substitution script,
auto-generated robots.txt + sitemap.xml + 404.html + favicon
• docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
— full strategy + measurement + deployment + master checklist
Test counts:
before: 1,520 passed · 4 skipped · 17 xfailed
after: 1,729 passed · 0 skipped · 0 xfailed
Tier-1 corpora added:
• missing-corpus 3 use cases + 16 edge cases
• column-mapper-corpus 3 use cases + 5 edge cases
• format-cleaner intl 20-row 13-country stress fixture
Engine hardening flushed out by the corpora:
• interpolate guards against object-dtype columns
• mean/median skip all-NaN columns (silences numpy warning)
• fillna runs under future.no_silent_downcasting (silences pandas warning)
• mojibake test no longer skips when ftfy installed (monkeypatch path)
• drop-row threshold semantics: strict-greater (consistent across rows / cols)
• currency_decimal validator allow-set updated for "auto"
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
| d18b95880d |
feat(format-i18n): broaden international coverage across all domains
Closes ~17 high-value international gaps surfaced by parallel review. Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed / failed / xfailed). DATES - Adds Portuguese, Italian, Dutch, Russian month dictionaries to the opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru). - Adds localized weekday recognition for those locales — "Lundi", "Montag", "lunedì", "понедельник", etc. all strip cleanly before format matching. - New CJK separator normalization: Japanese ``2024年01月15日`` and fullwidth digits ``2024/01/15`` fold to ASCII before parsing. - New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ doesn't block format matching. - New ISO 8601 extended formats: week date (``2024-W03-1``) and ordinal date (``2024-015``), plus RFC 2822 mail-header form (``Mon, 15 Jan 2024 10:30:00``). - New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` — defaults to Python's stdlib 69; lower it for birth-year columns where most subjects were born ≤ 1999. NAMES - Particles set extended with Arabic patronymic markers (bin, ibn, bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-). - Title set extended with German (Herr, Frau), French (M., Mme, Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra, Dott.), Portuguese. - Acronym map extended with international academic credentials (Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM). - New East Asian honorific suffix handler: ``Tanaka-san``, ``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the hyphen instead of being title-cased into ``Tanaka-San``. - Hyphenated-segment handler now keeps Arabic prefixes ``al-`` / ``el-`` lowercase per Arabic transliteration convention. - New ``family_first`` parameter on ``standardize_name()`` and matching ``name_family_first`` field on ``StandardizeOptions`` — set per-column for East Asian data to skip Western comma-format reversal (``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``). CURRENCY - Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa, Eastern Europe, Latin America gaps. - ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD, BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, KZT, etc. EMAIL - New BIDI / RTL override stripping (``standardize_email``): U+202A-U+202E and U+2066-U+2069 stripped from every email. These are a known phishing vector — ``alice@example.com`` displays as ``alice@elpmaxe.com`` to RTL-aware renderers. ADDRESS - Canadian provinces: 13 codes + names → 2-letter (Ontario → ON). - UK postcode pattern recognition (``SW1A 2AA`` shape). - Australian states: 8 codes + names (NSW, VIC, QLD, … + full names). - German Bundesland: 16 codes + names (Bayern → BY, etc.). - International PO Box variants: ``Postfach`` (DE), ``Boîte postale`` (FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal`` (PT) — all fold to canonical ``PO Box``. - ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position check that preserves state codes regardless of input case applies to all four jurisdictions. - ``_is_state_code_position`` postal pattern broadened to recognize US ZIP, AU 4-digit, CA first half, and UK outward code. CONSTANTS - ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` / ``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``, ``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS`` (us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex), ``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for use when address takes a `country_hint` parameter in a future pass). DOCS - TECHNICAL.md §11.3 domain table updated with the new handling per domain plus a new "International coverage" sub-section listing the supported locales / symbols / jurisdictions. DEFERRED (out of scope or rare) - Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) — corpus § 3.5 marks out of scope. - Persian/Arabic-Indic digit conversion — rare in tabular data. - Trailing-minus RTL currency convention. - Punycode ↔ Unicode IDN normalization. - Mixed-country phone column auto-detection (user can override ``default_region`` per column). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|||
| 26b9771625 |
feat(errors): structured error hierarchy + helpful messages everywhere
Introduces src/core/errors.py with a small structured error hierarchy
that every public entry point now uses. Each error carries the
context a user needs to fix it and the context a maintainer needs to
trace it.
The hierarchy:
DataToolsError (base — formats path, column, operation, suggestion)
InputValidationError (extends ValueError — bad arg / wrong type)
ConfigError (extends ValueError — bad config / options)
FileFormatError (extends ValueError — file is not what we expected)
FileAccessError (extends OSError — file I/O failure)
Subclassing the stdlib bases means existing `except OSError` /
`except ValueError` handlers still catch them — no breaking change.
Helpers:
- ensure_dataframe(value, function=...) — uniform DataFrame guard
- ensure_choice(value, name=, choices=) — uniform enum/literal guard
- wrap_file_read(path, op, exc) — tag OSError with hint + path
- wrap_file_write(path, op, exc) — same, with Windows-aware tip
- format_for_user(exc, context=) — user-facing string for st.error / stderr
Library hardening:
- io.read_file: missing files surface FileAccessError listing whether
the parent directory exists, and the suggestion to check the path.
- io.read_file: chunk_size <= 0 now raises InputValidationError with
a positive-integer suggestion.
- io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas
ValueError ("sheet not found") wrapped as FileFormatError listing
the path and a "list sheets with list_sheets()" hint.
- io._detect_excel_header_row: bare except narrowed to specific
openpyxl exceptions; falls back gracefully and logs at debug so
the real error surfaces from pd.read_excel.
- io.write_file: OSError / PermissionError on to_csv/to_excel wrapped
with file path and Windows-aware "file may be open in another
program" hint.
- dedup._parse_date: bare `except Exception` narrowed to
(TypeError, ValueError, OutOfBoundsDatetime); failed values
logged at debug for survivor-selection forensics.
- dedup._select_survivor: KEEP_MOST_RECENT now raises
InputValidationError instead of silently falling back to keep_first.
- dedup.deduplicate: input validation errors are InputValidationError
with operation/column/suggestion fields.
- format_standardize.from_dict: invalid FieldType for a column raises
ConfigError naming the column AND the bad value AND listing valid
values; same for date_order / phone_format / etc.
- format_standardize.from_file: OSError / JSON decode wrapped with
path AND line/column where parsing failed.
- format_standardize.to_file: TypeError on json.dumps wrapped as
ConfigError with the suspected source (extra_abbreviations).
- format_standardize._apply_field_type: dispatcher's "unknown field
type" branch now raises AssertionError (it's an internal invariant,
not user error — a new enum value was added without a branch).
- format_standardize._resolve_column_types: missing-column error now
InputValidationError with a "check for typos / unparsed header"
suggestion.
- format_standardize.standardize_dataframe: ensure_dataframe at entry.
- text_clean.clean_dataframe: ensure_dataframe at entry.
- config.to_strategies: invalid Algorithm/NormalizerType wrapped as
ConfigError naming the strategy index AND the column.
- config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError
listing valid values.
- config.from_file: OSError / JSON decode wrapped (mirror of
StandardizeOptions.from_file).
- fixes.repair_mojibake: ImportError on ftfy now logged at info level
with the underlying ImportError so a corrupt-package vs not-installed
distinction is visible in the logs.
- normalizers.normalize_phone: phonenumbers.NumberParseException now
logged at debug when the digits-only fallback drops extension /
country-code information — gives a trail when matching results
look wrong.
GUI / CLI surfaces:
- All 9 page handlers (`except Exception as e: st.error(...)`) now
use format_for_user(), which renders DataToolsError fields nicely
and falls back to "ClassName: message" for unrecognized errors.
- 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish
UnicodeDecodeError with an "re-save as UTF-8" suggestion before
the generic handler.
- cli.py's "Error reading file" handler now uses format_for_user()
and includes the input path in the prefix.
Tests:
- tests/test_errors.py — 22 new tests covering: base class formatting,
stdlib inheritance, ensure_dataframe / ensure_choice helpers,
wrap_file_read / wrap_file_write, format_for_user behavior, and
end-to-end integration (missing file, missing dir, bad JSON, bad
algorithm, bad enum, missing column).
- tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for
the new exception types (InputValidationError replaces TypeError,
FileAccessError extends OSError).
Full project suite: 1230 passed, 4 skipped, 17 xfailed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|