datatools-dev

Author	SHA1	Message	Date
Michael	c568aec8a7	feat(gui): one-click Close in its own bottom sidebar section Close is now a direct shutdown trigger: visiting the Close page (the sidebar entry) fires shutdown_app() immediately — no confirm step, no intermediate body. The farewell overlay paints and os._exit(0) lands ~1s later from a daemon thread. Layout: Close moved into its own bottom-of-sidebar section so the destructive action is visually separated from Account/Activate. - New shutdown_app() in components/_legacy.py replaces quit_button. os._exit thread is skipped when "pytest" is in sys.modules so the test suite doesn't suicide on rendering 99_Close. - pages/99_Close.py shrinks to set_page_config + chrome + shutdown_app. - app.py nav grows a new "Close" section header (new nav.section_close key in en/es packs) pinned at the bottom of the navigation dict. Tests updated: - TestQuitButtonRenders → TestClosePageShutsDownImmediately. Assert the shutdown caption renders + no confirm button exists. - test_smoke EXPECTED_SUBSTRINGS["99_Close"] now pins "Shutting down" / "Cerrando" (the visible page body) instead of the removed page title. 2008 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:17:14 +00:00
Michael	ff2eaeb6c4	feat(home): multi-file upload + per-file analysis, drop tool grid Home is now upload + analysis only. The page accepts multiple files in one go, analyzes each independently, and renders findings grouped by filename in bordered containers. The 3-section tool-card grid is gone — discovery happens via the sidebar now. Mechanics: - file_uploader uses accept_multiple_files=True. Each file's findings cache in session_state["home_findings_by_file"] keyed by filename so removing a file via Streamlit's "x" button drops its findings too, and re-clicking Run only re-analyzes pending files. - The first uploaded file is mirrored into the singular home_uploaded_{name,bytes,size} keys so tool pages continue to pick up an "active" upload through pickup_or_upload — no tool-page changes. - New i18n keys: upload.intro_multi, upload.uploader_label_multi, upload.clear_results, upload.empty_state. upload.heading text is updated to "Upload one or more files to start" (EN + ES). Dropped tests pinning the tool grid: - TestHomeToolGridLocalization (test_chrome.py) - test_home_tool_card_uses_es_name (test_smoke.py) - TestLiteHomeGridBadges (test_lite_tier.py — locked-card lock-badge assertions; locking is still enforced per-tool-page via require_feature_or_render_upgrade) 2009 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:12:48 +00:00
Michael	dad744f17f	refactor(gui): drop Review page + normalization gate Home is now the only entry point: the "Run analysis" button on the upload section IS the review step (findings render inline via render_findings_panel). Tool pages no longer gate on a passed normalization — running the analyzer is sufficient context. Removed: - src/gui/pages/0_Review.py - src/gui/components/gate.py (re-export seam) - require_normalization_gate() in src/gui/components/_legacy.py - "review" section enum in tools_registry.py - Data Review entry in app.py navigation - require_normalization_gate() calls + imports in all nine tool pages - tests/gui/test_gate.py (whole file) - TestReviewWorkflow in tests/gui/test_workflows.py - 0_Review entry in tests/gui/test_smoke.py PAGE_SLUGS - stash_upload's normalization_result+normalization_for stashing - stash_upload_without_gate (was the gate's negative-path helper) 2017 tests pass (16 retired with the gate flow). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:04:33 +00:00
Michael	fc6c22c6a7	feat(review): inline file uploader instead of redirect home When a user lands on Review without an upload, show a file uploader on the page itself and auto-run the analyzer once a file is picked, rather than bouncing them to the home page with a "Back to home" button. Auto-analyze is the right default here: the user is already on the Review page, so they've implicitly committed to a scan. Stashing the bytes in the same session-state keys the home page uses keeps the rest of the flow (encoding picker, gate, tool pages) unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 19:57:01 +00:00
Michael	db5ec084da	docs+code: rename tool labels everywhere Sweep follow-up to `93e43fc`. Display labels now consistent across docs, landing pages, CLI output, code comments, docstrings, and test prose. Five parallel surfaces touched: - docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal design/planning docs - landing pages: index + bookkeeper/revops/shopify-pet - src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py and gui/components/_legacy.py, core module headers, every tool page's module docstring - tests: class/method/module docstrings and section-header comments - test-cases READMEs Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*), URL paths, anchor IDs, CSS classes, and asset filenames were left intact since they're code identifiers / structural references. All 2033 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 19:50:09 +00:00
Michael	93e43fc0d9	feat(gui): sidebar sections + non-technical tool labels Sidebar nav now groups tools under Data Review / Data Cleaners / Transformations / Automations via st.navigation, replacing the flat auto-discovered list. Tool display names switch to action-first phrasing (Find Duplicates, Fix Missing Values, Find Unusual Values, Standardize Formats, Clean Text, Quality Check, Map Columns, Combine Files, Automated Workflows) in EN + ES packs and on each page's H1. The Data Cleaners section follows the requested order: Missing Values → Outliers → Text Cleaner → Format Standardizer → Deduplicator → Quality Check. (Text Cleaner kept inside cleaners since the request didn't list it but the tool still ships.) Registry now carries a section field; helpers added: tools_in_section(), section_label(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 19:36:01 +00:00
Michael	e534fb4989	sec(license): Ed25519 sigs + production-safe tripwire Two coupled hardening upgrades. 1. Asymmetric signatures (HMAC → Ed25519) The previous HMAC scheme used a symmetric secret that any motivated reverse engineer could pull out of the shipped binary and use to mint blobs for any tier / name / email. With Ed25519, the binary ships only the public verification key; the signing key never leaves the seller's environment, so binary compromise no longer yields forgery. - src/license/crypto.py rewritten around cryptography.hazmat.primitives.asymmetric.ed25519. Same public API surface (sign/verify/encode_blob/decode_blob), same canonical JSON encoding — drop-in for the manager / cli / GUI layers. - DATATOOLS_LICENSE_PRIVKEY (seller-side) and DATATOOLS_LICENSE_PUBKEY (build-time) env vars supply the keys; the in-source dev keypair (src/license/_dev_keypair.py) deterministically derives from a seed phrase for repro builds and tests. - Blob prefix bumped DTLIC1: → DTLIC2:. Decoding a DTLIC1 blob surfaces a clear "old format" error rather than a confusing signature mismatch. - scripts/generate_keypair.py mints fresh production keypairs for the seller (run once, stash the private key offline). Adds cryptography>=41,<46 to requirements.txt (was an undeclared transitive dep). 2. Production-safe tripwire assert_production_safe() refuses to boot a frozen / shipped build when either: - DATATOOLS_DEV_MODE=1 is set (would unconditionally bypass every license check — fine in source/test but catastrophic in a buyer install). - The active verification key is still the embedded dev key (the build pipeline forgot to set DATATOOLS_LICENSE_PUBKEY). No-op in source / pytest runs (sys.frozen is unset) so test fixtures and dev workflows keep working without ceremony. Called from src/cli_license_guard.guard() and from hide_streamlit_chrome — so it fires on every CLI invocation and every GUI page load. Tests: 49 license-layer unit tests (was 40); added Ed25519 wrong-key rejection, dev-keypair seed pin, blob v2 prefix, v1 rejection with clear message, and four production-safe scenarios (no-op in source, fires on DEV_MODE in frozen, fires on dev key in frozen, passes in frozen with prod pubkey). Total: 2024 → 2033. Docs (REQUIREMENTS §17a, DEVELOPER licensing recipe, DECISIONS §9b + decision log) updated with the new threat-model write-up, key-storage workflow, and tripwire behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:34:48 +00:00
Michael	d32b58e61a	feat(license): add Lite SKU; remove user-facing free trial Two coupled changes: 1. Lite tier - New Tier.LITE in src/license/schema.py. - FEATURES_BY_TIER[Tier.LITE] = {Deduplicator, Text Cleaner, Format Standardizer}. The three universally-useful tools that cover the most common bookkeeping / RevOps / Klaviyo prep workflows. Other six tools require Core. - i18n: license.tier_lite, license.feature_locked_title, license.feature_locked_body, license.upgrade_link, license.status_locked (en + es). - Per-tool feature gate at every GUI tool page (require_feature_or_render_upgrade) and every tool CLI (guard(feature=...)). A locked tool renders an upgrade prompt + Manage-license button (GUI) or exits with code 2 (CLI). - Home grid: tool cards the user's tier doesn't unlock get a red 🔒 Locked badge in place of green Ready. 2. Trial removed - Activation form's "Start 1-year trial" button removed. - license_cli's `trial` subcommand removed. - activation.trial_button / activation.trial_help i18n keys dropped (pack parity test stays green). - Tier.TRIAL stays in the enum (back-compat with any field- tested trial licenses); LicenseManager._mint stays internal for tests and the seller's key generator. - Decision logged in DECISIONS §9b: a 1-year all-features trial undercuts paid Lite; paid-only keeps tier economics clean. Tests (+29 net): +17 Lite-tier unit/guard tests + 13 Lite-tier GUI tests + 1 trial-absent assertion - 2 trial CLI tests - 1 trial GUI button test. Total: 1995 → 2024. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:19:30 +00:00
Michael	e435103113	feat(license): registration + 1-year licenses + tier scaffolding A complete offline licensing layer (no internet at any step): Core - src/license/ — schema (License, Tier, FeatureFlag), HMAC crypto, JSON storage, LicenseManager singleton with activate/renew/ deactivate/issue_trial. Tier-scaffolded so future SKUs can carve per-tool feature sets without consumer-code edits. - scripts/generate_license.py — creator-only key generator. Mints a DTLIC1: blob the buyer pastes into the activation page. GUI - New activation form component (src/gui/components/activation.py). - hide_streamlit_chrome() now inline-renders the activation form when no valid license is present (every page short-circuits to the form until activated). - Sidebar shows tier + days remaining; renewal warning under 30 days. - New pages/_Activate.py for revisiting the form after activation. CLI - src/license_cli.py — activate / renew / status / trial / deactivate commands. Exempt from the guard. - src/cli_license_guard.py — drop-in guard call added to every tool CLI's main(). Lets --help through; respects DATATOOLS_DEV_MODE. i18n - New activation.* and license.* keys in en.json + es.json (page title, form labels, status badges, renewal warnings, error messages). Pack parity test stays green. Test infrastructure - tests/conftest.py autouse fixture sets DATATOOLS_DEV_MODE=1 so the existing 1916 tests continue to pass. - isolated_license_path / activated_license_manager / unactivated_license_manager fixtures for tests that want to drive the real check. Tests (+79) - tests/test_license.py (40): schema, crypto roundtrip, blob encode/decode, tier→feature mapping, activation flow, name/email mismatch rejection, tamper detection, expiration, renewal, dev-mode bypass. - tests/test_license_cli.py (26): every license_cli command + subprocess tests confirming every tool CLI refuses to run without a license, --help always works, DEV_MODE bypasses. - tests/gui/test_activation.py (13): gate blocks without license, passes with trial, activation form submission unlocks the gate, sidebar status, renewal warning, i18n. Total: 1916 → 1995 tests. All pass under the strict warning filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:54:23 +00:00
Michael	b2c7b94fe9	fix: clear all latent deprecation + resource warnings Three real issues surfaced when running the suite with strict warnings: 1. src/core/format_standardize.py: ``datetime.utcfromtimestamp`` is deprecated in CPython 3.12 and slated for removal. Replace with ``datetime.fromtimestamp(ts, tz=timezone.utc)``. Output for the date-only format codes we use is byte-identical. 2. src/core/io.py: ``list_sheets`` leaked the openpyxl file handle by returning ``xl.sheet_names`` from an unclosed ``pd.ExcelFile``. Wrap in a ``with`` block so the FD closes deterministically — also prevents the Windows-only "file is locked" repro path. 3. tests/test_corpus.py: ``TestXlsxPollution.workbook`` fixture returned the bare ``pd.ExcelFile`` instead of yielding + closing. Convert to a yield-and-finally pattern so the class-scoped handle isn't leaked across the whole test file. Also harden pytest.ini's warning policy: escalate ``ResourceWarning`` from ``src`` to an error, alongside the existing ``DeprecationWarning`` rule. Third-party warnings stay filtered — we can't fix pandas/openpyxl/streamlit churn from here. All 1916 tests pass under the strict filter; full and split runs (``pytest``, ``pytest -m 'not gui'``, ``pytest -m gui``) all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:28:48 +00:00
Michael	35d46a0c1a	test(gui): add Streamlit AppTest layer (139 tests) Until now every test ran against core or the CLI; the Streamlit GUI was verified by hand. This commit adds tests/gui/ — 139 AppTest- driven tests behind a 'gui' marker so the quick loop (``pytest -m 'not gui'``) stays at 1777 tests / ~10s while ``pytest`` runs everything (1916 / ~14s). Coverage: - test_smoke.py (59): every page renders in EN and ES, expected substring present, sidebar selector mounted. - test_chrome.py (18): language selector flips session state and re-renders; quit button + farewell strings localize; tool-card names use the active language. - test_gate.py (9): require_normalization_gate no-op / warning / short-circuit / hash-mismatch invariants; warning + button localized. - test_workflows.py (14): happy path per Ready tool — stash upload, render, find primary action, verify result lands in session state. - test_dedup_review.py (8): Accept All / Reject All / Clear Decisions wire through to review_decisions; apply_review_decisions semantics (keep-all, merge, column override). - test_advanced_panels.py (15): config_panel widget defaults and options (algorithm, threshold, survivor rule, merge, multiselects, config save/load). - test_errors.py (4): garbage / empty / single-column uploads don't crash; duplicate-target mapping raises InputValidationError. - test_findings_panel.py (12): driven via a small standalone harness page so we test the component without faking a file_uploader. EN + ES strings, per-tool grouping, open-tool button label, untargeted expander, severity summary. Shared infrastructure in tests/gui/conftest.py: - ``stash_upload`` / ``stash_upload_without_gate`` — populate session_state to pre-pass or block the gate. - ``with_language`` — set ``ui_lang`` before run(). - ``collected_text`` — flatten title/caption/markdown/etc. into one string for substring assertions. - Auto-marking: every test in tests/gui/ gets ``@pytest.mark.gui`` via ``pytest_collection_modifyitems``, so the marker isn't per-test boilerplate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:13:40 +00:00
Michael	64452dd783	perf: dedup blocking, column-parallel scaffolding, lazy-copy pipelines Three follow-on wins from the audit, each with shape-pinning tests. 1. Dedup blocking - Exact-only strategies (every column EXACT @ 100 — covers strong- key dedup like email/phone, the drop-duplicates fallback, and explicit "match on this exact column" calls) now route through an O(n) groupby fast path. Lossless; no API change required. Measured: 10k-row email-exact dedup → 73 ms (was ~30 minutes via the O(n²) pair compare). - Fuzzy strategies still pair-compare, with opt-in prefix blocking via deduplicate(..., blocking_columns=[...], blocking_prefix_len=1). Measured: 5k-row fuzzy-name → 25.6s with blocking vs 179s without (7x). Trade-off: cross-block matches missed. 2. Column-parallel standardize - StandardizeOptions.parallel_columns (default 1) lands a ThreadPoolExecutor over the column loop. Output order and audit-record order are preserved deterministically via a merge step keyed off column_types order. Honest doc: under CPython 3.12's GIL the win is roughly neutral (phonenumbers/dateutil hold the GIL); the API is ready for free-threaded Python 3.13+. 3. Lazy-copy in missing / column_mapper - _standardize_sentinels now builds per-column changes in a dict and only materialises the output frame when at least one column actually changed. On a clean 1 GB file this skips a 1 GB allocation. - handle_missing carries an out_is_owned flag, copying on demand before any mutating step. No-op runs return the input frame. - map_columns drops the unconditional upfront df.copy(); rename and drop both return fresh frames already, and schema-add / coerce trigger _ensure_owned() lazily. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:54:25 +00:00
Michael	5b672370a6	perf: cache hot paths, drop wasted allocations, lift 1 GB → 1.5 GB Five targeted wins driven by an end-to-end audit, with shape-pinning regression tests so reverts are loud: - format_standardize: fuse the dispatcher loop into one pass — was calling Series.tolist() three times per typed column and materialising an intermediate triples list; now one tolist, one walk. On a synthetic 1M-row phone+email frame this measures ~2.7M rows/sec (vs. the previous 150k/sec doc target). - dedup: wrap normalizers in a per-call lru_cache so repeat phones / emails / addresses skip re-parsing. phonenumbers.parse is the expensive call; ~2–5x faster on the normalisation step for realistic workloads. - analyze: _detect_near_duplicates no longer copies the full input frame; builds only the normalised string columns via a dict and references non-string columns by view. Skips the redundant astype(str) when a column is already pandas string dtype. - text_clean: hoist _build_pipeline out of the per-cell loop and add a per-call string cache so 100k repeats of "Active" only run the pipeline once. ~1M rows/sec on repetition-heavy columns. - io.repair_bytes: the non-UTF-8 smart-quote fold path used a Python-level zip walk over the entire decoded string to count replacements — replaced with sum(text.count(c) ...) which runs in C at ~GB/s. Was a latent ~100s on a 1 GB cp1252 file; now <1s. Updates REQUIREMENTS §10 with measured numbers and bumps the buyer- facing upload limit from 1 GB to 1.5 GB across the i18n packs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:37:26 +00:00
Michael	c4ce86bd64	feat(i18n): add language-pack scaffold with English and Spanish Introduces ``src/i18n`` with a tiny JSON-backed t() lookup, an in-session language preference, and a sidebar selector wired through ``hide_streamlit_chrome`` so every page picks up the same picker. Covers home, tool cards, findings panel, gate, shutdown, and pickup banner strings. Tests pin pack parity and the farewell-overlay JS escape so future packs can't silently regress. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:11:30 +00:00
Michael	966af8ef94	feat: 3 new tools, format streaming, distribution-ready demo + landing pages Tools shipped this batch (4 → 6 of 9 Ready): 04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI 05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI 09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI with soft tool-dependency graph (recommended, not enforced) and JSON save/load for repeatable weekly cleanups. Format Standardizer reworked for 1 GB international files: • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email • Per-row country / address columns drive parsing • Audit cap (default 10 k rows, ~50 MB RAM) • standardize_file(): chunked streaming entry point (~165 k rows/sec) • currency_decimal="auto" for EU comma-decimal locales • R$ / kr / zł multi-char currency prefixes • cli_format.py with auto-stream above 100 MB inputs Encoding detection arbiter + language-aware probe: Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM) via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes. Distribution-readiness assets: • streamlit_app.py — Streamlit Community Cloud entry shim • src/gui/app_demo.py — single-page demo, ?p=<persona> routing, 100-row cap + watermark, free-vs-paid boundary enforced at surface • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs • landing/ — 4 static HTML pages (apex chooser + 3 niche), shared CSS, deploy.py URL-substitution script, auto-generated robots.txt + sitemap.xml + 404.html + favicon • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md — full strategy + measurement + deployment + master checklist Test counts: before: 1,520 passed · 4 skipped · 17 xfailed after: 1,729 passed · 0 skipped · 0 xfailed Tier-1 corpora added: • missing-corpus 3 use cases + 16 edge cases • column-mapper-corpus 3 use cases + 5 edge cases • format-cleaner intl 20-row 13-country stress fixture Engine hardening flushed out by the corpora: • interpolate guards against object-dtype columns • mean/median skip all-NaN columns (silences numpy warning) • fillna runs under future.no_silent_downcasting (silences pandas warning) • mojibake test no longer skips when ftfy installed (monkeypatch path) • drop-row threshold semantics: strict-greater (consistent across rows / cols) • currency_decimal validator allow-set updated for "auto" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:31:26 +00:00
Michael	d18b95880d	feat(format-i18n): broaden international coverage across all domains Closes ~17 high-value international gaps surfaced by parallel review. Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed / failed / xfailed). DATES - Adds Portuguese, Italian, Dutch, Russian month dictionaries to the opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru). - Adds localized weekday recognition for those locales — "Lundi", "Montag", "lunedì", "понедельник", etc. all strip cleanly before format matching. - New CJK separator normalization: Japanese ``2024年01月15日`` and fullwidth digits ``２０２４/０１/１５`` fold to ASCII before parsing. - New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ doesn't block format matching. - New ISO 8601 extended formats: week date (``2024-W03-1``) and ordinal date (``2024-015``), plus RFC 2822 mail-header form (``Mon, 15 Jan 2024 10:30:00``). - New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` — defaults to Python's stdlib 69; lower it for birth-year columns where most subjects were born ≤ 1999. NAMES - Particles set extended with Arabic patronymic markers (bin, ibn, bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-). - Title set extended with German (Herr, Frau), French (M., Mme, Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra, Dott.), Portuguese. - Acronym map extended with international academic credentials (Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM). - New East Asian honorific suffix handler: ``Tanaka-san``, ``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the hyphen instead of being title-cased into ``Tanaka-San``. - Hyphenated-segment handler now keeps Arabic prefixes ``al-`` / ``el-`` lowercase per Arabic transliteration convention. - New ``family_first`` parameter on ``standardize_name()`` and matching ``name_family_first`` field on ``StandardizeOptions`` — set per-column for East Asian data to skip Western comma-format reversal (``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``). CURRENCY - Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa, Eastern Europe, Latin America gaps. - ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD, BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, KZT, etc. EMAIL - New BIDI / RTL override stripping (``standardize_email``): U+202A-U+202E and U+2066-U+2069 stripped from every email. These are a known phishing vector — ``alice‮@example.com`` displays as ``alice@elpmaxe.com`` to RTL-aware renderers. ADDRESS - Canadian provinces: 13 codes + names → 2-letter (Ontario → ON). - UK postcode pattern recognition (``SW1A 2AA`` shape). - Australian states: 8 codes + names (NSW, VIC, QLD, … + full names). - German Bundesland: 16 codes + names (Bayern → BY, etc.). - International PO Box variants: ``Postfach`` (DE), ``Boîte postale`` (FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal`` (PT) — all fold to canonical ``PO Box``. - ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position check that preserves state codes regardless of input case applies to all four jurisdictions. - ``_is_state_code_position`` postal pattern broadened to recognize US ZIP, AU 4-digit, CA first half, and UK outward code. CONSTANTS - ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` / ``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``, ``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS`` (us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex), ``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for use when address takes a `country_hint` parameter in a future pass). DOCS - TECHNICAL.md §11.3 domain table updated with the new handling per domain plus a new "International coverage" sub-section listing the supported locales / symbols / jurisdictions. DEFERRED (out of scope or rare) - Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) — corpus § 3.5 marks out of scope. - Persian/Arabic-Indic digit conversion — rare in tabular data. - Trailing-minus RTL currency convention. - Punycode ↔ Unicode IDN normalization. - Mixed-country phone column auto-detection (user can override ``default_region`` per column). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 03:06:03 +00:00
Michael	26b9771625	feat(errors): structured error hierarchy + helpful messages everywhere Introduces src/core/errors.py with a small structured error hierarchy that every public entry point now uses. Each error carries the context a user needs to fix it and the context a maintainer needs to trace it. The hierarchy: DataToolsError (base — formats path, column, operation, suggestion) InputValidationError (extends ValueError — bad arg / wrong type) ConfigError (extends ValueError — bad config / options) FileFormatError (extends ValueError — file is not what we expected) FileAccessError (extends OSError — file I/O failure) Subclassing the stdlib bases means existing `except OSError` / `except ValueError` handlers still catch them — no breaking change. Helpers: - ensure_dataframe(value, function=...) — uniform DataFrame guard - ensure_choice(value, name=, choices=) — uniform enum/literal guard - wrap_file_read(path, op, exc) — tag OSError with hint + path - wrap_file_write(path, op, exc) — same, with Windows-aware tip - format_for_user(exc, context=) — user-facing string for st.error / stderr Library hardening: - io.read_file: missing files surface FileAccessError listing whether the parent directory exists, and the suggestion to check the path. - io.read_file: chunk_size <= 0 now raises InputValidationError with a positive-integer suggestion. - io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas ValueError ("sheet not found") wrapped as FileFormatError listing the path and a "list sheets with list_sheets()" hint. - io._detect_excel_header_row: bare except narrowed to specific openpyxl exceptions; falls back gracefully and logs at debug so the real error surfaces from pd.read_excel. - io.write_file: OSError / PermissionError on to_csv/to_excel wrapped with file path and Windows-aware "file may be open in another program" hint. - dedup._parse_date: bare `except Exception` narrowed to (TypeError, ValueError, OutOfBoundsDatetime); failed values logged at debug for survivor-selection forensics. - dedup._select_survivor: KEEP_MOST_RECENT now raises InputValidationError instead of silently falling back to keep_first. - dedup.deduplicate: input validation errors are InputValidationError with operation/column/suggestion fields. - format_standardize.from_dict: invalid FieldType for a column raises ConfigError naming the column AND the bad value AND listing valid values; same for date_order / phone_format / etc. - format_standardize.from_file: OSError / JSON decode wrapped with path AND line/column where parsing failed. - format_standardize.to_file: TypeError on json.dumps wrapped as ConfigError with the suspected source (extra_abbreviations). - format_standardize._apply_field_type: dispatcher's "unknown field type" branch now raises AssertionError (it's an internal invariant, not user error — a new enum value was added without a branch). - format_standardize._resolve_column_types: missing-column error now InputValidationError with a "check for typos / unparsed header" suggestion. - format_standardize.standardize_dataframe: ensure_dataframe at entry. - text_clean.clean_dataframe: ensure_dataframe at entry. - config.to_strategies: invalid Algorithm/NormalizerType wrapped as ConfigError naming the strategy index AND the column. - config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError listing valid values. - config.from_file: OSError / JSON decode wrapped (mirror of StandardizeOptions.from_file). - fixes.repair_mojibake: ImportError on ftfy now logged at info level with the underlying ImportError so a corrupt-package vs not-installed distinction is visible in the logs. - normalizers.normalize_phone: phonenumbers.NumberParseException now logged at debug when the digits-only fallback drops extension / country-code information — gives a trail when matching results look wrong. GUI / CLI surfaces: - All 9 page handlers (`except Exception as e: st.error(...)`) now use format_for_user(), which renders DataToolsError fields nicely and falls back to "ClassName: message" for unrecognized errors. - 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish UnicodeDecodeError with an "re-save as UTF-8" suggestion before the generic handler. - cli.py's "Error reading file" handler now uses format_for_user() and includes the input path in the prefix. Tests: - tests/test_errors.py — 22 new tests covering: base class formatting, stdlib inheritance, ensure_dataframe / ensure_choice helpers, wrap_file_read / wrap_file_write, format_for_user behavior, and end-to-end integration (missing file, missing dir, bad JSON, bad algorithm, bad enum, missing column). - tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for the new exception types (InputValidationError replaces TypeError, FileAccessError extends OSError). Full project suite: 1230 passed, 4 skipped, 17 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:35:42 +00:00
Michael	2eece6467d	refactor: dedup, consolidate, harden public APIs across core modules Closes 16 high-value findings from a parallel cross-module review. Refactors: - New src/core/_constants.py centralizes USPS street-suffix abbreviations, US state names, and 2-letter postal codes — one source of truth for both normalize_address (matching keys) and standardize_address (display formatting). Eliminates ~80 lines of duplicated dicts across normalizers.py and format_standardize.py. - format_standardize.py: collapse 4 identical nested _err() helpers into one shared _err_or_passthrough() module function; drop a dead duplicate `return _err("not a phone number")` branch in standardize_phone. - format_standardize.py: precompile per-locale month-name regexes (_MONTH_LOCALE_PATTERNS) and per-state-name regexes (_STATE_NAME_PATTERNS) at import time — they were rebuilt on every cell, a measurable hot path on million-row inputs. - dedup.py: extract _is_missing(value) helper; one definition of "this cell is None / NaN / pd.NA" instead of two. - fixes.py: extract _is_string_column(ser) helper; one dtype check instead of three duplicates across _apply_to_strings, _vectorized_translate, _vectorized_regex_sub. Production-readiness: - format_standardize.standardize_dataframe now logs a warning when more than 10% of typed cells are unparseable — surfaces the silently-broken-pipeline failure mode. - StandardizeOptions.from_dict validates date_order / phone_format / currency_decimal / name_case / boolean_style / *_error_policy enum values up front, with a clear error message instead of a deep crash inside the per-cell function. - StandardizeOptions.from_file and DeduplicationConfig.from_file wrap read + json.loads with descriptive OSError / ValueError messages including the file path. - standardize_date(month_locales=...) validates locale codes against the available set instead of silently passing through unknown ones. - io.read_file rejects chunk_size <= 0 (was silently failing inside pandas) and logs the resolved suffix + chunk_size at info level so data-pipeline runs are debuggable. - io.read_file's FileNotFoundError gains explanatory context. - io.write_file, text_clean.clean_dataframe, and dedup.deduplicate now reject non-DataFrame inputs with clear TypeError instead of cryptic pandas tracebacks downstream. - dedup.deduplicate validates that survivor_rule=KEEP_MOST_RECENT has a usable date_column up front; the helper _select_survivor now raises (instead of silently falling back to keep_first) when called directly with bad arguments. - dedup.deduplicate gains a structured no-op return when strategies is empty after auto-detection — preserves schema instead of crashing. - analyze._detect_inconsistent_date_format narrows its bare except to (TypeError, ValueError) and logs a debug line so genuine bugs don't hide behind silent skip. Tests: - tests/test_audit_fixes.py grows by 11 cases covering the new validation paths (chunk_size, DataFrame guards, KEEP_MOST_RECENT date_column, enum validation, locale validation, JSON error wrapping). Full project suite: 1208 passed, 4 skipped, 17 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:23:09 +00:00
Michael	b23a27d4e3	fix: cross-tool audit findings + alignment with format standardizer Closes 12 bugs and 8 gaps surfaced by parallel audits across all core modules, plus aligns the dedup-side normalizers with the new format_standardize behavior where they had silently diverged. Bugs (data integrity / correctness): - dedup: NaN/None values matched as duplicates because str(None)='None'. Two rows with missing email silently merged. - dedup: removed_df had 0 columns when nothing was removed; downstream code expecting matching schema broke. Now preserves column shape. - dedup: ColumnMatchStrategy threshold accepted any value; out-of-range silently broke matching. Validated to [0, 100] in __post_init__. - dedup: strategy referencing a missing column was silently skipped. Now raises ValueError listing available columns. - fixes: replace_null_sentinels crashed on non-string sentinels (int/None from JSON payload). Coerced to str. - fixes: _vectorized_regex_sub raised raw re.error on bad patterns. Now wraps as ValueError with clear message. - io: detect_header_row mis-identified all-empty and metadata-only rows as headers (all([]) is True). Now requires ≥2 non-empty cells. - config: from_dict crashed when JSON had unknown fields, breaking forward compat. Now filters to known fields. - analyze: mixed-case email detector flagged all-None columns because str(None)='None' contains both N and one. Now drops NaN before stringify. New features and gap closures: - io: _detect_excel_header_row mirrors detect_header_row for Excel via openpyxl read-only; _read_excel uses it when header_row=None. - io: write_file gains delimiter + encoding params; .tsv extension defaults to tab. - normalizers: normalize_phone preserves extensions as ;ext=N suffix. - normalizers: normalize_address folds spelled-out US state names to 2-letter codes (California ≡ CA). - normalizers: normalize_name drops surname particles (van, de, von) so "Charles de Gaulle" ≡ "Charles Gaulle" for matching. - analyze: new _detect_inconsistent_date_format detector flags columns with mixed ISO/US/EU date shapes; routes to format standardizer. - analyze: _NULL_LIKE recognizes "<na>" (pd.NA repr). - analyze: duplicate-row finding renamed count → n_extra (rows that would actually be removed) with clarified description. - dedup: group_confidence no longer falsely 100.0 when transitive group members lack a recorded direct pair; falls back to 100.0 only when truly no pairs were observed. - dedup: MatchResult / DeduplicationResult docstrings clarify that row_indices refer to the input frame's positional index (output index is reset). - text_clean: visualize_hidden_html(None) now returns None (matches visualize_hidden_text); strip_bom strips at most one BOM per call; sentence_case dead elif branch removed. Tests: - tests/test_audit_fixes.py — 28 regression tests, one or more per numbered finding, named after BUG/GAP/NIT tags so future readers can trace each test back to its audit. - tests/test_fixes_unit.py — 26 isolated unit tests for previously integration-only fix functions (trim_whitespace, strip_nbsp, strip_zero_width, normalize_line_endings, clean_headers, repair_mojibake — last skipped if ftfy unavailable). - tests/test_io.py — adds CSV / TSV / semicolon / UTF-8-BOM round-trip tests + Excel auto-header-detection tests. - tests/test_normalizers.py — adds 8 tests for the alignment work above (phone extension, state names, particles). Adds .claude/ to .gitignore (agent worktrees + local settings). Full project suite: 1197 passed, 4 skipped, 17 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:11:57 +00:00
Michael	4adeb5c7f3	feat(format): per-cell standardizers + 199-row buyer corpus Adds src/core/format_standardize.py — a per-cell standardizer for dates, phones, emails, addresses, names, currencies, booleans — wired through StandardizeOptions / standardize_dataframe with FieldType registry. Includes: - Date parser handles ISO/US/EU/longform/excel-serial/unix-timestamp/ partial-precision/quarter notation; opt-in French/German/Spanish month dictionaries via month_locales. - Phone via libphonenumber with extension preservation (;ext=N), 001 international prefix handling, error sentinels for placeholders / multi-number cells. - Email lowercase/trim/mailto/angle-bracket strip with optional --gmail-canonical mode. - Address USPS abbreviation expansion or compression (expand=False per corpus § 6.3), state-name → 2-letter conversion, multi-line collapse, PO Box normalization, state-code preservation regardless of input case. - Name handler: Mc/Mac/O'/D' inner caps, hyphen segments, particle lowercasing (von/van/de/da), comma-format reversal, period stripping for titles/suffixes/initials, PhD/MD acronym preservation, conservative mode for mixed-case input. - Currency: auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation, error sentinels for percentages/ranges/word-values/ambiguous separators. - Per-domain error_policy ("passthrough" \| "sentinel") for surfacing malformed values as <error: reason> per corpus § 0.3. Test corpus from Business/DataTools/test-cases-format-cleaner copied to test-cases/format-cleaner-corpus/ — 7 fixtures plus FORMATS-CASES.md. tests/test_format_standardize_corpus.py drives all 199 rows through the per-cell standardizers; 0 xfailed. Wires the GUI page (3_Format_Standardizer.py) to "Ready" status. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:11:24 +00:00
Michael	82d7fef21e	feat(gate): CSV-normalization gate with confidence-tiered findings Adds a Review & Normalize page that sits between upload and every tool page. The analyzer now tags each finding with confidence (high/medium/low) and a fix_action; the gate auto-applies high-confidence fixes, surfaces medium/low ones for user review, and blocks tool pages on error-level findings until resolved or waived. Core (src/core/): - analyze.py: Finding gains confidence, fix_action, pre_applied; new detectors for encoding_uncertain, encoding_decode_failed; new top- level encoding_override parameter. - fixes.py: registry of fix algorithms keyed by fix_action id. - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and the NormalizationResult / Decision dataclasses the gate consumes. - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption) and normalizes line endings (fixes bare-CR parser crash); empty file handled gracefully instead of EmptyDataError traceback. GUI (src/gui/): - pages/0_Review.py: gate page with per-finding decision controls, encoding override picker (16 codepages + custom), and Advanced output options (encoding, delimiter, line terminator) on the download. - components.py: require_normalization_gate() helper. - pages/1-9: gate guard wired on every tool page. Test corpora: - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference UTF-8 files + manifest, synced from Business/DataTools. - test-cases/text-cleaner-corpus/test_data/17: synced malformed input (unquoted $1,500.00) for the unquoted-delimiter detector. Tests (94 new): - test_normalize.py (48): finding fields, fix registry, auto_fix scope, decision paths, gate idempotency, output-options helper. - test_encodings_corpus.py (90, 16 xfailed): parametric detection + decode + analyzer-no-crash sweep against the manifest. - test_analyze.py: encoding override + encoding_uncertain detectors. - test_corpus.py: pre-parse repair in the strict reader. run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate; encodings corpus added to --fixtures category. Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema, gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds the analyzer JSON schema with the new fields; README links to all of it. Suite: 765 passed, 17 xfailed (was 458 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:35:27 +00:00
Michael	1049c033cb	feat(gui): visualize leading/trailing whitespace in analyzer findings The analyzer's "Run Analysis" panel rendered sample cells via st.dataframe, which (a) silently collapses leading/trailing ASCII whitespace and (b) displays NBSP/ZWSP/control chars as nothing. The user couldn't see the exact pollution they were being told about. visualize_hidden_html gains a mark_outer_whitespace=True option that wraps each leading and trailing ASCII space/tab in its own badge with a "SP LEAD" / "SP TRAIL" tooltip. The badges are per-character so the user can count exactly how much padding the cleaner will strip. components.render_findings_panel now: - injects hidden_char_css() once at the top of the panel - replaces st.dataframe(samples) with a custom HTML table - renders the value column with mark_outer_whitespace=True - applies white-space: pre-wrap on value cells so any internal ASCII whitespace also stays visible (browsers collapse runs by default) Four new tests cover: leading+trailing badge counts, default-off behaviour, leading tab badge, all-whitespace string treated entirely as leading. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:21:39 +00:00
Michael	90ceada2d1	feat(text_clean): visualize hidden characters in the cleaner GUI The whole point of the cleaner is to remove characters the user can't see — which makes the "before / after" preview nearly useless by default. A cell with NBSP padding looks identical to a cell with regular spaces. Two new helpers in src.core.text_clean: visualize_hidden_text(s) Plain-text rendering: each invisible/control/smart character is replaced by a glyph + [LABEL] (e.g. "·[NBSP]", "→[TAB]", "∅[ZWSP]", """[L DQUOTE]"). Suitable for terminal output, CSV exports, anywhere HTML is wrong. Unmapped C0 controls render as [U+XXXX]. visualize_hidden_html(s) + hidden_char_css() HTML rendering: every flagged character is wrapped in a <span> with a CSS class and a tooltip showing the codepoint and label. Pair with hidden_char_css() to inject the matching styles. Three colour bands (whitespace, special, control) so the user can scan an audit table and spot what's being changed at a glance. Mapping covers: ASCII tab/LF/CR, every NBSP variant (U+00A0, U+202F, U+2009, …), zero-width family (ZWSP/ZWNJ/ZWJ/WJ/BOM/SHY), bidi marks (LRM/RLM), all smart quotes, en/em dashes, ellipsis, prime/double-prime, and guillemets. ASCII printable text passes through; HTML output also escapes &/</> . GUI wiring (src/gui/pages/2_Text_Cleaner.py) The "Examples" changes table now defaults to a hidden-char-rendered HTML view: every NBSP/ZWSP/smart-quote/control char is shown with its badge and codepoint tooltip. A "Show hidden characters" toggle lets the user fall back to the raw st.dataframe view if they prefer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:14:14 +00:00
Michael	8dfc6ad8ae	feat(analyze): add mixed_line_endings + near_duplicate_rows detectors Two more detectors close the analyzer gap list: mixed_line_endings (warn, tool=02): scans raw bytes for combinations of CRLF / LF / bare CR. Disaster pattern after multi-source concat (Windows + macOS + Linux exports stitched together). Operates on raw bytes only — DataFrame-mode analyze() skips it because raw bytes aren't available. _load_for_analysis now returns the raw bytes alongside the DataFrame and repair result so the detector has them. near_duplicate_rows (info, tool=01): cheap dedup signal — strip and lowercase every string column, then count df.duplicated(). Catches the most common case (same customer entered twice with subtle formatting differences) without paying for fuzzy matching. Anything more sophisticated stays in tool 01. Six new tests cover both detectors plus the dataframe-mode skip path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:09:42 +00:00
Michael	0671ef277e	feat(io): route read_file through pre-parse repair by default Previously only analyze() and direct read_csv_repaired() callers got the byte-level repair pass (BOM strip, NUL strip, smart-double-quote fold, unquoted-delimiter merge). The dedup CLI and any other read_file consumer silently missed it. read_file gains a repair=True default. CSV/TSV inputs run through repair_bytes before pandas sees them; Excel inputs still pass through unchanged. Chunked reads (chunk_size set) bypass repair because the pre- parse pass loads the whole file — preserving streaming behavior on huge files. Repair actions and unrepairable lines are logged at INFO/WARNING. cli_text_clean opts out (repair=False): the cleaner offers fine-grained control via --preset and per-op flags, and a byte-level smart-quote fold under the user's "minimal" preset would violate that contract. The cell-level cleaner does the equivalent work itself when its options ask for it. Tests: read_file default strips BOM and folds curly double quotes; repair=False preserves smart quotes; chunked reads still work and skip repair as documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:09:35 +00:00
Michael	0b959dee93	feat(text_clean): preserve internal whitespace in numeric/date/phone cells Closes the §4.17 spec gap that test_gap_coverage.py was tracking via xfail: collapse_whitespace must NOT touch cells whose shape carries meaningful internal whitespace. Adds _looks_structured(s) — returns True when s matches: - numeric (currency optional, thousand-grouping by , . or single space) - date (ISO/slash/dot separator, or 'Mon DD YYYY' / 'DD Mon YYYY') - phone (digits + parens/dots/dashes/+/spaces, >= 7 digits, no letters) The pipeline uses a new _smart_collapse_whitespace wrapper that defers to collapse_whitespace only when _looks_structured returns False. The raw collapse_whitespace function is unchanged so direct callers and existing unit tests remain valid. Five new positive tests replace the xfail: - "(555) 123-4567" preserved (phone, double space inside) - "1 234" preserved (European thousands) - "2024-01-15" preserved (ISO date) - "Jan 15 2024" preserved (textual date) - "hello world" still collapsed to "hello world" (free-text negative case) Conservative on purpose: a false negative just collapses (existing behavior); a false positive leaves intentional double spaces in prose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:09:25 +00:00
Michael	4687cf87b4	test: single-command runner, cross-platform automation, fixture auto-discovery Adds a top-level test infrastructure layer addressing four needs at once: a single command to run anything, cross-platform automation, install/e2e sanity, and zero-config pickup of new fixtures dropped into test-cases/. Top-level runner — run_tests.py python run_tests.py # everything (default) python run_tests.py --tool dedup # one tool's tests python run_tests.py --unit # category scopes python run_tests.py --e2e # end-to-end CLI python run_tests.py --install # import / dependency sanity python run_tests.py --fixtures # corpus + dropped-file sweep python run_tests.py --coverage # term-missing report python run_tests.py --quick # skip @pytest.mark.slow Tools: analyze, cli, config, dedup, io, normalizers, text_clean. Cross-platform — tox.ini Envs for py310-py313 plus install / e2e / fixtures / coverage / lint. Forces UTF-8 (PYTHONUTF8=1, PYTHONIOENCODING=utf-8) so identical fixture bytes parse the same on Linux/macOS/Windows. Shared config — pytest.ini testpaths, python_files conventions, custom markers (slow, e2e, install, fixture_sweep), warning filters that fail on our own DeprecationWarnings while tolerating third-party ones. New test layers tests/test_install.py — required deps import; project modules import; src.core public API surface; CLI --help exits 0; streamlit app.py parses as valid Python; run_tests.py --help works. tests/test_e2e.py — CLI roundtrips: cli_analyze table + JSON, cli_text_clean --apply writes a real file with NBSP/smart-quote folded, dedup CLI removes duplicates, run_tests.py self-tests. tests/test_fixtures_sweep.py — parametrizes over every CSV/TSV/XLSX inside test-cases/ (excluding text-cleaner-corpus/, which has its own suite). Each fixture must: load through repair_bytes, run analyze() cleanly, and survive clean_dataframe() with row/col counts unchanged plus idempotency. Drop a CSV in, re-run — no test code changes needed. tests/test_gap_coverage.py — closes audit gaps: clean_headers=False toggle, repair_bytes with tab/semicolon delimiters, BOM+NUL+smart- quote combined-fix scenario, analyze() over an XLSX path, sample_rows larger than the DataFrame, mid-cell BOM, findings_by_tool edges, plus a strict xfail documenting the known §4.17 numeric/phone whitespace heuristic gap. Test count Before: 288 passed + 1 xfailed After: 475 passed + 2 xfailed (the second xfail is the documented collapse_whitespace gap on phone-shaped cells; spec §4.17 calls for a heuristic that hasn't been implemented yet). Functional gaps surfaced (not fixed in this commit): - Text cleaner: collapse_whitespace runs unconditionally on every string cell, including phone/numeric/date-shaped ones. Spec §4.17 requires a skip heuristic. Captured as strict xfail so the gap stays visible. - io.read_file does not run pre-parse repair; only analyze() and direct callers of read_csv_repaired() get it. CLI tool pages and the dedup CLI miss the safety net. - Analyzer has no mixed_line_endings detector or near_duplicate_rows detector; both planned but require additional plumbing. - GUI tool pages each have their own uploader instead of picking up the home-page upload through session_state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:01:06 +00:00
Michael	5c62fb6117	feat(cli): src.cli_analyze — Typer CLI for the analyzer python -m src.cli_analyze input.csv # rich table per tool python -m src.cli_analyze input.csv --json # array of finding dicts python -m src.cli_analyze input.csv --strict # exit 1 on warn/error python -m src.cli_analyze input.csv -n 50000 # cap rows scanned Findings are grouped by destination tool so the user can see at a glance which tool to open next. Read-only; exit code 0 unless --strict is set. The CLI keeps its own tool-id -> display-name map so it doesn't depend on the GUI module. 7 tests cover: clean-file passthrough, dirty-file table, --json round-trip, missing-file (exit 2), --strict exit code, --sample-rows cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:53:11 +00:00
Michael	edf6ccf90b	feat(analyze): upload-time data quality analyzer Pure, advisory scan over an uploaded file or DataFrame that returns a list of Finding objects naming each issue, the affected count, and which downstream tool can fix it. The GUI uses this to badge tool nav items at upload; the CLI will print findings as a table or JSON. src/core/analyze.py: Finding dataclass (id, severity, tool, count, description, column, samples) analyze(source, , sample_rows=1000, repair_result=None) -> list[Finding] - source: DataFrame, path, or str. Path scans first 1000 rows. - When source is a path, runs the same pre-parse repair the tool pages will use; the resulting RepairResult is auto-surfaced as csv_ findings. A caller-supplied repair_result wins so non-default repair flags are respected. Detectors (each independent, samples capped at 5): - smart_punctuation_in_data -> 02 - nbsp_or_unicode_whitespace -> 02 - zero_width_or_invisible -> 02 - dirty_column_headers -> 02 - whitespace_padding -> 02 - null_like_sentinels -> 04 - suspected_mojibake -> 02 (Tier 2) - mixed_case_email_column -> 02 case op - leading_zero_ids -> informational, no tool Helpers: findings_by_tool() for sidebar grouping, to_dict() for JSON. Detectors are decoupled from the GUI display layer — they emit stable tool ids ("02_text_cleaner") and the GUI maps those to display names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:41:36 +00:00
Michael	b8a9fa1b09	feat(io): pre-parse CSV repair (BOM/NUL/smart-quotes/unquoted-delim) Some pollution patterns block pandas before the cell-level cleaner can run. Add a pre-parse pass on raw bytes that fixes only what breaks parsing, and returns a structured action log the GUI/CLI can surface to the user. repair_bytes(raw, *, encoding, delimiter, fold_quotes, strip_nul, repair_delims): 1. Strip leading UTF-8 BOM. 2. Strip embedded NUL bytes (the C parser truncates fields at NUL). 3. Fold smart double quotes (curly, guillemet, double-prime) to ASCII '"'. Curly singles are NOT folded here; they don't conflict with CSV and the cell-level cleaner handles them more accurately. 4. Per-row repair when one rogue delimiter is embedded in a field that looks like currency or thousands-grouped digits. Tiered scoring keeps " $1,500.00 ,7" unambiguous: the strict currency regex match wins over the loose digit/sigil heuristic. read_csv_repaired(path) -> (DataFrame, RepairResult). RepairResult exposes .actions, .unrepairable_lines, and a summary() grouped by kind. Out of scope for this pass: encoding repair, delimiter conversion, multi- delimiter merges (k>1) — logged as unrepairable so callers can see what was left alone instead of silently parsing wrong. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:37:49 +00:00
Michael	c349a90e18	test: add text-cleaner corpus and close gaps surfaced by it The 21-fixture corpus (test-cases/text-cleaner-corpus/) exercises the cleaner end-to-end against the spec in TEST-CASES.md. Closing the failing cases drove five small cleaner fixes plus two fixture-generation fixes: - _SMART_CHARS: add prime, double prime, guillemets (case 03) - _ZERO_WIDTH: add soft hyphen U+00AD (case 05) - clean_dataframe: clean column headers via the same pipeline (cases 16/19/20), with a clean_headers toggle on CleanOptions - smart_title_case: title-case full-shout strings ("ALICE SMITH" -> "Alice Smith") while still preserving embedded acronyms; preserve uppercase after apostrophe in names ("O'CONNOR" -> "O'Connor", "o'neil" -> "O'neil") - test_corpus.py reader: pre-strip NUL bytes (C parser truncates at NUL, python engine is too strict about embedded literal "), per spec case 06 - generate_test_data.py: properly CSV-escape literal-quote cells in case 03 expected; quote the rogue-comma price field in case 17 input Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:37:35 +00:00
Michael	54f92ae47e	feat: implement text cleaner (script 02) with CLI, GUI, and tests Builds 02_text_cleaner.py from stub to working: character-level hygiene for CSV/Excel inputs covering trim, whitespace collapse, smart-character folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char strip, line-ending normalization, and per-column case conversion. Three presets (minimal/excel-hygiene/paranoid) keep the buyer surface small. - src/core/text_clean.py: pure helpers + CleanOptions/CleanResult + clean_dataframe with dtype-safe column selection - src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape (dry-run by default, --apply writes cleaned + changes audit, JSON config save/load) - src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset picker, advanced toggles, preview, before/after metrics, and three download buttons - tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests covering edge cases E1-E50 from the spec - samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10 in 10 rows - test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case fixtures Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7 entry locking the spec, CLI-REFERENCE.md gains the text cleaner section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md status row 02 promoted Skeleton -> Working. 200/200 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:14:15 +00:00
Michael	b871ab24fc	feat: add documentation, Streamlit GUI, and full source tree - Rewrite README.md with project overview, quick-start, and CLI summary - Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections - Add docs/DEVELOPER.md with architecture, data flow, and extension guides - Rewrite src/core/__init__.py with public API exports and module docstring - Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive match group review with side-by-side diff, and download buttons - Add .gitignore, requirements.txt, all source code, tests, and sample data - Add streamlit to requirements.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-28 23:06:39 +00:00

33 Commits