datatools-dev

Author	SHA1	Message	Date
Michael	27f0648093	fix(text-cleaner): make all three download buttons actually fire Only "Download cleaned CSV" was working; "Download changes audit" and "Download config JSON" did nothing on click. The symptom is the classic Streamlit footgun for multiple ``st.download_button`` widgets in adjacent columns: without an explicit ``key`` argument the auto-derived widget IDs can collide, especially when one button is conditionally rendered, and only the first button in source order actually fires on click. Same goes for unstable ``data`` bytes recomputed inside the ``with col:`` block — the widget identity can drift between renders. Robustness pattern applied: - Compute all three byte buffers up front, outside the columns, so the ``data`` parameter is the same object across reruns. - Pass an explicit unique ``key`` ("textclean_dl_cleaned" / "textclean_dl_changes" / "textclean_dl_config") to each button. - Render the changes button unconditionally with ``disabled=True`` and a help tooltip when ``result.changes.empty`` — instead of hiding it. Layout stays steady and the empty case is self-explanatory. - ``use_container_width=True`` so the three buttons size identically inside their columns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:56:52 +00:00
Michael	0a61d52200	feat(text-cleaner): collapse options + auto-scroll to Results on run After clicking Clean Text the user was left at the bottom of the script with the Options block still expanded and no viewport movement — they had to scroll to find the Results. - Wrap the whole Options block in an outer ``st.expander("Options", expanded=not _has_result)``. After the Clean Text rerun, both Preview AND Options collapse, leaving the primary action button + Results as the only prominent elements above the fold. The inner Advanced-options expander is preserved as a nested expander (supported in Streamlit 1.36+; this repo pins 1.35+). - Add a 1px anchor div ``#textclean-results-anchor`` immediately before the Results subheader. - On Clean Text click, set a one-shot ``_textclean_scroll_to_results`` flag in session state; on the next render, pop the flag and inject a tiny ``st.components.v1.html`` iframe whose ``<script>`` calls ``scrollIntoView`` on the parent document's anchor. One-shot so re-renders triggered by other widgets (Show-hidden toggle, etc.) don't jerk the viewport back to the top of Results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:50:43 +00:00
Michael	ca14ce2952	feat(text-cleaner): collapse preview on run + full hidden-char audit Two small UX fixes on the Clean Text page: 1. The input preview is now wrapped in an ``st.expander`` whose default-expanded state is ``not has_result``. Clicking the "Clean Text" primary button stashes the result and calls ``st.rerun()`` so the next pass sees the result in session state and the expander folds — the Results section becomes the primary visual focus. User can re-expand manually to re-inspect the source. 2. The Examples (changes audit) table's Before/After columns were calling ``visualize_hidden_html`` WITHOUT ``mark_outer_whitespace``, so leading/trailing whitespace — which is exactly what the cleaner most often removes — was invisible. Pass ``mark_outer_whitespace=True`` to match the input-preview rendering. Column-name cell now mirrors that flag too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:43:52 +00:00
Michael	502a72cd46	feat(nav): ← Back to Home link on every tool page Multi-file workflow: a user uploads several files on Home, clicks "Open <Tool>" on one file's findings, lands on a tool page. The sidebar lets them get back to Home, but a top-of-page back affordance is more discoverable and keeps the hand in the same screen region as the upload list they're working through. - New ``back_to_home_link()`` helper in components/_legacy.py renders a secondary button that calls ``st.switch_page("app.py")`` — under ``st.navigation`` that routes to the default (Home) page. - Wired into every tool page (1-9) directly after ``hide_streamlit_chrome()`` and BEFORE the license gate so a Lite user who lands on a locked tool can navigate away without paying. - New i18n key ``nav.back_to_home`` ("← Back to Home" / "← Volver al inicio") in en/es packs. 2008 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:38:01 +00:00
Michael	c568aec8a7	feat(gui): one-click Close in its own bottom sidebar section Close is now a direct shutdown trigger: visiting the Close page (the sidebar entry) fires shutdown_app() immediately — no confirm step, no intermediate body. The farewell overlay paints and os._exit(0) lands ~1s later from a daemon thread. Layout: Close moved into its own bottom-of-sidebar section so the destructive action is visually separated from Account/Activate. - New shutdown_app() in components/_legacy.py replaces quit_button. os._exit thread is skipped when "pytest" is in sys.modules so the test suite doesn't suicide on rendering 99_Close. - pages/99_Close.py shrinks to set_page_config + chrome + shutdown_app. - app.py nav grows a new "Close" section header (new nav.section_close key in en/es packs) pinned at the bottom of the navigation dict. Tests updated: - TestQuitButtonRenders → TestClosePageShutsDownImmediately. Assert the shutdown caption renders + no confirm button exists. - test_smoke EXPECTED_SUBSTRINGS["99_Close"] now pins "Shutting down" / "Cerrando" (the visible page body) instead of the removed page title. 2008 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:17:14 +00:00
Michael	dad744f17f	refactor(gui): drop Review page + normalization gate Home is now the only entry point: the "Run analysis" button on the upload section IS the review step (findings render inline via render_findings_panel). Tool pages no longer gate on a passed normalization — running the analyzer is sufficient context. Removed: - src/gui/pages/0_Review.py - src/gui/components/gate.py (re-export seam) - require_normalization_gate() in src/gui/components/_legacy.py - "review" section enum in tools_registry.py - Data Review entry in app.py navigation - require_normalization_gate() calls + imports in all nine tool pages - tests/gui/test_gate.py (whole file) - TestReviewWorkflow in tests/gui/test_workflows.py - 0_Review entry in tests/gui/test_smoke.py PAGE_SLUGS - stash_upload's normalization_result+normalization_for stashing - stash_upload_without_gate (was the gate's negative-path helper) 2017 tests pass (16 retired with the gate flow). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:04:33 +00:00
Michael	fc6c22c6a7	feat(review): inline file uploader instead of redirect home When a user lands on Review without an upload, show a file uploader on the page itself and auto-run the analyzer once a file is picked, rather than bouncing them to the home page with a "Back to home" button. Auto-analyze is the right default here: the user is already on the Review page, so they've implicitly committed to a scan. Stashing the bytes in the same session-state keys the home page uses keeps the rest of the flow (encoding picker, gate, tool pages) unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 19:57:01 +00:00
Michael	db5ec084da	docs+code: rename tool labels everywhere Sweep follow-up to `93e43fc`. Display labels now consistent across docs, landing pages, CLI output, code comments, docstrings, and test prose. Five parallel surfaces touched: - docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal design/planning docs - landing pages: index + bookkeeper/revops/shopify-pet - src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py and gui/components/_legacy.py, core module headers, every tool page's module docstring - tests: class/method/module docstrings and section-header comments - test-cases READMEs Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*), URL paths, anchor IDs, CSS classes, and asset filenames were left intact since they're code identifiers / structural references. All 2033 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 19:50:09 +00:00
Michael	93e43fc0d9	feat(gui): sidebar sections + non-technical tool labels Sidebar nav now groups tools under Data Review / Data Cleaners / Transformations / Automations via st.navigation, replacing the flat auto-discovered list. Tool display names switch to action-first phrasing (Find Duplicates, Fix Missing Values, Find Unusual Values, Standardize Formats, Clean Text, Quality Check, Map Columns, Combine Files, Automated Workflows) in EN + ES packs and on each page's H1. The Data Cleaners section follows the requested order: Missing Values → Outliers → Text Cleaner → Format Standardizer → Deduplicator → Quality Check. (Text Cleaner kept inside cleaners since the request didn't list it but the tool still ships.) Registry now carries a section field; helpers added: tools_in_section(), section_label(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 19:36:01 +00:00
Michael	d32b58e61a	feat(license): add Lite SKU; remove user-facing free trial Two coupled changes: 1. Lite tier - New Tier.LITE in src/license/schema.py. - FEATURES_BY_TIER[Tier.LITE] = {Deduplicator, Text Cleaner, Format Standardizer}. The three universally-useful tools that cover the most common bookkeeping / RevOps / Klaviyo prep workflows. Other six tools require Core. - i18n: license.tier_lite, license.feature_locked_title, license.feature_locked_body, license.upgrade_link, license.status_locked (en + es). - Per-tool feature gate at every GUI tool page (require_feature_or_render_upgrade) and every tool CLI (guard(feature=...)). A locked tool renders an upgrade prompt + Manage-license button (GUI) or exits with code 2 (CLI). - Home grid: tool cards the user's tier doesn't unlock get a red 🔒 Locked badge in place of green Ready. 2. Trial removed - Activation form's "Start 1-year trial" button removed. - license_cli's `trial` subcommand removed. - activation.trial_button / activation.trial_help i18n keys dropped (pack parity test stays green). - Tier.TRIAL stays in the enum (back-compat with any field- tested trial licenses); LicenseManager._mint stays internal for tests and the seller's key generator. - Decision logged in DECISIONS §9b: a 1-year all-features trial undercuts paid Lite; paid-only keeps tier economics clean. Tests (+29 net): +17 Lite-tier unit/guard tests + 13 Lite-tier GUI tests + 1 trial-absent assertion - 2 trial CLI tests - 1 trial GUI button test. Total: 1995 → 2024. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:19:30 +00:00
Michael	e435103113	feat(license): registration + 1-year licenses + tier scaffolding A complete offline licensing layer (no internet at any step): Core - src/license/ — schema (License, Tier, FeatureFlag), HMAC crypto, JSON storage, LicenseManager singleton with activate/renew/ deactivate/issue_trial. Tier-scaffolded so future SKUs can carve per-tool feature sets without consumer-code edits. - scripts/generate_license.py — creator-only key generator. Mints a DTLIC1: blob the buyer pastes into the activation page. GUI - New activation form component (src/gui/components/activation.py). - hide_streamlit_chrome() now inline-renders the activation form when no valid license is present (every page short-circuits to the form until activated). - Sidebar shows tier + days remaining; renewal warning under 30 days. - New pages/_Activate.py for revisiting the form after activation. CLI - src/license_cli.py — activate / renew / status / trial / deactivate commands. Exempt from the guard. - src/cli_license_guard.py — drop-in guard call added to every tool CLI's main(). Lets --help through; respects DATATOOLS_DEV_MODE. i18n - New activation.* and license.* keys in en.json + es.json (page title, form labels, status badges, renewal warnings, error messages). Pack parity test stays green. Test infrastructure - tests/conftest.py autouse fixture sets DATATOOLS_DEV_MODE=1 so the existing 1916 tests continue to pass. - isolated_license_path / activated_license_manager / unactivated_license_manager fixtures for tests that want to drive the real check. Tests (+79) - tests/test_license.py (40): schema, crypto roundtrip, blob encode/decode, tier→feature mapping, activation flow, name/email mismatch rejection, tamper detection, expiration, renewal, dev-mode bypass. - tests/test_license_cli.py (26): every license_cli command + subprocess tests confirming every tool CLI refuses to run without a license, --help always works, DEV_MODE bypasses. - tests/gui/test_activation.py (13): gate blocks without license, passes with trial, activation form submission unlocks the gate, sidebar status, renewal warning, i18n. Total: 1916 → 1995 tests. All pass under the strict warning filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:54:23 +00:00
Michael	c4ce86bd64	feat(i18n): add language-pack scaffold with English and Spanish Introduces ``src/i18n`` with a tiny JSON-backed t() lookup, an in-session language preference, and a sidebar selector wired through ``hide_streamlit_chrome`` so every page picks up the same picker. Covers home, tool cards, findings panel, gate, shutdown, and pickup banner strings. Tests pin pack parity and the farewell-overlay JS escape so future packs can't silently regress. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:11:30 +00:00
Michael	ea89c4d399	ui(gui): say 'window' instead of 'browser tab' in shutdown copy Update the Close page intro, the shutdown overlay, and the toast so they all read "you can close this window" — clearer for users running the app in a dedicated browser window rather than a tab. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:51:32 +00:00
Michael	340614e642	feat(gui): promote Quit to a 'Close' menu item in the sidebar nav Move the shutdown control out of the inline sidebar widget and into its own page (pages/99_Close.py), so it appears in the sidebar nav alongside the tool pages. An explicit confirm button on the page prevents accidental nav clicks from killing a live session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:38:02 +00:00
Michael	966af8ef94	feat: 3 new tools, format streaming, distribution-ready demo + landing pages Tools shipped this batch (4 → 6 of 9 Ready): 04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI 05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI 09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI with soft tool-dependency graph (recommended, not enforced) and JSON save/load for repeatable weekly cleanups. Format Standardizer reworked for 1 GB international files: • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email • Per-row country / address columns drive parsing • Audit cap (default 10 k rows, ~50 MB RAM) • standardize_file(): chunked streaming entry point (~165 k rows/sec) • currency_decimal="auto" for EU comma-decimal locales • R$ / kr / zł multi-char currency prefixes • cli_format.py with auto-stream above 100 MB inputs Encoding detection arbiter + language-aware probe: Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM) via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes. Distribution-readiness assets: • streamlit_app.py — Streamlit Community Cloud entry shim • src/gui/app_demo.py — single-page demo, ?p=<persona> routing, 100-row cap + watermark, free-vs-paid boundary enforced at surface • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs • landing/ — 4 static HTML pages (apex chooser + 3 niche), shared CSS, deploy.py URL-substitution script, auto-generated robots.txt + sitemap.xml + 404.html + favicon • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md — full strategy + measurement + deployment + master checklist Test counts: before: 1,520 passed · 4 skipped · 17 xfailed after: 1,729 passed · 0 skipped · 0 xfailed Tier-1 corpora added: • missing-corpus 3 use cases + 16 edge cases • column-mapper-corpus 3 use cases + 5 edge cases • format-cleaner intl 20-row 13-country stress fixture Engine hardening flushed out by the corpora: • interpolate guards against object-dtype columns • mean/median skip all-NaN columns (silences numpy warning) • fillna runs under future.no_silent_downcasting (silences pandas warning) • mojibake test no longer skips when ftfy installed (monkeypatch path) • drop-row threshold semantics: strict-greater (consistent across rows / cols) • currency_decimal validator allow-set updated for "auto" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:31:26 +00:00
Michael	26b9771625	feat(errors): structured error hierarchy + helpful messages everywhere Introduces src/core/errors.py with a small structured error hierarchy that every public entry point now uses. Each error carries the context a user needs to fix it and the context a maintainer needs to trace it. The hierarchy: DataToolsError (base — formats path, column, operation, suggestion) InputValidationError (extends ValueError — bad arg / wrong type) ConfigError (extends ValueError — bad config / options) FileFormatError (extends ValueError — file is not what we expected) FileAccessError (extends OSError — file I/O failure) Subclassing the stdlib bases means existing `except OSError` / `except ValueError` handlers still catch them — no breaking change. Helpers: - ensure_dataframe(value, function=...) — uniform DataFrame guard - ensure_choice(value, name=, choices=) — uniform enum/literal guard - wrap_file_read(path, op, exc) — tag OSError with hint + path - wrap_file_write(path, op, exc) — same, with Windows-aware tip - format_for_user(exc, context=) — user-facing string for st.error / stderr Library hardening: - io.read_file: missing files surface FileAccessError listing whether the parent directory exists, and the suggestion to check the path. - io.read_file: chunk_size <= 0 now raises InputValidationError with a positive-integer suggestion. - io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas ValueError ("sheet not found") wrapped as FileFormatError listing the path and a "list sheets with list_sheets()" hint. - io._detect_excel_header_row: bare except narrowed to specific openpyxl exceptions; falls back gracefully and logs at debug so the real error surfaces from pd.read_excel. - io.write_file: OSError / PermissionError on to_csv/to_excel wrapped with file path and Windows-aware "file may be open in another program" hint. - dedup._parse_date: bare `except Exception` narrowed to (TypeError, ValueError, OutOfBoundsDatetime); failed values logged at debug for survivor-selection forensics. - dedup._select_survivor: KEEP_MOST_RECENT now raises InputValidationError instead of silently falling back to keep_first. - dedup.deduplicate: input validation errors are InputValidationError with operation/column/suggestion fields. - format_standardize.from_dict: invalid FieldType for a column raises ConfigError naming the column AND the bad value AND listing valid values; same for date_order / phone_format / etc. - format_standardize.from_file: OSError / JSON decode wrapped with path AND line/column where parsing failed. - format_standardize.to_file: TypeError on json.dumps wrapped as ConfigError with the suspected source (extra_abbreviations). - format_standardize._apply_field_type: dispatcher's "unknown field type" branch now raises AssertionError (it's an internal invariant, not user error — a new enum value was added without a branch). - format_standardize._resolve_column_types: missing-column error now InputValidationError with a "check for typos / unparsed header" suggestion. - format_standardize.standardize_dataframe: ensure_dataframe at entry. - text_clean.clean_dataframe: ensure_dataframe at entry. - config.to_strategies: invalid Algorithm/NormalizerType wrapped as ConfigError naming the strategy index AND the column. - config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError listing valid values. - config.from_file: OSError / JSON decode wrapped (mirror of StandardizeOptions.from_file). - fixes.repair_mojibake: ImportError on ftfy now logged at info level with the underlying ImportError so a corrupt-package vs not-installed distinction is visible in the logs. - normalizers.normalize_phone: phonenumbers.NumberParseException now logged at debug when the digits-only fallback drops extension / country-code information — gives a trail when matching results look wrong. GUI / CLI surfaces: - All 9 page handlers (`except Exception as e: st.error(...)`) now use format_for_user(), which renders DataToolsError fields nicely and falls back to "ClassName: message" for unrecognized errors. - 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish UnicodeDecodeError with an "re-save as UTF-8" suggestion before the generic handler. - cli.py's "Error reading file" handler now uses format_for_user() and includes the input path in the prefix. Tests: - tests/test_errors.py — 22 new tests covering: base class formatting, stdlib inheritance, ensure_dataframe / ensure_choice helpers, wrap_file_read / wrap_file_write, format_for_user behavior, and end-to-end integration (missing file, missing dir, bad JSON, bad algorithm, bad enum, missing column). - tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for the new exception types (InputValidationError replaces TypeError, FileAccessError extends OSError). Full project suite: 1230 passed, 4 skipped, 17 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:35:42 +00:00
Michael	4adeb5c7f3	feat(format): per-cell standardizers + 199-row buyer corpus Adds src/core/format_standardize.py — a per-cell standardizer for dates, phones, emails, addresses, names, currencies, booleans — wired through StandardizeOptions / standardize_dataframe with FieldType registry. Includes: - Date parser handles ISO/US/EU/longform/excel-serial/unix-timestamp/ partial-precision/quarter notation; opt-in French/German/Spanish month dictionaries via month_locales. - Phone via libphonenumber with extension preservation (;ext=N), 001 international prefix handling, error sentinels for placeholders / multi-number cells. - Email lowercase/trim/mailto/angle-bracket strip with optional --gmail-canonical mode. - Address USPS abbreviation expansion or compression (expand=False per corpus § 6.3), state-name → 2-letter conversion, multi-line collapse, PO Box normalization, state-code preservation regardless of input case. - Name handler: Mc/Mac/O'/D' inner caps, hyphen segments, particle lowercasing (von/van/de/da), comma-format reversal, period stripping for titles/suffixes/initials, PhD/MD acronym preservation, conservative mode for mixed-case input. - Currency: auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation, error sentinels for percentages/ranges/word-values/ambiguous separators. - Per-domain error_policy ("passthrough" \| "sentinel") for surfacing malformed values as <error: reason> per corpus § 0.3. Test corpus from Business/DataTools/test-cases-format-cleaner copied to test-cases/format-cleaner-corpus/ — 7 fixtures plus FORMATS-CASES.md. tests/test_format_standardize_corpus.py drives all 199 rows through the per-cell standardizers; 0 xfailed. Wires the GUI page (3_Format_Standardizer.py) to "Ready" status. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:11:24 +00:00
Michael	82d7fef21e	feat(gate): CSV-normalization gate with confidence-tiered findings Adds a Review & Normalize page that sits between upload and every tool page. The analyzer now tags each finding with confidence (high/medium/low) and a fix_action; the gate auto-applies high-confidence fixes, surfaces medium/low ones for user review, and blocks tool pages on error-level findings until resolved or waived. Core (src/core/): - analyze.py: Finding gains confidence, fix_action, pre_applied; new detectors for encoding_uncertain, encoding_decode_failed; new top- level encoding_override parameter. - fixes.py: registry of fix algorithms keyed by fix_action id. - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and the NormalizationResult / Decision dataclasses the gate consumes. - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption) and normalizes line endings (fixes bare-CR parser crash); empty file handled gracefully instead of EmptyDataError traceback. GUI (src/gui/): - pages/0_Review.py: gate page with per-finding decision controls, encoding override picker (16 codepages + custom), and Advanced output options (encoding, delimiter, line terminator) on the download. - components.py: require_normalization_gate() helper. - pages/1-9: gate guard wired on every tool page. Test corpora: - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference UTF-8 files + manifest, synced from Business/DataTools. - test-cases/text-cleaner-corpus/test_data/17: synced malformed input (unquoted $1,500.00) for the unquoted-delimiter detector. Tests (94 new): - test_normalize.py (48): finding fields, fix registry, auto_fix scope, decision paths, gate idempotency, output-options helper. - test_encodings_corpus.py (90, 16 xfailed): parametric detection + decode + analyzer-no-crash sweep against the manifest. - test_analyze.py: encoding override + encoding_uncertain detectors. - test_corpus.py: pre-parse repair in the strict reader. run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate; encodings corpus added to --fixtures category. Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema, gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds the analyzer JSON schema with the new fields; README links to all of it. Suite: 765 passed, 17 xfailed (was 458 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:35:27 +00:00
Michael	e9c490ae1b	feat(gui): hidden-char-aware preview tables in Text Cleaner The Text Cleaner had two st.dataframe previews — the initial upload preview ("Preview: filename") and the post-clean "Cleaned preview" table — that both rendered cells with the same browser-collapses- whitespace, hides-invisibles problem the analyzer findings panel had before commit `1049c03`. components.render_hidden_aware_preview(df, n_rows, caption) renders a DataFrame as an HTML table where: - every cell uses visualize_hidden_html(mark_outer_whitespace=True), so leading/trailing ASCII spaces appear as per-character "·" badges - white-space: pre-wrap on every cell preserves internal multi-space runs and embedded newlines visually - headers route through the same visualizer so dirty column names (NBSP padding, ZWSP, smart quotes) show their badges too - NaN cells render as a faint "NaN" placeholder - rows are sticky-headed and scrollable inside a 26rem capped container so a 10-row preview doesn't push the rest of the UI off screen 2_Text_Cleaner.py wires it into both previews: - The upload preview gains its own "Show hidden characters in preview" toggle (default on). - The cleaned preview reuses the existing show_hidden toggle that already governs the Examples changes table, so one switch controls the whole results section. Either toggle off falls back to the original st.dataframe view. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:26:30 +00:00
Michael	90ceada2d1	feat(text_clean): visualize hidden characters in the cleaner GUI The whole point of the cleaner is to remove characters the user can't see — which makes the "before / after" preview nearly useless by default. A cell with NBSP padding looks identical to a cell with regular spaces. Two new helpers in src.core.text_clean: visualize_hidden_text(s) Plain-text rendering: each invisible/control/smart character is replaced by a glyph + [LABEL] (e.g. "·[NBSP]", "→[TAB]", "∅[ZWSP]", """[L DQUOTE]"). Suitable for terminal output, CSV exports, anywhere HTML is wrong. Unmapped C0 controls render as [U+XXXX]. visualize_hidden_html(s) + hidden_char_css() HTML rendering: every flagged character is wrapped in a <span> with a CSS class and a tooltip showing the codepoint and label. Pair with hidden_char_css() to inject the matching styles. Three colour bands (whitespace, special, control) so the user can scan an audit table and spot what's being changed at a glance. Mapping covers: ASCII tab/LF/CR, every NBSP variant (U+00A0, U+202F, U+2009, …), zero-width family (ZWSP/ZWNJ/ZWJ/WJ/BOM/SHY), bidi marks (LRM/RLM), all smart quotes, en/em dashes, ellipsis, prime/double-prime, and guillemets. ASCII printable text passes through; HTML output also escapes &/</> . GUI wiring (src/gui/pages/2_Text_Cleaner.py) The "Examples" changes table now defaults to a hidden-char-rendered HTML view: every NBSP/ZWSP/smart-quote/control char is shown with its badge and codepoint tooltip. A "Show hidden characters" toggle lets the user fall back to the raw st.dataframe view if they prefer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:14:14 +00:00
Michael	794d4cda94	feat(gui): tool pages pick up the home-page upload via session_state Closes the last UX gap from the analyzer review: each tool page had its own st.file_uploader, so users had to upload the same file twice (once on the home page for analysis, once on each tool page). components.pickup_or_upload(label, key, types) returns either: - a _StashedUpload shim wrapping the home-page bytes (when present and the user hasn't asked for a different file on this page), or - the standard st.file_uploader (when nothing is stashed or the user clicked "Use a different file"). _StashedUpload duck-types Streamlit's UploadedFile (.name, .size, .getvalue(), .read()) so existing tool-page code consumes it without changes. A "Use a different file" button per page sets a session-state override flag; a "Switch back to upload-screen file" button clears it. Wired into 2_Text_Cleaner.py and 1_Deduplicator.py — the two pages with working uploaders today. The remaining stub pages adopt it when they're implemented; the helper is the public surface they'll use. Verified by smoke-launching streamlit headless and curling the home, text-cleaner, and deduplicator routes — all return 200 with no errors in the server log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:09:51 +00:00
Michael	54f92ae47e	feat: implement text cleaner (script 02) with CLI, GUI, and tests Builds 02_text_cleaner.py from stub to working: character-level hygiene for CSV/Excel inputs covering trim, whitespace collapse, smart-character folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char strip, line-ending normalization, and per-column case conversion. Three presets (minimal/excel-hygiene/paranoid) keep the buyer surface small. - src/core/text_clean.py: pure helpers + CleanOptions/CleanResult + clean_dataframe with dtype-safe column selection - src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape (dry-run by default, --apply writes cleaned + changes audit, JSON config save/load) - src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset picker, advanced toggles, preview, before/after metrics, and three download buttons - tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests covering edge cases E1-E50 from the spec - samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10 in 10 rows - test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case fixtures Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7 entry locking the spec, CLI-REFERENCE.md gains the text cleaner section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md status row 02 promoted Skeleton -> Working. 200/200 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:14:15 +00:00
Michael	35ea21ad33	feat: hide Streamlit chrome for app-like appearance Add shared hide_streamlit_chrome() helper that removes header bar, hamburger menu, footer, and deploy button via CSS injection. Called on every page. Add .streamlit/config.toml with minimal toolbar mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 01:20:54 +00:00
Michael	f2fdc10af7	feat: refactor GUI to multi-page Streamlit app with 9 tool pages Convert single-page deduplicator into a multi-page suite. Home page shows tool card grid. Deduplicator extracted to its own page (fully working). 8 stub pages added for Text Cleaner, Format Standardizer, Missing Values, Column Mapper, Outlier Detector, Multi-File Merger, Validator & Reporter, and Pipeline Runner — each with functional file upload and coming-soon UI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 01:16:12 +00:00

24 Commits