Commit Graph

41 Commits

Author SHA1 Message Date
e534fb4989 sec(license): Ed25519 sigs + production-safe tripwire
Two coupled hardening upgrades.

1. Asymmetric signatures (HMAC → Ed25519)

The previous HMAC scheme used a symmetric secret that any motivated
reverse engineer could pull out of the shipped binary and use to
mint blobs for any tier / name / email. With Ed25519, the binary
ships only the public verification key; the signing key never
leaves the seller's environment, so binary compromise no longer
yields forgery.

- src/license/crypto.py rewritten around
  cryptography.hazmat.primitives.asymmetric.ed25519. Same public
  API surface (sign/verify/encode_blob/decode_blob), same canonical
  JSON encoding — drop-in for the manager / cli / GUI layers.
- DATATOOLS_LICENSE_PRIVKEY (seller-side) and
  DATATOOLS_LICENSE_PUBKEY (build-time) env vars supply the keys;
  the in-source dev keypair (src/license/_dev_keypair.py)
  deterministically derives from a seed phrase for repro builds and
  tests.
- Blob prefix bumped DTLIC1: → DTLIC2:. Decoding a DTLIC1 blob
  surfaces a clear "old format" error rather than a confusing
  signature mismatch.
- scripts/generate_keypair.py mints fresh production keypairs for
  the seller (run once, stash the private key offline). Adds
  cryptography>=41,<46 to requirements.txt (was an undeclared
  transitive dep).

2. Production-safe tripwire

assert_production_safe() refuses to boot a frozen / shipped build
when either:

- DATATOOLS_DEV_MODE=1 is set (would unconditionally bypass every
  license check — fine in source/test but catastrophic in a buyer
  install).
- The active verification key is still the embedded dev key (the
  build pipeline forgot to set DATATOOLS_LICENSE_PUBKEY).

No-op in source / pytest runs (sys.frozen is unset) so test
fixtures and dev workflows keep working without ceremony. Called
from src/cli_license_guard.guard() and from hide_streamlit_chrome
— so it fires on every CLI invocation and every GUI page load.

Tests: 49 license-layer unit tests (was 40); added Ed25519
wrong-key rejection, dev-keypair seed pin, blob v2 prefix, v1
rejection with clear message, and four production-safe scenarios
(no-op in source, fires on DEV_MODE in frozen, fires on dev key in
frozen, passes in frozen with prod pubkey). Total: 2024 → 2033.

Docs (REQUIREMENTS §17a, DEVELOPER licensing recipe, DECISIONS
§9b + decision log) updated with the new threat-model write-up,
key-storage workflow, and tripwire behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:34:48 +00:00
d32b58e61a feat(license): add Lite SKU; remove user-facing free trial
Two coupled changes:

1. Lite tier
   - New Tier.LITE in src/license/schema.py.
   - FEATURES_BY_TIER[Tier.LITE] = {Deduplicator, Text Cleaner,
     Format Standardizer}. The three universally-useful tools that
     cover the most common bookkeeping / RevOps / Klaviyo prep
     workflows. Other six tools require Core.
   - i18n: license.tier_lite, license.feature_locked_title,
     license.feature_locked_body, license.upgrade_link,
     license.status_locked (en + es).
   - Per-tool feature gate at every GUI tool page
     (require_feature_or_render_upgrade) and every tool CLI
     (guard(feature=...)). A locked tool renders an upgrade
     prompt + Manage-license button (GUI) or exits with code 2
     (CLI).
   - Home grid: tool cards the user's tier doesn't unlock get a
     red 🔒 Locked badge in place of green Ready.

2. Trial removed
   - Activation form's "Start 1-year trial" button removed.
   - license_cli's `trial` subcommand removed.
   - activation.trial_button / activation.trial_help i18n keys
     dropped (pack parity test stays green).
   - Tier.TRIAL stays in the enum (back-compat with any field-
     tested trial licenses); LicenseManager._mint stays internal
     for tests and the seller's key generator.
   - Decision logged in DECISIONS §9b: a 1-year all-features
     trial undercuts paid Lite; paid-only keeps tier economics
     clean.

Tests (+29 net): +17 Lite-tier unit/guard tests + 13 Lite-tier
GUI tests + 1 trial-absent assertion - 2 trial CLI tests - 1
trial GUI button test. Total: 1995 → 2024.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:19:30 +00:00
e435103113 feat(license): registration + 1-year licenses + tier scaffolding
A complete offline licensing layer (no internet at any step):

Core
- src/license/ — schema (License, Tier, FeatureFlag), HMAC crypto,
  JSON storage, LicenseManager singleton with activate/renew/
  deactivate/issue_trial. Tier-scaffolded so future SKUs can carve
  per-tool feature sets without consumer-code edits.
- scripts/generate_license.py — creator-only key generator. Mints a
  DTLIC1: blob the buyer pastes into the activation page.

GUI
- New activation form component (src/gui/components/activation.py).
- hide_streamlit_chrome() now inline-renders the activation form when
  no valid license is present (every page short-circuits to the form
  until activated).
- Sidebar shows tier + days remaining; renewal warning under 30 days.
- New pages/_Activate.py for revisiting the form after activation.

CLI
- src/license_cli.py — activate / renew / status / trial / deactivate
  commands. Exempt from the guard.
- src/cli_license_guard.py — drop-in guard call added to every tool
  CLI's main(). Lets --help through; respects DATATOOLS_DEV_MODE.

i18n
- New activation.* and license.* keys in en.json + es.json
  (page title, form labels, status badges, renewal warnings, error
  messages). Pack parity test stays green.

Test infrastructure
- tests/conftest.py autouse fixture sets DATATOOLS_DEV_MODE=1 so the
  existing 1916 tests continue to pass.
- isolated_license_path / activated_license_manager /
  unactivated_license_manager fixtures for tests that want to drive
  the real check.

Tests (+79)
- tests/test_license.py (40): schema, crypto roundtrip, blob
  encode/decode, tier→feature mapping, activation flow, name/email
  mismatch rejection, tamper detection, expiration, renewal,
  dev-mode bypass.
- tests/test_license_cli.py (26): every license_cli command +
  subprocess tests confirming every tool CLI refuses to run without
  a license, --help always works, DEV_MODE bypasses.
- tests/gui/test_activation.py (13): gate blocks without license,
  passes with trial, activation form submission unlocks the gate,
  sidebar status, renewal warning, i18n.

Total: 1916 → 1995 tests. All pass under the strict warning filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:54:23 +00:00
c4ce86bd64 feat(i18n): add language-pack scaffold with English and Spanish
Introduces ``src/i18n`` with a tiny JSON-backed t() lookup, an in-session
language preference, and a sidebar selector wired through
``hide_streamlit_chrome`` so every page picks up the same picker. Covers
home, tool cards, findings panel, gate, shutdown, and pickup banner
strings. Tests pin pack parity and the farewell-overlay JS escape so
future packs can't silently regress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:11:30 +00:00
ea89c4d399 ui(gui): say 'window' instead of 'browser tab' in shutdown copy
Update the Close page intro, the shutdown overlay, and the toast so
they all read "you can close this window" — clearer for users running
the app in a dedicated browser window rather than a tab.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:51:32 +00:00
701108c9d5 fix(gui): inject farewell overlay into parent DOM on shutdown
Replaces the data:-URL navigation (blocked by Chrome since v60 for
top-frame navigation) with a direct DOM-append of a full-screen
overlay onto the parent document. Uses z-index 2147483647 so it sits
above Streamlit's connection-error banner when the websocket drops.

Note: still doesn't fully suppress the connection-error banner in
testing — the next iteration will render the overlay through
Streamlit's own page rather than via a component iframe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:49:48 +00:00
340614e642 feat(gui): promote Quit to a 'Close' menu item in the sidebar nav
Move the shutdown control out of the inline sidebar widget and into
its own page (pages/99_Close.py), so it appears in the sidebar nav
alongside the tool pages. An explicit confirm button on the page
prevents accidental nav clicks from killing a live session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:38:02 +00:00
58c0195def fix(gui): make Quit button actually terminate the server
Signalling the process with SIGTERM/SIGINT didn't reliably shut Streamlit
down — its tornado/asyncio loop swallowed or deferred the signal, so the
browser saw the websocket drop ("Connection error") while the python
process kept running. Replace the signal with a daemon-thread
``os._exit(0)`` after a short delay so the current rerun can paint the
"shutting down" message before the process is hard-killed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:36:36 +00:00
30e257cc44 fix(gui): move Quit button to sidebar so it shows on every page
The footer placement was easy to miss (below all tool cards) and only
rendered on the home page. Hook the button into hide_streamlit_chrome()
so every page that hides default chrome — home + all 9 tool pages — gets
the Quit button at the bottom of the sidebar without per-page edits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:33:32 +00:00
0c25d80146 fix(gui): keep sidebar reopenable + add clean Quit button
The chrome-hiding CSS was removing the Streamlit header wholesale,
which also took the sidebar's expand chevron with it — a collapsed
sidebar became unreopenable. Make the header transparent instead and
explicitly preserve the sidebar collapsed-control.

Also add a Quit button in the app footer that signals the Streamlit
server (SIGTERM, falling back to SIGINT) so closing the GUI returns
the shell prompt cleanly instead of leaving Python hung.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:30:10 +00:00
966af8ef94 feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00
26b9771625 feat(errors): structured error hierarchy + helpful messages everywhere
Introduces src/core/errors.py with a small structured error hierarchy
that every public entry point now uses. Each error carries the
context a user needs to fix it and the context a maintainer needs to
trace it.

The hierarchy:
  DataToolsError  (base — formats path, column, operation, suggestion)
    InputValidationError  (extends ValueError — bad arg / wrong type)
    ConfigError           (extends ValueError — bad config / options)
    FileFormatError       (extends ValueError — file is not what we expected)
    FileAccessError       (extends OSError   — file I/O failure)

Subclassing the stdlib bases means existing `except OSError` /
`except ValueError` handlers still catch them — no breaking change.

Helpers:
- ensure_dataframe(value, function=...)  — uniform DataFrame guard
- ensure_choice(value, name=, choices=)  — uniform enum/literal guard
- wrap_file_read(path, op, exc)          — tag OSError with hint + path
- wrap_file_write(path, op, exc)         — same, with Windows-aware tip
- format_for_user(exc, context=)         — user-facing string for st.error / stderr

Library hardening:
- io.read_file: missing files surface FileAccessError listing whether
  the parent directory exists, and the suggestion to check the path.
- io.read_file: chunk_size <= 0 now raises InputValidationError with
  a positive-integer suggestion.
- io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas
  ValueError ("sheet not found") wrapped as FileFormatError listing
  the path and a "list sheets with list_sheets()" hint.
- io._detect_excel_header_row: bare except narrowed to specific
  openpyxl exceptions; falls back gracefully and logs at debug so
  the real error surfaces from pd.read_excel.
- io.write_file: OSError / PermissionError on to_csv/to_excel wrapped
  with file path and Windows-aware "file may be open in another
  program" hint.
- dedup._parse_date: bare `except Exception` narrowed to
  (TypeError, ValueError, OutOfBoundsDatetime); failed values
  logged at debug for survivor-selection forensics.
- dedup._select_survivor: KEEP_MOST_RECENT now raises
  InputValidationError instead of silently falling back to keep_first.
- dedup.deduplicate: input validation errors are InputValidationError
  with operation/column/suggestion fields.
- format_standardize.from_dict: invalid FieldType for a column raises
  ConfigError naming the column AND the bad value AND listing valid
  values; same for date_order / phone_format / etc.
- format_standardize.from_file: OSError / JSON decode wrapped with
  path AND line/column where parsing failed.
- format_standardize.to_file: TypeError on json.dumps wrapped as
  ConfigError with the suspected source (extra_abbreviations).
- format_standardize._apply_field_type: dispatcher's "unknown field
  type" branch now raises AssertionError (it's an internal invariant,
  not user error — a new enum value was added without a branch).
- format_standardize._resolve_column_types: missing-column error now
  InputValidationError with a "check for typos / unparsed header"
  suggestion.
- format_standardize.standardize_dataframe: ensure_dataframe at entry.
- text_clean.clean_dataframe: ensure_dataframe at entry.
- config.to_strategies: invalid Algorithm/NormalizerType wrapped as
  ConfigError naming the strategy index AND the column.
- config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError
  listing valid values.
- config.from_file: OSError / JSON decode wrapped (mirror of
  StandardizeOptions.from_file).
- fixes.repair_mojibake: ImportError on ftfy now logged at info level
  with the underlying ImportError so a corrupt-package vs not-installed
  distinction is visible in the logs.
- normalizers.normalize_phone: phonenumbers.NumberParseException now
  logged at debug when the digits-only fallback drops extension /
  country-code information — gives a trail when matching results
  look wrong.

GUI / CLI surfaces:
- All 9 page handlers (`except Exception as e: st.error(...)`) now
  use format_for_user(), which renders DataToolsError fields nicely
  and falls back to "ClassName: message" for unrecognized errors.
- 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish
  UnicodeDecodeError with an "re-save as UTF-8" suggestion before
  the generic handler.
- cli.py's "Error reading file" handler now uses format_for_user()
  and includes the input path in the prefix.

Tests:
- tests/test_errors.py — 22 new tests covering: base class formatting,
  stdlib inheritance, ensure_dataframe / ensure_choice helpers,
  wrap_file_read / wrap_file_write, format_for_user behavior, and
  end-to-end integration (missing file, missing dir, bad JSON, bad
  algorithm, bad enum, missing column).
- tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for
  the new exception types (InputValidationError replaces TypeError,
  FileAccessError extends OSError).

Full project suite: 1230 passed, 4 skipped, 17 xfailed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:35:42 +00:00
4adeb5c7f3 feat(format): per-cell standardizers + 199-row buyer corpus
Adds src/core/format_standardize.py — a per-cell standardizer for dates,
phones, emails, addresses, names, currencies, booleans — wired through
StandardizeOptions / standardize_dataframe with FieldType registry.

Includes:
- Date parser handles ISO/US/EU/longform/excel-serial/unix-timestamp/
  partial-precision/quarter notation; opt-in French/German/Spanish month
  dictionaries via month_locales.
- Phone via libphonenumber with extension preservation (;ext=N), 001
  international prefix handling, error sentinels for placeholders /
  multi-number cells.
- Email lowercase/trim/mailto/angle-bracket strip with optional
  --gmail-canonical mode.
- Address USPS abbreviation expansion or compression (expand=False per
  corpus § 6.3), state-name → 2-letter conversion, multi-line collapse,
  PO Box normalization, state-code preservation regardless of input case.
- Name handler: Mc/Mac/O'/D' inner caps, hyphen segments, particle
  lowercasing (von/van/de/da), comma-format reversal, period stripping
  for titles/suffixes/initials, PhD/MD acronym preservation, conservative
  mode for mixed-case input.
- Currency: auto-detect EU vs US separators, space-thousands, Swiss
  apostrophe, accounting parens, optional ISO code preservation, error
  sentinels for percentages/ranges/word-values/ambiguous separators.
- Per-domain error_policy ("passthrough" | "sentinel") for surfacing
  malformed values as <error: reason> per corpus § 0.3.

Test corpus from Business/DataTools/test-cases-format-cleaner copied to
test-cases/format-cleaner-corpus/ — 7 fixtures plus FORMATS-CASES.md.
tests/test_format_standardize_corpus.py drives all 199 rows through the
per-cell standardizers; 0 xfailed.

Wires the GUI page (3_Format_Standardizer.py) to "Ready" status.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:11:24 +00:00
3f007ef3d6 feat(gui): 1 GB upload cap + delimiter / encoding diversity caption
Streamlit's default file_uploader footer reads "Limit 200MB per file —
CSV, TSV, XLSX, XLS" which contradicts the 1 GB efficiency target shipped
in 438bc0f and codified in docs/REQUIREMENTS.md §1.1.

Three changes:

1. .streamlit/config.toml — set [server] maxUploadSize = 1024. Footer
   now reads "Limit 1024MB per file".

2. upload_and_analyze_section (home page) — adds an explicit caption
   above the uploader stating size limit, supported formats, the four
   auto-detected delimiters, and the 13 auto-detected encodings (with
   the Review-page override as the safety net).

3. pickup_or_upload (every tool page that falls back to its own
   uploader when no home-page upload is present) — same caption,
   only rendered when the upload accepts CSV/TSV/XLSX/XLS so JSON
   schema / config uploaders aren't decorated.

Test suite: 765 passed, 17 xfailed (no regressions). Home + Review +
Deduplicator pages all serve HTTP 200 under the new config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 21:23:21 +00:00
f891c6116d refactor(gui): tool registry + components package for per-tool builds
Two low-risk seam moves to enable selling per-tool subsets without
breaking the existing all-in-one bundle. Behaviour identical; every
existing import still resolves; full pytest suite + every page returns
HTTP 200.

1. **Tool registry** (src/gui/tools_registry.py) — replaces the
   inline dict-of-dicts in app.py with a Tool dataclass and a TOOLS
   list. Adds a tier field ("core" today, "pro" / "enterprise" later)
   and tools_for_tier() / tool_by_id() / display_name() helpers. A
   per-tool build slices TOOLS at import time without code changes.

2. **components package** (src/gui/components/) — converts the former
   single components.py into a package with:
     _legacy.py        — original file, unchanged.
     __init__.py       — re-exports the legacy surface; existing
                         "from src.gui.components import …" calls
                         continue to work.
     shared.py         — hide_streamlit_chrome, pickup_or_upload
                         (every build needs these).
     gate.py           — require_normalization_gate (Pro / Suite SKUs).
     findings.py       — analyzer-finding widgets (drops out of a
                         standalone-Dedup build).
     dedup_review.py   — match-group cards + apply pipeline (drops out
                         of a non-dedup build).

   The seam modules are narrow re-exports today. As code migrates out
   of _legacy.py into the focused modules, the public import path
   stays stable via the shim.

E2E: 765 passed, 17 xfailed (unchanged); home page + all 9 tool pages
+ Review page render HTTP 200; full pipeline (analyze → auto_fix →
apply_decisions → output bytes) round-trips on the kitchen-sink
fixture with zero high-confidence findings remaining post-fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:56:21 +00:00
82d7fef21e feat(gate): CSV-normalization gate with confidence-tiered findings
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:35:27 +00:00
e9c490ae1b feat(gui): hidden-char-aware preview tables in Text Cleaner
The Text Cleaner had two st.dataframe previews — the initial upload
preview ("Preview: filename") and the post-clean "Cleaned preview"
table — that both rendered cells with the same browser-collapses-
whitespace, hides-invisibles problem the analyzer findings panel had
before commit 1049c03.

components.render_hidden_aware_preview(df, n_rows, caption) renders a
DataFrame as an HTML table where:
  - every cell uses visualize_hidden_html(mark_outer_whitespace=True),
    so leading/trailing ASCII spaces appear as per-character "·" badges
  - white-space: pre-wrap on every cell preserves internal multi-space
    runs and embedded newlines visually
  - headers route through the same visualizer so dirty column names
    (NBSP padding, ZWSP, smart quotes) show their badges too
  - NaN cells render as a faint "NaN" placeholder
  - rows are sticky-headed and scrollable inside a 26rem capped
    container so a 10-row preview doesn't push the rest of the UI off
    screen

2_Text_Cleaner.py wires it into both previews:
  - The upload preview gains its own "Show hidden characters in preview"
    toggle (default on).
  - The cleaned preview reuses the existing show_hidden toggle that
    already governs the Examples changes table, so one switch controls
    the whole results section.

Either toggle off falls back to the original st.dataframe view.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:26:30 +00:00
1049c033cb feat(gui): visualize leading/trailing whitespace in analyzer findings
The analyzer's "Run Analysis" panel rendered sample cells via st.dataframe,
which (a) silently collapses leading/trailing ASCII whitespace and (b)
displays NBSP/ZWSP/control chars as nothing. The user couldn't see the
exact pollution they were being told about.

visualize_hidden_html gains a mark_outer_whitespace=True option that
wraps each leading and trailing ASCII space/tab in its own badge with a
"SP LEAD" / "SP TRAIL" tooltip. The badges are per-character so the
user can count exactly how much padding the cleaner will strip.

components.render_findings_panel now:
  - injects hidden_char_css() once at the top of the panel
  - replaces st.dataframe(samples) with a custom HTML table
  - renders the value column with mark_outer_whitespace=True
  - applies white-space: pre-wrap on value cells so any internal ASCII
    whitespace also stays visible (browsers collapse runs by default)

Four new tests cover: leading+trailing badge counts, default-off
behaviour, leading tab badge, all-whitespace string treated entirely
as leading.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:21:39 +00:00
e12615357d fix(gui): use page paths relative to streamlit entrypoint
st.page_link resolves paths from the directory of the entrypoint file
(src/gui/app.py), so the existing "src/gui/{page_slug}" prefix doubled
up and produced StreamlitPageNotFoundError on first upload + analysis
(reproducible on Windows; the stack trace from a Windows install
surfaced the bug).

The _TOOL_PAGE_PATHS map already stores the correct relative form
("pages/2_Text_Cleaner.py"); just pass the slug straight to
st.page_link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:17:50 +00:00
90ceada2d1 feat(text_clean): visualize hidden characters in the cleaner GUI
The whole point of the cleaner is to remove characters the user can't
see — which makes the "before / after" preview nearly useless by default.
A cell with NBSP padding looks identical to a cell with regular spaces.

Two new helpers in src.core.text_clean:

  visualize_hidden_text(s)
    Plain-text rendering: each invisible/control/smart character is
    replaced by a glyph + [LABEL] (e.g. "·[NBSP]", "→[TAB]", "∅[ZWSP]",
    """[L DQUOTE]"). Suitable for terminal output, CSV exports, anywhere
    HTML is wrong. Unmapped C0 controls render as [U+XXXX].

  visualize_hidden_html(s) + hidden_char_css()
    HTML rendering: every flagged character is wrapped in a <span> with
    a CSS class and a tooltip showing the codepoint and label. Pair with
    hidden_char_css() to inject the matching styles. Three colour bands
    (whitespace, special, control) so the user can scan an audit table
    and spot what's being changed at a glance.

Mapping covers: ASCII tab/LF/CR, every NBSP variant (U+00A0, U+202F,
U+2009, …), zero-width family (ZWSP/ZWNJ/ZWJ/WJ/BOM/SHY), bidi marks
(LRM/RLM), all smart quotes, en/em dashes, ellipsis, prime/double-prime,
and guillemets. ASCII printable text passes through; HTML output also
escapes &/</> .

GUI wiring (src/gui/pages/2_Text_Cleaner.py)
  The "Examples" changes table now defaults to a hidden-char-rendered
  HTML view: every NBSP/ZWSP/smart-quote/control char is shown with its
  badge and codepoint tooltip. A "Show hidden characters" toggle lets
  the user fall back to the raw st.dataframe view if they prefer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:14:14 +00:00
794d4cda94 feat(gui): tool pages pick up the home-page upload via session_state
Closes the last UX gap from the analyzer review: each tool page had its
own st.file_uploader, so users had to upload the same file twice (once
on the home page for analysis, once on each tool page).

components.pickup_or_upload(label, key, types) returns either:
  - a _StashedUpload shim wrapping the home-page bytes (when present and
    the user hasn't asked for a different file on this page), or
  - the standard st.file_uploader (when nothing is stashed or the user
    clicked "Use a different file").

_StashedUpload duck-types Streamlit's UploadedFile (.name, .size,
.getvalue(), .read()) so existing tool-page code consumes it without
changes. A "Use a different file" button per page sets a session-state
override flag; a "Switch back to upload-screen file" button clears it.

Wired into 2_Text_Cleaner.py and 1_Deduplicator.py — the two pages with
working uploaders today. The remaining stub pages adopt it when they're
implemented; the helper is the public surface they'll use.

Verified by smoke-launching streamlit headless and curling the home,
text-cleaner, and deduplicator routes — all return 200 with no errors
in the server log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:09:51 +00:00
a8943f29eb feat(gui): wire analyzer into home page with findings panel and tool badges
Home page (src/gui/app.py) gains an upload + analyze section above the tool
grid: file uploader, "Run analysis" / "Skip" buttons, and a findings panel
grouped by destination tool. Tool cards now carry a "N findings" badge
when the active session's findings reference that tool, so the user sees
at a glance which tools their just-uploaded file would benefit from.

src/gui/components.py adds the shared GUI surface:
  - TOOL_DISPLAY_NAMES + tool_display_name() — single source of truth for
    GUI labels, keeping detector tool ids decoupled from the UI.
  - render_findings_panel(findings) — severity icons, expander per tool,
    open-tool page link, sample-cells dataframe.
  - upload_and_analyze_section() — the home-page widget; stashes file
    bytes and findings in session_state so future tool pages can pick up
    the existing upload instead of re-prompting.
  - findings_count_for_tool(tool_id) — used by app.py to badge cards.

CSV/TSV uploads run through repair_bytes() before analysis, so the user
also sees csv_bom_stripped / csv_smart_quotes_folded findings synthesized
from the pre-parse repair pass. Excel uploads skip that step.

The Text Cleaner tool card flips from "Coming Soon" to "Ready" — that has
been true since the v3.0 implementation and the home page just hadn't been
updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:53:22 +00:00
54f92ae47e feat: implement text cleaner (script 02) with CLI, GUI, and tests
Builds 02_text_cleaner.py from stub to working: character-level hygiene
for CSV/Excel inputs covering trim, whitespace collapse, smart-character
folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char
strip, line-ending normalization, and per-column case conversion. Three
presets (minimal/excel-hygiene/paranoid) keep the buyer surface small.

- src/core/text_clean.py: pure helpers + CleanOptions/CleanResult +
  clean_dataframe with dtype-safe column selection
- src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape
  (dry-run by default, --apply writes cleaned + changes audit, JSON
  config save/load)
- src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset
  picker, advanced toggles, preview, before/after metrics, and three
  download buttons
- tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests
  covering edge cases E1-E50 from the spec
- samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10
  in 10 rows
- test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case
  fixtures

Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7
entry locking the spec, CLI-REFERENCE.md gains the text cleaner
section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md
status row 02 promoted Skeleton -> Working.

200/200 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:14:15 +00:00
b2ca04e6f4 fix: scale app content to 85% zoom
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:30:58 +00:00
223148283d revert: remove 75% zoom, 100% fits correctly with chrome hidden
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:29:42 +00:00
1c609214b0 fix: scale app content to 75% to fit window
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:28:12 +00:00
dc48578c7e feat: launch Chrome in app mode for chromeless window
python -m src.gui now opens Chrome with --app flag, hiding the address
bar, tabs, and bookmarks bar. Falls back to default browser if Chrome
is not found. Headless flag passed via CLI so streamlit run directly
still auto-opens normally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:24:54 +00:00
35ea21ad33 feat: hide Streamlit chrome for app-like appearance
Add shared hide_streamlit_chrome() helper that removes header bar,
hamburger menu, footer, and deploy button via CSS injection. Called
on every page. Add .streamlit/config.toml with minimal toolbar mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:20:54 +00:00
f2fdc10af7 feat: refactor GUI to multi-page Streamlit app with 9 tool pages
Convert single-page deduplicator into a multi-page suite. Home page shows
tool card grid. Deduplicator extracted to its own page (fully working).
8 stub pages added for Text Cleaner, Format Standardizer, Missing Values,
Column Mapper, Outlier Detector, Multi-File Merger, Validator & Reporter,
and Pipeline Runner — each with functional file upload and coming-soon UI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:16:12 +00:00
27fe87c4fe fix: simplify upload placeholder text
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:56:32 +00:00
8f1fb690ae chore: bump version to v3.0
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:54:37 +00:00
ec9f100e67 feat: add custom delimiter input and update subtitle text
Delimiter dropdown now includes "Other" option with a text input for
custom delimiter characters. Subtitle updated to mention delimited text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:46:12 +00:00
310bea08bf feat: add delimiter selector for CSV/TSV files in GUI
Auto-detects delimiter on upload and shows a selectbox with comma, tab,
semicolon, and pipe options. Changing re-reads the file immediately.
Line terminators (Windows/Unix/Mac) already handled by universal newlines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:30:50 +00:00
24ae566ec4 fix: hide Deploy button from Streamlit toolbar
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:25:26 +00:00
f97b633d4c feat: add live surviving rows preview in match group editor
Shows a read-only preview of the output rows below the editor,
updating as checkboxes and column dropdowns are changed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:17:34 +00:00
e672488d50 fix: default Keep checkbox to algorithm-selected survivor only
Only the row chosen by the survivor rule (first, last, most-recent, etc.)
is checked by default. Other rows start unchecked.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:15:13 +00:00
d368cad89d feat: inline checkboxes and column dropdowns in match group editor
Replace separate checkbox row and "Customize columns" toggle with a
unified st.data_editor grid — Keep checkboxes at the start of each row,
differing columns render as inline selectbox dropdowns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:10:16 +00:00
863fe89f2c feat: multi-row survivor support in match group review
Replace radio + Merge/Keep Both buttons with per-row checkboxes
and a single Confirm button. Users can now:

- Keep all rows (not duplicates) — check all, confirm
- Merge to one row — uncheck all but one, optionally customize columns
- Split a group — keep some rows, remove others (new capability)

Decision format changed from {action, survivor_idx, overrides} to
{keep_indices, overrides}. apply_review_decisions() updated to handle
all three modes. Batch actions updated accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 23:52:45 +00:00
debb0cb516 feat: per-group survivor selection and column cherry-picking in GUI
Each match group card now has:
- Radio button to pick which row to keep as the base survivor
- "Customize columns" toggle showing only columns that differ
- Per-column selectbox to pick values from any row in the group
- Decisions stored as {action, survivor_idx, overrides} dicts

Added apply_review_decisions() that builds the final DataFrame by
applying survivor selection + column overrides without re-running
the dedup engine. Batch actions also use the new dict format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 23:47:25 +00:00
39e139d777 fix: prevent match group expanders from collapsing on button click
Replace st.rerun() with on_click callbacks so decisions write to
session state before the natural rerun. Decided groups auto-collapse
with status in the label; undecided groups stay expanded. Added undo
button on decided groups.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 23:25:12 +00:00
b871ab24fc feat: add documentation, Streamlit GUI, and full source tree
- Rewrite README.md with project overview, quick-start, and CLI summary
- Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections
- Add docs/DEVELOPER.md with architecture, data flow, and extension guides
- Rewrite src/core/__init__.py with public API exports and module docstring
- Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive
  match group review with side-by-side diff, and download buttons
- Add .gitignore, requirements.txt, all source code, tests, and sample data
- Add streamlit to requirements.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 23:06:39 +00:00