243 Commits

Author SHA1 Message Date
604debb9a9 revert(home): keep per-tool grouping for per-file findings
Restoring ``render_findings_panel`` on the home page. Previous commit
(c575efd) inlined a flat renderer that dropped the per-tool grouping
and the "Open <Tool>" jump links — that was an over-correction. The
user only wanted the bottom tool-card grid gone (already removed in
ff2eaeb). The grouping inside the findings panel is what lets a user
land on a specific finding and one-click into the cleaner that fixes
it; without it they'd have to guess which sidebar entry to open.

Tool-card grid stays removed. Sidebar nav is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:31:36 +00:00
c575efd26e fix(home): render findings flat — drop per-tool grouping
The home page was calling ``render_findings_panel``, which groups
findings by tool into expanders and renders an "Open <Tool>" page
link under each. After uploading a file, the user still saw a tool
list (just under a different shape) — defeating the earlier cleanup
that removed the tool-cards grid.

Inline a flat renderer in ``_home_page``: per uploaded file, render
the filename header + severity summary + a flat list of findings via
``_render_one_finding`` directly. No expanders, no tool names as
section headers, no per-tool page-link buttons. Tool discovery
happens in the sidebar.

``render_findings_panel`` itself is unchanged — it still groups by
tool and remains tested via the findings-panel harness, but is no
longer used on the home page.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:22:20 +00:00
175389219f fix(gui): translate sidebar tool names when language changes
The sidebar nav was passing ``tool.name`` (the registry's English
field) to ``st.Page``, so the tool entries stayed in English even
after the user picked Spanish from the language selector. Section
headers were already i18n-driven; tool entries were not.

Switch to ``tool_name(tool_id)`` which routes through ``t(...)`` and
picks up the active language from session state. Verified: with
``ui_lang=es`` the sidebar renders Buscar duplicados / Limpiar texto /
Mapear columnas / etc. instead of the English fallbacks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:19:15 +00:00
c568aec8a7 feat(gui): one-click Close in its own bottom sidebar section
Close is now a direct shutdown trigger: visiting the Close page (the
sidebar entry) fires shutdown_app() immediately — no confirm step, no
intermediate body. The farewell overlay paints and os._exit(0) lands
~1s later from a daemon thread.

Layout: Close moved into its own bottom-of-sidebar section so the
destructive action is visually separated from Account/Activate.

- New shutdown_app() in components/_legacy.py replaces quit_button.
  os._exit thread is skipped when "pytest" is in sys.modules so the
  test suite doesn't suicide on rendering 99_Close.
- pages/99_Close.py shrinks to set_page_config + chrome + shutdown_app.
- app.py nav grows a new "Close" section header (new
  nav.section_close key in en/es packs) pinned at the bottom of the
  navigation dict.

Tests updated:
- TestQuitButtonRenders → TestClosePageShutsDownImmediately.
  Assert the shutdown caption renders + no confirm button exists.
- test_smoke EXPECTED_SUBSTRINGS["99_Close"] now pins
  "Shutting down" / "Cerrando" (the visible page body) instead of
  the removed page title.

2008 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:17:14 +00:00
ff2eaeb6c4 feat(home): multi-file upload + per-file analysis, drop tool grid
Home is now upload + analysis only. The page accepts multiple files in
one go, analyzes each independently, and renders findings grouped by
filename in bordered containers. The 3-section tool-card grid is gone —
discovery happens via the sidebar now.

Mechanics:
- file_uploader uses accept_multiple_files=True. Each file's findings
  cache in session_state["home_findings_by_file"] keyed by filename so
  removing a file via Streamlit's "x" button drops its findings too,
  and re-clicking Run only re-analyzes pending files.
- The first uploaded file is mirrored into the singular
  home_uploaded_{name,bytes,size} keys so tool pages continue to pick
  up an "active" upload through pickup_or_upload — no tool-page changes.
- New i18n keys: upload.intro_multi, upload.uploader_label_multi,
  upload.clear_results, upload.empty_state. upload.heading text is
  updated to "Upload one or more files to start" (EN + ES).

Dropped tests pinning the tool grid:
- TestHomeToolGridLocalization (test_chrome.py)
- test_home_tool_card_uses_es_name (test_smoke.py)
- TestLiteHomeGridBadges (test_lite_tier.py — locked-card lock-badge
  assertions; locking is still enforced per-tool-page via
  require_feature_or_render_upgrade)

2009 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:12:48 +00:00
dad744f17f refactor(gui): drop Review page + normalization gate
Home is now the only entry point: the "Run analysis" button on the
upload section IS the review step (findings render inline via
render_findings_panel). Tool pages no longer gate on a passed
normalization — running the analyzer is sufficient context.

Removed:
- src/gui/pages/0_Review.py
- src/gui/components/gate.py (re-export seam)
- require_normalization_gate() in src/gui/components/_legacy.py
- "review" section enum in tools_registry.py
- Data Review entry in app.py navigation
- require_normalization_gate() calls + imports in all nine tool pages
- tests/gui/test_gate.py (whole file)
- TestReviewWorkflow in tests/gui/test_workflows.py
- 0_Review entry in tests/gui/test_smoke.py PAGE_SLUGS
- stash_upload's normalization_result+normalization_for stashing
- stash_upload_without_gate (was the gate's negative-path helper)

2017 tests pass (16 retired with the gate flow).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:04:33 +00:00
fc6c22c6a7 feat(review): inline file uploader instead of redirect home
When a user lands on Review without an upload, show a file uploader
on the page itself and auto-run the analyzer once a file is picked,
rather than bouncing them to the home page with a "Back to home"
button.

Auto-analyze is the right default here: the user is already on the
Review page, so they've implicitly committed to a scan. Stashing the
bytes in the same session-state keys the home page uses keeps the
rest of the flow (encoding picker, gate, tool pages) unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:57:01 +00:00
db5ec084da docs+code: rename tool labels everywhere
Sweep follow-up to 93e43fc. Display labels now consistent across docs,
landing pages, CLI output, code comments, docstrings, and test prose.
Five parallel surfaces touched:

- docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal
  design/planning docs
- landing pages: index + bookkeeper/revops/shopify-pet
- src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py
  and gui/components/_legacy.py, core module headers, every tool
  page's module docstring
- tests: class/method/module docstrings and section-header comments
- test-cases READMEs

Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator
etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*),
URL paths, anchor IDs, CSS classes, and asset filenames were left
intact since they're code identifiers / structural references.

All 2033 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:50:09 +00:00
93e43fc0d9 feat(gui): sidebar sections + non-technical tool labels
Sidebar nav now groups tools under Data Review / Data Cleaners /
Transformations / Automations via st.navigation, replacing the flat
auto-discovered list. Tool display names switch to action-first
phrasing (Find Duplicates, Fix Missing Values, Find Unusual Values,
Standardize Formats, Clean Text, Quality Check, Map Columns, Combine
Files, Automated Workflows) in EN + ES packs and on each page's H1.

The Data Cleaners section follows the requested order: Missing
Values → Outliers → Text Cleaner → Format Standardizer → Deduplicator
→ Quality Check. (Text Cleaner kept inside cleaners since the request
didn't list it but the tool still ships.) Registry now carries a
section field; helpers added: tools_in_section(), section_label().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:36:01 +00:00
624f99653e docs(arch): end-to-end system + tech-stack diagrams
New ARCHITECTURE.md pulls the desktop app (TECHNICAL.md) and the
license server (LICENSE-SERVER.md) into a single picture — the two
were never reconciled into an end-to-end view before.

Contents:
  §1. System diagram (ASCII) showing operator laptop, license
      server stack (nginx → FastAPI → Postgres), Postmark, Gumroad,
      and the buyer's machine — with the three primary flows
      (sale, manual mint, offline activation) traced through it.
  §2. Tech stack diagram, layered: desktop / server / operator /
      external SaaS, with version pins.
  §3. Trust + isolation boundaries table — what crosses each one
      and what the threat model is.
  §4. "Where things are stored" — paths, tables, files.
  §5. Pointers to the deeper per-component docs.

ASCII over Mermaid since the repo's Gitea version is unknown and
plain text renders in every viewer / IDE / raw `cat`.

LICENSE-SERVER.md status flipped from "design proposal, not built"
to "deployed (PR 1 + PR 2 code merged)" — that was stale since
the PR 1 deploy yesterday.

TECHNICAL.md and ADMIN.md gain one-line pointers to ARCHITECTURE.md
so people land at the unified view when looking for "how does it
all fit together".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:59:05 +00:00
86ad21db79 docs(license): PR 2 deploy + operator instructions
ADMIN.md gains a "Running a Gumroad webhook" section: how the URL
secret works, how to add a SKU to products.yaml, how to inspect
gumroad_events (recent activity + failures-only queries), how to
replay a failed delivery, and how to test without buyers via
Gumroad's "Send Test Ping" button.

The deployed-vs-queued matrix flips Gumroad + Postmark to
"code merged, deploy pending" so it's clear the bits exist on
main but the live box still runs PR 1.

SETUP-LICENSE-SERVER.md §3 commits the eventual compose.yml shape
with PR 2 environment + secrets lines included but commented out,
ready to uncomment at deploy time. The §3 chown step already covers
the new secret files because it uses `chmod 400 secrets/*` /
`chown 10001:10001 secrets/*`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:33:53 +00:00
2bbaba954b feat(server): Gumroad webhook receiver + Postmark email (PR 2)
Wires the second source-adapter (Gumroad) plus the email delivery
that lets the server fulfill a sale end-to-end without operator
intervention.

Auth model: Gumroad doesn't HMAC the body, so we use their
recommended URL-secret pattern (?secret=...). Wrong/missing secret
returns 404 — no signal to a prober that the endpoint exists.

Webhook flow (server/app/routes/webhooks.py):
  1. audit-log the raw payload (gumroad_events row) BEFORE anything
     else, so a later failure leaves us replayable
  2. parse via GumroadAdapter (server/app/adapters/gumroad.py)
  3. mint_from_sale — UNIQUE(source, source_order_id) dedups
     duplicate webhook retries
  4. send the license email
  5. mark gumroad_events.processed = true

Always returns 200 once auth passes. Non-2xx would trigger Gumroad's
3-day retry storm; we'd rather record the failure on the audit row
and replay manually after fixing whatever surfaced.

Product → tier mapping is per-source YAML at
server/config/products.yaml (lru_cached). Adding a SKU = edit yaml,
restart api. Unmapped product_id is an error on the audit row, not
a crash.

EmailService (server/app/email.py): provider-agnostic interface with
Postmark as the first implementation. When POSTMARK_TOKEN is unset
the factory returns LoggingEmailService instead, so the webhook
exercises end-to-end before Postmark is provisioned.

48 unit tests (was 21) including:
- Gumroad secret verify with constant-time compare
- Sale parsing: amount-in-cents, name fallback from email,
  test=true tagging, missing-required fields, offer codes
- Product mapping lookups
- Email rendering text + HTML, HTML-escapes user input
- Postmark client via httpx.MockTransport (success and 4xx)
- Webhook end-to-end: secret check, audit log, idempotency on
  retry, unmapped product, email failure keeps license

Smoke test (server/scripts/smoke.sh) extended to POST a synthetic
Ping payload, verify the row + audit log, prove wrong-secret is
rejected, prove duplicate sale_id stays one row.

SQLite-test compatibility:
- BigInteger primary key uses with_variant(Integer, "sqlite") since
  SQLite only autoincrements INTEGER PRIMARY KEY.
- python-multipart pulled in for FastAPI Form parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:33:43 +00:00
b5cd74d474 docs(admin): live deployment section for the running license server
Documents the post-deploy state of PR 1: live URLs (datatools and
licenses subdomains on unalogix.com), the on-box filesystem layout
under /srv/datatools-license/, where the admin token lives and how
to retrieve / rotate it, the laptop-side SSH-tunnel + admin_cli
mint workflow, inspection commands (logs, psql, container status),
restart / rebuild procedures, manual backup commands until cron
lands, the production-key rotation outline, and a deployed-vs-queued
capability matrix.

Secrets are NEVER pasted into this doc — the admin token's literal
value lives only on disk (mode 400, UID 10001). Committing it to
git would mean permanent leakage via history even after rotation;
documenting its location + rotation procedure achieves the same
operational outcome without the residual exposure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:19:57 +00:00
1cf69dd23b docs(license): runbook fixes from PR 1 self-host deploy
Two real-world footguns surfaced during the first live deploy:

1. docker-compose's uid/gid/mode long-form on file-based secrets is
   silently ignored — that's a swarm-mode-only feature. The
   container app user (UID 10001 from the Dockerfile) cannot read
   a mode-400 file whose host UID it doesn't match. Fix is to
   chown the secret files to 10001 directly; host-side access
   control stays gated by the parent dir's mode 750.

2. nginx 1.24 (Ubuntu 24.04 default) rejects the standalone
   "http2 on;" directive (that arrived in 1.25). Use the legacy
   "listen 443 ssl http2;" combined form. Noted prominently so the
   next deploy doesn't trip on it.

Also realigned §3's compose example to what actually got deployed
for PR 1 — only pg_password + admin_token secrets, postmark /
gumroad / license_privkey commented out as PR 2 / production-key
follow-ups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:17:05 +00:00
673b902377 feat(license): datatools-admin CLI for the mint API
New operator CLI at src/admin_cli.py: mint, list, revoke, ping —
talks to the server's /internal/* endpoints over a local SSH tunnel.
Stdlib-only on the desktop side (urllib + typer), no new top-level
deps. Auth via $DATATOOLS_ADMIN_TOKEN.

scripts/generate_license.py is now annotated as a break-glass tool
for when the server is unreachable — routine work goes through the
new CLI so the authoritative `licenses` row is created.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 00:47:01 +00:00
bab2c9468c feat(server): mint API + Postgres schema + manual adapter (PR 1)
Source-agnostic license issuance service. FastAPI app fronts a
Postgres `licenses` table; the only currently-wired source is
`manual` (operator mints via /internal/mint). Gumroad webhook
adapter lands in PR 2.

Key design points:

- Signing reuses src/license/crypto.py via a COPY into the image
  (single source of truth — blobs minted server-side verify against
  the same embedded pubkey on the buyer's machine).
- Source adapter Protocol (app/adapters/base.py) is the seam for
  Gumroad / Lemon Squeezy / Stripe in later PRs; Mint API speaks
  only SaleEvent / RefundEvent.
- (source, source_order_id) UNIQUE composite gives idempotent
  webhook retries without double-mint.
- JSONB type uses with_variant(JSON, 'sqlite') so the same models
  drive both Postgres prod and SQLite tests (no testcontainers dep).
- Bearer-token auth on /internal/*; the IP-loopback guard was
  removed after the docker bridge made it fight legitimate prod
  traffic (nginx defense + Bearer remain).
- Secrets resolved via *_FILE env vars pointing at
  /run/secrets/<name>, so passwords never appear in `docker inspect`.

21 unit tests (SQLite in-memory, StaticPool) plus a real-Postgres
docker-compose smoke test in server/scripts/smoke.sh that builds the
image, runs the alembic migration, mints a license, verifies the
signature against the host dev pubkey, and checks the DB row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 00:46:54 +00:00
4179cb5156 docs(license): self-hosted server runbook + multi-tenancy plan
Adds SETUP-LICENSE-SERVER.md — end-to-end install runbook for the
license server on the existing invixiom box (Ubuntu 24.04). Covers
DNS, system packages, Postgres + API in Docker, dedicated system
user, secrets layout under /srv/datatools-license/secrets (mode
400), nginx config in a separate sites-available/unalogix file,
Let's Encrypt cert issuance, smoke tests, backups, monitoring, key
rotation, and rollback.

Multi-tenancy is explicit at every layer: separate DNS zone
(unalogix.com vs invixiom.com), separate nginx file, separate TLS
cert, dedicated backend ports (8090 for the API, 5433 for Postgres,
both localhost-only), separate docker compose project and volume.
No invixiom service is touched.

LICENSE-SERVER.md updated: hosting choice moved from "Fly.io /
Render" (rejected) to self-hosted (decided). Points at the new
runbook for ops specifics.

ADMIN.md pointer table updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:57:53 +00:00
52e04f63a9 docs(license): design proposal for online issuance & record-keeping
Forward-looking design doc — not implemented. Describes the smallest
useful server that replaces the manual mint-and-paste workflow:
Gumroad webhook → Mint API (KMS-held private key) → Postgres
licenses table, plus a self-service renewal/re-delivery portal.

The desktop app is deliberately untouched across all three migration
phases: activation stays fully offline and continues to verify blobs
against the embedded pubkey, preserving the DECISIONS.md §9b promise
that buyer machines never phone home.

Schema is intentionally a superset of the local issuance JSONL log
(ADMIN.md), so Phase 1 migration is a flat INSERT per row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:26:24 +00:00
23c51fd759 feat(license): local issuance log for minted blobs
generate_license.py now appends every minted license to
~/.datatools-creator/issued.jsonl (overridable via env). This is the
creator-side system of record until the server-side flow lands.

The full blob is stored alongside name/email/tier/expiry so buyers
who lose their delivery email can be re-served without re-minting.
File is created mode 600 and lives outside the buyer-facing
~/.datatools/ dir so it never gets bundled into a shipped install.

Log failures are non-fatal (warning to stderr) — the mint already
succeeded by the time we try to log, and forcing a re-mint after a
log error would invalidate any device the buyer had activated. Pass
--no-log for test mints.

ADMIN.md adds a "Customer record-keeping" section with the path,
schema, jq one-liners, and migration note pointing at the upcoming
LICENSE-SERVER.md design doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:25:19 +00:00
65e17e0a70 docs(admin): internal license operations reference
Creator-only ADMIN.md covering keypair generation, blob minting,
dev vs. production key model, tier matrix, and recovery if the
private key is lost. Includes a TL;DR for minting a dev license
against the in-tree keypair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:10:16 +00:00
e534fb4989 sec(license): Ed25519 sigs + production-safe tripwire
Two coupled hardening upgrades.

1. Asymmetric signatures (HMAC → Ed25519)

The previous HMAC scheme used a symmetric secret that any motivated
reverse engineer could pull out of the shipped binary and use to
mint blobs for any tier / name / email. With Ed25519, the binary
ships only the public verification key; the signing key never
leaves the seller's environment, so binary compromise no longer
yields forgery.

- src/license/crypto.py rewritten around
  cryptography.hazmat.primitives.asymmetric.ed25519. Same public
  API surface (sign/verify/encode_blob/decode_blob), same canonical
  JSON encoding — drop-in for the manager / cli / GUI layers.
- DATATOOLS_LICENSE_PRIVKEY (seller-side) and
  DATATOOLS_LICENSE_PUBKEY (build-time) env vars supply the keys;
  the in-source dev keypair (src/license/_dev_keypair.py)
  deterministically derives from a seed phrase for repro builds and
  tests.
- Blob prefix bumped DTLIC1: → DTLIC2:. Decoding a DTLIC1 blob
  surfaces a clear "old format" error rather than a confusing
  signature mismatch.
- scripts/generate_keypair.py mints fresh production keypairs for
  the seller (run once, stash the private key offline). Adds
  cryptography>=41,<46 to requirements.txt (was an undeclared
  transitive dep).

2. Production-safe tripwire

assert_production_safe() refuses to boot a frozen / shipped build
when either:

- DATATOOLS_DEV_MODE=1 is set (would unconditionally bypass every
  license check — fine in source/test but catastrophic in a buyer
  install).
- The active verification key is still the embedded dev key (the
  build pipeline forgot to set DATATOOLS_LICENSE_PUBKEY).

No-op in source / pytest runs (sys.frozen is unset) so test
fixtures and dev workflows keep working without ceremony. Called
from src/cli_license_guard.guard() and from hide_streamlit_chrome
— so it fires on every CLI invocation and every GUI page load.

Tests: 49 license-layer unit tests (was 40); added Ed25519
wrong-key rejection, dev-keypair seed pin, blob v2 prefix, v1
rejection with clear message, and four production-safe scenarios
(no-op in source, fires on DEV_MODE in frozen, fires on dev key in
frozen, passes in frozen with prod pubkey). Total: 2024 → 2033.

Docs (REQUIREMENTS §17a, DEVELOPER licensing recipe, DECISIONS
§9b + decision log) updated with the new threat-model write-up,
key-storage workflow, and tripwire behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:34:48 +00:00
d32b58e61a feat(license): add Lite SKU; remove user-facing free trial
Two coupled changes:

1. Lite tier
   - New Tier.LITE in src/license/schema.py.
   - FEATURES_BY_TIER[Tier.LITE] = {Deduplicator, Text Cleaner,
     Format Standardizer}. The three universally-useful tools that
     cover the most common bookkeeping / RevOps / Klaviyo prep
     workflows. Other six tools require Core.
   - i18n: license.tier_lite, license.feature_locked_title,
     license.feature_locked_body, license.upgrade_link,
     license.status_locked (en + es).
   - Per-tool feature gate at every GUI tool page
     (require_feature_or_render_upgrade) and every tool CLI
     (guard(feature=...)). A locked tool renders an upgrade
     prompt + Manage-license button (GUI) or exits with code 2
     (CLI).
   - Home grid: tool cards the user's tier doesn't unlock get a
     red 🔒 Locked badge in place of green Ready.

2. Trial removed
   - Activation form's "Start 1-year trial" button removed.
   - license_cli's `trial` subcommand removed.
   - activation.trial_button / activation.trial_help i18n keys
     dropped (pack parity test stays green).
   - Tier.TRIAL stays in the enum (back-compat with any field-
     tested trial licenses); LicenseManager._mint stays internal
     for tests and the seller's key generator.
   - Decision logged in DECISIONS §9b: a 1-year all-features
     trial undercuts paid Lite; paid-only keeps tier economics
     clean.

Tests (+29 net): +17 Lite-tier unit/guard tests + 13 Lite-tier
GUI tests + 1 trial-absent assertion - 2 trial CLI tests - 1
trial GUI button test. Total: 1995 → 2024.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:19:30 +00:00
e612c751a8 docs(license): document activation flow, tier system, dev bypass
- USER-GUIDE EN + ES gain a §0 "First launch — activation" section
  covering paid blob activation, 1-year trial, renewal, file
  location, and device-swap.
- REQUIREMENTS §17a "Licensing" — storage path, activation model,
  lifetime, tier list, dev bypass env var. Test count: 1995.
- DEVELOPER gains a "Licensing" recipe in the Extension recipes
  section: public API, feature-flag add, tier add, minting via the
  creator-only script.
- DECISIONS §9b — log the offline-HMAC choice with the threat-model
  trade-off (motivated piracy not stopped; honor-system + 30-day
  refund covers casual sharing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:54:30 +00:00
e435103113 feat(license): registration + 1-year licenses + tier scaffolding
A complete offline licensing layer (no internet at any step):

Core
- src/license/ — schema (License, Tier, FeatureFlag), HMAC crypto,
  JSON storage, LicenseManager singleton with activate/renew/
  deactivate/issue_trial. Tier-scaffolded so future SKUs can carve
  per-tool feature sets without consumer-code edits.
- scripts/generate_license.py — creator-only key generator. Mints a
  DTLIC1: blob the buyer pastes into the activation page.

GUI
- New activation form component (src/gui/components/activation.py).
- hide_streamlit_chrome() now inline-renders the activation form when
  no valid license is present (every page short-circuits to the form
  until activated).
- Sidebar shows tier + days remaining; renewal warning under 30 days.
- New pages/_Activate.py for revisiting the form after activation.

CLI
- src/license_cli.py — activate / renew / status / trial / deactivate
  commands. Exempt from the guard.
- src/cli_license_guard.py — drop-in guard call added to every tool
  CLI's main(). Lets --help through; respects DATATOOLS_DEV_MODE.

i18n
- New activation.* and license.* keys in en.json + es.json
  (page title, form labels, status badges, renewal warnings, error
  messages). Pack parity test stays green.

Test infrastructure
- tests/conftest.py autouse fixture sets DATATOOLS_DEV_MODE=1 so the
  existing 1916 tests continue to pass.
- isolated_license_path / activated_license_manager /
  unactivated_license_manager fixtures for tests that want to drive
  the real check.

Tests (+79)
- tests/test_license.py (40): schema, crypto roundtrip, blob
  encode/decode, tier→feature mapping, activation flow, name/email
  mismatch rejection, tamper detection, expiration, renewal,
  dev-mode bypass.
- tests/test_license_cli.py (26): every license_cli command +
  subprocess tests confirming every tool CLI refuses to run without
  a license, --help always works, DEV_MODE bypasses.
- tests/gui/test_activation.py (13): gate blocks without license,
  passes with trial, activation form submission unlocks the gate,
  sidebar status, renewal warning, i18n.

Total: 1916 → 1995 tests. All pass under the strict warning filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:54:23 +00:00
b2c7b94fe9 fix: clear all latent deprecation + resource warnings
Three real issues surfaced when running the suite with strict warnings:

1. src/core/format_standardize.py: ``datetime.utcfromtimestamp`` is
   deprecated in CPython 3.12 and slated for removal. Replace with
   ``datetime.fromtimestamp(ts, tz=timezone.utc)``. Output for the
   date-only format codes we use is byte-identical.

2. src/core/io.py: ``list_sheets`` leaked the openpyxl file handle by
   returning ``xl.sheet_names`` from an unclosed ``pd.ExcelFile``.
   Wrap in a ``with`` block so the FD closes deterministically — also
   prevents the Windows-only "file is locked" repro path.

3. tests/test_corpus.py: ``TestXlsxPollution.workbook`` fixture
   returned the bare ``pd.ExcelFile`` instead of yielding + closing.
   Convert to a yield-and-finally pattern so the class-scoped handle
   isn't leaked across the whole test file.

Also harden pytest.ini's warning policy: escalate
``ResourceWarning`` from ``src`` to an error, alongside the existing
``DeprecationWarning`` rule. Third-party warnings stay filtered — we
can't fix pandas/openpyxl/streamlit churn from here.

All 1916 tests pass under the strict filter; full and split runs
(``pytest``, ``pytest -m 'not gui'``, ``pytest -m gui``) all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:28:48 +00:00
070e3c9f06 docs(gui): document the new GUI test layer
REQUIREMENTS §16 updates the test count (1777 → 1916) and breaks out
the GUI subset. DEVELOPER's Tests section gains the 'gui' marker
recipes and the new tests/gui/ tree under test layout, plus a short
'GUI test layer' explainer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:13:40 +00:00
35d46a0c1a test(gui): add Streamlit AppTest layer (139 tests)
Until now every test ran against core or the CLI; the Streamlit GUI
was verified by hand. This commit adds tests/gui/ — 139 AppTest-
driven tests behind a 'gui' marker so the quick loop
(``pytest -m 'not gui'``) stays at 1777 tests / ~10s while
``pytest`` runs everything (1916 / ~14s).

Coverage:
- test_smoke.py (59): every page renders in EN and ES, expected
  substring present, sidebar selector mounted.
- test_chrome.py (18): language selector flips session state and
  re-renders; quit button + farewell strings localize; tool-card
  names use the active language.
- test_gate.py (9): require_normalization_gate no-op / warning /
  short-circuit / hash-mismatch invariants; warning + button
  localized.
- test_workflows.py (14): happy path per Ready tool — stash
  upload, render, find primary action, verify result lands in
  session state.
- test_dedup_review.py (8): Accept All / Reject All / Clear
  Decisions wire through to review_decisions; apply_review_decisions
  semantics (keep-all, merge, column override).
- test_advanced_panels.py (15): config_panel widget defaults and
  options (algorithm, threshold, survivor rule, merge, multiselects,
  config save/load).
- test_errors.py (4): garbage / empty / single-column uploads don't
  crash; duplicate-target mapping raises InputValidationError.
- test_findings_panel.py (12): driven via a small standalone harness
  page so we test the component without faking a file_uploader. EN
  + ES strings, per-tool grouping, open-tool button label, untargeted
  expander, severity summary.

Shared infrastructure in tests/gui/conftest.py:
- ``stash_upload`` / ``stash_upload_without_gate`` — populate
  session_state to pre-pass or block the gate.
- ``with_language`` — set ``ui_lang`` before run().
- ``collected_text`` — flatten title/caption/markdown/etc. into
  one string for substring assertions.
- Auto-marking: every test in tests/gui/ gets ``@pytest.mark.gui``
  via ``pytest_collection_modifyitems``, so the marker isn't
  per-test boilerplate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:13:40 +00:00
d0423a8912 docs(perf): publish the dedup/parallel/lazy-copy wins and limits
REQUIREMENTS §10 carries the new measured numbers and the dedup
blocking trade-off note. DEVELOPER known-limitations is rewritten to
reflect that exact-only dedup is now O(n), fuzzy-blocking is opt-in,
and column-parallelism is scaffolding for free-threaded Python.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:54:25 +00:00
64452dd783 perf: dedup blocking, column-parallel scaffolding, lazy-copy pipelines
Three follow-on wins from the audit, each with shape-pinning tests.

1. Dedup blocking
   - Exact-only strategies (every column EXACT @ 100 — covers strong-
     key dedup like email/phone, the drop-duplicates fallback, and
     explicit "match on this exact column" calls) now route through
     an O(n) groupby fast path. Lossless; no API change required.
     Measured: 10k-row email-exact dedup → 73 ms (was ~30 minutes
     via the O(n²) pair compare).
   - Fuzzy strategies still pair-compare, with opt-in prefix blocking
     via deduplicate(..., blocking_columns=[...], blocking_prefix_len=1).
     Measured: 5k-row fuzzy-name → 25.6s with blocking vs 179s
     without (7x). Trade-off: cross-block matches missed.

2. Column-parallel standardize
   - StandardizeOptions.parallel_columns (default 1) lands a
     ThreadPoolExecutor over the column loop. Output order and
     audit-record order are preserved deterministically via a merge
     step keyed off column_types order. Honest doc: under CPython
     3.12's GIL the win is roughly neutral (phonenumbers/dateutil
     hold the GIL); the API is ready for free-threaded Python 3.13+.

3. Lazy-copy in missing / column_mapper
   - _standardize_sentinels now builds per-column changes in a dict
     and only materialises the output frame when at least one column
     actually changed. On a clean 1 GB file this skips a 1 GB
     allocation.
   - handle_missing carries an out_is_owned flag, copying on demand
     before any mutating step. No-op runs return the input frame.
   - map_columns drops the unconditional upfront df.copy(); rename
     and drop both return fresh frames already, and schema-add /
     coerce trigger _ensure_owned() lazily.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:54:25 +00:00
e5f632bcd6 docs(perf): publish 1.5 GB target and the new measured throughputs
REQUIREMENTS §10 reflects the post-optimisation numbers and the
known O(n²) dedup match step (flagged for a future blocking pass).
en/es upload-limit copy and uploader help now say 1.5 GB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:37:26 +00:00
5b672370a6 perf: cache hot paths, drop wasted allocations, lift 1 GB → 1.5 GB
Five targeted wins driven by an end-to-end audit, with shape-pinning
regression tests so reverts are loud:

- format_standardize: fuse the dispatcher loop into one pass — was
  calling Series.tolist() three times per typed column and materialising
  an intermediate triples list; now one tolist, one walk. On a
  synthetic 1M-row phone+email frame this measures ~2.7M rows/sec
  (vs. the previous 150k/sec doc target).
- dedup: wrap normalizers in a per-call lru_cache so repeat phones /
  emails / addresses skip re-parsing. phonenumbers.parse is the
  expensive call; ~2–5x faster on the normalisation step for realistic
  workloads.
- analyze: _detect_near_duplicates no longer copies the full input
  frame; builds only the normalised string columns via a dict and
  references non-string columns by view. Skips the redundant
  astype(str) when a column is already pandas string dtype.
- text_clean: hoist _build_pipeline out of the per-cell loop and add a
  per-call string cache so 100k repeats of "Active" only run the
  pipeline once. ~1M rows/sec on repetition-heavy columns.
- io.repair_bytes: the non-UTF-8 smart-quote fold path used a
  Python-level zip walk over the entire decoded string to count
  replacements — replaced with sum(text.count(c) ...) which runs in
  C at ~GB/s. Was a latent ~100s on a 1 GB cp1252 file; now <1s.

Updates REQUIREMENTS §10 with measured numbers and bumps the buyer-
facing upload limit from 1 GB to 1.5 GB across the i18n packs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:37:26 +00:00
318b9b45dc docs(i18n): ship Spanish translations of buyer-facing docs
Adds README.es.md, docs/README.es.md, docs/USER-GUIDE.es.md, and
docs/CLI-REFERENCE.es.md mirroring the English client-facing set.
Each English doc gains a one-line language-switch banner pointing at
its Spanish counterpart; the docs index advertises both language sets
in the buyer-facing section. Internal docs (TECHNICAL, DECISIONS,
REQUIREMENTS, BUSINESS, RECOVERY) stay English-only by design — they
don't ship with the product.

The CLI itself emits English only, so CLI-REFERENCE.es.md notes that
flags and values are language-invariant while translating the prose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:21:18 +00:00
38011872e1 docs(i18n): document language packs across user, dev, and marketing docs
README + USER-GUIDE describe the sidebar picker and current coverage
(home + shared chrome, per-tool bodies pending). DEVELOPER gains a
how-to for adding packs and keys with the parity-test guarantee.
TECHNICAL §10b records the in-house-JSON architecture and locks in the
no-gettext decision (also logged in DECISIONS). REQUIREMENTS reflects
the new interface surface and updated test count. COPY.md adds a
"Language claim" slot so landing/email work can pick it up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:16:24 +00:00
c4ce86bd64 feat(i18n): add language-pack scaffold with English and Spanish
Introduces ``src/i18n`` with a tiny JSON-backed t() lookup, an in-session
language preference, and a sidebar selector wired through
``hide_streamlit_chrome`` so every page picks up the same picker. Covers
home, tool cards, findings panel, gate, shutdown, and pickup banner
strings. Tests pin pack parity and the farewell-overlay JS escape so
future packs can't silently regress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:11:30 +00:00
4706ed571e build: wire desktop-bundle pipeline (CI matrix + per-platform installers)
Stand up the seamless-download path for non-technical buyers:

* .github/workflows/build.yml — matrix CI (mac/win/linux) that builds
  PyInstaller bundles and packages them per platform on tag push,
  attaching the resulting installers to a GitHub Release.
* build/installer.iss — Inno Setup script for the Windows installer
  (per-user install, optional desktop shortcut, runs on finish).
* build/macos/build_dmg.sh — wraps DataTools.app into a .dmg with a
  drag-to-/Applications layout.
* build/appimage/{AppRun,datatools.desktop,build.sh} — AppImage recipe.
* src/__init__.py — single source of truth for __version__; the spec
  reads it (was hardcoded), CI passes it through to all packagers.

Buyer download path now lives in the top-level README. Per-build
README documents the Phase 2 step (signing/notarization) that needs
the owner's Apple Developer + Windows code-signing credentials —
those are intentionally not in CI yet because they require setup
outside this repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:58:43 +00:00
ea89c4d399 ui(gui): say 'window' instead of 'browser tab' in shutdown copy
Update the Close page intro, the shutdown overlay, and the toast so
they all read "you can close this window" — clearer for users running
the app in a dedicated browser window rather than a tab.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:51:32 +00:00
701108c9d5 fix(gui): inject farewell overlay into parent DOM on shutdown
Replaces the data:-URL navigation (blocked by Chrome since v60 for
top-frame navigation) with a direct DOM-append of a full-screen
overlay onto the parent document. Uses z-index 2147483647 so it sits
above Streamlit's connection-error banner when the websocket drops.

Note: still doesn't fully suppress the connection-error banner in
testing — the next iteration will render the overlay through
Streamlit's own page rather than via a component iframe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:49:48 +00:00
340614e642 feat(gui): promote Quit to a 'Close' menu item in the sidebar nav
Move the shutdown control out of the inline sidebar widget and into
its own page (pages/99_Close.py), so it appears in the sidebar nav
alongside the tool pages. An explicit confirm button on the page
prevents accidental nav clicks from killing a live session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:38:02 +00:00
58c0195def fix(gui): make Quit button actually terminate the server
Signalling the process with SIGTERM/SIGINT didn't reliably shut Streamlit
down — its tornado/asyncio loop swallowed or deferred the signal, so the
browser saw the websocket drop ("Connection error") while the python
process kept running. Replace the signal with a daemon-thread
``os._exit(0)`` after a short delay so the current rerun can paint the
"shutting down" message before the process is hard-killed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:36:36 +00:00
30e257cc44 fix(gui): move Quit button to sidebar so it shows on every page
The footer placement was easy to miss (below all tool cards) and only
rendered on the home page. Hook the button into hide_streamlit_chrome()
so every page that hides default chrome — home + all 9 tool pages — gets
the Quit button at the bottom of the sidebar without per-page edits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:33:32 +00:00
0c25d80146 fix(gui): keep sidebar reopenable + add clean Quit button
The chrome-hiding CSS was removing the Streamlit header wholesale,
which also took the sidebar's expand chevron with it — a collapsed
sidebar became unreopenable. Make the header transparent instead and
explicitly preserve the sidebar collapsed-control.

Also add a Quit button in the app footer that signals the Streamlit
server (SIGTERM, falling back to SIGINT) so closing the GUI returns
the shell prompt cleanly instead of leaving Python hung.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:30:10 +00:00
e1f364f010 feat: Tier B operator scaffolding — bundle, copy SoT, posts, emails
Pick up and finish yesterday's cut-off Tier B pass.

- build/: PyInstaller scaffold (datatools.spec + launcher.py +
  hook-streamlit.py + README) — folder-mode bundle, locked
  127.0.0.1, per-OS recipe
- marketing/COPY.md: single source of truth for every customer-facing
  string — landing H1/sub/CTAs, demo CTAs, email subjects, Gumroad
  listing, banned phrases
- marketing/community-posts/: 9 drafts (3 posts × 3 niches:
  bookkeeper, revops, shopify-pet) — story / tip / soft-offer
- marketing/emails/: 18 drafts (Gumroad delivery + 5-touch
  onboarding × 3 niches), per-niche segmentation guidance
- docs/NEXT-STEPS.md: flip 2.2 / 2.4 / 3.1 / 3.4 to done with
  pointers to the new assets; add Phase 0 inventory rows
- .gitignore: narrow `build/` ignore so PyInstaller spec + launcher
  + hooks get tracked, only generated artifacts (build/build/,
  build/__pycache__/, build/dist/) stay ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 14:04:37 +00:00
966af8ef94 feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00
d18b95880d feat(format-i18n): broaden international coverage across all domains
Closes ~17 high-value international gaps surfaced by parallel review.
Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed
/ failed / xfailed).

DATES
- Adds Portuguese, Italian, Dutch, Russian month dictionaries to the
  opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru).
- Adds localized weekday recognition for those locales — "Lundi",
  "Montag", "lunedì", "понедельник", etc. all strip cleanly before
  format matching.
- New CJK separator normalization: Japanese ``2024年01月15日`` and
  fullwidth digits ``2024/01/15`` fold to ASCII before parsing.
- New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to
  fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ
  doesn't block format matching.
- New ISO 8601 extended formats: week date (``2024-W03-1``) and
  ordinal date (``2024-015``), plus RFC 2822 mail-header form
  (``Mon, 15 Jan 2024 10:30:00``).
- New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` —
  defaults to Python's stdlib 69; lower it for birth-year columns
  where most subjects were born ≤ 1999.

NAMES
- Particles set extended with Arabic patronymic markers (bin, ibn,
  bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-).
- Title set extended with German (Herr, Frau), French (M., Mme,
  Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra,
  Dott.), Portuguese.
- Acronym map extended with international academic credentials
  (Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM).
- New East Asian honorific suffix handler: ``Tanaka-san``,
  ``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the
  hyphen instead of being title-cased into ``Tanaka-San``.
- Hyphenated-segment handler now keeps Arabic prefixes ``al-`` /
  ``el-`` lowercase per Arabic transliteration convention.
- New ``family_first`` parameter on ``standardize_name()`` and matching
  ``name_family_first`` field on ``StandardizeOptions`` — set
  per-column for East Asian data to skip Western comma-format reversal
  (``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``).

CURRENCY
- Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN),
  ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa,
  Eastern Europe, Latin America gaps.
- ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD,
  BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS,
  KES, HUF, CZK, RON, UAH, KZT, etc.

EMAIL
- New BIDI / RTL override stripping (``standardize_email``):
  U+202A-U+202E and U+2066-U+2069 stripped from every email. These
  are a known phishing vector — ``alice‮@example.com`` displays as
  ``alice@elpmaxe.com`` to RTL-aware renderers.

ADDRESS
- Canadian provinces: 13 codes + names → 2-letter (Ontario → ON).
- UK postcode pattern recognition (``SW1A 2AA`` shape).
- Australian states: 8 codes + names (NSW, VIC, QLD, … + full names).
- German Bundesland: 16 codes + names (Bayern → BY, etc.).
- International PO Box variants: ``Postfach`` (DE), ``Boîte postale``
  (FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal``
  (PT) — all fold to canonical ``PO Box``.
- ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position
  check that preserves state codes regardless of input case applies
  to all four jurisdictions.
- ``_is_state_code_position`` postal pattern broadened to recognize
  US ZIP, AU 4-digit, CA first half, and UK outward code.

CONSTANTS
- ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` /
  ``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``,
  ``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS``
  (us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex),
  ``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for
  use when address takes a `country_hint` parameter in a future pass).

DOCS
- TECHNICAL.md §11.3 domain table updated with the new handling per
  domain plus a new "International coverage" sub-section listing the
  supported locales / symbols / jurisdictions.

DEFERRED (out of scope or rare)
- Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) —
  corpus § 3.5 marks out of scope.
- Persian/Arabic-Indic digit conversion — rare in tabular data.
- Trailing-minus RTL currency convention.
- Punycode ↔ Unicode IDN normalization.
- Mixed-country phone column auto-detection (user can override
  ``default_region`` per column).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 03:06:03 +00:00
abb720997e docs: tight, scannable rewrite — every item earns its place
Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS,
TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from
prose-heavy to bullet-heavy + table-heavy. Same information density,
significantly less reading load.

Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content
that landed since v1.6:

- Format Standardizer (3rd Ready tool)
- 199-row buyer corpus
- src/core/errors.py structured hierarchy + ensure_dataframe /
  ensure_choice / wrap_file_read|write / format_for_user helpers
- src/core/_constants.py shared USPS/state lookup tables
- Cross-tool audit fixes (NaN matching, removed_df schema, validation,
  enum-bounds checks, forward-compat config)
- Per-domain error_policy across format standardizers
- Inconsistent-date-format detector
- Excel header-row auto-detection + write_file delimiter param

Per-doc changes:

- README.md (175 → 71): 9-tool table at top, status column, 3 CLI
  entry points listed, dropped repeated marketing prose.
- docs/README.md (38 → 27): pure index — buyer-facing vs creator-only
  split + version footer.
- USER-GUIDE.md (208 → 118): tool table replaces script descriptions,
  troubleshooting compressed to bullets, gate explanation tightened.
- CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed
  redundant intro text, kept full recipes section.
- REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added
  §18 Error Handling, formatting tightened to single-line entries.
- TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged
  redundant §3.5-3.7 OS sections, added §7 (Error handling) +
  §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate /
  Review page / repair_bytes promoted from §10.2.x sub-numbering).
- DEVELOPER.md (285 → 161): module map table replaces per-file prose,
  extension recipes condensed, new §Errors covers when to use each
  hierarchy class.
- BUSINESS.md (278 → 225): collapsed prose to tables (use cases,
  competitive landscape, costs, risks); honest-status updated.
- DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved,
  decision log compressed to single-line entries, added v1.6 entries
  (Format Standardizer Ready, errors module).
- RECOVERY.md (180 → 147): rebuild steps as numbered + tabular,
  external dependencies as one table, recovery priorities tightened.

No information removed; redundancy compressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:49:29 +00:00
26b9771625 feat(errors): structured error hierarchy + helpful messages everywhere
Introduces src/core/errors.py with a small structured error hierarchy
that every public entry point now uses. Each error carries the
context a user needs to fix it and the context a maintainer needs to
trace it.

The hierarchy:
  DataToolsError  (base — formats path, column, operation, suggestion)
    InputValidationError  (extends ValueError — bad arg / wrong type)
    ConfigError           (extends ValueError — bad config / options)
    FileFormatError       (extends ValueError — file is not what we expected)
    FileAccessError       (extends OSError   — file I/O failure)

Subclassing the stdlib bases means existing `except OSError` /
`except ValueError` handlers still catch them — no breaking change.

Helpers:
- ensure_dataframe(value, function=...)  — uniform DataFrame guard
- ensure_choice(value, name=, choices=)  — uniform enum/literal guard
- wrap_file_read(path, op, exc)          — tag OSError with hint + path
- wrap_file_write(path, op, exc)         — same, with Windows-aware tip
- format_for_user(exc, context=)         — user-facing string for st.error / stderr

Library hardening:
- io.read_file: missing files surface FileAccessError listing whether
  the parent directory exists, and the suggestion to check the path.
- io.read_file: chunk_size <= 0 now raises InputValidationError with
  a positive-integer suggestion.
- io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas
  ValueError ("sheet not found") wrapped as FileFormatError listing
  the path and a "list sheets with list_sheets()" hint.
- io._detect_excel_header_row: bare except narrowed to specific
  openpyxl exceptions; falls back gracefully and logs at debug so
  the real error surfaces from pd.read_excel.
- io.write_file: OSError / PermissionError on to_csv/to_excel wrapped
  with file path and Windows-aware "file may be open in another
  program" hint.
- dedup._parse_date: bare `except Exception` narrowed to
  (TypeError, ValueError, OutOfBoundsDatetime); failed values
  logged at debug for survivor-selection forensics.
- dedup._select_survivor: KEEP_MOST_RECENT now raises
  InputValidationError instead of silently falling back to keep_first.
- dedup.deduplicate: input validation errors are InputValidationError
  with operation/column/suggestion fields.
- format_standardize.from_dict: invalid FieldType for a column raises
  ConfigError naming the column AND the bad value AND listing valid
  values; same for date_order / phone_format / etc.
- format_standardize.from_file: OSError / JSON decode wrapped with
  path AND line/column where parsing failed.
- format_standardize.to_file: TypeError on json.dumps wrapped as
  ConfigError with the suspected source (extra_abbreviations).
- format_standardize._apply_field_type: dispatcher's "unknown field
  type" branch now raises AssertionError (it's an internal invariant,
  not user error — a new enum value was added without a branch).
- format_standardize._resolve_column_types: missing-column error now
  InputValidationError with a "check for typos / unparsed header"
  suggestion.
- format_standardize.standardize_dataframe: ensure_dataframe at entry.
- text_clean.clean_dataframe: ensure_dataframe at entry.
- config.to_strategies: invalid Algorithm/NormalizerType wrapped as
  ConfigError naming the strategy index AND the column.
- config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError
  listing valid values.
- config.from_file: OSError / JSON decode wrapped (mirror of
  StandardizeOptions.from_file).
- fixes.repair_mojibake: ImportError on ftfy now logged at info level
  with the underlying ImportError so a corrupt-package vs not-installed
  distinction is visible in the logs.
- normalizers.normalize_phone: phonenumbers.NumberParseException now
  logged at debug when the digits-only fallback drops extension /
  country-code information — gives a trail when matching results
  look wrong.

GUI / CLI surfaces:
- All 9 page handlers (`except Exception as e: st.error(...)`) now
  use format_for_user(), which renders DataToolsError fields nicely
  and falls back to "ClassName: message" for unrecognized errors.
- 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish
  UnicodeDecodeError with an "re-save as UTF-8" suggestion before
  the generic handler.
- cli.py's "Error reading file" handler now uses format_for_user()
  and includes the input path in the prefix.

Tests:
- tests/test_errors.py — 22 new tests covering: base class formatting,
  stdlib inheritance, ensure_dataframe / ensure_choice helpers,
  wrap_file_read / wrap_file_write, format_for_user behavior, and
  end-to-end integration (missing file, missing dir, bad JSON, bad
  algorithm, bad enum, missing column).
- tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for
  the new exception types (InputValidationError replaces TypeError,
  FileAccessError extends OSError).

Full project suite: 1230 passed, 4 skipped, 17 xfailed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:35:42 +00:00
2eece6467d refactor: dedup, consolidate, harden public APIs across core modules
Closes 16 high-value findings from a parallel cross-module review.

Refactors:
- New src/core/_constants.py centralizes USPS street-suffix
  abbreviations, US state names, and 2-letter postal codes — one source
  of truth for both normalize_address (matching keys) and
  standardize_address (display formatting). Eliminates ~80 lines of
  duplicated dicts across normalizers.py and format_standardize.py.
- format_standardize.py: collapse 4 identical nested _err() helpers
  into one shared _err_or_passthrough() module function; drop a dead
  duplicate `return _err("not a phone number")` branch in
  standardize_phone.
- format_standardize.py: precompile per-locale month-name regexes
  (_MONTH_LOCALE_PATTERNS) and per-state-name regexes
  (_STATE_NAME_PATTERNS) at import time — they were rebuilt on every
  cell, a measurable hot path on million-row inputs.
- dedup.py: extract _is_missing(value) helper; one definition of
  "this cell is None / NaN / pd.NA" instead of two.
- fixes.py: extract _is_string_column(ser) helper; one dtype check
  instead of three duplicates across _apply_to_strings,
  _vectorized_translate, _vectorized_regex_sub.

Production-readiness:
- format_standardize.standardize_dataframe now logs a warning when
  more than 10% of typed cells are unparseable — surfaces the
  silently-broken-pipeline failure mode.
- StandardizeOptions.from_dict validates date_order / phone_format /
  currency_decimal / name_case / boolean_style / *_error_policy
  enum values up front, with a clear error message instead of a deep
  crash inside the per-cell function.
- StandardizeOptions.from_file and DeduplicationConfig.from_file wrap
  read + json.loads with descriptive OSError / ValueError messages
  including the file path.
- standardize_date(month_locales=...) validates locale codes against
  the available set instead of silently passing through unknown ones.
- io.read_file rejects chunk_size <= 0 (was silently failing inside
  pandas) and logs the resolved suffix + chunk_size at info level so
  data-pipeline runs are debuggable.
- io.read_file's FileNotFoundError gains explanatory context.
- io.write_file, text_clean.clean_dataframe, and dedup.deduplicate
  now reject non-DataFrame inputs with clear TypeError instead of
  cryptic pandas tracebacks downstream.
- dedup.deduplicate validates that survivor_rule=KEEP_MOST_RECENT has
  a usable date_column up front; the helper _select_survivor now
  raises (instead of silently falling back to keep_first) when called
  directly with bad arguments.
- dedup.deduplicate gains a structured no-op return when strategies
  is empty after auto-detection — preserves schema instead of crashing.
- analyze._detect_inconsistent_date_format narrows its bare except to
  (TypeError, ValueError) and logs a debug line so genuine bugs don't
  hide behind silent skip.

Tests:
- tests/test_audit_fixes.py grows by 11 cases covering the new
  validation paths (chunk_size, DataFrame guards, KEEP_MOST_RECENT
  date_column, enum validation, locale validation, JSON error wrapping).

Full project suite: 1208 passed, 4 skipped, 17 xfailed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:23:09 +00:00
b23a27d4e3 fix: cross-tool audit findings + alignment with format standardizer
Closes 12 bugs and 8 gaps surfaced by parallel audits across all core
modules, plus aligns the dedup-side normalizers with the new
format_standardize behavior where they had silently diverged.

Bugs (data integrity / correctness):
- dedup: NaN/None values matched as duplicates because str(None)='None'.
  Two rows with missing email silently merged.
- dedup: removed_df had 0 columns when nothing was removed; downstream
  code expecting matching schema broke. Now preserves column shape.
- dedup: ColumnMatchStrategy threshold accepted any value; out-of-range
  silently broke matching. Validated to [0, 100] in __post_init__.
- dedup: strategy referencing a missing column was silently skipped.
  Now raises ValueError listing available columns.
- fixes: replace_null_sentinels crashed on non-string sentinels (int/None
  from JSON payload). Coerced to str.
- fixes: _vectorized_regex_sub raised raw re.error on bad patterns. Now
  wraps as ValueError with clear message.
- io: detect_header_row mis-identified all-empty and metadata-only rows
  as headers (all([]) is True). Now requires ≥2 non-empty cells.
- config: from_dict crashed when JSON had unknown fields, breaking
  forward compat. Now filters to known fields.
- analyze: mixed-case email detector flagged all-None columns because
  str(None)='None' contains both N and one. Now drops NaN before stringify.

New features and gap closures:
- io: _detect_excel_header_row mirrors detect_header_row for Excel via
  openpyxl read-only; _read_excel uses it when header_row=None.
- io: write_file gains delimiter + encoding params; .tsv extension
  defaults to tab.
- normalizers: normalize_phone preserves extensions as ;ext=N suffix.
- normalizers: normalize_address folds spelled-out US state names to
  2-letter codes (California ≡ CA).
- normalizers: normalize_name drops surname particles (van, de, von)
  so "Charles de Gaulle" ≡ "Charles Gaulle" for matching.
- analyze: new _detect_inconsistent_date_format detector flags columns
  with mixed ISO/US/EU date shapes; routes to format standardizer.
- analyze: _NULL_LIKE recognizes "<na>" (pd.NA repr).
- analyze: duplicate-row finding renamed count → n_extra (rows that
  would actually be removed) with clarified description.
- dedup: group_confidence no longer falsely 100.0 when transitive group
  members lack a recorded direct pair; falls back to 100.0 only when
  truly no pairs were observed.
- dedup: MatchResult / DeduplicationResult docstrings clarify that
  row_indices refer to the input frame's positional index (output index
  is reset).
- text_clean: visualize_hidden_html(None) now returns None (matches
  visualize_hidden_text); strip_bom strips at most one BOM per call;
  sentence_case dead elif branch removed.

Tests:
- tests/test_audit_fixes.py — 28 regression tests, one or more per
  numbered finding, named after BUG/GAP/NIT tags so future readers
  can trace each test back to its audit.
- tests/test_fixes_unit.py — 26 isolated unit tests for previously
  integration-only fix functions (trim_whitespace, strip_nbsp,
  strip_zero_width, normalize_line_endings, clean_headers,
  repair_mojibake — last skipped if ftfy unavailable).
- tests/test_io.py — adds CSV / TSV / semicolon / UTF-8-BOM round-trip
  tests + Excel auto-header-detection tests.
- tests/test_normalizers.py — adds 8 tests for the alignment work
  above (phone extension, state names, particles).

Adds .claude/ to .gitignore (agent worktrees + local settings).

Full project suite: 1197 passed, 4 skipped, 17 xfailed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:11:57 +00:00
4adeb5c7f3 feat(format): per-cell standardizers + 199-row buyer corpus
Adds src/core/format_standardize.py — a per-cell standardizer for dates,
phones, emails, addresses, names, currencies, booleans — wired through
StandardizeOptions / standardize_dataframe with FieldType registry.

Includes:
- Date parser handles ISO/US/EU/longform/excel-serial/unix-timestamp/
  partial-precision/quarter notation; opt-in French/German/Spanish month
  dictionaries via month_locales.
- Phone via libphonenumber with extension preservation (;ext=N), 001
  international prefix handling, error sentinels for placeholders /
  multi-number cells.
- Email lowercase/trim/mailto/angle-bracket strip with optional
  --gmail-canonical mode.
- Address USPS abbreviation expansion or compression (expand=False per
  corpus § 6.3), state-name → 2-letter conversion, multi-line collapse,
  PO Box normalization, state-code preservation regardless of input case.
- Name handler: Mc/Mac/O'/D' inner caps, hyphen segments, particle
  lowercasing (von/van/de/da), comma-format reversal, period stripping
  for titles/suffixes/initials, PhD/MD acronym preservation, conservative
  mode for mixed-case input.
- Currency: auto-detect EU vs US separators, space-thousands, Swiss
  apostrophe, accounting parens, optional ISO code preservation, error
  sentinels for percentages/ranges/word-values/ambiguous separators.
- Per-domain error_policy ("passthrough" | "sentinel") for surfacing
  malformed values as <error: reason> per corpus § 0.3.

Test corpus from Business/DataTools/test-cases-format-cleaner copied to
test-cases/format-cleaner-corpus/ — 7 fixtures plus FORMATS-CASES.md.
tests/test_format_standardize_corpus.py drives all 199 rows through the
per-cell standardizers; 0 xfailed.

Wires the GUI page (3_Format_Standardizer.py) to "Ready" status.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:11:24 +00:00
3f007ef3d6 feat(gui): 1 GB upload cap + delimiter / encoding diversity caption
Streamlit's default file_uploader footer reads "Limit 200MB per file —
CSV, TSV, XLSX, XLS" which contradicts the 1 GB efficiency target shipped
in 438bc0f and codified in docs/REQUIREMENTS.md §1.1.

Three changes:

1. .streamlit/config.toml — set [server] maxUploadSize = 1024. Footer
   now reads "Limit 1024MB per file".

2. upload_and_analyze_section (home page) — adds an explicit caption
   above the uploader stating size limit, supported formats, the four
   auto-detected delimiters, and the 13 auto-detected encodings (with
   the Review-page override as the safety net).

3. pickup_or_upload (every tool page that falls back to its own
   uploader when no home-page upload is present) — same caption,
   only rendered when the upload accepts CSV/TSV/XLSX/XLS so JSON
   schema / config uploaders aren't decorated.

Test suite: 765 passed, 17 xfailed (no regressions). Home + Review +
Deduplicator pages all serve HTTP 200 under the new config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 21:23:21 +00:00