34 Commits

Author SHA1 Message Date
6df726e69e demo: reconstruct sales demos for an accounting audience
Replaces the Shopify / RevOps / Bookkeeper demo trio with three accounting
personas that share one buyer, each entering through a workflow where a
messy export costs money — all running the same saved 4-step pipeline:

- bank_reconciliation.csv (Bookkeeper): 26 -> 20 rows, 6 double-posted
  transactions caught after date+amount standardization.
- vendor_1099.csv (AP / 1099): 24 records -> 8 vendors, 7 missing EINs
  recovered via dedup merge — the 1099-complete story.
- ar_open_invoices.csv (AR): 26 -> 21 rows, 5 double-entered invoices
  removed, blank status backfilled from the twin row.

Every number is validated against the live engine and pinned by
tests/test_demo_pipelines.py (read path mirrors app_demo._load_demo:
dtype=str, keep_default_na=False). Rewires src/gui/app_demo.py PERSONAS
(keys bookkeeper / ap-1099 / ar-aging, accounting H1/sub/CTA) and rewrites
docs/DEMO-PLAN.md sections 3/4/7 with the validated outcomes.

(Repo hygiene forced by a partial-clone gap: finalizes the already-deleted,
unreferenced samples/messy_text.csv whose blob was unrecoverable.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:52:39 +00:00
38616d69e2 test(pipeline): complete automated test suite for the pipeline feature
Adds ~115 tests pinning the Automated Workflows feature end to end:

- tests/test_pipeline.py (+43): per-adapter summary correctness on known
  inputs, multi-step data flow, error stop/continue contract, empty /
  single-column / all-disabled edges, dict+file serialization round-trips,
  recommended_pipeline(include=…), and a synthesized demo integration run.
- tests/test_cli_pipeline.py (new, 21): --recommend, dry-run-by-default,
  --apply output CSV + audit JSON, --steps, --strict abort, arg validation,
  --continue-on-error vs halt, and a save→load round-trip. Invokes the Typer
  app directly to bypass the license guard (house pattern).
- tests/gui/test_pipeline_builder.py (+9): reorder ▲/▼, disabled edge
  buttons, disabled-step persistence across reorder, restore-recommended,
  Advanced JSON export/import, and per-tool Configure panels emitting the
  correct option dicts (AppTest).
- tests/gui/test_pipeline_phrasing.py (new, 30): step_phrase/step_status and
  the adapter-key→friendly-name bridge as pure functions, incl. pluralization,
  column prose, and warn/error status derivation.

Full suite: 2565 passed, 91 skipped. No product bugs surfaced. Documents the
coverage in docs/DEVELOPER.md (test tree + a pipeline-coverage note).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:31:15 +00:00
fd9606c67b build: drop the local Python release method, return to CI-only installer builds
Removes the single-command Python packaging method (build/make_release.py
+ build/build_portable_zip.py + build/macos/build_zip.sh) and the portable
.zip artifacts it produced. Release builds go back to the original GitHub
Actions process: the CI matrix builds one installer per platform (.dmg /
.exe / .AppImage) on tag push and attaches them to a GitHub Release.

Tesseract OCR bundling is preserved: the fetch helpers the workflow depends
on (fetch_tessdata, fetch_tesseract_for_platform) are extracted into a
standalone build/tesseract.py, which build.yml now imports.

Docs (README, build/README, DEVELOPER, TECHNICAL, USER-GUIDE, vendor README,
es translations) updated to drop the portable-zip flavor and point at the
new module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 17:47:36 +00:00
b703911df3 docs: reflect bundled Tesseract on every install surface
- NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers
  the bundled Tesseract OCR binary (Apache 2.0, upstream
  tesseract-ocr/tesseract, copyright Google + contributors) and the
  eng.traineddata from tessdata_best (also Apache 2.0). Clarifies
  DataTools itself remains proprietary. Full canonical Apache 2.0
  license text included.
- README.md + README.es.md (Download section): bumped size estimate
  ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR
  is bundled (no separate install required), with a link to the new
  license file.
- docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System
  requirements): bumped disk estimate, added a paragraph stating
  Tesseract 5.5 + eng.traineddata ship inside every installer /
  portable / AppImage, with a source-install fallback hint pointing
  developers to DEVELOPER.md.
- docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section
  documenting the runtime layout (sys._MEIPASS / tesseract / …),
  discovery order, source of bytes (build/vendor/tessdata + per-
  platform fetch in make_release.py), version pin, update recipe.
- docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor
  OCR)" — short version of the discovery order for the build
  pipeline section.
- build/README.md: distribution-outputs paragraph now lists
  Tesseract among bundled deps with the ~250-300 MB estimate; new
  "Tesseract bundling" section: layout diagram, resolver order,
  source of bytes + 5.5.0 pin, update steps, license-file ref.

Out-of-scope gaps noted by the docs sweep:
- docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a
  high-risk packaging headache; now superseded. Worth a one-line
  "(resolved — bundled as of v1.x)" callout in a future pass.
- USER-GUIDE §2 "What's included" table doesn't list PDF Extractor
  at all (it shipped in b8aff86…967d3f6). Separate gap to close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:50 +00:00
4d8513b1a3 docs: cover help popover, +/- nav indicators, render_tool_header
User-facing docs (USER-GUIDE en+es, README en+es):
- New short paragraph under §3.1 GUI noting the in-tool Help button
  on every detail page, what it contains (When to use / Steps /
  Examples / Tip), and that content lives in tools.<id>.help_md.
- One-line note in the README tool tables pointing at the same.
- Mention the sidebar +/- nav indicators replacing Streamlit's
  default Material Symbols chevron.

Developer docs:
- DEVELOPER: new "Tool page header" subsection documenting
  render_tool_header(tool_id), the help_md markdown skeleton, and
  the fallback to help.missing_body when a tool's help is absent.
  Update i18n authoring rules to list help.* keys and the per-tool
  help_md field alongside name/description/page_title/page_caption.
- TECHNICAL: new §10c documenting the sidebar nav indicator swap —
  CSS in _HIDE_CHROME_CSS plus _SWAP_NAV_SECTION_INDICATOR_JS
  injected through the hide_streamlit_chrome() iframe bundle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:08:01 +00:00
2bd94c4441 docs: document installer + portable downloads in en/es
Repo READMEs now show both download flavors side-by-side with
first-launch warnings (SmartScreen, Gatekeeper) and link to the
deeper walkthrough.

USER-GUIDE §1 rewritten from a 9-line stub into six subsections:
- §1.1 Windows: installer (5 steps) + portable (4 steps)
- §1.2 macOS:   DMG (5 steps incl. right-click-Open) + portable
- §1.3 Linux:   AppImage flow (unchanged)
- §1.4 First-launch: port selection, localhost binding, browser open
- §1.5 How the GUI works
- §1.6 System requirements

§6 Troubleshooting picks up portable-specific items: Safari unzip
quirks, antivirus quarantine on Win portable, license file location.

docs/README and Spanish mirrors updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 19:30:28 +00:00
ee0b1f6f6b docs: design notes for future PDF→CSV tool
New ``docs/FUTURE-TOOLS.md`` captures post-launch tool ideas with a
consistent shape — What / Why / Can we ship now / Approach / GUI
sketch / Effort / Risks / Ship criteria. Resting place for things
the new-tool freeze in ``PLAN.md`` §2.1 refuses to build but that
keep coming up.

First entry: **#10 PDF → CSV extractor** (bank statements et al.).

Key facts captured:

- **Current state**: no PDF infrastructure exists. Zero PDF
  dependencies in requirements.txt; zero PDF-touching code under
  ``src/``. The only "PDF" string in the codebase is the planned-
  output copy for the Quality Check tool, unrelated to extraction.
- **Library picks**: pdfplumber as the extraction core (BSD-3,
  no native compiler, gives coordinate-aware text), Tesseract via
  pytesseract as the OCR fallback for scanned PDFs,
  streamlit-drawable-canvas as the region-picker component.
- **GUI sketch**: user draws a header strip + a row template on a
  rendered page; the tool applies that template across N pages,
  saves the template by layout fingerprint for next month's
  statement, emits CSV.
- **Effort phased A–E**: 3–4 weeks for a text-only MVP; 6–10
  weeks for a polished version with multi-page template recall;
  +2–3 weeks if scanned-PDF OCR is required.
- **Difficulty**: medium-hard. The pieces are well-trodden; the
  combination (region selection that persists across pages and
  across documents with similar layouts) is where the engineering
  goes.
- **Ship criteria**: ≥1 paying customer + ≥3 paid or ≥5 demo
  emails asking for PDF extraction + the bookkeeper niche
  converting at least one customer first. None have fired.

Cross-references added:

- ``docs/REQUIREMENTS.md`` §11: pointer to FUTURE-TOOLS.md for
  parked tool ideas, with a one-paragraph summary of #10.
- ``docs/PLAN.md`` §2.1: notes that the freeze parks future tools
  in FUTURE-TOOLS.md and explicitly names #10 as the current
  highest-pressure entry.
- ``docs/NEXT-STEPS.md`` Phase 5 "what NOT to build" table: a new
  row for the PDF tool tied to the same ship-trigger language.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:52:42 +00:00
db5ec084da docs+code: rename tool labels everywhere
Sweep follow-up to 93e43fc. Display labels now consistent across docs,
landing pages, CLI output, code comments, docstrings, and test prose.
Five parallel surfaces touched:

- docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal
  design/planning docs
- landing pages: index + bookkeeper/revops/shopify-pet
- src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py
  and gui/components/_legacy.py, core module headers, every tool
  page's module docstring
- tests: class/method/module docstrings and section-header comments
- test-cases READMEs

Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator
etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*),
URL paths, anchor IDs, CSS classes, and asset filenames were left
intact since they're code identifiers / structural references.

All 2033 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:50:09 +00:00
624f99653e docs(arch): end-to-end system + tech-stack diagrams
New ARCHITECTURE.md pulls the desktop app (TECHNICAL.md) and the
license server (LICENSE-SERVER.md) into a single picture — the two
were never reconciled into an end-to-end view before.

Contents:
  §1. System diagram (ASCII) showing operator laptop, license
      server stack (nginx → FastAPI → Postgres), Postmark, Gumroad,
      and the buyer's machine — with the three primary flows
      (sale, manual mint, offline activation) traced through it.
  §2. Tech stack diagram, layered: desktop / server / operator /
      external SaaS, with version pins.
  §3. Trust + isolation boundaries table — what crosses each one
      and what the threat model is.
  §4. "Where things are stored" — paths, tables, files.
  §5. Pointers to the deeper per-component docs.

ASCII over Mermaid since the repo's Gitea version is unknown and
plain text renders in every viewer / IDE / raw `cat`.

LICENSE-SERVER.md status flipped from "design proposal, not built"
to "deployed (PR 1 + PR 2 code merged)" — that was stale since
the PR 1 deploy yesterday.

TECHNICAL.md and ADMIN.md gain one-line pointers to ARCHITECTURE.md
so people land at the unified view when looking for "how does it
all fit together".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:59:05 +00:00
86ad21db79 docs(license): PR 2 deploy + operator instructions
ADMIN.md gains a "Running a Gumroad webhook" section: how the URL
secret works, how to add a SKU to products.yaml, how to inspect
gumroad_events (recent activity + failures-only queries), how to
replay a failed delivery, and how to test without buyers via
Gumroad's "Send Test Ping" button.

The deployed-vs-queued matrix flips Gumroad + Postmark to
"code merged, deploy pending" so it's clear the bits exist on
main but the live box still runs PR 1.

SETUP-LICENSE-SERVER.md §3 commits the eventual compose.yml shape
with PR 2 environment + secrets lines included but commented out,
ready to uncomment at deploy time. The §3 chown step already covers
the new secret files because it uses `chmod 400 secrets/*` /
`chown 10001:10001 secrets/*`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:33:53 +00:00
b5cd74d474 docs(admin): live deployment section for the running license server
Documents the post-deploy state of PR 1: live URLs (datatools and
licenses subdomains on unalogix.com), the on-box filesystem layout
under /srv/datatools-license/, where the admin token lives and how
to retrieve / rotate it, the laptop-side SSH-tunnel + admin_cli
mint workflow, inspection commands (logs, psql, container status),
restart / rebuild procedures, manual backup commands until cron
lands, the production-key rotation outline, and a deployed-vs-queued
capability matrix.

Secrets are NEVER pasted into this doc — the admin token's literal
value lives only on disk (mode 400, UID 10001). Committing it to
git would mean permanent leakage via history even after rotation;
documenting its location + rotation procedure achieves the same
operational outcome without the residual exposure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:19:57 +00:00
1cf69dd23b docs(license): runbook fixes from PR 1 self-host deploy
Two real-world footguns surfaced during the first live deploy:

1. docker-compose's uid/gid/mode long-form on file-based secrets is
   silently ignored — that's a swarm-mode-only feature. The
   container app user (UID 10001 from the Dockerfile) cannot read
   a mode-400 file whose host UID it doesn't match. Fix is to
   chown the secret files to 10001 directly; host-side access
   control stays gated by the parent dir's mode 750.

2. nginx 1.24 (Ubuntu 24.04 default) rejects the standalone
   "http2 on;" directive (that arrived in 1.25). Use the legacy
   "listen 443 ssl http2;" combined form. Noted prominently so the
   next deploy doesn't trip on it.

Also realigned §3's compose example to what actually got deployed
for PR 1 — only pg_password + admin_token secrets, postmark /
gumroad / license_privkey commented out as PR 2 / production-key
follow-ups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:17:05 +00:00
4179cb5156 docs(license): self-hosted server runbook + multi-tenancy plan
Adds SETUP-LICENSE-SERVER.md — end-to-end install runbook for the
license server on the existing invixiom box (Ubuntu 24.04). Covers
DNS, system packages, Postgres + API in Docker, dedicated system
user, secrets layout under /srv/datatools-license/secrets (mode
400), nginx config in a separate sites-available/unalogix file,
Let's Encrypt cert issuance, smoke tests, backups, monitoring, key
rotation, and rollback.

Multi-tenancy is explicit at every layer: separate DNS zone
(unalogix.com vs invixiom.com), separate nginx file, separate TLS
cert, dedicated backend ports (8090 for the API, 5433 for Postgres,
both localhost-only), separate docker compose project and volume.
No invixiom service is touched.

LICENSE-SERVER.md updated: hosting choice moved from "Fly.io /
Render" (rejected) to self-hosted (decided). Points at the new
runbook for ops specifics.

ADMIN.md pointer table updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:57:53 +00:00
52e04f63a9 docs(license): design proposal for online issuance & record-keeping
Forward-looking design doc — not implemented. Describes the smallest
useful server that replaces the manual mint-and-paste workflow:
Gumroad webhook → Mint API (KMS-held private key) → Postgres
licenses table, plus a self-service renewal/re-delivery portal.

The desktop app is deliberately untouched across all three migration
phases: activation stays fully offline and continues to verify blobs
against the embedded pubkey, preserving the DECISIONS.md §9b promise
that buyer machines never phone home.

Schema is intentionally a superset of the local issuance JSONL log
(ADMIN.md), so Phase 1 migration is a flat INSERT per row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:26:24 +00:00
23c51fd759 feat(license): local issuance log for minted blobs
generate_license.py now appends every minted license to
~/.datatools-creator/issued.jsonl (overridable via env). This is the
creator-side system of record until the server-side flow lands.

The full blob is stored alongside name/email/tier/expiry so buyers
who lose their delivery email can be re-served without re-minting.
File is created mode 600 and lives outside the buyer-facing
~/.datatools/ dir so it never gets bundled into a shipped install.

Log failures are non-fatal (warning to stderr) — the mint already
succeeded by the time we try to log, and forcing a re-mint after a
log error would invalidate any device the buyer had activated. Pass
--no-log for test mints.

ADMIN.md adds a "Customer record-keeping" section with the path,
schema, jq one-liners, and migration note pointing at the upcoming
LICENSE-SERVER.md design doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:25:19 +00:00
65e17e0a70 docs(admin): internal license operations reference
Creator-only ADMIN.md covering keypair generation, blob minting,
dev vs. production key model, tier matrix, and recovery if the
private key is lost. Includes a TL;DR for minting a dev license
against the in-tree keypair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:10:16 +00:00
e534fb4989 sec(license): Ed25519 sigs + production-safe tripwire
Two coupled hardening upgrades.

1. Asymmetric signatures (HMAC → Ed25519)

The previous HMAC scheme used a symmetric secret that any motivated
reverse engineer could pull out of the shipped binary and use to
mint blobs for any tier / name / email. With Ed25519, the binary
ships only the public verification key; the signing key never
leaves the seller's environment, so binary compromise no longer
yields forgery.

- src/license/crypto.py rewritten around
  cryptography.hazmat.primitives.asymmetric.ed25519. Same public
  API surface (sign/verify/encode_blob/decode_blob), same canonical
  JSON encoding — drop-in for the manager / cli / GUI layers.
- DATATOOLS_LICENSE_PRIVKEY (seller-side) and
  DATATOOLS_LICENSE_PUBKEY (build-time) env vars supply the keys;
  the in-source dev keypair (src/license/_dev_keypair.py)
  deterministically derives from a seed phrase for repro builds and
  tests.
- Blob prefix bumped DTLIC1: → DTLIC2:. Decoding a DTLIC1 blob
  surfaces a clear "old format" error rather than a confusing
  signature mismatch.
- scripts/generate_keypair.py mints fresh production keypairs for
  the seller (run once, stash the private key offline). Adds
  cryptography>=41,<46 to requirements.txt (was an undeclared
  transitive dep).

2. Production-safe tripwire

assert_production_safe() refuses to boot a frozen / shipped build
when either:

- DATATOOLS_DEV_MODE=1 is set (would unconditionally bypass every
  license check — fine in source/test but catastrophic in a buyer
  install).
- The active verification key is still the embedded dev key (the
  build pipeline forgot to set DATATOOLS_LICENSE_PUBKEY).

No-op in source / pytest runs (sys.frozen is unset) so test
fixtures and dev workflows keep working without ceremony. Called
from src/cli_license_guard.guard() and from hide_streamlit_chrome
— so it fires on every CLI invocation and every GUI page load.

Tests: 49 license-layer unit tests (was 40); added Ed25519
wrong-key rejection, dev-keypair seed pin, blob v2 prefix, v1
rejection with clear message, and four production-safe scenarios
(no-op in source, fires on DEV_MODE in frozen, fires on dev key in
frozen, passes in frozen with prod pubkey). Total: 2024 → 2033.

Docs (REQUIREMENTS §17a, DEVELOPER licensing recipe, DECISIONS
§9b + decision log) updated with the new threat-model write-up,
key-storage workflow, and tripwire behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:34:48 +00:00
d32b58e61a feat(license): add Lite SKU; remove user-facing free trial
Two coupled changes:

1. Lite tier
   - New Tier.LITE in src/license/schema.py.
   - FEATURES_BY_TIER[Tier.LITE] = {Deduplicator, Text Cleaner,
     Format Standardizer}. The three universally-useful tools that
     cover the most common bookkeeping / RevOps / Klaviyo prep
     workflows. Other six tools require Core.
   - i18n: license.tier_lite, license.feature_locked_title,
     license.feature_locked_body, license.upgrade_link,
     license.status_locked (en + es).
   - Per-tool feature gate at every GUI tool page
     (require_feature_or_render_upgrade) and every tool CLI
     (guard(feature=...)). A locked tool renders an upgrade
     prompt + Manage-license button (GUI) or exits with code 2
     (CLI).
   - Home grid: tool cards the user's tier doesn't unlock get a
     red 🔒 Locked badge in place of green Ready.

2. Trial removed
   - Activation form's "Start 1-year trial" button removed.
   - license_cli's `trial` subcommand removed.
   - activation.trial_button / activation.trial_help i18n keys
     dropped (pack parity test stays green).
   - Tier.TRIAL stays in the enum (back-compat with any field-
     tested trial licenses); LicenseManager._mint stays internal
     for tests and the seller's key generator.
   - Decision logged in DECISIONS §9b: a 1-year all-features
     trial undercuts paid Lite; paid-only keeps tier economics
     clean.

Tests (+29 net): +17 Lite-tier unit/guard tests + 13 Lite-tier
GUI tests + 1 trial-absent assertion - 2 trial CLI tests - 1
trial GUI button test. Total: 1995 → 2024.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:19:30 +00:00
e612c751a8 docs(license): document activation flow, tier system, dev bypass
- USER-GUIDE EN + ES gain a §0 "First launch — activation" section
  covering paid blob activation, 1-year trial, renewal, file
  location, and device-swap.
- REQUIREMENTS §17a "Licensing" — storage path, activation model,
  lifetime, tier list, dev bypass env var. Test count: 1995.
- DEVELOPER gains a "Licensing" recipe in the Extension recipes
  section: public API, feature-flag add, tier add, minting via the
  creator-only script.
- DECISIONS §9b — log the offline-HMAC choice with the threat-model
  trade-off (motivated piracy not stopped; honor-system + 30-day
  refund covers casual sharing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:54:30 +00:00
070e3c9f06 docs(gui): document the new GUI test layer
REQUIREMENTS §16 updates the test count (1777 → 1916) and breaks out
the GUI subset. DEVELOPER's Tests section gains the 'gui' marker
recipes and the new tests/gui/ tree under test layout, plus a short
'GUI test layer' explainer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:13:40 +00:00
d0423a8912 docs(perf): publish the dedup/parallel/lazy-copy wins and limits
REQUIREMENTS §10 carries the new measured numbers and the dedup
blocking trade-off note. DEVELOPER known-limitations is rewritten to
reflect that exact-only dedup is now O(n), fuzzy-blocking is opt-in,
and column-parallelism is scaffolding for free-threaded Python.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:54:25 +00:00
e5f632bcd6 docs(perf): publish 1.5 GB target and the new measured throughputs
REQUIREMENTS §10 reflects the post-optimisation numbers and the
known O(n²) dedup match step (flagged for a future blocking pass).
en/es upload-limit copy and uploader help now say 1.5 GB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:37:26 +00:00
318b9b45dc docs(i18n): ship Spanish translations of buyer-facing docs
Adds README.es.md, docs/README.es.md, docs/USER-GUIDE.es.md, and
docs/CLI-REFERENCE.es.md mirroring the English client-facing set.
Each English doc gains a one-line language-switch banner pointing at
its Spanish counterpart; the docs index advertises both language sets
in the buyer-facing section. Internal docs (TECHNICAL, DECISIONS,
REQUIREMENTS, BUSINESS, RECOVERY) stay English-only by design — they
don't ship with the product.

The CLI itself emits English only, so CLI-REFERENCE.es.md notes that
flags and values are language-invariant while translating the prose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:21:18 +00:00
38011872e1 docs(i18n): document language packs across user, dev, and marketing docs
README + USER-GUIDE describe the sidebar picker and current coverage
(home + shared chrome, per-tool bodies pending). DEVELOPER gains a
how-to for adding packs and keys with the parity-test guarantee.
TECHNICAL §10b records the in-house-JSON architecture and locks in the
no-gettext decision (also logged in DECISIONS). REQUIREMENTS reflects
the new interface surface and updated test count. COPY.md adds a
"Language claim" slot so landing/email work can pick it up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:16:24 +00:00
e1f364f010 feat: Tier B operator scaffolding — bundle, copy SoT, posts, emails
Pick up and finish yesterday's cut-off Tier B pass.

- build/: PyInstaller scaffold (datatools.spec + launcher.py +
  hook-streamlit.py + README) — folder-mode bundle, locked
  127.0.0.1, per-OS recipe
- marketing/COPY.md: single source of truth for every customer-facing
  string — landing H1/sub/CTAs, demo CTAs, email subjects, Gumroad
  listing, banned phrases
- marketing/community-posts/: 9 drafts (3 posts × 3 niches:
  bookkeeper, revops, shopify-pet) — story / tip / soft-offer
- marketing/emails/: 18 drafts (Gumroad delivery + 5-touch
  onboarding × 3 niches), per-niche segmentation guidance
- docs/NEXT-STEPS.md: flip 2.2 / 2.4 / 3.1 / 3.4 to done with
  pointers to the new assets; add Phase 0 inventory rows
- .gitignore: narrow `build/` ignore so PyInstaller spec + launcher
  + hooks get tracked, only generated artifacts (build/build/,
  build/__pycache__/, build/dist/) stay ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 14:04:37 +00:00
966af8ef94 feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00
d18b95880d feat(format-i18n): broaden international coverage across all domains
Closes ~17 high-value international gaps surfaced by parallel review.
Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed
/ failed / xfailed).

DATES
- Adds Portuguese, Italian, Dutch, Russian month dictionaries to the
  opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru).
- Adds localized weekday recognition for those locales — "Lundi",
  "Montag", "lunedì", "понедельник", etc. all strip cleanly before
  format matching.
- New CJK separator normalization: Japanese ``2024年01月15日`` and
  fullwidth digits ``2024/01/15`` fold to ASCII before parsing.
- New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to
  fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ
  doesn't block format matching.
- New ISO 8601 extended formats: week date (``2024-W03-1``) and
  ordinal date (``2024-015``), plus RFC 2822 mail-header form
  (``Mon, 15 Jan 2024 10:30:00``).
- New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` —
  defaults to Python's stdlib 69; lower it for birth-year columns
  where most subjects were born ≤ 1999.

NAMES
- Particles set extended with Arabic patronymic markers (bin, ibn,
  bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-).
- Title set extended with German (Herr, Frau), French (M., Mme,
  Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra,
  Dott.), Portuguese.
- Acronym map extended with international academic credentials
  (Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM).
- New East Asian honorific suffix handler: ``Tanaka-san``,
  ``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the
  hyphen instead of being title-cased into ``Tanaka-San``.
- Hyphenated-segment handler now keeps Arabic prefixes ``al-`` /
  ``el-`` lowercase per Arabic transliteration convention.
- New ``family_first`` parameter on ``standardize_name()`` and matching
  ``name_family_first`` field on ``StandardizeOptions`` — set
  per-column for East Asian data to skip Western comma-format reversal
  (``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``).

CURRENCY
- Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN),
  ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa,
  Eastern Europe, Latin America gaps.
- ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD,
  BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS,
  KES, HUF, CZK, RON, UAH, KZT, etc.

EMAIL
- New BIDI / RTL override stripping (``standardize_email``):
  U+202A-U+202E and U+2066-U+2069 stripped from every email. These
  are a known phishing vector — ``alice‮@example.com`` displays as
  ``alice@elpmaxe.com`` to RTL-aware renderers.

ADDRESS
- Canadian provinces: 13 codes + names → 2-letter (Ontario → ON).
- UK postcode pattern recognition (``SW1A 2AA`` shape).
- Australian states: 8 codes + names (NSW, VIC, QLD, … + full names).
- German Bundesland: 16 codes + names (Bayern → BY, etc.).
- International PO Box variants: ``Postfach`` (DE), ``Boîte postale``
  (FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal``
  (PT) — all fold to canonical ``PO Box``.
- ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position
  check that preserves state codes regardless of input case applies
  to all four jurisdictions.
- ``_is_state_code_position`` postal pattern broadened to recognize
  US ZIP, AU 4-digit, CA first half, and UK outward code.

CONSTANTS
- ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` /
  ``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``,
  ``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS``
  (us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex),
  ``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for
  use when address takes a `country_hint` parameter in a future pass).

DOCS
- TECHNICAL.md §11.3 domain table updated with the new handling per
  domain plus a new "International coverage" sub-section listing the
  supported locales / symbols / jurisdictions.

DEFERRED (out of scope or rare)
- Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) —
  corpus § 3.5 marks out of scope.
- Persian/Arabic-Indic digit conversion — rare in tabular data.
- Trailing-minus RTL currency convention.
- Punycode ↔ Unicode IDN normalization.
- Mixed-country phone column auto-detection (user can override
  ``default_region`` per column).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 03:06:03 +00:00
abb720997e docs: tight, scannable rewrite — every item earns its place
Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS,
TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from
prose-heavy to bullet-heavy + table-heavy. Same information density,
significantly less reading load.

Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content
that landed since v1.6:

- Format Standardizer (3rd Ready tool)
- 199-row buyer corpus
- src/core/errors.py structured hierarchy + ensure_dataframe /
  ensure_choice / wrap_file_read|write / format_for_user helpers
- src/core/_constants.py shared USPS/state lookup tables
- Cross-tool audit fixes (NaN matching, removed_df schema, validation,
  enum-bounds checks, forward-compat config)
- Per-domain error_policy across format standardizers
- Inconsistent-date-format detector
- Excel header-row auto-detection + write_file delimiter param

Per-doc changes:

- README.md (175 → 71): 9-tool table at top, status column, 3 CLI
  entry points listed, dropped repeated marketing prose.
- docs/README.md (38 → 27): pure index — buyer-facing vs creator-only
  split + version footer.
- USER-GUIDE.md (208 → 118): tool table replaces script descriptions,
  troubleshooting compressed to bullets, gate explanation tightened.
- CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed
  redundant intro text, kept full recipes section.
- REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added
  §18 Error Handling, formatting tightened to single-line entries.
- TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged
  redundant §3.5-3.7 OS sections, added §7 (Error handling) +
  §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate /
  Review page / repair_bytes promoted from §10.2.x sub-numbering).
- DEVELOPER.md (285 → 161): module map table replaces per-file prose,
  extension recipes condensed, new §Errors covers when to use each
  hierarchy class.
- BUSINESS.md (278 → 225): collapsed prose to tables (use cases,
  competitive landscape, costs, risks); honest-status updated.
- DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved,
  decision log compressed to single-line entries, added v1.6 entries
  (Format Standardizer Ready, errors module).
- RECOVERY.md (180 → 147): rebuild steps as numbered + tabular,
  external dependencies as one table, recovery priorities tightened.

No information removed; redundancy compressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:49:29 +00:00
3dd924474c docs: short-form numbered requirements list
New docs/REQUIREMENTS.md catalogs every shipped capability in 17 numbered
categories — file handling, input/output encodings, delimiters, line
endings, detectors, finding schema, confidence tiers, decisions,
performance targets (1 GB), tools, gate behavior, interfaces, platforms,
deps, test coverage, privacy. Linked from README and USER-GUIDE so a
buyer / integrator can scan compliance in under a minute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 21:19:21 +00:00
82d7fef21e feat(gate): CSV-normalization gate with confidence-tiered findings
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:35:27 +00:00
54f92ae47e feat: implement text cleaner (script 02) with CLI, GUI, and tests
Builds 02_text_cleaner.py from stub to working: character-level hygiene
for CSV/Excel inputs covering trim, whitespace collapse, smart-character
folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char
strip, line-ending normalization, and per-column case conversion. Three
presets (minimal/excel-hygiene/paranoid) keep the buyer surface small.

- src/core/text_clean.py: pure helpers + CleanOptions/CleanResult +
  clean_dataframe with dtype-safe column selection
- src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape
  (dry-run by default, --apply writes cleaned + changes audit, JSON
  config save/load)
- src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset
  picker, advanced toggles, preview, before/after metrics, and three
  download buttons
- tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests
  covering edge cases E1-E50 from the spec
- samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10
  in 10 rows
- test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case
  fixtures

Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7
entry locking the spec, CLI-REFERENCE.md gains the text cleaner
section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md
status row 02 promoted Skeleton -> Working.

200/200 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:14:15 +00:00
9ec371a85f docs: update all documentation to reflect v3.0 functionality
Update README, CLI reference, and developer guide to cover delimiter
selector, inline checkboxes/dropdowns, live surviving rows preview,
multi-row survivors, and apply_review_decisions(). Remove dead link.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:58:38 +00:00
b871ab24fc feat: add documentation, Streamlit GUI, and full source tree
- Rewrite README.md with project overview, quick-start, and CLI summary
- Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections
- Add docs/DEVELOPER.md with architecture, data flow, and extension guides
- Rewrite src/core/__init__.py with public API exports and module docstring
- Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive
  match group review with side-by-side diff, and download buttons
- Add .gitignore, requirements.txt, all source code, tests, and sample data
- Add streamlit to requirements.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 23:06:39 +00:00
0613dc420c docs: add project documentation files
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 22:02:07 +00:00