Compare commits

..

179 Commits

Author SHA1 Message Date
41ab2166ef build(ci): wire macOS code signing + notarization into release workflow
Add a guarded "Sign & notarize macOS app" step to build.yml that signs
dist/DataTools.app with the Developer ID (hardened runtime + entitlements
+ secure timestamp), notarizes via notarytool, and staples the ticket —
running before DMG packaging. The step exits 0 with a warning when the
MACOS_* secrets are absent, so dry-run dispatches still produce an
(unsigned) build.

Add build/macos/entitlements.plist with the hardened-runtime entitlements
a frozen PyInstaller/CPython app needs (JIT memory, library-validation
disabled for bundled .so/.dylib + Tesseract). Update build/README.md to
reflect that macOS signing is now wired and only needs the secrets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:56:17 +00:00
9943e6e537 test(demo): cover the demo app + sales-surface coherence
Adds a demo test suite on top of the data-value pins:

- tests/gui/test_app_demo.py (new, AppTest): every accounting persona
  renders with its dataset, the default/unknown-persona fallback resolves
  to bookkeeper, clicking Run produces the AFTER value (rows reduced to the
  validated count) with the watermarked download + Gumroad CTA, and
  switching persona via the quick-switch dropdown clears the stale result.
- tests/test_demo_pipelines.py (extended): cross-surface coherence —
  each persona key served by app_demo has a matching landing page whose
  iframe (?p=) and CTA (from=) point at it and that the hub links to;
  no retired Shopify/RevOps language remains in landing HTML; and the
  demo download still appends exactly one watermark row.

Full suite: 2584 passed, 91 skipped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 19:06:50 +00:00
e7ec79b9b5 demo: retarget landing pages to the accounting audience
Reorients the whole sales surface to accounting so it matches the rebuilt
demos. Replaces the Shopify and RevOps persona pages with accounts-payable
(1099) and accounts-receivable pages, refreshes the bookkeeper page, and
rewires the hub + deploy tooling:

- landing/bookkeeper/  — refreshed to the validated bank-rec demo
  (26 -> 20, six phantom duplicates), iframe ?p=bookkeeper.
- landing/ap-1099/     — NEW (replaces shopify-pet/): 1099 vendor prep,
  "24 records -> 8 vendors, 7 missing EINs recovered", iframe ?p=ap-1099,
  amber accent.
- landing/ar-aging/    — NEW (replaces revops/): AR open invoices,
  "26 -> 21, five double-entered invoices removed", iframe ?p=ar-aging,
  green accent.
- landing/index.html   — hub rewritten with the three accounting cards.
- deploy.py / deploy.config.example.json / README.md / _shared/styles.css
  — persona list, sitemap defaults, 404 links, cross-links, docs updated.

All demo iframes now point at the renamed app_demo personas; deploy.py
builds the dist bundle cleanly (verified) and the Gumroad ?from= tags match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:59:50 +00:00
6df726e69e demo: reconstruct sales demos for an accounting audience
Replaces the Shopify / RevOps / Bookkeeper demo trio with three accounting
personas that share one buyer, each entering through a workflow where a
messy export costs money — all running the same saved 4-step pipeline:

- bank_reconciliation.csv (Bookkeeper): 26 -> 20 rows, 6 double-posted
  transactions caught after date+amount standardization.
- vendor_1099.csv (AP / 1099): 24 records -> 8 vendors, 7 missing EINs
  recovered via dedup merge — the 1099-complete story.
- ar_open_invoices.csv (AR): 26 -> 21 rows, 5 double-entered invoices
  removed, blank status backfilled from the twin row.

Every number is validated against the live engine and pinned by
tests/test_demo_pipelines.py (read path mirrors app_demo._load_demo:
dtype=str, keep_default_na=False). Rewires src/gui/app_demo.py PERSONAS
(keys bookkeeper / ap-1099 / ar-aging, accounting H1/sub/CTA) and rewrites
docs/DEMO-PLAN.md sections 3/4/7 with the validated outcomes.

(Repo hygiene forced by a partial-clone gap: finalizes the already-deleted,
unreferenced samples/messy_text.csv whose blob was unrecoverable.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:52:39 +00:00
38616d69e2 test(pipeline): complete automated test suite for the pipeline feature
Adds ~115 tests pinning the Automated Workflows feature end to end:

- tests/test_pipeline.py (+43): per-adapter summary correctness on known
  inputs, multi-step data flow, error stop/continue contract, empty /
  single-column / all-disabled edges, dict+file serialization round-trips,
  recommended_pipeline(include=…), and a synthesized demo integration run.
- tests/test_cli_pipeline.py (new, 21): --recommend, dry-run-by-default,
  --apply output CSV + audit JSON, --steps, --strict abort, arg validation,
  --continue-on-error vs halt, and a save→load round-trip. Invokes the Typer
  app directly to bypass the license guard (house pattern).
- tests/gui/test_pipeline_builder.py (+9): reorder ▲/▼, disabled edge
  buttons, disabled-step persistence across reorder, restore-recommended,
  Advanced JSON export/import, and per-tool Configure panels emitting the
  correct option dicts (AppTest).
- tests/gui/test_pipeline_phrasing.py (new, 30): step_phrase/step_status and
  the adapter-key→friendly-name bridge as pure functions, incl. pluralization,
  column prose, and warn/error status derivation.

Full suite: 2565 passed, 91 skipped. No product bugs surfaced. Documents the
coverage in docs/DEVELOPER.md (test tree + a pipeline-coverage note).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:31:15 +00:00
00d3f28865 feat(pipeline): plain-English per-step result summaries
Replaces the raw-JSON summary column in the Results table with the mockup's
plain-English phrasing: "312 duplicates removed across 147 groups
(18,442 → 18,130 rows)", "1,204 cells cleaned in name & city", etc.
(correct singular/plural via a small _n helper).

Adds step_phrase() and step_status() to pipeline_modules.py. step_status
derives the status pill (✓ ok / ⚠ ok · N skipped / ✗ error / ⏭ skipped) and,
for warn/error steps (e.g. format_standardize unparseable cells, column_map
coercion failures / missing required targets), an inline detail callout
rendered directly below the results table — surfacing non-fatal issues in
context without a dedicated always-empty column.

Extends tests/gui/test_pipeline_builder.py with phrasing + status assertions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:21:17 +00:00
837f4b88b5 feat(pipeline): visual module-card builder for Automated Workflows
Replaces the raw options_json data-editor table with a per-step "module
card" builder matching the locked design mockup
(layout-review/09_pipeline_runner.html): each step shows a friendly name +
caption, an enable toggle, ▲/▼/✕ reorder/remove controls, and a Configure
expander that renders that tool's own controls in plain language. Raw JSON
is demoted to an Advanced import/export section.

New src/gui/components/pipeline_modules.py holds the adapter-key→tool_id
friendly-name bridge, one plain-language config renderer per tool
(text_clean, format_standardize, missing, column_map, dedup — emitting the
exact JSON option shapes the core adapters accept), and render_step_card.
Steps live in session state as an ordered list with stable ids so widget
keys survive reorder/remove. Reorder is ▲/▼ buttons (no JS drag dependency).

The on-disk/CLI pipeline JSON format is unchanged — CLI and src/core
untouched. Adds tests/gui/test_pipeline_builder.py (AppTest) covering seed,
configure panels, toggle/add/remove, and a full run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:16:09 +00:00
fd9606c67b build: drop the local Python release method, return to CI-only installer builds
Removes the single-command Python packaging method (build/make_release.py
+ build/build_portable_zip.py + build/macos/build_zip.sh) and the portable
.zip artifacts it produced. Release builds go back to the original GitHub
Actions process: the CI matrix builds one installer per platform (.dmg /
.exe / .AppImage) on tag push and attaches them to a GitHub Release.

Tesseract OCR bundling is preserved: the fetch helpers the workflow depends
on (fetch_tessdata, fetch_tesseract_for_platform) are extracted into a
standalone build/tesseract.py, which build.yml now imports.

Docs (README, build/README, DEVELOPER, TECHNICAL, USER-GUIDE, vendor README,
es translations) updated to drop the portable-zip flavor and point at the
new module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 17:47:36 +00:00
28ab51a869 Merge ui-redesign: journey-level UX redesign + live-app port
Brings the design-review mockups and the highest-leverage live-app
changes into main:
- layout-review/ mockups: 12-page review addressed; front door, taught
  pipeline order, consistent intake, coming-soon stubs, shared tokens.
- Live src/gui/: nav reordered to pipeline order with new Finance +
  Coming-soon groups; Home is the "Start here" front door with a
  one-click "Clean these files for me" pipeline runner; local-first
  pill on every working tool header.
- DECISIONS.md: PDF to CSV + Reconcile kept in-bundle under Finance.

Full suite green: 2441 passed, 91 skipped, 0 failed.

Follow-ups tracked (not blockers): streamlit-run visual verification of
the live UI; i18n keys for the front-door copy (English literals today);
rebuild the live coming-soon stub page bodies.
2026-06-08 17:41:30 +00:00
1895074b8f test+fix(gui): retire the now-empty "analysis" nav section
The journey-level nav restructure moved Home to a standalone "Start
here" entry and Reconcile into the "Finance" group, leaving the
"analysis" section with zero tools. Two registry tests encoded the old
layout and failed:
- test_every_section_has_at_least_one_tool[analysis] (empty section)
- test_reconciler_present (asserted section == "analysis")

Drop "analysis" from the Section literal, SECTION_LABELS, and app.py's
by_section bucket — it's genuinely dead now (home isn't a registry Tool).
Update the presence tests to assert Reconcile + PDF to CSV live in
"finance". The section-invariant tests (every section non-empty, has a
label, no orphan labels) are preserved and pass.

Full suite: 2441 passed, 91 skipped, 0 failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 17:11:02 +00:00
d807d3c11b feat(gui): add the one-click "Clean these files for me" front door
Issue #1 (the make-or-break UX fix): after the analyzer runs, Home now
leads with a primary "Clean these files for me" CTA that runs the
recommended pipeline (Clean Text -> Standardize -> Fix Missing -> Find
Duplicates, in order) on every imported file and hands back a cleaned
CSV per file — collapsing "which tool, what order" to one click. The
existing per-finding cards remain, reframed as "Or fix issues one at a
time" for users who want manual control.

- Reuses the core API verbatim (recommended_pipeline + run_pipeline);
  reader mirrors 9_Pipeline_Runner._read_uploaded so files load the same
  way the standalone orchestrator loads them.
- Per-file errors are captured so one bad file doesn't kill the batch;
  cleaned CSVs are cached in session_state so downloads survive reruns
  and are pruned when a file is removed or re-analyzed.

Verified: the read -> run_pipeline -> CSV data path executes correctly
(compile + a non-Streamlit functional smoke test). The Streamlit UI
scaffolding (button / download_button / progress / session_state)
mirrors the proven runner page but still needs a `streamlit run` check.
Front-door copy is English literals for now; i18n keys are a follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 17:06:30 +00:00
09ec01e98b feat(gui): port journey-level nav + local-first pill to the live app
Brings the live Streamlit app in line with the finalized layout-review
mockups (structural/low-risk changes; verified by compile + registry
sanity, still pending a streamlit-run visual check):

- tools_registry: Data Cleaners now in pipeline order (Clean Text ->
  Standardize -> Fix Missing -> Find Duplicates); new "finance" section
  (Reconcile, PDF to CSV) and "coming_soon" section (Find Unusual,
  Quality Check, Combine Files). Adds those to the Section type +
  SECTION_LABELS.
- app.py: Home becomes the "Start here" front door — a standalone,
  unlabeled top entry (play_circle icon) ahead of the hidden
  Activate/Logs/Close pages; nav groups reordered cleaners ->
  transformations -> automations -> finance -> coming soon.
- _legacy.py: render_tool_header now shows the "Runs 100% locally"
  privacy pill (right-aligned, Ready tools only — omitted on Coming
  Soon stubs); accent emphasis CSS for the Start-here nav link.
- i18n: add nav.start_here_title, nav.section_finance,
  nav.section_coming_soon to en + es packs.
- DECISIONS.md: log the PDF/Reconcile in-bundle (Finance group) call.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 17:01:57 +00:00
48251b625f refactor(layout-review): consolidate tool-header actions + align reconcile downloads
Consistency pass over the parallel-agent work:
- Replace 4 divergent inline header wrappers (flex/inline-flex, gap
  10/12px, margin-top present/absent across 8 tool pages) with one shared
  .dt-tool-header-actions class; strip the now-redundant per-button
  margin-top:0. Every tool header now aligns the local-first pill + Help
  button identically.
- Reconcile downloads row: reorder to the page's exceptions-first order
  (Review, Unmatched left, Unmatched right, Matched) to match the tabs and
  metric strip, and drop the lone competing primary — the four are
  parallel exports of equal weight.

Audited and confirmed already-consistent: compact intake banner, privacy
pill markup, .dt-next-step strips, the three coming-soon stubs, primary
CTAs, and the 3-download CSV/audit/config pattern.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:50:25 +00:00
dd0942d710 feat(layout-review): journey-level redesign — front door, taught order, consistency
Addresses the journey-level review (the app felt like 12 tools sharing a
stylesheet, not one guided product). File-partitioned changes:

Navigation (shell.js): rename Home -> "Start here" with front-door
emphasis (.dt-nav-start); reorder Data Cleaners into pipeline order
(Clean Text -> Standardize -> Fix Missing -> Find Duplicates); new
"Finance" group (Reconcile, PDF to CSV); all stubs moved to a bottom
"Coming soon" group, no longer interleaved with working tools.

Front door (home.html): a prominent primary "Clean these files for me"
that runs the recommended pipeline in order, above the existing
per-finding cards (reframed as "fix one thing at a time").

Shared tokens (app.css): .dt-next-step suggestion strip + .dt-nav-start.

Teach the order: a slim .dt-next-step strip at the end of each linear
cleaner page points to the next pipeline step (Map Columns -> Start here;
orchestrator/Finance pages correctly omit it).

Local-first: the green "Runs 100% locally" pill now sits in every working
tool page's header (home + 8 tools), where client data is entered.

Plain English: jargon relabeled on input controls (coerce, E.164,
NFC/NFKC, sentinels, survivor rule), technical terms kept in tooltips and
audit/output cells only.

Stubs (06/08/07): rebuilt to one identical skeleton — info line + plain
feature list + a real "Notify me when this ships" button; every disabled
control and uploader removed (a dimmed dropzone reads as broken).

Intake: full dropzone+chip replaced with the compact "Using <file>" banner
on Clean Text, Fix Missing, Find Duplicates, and both Reconcile sides.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:44:11 +00:00
cf31d9ef14 feat(layout-review): address review findings on pages 7-12
Find Duplicates (01_deduplicator):
- Delete the redundant outer Options wrapper; surface threshold +
  survivor rule directly, push the rest behind a single Advanced pane.
- Disambiguate competing primaries: top result is an auto-resolved
  preview (secondary download), review decisions are the single primary.
- Plain-English match labels (exact / approximate); clarify the third.
- Lift the match-card caption to a one-time instruction; note delimiter
  is delimited-text-only.

Quality Check (08_validator_reporter) — stub:
- Remove the dead disabled "Load rules file (JSON)" uploader so the
  stub invites a single action; keep the informative feature list.

Map Columns (05_column_mapper):
- Regroup schema -> mapping -> strategy/advanced (core task contiguous).
- Make preset-vs-Advanced precedence legible (Custom + modified marker).
- Adopt the compact file-intake banner; drop the duplicate resolved-
  mapping table; fix the add-row gutter style.

Combine Files (07_multi_file_merger) — stub:
- Actually disable the Merge CTA (add the disabled attribute).

PDF to CSV (10_pdf_extractor):
- Drop page/raw from the default preview to match export + fix the
  horizontal clip; surface raw via per-row affordance + overflow-x.
- Move the column selector above the download button; give auto-excluded
  rows a reason; align the files card to Home; de-dupe the row count.

Automated Workflows (09_pipeline_runner):
- Replace hand-edited JSON step config with per-step control expanders;
  JSON moved behind Advanced import/export.
- Editing the table marks the mode modified; fold the empty error column
  into the status pill; render summaries as plain English; collapse the
  explainer by default.

Cross-cutting items (stub standardization on page 10, shared disabled-
field token, remaining intake rollout) deferred to a holistic pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:35:46 +00:00
563d845b70 feat(layout-review): address review findings on pages 4-6
Find Unusual Values (06_outlier_detector) — coming-soon stub:
- Anchor the disabled Method on IQR (multiplier 1.5), not Z-score, per
  the logged robustness decision.
- Drop the redundant feature bullet list (kept alert + greyed controls
  + disabled button); also fixes the MAD-only-in-bullets mismatch.
- Remove the live uploader that dead-ended into disabled controls.

Clean Text (02_text_cleaner):
- Add an inline hidden-character legend (3 swatches reusing the actual
  badge classes) beside the canonical "Show hidden characters" toggle.
- Unify the two hidden-char toggles: preview one is canonical; the
  Results bare checkbox is wrapped in a field + bound note.
- Describe all three presets (minimal / excel-hygiene / paranoid).
- Give "Changes by column" a real "column" header instead of the
  grey index-gutter style.

Standardize Formats (03_format_standardizer):
- Make preset-vs-control precedence legible: preset shows Custom with a
  "modified" marker + base tag, diverging controls flag the winning
  value (same pattern as Fix Missing Values).
- Replace the dead-end unparseable alert with a real "Unparseable
  cells (47)" expander the alert now points to.
- Honest preview caption: "5 of 6 columns (notes skipped)".
Intake pattern (the cross-page reference) left untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:27:42 +00:00
be1e263223 feat(layout-review): address Fix Missing Values review findings
- Pin down strategy precedence: add a resolution-order legend
  (per-column -> global -> preset), dim/strike the preset radios when
  a global strategy overrides them, and add a "Resolves to" column to
  the per-column override table so the winning value is legible.
- Make the demo state honest: Global strategy = median is what drives
  the 1,043 fills, resolving the detect-only contradiction.
- Surface the missingness profile as an always-visible block above the
  (now-open) Options expander — diagnostic before configuration.
- Stop highlighting unchanged before/after cells (respondent_id 0->0);
  show "(global)" placeholders in unset per-column override cells.
- Fold the standalone "Strategy applied per column" table into the
  before/after table as a strategy column; inset maxed slider knobs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:23:32 +00:00
7ebfd0f153 feat(layout-review): address Reconcile page review findings
- Fix doubled "Invert right amount sign" label: keep the field label,
  strip the checkbox caption to the box only (also evens the 3-up row).
- Reorder results exceptions-first: tabs and metric strip both run
  Review -> Unmatched left -> Unmatched right -> Matched, with Review
  the default active tab and its table as the inline content; Matched
  demoted to a trailing context expander.
- Surface the "references must match left count" rule with an inline
  validation indicator under the right reference field instead of a
  label note alone.
- Mark the required Amount join key with the .req accent star on both
  sides so it reads distinct from the optional date/description pickers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:17:20 +00:00
2592604067 feat(layout-review): address Home page review findings
- Findings card no longer truncates silently: panel #1 gains a
  .dt-finding-more overflow control ("Show all 8 findings · 5 more").
- Replace the dead "Files analyzed: 3" stat (restated the section meta
  + visible rows) with "Rows scanned" — info not already on screen.
- Collapsed findings panels use a real .is-collapsed state variant
  instead of inline margin-bottom:-16px hacks, so states can't drift.
- Action bar buttons are content-sized; drop the 340px island that
  jarred against the full-width divider/stats below it.

Branding kept as deliberate landing-style treatment on Home (per
review decision); interior tool pages remain title-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:14:04 +00:00
58d0009849 refactor(layout-review): inline assets beside pages
Move app.css and shell.js into layout-review/ alongside the .html files
and reference them by bare filename; drop the assets/ subfolder.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:43:31 +00:00
b6c39d7a09 refactor(layout-review): move assets to repo root
Relocate assets/ (app.css, shell.js) from layout-review/ up to the repo
root and rewrite every page's link/script refs to ../assets/.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:31:53 +00:00
b2fa8503e6 chore: add layout-review HTML mockups
Static layout mockups for each app tool (deduplicator, text cleaner,
format standardizer, missing handler, column mapper, outlier detector,
multi-file merger, validator/reporter, pipeline runner, PDF extractor,
reconciler) plus index/home shells and shared assets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:28:23 +00:00
b703911df3 docs: reflect bundled Tesseract on every install surface
- NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers
  the bundled Tesseract OCR binary (Apache 2.0, upstream
  tesseract-ocr/tesseract, copyright Google + contributors) and the
  eng.traineddata from tessdata_best (also Apache 2.0). Clarifies
  DataTools itself remains proprietary. Full canonical Apache 2.0
  license text included.
- README.md + README.es.md (Download section): bumped size estimate
  ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR
  is bundled (no separate install required), with a link to the new
  license file.
- docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System
  requirements): bumped disk estimate, added a paragraph stating
  Tesseract 5.5 + eng.traineddata ship inside every installer /
  portable / AppImage, with a source-install fallback hint pointing
  developers to DEVELOPER.md.
- docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section
  documenting the runtime layout (sys._MEIPASS / tesseract / …),
  discovery order, source of bytes (build/vendor/tessdata + per-
  platform fetch in make_release.py), version pin, update recipe.
- docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor
  OCR)" — short version of the discovery order for the build
  pipeline section.
- build/README.md: distribution-outputs paragraph now lists
  Tesseract among bundled deps with the ~250-300 MB estimate; new
  "Tesseract bundling" section: layout diagram, resolver order,
  source of bytes + 5.5.0 pin, update steps, license-file ref.

Out-of-scope gaps noted by the docs sweep:
- docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a
  high-risk packaging headache; now superseded. Worth a one-line
  "(resolved — bundled as of v1.x)" callout in a future pass.
- USER-GUIDE §2 "What's included" table doesn't list PDF Extractor
  at all (it shipped in b8aff86…967d3f6). Separate gap to close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:50 +00:00
93ccada974 build: bundle Tesseract 5.5.0 + tessdata into every release artifact
End users no longer have to install Tesseract separately for OCR on
scanned PDFs — the engine ships inside the installer, portable .zip,
and AppImage for all three platforms.

Per-platform fetch in build/make_release.py (run before PyInstaller):
- Windows: download UB-Mannheim installer 5.5.0.20241111, extract
  with 7-Zip, copy tesseract.exe + required DLLs into the staging dir.
- macOS: ``brew install tesseract``, copy binary + every Homebrew-
  prefixed dylib resolved via otool -L (recurse one level for
  transitive deps), then install_name_tool rewrites IDs / load paths
  to @loader_path/... so the bundle is relocatable.
- Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary
  + every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'.

Wire-up:
- build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by
  make_release) and adds the staging dir + tessdata + the
  LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas
  so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/}
  and the license sits at the bundle root. Soft-warns when staging
  is empty so dev spec runs still complete.
- English tessdata pulled by fetch_tessdata() from
  tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at
  build/vendor/tessdata/.
- .github/workflows/build.yml: actions/cache@v4 step keyed on
  ``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the
  staging dir and the vendored tessdata across runs; apt installs
  patchelf on the Linux runner; PyInstaller step now receives the
  DATATOOLS_TESS_STAGING env var.
- .gitignore: build/_tesseract/ and the .traineddata blob.
- TESSERACT_SKIP_FETCH=1 honored for offline / manual stages.
- Installer / .dmg / .zip / AppImage scripts: one-line comments
  confirming Tesseract rides along automatically via PyInstaller's
  datas (no extra packaging steps required in those scripts).

Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post-
compression. Net installer size ~250-300 MB (was ~120 MB) — accepted
tradeoff for zero end-user OCR setup.

Reversal of the prior "don't bundle Tesseract" decision (option A).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:33 +00:00
17faf84aed feat(pdf): probe bundled Tesseract first when running frozen
Adds runtime support for the bundled Tesseract that ships inside the
DataTools installer / portable / AppImage artifacts. When DataTools
is launched from a PyInstaller frozen bundle the OCR engine now
resolves automatically — no end-user install required.

New helpers in src/pdf_extract.py:
- _bundled_tesseract_path() → Path | None — returns
  <sys._MEIPASS>/tesseract/tesseract[.exe] when getattr(sys,
  "frozen", False) AND sys._MEIPASS are present; None in dev.
- _bundled_tessdata_dir() → Path | None — same gating, returns
  <sys._MEIPASS>/tesseract/tessdata.
- _apply_bundled_tessdata_prefix() — sets TESSDATA_PREFIX to the
  bundled tessdata dir before any pytesseract call; only if frozen,
  dir exists, and the user hasn't already overridden the env var.

Discovery order in ocr_available() / _autodetect_tesseract_path():
1. DATATOOLS_TESSERACT_PATH env override (existing)
2. Bundled binary (NEW — frozen-only)
3. System PATH (existing)
4. Windows well-known install dirs (existing legacy fallback)

In dev (not frozen) every new probe is a no-op so the developer
experience is unchanged.

12 new tests cover frozen vs. non-frozen detection on each platform,
the user-override respect for TESSDATA_PREFIX, autodetect priority
ordering, and the no-bundled-dir graceful path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:19:52 +00:00
4d8513b1a3 docs: cover help popover, +/- nav indicators, render_tool_header
User-facing docs (USER-GUIDE en+es, README en+es):
- New short paragraph under §3.1 GUI noting the in-tool Help button
  on every detail page, what it contains (When to use / Steps /
  Examples / Tip), and that content lives in tools.<id>.help_md.
- One-line note in the README tool tables pointing at the same.
- Mention the sidebar +/- nav indicators replacing Streamlit's
  default Material Symbols chevron.

Developer docs:
- DEVELOPER: new "Tool page header" subsection documenting
  render_tool_header(tool_id), the help_md markdown skeleton, and
  the fallback to help.missing_body when a tool's help is absent.
  Update i18n authoring rules to list help.* keys and the per-tool
  help_md field alongside name/description/page_title/page_caption.
- TECHNICAL: new §10c documenting the sidebar nav indicator swap —
  CSS in _HIDE_CHROME_CSS plus _SWAP_NAV_SECTION_INDICATOR_JS
  injected through the hide_streamlit_chrome() iframe bundle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:08:01 +00:00
ac94208d8f chore: production-readiness sweep on the help-popover wave
- Drop unused 'from src.i18n import t' from pages 1-9 (the swap to
  render_tool_header(tool_id) means no page calls t() directly anymore).
  Pages 10, 11 and the underscore-prefixed pages were already clean or
  legitimately use t().

- Rewrite PDF Extractor help_md (en + es). The original prose described
  features the tool does NOT have — template drawing, per-source saved
  templates, automatic reuse. The actual tool is a heuristic batch
  scanner (per its own docstring: "No templates, no per-bank
  configuration"). New copy: scan → uncheck → pick date format → enable
  OCR if needed → download. Spanish version tagged with
  '<!-- TODO: review Spanish -->' since the prose is best-effort.

- Document why both stSidebarNavSectionHeader (legacy, streamlit~=1.35)
  and stNavSectionHeader (current, 1.57) testids appear in the chrome
  CSS — requirements floor is streamlit>=1.35,<2 so dropping the legacy
  selector would silently break the lower bound.

- Pin the t()-returns-key-on-miss contract that render_tool_header's
  fallback path depends on, with a comment at the call site.

- Pin the demo's intentional skip of hide_streamlit_chrome (so the
  +/- sidebar swap JS doesn't ever try to load there) with a load-
  bearing comment in app_demo.py.

- Confirmed i18n parity: every tool id has page_title / page_caption /
  description / name / help_md in BOTH packs; help.button_label and
  help.missing_body in both.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:07:33 +00:00
4955fb239b test: cover help_md keys, header smoke, and bilingual ES smoke
Two stale Spanish smoke assertions still expected English page titles
for PDF Extractor and Reconciler — the i18n work landed real
translations ("PDF a CSV", "Reconciliar dos archivos"), so refresh the
expected substrings and the surrounding comment.

Add new coverage for the help-popover feature:
- TestHelpPopoverKeys (test_lang_packs): every tool_id resolves a
  non-empty tools.<id>.help_md in BOTH packs; help.button_label and
  help.missing_body resolve in both.
- TestDescriptionCopy (test_tools_registry): every Tool.description
  non-empty and under 120 chars — pins the post-jargon-scrub copy
  so future drift back into multi-clause prose is loud.
- TestRenderToolHeaderSmoke: render_tool_header is callable, listed
  in components.__all__, and every i18n key it touches resolves in
  both packs. Runs without a Streamlit script context.

Suite: 2427 passed (+9 new), 91 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:07:19 +00:00
4a8961d58a fix(gui): keep tool-page Help button on one line at narrow widths
When the viewport shrunk, the help popover button in the title row
was wrapping its label vertically — ``[icon]`` over ``Help`` — because
the button was set to use_container_width=True and the column it sat
in collapsed below the button's natural width.

Two-pronged fix:
- Set use_container_width=False on the popover so the button sizes to
  content (icon + label) instead of stretching to the column.
- Widen the column ratio from [10, 1] to [8, 2] so there's room for
  the button without forcing the title text to truncate.
- Add CSS pinning ``white-space: nowrap`` on every popover button (and
  its inner div / p) as defense-in-depth — even if the button does
  get squeezed, the label can't wrap. ``min-width: max-content`` keeps
  the button from compressing below its content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:54:41 +00:00
fe4b5dc755 fix(sidebar): correct testid + JS swap so +/− actually renders
The prior attempt used data-testid=stSidebarNavSectionHeader, which is
not what Streamlit 1.57 emits — the correct testid is stNavSectionHeader
(verified against the bundled JS in streamlit/static/static/js/).
The section header is also a <div> with onClick, not a <button>, and
the React component keeps the expanded state in a prop without
surfacing aria-expanded on the DOM. Pure CSS can therefore neither
locate the header nor switch the glyph by state, which is why the
chevron was unchanged in the rendered UI.

Switch strategies:
- CSS now targets the correct stNavSectionHeader / stIconMaterial
  selectors, drops the Material Symbols font from the icon span, and
  restyles it so a plain ascii character reads as proper typography
  (size, weight, color, hover).
- Add _SWAP_NAV_SECTION_INDICATOR_JS — small inline script that
  rewrites the icon's text node from "expand_more"/"expand_less" to
  "+"/"−" (U+2212), throttled via requestAnimationFrame, re-applied
  on every DOM mutation by a MutationObserver. Bundled into the same
  iframe injection as the existing brand/upload/findings scripts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:52:47 +00:00
209b5fb1aa style(sidebar): swap expand chevrons for +/− indicators on nav sections
Streamlit's default sidebar section header uses a Material Symbols
expand_more chevron — three different icons (chevron down, chevron up,
sometimes a plain triangle) depending on version, all of which felt
inconsistent with the rest of the chrome.

Hide the built-in icon (svg / material-symbols span — covered with
multiple selectors for cross-version durability) and render our own
glyph as a right-aligned pseudo-element on the section-header button,
keyed off the standard ARIA aria-expanded attribute:
- collapsed → "+"
- expanded  → "−" (U+2212, visually balanced with +)

Hover deepens the indicator color to match the surrounding nav-link
hover treatment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:23:49 +00:00
904356f4e8 feat(gui): inline Help popover next to every tool's title
Adds a contextual Help button on each detail page, right of the title.
Clicking it opens a Streamlit popover with a one-shot how-to: when to
use, numbered steps, before→after examples, and an optional one-line
tip. Designed to be scannable — no paragraph prose.

Implementation:
- New ``render_tool_header(tool_id)`` helper in components replaces the
  bare ``st.title(...) + st.caption(...)`` block on each of the 11 tool
  pages. Title in the wide column, popover in a narrow right column;
  caption sits on its own line beneath.
- Help content is one markdown blob per tool stored in i18n under
  ``tools.<id>.help_md`` (en + es). Editors can tweak copy without
  touching Python.
- ``help.button_label`` and ``help.missing_body`` keys added to both
  packs for the popover trigger and the empty-tool fallback.

All 11 tool pages now use the same header pattern — including the
PDF Extractor and Reconciler which previously had hardcoded title/
caption pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:21:55 +00:00
7203a81af7 copy: strip jargon from tool descriptions and captions
Prior round only touched page_caption; the description field (shown on
home grid cards) still said "imputation", "missingness",
"winsorization", "schema coercion", "fuzzy matching with normalization",
etc. The audience is non-technical buyers — they shouldn't need a stats
or DB-admin vocabulary to read a tool card.

Rewrite both description and page_caption across en, es, and the
tools_registry (the fallback source of truth) using everyday words:
blanks instead of nulls, fill in instead of impute, look wrong instead
of statistical outliers, etc. Same one-line shape as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:09:52 +00:00
dd3b9bd59d copy: tighten tool-page captions to one plain-English line
Each tool's page caption is what tells a user what the tool actually
does the moment they land. They were inconsistent — some terse, most
multi-clause with a redundant "Runs locally — your data never leaves
this computer" trailer that's already a privacy pill on Home.

Rewrite every caption (en + es) as a single ~60-80 char action-first
line. Replaces the hardcoded multi-line Reconciler caption with the
same shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 14:34:34 +00:00
2bd94c4441 docs: document installer + portable downloads in en/es
Repo READMEs now show both download flavors side-by-side with
first-launch warnings (SmartScreen, Gatekeeper) and link to the
deeper walkthrough.

USER-GUIDE §1 rewritten from a 9-line stub into six subsections:
- §1.1 Windows: installer (5 steps) + portable (4 steps)
- §1.2 macOS:   DMG (5 steps incl. right-click-Open) + portable
- §1.3 Linux:   AppImage flow (unchanged)
- §1.4 First-launch: port selection, localhost binding, browser open
- §1.5 How the GUI works
- §1.6 System requirements

§6 Troubleshooting picks up portable-specific items: Safari unzip
quirks, antivirus quarantine on Win portable, license file location.

docs/README and Spanish mirrors updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 19:30:28 +00:00
9c426194b1 build: add single-command release script + portable zip artifacts
One-developer workflow: ``python build/make_release.py`` on each
target OS produces both the installer and a portable .zip for that
platform. Preflight checks PyInstaller / Pillow / iscc / hdiutil /
ditto / appimagetool and bails with install hints if anything is
missing — no half-built dist/.

New scripts:
- build/make_release.py   — orchestrator, auto-detects host OS.
- build/generate_icons.py — icon.ico / icon.icns / icon.png from
  src/gui/assets/datatools_icon_256.png (Pillow ships ICO + ICNS
  writers; no platform tooling needed).
- build/build_portable_zip.py — Win/Linux portable zip via stdlib.
- build/macos/build_zip.sh — Mac portable .app via ditto so
  bundle metadata survives.

installer.iss now adds: Quick Launch task (opt-in, legacy Win 7),
App Paths registry entry (Win+R "DataTools" works), SetupIconFile,
UninstallDisplayIcon, AppSupportURL, AppUpdatesURL.

CI workflow uploads installer + portable per platform and attaches
both to GitHub Releases on tag push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 19:30:17 +00:00
6627895a10 test: fix v3 branding drift, add reconcile CLI + registry coverage
GUI/lang-pack tests were asserting against pre-v3 strings ("Data
Cleaning Mastery", "Maestría en limpieza…") that the brand refresh
replaced with "UNALOGIX DataTools" + "Clean. Normalize. Transform."
Updated assertions to the current copy and switched the findings
panel tests to the redesigned flat-list layout (per-finding "Open
Tool →" buttons instead of per-tool expanders).

New coverage:
- tests/test_cli_reconcile.py (13) — preview/apply, tolerance flags,
  sign inversion, key flags, error paths, Excel input.
- tests/test_tools_registry.py (27) — unique tool_ids, page_slug →
  real file, valid sections/tiers, localized accessor fallbacks,
  explicit pins for PDF Extractor + Reconciler entries.
- tests/test_reconcile.py — one-side-empty, key-pass tagging,
  additional validation cases, input-DataFrame immutability.
- tests/gui/test_smoke.py — PAGE_SLUGS now includes 10_PDF_Extractor
  and 11_Reconciler in both en/es.
- tests/gui/test_workflows.py — TestPdfExtractorWorkflow and
  TestReconcilerWorkflow render checks.

Net: 2317 passed → 2418 passed, 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 19:30:02 +00:00
ea99e292d2 feat(nav): group Home + Reconcile under a new "Analysis" section
Home now appears in the sidebar as "File Analysis" under a labeled
"Analysis" section together with Reconcile Two Files — both pages
are data-analysis workflows (importing/profiling files vs. matching
across files), so grouping them clarifies the sidebar's mental model.

- tools_registry: new ``analysis`` Section; reconcile moves out of
  automations into it.
- i18n: ``nav.section_analysis`` + ``nav.file_analysis_title`` added
  to en.json and es.json.
- app.py: home dropped from the unlabeled section and surfaced at the
  top of the Analysis group; ``default=True`` preserved so first-visit
  routing is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:11:06 +00:00
0be59c0f03 fix(gui): shrink white-bar compensation to ~1/4 of original gap
Plain ``min-height: 100vh`` left a ~15vh white bar below ``.stApp``
(the zoom: 0.85 scaler shrinks visual height to 85%). Reinstate the
stretching but stop short of the full ``100vh / 0.85`` overflow:
``calc(96vh / 0.85)`` fills 96vh visually and leaves a ~4vh bar — a
quarter the size, no longer dominating the page.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:06:32 +00:00
3a3a9a895b fix(gui): stop overstretching pages, restore footer clearance
Two layout bugs were hiding the bottom of every tool page behind the
sticky footer:

1. ``.stApp`` and the main/sidebar containers were forced to
   ``min-height: calc(100vh / 0.85)``, ≈ 17.6% taller than the
   viewport, to mask a white bar caused by the ``zoom: 0.85`` scaler.
   That hack stretches short pages and pushes long-page content past
   the visible area. Drop the calc factor — plain ``100vh`` fills the
   visible viewport without forced overflow.

2. ``render_sticky_footer``'s stylesheet re-set the block container's
   ``padding-bottom`` to ``2rem``, overriding the ``7rem`` reserved
   by ``hide_streamlit_chrome``. The footer (~40px tall) needs more
   than 32px of clearance, so the last row of content was sliding
   behind the footer. Remove the override and let chrome's reservation
   stand.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:03:52 +00:00
d090f8cb5e feat(reconcile): auto-detect role columns, preview result tabs
Match-settings selectors now reorder per side to match the file's
column order, using name heuristics (amount / date / desc) so a
typical bank CSV reads Date → Description → Amount → Reference
without manual fiddling. Detected columns also pre-fill as the
default selection.

Result tabs render at most 25 rows with a "preview of N of M"
caption; full data is still available via the existing download
buttons.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 22:39:47 +00:00
e44af3a45e feat(reconcile): two-source reconciliation tool
Bank-feed-vs-ledger style matcher: 4-pass greedy assignment (key →
exact → tolerance → fuzzy) with ambiguous candidates routed to a
review bucket instead of arbitrary picks. CLI mirrors the
cli_text_clean preview/--apply pattern; Streamlit page registered
in the automations section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 22:33:14 +00:00
450d4fc9a8 feat(pdf): default output date format to YYYY-MM-DD
User asked to flip the default from YYYYMMDD to YYYY-MM-DD.
ISO is the better default for an accountant CSV workflow:

- Lexicographic sort = chronological sort (no parsing needed).
- Every spreadsheet tool the user might import into recognises
  it as a real date with no ambiguity (US vs EU readers can't
  disagree on the order).
- Hyphens make the year/month/day boundaries scan-able by eye.

Concrete changes:

- New module constant ``DEFAULT_DATE_FORMAT = "%Y-%m-%d"``,
  used as the default for ``format_date()`` and the
  ``output_date_format`` keyword on
  ``scan_pdf_for_transactions``.
- Page's ``_DATE_FORMAT_CHOICES`` reordered so the ISO entry
  is first (index 0 = default Streamlit selection); YYYYMMDD
  drops to second.
- Custom-strftime input default also flips to ``%Y-%m-%d``.

Tests updated to reflect the new default (``test_dates_formatted_iso_by_default``,
``test_short_dates_get_year_from_period``,
``test_compact_format_round_trip``, plus a new
``test_default_is_iso`` for the format_date helper).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 02:04:34 +00:00
a0042d4aba feat(pdf): Dec/Jan-aware year inference + filename hint + override
Previous year inference picked ``period_end_iso[:4]`` for every
short date, which fails on statements that cross the Dec/Jan
boundary. A "12/30" row in a 2024-12-16 to 2025-01-15 statement
got 2025-12-30 (wrong) instead of 2024-12-30.

New cascade for ``_infer_year_for_short_date``:

1. **``override_year``** — caller supplies it (new ``"Override
   year for short dates"`` field in Scan options). Beats every
   heuristic. Empty by default; the page validates the value
   is a 4-digit-looking integer in 1900-2100 and falls back to
   automatic on garbage input.

2. **Statement period start + end** — the function now takes
   BOTH dates and generates candidates with every distinct year
   in the period (one year for same-year statements, two for
   Dec/Jan boundaries). The picker scores each candidate by
   distance from the period: candidates inside the period
   score 0, candidates outside score ``min(|days from start|,
   |days from end|)``. Lowest-distance candidate wins. So:

     - ``12/30`` + period 2024-12-16 to 2025-01-15 → 2024-12-30
       (inside period, score 0)
     - ``01/05`` + same period → 2025-01-05 (inside, score 0)
     - ``12/15`` + same period → 2024-12-15 (1 day before,
       closer than 2025-12-15 which is 11 months after)

3. **``filename_year_hint``** — fallback when the statement
   period regex misses the bank's specific layout. The page
   passes ``year_from_filename(upload.name)`` automatically so
   files like ``eStmt_2025-01-13.pdf`` get year 2025 even if
   the PDF's text doesn't yield a parseable period. The regex
   matches the first ``20XX`` token bounded by non-digits.

Both new helpers (``year_from_filename`` and the new
``_try_short_date_with_year`` factor-out) are exported and
tested. 16 new tests cover: within-period inference (same-year
sanity), Dec/Jan boundary cases for both sides, the
just-before-period closer-distance case, override priority,
filename fallback, no-signal None, dash-format / month-name
shorthand round-trip, garbage input, filename year extraction
(eStmt pattern, embedded, first-match-wins, no-match, empty).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:59:30 +00:00
a18b126885 fix(pdf): stamp scan timestamp once; restores Saved-to-path banner
After swapping to ``html_download_button`` the user noticed the
"✓ Saved to <path>" + 📂 Open Downloads folder pair never
appeared. The helper itself is fine — every other tool shows
those affordances correctly. Bug was specific to the PDF page.

The download button's file_name was being computed with a fresh
``datetime.now().strftime(...)`` on every render. The helper
builds its session-state keys from
``f"_dl_btn_{file_name}_{digest}"`` so the keys silently drift
every second. After the click and rerun, the helper looks up
the saved_key for the NEW file_name, finds nothing in
session_state (the click had written to the OLD key), and skips
the success banner.

Fix: stamp the timestamp once when scan completes, store it in
``K_TIMESTAMP``, and reuse it for the download filename. The
filename stays stable across reruns, so the helper's keys are
stable, so the saved-path banner renders correctly on the post-
click rerun.

Also clear ``K_TIMESTAMP`` on Clear-all-files so a new scan
gets a fresh stamp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:50:22 +00:00
981a1a9cba fix(downloads): OneDrive-aware Downloads path + PDF uses html_download_button
User reported downloads "do nothing on click" in tool pages and
"acts like it downloads but no file in the folder" in the PDF
tool. Two root causes, two fixes.

**Root cause #1 — wrong Downloads folder on Windows.**
``_downloads_dir()`` returned ``Path.home() / "Downloads"``
unconditionally. On Windows machines with OneDrive enabled
(very common for business users), the real Downloads folder
is redirected to ``C:\Users\<u>\OneDrive\Downloads``. Our
helper would write to ``C:\Users\<u>\Downloads`` instead —
a folder that may not even exist until ``mkdir`` creates it —
and the user, naturally opening their actual OneDrive
Downloads, sees no file and concludes nothing happened.

Now: on Windows, ``_downloads_dir`` queries the registry key
``Software\Microsoft\Windows\CurrentVersion\Explorer\User
Shell Folders`` for FOLDERID_Downloads (GUID
``{374DE290-123F-4565-9164-39C4925E467B}``). This entry returns
the redirected path when OneDrive is active, the original
``%USERPROFILE%\Downloads`` otherwise — exactly what the user's
File Explorer reads. ``%USERPROFILE%`` expansion is applied
via ``os.path.expandvars``. Any registry hiccup falls through
to ``Path.home() / "Downloads"`` so the helper never raises.

The sanity check (path exists OR parent exists) catches the
edge case where the registry points into a deleted OneDrive
mount.

**Root cause #2 — PDF page used st.download_button.**
Every other tool uses the project's ``html_download_button``
helper (which is ``local_download_button`` under the hood —
the rename happened in b9147f3). ``st.download_button`` has a
long-standing bug where the second-or-later instance in a
script pass silently fails to fire. The PDF tool predated the
rewrite that switched everyone over and was still using the
broken native widget. ``_Logs.py`` had the same problem in two
places.

Swapped all three call sites to ``html_download_button``. They
now save to ``~/Downloads/<filename>`` (correctly resolved per
fix #1) and show the saved path + "Open Downloads folder"
button below the click, matching every other tool in the suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:45:51 +00:00
dbcf4d4048 feat(pdf): adopt Home-page Files-card layout
User wants the PDF page's upload UX to match the Home page
exactly — Files section header + bordered card containing the
file rows AND the "Add more files" button at the bottom, no
visible Streamlit file_uploader competing for attention.

Layout changes mirroring ``src/gui/_home.py``:

- ``st.file_uploader`` is positioned off-screen via CSS
  (``position:absolute;left:-10000px;…``). The underlying
  ``<input type=file>`` stays reachable to JS so the in-card
  "Add more files" button can programmatically click it.
- ``<h2>Files</h2>`` section header with ``N files · X.X MB
  total`` meta on the right, identical markup
  (``dt-files-section-head``).
- Single ``st.container(border=True)`` hosts every file row
  (``✕ | 📄 filename | size``, using ``dt-file-row`` /
  ``dt-file-icon-chip`` / ``dt-file-name`` / ``dt-file-size``
  classes) AND the "Add more files" button (``dt-file-add``)
  at the bottom. All classes are already defined globally in
  ``_legacy.py`` so no new CSS.
- The Add button click is wired to the off-screen uploader's
  ``stFileUploaderDropzoneInput`` via a 30-line iframe script,
  identical to the Home page's pattern. A ``MutationObserver``
  re-wires after Streamlit reruns when the button gets
  re-mounted.

Action buttons (Scan + Clear all) sit BELOW the Files card,
side-by-side in a `[1, 1, 4]` column split with
``use_container_width=True`` so they fill their cells cleanly
without stretching across the whole row. Both buttons are
disabled when no files are uploaded — the empty Files card is
its own affordance for the empty state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:34:31 +00:00
34b56b404a fix(pdf): drop statement_period_start/end columns from output
User asked to remove them — the two columns repeated the same
value on every row from a given statement, took up screen space
in the editor, and offered limited value once the date column
already carries the inferred full date.

What's kept:
- ``account_number`` — still stamped onto every row so multi-
  statement CSVs are self-attributing
- ``extract_statement_metadata`` — still runs every scan because
  ``period_end`` is the source of the year inference that binds
  Chase-style short ``01/13`` dates to ``20250113``
- ``_extract_statement_period`` and its tests — period
  detection itself isn't going anywhere, just its appearance in
  the output rows

What's removed:
- ``record["statement_period_start"]`` / ``record["statement_period_end"]``
  assignments in ``scan_pdf_for_transactions``
- The two columns from the page's column-ordering setup
- Tests pinning their presence; replaced with assertions that
  they're explicitly absent

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:28:32 +00:00
ad7c22d7fb fix(pdf): consistent 2-decimal amount precision in display and CSV
User reported amounts losing trailing zeros — 4.50 rendering as
4.5, 1000.00 as 1000 — on the same statement. Classic float
display issue: Python's native ``repr(4.5)`` drops the
``.0``, and pandas / Streamlit happily show that
inconsistency cell-by-cell.

Two layers of fix, internal type stays ``float`` for arithmetic:

**Display.** ``st.column_config.NumberColumn(format="%.2f")``
applied programmatically to every ``amount_*`` column on the
data_editor. Every numeric amount now shows with exactly two
decimal places regardless of trailing zeros.

**CSV export.** Pandas' default float-to-CSV writer also drops
trailing zeros (the same issue an accountant would see when
opening the file in Excel). Before serialising, each amount
column is mapped through the new ``format_amount`` helper —
returns ``f"{v:.2f}"`` for numerics, empty string for
None/NaN/inf, ``str(value)`` for booleans (guards the
``True → "1.00"`` foot-gun since ``bool`` is an ``int``
subclass), and passes through any string the scanner kept
because parsing failed (e.g. ``(4.50)`` when parens-negative is
off — user can correct in the editor before re-exporting).

``format_amount`` lives in ``src/pdf_extract.py`` so it's
testable in isolation (the page module can't easily be unit
tested because of its Streamlit import chain). 8 new tests
cover the trailing-zeros case, negatives, None/empty,
string-passthrough, bool guard, NaN/inf, and the ``places``
parameter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:27:16 +00:00
6f2ad57490 fix(pdf): require non-empty description; tighten multi-line merge
User reported "Daily Ledger Balances" entries leaking into
output. Three correlated bugs in the row qualifier:

**1. Empty description is now disqualifying.** A row like
``01/13/2025  $1,000.00`` has a date and an amount but no text
between them — that's a daily-balance entry, a period-summary,
or page furniture. Drop these. New filter sits after
``_description_from_row`` returns: if the description string is
empty (or whitespace-only), continue past the row.

**2. ``prev`` resets per page.** The state that drives multi-
line description merging (the "previous transaction this
continuation might attach to") used to persist across page
boundaries. A no-date no-amount line at the top of page 2
could silently attach to the last transaction on page 1. Fixed
by moving the ``prev`` / ``prev_y_bottom`` declarations into
the outer page loop so each page starts clean.

**3. Multi-line merges now check y-distance.** Before this fix,
ANY no-date no-amount line attached to the previous
transaction's description. A "Daily Ledger Balances" section
header several rows below the last transaction would silently
fold into it. Now the merge only happens when the gap
``current_top - prev_y_bottom <= 25.0`` PDF points — generous
enough for one blank-line gap between wrapped descriptions,
tight enough to reject section headers across paragraph
breaks. The threshold is a module constant
(``_MULTILINE_MERGE_MAX_GAP``) for future tuning if real
statements call for it.

Three new test classes:

- ``TestRequiresDescription.test_empty_description_row_dropped``
  — date+amount-no-text row filtered, real transaction kept.
- ``TestPrevTransactionResetsPerPage.test_no_cross_page_merge``
  — page-1 transaction + page-2 section header = no merge.
- ``TestMultilineMergeYGap`` — close continuation merges
  (10-pt gap), far section header doesn't (100-pt gap).

The original ``TestMultilineDescription.test_continuation_line_merges``
still passes — its setup has a 10-pt gap which is within the
new threshold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:58:50 +00:00
a1824b8dc4 feat(pdf): Home-style file list + Clear-all button
User feedback: the standard file_uploader didn't visually match
the Home page, and there was no obvious way to clear out
uploaded files between scans (have to refresh the browser tab).

**Persistent stash + add-only sync.** Files captured into
``st.session_state["pdf_uploads"]`` (dict name → {bytes, size})
via an ``on_change`` callback on the file_uploader widget. The
callback is **add-only** — never removes files from the stash
based on widget state. Removal is owned by the custom X buttons
+ widget-counter bump (see below). This guarantees a hidden
native X click can't silently drop files behind the user's
back.

**Hidden native file list.** A small CSS block suppresses the
file_uploader's built-in file rows + their delete buttons
(``stFileUploaderFile`` + ``stFileUploaderDeleteBtn``), so the
custom list below is the single source of truth on screen.

**Custom file list (Home pattern).** Below the dropzone, every
uploaded file gets a row: ``✕ | 📄 filename | size``. Top of
section shows ``N files · 12.3 MB total``. Counts and sizes
update in real time as the user adds or removes files. The X
button per row calls ``log_event("upload", "PDF removed: …")``,
removes the entry from the stash, and bumps the widget counter
to clear the widget too.

**Clear-all button.** Sits next to the Scan button. Wipes the
stash, bumps the widget counter, drops any cached scan results
(``K_ROWS``, ``K_WARNINGS``, ``K_SOURCE_COUNT``). Audited via
``log_event("upload", "PDF list cleared", count=N)``.

**Widget reset via counter bump.** Streamlit disallows
programmatic mutation of widget session-state entries; the
standard workaround is to rotate the widget's ``key``. Page
maintains ``K_UPLOAD_COUNTER`` which gets incremented on
remove / clear-all, producing a fresh ``pdf_upload_v{N}`` key
and a freshly-instantiated empty widget. The stash retains any
unaffected files; on next upload, the add-only sync picks up
the new ones without re-adding the removed ones.

**Scan rewired to read the stash.** Instead of iterating the
widget's UploadedFile objects (which the previous code did and
which broke when the widget unmounted on remove), the scan
loop iterates ``pdf_uploads.items()`` and uses the cached
``bytes``. Diagnostic expander does the same — re-reads from
the stash, removing the need for a separate ``K_DIAGNOSTIC``
cache (deleted).

**``_format_size`` helper** ports the byte-formatting logic
from ``_home.py``'s pattern (KB / MB / GB rollover).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:28:01 +00:00
155dd30746 feat(pdf): extract statement header (account + period) + date format
Two related additions for the accountant workflow:

**1. Statement header extraction.** New
``extract_statement_metadata(pages)`` pulls the account number
and statement period out of the first page (falls back to
page 1+2 if either is missing on page 1 — Wells Fargo business
accounts put header info on page 2). Detected fields are
stamped onto EVERY transaction row so a multi-statement CSV is
self-attributing per row::

    {
      "date": "20250113",
      "description": "Coffee Shop",
      "amount_1": -4.50,
      "account_number": "****5678",
      "statement_period_start": "20250101",
      "statement_period_end": "20250131",
      ...
    }

Account-number regex is tolerant of masks (``****1234``),
hyphens (``1234-5678-9012``), and spaces. Period regex looks
for "Statement Period" / "From" / "Period Covered" labels plus
the first 1-2 full-year dates that follow. If only one date is
present near the label, it's used for both start and end (some
statements show only the closing date).

**2. Year inference for short dates.** When the row date is a
short ``01/13`` or ``Jan 13`` without a year, the scanner now
binds the year from the statement period's end date BEFORE
formatting. Doesn't handle the December-in-January-statement
cross-year case (rare; user can edit in the table).

**3. Configurable output date format.** New
``output_date_format`` parameter on ``scan_pdf_for_transactions``
defaults to ``%Y%m%d``. Applied to: the transaction date column
AND the statement period start/end fields. The page surfaces a
dropdown in Scan options with common presets (YYYYMMDD,
YYYY-MM-DD, MM/DD/YYYY, DD/MM/YYYY, ``Mon DD, YYYY``) plus a
Custom option that accepts a raw strftime string.

New helper: ``format_date(iso_str, fmt)`` converts ISO
``YYYY-MM-DD`` to any strftime; passes invalid input through
unchanged so the user can see what was actually there rather
than getting silent empties.

20 new tests cover: format_date, account-number extraction
(masked / hyphenated / spaced / no-label / short), period
extraction (standard / from-to / single-date / no-label),
metadata orchestrator (full header / no pages / page-2
fallback), year inference (US / dash / month-name / no-period /
unparseable), plus an end-to-end class that builds a header'd
PDF with short-date transactions and confirms metadata
attribution + year inference + format round-trip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:20:46 +00:00
3cf935c999 fix(pdf): drop zero-amount rows; multi-date rows clean description
Two corrections from real-statement feedback:

**1. Drop rows where the transaction amount is exactly 0.**
Bank statements include date+amount-shaped noise like
"INTEREST EARNED 0.00", "PAGE TOTAL 0.00", "BALANCE FORWARD
0.00 1,234.56" — all match the date+amount heuristic but
aren't transactions. New filter in
``scan_pdf_for_transactions``: drop rows whose ``amount_1``
parses to exactly 0. Non-zero balances in ``amount_2`` don't
rescue a zero amount_1 — leftmost amount is the canonical
transaction amount. Unparsed-but-non-empty amount strings are
kept (user verifies in the editor).

**2. Multi-date rows: first date wins for the column, every
date excluded from the description.** Chase / BofA / Wells
commonly show both a transaction date and a posting date per
row:

    01/13  01/14  COFFEE SHOP  $4.50

Before this fix, ``_find_dates_in_words`` returned the first
date only and the second date leaked into description as
"01/14 COFFEE SHOP". Now it returns ALL dates with their word
ranges; the scanner uses ``dates[0]`` as the canonical date
and passes every range to the description builder for
exclusion.

The detector's two-pass strategy now also guards against
mixing full-year and short-date matches on the same row.
Previously, a header line like ``Page 1/2 of 3 ... Statement
Date 01/13/2026`` would return both ``1/2`` and ``01/13/2026``,
and ``1/2`` (being leftmost) would have won the date column.
Now: if any full-year date is found on the row, short patterns
are NOT also collected — full year anchors interpretation. A
row with no full-year date (Chase short-date case) still falls
back to short patterns and collects all of them.

New tests:
- ``test_multiple_dates_returned_in_position_order`` —
  ``01/13`` + ``01/14`` both returned, in order
- ``TestMultiDateRow.test_first_date_wins_second_excluded_from_description``
  — end-to-end through ``scan_pdf_for_transactions``
- ``TestZeroAmountRowsAreDropped.test_zero_amount_row_dropped``
  — "INTEREST EARNED 0.00" row dropped while real txn kept
- ``test_negative_amount_kept`` — pin that -40.00 is not
  treated as zero by the filter

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:12:21 +00:00
263af3c7c2 fix(pdf): short dates without year + diagnostic for "0 rows" runs
User uploaded a real Chase statement and got "0 rows detected."
Two bugs the rewrite shipped with, plus a diagnostic:

**1. Short dates without year weren't recognized.** Most bank
statements (Chase, Wells, BofA, …) display transaction dates as
``01/13`` or ``Jan 13`` because the year is implied by the
statement period. The original regex required ``\d{2,4}`` after
the second slash, so ``01/13`` failed to match and rows with no
detected date got dropped.

Split ``_DATE_RES`` into ``_FULL`` (with year) and ``_SHORT``
(no year), with a two-pass detector: pass 1 tries full-year
patterns across the whole row; pass 2 only tries short patterns
if pass 1 found nothing. This prevents a stray ``Page 1/2`` from
shadowing the real dated transaction on the same line.

Short patterns:
- ``\d{1,2}/\d{1,2}`` — Chase, etc.
- ``\d{1,2}-\d{1,2}``
- ``[A-Z][a-z]{2}\s+\d{1,2}`` — "Jan 13"

When parsing, short dates pass through ``parse_date`` and
return None (no year to bind to), so the scanner falls back to
the raw text — the user sees ``01/13`` in the date column and
can correct in the editor.

**2. Multi-word dates leaked the day token into the description.**
A pre-existing bug: ``_find_dates_in_words`` returned only the
START word index, and ``_description_from_row`` only excluded
that single word. For "Jan 13 Coffee $4.50", the description
became "13 Coffee" instead of "Coffee". Fixed by returning
``(start, end, text)`` with ``end`` exclusive (computed from
``len(m.group(1).split())`` so window-overrun doesn't
over-consume), and the description builder now skips the full
range.

**3. New diagnostic: ``diagnose_pdf_lines(pdf_bytes)``.** Returns
every clustered text line the scanner saw with ``has_date`` /
``has_amount`` flags. When the page's scan returns 0 rows, an
auto-expanded "what the scanner saw" expander now renders a
table of all extracted lines so the user can:

- Spot scanned-PDF cases (empty result → enable OCR)
- See which lines have a date but no amount (or vice versa)
- Eyeball the date / amount format the scanner missed

Without leaving the app or asking the developer for help.

Eight new tests cover: short US date (``01/13``), short month-
name date with two-word consumption (``Jan 13``), the
``Page 1/2 ... 01/13/2026`` shadowing case, and the multi-word-
date description fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 00:06:07 +00:00
bece2b4030 refactor(pdf): rip out templates; heuristic scan + selectable table
User feedback: the template / visual-picker / mode-dispatch
implementation was too complex for the actual workflow.
Statements drift between months, the canvas state didn't survive
multi-page navigation, and accountants don't want to maintain
per-bank configuration just to convert PDFs to CSV.

Start-over design — one public function, one page, no
persistence:

  ``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)``

A row is "any text line with a date pattern AND at least one
amount pattern." Each detected row is a dict shaped::

    {
      "date": "2026-01-15",
      "description": "Coffee Shop",
      "amount_1": -4.50,
      "amount_2": 1000.00,   # if a second amount was found
      "page": 1,
      "raw": "01/15/2026 Coffee Shop (4.50) 1,000.00",
      "source_file": "chase-jan-2026.pdf",
    }

Multi-line descriptions still merge (no-date no-amount lines
attach to the previous transaction). Multi-PDF batches share a
single combined table with a ``source_file`` column.

**Page UX:**

- Upload PDF(s) → optional Options expander (parens-negative,
  use-OCR) → click Scan → see all detected rows in an
  ``st.data_editor``.
- The editor has an ``Include`` checkbox column (default on),
  plus user-editable date / description / amount cells and a
  read-only ``raw`` column showing the original PDF text for
  verification.
- A ``Columns to include in CSV`` multiselect hides
  ``page`` / ``raw`` from the download by default; user can
  re-add either.
- Download CSV gets only the checked rows.

No template save/load. No visual picker. No mode dispatch. No
column boundaries. No schema migration. No per-bank
configuration files.

**Deletions:**

- ``src/pdf_templates.py`` — template storage layer
- ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim
  for the canvas (no canvas now)
- ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``,
  ``test_drawable_canvas_compat.py`` — covered the removed APIs
- ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for
  the removed dep
- ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt``
- The drawable-canvas references in ``build/datatools.spec``

**``src/pdf_extract.py``** shrinks from ~30 helper functions to
~10. Keeps: value parsers, row clusterer, date/amount token
finders, OCR pipeline, dependency guards. The one new public
function ``scan_pdf_for_transactions`` glues them together.

**Tests** (59 passing): the unit layer keeps full coverage of
the building blocks; the smoke layer pins the end-to-end PDF
roundtrip, OCR discovery, dependency-import behavior, and the
multi-line-description merge. The fpdf2-generated fixture PDF
still drives the real-PDF test.

Rollback: ``git revert HEAD`` brings back the template system if
needed — but the simpler model should make that unlikely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:57:30 +00:00
60969c0770 feat(pdf): UI rework — Auto-detect is the default build flow
Pulls the user's primary mental model away from "draw column
boundaries" toward "tell me what shape your amounts have, see
detected rows, save." The visual picker that wasn't working for
multi-statement workflows is reachable but no longer the
default.

**Build mode header** now has a mode radio:

- "Auto-detect (recommended)" — row_heuristic. Tabs: Amount
  layout · Filters & date · Save. Three small forms; no
  coordinate UI anywhere. The Amount-layout tab's dropdown picks
  one of single / txn+balance / debit+credit / debit+credit+balance
  and auto-derives the min/max amount-count range (overridable
  under an expander).
- "Visual columns (advanced)" — column_visual. Five tabs (the
  original Visual picker / Pages & table / Columns / Parsing /
  Save). A yellow warning panel up top reminds the user that
  column-x templates only work when statement layout is stable.

Switching modes triggers a rerun so the right tab set renders
immediately. The template object preserves both mode's config
trees side-by-side so a user can flip between them without
losing work.

**Live preview** below the form runs ``apply_template`` against
the cached sample pages (already cached in session_state so this
re-renders cheaply on every form edit). The "no rows yet"
message is mode-aware — points users at the right tuning knobs
for whichever mode they're in. The preview caption notes which
mode produced the rows so the user can correlate decisions to
output.

The visual picker bug the user reported — "a single box stays in
the same location regardless of page" — is sidestepped rather
than fixed: in row_heuristic mode there's no canvas to confuse,
and for the rare column_visual user the canvas is still
imperfect but no longer their first interaction with the tool.
Cleaning up the column_visual canvas state bugs is a separate
follow-up if real users still hit the Advanced mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:46:27 +00:00
48cd9e8249 feat(pdf): schema v2 + mode field + v1 in-memory migration
Bumps ``SCHEMA_VERSION`` from 1 to 2 to add a top-level ``mode``
field distinguishing ``row_heuristic`` (new default) from
``column_visual`` (legacy). The schema bump is real — old code
that defaults missing keys would silently mis-extract — so we
do it the careful way:

- ``new_template`` now returns mode=``row_heuristic`` with the
  full row-heuristic config tree pre-populated. The legacy
  column-visual fields are still seeded with empty defaults so
  switching modes in the GUI doesn't require runtime key
  insertion.
- ``validate_template`` is mode-aware: row_heuristic templates
  must have a valid ``amounts.shape`` + sane
  ``row_detection.min/max_amounts_per_row``; column_visual
  templates keep the existing column/target requirements.
- ``load_template`` accepts both v1 and v2 files
  (``_LOAD_SUPPORTED_VERSIONS = {1, 2}``). v1 files get
  ``mode="column_visual"`` injected and ``schema_version`` bumped
  IN MEMORY ONLY — disk file stays v1 until the user explicitly
  re-saves. A buggy migration can't silently corrupt their
  template library.
- ``save_template`` continues to write the current schema; saving
  a v1 template through the GUI naturally upgrades it.

Mode + shape constants exported (``VALID_MODES``,
``VALID_AMOUNT_SHAPES``) so the GUI dropdowns can derive their
options from the source of truth.

Tests split into ``TestValidateTemplateRowHeuristic`` (6) +
``TestValidateTemplateColumnVisual`` (4) + ``TestV1Migration``
(1). All 29 template tests pass; the original column-mode tests
that previously implicitly relied on schema_version=1 keep
working because new_template's seeded column fields are still
present in row_heuristic templates (just not validated as
required).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:46:10 +00:00
d80befd05a feat(pdf): row-heuristic extraction (mode dispatch, no coordinates)
User reported the column-visual approach is too brittle for real
bank statements: column-x-positions saved against a sample page
don't survive layout drift between months (statement A has
columns at x=300, statement B drifted to x=320), and a saved
template can only realistically work for one statement's
specific render. The fundamental fix is to stop depending on
coordinates at all.

**Row-heuristic mode** finds transaction rows by pattern: any
line with a date token + N amount tokens IS a transaction. Date
patterns (US slash / EU slash / ISO / "Jan 15, 2026" / etc.) and
amount patterns (currency, parens-negative, thousands grouping)
are matched against word text — no x-positions involved.

The full pipeline:

1. ``find_transaction_rows`` clusters words into rows and scans
   each line for date + amount tokens.
2. Multi-line descriptions still attach to the previous row via
   the no-date-no-amount continuation rule.
3. Amount shapes drive interpretation: ``single`` /
   ``txn_balance`` / ``debit_credit`` / ``debit_credit_balance``.
4. ``_infer_amount_column_centers`` clusters amount x-midpoints
   ACROSS ALL detected rows to find natural column groupings —
   so debit-vs-credit assignment for single-amount lines works
   without the user marking anything on screen.

``apply_template`` is now a dispatch over ``template["mode"]``:

- ``mode="row_heuristic"`` (default for new templates) — the new
  pipeline.
- ``mode="column_visual"`` — the existing pipeline, kept under
  ``_apply_template_column_visual`` for v1 templates and the
  Advanced fallback.

18 new tests cover: date detection (US slash, two-digit year,
ISO, month-name, missing); amount-token finding (currency,
parens, pure text, bare-year rejection); column-center inference
(clear two-column case, empty input); end-to-end on synthetic
Page objects with all four amount shapes; the critical
layout-drift test that proves the same template works on pages
of different sizes / different absolute x-positions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:45:55 +00:00
10015c40e1 fix(pdf): shim image_to_url for drawable-canvas on modern Streamlit
User hit ``AttributeError: module 'streamlit.elements.image' has
no attribute 'image_to_url'`` on first PDF import. Root cause:
``streamlit-drawable-canvas`` 0.9.3 (last upstream release 2023)
calls a Streamlit internal that was relocated in Streamlit
~1.30+. The function moved from ``streamlit.elements.image`` to
``streamlit.elements.lib.image_utils`` AND its signature
changed — the second positional argument is now a
``LayoutConfig`` dataclass instead of a plain ``int`` width.

Three remedies considered:

1. Downgrade Streamlit. Reverses unrelated improvements +
   security fixes; not on the table.
2. Fork drawable-canvas. The maintenance hit isn't worth it for a
   one-line internal API change.
3. **Ship a compatibility shim.** Re-attach a wrapper at the old
   import path that adapts the old call shape to the new
   function. This is the standard workaround the wider Streamlit
   community has converged on for this exact regression.

``src/gui/_drawable_canvas_compat.py`` does (3). The ``install()``
helper is idempotent, opt-in (not auto-run at module import — a
grep for ``_install_canvas_compat`` shows every call site), and
no-ops if Streamlit hasn't moved the function OR if the new
function isn't where we expect (lets the canvas surface a real
error rather than papering over a different bug). The page calls
``_install_canvas_compat()`` once at module top before any
``st_canvas`` invocation; Streamlit's script-rerun model means
this fires every page load but the ``_PATCHED`` guard makes
re-runs free.

The shim wraps the old ``width=int`` arg into a default-constructed
``LayoutConfig()`` — the old ``width=-1`` sentinel meant "use
the image's natural width", which is also what an unconfigured
LayoutConfig produces. Confirmed by inspecting Streamlit 1.57.0's
``image_utils.py``.

4 new tests pin the shim contract:

- ``install()`` attaches ``image_to_url`` to the old path on modern
  Streamlit
- Idempotent — calling twice doesn't double-wrap
- Doesn't clobber a future Streamlit that restores the original
  at the old path
- Translates ``(image, -1, False, "RGB", "PNG", "id")`` into a
  proper call to the new function with a ``LayoutConfig`` instance

If a future Streamlit upgrade moves ``image_to_url`` AGAIN, the
shim's silent-no-op fallback means the canvas error surfaces
again and points at where to look. The shim doesn't paper over
mysteries; it only patches the one specific relocation we know
about.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:29:20 +00:00
e6ee2e3481 feat(pdf): robust Tesseract discovery + OS-aware install copy
User tried ``brew install tesseract`` in PowerShell after seeing
all three OSes listed inline in the OCR banner — easy mistake
when the install commands are crammed on one line with ``·``
separators. Two changes pre-empt this:

**OS-aware OCR banner.** The expander now detects the user's
platform via ``platform.system()`` and shows only the relevant
install instructions:

- **Windows**: UB-Mannheim installer link, numbered steps,
  explicit "keep the Add to PATH checkbox on" callout, plus a
  fallback paragraph telling the user how to set
  ``DATATOOLS_TESSERACT_PATH`` if they already installed
  without PATH and don't want to reinstall.
- **macOS**: ``brew install tesseract`` with a Homebrew link.
- **Linux**: ``apt install tesseract-ocr`` with a "or your
  distro's equivalent" hedge.

**Robust binary discovery in ``ocr_available()``.** Three-stage:

1. Honor ``DATATOOLS_TESSERACT_PATH`` env var if set — explicit
   override for portable installs or non-default locations.
2. Try ``pytesseract``'s default PATH-based lookup.
3. If PATH lookup fails, probe known Windows install paths
   (``C:\Program Files\Tesseract-OCR\tesseract.exe``,
   the x86 variant, and ``%LOCALAPPDATA%\Programs\Tesseract-OCR\``)
   via the new ``_autodetect_tesseract_path``. On hit, set
   ``pytesseract.pytesseract.tesseract_cmd`` so all subsequent
   ``image_to_data`` calls use the same binary without
   re-discovering.

This means a user who runs the UB-Mannheim installer with
default options but forgets the PATH checkbox will still get
OCR working after a launcher restart, without env-var
gymnastics.

Tests (4 new, 85 total in the suite):

- Auto-detect returns None on non-Windows (no false positives
  on dev laptops).
- Auto-detect finds the binary at a mocked
  ``C:\Program Files\Tesseract-OCR\tesseract.exe``.
- Auto-detect returns None when no candidate exists.
- ``DATATOOLS_TESSERACT_PATH`` env var beats both PATH lookup
  and auto-detect (sets ``tesseract_cmd`` even when the path
  doesn't resolve, so a real binary at a custom location works).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:15:00 +00:00
538e23d219 build(pdf): bundle PDF deps in installers + pin versions + smoke tests
Three changes prepare the next tagged release so end users get
the PDF Extractor without ever touching pip.

**Exact-pin the new deps** (``requirements.txt``):

  pdfplumber==0.11.9
  pypdfium2==5.8.0
  pytesseract==0.3.13
  streamlit-drawable-canvas==0.9.3

Tight pins are the right call for these because the GUI's
visual-picker geometry + the parsing-pipeline word positions
depend on stable internal behavior — a quiet upstream tweak to
``extract_words`` or ``page.render`` would re-break the tool on
the next CI build. Bumping requires a deliberate edit + a CI
run, not a transient ``pip install`` resolving to whatever
``setup.py`` pulled.

Existing deps stay on their current ``>=X.Y,<X+1`` ranges; the
user's "tight pin" concern is specifically about the PDF stack.

**Wire the new deps into the PyInstaller bundle** (``build/``):

- ``datatools.spec`` — add ``collect_submodules`` for pdfplumber,
  pdfminer, pypdfium2, streamlit_drawable_canvas, PIL,
  pytesseract; add ``collect_data_files`` for pypdfium2 (PDFium
  native ``.dll``/``.so``/``.dylib``), streamlit_drawable_canvas
  (frontend JS bundle), pdfminer (Adobe CMap tables).
- ``hooks/hook-pypdfium2.py`` — belt-and-braces hook that uses
  ``collect_dynamic_libs`` to force-include the PDFium binary.
  Without this the visual picker silently fails on installed
  builds with a ``FileNotFoundError`` for the shared library.
- ``hooks/hook-streamlit_drawable_canvas.py`` — collects the
  built JS frontend so the canvas iframe loads under the bundled
  Streamlit server instead of rendering blank.

**Tesseract is intentionally NOT bundled** (option A from the
design discussion). Modern bank statements are text-based;
bundling Tesseract would ~triple installer size for a long-tail
case. The in-app banner directs users to install it from
``UB-Mannheim/tesseract`` if they need OCR. Decision is captured
in the ``project-pdf-installer-pending`` memory note.

**Smoke tests** (``tests/test_pdf_extract_smoke.py``, 17 tests)
add the layer above the pure unit tests:

- ``TestDependencyImports`` — each dep imports cleanly
- ``TestRealPdfRoundTrip`` — generates a tiny statement PDF in
  memory with ``fpdf2`` (test-only dep in
  ``requirements-dev.txt``), runs ``extract_pages`` +
  ``apply_template``, asserts 3 rows out with the right signed
  amounts. Catches "the build succeeded but pdfplumber breaks at
  runtime."
- ``TestRenderPageImage`` — exercises ``pypdfium2.render`` so the
  hook-bundled native lib gets a real call. This is the most
  common installer-bug signature (missing .dll) and the test
  catches it before users do.
- ``TestPdfDependencyMissing`` — monkeypatches ``__import__`` to
  simulate a stripped install; confirms the typed exception +
  actionable hint round-trip.
- ``TestPinnedVersionsMatchInstalled`` — parametrized over all
  four pinned dists; uses ``importlib.metadata`` rather than
  ``__version__`` because pypdfium2 doesn't expose it directly.
  Trips if someone bumps the pin without reinstalling.
- ``TestOcrAvailability`` — confirms ``ocr_available()`` returns
  ``(bool, str)`` and ``extract_pages_auto(allow_ocr=False)``
  skips OCR cleanly.

All 81 PDF + audit tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:10:43 +00:00
2d927bc95f fix(pdf): graceful fallback when PDF dependencies aren't installed
User hit a hard ImportError on opening the PDF→CSV tool because
``pip install -r requirements.txt`` hadn't picked up the new
``pdfplumber`` / ``pypdfium2`` lines yet. Streamlit surfaces
that as an unfiltered traceback — friendlier to show a clear
install-required panel inside the tool instead.

Two changes:

1. ``src/pdf_extract.py`` lazy-imports the PDF deps via
   ``_require_pdfplumber()`` / ``_require_pdfium()`` helpers that
   raise a new ``PdfDependencyMissing`` (subclass of ImportError)
   with an actionable ``hint`` field. Pure helpers
   (``parse_amount``, ``parse_date``, ``cluster_rows``, etc.)
   keep working with no PDF dep installed — useful for tests and
   for keeping module-import paths cheap.

2. The tool page probes both deps at render time via
   ``_pdf_deps_status()``; if anything's missing it shows a
   ``st.error`` panel with the exact pip command and a
   "restart the launcher" reminder, then ``st.stop()``s before
   touching any PDF code path.

The page itself loads cleanly without the deps installed, so the
sidebar nav doesn't 500 — the user just sees the install panel
on click.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:59:20 +00:00
967d3f6a11 feat(pdf): OCR availability banner + per-run toggle
Phase 6/6. Final polish layer on top of the OCR pipeline that
``extract_pages_auto`` has carried since commit 1.

- **OCR status banner** at the top of the page next to the mode
  selector. Ready: a one-liner caption confirming OCR will run
  on scanned pages. Unavailable: a collapsed expander explaining
  the missing piece (``pytesseract`` binding vs. Tesseract
  binary) with install pointers for Windows, macOS, and Linux.
  The expander explicitly notes that modern text-based bank
  statements don't need OCR — most users will never expand it.
- **"Use OCR for scanned pages" toggle** in Extract mode,
  defaulting to the runtime availability. Disabled (greyed out)
  when Tesseract isn't usable, so the user can't accidentally
  set themselves up for confusing warnings. Passes through as
  ``allow_ocr`` to ``extract_pages_auto``.
- Build mode's sample-loading path continues to call
  ``extract_pages_auto(..., allow_ocr=True)`` — sample preview
  always uses OCR if available, since the user is actively
  diagnosing template fit.

No schema change. OCR's structural support is in commits 1 + 3;
this commit just makes it discoverable + opt-out.

Rolling up the 6-commit feature:

  b8aff86  Phase 1 — pure pdf_extract module + tests
  aea520d  Phase 2 — template storage layer + tests
  2f349e8  Phase 3 — Extract/Build/Manage page + nav + i18n
  5a8e2ec  Phase 4 — batch polish (ZIP, sort, status block)
  b86828d  Phase 5 — visual region picker (drawable canvas)
  THIS     Phase 6 — OCR banner + toggle

Each commit is independently revertable; rolling all the way
back to ``c16e2a5`` is ``git revert b86828d 5a8e2ec 2f349e8
aea520d b8aff86 <this>`` (or just ``git reset --hard c16e2a5``
on a clean branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:54:11 +00:00
b86828d791 feat(pdf): visual region picker on rendered sample page
Phase 5/6. Adds a "Visual picker" tab as the first stop in the
template-build flow. The sample PDF page is rasterized with
``pypdfium2`` (capped at ~900px wide for sensible display), and
``streamlit-drawable-canvas`` overlays drawing tools on top.

UX:

- **Line mode** — drag short (roughly vertical) strokes where you
  want columns to split. Each stroke's x-midpoint becomes one
  boundary in PDF point coordinates.
- **Rect mode** — drag a rectangle around the transactions
  table; bbox is preserved on the template as
  ``visual.table_bbox`` for round-trip, future use as a hard
  crop region.
- **Transform mode** — move/resize already-drawn shapes after
  the fact.

Round-trip: re-entering Build mode with an existing template
seeds the canvas with full-height vertical lines for every
boundary already on the template, plus the saved bbox if any,
so editing-after-save matches the user's mental model.

Coordinate translation: the canvas reports pixel positions; we
divide by the renderer's pixels-per-PDF-point scale to get back
to PDF coordinates that ``apply_template`` already expects. No
template-schema change required — the boundaries the picker
writes are the same list the text-input editor wrote in
commit 3, just sourced visually.

New helper in the extraction module:

- ``render_page_image(pdf_bytes, page_no, target_width=900)`` —
  rasterize a single 1-indexed page to a PIL image; returns
  ``(image, scale)`` for coordinate translation.

The text-input boundary editor in the Columns tab remains as a
fallback for power users / keyboard-only workflows and for
copy-paste from spreadsheet-derived x-positions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:52:54 +00:00
5a8e2ec9e1 feat(pdf): batch extract polish — ZIP output, sort-by-date, status block
Phase 4/6. Polishes the batch workflow shipped in commit 3:

- **st.status progress block** replaces the simple progress bar.
  Each file appears as its own line as it's processed; the block
  auto-collapses on completion with a "12/13 extracted" summary
  and turns red if any file errored.
- **Sort combined output by date** checkbox (default ON) sorts
  the merged CSV ascending by date, with source_file as a stable
  secondary sort so multiple statements interleave by date but
  same-day rows from the same file stay together.
- **ZIP-of-per-PDF-CSVs output option** alongside the combined
  CSV. When the accountant has 12 statements from 12 different
  account periods and wants to feed them into 12 separate ledger
  imports, the ZIP keeps each file's rows in its own CSV named
  after the original PDF stem.
- **Per-file summary table** gets a ``status`` column ("ok" /
  "no rows" / "error: ExceptionName") so error grouping is
  obvious at a glance — already present from commit 3, now
  upgraded with the status field.

Cancellation is intentionally not added — Streamlit's single-
thread rerun model has no clean way to interrupt a tool-run
mid-stream without architectural changes to extraction. If a
user mis-fires Extract on 50 PDFs they can refresh the browser
tab; the task will be killed when the next interaction comes in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:51:05 +00:00
2f349e8191 feat(pdf): tool page with Extract / Build / Manage modes
Phase 3/6. Wires the PDF Extractor into the GUI as a new
"transformations" tool with three modes selected by a horizontal
radio at the top of the page:

**Extract** — pick a saved template, upload one or more
statement PDFs (single + batch shipping together to keep the
common case one-step), get a previewed DataFrame + CSV download.
Per-file row counts and warnings are surfaced; failures on one
file don't kill the whole batch. The combined CSV gets a
``source_file`` first column so the accountant can sort/filter
by statement.

**Build template** — load an existing template or start fresh,
upload a sample PDF, edit every schema field across four tabs
(Pages & table / Columns / Parsing / Save). A live preview below
re-runs ``apply_template`` against the sample on each re-render
so the user sees their changes hit rows immediately. The column-
boundary editor is text-input ("comma-separated x-positions") for
now — replaced by the drawable-canvas visual picker in commit 5.

**Manage templates** — list with rename / delete / export
(downloads the canonical JSON) / import (uploads someone else's
JSON, validated through ``template_from_json``).

Heavy work (``extract_pages_auto``) only runs on explicit user
action (Extract / a new sample upload), and the parsed Page list
is cached in ``st.session_state`` so widget-edit reruns don't
re-parse the PDF.

Logging: tool runs and template saves both hit the audit log via
``log_event("tool_run", …)``, matching every other tool's
instrumentation pattern.

Registered in ``tools_registry.py`` under ``transformations``
with status ``Ready`` and the picture-as-pdf Material icon. i18n
keys added for en + es ("PDF to CSV" / "PDF a CSV").

OCR is wired in this commit — ``extract_pages_auto`` already
falls back through ``pytesseract`` when the binary is available,
and the warning strings it returns surface as ``st.info`` /
``st.warning`` per-file. Commit 6 will polish the OCR UX with a
status row.

Next commits build on this page:
  4 — batch progress + cancellation + per-file error grouping
  5 — drawable-canvas visual picker replaces text x-positions
  6 — OCR availability banner + scanned-page indicators

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:49:44 +00:00
aea520d2f7 feat(pdf): template storage layer (load/save/list/import/export)
Phase 2/6. Persists "how to read this bank's statements" as JSON
files under ``~/.datatools/pdf_templates/<slug>.json`` so an
accountant can build one template per source and reuse it across
every statement that follows the same layout.

Public API:

- ``new_template(name)`` — blank with sensible defaults
- ``save_template(t)`` — validate + atomic write (temp + rename)
- ``load_template(slug)`` / ``delete_template(slug)``
- ``list_templates()`` — sorted summaries, skips corrupt files
- ``template_to_json`` / ``template_from_json`` — portability
- ``validate_template(t)`` — returns (ok, errors) list for GUI

Schema is documented in the module docstring. Versioned via
``schema_version: 1`` so future fields don't break saved files
silently — ``load_template`` refuses unknown versions instead of
limping along with missing keys.

Validation contract enforces:
- non-empty name + slug (lowercase alphanumeric + hyphens)
- at least two output columns
- at least one column mapped to ``date``
- either one ``amount`` column OR both ``amount_debit`` +
  ``amount_credit``
- column boundary count consistent with source-column count

Storage is atomic: ``_atomic_write`` goes through a temp file +
``os.replace`` so a crashed save can't leave a half-written JSON
at the canonical path. The GUI's build flow saves on most
visual-picker changes, so this matters more here than for a
"save button" workflow.

24 tests cover slugify, defaults, validation branches, round-trip
load/save, missing/corrupt file handling, delete, list (incl.
skipping corrupt files), atomic-write rollback, and import/export.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:46:44 +00:00
b8aff862ed feat(pdf): add pure PDF→DataFrame extraction module
Phase 1/6 of the PDF Extractor tool. Pure module — no Streamlit,
no user-config I/O — that turns a PDF blob plus a template dict
into a ``pandas.DataFrame`` of transaction rows. Primary use case
is accountant-style extraction of bank-statement transactions,
where each bank's format is encoded as a reusable template.

Pipeline:

1. ``extract_pages(pdf_bytes)`` reads with pdfplumber and surfaces
   words with bounding boxes.
2. ``cluster_rows(words)`` groups words into rows by ``top``
   tolerance — no reliance on PDF table-line detection (most bank
   statements have no visible cell borders).
3. ``assign_columns(row_words, boundaries)`` buckets each word by
   its horizontal midpoint into N+1 columns defined by N interior
   x-boundaries.
4. ``_within_table_window`` slices to the band between the header
   line and the end-marker (e.g. "Closing balance").
5. ``apply_template`` orchestrates the above, handling:
   - parens-style negative amounts, currency stripping, custom
     decimal/thousands separators
   - separate debit + credit columns combined into a single signed
     ``amount`` (credit positive, debit negative — accounting
     register convention; matches QuickBooks/Xero imports)
   - multi-line description wrapping (rows with empty date column
     attach to the previous row's description)
   - row-level regex skip filters (e.g., "Total", "Subtotal")
   - page-range filters ("all", "2-", "1,3-5")

Optional OCR fallback for scanned statements:

- ``page_has_extractable_text`` heuristic flags pages with <5
  words as likely-scanned.
- ``ocr_available()`` checks both the ``pytesseract`` Python
  binding and the Tesseract binary; surfaces a clear reason
  string when either is missing.
- ``extract_pages_auto`` does text-first, OCR-the-blanks, and
  returns warnings the UI can surface.

29 unit tests cover the parsing pipeline against synthetic
WordBox/Page data — no fixture PDFs required, runs in 0.1s. Real
PDF extraction is exercised by hand on the user's statements.

Dependencies added:
- ``pdfplumber>=0.10,<1`` — text + position extraction
- ``pypdfium2>=4,<6`` — page rasterization for OCR + visual picker
- ``streamlit-drawable-canvas>=0.9,<1`` — visual region picker
  (used in commit 5)
- ``pytesseract>=0.3,<1`` — OCR (used in commit 6; system
  Tesseract binary required separately)
- ``cryptography>=41,<49`` — bumped upper bound; pdfminer.six
  transitively requires a recent release. Internal ed25519
  license-signing usage is API-stable across the bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:44:51 +00:00
c16e2a5e29 feat(audit): surface log path + /logs link in Help popover
Adds a "Log file" section to the sticky-footer Help popover with
two affordances:

1. The current audit-log path rendered as monospace text with
   ``user-select: all`` so a single click selects the whole path
   for copy-paste into a file manager. Works on every platform —
   no subprocess required.
2. A "View all logs →" link to the new ``/logs`` page (added in
   the previous commit) for download/inspection of today's and
   prior days' files.

i18n keys ``footer.help_logs_label`` + ``footer.help_logs_link``
added to en + es packs, matching the existing
``footer.help_*`` naming.

``audit_log_path()`` is wrapped in try/except because a broken
audit module MUST NOT take the footer down — falls back to "—".
Same defensive pattern the license section uses.

Rollback: ``git revert HEAD`` removes the section; the popover
and its layout return to the prior shape with zero coupling to
the audit module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 21:26:53 +00:00
7c9139f199 feat(audit): /logs page — view + download recent audit log files
Adds a Streamlit page at ``/logs`` listing every
``datatools-*.jsonl`` file in ``audit_log_dir()`` (7-day window
per the retention sweep in b3ae913). Each entry shows filename,
mtime, byte size, and a ``st.download_button``. Today's file
gets its own section at the top.

The page also surfaces both paths as copyable monospace text:
the active log path (so users can grep/cat it directly on their
machine) and the folder path (so they can paste into Explorer /
Finder).

Wired into navigation via ``st.Page("pages/_Logs.py", ...)`` with
``url_path="logs"``. The sidebar entry is hidden by the same
``hide_streamlit_chrome`` CSS rule that hides ``/activate`` and
``/close`` — same pattern, same ``:has()`` + plain-fallback
selectors so the LinkContainer collapses cleanly in modern
browsers and the anchor is at least un-clickable in older ones.

License gate is OFF for this page (``gate_license=False``) — if a
user's license expires they may need logs to file a support
request; locking them out of their own audit history would be
hostile.

Next commit will wire the popover link.

Rollback: ``git revert HEAD`` removes the page and its nav entry;
the audit log itself keeps working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 21:24:46 +00:00
b3ae913bb9 feat(audit): daily filename + 7-day retention sweep
Replaces the per-session ``datatools-<ts>-<sid>.jsonl`` filename
with a single daily file ``datatools-YYYY-MM-DD.jsonl`` (local
date). Sessions on the same calendar day share a file via the
writer thread's per-batch open+append; multiple DataTools
instances running concurrently on the same day fan into the same
file (append-mode small writes are atomic on POSIX, safe-enough on
Windows under realistic load).

Drops the ``_LOG_PATH`` module global and the lock around it —
``audit_log_path()`` is now pure date math, recomputed on every
call so a session that crosses midnight follows the rollover into
the next day's file.

Adds ``_sweep_old_logs()`` invoked once per process at writer-
thread start. Deletes any ``datatools-*.jsonl`` whose mtime is
older than 7 days. The glob deliberately matches the legacy
per-session filename too, so users upgrading from the previous
build don't keep a permanent backlog of pre-retention files.

Event ``ts`` fields stay UTC; only the filename uses local date,
because users go looking for "today's log" on their wall clock.

Tests cover: daily filename shape, sweep removes stale files,
sweep keeps fresh files, sweep also clears legacy filenames.

Rollback: ``git revert HEAD`` restores the per-session filename
and removes the sweep. No data migration needed either way —
existing files keep working as JSONL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 21:22:47 +00:00
ba07dcb6c7 feat(audit): re-enable audit log (kill switch off by default)
Phase 1 diagnostic build validated end-to-end on the user's machine:
session cf2ebbd5 (2026-05-19) produced session/upload/analyze/nav/
session-end events with no blank-pages regression. Root cause of the
original symptom was the audit_log_path/_session_id deadlock fixed in
a8ff8f4 — the kill switch is no longer load-bearing.

Flips ``_DISABLED: True`` → ``False`` so the default install writes a
log. The three env-var overrides (``DATATOOLS_AUDIT_ENABLED``,
``DATATOOLS_AUDIT_TRACE``, ``DATATOOLS_AUDIT_PROBE``) and the writer-
thread BaseException guard from 76c9f5a stay in place as escape
hatches if the symptom ever recurs.

TestKillSwitchContract continues to pass — it monkeypatches
``_DISABLED = True`` explicitly and doesn't rely on the module default.

Rollback: ``git revert HEAD`` flips the switch back without removing
the diagnostic instrumentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:50:28 +00:00
76c9f5a679 feat(audit): diagnostic instrumentation env vars + writer-thread guard
Phase 1 of the audit-log re-enablement plan. Adds three opt-in env
vars that let us ship one instrumented build for the user to run,
without flipping the kill switch on for everybody. **Default
behaviour is byte-identical to today**: with no env vars set the
kill switch wins, no writer thread starts, no file is written, no
stderr line is printed.

Env vars (do NOT set in prod):

- ``DATATOOLS_AUDIT_ENABLED=1`` — bypass ``_DISABLED`` for one
  session. ``_DISABLED = True`` stays in the source so an upgrade
  with no env var is still safe.
- ``DATATOOLS_AUDIT_TRACE=1`` — print ``[audit] ...`` lines to
  stderr at module import, every writer-thread state change, and
  every producer entry point. Lets the user share a small log
  instead of attaching a debugger.
- ``DATATOOLS_AUDIT_PROBE=<value>`` — bisect the producer path
  for Phase 2. Values: ``full`` (default), ``noop``, ``no-events``,
  ``no-page-open``, ``no-session-start``. The named variants
  return early from the corresponding ``log_*`` function so we can
  isolate which call is implicated in the blank-pages symptom.

Also:

- ``_writer_loop`` gets an outer ``try/except BaseException`` so
  silent thread death now surfaces a ``"writer thread died: ..."``
  line in the launcher terminal instead of looking like a hang.
- Existing first-write-failure stderr print gets ``flush=True`` so
  the user actually sees it before the process is killed.
- Test fixture switches from the previous-commit ``_DISABLED = False``
  override to ``_ENABLE_OVERRIDE = True`` so tests exercise the same
  bypass path the diagnostic build uses.
- Two new tests pin the safety contract: with the kill switch on
  and no override, every producer is a true no-op (no writer
  thread, no file). And ``DATATOOLS_AUDIT_PROBE=no-events`` bypasses
  ``log_event`` even when the override is on — guards the bisect.

Rollback: ``git revert HEAD`` removes Phase 1 cleanly. The deadlock
fix from the previous commit stays in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:46:27 +00:00
a8ff8f4bd0 fix(audit): break audit_log_path/_session_id deadlock
Pre-existing latent bug since d9e32e5: ``audit_log_path()`` acquires
the non-reentrant ``_LOCK`` and, while holding it, calls
``_session_id()`` which also takes ``_LOCK``. On a clean module state
(both ``_LOG_PATH`` and ``_SESSION_ID`` unset) the first caller
deadlocks.

``log_session_start`` triggers it in practice — it's the first GUI
call after import and the ``log_file=str(audit_log_path())`` arg is
evaluated before any ``log_event`` has had a chance to lazy-init the
session id. Strong candidate contributor to the blank-pages symptom
the kill switch was put back to mask: the writer thread (and any
producer reaching ``audit_log_path``) would freeze forever, and
Ctrl+C would not free the GIL — matches the launcher-can't-be-killed
behaviour reported in 1caedbb.

Fix: resolve the session id BEFORE acquiring ``_LOCK`` in
``audit_log_path``. ``_session_id`` already double-checks under its
own lock, so the call is safe and self-synchronising.

Test fixture in ``tests/test_audit.py`` now bypasses the kill switch
via ``monkeypatch.setattr(audit, "_DISABLED", False)`` — env vars are
captured at import time and ``monkeypatch.setenv`` won't reach the
module-level flag. With the fix in place, all 6 tests pass in 0.15s;
without it, ``test_session_start_renders`` (and any test exercising
the log_session_start path) hangs indefinitely.

Kill switch behaviour is unchanged in production (`_DISABLED = True`
in the shipped module); this is purely a correctness fix for the
code path that gets exercised when the switch is off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:45:08 +00:00
4451f74895 fix(layout): bump bottom block-container padding 4rem → 7rem
Last lines on long tool pages were still grazing the fixed Help/Close
footer when scrolled all the way down. 4rem gave the cursor of free
space the footer claims but no breathing room — the bottom button
or text was visually flush against the footer's top edge. 7rem buys
~3rem of clear space on every page so the last content row reads
without obstruction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 02:32:13 +00:00
a022059b1e chore: drop accidentally-tracked scratch screenshot 2026-05-19 02:30:01 +00:00
69240fc922 fix(home,close): tool-link preserves file context + drop close-page explanation
(1) ``[Tool] →`` action links inside per-file finding rows now
preserve the file that the card belongs to. Previously the home page
re-set ``home_uploaded_*`` to the FIRST imported file on every rerun
— so when a user with multiple imports clicked
``Clean Text →`` on file_B's findings card, the tool page loaded
file_A. The click handler in ``_render_finding_row_v2`` now looks
the file up in ``home_uploads`` by the findings-card filename and
writes ``home_uploaded_name / size / bytes`` BEFORE
``st.switch_page``, so the tool's ``pickup_or_upload`` reads the
right context.

The filename threads through ``render_findings_panel(..., header=)``
→ ``_render_finding_row_v2(..., filename=)``; ``header`` is already
the filename today, so no call-site change needed.

(2) Close screen "explanation" removed. The long browser-restriction
hint paragraph (``quit.close_hint``: "Browsers don't let JavaScript
close a tab you opened yourself …") is gone from the farewell overlay
— the auto-dismiss path lands the user on about:blank within ~1.5s
of the close click, so the explanation never had a chance to be
useful. ``autoDismiss`` simplified to "try close, else redirect"
without the hint-surface step. The i18n key is retained as a no-op
in case the hint comes back.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 02:29:49 +00:00
9a7d861903 fix(ui): bottom padding + close-screen button removed + sidebar collapse + quiet loguru
Four issues batched together since they all touch the GUI shell:

- ``stMainBlockContainer``'s ``padding-bottom`` bumped from 0.75rem
  → 4rem (~one button-height of free space above the fixed Help/Close
  footer). The last line of content on a page that fills the viewport
  was previously sitting flush against the footer's top border.

- Farewell overlay's "Close this window" button removed per UX
  request. The auto-dismiss path is now the only flow: try
  programmatic close (works in Chrome/Edge ``--app`` windows);
  failing that, surface the hint and redirect the parent window to
  ``about:blank`` after a short timeout. Previously the user had to
  click the button to get the same fallback. The
  ``quit.close_window_button`` i18n key is retained as a no-op for
  now in case the button comes back; nothing references it.

- Sidebar collapse → expand was broken: clicking « collapsed the
  sidebar but the » expand-back affordance was invisible. Two causes
  pulled apart:

   1. ``.dt-brand { flex: 1 }`` was eating the entire
      ``stSidebarHeader`` width, squeezing Streamlit's
      ``stSidebarCollapseButton`` off the right edge. Changed to
      ``margin: 0 auto 0 0`` so the brand keeps its natural width
      and the chevron has room to live next to it.

   2. The "hide Streamlit chrome" toolbar block was listing
      ``stToolbar`` and ``stToolbarActions`` for ``display: none``
      — but the post-collapse re-open button
      (``stExpandSidebarButton``) lives inside ``stToolbar``, so
      hiding the container killed the button too. Dropped both
      container testids from the hide list and kept the per-icon
      rules for ``stMainMenu`` / ``stAppDeployButton`` /
      ``stStatusWidget`` / ``stDecoration``.

- Loguru's stderr sink quieted in GUI mode. ``src/gui/app.py`` now
  runs ``logger.remove()`` + ``logger.add(sys.stderr, level="ERROR",
  …)`` at the top so internal ``logger.debug`` / ``logger.warning``
  breadcrumbs (e.g.
  ``standardize_dataframe: 7/31 cells were unparseable``) no longer
  print to the terminal when the user runs ``python -m src.gui``.
  CLI entry points already do the same configuration per-script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 02:21:41 +00:00
1016a4d2c4 feat(home,sidebar): brand hero + sidebar = footer style + PNG icon
Bundles a handful of UX cleanups:

- Findings-card chevron moved to the LEFT side of the head. CSS still
  rotates it 90° between collapsed/expanded states.

- Tool-link buttons in findings rows (``Clean Text →`` etc.) are now
  left-justified against the icon column with minimal surrounding
  whitespace. Action column ratio dropped from 1.8 → 1.4 and the
  button switched from ``width="stretch"`` (centered text) to
  ``width="content"`` (shrinks to fit, left-aligned within column).

- Home-page hero now mirrors the sidebar brand block: 56px ink "D"
  chip on the left + "UNALOGIX" eyebrow stacked above "DataTools"
  wordmark, then the "Clean. Normalize. Transform." tagline beneath.
  New ``.dt-page-brand / -row / -words / -mark / -eyebrow /
  -wordmark`` rules in ``_DESIGN_TOKENS_CSS``. Streamlit wraps h1
  elements in an emotion-cache div with extra padding; a descendant
  flattener (``.dt-page-brand-words *`` margin:0 / padding:0) keeps
  the eyebrow + wordmark stack the same height as the chip so they
  center-align cleanly.

- Sidebar nav restyled to match the sticky-footer Help/Close buttons
  exactly: 13px / 500 / 1.3 line-height, 5×10px padding, 8px gap
  between icon and label, transparent background. Active item gets
  the same ``rgba(0,0,0,0.04)`` tint as the hover state (no white
  pill, no shadow), only the heavier weight + ink text distinguishes
  it.

- OS app icon (page_icon) switched from SVG to a Pillow-rendered
  ``datatools_icon_256.png`` so Windows / macOS taskbar+dock pick
  it up reliably (some OS shells fall back to a default icon for
  SVG favicons). Rounded-square ink ground with cream "D" centered —
  same mark as the sidebar chip + hero chip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 02:04:53 +00:00
6c3939d21b feat(brand): "Letter D (sans)" app icon — favicon + sidebar chip
Implements ``Business/DataTools/app_icons.html`` §03 "Letter D (sans)"
as the canonical app mark.

- New ``src/gui/assets/datatools_icon.svg`` — 64×64 SVG, 14px corner
  radius, ink ground (#1c1917), cream "D" (#fef4ed) in
  Geist 700 / -0.04em tracking. Pure SVG so it renders sharp at
  every favicon size; font stack falls back through Geist →
  system sans where the webfont isn't installed (favicons can't load
  Google Fonts).

- ``_home.py``, ``_Activate.py``, ``99_Close.py``: page_icon now
  resolves the SVG path via ``Path(__file__).parent / "assets" /
  "datatools_icon.svg"`` instead of the broom 🧹 / 🔑 / 🛑
  emojis. Streamlit inlines it as a ``data:image/svg+xml;base64,...``
  link tag so the browser tab + OS app-icon for ``python -m src.gui``
  matches the sidebar chip.

- Sidebar ``.dt-brand-mark`` tightened to match the spec's "Letter D
  (sans)" rendering: ``font-weight: 700`` and
  ``letter-spacing: -0.04em`` (was 600 / -0.02em). The on-screen
  chip is now a scaled-up copy of the OS icon.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:50:18 +00:00
d436e34a45 feat(brand): rebrand to UNALOGIX DataTools + Clean. Normalize. Transform.
User-facing copy + brand updates landed together:

- Page H1 + browser-tab title: "DataTools — Data Cleaning Mastery"
  → "UNALOGIX DataTools". Same change in es.json (was "DataTools —
  Maestría en limpieza de datos").
- Hero subtitle: long descriptive caption replaced with the tagline
  "Clean. Normalize. Transform." (es: "Limpia. Normaliza.
  Transforma.").
- Sidebar brand block: wordmark is now two lines — UNALOGIX in tiny
  uppercase tracked eyebrow style on top, DataTools in the 15px
  semibold wordmark beneath. The 28px "D" chip stays as the
  recognizable mark. New ``.dt-brand-eyebrow`` rule in
  ``_DESIGN_TOKENS_CSS``.

Top-right Streamlit chrome cleanup — the user reported two stacked
icon buttons. ``.streamlit/config.toml`` bumped to
``toolbarMode = "viewer"`` (most aggressive — suppresses status
indicator + deploy button + running glyph). CSS belt-and-suspenders
hides ``stToolbar``, ``stToolbarActions``, ``stStatusWidget``,
``stDecoration`` for newer Streamlit releases that keep emitting
these with inline styles even under toolbarMode=viewer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:45:38 +00:00
0bb72ecd7e feat(home,sidebar): brand block + collapsible findings + many polish tweaks
Batch of UX tweaks the user asked for in quick succession:

- Sidebar brand block (mockup §brand) — 28px ink chip with a "D"
  wordmark plus the "DataTools" text — injected into
  ``stSidebarHeader`` by a small JS bundled into the iframe-mounted
  script that already runs from ``hide_streamlit_chrome``. The
  Streamlit ``stLogoSpacer`` is hidden when the brand block is
  present so it sits flush at the top of the sidebar.

- Findings cards are now collapsible. Each file's card head carries
  ``data-dt-collapsed="true"`` on first render; clicking the head
  flips the attribute via the new ``_WIRE_COLLAPSIBLE_FINDINGS_JS``
  (MutationObserver re-wires after reruns). A CSS rule
  ``[stElementContainer]:has(.dt-finding-group-head[data-dt-collapsed
  ="true"]) ~ *`` hides every later sibling of the head's element
  container — covers both ``stLayoutWrapper`` (the columns rows in
  this Streamlit release) and ``stElementContainer`` so the rule
  survives future Streamlit layout renames. A chevron icon
  (``chevron_right``) rotates 90° when expanded. The head itself
  gets ``cursor: pointer`` + an accent-fill hover.

- Tool-link buttons in finding rows dropped the leading ``Open`` —
  now read ``Clean Text →``, ``Standardize Formats →`` etc.

- Finding-row column order: action is now LEFT of the description,
  matching user feedback (``[icon] [Tool →] [description + meta]``).

- Head padding bumped to ``16px 22px`` so the filename has visible
  breathing room from the card's left edge (previously the mono
  filename felt like it was bleeding into the rounded corner).

- Head margin-bottom bumped to 1.5rem for breathing room before the
  first finding row when expanded; collapsed state tucks the head
  flush against the card bottom with full ``--r-lg`` corner radius
  and no visible bottom border.

- Files card row layout: ``✕`` button moved to the LEFT of the
  filename (``[✕] [chip + filename] [size]``).

- Sidebar nav rows tightened: link padding 7px → 4px, line-height
  1.25, 1px margin-bottom per li, section-header padding-top reduced.
  Plus a new ``--gap: 0.25rem`` rule for vertical blocks inside
  bordered containers so the Files card and findings card body have
  denser inter-row spacing.

- Sidebar Language selector restyled: widget labels render as the
  spec's "Eyebrow" row (11.5px / 500 / 0.08em uppercase, tertiary
  ink), selectbox combobox gets a paper surface + soft border that
  matches the rest of the sidebar chrome.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:40:22 +00:00
74d0ee270f chore(home): remove "Export report" button
The disabled "Export report" placeholder is gone — it wasn't tied to
a real feature and was just noise in the action bar. Action bar is
back to two buttons (Run analysis · Clear results) on a 1:1:4
column split. ``upload.export_report`` keys removed from en + es
i18n packs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:17:43 +00:00
06f1ea6cf7 fix(buttons,footer): unify disabled state + restyle Help/Close as nav links
(3) Disabled primary buttons no longer read as a "whited-out" dark
slab. Streamlit's primary-button selector
``button[data-testid="stBaseButton-primary"]`` has the same
specificity as our previous ``button:disabled`` selector, so the
primary background + cream text kept winning the cascade tie-break.
The disabled rule's selector list now explicitly matches both the
``kind="primary"``/``kind="secondary"`` shapes AND the
``stBaseButton-primary``/``-secondary`` testids, so disabled
buttons collapse to ``surface-hover`` background, ``ink-tertiary``
label, soft border — same look regardless of starting kind. A
follow-up rule re-asserts ``color: var(--ink-tertiary)`` on every
descendant of the disabled primary so the inner
``stMarkdownContainer > p`` doesn't keep the cream label from the
"all descendants get --bg" primary rule.

(4) The sticky-footer Help + Close buttons now match the sidebar
nav-item look. Old outlined-pill chrome is gone:
``.datatools-footer-btn`` is now display:inline-flex with a
Material-Symbols ligature icon + label, borderless, ``ink-secondary``
text on a transparent surface, ``rgba(0,0,0,0.04)`` hover background.
The Close button keeps a danger tint via ``.close`` so it still reads
as the shut-down action, with a soft ``--danger-fill`` hover. Help
uses the ``help_outline`` icon, Close uses ``power_settings_new``.
Built via a small ``makeFooterBtn`` helper in the iframe JS that
appends the icon span + label text node to the button — keeps the
existing soft-nav click handlers intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:12:03 +00:00
784695e3a7 fix(home,findings): reclaim top whitespace + add padding under finding head
Two visual cleanups:

1. The block-container "claim padding" rule was a no-op — it targets
   the legacy ``stAppViewBlockContainer`` testid; Streamlit renamed
   it to ``stMainBlockContainer`` in the current release. Updated the
   selector list to match both, so the page title now sits close to
   the top edge again (~0.5rem from the hidden header) instead of
   inheriting Streamlit's default ~6rem header reservation.

2. ``.dt-finding-group-head`` margin tightened to ``margin: -1rem
   -1rem 0.75rem``: -1rem on top/sides still bleeds the head to the
   card edges, but +0.75rem on the bottom is breathing room between
   the head's bottom border and the first finding row, which were
   abutting before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:04:42 +00:00
4816da1ad6 fix(home): show file sizes in KB/MB/GB, never raw bytes
Per-row file sizes and the Files-card total-size meta both read as
human-readable units now. Smallest unit is KB even for sub-kilobyte
files (so ``538 B`` → ``0.5 KB``, ``4914 B`` → ``4.8 KB``), steps up
to MB at 1 MiB and GB at 1 GiB. Always one decimal place.

New module-level helper ``_format_size(int) -> str`` in ``_home.py``;
both the section meta (``1 file · 4.8 KB total``) and the per-row
``dt-file-size`` cell call it instead of the previous ad-hoc
``f"{n:,} B"`` formatter. Keeps the display consistent regardless of
file size — and keeps the GUI free of raw byte counts that nobody
needs to read.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:59:56 +00:00
6703e2c15c feat(home): in-card "+ Add more files" replaces Streamlit's dropzone
Mockup §file-add lands as the canonical import affordance:

- Streamlit's ``st.file_uploader`` widget is still mounted (only path
  that actually receives browser file events), but parked off-screen
  via a new ``[data-testid="stFileUploader"] { position:absolute;
  left:-10000px; … pointer-events:none }`` rule. Its hidden
  ``<input type="file">`` stays reachable to JavaScript.
- The Files card is now always rendered (header + bordered body).
  The bottom row of the card is a ``button.dt-file-add`` styled per
  mockup §file-add: dashed top border bleeding to the card edges,
  surface-hover background, ``+ Add more files`` text in
  ``--ink-secondary``, accent-fill on hover.
- A small ``<script>`` shipped through ``st.iframe`` wires the
  button: ``click → input.click()`` on the off-screen
  ``stFileUploaderDropzoneInput``. Streamlit's HTML sanitizer
  strips inline ``onclick`` from ``unsafe_allow_html`` content, so
  the binding has to come from a real script element — same pattern
  the sticky footer and Upload→Import rewriter use. A
  ``MutationObserver`` re-wires the button when Streamlit remounts
  it across reruns. The ``dataset.dtWired`` guard prevents double
  binding.

Section structure also tightened to match the mockup:

- Section heading is now ``<h2>Files</h2>`` (was ``### Import one
  or more files to start``) with the count + total size on the
  right of the same flex row. When no files: ``No files imported
  yet``. When files exist: ``1 file · 4.8 KB total``.
- Dropped the ``upload.intro_multi`` caption and the
  ``upload.empty_state`` info banner — the card itself plus the
  in-card Add button cover both prompts.
- Empty state now ends after the Files card (no stats / no action
  bar / no findings rendered) — matches mockup's single-section
  empty view.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:56:11 +00:00
a9788ba712 feat(ui): page header + files card + action bar + findings cards (mockup 2)
Closes the remaining gaps between the live home page and the
``datatools_layout_redesign2.html`` mockup. Four pieces land
together because they all consume the same new CSS scaffold:

1. Page header (§page-header)
   ``st.title`` + ``st.caption`` + ``st.divider`` collapse into one
   flex header: h1 + body subtitle on the left, ``Runs 100% locally``
   privacy pill (success-fill + lock SVG) on the right, soft border
   below. The "Runs 100% locally" phrase moved out of
   ``home.caption`` into the new ``home.privacy_pill`` i18n key
   (en + es).

2. Files card (§files-card)
   The "Imported files" list is now a single bordered card with a
   section head (count + KB total on the right, mockup §section-head).
   Each row renders a 28px accent-fill chip carrying the inline
   document SVG, a mono filename, a right-aligned mono size, and a
   compact ``✕`` button. The word-button ``Remove`` is gone —
   replaced by an icon-only tertiary button styled via a new CSS
   rule that goes transparent → danger-fill on hover (mockup
   §file-remove).

3. Action bar (§action-bar)
   Three buttons in one row: ``Run analysis`` (primary ink), a new
   disabled ``Export report`` (secondary; coming soon, tooltip), and
   ``Clear results``. New i18n key ``upload.export_report``.

4. Findings — per-file group cards (§finding-group)
   ``render_findings_panel`` rewritten end-to-end. Output is now:
     • A head row (``dt-finding-group-head``) bleeding to the card
       edges: worst-severity dot · mono filename · count pills
       enumerating non-zero severities (e.g. ``2 info`` blue,
       ``1 warning`` amber, ``1 error`` rose).
     • A flat list of finding rows sorted error → warn → info.
       Each row: tinted Material-icon chip + title (description
       with optional ``<code>`` column chip) + mono meta line
       (rows affected, samples captured) + tertiary
       ``Open <Tool> →`` action button that ``st.switch_page``s
       to the relevant tool.
   The previous tool-grouped expander stack is dropped — the new
   layout is denser and matches the mockup's single-card-per-file
   structure.

   ``_render_one_finding`` (the old per-finding helper that emitted
   markdown lines + sample tables) remains in the file but is no
   longer called from the home flow; left in place for any other
   surface that still depends on the markdown style.

   The "no issues" success state renders a green dot + mono
   filename + ``no issues`` success pill in the same card chrome,
   so empty-result files visually match the rest of the panel
   rather than getting a generic ``st.success`` callout.

CSS additions (``_DESIGN_TOKENS_CSS``):
  ``.dt-page-header / .dt-page-subtitle / .dt-privacy-pill``
  ``.dt-files-section-head / .dt-section-meta``
  ``.dt-file-row / .dt-file-icon-chip / .dt-file-name / .dt-file-size``
  ``.dt-finding-group-head / .dt-severity-dot{.warn,.info,.error,.success}``
  ``.dt-group-filename / .dt-group-counts``
  ``.dt-count-pill{.warn,.info,.error,.success}``
  ``.dt-finding-row / .dt-finding-icon{.warn,.info,.error}``
  ``.dt-finding-title / .dt-finding-meta``
  Tertiary button rule (transparent → danger-fill on hover) for
  the X button and the ``Open Tool →`` row action.

theme.py:
  Explicitly loads Material Symbols Outlined alongside Geist —
  the severity-chip ligatures (``info`` / ``warning`` / ``error``)
  need the font present even when no ``:material/`` token has been
  emitted yet on the page. Tightened ``.dt-finding-icon .dt-mui``
  selector with ``[data-testid="stMarkdownContainer"]``-scoped
  variant so the Material font wins over theme.py's base
  ``var(--font-sans) !important`` on markdown descendants.

Leading section-heading emojis stripped from i18n
(``upload.heading``) for parity with the mockup's clean ``Files``
/ ``Findings`` h2s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:43:42 +00:00
da7d86f457 feat(ui): Material icons in sidebar + stats overview on home
Two pieces of the mockup 2 layout that hadn't landed yet:

1. Sidebar nav icons — emoji glyphs (🧹 ✂️ 🔍 …) swapped for
   Streamlit's ``:material/<name>:`` syntax, picking the outline
   Material Symbol that best matches each mockup SVG:

       Home               → :material/home:
       Fix Missing Values → :material/help_outline:
       Find Unusual Vals  → :material/insights:
       Clean Text         → :material/text_format:
       Standardize Fmts   → :material/format_list_bulleted:
       Find Duplicates    → :material/search:
       Quality Check      → :material/check_circle:
       Map Columns        → :material/view_column:
       Combine Files      → :material/account_tree:
       Auto Workflows     → :material/auto_awesome:
       Activate           → :material/key:
       Close              → :material/close:

   Streamlit injects the icon name as a literal ligature inside a
   first-child ``<span>`` of the nav anchor, expected to render
   through the Material Symbols font. theme.py's base rule was
   forcing Geist on every span under ``stSidebarNav``, turning the
   ligatures back into plain text labels — added a structural
   exception that targets ``[data-testid="stSidebarNavLink"] >
   span:first-child`` (and any descendant), restoring the Material
   font family, neutralizing the inherited ``ss01/cv01/cv11``
   feature settings, and sizing to 18px.

   Also stripped the leading emojis from every page title in the
   en/es i18n packs (``home.title``, ``close_page.title``,
   ``activation.title``, ``tools.*.page_title``) — the icons live
   in the sidebar now, the page H1 no longer needs to carry one.

2. Stats overview on home — new ``_render_stats_overview`` in
   _home.py emits a 4-card grid above the per-file findings panels:
   Files analyzed, Total findings, Warnings (severity ``warn`` ∪
   ``error``), Info (severity ``info``). Card layout follows the
   mockup §stats verbatim — Geist 28px / 600 / -0.03em for the
   numeric value (the "Display number" row in spec §4), tiny
   uppercase tracked label, paper-surface card with the standard
   warm border + faint shadow. The Warnings / Info cards tint the
   number with ``--warn`` / ``--info`` when the count is non-zero.

CSS for ``.dt-stats / .dt-stat / .dt-stat-label / .dt-stat-value /
.dt-stat-unit`` added to ``_DESIGN_TOKENS_CSS``; falls to a
2-column grid below 900px viewport, matching the mockup's media
query.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:31:40 +00:00
2501119ac2 feat(ui): replace Fraunces with Geist per geist_spec.md
Switches the type system to the single-family Geist spec referenced
in ``Business/DataTools/geist_spec.md`` and the matching
``datatools_layout_redesign2.html`` mockup. Editorial-serif headings
are out; the product now reads as modern SaaS-tool typography per
the spec's positioning note (§10).

  src/gui/theme.py (new)
    Implements geist_spec.md §3 verbatim — preconnect + Google Fonts
    link for Geist (400/500/600/700) and Geist Mono (400/500), the
    canonical ``:root`` token table (§7) plus severity extensions,
    and the type scale (§4): h1 32/600/-0.035em, h2 22/600/-0.025em,
    h3 18/500/-0.018em, h4 15/500/-0.012em, body 14/400, caption
    12.5/400, mono 0.92× ss02. ``apply_theme()`` is the single entry
    point.

    Two deviations from the spec, both anticipated by spec §6.1:
    - ``font-family: var(--font-sans) !important`` on the base rule.
      Streamlit applies ``font-family: "Source Sans"`` directly to
      ``[data-testid="stMarkdownContainer"]`` and a few widget
      wrappers at equal-or-higher specificity than the spec's
      selector list, so plain inheritance loses the cascade.
    - The base selector list explicitly enumerates
      ``stSidebarNav``, ``stMarkdownContainer``, ``stVerticalBlock``
      and a few siblings so Streamlit's per-widget font reset
      doesn't reach descendant text.

  src/gui/components/_legacy.py
    - ``_DESIGN_TOKENS_CSS`` no longer redeclares fonts or the
      heading rules — those are theme.py's job (spec §9 says the
      spec is type-only; everything below is component chrome).
    - Token references switched from ``--dt-*`` to the spec names
      (``--ink``, ``--bg``, ``--surface``, ``--border``, ``--accent``,
      ``--font-sans``, ``--font-mono``, …).
    - Sidebar section-label rule tightened to 11.5px / 500 to match
      the "Eyebrow" row in spec §4.
    - Primary-button text color now also targets every descendant
      (``button[kind="primary"] *``) so the inner
      ``stMarkdownContainer > p`` doesn't pick up
      ``color: var(--ink)`` from the base rule and render
      near-invisible ink-on-ink.
    - ``hide_streamlit_chrome`` now calls ``apply_theme`` before
      injecting component CSS so the base tokens are defined first.

Acceptance criteria from spec §8 verified at 1920×1050:
  - h1 computes ``font-family: Geist``, ``font-weight: 600``,
    ``letter-spacing: -1.12px`` (= 32px × -0.035em), size ``32px``.
  - Body ``<p>`` inside ``stMarkdownContainer``: Geist 400 / 14px.
  - Caption: Geist 400 / 12.5px.
  - Inline mono filenames: Geist Mono in accent-fill chip.
  - No Source Sans Pro leaks into any text the user reads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:21:52 +00:00
444dffbc63 chore(ui): rename Upload → Import in user-facing strings
DataTools is local-first — "Upload" reads like "send data somewhere
remote", which contradicts the product positioning. Sweep replaces
the user-visible term throughout the UI:

- ``src/i18n/packs/en.json`` + ``es.json``: all ``upload.*`` strings
  (heading, intro, uploader labels, empty state, switch-back, etc.)
  and ``gate.default_name``. The ``intro_multi`` "no upload anywhere"
  phrasing dropped the verb entirely — now reads "nothing leaves
  this computer".
- All 9 tool pages: ``st.file_uploader(label="Upload …")`` →
  ``"Import …"``; matching ``st.info("Upload a …")`` empty-state
  banners; ``help="Upload …"`` strings on disabled uploaders.
- ``9_Pipeline_Runner`` + ``5_Column_Mapper``: radio-option text
  ``"Upload schema/pipeline JSON"`` → ``"Import …"`` plus the
  ``.startswith("Upload")`` branch guards that read those values.
- ``_home.py``: "**Uploaded files**" → "**Imported files**".
- ``app_demo.py``: "Uploaded file is …" → "Imported file is …".

Internal identifiers left untouched: function names
(``pickup_or_upload``, ``_StashedUpload``), session-state keys
(``home_upload``, ``home_uploads``, ``home_uploaded_*``,
``merger_file_upload``), audit-log event category (``"upload"``),
Streamlit testid CSS selectors. None of those are visible to the
user.

The file_uploader's dropzone button text is a baked-in React
literal that Streamlit's ``label=`` doesn't reach; rewritten at the
DOM level with a small ``_RENAME_UPLOAD_BUTTON_JS`` snippet shipped
through ``st.iframe`` (same pattern the sticky footer uses to mount
on ``<body>``). A ``MutationObserver`` on the parent document re-
applies the swap when Streamlit remounts the dropzone after file
add/remove or page navigation, throttled via ``requestAnimationFrame``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:48:31 +00:00
3c4b80895e fix(home): hide Streamlit's chip row, keep only the canonical file list
After upload, two near-identical file lists were shown stacked:
Streamlit's built-in compact chip row inside the dropzone (icon +
``messy_sales.csv`` + size) and the home page's own "Uploaded files"
section beneath it (filename + Remove button). User flagged the
duplication.

Hide ``[data-testid="stFileChip"]`` and its first-child wrapper so
the chip row collapses; the dropzone's borderless ``+`` button is
preserved as the "add more files" affordance, and our "Uploaded
files" list is now the single source of truth visually.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:42:22 +00:00
b0ee65e922 feat(ui): warm editorial redesign — Fraunces + Geist + stone palette
Lifts ideas from the ``datatools_layout_redesign.html`` mockup
(artistic licence, not literal). Two changes:

1. ``.streamlit/config.toml`` ``[theme]`` block — cream paper bg
   (#fafaf7), warm sidebar (#f5f4ef), stone ink (#1c1917), burnt
   orange primary (#c2410c). Streamlit threads these through its
   chrome (focus rings, file-uploader accents, link colors).

2. ``_DESIGN_TOKENS_CSS`` injected by ``hide_streamlit_chrome`` on
   every page. Imports Fraunces (display serif), Geist (body sans),
   Geist Mono. Restyles, scoped through ``--dt-*`` custom properties:

   - Page surface + sidebar — warm cream backgrounds, soft warm
     borders, no harsh white.
   - Sidebar nav — section labels in tiny uppercase tracking, nav
     items with soft hover, active item as a white pill with subtle
     shadow.
   - Typography — H1/H2/H3 in Fraunces with tightened tracking;
     body Geist; inline code Geist Mono with orange-on-cream chip.
   - Buttons — primary = dark ink (``#1c1917``) with white text;
     secondary = paper surface with warm border; disabled = muted
     cream.
   - Containers / expanders — editorial cards: 14px radius, 1px
     warm border, faint shadow, warm-cream summary headers.
   - File uploader — cream dropzone with dashed border + per-file
     paper chips.
   - Alerts — soft tinted fills (info=sky, success=mint, warn=amber,
     error=rose) over the kind-specific palette.
   - Inputs, tabs, dataframes — paper surfaces with rounded warm
     borders.

Verified at 1920x1050 + 1400x900 on home page (empty + with file
uploaded + with findings rendered) and Clean Text tool page; no
regressions in the white-bar fix from 65b663b.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:36:24 +00:00
65b663be97 fix(footer): stretch .stApp + sidebar + main to compensate for zoom
User screenshot pinned the actual culprit: a horizontal white band
across the FULL viewport width (including over the sidebar) above
the Help/Close footer. Diagnosis:

  - ``.stApp`` carries ``zoom: 0.85``, so any descendant sized at
    ``100vh`` only renders at ~85vh visually.
  - At 1920x1050 the visual end of ``.stApp`` is around y=893; the
    fixed footer overlays y=1017..1050; the strip in between (124px
    at this resolution) is ``body`` painting white through, because
    ``.stApp``, ``stSidebar`` and ``stMain`` are all shorter than
    the viewport.
  - The previous "min-height: 100vh/0.85" rule targeted the legacy
    ``data-testid="stAppViewBlockContainer"``. The current Streamlit
    release renamed that testid to ``stMainBlockContainer`` — so the
    rule was a no-op for months. Verified the new testid by walking
    the live DOM.

Fix: stretch ``.stApp``, ``[data-testid="stSidebar"]`` and
``[data-testid="stMain"]`` with ``min-height: calc(100vh / 0.85)``
so they fill the visible viewport. Keep the block-container's 2rem
``padding-bottom`` (now matching both the new and legacy testids in
case Streamlit rolls it back).

Verified at 1920x1050: sidebar gray extends to y=1050, content area
extends to y=1050, footer overlays the bottom 33px, no white band
between content and footer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:22:11 +00:00
c942b8aa19 fix(footer): offset sticky-footer's left edge past the sidebar
The "white bar" was the footer's near-white background painting
over the bottom of the sidebar. The footer is fixed at body level
with ``left: 0; right: 0`` so it spans the full viewport — its
``rgba(255, 255, 255, 0.97)`` background renders as essentially
white over the sidebar's ``rgb(240, 242, 246)`` gray, producing a
visibly different strip at the bottom of the sidebar (this is what
the diagnostic GREEN tint marked as ``stAppViewContainer``-shaped
because that is the element directly behind it).

Pixel-sampled the bottom row to confirm:
  y=860 over sidebar  →  (240, 242, 246)  (gray)
  y=870 over sidebar  →  (255, 255, 255)  (footer-painted white)

Fix: in the iframe JS that mounts the footer on ``<body>``, measure
``[data-testid="stSidebar"].getBoundingClientRect().right`` and set
the footer's (and help popover's) ``left`` to that offset with
``setProperty(..., 'important')`` so it beats the ``left:0!important``
fallback in CSS. A ``ResizeObserver`` on the sidebar plus a
``window.resize`` listener keep the offset in sync when the sidebar
collapses or expands.

Sidebar collapsed (width 0 or off-screen) clamps to 0 → footer goes
flush-left as before. Also dropped the no-op ``min-height`` on the
view container from the previous attempt; ``stAppViewContainer`` is
transparent, so stretching it never painted anything.

Verified by injecting the same offset on the live page: bottom row
at y=890 is now ``(240,242,246)`` over the sidebar and only turns
white at x=255 where the content area begins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:52:02 +00:00
61e63913cb chore: migrate use_container_width → width (Streamlit deprecation)
``use_container_width`` is being removed after 2025-12-31. Streamlit
log was flooding the terminal with the deprecation notice on every
rerun. Mechanical sweep:

  use_container_width=True   →  width="stretch"
  use_container_width=False  →  width="content"

51 call sites across 11 page files + ``app_demo.py``. Also renamed
the ``local_download_button`` helper's ``use_container_width`` kwarg
to ``width`` (default ``"stretch"``); it has no external callers
passing the old name, so this is a safe rename.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:43:52 +00:00
e011c0b6e6 fix(footer): close white gap by stretching stAppViewContainer
Color-tag diagnostic confirmed the bottom-of-viewport strip was
painted by ``stAppViewContainer`` (it showed GREEN), not by the
block container as the previous two attempts assumed. ``.stApp``
has ``zoom: 0.85`` so 100vh visually renders at 85% — apply
``min-height: calc(100vh / 0.85)`` to the view container itself so
it spans the full visible viewport and there is no gap for its own
background to leak through as a "white bar". Reverts the diagnostic
tints (RED/BLUE/GREEN/GOLD); keeps the 2rem block-container
padding-bottom that reserves room for the fixed footer overlay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:36:41 +00:00
2fe324279e diag(footer): color-tag every candidate bottom-area container
Option 2 (stretching the block container with ``min-height``) did
not close the white gap. Either the rule isn't applying, or the
block container isn't the element that fills the visible bottom of
the page. Tint every plausible container so the eye can tell us
instantly which one paints the bar:

  - RED    ``stAppViewBlockContainer``   (still has min-height applied)
  - BLUE   ``stMain`` / ``section[stMain]``  (with its own min-height)
  - GREEN  ``stAppViewContainer``
  - GOLD   ``.stApp`` (zoomed)

User reload + report which color shows where the "white bar"
previously was — that names the target.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:33:19 +00:00
04dc326020 fix(footer): stretch block container to full viewport to close white gap
Option 1 (tightening ``padding-bottom`` from 3rem to 2rem) did not
eliminate the gap. The remaining gap is ``.stApp``'s solid white
background showing through the area below the block container's
natural (content-sized) bottom edge — visible because the home
page's content is shorter than the viewport.

Stretch the block container with ``min-height: calc(100vh / 0.85)``
so the container itself fills the visible viewport. Now the area
between the last finding card and the fixed footer is the block
container's own background, not ``.stApp`` showing through —
visually continuous with the content above.

The ``/0.85`` compensates for ``.stApp { zoom: 0.85 }`` (defined in
``_HIDE_CHROME_CSS``): inside a zoomed container, ``100vh`` renders
at 85% of true viewport height, leaving a 15% gap if used raw.
``box-sizing: border-box`` keeps the 2rem padding part of the
total height instead of stacking onto it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:30:22 +00:00
d487a44170 fix(footer): tighten block-container `padding-bottom` to close white gap
Diagnostics confirmed the "white bar" the user has been describing is
not a separate element — it's ``[data-testid=stApp]``'s solid white
background (``rgb(255,255,255)``, viewport-locked) showing through the
gap between where page content ends and where the fixed Help/Close
footer overlay begins. ``stApp`` stays put while content scrolls
inside it, which is why the bar "doesn't change when scrolling".

The gap exists because ``render_sticky_footer`` overrides the block
container's ``padding-bottom`` to ``3rem`` (48px) to reserve clear
room for the fixed footer. The footer is only ~32-33px tall (min-
height 32px + 0.25rem top/bottom padding), so ~16px of that reserve
was pure visible white space sitting above the buttons.

Reduce ``padding-bottom`` to ``2rem`` (~32px) — just enough to
prevent content from rendering under the footer overlay, no more.
Eliminates the visible gap without exposing text to clipping.

Also remove the diagnostic banner + click-to-inspect iframe from
the home page now that the bar is identified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:28:17 +00:00
f106275643 test(home): replace clutter outliner with click-to-inspect
User reported the previous diagnostic was too cluttered to read,
and the white bar showed no outline anyway — meaning the flat
``querySelectorAll('body *')`` walker missed it (likely inside an
iframe's contentDocument, which the script didn't recurse into).

New approach: a single red button "CLAUDE: click here, then click
the white bar" in the top-right. Clicking the button arms an
inspect handler. The next click anywhere on the page reports the
full element stack at that point via ``elementsFromPoint`` AND
recursively descends into any same-origin iframe at the click
location, so iframe contents are no longer invisible.

A black report panel lists every element in the stack with its
tag/id/testid/class, position, z-index, background color, and
bounding rect — TOP element highlighted in red. User clicks the
white bar exactly once and we know what it is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:23:35 +00:00
8232ab1ca7 test(home): broader diagnostic — outline anything near viewport bottom
Previous diagnostic only outlined fixed/sticky elements; user
confirmed the offending white bar isn't one of those. Cast a much
wider net:

- Outline every element whose visible rect intersects the bottom
  200px of the viewport, regardless of position.
- Border style encodes position: solid=fixed, dashed=sticky,
  dotted=absolute, thin=static/relative.
- Render a readable list in a top-right panel showing each element's
  tag/id/testid/class, position, z-index, height, and background.
- Skip fully transparent + un-positioned elements (those can't
  actually overlay anything).

With this, scroll to the bottom and the panel + colored outlines
will identify exactly which element is the white bar — fixed or
not. The user can paste the panel list (or just name the colored
box) so we know what to remove.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:18:56 +00:00
4c8e1199a4 test(home): outline every fixed/sticky element to find the white bar
User reports: TEST #3 marker sits at the true bottom of the home
page's main content, but when scrolled the test text "goes behind"
an opaque white bar — confirming the bar is fixed/sticky (overlays
scrolling content). Our CSS only declares ONE fixed element near
the bottom (``#datatools-sticky-footer``), which the user already
ruled out. So something else — Streamlit native chrome, a third-
party widget, or a fixed element we haven't enumerated — is
overlaying the content.

Inject a small diagnostic iframe whose JS, running against the
parent document, walks every element on the page and outlines each
``position: fixed`` or ``position: sticky`` node with a distinct
color + a top-left label showing ``tagName#id[data-testid] pos=…
h=…px bg=…``. Re-runs after initial paint, on a couple of delays
(for late-mounting components), and on every scroll.

This is read-only — no DOM mutations beyond outline styles and
labels — so it's safe to ship even if I miss removing it.
The user can now visually identify which colored box is the
offending white bar and report its label.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:15:19 +00:00
e282f061dc test(home): move marker to true bottom of main content
User reported the previous TEST #2 banner appeared at the *top* of
the main content area instead of the bottom. Root cause: on the home
page, ``render_sticky_footer()`` is called at line 107 — before
``st.title()`` — so anything that function injects in document flow
lands at the top of ``stAppViewBlockContainer``. Other pages call
``render_sticky_footer()`` at the end of their script, so the flow
content lands at the bottom there.

Remove the marker from ``render_sticky_footer`` and add it directly
at the very end of ``_home._home_page()`` — after the findings
panels. If this banner lines up with the offending white strip when
scrolled to the bottom, the strip is something rendered at the tail
of the page (likely an iframe wrapper from ``render_findings_panel``
or the block container's ``padding-bottom``).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:11:24 +00:00
5daae9e5fa test(footer): move marker out of footer into main content flow
User confirmed the previous marker landed inside the Help/Close
sticky footer — which is NOT the offending white bar. They want the
sticky footer kept; the white strip they're trying to remove sits
*above* the footer in the main content area.

Move the marker out of ``#datatools-sticky-footer`` and render it
via ``st.markdown`` immediately before the ``st.iframe`` call that
injects the footer. That places it at the very bottom of
``stAppViewBlockContainer`` — exactly where the iframe wrapper
(``stElementContainer``) and the block container's
``padding-bottom: 3rem`` reservation live.

Styled as a red dashed banner so it's unmistakable. If it lines up
with the white strip clipping text on scroll, one of those two is
the culprit and the next commit can target it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:09:21 +00:00
48cb802dfb test(footer): inject visible marker into #datatools-sticky-footer
The user reports a "white bar/box" at the bottom of the main content
area that clips text when scrolling. The DOM inspector found only one
fixed-position white element near the viewport bottom —
``#datatools-sticky-footer`` (bg ``rgba(255,255,255,0.97)``,
~33px tall) — so this is my best candidate for what they're seeing.

Append a red marker span "◀ CLAUDE TEST: is this the white bar you
want removed? ▶" inside the footer div so the user can visually
confirm. If the text shows up where they see the offending white
bar, the footer is the right target; if the bar is somewhere else,
this confirms it's a different element.

Temporary — to be reverted in the next commit either way.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:06:56 +00:00
d022167ba2 fix(home): widget's "✕" Remove now actually removes the file
Reported: on the Home page after uploading data files, the Remove
buttons "on the right side" did nothing — the file kept showing up in
the list. That was the file_uploader widget's BUILT-IN ✕ icons (the
ones inside the uploader's chrome, on the right of each file row),
not our custom "Remove" buttons further down — the custom ones have
worked correctly since 84e4665.

Cause: ``_home_page`` deliberately treated the widget as add-only and
never honored widget-side removals. The reasoning, per the prior
comment, was that navigation can remount the widget with value ``[]``
— a render-time sync would then wipe ``home_uploads``. Real, but the
side effect was that the widget's own ✕ appeared to do nothing: the
file vanished from the widget chrome, stayed in ``home_uploads``, and
re-rendered immediately in the custom list below.

Fix: hook the file_uploader's ``on_change`` callback to reconcile
``home_uploads`` against the widget's current value. Streamlit's
``on_change`` fires ONLY on user-initiated value changes; the
remount-induced ``[]`` reset doesn't trigger it, so the stash still
survives navigation. Removals from the callback also drop the file's
findings entry and clear the singular ``home_uploaded_*`` keys when
the active upload was removed — matching the custom-button path.

The custom "Remove" buttons further down keep working unchanged; the
existing AppTest path through ``_home_remove_<sha1>`` still removes
exactly the file clicked. 2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:52:20 +00:00
24ee021314 fix(footer): hide the helper page_link row that was leaking into pages
Same wrong-testid bug as the Close click handler: the CSS rule
that's supposed to position the hidden ``st.page_link`` off-screen
was selecting ``a[data-testid="stPageLink"]``, but the bare
``stPageLink`` testid is on the OUTER wrapper div — the anchor
uses ``stPageLink-NavLink``. ``:has(a[data-testid="stPageLink"]...)``
matched nothing, so the helper rendered as a full-size visible
row at the bottom of every page (the "large white bar blocking
content" the user reported).

Fix: switch both the ``:has()`` rule and the no-:has() fallback
to ``a[data-testid="stPageLink-NavLink"][href*="close"]``. The
``href*="close"`` form also works for base-path deployments
(``/myapp/close``), matching the click handler's selector.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:07:07 +00:00
add3b866ee fix(footer): Close button now actually fires — wrong testid + bad fallback
Two bugs combined to make the footer Close a no-op:

1. The helper page_link's anchor carries
   ``data-testid="stPageLink-NavLink"`` — the bare
   ``stPageLink`` testid is on the OUTER WRAPPER div, not the
   anchor. The old selector ``a[data-testid="stPageLink"]``
   matched nothing, so ``helper`` was always ``null``.
2. The fallback ``window.location.href = './close'`` ran inside
   the component iframe, so it only navigated the (invisible)
   srcdoc iframe. The main app stayed put.

End result: click → nothing visible → shutdown_app never runs →
farewell-script's ``window.close()`` attempt never happens →
user sees the Close button as broken.

Fixes:
- Selector → ``a[data-testid="stPageLink-NavLink"][href*="close"]``.
  ``href*="close"`` covers both root (/close) and base-path
  (/myapp/close) deployments.
- Fallback → resolve the parent window via
  ``doc.defaultView`` (the parent doc's window) with a
  ``window.top`` fallback, so the hard-nav navigates the whole
  app instead of just the iframe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:02:46 +00:00
b568773a1f chore(streamlit): migrate components.v1.html → st.iframe (deprecation)
Streamlit logs a deprecation notice on every render:

  Please replace ``st.components.v1.html`` with ``st.iframe``.
  ``st.components.v1.html`` will be removed after 2026-06-01.

Replace all 9 call sites (6 tool pages + 3 in ``_legacy.py``).
Both APIs feed ``srcdoc`` to the underlying iframe so the
HTML/JS payload and the cross-frame DOM access pattern
(``window.parent.document``) are unchanged.

``st.iframe`` rejects ``height=0`` (raises ``StreamlitInvalid
HeightError``), so bump every zero-height call to ``height=1``.
1px is effectively invisible — these are script-only iframes, no
visible payload — and avoids the validator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:57:40 +00:00
4a7f99f0ec fix(footer): restore soft-nav for Close (no page reload on shutdown)
Footer Close was using ``<a href="./close">`` which triggers a
browser hard-nav. That's a visible page-reload flash, websocket
churn, and slower shutdown than the previous sidebar Close —
which used ``st.navigation``'s soft nav.

Restore the soft-nav path:

- ``render_sticky_footer`` now renders a hidden ``st.page_link``
  pointing at ``pages/99_Close.py``. Positioned off-screen via
  CSS (``stElementContainer:has(a[data-testid=stPageLink]
  [href$=/close])``) so it occupies no layout space but stays in
  the DOM, reachable + clickable.
- Footer's Close <button> click handler now dispatches a
  programmatic click on that hidden page_link. Streamlit's React
  handler picks it up and runs the soft nav (same code path the
  old sidebar entry used). Falls back to ``window.location.href``
  if the helper link hasn't rendered yet so the button is never
  a no-op.
- The page_link call is wrapped in try/except: ``AppTest`` doesn't
  populate the page-nav session keys it needs and raises
  ``KeyError('url_pathname')``. Failure costs only the soft-nav
  optimization — Close still works via the hard-nav fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:52:00 +00:00
b2449d3139 fix(nav,footer): drop orphan _hidden section header, show footer on Activate
Two follow-ups to the prior sidebar/footer cleanup:

- The "_hidden" section header was still visible in the sidebar
  because Streamlit renders ``stNavSectionHeader`` as a sibling of
  ``stNavSection``, not a child — so the ``:has()`` rule on the
  section was hiding the items list but leaving the header
  (and its collapse/drilldown marker) behind. Move Activate +
  Close into the unlabeled section (key ``""``) alongside Home so
  there is no header to leak in the first place, then hide just
  the two links via ``stSidebarNavLinkContainer:has(...)`` (with
  a defensive ``a[href$=...]`` fallback for browsers without
  ``:has()`` support).
- The sticky footer was missing on ``pages/_Activate.py`` because
  the page never called ``render_sticky_footer`` — added the
  call so the Help / Close bar persists when the user follows
  the popover's Activate / Manage link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:45:22 +00:00
d840230e48 fix(nav,footer): hide Activate from sidebar, surface it in Help popover
- Collapse the Account section: Activate now lives in the same
  hidden sidebar section as Close (single ``_hidden`` group). Both
  pages stay registered with ``st.navigation`` so /activate and
  /close remain URL-routable for the Help-popover / Close-button
  links — only the sidebar entries + their section header are
  hidden via CSS.
- Help popover always exposes a license-management link now:
  ``Activate now →`` when the license is inactive, ``Manage
  license →`` when it is active and valid. Both point at
  ``./activate``.
- Extend the sidebar-hide CSS to also match ``a[href$="/activate"]``
  and the section that contains it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:39:14 +00:00
9e8b4b2ca9 feat(footer): help popover shows license state + Activate link
- Bump version to 3.0 (src/__init__.py).
- Switch support address to support@unalogix.com.
- Help popover now includes a License section that reads
  ``src.license.current_state()``:
  * When activated + valid: name + expiry date + days remaining.
  * Otherwise: "Not activated" + an ``Activate now →`` link
    pointing at ``./activate``.
  License-state queries are wrapped so a corrupted license file
  can't take the footer down — it falls through to the inactive
  branch.
- Popover HTML is now built in Python (so the license branch
  lives in one place) and passed to the JS as a single string.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:35:47 +00:00
dd231f5a38 fix(footer): render sticky Close+Help footer on the home page too
The sticky footer was only wired into the 9 tool pages — the home
page (``_home.py``) called ``hide_streamlit_chrome`` but never
``render_sticky_footer``, so the app-level Close+Help bar was
missing whenever the user was on the home page. Add the call.

Also drop the home page's now-redundant trailing
``st.divider() + st.caption(t("chrome.footer"))`` block — same
"blank white bar above the sticky footer" symptom that motivated
removing the per-page version from the tool pages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:32:16 +00:00
143c775cdf fix(footer,nav): left-justify buttons, drop per-page caption bar, hide sidebar Close
Three small follow-ups to the sticky-footer rework:

- Left-justify the footer buttons (and reposition the Help popover
  to anchor at the left edge so it lines up with its trigger).
- Remove the per-page ``st.divider() + st.caption("Runs locally…")``
  trailing block from all 9 tool pages. The new sticky footer
  covers that text, so it was rendering as an empty white bar at
  the bottom of each tool page.
- Hide the Close entry from the sidebar nav via CSS. The page stays
  registered with st.navigation so /close is still routable for the
  sticky-footer Close button — only the sidebar link + its section
  header are hidden (via :has() on stNavSection).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:04:12 +00:00
d1b9f642e2 feat(footer): slim sticky footer with Close + Help, drop bottom Back-to-Home
The duplicate full-width Back-to-Home button at the bottom of every
tool page was reading as a "huge footer." Replace it with a real
slim sticky footer holding two controls:

- Close: <a href="./close"> to the Close page (which shuts down).
  Full-page nav is fine here — the process is terminating, so the
  session-state-loss concern that retired the previous sticky
  footer doesn't apply.
- Help: JS-toggled popover showing version + support@datatools.app.
  No navigation, no state loss.

Top-of-page Back-to-Home stays (uses st.switch_page, preserves
state). Add footer.* i18n keys for en + es.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:56:02 +00:00
65c85107b6 revert: restore audit-log kill switch — async redesign didn't help
User pulled d9e32e5 (async-writer audit log + re-enabled diagnostics
sidebar) and still sees blank pages. The synchronous-write theory
from the previous round was at most a partial explanation; something
ELSE in the audit-log code path is also taking the page render down
on the user's machine.

Restore the kill switch so the user has a working app while we
diagnose:

- ``src/audit.py``: ``_DISABLED = True`` re-introduced at module
  top, each of ``log_event`` / ``log_session_start`` /
  ``log_page_open`` / ``flush_audit_log`` early-returns. The async
  writer thread is never started.
- ``hide_streamlit_chrome``: ``_render_diagnostics_sidebar()`` call
  re-gated behind ``if False:``.

The async writer code stays in place — easier to flip the flag back
when we identify the real cause than to rewrite a third time. The
shutdown-flush call in ``shutdown_app`` also stays; it early-returns
on the kill switch and is harmless.

Diagnostic plan for the next session: ask the user for the launcher
terminal output (the new stderr "DataTools audit: writes failing..."
message would tell us if the writer thread DID start and DID fail),
and whether ``~/.datatools/logs/`` is being created at all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:44:23 +00:00
d9e32e578b feat(audit): async writer thread — safe to re-enable
Reported earlier: synchronous file writes in ``log_event`` blocked
the GUI render thread on hostile filesystems (Windows antivirus on
``~/.datatools/logs/`` is the prime suspect). A blocking ``open``
call doesn't raise — try/except can't recover from it — so the
only safe re-enable is to take file I/O off the render path.

Refactor:

- ``log_event`` and friends push events onto a ``deque(maxlen=5000)``
  via ``put_nowait`` and return in microseconds.
- A single daemon thread (``datatools-audit-writer``) drains the
  queue and writes batches. Holds the queue lock only long enough to
  snapshot + clear, then does I/O outside the lock so producers can
  keep enqueueing.
- ``audit_log_path()`` is now pure path arithmetic — no ``mkdir``
  no ``open``. The writer thread does the directory creation off
  the request path, so any hang there only affects the writer.
- Bounded queue means an unwritable disk doesn't unbounded-grow
  memory; the queue caps at 5000 and overflow drops OLDEST events
  so the most-recent (most-diagnostic) ones survive.
- First write failure prints once to stderr; subsequent failures
  are silent so logs don't drown the launcher terminal.
- ``flush_audit_log(timeout_s=0.5)`` drains the queue and signals
  the writer to exit; bounded so a stuck disk can't delay shutdown.

Other changes in this commit:

- ``shutdown_app`` now emits a "Session ending" event and calls
  ``flush_audit_log`` before kicking the os._exit timer, so the
  closing session's events make it to disk.
- The Diagnostics sidebar in ``hide_streamlit_chrome`` is
  re-enabled (the ``if False:`` gate is removed). Wrapped in
  try/except defensively — render errors print to stderr, never
  blank the page.
- ``_DISABLED`` kill-switch is gone. The async design IS the
  safety mechanism now.

Tests in ``tests/test_audit.py``:

- log_event burst of 1000 events completes in well under 1s
  (proves non-blocking).
- Events queued before flush land on disk with the expected JSON
  shape; session_start renders; idempotent.
- Pointing the audit dir at a file (so mkdir fails) doesn't hang
  or crash the producer.
- Non-JSON extras are str()-coerced rather than dropped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:39:48 +00:00
7cb1bc922d fix(nav): restore real Streamlit Back-to-Home button — preserves state
Reported: after the sticky-footer href fix (be7191a) the back-to-home
click worked but the home-page upload list disappeared. Full-page
navigation via ``<a href>`` doesn't preserve ``st.session_state`` on
the user's Streamlit build.

Trade-off forced: pick visible-from-anywhere sticky footer OR state
preservation. Can't have both because ``st.switch_page`` (soft nav,
preserves state) needs a real Streamlit button widget, and Streamlit
widgets can't be reliably CSS-positioned to the viewport bottom —
Streamlit owns the widget DOM and remounts it on every rerun.

State preservation wins. Going back to the pre-sticky design:

- ``render_sticky_footer()`` becomes a no-op shim. Kept as a callable
  so the call sites in every tool page don't have to be touched in
  this commit; the original implementation is preserved as
  ``_render_sticky_footer_DISABLED`` if we ever decide to revisit.
- Every Ready/Coming-Soon tool page (1-9) gets ``back_to_home_link()``
  reinstated near the top of the page (visible at scroll-top) AND
  ``back_to_home_link(key="_back_to_home_link_bottom")`` reinstated
  near the bottom of the page (visible at scroll-bottom). Both
  instances call ``st.switch_page`` via the existing helper — soft
  nav, no full reload, ``st.session_state["home_uploads"]`` and
  every other session-state key survive.

User trades the "always-visible while scrolling" sticky behavior for
the upload-list-survives-navigation behavior. The two-button pattern
(top + bottom) was what we had before the sticky-footer experiment;
on short pages both are visible at once, on long pages the user has
one in reach at either end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:31:50 +00:00
be7191a5d1 fix(footer): navigate to / instead of /home on Back to Home
Reported: clicking Back to Home in the sticky footer surfaced
Streamlit's "Page not found — Running the app's main page" message
in the user's build.

Root cause: ``url_path="home"`` on the home page's ``st.Page``
registration is treated as an alias for the default page in some
Streamlit minor versions, but the user's build doesn't honour the
alias for the page that ALSO has ``default=True``. The default page
is served at the root URL ``/``; ``/home`` is treated as a missing
page on that build.

Switch the footer anchor's href from ``"home"`` (which resolved to
``/home`` from any tool-page URL) to ``"./"`` (resolves to the
current document's directory, which on a single-segment URL is the
server root → default page → Home). Robust across Streamlit minor
versions regardless of how the url_path alias is interpreted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:25:57 +00:00
2d2ff43754 re-enable sticky footer + compact CSS — the audit-log I/O was the hang
User confirmed: with the audit-log kill switch (1caedbb) in place,
pages render. So the hang was 100% in the audit-log file writes —
``open()`` blocking on Windows somewhere — not in the chrome
additions disabled during bisection.

Two of those three additions are pure UI and have no filesystem
exposure, so they're safe to re-enable now:

- **Sticky footer**: pure CSS + a components-html iframe whose JS
  appends a div to ``parent.document.body``. No disk touch. The
  user just reported losing the Back-to-Home button to the
  bisection commit — restoring this brings it back.
- **Compact-spacing CSS layer**: gap reductions on stVerticalBlock
  / stHorizontalBlock, slim heading margins, slim hr / caption /
  expander / button / metric padding. Pure CSS.

What stays disabled:

- **Audit-log writes** (``src/audit.py:_DISABLED = True``). Any
  resumption needs an async-write design with a hard timeout so a
  stuck filesystem can't hang the GUI render.
- **Diagnostics sidebar**: it calls ``audit_log_path()`` which
  itself does a ``mkdir()`` — and a hanging mkdir would re-introduce
  the same blank-pages symptom. Will re-enable once the audit log
  is rewritten not to block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:22:55 +00:00
36510eee7b fix(findings): namespace per-tool button keys so multi-file render works
Reported: uploading multiple files on the home page and clicking Run
analysis blew up with

    StreamlitDuplicateElementKey: key='_findings_open_02_text_cleaner'

when two uploaded files both had Clean Text findings.

Root cause: ``render_findings_panel`` is invoked once per uploaded
file from ``_home.py``, but the per-tool jump button used a
filename-agnostic key:

    key=f"_findings_open_{tool_id}"

Two files both flagging Clean Text → two buttons with identical keys
→ Streamlit rejects the second one.

Fix:

- Add ``key_namespace: str = ""`` to ``render_findings_panel``. The
  helper hashes it (sha1 truncated to 8 chars) and appends to every
  button key, so different namespaces produce different keys but the
  same namespace stays stable across reruns.
- The home page now passes the filename:
  ``render_findings_panel(findings, header=f"📄 {name}", key_namespace=name)``.
- The single-call site in ``upload_and_analyze_section`` (the legacy
  helper, only used outside the new home-page path) keeps the default
  empty namespace, which is fine because that path renders findings
  for ONE file at a time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:17:03 +00:00
1caedbbbc7 bisect: kill-switch every audit-log write
Reported: bisection commit c0bfd4d that disabled the sticky footer,
diagnostics sidebar, and compact-CSS didn't fix the blank-page
symptom. User adds that Ctrl+C also can't kill the launcher.

Ctrl+C-doesn't-work + every-page-blank together points at a hang in
the Python process, not an exception. The most likely hang point in
the chrome path is the audit log's file I/O — ``open()`` inside the
``with`` block in ``log_event`` blocks on a stuck filesystem (Windows
antivirus quarantining ``~/.datatools/logs/datatools-*.jsonl`` on
every write is a plausible culprit on the user's machine). A blocking
``open`` call does NOT raise — try/except can't recover from it —
which is why our prior defensive wrap didn't help.

Add a module-level ``_DISABLED = True`` kill switch. ``log_event``,
``log_session_start``, and ``log_page_open`` each early-return at
the very top of the function when the flag is set, before any
file-system call. Path resolution (``audit_log_path``) still works
since it's needed for the diagnostics sidebar (still disabled in
c0bfd4d, but kept harmless).

If pages render after this commit, file I/O from the audit log is
confirmed as the culprit; we'll redesign with an async writer
queue and a tighter timeout. If they still don't, the cause is
somewhere we haven't bisected yet and we move to a hard revert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:14:29 +00:00
c0bfd4dbc9 bisect: temporarily disable new chrome additions to diagnose blank pages
Reported: every page renders empty in the main body even after the
audit-log defensive-wrap commit (59c6d0f). Close button also doesn't
trigger shutdown — that page is blank too. Sidebar nav still renders,
so the chrome path that runs on every page is the suspect.

Three chrome additions land all at once and are temporarily turned
off so the user can see whether bare chrome restores rendering:

1. **Sticky footer (``render_sticky_footer``)**: short-circuited with
   ``return`` at the top of the function. The CSS-injection +
   components-html iframe mechanic is the highest-suspicion item —
   if the iframe script throws or the CSS interacts badly with the
   user's Streamlit / Python build, the side effects can be
   page-killing on theirs while invisible on ours. The original body
   is preserved as ``_render_sticky_footer_DISABLED`` so re-enabling
   is a one-line change.

2. **Diagnostics sidebar (``_render_diagnostics_sidebar``)**: call
   site in ``hide_streamlit_chrome`` is gated by ``if False:``.
   Wrapping in try/except (the previous commit) caught exceptions
   but didn't help — silent partial renders inside
   ``with st.sidebar: with st.expander: ...`` can still leave the
   render stack in a bad state on some Streamlit versions.

3. **Compact-spacing CSS layer**: the ``gap: 0.5rem !important;`` on
   ``stVerticalBlock`` / ``stHorizontalBlock``, the slim heading
   margins, the slim hr / caption / expander / button / metric
   rules — all stripped back to the pre-compact ``_HIDE_CHROME_CSS``.
   The ``gap`` rule in particular is a suspect: if the user's
   Streamlit version doesn't render stVerticalBlock as a flex
   container, the rule is harmless; if it does and interacts badly
   with overflow, content could be clipped.

What's deliberately KEPT enabled:

- The audit-log calls (already wrapped from 59c6d0f).
- ``log_page_open`` calls in tool pages (already wrapped internally).
- All UI changes pre-compact (the unified tool-page layout, the
  download-button helper, etc.).

If pages render after this commit, we know it's one of the three
disabled items above and can bisect further. If they still don't
render, the cause is in code that pre-dated the audit-log work and
the bisection has to keep going.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:09:23 +00:00
59c6d0f914 fix(audit): defensive wrap so audit failures can never blank the GUI
Reported: after pulling commit c73d716 (audit log) the main body of
every page showed empty. Sidebar nav still worked.

Diagnosis: the most likely path is that something inside the audit
calls — ``_render_diagnostics_sidebar()`` calling ``audit_log_path()``,
or ``log_session_start()`` itself — raises during ``hide_streamlit_chrome``
on the user's environment (Python 3.14 on Windows, a less-tested
combo than the test environment). Streamlit's script runner sees the
exception, and on some chrome paths it eats it without surfacing an
error block, leaving the page body empty.

The audit log is best-effort by design. Make that contract real:

1. ``hide_streamlit_chrome`` now wraps both ``log_session_start()``
   and ``_render_diagnostics_sidebar()`` in try/except. Errors print
   to stderr (so the developer running ``python -m src.gui`` sees
   them in the launcher's console) but never bubble up to kill the
   page render.

2. ``audit_log_path()`` already had a tempdir fallback for the
   primary mkdir failure, but the SECOND mkdir wasn't protected
   either. Restructured to a two-level fallback: configured dir →
   tempdir → ``/dev/null`` (or ``NUL`` on Windows). The last fallback
   ensures the function never raises; ``log_event``'s own try/except
   handles the eventual unwritable-file case.

3. ``log_page_open(slug)`` now has an outer try/except so it cannot
   raise either — protecting every tool page's render path.

If a user reports the same symptom again, the launcher terminal will
now show a real traceback explaining what's wrong, and the GUI will
still render normally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:00:31 +00:00
ee0b1f6f6b docs: design notes for future PDF→CSV tool
New ``docs/FUTURE-TOOLS.md`` captures post-launch tool ideas with a
consistent shape — What / Why / Can we ship now / Approach / GUI
sketch / Effort / Risks / Ship criteria. Resting place for things
the new-tool freeze in ``PLAN.md`` §2.1 refuses to build but that
keep coming up.

First entry: **#10 PDF → CSV extractor** (bank statements et al.).

Key facts captured:

- **Current state**: no PDF infrastructure exists. Zero PDF
  dependencies in requirements.txt; zero PDF-touching code under
  ``src/``. The only "PDF" string in the codebase is the planned-
  output copy for the Quality Check tool, unrelated to extraction.
- **Library picks**: pdfplumber as the extraction core (BSD-3,
  no native compiler, gives coordinate-aware text), Tesseract via
  pytesseract as the OCR fallback for scanned PDFs,
  streamlit-drawable-canvas as the region-picker component.
- **GUI sketch**: user draws a header strip + a row template on a
  rendered page; the tool applies that template across N pages,
  saves the template by layout fingerprint for next month's
  statement, emits CSV.
- **Effort phased A–E**: 3–4 weeks for a text-only MVP; 6–10
  weeks for a polished version with multi-page template recall;
  +2–3 weeks if scanned-PDF OCR is required.
- **Difficulty**: medium-hard. The pieces are well-trodden; the
  combination (region selection that persists across pages and
  across documents with similar layouts) is where the engineering
  goes.
- **Ship criteria**: ≥1 paying customer + ≥3 paid or ≥5 demo
  emails asking for PDF extraction + the bookkeeper niche
  converting at least one customer first. None have fired.

Cross-references added:

- ``docs/REQUIREMENTS.md`` §11: pointer to FUTURE-TOOLS.md for
  parked tool ideas, with a one-paragraph summary of #10.
- ``docs/PLAN.md`` §2.1: notes that the freeze parks future tools
  in FUTURE-TOOLS.md and explicitly names #10 as the current
  highest-pressure entry.
- ``docs/NEXT-STEPS.md`` Phase 5 "what NOT to build" table: a new
  row for the PDF tool tied to the same ship-trigger language.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:52:42 +00:00
c73d716d06 feat(audit): JSONL audit log for support diagnostics
New ``src/audit.py`` module records GUI actions to a per-session
JSONL file under ``~/.datatools/logs/`` (overrideable via
``DATATOOLS_AUDIT_DIR``). The file is human-readable (one JSON
object per line, each with a ``message`` field) AND trivially
machine-parseable — the support flow is "client mails the file,
we read it and explain what went wrong."

Format example::

    {"ts":"2026-05-17T05:30:00.123+00:00","level":"info","category":"session",
     "session":"a1b2c3d4","message":"Session started",
     "platform":"Windows 11","python":"3.14.0","user":"Michael Dombaugh",
     "log_file":"C:\\Users\\Michael Dombaugh\\.datatools\\logs\\datatools-...jsonl"}
    {"ts":"...","category":"upload","message":"Uploaded customers.csv",
     "filename":"customers.csv","bytes":24813}
    {"ts":"...","category":"analyze","message":"Analyzed customers.csv (3 findings)",
     "filename":"customers.csv","findings":3,"rows":120,"cols":8}
    {"ts":"...","category":"tool_run","message":"Clean Text run",
     "page":"2_Text_Cleaner"}
    {"ts":"...","category":"error","level":"error",
     "message":"analyze(weird.csv): EmptyDataError: No columns to parse",
     "filename":"weird.csv","outcome":"empty_after_repair"}

Public API:

- ``log_event(category, message, **extra)``
- ``log_session_start()`` — idempotent banner with platform info
- ``log_page_open(slug)`` — emit a ``nav`` event, deduplicated per
  Streamlit session so reruns don't spam the log
- ``log_exception(where, exc, **extra)`` — convenience wrapper
- ``audit_log_path()`` / ``audit_log_dir()`` — for the UI

Wired in at:

- ``hide_streamlit_chrome``: stamps session start, mounts a small
  "🩺  Diagnostics" expander in the sidebar with the log path and
  an "Open log folder" button so the user can grab the file to
  attach to a support email.
- Home page: ``upload`` event on every new file, ``upload`` event
  on per-file remove, ``analyze`` event with file count when
  Run-analysis fires.
- ``_run_analysis_on_upload``: ``analyze`` event with rows / cols /
  findings count per file, plus ``error`` events on every caught
  exception (empty upload, empty after repair, pandas EmptyDataError,
  generic Exception).
- Every Ready tool page (1, 2, 3, 4, 5, 9): ``tool_run`` event
  immediately after the primary action stashes its result.
- Every tool page (1-9): ``log_page_open(slug)`` on render — deduped
  via session state so we don't get one event per Streamlit rerun.

Safety:

- ``log_event`` wraps every write in try/except. A broken audit
  log must NOT crash the GUI.
- Non-JSON-serializable extras are ``str()``-coerced before writing.
- File CONTENTS are never logged. We capture filename, byte count,
  and (in the analyzer) a 12-char sha1 fingerprint of the bytes so
  the same file re-uploaded gets the same trace.
- License keys, session cookies, etc. are not logged.
- ``DATATOOLS_AUDIT_DIR`` env var lets tests redirect writes into a
  tmp dir.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:36:35 +00:00
f0885aeb1e feat(analyze,ui): recommend Standardize Formats + bold red Open buttons
Two reported issues addressed together because they're the same UX
flow (home findings panel → jump to relevant tool).

(1) Format-Standardizer recommendations weren't firing.

Reported: uploading a file from the format-cleaner test corpus
(``24_format_dates.csv``, ``25_format_phones.csv``,
``29_format_currencies.csv``, ``30_format_integration.csv``) showed
zero "Standardize Formats" recommendations even though the columns
clearly mixed multiple date / phone / currency formats.

Two underlying causes:

- ``_detect_inconsistent_date_format`` required two MATCHES per
  distinct format. A test column with N rows each in a different
  format had ≤1 match per format and was silently passed over.
  Loosened to "≥1 match per format" — the inconsistency signal is
  the presence of ≥2 distinct formats, not their volume.
- Only date inconsistency was detected. Phones, currency, and
  booleans (the other format-standardizer fix categories) had no
  detector at all.

Added three new detectors:

- ``_detect_inconsistent_phone_format``: nine phone-format regexes
  (plain-10, US paren / dash / dot / space, +country, extension,
  intl plus). Fires when a column is ≥35% phone-shaped AND mixes
  ≥2 formats.
- ``_detect_inconsistent_currency_format``: thirteen currency regexes
  covering US ($1,234.56 / $1234.56), EU (€1.234,56), India lakh
  notation, Swiss apostrophe, trailing-symbol, parens-negative,
  prefix-currency-code, suffix-currency-code, and negative variants.
  Same fire criteria as phone.
- ``_detect_inconsistent_boolean_format``: column is ≥80% boolean
  tokens (yes/no/y/n/true/false/1/0) AND uses ≥3 distinct surface
  forms (e.g. yes / Y / true / 1 mixed together).

Verified on every file in ``test-cases/format-cleaner-corpus/``:
24_format_dates, 25_format_phones, 29_format_currencies all now
produce a format-standardizer Finding. The integration test file
flags all three.

The threshold loosening (from 50% to 35% of values format-shaped) is
still strict enough to avoid false-positives on free-text comment
columns where a few cells happen to look phone- or date-shaped.

(2) The "Open <Tool>" jump links blended into the page.

Reported: the per-tool jump links inside the home findings panel
were too subtle to notice.

Replaced ``st.page_link`` with ``st.button(type="primary")`` so the
buttons render in Streamlit's primary-action red colour, matching the
"Clean Text" / "Find Duplicates" / etc. run buttons. Click handler
delegates to ``st.switch_page(page_slug)`` so it's still a soft
in-app navigation (no full reload).

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:54:31 +00:00
229e1afd45 fix(footer): mount Back-to-Home outside Streamlit's container tree
Reported: the sticky footer rendered, but the Back to Home button
inside it wasn't visible.

Likely cause: ``st.markdown`` inserts the footer div inside Streamlit's
content tree, which sits under ``.stApp { zoom: 0.85 }`` (our compact
scaler) and several nested padding/positioning contexts. Streamlit's
own ``<a>`` styling rules can also colour-collide with our anchor.

Switch the mount strategy. Two passes:

1. CSS rules go to the parent document via ``st.markdown`` as before,
   but every property carries ``!important`` and the selectors key on
   ``#datatools-sticky-footer`` (id, not class) plus a dedicated
   ``.datatools-sticky-footer-link`` class on the anchor — so
   Streamlit's default ``<a>`` styles can't override colour or
   padding. ``z-index: 2147483646`` keeps the footer above
   anything else in the page.

2. The footer DOM node itself is created by a script inside a
   zero-height ``streamlit.components.v1.html`` iframe. The script
   does ``window.parent.document.body.appendChild(...)`` so the div
   lives as a direct child of ``<body>`` — outside ``.stApp``,
   outside every Streamlit container, free of every parent's
   ``zoom`` / ``transform`` / ``overflow`` rules.

   If the cross-frame access ever fails (Streamlit sandbox config
   change), the script falls through to appending inside the
   iframe's own document — degraded but still visible.

Each rerun replaces any prior ``#datatools-sticky-footer`` so we
don't accumulate stacked footers on every script pass.

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:47:44 +00:00
7ad19ac7f4 feat(nav,i18n): sticky footer with Back-to-Home + localized tool headers
Two unrelated UX issues addressed in one sweep across all nine tool
pages because they share the same edit surface.

(1) Sticky footer replaces the top + bottom back-link buttons.

Reported: a big white empty footer space at the bottom of every page;
the Back to Home button at the top scrolled out of view on long pages.

New ``render_sticky_footer()`` helper in ``components/_legacy.py``
injects a fixed-position bar at ``bottom: 0`` of the viewport with:

- A border-top so it visually reads as a non-movable bar.
- A semi-transparent background (rgba 0.96 + ``backdrop-filter: blur``)
  so content underneath shows through faintly when the user scrolls.
- A styled ``<a href="home">`` anchor (not an ``st.button``) because
  Streamlit widgets can't be CSS-positioned reliably — Streamlit owns
  the widget's DOM container and re-mounts it on every rerun. A real
  anchor sits exactly where the CSS puts it and triggers Streamlit's
  URL routing to the home page.
- ``padding-bottom: 3.5rem`` on the main container so the last widget
  isn't hidden behind the bar.

Called once per tool page, immediately after ``hide_streamlit_chrome()``
so it renders even on pages that ``st.stop()`` early before any other
content runs. The old top-and-bottom ``back_to_home_link()`` calls are
removed from every tool page; their entry/exit points were dropping
the button when the script short-circuited.

(2) Tool-page headers now localize.

Reported: switching the sidebar language picker to Spanish left the
tool page's title + caption in English. Root cause: every page had
hard-coded ``st.title("✂️ Clean Text")`` / ``st.caption("Trim
whitespace...")`` strings.

Added per-tool ``tools.<id>.page_title`` and
``tools.<id>.page_caption`` keys to ``en.json`` and ``es.json`` for
all nine tools. Routed each page's title/caption call through ``t()``.
Verified: with ``ui_lang=es`` set, the Clean Text page now renders
"✂️ Limpiar texto" + the Spanish caption.

Updated ``tests/gui/test_smoke.py::EXPECTED_SUBSTRINGS`` so the
``es`` column for each tool page asserts the actual Spanish string
(was a duplicate of the English string back when the page bodies
were English-only).

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:42:45 +00:00
84e4665ab0 fix(home): make per-file Remove button reliable
Reported: the "✕" buttons on the uploaded file list removed files
inconsistently — some clicks took, some didn't.

Two compounding causes:

1. ``key=f"_home_remove_{name}"`` embedded the raw filename in the
   Streamlit widget key. Streamlit's widget-identity machinery
   normalizes keys differently across reruns when they contain
   spaces, dots, brackets, or non-ASCII characters, so a button's
   identity could shift between the render where the user clicked
   it and the rerun that should have processed the click. The click
   was registered, but the post-rerun render produced a new widget
   under a new effective key, and the original click was "lost".

2. The handler mutated ``home_uploads`` mid-loop while subsequent
   iterations were still creating buttons. ``st.rerun()`` raises
   synchronously, but if ANOTHER button's state changed in the same
   pass (e.g. a stale click held over from a fast double-tap), the
   ordering of state-mutation vs widget-key-update vs rerun could
   race.

Fixes:

- Stable widget keys: ``f"_home_remove_{sha1(name)[:10]}"``. The
  hash is identifier-safe regardless of spaces / dots / Unicode in
  the filename. Verified across "sample with spaces.csv",
  "sample.csv", and "日本語.csv" — three sequential Remove clicks
  each remove exactly one file with no clicks lost.

- Two-phase capture: the loop collects the target ``to_remove``
  filename, finishes rendering every other row at consistent widget
  identity, THEN mutates state once and reruns. No more mid-loop
  ``del`` racing other widgets' click handlers.

- Wider click target: column ratio ``[8, 1]`` (was ``[12, 1]``) and
  ``use_container_width=True`` on the Remove button so the click
  surface fills the entire column. Label changed to "Remove" for
  the same reason — "✕" is a thin glyph that compressed the
  hit-test region.

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:34:20 +00:00
4685bb4289 style(chrome): tighter vertical rhythm — less whitespace across screens
Reported: too much whitespace between widgets, dividers, and headings.

Compact-spacing CSS layer added to ``_HIDE_CHROME_CSS`` (so it applies
on every page that calls ``hide_streamlit_chrome``):

- ``[data-testid="stVerticalBlock"]`` and ``stHorizontalBlock`` gap
  trimmed from Streamlit's default ~1rem to 0.5rem.
- Heading margins (h1-h4) tightened — h1/h2/h3 used to leave 1-1.5rem
  above; now 0.25-0.5rem.
- ``hr`` (``st.divider()``) drops from 1rem above+below to 0.4rem.
- Markdown paragraphs and captions: 0.25rem bottom margin instead of
  the default 1rem.
- Expander summary padding reduced (0.35rem top/bottom).
- File-uploader, button, and metric tiles: trimmed internal padding.

Also slimmed the main-container padding from 1rem top / Streamlit
default bottom (~6rem) to 0.5rem top / 0.75rem bottom.

The existing ``zoom: 0.85`` on ``.stApp`` is kept — the user wanted
*less white space*, not *smaller content*, and dropping zoom would
shrink type alongside everything else.

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:28:58 +00:00
e96d5901f4 fix(close): graceful about:blank fallback + display-mode aware hint
Reported: user asked whether we can send Alt+F4 / Ctrl+W to the
browser from JavaScript to force-close a tab.

Honest answer that's now baked into the hint message: NO. Synthesized
keyboard events from page JS only reach DOM event listeners, not the
browser chrome or the OS. There is no flag, API, or trick that lets
a page close a tab the user opened themselves. The page CAN close a
window it opened (window.opener trail) or one whose display-mode is
``standalone`` (Chrome/Edge ``--app=URL``) — that's what
``python -m src.gui`` arranges, and that's the path that actually
closes the window without a manual Ctrl+W.

Improvements landed:

1. ``isStandalone(win)`` detects Chrome --app windows up front
   (``matchMedia('(display-mode: standalone)').matches``). In a
   regular tab the manual hint surfaces immediately on the
   "Close this window" click; in --app mode we only show it if the
   close attempt actually fails.

2. ``fallbackToBlank(win)`` navigates the tab to ``about:blank``
   via ``location.replace`` (no history pollution) so the user
   sees a clean empty tab instead of the farewell overlay frozen
   over Streamlit's connection-error banner. They still have to
   Ctrl+W the blank tab, but the screen is no longer a misleading
   "did it close or not?" mess. Fires 250 ms after a failed close
   in --app mode (very rare path), or 1.5 s in a regular tab so
   the user has time to read the hint.

3. Hint message rewritten in en + es to explain WHY the close is
   blocked (browser security — not something we can override), to
   acknowledge the Alt+F4 / Ctrl+W question directly (those don't
   work either, for the same reason), and to point at
   ``python -m src.gui`` as the path that gives a clean auto-close.

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:07:51 +00:00
ecfc52499f fix(home): persist upload list across page navigation
Reported: clicking "Back to Home" from a tool page returned the user
to an empty home — their previously-uploaded files were gone.

Root cause: Streamlit's ``st.file_uploader`` widget state does not
reliably survive ``st.switch_page``. The widget gets unmounted on
navigation, and its ``UploadedFile`` objects don't always re-attach
on remount. The home page was treating the widget's return value as
the source of truth, so after navigation the list was empty.

Fix: introduce a session-state stash keyed by filename
(``home_uploads: dict[str, {"bytes": bytes, "size": int}]``) and
treat it as the source of truth for everything downstream — the
active-file pickup keys for tool pages, the per-file findings
cache, and the rendered file list. The widget is reduced to its
narrow role of capturing NEW uploads, which we merge into the stash
without ever removing.

Per-file remove: a "✕" button next to each filename drops just that
file (and its findings). The widget's own "✕" is bypassed by our
rendering, since trusting it would let the widget's state diverge
from the stash.

Clear-results button is unchanged: it wipes only the analysis cache,
leaving uploaded files intact (per the user's "persistent until
cleared" requirement — removal is per-file via "✕").

Tool-page compatibility: the singular ``home_uploaded_{name,size,
bytes}`` keys still get populated from the first entry in the stash
on every render, so ``pickup_or_upload`` on a tool page keeps
finding the active upload. When the user removes the active file,
those keys are cleared so the next render repopulates from whatever
file is now first.

``_StashedUpload`` is a small duck type ( ``.name``, ``.size``,
``.getvalue()`` ) so ``_run_analysis_on_upload`` accepts entries
restored from the stash without changes.

2220 tests pass. Smoke-verified via AppTest: pre-stashed
``home_uploads`` renders the file list with per-file remove buttons,
and the persistent state survives a simulated navigation round-trip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:04:12 +00:00
21fd8a4cd7 fix(nav): switch_page resolves correctly + bottom-of-page back link
Two issues, same fix surface.

(1) Reported crash on Back-to-Home:

    StreamlitAPIException: Could not find page: app.py.

``st.switch_page("app.py")`` doesn't work under ``st.navigation`` —
the entry script is the nav manager itself and is not a registered
page. The fix needs to pass an ``st.Page`` object whose script
identity matches one registered in the nav.

First-pass attempt (``from src.gui.app import _home_page``) hit a
worse failure: importing ``app.py`` from inside a tool-page render
re-executes the nav setup with the WRONG "main script" context, so
every ``st.Page("pages/N_foo.py", ...)`` call in ``_build_navigation``
fails with "file could not be found".

Extract the home renderer into its own module ``src/gui/_home.py``
which has no top-level Streamlit side effects. Both the nav manager
and the back-link helper import ``_home_page`` from there. The Page
object built at click time has the same callable identity as the one
registered, so ``st.switch_page`` resolves it.

(2) Reported UX: the back button scrolled out of view on long pages.

Add a second ``back_to_home_link(key="_back_to_home_link_bottom")``
call near the footer of every tool page (1-9). The unique key avoids
widget-id collision with the top instance. Coming-Soon stubs get it
unconditionally; Ready tools render it only after a result exists
because the page short-circuits with ``st.stop()`` before then —
when no result is on screen the page is short enough that the top
link is sufficient.

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:58:33 +00:00
42f8d78dd5 fix(downloads): drop /select on Windows — opens wrong folder
Reported: clicking "Open Downloads folder" was opening the Documents
folder instead of Downloads. Root cause is the classic Windows
gotcha: when the path contains a space (e.g.
``C:\Users\Michael Dombaugh\Downloads``), Python's
``subprocess.Popen`` packs the ``/select,...`` argument into a single
quoted token, and Explorer's ``/select`` argument parser does NOT
accept that form — it silently falls back to whatever the user's
default Explorer view is (typically Documents).

Resolution paths considered:

- ``shell=True`` with a hand-built command string — works but opens
  the door to shell-injection if a file_name ever contained a quote
  or special char.
- ``cmd /c start "" explorer /select,...`` — same parsing issue.
- ctypes ShellExecuteW — pulls in a Windows-only dependency.
- **Skip /select. Open the folder directly.** ✓

Going with the last. ``explorer <folder>`` reliably opens the folder
regardless of spaces in the path; the user finds the freshly-saved
file by its name. The previous "highlight the file" nicety wasn't
worth the path-parsing fragility — every user folder on Windows is
``C:\Users\<name>`` and every Windows username can contain a space.

macOS keeps the ``open -R <file>`` reveal-in-Finder path because
macOS argument parsing is sane and that's a strict UX win.

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:45:47 +00:00
0f89d7ba66 fix(downloads): use explorer /select on Windows + show open feedback
Reported: clicking "Open Downloads folder" did nothing visible. The
previous implementation called ``os.startfile(folder)`` on Windows,
which is known to silently no-op or open Explorer behind the active
window in some configurations (Streamlit running headless, no
foreground rights inherited by the click handler thread, etc.).

Switch to the more reliable ``explorer /select,<file>`` form:

- Opens Explorer with the just-saved file pre-highlighted instead of
  just navigating to the folder — better UX than the old behavior.
- explorer.exe is a real GUI process that's spawned in the user's
  session with foreground rights, so it shows up on top.
- Fallback chain on Windows: ``/select`` first, then plain
  ``explorer <folder>``, then ``os.startfile`` as a last resort.

macOS upgraded the same way: ``open -R <file>`` reveals in Finder
rather than opening the directory.

Linux: no reliable cross-distro reveal, so ``xdg-open <folder>``.

Plus user feedback at the call site:

- On successful dispatch: ``st.toast("Opening <folder>", icon="📂")``
  — confirms we tried, in case the window comes up behind the
  browser.
- On dispatch failure: ``st.warning`` with the full path the user
  can copy/paste into their file manager manually.

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:25:06 +00:00
b9147f3b66 fix(downloads): save server-side to ~/Downloads + open-folder link
Switch the download mechanic from "browser <a download> with a data:
URL" to "write the bytes directly to the user's Downloads folder and
show them the exact path". DataTools runs as a local Streamlit app,
so the "server" IS the user's machine — there's no reason to go
through the browser save dialog at all.

Flow:

1. Click "Download <something>" button (rendered as a regular
   ``st.button``, so no widget-collision issues).
2. Bytes are written to ``Path.home() / "Downloads" / file_name``
   (overwriting any same-named file).
3. The page reruns and renders a success caption with the absolute
   path the file landed at.
4. An "📂 Open Downloads folder" button appears. Clicking it pops the
   OS file manager via ``os.startfile`` (Windows), ``open`` (macOS),
   or ``xdg-open`` (Linux).

Why this is better than the previous HTML-data-URL helper:

- Unambiguous about where the file went — user sees the full path,
  not "wherever your browser was configured to save".
- The data: URL approach base64-inflated the page payload by 33% and
  bloated for large outputs; server-side write is byte-for-byte.
- No more browser-side widget collision class of bug.
- The save action is a real Streamlit button, so the existing widget
  semantics (disabled, help tooltip, key isolation) work without
  workarounds.

API surface unchanged. New canonical name ``local_download_button``;
``html_download_button`` is kept as a back-compat alias that points
at the same implementation — every existing call site continues to
work without edits.

Tests are protected from polluting the developer's home dir via a
``DATATOOLS_DOWNLOADS_DIR`` env var override returned by the new
``_downloads_dir()`` helper. Smoke verified end-to-end via AppTest:
click → file appears in tmp dir → success banner shows path →
open-folder button renders.

2220 tests pass, 91 skipped, 35 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:48:28 +00:00
5128d35961 fix(text-cleaner): hoist show_hidden + stress-test all tool pages
Reported crash: clicking "Clean Text" with mojibake.csv (a junk corpus
file that the cleaner ran on but produced zero changes) blew up the
results render with

    NameError: name 'show_hidden' is not defined

at the cleaned-preview block. ``show_hidden`` was defined inside
``if result.cells_changed:`` and referenced unconditionally below.

Fix on the page itself: hoist the ``show_hidden = st.toggle(...)``
declaration out of the conditional so it's always in scope for the
downstream cleaned-preview render. One toggle now drives both the
Examples table (which only renders when there are changes) AND the
cleaned preview (which always renders).

Generalized regression net: ``tests/test_junk_corpus_tool_pages.py``.
For nine representative junk files (empty, only_nul, mojibake,
invalid_utf8, utf16_le_no_bom, mismatched_columns, all_nulls,
corrupt_xlsx, single_column) and every Ready/Coming-Soon tool page,
the test:

1. Stashes the junk bytes as the home upload via session_state.
2. Runs the page through AppTest, asserts ``app.exception`` is empty.
3. If the page exposes a deterministic primary-action button label,
   clicks it and asserts no exception on the post-click render.

Pages that catch a bad file at read time and short-circuit via
``st.error`` + ``st.stop`` are correctly skipped from the
primary-action half (the button isn't rendered). A genuine crash
shows up as ``app.exception`` carrying a Python traceback — exactly
what the user reported, exactly what we now catch.

162 tests collected, 102 passed, 60 skipped. 4 seconds.

Full suite: 2220 passed, 91 skipped, 35 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:41:14 +00:00
696996c119 test(junk-corpus): pathological-input stress suite for the analyzer
Build a corpus of 35 deliberately-broken files (empty bytes, NUL
bytes, mojibake, UTF-16 without BOM, mismatched columns, unescaped
quotes, corrupt zip, etc.) and pin the analyzer's stability contract
against them.

Files land in ``test-cases/junk-corpus/test_data/``. The generator
``make_junk_corpus.py`` produces them deterministically (one random
sample uses ``secrets.token_bytes`` — committed bytes are stable
across regenerations because the byte stream is captured at commit
time). README documents the categories and how to add new shapes.

``tests/test_junk_corpus.py`` parametrizes over every file in the
corpus and asserts:

1. ``_run_analysis_on_upload`` never raises — exceptions must be
   caught and surfaced as a synthetic ``Finding`` with
   severity="error". This was the user-reported crash for
   13_non_latin_scripts.csv that the previous fix in ae9d4a2
   defensively wrapped; the corpus now stops the regression
   from re-landing on a different shape.
2. Every Finding in the result list is well-formed (string id,
   valid severity, non-empty description).
3. A high-risk subset (empty.csv, only_bom.csv, only_nul.csv,
   corrupt_xlsx.xlsx) MUST surface at least one error-level
   Finding — otherwise the GUI would render "no issues found"
   for a structurally broken file.
4. Error-level Finding descriptions are at least 20 chars so the
   UI banner gives the user something to act on.

Also exclude ``junk-corpus`` from ``tests/test_fixtures_sweep.py``
since that sweep is happy-path (round-trip the text cleaner) and
fights with files designed to break it. The contract is enforced
by the dedicated junk-corpus test, not the sweep.

Runtime: 12 s for the junk-corpus tests, 30 s for the full
project suite (was 19 s without these). 2118 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:35:22 +00:00
ae9d4a2db5 fix(home): defensive analysis errors don't crash the whole page
Reported: uploading 13_non_latin_scripts.csv made the home page bubble
a ``pandas.errors.EmptyDataError`` traceback up through the page
chrome instead of surfacing as a per-file error. In a multi-file
analysis run that kills every other file's results too, which is
worse than the symptom itself.

Wrap ``_run_analysis_on_upload`` in proper error handling:

- Empty bytes ``getvalue() == b""`` short-circuits with a synthetic
  error Finding telling the user the upload was zero-byte and to
  re-upload.
- Empty ``repair.repaired_bytes`` (file was all NULs / BOM / stripped
  to nothing) likewise surfaces as a synthetic Finding rather than
  reaching pd.read_csv.
- ``pd.errors.EmptyDataError`` from pandas is caught and rendered as
  a Finding that names the file, its byte size, and suggests opening
  it in a text editor to verify the header row matches the data row
  delimiter.
- Any other exception during read/analyze is caught and surfaces as
  a Finding via ``format_for_user`` so the user gets a clean message,
  not a Python traceback.

Each file in a multi-file run now stands alone: a bad file produces
one red banner in its own card, every other file analyzes normally.

The 13_non_latin_scripts.csv corpus file is 249 bytes of valid UTF-8
on disk and parses cleanly under the same code path locally — the
user's specific symptom is likely a zero-byte upload (browser /
network / Python 3.14 + Streamlit edge case). The new ``empty_upload``
finding will name the bytes count so they can confirm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:22:10 +00:00
ef9f8b5de4 fix(close): Edge fallback + better tryClose + honest hint
There is no JavaScript override for browser tab-close security:
``window.close()`` only succeeds on windows JS opened (Chrome --app
windows qualify; a regular browser tab does not). What we can do is
make the --app path easier to hit and the failure case more
actionable.

Three changes:

1. ``src/gui/__main__.py`` — extend browser detection. PATH lookup
   now also looks for ``msedge`` / ``microsoft-edge``; Windows install
   candidates include the Edge install path; macOS candidates include
   Edge and Chromium. Edge is Chromium-based, supports ``--app``, and
   ships on every Windows 10+ machine — so users without Chrome no
   longer fall through to the regular browser tab. When the fallback
   IS hit, print a warning to stderr explaining why Close-from-page
   will require Ctrl+W. Renamed ``_find_chrome`` to
   ``_find_app_browser`` to reflect the broader scope.

2. ``_FAREWELL_SCRIPT_TEMPLATE`` in ``components/_legacy.py`` —
   factor close attempts into a ``tryClose`` helper that runs three
   escalating tries: standard ``win.close()``, the
   ``win.open('', '_self')`` history-rewrite trick (no-op in modern
   Chrome but free), and ``win.top.close()``. Auto-close on paint AND
   the manual button now both call this helper. Skip the manual hint
   if the close eventually succeeded between the click and the 250 ms
   timeout.

3. ``quit.close_hint`` in en/es i18n packs — rewrite the message to
   tell the user honestly that this is a browser security restriction,
   tell them the Ctrl+W keystroke that works, and point them at
   ``python -m src.gui`` for the auto-closing app-mode experience.

2008 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:17:18 +00:00
aeead05e4c fix(downloads): swap st.download_button for an HTML <a download> helper
Reported symptom: only the FIRST download button in a multi-button
row pops the browser save dialog. The second and third do nothing on
click. Affects every tool page that exposes (cleaned + audit + config)
downloads.

Root cause is ``st.download_button`` itself — when several render in
the same script pass, the click-to-bytes wiring on the browser side
mis-routes and only one button's data is actually exposed. Explicit
``key`` arguments don't fix it; ``use_container_width=True`` doesn't
help either; we confirmed this in the Text Cleaner reverts.

Replace the widget with a real ``<a download="file" href="data:...">``
anchor rendered via ``st.markdown(..., unsafe_allow_html=True)``.
Bypasses Streamlit's widget machinery entirely; behaves identically to
a native browser download. Side benefit: clicking it does NOT trigger
a script rerun, so other in-flight UI state survives.

New helper ``html_download_button`` lives in
``src/gui/components/_legacy.py`` (exported from ``components``). API:

    html_download_button(
        label, data,
        *, file_name, mime="application/octet-stream",
        disabled=False, help=None, use_container_width=True,
    )

Translation pattern applied across every tool page (and shared
``results_summary`` / ``config_panel`` widgets in ``_legacy.py``):

- ``st.download_button(`` -> ``html_download_button(``
- ``data=foo_bytes`` kwarg -> positional second arg
- ``key="..."`` -> dropped (helper has no widget identity)
- ``use_container_width=True`` -> dropped (default)
- ``disabled=`` and ``help=`` pass through unchanged
- Pre-computed byte buffers kept where they were

Total: 17 sites replaced (3 in Text Cleaner, 3 in Format
Standardizer, 3 in Fix Missing Values, 3 in Map Columns, 3 in
Automated Workflows, 2 in Find Duplicates page + 4 in shared
_legacy.py widgets used by Find Duplicates).

Caveat: data: URLs balloon by 33% (base64). Fine for tool output
sizes we ship; if a future result topped a few hundred MB we'd want a
Blob-URL fallback.

The marketing demo at src/gui/app_demo.py keeps its single
st.download_button — single button, no collision, no need to switch.

2008 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:13:41 +00:00
6415be8bf4 feat(tools): unified post-run UX across all Ready tool pages
Apply the Clean Text page's post-run UX pattern to every other Ready
tool page (Find Duplicates, Standardize Formats, Fix Missing Values,
Map Columns, Automated Workflows) for consistency and ease of use.

Per page:

1. Preview wrapped in ``st.expander(f"Preview: {filename}",
   expanded=not _has_result)``. Open before a result exists, folded
   afterwards.

2. Options / configuration controls wrapped in
   ``st.expander("Options", expanded=not _has_result)``. Inner
   sub-expanders preserved (Streamlit 1.36+ supports nesting).

3. After the primary action stashes the result, set a one-shot
   ``_<tool>_scroll_to_results`` flag in session state and call
   ``st.rerun()`` so the preview + options expanders see the new
   state on the next pass and collapse themselves.

4. ``<div id="<tool>-results-anchor" style="height:1px">`` placed
   immediately before the Results subheader.

5. End-of-page: pop the scroll flag and inject a tiny
   ``streamlit.components.v1.html`` iframe whose ``<script>`` calls
   ``scrollIntoView`` on the parent document's anchor. One-shot, so
   unrelated reruns (toggling Show-hidden, etc.) don't yank the
   viewport.

6. Download buttons hardened against the multi-button Streamlit
   footgun: byte buffers pre-computed outside the column scopes,
   explicit unique ``key="<tool>_dl_<purpose>"`` per button,
   ``use_container_width=True``, and previously-conditional buttons
   now render unconditionally with ``disabled=True`` + a help
   tooltip when the underlying data is empty so layout stays steady.

Per-page judgment calls (already noted in agent reports):

- Find Duplicates: sheet picker and delimiter selector kept OUTSIDE
  expanders (the user still needs to see them when a file fails to
  parse).
- Fix Missing Values: missingness profile wrapped INSIDE the Options
  expander together with Strategy — the Results section already
  shows a before/after missingness comparison that supersedes the
  static input profile.
- Map Columns: all three subsections (Target schema, Strategy,
  Mapping) wrapped under one outer Options expander, matching the
  Text Cleaner pattern.
- Automated Workflows: inner "Recommended tool order" expander stays
  nested inside the outer Options wrap; Run button stays outside
  Options so the user can re-run after tweaking the (collapsed)
  editor.

2008 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:04:37 +00:00
d1aaf3c2b9 feat(quit): close-window button + manual hint on the farewell overlay
The farewell overlay already attempted ``window.top.close()`` after a
Close click — but browsers only honour that for tabs that JS opened
(Chrome --app windows qualify; a regular browser tab does not). For
users whose Chrome wasn't auto-detected and who fall back to
``webbrowser.open``, the overlay stays put and they had no in-page
way to close.

Add to the overlay HTML:
- A "Close this window" button (uses the user-gesture path, which has
  slightly looser browser rules than auto-close).
- A hidden hint paragraph that reveals itself 250 ms after the
  button is clicked IF the window is still here, telling the user to
  press Ctrl+W (⌘W on Mac).

Wired through the existing _farewell_script template + ``_js_html_safe``
escaping so neither label can break out of the JS string literal.

New i18n keys (en + es): ``quit.close_window_button`` and
``quit.close_hint``.

The existing auto-close attempt remains — Chrome --app users still get
their window closed without touching the button.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:59:17 +00:00
27f0648093 fix(text-cleaner): make all three download buttons actually fire
Only "Download cleaned CSV" was working; "Download changes audit" and
"Download config JSON" did nothing on click.

The symptom is the classic Streamlit footgun for multiple
``st.download_button`` widgets in adjacent columns: without an explicit
``key`` argument the auto-derived widget IDs can collide, especially
when one button is conditionally rendered, and only the first button
in source order actually fires on click. Same goes for unstable
``data`` bytes recomputed inside the ``with col:`` block — the widget
identity can drift between renders.

Robustness pattern applied:
- Compute all three byte buffers up front, outside the columns, so the
  ``data`` parameter is the same object across reruns.
- Pass an explicit unique ``key`` ("textclean_dl_cleaned" /
  "textclean_dl_changes" / "textclean_dl_config") to each button.
- Render the changes button unconditionally with ``disabled=True`` and
  a help tooltip when ``result.changes.empty`` — instead of hiding it.
  Layout stays steady and the empty case is self-explanatory.
- ``use_container_width=True`` so the three buttons size identically
  inside their columns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:56:52 +00:00
0a61d52200 feat(text-cleaner): collapse options + auto-scroll to Results on run
After clicking Clean Text the user was left at the bottom of the
script with the Options block still expanded and no viewport movement
— they had to scroll to find the Results.

- Wrap the whole Options block in an outer ``st.expander("Options",
  expanded=not _has_result)``. After the Clean Text rerun, both
  Preview AND Options collapse, leaving the primary action button +
  Results as the only prominent elements above the fold. The inner
  Advanced-options expander is preserved as a nested expander
  (supported in Streamlit 1.36+; this repo pins 1.35+).
- Add a 1px anchor div ``#textclean-results-anchor`` immediately
  before the Results subheader.
- On Clean Text click, set a one-shot ``_textclean_scroll_to_results``
  flag in session state; on the next render, pop the flag and inject
  a tiny ``st.components.v1.html`` iframe whose ``<script>`` calls
  ``scrollIntoView`` on the parent document's anchor. One-shot so
  re-renders triggered by other widgets (Show-hidden toggle, etc.)
  don't jerk the viewport back to the top of Results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:50:43 +00:00
ca14ce2952 feat(text-cleaner): collapse preview on run + full hidden-char audit
Two small UX fixes on the Clean Text page:

1. The input preview is now wrapped in an ``st.expander`` whose
   default-expanded state is ``not has_result``. Clicking the
   "Clean Text" primary button stashes the result and calls
   ``st.rerun()`` so the next pass sees the result in session state
   and the expander folds — the Results section becomes the primary
   visual focus. User can re-expand manually to re-inspect the source.

2. The Examples (changes audit) table's Before/After columns were
   calling ``visualize_hidden_html`` WITHOUT ``mark_outer_whitespace``,
   so leading/trailing whitespace — which is exactly what the cleaner
   most often removes — was invisible. Pass ``mark_outer_whitespace=True``
   to match the input-preview rendering. Column-name cell now mirrors
   that flag too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:43:52 +00:00
502a72cd46 feat(nav): ← Back to Home link on every tool page
Multi-file workflow: a user uploads several files on Home, clicks
"Open <Tool>" on one file's findings, lands on a tool page. The
sidebar lets them get back to Home, but a top-of-page back affordance
is more discoverable and keeps the hand in the same screen region as
the upload list they're working through.

- New ``back_to_home_link()`` helper in components/_legacy.py renders
  a secondary button that calls ``st.switch_page("app.py")`` — under
  ``st.navigation`` that routes to the default (Home) page.
- Wired into every tool page (1-9) directly after
  ``hide_streamlit_chrome()`` and BEFORE the license gate so a Lite
  user who lands on a locked tool can navigate away without paying.
- New i18n key ``nav.back_to_home`` ("← Back to Home" /
  "← Volver al inicio") in en/es packs.

2008 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:38:01 +00:00
604debb9a9 revert(home): keep per-tool grouping for per-file findings
Restoring ``render_findings_panel`` on the home page. Previous commit
(c575efd) inlined a flat renderer that dropped the per-tool grouping
and the "Open <Tool>" jump links — that was an over-correction. The
user only wanted the bottom tool-card grid gone (already removed in
ff2eaeb). The grouping inside the findings panel is what lets a user
land on a specific finding and one-click into the cleaner that fixes
it; without it they'd have to guess which sidebar entry to open.

Tool-card grid stays removed. Sidebar nav is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:31:36 +00:00
c575efd26e fix(home): render findings flat — drop per-tool grouping
The home page was calling ``render_findings_panel``, which groups
findings by tool into expanders and renders an "Open <Tool>" page
link under each. After uploading a file, the user still saw a tool
list (just under a different shape) — defeating the earlier cleanup
that removed the tool-cards grid.

Inline a flat renderer in ``_home_page``: per uploaded file, render
the filename header + severity summary + a flat list of findings via
``_render_one_finding`` directly. No expanders, no tool names as
section headers, no per-tool page-link buttons. Tool discovery
happens in the sidebar.

``render_findings_panel`` itself is unchanged — it still groups by
tool and remains tested via the findings-panel harness, but is no
longer used on the home page.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:22:20 +00:00
175389219f fix(gui): translate sidebar tool names when language changes
The sidebar nav was passing ``tool.name`` (the registry's English
field) to ``st.Page``, so the tool entries stayed in English even
after the user picked Spanish from the language selector. Section
headers were already i18n-driven; tool entries were not.

Switch to ``tool_name(tool_id)`` which routes through ``t(...)`` and
picks up the active language from session state. Verified: with
``ui_lang=es`` the sidebar renders Buscar duplicados / Limpiar texto /
Mapear columnas / etc. instead of the English fallbacks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:19:15 +00:00
c568aec8a7 feat(gui): one-click Close in its own bottom sidebar section
Close is now a direct shutdown trigger: visiting the Close page (the
sidebar entry) fires shutdown_app() immediately — no confirm step, no
intermediate body. The farewell overlay paints and os._exit(0) lands
~1s later from a daemon thread.

Layout: Close moved into its own bottom-of-sidebar section so the
destructive action is visually separated from Account/Activate.

- New shutdown_app() in components/_legacy.py replaces quit_button.
  os._exit thread is skipped when "pytest" is in sys.modules so the
  test suite doesn't suicide on rendering 99_Close.
- pages/99_Close.py shrinks to set_page_config + chrome + shutdown_app.
- app.py nav grows a new "Close" section header (new
  nav.section_close key in en/es packs) pinned at the bottom of the
  navigation dict.

Tests updated:
- TestQuitButtonRenders → TestClosePageShutsDownImmediately.
  Assert the shutdown caption renders + no confirm button exists.
- test_smoke EXPECTED_SUBSTRINGS["99_Close"] now pins
  "Shutting down" / "Cerrando" (the visible page body) instead of
  the removed page title.

2008 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:17:14 +00:00
ff2eaeb6c4 feat(home): multi-file upload + per-file analysis, drop tool grid
Home is now upload + analysis only. The page accepts multiple files in
one go, analyzes each independently, and renders findings grouped by
filename in bordered containers. The 3-section tool-card grid is gone —
discovery happens via the sidebar now.

Mechanics:
- file_uploader uses accept_multiple_files=True. Each file's findings
  cache in session_state["home_findings_by_file"] keyed by filename so
  removing a file via Streamlit's "x" button drops its findings too,
  and re-clicking Run only re-analyzes pending files.
- The first uploaded file is mirrored into the singular
  home_uploaded_{name,bytes,size} keys so tool pages continue to pick
  up an "active" upload through pickup_or_upload — no tool-page changes.
- New i18n keys: upload.intro_multi, upload.uploader_label_multi,
  upload.clear_results, upload.empty_state. upload.heading text is
  updated to "Upload one or more files to start" (EN + ES).

Dropped tests pinning the tool grid:
- TestHomeToolGridLocalization (test_chrome.py)
- test_home_tool_card_uses_es_name (test_smoke.py)
- TestLiteHomeGridBadges (test_lite_tier.py — locked-card lock-badge
  assertions; locking is still enforced per-tool-page via
  require_feature_or_render_upgrade)

2009 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:12:48 +00:00
dad744f17f refactor(gui): drop Review page + normalization gate
Home is now the only entry point: the "Run analysis" button on the
upload section IS the review step (findings render inline via
render_findings_panel). Tool pages no longer gate on a passed
normalization — running the analyzer is sufficient context.

Removed:
- src/gui/pages/0_Review.py
- src/gui/components/gate.py (re-export seam)
- require_normalization_gate() in src/gui/components/_legacy.py
- "review" section enum in tools_registry.py
- Data Review entry in app.py navigation
- require_normalization_gate() calls + imports in all nine tool pages
- tests/gui/test_gate.py (whole file)
- TestReviewWorkflow in tests/gui/test_workflows.py
- 0_Review entry in tests/gui/test_smoke.py PAGE_SLUGS
- stash_upload's normalization_result+normalization_for stashing
- stash_upload_without_gate (was the gate's negative-path helper)

2017 tests pass (16 retired with the gate flow).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:04:33 +00:00
fc6c22c6a7 feat(review): inline file uploader instead of redirect home
When a user lands on Review without an upload, show a file uploader
on the page itself and auto-run the analyzer once a file is picked,
rather than bouncing them to the home page with a "Back to home"
button.

Auto-analyze is the right default here: the user is already on the
Review page, so they've implicitly committed to a scan. Stashing the
bytes in the same session-state keys the home page uses keeps the
rest of the flow (encoding picker, gate, tool pages) unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:57:01 +00:00
db5ec084da docs+code: rename tool labels everywhere
Sweep follow-up to 93e43fc. Display labels now consistent across docs,
landing pages, CLI output, code comments, docstrings, and test prose.
Five parallel surfaces touched:

- docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal
  design/planning docs
- landing pages: index + bookkeeper/revops/shopify-pet
- src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py
  and gui/components/_legacy.py, core module headers, every tool
  page's module docstring
- tests: class/method/module docstrings and section-header comments
- test-cases READMEs

Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator
etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*),
URL paths, anchor IDs, CSS classes, and asset filenames were left
intact since they're code identifiers / structural references.

All 2033 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:50:09 +00:00
93e43fc0d9 feat(gui): sidebar sections + non-technical tool labels
Sidebar nav now groups tools under Data Review / Data Cleaners /
Transformations / Automations via st.navigation, replacing the flat
auto-discovered list. Tool display names switch to action-first
phrasing (Find Duplicates, Fix Missing Values, Find Unusual Values,
Standardize Formats, Clean Text, Quality Check, Map Columns, Combine
Files, Automated Workflows) in EN + ES packs and on each page's H1.

The Data Cleaners section follows the requested order: Missing
Values → Outliers → Text Cleaner → Format Standardizer → Deduplicator
→ Quality Check. (Text Cleaner kept inside cleaners since the request
didn't list it but the tool still ships.) Registry now carries a
section field; helpers added: tools_in_section(), section_label().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:36:01 +00:00
624f99653e docs(arch): end-to-end system + tech-stack diagrams
New ARCHITECTURE.md pulls the desktop app (TECHNICAL.md) and the
license server (LICENSE-SERVER.md) into a single picture — the two
were never reconciled into an end-to-end view before.

Contents:
  §1. System diagram (ASCII) showing operator laptop, license
      server stack (nginx → FastAPI → Postgres), Postmark, Gumroad,
      and the buyer's machine — with the three primary flows
      (sale, manual mint, offline activation) traced through it.
  §2. Tech stack diagram, layered: desktop / server / operator /
      external SaaS, with version pins.
  §3. Trust + isolation boundaries table — what crosses each one
      and what the threat model is.
  §4. "Where things are stored" — paths, tables, files.
  §5. Pointers to the deeper per-component docs.

ASCII over Mermaid since the repo's Gitea version is unknown and
plain text renders in every viewer / IDE / raw `cat`.

LICENSE-SERVER.md status flipped from "design proposal, not built"
to "deployed (PR 1 + PR 2 code merged)" — that was stale since
the PR 1 deploy yesterday.

TECHNICAL.md and ADMIN.md gain one-line pointers to ARCHITECTURE.md
so people land at the unified view when looking for "how does it
all fit together".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:59:05 +00:00
86ad21db79 docs(license): PR 2 deploy + operator instructions
ADMIN.md gains a "Running a Gumroad webhook" section: how the URL
secret works, how to add a SKU to products.yaml, how to inspect
gumroad_events (recent activity + failures-only queries), how to
replay a failed delivery, and how to test without buyers via
Gumroad's "Send Test Ping" button.

The deployed-vs-queued matrix flips Gumroad + Postmark to
"code merged, deploy pending" so it's clear the bits exist on
main but the live box still runs PR 1.

SETUP-LICENSE-SERVER.md §3 commits the eventual compose.yml shape
with PR 2 environment + secrets lines included but commented out,
ready to uncomment at deploy time. The §3 chown step already covers
the new secret files because it uses `chmod 400 secrets/*` /
`chown 10001:10001 secrets/*`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:33:53 +00:00
2bbaba954b feat(server): Gumroad webhook receiver + Postmark email (PR 2)
Wires the second source-adapter (Gumroad) plus the email delivery
that lets the server fulfill a sale end-to-end without operator
intervention.

Auth model: Gumroad doesn't HMAC the body, so we use their
recommended URL-secret pattern (?secret=...). Wrong/missing secret
returns 404 — no signal to a prober that the endpoint exists.

Webhook flow (server/app/routes/webhooks.py):
  1. audit-log the raw payload (gumroad_events row) BEFORE anything
     else, so a later failure leaves us replayable
  2. parse via GumroadAdapter (server/app/adapters/gumroad.py)
  3. mint_from_sale — UNIQUE(source, source_order_id) dedups
     duplicate webhook retries
  4. send the license email
  5. mark gumroad_events.processed = true

Always returns 200 once auth passes. Non-2xx would trigger Gumroad's
3-day retry storm; we'd rather record the failure on the audit row
and replay manually after fixing whatever surfaced.

Product → tier mapping is per-source YAML at
server/config/products.yaml (lru_cached). Adding a SKU = edit yaml,
restart api. Unmapped product_id is an error on the audit row, not
a crash.

EmailService (server/app/email.py): provider-agnostic interface with
Postmark as the first implementation. When POSTMARK_TOKEN is unset
the factory returns LoggingEmailService instead, so the webhook
exercises end-to-end before Postmark is provisioned.

48 unit tests (was 21) including:
- Gumroad secret verify with constant-time compare
- Sale parsing: amount-in-cents, name fallback from email,
  test=true tagging, missing-required fields, offer codes
- Product mapping lookups
- Email rendering text + HTML, HTML-escapes user input
- Postmark client via httpx.MockTransport (success and 4xx)
- Webhook end-to-end: secret check, audit log, idempotency on
  retry, unmapped product, email failure keeps license

Smoke test (server/scripts/smoke.sh) extended to POST a synthetic
Ping payload, verify the row + audit log, prove wrong-secret is
rejected, prove duplicate sale_id stays one row.

SQLite-test compatibility:
- BigInteger primary key uses with_variant(Integer, "sqlite") since
  SQLite only autoincrements INTEGER PRIMARY KEY.
- python-multipart pulled in for FastAPI Form parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:33:43 +00:00
b5cd74d474 docs(admin): live deployment section for the running license server
Documents the post-deploy state of PR 1: live URLs (datatools and
licenses subdomains on unalogix.com), the on-box filesystem layout
under /srv/datatools-license/, where the admin token lives and how
to retrieve / rotate it, the laptop-side SSH-tunnel + admin_cli
mint workflow, inspection commands (logs, psql, container status),
restart / rebuild procedures, manual backup commands until cron
lands, the production-key rotation outline, and a deployed-vs-queued
capability matrix.

Secrets are NEVER pasted into this doc — the admin token's literal
value lives only on disk (mode 400, UID 10001). Committing it to
git would mean permanent leakage via history even after rotation;
documenting its location + rotation procedure achieves the same
operational outcome without the residual exposure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:19:57 +00:00
1cf69dd23b docs(license): runbook fixes from PR 1 self-host deploy
Two real-world footguns surfaced during the first live deploy:

1. docker-compose's uid/gid/mode long-form on file-based secrets is
   silently ignored — that's a swarm-mode-only feature. The
   container app user (UID 10001 from the Dockerfile) cannot read
   a mode-400 file whose host UID it doesn't match. Fix is to
   chown the secret files to 10001 directly; host-side access
   control stays gated by the parent dir's mode 750.

2. nginx 1.24 (Ubuntu 24.04 default) rejects the standalone
   "http2 on;" directive (that arrived in 1.25). Use the legacy
   "listen 443 ssl http2;" combined form. Noted prominently so the
   next deploy doesn't trip on it.

Also realigned §3's compose example to what actually got deployed
for PR 1 — only pg_password + admin_token secrets, postmark /
gumroad / license_privkey commented out as PR 2 / production-key
follow-ups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:17:05 +00:00
673b902377 feat(license): datatools-admin CLI for the mint API
New operator CLI at src/admin_cli.py: mint, list, revoke, ping —
talks to the server's /internal/* endpoints over a local SSH tunnel.
Stdlib-only on the desktop side (urllib + typer), no new top-level
deps. Auth via $DATATOOLS_ADMIN_TOKEN.

scripts/generate_license.py is now annotated as a break-glass tool
for when the server is unreachable — routine work goes through the
new CLI so the authoritative `licenses` row is created.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 00:47:01 +00:00
bab2c9468c feat(server): mint API + Postgres schema + manual adapter (PR 1)
Source-agnostic license issuance service. FastAPI app fronts a
Postgres `licenses` table; the only currently-wired source is
`manual` (operator mints via /internal/mint). Gumroad webhook
adapter lands in PR 2.

Key design points:

- Signing reuses src/license/crypto.py via a COPY into the image
  (single source of truth — blobs minted server-side verify against
  the same embedded pubkey on the buyer's machine).
- Source adapter Protocol (app/adapters/base.py) is the seam for
  Gumroad / Lemon Squeezy / Stripe in later PRs; Mint API speaks
  only SaleEvent / RefundEvent.
- (source, source_order_id) UNIQUE composite gives idempotent
  webhook retries without double-mint.
- JSONB type uses with_variant(JSON, 'sqlite') so the same models
  drive both Postgres prod and SQLite tests (no testcontainers dep).
- Bearer-token auth on /internal/*; the IP-loopback guard was
  removed after the docker bridge made it fight legitimate prod
  traffic (nginx defense + Bearer remain).
- Secrets resolved via *_FILE env vars pointing at
  /run/secrets/<name>, so passwords never appear in `docker inspect`.

21 unit tests (SQLite in-memory, StaticPool) plus a real-Postgres
docker-compose smoke test in server/scripts/smoke.sh that builds the
image, runs the alembic migration, mints a license, verifies the
signature against the host dev pubkey, and checks the DB row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 00:46:54 +00:00
4179cb5156 docs(license): self-hosted server runbook + multi-tenancy plan
Adds SETUP-LICENSE-SERVER.md — end-to-end install runbook for the
license server on the existing invixiom box (Ubuntu 24.04). Covers
DNS, system packages, Postgres + API in Docker, dedicated system
user, secrets layout under /srv/datatools-license/secrets (mode
400), nginx config in a separate sites-available/unalogix file,
Let's Encrypt cert issuance, smoke tests, backups, monitoring, key
rotation, and rollback.

Multi-tenancy is explicit at every layer: separate DNS zone
(unalogix.com vs invixiom.com), separate nginx file, separate TLS
cert, dedicated backend ports (8090 for the API, 5433 for Postgres,
both localhost-only), separate docker compose project and volume.
No invixiom service is touched.

LICENSE-SERVER.md updated: hosting choice moved from "Fly.io /
Render" (rejected) to self-hosted (decided). Points at the new
runbook for ops specifics.

ADMIN.md pointer table updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:57:53 +00:00
52e04f63a9 docs(license): design proposal for online issuance & record-keeping
Forward-looking design doc — not implemented. Describes the smallest
useful server that replaces the manual mint-and-paste workflow:
Gumroad webhook → Mint API (KMS-held private key) → Postgres
licenses table, plus a self-service renewal/re-delivery portal.

The desktop app is deliberately untouched across all three migration
phases: activation stays fully offline and continues to verify blobs
against the embedded pubkey, preserving the DECISIONS.md §9b promise
that buyer machines never phone home.

Schema is intentionally a superset of the local issuance JSONL log
(ADMIN.md), so Phase 1 migration is a flat INSERT per row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:26:24 +00:00
23c51fd759 feat(license): local issuance log for minted blobs
generate_license.py now appends every minted license to
~/.datatools-creator/issued.jsonl (overridable via env). This is the
creator-side system of record until the server-side flow lands.

The full blob is stored alongside name/email/tier/expiry so buyers
who lose their delivery email can be re-served without re-minting.
File is created mode 600 and lives outside the buyer-facing
~/.datatools/ dir so it never gets bundled into a shipped install.

Log failures are non-fatal (warning to stderr) — the mint already
succeeded by the time we try to log, and forcing a re-mint after a
log error would invalidate any device the buyer had activated. Pass
--no-log for test mints.

ADMIN.md adds a "Customer record-keeping" section with the path,
schema, jq one-liners, and migration note pointing at the upcoming
LICENSE-SERVER.md design doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:25:19 +00:00
65e17e0a70 docs(admin): internal license operations reference
Creator-only ADMIN.md covering keypair generation, blob minting,
dev vs. production key model, tier matrix, and recovery if the
private key is lost. Includes a TL;DR for minting a dev license
against the in-tree keypair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:10:16 +00:00
e534fb4989 sec(license): Ed25519 sigs + production-safe tripwire
Two coupled hardening upgrades.

1. Asymmetric signatures (HMAC → Ed25519)

The previous HMAC scheme used a symmetric secret that any motivated
reverse engineer could pull out of the shipped binary and use to
mint blobs for any tier / name / email. With Ed25519, the binary
ships only the public verification key; the signing key never
leaves the seller's environment, so binary compromise no longer
yields forgery.

- src/license/crypto.py rewritten around
  cryptography.hazmat.primitives.asymmetric.ed25519. Same public
  API surface (sign/verify/encode_blob/decode_blob), same canonical
  JSON encoding — drop-in for the manager / cli / GUI layers.
- DATATOOLS_LICENSE_PRIVKEY (seller-side) and
  DATATOOLS_LICENSE_PUBKEY (build-time) env vars supply the keys;
  the in-source dev keypair (src/license/_dev_keypair.py)
  deterministically derives from a seed phrase for repro builds and
  tests.
- Blob prefix bumped DTLIC1: → DTLIC2:. Decoding a DTLIC1 blob
  surfaces a clear "old format" error rather than a confusing
  signature mismatch.
- scripts/generate_keypair.py mints fresh production keypairs for
  the seller (run once, stash the private key offline). Adds
  cryptography>=41,<46 to requirements.txt (was an undeclared
  transitive dep).

2. Production-safe tripwire

assert_production_safe() refuses to boot a frozen / shipped build
when either:

- DATATOOLS_DEV_MODE=1 is set (would unconditionally bypass every
  license check — fine in source/test but catastrophic in a buyer
  install).
- The active verification key is still the embedded dev key (the
  build pipeline forgot to set DATATOOLS_LICENSE_PUBKEY).

No-op in source / pytest runs (sys.frozen is unset) so test
fixtures and dev workflows keep working without ceremony. Called
from src/cli_license_guard.guard() and from hide_streamlit_chrome
— so it fires on every CLI invocation and every GUI page load.

Tests: 49 license-layer unit tests (was 40); added Ed25519
wrong-key rejection, dev-keypair seed pin, blob v2 prefix, v1
rejection with clear message, and four production-safe scenarios
(no-op in source, fires on DEV_MODE in frozen, fires on dev key in
frozen, passes in frozen with prod pubkey). Total: 2024 → 2033.

Docs (REQUIREMENTS §17a, DEVELOPER licensing recipe, DECISIONS
§9b + decision log) updated with the new threat-model write-up,
key-storage workflow, and tripwire behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:34:48 +00:00
d32b58e61a feat(license): add Lite SKU; remove user-facing free trial
Two coupled changes:

1. Lite tier
   - New Tier.LITE in src/license/schema.py.
   - FEATURES_BY_TIER[Tier.LITE] = {Deduplicator, Text Cleaner,
     Format Standardizer}. The three universally-useful tools that
     cover the most common bookkeeping / RevOps / Klaviyo prep
     workflows. Other six tools require Core.
   - i18n: license.tier_lite, license.feature_locked_title,
     license.feature_locked_body, license.upgrade_link,
     license.status_locked (en + es).
   - Per-tool feature gate at every GUI tool page
     (require_feature_or_render_upgrade) and every tool CLI
     (guard(feature=...)). A locked tool renders an upgrade
     prompt + Manage-license button (GUI) or exits with code 2
     (CLI).
   - Home grid: tool cards the user's tier doesn't unlock get a
     red 🔒 Locked badge in place of green Ready.

2. Trial removed
   - Activation form's "Start 1-year trial" button removed.
   - license_cli's `trial` subcommand removed.
   - activation.trial_button / activation.trial_help i18n keys
     dropped (pack parity test stays green).
   - Tier.TRIAL stays in the enum (back-compat with any field-
     tested trial licenses); LicenseManager._mint stays internal
     for tests and the seller's key generator.
   - Decision logged in DECISIONS §9b: a 1-year all-features
     trial undercuts paid Lite; paid-only keeps tier economics
     clean.

Tests (+29 net): +17 Lite-tier unit/guard tests + 13 Lite-tier
GUI tests + 1 trial-absent assertion - 2 trial CLI tests - 1
trial GUI button test. Total: 1995 → 2024.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:19:30 +00:00
e612c751a8 docs(license): document activation flow, tier system, dev bypass
- USER-GUIDE EN + ES gain a §0 "First launch — activation" section
  covering paid blob activation, 1-year trial, renewal, file
  location, and device-swap.
- REQUIREMENTS §17a "Licensing" — storage path, activation model,
  lifetime, tier list, dev bypass env var. Test count: 1995.
- DEVELOPER gains a "Licensing" recipe in the Extension recipes
  section: public API, feature-flag add, tier add, minting via the
  creator-only script.
- DECISIONS §9b — log the offline-HMAC choice with the threat-model
  trade-off (motivated piracy not stopped; honor-system + 30-day
  refund covers casual sharing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:54:30 +00:00
e435103113 feat(license): registration + 1-year licenses + tier scaffolding
A complete offline licensing layer (no internet at any step):

Core
- src/license/ — schema (License, Tier, FeatureFlag), HMAC crypto,
  JSON storage, LicenseManager singleton with activate/renew/
  deactivate/issue_trial. Tier-scaffolded so future SKUs can carve
  per-tool feature sets without consumer-code edits.
- scripts/generate_license.py — creator-only key generator. Mints a
  DTLIC1: blob the buyer pastes into the activation page.

GUI
- New activation form component (src/gui/components/activation.py).
- hide_streamlit_chrome() now inline-renders the activation form when
  no valid license is present (every page short-circuits to the form
  until activated).
- Sidebar shows tier + days remaining; renewal warning under 30 days.
- New pages/_Activate.py for revisiting the form after activation.

CLI
- src/license_cli.py — activate / renew / status / trial / deactivate
  commands. Exempt from the guard.
- src/cli_license_guard.py — drop-in guard call added to every tool
  CLI's main(). Lets --help through; respects DATATOOLS_DEV_MODE.

i18n
- New activation.* and license.* keys in en.json + es.json
  (page title, form labels, status badges, renewal warnings, error
  messages). Pack parity test stays green.

Test infrastructure
- tests/conftest.py autouse fixture sets DATATOOLS_DEV_MODE=1 so the
  existing 1916 tests continue to pass.
- isolated_license_path / activated_license_manager /
  unactivated_license_manager fixtures for tests that want to drive
  the real check.

Tests (+79)
- tests/test_license.py (40): schema, crypto roundtrip, blob
  encode/decode, tier→feature mapping, activation flow, name/email
  mismatch rejection, tamper detection, expiration, renewal,
  dev-mode bypass.
- tests/test_license_cli.py (26): every license_cli command +
  subprocess tests confirming every tool CLI refuses to run without
  a license, --help always works, DEV_MODE bypasses.
- tests/gui/test_activation.py (13): gate blocks without license,
  passes with trial, activation form submission unlocks the gate,
  sidebar status, renewal warning, i18n.

Total: 1916 → 1995 tests. All pass under the strict warning filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:54:23 +00:00
b2c7b94fe9 fix: clear all latent deprecation + resource warnings
Three real issues surfaced when running the suite with strict warnings:

1. src/core/format_standardize.py: ``datetime.utcfromtimestamp`` is
   deprecated in CPython 3.12 and slated for removal. Replace with
   ``datetime.fromtimestamp(ts, tz=timezone.utc)``. Output for the
   date-only format codes we use is byte-identical.

2. src/core/io.py: ``list_sheets`` leaked the openpyxl file handle by
   returning ``xl.sheet_names`` from an unclosed ``pd.ExcelFile``.
   Wrap in a ``with`` block so the FD closes deterministically — also
   prevents the Windows-only "file is locked" repro path.

3. tests/test_corpus.py: ``TestXlsxPollution.workbook`` fixture
   returned the bare ``pd.ExcelFile`` instead of yielding + closing.
   Convert to a yield-and-finally pattern so the class-scoped handle
   isn't leaked across the whole test file.

Also harden pytest.ini's warning policy: escalate
``ResourceWarning`` from ``src`` to an error, alongside the existing
``DeprecationWarning`` rule. Third-party warnings stay filtered — we
can't fix pandas/openpyxl/streamlit churn from here.

All 1916 tests pass under the strict filter; full and split runs
(``pytest``, ``pytest -m 'not gui'``, ``pytest -m gui``) all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:28:48 +00:00
070e3c9f06 docs(gui): document the new GUI test layer
REQUIREMENTS §16 updates the test count (1777 → 1916) and breaks out
the GUI subset. DEVELOPER's Tests section gains the 'gui' marker
recipes and the new tests/gui/ tree under test layout, plus a short
'GUI test layer' explainer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:13:40 +00:00
35d46a0c1a test(gui): add Streamlit AppTest layer (139 tests)
Until now every test ran against core or the CLI; the Streamlit GUI
was verified by hand. This commit adds tests/gui/ — 139 AppTest-
driven tests behind a 'gui' marker so the quick loop
(``pytest -m 'not gui'``) stays at 1777 tests / ~10s while
``pytest`` runs everything (1916 / ~14s).

Coverage:
- test_smoke.py (59): every page renders in EN and ES, expected
  substring present, sidebar selector mounted.
- test_chrome.py (18): language selector flips session state and
  re-renders; quit button + farewell strings localize; tool-card
  names use the active language.
- test_gate.py (9): require_normalization_gate no-op / warning /
  short-circuit / hash-mismatch invariants; warning + button
  localized.
- test_workflows.py (14): happy path per Ready tool — stash
  upload, render, find primary action, verify result lands in
  session state.
- test_dedup_review.py (8): Accept All / Reject All / Clear
  Decisions wire through to review_decisions; apply_review_decisions
  semantics (keep-all, merge, column override).
- test_advanced_panels.py (15): config_panel widget defaults and
  options (algorithm, threshold, survivor rule, merge, multiselects,
  config save/load).
- test_errors.py (4): garbage / empty / single-column uploads don't
  crash; duplicate-target mapping raises InputValidationError.
- test_findings_panel.py (12): driven via a small standalone harness
  page so we test the component without faking a file_uploader. EN
  + ES strings, per-tool grouping, open-tool button label, untargeted
  expander, severity summary.

Shared infrastructure in tests/gui/conftest.py:
- ``stash_upload`` / ``stash_upload_without_gate`` — populate
  session_state to pre-pass or block the gate.
- ``with_language`` — set ``ui_lang`` before run().
- ``collected_text`` — flatten title/caption/markdown/etc. into
  one string for substring assertions.
- Auto-marking: every test in tests/gui/ gets ``@pytest.mark.gui``
  via ``pytest_collection_modifyitems``, so the marker isn't
  per-test boilerplate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:13:40 +00:00
d0423a8912 docs(perf): publish the dedup/parallel/lazy-copy wins and limits
REQUIREMENTS §10 carries the new measured numbers and the dedup
blocking trade-off note. DEVELOPER known-limitations is rewritten to
reflect that exact-only dedup is now O(n), fuzzy-blocking is opt-in,
and column-parallelism is scaffolding for free-threaded Python.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:54:25 +00:00
64452dd783 perf: dedup blocking, column-parallel scaffolding, lazy-copy pipelines
Three follow-on wins from the audit, each with shape-pinning tests.

1. Dedup blocking
   - Exact-only strategies (every column EXACT @ 100 — covers strong-
     key dedup like email/phone, the drop-duplicates fallback, and
     explicit "match on this exact column" calls) now route through
     an O(n) groupby fast path. Lossless; no API change required.
     Measured: 10k-row email-exact dedup → 73 ms (was ~30 minutes
     via the O(n²) pair compare).
   - Fuzzy strategies still pair-compare, with opt-in prefix blocking
     via deduplicate(..., blocking_columns=[...], blocking_prefix_len=1).
     Measured: 5k-row fuzzy-name → 25.6s with blocking vs 179s
     without (7x). Trade-off: cross-block matches missed.

2. Column-parallel standardize
   - StandardizeOptions.parallel_columns (default 1) lands a
     ThreadPoolExecutor over the column loop. Output order and
     audit-record order are preserved deterministically via a merge
     step keyed off column_types order. Honest doc: under CPython
     3.12's GIL the win is roughly neutral (phonenumbers/dateutil
     hold the GIL); the API is ready for free-threaded Python 3.13+.

3. Lazy-copy in missing / column_mapper
   - _standardize_sentinels now builds per-column changes in a dict
     and only materialises the output frame when at least one column
     actually changed. On a clean 1 GB file this skips a 1 GB
     allocation.
   - handle_missing carries an out_is_owned flag, copying on demand
     before any mutating step. No-op runs return the input frame.
   - map_columns drops the unconditional upfront df.copy(); rename
     and drop both return fresh frames already, and schema-add /
     coerce trigger _ensure_owned() lazily.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:54:25 +00:00
259 changed files with 31685 additions and 3804 deletions

View File

@@ -1,8 +1,17 @@
name: Build installers
# Triggers:
# * Tag push (v*) → produces installers, attaches to a GitHub Release.
# * Manual dispatch → produces installers as workflow artifacts only.
# * Tag push (v*) → produces installers, attaches them to a GitHub Release.
# * Manual dispatch → uploads the installers as workflow artifacts only.
#
# Outputs per platform (downloadable by buyers):
# * macOS: .dmg installer
# * Windows: .exe installer
# * Linux: .AppImage (already portable; no separate installer step)
#
# Self-contained: every artifact ships its own Python interpreter + every
# runtime dep (including bundled Tesseract OCR) through PyInstaller. No
# pre/post install steps on the buyer's machine.
#
# What this workflow doesn't do (yet):
# * Code signing (Mac Developer ID, Windows code-signing cert).
@@ -29,12 +38,15 @@ jobs:
matrix:
include:
- os: macos-latest
platform: mac
artifact_name: DataTools-mac.dmg
artifact_path: dist/DataTools-*-mac.dmg
- os: windows-latest
platform: win
artifact_name: DataTools-win.exe
artifact_path: dist/DataTools-*-win-setup.exe
- os: ubuntu-latest
platform: linux
artifact_name: DataTools-linux.AppImage
artifact_path: dist/DataTools-*-linux-x86_64.AppImage
runs-on: ${{ matrix.os }}
@@ -50,7 +62,31 @@ jobs:
run: |
pip install --upgrade pip
pip install -r requirements.txt
pip install pyinstaller
pip install pyinstaller pillow
# ---- Tesseract bundling cache --------------------------------
# The fetch logic inside build/tesseract.py downloads:
# * build/vendor/tessdata/eng.traineddata (~16 MB, shared)
# * build/_tesseract/<platform>/ (binary + libs, 30-120 MB)
# Cache both so iterative CI runs don't re-download. The
# cache key bakes in the pinned Tesseract version + tessdata
# URL so a version bump invalidates automatically.
- name: Cache Tesseract bundle inputs
uses: actions/cache@v4
with:
path: |
build/_tesseract
build/vendor/tessdata
key: tesseract-${{ runner.os }}-5.5.0-tessdata_best-v1
# ---- Linux: install patchelf so tesseract.py can rewrite
# RPATH on the bundled tesseract binary. apt-get install
# tesseract-ocr is handled inside tesseract.py itself. --------
- name: Install Linux build prereqs for Tesseract bundling
if: matrix.os == 'ubuntu-latest'
run: |
sudo apt-get update
sudo apt-get install -y patchelf
- name: Read version
id: version
@@ -59,12 +95,109 @@ jobs:
VER=$(python -c "import re; print(re.search(r'__version__\s*=\s*\"([^\"]+)\"', open('src/__init__.py').read()).group(1))")
echo "version=$VER" >> "$GITHUB_OUTPUT"
- name: Generate platform icons
run: python build/generate_icons.py
# Stage Tesseract before PyInstaller. The tesseract.py helpers
# handle the per-platform fetch (UB-Mannheim on Win, brew on
# Mac, apt on Linux) and stage the binary + libs into
# build/_tesseract/<platform>/ where the spec picks them up.
# We invoke a tiny inline Python so the workflow doesn't have
# to know the per-platform target string.
- name: Stage Tesseract binary + tessdata
shell: bash
env:
DATATOOLS_PLATFORM: ${{ matrix.platform }}
run: |
python - <<'PY'
import os, sys
sys.path.insert(0, "build")
from tesseract import fetch_tessdata, fetch_tesseract_for_platform
target = os.environ["DATATOOLS_PLATFORM"]
fetch_tessdata()
fetch_tesseract_for_platform(target)
PY
- name: Build PyInstaller bundle
shell: bash
env:
# The spec reads this to find the per-platform staging dir;
# see build/datatools.spec for the contract.
DATATOOLS_TESS_STAGING: build/_tesseract/${{ matrix.platform }}
run: pyinstaller build/datatools.spec --clean --noconfirm
# ---- Per-platform packaging ----------------------------------
# ---- macOS code signing + notarization (before DMG packaging) -
# Signs dist/DataTools.app with the Developer ID, notarizes it,
# and staples the ticket so Gatekeeper passes offline. Wrapped in
# a guard: if the cert secret is absent the step prints a warning
# and exits 0, so dry-run dispatches still produce an (unsigned)
# build. Secret names match build/README.md "Signing".
- name: Sign & notarize macOS app
if: matrix.os == 'macos-latest'
env:
CERT_P12_BASE64: ${{ secrets.MACOS_DEVELOPER_ID_CERT_P12_BASE64 }}
CERT_PASSWORD: ${{ secrets.MACOS_DEVELOPER_ID_CERT_PASSWORD }}
NOTARY_APPLE_ID: ${{ secrets.MACOS_NOTARY_APPLE_ID }}
NOTARY_TEAM_ID: ${{ secrets.MACOS_NOTARY_TEAM_ID }}
NOTARY_PASSWORD: ${{ secrets.MACOS_NOTARY_PASSWORD }}
run: |
set -euo pipefail
if [ -z "${CERT_P12_BASE64:-}" ]; then
echo "::warning::MACOS_DEVELOPER_ID_CERT_P12_BASE64 not set — shipping an UNSIGNED build (Gatekeeper will warn buyers)."
exit 0
fi
- name: Package macOS DMG
APP="dist/DataTools.app"
# 1. Import the Developer ID cert into an ephemeral keychain.
KEYCHAIN="$RUNNER_TEMP/build.keychain-db"
KEYCHAIN_PW="$(uuidgen)"
security create-keychain -p "$KEYCHAIN_PW" "$KEYCHAIN"
security set-keychain-settings -lut 3600 "$KEYCHAIN"
security unlock-keychain -p "$KEYCHAIN_PW" "$KEYCHAIN"
echo "$CERT_P12_BASE64" | base64 --decode > "$RUNNER_TEMP/cert.p12"
security import "$RUNNER_TEMP/cert.p12" -k "$KEYCHAIN" -P "$CERT_PASSWORD" \
-T /usr/bin/codesign
security set-key-partition-list -S apple-tool:,apple: -s -k "$KEYCHAIN_PW" "$KEYCHAIN" >/dev/null
# Make the ephemeral keychain searchable (preserve the login keychain).
security list-keychains -d user -s "$KEYCHAIN" \
$(security list-keychains -d user | sed 's/"//g')
IDENTITY="$(security find-identity -v -p codesigning "$KEYCHAIN" \
| grep 'Developer ID Application' | head -1 | awk -F'"' '{print $2}')"
if [ -z "$IDENTITY" ]; then
echo "::error::No 'Developer ID Application' identity found in the imported cert."
exit 1
fi
echo "Signing with: $IDENTITY"
# 2. Sign the bundle (hardened runtime + secure timestamp + entitlements).
# --deep signs the nested dylibs/.so the PyInstaller bundle carries.
codesign --deep --force --options runtime --timestamp \
--entitlements build/macos/entitlements.plist \
--sign "$IDENTITY" "$APP"
codesign --verify --strict --verbose=2 "$APP"
# 3. Notarize the .app (notarytool needs a zip/dmg/pkg, not a bare .app),
# then staple so Gatekeeper validates offline.
if [ -n "${NOTARY_APPLE_ID:-}" ]; then
ditto -c -k --keepParent "$APP" "$RUNNER_TEMP/DataTools.zip"
xcrun notarytool submit "$RUNNER_TEMP/DataTools.zip" \
--apple-id "$NOTARY_APPLE_ID" \
--team-id "$NOTARY_TEAM_ID" \
--password "$NOTARY_PASSWORD" \
--wait
xcrun stapler staple "$APP"
xcrun stapler validate "$APP"
else
echo "::warning::Notary credentials not set — app is signed but NOT notarized (Gatekeeper will still warn)."
fi
rm -f "$RUNNER_TEMP/cert.p12"
# ---- Per-platform installer packaging ------------------------
- name: Package macOS DMG (installer)
if: matrix.os == 'macos-latest'
run: bash build/macos/build_dmg.sh "${{ steps.version.outputs.version }}"
@@ -92,7 +225,7 @@ jobs:
# ---- Upload + release ----------------------------------------
- name: Upload artifact
- name: Upload installer artifact
uses: actions/upload-artifact@v4
with:
name: ${{ matrix.artifact_name }}

13
.gitignore vendored
View File

@@ -11,6 +11,19 @@ dist/
build/build/
build/__pycache__/
build/dist/
# Generated by build/generate_icons.py from src/gui/assets/datatools_icon_256.png.
# Build artifacts, not source — regenerated each CI run.
build/icon.ico
build/icon.icns
build/icon.png
# Tesseract bundling — fetched at build time, not committed. See
# build/vendor/README.md for the canonical URLs and rationale.
# - build/_tesseract/ : per-platform binary + DLLs/dylibs staging dir
# - build/vendor/tessdata/eng.traineddata : ~16 MB language data
build/_tesseract/
build/vendor/tessdata/*.traineddata
.pytest_cache/
# Claude Code agent worktrees + local settings

View File

@@ -1,5 +1,8 @@
[client]
toolbarMode = "minimal"
# ``viewer`` is the most aggressive — hides Streamlit's running
# indicator, deploy button, and status icons. Keeps the main content
# area's top-right corner clean.
toolbarMode = "viewer"
[browser]
gatherUsageStats = false
@@ -9,3 +12,17 @@ gatherUsageStats = false
# reads "Limit 1024MB per file" — matches the analyzer + gate's stated
# 1 GB efficiency target. See docs/REQUIREMENTS.md §1.1.
maxUploadSize = 1024
# Warm, editorial palette inspired by the
# ``datatools_layout_redesign.html`` mockup — cream paper background,
# stone ink, burnt-orange accent. Streamlit reads these on startup and
# threads them through its widget chrome (file uploader, focus rings,
# primary buttons, links). Heavier visual restyling rides on the CSS
# in ``_legacy.py:_DESIGN_TOKENS_CSS``.
[theme]
base = "light"
primaryColor = "#c2410c"
backgroundColor = "#fafaf7"
secondaryBackgroundColor = "#f5f4ef"
textColor = "#1c1917"
font = "sans serif"

33
DECISIONS.md Normal file
View File

@@ -0,0 +1,33 @@
# Product & architecture decisions
A running log of decisions that aren't obvious from the code and would
otherwise be re-litigated. Newest first.
## 2026-06-08 — PDF to CSV and Reconcile stay in the bundle, under a "Finance" group
**Decision:** `10_pdf_extractor` (PDF to CSV) and `11_reconciler` (Reconcile
Two Files) remain part of the DataTools suite. In the sidebar they are
segregated into their own **Finance** section, distinct from the
file-cleaning tools.
**Context / why this needed deciding:**
- Both tools sit outside the documented 9-script cleaning architecture
(TECHNICAL.md / USER-GUIDE.md stop at the orchestrator).
- They occupy the "reconciliation / manual data-entry" territory the
product's honest-positioning note explicitly placed outside a
file-cleaning tool's scope.
- A journey-level UX review flagged that every extra tool in the main
sidebar raises the "which tool do I need?" load for a non-technical
buyer, so tools serving a different job should live in a clearly
different place.
**Resolution:** Keep them in-bundle (they're built, useful, and ship
today) but group them under "Finance" so the cleaning flow stays
uncluttered. Revisit only if a separate finance-focused product emerges.
**Implications:**
- `tools_registry.py`: Reconcile + PDF to CSV carry a `finance` section.
- Sidebar order: Start here → Data Cleaners → Transformations →
Automations → Finance → Coming soon.
- This is the source-of-truth realization of the `layout-review/`
mockups (see `layout-review/shell.js`).

220
LICENSE_TESSERACT.txt Normal file
View File

@@ -0,0 +1,220 @@
This license applies to the bundled Tesseract OCR binary distributed
inside DataTools installer artifacts (Windows .exe, macOS .dmg, Linux
.AppImage) and the corresponding portable .zip downloads.
Tesseract OCR upstream: https://github.com/tesseract-ocr/tesseract
Copyright (C) 2006-2024 Google Inc. and the Tesseract OCR contributors
The Tesseract OCR binary is distributed under the Apache License,
Version 2.0, the full text of which is reproduced verbatim below.
The bundled `eng.traineddata` data file is the "best" English model
from https://github.com/tesseract-ocr/tessdata_best and is licensed
under the Apache License, Version 2.0 as well.
DataTools itself is proprietary and is NOT covered by this license;
see LICENSE.txt at the repository root for DataTools' own license.
================================================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for describing the origin of the Work and
reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may accept and charge a
fee for, acceptance of support, warranty, indemnity, or other
liability obligations and/or rights consistent with this License.
However, in accepting such obligations, You may act only on Your
own behalf and on Your sole responsibility, not on behalf of any
other Contributor, and only if You agree to indemnify, defend,
and hold each Contributor harmless for any liability incurred by,
or claims asserted against, such Contributor by reason of your
accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied. See the License for the specific language governing
permissions and limitations under the License.

View File

@@ -8,27 +8,37 @@ Limpieza local de CSV / Excel. CLI + GUI en el navegador, sin nube, sin ceremoni
| # | Herramienta | Estado |
|---|------|--------|
| 01 | **Eliminador de duplicados** — coincidencia exacta + difusa, 5 normalizadores, reglas de superviviente, auditoría | Listo |
| 02 | **Limpiador de texto** — espacios, caracteres tipográficos, BOM, finales de línea, mayúsculas/minúsculas | Listo |
| 03 | **Estandarizador de formatos** — fechas, teléfonos, correos, direcciones, nombres, monedas, booleanos | Listo |
| 04 | **Gestor de valores faltantes** — detección de nulos disfrazados, perfil, media/mediana/moda/ffill/bfill/interpolación, estrategias de descarte | Listo |
| 05 | **Mapeador de columnas** — autodetección difusa de renombrados, esquema objetivo con coerción de tipos, campos requeridos con valores por defecto, descartar/reordenar | Listo |
| 06 | Detector de valores atípicos | Próximamente |
| 07 | Combinador de varios archivos | Próximamente |
| 08 | Validador e informes | Próximamente |
| 09 | **Ejecutor de canalizaciones** — encadena herramientas en un orden recomendado (no forzado), guarda/carga JSON, automatiza limpiezas semanales | Listo |
| 01 | **Buscar duplicados** — coincidencia exacta + difusa, 5 normalizadores, reglas de superviviente, auditoría | Listo |
| 02 | **Limpiar texto** — espacios, caracteres tipográficos, BOM, finales de línea, mayúsculas/minúsculas | Listo |
| 03 | **Estandarizar formatos** — fechas, teléfonos, correos, direcciones, nombres, monedas, booleanos | Listo |
| 04 | **Corregir valores faltantes** — detección de nulos disfrazados, perfil, media/mediana/moda/ffill/bfill/interpolación, estrategias de descarte | Listo |
| 05 | **Mapear columnas** — autodetección difusa de renombrados, esquema objetivo con coerción de tipos, campos requeridos con valores por defecto, descartar/reordenar | Listo |
| 06 | Detectar valores atípicos | Próximamente |
| 07 | Combinar archivos | Próximamente |
| 08 | Verificación de calidad | Próximamente |
| 09 | **Flujos automatizados** — encadena herramientas en un orden recomendado (no forzado), guarda/carga JSON, automatiza limpiezas semanales | Listo |
Cada página de herramienta incluye una ventana emergente de **Help** (a la derecha del título) con una guía compacta de Cuándo usarla / Pasos / Ejemplos / Consejo. El texto vive en los paquetes de idioma (`tools.<id>.help_md`).
## Descarga (usuarios no técnicos)
Instaladores precompilados — no se requiere Python:
Paquetes precompilados — sin instalar Python, sin permisos de administrador, sin internet en ejecución. Cada versión ofrece un **instalador** por sistema operativo que crea accesos directos en el escritorio + menú Inicio / Launchpad.
| Plataforma | Descarga | Nota de primer arranque |
|---|---|---|
| **macOS** | `DataTools-X.Y.Z-mac.dmg` | Arrastra DataTools.app a /Applications y haz doble clic. |
| **Windows** | `DataTools-X.Y.Z-win-setup.exe` | Ejecuta el instalador; se inicia desde el menú Inicio. |
| **Linux** | `DataTools-X.Y.Z-linux-x86_64.AppImage` | `chmod +x` al archivo y luego doble clic. |
| Plataforma | Instalador |
|---|---|
| **macOS** | `DataTools-X.Y.Z-mac.dmg` — ábrelo, arrastra DataTools.app a /Applications, ejecútalo desde Launchpad. |
| **Windows** | `DataTools-X.Y.Z-win-setup.exe` — ejecuta el instalador (por usuario, sin admin). Crea acceso directo en el escritorio + entrada en el menú Inicio. |
| **Linux** | `DataTools-X.Y.Z-linux-x86_64.AppImage` `chmod +x` y doble clic. El AppImage ya es portable. |
Última versión: consulta [GitHub Releases](https://git.invixiom.com/giteadmin/datatools-dev/releases) (o el listado de Gumroad). Los instaladores ocupan ~150200 MB; el lanzador arranca un servidor local en http://127.0.0.1:8501 y abre tu navegador. Nada se envía a la nube.
Última versión: consulta [GitHub Releases](https://git.invixiom.com/giteadmin/datatools-dev/releases) (o el listado de Gumroad). Cada paquete ocupa ~300 MB descomprimido; al primer arranque la app levanta un servidor local en http://127.0.0.1:8501 y abre tu navegador predeterminado. Nada sale de tu equipo.
**Tesseract OCR viene incluido.** El soporte para PDFs escaneados del Extractor de PDF funciona sin configuración adicional en las tres plataformas — no hace falta instalar Tesseract por separado. Atribución de licencia: ver [`LICENSE_TESSERACT.txt`](LICENSE_TESSERACT.txt).
**Avisos del primer arranque (una sola vez):**
- **macOS** sin firma: clic derecho → **Abrir** → confirma. (Las compilaciones firmadas se lo saltan.)
- **Windows** SmartScreen: pulsa **Más información****Ejecutar de todas formas**.
Guía detallada de instalación y resolución de problemas: [Guía del usuario §1](docs/USER-GUIDE.es.md#1-instalaci%C3%B3n).
## Instalar desde el código (desarrolladores)

View File

@@ -8,27 +8,37 @@ Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony. GU
| # | Tool | Status |
|---|------|--------|
| 01 | **Deduplicator** — exact + fuzzy match, 5 normalizers, survivor rules, audit | Ready |
| 02 | **Text Cleaner** — whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | **Format Standardizer** — dates, phones, emails, addresses, names, currencies, booleans | Ready |
| 04 | **Missing Value Handler** — disguised-null detection, profile, mean/median/mode/ffill/bfill/interpolate, drop strategies | Ready |
| 05 | **Column Mapper** — fuzzy auto-rename, target schema with type coercion, required fields with defaults, drop/reorder | Ready |
| 06 | Outlier Detector | Coming Soon |
| 07 | Multi-File Merger | Coming Soon |
| 08 | Validator & Reporter | Coming Soon |
| 09 | **Pipeline Runner** — chain tools with recommended (not forced) order, save/load JSON, automate weekly cleanups | Ready |
| 01 | **Find Duplicates** — exact + fuzzy match, 5 normalizers, survivor rules, audit | Ready |
| 02 | **Clean Text** — whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | **Standardize Formats** — dates, phones, emails, addresses, names, currencies, booleans | Ready |
| 04 | **Fix Missing Values** — disguised-null detection, profile, mean/median/mode/ffill/bfill/interpolate, drop strategies | Ready |
| 05 | **Map Columns** — fuzzy auto-rename, target schema with type coercion, required fields with defaults, drop/reorder | Ready |
| 06 | Find Unusual Values | Coming Soon |
| 07 | Combine Files | Coming Soon |
| 08 | Quality Check | Coming Soon |
| 09 | **Automated Workflows** — chain tools with recommended (not forced) order, save/load JSON, automate weekly cleanups | Ready |
Every tool page has an in-tool **Help** popover (right of the title) with a compact When-to-use / Steps / Examples / Tip card. Copy lives in the language packs (`tools.<id>.help_md`).
## Download (non-technical users)
Pre-built installers — no Python required:
Pre-built bundles — no Python install, no admin rights, no internet at runtime. Each release ships an **installer** per OS that wires up Desktop + Start Menu / Launchpad shortcuts.
| Platform | Download | First-launch note |
|---|---|---|
| **macOS** | `DataTools-X.Y.Z-mac.dmg` | Drag DataTools.app into /Applications, then double-click. |
| **Windows** | `DataTools-X.Y.Z-win-setup.exe` | Run the installer; launches from Start Menu. |
| **Linux** | `DataTools-X.Y.Z-linux-x86_64.AppImage` | `chmod +x` the file, then double-click. |
| Platform | Installer |
|---|---|
| **macOS** | `DataTools-X.Y.Z-mac.dmg` — open, drag DataTools.app into /Applications, launch from Launchpad. |
| **Windows** | `DataTools-X.Y.Z-win-setup.exe` — run installer (per-user, no admin). Desktop shortcut + Start Menu entry created. |
| **Linux** | `DataTools-X.Y.Z-linux-x86_64.AppImage` `chmod +x`, double-click. The AppImage is already portable. |
Latest release: see [GitHub Releases](https://git.invixiom.com/giteadmin/datatools-dev/releases) (or the Gumroad listing). The installers are ~150200 MB; the launcher boots a local server at http://127.0.0.1:8501 and opens your browser. Nothing is sent to the cloud.
Latest release: see [GitHub Releases](https://git.invixiom.com/giteadmin/datatools-dev/releases) (or the Gumroad listing). Each bundle is ~300 MB unpacked; on first launch the app starts a local server at http://127.0.0.1:8501 and opens your default browser. Nothing leaves your machine.
**Tesseract OCR is bundled.** Scanned-PDF support in the PDF Extractor works out of the box on all three platforms — no separate Tesseract install required. License attribution: see [`LICENSE_TESSERACT.txt`](LICENSE_TESSERACT.txt).
**First-launch warnings (one-time):**
- **macOS** unsigned builds: right-click → **Open** → confirm. (Signed builds skip this.)
- **Windows** SmartScreen: click **More info****Run anyway**.
Detailed install + troubleshooting walkthrough: [User Guide §1](docs/USER-GUIDE.md#1-install).
## Install from source (developers)

View File

@@ -19,23 +19,52 @@ build/
│ Mac .app bundle config. Reads the version
│ from src/__init__.py.
├── installer.iss Inno Setup script — Windows .exe installer.
│ Adds Start Menu + Desktop + App Paths entries.
├── generate_icons.py Builds icon.ico / icon.icns / icon.png from
│ src/gui/assets/datatools_icon_256.png. Run
│ once before pyinstaller (CI does this).
├── tesseract.py Fetches the per-platform Tesseract binary +
│ eng.traineddata at build time. CI imports
│ fetch_tessdata + fetch_tesseract_for_platform.
├── macos/
│ └── build_dmg.sh Wraps dist/DataTools.app into a .dmg with a
│ drag-to-/Applications layout.
│ drag-to-/Applications layout (installer).
├── appimage/
│ ├── AppRun Entry point invoked when the AppImage runs.
│ ├── datatools.desktop Linux desktop-entry metadata.
│ └── build.sh Wraps dist/DataTools/ into an .AppImage.
├── hooks/ PyInstaller hooks for libs the static analyser
│ └── hook-streamlit.py misses (Streamlit's dynamic imports).
├── icon.icns macOS app icon (TODO: produce from a 1024×1024
│ PNG. Optional — bundle still builds without).
├── icon.ico Windows app icon (TODO).
├── icon.png Linux AppImage icon (TODO — build.sh generates
│ a placeholder if missing).
├── icon.{ico,icns,png} Generated by generate_icons.py — gitignored.
└── README.md this file
```
## Distribution outputs per platform
Each CI run produces one installer per platform:
| Platform | Installer |
|----------|----------------------------------------|
| macOS | `DataTools-<ver>-mac.dmg` |
| Windows | `DataTools-<ver>-win-setup.exe` |
| Linux | `DataTools-<ver>-linux-x86_64.AppImage` (already portable) |
All three outputs are self-contained: every dependency (Python, pandas,
streamlit, pdfplumber, **Tesseract OCR + `eng.traineddata`**, the lot)
is frozen into the bundle. The buyer does not need to install Python,
pip, Tesseract, or anything else first. With Tesseract bundled, each
artifact is roughly **250300 MB** on disk (up from ~120 MB pre-OCR);
unpacked installs run ~300400 MB once scratch space is counted.
## Easy-launch surface
| Affordance | Windows | macOS |
|------------------|--------------------------------------------------|------------------------------------------------------|
| Desktop shortcut | Inno Setup `desktopicon` task (checked default) | The .app bundle in /Applications is the icon |
| App menu | Start Menu → DataTools (always installed) | Launchpad + Spotlight (auto from /Applications) |
| Taskbar / Dock | User pins manually (OS forbids programmatic pin) | User pins manually after first launch |
| Run from terminal| `DataTools` (registered via App Paths) | `open -a DataTools` (auto from .app bundle) |
CI: `.github/workflows/build.yml` runs the full pipeline on tag push
(matrix: macos-latest, windows-latest, ubuntu-latest) and attaches
the resulting installers to a GitHub Release. Manual
@@ -43,21 +72,55 @@ the resulting installers to a GitHub Release. Manual
## Releasing
### CI build (push tag → GitHub Release) — the release process
Releases are built by GitHub Actions (`.github/workflows/build.yml`),
not on a developer's machine. The matrix runs on
macos-latest / windows-latest / ubuntu-latest, stages Tesseract
(`build/tesseract.py`), runs PyInstaller, packages the per-platform
installer, and attaches it to a GitHub Release on tag push:
1. Bump `__version__` in `src/__init__.py`.
2. `git commit -am "release: vX.Y.Z" && git tag vX.Y.Z`.
3. `git push && git push --tags`.
4. CI builds all three platforms and creates a GitHub Release with
the installers attached.
5. Mirror the GitHub Release assets to Gumroad (manual until v2).
4. CI builds all three platforms and creates a Release with the
installers attached.
5. Mirror the Release assets to Gumroad (manual until v2).
A manual `workflow_dispatch` run does the same build but uploads the
installers as workflow artifacts instead of creating a Release —
useful for smoke-testing a build without cutting a tag.
### Local build (single platform, for testing)
PyInstaller can't cross-compile, so a local build produces only the
current OS's installer. This mirrors what CI does, by hand — use it to
debug the bundle before tagging. See the per-platform recipes below for
the exact commands; the short version is:
```bash
pip install -r requirements.txt
pip install pyinstaller pillow
python build/generate_icons.py
python -c "import sys; sys.path.insert(0,'build'); \
from tesseract import fetch_tessdata, fetch_tesseract_for_platform; \
fetch_tessdata(); fetch_tesseract_for_platform('mac')" # win / mac / linux
pyinstaller build/datatools.spec --clean --noconfirm
# then run the matching packager: build/macos/build_dmg.sh,
# build/installer.iss (iscc), or build/appimage/build.sh
```
## Signing (Phase 2 — needs accounts/credentials)
Both code-signing steps are intentionally not in CI yet because they
require credentials the owner sets up first.
**macOS signing + notarization is now wired into `build.yml`** (the
"Sign & notarize macOS app" step, with `build/macos/entitlements.plist`).
It is guarded: if `MACOS_DEVELOPER_ID_CERT_P12_BASE64` is absent the step
warns and exits 0, so dry-run dispatches still produce an unsigned build.
To activate it, just add the secrets below — no code change needed.
**Windows** code-signing is still not wired (accepted v1 friction).
**macOS** — Apple Developer Program enrollment ($99/yr). Once enrolled,
add these GitHub Secrets and uncomment the `codesign` + `notarytool`
steps in `build.yml`:
add these GitHub Secrets to activate the signing step in `build.yml`:
| Secret | Value |
|---|---|
@@ -223,6 +286,57 @@ Mac code-signing in CI requires the cert + private key as a GitHub
secret (encoded with `base64`). Detailed walkthrough belongs in a
later doc — for v1, sign locally and upload to GitHub Releases.
## Tesseract bundling (PDF Extractor OCR)
Frozen artifacts ship a per-platform Tesseract binary plus the English
`eng.traineddata` model so scanned-PDF support in the PDF Extractor
works out of the box — no separate user install. Source / pip
developer setups still need system Tesseract on `PATH`.
**Layout inside the bundle**:
```
DataTools/ (or DataTools.app/Contents/MacOS/)
└── tesseract/
├── tesseract (Linux/macOS binary; tesseract.exe on Windows)
└── tessdata/
└── eng.traineddata
```
The runtime resolver (in `src/`, owned by the runtime team) walks:
1. `DATATOOLS_TESSERACT_BIN` env var override.
2. `Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]"` — frozen
bundles only.
3. `tesseract` on `PATH`.
4. Windows well-known paths.
**Where the bytes come from**:
- **Tessdata** — vendored in-repo at `build/vendor/tessdata/eng.traineddata`
(sourced from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best)).
`datatools.spec` copies it into `tesseract/tessdata/`.
- **Binary** — fetched per-platform at build time by
`build/tesseract.py` from pinned upstream URLs. Current pin:
**Tesseract 5.5.0**. CI imports `fetch_tessdata` +
`fetch_tesseract_for_platform` from this module before PyInstaller.
**Updating Tesseract**:
1. Bump the version pin and the per-platform fetch URLs in
`build/tesseract.py`.
2. If the model schema changed upstream, refresh
`build/vendor/tessdata/eng.traineddata` from `tessdata_best` at the
matching tag.
3. Push a `v*` tag so CI rebuilds all three platforms, then
smoke-test a scanned PDF through the PDF Extractor.
4. Update `LICENSE_TESSERACT.txt` at the repo root if upstream license
terms change (Apache-2.0 today).
License attribution for the bundled binary lives at
`LICENSE_TESSERACT.txt` at the repo root — it must ship alongside any
binary that contains Tesseract.
## Common pitfalls
| Symptom | Fix |
@@ -246,7 +360,7 @@ much state to trust:
4. Double-click the app icon.
5. Browser should open to http://127.0.0.1:850x within 5 seconds.
6. Drop samples/demo/shopify_pet_customers.csv into the
Pipeline Runner page; click Run; AFTER preview should appear.
Automated Workflows page; click Run; AFTER preview should appear.
7. Confirm in the network tab: zero outbound calls except to
127.0.0.1 and the Streamlit static asset paths (also local).
```

View File

@@ -9,6 +9,11 @@
# latest release from https://github.com/AppImage/AppImageKit/releases).
#
# Output: dist/DataTools-<version>-linux-x86_64.AppImage
#
# Tesseract bundling: no-op here. The PyInstaller bundle in
# dist/DataTools/ already contains tesseract/{tesseract, *.so,
# tessdata/eng.traineddata} from the spec's datas; ``cp -R``
# below carries it along into the AppDir.
set -euo pipefail

View File

@@ -24,6 +24,7 @@
# -*- mode: python ; coding: utf-8 -*-
import os
from pathlib import Path
from PyInstaller.utils.hooks import (
collect_all,
@@ -58,6 +59,15 @@ hidden_imports += collect_submodules("charset_normalizer")
hidden_imports += collect_submodules("openpyxl")
hidden_imports += collect_submodules("loguru")
# PDF Extractor stack. ``pypdfium2`` has its own PyInstaller hook
# under ``build/hooks/`` that pulls in the native PDFium binary —
# keep the ``collect_submodules`` calls here for belt-and-braces.
hidden_imports += collect_submodules("pdfplumber")
hidden_imports += collect_submodules("pdfminer")
hidden_imports += collect_submodules("pypdfium2")
hidden_imports += collect_submodules("PIL")
hidden_imports += collect_submodules("pytesseract")
# Our own engine + GUI modules. Even though we import them directly
# at the top of ``launcher.py`` / ``app.py``, the Streamlit
# session-state and per-page page discovery layers re-import via
@@ -77,6 +87,14 @@ datas += collect_data_files("streamlit", include_py_files=False)
# phonenumbers ships its country/area-code metadata as resources.
datas += collect_data_files("phonenumbers", include_py_files=False)
# PDF Extractor data files. ``pypdfium2`` ships a native PDFium
# shared library (``.dll`` / ``.so`` / ``.dylib``) under its package
# dir; ``pdfminer`` ships the Adobe CMap tables it uses for
# character mapping. The drawable-canvas frontend bundle is gone
# now that the visual picker was removed.
datas += collect_data_files("pypdfium2", include_py_files=False)
datas += collect_data_files("pdfminer", include_py_files=False)
# Our application files. PyInstaller's bundler treats source as code
# (.pyc) by default; we add it again as data so the launcher's
# ``Path(sys._MEIPASS) / "src" / "gui" / "app.py"`` resolution works.
@@ -86,6 +104,78 @@ datas += [
(str(REPO / ".streamlit" / "config.toml"),".streamlit"),
]
# ----- Tesseract OCR bundle ----------------------------------------
# ``build/tesseract.py`` stages the per-platform Tesseract binary
# + its runtime libs (DLLs/dylibs/sos) into
# ``build/_tesseract/<target>/`` and the shared eng.traineddata into
# ``build/vendor/tessdata/``. We add both to ``datas`` so PyInstaller
# drops them at the path the runtime expects:
#
# <bundle>/tesseract/tesseract[.exe]
# <bundle>/tesseract/<all dll/dylib/so deps>
# <bundle>/tesseract/tessdata/eng.traineddata
#
# The runtime discovery code in src/pdf_extract.py reads this layout
# from ``Path(sys._MEIPASS) / "tesseract" / ...``. Keep the two ends
# in sync — if you rename "tesseract" here, update pdf_extract.py too.
#
# CI (.github/workflows/build.yml) sets DATATOOLS_TESS_STAGING to the
# right per-platform dir before invoking PyInstaller. For ad-hoc
# `pyinstaller build/datatools.spec` runs without that env var, fall
# back to the canonical staging path.
_tess_staging_env = os.environ.get("DATATOOLS_TESS_STAGING")
if _tess_staging_env:
_tess_staging = Path(_tess_staging_env)
else:
# Pick the obvious per-host staging dir as a fallback so spec-only
# builds (without the CI env var) still work in dev.
import sys as _sys_for_target
_target_guess = (
"win" if _sys_for_target.platform.startswith("win")
else "mac" if _sys_for_target.platform == "darwin"
else "linux"
)
_tess_staging = REPO / "build" / "_tesseract" / _target_guess
_tessdata = REPO / "build" / "vendor" / "tessdata"
if _tess_staging.is_dir() and any(_tess_staging.iterdir()):
# Drop every file in the staging dir directly under
# ``<bundle>/tesseract/`` (binary + DLL/dylib/so siblings).
datas += [(str(_tess_staging), "tesseract")]
else:
# Don't hard-fail spec parse — useful for first-time devs running
# PyInstaller before fetching binaries. Surface a loud warning
# though, since the OCR feature will silently fail at runtime.
print(
f"WARNING: {_tess_staging} is empty or missing OCR will be "
"disabled in the bundle. Run build/tesseract.py's "
"fetch_tesseract_for_platform before pyinstaller, or "
"pre-stage the binary manually."
)
if (_tessdata / "eng.traineddata").exists():
datas += [(str(_tessdata), "tesseract/tessdata")]
else:
print(
f"WARNING: {_tessdata}/eng.traineddata is missing OCR will "
"have no language data at runtime. Run build/tesseract.py's "
"fetch_tessdata or fetch manually per build/vendor/README.md."
)
# Bundle the Apache-2.0 LICENSE text alongside the binary. The docs
# agent maintains LICENSE_TESSERACT.txt at the repo root; PyInstaller
# drops it at the bundle root next to DataTools[.exe].
_tess_license = REPO / "LICENSE_TESSERACT.txt"
if _tess_license.exists():
datas += [(str(_tess_license), ".")]
else:
print(
"WARNING: LICENSE_TESSERACT.txt missing at repo root. Required "
"by Apache-2.0 for redistribution; the docs agent should "
"create it. Continuing without it for now."
)
# ----- Analysis ------------------------------------------------------
a = Analysis(
@@ -141,6 +231,13 @@ coll = COLLECT(
# macOS .app bundle wrapper. PyInstaller produces it only on Mac;
# this block is a no-op on Win/Linux.
#
# Tesseract bundling note: ``BUNDLE(coll, ...)`` carries the entire
# COLLECT output (binaries + datas) into the .app's
# Contents/Resources tree, so the ``tesseract/`` subdir we built up
# in ``datas`` lands at ``DataTools.app/Contents/Resources/tesseract/``
# and the runtime ``sys._MEIPASS`` resolves there. No extra plumbing
# needed.
import sys as _sys
if _sys.platform == "darwin":
app = BUNDLE(

78
build/generate_icons.py Normal file
View File

@@ -0,0 +1,78 @@
"""Generate platform-specific app icons from the source PNG asset.
Outputs:
build/icon.ico Windows multi-resolution icon (16..256 px sizes).
build/icon.icns macOS icon bundle (16..1024 px scaled tiers).
build/icon.png Plain 256x256 PNG used by the Linux AppImage.
Source: ``src/gui/assets/datatools_icon_256.png`` (the same icon
``st.set_page_config`` uses, so the installer / Dock / Taskbar match
the in-app tab favicon).
Run manually:
python build/generate_icons.py
CI runs this automatically before invoking PyInstaller (see
``.github/workflows/build.yml``). Both files are .gitignored — they
are build artifacts derived from the committed PNG.
Self-contained: pulls only Pillow (already a transitive dep of
``pdfplumber``) so no extra installs are required.
"""
from __future__ import annotations
import sys
from pathlib import Path
from PIL import Image
# Repo layout: this script lives at <REPO>/build/. The source PNG is at
# <REPO>/src/gui/assets/datatools_icon_256.png.
BUILD_DIR = Path(__file__).resolve().parent
REPO = BUILD_DIR.parent
SOURCE_PNG = REPO / "src" / "gui" / "assets" / "datatools_icon_256.png"
# Windows ICO needs every size the OS might render at: taskbar (16/24),
# Start Menu (32/48), tile (64/128), shell properties dialog (256).
ICO_SIZES = [(16, 16), (24, 24), (32, 32), (48, 48), (64, 64),
(128, 128), (256, 256)]
def main() -> int:
if not SOURCE_PNG.exists():
sys.stderr.write(
f"Source icon not found at {SOURCE_PNG}.\n"
"Add a 256x256 (or larger) RGBA PNG there and re-run.\n"
)
return 1
src = Image.open(SOURCE_PNG).convert("RGBA")
if src.size[0] < 256 or src.size[1] < 256:
sys.stderr.write(
f"Source icon is {src.size}; recommend 256x256 or larger "
"so downscaled tiers look crisp.\n"
)
ico_path = BUILD_DIR / "icon.ico"
src.save(ico_path, format="ICO", sizes=ICO_SIZES)
print(f"wrote {ico_path} ({ico_path.stat().st_size:,} bytes)")
icns_path = BUILD_DIR / "icon.icns"
# Pillow's ICNS writer derives the per-tier sizes from the source
# image; passing a 256x256 source yields ic07..ic12 entries which
# cover Finder, Dock, and the Get Info panel.
src.save(icns_path, format="ICNS")
print(f"wrote {icns_path} ({icns_path.stat().st_size:,} bytes)")
# AppImage uses a plain PNG for its desktop entry. Copy the source
# so the AppImage build script doesn't have to know the asset path.
png_path = BUILD_DIR / "icon.png"
src.save(png_path, format="PNG")
print(f"wrote {png_path} ({png_path.stat().st_size:,} bytes)")
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -0,0 +1,31 @@
"""PyInstaller hook for pypdfium2.
``pypdfium2`` ships the native PDFium shared library as a data file
inside its package directory (``pdfium``-prefixed ``.dll`` on
Windows, ``.so`` on Linux, ``.dylib`` on macOS). PyInstaller's
default discovery picks up Python ``.py``/``.pyc`` but can miss
the binary if the package is wheel-installed and the shared lib
isn't on the ``__init__``'s module-level path it scans.
This hook is belt-and-braces — the main spec already calls
``collect_data_files("pypdfium2")`` and ``collect_submodules``,
but PyInstaller's hook-discovery-by-name is the documented
escape hatch for native-bundled libraries. Without this, the
visual picker (which renders PDF pages via
``pypdfium2.PdfDocument(...).render(...)``) silently fails on
installed builds with a ``FileNotFoundError`` for the PDFium
shared library.
"""
from PyInstaller.utils.hooks import (
collect_all,
collect_data_files,
collect_dynamic_libs,
)
datas, binaries, hiddenimports = collect_all("pypdfium2")
# Make absolutely sure the bundled PDFium .dll/.so/.dylib is
# carried over — PyInstaller treats it as a dynamic lib, not data.
binaries += collect_dynamic_libs("pypdfium2")
# And its raw data files (the type stubs + metadata file).
datas += collect_data_files("pypdfium2", include_py_files=False)

View File

@@ -1,11 +1,26 @@
; Inno Setup script for DataTools — Windows installer.
;
; Compile from the repo root:
; iscc /DAppVersion=1.0.0 build\installer.iss
; iscc /DAppVersion=3.0 build\installer.iss
;
; CI passes the version via /DAppVersion to keep src/__init__.py the
; single source of truth. Local manual builds: pass /DAppVersion or
; let the default kick in.
;
; What this installer wires up (covers the "easy launch" surface):
; * Start Menu group: Start → DataTools → DataTools / Uninstall
; * Desktop shortcut: optional, checked by default during install
; * Quick Launch: optional, off by default (legacy Win 7 + power
; users who keep the bar enabled). Windows 10/11
; users pin to taskbar manually via right-click —
; OS security policy forbids programmatic pinning.
; * App Paths entry: so ``DataTools`` typed into Win+R / cmd works.
;
; Self-contained: the installer contains a frozen PyInstaller bundle
; (Python + every runtime dep). No pre-install or post-install steps
; on the buyer's machine. UAC is NOT required because we install
; per-user by default; the prompt only fires if the buyer asks for an
; all-users install.
#ifndef AppVersion
#define AppVersion "0.0.0-dev"
@@ -18,11 +33,15 @@ AppVersion={#AppVersion}
AppVerName=DataTools {#AppVersion}
AppPublisher=DataTools
AppPublisherURL=https://datatools.app
AppSupportURL=https://datatools.app/support
AppUpdatesURL=https://datatools.app/releases
DefaultDirName={autopf}\DataTools
DefaultGroupName=DataTools
DisableProgramGroupPage=yes
OutputDir=..\dist
OutputBaseFilename=DataTools-{#AppVersion}-win-setup
SetupIconFile=icon.ico
UninstallDisplayIcon={app}\DataTools.exe
Compression=lzma2/max
SolidCompression=yes
WizardStyle=modern
@@ -30,20 +49,45 @@ ArchitecturesInstallIn64BitMode=x64
PrivilegesRequired=lowest
PrivilegesRequiredOverridesAllowed=dialog
; Allow per-user install (no UAC prompt) when admin isn't available.
; Buyers without admin rights can still install without IT involvement.
ChangesAssociations=no
CloseApplications=force
RestartApplications=no
[Languages]
Name: "english"; MessagesFile: "compiler:Default.isl"
[Tasks]
Name: "desktopicon"; Description: "Create a &desktop shortcut"; GroupDescription: "Additional shortcuts:"
Name: "quicklaunchicon"; Description: "Create a &Quick Launch shortcut"; GroupDescription: "Additional shortcuts:"; Flags: unchecked; OnlyBelowVersion: 6.1
[Files]
; PyInstaller's dist/DataTools/ tree includes:
; * DataTools.exe + frozen Python runtime
; * tesseract/tesseract.exe + DLLs + tessdata/eng.traineddata
; (bundled via build/datatools.spec datas; runtime discovery in
; src/pdf_extract.py reads sys._MEIPASS / "tesseract" / ...).
; * LICENSE_TESSERACT.txt at the bundle root (Apache-2.0).
; The recursesubdirs flag below picks all of those up — no separate
; Files: entry needed for tesseract/.
Source: "..\dist\DataTools\*"; DestDir: "{app}"; Flags: recursesubdirs ignoreversion
[Icons]
Name: "{group}\DataTools"; Filename: "{app}\DataTools.exe"
; Start Menu entries — created unconditionally so the app is always
; discoverable via Start search.
Name: "{group}\DataTools"; Filename: "{app}\DataTools.exe"; IconFilename: "{app}\DataTools.exe"
Name: "{group}\Uninstall DataTools"; Filename: "{uninstallexe}"
Name: "{autodesktop}\DataTools"; Filename: "{app}\DataTools.exe"; Tasks: desktopicon
; Desktop shortcut — opt-in via the Tasks page.
Name: "{autodesktop}\DataTools"; Filename: "{app}\DataTools.exe"; IconFilename: "{app}\DataTools.exe"; Tasks: desktopicon
; Quick Launch (legacy) — only relevant on Win 7 and older.
Name: "{userappdata}\Microsoft\Internet Explorer\Quick Launch\DataTools"; Filename: "{app}\DataTools.exe"; IconFilename: "{app}\DataTools.exe"; Tasks: quicklaunchicon
[Registry]
; App Paths — lets the buyer launch from Win+R or cmd with just
; "DataTools" instead of a full path. Per-user hive so the per-user
; install path doesn't need admin to register.
Root: HKCU; Subkey: "Software\Microsoft\Windows\CurrentVersion\App Paths\DataTools.exe"; ValueType: string; ValueName: ""; ValueData: "{app}\DataTools.exe"; Flags: uninsdeletekey
[Run]
Filename: "{app}\DataTools.exe"; Description: "Launch DataTools"; Flags: nowait postinstall skipifsilent

View File

@@ -10,6 +10,11 @@
#
# Code signing + notarization happen separately (see build/README.md
# "Signing"). This script only handles the packaging step.
#
# Tesseract bundling: no-op here. The .app already contains
# Contents/Resources/tesseract/{tesseract, *.dylib, tessdata/} thanks
# to PyInstaller's BUNDLE() carrying the spec's datas through. This
# script just wraps the finished .app — no extra steps for OCR.
set -euo pipefail

View File

@@ -0,0 +1,28 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<!--
Hardened-runtime entitlements for the notarized DataTools.app.
PyInstaller freezes a CPython interpreter that maps writable+executable
memory and loads many unsigned .so/.dylib modules at runtime. Without
these entitlements the hardened runtime kills the process on launch
(or notarization rejects the bundle). Keep this list minimal — the app
is a local-only Streamlit server, so no network-server/device/camera
entitlements are needed.
-->
<plist version="1.0">
<dict>
<!-- CPython JIT-style writable/executable memory + ctypes trampolines -->
<key>com.apple.security.cs.allow-jit</key>
<true/>
<key>com.apple.security.cs.allow-unsigned-executable-memory</key>
<true/>
<!-- Load the bundled C-extension .so / .dylib modules (pandas, pdfplumber,
Pillow, the bundled Tesseract dylibs) that aren't Team-ID signed -->
<key>com.apple.security.cs.disable-library-validation</key>
<true/>
<!-- Launcher sets DATATOOLS_*/TESSDATA_PREFIX/PYTHON* before exec -->
<key>com.apple.security.cs.allow-dyld-environment-variables</key>
<true/>
</dict>
</plist>

453
build/tesseract.py Normal file
View File

@@ -0,0 +1,453 @@
"""Tesseract bundling helpers for the release build.
PDF Extractor OCR ships a per-platform Tesseract binary plus the English
``eng.traineddata`` model inside the frozen PyInstaller bundle so scanned
PDFs work without a separate user install. These helpers fetch the binary
and tessdata at build time; the GitHub Actions workflow
(``.github/workflows/build.yml``) imports ``fetch_tessdata`` and
``fetch_tesseract_for_platform`` and runs them before PyInstaller.
Everything is staged under ``build/_tesseract/<platform>/`` (gitignored).
The PyInstaller spec (``build/datatools.spec``) reads that staging dir plus
``build/vendor/tessdata/`` and bundles them under ``<bundle>/tesseract/``,
where the runtime discovery code in ``src/pdf_extract.py`` expects:
Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]"
Path(sys._MEIPASS) / "tesseract" / "tessdata" / "eng.traineddata"
"""
from __future__ import annotations
import os
import shutil
import subprocess
import sys
import urllib.request
from pathlib import Path
REPO = Path(__file__).resolve().parent.parent
BUILD = REPO / "build"
# Tesseract bundling. The runtime discovery code in
# ``src/pdf_extract.py`` looks for the binary at
# ``Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]"`` and tessdata
# at ``... / "tesseract" / "tessdata" / "eng.traineddata"``. We stage
# everything under ``build/_tesseract/<platform>/`` (gitignored) and
# the PyInstaller spec adds that staging dir to ``datas=`` so it lands
# at the right place inside the frozen bundle.
TESSERACT_VERSION = "5.5.0"
TESSDATA_DIR = BUILD / "vendor" / "tessdata"
TESSDATA_URL = (
"https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata"
)
TESSERACT_STAGING = BUILD / "_tesseract"
# ---------------------------------------------------------------------------
# Output helpers — colourless so logs stay readable in any terminal/CI tail.
# ---------------------------------------------------------------------------
def _step(msg: str) -> None:
print(f"\n==> {msg}", flush=True)
def _ok(msg: str) -> None:
print(f" ok: {msg}", flush=True)
def _warn(msg: str) -> None:
print(f" warn: {msg}", flush=True)
def _err(msg: str) -> None:
print(f" ERROR: {msg}", file=sys.stderr, flush=True)
def _run(cmd: list[str], cwd: Path | None = None, env: dict | None = None) -> None:
"""Run *cmd*, stream output, exit on failure with a useful banner."""
printable = " ".join(map(str, cmd))
print(f" $ {printable}", flush=True)
try:
subprocess.run(cmd, check=True, cwd=cwd or REPO, env=env)
except subprocess.CalledProcessError as e:
_err(f"command failed (exit {e.returncode}): {printable}")
sys.exit(e.returncode)
except FileNotFoundError:
_err(f"command not found: {cmd[0]}")
sys.exit(127)
# ---------------------------------------------------------------------------
# Tesseract bundling — fetch the binary + tessdata at build time.
#
# We download (not vendor) because:
# * Binaries are large (5-40 MB per platform) and license-encumbered
# to keep current in git.
# * tessdata is Apache-2.0 and ~16 MB — fine to redistribute but
# bloats clones for contributors who don't touch OCR.
#
# Caching layout:
# build/_tesseract/win/tesseract.exe + DLLs
# build/_tesseract/mac/tesseract + dylibs
# build/_tesseract/linux/tesseract + libs
# build/vendor/tessdata/eng.traineddata (shared across platforms)
#
# The PyInstaller spec reads ``build/_tesseract/<platform>/`` and the
# tessdata dir, then bundles them under ``<bundle>/tesseract/``.
# ---------------------------------------------------------------------------
def _download(url: str, dest: Path, *, expected_min_bytes: int = 1024) -> None:
"""Download *url* to *dest* atomically. Sanity-check the size."""
dest.parent.mkdir(parents=True, exist_ok=True)
tmp = dest.with_suffix(dest.suffix + ".part")
print(f" GET {url}", flush=True)
try:
with urllib.request.urlopen(url, timeout=120) as r, open(tmp, "wb") as f:
shutil.copyfileobj(r, f)
except Exception as e: # noqa: BLE001 — bubble any network error up
if tmp.exists():
tmp.unlink()
_err(f"download failed: {url}\n {e}")
raise
size = tmp.stat().st_size
if size < expected_min_bytes:
tmp.unlink()
raise RuntimeError(
f"downloaded file too small ({size} bytes < {expected_min_bytes}); "
f"the URL probably 404'd into an HTML error page."
)
tmp.replace(dest)
_ok(f"downloaded {dest.name} ({size / (1024 * 1024):.1f} MB)")
def fetch_tessdata() -> Path:
"""Ensure ``build/vendor/tessdata/eng.traineddata`` exists; return its path.
Shared across platforms. Downloaded once and cached. The
runtime expects this file at ``<bundle>/tesseract/tessdata/eng.traineddata``;
the PyInstaller spec handles the placement.
"""
_step("fetch tessdata (eng.traineddata)")
TESSDATA_DIR.mkdir(parents=True, exist_ok=True)
target = TESSDATA_DIR / "eng.traineddata"
if target.exists() and target.stat().st_size > 1_000_000:
_ok(f"already cached: {target.relative_to(REPO)} "
f"({target.stat().st_size / (1024 * 1024):.1f} MB)")
return target
# ~16 MB on disk for the "best" model. Allow some slack on the
# min-bytes check (3 MB) so we still catch HTML 404 pages.
_download(TESSDATA_URL, target, expected_min_bytes=3 * 1024 * 1024)
return target
def _fetch_tesseract_windows(staging: Path) -> None:
"""Stage tesseract.exe + DLLs into *staging*.
Strategy (no easy stand-alone Windows tarball exists — UB-Mannheim
ships the canonical Windows builds as Inno Setup installers):
1. Download the installer .exe from the UB-Mannheim mirror.
2. Extract it with 7-Zip (which can read Inno Setup archives via
the {app} group). 7-Zip is preinstalled on
``windows-latest`` GitHub Actions runners (`C:\\Program Files\\7-Zip\\7z.exe`).
3. Copy tesseract.exe + every DLL + the tessdata dir from the
extraction into ``staging/``.
The DLL set tesseract.exe needs at runtime (per UB-Mannheim's
Inno Setup script):
libtesseract-5.dll, libleptonica-6.dll, libgomp-1.dll,
libstdc++-6.dll, libwinpthread-1.dll, libgcc_s_seh-1.dll,
liblz4.dll, libjpeg-8.dll, libpng16-16.dll, libtiff-6.dll,
libwebp-7.dll, libwebpmux-3.dll, libopenjp2-7.dll, zlib1.dll
The whole {app} tree from the installer is ~120 MB; we copy
just the .exe + .dll files (~50 MB) since the runtime only
needs the binary and its direct deps.
"""
# UB-Mannheim posts builds under a versioned filename; the exact
# build revision changes (5.5.0.20241111 at time of writing).
# We pin a specific rev so reproducible builds don't drift.
rev = "20241111" # patch rev for tesseract 5.5.0 on the UB-Mannheim mirror
fname = f"tesseract-ocr-w64-setup-{TESSERACT_VERSION}.{rev}.exe"
url = f"https://digi.bib.uni-mannheim.de/tesseract/{fname}"
cache = TESSERACT_STAGING / fname
if not cache.exists():
_download(url, cache, expected_min_bytes=20 * 1024 * 1024)
# 7-Zip is preinstalled on windows-latest runners; on a dev box
# the user installs it (choco install 7zip) or substitutes
# innoextract. Locate it.
sevenz = (
shutil.which("7z")
or shutil.which("7z.exe")
or r"C:\Program Files\7-Zip\7z.exe"
)
if not Path(sevenz).exists() and not shutil.which("7z"):
_err(
"7-Zip not found. On Windows CI runners it's preinstalled; "
"on a dev box install via ``choco install 7zip`` or extract "
f"{cache} manually into {staging}/ and re-run with "
"TESSERACT_SKIP_FETCH=1."
)
raise FileNotFoundError("7z")
extract = TESSERACT_STAGING / "win_extract"
if extract.exists():
shutil.rmtree(extract)
extract.mkdir(parents=True)
_run([str(sevenz), "x", "-y", f"-o{extract}", str(cache)])
staging.mkdir(parents=True, exist_ok=True)
# The Inno Setup payload lands under ``{app}/`` inside the
# extraction. Recursively grab tesseract.exe + DLLs.
found_exe = False
for root, _dirs, files in os.walk(extract):
for f in files:
src = Path(root) / f
if f.lower() == "tesseract.exe":
shutil.copy2(src, staging / "tesseract.exe")
found_exe = True
elif f.lower().endswith(".dll"):
shutil.copy2(src, staging / f)
if not found_exe:
raise RuntimeError(
f"tesseract.exe not found inside extracted installer at {extract}"
)
_ok(f"staged Windows tesseract into {staging.relative_to(REPO)}")
def _fetch_tesseract_macos(staging: Path) -> None:
"""Stage tesseract + dylibs into *staging* on macOS.
Strategy: use Homebrew. ``brew install tesseract`` is the
sanctioned macOS path and the binary it installs is the same one
every guide on the internet points at. We copy the binary +
every dylib it links against into the staging dir, then run
``install_name_tool`` to rewrite the load paths so the binary
works after relocation into the .app bundle.
Caveat: ``brew`` must be on PATH (it is on ``macos-latest``
runners). If it isn't, we surface a helpful error rather than
fail mysteriously.
"""
if not shutil.which("brew"):
_err(
"Homebrew not found. On macos-latest GitHub runners it's "
"preinstalled; on a dev Mac install from https://brew.sh and "
"re-run. Alternatively pre-stage tesseract into "
f"{staging}/ and set TESSERACT_SKIP_FETCH=1."
)
raise FileNotFoundError("brew")
# ``brew install`` is idempotent — fine to run on every build. We
# don't pin the version through brew because brew tracks its own
# taps; instead we assert the version matches TESSERACT_VERSION
# after install.
_run(["brew", "install", "tesseract"])
# Find the binary brew just installed.
tess_path = shutil.which("tesseract")
if not tess_path:
raise RuntimeError("brew install tesseract succeeded but tesseract not on PATH")
staging.mkdir(parents=True, exist_ok=True)
shutil.copy2(tess_path, staging / "tesseract")
# Copy every non-system dylib the binary links against. The
# ``otool -L`` output lists absolute paths under /opt/homebrew/
# (Apple Silicon) or /usr/local/ (Intel). We skip /usr/lib/* and
# /System/* (Apple-shipped, present on every Mac).
try:
otool = subprocess.run(
["otool", "-L", str(staging / "tesseract")],
check=True, capture_output=True, text=True,
)
except subprocess.CalledProcessError as e:
raise RuntimeError(f"otool failed: {e.stderr}") from e
deps = []
for line in otool.stdout.splitlines()[1:]:
path = line.strip().split(" ", 1)[0]
if path.startswith(("/opt/homebrew/", "/usr/local/")):
deps.append(path)
# Copy each dep and its transitive deps. One level of recursion
# is usually enough for the tesseract dep tree (libtesseract →
# libleptonica → libpng/libjpeg/libtiff/libwebp).
copied: set[str] = set()
def _copy_with_deps(libpath: str) -> None:
if libpath in copied or not Path(libpath).exists():
return
copied.add(libpath)
dest = staging / Path(libpath).name
shutil.copy2(libpath, dest)
# Rewrite the dest's own load path to @loader_path so the
# bundle is relocatable.
try:
subprocess.run(
["install_name_tool", "-id", f"@loader_path/{Path(libpath).name}", str(dest)],
check=True, capture_output=True,
)
except subprocess.CalledProcessError:
# Not fatal — install_name_tool refuses on already-relative
# IDs. The dyld loader will still find them via
# @loader_path rewrites on the consumer side.
pass
# Walk this lib's own deps.
try:
sub = subprocess.run(
["otool", "-L", libpath], check=True, capture_output=True, text=True,
)
for sub_line in sub.stdout.splitlines()[1:]:
sub_path = sub_line.strip().split(" ", 1)[0]
if sub_path.startswith(("/opt/homebrew/", "/usr/local/")):
_copy_with_deps(sub_path)
except subprocess.CalledProcessError:
pass
for dep in deps:
_copy_with_deps(dep)
# Rewrite the tesseract binary's references to point at
# @loader_path/<dyname> so it can find its deps inside the bundle.
bin_path = staging / "tesseract"
for dep in deps:
try:
subprocess.run(
["install_name_tool", "-change", dep,
f"@loader_path/{Path(dep).name}", str(bin_path)],
check=True, capture_output=True,
)
except subprocess.CalledProcessError:
pass
_ok(f"staged macOS tesseract + {len(copied)} dylibs into {staging.relative_to(REPO)}")
def _fetch_tesseract_linux(staging: Path) -> None:
"""Stage tesseract + .so files into *staging* on Linux.
Strategy: ``apt-get install tesseract-ocr libtesseract5``
(preinstalled on most ubuntu-latest images; we run install
anyway because the package is idempotent). Then copy the
binary + every .so it links against into staging. ``patchelf``
rewrites RPATH so the bundle is relocatable.
"""
if not shutil.which("apt-get") and not shutil.which("tesseract"):
_err(
"Neither apt-get nor a pre-installed tesseract found. On "
"ubuntu-latest runners both are present. On other distros "
"install tesseract-ocr via your package manager and re-run "
"with TESSERACT_SKIP_FETCH=1 after pre-staging the binary."
)
raise FileNotFoundError("tesseract")
if shutil.which("apt-get") and not shutil.which("tesseract"):
_run(["sudo", "apt-get", "update"])
_run(["sudo", "apt-get", "install", "-y", "tesseract-ocr", "libtesseract5"])
tess_path = shutil.which("tesseract")
if not tess_path:
raise RuntimeError("apt-get install succeeded but tesseract not on PATH")
staging.mkdir(parents=True, exist_ok=True)
shutil.copy2(tess_path, staging / "tesseract")
# Collect .so dependencies via ldd. Skip the dynamic linker and
# libc/libpthread/libdl/libm/libstdc++/libgcc_s — those are
# guaranteed to exist on every Linux target and shipping them can
# cause GLIBC mismatch errors on older distros. The interesting
# tesseract-specific deps are libtesseract, libleptonica, and the
# image format libs (libpng, libjpeg, libtiff, libwebp, libgif).
SKIP_PREFIXES = (
"linux-vdso", "/lib64/ld-linux", "/lib/ld-linux",
"libc.so", "libdl.so", "libpthread.so", "libm.so",
"librt.so", "libnsl.so", "libutil.so",
)
try:
ldd = subprocess.run(
["ldd", str(staging / "tesseract")],
check=True, capture_output=True, text=True,
)
except subprocess.CalledProcessError as e:
raise RuntimeError(f"ldd failed: {e.stderr}") from e
copied = 0
for line in ldd.stdout.splitlines():
# Format: " libfoo.so.N => /path/to/libfoo.so.N (0x...)"
parts = line.split("=>")
if len(parts) != 2:
continue
soname = parts[0].strip()
if soname.startswith(SKIP_PREFIXES):
continue
path_part = parts[1].strip().split(" ", 1)[0]
if not path_part or not Path(path_part).exists():
continue
shutil.copy2(path_part, staging / Path(path_part).name)
copied += 1
# patchelf is optional — if present, rewrite RPATH to $ORIGIN so
# the binary finds its bundled .so files. If absent, the
# PyInstaller LD_LIBRARY_PATH that the launcher sets will cover
# it (we already chdir into _MEIPASS for the runtime).
if shutil.which("patchelf"):
try:
_run(["patchelf", "--set-rpath", "$ORIGIN", str(staging / "tesseract")])
except SystemExit:
_warn("patchelf rpath rewrite failed — relying on LD_LIBRARY_PATH at runtime")
_ok(f"staged Linux tesseract + {copied} .so files into {staging.relative_to(REPO)}")
def fetch_tesseract_for_platform(target: str) -> Path:
"""Stage the per-platform Tesseract binary + libs into ``build/_tesseract/<target>/``.
Returns the staging dir path. The PyInstaller spec adds this dir
(plus tessdata) to its ``datas=`` so the bundle ends up with
everything under ``<bundle>/tesseract/`` where the runtime
discovery code expects it.
Honours ``TESSERACT_SKIP_FETCH=1`` — set this when you've
pre-staged the binary by hand (offline build, behind a proxy,
custom build of tesseract, etc.). The script still verifies the
binary is present and surfaces a helpful error if not.
"""
_step(f"fetch tesseract binary ({target})")
staging = TESSERACT_STAGING / target
exe_name = "tesseract.exe" if target == "win" else "tesseract"
exe_path = staging / exe_name
if os.environ.get("TESSERACT_SKIP_FETCH") == "1":
if not exe_path.exists():
_err(
f"TESSERACT_SKIP_FETCH=1 but {exe_path} is missing. "
"Pre-stage the binary + its libs into that dir, then re-run."
)
sys.exit(1)
_ok(f"skipping fetch (TESSERACT_SKIP_FETCH=1); using {exe_path.relative_to(REPO)}")
return staging
if exe_path.exists():
_ok(f"already staged: {exe_path.relative_to(REPO)}")
return staging
if target == "win":
_fetch_tesseract_windows(staging)
elif target == "mac":
_fetch_tesseract_macos(staging)
elif target == "linux":
_fetch_tesseract_linux(staging)
else:
_err(f"unknown target {target!r} for tesseract fetch")
sys.exit(2)
if not exe_path.exists():
_err(
f"fetch step finished but {exe_path.relative_to(REPO)} is missing. "
"Inspect the logs above; you may need to pre-stage the binary manually."
)
sys.exit(1)
return staging

63
build/vendor/README.md vendored Normal file
View File

@@ -0,0 +1,63 @@
# build/vendor/ — third-party bundle inputs (fetched at build time)
This tree holds the third-party assets that get bundled into the
PyInstaller artifacts but that we deliberately do **not** keep in git
(too large / license-encumbered / re-fetchable on demand).
The build's Tesseract helper (`build/tesseract.py`) populates
everything in here before the PyInstaller step — CI
(`.github/workflows/build.yml`) calls it ahead of the build. The
contents are git-ignored except for this README.
## tessdata/
Holds the Tesseract language data file(s) used by the PDF Extractor
OCR fallback. Only English is bundled today.
### Canonical source
We use the **"best" model** from `tesseract-ocr/tessdata_best` (LSTM,
slower but higher accuracy than the legacy `tessdata` set, and only
~12 MB compressed → ~16 MB uncompressed):
```
https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata
```
There is also `tessdata_fast/` (~4 MB, lower accuracy) if you ever
want to optimise for bundle size over recognition quality. For bank
statements (the only OCR use case so far), the extra accuracy of the
`_best` model is worth the 10 MB.
### Why we don't vendor it in git
* ~16 MB binary file — bloats clone times for everyone, including
contributors who never touch the OCR code path.
* Apache-2.0-licensed and stable; the file rarely changes upstream
(last touched 2021), so a build-time fetch is safe.
* The Tesseract project explicitly distributes these via GitHub
raw URLs — they're meant to be downloaded, not redistributed
through other repos.
### How it gets populated
`build/tesseract.py::fetch_tessdata()` checks for
`build/vendor/tessdata/eng.traineddata` on every run. If it's
missing, it downloads the file from the canonical URL above and
caches it here. Subsequent builds reuse the cached file.
On CI, the directory is restored from the GitHub Actions cache so we
don't pay the download cost on every run (`.github/workflows/build.yml`
caches `build/vendor/tessdata/` keyed on the URL above).
## Manual one-time fetch (if you're offline or behind a proxy)
```bash
mkdir -p build/vendor/tessdata
curl -L -o build/vendor/tessdata/eng.traineddata \
https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata
```
Verify the file is non-empty and starts with the magic bytes
`b"\x00\x00\x00\x00"` followed by a header that `pytesseract` can
read; the script does a basic sanity check after download.

0
build/vendor/tessdata/.gitkeep vendored Normal file
View File

481
docs/ADMIN.md Normal file
View File

@@ -0,0 +1,481 @@
# ADMIN — Internal license operations
Creator/operator-only reference. End users should read `USER-GUIDE.md` instead.
This doc covers everything the creator does that buyers never see: minting
through the live server, where state lives on the box, how to rotate secrets,
generating the signing keypair, the dev vs. production key story, and how to
recover from key loss.
For the end-to-end system + tech stack diagrams, see `ARCHITECTURE.md`.
---
## Live deployment (PR 1)
The license server is running at:
| URL | What it serves |
|---|---|
| `https://datatools.unalogix.com/` | Marketing site (placeholder — "DataTools — coming soon") |
| `https://licenses.datatools.unalogix.com/health` | Liveness + DB reachability probe |
| `https://licenses.datatools.unalogix.com/internal/*` | nginx-blocked on the public side — accessible only via SSH tunnel |
| Postgres @ `127.0.0.1:5433` (localhost) | DB containing the authoritative `licenses` table |
**Host**: `46.225.166.142` (Ubuntu 24.04), nginx 1.24, Postgres 16-alpine + FastAPI in Docker.
**Cert**: Let's Encrypt, covers both subdomains, expires 2026-08-12, auto-renews via `certbot.timer`.
### On-box state
| Path | Contents |
|---|---|
| `/srv/datatools-license/` | Deploy root, mode 750, owned by `datatools-api` |
| `/srv/datatools-license/compose.yml` | Production docker-compose definition |
| `/srv/datatools-license/app/` | Git clone of this repo (re-clone or `git pull` to update) |
| `/srv/datatools-license/secrets/` | Mode 750 dir holding `pg_password`, `admin_token`. Files are mode 400, owned UID 10001 (container app user) |
| `/srv/datatools-license/backups/` | Postgres dumps land here (cron not yet wired — see §"Backups" below) |
| `/etc/nginx/sites-available/unalogix` | nginx config for both subdomains |
| `/etc/letsencrypt/live/datatools.unalogix.com/` | TLS cert + key |
Container names: `datatools-api`, `datatools-postgres`. Both use
`restart: unless-stopped`.
### Get the admin token
```bash
ssh michael@46.225.166.142 'sudo cat /srv/datatools-license/secrets/admin_token'
```
The token is **never** in git, in environment-variable dumps, or in
`docker inspect`. It lives on disk under mode 400 / UID 10001 (so only
root and the container app user can read it).
### Rotate the admin token
Any time it's been shown somewhere it shouldn't, or as routine hygiene:
```bash
cd /srv/datatools-license
openssl rand -hex 32 > secrets/admin_token
chown 10001:10001 secrets/admin_token
chmod 400 secrets/admin_token
docker compose restart api # ~3 seconds; old token stops working immediately
```
### Mint a license from your laptop
```bash
# 1. Open the SSH tunnel (leave running in a background terminal)
ssh -L 8090:127.0.0.1:8090 michael@46.225.166.142 -N &
# 2. Set the auth env
export DATATOOLS_ADMIN_TOKEN="$(ssh michael@46.225.166.142 'sudo cat /srv/datatools-license/secrets/admin_token')"
export DATATOOLS_ADMIN_URL=http://127.0.0.1:8090
# 3. Mint
python3 -m src.admin_cli mint \
--name "Buyer Name" \
--email buyer@example.com \
--tier core
# 4. (optional) List or revoke
python3 -m src.admin_cli list --email buyer@example.com
python3 -m src.admin_cli revoke DT1-CORE-xxxx-yyyy --reason "refund"
```
The blob lands in the response (and in the `licenses` table). Deliver it
to the buyer however suits — copy-paste into email, attach as `.dtlic`.
### Inspect / debug
```bash
# Container status + recent logs
ssh michael@46.225.166.142 'cd /srv/datatools-license && docker compose ps && docker compose logs api --tail 30'
# Query the licenses table directly
ssh michael@46.225.166.142 'cd /srv/datatools-license && docker compose exec -T postgres \
psql -U datatools_api -d datatools_licenses -c "SELECT license_key, email, tier, source, expires_at FROM licenses ORDER BY created_at DESC LIMIT 20;"'
# Public-side health
curl https://licenses.datatools.unalogix.com/health
```
### Bring it down / back up / rebuild
```bash
cd /srv/datatools-license
# Restart just the API (e.g. after rotating a secret)
docker compose restart api
# Restart everything
docker compose restart
# Bring down (DB volume PRESERVED)
docker compose down
# Bring up
docker compose up -d
# Rebuild the image after a git pull
cd app && git pull
cd ..
docker compose build && docker compose up -d
docker compose exec api alembic upgrade head # if new migrations
```
### Backups (not yet automated)
Postgres state is the system of record for the customer list — once PR 2
auto-mints from Gumroad webhooks, losing the DB would mean losing every
buyer record. Schedule a daily dump:
```bash
# /etc/cron.daily/datatools-license-backup — see SETUP-LICENSE-SERVER.md §9
```
Until that's in place, dump manually before any risky operation:
```bash
docker compose exec -T postgres \
pg_dump -U datatools_api datatools_licenses \
| gzip > backups/db-$(date -u +%Y%m%dT%H%M%SZ).sql.gz
```
### Production signing key (not yet rotated)
The server currently signs with the in-tree dev keypair (no
`DATATOOLS_LICENSE_PRIVKEY_FILE` configured → falls back to
`src/license/_dev_keypair.py`). That matches what the desktop currently
verifies against, so existing buyers continue to work.
**Before shipping v1.0 to paying buyers**, rotate to a production keypair:
1. `python scripts/generate_keypair.py` (on a trusted machine).
2. Save the private hex to `/srv/datatools-license/secrets/license_privkey`,
chmod 400, chown 10001:10001.
3. Bake the public hex into the PyInstaller build's
`DATATOOLS_LICENSE_PUBKEY` env.
4. Wire `DATATOOLS_LICENSE_PRIVKEY_FILE` + `DATATOOLS_LICENSE_PUBKEY`
into compose.yml's `api.environment` and add `license_privkey` to
the secrets block.
5. `docker compose restart api`.
### What's deployed (PR 1) vs queued (PR 2 / 3)
| Capability | Status |
|---|---|
| Mint API + Postgres + auth | **Live** |
| `datatools-admin` CLI (manual mints) | **Live** |
| `licenses.datatools.unalogix.com/health` public | **Live** |
| Gumroad webhook receiver | **PR 2 — code merged, deploy pending** |
| Postmark transactional email | **PR 2 — code merged, deploy pending** |
| Buyer renewal / re-delivery portal | **PR 3** |
| Cloudflare in front (DDoS / WAF) | Deferred (DNS at supercp/cPanel) |
| Production signing keypair | Deferred (still using dev key) |
| Automated DB backups | **Pending** — see §"Backups" |
### Running a Gumroad webhook (PR 2)
Once PR 2 is deployed, sales fire `POST` to
`https://licenses.datatools.unalogix.com/webhooks/gumroad?secret=<gumroad_secret>`.
Auth is the URL secret (Gumroad's recommended pattern). The handler
audit-logs the raw payload, mints idempotently keyed on `sale_id`,
sends the buyer their blob via Postmark, and returns 200 (always —
non-2xx would trigger 3-day retry storms).
**Adding a new SKU:**
1. Create the product in Gumroad and copy its `product_id`.
2. Edit `/srv/datatools-license/app/server/config/products.yaml`,
add a row under `gumroad:` with that ID + the tier you sold.
3. `cd /srv/datatools-license && docker compose restart api` — the
config is read at startup and cached.
**Inspecting webhook activity:**
```bash
# Recent webhook deliveries (all storefronts share this table)
ssh michael@46.225.166.142 'cd /srv/datatools-license && docker compose exec -T postgres \
psql -U datatools_api -d datatools_licenses -c \
"SELECT received_at, order_id, processed, error FROM gumroad_events ORDER BY received_at DESC LIMIT 20;"'
# Failures only (replay candidates)
ssh michael@46.225.166.142 'cd /srv/datatools-license && docker compose exec -T postgres \
psql -U datatools_api -d datatools_licenses -c \
"SELECT id, received_at, order_id, error FROM gumroad_events WHERE processed=false ORDER BY received_at DESC;"'
```
**Replaying a failed webhook** (after fixing the products.yaml mapping
or whatever surfaced the error): the safest path is to ask the buyer
to re-trigger via Gumroad's "Send Test Ping" button in their order
record, *or* mint manually via `datatools-admin mint --source manual`
and add a note linking to the original `gumroad_events.id`.
**Testing without buyers:** Gumroad's seller dashboard has a "Send
Test Ping" button. It sets `test=true` in the payload; the adapter
tags the resulting license with `notes='gumroad test ping'` so it's
trivially filterable later.
---
## TL;DR — I just need a license for my dev machine
You're running from source, so the repo's embedded dev keypair signs and
verifies. No env vars needed.
```bash
python scripts/generate_license.py \
--name "Michael Dombaugh" \
--email michael.dombaugh@gmail.com \
--tier core
```
Copy the `DTLIC2:…` blob from stdout, then activate:
```bash
python -m src.license_cli activate "DTLIC2:..." \
--name "Michael Dombaugh" \
--email michael.dombaugh@gmail.com
```
Verify:
```bash
python -m src.license_cli status
```
License lands at `~/.datatools/license.json`, valid 1 year.
> The `--name` / `--email` you pass to `activate` **must** match the values
> the blob was minted with — they're part of the signed payload.
---
## Key model (Ed25519, asymmetric)
| Key | Lives where | Used for |
|-----|------------|---------|
| **Private** (32 bytes hex) | Creator's password manager / KMS only | Signing license blobs |
| **Public** (32 bytes hex) | Baked into the shipped binary | Verifying blobs at activation |
The split is the whole point: an attacker with a copy of the binary still
can't mint blobs — they'd need the private key, which never ships.
There's also an in-tree **dev keypair** (`src/license/_dev_keypair.py`)
derived deterministically from a seed. It's used when no env vars are set,
so devs/tests can sign and verify locally without juggling secrets. Frozen
builds that still use it are rejected at startup by
`assert_production_safe` — see `src/license/crypto.py:84`.
Blob format prefix: `DTLIC2:` (v1 was HMAC; v2 is Ed25519).
---
## One-time setup — generating the production keypair
Run once, before the first paid release.
```bash
python scripts/generate_keypair.py --output keypair.env
```
You'll get:
```
DATATOOLS_LICENSE_PRIVKEY=<64 hex chars> # KEEP SECRET
DATATOOLS_LICENSE_PUBKEY=<64 hex chars> # BAKE INTO BUILD
```
Then:
1. **Stash the private key** in a password manager / KMS / hardware token.
Losing it means no more renewals — see "Recovery" below.
2. **Delete `keypair.env`** from disk once stored.
3. **Set the public key** as `DATATOOLS_LICENSE_PUBKEY` in the PyInstaller
build environment. The shipped binary embeds it via the env at freeze time.
---
## Minting a buyer license (production)
With the production private key loaded:
```bash
export DATATOOLS_LICENSE_PRIVKEY=<your-private-hex>
python scripts/generate_license.py \
--name "Buyer Name" \
--email buyer@example.com \
--tier core \
--years 1 \
--output buyer.dtlic
```
Flags:
| Flag | Default | Notes |
|------|---------|-------|
| `--name` | required | Buyer's full name. Goes into signed payload. |
| `--email` | required | Buyer's email. Goes into signed payload. |
| `--tier` | `core` | One of: `lite`, `core`, `pro` |
| `--years` | `1` | Lifetime in years |
| `--key` | random | Override the auto-generated license key |
| `--output` / `-o` | stdout | Write blob to file instead of printing |
Deliver the blob to the buyer either inline in the purchase email or as
the attached `.dtlic` file.
---
## Tiers
| Tier | Features |
|------|---------|
| **lite** | Find Duplicates, Clean Text, Standardize Formats |
| **core** | All 9 tools |
| **pro** | All 9 tools + future Pro-only features |
Source of truth: `src/license/features.py::all_features_for_tier`.
---
## Useful one-liners
Mint a free internal/team license (dev key, no env needed):
```bash
python scripts/generate_license.py --name "QA Bot" --email qa@datatools.app --tier core --years 5
```
Mint with a stable, human-readable key:
```bash
python scripts/generate_license.py --name "Acme Corp" --email ops@acme.com \
--tier pro --key "DT1-PRO-ACME-2026"
```
Renew an existing buyer (just re-mint with the same email; they paste the
new blob):
```bash
python -m src.license_cli renew "DTLIC2:..."
```
Check what's active locally:
```bash
python -m src.license_cli status
```
Wipe a local license (move to a new machine, debug a buyer issue):
```bash
python -m src.license_cli deactivate
```
---
## Customer record-keeping — the issuance log
Every successful `scripts/generate_license.py` run appends one JSON
line to a local **issuance log**. This is the creator-side system of
record for "who has a license" until the server-side flow in
`docs/LICENSE-SERVER.md` lands.
**Path:** `~/.datatools-creator/issued.jsonl` (override with
`$DATATOOLS_ISSUANCE_LOG`). Mode 600. Outside the buyer-facing
`~/.datatools/` dir so it never gets bundled into a shipped install.
**Format** — one record per line:
```json
{
"license_key": "DT1-CORE-5dd8e1db-d90c4656",
"name": "Michael Dombaugh",
"email": "michael.dombaugh@gmail.com",
"tier": "core",
"issued_at": "2026-05-13T22:10:27Z",
"expires_at": "2031-05-13T22:10:27Z",
"blob": "DTLIC2:..."
}
```
The full blob is stored so you can re-deliver to a buyer who lost
their email without re-minting (the re-minted blob would have a
different signature and would invalidate any device they'd already
activated against the old one).
**Useful operations:**
```bash
# Full list of issued licenses
cat ~/.datatools-creator/issued.jsonl | jq
# Find by buyer email
jq -r 'select(.email == "buyer@example.com")' ~/.datatools-creator/issued.jsonl
# Count by tier
jq -r .tier ~/.datatools-creator/issued.jsonl | sort | uniq -c
# Licenses expiring in the next 30 days
jq -r 'select(.expires_at < "'"$(date -u -d '+30 days' +%Y-%m-%dT%H:%M:%SZ)"'") | .email' \
~/.datatools-creator/issued.jsonl
# Re-deliver a buyer's blob
jq -r 'select(.email == "buyer@example.com") | .blob' \
~/.datatools-creator/issued.jsonl
```
**Skipping the log** for test mints: pass `--no-log`. Never use this
for real buyer fulfillment — an unlogged mint is invisible to every
future query and to the eventual server-side migration.
**Backup:** treat this file like a small business ledger. Copy it
into your password manager / encrypted cloud sync alongside the
private key. Losing it doesn't break anything cryptographically (you
can still mint new licenses) but it does lose the customer list.
**Migrating to the server:** the JSONL schema is intentionally close
to the planned `licenses` table in `docs/LICENSE-SERVER.md`. Once the
server is up, a one-shot import script will read the JSONL and
insert each row.
---
## Recovery — what if the private key is lost?
Existing licenses keep working until they expire (the public key in the
shipped binary still verifies them). What breaks:
- **Renewals** — you can't mint a new blob for an existing buyer.
- **New sales** — you can't mint anything.
Path forward:
1. Generate a new keypair (`scripts/generate_keypair.py`).
2. Ship a new build with the new public key.
3. Re-issue every active buyer a new blob signed by the new private key.
4. Communicate the upgrade path to buyers.
Treat the private key like a code-signing cert — back it up to two
independent secure locations.
---
## Files & code pointers
| Path | Purpose |
|------|---------|
| `scripts/generate_keypair.py` | One-time keypair generation |
| `scripts/generate_license.py` | Mint a signed blob |
| `src/license/crypto.py` | Sign / verify / dev-key detection |
| `src/license/_dev_keypair.py` | In-tree dev keypair (never ships in prod) |
| `src/license/manager.py` | `assert_production_safe` startup check |
| `src/license/features.py` | Tier → features mapping |
| `src/license_cli.py` | End-user `activate` / `status` / `renew` / `deactivate` |
| `~/.datatools/license.json` | Where activated licenses are stored on each machine |
| `~/.datatools-creator/issued.jsonl` | Creator-side issuance log (one JSON line per mint) |
| `docs/LICENSE-SERVER.md` | Design for the future online issuance + record-keeping system |
| `docs/SETUP-LICENSE-SERVER.md` | Self-hosted server install runbook (DNS, Docker, nginx, TLS, backups) |

241
docs/ARCHITECTURE.md Normal file
View File

@@ -0,0 +1,241 @@
# ARCHITECTURE — end-to-end view
Stitches the desktop app (`TECHNICAL.md`) and the license server
(`LICENSE-SERVER.md`) into a single picture. Read this first for "how
does it all fit together"; drill into the per-component docs for
detail.
---
## 1. System diagram
```
┌────────────────────────────────────────────────────────────────────────┐
│ OPERATOR / DEVELOPER LAPTOP │
│ │
│ git clone / push ←─── code lives in git.invixiom.com │
│ datatools-admin CLI ─── manual mints, list, revoke ─────┐ │
│ ssh -L 8090:127.0.0.1:8090 ───── tunnel for /internal/* ─────┤ │
└────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┘
│ internal Bearer-auth API (over SSH tunnel only)
┌────────────────────────────────────────────────────────────────────────┐
│ LICENSE SERVER — 46.225.166.142 │
│ ───────────────────────────────────────────────────────────────── │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ nginx 1.24 (TLS termination, public reverse proxy) │ │
│ │ │ │
│ │ datatools.unalogix.com → static placeholder │ │
│ │ licenses.datatools.unalogix.com → 127.0.0.1:8090 (FastAPI) │ │
│ │ /internal/* on public surface → blocked (404) │ │
│ └────────────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────▼─────────────────────────────────────┐ │
│ │ FastAPI app — datatools-api (Docker container, UID 10001) │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ ┌───────────────┐ │ │
│ │ │ /webhooks/* │ │ /internal/* │ │ /health │ │ │
│ │ │ (storefronts) │ │ (Bearer-auth) │ │ (public) │ │ │
│ │ └────────┬─────────┘ └────────┬─────────┘ └───────────────┘ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌────────────────────────────────────────┐ │ │
│ │ │ SourceAdapter (Protocol) — normalized │ │ │
│ │ │ • ManualAdapter • GumroadAdapter │ │ │
│ │ │ • (LemonSqueezy, Stripe — future) │ │ │
│ │ └────────────────┬───────────────────────┘ │ │
│ │ │ SaleEvent / RefundEvent │ │
│ │ ▼ │ │
│ │ ┌────────────────────────────────────────┐ │ │
│ │ │ mint_from_sale() │ │ │
│ │ │ • Ed25519 sign via PyCA cryptography │ │ │
│ │ │ • idempotent on (source, order_id) │ │ │
│ │ └────────────────┬───────────────────────┘ │ │
│ └────────────────────┼─────────────────────────────────────────────┘ │
│ │ SQL │
│ ┌────────────────────▼─────────────────────────────────────────────┐ │
│ │ Postgres 16 — datatools-postgres (container, vol pg_data) │ │
│ │ • licenses — authoritative customer record │ │
│ │ • gumroad_events — webhook audit log (idempotency, replay) │ │
│ └──────────────────────────────────────────────────────────────────┘ │
└───────────────────────┬────────────────────────────────┬───────────────┘
│ │
┌───────────┘ └──────────┐
│ POST /email (httpx) Gumroad Ping│
▼ POST │
┌───────────────────┐ ┌─────────────▼──┐
│ Postmark │ │ Gumroad │
│ (transactional │ │ (storefront, │
│ email) │ │ payments) │
└───────┬───────────┘ └────────────────┘
│ DKIM-signed email with license blob ▲
▼ │
┌────────────────────────────────────────────────────────────────┴───────┐
│ BUYER'S MACHINE │
│ │
│ Receives email ──► copies DTLIC2: blob ──► pastes into desktop app │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ DataTools desktop (Python 3.12 + Streamlit + Typer CLIs) │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────────┐ │ │
│ │ │ Activate screen — verifies blob signature │ │ │
│ │ │ against EMBEDDED Ed25519 public key │ │ │
│ │ │ (NO network call to the license server, ever) │ │ │
│ │ └─────────────────────────┬──────────────────────────────────┘ │ │
│ │ ▼ │ │
│ │ ~/.datatools/license.json (signed blob, mode 644, on disk) │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ Pays via web browser ─────► Gumroad ────► (kicks off the Ping) │
└────────────────────────────────────────────────────────────────────────┘
```
**Three primary flows**, distinguishable by where the green arrows
start in the diagram:
1. **Sale → fulfillment** (the automated path)
Buyer pays at Gumroad → Gumroad fires Ping to
`licenses.datatools.unalogix.com/webhooks/gumroad?secret=…` → nginx
→ FastAPI → audit-log row → adapter normalizes payload → `mint_from_sale`
writes the `licenses` row + Ed25519-signs the blob → Postmark emails
the buyer their blob. End-to-end latency: a few hundred milliseconds.
2. **Manual mint** (operator path — comps, support replacements)
Operator opens SSH tunnel → `datatools-admin mint``/internal/mint`
(Bearer-authed, never publicly reachable) → same `mint_from_sale`
path → blob returned in HTTP response. Operator delivers to buyer
out-of-band.
3. **Activation** (buyer path — fully offline)
Buyer pastes blob into desktop's Activate screen → desktop verifies
the Ed25519 signature against the public key **embedded in the
shipped binary** → license written to `~/.datatools/license.json`.
The desktop app makes **no network calls** to the license server at
any point. This preserves the "your data never leaves your computer"
promise (`DECISIONS.md §9b`).
---
## 2. Tech stack
Layered view of what technology lives where. "External SaaS" entries
are services we depend on but don't operate.
```
┌────────────────────────────────────────────────────────────────────────┐
│ DESKTOP APP (shipped binary, runs on buyer's box) │
├──────────────────┬─────────────────────────────────────────────────────┤
│ GUI │ Streamlit 1.35 — local web server, browser opens │
│ CLI │ Typer 0.12 — per-tool entry points │
│ Core logic │ pandas 2.x, numpy, rapidfuzz, charset-normalizer │
│ Crypto (verify) │ PyCA cryptography — Ed25519 public-key verify only │
│ Storage │ ~/.datatools/license.json (file, mode 644) │
│ Internationalization │ i18n via JSON catalogs in src/i18n/ │
│ Build │ PyInstaller — one-file binary, per OS │
│ Runtimes │ Python 3.12 (bundled into installer) │
│ Platforms │ Windows · macOS · Linux │
└──────────────────┴─────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────┐
│ LICENSE SERVER (this box; non-buyer-facing) │
├──────────────────┬─────────────────────────────────────────────────────┤
│ Edge │ nginx 1.24 + Let's Encrypt (auto-renew via timer) │
│ HTTP framework │ FastAPI 0.119 + Starlette + Pydantic v2 │
│ ASGI server │ uvicorn 0.39 (+uvloop, +httptools, +watchfiles) │
│ Form parsing │ python-multipart (for Gumroad form-encoded Pings) │
│ ORM │ SQLAlchemy 2.0 │
│ Migrations │ Alembic 1.18 (one initial migration so far) │
│ Database │ Postgres 16-alpine (containerized, single node) │
│ Database driver │ psycopg 3.3 (with binary wheel) │
│ Crypto (sign) │ PyCA cryptography — Ed25519 private-key sign │
│ HTTP client │ httpx 0.28 (Postmark calls, test mocking) │
│ Config │ Pydantic Settings + YAML (products.yaml) │
│ Container │ Docker + Docker Compose v2 plugin │
│ Image base │ python:3.12-slim │
│ Process user │ UID 10001 (non-root `app` user defined in image) │
│ Logging │ stdlib `logging` to container stdout → docker logs │
│ Host OS │ Ubuntu 24.04 LTS │
└──────────────────┴─────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────┐
│ OPERATOR / DEVELOPER MACHINE │
├──────────────────┬─────────────────────────────────────────────────────┤
│ Source control │ git → self-hosted Gitea (git.invixiom.com) │
│ Admin CLI │ Typer (src/admin_cli.py) │
│ Server access │ SSH tunnel for /internal/* (no public exposure) │
│ Break-glass │ scripts/generate_license.py (offline-only mints, │
│ │ used when the license server is unreachable) │
│ Test runner │ pytest 8.3 + SQLite in-memory (no docker required) │
│ Smoke test │ bash + docker compose (server/scripts/smoke.sh) │
└──────────────────┴─────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────┐
│ EXTERNAL SaaS / dependencies │
├──────────────────┬─────────────────────────────────────────────────────┤
│ Storefront │ Gumroad — Ping webhook to /webhooks/gumroad │
│ Transactional │ Postmark — HTTP API for license-delivery emails │
│ email │ (LoggingEmailService fallback when token unset) │
│ TLS CA │ Let's Encrypt — ACME HTTP-01 challenge via certbot │
│ Authoritative │ supercp / cPanel (your DNS host for unalogix.com) │
│ DNS │ — Cloudflare front-door deferred │
│ Source hosting │ Self-hosted Gitea (git.invixiom.com) — not on the │
│ │ datatools box; shares the same physical host │
└──────────────────┴─────────────────────────────────────────────────────┘
```
---
## 3. Trust + isolation boundaries
Worth tracing explicitly because the threat model differs at each
boundary:
| Boundary | What crosses it | Trust model |
|---|---|---|
| Buyer ↔ Gumroad | Payment, buyer details | Out of scope — Gumroad's problem |
| Gumroad → license server (webhook) | Signed-by-shared-secret POST | URL secret check; non-matching = 404 (no info leak); audit-log everything regardless |
| License server → Postmark | DKIM-signed transactional mail | Postmark verified-sender domain; HTTP API auth via server token |
| License server → Postgres | SQL over local docker bridge | Same compose project; password from on-disk secret file |
| Operator → license server (`/internal/*`) | Bearer token over SSH tunnel | Token only on disk + in the operator's env; nginx blocks `/internal/*` publicly as defense-in-depth |
| License server → buyer (email) | Plaintext blob in inbox | Buyer's email account hygiene; we deliberately don't encrypt — blob is self-protecting (signature) |
| Buyer → desktop app (activation) | Signed blob pasted in | Verified against pubkey **embedded in the shipped binary**; no network call |
The single most important property to preserve: **the desktop app
never talks to the license server.** All trust in the desktop comes
from the embedded public key + the signed blob. This is what makes
the offline activation guarantee real, and what keeps a license-server
outage from breaking buyers who've already activated.
---
## 4. Where things are stored
| Lives on… | Path / location | Contents |
|---|---|---|
| Buyer's machine | `~/.datatools/license.json` | Activated license blob |
| Buyer's machine | Postmark email | Delivery copy of the blob |
| License server | `licenses` table (Postgres) | Authoritative customer record — name, email, tier, blob, source, order ID, promotion, amount paid |
| License server | `gumroad_events` table | Append-only webhook delivery audit log |
| License server | `/srv/datatools-license/secrets/` | Postgres password, admin Bearer token, (PR 2) Postmark token + Gumroad secret |
| License server | `/etc/letsencrypt/live/datatools.unalogix.com/` | TLS cert + key |
| Operator's laptop | `~/.datatools-creator/issued.jsonl` | Creator-side issuance log (pre-server era, kept as a break-glass backup) |
| Operator's laptop | Git clone of this repo | Source code, including `server/config/products.yaml` |
| Gitea | This repo's commits | Everything except secrets |
---
## 5. Related docs
| Doc | Scope |
|---|---|
| `TECHNICAL.md` | Desktop app internals (core libs, GUI, CLIs) |
| `LICENSE-SERVER.md` | Server architecture rationale + DB schema |
| `SETUP-LICENSE-SERVER.md` | Server install runbook (DNS, packages, nginx, TLS, Postgres) |
| `ADMIN.md` | Day-2 operations (minting, rotation, inspection) |
| `DECISIONS.md` | Architecture decision records — `§9b` = no online activation check |
| `USER-GUIDE.md` | Buyer-facing documentation |

View File

@@ -47,7 +47,7 @@ Sell niche Python automation tools as one-time downloadable digital products. Ta
**Surface**: desktop install per OS (PyInstaller) with Streamlit GUI + CLI. Constrained demo on Streamlit Community Cloud.
## 4a. Lead bundle — Deduplicator
## 4a. Lead bundle — Find Duplicates
Highest pain density across all 4 personas. Feeds landing copy, demo design, feature priority. Tech spec: TECHNICAL.md §11.1.
@@ -208,7 +208,7 @@ Headroom enables optional ad spend ($100-200/mo) once a bundle has proven conver
## 13. Honest status (2026-05-01)
- 3 of 9 tools shipped (Dedup, Text Cleaner, Format Standardizer).
- 3 of 9 tools shipped (Find Duplicates, Clean Text, Standardize Formats).
- Cross-platform build pipeline designed, not yet built.
- macOS code signing not yet set up.
- Streamlit GUI shipped for the 3 ready tools.

View File

@@ -8,15 +8,15 @@ Tres módulos de CLI, uno por cada herramienta Lista:
| Módulo | Comando | Propósito |
|--------|---------|---------|
| `src.cli` | `python -m src.cli FILE` | Eliminador de duplicados |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Limpiador de texto |
| `src.cli` | `python -m src.cli FILE` | Buscar duplicados |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Limpiar texto |
| `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analizador (escaneo de solo lectura) |
Cada comando es **previsualización por defecto** — añade `--apply` para escribir la salida.
---
# Eliminador de duplicados
# Buscar duplicados
```
python -m src.cli ARCHIVO_ENTRADA [OPCIONES]
@@ -125,7 +125,7 @@ Registro: `logs/dedup_YYYYMMDD_HHMMSS.log`.
---
# Limpiador de texto
# Limpiar texto
```
python -m src.cli_text_clean ARCHIVO_ENTRADA [OPCIONES]
@@ -156,7 +156,7 @@ Higiene a nivel de carácter. Ver [TECHNICAL.md §10.2](TECHNICAL.md) (solo en i
- `--config RUTA` / `--save-config RUTA`.
### Archivo
- `--sheet`, `--encoding`, `--header-row` — iguales que en el Eliminador de duplicados.
- `--sheet`, `--encoding`, `--header-row` — iguales que en Buscar duplicados.
## Presets

View File

@@ -6,15 +6,15 @@ Three CLI modules, one per Ready tool:
| Module | Command | Purpose |
|--------|---------|---------|
| `src.cli` | `python -m src.cli FILE` | Deduplicator |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Text Cleaner |
| `src.cli` | `python -m src.cli FILE` | Find Duplicates |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Clean Text |
| `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analyzer (read-only scan) |
Every command is **preview-only by default** — add `--apply` to write output.
---
# Deduplicator
# Find Duplicates
```
python -m src.cli INPUT_FILE [OPTIONS]
@@ -123,7 +123,7 @@ Log: `logs/dedup_YYYYMMDD_HHMMSS.log`.
---
# Text Cleaner
# Clean Text
```
python -m src.cli_text_clean INPUT_FILE [OPTIONS]
@@ -154,7 +154,7 @@ Character-level hygiene. See [TECHNICAL.md §10.2](TECHNICAL.md) for the spec.
- `--config PATH` / `--save-config PATH`.
### File
- `--sheet`, `--encoding`, `--header-row` — same as Deduplicator.
- `--sheet`, `--encoding`, `--header-row` — same as Find Duplicates.
## Presets

View File

@@ -67,7 +67,7 @@ Each candidate scored 1-5 on 6 dimensions. Total /30 → verdict.
**v1.2 rationale**:
- Buyer persona ("hates Excel work but can't code") won't learn a CLI. Refunds at this price.
- Deduplicator needs interactive review — not viable in pure CLI.
- Find Duplicates needs interactive review — not viable in pure CLI.
- Dual interface keeps CLI for automation without sacrificing primary buyer surface.
## 4a. Functional scope principle (v1.2)
@@ -170,11 +170,60 @@ $49-79/bundle · $149 full suite (when 3+ exist).
| Apr 28 (v1.3) | Add hosted browser demo as conversion lever | Direct consequence of Streamlit choice. See §5. |
| Apr 28 (v1.4) | Re-apply 04/06 boundary work (silent-drift recovery) | Stream B v1.2 content overwritten in parallel v1.3 work. Restored per no-silent-drift rule. |
| Apr 28 (v1.5) | Add `02_text_cleaner.py`; renumber 02-08 → 03-09 | Character-level hygiene had no clear owner. See TECHNICAL §10. |
| Apr 29 (v1.7) | Adopt Text Cleaner Tier 1/2/3 spec; lock `excel-hygiene` default | Promotes from stub to buildable v1 target. Full spec in TECHNICAL §11.2. |
| Apr 29 (v1.7) | Adopt Clean Text Tier 1/2/3 spec; lock `excel-hygiene` default | Promotes from stub to buildable v1 target. Full spec in TECHNICAL §11.2. |
| Apr 28 (v1.6) | Fold conversation-history content into docs (deduplicator spec, lead bundle use cases, full GUI matrix, 04/06 examples, Streamlit-to-SaaS reasoning) | No new decisions; promote at-risk analysis from chat history per no-silent-drift rule. |
| May 1 (v1.6) | Mark Format Standardizer **Ready** | 199-row buyer corpus passing; Tier 1 + most Tier 2 built. |
| May 1 (v1.6) | Mark Standardize Formats **Ready** | 199-row buyer corpus passing; Tier 1 + most Tier 2 built. |
| May 1 (v1.6) | Add `src/core/errors.py` structured hierarchy | Uniform helpful messages across CLI + GUI. See TECHNICAL §7. |
| May 13 (v1.6) | Ship in-house JSON i18n + EN/ES packs | Expand addressable market (Spanish-first buyers, LatAm bookkeepers) without a `gettext` build step. JSON packs editable by non-devs; parity test prevents drift. See TECHNICAL §10b. |
| May 13 (v1.6) | Ship licensing: 1-year HMAC-signed blobs, name+email registration, offline verification, tier-scaffolded for future SKUs | Unlock the lifetime-update business model without recurring infra. Honor-system DRM (HMAC + 30-day refund) — sufficient at $49. See §9b below. |
| May 13 (v1.6) | Add Lite SKU (Find Duplicates + Clean Text + Standardize Formats) | Lower-priced entry point for buyers who only need the three universal tools. Per-tool feature gating + lock badges on the home grid surface the upgrade path. See §9b. |
| May 13 (v1.6) | Remove user-facing free trial | A 1-year all-features trial undercut the paid Lite SKU. Paid-only keeps tier economics clean. Internal ``_mint`` API still exists for tests and the seller's key generator. See §9b. |
| May 13 (v1.6) | Upgrade license crypto: HMAC → Ed25519 (asymmetric) | HMAC's symmetric secret was extractable from the shipped binary — anyone with the binary could mint blobs. Ed25519 splits sign (seller) from verify (binary), so binary compromise doesn't let an attacker forge licenses. Blob prefix bumped DTLIC1 → DTLIC2. See §9b. |
| May 13 (v1.6) | Add ``assert_production_safe`` tripwire | A shipped build with ``DATATOOLS_DEV_MODE=1`` or the in-source dev pubkey would silently defeat licensing. The tripwire refuses to boot such a build. No-op in source / pytest runs. See §9b. |
## 9b. Licensing model
**Decision (v1.6)**: offline HMAC-signed license blobs, 1-year lifetime, name + email registration required. Tier-scaffolded so future SKUs (PRO, ENTERPRISE) can carve per-tool feature sets without code changes.
| Option | Verdict |
|---|---|
| **Offline HMAC blob (chosen)** | **CHOSEN.** No server, no internet, fits the no-touch constraint. Honor-system at this price point. |
| Online activation check | Rejected. Conflicts with the "your data never leaves your computer" promise; introduces support load (server downtime, network issues). |
| No license at all | Rejected. The lifetime-update value prop requires *some* gating to make renewal meaningful. |
| Time-bombed binary (PyInstaller --no-license) | Rejected. Can't deliver renewals without re-shipping the installer. |
| Hardware-locked license | Rejected. Friction on legitimate device-swaps; doesn't match the buyer persona's tolerance. |
**Threat model** (v1.6 — Ed25519): the binary ships only the public key. A motivated reverse engineer who pulls everything out of the binary has the verification key but not the signing key — they can't mint new licenses. The earlier HMAC scheme had this hole; the asymmetric upgrade closes it. The remaining attack surface is:
- Re-signing with a forked binary that ships an attacker-controlled pubkey + auto-grants licenses. Costs more effort than the price of a legitimate copy and the result is per-fork, not shareable.
- Hooking the verification call to always return True. Defeats DRM entirely but only on the attacker's own machine — they could just write down "I unlocked DataTools" and skip the work.
- Setting ``DATATOOLS_DEV_MODE=1`` to bypass checks. **Refused in shipped builds** by ``assert_production_safe``; works in source/test runs only.
The 30-day refund window covers casual blob sharing from a different angle (anyone who shares their blob is implicitly authorizing the buyer to issue them a refund-on-demand).
**What's enforced**:
- License blob signature must match (HMAC-SHA256 with the build secret).
- Buyer-entered name + email must match the values embedded in the blob.
- Expiry date must be in the future.
- Tier must include the requested feature.
**What's NOT enforced**:
- Number of devices the same blob is used on (no concurrent-use detection).
- Reverse-engineered re-signing of expired blobs (would require RSA / online check).
**Future SKUs**: the ``FEATURES_BY_TIER`` table in ``src/license/features.py`` is the single source of truth for "which tools each tier unlocks". Adding a PRO SKU that excludes Automated Workflows is a 1-line edit there + a 1-line edit at the gate site. No consumer-code churn.
**v1.6 SKU lineup**:
| Tier | Tools unlocked | Notes |
|---|---|---|
| LITE | Find Duplicates, Clean Text, Standardize Formats | Entry SKU. Three universal tools that handle the most common bookkeeping / RevOps / Klaviyo prep workflows. |
| CORE | All 9 tools | Full v1 suite. |
| PRO | All 9 tools (scaffolded) | Reserved for future per-feature carve-outs (e.g., scheduled pipelines, API access). |
| ENTERPRISE | All 9 tools (scaffolded) | Reserved for future bulk / multi-seat SKUs. |
| TRIAL | Same as LITE | Deprecated — no longer issuable. Mapping kept for any legacy on-disk trial licenses to load without error. |
**Trial removed (v1.6)**: a 1-year free trial that unlocked every tool would undercut the paid Lite SKU (why pay for Lite when trial gives more for longer?). Paid-only keeps the funnel clean. The internal ``LicenseManager._mint`` API still exists for tests and for the seller's ``scripts/generate_license.py`` key generator; there's no user-facing way to self-issue a license.
## 8. Re-lock triggers

View File

@@ -32,17 +32,22 @@ rebuilds it from a stale headline.
| Friction kills conversion | BUSINESS.md §7 | Demo dataset preloaded; no "select a file" first-step |
| < $1,200/mo recurring | BUSINESS.md §9 | Migration plan to $5/mo VPS only after rate-limit signal |
## 3. The three personas (per PLAN.md §2.3)
## 3. The three personas — one audience: accounting (per PLAN.md §2.3)
We niche to **accounting** and enter through the three workflows where a
messy export costs real money. Same engine, three landing pages — each
is the same buyer at a different desk (bookkeeping, payables, receivables).
| Tag | Persona | Top-of-funnel keyword | Demo dataset | Pre-saved pipeline |
|---|---|---|---|---|
| `shopify-pet` | Shopify operator (priority: pet supplies) | "shopify customer cleanup" | `samples/demo/shopify_pet_customers.csv` | `shopify_pet_pipeline.json` |
| `bookkeeper` | Bookkeeper / freelance accountant | "reconcile bank export csv" | `samples/demo/bookkeeper_bank_reconcile.csv` | `bookkeeper_bank_pipeline.json` |
| `revops` | Marketing / RevOps agency | "dedupe lead list across vendors" | `samples/demo/agency_combined_leads.csv` | `agency_leads_pipeline.json` |
| `bookkeeper` | Bookkeeper — bank reconciliation | "reconcile bank export csv duplicates" | `samples/demo/bank_reconciliation.csv` | `bank_reconciliation_pipeline.json` |
| `ap-1099` | Accounts payable — 1099 vendor prep | "clean 1099 vendor list missing EIN" | `samples/demo/vendor_1099.csv` | `vendor_1099_pipeline.json` |
| `ar-aging` | Accounts receivable — open invoices | "remove duplicate invoices aging report" | `samples/demo/ar_open_invoices.csv` | `ar_open_invoices_pipeline.json` |
Each persona gets its **own landing page URL**, its **own demo dataset
loaded by default**, and its **own H1 + below-the-fold copy.** The
engine is identical; only positioning differs.
Each persona gets its **own landing page URL** (`?p=<tag>`), its **own
demo dataset loaded by default**, and its **own H1 + below-the-fold
copy** — wired in `src/gui/app_demo.py::PERSONAS`. The engine is
identical; only positioning differs.
## 4. Demo dataset specifications
@@ -53,114 +58,77 @@ persona's tooling. Each contains every kind of pollution the bundle's
five tools fix, so a single demo run shows every tool earning its
keep.
### 4.0 Pain-point coverage map
### 4.0 Value-proof map
Each demo dataset is engineered so the buyer sees their **own top
pain** demonstrated in the AFTER preview. The mapping below pairs
each pain from PLAN.md §2.3a with the rows / columns that exercise
it. Refresh the dataset only when this coverage drops.
Each demo dataset is engineered so the buyer sees their **own top pain**
fixed in the AFTER preview, with one unmistakable headline number. All
three run the same saved 4-step pipeline (Clean Text → Standardize
Formats → Fix Missing Values → Find Duplicates). The numbers below are
**validated against the live engine** (`tests/test_demo_pipelines.py`
pins them) — refresh the dataset only if a number stops landing.
| Persona | Pain (from PLAN §2.3a) | Demo coverage |
| Persona | Headline proof | What the visitor watches happen |
|---|---|---|
| Shopify pet | S1 — Klaviyo per-contact dupes | 5 dup pairs across rows 115 (case + format + address-twin variants) |
| Shopify pet | S2 — feed-rejection chars | smart-quote / NBSP / BOM in rows 16, 9, 11 |
| Shopify pet | S3 — multi-channel | partner-style customer IDs (`SHOP-`); demonstration of column-level mapping covered in RevOps demo |
| Shopify pet | S4 — subscription identity | rows 1+2, 7+8, 9+10 — same person, different format |
| Shopify pet | S5 — VAT-MOSS country drift | rows 1618 (`United Kingdom` / `U.K.` / `UK`) + rows 1920 (`Germany`/`Italia`) |
| Bookkeeper | B1 — month-overlap re-import | 7 dup pairs spanning Jan↔Feb and Mar boundaries |
| Bookkeeper | B2 — 1099 vendor consolidation | Amazon × 3 spellings, Verizon × 2, Acme Realty × 2, Adobe × 2, Costco × 2, Zoom × 2, Stripe × 4 |
| Bookkeeper | B3 — audit trail | every cell change in the run logged with old/new/rule — surface in the demo's audit tab |
| Bookkeeper | B4 — per-license economics | demonstrated by pricing copy, not data |
| Bookkeeper | B5 — multi-currency | rows 26 (EUR), 27 (GBP), 28 (BRL with comma decimal), 29 (parens-negative) |
| RevOps | R1 — per-contact tier | 6 cross-source dup pairs (HubSpot × LinkedIn × Manual Scrape) |
| RevOps | R2 — deliverability | rows 2627 (`uma at uniform dot com`, `victor@@victorco.com` invalid emails) |
| RevOps | R3 — GDPR / privacy | demonstrated by the network-tab moat panel + zero-upload claim |
| RevOps | R4 — vendor unification | 3 source values (HubSpot / LinkedIn / Manual Scrape), 13 country codes, mixed-shape headers |
| RevOps | R5 — suppression list | rows 2930 (`Suppressed`, `Opted Out` tags) |
| Bookkeeper | **26 → 20 rows · 6 phantom duplicates removed** | The same payment posted twice (different date + amount format) collapses to one; dates go ISO, parens-negatives become real negatives |
| AP / 1099 | **24 records → 8 vendors · 7 missing EINs recovered** | Each vendor's scattered records merge into one complete row; `merge=true` backfills the EIN/address/phone that any single record was missing |
| AR aging | **26 → 21 rows · 5 double-entered invoices removed** | Duplicate invoice numbers collapse; a blank status is backfilled from its twin; invoice + due dates go ISO, amounts numeric |
### 4.1 `shopify_pet_customers.csv` (20 rows)
### 4.1 `bank_reconciliation.csv` (26 rows) — Bookkeeper
**Looks like**: a Shopify customer export filtered for "Pet Supplies"
sales channel, 12 months activity.
**Looks like**: two months (Jan + Feb 2025) of business-checking activity
from a bank portal, where the Feb re-export overlaps Jan so the same
transaction posts twice. Columns: `Date, Description, Vendor, Category,
Amount, Account`.
**Pollution included**:
- Whitespace padding (" Alice ", "Sydney Opera House Drive ")
- Mixed phone formats: `(415) 555-1234`, `415.555.1234`, `5559876543`,
`+1 555-111-1111`
- International phones: GB, ES, DE, AU, JP (15 demo rows span 6
countries)
- Currency variants: `$1,240.50`, `£890.25`, `€2.410,75` (EU comma
decimal), `A$ 1,299.00`, `¥75000`
- Date formats: `2025-12-04`, `12/15/2025`, `?`, `(blank)`, `(none)`,
`#N/A`
- Disguised nulls: `N/A`, blank, `(blank)`, `?`, `#N/A`, `(none)`,
`unknown`
- Name casing: `EVE MARTINEZ`, `henry`, `O'NEIL`, `noah`, mixed Title /
ALL CAPS / lower
- Email case variants that *should* dedup: `Bob@PetShop.com` vs
`alice@petshop.com`
- 4 fuzzy duplicates (Alice/Bob same address, Grace/Henry same phone,
Carlos/Olivia same address, Ivy/Jack same address)
- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`, `1/27/25`, `Feb 5 2025`.
- Currency formats incl. negatives: `-$129.99`, `($89.50)` parens-negative, `+$3,450.00`, `- $599.88`, bare `-129.99`, `(50.00)`.
- Whitespace + NBSP padding; smart quotes and an em-dash inside descriptions.
- Vendor casing variety on *non-duplicate* rows: `Amazon` / `amazon.com` / `AMAZON.COM`, `Verizon` / `verizon`.
- Disguised nulls in Category: `—`, `(blank)`, `?`, `unknown`, `TBD`.
- **6 duplicate transactions** — each pair shares the same vendor + real value but a different date *and* amount format, so they collapse only after standardization.
**After running the pipeline**: 20 rows → 15, ~29 cells canonicalized,
~45 sentinels standardised, 5 cross-row duplicates merged. The
customer table is now Klaviyo-import-ready and the country column
(previously `UK` / `U.K.` / `United Kingdom` / `Germany` / `Italia`)
is GB / DE / IT — VAT MOSS report won't break.
**After running the pipeline** (validated): **26 → 20 rows, 6 duplicates
removed**, 36 date/amount cells standardized (0 unparseable), all dates
ISO, parens-negatives resolved (`($89.50)``-89.50`), disguised-null
categories flagged. The reconciliation ties out.
### 4.2 `bookkeeper_bank_reconcile.csv` (30 rows)
### 4.2 `vendor_1099.csv` (24 rows) — Accounts payable / 1099
**Looks like**: two months of business checking + credit-card activity
exported from a bank portal, with the Feb export accidentally
overlapping the Jan export at the month boundary.
**Looks like**: a 1099-NEC vendor master list where the same vendor was
entered 23 times across the year by different staff, each record holding
only *part* of the vendor's details. Columns: `Vendor, Contact, Email,
Phone, EIN, Address, Total_Paid`.
**Pollution included**:
- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`,
`1/27/25`, `Feb 5 2025`
- Currency formats: `-$129.99`, `($89.50)` parens-negative,
`+$3,450.00`, `- $599.88` space, bare `-129.99`, `(50.00)`
- Header trailing whitespace: `"Date "`
- Smart quotes around descriptions: `"autopay"`
- Em-dash sentinels in Vendor: `—`
- Smart-em-dash inside descriptions: `STAPLES #4422 — paper, toner`
- Vendor casing inconsistency: `Amazon` / `amazon.com` / `AMAZON.COM`,
`Verizon` / `verizon`
- 6 duplicate transactions (same date+amount+vendor recorded twice
with different formats)
- The duplicate records for a vendor share one email differing only by case/whitespace (the reliable dedup key, matched with the `email` normalizer).
- EIN / Phone / Address scattered across the duplicate set so no single record is complete but the union is — gaps marked `—`, `(blank)`, `TBD`, `unknown`, `N/A`.
- Vendor name casing/spelling variants, phone formats, EIN formats (`12-3456789` vs `123456789`), `Total_Paid` currency variants.
**After running the pipeline**: 30 rows → 23, ~84 cells normalized, 7
duplicates removed (month-overlap + VAT-MOSS dups). All dates
ISO-formatted, all amounts numeric (including EUR/GBP/BRL with comma
decimal), vendor casing canonical, parens-negative resolved.
**After running the pipeline** (validated): **24 records → 8 vendors, 16
duplicates removed, 7 missing EINs recovered** by `merge=true` +
`most_complete` survivor, 35 disguised nulls caught, phones/emails/amounts
standardized (0 unparseable). One vendor genuinely has no EIN in any
record — it survives with a blank EIN as the realistic "flag for
follow-up" case.
### 4.3 `agency_combined_leads.csv` (30 rows)
### 4.3 `ar_open_invoices.csv` (26 rows) — Accounts receivable
**Looks like**: a marketing-ops worksheet combining lead exports from
HubSpot + LinkedIn Sales Navigator + manual scraping, ready for
campaign targeting.
**Looks like**: an open-invoices (unpaid AR) export where some invoices
were double-entered in different formats and client contacts are messy.
Columns: `Invoice, Client, Email, Invoice_Date, Due_Date, Amount, Status`.
**Pollution included**:
- Phone formats per region: US, UK, Spain, Germany, China, India,
Australia, Mexico, Israel, Singapore, Hong Kong, Italy, South
Korea — 13 country codes
- Country column inconsistent: `USA` / `US` / `United States`
- Disguised nulls: `N/A`, `unknown`, `(unknown)`, `(blank)`, `(none)`,
`?`, `—`, `#N/A`, `TBD`
- Source column tags origin (`HubSpot` / `LinkedIn` / `Manual Scrape`)
- Email duplicates across sources with case variants: `alice@acme.com`
+ `Alice.Johnson@acme.com`, `bob@beta.com` + `Bob@Beta.com`,
`diana@delta.com` from two sources, `carlos@gamma.io` from two
sources, `Frank@Foxtrot.de` + `frank@foxtrot.de`
- Name casing: `DIANA LEE`, `henry`, `IVY CHEN`, mixed
- 6 fuzzy / cross-source duplicates designed to survive the dedup
- Score column with sentinel pollution that needs coercion to integer
- Two date columns with mixed formats; currency variants incl. a credit memo `($300.00)``-300.00`.
- Client name casing variety; email case variants (`AP@Acme.com` vs `ap@acme.com`).
- Status disguised nulls: `—`, `?`, `(blank)`, `TBD`, `unknown`, `(none)`.
- **5 double-entered invoices** — same invoice number twice, dates/amount in different formats, one copy with a blank status the other fills.
**After running the pipeline**: 30 rows → 24, ~43 cells canonicalized,
14 sentinels resolved, 6 cross-source duplicates merged with `merge=true`
so each survivor inherits the most-complete picture. Invalid-email
rows (deliverability stress) and `Suppressed`/`Opted Out` tags
(suppression-list use case) survive as flagged rows the operator
manually reviews.
**After running the pipeline** (validated): **26 → 21 rows, 5 duplicate
invoices removed**, both date columns ISO + amounts numeric + emails
lowercased (0 unparseable), 7 disguised-null statuses caught, and a blank
status backfilled from its twin via `merge=true`. The aging report stops
double-counting.
## 5. UX flow (per persona)
@@ -174,26 +142,26 @@ dedicated `app_demo.py` for the cloud build).
│ "{Persona-specific H1}" │
├──────────────────────────────────────────────────────────┤
│ │
│ Sample dataset preloaded: shopify_pet_customers.csv │
│ Sample dataset preloaded: bank_reconciliation.csv
│ [Replace with your own file (capped 100 rows)] │
│ │
│ ┌─ BEFORE preview (15 rows) ─────────────────────────┐ │
│ │ Alice | (415) 555-1234 | $1,240.50 | … │ │
│ │ Bob | 415.555.1234 | $1,240.50 | … │ │
│ ┌─ BEFORE preview (26 rows) ─────────────────────────┐ │
│ │ 01/15/2025 | Stripe | +$3,450.00 | … │ │
│ │ 2025-01-15 | Stripe | 3450.00 | … (dup) │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Pipeline (saved): │
│ 1. Text Clean → 2. Format Standardize → │
│ 3. Missing → 4. Deduplicate
│ 1. Clean Text → 2. Standardize Formats → │
│ 3. Fix Missing → 4. Find Duplicates
│ │
│ [▶ Run pipeline] │
│ │
│ ┌─ AFTER preview ───────────────────────────────────┐ │
│ │ 15 rows → 11 (4 duplicates merged) │ │
│ │ 27 cells canonicalized · 33 sentinels resolved │ │
│ │ 26 rows → 20 (6 duplicate transactions removed) │ │
│ │ 36 cells standardized · 4 disguised nulls flagged │ │
│ │ │ │
│ │ Alice Johnson | +14155551234 | 1240.50 | … │ │
│ │ 2025-01-15 | Stripe | 3450.00 | … │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────┘ │
│ │
@@ -244,27 +212,35 @@ not "demo crippled" data.
## 7. CTA copy (per persona)
### 7.1 Shopify pet operator
Copy lives in `src/gui/app_demo.py::PERSONAS` (H1 / sub / CTA per tag);
keep this section in sync with that dict.
- **H1**: *Clean your customer / vendor / subscriber exports — locally.*
- **Sub**: *Klaviyo-import-ready in 30 seconds. Catches duplicates Excel
misses. Your data never leaves your computer.*
- **CTA**: *Get DataTools for Shopify — $49 →*
### 7.1 Bookkeeper — bank reconciliation (`?p=bookkeeper`)
### 7.2 Bookkeeper / freelance accountant
- **H1**: *Reconcile messy bank exports. Hand your client an audit
trail.*
- **Sub**: *Catches the duplicate transaction Quickbooks imported twice.
Standardizes dates, amounts, vendor casing. Every change auditable.*
- **H1**: *Catch the transactions your bank export posted twice. Locally.*
- **Sub**: *When the Jan and Feb exports overlap, the same payment posts
twice in two formats. DataTools standardizes every date and amount, then
dedups on the real transaction so your reconciliation ties out — 26 rows
→ 20, six phantom duplicates gone.*
- **CTA**: *Get DataTools for Bookkeepers — $49 →*
### 7.3 Marketing / RevOps agency
### 7.2 Accounts payable — 1099 prep (`?p=ap-1099`)
- **H1**: *Dedupe leads across HubSpot, LinkedIn, and manual scrapes.*
- **Sub**: *International phones, country normalization, fuzzy dedup
with merge — one tool, one schema, no upload.*
- **CTA**: *Get DataTools for RevOps — $49 →*
- **H1**: *Build a clean 1099 vendor list — with the missing EINs filled in.*
- **Sub**: *The same vendor entered three times, each record holding only
part of the details. DataTools consolidates to one row and backfills the
gaps from the duplicates — 24 records → 8 vendors, 7 missing EINs
recovered.*
- **CTA**: *Get DataTools for Accounting — $49 →*
### 7.3 Accounts receivable — open invoices (`?p=ar-aging`)
- **H1**: *Stop chasing the invoices your aging report counted twice. Locally.*
- **Sub**: *Double-entered invoices inflate your AR aging and your
follow-ups. DataTools standardizes dates and amounts, lowercases client
emails, and removes the duplicate invoice numbers — 26 rows → 21, five
phantom invoices off the books.*
- **CTA**: *Get DataTools for Accounting — $49 →*
## 8. Telemetry / conversion tracking

View File

@@ -33,7 +33,7 @@ CLI (src/cli*.py) GUI (src/gui/app.py + pages/)
| `core.errors` | `DataToolsError` hierarchy, `ensure_dataframe()`, `ensure_choice()`, `wrap_file_read/write()`, `format_for_user()` |
| `core._constants` | `US_STATE_NAMES`, `US_STATE_CODES`, `USPS_EXPANSIONS`, `USPS_COMPRESSIONS` |
## Data flow — Deduplicator
## Data flow — Find Duplicates
```
read_file() # auto-detect encoding, delimiter, header
@@ -96,6 +96,36 @@ DeduplicationResult # deduplicated_df, removed_df, match_groups, l
No other call sites change. Gate auto-discovers it via the registry.
### Tool page header — `render_tool_header(tool_id)`
Every tool page renders its title block via `render_tool_header(tool_id)` in `src/gui/components/_legacy.py` — do not call `st.title()` + `st.caption()` directly. The helper renders:
- `tools.<id>.page_title` as the page title (left column).
- A **Help** popover button right of the title (icon `:material/help_outline:`, label from `help.button_label`). Clicking opens an `st.popover` containing the markdown body.
- `tools.<id>.page_caption` as the caption below.
All copy is i18n-driven; editors can tweak help text without touching Python. If a tool is missing its `help_md` key, the popover falls back to `help.missing_body`.
**`help_md` structure** (markdown, stored as a single string with `\n` line breaks in JSON):
```
**When to use**
- bullet 1
- bullet 2
**Steps**
1. numbered step
2. numbered step
**Examples**
- example 1
- example 2
**Tip** one-sentence pro tip.
```
Keep it short — the popover is intentionally compact. Mirror the structure across every tool so the muscle memory transfers.
### i18n — language packs
The GUI's user-facing strings live in `src/i18n/packs/<code>.json`, keyed by ISO-639-1 code. English (`en.json`) is canonical; missing keys in other packs fall back to English, and missing keys in English fall back to the literal dotted key so a typo is visible rather than silent.
@@ -120,12 +150,123 @@ st.warning(t("gate.warning", name=filename)) # {name} interpolated via str.for
3. Use the dotted key at the call site: `t("section.subsection.key")` or `t("section.key", name=value)` for placeholder interpolation.
**Authoring rules:**
- Keys live under semantic sections (`home.*`, `upload.*`, `findings.*`, `tools.<id>.name`). Don't nest by language or by tool unless the string is genuinely tool-specific.
- Keys live under semantic sections (`home.*`, `upload.*`, `findings.*`, `help.*`, `tools.<id>.name`). Don't nest by language or by tool unless the string is genuinely tool-specific.
- Per-tool header copy lives under `tools.<id>.{page_title, page_caption, help_md}`. `page_caption` is the one-line subtitle under the title; `help_md` is the popover body (see *Tool page header* above). Top-level `help.button_label` / `help.missing_body` are shared across every tool.
- Use `{named}` placeholders (not positional `{0}`) so translators see what's being interpolated.
- Strings can contain Streamlit markdown (`**bold**`) — pass through `st.markdown` / `st.caption` as usual.
- Do **not** put strings inside the farewell-overlay JS payload without going through `_js_html_safe()` in `src/gui/components/_legacy.py`; the helper escapes both the JS string terminator and HTML special chars. The test `TestFarewellEscape` pins that contract.
- The sidebar picker is mounted by `hide_streamlit_chrome()`, so every page that calls that helper automatically gets the picker. Pages that don't call it (rare) can call `render_language_selector()` directly.
### Licensing
The license layer lives at ``src/license/``. The public API:
```python
from src.license import (
get_manager, require_feature, current_state,
FeatureFlag, Tier, License,
)
mgr = get_manager()
if not mgr.is_valid():
raise RuntimeError("Not licensed")
require_feature(FeatureFlag.DEDUPLICATOR)
```
**Storage**: ``~/.datatools/license.json`` (override via
``DATATOOLS_LICENSE_PATH``). Signed with Ed25519 (asymmetric) — the
seller's private key signs; the buyer's binary verifies with the
embedded public key.
**Key material**:
| Variable | Who has it | Where it's used |
|---|---|---|
| ``DATATOOLS_LICENSE_PRIVKEY`` | Seller only | ``scripts/generate_license.py`` (mint a buyer's blob), ``scripts/generate_keypair.py`` writes a fresh one |
| ``DATATOOLS_LICENSE_PUBKEY`` | Every shipped binary | Verification at activation time; set at build time via PyInstaller env |
If neither env var is set, ``src.license.crypto`` falls back to the
deterministic dev keypair in ``src/license/_dev_keypair.py``. The
dev key is in source on purpose (so tests work without secrets),
but a frozen build that's still using it is a build-config bug —
:func:`assert_production_safe` refuses to start such a binary.
**First-time setup for shipped builds**:
1. ``python scripts/generate_keypair.py --output prod-keys.env`` —
creates a fresh keypair.
2. Stash ``DATATOOLS_LICENSE_PRIVKEY`` somewhere safe (password
manager / KMS). Lose it and you can't issue renewals without
reshipping a new build with a new public key.
3. Configure the PyInstaller build env with
``DATATOOLS_LICENSE_PUBKEY=<hex>`` so the shipped binary
verifies against the production key.
4. Mint buyer licenses with
``DATATOOLS_LICENSE_PRIVKEY=<hex> python scripts/generate_license.py ...``.
**Dev bypass**: ``DATATOOLS_DEV_MODE=1`` short-circuits every check.
The test suite's autouse fixture sets this so existing tests don't
need their own license fixtures. Tests that need the real check
explicitly use ``isolated_license_path`` /
``activated_license_manager`` / ``unactivated_license_manager``.
**Adding a feature flag**:
1. Add the enum value to ``FeatureFlag`` in ``src/license/schema.py``.
2. Add it to the relevant tier's set in
``FEATURES_BY_TIER`` in ``src/license/features.py``.
3. Gate at the call site: ``require_feature(FeatureFlag.YOUR_FLAG)``.
**Adding a new tier**:
1. Add the enum value to ``Tier``.
2. Add a row to ``FEATURES_BY_TIER`` listing the unlocked flags.
3. Add ``license.tier_<name>`` translation keys to every i18n pack.
4. The activation flow, sidebar status badge, feature gate, and home
grid lock badge all pick up the new tier automatically.
**Worked example — the Lite tier**:
```python
# src/license/schema.py
class Tier(str, Enum):
LITE = "lite" # new
CORE = "core"
...
# src/license/features.py
FEATURES_BY_TIER = {
...
Tier.LITE: frozenset({
FeatureFlag.DEDUPLICATOR,
FeatureFlag.TEXT_CLEANER,
FeatureFlag.FORMAT_STANDARDIZER,
}),
Tier.CORE: _all(),
...
}
```
Then in en.json/es.json add ``license.tier_lite``. That's it — the
existing ``require_feature_or_render_upgrade`` (GUI) and
``guard(feature=...)`` (CLI) calls in every tool page/CLI route a
Lite user into the upgrade prompt for any tool the tier doesn't
unlock. The home grid's lock badge fires off the same feature
lookup.
**Minting a license** (creator-only):
```bash
DATATOOLS_LICENSE_SECRET=<shipping-secret> \
python scripts/generate_license.py \
--name "Jane Doe" --email jane@example.com \
--tier core --years 1
```
The script prints a ``DTLIC1:`` blob to stdout — deliver this in the
Gumroad / purchase email. The buyer pastes it into the activation
page or runs ``python -m src.license_cli activate <blob> --name ...``.
### Add a format-standardizer field type
1. Add value to `FieldType` enum in `core/format_standardize.py`.
@@ -155,11 +296,46 @@ GUI / CLI handlers: use `format_for_user(exc, context="...")` to render.
All `DataToolsError` subclasses extend stdlib `ValueError` or `OSError` so existing handlers still catch them.
## PDF Extractor — bundled Tesseract
Frozen builds (installer / AppImage) ship Tesseract OCR inside the bundle so scanned PDFs work without a separate system install. Source / `pip` developer environments still resolve Tesseract from `PATH`.
**Runtime layout (frozen bundles)**:
| Resource | Path |
|---|---|
| Tesseract binary | `Path(sys._MEIPASS) / "tesseract" / "tesseract"` (Linux/macOS), `…/tesseract/tesseract.exe` (Windows) |
| Tessdata directory | `Path(sys._MEIPASS) / "tesseract" / "tessdata"` |
| English model | `Path(sys._MEIPASS) / "tesseract" / "tessdata" / "eng.traineddata"` |
**Discovery order** (PDF Extractor runtime):
1. `DATATOOLS_TESSERACT_BIN` env var (override — explicit path to a `tesseract` binary).
2. Bundled path under `sys._MEIPASS` (frozen bundles only — falls through to step 3 otherwise).
3. `tesseract` on `PATH` (developer setups, source checkouts).
4. Windows well-known locations (`C:\Program Files\Tesseract-OCR\tesseract.exe`, etc.).
**Where the bytes come from**:
- **Tessdata** is vendored at `build/vendor/tessdata/eng.traineddata` — the "best" English model from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best). PyInstaller's spec copies it into `tesseract/tessdata/` inside the bundle.
- **Tesseract binary** is fetched at build time by `build/tesseract.py` — per-platform download URLs are pinned in that module. The current pin is **Tesseract 5.5.0**. CI (`.github/workflows/build.yml`) imports `fetch_tessdata` + `fetch_tesseract_for_platform` and runs them before PyInstaller.
**To update Tesseract**:
1. Bump the version pin + the per-platform fetch URLs in `build/tesseract.py`.
2. If upstream changed the `eng.traineddata` schema, refresh `build/vendor/tessdata/eng.traineddata` from `tessdata_best` at the matching tag.
3. Push a `v*` tag so CI rebuilds all three platforms, then smoke-test a scanned-PDF run through the PDF Extractor before publishing the release.
4. Update `LICENSE_TESSERACT.txt` at the repo root if the upstream license terms change (Tesseract is Apache-2.0 today).
## Tests
```bash
# All
# All (core + CLI + GUI)
pytest -q
# Quick loop — skip the GUI layer
pytest -q -m 'not gui'
# Only the GUI tests
pytest -q -m gui
# By module
pytest tests/test_dedup.py
# Include slow / integration
@@ -171,22 +347,77 @@ pytest tests/test_dedup.py::TestExactMatch::test_basic
Test layout:
```
tests/
├── conftest.py # fixtures
├── conftest.py # core/CLI fixtures
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
├── test_analyze.py · test_normalize.py · test_text_clean.py
├── test_format_standardize.py
├── test_format_standardize_corpus.py # 199-row buyer corpus
├── test_pipeline.py # pipeline engine: adapters, run, validate, serialize
├── test_cli_pipeline.py # pipeline CLI: recommend/apply/strict/audit
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
├── test_perf_regressions.py # shape pins for the perf wins
└── gui/ # Streamlit AppTest-driven tests
├── conftest.py # AppTest fixtures + helpers
├── _findings_panel_harness.py # isolated component test page
├── test_smoke.py # every page renders in EN + ES
├── test_chrome.py # language selector, hide_chrome
├── test_gate.py # require_normalization_gate
├── test_workflows.py # happy path per Ready tool
├── test_dedup_review.py # match-group card interactions
├── test_advanced_panels.py # config_panel widgets
├── test_pipeline_builder.py # module-card builder: cards, reorder, JSON, run
├── test_pipeline_phrasing.py # step_phrase/step_status + name bridge (pure fns)
├── test_errors.py # malformed-upload error paths
└── test_findings_panel.py # analyzer findings rendering
```
### Pipeline (Automated Workflows) coverage
The pipeline feature is pinned end to end across four files (~115 tests):
`test_pipeline.py` (core engine — every adapter's summary numbers, step
data-flow, error stop/continue, empty/single-column/all-disabled edges,
dict + file serialization round-trips, `recommended_pipeline(include=…)`,
soft-dependency validation), `test_cli_pipeline.py` (CLI — `--recommend`,
dry-run-by-default, `--apply` output + audit JSON, `--steps`, `--strict`,
`--continue-on-error`, arg validation, save→load round-trip),
`test_pipeline_builder.py` (the visual builder via AppTest — card seeding,
toggle, reorder ▲/▼, add/remove, restore-recommended, Advanced JSON
import/export, per-tool Configure panels emitting the right option dicts),
and `test_pipeline_phrasing.py` (the plain-English `step_phrase`/`step_status`
helpers and the adapter-key→friendly-name bridge as pure functions).
### GUI test layer
GUI tests drive pages with `streamlit.testing.v1.AppTest` —
in-process, no browser, no display. They pre-populate
`st.session_state` with stashed-upload bytes (via the
`stash_upload()` helper in `tests/gui/conftest.py`) and either click
buttons via `app.button[i].click().run()` or assert on the
`session_state` after the run.
Marker registered in `pytest.ini`. Default `pytest` runs everything;
`pytest -m 'not gui'` skips them for a faster core-only loop.
Coming-Soon stubs are pinned by the smoke tests so a regression
("import error", "missing widget") shows up immediately.
Fixture corpora: `test-cases/text-cleaner-corpus/` (21 files) · `test-cases/encodings-corpus/` (31 files) · `test-cases/format-cleaner-corpus/` (7 files + spec).
## Known limitations
- **Dedup is O(n²)** — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix.
- **Single-threaded** — could benefit from `multiprocessing`.
- **Memory-bound** — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine.
- **Dedup pair-compare is O(n²)** for fuzzy strategies. Exact-only
strategies (every column uses `Algorithm.EXACT` at threshold 100)
now route through an O(n) groupby fast path automatically — no API
change. Fuzzy strategies can opt into prefix blocking via
`deduplicate(..., blocking_columns=[...], blocking_prefix_len=1)`
to partition pairs by a cheap key (trades recall for speed).
- **Threading is opt-in for format_standardize** —
`StandardizeOptions.parallel_columns > 1` uses a thread pool.
On CPython 3.12 the GIL caps the win at roughly neutral; the
scaffolding is in place for free-threaded Python 3.13+.
- **Memory-bound** — entire file loaded into pandas. Streaming reads
exist but not integrated with the dedup engine.
- **No multi-sheet dedup** — each Excel sheet processed independently.
- **Phonenumbers minimum-length** — international numbers without country codes fall back to digits-only.
- **Phonenumbers minimum-length** — international numbers without
country codes fall back to digits-only.

244
docs/FUTURE-TOOLS.md Normal file
View File

@@ -0,0 +1,244 @@
# Future tools — design notes
> Creator-only. Specs for tools the strategic plan refuses to build right now
> but that surface repeatedly enough to be worth documenting once instead of
> re-thinking from scratch every time a customer asks.
> **Status of these tools**: post-launch, post-revenue. See `PLAN.md` §2.1 —
> new-tool development is frozen until DataTools has a paying customer and a
> repeated demand signal for the same idea. This file is the resting place
> for those ideas in the meantime; nothing here ships unless a future
> decision says it does.
Each entry follows the same shape: **What it does**, **Why someone would
want it**, **Can we ship it now?**, **Approach**, **GUI sketch**, **Effort**,
**Risks/unknowns**, **Ship criteria** (the signal that overrides the freeze).
---
## 10. PDF → CSV extractor (bank statements + similar)
### What it does
Takes a PDF (typically a bank statement, expense report, paystub, invoice,
or any document where humans-but-not-computers can read a table) and turns
the tabular content into a CSV that the rest of DataTools can consume.
The user shows the tool **where** the data lives by drawing rectangles on
a rendered preview of the first page; the tool then applies those region
templates to every page of the document (and remembers the template so the
same template can be re-applied to next month's statement without
re-clicking).
### Why someone would want it
Bookkeepers, accountants, and any small-business operator who:
- Gets bank/credit-card statements only as PDFs (most US banks; many
European ones).
- Wants to import transactions into QuickBooks / Xero / a spreadsheet
without paying $10$30/month for a SaaS converter (Docparser,
Rossum, Hubdoc) or relying on a Python script they can't maintain.
- Has 12 months × N accounts of statements to back-fill into a
ledger.
This is the most-requested DataTools adjacency in the casual feedback we
have so far. It maps tightly onto the **bookkeeper niche** identified in
`PLAN.md` §2.3 — that persona is exactly who needs PDF extraction and is
exactly the kind of operator who'd pay for a one-time desktop tool over a
recurring SaaS subscription.
### Can we ship it now?
**No.** Current state, verified 2026-05-17:
- No PDF dependency in `requirements.txt` or `requirements-dev.txt`.
- No PDF-touching code anywhere under `src/`. The single
string-mention of "PDF" in the codebase is in the **output** copy for
the Quality Check tool ("generate PDF/Excel quality reports"),
unrelated to extraction.
- No region-selection / canvas component in the Streamlit GUI today.
Building this requires net-new infrastructure on three axes (libraries,
extraction core, region-picker UI). Estimates below.
### Approach (technical)
PDFs split cleanly into two populations and the strategy differs:
1. **Native / text-layer PDFs** — text is stored as text, just laid out
visually. Most modern bank statements are this. Solvable with
coordinate-aware text extraction:
- **`pdfplumber`** (BSD-3, on top of `pdfminer.six`) — gives `(x0, y0,
x1, y1, text)` per character/word/line for each page. Mature, well
tested, single dependency, no native compiler. **First-choice.**
- **`pypdf`** (BSD-3) — text-only, no positions. Too coarse for
statement parsing; useful only for "the whole document as one big
string."
- **`camelot-py`** (MIT) — purpose-built for table extraction.
Heavier (needs `ghostscript` and `tk`/`opencv` for some modes),
and assumes the table grid is already visible. Worth evaluating
as a fallback for documents with explicit ruled tables.
2. **Scanned / image-only PDFs** — pixels of a scanner; no text layer.
Less common from major banks today but still happens with old PDFs
and receipts. Needs OCR:
- **`pytesseract`** wrapping the **Tesseract** binary (Apache-2). The
OCR is good for English on clean scans, mediocre on receipts.
Detect with `pdfplumber`: a page where every character is in a
glyph "image" object means the page is image-only → OCR fallback.
The extraction core would be a state machine:
1. Render page to an image (`pdfplumber.Page.to_image()` returns a PIL
image at a chosen DPI).
2. User draws a header region and per-row regions (or marks a single
table bounding box + column dividers) on the preview.
3. For each PDF page, crop the corresponding pixel region (or pdf
coordinate region), pull the text in that crop, and apply per-region
parsing (date, amount, description).
4. Emit one CSV row per detected statement row.
Bank-statement-specific niceties — implementable as templates on top of
the generic engine:
- Recurring-template store: save "Chase visa October layout" once, the
next month's PDF lands on the same template automatically. JSON file
in `~/.datatools/templates/` keyed by a layout fingerprint (page
size + header text hash).
- Multi-page row stitching: a row that wraps across pages gets merged
back together based on date-column continuity.
- Currency / sign inference: a column that mixes `$1,234.56` and
`($45.00)` — already handled by the (now-existing) Standardize
Formats analyzer rules.
### GUI sketch
The hardest part of the whole project. Streamlit doesn't ship a native
"draw rectangles on an image" widget. Options:
- **`streamlit-drawable-canvas`** — community component (MIT-licensed).
Lets the user draw freehand rectangles on top of a background image.
Returns the rectangle coordinates as JSON. Active maintenance.
**First-choice for the region picker.**
- **`streamlit-cropper`** — single-rectangle crop tool. Good if we only
needed the table bbox; too limited for "header region + column
dividers + repeating-row template."
- **Custom React component** — fully tailored UX but adds a build
toolchain DataTools doesn't have today. Last resort.
Sketch of the proposed page (under "Transformations" in the sidebar
section):
```
🧾 PDF → CSV (Beta)
─────────────────────────────────────────────────────────────────────
Upload a PDF [ Browse… ]
(statement / invoice / form — text-based PDFs work best)
[ ▸ Preview: October-statement.pdf · 3 pages ]
┌────────────────────────────────────────────────┐
│ CHASE BANK │
│ Statement period Oct 131, 2025 │
│ ┌─[1: header strip — drawn in red]──────────┐ │
│ │ Date Description Amount │ │
│ └────────────────────────────────────────────┘ │
│ ┌─[2: row template — drawn in green]────────┐ │
│ │ 10/03 AMAZON.COM #42… -45.67 │ │
│ └────────────────────────────────────────────┘ │
│ ⋮ (more transactions) │
└────────────────────────────────────────────────┘
Columns: [Date] [Description] [Amount] [+ Add column]
Apply template to: ( ) Only this page
(•) All pages with this layout
( ) All pages (force)
[ Save template as… Chase Visa Oct 2025 ]
[ Run extraction → CSV ]
```
After "Run extraction": the standard tool-page result layout (preview
table, "Saved to ~/Downloads/<name>_extracted.csv", "Open Downloads
folder" — matching the other Ready tools).
The **template save/recall** is what makes this a one-time setup
instead of a per-document chore — bookkeepers don't want to re-draw
rectangles every month.
### Effort estimate
| Phase | Scope | Estimate | Risk |
|---|---|---|---|
| **A. Backend, native PDFs only** | pdfplumber-based extraction, hard-coded region passed via a JSON config (no GUI) | **12 weeks** | Low — straightforward use of pdfplumber. |
| **B. Region-picker GUI** | streamlit-drawable-canvas, multi-region drawing, per-region role assignment (date / amount / description) | **23 weeks** | Medium — the canvas component has quirks; persisting region state across reruns is non-trivial. |
| **C. Multi-page application + template persistence** | Apply one page's template to N pages, save/load templates, layout fingerprint | **12 weeks** | Medium — "is the next page the same layout?" is a real perception problem; we'll need a heuristic. |
| **D. Scanned-PDF OCR fallback** | Detect image-only pages, run Tesseract, merge OCR text into the extraction path | **23 weeks** | High — OCR accuracy is variable; we'd want a quality threshold + a "fail this page noisily" path. Bundling Tesseract with the PyInstaller build is its own packaging headache. |
| **E. Bank-statement specifics** | Cross-page row stitching, currency-sign inference, multi-account splits | **12 weeks** | Medium — every bank's idea of a "statement" differs. Templates absorb most of the variance. |
**Realistic total for a polished v1**: 610 calendar weeks of focused work
(text-PDFs + GUI + templates + statement-specific niceties). Add another
23 weeks if scanned PDFs are required at launch.
**Minimum viable extract** (just text PDFs, single-region drawing, no
template recall, no OCR): **34 weeks**. Worth scoping a beta at that
level before committing to the full surface.
### Difficulty rating
**Medium-hard.** Not because any single piece is novel — pdfplumber +
streamlit-drawable-canvas are well-trodden libraries — but because the
*combination* (point-and-click region selection that persists across
multiple PDF pages and across documents with similar layouts) is where
most of the engineering goes. The "every bank does it slightly
differently" reality makes templates a hard requirement rather than a
nice-to-have, and templates raise the design effort.
### Risks / unknowns
- **Scanned-PDF coverage**: if a meaningful slice of the addressable
market sends image-only PDFs (older statements, scanned receipts),
shipping text-only extraction limits the audience. Decide via the
first 1020 user requests.
- **PyInstaller packaging of Tesseract**: bundling the OCR binary into
the desktop build is non-trivial. May force a "Tesseract not found —
install it separately" path on first launch, which hurts the "one-
click install" story.
- **Bank layout drift**: a template captured today can stop working
next month if the bank redesigns its statement. Layout-fingerprint
detection has to fail loudly rather than silently produce garbage.
- **PII surface**: bank statements are some of the most sensitive
documents the user might touch. The "runs locally — your data never
leaves this computer" guarantee is even more load-bearing here than
for CSVs. No telemetry, no cloud OCR services, hard line.
### Ship criteria
Before this tool re-enters active development, all of these need to be
true:
- DataTools has shipped to **≥1 paying customer** (the `PLAN.md` §2.1
freeze condition).
- **At least 3 paying customers OR 5 demo-traffic emails** have
explicitly asked for PDF extraction. Below that signal, build
something else.
- The bookkeeper niche (per `PLAN.md` §2.3) has at least one converted
customer — that's the persona who actually needs this tool, and
confirming they pay before building a tool aimed squarely at them
is the discipline the freeze exists to enforce.
If those three trip, the **Phase A minimum-viable beta (34 weeks)**
goes first — text PDFs + single-region drawing — so we can see real
user behaviour before committing to the full template surface.
---
## (placeholder for additional future-tool entries)
Add new entries above this line. Keep the same shape:
What / Why / Can we ship now / Approach / GUI / Effort / Risks /
Ship criteria. The shape is what makes "is this idea ready" a
factual question instead of an opinion.

259
docs/LICENSE-SERVER.md Normal file
View File

@@ -0,0 +1,259 @@
# LICENSE-SERVER — online issuance & record-keeping
**Status:** **deployed (PR 1 + PR 2 code merged)**. Live at
`licenses.datatools.unalogix.com`. See `ADMIN.md §"Live deployment"`
for day-2 operations, and `ARCHITECTURE.md` for the end-to-end
diagram including the desktop and storefronts.
This doc describes the smallest useful server we could build to
replace the manual mint-and-paste workflow, without compromising the
"your data never leaves your computer" promise to buyers (see
`DECISIONS.md §9b`).
---
## Goals
1. **Automate fulfillment.** Gumroad sale → buyer gets a blob in
their inbox within seconds. No creator intervention.
2. **Authoritative customer list.** A queryable record of who has
what tier, when it expires, what they paid. Replaces the JSONL
log as the system of record.
3. **Self-service renewal & re-delivery.** Buyer enters their email
→ gets a fresh blob or a copy of their existing one. Cuts support
load.
4. **Move the private key off the founder's laptop.** Today the prod
private key has to be loaded as an env var to mint anything;
that's a security hazard. Server-side, it lives in a KMS and the
laptop never touches it.
## Non-goals
- **No phone-home from the desktop app.** Activation stays offline.
The shipped binary still verifies blobs against the embedded
pubkey with no network call. `DECISIONS.md §9b` stands.
- **No per-machine activation limits enforced server-side.** v1
treats one license = one buyer, used on as many of their machines
as they want. Revisit only if abuse appears.
- **No telemetry.** The server only knows what the buyer or Gumroad
tells it (purchase events, renewal requests). It does not learn
anything from desktop installations.
---
## Architecture
```
┌─────────────────┐
│ Gumroad │
└────────┬────────┘
│ webhook (sale, refund)
┌──────────────┐ ┌───────────────┐ ┌──────────────┐
│ Buyer email │◄──────│ Mint API │──────►│ licenses │
│ (SMTP send) │ │ (Python web) │ │ (Postgres) │
└──────────────┘ └───────┬───────┘ └──────────────┘
│ sign() via
┌─────────────────┐
│ KMS / HSM │
│ (private key) │
└─────────────────┘
┌─────────────────────────────────────────┐
│ Renewal / re-delivery portal │
│ - buyer enters email │
│ - signed magic link │
│ - sees current license + "resend" │
└─────────────────────────────────────────┘
```
---
## Components
### 1. Mint API
Thin Python web service (FastAPI or Flask — Streamlit isn't appropriate
here). Two internal endpoints:
- `POST /internal/mint` — name, email, tier, years → blob + DB row.
Auth: shared HMAC header from the webhook receiver only.
- `POST /internal/revoke` — license_key → sets `revoked_at`. Auth: same.
The mint endpoint is the **only** place that calls `crypto.sign()`.
It pulls the private key from the KMS at request time; the key
material never lives in the API process's environment.
### 2. Webhook receiver
Public endpoint `POST /webhooks/gumroad`. Verifies Gumroad's
signature, maps the payload to a `mint` call, returns 200. Stores
the raw payload to a `gumroad_events` table for audit.
Refunds: webhook → `POST /internal/revoke` keyed on
`gumroad_order_id`. The desktop app doesn't currently honor
revocations (no online check), but future buyers won't be able to
renew a revoked license, and the row remains as evidence if a
dispute escalates.
### 3. Renewal portal
Single-page form, public. Buyer enters email → server emails a
signed magic link → click → page shows their license (tier, expiry,
"resend blob" button, "renew" button).
Renew flow: button → `POST /internal/mint` with the same name/email
and a fresh expiry → buyer gets the new blob → pastes into desktop
app via existing `license_cli.py renew`. No code change in the
desktop app.
### 4. Database
Postgres (small — a few thousand rows for the foreseeable future).
Single source of truth for the customer list.
---
## Schema
```sql
CREATE TABLE licenses (
license_key text PRIMARY KEY, -- DT1-{TIER}-xxxx-xxxx
name text NOT NULL,
email text NOT NULL,
tier text NOT NULL, -- lite | core | pro | enterprise
issued_at timestamptz NOT NULL,
expires_at timestamptz NOT NULL,
blob text NOT NULL, -- DTLIC2:...
gumroad_order_id text UNIQUE, -- null for manual mints
revoked_at timestamptz, -- null = active
notes text -- free-form support notes
);
CREATE INDEX idx_licenses_email ON licenses (lower(email));
CREATE INDEX idx_licenses_expires ON licenses (expires_at) WHERE revoked_at IS NULL;
CREATE INDEX idx_licenses_gumroad ON licenses (gumroad_order_id);
CREATE TABLE gumroad_events (
id bigserial PRIMARY KEY,
received_at timestamptz NOT NULL DEFAULT now(),
event_type text NOT NULL, -- sale | refund | dispute | ...
order_id text,
raw_payload jsonb NOT NULL,
processed boolean NOT NULL DEFAULT false,
error text -- non-null if processing failed
);
```
The `licenses` schema is the JSONL log fields plus
`gumroad_order_id`, `revoked_at`, `notes`. The migration script from
JSONL → Postgres is therefore a flat insert.
---
## Security
- **Private key**: AWS KMS, GCP KMS, or HashiCorp Vault. Mint API
has IAM permission to *use* the key (sign operation), not to
*export* it. Rotating to a new key still requires a new desktop
build (the pubkey is embedded); plan a 90-day overlap window where
both keys are accepted.
- **Webhook secret**: Gumroad's HMAC signature, verified before
touching the body.
- **Internal endpoints**: not reachable from the public internet —
bind to localhost or a private subnet, fronted by the webhook
receiver and the renewal portal.
- **PII**: name + email + Gumroad order ID. Standard customer-data
hygiene — DB backups encrypted at rest, no PII in application
logs, GDPR delete-on-request supported via a `DELETE FROM
licenses WHERE email = ?` (the desktop activation still works
until the license expires; the buyer just won't appear in our
records anymore).
- **Mint API access**: short-lived signed tokens for any creator
CLI that talks to it. The CLI is a thin wrapper around the same
`POST /internal/mint`; the days of running
`scripts/generate_license.py` against the prod private key on a
laptop are over once the server exists.
---
## Migration plan
Three phases, each independently revertable.
### Phase 0 (done)
- Ed25519 signing with prod key on creator's laptop.
- Local JSONL issuance log at `~/.datatools-creator/issued.jsonl`.
### Phase 1 — server stands up, no behavior change
1. Stand up Postgres + Mint API in a small VPS / Fly.io / Render box.
2. Provision a KMS-held keypair; **the public key must match the one
already embedded in the shipped binary** — i.e., import the
existing prod private key into KMS, do not generate a new one. If
the existing key is laptop-only and can't be imported, plan a
build-with-new-pubkey + buyer-side rotation cycle (see
`ADMIN.md` Recovery).
3. Run a one-shot script: read `~/.datatools-creator/issued.jsonl`,
`INSERT … ON CONFLICT (license_key) DO NOTHING` each row.
4. Add a creator-only CLI command `datatools-admin mint` that calls
`POST /internal/mint` instead of running the local script. Local
script stays as a fallback.
At this point: nothing buyer-facing has changed. The creator now has
two ways to mint (server or local) and a real DB.
### Phase 2 — automation
5. Wire the Gumroad webhook. New buyers get automated fulfillment.
6. Manual mints (friends, comps, support replacements) still go
through `datatools-admin mint`, which writes to the same DB.
7. Old local script is deprecated but kept (read-only) as a break-glass
tool if the server is down.
### Phase 3 — self-service
8. Ship the renewal portal.
9. Replace "email support to lose-my-blob" with a self-service form.
Each phase ships independently. The desktop app sees no change
across any of them — that's the whole point.
---
## Open questions
- **Hosting choice.** *Decided: self-hosted* on the existing
`46.225.166.142` box alongside the `*.invixiom.com` services.
Runbook in `SETUP-LICENSE-SERVER.md`. Operator owns uptime,
backups, TLS renewal, and key custody — see that doc's
"Operational concerns" section.
- **Per-seat or per-device limits?** v1 says no. Revisit if/when
abuse is observable.
- **Email delivery.** Postmark or SES — both fine. Pick whichever the
rest of the stack uses. Avoid Gmail SMTP for transactional mail.
- **Audit log retention.** `gumroad_events` rows are unbounded growth
but trivially small. Default to forever; partition by year if it
ever exceeds a few GB.
- **Existing Gumroad customers.** Before any of this lands, every
buyer is already in Gumroad's records. A one-shot import from
Gumroad's CSV export → `licenses` table would catch anyone whose
blob the JSONL log doesn't have (e.g., if the creator's laptop
was lost before this design lands).
---
## Code pointers (current state, for the future implementer)
| File | What it does now | What changes |
|------|------------------|--------------|
| `scripts/generate_license.py` | Sign locally, append JSONL | Becomes a CLI client of the Mint API |
| `src/license/crypto.py` | `sign()` reads `$DATATOOLS_LICENSE_PRIVKEY` | `sign()` calls KMS; the env var stays as a fallback for local dev |
| `src/license_cli.py` | Activate / status / renew — already buyer-facing | **No change.** Still verifies offline against embedded pubkey |
| `src/license/manager.py` | Verify, persist | **No change.** |
The desktop app is deliberately decoupled from any of this. The
server is a fulfillment + record-keeping layer wrapped around the
existing, frozen, offline activation flow.

View File

@@ -30,7 +30,7 @@ Status legend:
| ✓ | Item | Where it lives |
|---|------|----------------|
| 🟢 | 6 of 9 tools shipped (Dedup, Text, Format, Missing, Column-Map, Pipeline) | `src/core/`, `src/cli_*.py`, `src/gui/pages/` |
| 🟢 | Pipeline Runner (the retention multiplier per `PLAN.md` §2.6) | `src/core/pipeline.py`, `src/cli_pipeline.py`, `src/gui/pages/9_Pipeline_Runner.py` |
| 🟢 | Automated Workflows (the retention multiplier per `PLAN.md` §2.6) | `src/core/pipeline.py`, `src/cli_pipeline.py`, `src/gui/pages/9_Pipeline_Runner.py` |
| 🟢 | 1,729 passing tests · 0 skipped · 0 xfailed | `tests/` |
| 🟢 | 3 niche demo datasets + pre-tuned pipeline JSONs | `samples/demo/` |
| 🟢 | Streamlit demo app + Cloud entry shim | `streamlit_app.py`, `src/gui/app_demo.py` |
@@ -269,6 +269,7 @@ moves until $5k/mo MRR:
| | Why locked |
|---|---|
| ❌ More tools (0608) | `PLAN.md` §2.1 distribution-gate. Tool 09 was the exception; no others until first paid customer + one external review. |
| ❌ Tool #10 PDF → CSV (the most-asked-for adjacency) | Parked in `docs/FUTURE-TOOLS.md` with full design + 34 wk MVP / 610 wk polished estimate. Ship trigger: paying customer + ≥3 paid or ≥5 demo emails asking for PDF + the bookkeeper niche converting first. None have fired yet. |
| ❌ SaaS pivot | `DECISIONS.md` §4 — recurring infra conflicts with the lifestyle constraint. |
| ❌ Live chat / sales calls | `DECISIONS.md` §1 #8 — no-touch is locked until $5k/mo. |
| ❌ Custom integrations / one-off consulting | Breaks "build once, sell many." |

View File

@@ -29,8 +29,8 @@ win.
| Asset | State |
|---|---|
| Tools 15 (Dedup, Text Clean, Format Standardize, Missing, Column Mapper) | Ready · 1,691 tests passing · 0 xfailed |
| Tools 69 (Outlier, Multi-File Merge, Validator, Pipeline) | Coming Soon |
| Tools 15 (Find Duplicates, Clean Text, Standardize Formats, Fix Missing Values, Map Columns) | Ready · 1,691 tests passing · 0 xfailed |
| Tools 69 (Find Unusual Values, Combine Files, Quality Check, Automated Workflows) | Coming Soon |
| PyInstaller installer pipeline | Not started |
| macOS code signing (Apple Dev Program) | Not started |
| Hosted browser demo (Streamlit Cloud) | Not deployed |
@@ -52,12 +52,20 @@ Tools 68 are blocked behind a **distribution gate**: no work on them
until the existing 5 tools have a paying customer + one external review
(BUSINESS.md §4 sequence rule, applied recursively inside the bundle).
**Exception granted 2026-05-01**: Tool 09 Pipeline Runner is built
**Exception granted 2026-05-01**: Tool 09 Automated Workflows is built
*now*. Rationale: the pipeline transforms the bundle from "5 tools you
buy" into "an automatable workflow you depend on." That conversion is
what produces retention and word-of-mouth — the only marketing channel
that scales under the no-network/no-touch constraint.
**Parked behind the freeze**: post-launch tool ideas are captured in
`docs/FUTURE-TOOLS.md` with feasibility, GUI sketch, effort estimate,
and ship criteria for each. Currently parked: **#10 PDF → CSV
extractor** (bank statements et al.) — gated on a paying customer +
≥3 paying customers or ≥5 demo emails explicitly asking for PDF
extraction, with the bookkeeper niche converting at least one customer
first. None of those triggers have fired yet.
### 2.2 The demo *is* the product. Make it embarrassingly good.
- Three persona-tagged sample datasets, not one generic CSV: Shopify
@@ -104,10 +112,10 @@ demo dataset.
| # | Pain | $ / time impact | Tools that fix it |
|---|------|-----------------|---|
| S1 | **Klaviyo / Mailchimp / Omnisend per-contact billing.** Subscriber list with 1018 % duplicate rate (case drift, plus signs in Gmail addresses, multiple devices) → recurring overpay forever. | $30300/mo per percent of dupes on a 50 k list — recurring | Dedup + Format Standardize (email canonicalization) + Pipeline (re-run weekly) |
| S2 | **Product feed rejected by Google Merchant Center / Meta Catalog.** Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 2472 h while feed gets fixed. | 13 days delayed launch × campaign value | Text Cleaner + Format Standardize |
| S3 | **Multi-channel order consolidation.** Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 48 hr / month manually merging | Column Mapper + Dedup + Pipeline |
| S2 | **Product feed rejected by Google Merchant Center / Meta Catalog.** Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 2472 h while feed gets fixed. | 13 days delayed launch × campaign value | Clean Text + Standardize Formats |
| S3 | **Multi-channel order consolidation.** Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 48 hr / month manually merging | Map Columns + Find Duplicates + Automated Workflows |
| S4 | **Subscription identity fragmentation.** Pet-box subscribers cancel and re-sub under a different email; cohort analysis says churn is 20 % when it's actually 12 % — pricing decisions wrong. | Mis-priced LTV → over- or under-paid acquisition | Dedup with `merge=true` survivor |
| S5 | **International tax / VAT MOSS compliance.** Country column is `UK` / `U.K.` / `United Kingdom` / `GB` in the same export; VAT report breaks. Phone formats per region break call-center routing. | Compliance penalty risk + ops friction | Format Standardize (per-row country) + Column Mapper |
| S5 | **International tax / VAT MOSS compliance.** Country column is `UK` / `U.K.` / `United Kingdom` / `GB` in the same export; VAT report breaks. Phone formats per region break call-center routing. | Compliance penalty risk + ops friction | Standardize Formats (per-row country) + Map Columns |
#### Bookkeeper / freelance accountant
@@ -126,7 +134,7 @@ demo dataset.
| R1 | **HubSpot / Marketo / Iterable per-contact tier pricing.** 10 k contacts → enterprise tier at $48 k/mo. Every duplicate is a recurring tax. | $200800 / month per 1 k duplicate contacts — recurring | Dedup with cross-source merge + Pipeline |
| R2 | **Email-deliverability / sender reputation.** Sending to invalid or duplicate addresses tanks reputation; recovery takes weeks. | Catastrophic — entire email programme degraded | Format Standardize (email canonicalization) + Missing (sentinel detection) |
| R3 | **GDPR / contact-data privacy.** Uploading lead data to a third-party cleaning SaaS is itself a GDPR concern; legal review blocks adoption. | Compliance risk + 48 wk legal-review delay | Local-only desktop app, zero outbound calls |
| R4 | **Multi-vendor lead-source unification.** Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 13 days per campaign of manual unification | Column Mapper (alias matching) + Format Standardize (per-row country) + Dedup |
| R4 | **Multi-vendor lead-source unification.** Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 13 days per campaign of manual unification | Map Columns (alias matching) + Standardize Formats (per-row country) + Find Duplicates |
| R5 | **Suppression-list management across 5+ platforms.** Each platform has its own format; un-deduped suppression lists let opt-outs slip through, triggering CAN-SPAM / GDPR exposure. | Compliance risk + churn-back cost | Pipeline saved as JSON, re-run on each new suppression batch |
### 2.4 Operationalize the moat the docs already name.
@@ -154,7 +162,7 @@ right after "runs locally."
Copy seed: *"Every change auditable. Hand the audit CSV to your client
with the cleaned file."*
### 2.6 The Pipeline Runner is the retention multiplier.
### 2.6 Automated Workflows is the retention multiplier.
A buyer with a saved pipeline isn't a one-off purchase — they're a
recurring user who recommends the product. This is exactly the
@@ -172,8 +180,8 @@ trigger DECISIONS.md §8 already names).
### 2.8 Dependency-aware pipeline UX.
Tools have soft execution-order preferences (Text Clean before Format
Standardize, Format before Dedup, Missing before Dedup). The Pipeline
Runner *recommends* the order, *warns* on reversals, and **never
Standardize, Format before Dedup, Missing before Dedup). Automated
Workflows *recommends* the order, *warns* on reversals, and **never
forces** — the user owns their workflow. Implementation: see
`src/core/pipeline.py` `SOFT_DEPENDENCIES`.
@@ -184,7 +192,7 @@ forces** — the user owns their workflow. Implementation: see
| 1 | PyInstaller pipeline · Mac/Win unsigned installers · Apple Dev Program enrollment (12 wk lead) | `dist/datatools-mac.dmg` and `dist/datatools-win.exe` install on a clean machine |
| 2 | Demo deployed to Streamlit Cloud · landing page v1 with embedded demo · 3 persona datasets in the demo | Public URL serves a working pipeline run on a sample dataset in < 30 s |
| 3 | Gumroad listing live · share value-first in 3 niche communities (no pitch) · 1 long-tail SEO post for the lead persona | First listing impression captured · post not removed for self-promotion |
| 4 | Pipeline Runner v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded |
| 4 | Automated Workflows v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded |
| 58 | Bookkeeper landing page · agency landing page · second tool's promo cycle · priority-support tier added (defer purchase until §2.7 trigger) | Three live landing pages with distinct H1, demo dataset, conversion target |
| 913 | Tool 0608 only **if** revenue trajectory supports continued investment · otherwise more market work on the existing 5 + 09 | Decision made on 13 Aug 2026 with revenue data, not feature ambition |
@@ -202,7 +210,7 @@ These flip the plan, not the underlying criteria:
## 5. Anti-temptations (things the plan refuses)
- **More tools before more buyers.** Locked. Exception only for Pipeline Runner per §2.1.
- **More tools before more buyers.** Locked. Exception only for Automated Workflows per §2.1.
- **SaaS pivot.** Recurring infra conflicts with the lifestyle constraint (DECISIONS.md §4).
- **Live chat / sales calls.** Conflicts with no-touch (DECISIONS.md §1 #8).
- **Custom integrations / one-off consulting.** $300/hr looks tempting; breaks the "build once, sell many" model that justifies the entire strategy.

View File

@@ -144,7 +144,7 @@ Reading PLAN.md §3 + this doc together, the rough script:
| **M1** (June) | Installers · demo · 3 landing pages · Gumroad live | Whether the funnel mechanically works. Numbers will be noisy; just look for one purchase. |
| **M2** (July) | M1 + community posts in 3 niches + 1 SEO post | Which persona converts. Re-allocate effort to the highest-converting niche. |
| **M3** (August) | M2 + landing-page changes from M2 review | Whether intent-rate moved on the change. Decide tools 0608 go/no-go. |
| **M4** (September) | M3 + first repeat-buyer signals | Whether the Pipeline Runner is producing retention as designed. |
| **M4** (September) | M3 + first repeat-buyer signals | Whether Automated Workflows is producing retention as designed. |
By end of M4, the data tells you whether the plan is producing
$1k3k/mo (BUSINESS.md §6 6-month target) — extrapolated from the

View File

@@ -6,11 +6,13 @@
## Inicio rápido
1. Descarga el instalador para tu sistema operativo desde tu correo de compra.
2. Ejecútalo (no se requieren conocimientos de Python).
3. Lánzalo desde el acceso directo del escritorio → tu navegador predeterminado se abrirá en una página local.
1. Descarga desde tu correo de compra. Dos formatos por sistema operativo — elige uno:
- **Instalador** (`.dmg` en macOS, `.exe` en Windows) — crea acceso directo en el escritorio + entrada en el menú Inicio / Launchpad.
- **.zip portable** — descomprime y haz doble clic. Sin instalación, sin admin, se ejecuta desde cualquier lugar.
2. Ábrelo (no necesitas Python; todo viene incluido).
3. La app arranca un servidor local y abre tu navegador. Nada sale de tu equipo.
Instrucciones completas: [USER-GUIDE.es.md](USER-GUIDE.es.md).
Paso a paso completo incluyendo SmartScreen / Gatekeeper: [USER-GUIDE.es.md §1](USER-GUIDE.es.md#1-instalaci%C3%B3n).
## Documentación

View File

@@ -6,11 +6,13 @@
## Quick Start
1. Download the installer for your OS from your purchase email.
2. Run it (no Python knowledge required).
3. Launch via the desktop shortcut → your default browser opens to a local page.
1. Download from your purchase email. Two flavors per OS — pick one:
- **Installer** (`.dmg` on macOS, `.exe` on Windows) — wires up Desktop + Start Menu / Launchpad shortcuts.
- **Portable .zip** — unzip and double-click. No install, no admin rights, runs from anywhere.
2. Open it (no Python needed; everything is bundled inside).
3. The app starts a local server and opens your browser. Nothing leaves your machine.
Full instructions: [USER-GUIDE.md](USER-GUIDE.md).
Full step-by-step including SmartScreen / Gatekeeper workarounds: [USER-GUIDE.md §1](USER-GUIDE.md#1-install).
## Docs

View File

@@ -21,8 +21,8 @@ project-root/
│ └── CLI-REFERENCE.md
├── src/
│ ├── core/ # shared logic — both CLI + GUI call into this
│ ├── cli.py # Deduplicator CLI
│ ├── cli_text_clean.py # Text Cleaner CLI
│ ├── cli.py # Find Duplicates CLI
│ ├── cli_text_clean.py # Clean Text CLI
│ ├── cli_analyze.py # Analyzer CLI
│ └── gui/
│ ├── app.py # Streamlit entry

View File

@@ -76,36 +76,71 @@ Sample size: 1,000 rows (configurable).
- Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
- Output write: ~10 s.
- Recommended RAM: 34× input size for the full-Apply path.
- **Format standardizer** (`standardize_dataframe`): ~2.7M rows/sec on
- **Standardize Formats** (`standardize_dataframe`): ~2.7M rows/sec on
cache-warm repetition-heavy columns (synthetic 1M-row in-memory
benchmark, 2 typed columns); the fused single-pass loop replaced a
3-pass ``.tolist()`` cycle, so per-call overhead is now dominated by
the underlying parsers (phonenumbers, dateutil) rather than Python
list materialisation. A 1.5 GB CSV with mixed phone+currency+address
columns finishes in ~1.56 minutes depending on column count.
- **Text cleaner** (`clean_dataframe`): ~1M rows/sec on
`StandardizeOptions.parallel_columns` (default 1, serial) lands the
thread-pool scaffolding; on CPython 3.12 with the GIL it's
roughly neutral, but the API is ready for the free-threaded
(PEP 703) Python 3.13+ build where it will help.
- **Clean Text** (`clean_dataframe`): ~1M rows/sec on
repetition-heavy columns (per-call string cache: the pipeline runs
once per *unique* cell value, not once per row).
- **Deduplicator**: known O(n²) match step — works to ~50k rows in
comfortable time. The normalisation pass is now LRU-cached per call
so repeat values (the common dedup workload) skip re-parsing
(~25× faster on the normalisation step alone). Scale beyond 50k
needs blocking — flagged in `docs/NEXT-STEPS.md`.
- **Fix Missing Values** (`handle_missing`): lazy-copy — when sentinel
standardization runs but finds nothing, AND no drops AND no fills
apply, the input frame is returned as-is. On a clean 1 GB file this
saves the 1 GB allocation that the unconditional upfront copy used
to take.
- **Map Columns** (`map_columns`): rename + drop both already
return fresh frames; the explicit upfront `df.copy()` is now
removed and downstream mutating steps (schema-add, coerce) copy on
demand via `_ensure_owned()`. Rename-only and identity-mapping
paths run with zero explicit copies.
- **Find Duplicates**:
- **Exact-only strategies** (every column uses `Algorithm.EXACT` at
threshold 100 — covers strong-key dedup like email/phone, the
fallback drop-duplicates path, and explicit "match on this exact
column" calls) now run in **O(n)** via groupby. Measured: 10k
rows on an email-exact strategy → 73 ms (was ~30 minutes via the
old O(n²) pair compare).
- **Fuzzy strategies** still pair-compare. Opt in to **prefix
blocking** via `deduplicate(..., blocking_columns=['name'],
blocking_prefix_len=1)` to partition pairs by a cheap key.
Measured: 5k rows fuzzy-name dedup → 25.6s with blocking vs.
179s without (7× faster). Trade-off: cross-block matches are
missed; lower `blocking_prefix_len` widens blocks.
- Normalisation pass remains LRU-cached per call so repeat values
(the common dedup workload) skip re-parsing.
## 11. Tools
1. Deduplicator — Ready
2. Text Cleaner — Ready
3. Format Standardizer — Ready
4. Missing Value Handler — Ready
5. Column Mapper — Ready
6. Outlier Detector — Coming Soon
7. Multi-File Merger — Coming Soon
8. Validator & Reporter — Coming Soon
9. Pipeline Runner — Ready
1. Find Duplicates — Ready
2. Clean Text — Ready
3. Standardize Formats — Ready
4. Fix Missing Values — Ready
5. Map Columns — Ready
6. Find Unusual Values — Coming Soon
7. Combine Files — Coming Soon
8. Quality Check — Coming Soon
9. Automated Workflows — Ready
**Future / not in v1.** Tool ideas captured for after-launch consideration
live in `docs/FUTURE-TOOLS.md` — entries there are gated by the new-tool
freeze in `PLAN.md` §2.1 and don't ship without a paying-customer +
repeated-demand signal. Currently parked there:
- **#10. PDF → CSV extractor** (bank statements + similar). No PDF
dependency exists in the repo today; this tool would need pdfplumber,
streamlit-drawable-canvas, and a templates store. Estimated 34 weeks
for a text-only MVP, 610 weeks for the polished version with
multi-page template recall.
### 11.a Recommended pipeline order (soft, not enforced)
The Pipeline Runner ships with a `SOFT_DEPENDENCIES` table; the
Automated Workflows ships with a `SOFT_DEPENDENCIES` table; the
following ordering is the default and the basis of the warning
surface. Re-ordering is allowed; the runner emits a warning string
and proceeds.
@@ -150,7 +185,16 @@ and proceeds.
- **Dev**: pytest, tox.
## 16. Test coverage
- 1,770 tests passing, 0 skipped, 0 xfailed (incl. perf-shape regression tests).
- 2,033 tests passing, 0 skipped, 0 xfailed.
- 1,868 core + CLI tests (run with `pytest -m 'not gui'` for a quick loop).
Includes 49 license-layer unit tests (Ed25519 sign/verify, dev-key
derivation, production-safe tripwire, schema), 25 license-CLI
tests, and 17 Lite-tier feature-map + guard tests.
- 165 GUI tests under `tests/gui/` driving Streamlit pages via `AppTest`
(smoke + EN/ES localization, chrome, gate, workflows, dedup review,
advanced panels, error paths, findings panel, activation +
license gate, Lite-tier per-page lock behaviour). Marked `gui`.
- Includes 15 perf-shape regression tests.
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
- Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
@@ -160,6 +204,58 @@ and proceeds.
- Original input never modified.
- Audit logs: `logs/` next to each run (timestamped).
## 17a. Licensing
- **Storage**: ``~/.datatools/license.json`` (or
``$DATATOOLS_LICENSE_PATH`` override). Signed with Ed25519
(asymmetric).
- **Crypto**: Ed25519. The seller holds the private key; every
shipped binary embeds only the public key. A motivated reverse
engineer who pulls everything out of the binary still can't sign
new licenses. Keys are 32 bytes raw, exposed as hex via
``DATATOOLS_LICENSE_PRIVKEY`` (seller-side) and
``DATATOOLS_LICENSE_PUBKEY`` (build-time bake-in).
- **Activation**: buyer pastes a base64-encoded license blob
(``DTLIC1:...``) on first launch; app verifies the signature
offline + matches the buyer-entered name/email to the embedded
values.
- **No free trial**: every license requires a paid blob from the
seller. The user-facing trial flow (button + ``license_cli trial``
subcommand) was removed in v1.6 to keep paid-tier economics clean.
- **Lifetime**: every license is 1 year by default. Renewal applies a
fresh blob without losing the embedded buyer identity. Tier may
change during renewal (Lite → Core upgrade path).
- **Tiers**:
- ``lite`` — Find Duplicates + Clean Text + Standardize Formats.
Buyer pays once, gets the three universally-useful tools.
- ``core`` — every Ready tool (all 9 in v1.6).
- ``pro``, ``enterprise`` — scaffolded for future SKUs; currently
mirror Core. Add per-SKU restrictions by editing
``FEATURES_BY_TIER`` in ``src/license/features.py``.
- ``trial`` — kept in the enum for backwards compat with any
field-tested trial licenses but no longer issuable.
- **Feature flags**: every tool has a stable feature id matching its
``tool_id`` in :mod:`src.gui.tools_registry`. Adding a future per-
tool SKU is a one-line change to ``FEATURES_BY_TIER`` — no consumer
code edits.
- **Per-tool gating**: each tool page (GUI) and tool CLI calls
``require_feature(FeatureFlag.<TOOL>)`` at entry. GUI shows an
upgrade prompt + button to the Activate page; CLI prints a
message naming the locked feature and exits with code 2.
- **Lock badge**: the home grid shows a red 🔒 Locked pill on tool
cards the current tier doesn't unlock.
- **Dev bypass**: ``DATATOOLS_DEV_MODE=1`` skips every check (used by
the test suite and during development). **Refused in shipped
builds** by the production-safe tripwire.
- **Production-safe tripwire**: ``assert_production_safe()`` runs at
startup in every frozen build. Refuses to boot when ``DEV_MODE``
is set or the verification key is still the embedded dev key
(i.e., the build pipeline forgot to override
``DATATOOLS_LICENSE_PUBKEY``). No-op in source / pytest runs.
- **No internet**: signature verification is fully offline. The
shipped binary embeds only the public key; the private key never
leaves the seller. See ``docs/DECISIONS.md`` for the threat-model
discussion.
## 18. Error handling
- Structured hierarchy: `DataToolsError` → `InputValidationError`, `ConfigError`, `FileFormatError`, `FileAccessError`.
- Subclasses extend stdlib `ValueError` / `OSError` so existing handlers still catch them.

View File

@@ -0,0 +1,593 @@
# SETUP — Self-hosted license server runbook
End-to-end build instructions for `licenses.datatools.unalogix.com` on
the existing invixiom box (Ubuntu 24.04, public IP `46.225.166.142`).
Audience: creator/operator. Read top to bottom on first install; use as
a reference thereafter.
Companions:
- `LICENSE-SERVER.md` — the architecture / design rationale
- `ADMIN.md` — day-2 ops (minting comps, looking at the issuance log)
---
## 0. Multi-tenancy: where this lands among existing services
This box already hosts the `*.invixiom.com` family (kasm, files, lifeos,
code, gitea) via one shared nginx + one shared Let's Encrypt cert.
DataTools is intentionally separated from that stack at every layer:
| Layer | Existing | New |
|---|---|---|
| **DNS zone** | `invixiom.com` | `unalogix.com` (different TLD) |
| **nginx file** | `/etc/nginx/sites-available/invixiom` | `/etc/nginx/sites-available/unalogix` |
| **nginx symlink** | `sites-enabled/invixiom` | `sites-enabled/unalogix` |
| **TLS cert** | `letsencrypt/live/kasm.invixiom.com[-0001]` | `letsencrypt/live/datatools.unalogix.com` |
| **Backend port** | 8000, 8002, 8003, 8080, 8081, 8443 | **8090** (mint API), **5433** (Postgres, localhost-only) |
| **Docker compose project** | per-service (kasm, lifeos, gitea) | `datatools-license` |
| **Docker volume** | per service | `datatools_pg_data` |
| **Filesystem root** | various | `/srv/datatools-license/` |
| **System user** | various | `datatools-api` (UID auto-assigned, no shell) |
Nothing in the invixiom stack is read, modified, or referenced by the
datatools stack. Restart, upgrade, or remove either without affecting
the other.
---
## 1. Pre-flight checklist (off-box, before any commands run)
These have to be done by the operator outside this box. The build
won't proceed without them.
### 1a. DNS records
In your `unalogix.com` registrar / DNS panel, add:
```
A datatools.unalogix.com 46.225.166.142
A licenses.datatools.unalogix.com 46.225.166.142
```
Verify before continuing:
```bash
dig +short datatools.unalogix.com
dig +short licenses.datatools.unalogix.com
# Both should print: 46.225.166.142
```
DNS propagation can take 160 minutes. Let's Encrypt won't issue
certs until DNS resolves correctly.
### 1b. Postmark account (transactional email)
1. Sign up at https://postmarkapp.com (free 100 emails/mo, $15/mo for
the volume range we'll be in).
2. Verify the `unalogix.com` domain (DNS TXT/CNAME records — Postmark
will tell you exactly what to add).
3. Create a Server, copy the **Server API Token**. Stash it; we'll put
it in the app's `.env`.
4. Configure the sender address: `licenses@datatools.unalogix.com`.
If you prefer SES, Mailgun, Resend, etc. — fine, just swap the
adapter (see §6). Postmark is the recommended default.
### 1c. Cloudflare in front (recommended)
Move `unalogix.com` DNS hosting to Cloudflare and enable proxy ("orange
cloud") on both subdomains. Gets you free DDoS protection, WAF, and rate
limiting. **Origin TLS still goes through Let's Encrypt on this box**;
Cloudflare adds a second TLS hop in front. Cert renewal still works
because we use HTTP-01 challenge on the origin, which Cloudflare
proxies transparently.
If you skip this, the public webhook endpoint is directly hammerable.
Not catastrophic at low scale, but the free protection is worth taking.
### 1d. Gumroad webhook secret
In Gumroad's seller dashboard → Settings → Advanced → "Ping URL":
```
URL: https://licenses.datatools.unalogix.com/webhooks/gumroad
Secret: <generate a random 32-char hex; save it for the .env>
```
Don't enter this until §10 ("PR 2 cutover") — the endpoint won't exist
yet during the Mint API build.
---
## 2. One-time host setup
Run as `root` (or via `sudo`).
```bash
# Update apt cache and pull in the bits the rest of the doc needs.
apt-get update
apt-get install -y \
docker-compose-plugin \
certbot \
python3-certbot-nginx \
postgresql-client-16 # for psql to reach the containerized DB
# Sanity check: docker + compose v2 are already installed via Docker CE.
docker --version
docker compose version
# Create the system user the app process will run as (no shell, no home).
adduser --system --group --no-create-home --shell /usr/sbin/nologin datatools-api
# Filesystem layout under /srv (separate from /opt to make the
# multi-tenant boundary obvious on disk).
install -d -o datatools-api -g datatools-api -m 750 /srv/datatools-license
install -d -o datatools-api -g datatools-api -m 750 /srv/datatools-license/app
install -d -o datatools-api -g datatools-api -m 750 /srv/datatools-license/secrets
install -d -o datatools-api -g datatools-api -m 750 /srv/datatools-license/backups
```
The `secrets/` dir is mode 750 owned by `datatools-api`. The private
signing key and Postmark token live there as mode-400 files — never
in environment-variable-via-systemd-EnvironmentFile, never in the
docker-compose file, never anywhere `root` doesn't need to look.
> **Gotcha — secret file ownership UID.** Docker compose's
> `uid:`/`gid:`/`mode:` long-form on `secrets:` is silently ignored
> for **file-based** secrets (it's a swarm-mode-only feature). The
> file inside the container appears with whatever ownership it has
> on the host, and the API runs as UID 10001 (the `app` user from
> the Dockerfile). So chown the actual files to **10001** (a numeric
> UID that doesn't exist on the host — that's fine, chown accepts
> it) and rely on the parent dir's mode 750 + ownership for host-side
> access control. See §3 below for the corrected `chown` step.
### Firewall recommendation (separate decision)
The box currently runs without UFW. Enabling it now would affect all
existing services. Two options:
- **(A) Don't enable UFW.** Leave the cloud provider's network firewall
as the perimeter. This is the current state.
- **(B) Enable UFW with `allow 22, 80, 443` only.** Forces every Docker
service to bind to `127.0.0.1` (some currently bind `0.0.0.0`). Will
break any direct-port access until those binds are updated.
Default for this runbook: **(A)**. Revisit independently of the
DataTools rollout. The DataTools containers always bind to `127.0.0.1`
regardless.
---
## 3. Database (Postgres in Docker)
Postgres lives inside the datatools compose project — separate from
every other service on the box, separate volume, separate port,
localhost-only binding.
`/srv/datatools-license/compose.yml`:
```yaml
services:
postgres:
image: postgres:16-alpine
container_name: datatools-postgres
restart: unless-stopped
environment:
POSTGRES_DB: datatools_licenses
POSTGRES_USER: datatools_api
POSTGRES_PASSWORD_FILE: /run/secrets/pg_password
secrets:
- pg_password
volumes:
- datatools_pg_data:/var/lib/postgresql/data
ports:
- "127.0.0.1:5433:5432" # localhost-only, non-default port
healthcheck:
test: ["CMD-SHELL", "pg_isready -U datatools_api -d datatools_licenses"]
interval: 10s
timeout: 3s
retries: 5
api:
build:
context: ./app
dockerfile: server/Dockerfile
image: datatools-license-api:latest
container_name: datatools-api
restart: unless-stopped
depends_on:
postgres:
condition: service_healthy
environment:
DATABASE_URL: postgresql+psycopg://datatools_api@postgres:5432/datatools_licenses
PG_PASSWORD_FILE: /run/secrets/pg_password
DATATOOLS_ADMIN_TOKEN_FILE: /run/secrets/admin_token
# PR 2 — uncomment when Postmark + Gumroad are provisioned.
# POSTMARK_TOKEN_FILE: /run/secrets/postmark_token
# GUMROAD_WEBHOOK_SECRET_FILE: /run/secrets/gumroad_secret
# Production keypair (replaces in-tree dev key): set
# DATATOOLS_LICENSE_PRIVKEY_FILE: /run/secrets/license_privkey
# and DATATOOLS_LICENSE_PUBKEY: <hex> before shipping v1.0.
secrets:
- pg_password
- admin_token
# PR 2:
# - postmark_token
# - gumroad_secret
ports:
- "127.0.0.1:8090:8000" # localhost-only; nginx is the only path in
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 3s
retries: 3
secrets:
pg_password: { file: ./secrets/pg_password }
admin_token: { file: ./secrets/admin_token }
# PR 2:
# postmark_token: { file: ./secrets/postmark_token }
# gumroad_secret: { file: ./secrets/gumroad_secret }
# Production keypair rotation adds:
# license_privkey: { file: ./secrets/license_privkey }
volumes:
datatools_pg_data:
name: datatools_pg_data
```
Populate the secrets (each file should contain the value with no
trailing newline). For PR 1, only `pg_password` and `admin_token`
are required; the rest land in PR 2 / production key rotation.
```bash
cd /srv/datatools-license
# Random 32-char hex DB password
openssl rand -hex 32 > secrets/pg_password
# Random admin Bearer token (CLI auth). Save this — you'll need it
# on your laptop to talk to /internal/* via the SSH tunnel.
openssl rand -hex 32 > secrets/admin_token
# --- PR 2 secrets ---
# echo -n "<postmark-server-token>" > secrets/postmark_token # from postmarkapp.com
# openssl rand -hex 32 > secrets/gumroad_secret # paste into Gumroad's Ping URL: ?secret=<this>
#
# --- production-key follow-up (defer until v1.0 cutover) ---
# echo -n "<ed25519-private-hex>" > secrets/license_privkey
# Lock everything down. The numeric 10001 matches the in-container
# `app` user (Dockerfile-defined), letting the API read the file
# while keeping host-side access gated by the parent dir's mode 750.
chmod 400 secrets/*
chown 10001:10001 secrets/*
```
The corresponding **public** key for `DATATOOLS_LICENSE_PUBKEY` goes
in `/srv/datatools-license/.env` (it's not secret — it's already in
every shipped binary):
```bash
echo "DATATOOLS_LICENSE_PUBKEY=<hex-pubkey>" > /srv/datatools-license/.env
chmod 640 /srv/datatools-license/.env
chown datatools-api:datatools-api /srv/datatools-license/.env
```
---
## 4. App image build
The Mint API source lives in this repo under `server/` (new directory
introduced by PR 1). Build the Docker image:
```bash
cd /srv/datatools-license/app
git clone https://git.invixiom.com/giteadmin/datatools-dev.git .
docker build -t datatools-license-api:latest -f server/Dockerfile server/
```
Schema bootstrap (one-time, after first `docker compose up`):
```bash
docker compose exec api alembic upgrade head
```
Smoke test:
```bash
curl -s http://127.0.0.1:8090/health
# expects: {"status":"ok","db":"ok"}
```
---
## 5. nginx config
> **Gotcha — nginx version syntax.** Ubuntu 24.04 ships nginx 1.24,
> which uses the legacy `listen 443 ssl http2;` form. The standalone
> `http2 on;` directive arrived in nginx 1.25 and will error on 1.24
> with `unknown directive "http2"`. The config below uses the 1.24
> form.
>
> **Bring-up sequence.** This config references a TLS cert at
> `/etc/letsencrypt/live/datatools.unalogix.com/`, which doesn't
> exist on a fresh install — nginx would refuse to start. The
> working sequence is: (a) install a temporary HTTP-only config
> that serves `.well-known/acme-challenge/` and returns 503 for
> everything else, (b) `nginx -s reload`, (c) run `certbot
> certonly --webroot`, (d) replace with the HTTPS config below,
> (e) `nginx -s reload` again. See §6.
`/etc/nginx/sites-available/unalogix`**new file**, do not merge
into `invixiom`:
```nginx
# Marketing / product site (datatools.unalogix.com) — static for now.
server {
listen 80;
server_name datatools.unalogix.com licenses.datatools.unalogix.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl http2; # nginx 1.24 syntax (Ubuntu 24.04)
server_name datatools.unalogix.com;
ssl_certificate /etc/letsencrypt/live/datatools.unalogix.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/datatools.unalogix.com/privkey.pem;
root /srv/datatools-license/site; # static landing page; create later
index index.html;
location / {
try_files $uri $uri/ =404;
}
}
# License operations subdomain.
server {
listen 443 ssl http2; # nginx 1.24 syntax (Ubuntu 24.04)
server_name licenses.datatools.unalogix.com;
ssl_certificate /etc/letsencrypt/live/datatools.unalogix.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/datatools.unalogix.com/privkey.pem;
# Block /internal/* from the public side as defense-in-depth.
# (The app also enforces this server-side; this is layered.)
location /internal/ {
return 404;
}
location / {
proxy_pass http://127.0.0.1:8090;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Gumroad webhook payloads are tiny but tighten anyway.
client_max_body_size 1m;
# Basic rate limiting: 30 req/min/IP on /webhooks/* and /portal/*.
# Tune in nginx.conf with a `limit_req_zone` directive.
# limit_req zone=licenses burst=10 nodelay;
}
}
```
Enable + reload:
```bash
ln -s /etc/nginx/sites-available/unalogix /etc/nginx/sites-enabled/unalogix
nginx -t # validate
systemctl reload nginx
```
---
## 6. TLS cert
Use the standalone http-01 challenge (nginx-plugin works too; this is
slightly more explicit):
```bash
certbot certonly \
--webroot -w /var/www/html \
-d datatools.unalogix.com \
-d licenses.datatools.unalogix.com \
--agree-tos \
--email michael.dombaugh@gmail.com \
--non-interactive
```
Cert lands at `/etc/letsencrypt/live/datatools.unalogix.com/`.
Auto-renewal is already configured by the certbot package (systemd
timer `certbot.timer`). Confirm:
```bash
systemctl list-timers certbot.timer
```
---
## 7. Bring it up
```bash
cd /srv/datatools-license
docker compose up -d
docker compose ps # both services should be 'running (healthy)'
docker compose logs -f api
```
Public smoke test:
```bash
curl -s https://licenses.datatools.unalogix.com/health
# expects: {"status":"ok","db":"ok"}
```
---
## 8. Verification — end-to-end internal mint
From your laptop (NOT the server), open an SSH tunnel for the internal
endpoint:
```bash
ssh -L 8090:127.0.0.1:8090 michael@46.225.166.142 -N
# Leave running; in another terminal:
curl -X POST http://127.0.0.1:8090/internal/mint \
-H "Authorization: Bearer $DATATOOLS_ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name":"Test Buyer",
"email":"test@example.com",
"tier":"core",
"years":1,
"source":"manual",
"notes":"smoke test"
}'
```
Expected: 200 + a `DTLIC2:...` blob + a row inserted in the `licenses`
table. Confirm with:
```bash
docker compose exec postgres \
psql -U datatools_api -d datatools_licenses \
-c "SELECT license_key, email, tier, source FROM licenses;"
```
Then **revoke the test row** before going further:
```bash
docker compose exec postgres \
psql -U datatools_api -d datatools_licenses \
-c "DELETE FROM licenses WHERE email = 'test@example.com';"
```
---
## 9. Operational concerns
### Backups (Postgres → off-site)
`/etc/cron.daily/datatools-license-backup`:
```bash
#!/bin/bash
set -euo pipefail
TS=$(date -u +%Y%m%dT%H%M%SZ)
OUT=/srv/datatools-license/backups/db-${TS}.sql.gz
docker compose -f /srv/datatools-license/compose.yml exec -T postgres \
pg_dump -U datatools_api datatools_licenses | gzip > "$OUT"
chmod 600 "$OUT"
# Off-site copy — pick one:
# rclone copy "$OUT" remote:datatools-license-backups/
# aws s3 cp "$OUT" s3://datatools-backups/db/ --sse AES256
find /srv/datatools-license/backups -name 'db-*.sql.gz' -mtime +30 -delete
```
Pick an off-site target. Without one, a disk failure loses every
customer record. Test the restore at least once on a staging copy.
### Monitoring
External uptime probe (free):
1. UptimeRobot account → add monitor for `https://licenses.datatools.unalogix.com/health`.
2. 5-minute interval, alert to email/SMS.
Container health is already handled by `restart: unless-stopped` +
healthcheck. To see recent failures:
```bash
docker compose ps # last health-check status
docker compose logs api --tail 200
journalctl -u docker --since '1 hour ago' | grep datatools
```
### Log rotation
Docker handles container logs; cap their size in
`/etc/docker/daemon.json`:
```json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
```
Then `systemctl restart docker` (this restarts all containers — schedule
during a quiet window).
### Key rotation (future)
If the private signing key is ever compromised:
1. Generate a new keypair (`scripts/generate_keypair.py`).
2. Build and ship a desktop release with the new pubkey embedded.
3. Update `/srv/datatools-license/secrets/license_privkey` and
`/srv/datatools-license/.env`'s pubkey.
4. `docker compose restart api`.
5. Re-issue every active license (script that queries the DB, calls
`/internal/mint`, emails buyers). Old blobs will fail verification
in the new desktop build.
Plan a 90-day overlap window where the desktop verifies against
*both* keys before retiring the old pubkey. (Verification logic
change to the desktop app — not in scope for PR 1.)
---
## 10. PR cutover sequence
This runbook covers the box-level scaffolding. Application code lands
in three independently shippable PRs:
| PR | Adds | Ship gate | Webhook live? |
|---|---|---|---|
| **1** | Source-agnostic Mint API + Postgres + `datatools-admin mint` CLI | Operator can mint a comp license through the server | No |
| **2** | Gumroad adapter + webhook receiver + email send | Real Gumroad sale auto-mints + emails buyer | **Yes** (enable in Gumroad dashboard at this PR's deploy) |
| **3** | Renewal / re-delivery portal | Buyer self-services renewals and lost-blob re-delivery | (unchanged) |
§1d (Gumroad webhook URL) is **filled in during PR 2's deploy**, not
before. Until then the endpoint returns 404.
---
## 11. Rollback
Each component is independently reversible.
```bash
# Stop and remove containers (DB volume persists)
docker compose -f /srv/datatools-license/compose.yml down
# Full teardown including DB (DESTRUCTIVE — backup first)
docker compose -f /srv/datatools-license/compose.yml down -v
# Remove nginx site
rm /etc/nginx/sites-enabled/unalogix
nginx -t && systemctl reload nginx
# Revoke + delete TLS cert
certbot delete --cert-name datatools.unalogix.com
# Remove filesystem
rm -rf /srv/datatools-license # NOTE: includes secrets dir; backup first
# Remove system user
deluser datatools-api
delgroup datatools-api
```
DNS records can stay or be removed — they're not on this host.

View File

@@ -3,6 +3,9 @@
> Creator-only. Do not ship to buyers.
> **Version**: 1.6 · **Updated**: 2026-05-01
For the end-to-end picture (desktop app + license server + storefronts
+ email), see `ARCHITECTURE.md`. This doc focuses on desktop internals.
## 1. Architecture
- **Dual interface**: CLI + GUI, both wrapping the same `src/core/` library.
@@ -31,8 +34,8 @@ src/
normalizers.py # Per-column normalizers for dedup matching
text_clean.py # clean_dataframe + smart_title_case
_constants.py # Shared USPS abbrevs + state names
cli.py # Deduplicator CLI (Typer)
cli_text_clean.py # Text Cleaner CLI
cli.py # Find Duplicates CLI (Typer)
cli_text_clean.py # Clean Text CLI
cli_analyze.py # Analyzer CLI (--json)
gui/
app.py # Streamlit entry point
@@ -119,6 +122,17 @@ Tag a release → 3 platform artifacts upload to GitHub Releases. Manual: copy t
`demo/streamlit_app.py` → Streamlit Community Cloud. Configure deployment in Streamlit UI. Custom domain via CNAME (verify policy at deploy time). Fall back to $5/mo VPS if rate limits / branding constraints hit.
### 3.10 Bundled Tesseract (PDF Extractor OCR)
Frozen builds ship Tesseract 5.5 + `eng.traineddata` inside the PyInstaller bundle so scanned PDFs work without a separate install. Per-platform binary URLs pinned in `build/tesseract.py`; tessdata vendored at `build/vendor/tessdata/eng.traineddata`. License attribution in `LICENSE_TESSERACT.txt` at the repo root.
**Discovery order at runtime** (see `docs/DEVELOPER.md` for the full Path layout):
1. `DATATOOLS_TESSERACT_BIN` env var override.
2. Bundled path under `sys._MEIPASS / "tesseract" /` (frozen bundles only).
3. `tesseract` on `PATH` (source / pip developer environments).
4. Windows well-known locations.
## 4. Libraries
| Purpose | Library |
@@ -189,7 +203,7 @@ GUI / CLI handlers use `format_for_user()` so the user always sees: file path, o
| Bundle | Status |
|--------|--------|
| Data Cleaning Mastery | 3/9 tools Ready (Dedup, Text Cleaner, Format Standardizer); 6 stubs |
| Data Cleaning Mastery | 3/9 tools Ready (Find Duplicates, Clean Text, Standardize Formats); 6 stubs |
| Automated Business Reporting | Not started |
| Ecommerce Data Pipeline | Not started |
| Small Business Finance | Not started |
@@ -211,12 +225,12 @@ Deliberately separate. Confluent original spec was wrong.
| Script | Owns |
|--------|------|
| 04 Missing Value Handler | "What's not there." Disguised nulls (`N/A`, `-`, sentinel codes), missingness patterns, imputation, drop-by-threshold. |
| 06 Outlier Detector | "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization. |
| 04 Fix Missing Values | "What's not there." Disguised nulls (`N/A`, `-`, sentinel codes), missingness patterns, imputation, drop-by-threshold. |
| 06 Find Unusual Values | "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization. |
**Run order**: 04 before 06. Outlier stats on data with `NaN` / sentinels are mathematically poisoned (means dragged, IQR widens, false negatives).
**Pipeline order** (Pipeline Runner enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible.
**Pipeline order** (Automated Workflows enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible.
**Contested cases**:
- Whitespace-only cell — 02 trims to empty; 04 then flags empty as null.
@@ -239,6 +253,15 @@ The GUI uses an in-house, JSON-backed translation layer at `src/i18n/`. **No** `
**Why not gettext**: zero compiled artifacts in the PyInstaller bundle, no build step before tests run, no `.po`/`.mo` round-trip for translators (anyone can edit JSON), and the same lookup works in unit tests without process state. Locked in because the surface won't grow large enough to need the alternative, and the alternative breaks the "drop a file, run pytest, ship" loop.
## 10c. GUI chrome — sidebar nav indicator swap
Streamlit's `st.Page`-driven sidebar renders section headers with a Material Symbols ligature (`expand_more` / `expand_less`). The header element is not a button and carries no `aria-expanded`, so a pure-CSS swap can't follow open/closed state. We replace the glyph with plain typographic `+` / `` (U+2212) via JS:
- **CSS** (`components/_legacy.py`, `_HIDE_CHROME_CSS`) drops the Material Symbols font on `[data-testid="stIconMaterial"]` inside `[data-testid="stNavSectionHeader"]` so the rewritten character renders as normal text rather than re-resolving as an icon name.
- **JS** (`_SWAP_NAV_SECTION_INDICATOR_JS`) walks each section header, reads the icon's text node, and rewrites `expand_more``+` / `expand_less```. A MutationObserver re-runs the swap when Streamlit re-renders the sidebar (RAF-throttled so a burst of mutations is one swap).
The script ships through the same component-iframe bundle as the brand injector and upload-button rename inside `hide_streamlit_chrome()` — one iframe per page, three DOM mutations.
## 11. Per-script functional specs
Specs live in this section as scripts enter active build. Each follows the Tier 1/2/3 structure with explicit strategic framing (what's the market gap given some of this is free elsewhere).

View File

@@ -4,29 +4,90 @@
**Versión**: 1.6 · **Actualizado**: 2026-05-13
## 0. Primer arranque — activación
DataTools debe activarse antes de desbloquear cualquier herramienta. En el primer arranque verás la pantalla **Activar**.
Introduce tu nombre completo y correo, pega el código de licencia del correo de compra (empieza con `DTLIC1:`) y pulsa **Activar**. La renovación funciona igual: pega el código de renovación y pulsa **Aplicar renovación**.
**Niveles**:
| Nivel | Herramientas |
|---|---|
| **Lite** | Buscar duplicados · Limpiar texto · Estandarizar formatos |
| **Core** | Las 9 herramientas |
Un usuario Lite que abra una herramienta exclusiva de Core verá un mensaje "Actualiza tu licencia". La página de inicio también muestra una marca 🔒 Bloqueado en las tarjetas de las herramientas que tu nivel no incluye. Para actualizar, pega un código Core en la página Activar.
Cada licencia dura 1 año. La barra lateral muestra tu nivel y los días restantes en todo momento; aparece un aviso de renovación 30 días antes de la caducidad. El archivo de licencia vive en `~/.datatools/license.json` (Windows: `C:\Users\<tú>\.datatools\license.json`).
Para usar la misma licencia en otro equipo: desactiva éste (página Activar → **Desactivar este dispositivo**) y vuelve a pegar tu código en el nuevo.
## 1. Instalación
No necesitas tener Python instalado — el paquete es autocontenido.
No necesitas tener Python ni permisos de administrador — el paquete trae su propio intérprete y todas las dependencias. Cada sistema operativo tiene un único instalador que crea automáticamente el acceso directo en el escritorio + la entrada en el menú Inicio / Launchpad.
| Sistema operativo | Archivo | Cómo |
|----|------|-----|
| Windows | `BundleName-Setup-1.0.exe` | Doble clic en el instalador → acceso directo en el escritorio. |
| macOS | `BundleName-1.0.dmg` | Monta el DMG y arrástralo a Aplicaciones. Firmado y notarizado. |
| Linux | `BundleName-1.0.AppImage` | `chmod +x`, doble clic. (También hay un `.tar.gz` de respaldo.) |
### 1.1 Windows
Al iniciar la app, se abre tu navegador predeterminado en una página local (`http://localhost:8501`).
**Instalador (`DataTools-<ver>-win-setup.exe`)**
### Cómo funciona la interfaz gráfica (GUI)
1. Descarga `DataTools-<ver>-win-setup.exe` desde tu correo de licencia o GitHub Releases.
2. Doble clic en el instalador. La primera vez, Windows SmartScreen mostrará **"Windows protegió tu PC"** — pulsa **Más información****Ejecutar de todas formas**. (Este aviso solo aparece una vez por compilación hasta que tengamos un certificado EV de firma de código.)
3. Acepta la ruta de instalación por usuario (`%LOCALAPPDATA%\Programs\DataTools` por defecto — no pide UAC). Marca **Crear acceso directo en el escritorio** si lo quieres (activado por defecto).
4. Pulsa **Instalar** y luego **Finalizar**. El instalador te ofrece lanzar DataTools al terminar.
5. A partir de ahora ejecútalo desde: **Menú Inicio → DataTools**, el **acceso directo del escritorio**, o escribiendo `DataTools` en Ejecutar (Win+R) / cmd.
Para anclarlo a la barra de tareas, lanza la app una vez, clic derecho en su icono de la barra de tareas, y **Anclar a la barra de tareas**. Windows requiere este paso manual — ningún instalador puede anclar por programa.
**Desinstalar**: Configuración → Aplicaciones → DataTools → Desinstalar.
### 1.2 macOS
**DMG instalador (`DataTools-<ver>-mac.dmg`)**
1. Descarga `DataTools-<ver>-mac.dmg`.
2. Doble clic en el .dmg. Se abre una ventana de Finder con el icono **DataTools** y un alias **Aplicaciones**.
3. Arrastra **DataTools** sobre **Aplicaciones**. Espera a que termine la copia y expulsa el DMG.
4. En compilaciones sin firma, el primer arranque muestra **"No se puede abrir 'DataTools' porque no se puede verificar al desarrollador"**. Solución: clic derecho en DataTools en /Aplicaciones → **Abrir** → confirma **Abrir** en el diálogo. macOS recuerda la elección — los siguientes arranques no muestran nada.
5. Ejecútalo desde **Launchpad**, **Spotlight** (`⌘ Espacio` → escribe "DataTools"), o **Aplicaciones** en Finder.
Para mantener DataTools en el Dock: lanza la app, clic derecho en su icono del Dock → **Opciones → Mantener en el Dock**. macOS no permite que los instaladores fijen al Dock automáticamente.
**Desinstalar**: arrastra `DataTools.app` a la Papelera. Tus archivos de datos siguen donde estén — la app no instala nada más.
### 1.3 Linux
`DataTools-<ver>-linux-x86_64.AppImage` ya es portable — no hay .zip aparte.
1. Descarga el .AppImage.
2. `chmod +x DataTools-*.AppImage`.
3. Doble clic, o ejecútalo desde la terminal.
Si tu distro no incluye FUSE 2: `sudo apt install libfuse2` (Debian/Ubuntu) o equivalente.
### 1.4 Qué pasa al arrancar por primera vez
El lanzador (llamado `DataTools.exe` / `DataTools.app` / `DataTools.AppImage`) hace tres cosas, en orden:
1. Elige un puerto TCP libre en `127.0.0.1` — normalmente el 8501; si está ocupado prueba 8502, 8503, …
2. Arranca un servidor Streamlit local en ese puerto. El servidor solo está enlazado a localhost, nunca a tu red.
3. Abre tu navegador predeterminado en `http://127.0.0.1:<puerto>/`. Si el navegador no se abre en 5 segundos, pega esa URL manualmente.
La ventana del lanzador queda abierta en segundo plano. Cerrarla detiene el servidor — la pestaña del navegador dirá "no se puede acceder a este sitio" la próxima vez.
### 1.5 Cómo funciona la GUI
- Se ejecuta localmente en tu equipo. **Sin internet, sin subidas.**
- El navegador es solo la capa de visualización. Cerrarlo detiene el programa subyacente.
- ¿Prefieres la terminal? Cada herramienta incluye también una interfaz de línea de comandos (CLI) — ver Sección 3.
- El navegador es solo la capa de visualización. Cerrarlo NO detiene la app — cierra la ventana del lanzador (o sal de la .app de macOS desde el Dock) para terminar del todo.
- ¿Prefieres la terminal? Cada herramienta incluye también una CLI — ver Sección 3.
### Requisitos del sistema
### 1.6 Requisitos del sistema
- Windows 10/11 (64 bits), macOS 11+, Linux moderno (2020+).
- Navegador moderno (Chrome, Edge, Firefox, Safari, últimos 3 años).
- ~400-500 MB de espacio libre en disco.
- ~500 MB de espacio libre en disco (el paquete ocupa ~300 MB; el resto es espacio de trabajo para CSV grandes).
**OCR para PDFs escaneados viene incluido** — Tesseract 5.5 y el modelo en inglés `eng.traineddata` vienen dentro de cada instalador / portable / AppImage. La ruta de extracción de PDFs escaneados del Extractor de PDF funciona sin configuración adicional; no hace falta instalar nada por separado. (Quien ejecute desde un checkout con `pip install -r requirements.txt` sigue necesitando Tesseract del sistema en el `PATH` — ver [DEVELOPER.md §PDF Extractor — bundled Tesseract](DEVELOPER.md#pdf-extractor--bundled-tesseract) (solo en inglés).)
Matriz de soporte completa: [REQUIREMENTS.md](REQUIREMENTS.md) (solo en inglés).
@@ -34,15 +95,15 @@ Matriz de soporte completa: [REQUIREMENTS.md](REQUIREMENTS.md) (solo en inglés)
| # | Herramienta | Propósito | Estado |
|---|------|---------|--------|
| 01 | Eliminador de duplicados | Coincidencia exacta + difusa, 5 normalizadores, auditoría | Listo |
| 02 | Limpiador de texto | Espacios, caracteres tipográficos, BOM, finales de línea, mayúsculas/minúsculas | Listo |
| 03 | Estandarizador de formatos | Fechas / teléfonos / correos / direcciones / nombres / monedas / booleanos | Listo |
| 04 | Gestor de valores faltantes | Nulos disfrazados, imputación, descarte por umbral | Próximamente |
| 05 | Mapeador de columnas | Renombrar + aplicar esquema | Próximamente |
| 06 | Detector de valores atípicos | z-score, IQR, multivariante | Próximamente |
| 07 | Combinador de varios archivos | Combina varios archivos | Próximamente |
| 08 | Validador e informes | Reglas + informe PDF/Excel | Próximamente |
| 09 | Ejecutor de canalizaciones | Lanzador multi-herramienta de un clic | Próximamente |
| 01 | Buscar duplicados | Coincidencia exacta + difusa, 5 normalizadores, auditoría | Listo |
| 02 | Limpiar texto | Espacios, caracteres tipográficos, BOM, finales de línea, mayúsculas/minúsculas | Listo |
| 03 | Estandarizar formatos | Fechas / teléfonos / correos / direcciones / nombres / monedas / booleanos | Listo |
| 04 | Corregir valores faltantes | Nulos disfrazados, imputación, descarte por umbral | Próximamente |
| 05 | Mapear columnas | Renombrar + aplicar esquema | Próximamente |
| 06 | Detectar valores atípicos | z-score, IQR, multivariante | Próximamente |
| 07 | Combinar archivos | Combina varios archivos | Próximamente |
| 08 | Verificación de calidad | Reglas + informe PDF/Excel | Próximamente |
| 09 | Flujos automatizados | Lanzador multi-herramienta de un clic | Próximamente |
**Datos de muestra** (`samples/`): `messy_sales.csv`, `bank_export.xlsx`.
@@ -58,6 +119,10 @@ Matriz de soporte completa: [REQUIREMENTS.md](REQUIREMENTS.md) (solo en inglés)
Las opciones avanzadas se encuentran en paneles desplegables. El archivo original nunca se modifica.
**Ayuda en la herramienta**: cada página tiene un botón **Help** a la derecha del título. Al pulsarlo se abre una ventana emergente con una guía compacta (Cuándo usarla · Pasos · Ejemplos · Consejo). Úsala como recordatorio a media tarea — la ventana se cierra al hacer clic fuera y tus datos no se ven afectados.
**Navegación lateral**: la barra lateral agrupa las herramientas en secciones (Análisis, Limpiadores de datos, Transformaciones, Automatizaciones). Cada cabecera muestra `+` cuando está plegada y `` cuando está desplegada — pulsa la cabecera para alternar.
### 3.2 CLI
```bash
@@ -70,17 +135,17 @@ Ayuda: `deduplicator --help`. Referencia completa: [CLI-REFERENCE.es.md](CLI-REF
### 3.3 Orden de ejecución (cuando uses las herramientas manualmente)
Si no usas el Ejecutor de canalizaciones, sigue este orden:
Si no usas Flujos automatizados, sigue este orden:
1. **02 Limpiador de texto** primero — normaliza espacios y caracteres especiales.
2. **03 Estandarizador de formatos** — fechas, teléfonos, etc. necesitan texto limpio.
3. **04 Gestor de valores faltantes** — códigos centinela se ocultan como números.
4. **05 Mapeador de columnas** — esquema antes que estadísticas de atípicos.
5. **06 Detector de valores atípicos** — necesita datos numéricos limpios. Calcular estadísticas con `NaN` o `-999` envenena los resultados.
6. **07 Combinador de varios archivos**, **08 Validador** según sea necesario.
7. **01 Eliminador de duplicados** es flexible en cuanto al orden (normaliza internamente para la coincidencia).
1. **02 Limpiar texto** primero — normaliza espacios y caracteres especiales.
2. **03 Estandarizar formatos** — fechas, teléfonos, etc. necesitan texto limpio.
3. **04 Corregir valores faltantes** — códigos centinela se ocultan como números.
4. **05 Mapear columnas** — esquema antes que estadísticas de atípicos.
5. **06 Detectar valores atípicos** — necesita datos numéricos limpios. Calcular estadísticas con `NaN` o `-999` envenena los resultados.
6. **07 Combinar archivos**, **08 Verificación de calidad** según sea necesario.
7. **01 Buscar duplicados** es flexible en cuanto al orden (normaliza internamente para la coincidencia).
El Ejecutor de canalizaciones aplica este orden automáticamente.
Flujos automatizados aplica este orden automáticamente.
### 3.4 Idioma
@@ -118,12 +183,15 @@ El archivo original nunca se modifica.
## 6. Solución de problemas
- **La GUI no se abre / el navegador no se inicia** — espera 10-15 s; visita manualmente `http://localhost:8501`. Error de puerto ocupado → cierra otras instancias.
- **La GUI no se abre / el navegador no se inicia** — espera 10-15 s; visita manualmente `http://127.0.0.1:8501` (o el puerto que muestre la ventana del lanzador). Error de puerto ocupado → cierra otras instancias. El lanzador recorre los puertos 85018550 buscando uno libre, así que una instancia colgada puede desplazar la URL.
- **¿Por qué se abre el navegador?** — patrón de aplicación web local (igual que Jupyter o RStudio). Nada sale de tu equipo.
- **Windows SmartScreen** — pulsa "Más información" → "Ejecutar de todas formas". Estándar para software sin firma EV.
- **macOS "La aplicación está dañada"** — descárgala de nuevo (probablemente se corrompió en tránsito).
- **El AppImage de Linux no se ejecuta** — `chmod +x archivo.AppImage`. Si falta FUSE → `sudo apt install libfuse2` o usa el `.tar.gz`.
- **Windows SmartScreen** — pulsa "Más información" → "Ejecutar de todas formas". Una sola vez por compilación hasta que tengamos un certificado EV.
- **macOS "La aplicación está dañada" / "no se puede verificar al desarrollador"** — clic derecho en la app → **Abrir** → confirma. Si el mensaje persiste, el archivo se corrompió en tránsito — vuelve a descargarlo. Último recurso: `xattr -cr /Applications/DataTools.app` limpia el atributo de cuarentena.
- **macOS — el .zip portable extraído no abre** — Safari descomprime al descargar; si ves una carpeta `__MACOSX/` o archivos `._DataTools.app` usaste otro descompresor. Vuelve a extraer con la Utilidad de Archivo integrada (clic derecho en el .zip → **Abrir con → Utilidad de Archivo**) para preservar los metadatos de la .app.
- **Windows — el antivirus pone en cuarentena `DataTools.exe` del portable** — tu antivirus no reconoce el paquete. Añade la carpeta extraída a la lista blanca. El instalador .exe activa menos antivirus porque es un envoltorio Inno Setup conocido.
- **El AppImage de Linux no se ejecuta** — `chmod +x archivo.AppImage`. Si falta FUSE → `sudo apt install libfuse2`.
- **Lento con archivos grandes** — por encima de ~100k filas tarda más; la barra de progreso lo indica. Para millones de filas → usa la CLI directamente.
- **¿Dónde guarda la app mi licencia / configuración?** — `~/.datatools/` en macOS y Linux, `C:\Users\<tú>\.datatools\` en Windows. Tus archivos de entrada y salida siguen donde los dejes; la app nunca los copia a otro sitio.
- **Necesito ayuda** — escribe al correo que aparece en tu recibo de compra.
## 7. Licencia

View File

@@ -4,29 +4,90 @@
**Version**: 1.6 · **Updated**: 2026-05-01
## 0. First launch — activation
DataTools must be activated before any tools unlock. On first launch you'll see the **Activate** screen.
Enter your full name + email, paste the license blob from your purchase email (starts with `DTLIC1:`), and click **Activate**. Renewal works the same way — paste the renewal blob, click **Apply renewal**.
**Tiers**:
| Tier | Tools |
|---|---|
| **Lite** | Find Duplicates · Clean Text · Standardize Formats |
| **Core** | All 9 tools |
A Lite user opening a Core-only tool sees an "Upgrade your license" prompt. The home page also shows a 🔒 Locked badge on tool cards your tier doesn't unlock. To upgrade, paste a Core blob on the Activate page.
Every license lasts 1 year. The sidebar shows your tier and days remaining at all times; a renewal warning appears 30 days before expiry. The license file lives at `~/.datatools/license.json` (Windows: `C:\Users\<you>\.datatools\license.json`).
To use the same license on a different machine: deactivate this one (Activate page → **Deactivate this device**) and re-paste your blob on the new machine.
## 1. Install
You don't need Python — the bundle is self-contained.
You don't need Python and you don't need admin rights — the bundle ships its own interpreter and every dependency. Each OS gets a single installer that wires up the Desktop shortcut + Start Menu / Launchpad entry automatically.
| OS | File | How |
|----|------|-----|
| Windows | `BundleName-Setup-1.0.exe` | Double-click installer → desktop shortcut. |
| macOS | `BundleName-1.0.dmg` | Mount, drag to Applications. Signed + notarized. |
| Linux | `BundleName-1.0.AppImage` | `chmod +x`, double-click. (`.tar.gz` fallback available.) |
### 1.1 Windows
Launching opens your default browser to a local page (`http://localhost:8501`).
**Installer (`DataTools-<ver>-win-setup.exe`)**
### How the GUI works
1. Download `DataTools-<ver>-win-setup.exe` from your release email or GitHub Releases.
2. Double-click the installer. On the first run Windows SmartScreen will say **"Windows protected your PC"** — click **More info****Run anyway**. (This warning only appears once per build until we have an EV code-signing cert.)
3. Accept the per-user install location (`%LOCALAPPDATA%\Programs\DataTools` by default — no admin prompt). Check **Create a desktop shortcut** if you want one (on by default).
4. Click **Install**, then **Finish**. The installer offers to launch DataTools immediately.
5. From now on launch from: **Start Menu → DataTools**, the **Desktop shortcut**, or just type `DataTools` into Windows Run (Win+R) / cmd.
To pin to the taskbar, launch the app once, right-click its icon in the taskbar, then **Pin to taskbar**. Windows requires this manual step — no installer is allowed to pin programmatically.
**Uninstall**: Settings → Apps → DataTools → Uninstall.
### 1.2 macOS
**Installer DMG (`DataTools-<ver>-mac.dmg`)**
1. Download `DataTools-<ver>-mac.dmg`.
2. Double-click the .dmg. A Finder window opens showing the **DataTools** icon and an **Applications** alias.
3. Drag **DataTools** onto **Applications**. Wait for the copy to finish, then eject the DMG.
4. On unsigned builds the first launch shows **"DataTools" cannot be opened because the developer cannot be verified**. Fix: right-click DataTools in /Applications → **Open** → confirm **Open** in the dialog. macOS remembers this choice — subsequent launches are clean.
5. Launch from **Launchpad**, **Spotlight** (`⌘ Space` → type "DataTools"), or **Applications** in Finder.
To keep DataTools in the Dock: launch the app, right-click its Dock icon → **Options → Keep in Dock**. macOS doesn't allow installers to pin to the Dock automatically.
**Uninstall**: drag `DataTools.app` to the Trash. Your data files stay where you put them — nothing else is installed.
### 1.3 Linux
`DataTools-<ver>-linux-x86_64.AppImage` is already portable — no separate zip needed.
1. Download the .AppImage.
2. `chmod +x DataTools-*.AppImage`.
3. Double-click, or run it from a terminal.
If your distro doesn't ship FUSE 2: `sudo apt install libfuse2` (Debian/Ubuntu) or equivalent.
### 1.4 What happens on first launch
The launcher (called `DataTools.exe` / `DataTools.app` / `DataTools.AppImage`) does three things, in order:
1. Picks a free TCP port on `127.0.0.1` — usually 8501, falls back through 8502, 8503, … if another app is using 8501.
2. Starts a local Streamlit server on that port. The server is **bound to localhost only**, never to your LAN.
3. Opens your default browser at `http://127.0.0.1:<port>/`. If the browser doesn't open within 5 seconds, paste that URL into your browser manually.
The launcher window stays open in the background. Closing it stops the server — the browser tab will say "this site can't be reached" the next time you click it.
### 1.5 How the GUI works
- Runs locally on your machine. **No internet, no upload.**
- Browser is just the display surface. Closing it stops the underlying program.
- The browser is just the display surface. Closing it does NOT stop the app — close the launcher window (or quit the macOS .app from the Dock) to fully exit.
- Prefer the terminal? Every tool ships with a CLI too (Section 3).
### System requirements
### 1.6 System requirements
- Windows 10/11 (64-bit), macOS 11+, modern Linux (2020+).
- Modern browser (Chrome, Edge, Firefox, Safari, last 3 years).
- ~400-500 MB free disk space.
- ~500 MB free disk space (the bundle itself is ~300 MB; the rest is working scratch space for large CSVs).
**OCR for scanned PDFs is bundled** — Tesseract 5.5 + the English `eng.traineddata` model ship inside every installer / portable / AppImage. The PDF Extractor's scanned-statement path works out of the box; no separate install required. (Developers running from a `pip install -r requirements.txt` checkout still need system Tesseract on `PATH` — see [DEVELOPER.md §PDF Extractor — bundled Tesseract](DEVELOPER.md#pdf-extractor--bundled-tesseract).)
Full numbered support matrix: [REQUIREMENTS.md](REQUIREMENTS.md).
@@ -34,15 +95,15 @@ Full numbered support matrix: [REQUIREMENTS.md](REQUIREMENTS.md).
| # | Tool | Purpose | Status |
|---|------|---------|--------|
| 01 | Deduplicator | Exact + fuzzy match, 5 normalizers, audit | Ready |
| 02 | Text Cleaner | Whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | Format Standardizer | Dates / phones / emails / addresses / names / currencies / booleans | Ready |
| 04 | Missing Value Handler | Disguised nulls, imputation, drop-by-threshold | Coming Soon |
| 05 | Column Mapper | Rename + enforce schema | Coming Soon |
| 06 | Outlier Detector | z-score, IQR, multivariate | Coming Soon |
| 07 | Multi-File Merger | Combine multiple files | Coming Soon |
| 08 | Validator & Reporter | Rules + PDF/Excel report | Coming Soon |
| 09 | Pipeline Runner | One-click multi-tool launcher | Coming Soon |
| 01 | Find Duplicates | Exact + fuzzy match, 5 normalizers, audit | Ready |
| 02 | Clean Text | Whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | Standardize Formats | Dates / phones / emails / addresses / names / currencies / booleans | Ready |
| 04 | Fix Missing Values | Disguised nulls, imputation, drop-by-threshold | Coming Soon |
| 05 | Map Columns | Rename + enforce schema | Coming Soon |
| 06 | Find Unusual Values | z-score, IQR, multivariate | Coming Soon |
| 07 | Combine Files | Combine multiple files | Coming Soon |
| 08 | Quality Check | Rules + PDF/Excel report | Coming Soon |
| 09 | Automated Workflows | One-click multi-tool launcher | Coming Soon |
**Sample data** (`samples/`): `messy_sales.csv`, `bank_export.xlsx`.
@@ -58,6 +119,10 @@ Full numbered support matrix: [REQUIREMENTS.md](REQUIREMENTS.md).
Advanced options are tucked in expander panes. The original file is never modified.
**In-tool Help**: every tool page has a **Help** button right of the title. Click it to open a popover with a compact how-to (When to use · Steps · Examples · Tip). Use it as a refresher mid-task — the popover closes when you click outside, your inputs are untouched.
**Sidebar nav**: the sidebar groups tools into sections (Analysis, Data Cleaners, Transformations, Automations). Each section header shows `+` when collapsed and `` when expanded — click the header to toggle.
### 3.2 CLI
```bash
@@ -70,17 +135,17 @@ Get help: `deduplicator --help`. Full reference: [CLI-REFERENCE.md](CLI-REFERENC
### 3.3 Run order (when running tools manually)
If you skip the Pipeline Runner, follow this order:
If you skip Automated Workflows, follow this order:
1. **02 Text Cleaner** first — normalizes whitespace + special chars.
2. **03 Format Standardizer** — dates, phones, etc. need cleaned text.
3. **04 Missing Value Handler** — sentinel codes hide as numbers.
4. **05 Column Mapper** — schema before outlier stats.
5. **06 Outlier Detector** — needs clean numerics. Stats on data with `NaN` or `-999` are mathematically poisoned.
6. **07 Multi-File Merger**, **08 Validator** as needed.
7. **01 Deduplicator** is order-flexible (normalizes internally for matching).
1. **02 Clean Text** first — normalizes whitespace + special chars.
2. **03 Standardize Formats** — dates, phones, etc. need cleaned text.
3. **04 Fix Missing Values** — sentinel codes hide as numbers.
4. **05 Map Columns** — schema before outlier stats.
5. **06 Find Unusual Values** — needs clean numerics. Stats on data with `NaN` or `-999` are mathematically poisoned.
6. **07 Combine Files**, **08 Quality Check** as needed.
7. **01 Find Duplicates** is order-flexible (normalizes internally for matching).
The Pipeline Runner enforces this automatically.
Automated Workflows enforces this automatically.
### 3.4 Language
@@ -118,12 +183,15 @@ Original input is never modified.
## 6. Troubleshooting
- **GUI won't launch / browser doesn't open** — wait 10-15 s; manually visit `http://localhost:8501`. Port-in-use error → close other instances.
- **GUI won't launch / browser doesn't open** — wait 10-15 s; manually visit `http://127.0.0.1:8501` (or whichever port the launcher window prints). Port-in-use error → close other instances. The launcher walks ports 85018550 looking for a free one, so a stale instance can shift the URL.
- **Why does my browser open?** — local web app pattern (same as Jupyter, RStudio). Nothing leaves your machine.
- **Windows SmartScreen** — click "More info" → "Run anyway". Standard for non-EV-signed software.
- **macOS "App is damaged"** — re-download (file likely corrupted in transit).
- **Linux AppImage won't run** — `chmod +x file.AppImage`. Missing FUSE → `sudo apt install libfuse2` or use `.tar.gz`.
- **Windows SmartScreen** — click "More info" → "Run anyway". One-time per build until we have an EV-signed cert.
- **macOS "App is damaged" / "developer cannot be verified"** — right-click the app → **Open** → confirm. If the message persists, the file was likely corrupted in transit — re-download. As a last resort: `xattr -cr /Applications/DataTools.app` clears the quarantine attribute.
- **macOS portable .zip — extracted but won't open** — Safari unzips on download by default; if you see a `__MACOSX/` folder or `._DataTools.app` file you used a different unarchiver. Re-extract with the built-in Archive Utility (right-click the .zip → **Open With → Archive Utility**) so the .app's metadata is preserved.
- **Windows portable .zip — antivirus quarantines DataTools.exe** — your AV doesn't recognize the bundle. Allowlist the extracted folder. The installer .exe trips fewer AV products because it's a known Inno Setup wrapper.
- **Linux AppImage won't run** — `chmod +x file.AppImage`. Missing FUSE → `sudo apt install libfuse2`.
- **Slow on large file** — over ~100k rows takes longer; progress bar shows. Multi-million rows → use the CLI directly.
- **Where does the app store my license / settings?** — `~/.datatools/` on macOS + Linux, `C:\Users\<you>\.datatools\` on Windows. Your input/output files stay where you put them; the app never copies them anywhere else.
- **Need help** — email the address on your purchase receipt.
## 7. License

View File

@@ -9,9 +9,9 @@ Cloudflare Pages.
```
landing/
├── _shared/styles.css shared CSS (system fonts, no externals)
├── shopify-pet/index.html Shopify operator (priority: pet supplies)
├── bookkeeper/index.html bookkeeper / freelance accountant
├── revops/index.html marketing / RevOps agency
├── bookkeeper/index.html bookkeeper — bank reconciliation
├── ap-1099/index.html accounts payable — 1099 vendor prep
├── ar-aging/index.html accounts receivable — open invoices
└── README.md this file
```
@@ -19,8 +19,8 @@ Each page:
- Inherits `landing/_shared/styles.css`
- Overrides the `--accent` colour variable in an inline `<style>` block
so each persona has its own visual identity (Shopify = mint green,
Bookkeeper = steel blue, RevOps = vivid violet)
so each persona has its own visual identity (Bookkeeper = steel blue,
AP / 1099 = amber/gold, AR = receivables green)
- Has a sticky buy bar with the Gumroad CTA tagged with `?from=<persona>`
- Embeds the live demo (Streamlit) via `<iframe>` with a sandbox attribute
- Carries persona-specific H1, sub-copy, use cases, FAQ, and a
@@ -64,13 +64,13 @@ wrangler pages deploy landing/dist
```
Configure the custom apex domain (`datatools.app`) in the Cloudflare
Pages project settings; sub-paths `/shopify-pet/`, `/bookkeeper/`,
`/revops/` are served automatically because the directory layout
Pages project settings; sub-paths `/bookkeeper/`, `/ap-1099/`,
`/ar-aging/` are served automatically because the directory layout
mirrors them. Cache rule defaults are fine (HTML 1 day, CSS 7 days).
If you want **separate Pages projects** per persona for independent
A/B testing, point three projects at the same `landing/dist/` and
configure each with its own sub-domain (`shopify.datatools.app`, etc.)
configure each with its own sub-domain (`bookkeeper.datatools.app`, etc.)
and a Pages rule that rewrites the root to that persona's
sub-directory.
@@ -110,7 +110,7 @@ Refresh the page when:
| `page_view → run_completed < 30%` for 4 weeks | The demo iframe isn't loading or visitors aren't engaging. Check the iframe URL. Move the demo above the fold if it's currently below. |
| New tool ships (0609) | Add it to the persona's saved pipeline only if it fits — don't bloat the demo with every tool. |
| Pricing change | Update `<meta>` schema, the buybar `.price-tag`, the pricing card, and the FAQ. Search-and-replace `$49` across the file. |
| New persona added (4th, 5th) | Copy `shopify-pet/index.html`, replace persona-specific copy, add to the `footer` cross-link block on the existing pages. |
| New persona added (4th, 5th) | Copy `bookkeeper/index.html`, replace persona-specific copy, add to the `footer` cross-link block on the existing pages. |
## Why static HTML

View File

@@ -5,7 +5,7 @@
* with zero build step, no privacy banner needed).
* • Mobile-first; layout reflows below 720 px.
* • Dark, focused, content-first. Buyer reads this on a laptop
* between Shopify exports — keep it readable and skimmable.
* between messy accounting exports — keep it readable and skimmable.
* • Persona pages all share this sheet — niche differences live in
* copy + accent-color variables overridden in each page's <style>.
*/
@@ -18,7 +18,7 @@
--text-mute: #9aa3b2;
--text-soft: #c8ced8;
--rule: #252a36;
--accent: #6ee7b7; /* Shopify pet default — overridden per persona */
--accent: #6ee7b7; /* default accent — overridden per persona */
--accent-ink: #052e1a;
--warn: #fbbf24;
--max: 1080px;

391
landing/ap-1099/index.html Normal file
View File

@@ -0,0 +1,391 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools for 1099 Prep — Clean Your Vendor Master & Recover Missing EINs Locally · $49</title>
<meta name="description" content="Build a clean 1099 vendor list — locally. Consolidates duplicate vendor rows, backfills scattered EINs, and flags the genuinely missing ones. 24 messy records → 8 complete vendors, 7 EINs recovered. Your data never leaves your computer. $49 one-time." />
<meta name="keywords" content="1099 vendor list, missing EIN, accounts payable cleanup, vendor master dedupe, 1099-NEC prep, QuickBooks vendor export, deduplicate vendors" />
<link rel="canonical" href="https://datatools.app/ap-1099/" />
<link rel="stylesheet" href="../_shared/styles.css" />
<!-- Persona accent: Accounts Payable / 1099 → amber/gold invoice tone -->
<style>
:root { --accent: #d97706; --accent-ink: #2a1604; }
</style>
<!-- Open Graph -->
<meta property="og:title" content="DataTools for 1099 Prep — Clean Your Vendor Master & Recover Missing EINs Locally" />
<meta property="og:description" content="Consolidate duplicate vendors, backfill scattered EINs, file 1099-NECs on time. Local. No upload. $49 one-time." />
<meta property="og:type" content="product" />
<meta property="og:url" content="https://datatools.app/ap-1099/" />
<!-- Schema.org Product -->
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareApplication",
"name": "DataTools for 1099 Prep",
"operatingSystem": "Windows, macOS, Linux",
"applicationCategory": "BusinessApplication",
"offers": {
"@type": "Offer",
"price": "49",
"priceCurrency": "USD"
},
"description": "Clean your accounts-payable vendor master locally for 1099-NEC season. Six-tool data-cleaning bundle: dedupe-merge to consolidate duplicate vendor rows and backfill missing EINs, text-clean, format-standardize, missing-value handle, column-map, pipeline.",
"softwareVersion": "1.0"
}
</script>
</head>
<body>
<!-- ============= Sticky buy bar ============= -->
<div class="buybar">
<div class="buybar-inner">
<div class="brand"><span class="brand-mark"></span> DataTools <span class="muted">/ for 1099 prep</span></div>
<div>
<span class="price-tag">$49 — one-time, no subscription</span>
<a class="btn" href="https://gumroad.com/l/datatools?from=ap-1099" rel="noopener">Get DataTools →</a>
</div>
</div>
</div>
<!-- ============= Hero ============= -->
<section class="hero">
<div class="container">
<div class="eyebrow">For accounts payable · 1099-NEC season · vendor master cleanup</div>
<h1>Build a clean 1099 vendor list —<br /><strong>with the missing EINs filled in.</strong></h1>
<p class="lead">
The same vendor got entered three times across the year — one row has
the EIN, another the address, another the phone — and now it's January
and you can't file because the numbers are scattered. DataTools
consolidates each vendor to one row and backfills the gaps from the
duplicates: in our sample, <strong>24 messy records become 8 complete
vendors with 7 missing EINs recovered</strong> from duplicate rows.
<strong>Your data never leaves your computer.</strong>
</p>
<div class="cta-row">
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=ap-1099" rel="noopener">Get DataTools for Accounting — $49 →</a>
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
<span class="price-note">One-time payment · cross-platform · runs offline</span>
</div>
<div class="stats">
<div class="stat"><div class="num">24→8</div><div class="label">messy records to complete vendors</div></div>
<div class="stat"><div class="num">7</div><div class="label">missing EINs recovered</div></div>
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
</div>
</div>
</section>
<!-- ============= Pain points ============= -->
<section>
<div class="container">
<div class="eyebrow">If any of these sound like your January</div>
<h2>Five pains DataTools fixes in one pass</h2>
<div class="grid">
<div class="card">
<span class="icon">🧾</span>
<h3>The same vendor is in the list two or three times</h3>
<p>Different staff entered "Acme LLC", "Acme, L.L.C.", and "ACME Llc" across the year. Each is a separate row in the vendor master, and each only holds part of the story — so your 1099 totals split across three near-duplicate spellings.</p>
<p class="muted"><strong>What it costs:</strong> hours of manual matching, plus the risk of filing the wrong total.</p>
</div>
<div class="card">
<span class="icon">🔢</span>
<h3>The EIN is on a different row than the rest of the details</h3>
<p>One record captured the EIN at onboarding; the row you actually paid against doesn't have it. At 1099 time the field is blank even though you collected it months ago — it's just sitting on a duplicate.</p>
<p class="muted"><strong>What it costs:</strong> chasing W-9s you already have on file.</p>
</div>
<div class="card">
<span class="icon">📵</span>
<h3>Phones, addresses, and amounts are formatted five different ways</h3>
<p>Remittance phone as <code>(212) 555-0147</code> on one row and <code>212.555.0147</code> on another. Amounts with stray <code>$</code> and commas. The export won't reconcile and the 1099-NEC box totals don't tie out.</p>
<p class="muted"><strong>What it costs:</strong> a half-day reconciling before you can even start filing.</p>
</div>
<div class="card">
<span class="icon"></span>
<h3>You don't know which EINs are genuinely missing</h3>
<p>Some EINs are recoverable from a duplicate row. Some you never collected. Until the list is consolidated you can't tell the two apart — so you either over-chase vendors or under-file.</p>
<p class="muted"><strong>What it costs:</strong> late filings and TIN-mismatch penalties.</p>
</div>
<div class="card">
<span class="icon">📤</span>
<h3>Your QuickBooks vendor export doesn't match your AP ledger</h3>
<p>The vendor master in QuickBooks, the payments spreadsheet, and the W-9 tracker each use different column names for "vendor name" / "Tax ID" / "amount paid." Merging them is an afternoon of manual rename before any analysis begins.</p>
<p class="muted"><strong>What it costs:</strong> 48 hours per filing season manually merging exports.</p>
</div>
<div class="card">
<span class="icon">🔒</span>
<h3>Cloud cleaners want you to upload your vendor master</h3>
<p>Your vendor master holds EINs, remittance addresses, and payment history — exactly the data you should not be uploading to a SaaS to clean. DataTools is desktop-only — your vendor list never leaves your computer.</p>
<p class="muted"><strong>What it costs:</strong> nothing — and that's the point.</p>
</div>
</div>
</div>
</section>
<!-- ============= Live demo ============= -->
<section id="demo">
<div class="container">
<div class="eyebrow">Live demo · runs in your browser</div>
<h2>Try it on a real-looking vendor master export</h2>
<p>
The demo below loads a sample 24-row vendor file with the pollution
we've seen in real AP systems: the same vendor entered two or three
times under slightly different spellings, EINs that live on one
duplicate row but not the one you paid against, phones and amounts
formatted five ways, and the usual mess of
<code>N/A</code> / <code>(blank)</code> / <code>?</code> sentinels.
Click <strong>Run pipeline</strong> and watch the 24 records collapse
to <strong>8 complete vendors with 7 EINs recovered</strong> in under
a second.
</p>
<div class="demo-frame">
<iframe
src="https://demo.datatools.app/?p=ap-1099"
loading="lazy"
title="DataTools live demo — accounts payable / 1099 vendor cleanup"
sandbox="allow-scripts allow-same-origin allow-downloads allow-forms"></iframe>
<div class="demo-caption">
Demo runs on free hosting (Streamlit Community Cloud). Capped at
100 input rows · output watermarked with one trailing row. The
paid product has no caps and runs entirely offline.
</div>
</div>
</div>
</section>
<!-- ============= Built for AP / 1099 ============= -->
<section>
<div class="container">
<div class="eyebrow">Built for the accounts-payable team</div>
<h2>Five workflows you do every filing season</h2>
<div class="grid">
<div class="card">
<span class="icon">🧹</span>
<h3>Vendor-master consolidation</h3>
<p>Catches the same vendor that shows up as <code>Acme LLC</code>, <code>Acme, L.L.C.</code>, and <code>ACME Llc</code>. Fuzzy match merges the spellings; the dedup merge collapses them to one row and backfills the gaps from each duplicate.</p>
</div>
<div class="card">
<span class="icon">🔢</span>
<h3>EIN backfill &amp; missing-EIN flagging</h3>
<p>Pulls the EIN off whichever duplicate row captured it and fills it into the survivor. The EINs that are <em>genuinely</em> missing get flagged so you know exactly which W-9s to chase.</p>
</div>
<div class="card">
<span class="icon">💵</span>
<h3>1099-NEC amount roll-up</h3>
<p>Before filing: standardize amounts, drop sentinels-as-missing, and merge so each vendor's total paid lands on one row and ties to your AP ledger.</p>
</div>
<div class="card">
<span class="icon">📥</span>
<h3>QuickBooks vendor export cleanup</h3>
<p>Whitespace in Tax IDs, near-identical vendor names, copy-paste smart quotes in remittance addresses — gone. Audit log shows every change for your reviewer.</p>
</div>
<div class="card">
<span class="icon">🔗</span>
<h3>Merging the W-9 tracker into the AP ledger</h3>
<p>The vendor master, the payments spreadsheet, and the W-9 tracker each name "Tax ID" differently. Map Columns aligns them; the dedup merge consolidates across all three sources.</p>
</div>
<div class="card">
<span class="icon">⚙️</span>
<h3>Repeatable pipeline</h3>
<p>Save the cleanup as a JSON file. Drop next year's vendor export on it. Same consolidation, zero re-configuration. Automatable via the CLI.</p>
</div>
</div>
</div>
</section>
<!-- ============= Privacy moat ============= -->
<section>
<div class="container">
<div class="eyebrow">The thing every cloud cleaner can't say</div>
<h2>Your vendor master never leaves your computer.</h2>
<p>
DataTools is a desktop app. There's no upload step, no SaaS account,
no subscription, no "trust our security policy." The first thing you
can do after install is open your browser's network tab, run the
cleaner on your real vendor file, and verify zero outbound
requests.
</p>
<div class="callout">
<strong>Why it matters for AP:</strong> your vendor master holds EINs,
remittance addresses, and payment history. Cloud cleaners require you
to upload it. We don't.
</div>
<div class="terminal"><span class="prompt">$</span> python -m src.cli_pipeline vendor_1099.csv --pipeline vendor_1099_pipeline.json --apply
Reading vendor_1099.csv...
24 rows, 9 columns
Executing pipeline:
<span class="ok"></span> text_clean (38 ms) {cells_changed: 41}
<span class="ok"></span> format_standardize (62 ms) {cells_changed: 36} # phones, EINs, amounts
<span class="ok"></span> missing (11 ms) {sentinels_standardized: 9}
<span class="ok"></span> dedup (140 ms) {groups_merged: 8, rows_removed: 16, eins_backfilled: 7}
Initial rows: 24 → Final rows: 8 (8 complete vendors)
EINs recovered from duplicate rows: 7 | Still missing (flagged): 1
Unparseable cells: 0
Total elapsed: 0.25 s
<span class="prompt">$</span> # zero network calls. zero. promise.</div>
</div>
</section>
<!-- ============= Audit moat ============= -->
<section>
<div class="container">
<div class="eyebrow">For when your reviewer asks "what changed?"</div>
<h2>Every change auditable. Every cell logged.</h2>
<p>
Every modification is recorded with the original value, the new
value, and which rule fired. Hand the audit CSV to your controller,
your reviewer, or the IRS-ready workpaper file along with the cleaned
vendor list. No <em>"I trust the AI"</em> hand-waving — they see
exactly which EIN came from which duplicate row.
</p>
<div class="callout">
<strong>Real example:</strong> the demo above merged 24 records into
8 vendors and backfilled 7 EINs. The dedup audit lists every vendor
group with the survivor, its merged-in duplicates, and the source row
each recovered EIN was pulled from. The standardize audit lists every
phone, amount, and Tax ID it reformatted.
</div>
</div>
</section>
<!-- ============= Format handling ============= -->
<section>
<div class="container">
<div class="eyebrow">If your vendors are messy — most AP files are</div>
<h2>EINs, phones, addresses, and amounts in every shape.</h2>
<p>
One row has the EIN as <code>12-3456789</code>, another as
<code>123456789</code>. The remittance phone is <code>(212)
555-0147</code> on one and <code>212.555.0147</code> on the next.
An amount reads <code>$12,410.75</code> with a stray space. Excel
treats half of these as text errors. DataTools normalizes every one —
EINs to a single format, phones to E.164, amounts to clean numerics —
so the file reconciles and the 1099 box totals tie out.
</p>
<ul class="bullets">
<li><strong>EIN / Tax-ID normalization</strong> to one consistent <code>NN-NNNNNNN</code> shape, with genuinely-missing ones flagged.</li>
<li><strong>Phone standardization</strong> to E.164 via Google's libphonenumber.</li>
<li><strong>Amount parsing</strong> for <code>$</code> / commas / stray spaces — including amounts Excel mis-types as text.</li>
<li><strong>Address shape detection</strong> for US remittance addresses.</li>
</ul>
</div>
</section>
<!-- ============= What you get ============= -->
<section>
<div class="container">
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Find Duplicates</h3><p>Fuzzy match (Jaro-Winkler), 5 normalizers, survivor rules, gap-backfill merge, interactive review.</p></div>
<div class="card"><h3>2 · Clean Text</h3><p>Whitespace, smart chars, NBSP, BOM, line endings, case ops.</p></div>
<div class="card"><h3>3 · Standardize Formats</h3><p>EINs, amounts, dates, phones, emails, addresses, names, booleans.</p></div>
<div class="card"><h3>4 · Fix Missing Values</h3><p>Disguised-null detection, profile, flag genuinely-missing fields, drop strategies.</p></div>
<div class="card"><h3>5 · Map Columns</h3><p>Fuzzy auto-rename, target schema, type coercion, required-field defaults.</p></div>
<div class="card"><h3>6 · Automated Workflows</h3><p>Chain tools in recommended order, save/load JSON, automate next year's vendor cleanup.</p></div>
</div>
</div>
</section>
<!-- ============= Pricing ============= -->
<section>
<div class="container">
<div class="eyebrow">Pricing — pay once, own it</div>
<h2>$49. No subscription. No ceiling on rows or files.</h2>
<div class="pricing">
<div class="card featured">
<div class="row"><div class="price">$49</div><div class="price-suffix">one-time</div></div>
<h3>DataTools for 1099 Prep</h3>
<ul>
<li>All 6 tools, full pipeline</li>
<li>Mac · Windows · Linux installers</li>
<li>Code-signed (no Gatekeeper warnings)</li>
<li>Free updates for the v1.x line</li>
<li>Bonus: ready-made vendor-master &amp; 1099 pipelines</li>
</ul>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=ap-1099" rel="noopener">Buy on Gumroad →</a>
</div>
<div class="card">
<div class="row"><div class="price">$149</div><div class="price-suffix">one-time</div></div>
<h3>Full DataTools Suite</h3>
<p class="muted">Available when 3+ bundles ship. Includes everything in the 1099-prep pack plus the Bookkeeper and Accounts-Receivable bundles. Save $48.</p>
<a class="btn btn-ghost btn-large" href="#" aria-disabled="true">Coming when ready</a>
</div>
</div>
</div>
</section>
<!-- ============= FAQ ============= -->
<section>
<div class="container">
<h2>Questions</h2>
<details class="faq">
<summary>Does this work with my QuickBooks vendor export?</summary>
<p>Yes — the input is just CSV / Excel from any source. Your QuickBooks vendor export works the same as a Xero export, a Bill.com download, or a vendor spreadsheet you maintain by hand. The cleaner doesn't care where the file came from.</p>
</details>
<details class="faq">
<summary>How does this compare to Excel's "Remove Duplicates"?</summary>
<p>Excel does <em>exact</em> deduplication and only deletes — it never backfills. <code>Acme LLC</code> and <code>Acme, L.L.C.</code> are different vendors to Excel, and even when it does catch a duplicate it throws the extra row away, taking the EIN with it. DataTools fuzzy-matches across spelling drift, merges the group to one survivor, and pulls the missing EIN, phone, and address off the rows it merges in.</p>
</details>
<details class="faq">
<summary>How does it recover a missing EIN?</summary>
<p>When it merges a group of duplicate vendor rows, it keeps the survivor and backfills any empty field — including the EIN — from whichever duplicate row had it. In the sample file, 7 of the 8 vendors had their EIN recovered this way; the 1 that's truly missing gets flagged so you know to chase the W-9.</p>
</details>
<details class="faq">
<summary>Do I need to know Python to use it?</summary>
<p>No. The GUI is a browser interface that opens automatically when you double-click the app. It loads your vendor file, you click Run, you download the cleaned list. The CLI is there for power users who want to script next year's cleanup.</p>
</details>
<details class="faq">
<summary>What about my data privacy?</summary>
<p>Your vendor master — EINs, remittance addresses, payment history — never leaves your computer. There is no cloud component, no telemetry, no "anonymous usage stats." When the app is running you can confirm zero outbound network requests in your browser's developer tools.</p>
</details>
<details class="faq">
<summary>What's your refund policy?</summary>
<p>Try the live demo above on the sample vendor dataset before you buy. If you still find DataTools doesn't fit your workflow within 14 days, email for a refund — no questions asked.</p>
</details>
<details class="faq">
<summary>Will there be updates?</summary>
<p>Yes. The v1.x line is included free for everyone who buys DataTools today. We ship a patch every 30 days adding format support, edge-case fixes, and small features.</p>
</details>
</div>
</section>
<!-- ============= Final CTA ============= -->
<section>
<div class="container" style="text-align: center;">
<h2>Stop chasing scattered EINs by hand.</h2>
<p class="lead" style="margin: 0 auto 28px;">One $49 download. Mac, Windows, or Linux. Runs offline. Consolidates 24 messy records into 8 complete vendors, recovers the 7 EINs hiding on duplicate rows, flags the ones genuinely missing, and saves a pipeline you can re-run on next year's vendor export.</p>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=ap-1099" rel="noopener">Get DataTools for Accounting — $49 →</a>
</div>
</section>
<!-- ============= Footer ============= -->
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for accounts payable, bookkeepers, and accounts-receivable teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="../bookkeeper/">For bookkeepers</a> ·
<a href="../ar-aging/">For accounts receivable</a><br />
<a href="https://gumroad.com/l/datatools?from=ap-1099">Buy on Gumroad</a> ·
<a href="mailto:hello@datatools.app">Email support</a>
</p>
</div>
</div>
</footer>
</body>
</html>

358
landing/ar-aging/index.html Normal file
View File

@@ -0,0 +1,358 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools for Accounts Receivable — Kill Duplicate Invoices Inflating Your AR Aging Report · $49</title>
<meta name="description" content="One tool to clean your open-invoices export: standardize invoice dates, due dates, and amounts, lowercase client emails, then remove double-entered invoice numbers so your AR aging report is accurate. 26 rows → 21, five duplicate invoices removed. Fully offline. $49 one-time." />
<meta name="keywords" content="accounts receivable aging, duplicate invoices, AR cleanup, open invoices export, invoice dedupe, aging report accuracy, receivables csv tool" />
<link rel="canonical" href="https://datatools.app/ar-aging/" />
<link rel="stylesheet" href="../_shared/styles.css" />
<!-- Persona accent: Accounts Receivable → receivables green -->
<style>
:root {
--accent: #059669;
--accent-ink: #03241a;
}
</style>
<meta property="og:title" content="DataTools for Accounts Receivable — Kill Duplicate Invoices Inflating Your AR Aging Report" />
<meta property="og:description" content="Standardize invoice dates, due dates, and amounts, lowercase client emails, then dedupe double-entered invoices — one tool, no upload. $49 one-time." />
<meta property="og:type" content="product" />
<meta property="og:url" content="https://datatools.app/ar-aging/" />
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareApplication",
"name": "DataTools for Accounts Receivable",
"operatingSystem": "Windows, macOS, Linux",
"applicationCategory": "BusinessApplication",
"offers": {
"@type": "Offer",
"price": "49",
"priceCurrency": "USD"
},
"description": "Clean and dedupe your open-invoices export so the AR aging report is accurate. Standardize invoice dates, due dates, and amounts, lowercase client emails, then remove double-entered invoice numbers — backfilling a blank status from its twin row. Six-tool data-cleaning bundle for accounts receivable and accounting teams.",
"softwareVersion": "1.0"
}
</script>
</head>
<body>
<div class="buybar">
<div class="buybar-inner">
<div class="brand"><span class="brand-mark"></span> DataTools <span class="muted">/ for Accounts Receivable</span></div>
<div>
<span class="price-tag">$49 — one-time, no subscription</span>
<a class="btn" href="https://gumroad.com/l/datatools?from=ar-aging" rel="noopener">Get DataTools →</a>
</div>
</div>
</div>
<section class="hero">
<div class="container">
<div class="eyebrow">For accounts receivable · controllers · collections · accounting teams</div>
<h1>Stop chasing the invoices<br /><strong>your aging report counted twice.</strong></h1>
<p class="lead">
The same invoice number gets posted twice — once as
<code>3/04/2026</code> for <code>$1,250.00</code>, again as
<code>2026-03-04</code> for <code>1250</code> — so your AR aging
report double-counts the receivable and your team chases a balance
that was never really open. DataTools standardizes every invoice
date, due date, and amount, lowercases client emails, then removes
the double-entered invoice numbers — taking a real open-invoices
export from <strong>26 rows to 21, five duplicate invoices
removed</strong> — all on your own machine, with nothing uploaded.
</p>
<div class="cta-row">
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=ar-aging" rel="noopener">Get DataTools for Accounting — $49 →</a>
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
<span class="price-note">One-time payment · cross-platform · runs offline</span>
</div>
<div class="stats">
<div class="stat"><div class="num">26→21</div><div class="label">rows after dedupe</div></div>
<div class="stat"><div class="num">5</div><div class="label">duplicate invoices removed</div></div>
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
</div>
</div>
</section>
<!-- ============= Pain points ============= -->
<section>
<div class="container">
<div class="eyebrow">If your last aging report didn't tie out to cash</div>
<h2>Five pains DataTools fixes before you run the aging report</h2>
<div class="grid">
<div class="card">
<span class="icon">💸</span>
<h3>Double-entered invoices inflate every aging bucket</h3>
<p>The same invoice number posted twice — once in <code>MM/DD/YYYY</code>, once in ISO — lands in two rows and gets counted twice. Your 60-day bucket looks worse than it is, and the receivables total overstates what's actually owed.</p>
<p class="muted"><strong>What it costs:</strong> overstated AR, a balance sheet that won't reconcile, and a controller asking why.</p>
</div>
<div class="card">
<span class="icon">📞</span>
<h3>Collections chases invoices that were already paid or never real</h3>
<p>When a duplicate invoice number shows as still-open, a collector emails the client about a balance that doesn't exist. The client pushes back, trust erodes, and your team burns a morning untangling it.</p>
<p class="muted"><strong>What it costs:</strong> wasted collections hours + an awkward "please disregard" to the client.</p>
</div>
<div class="card">
<span class="icon">⚖️</span>
<h3>Uploading the AR ledger to a cloud cleaner is a compliance headache</h3>
<p>Every cloud-based cleaner wants you to upload your full receivables ledger — client names, amounts, contact emails. That's a data-handling review your firm doesn't want to run. DataTools is desktop-only — no upload, no DPA, no review.</p>
<p class="muted"><strong>What it costs:</strong> weeks of review per tool, or just not cleaning the data at all.</p>
</div>
<div class="card">
<span class="icon">🗓️</span>
<h3>Mixed date formats make due dates and aging unreliable</h3>
<p>Invoice dates arrive as <code>3/4/26</code>, <code>2026-03-04</code>, and <code>Mar 4 2026</code>; due dates are just as mixed. Sort by date and the buckets are wrong, so the wrong invoices show up in the wrong aging column.</p>
<p class="muted"><strong>What it costs:</strong> 13 hours per close reconciling dates by hand, every period.</p>
</div>
<div class="card">
<span class="icon">📧</span>
<h3>Messy client contacts break your remittance reminders</h3>
<p>Client names come in mixed casing and emails arrive as <code>Billing@ClientCo.com</code> in one row and <code>billing@clientco.com</code> in another — so the same client looks like two, and reminders go out twice or not at all.</p>
<p class="muted"><strong>What it costs:</strong> duplicate dunning, missed reminders, and a client list that won't group.</p>
</div>
<div class="card">
<span class="icon"></span>
<h3>Blank invoice statuses hide whether a receivable is really open</h3>
<p>When one of the two twin rows has a blank status, you can't tell if the invoice is open, partial, or paid — so it either gets dropped from the aging report or counted at the wrong stage.</p>
<p class="muted"><strong>What it costs:</strong> misclassified receivables and an aging report you can't trust.</p>
</div>
</div>
</div>
</section>
<section id="demo">
<div class="container">
<div class="eyebrow">Live demo · runs in your browser</div>
<h2>Try it on a real-looking open-invoices export</h2>
<p>
The demo below loads a 26-row open-invoices export with five
double-entered invoice numbers — the same invoice posted twice in
different date and amount formats (<code>3/04/2026</code> vs
<code>2026-03-04</code>, <code>$1,250.00</code> vs <code>1250</code>),
client emails in mixed case, and one blank invoice status. Click
<strong>Run pipeline</strong> and watch the 5-step pipeline (text
clean → format → missing → column map → dedup) standardize both date
columns to ISO, coerce amounts to numbers, lowercase the emails, and
collapse 26 rows to 21 — backfilling the blank status from its twin
row so the aging report is accurate.
</p>
<div class="demo-frame">
<iframe
src="https://demo.datatools.app/?p=ar-aging"
loading="lazy"
title="DataTools live demo — Accounts Receivable"
sandbox="allow-scripts allow-same-origin allow-downloads allow-forms"></iframe>
<div class="demo-caption">
Demo runs on free hosting. Capped at 100 input rows · output
watermarked. The paid product has no caps and runs entirely offline.
</div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">Built for the receivables close</div>
<h2>Three workflows you do every period</h2>
<div class="grid">
<div class="card">
<span class="icon">🪢</span>
<h3>Dedupe double-entered invoices</h3>
<p>Match on invoice number, drop the second posting, and keep one canonical row per invoice — backfilling a blank status, due date, or amount from its twin so nothing accurate is lost when the duplicate goes.</p>
</div>
<div class="card">
<span class="icon">🗓️</span>
<h3>Standardize invoice and due dates</h3>
<p>Coerce every invoice date and due date to ISO and every amount to a clean number, so the aging buckets sort correctly and the receivables total ties out to the ledger.</p>
</div>
<div class="card">
<span class="icon">📧</span>
<h3>Normalize client contacts for remittance</h3>
<p>Lowercase client emails and fix name casing so each client groups as one. Send remit-to reminders once, to a clean contact list — not twice because two rows looked like two clients.</p>
</div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">If your export comes from QuickBooks, Xero, or a billing system</div>
<h2>Standardized dates and amounts. One row per invoice.</h2>
<p>
Your billing system exports <code>3/04/2026</code>. The re-post of
the same invoice has <code>2026-03-04</code>. The amount is
<code>$1,250.00</code> in one row and <code>1250</code> in the other.
DataTools reads each row, normalizes both date columns to ISO,
coerces the amount to a number, and then matches on invoice number
to keep exactly one canonical row per receivable.
</p>
<ul class="bullets">
<li><strong>Invoice date + due date</strong> both standardized to ISO, so every aging bucket sorts and totals correctly.</li>
<li><strong>Amounts coerced to numbers</strong>: <code>$1,250.00</code> and <code>1250</code> resolve to the same value — no false mismatch between twin rows.</li>
<li><strong>Client emails lowercased</strong> so the same client groups as one for remittance reminders.</li>
<li><strong>Status backfill on dedupe</strong>: when a twin row has a blank invoice status, the survivor inherits it — so no open receivable goes missing from the report.</li>
</ul>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">For anyone who reports on receivables</div>
<h2>Every duplicate invoice you don't catch overstates your AR.</h2>
<p>
Your aging report is only as good as the export under it. Every
double-entered invoice number is a receivable counted twice — it
inflates the aging buckets, overstates the total owed, and sends
collections after balances that aren't really open. DataTools
catches them once, before the report runs, by matching on invoice
number with the date and amount noise already standardized away.
</p>
<div class="callout">
<strong>Real numbers from the demo:</strong> a 26-row open-invoices
export collapses to 21 — that's five double-entered invoices the
mixed date and amount formats were hiding, both date columns now
ISO, amounts numeric, emails lowercased, 0 unparseable, and a blank
status backfilled from its twin row. The aging report finally ties out.
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">The thing every cloud cleaner can't say</div>
<h2>Your clients' receivables never leave your computer.</h2>
<p>
Cloud cleaning tools require you to upload your AR ledger — client
names, invoice amounts, remit-to contacts. That ledger is sensitive
client financial data, and once it's on someone else's server, your
firm owns a data-handling problem you didn't need. DataTools is a
desktop app. There is no upload step.
</p>
<div class="terminal"><span class="prompt">$</span> python -m src.cli_pipeline ar_open_invoices.csv --pipeline ar_open_invoices_pipeline.json --apply
Reading ar_open_invoices.csv...
26 rows, 9 columns
Executing pipeline:
<span class="ok"></span> text_clean (40 ms) {cells_changed: 31}
<span class="ok"></span> format_standardize (120 ms) {dates_to_iso: 41, amounts_to_number: 26, emails_lowercased: 18}
<span class="ok"></span> missing (30 ms) {sentinels_standardized: 4, status_backfilled: 1}
<span class="ok"></span> column_map (20 ms) {columns_renamed: 2}
<span class="ok"></span> dedup (60 ms) {duplicate_invoices_removed: 5, merged: 5}
Initial rows: 26 → Final rows: 21
Unparseable dates/amounts: 0
Total elapsed: 0.3 s
<span class="prompt">$</span> # 5 double-entered invoices gone. aging report ties out. for $49.</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Find Duplicates</h3><p>Match on invoice number; keep one canonical row per receivable and backfill blanks from the twin.</p></div>
<div class="card"><h3>2 · Clean Text</h3><p>Smart quotes from copy-paste, NBSP from spreadsheet exports, BOM from Excel.</p></div>
<div class="card"><h3>3 · Standardize Formats</h3><p>Invoice and due dates to ISO, amounts to clean numbers, client emails lowercased.</p></div>
<div class="card"><h3>4 · Fix Missing Values</h3><p>Detect <code>TBD</code>, <code>(unknown)</code>, <code></code> and backfill blank invoice statuses on dedupe.</p></div>
<div class="card"><h3>5 · Map Columns</h3><p>Project to your aging-report schema, coerce amount to a number, reorder fields for import.</p></div>
<div class="card"><h3>6 · Automated Workflows</h3><p>Save the cleanup as JSON. Drop next period's open-invoices export on it. Same dedupe, automated.</p></div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">Pricing — pay once, own it</div>
<h2>$49. No subscription. No per-close fee.</h2>
<div class="pricing">
<div class="card featured">
<div class="row"><div class="price">$49</div><div class="price-suffix">one-time</div></div>
<h3>DataTools for Accounts Receivable</h3>
<ul>
<li>All 6 tools, full pipeline</li>
<li>Mac · Windows · Linux installers</li>
<li>Code-signed (no Gatekeeper warnings)</li>
<li>Free updates for the v1.x line</li>
<li>Bonus: open-invoices dedupe pipeline preset</li>
<li><strong>Use on any number of clients</strong> — no seat limits</li>
</ul>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=ar-aging" rel="noopener">Buy on Gumroad →</a>
</div>
<div class="card">
<div class="row"><div class="price">$149</div><div class="price-suffix">one-time</div></div>
<h3>Full DataTools Suite</h3>
<p class="muted">Available when 3+ bundles ship. Includes everything in the Accounts Receivable pack plus the Bookkeeper and Accounts Payable / 1099 bundles. Save $48.</p>
<a class="btn btn-ghost btn-large" href="#" aria-disabled="true">Coming when ready</a>
</div>
</div>
</div>
</section>
<section>
<div class="container">
<h2>Questions</h2>
<details class="faq">
<summary>Does this replace my accounting system's deduplication?</summary>
<p>No — it cleans the export <em>before</em> you run the aging report or import it back. Most billing systems will happily hold two postings of the same invoice number; DataTools catches the double-entered invoice so it never inflates a single aging bucket.</p>
</details>
<details class="faq">
<summary>How does it know two rows are the same invoice?</summary>
<p>It matches on invoice number after the date and amount formats are standardized away. So a posting dated <code>3/04/2026</code> for <code>$1,250.00</code> and its twin dated <code>2026-03-04</code> for <code>1250</code> are recognized as one invoice — and only one canonical row survives.</p>
</details>
<details class="faq">
<summary>What happens to a blank invoice status when the duplicate is removed?</summary>
<p>It's backfilled. If one twin row has a blank status and the other says <code>open</code>, the surviving row inherits <code>open</code> — so no real receivable drops off the aging report just because the duplicate carried the better data.</p>
</details>
<details class="faq">
<summary>Can I use it on multiple clients without paying again?</summary>
<p>Yes. The licence is per-operator, not per-client. Run it on every client's open-invoices export for the same $49.</p>
</details>
<details class="faq">
<summary>What's the audit trail look like?</summary>
<p>A row-by-row CSV: every modified cell with its original value, new value, and which rule fired — every date coerced to ISO, every amount normalized, every duplicate invoice removed. A separate JSON file describes the pipeline that produced it, so the cleanup reproduces deterministically and your client can verify it on their machine.</p>
</details>
<details class="faq">
<summary>What's your refund policy?</summary>
<p>Try the live demo above on the sample open-invoices export before you buy. If DataTools doesn't fit your workflow within 14 days, email for a refund — no questions asked.</p>
</details>
</div>
</section>
<section>
<div class="container" style="text-align: center;">
<h2>Stop counting the same receivable twice.</h2>
<p class="lead" style="margin: 0 auto 28px;">One $49 download. Standardizes invoice dates, due dates, and amounts, lowercases client emails, removes the double-entered invoices your aging report was counting twice, and saves a pipeline you can re-run on next period's open-invoices export.</p>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=ar-aging" rel="noopener">Get DataTools for Accounting — $49 →</a>
</div>
</section>
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for bookkeepers, accounts payable, and accounts receivable teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="../bookkeeper/">For bookkeepers</a> ·
<a href="../ap-1099/">For accounts payable / 1099</a><br />
<a href="https://gumroad.com/l/datatools?from=ar-aging">Buy on Gumroad</a> ·
<a href="mailto:hello@datatools.app">Email support</a>
</p>
</div>
</div>
</footer>
</body>
</html>

View File

@@ -3,9 +3,9 @@
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools for Bookkeepers — Reconcile Bank Exports With An Audit Trail · $49</title>
<meta name="description" content="Reconcile messy bank exports. Catch duplicate transactions QuickBooks imported twice. Standardize dates, amounts, and vendor casing — locally. Every change auditable. $49 one-time." />
<meta name="keywords" content="reconcile bank export csv, quickbooks duplicate transactions, vendor list cleanup, bookkeeper csv tool, bank export deduplicator, bookkeeper audit trail" />
<title>DataTools for Bookkeepers — Catch Bank Transactions Posted Twice · $49</title>
<meta name="description" content="Catch the transactions your bank export posted twice. Standardize every date to ISO and every amount to numeric, then dedup on the real transaction so the reconciliation ties out — with a row-level audit trail. $49 one-time." />
<meta name="keywords" content="bank reconciliation, duplicate transactions, bank export csv cleanup, QuickBooks reconcile, bookkeeper csv tool" />
<link rel="canonical" href="https://datatools.app/bookkeeper/" />
<link rel="stylesheet" href="../_shared/styles.css" />
@@ -18,8 +18,8 @@
</style>
<!-- Open Graph -->
<meta property="og:title" content="DataTools for Bookkeepers — Reconcile Bank Exports With An Audit Trail" />
<meta property="og:description" content="Catch duplicate transactions. Standardize dates and amounts. Hand your client an audit trail. $49 one-time." />
<meta property="og:title" content="DataTools for Bookkeepers — Catch Bank Transactions Posted Twice" />
<meta property="og:description" content="The same payment posts twice in two date/amount formats and a plain dedupe misses it. DataTools standardizes, dedups on the real transaction, and hands you an audit trail. $49 one-time." />
<meta property="og:type" content="product" />
<meta property="og:url" content="https://datatools.app/bookkeeper/" />
@@ -35,7 +35,7 @@
"price": "49",
"priceCurrency": "USD"
},
"description": "Reconcile bank exports, dedupe vendor lists, and produce a hand-off-ready audit trail. Six-tool data-cleaning bundle for bookkeepers and freelance accountants.",
"description": "Catch the duplicate transactions your bank export posted twice across overlapping months, standardize dates and amounts, and produce a hand-off-ready audit trail. Six-tool data-cleaning bundle for bookkeepers and freelance accountants.",
"softwareVersion": "1.0"
}
</script>
@@ -47,7 +47,7 @@
<div class="brand"><span class="brand-mark"></span> DataTools <span class="muted">/ for Bookkeepers</span></div>
<div>
<span class="price-tag">$49 — one-time, no subscription</span>
<a class="btn" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Get DataTools →</a>
<a class="btn" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Get DataTools for Bookkeepers — $49 </a>
</div>
</div>
</div>
@@ -55,24 +55,29 @@
<section class="hero">
<div class="container">
<div class="eyebrow">For bookkeepers · freelance accountants · small-firm partners</div>
<h1>Reconcile messy bank exports.<br /><strong>Hand your client an audit trail.</strong></h1>
<h1>Catch the transactions your bank export<br /><strong>posted twice.</strong></h1>
<p class="lead">
The Jan and Feb exports overlap and you've got the same transaction
booked twice. Vendor names are <em>"Amazon"</em>, <em>"amazon.com"</em>,
and <em>"AMAZON.COM*4F2X9"</em> in three different rows. Dates are a
smoosh of <code>01/15/2025</code>, <code>2025-01-15</code>, and
<code>Jan 18 2025</code>. DataTools fixes all of it in one pass —
and produces a row-by-row CSV showing every change so your client
can verify your work.
The Jan and Feb exports overlap, so the <em>same</em> payment posts
twice in two different shapes — <code>01/15/2025&nbsp;&nbsp;+$3,450.00</code>
in one export and <code>2025-01-15&nbsp;&nbsp;3450.00</code> in the
other — and a plain Excel dedupe never catches it because the dates and
amounts don't match character-for-character. DataTools standardizes
every date to ISO and every amount to numeric (parens-negatives
resolved), then dedups on the <em>real</em> transaction so the
reconciliation ties out. On the sample export that's
<strong>26 rows → 20</strong> — six phantom duplicate transactions
removed, 36 date/amount cells standardized, 0 unparseable — and you
get a row-by-row CSV showing every change so your client can verify
your work.
</p>
<div class="cta-row">
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Get DataTools — $49 →</a>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Get DataTools for Bookkeepers — $49 →</a>
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
<span class="price-note">One-time payment · cross-platform · runs offline</span>
</div>
<div class="stats">
<div class="stat"><div class="num">6</div><div class="label">tools, one bundle</div></div>
<div class="stat"><div class="num">100 %</div><div class="label">auditable changes</div></div>
<div class="stat"><div class="num">26→20</div><div class="label">rows, on the sample export</div></div>
<div class="stat"><div class="num">6</div><div class="label">phantom duplicates removed</div></div>
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
</div>
</div>
@@ -129,13 +134,15 @@
<div class="eyebrow">Live demo · runs in your browser</div>
<h2>Try it on a sample bank export with a known overlap</h2>
<p>
The demo below loads a 25-row export combining January and February
The demo below loads a 26-row export combining January and February
activity, with the month-boundary rows duplicated across exports —
the exact scenario where QuickBooks (or any reconciler) silently
double-counts transactions. Click <strong>Run pipeline</strong> and
watch the dedup catch every overlap, dates land in ISO format, and
the parens-negative amounts (<code>($89.50)</code>) become proper
negative numbers.
watch it standardize 36 date/amount cells, land every date in ISO
format, turn the parens-negative amounts (<code>($89.50)</code>) into
proper negatives, flag the disguised-null categories, and dedup the
export down to <strong>20 real transactions</strong> — six phantom
duplicates removed, 0 unparseable.
</p>
<div class="demo-frame">
<iframe
@@ -197,13 +204,17 @@
price. DataTools writes the audit by default, downloadable as a
separate CSV alongside the cleaned file.
</div>
<div class="terminal"><span class="prompt">$</span> head -5 client_jan2025_changes.csv
<div class="terminal"><span class="prompt">$</span> python -m src.cli_pipeline bank_reconciliation.csv --pipeline bank_reconciliation_pipeline.json --apply
standardize · 36 date/amount cells normalized (ISO dates, numeric amounts, parens-negatives resolved)
missing · disguised-null categories flagged (—, N/A, (blank))
dedup · 6 phantom duplicate transactions removed
rows · 26 → 20 · 0 unparseable
✓ wrote bank_reconciliation.cleaned.csv + bank_reconciliation.changes.csv (row-level audit)
<span class="prompt">$</span> head -4 bank_reconciliation.changes.csv
row,column,field_type,old,new
0,"Date ",date,"01/15/2025","2025-01-15"
0,Description,name," AMAZON.COM*4F2X9 PURCHASE","Amazon.com*4F2X9 Purchase"
0,Amount,currency,"-$129.99","-129.99"
1,Date ,date,"2025-01-15","2025-01-15"
<span class="prompt">$</span> # one row of audit per cell change. handed to the client. signed off.</div>
0,Amount,currency,"+$3,450.00","3450.00"
0,Category,category,"—","(missing)"
</div>
</section>
@@ -251,12 +262,12 @@ row,column,field_type,old,new
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match (Jaro-Winkler), explicit strategies for Date+Amount+Vendor, survivor rules.</p></div>
<div class="card"><h3>2 · Text Cleaner</h3><p>Header whitespace, smart quotes from copy-paste, em-dash sentinels.</p></div>
<div class="card"><h3>3 · Format Standardizer</h3><p>ISO dates, numeric amounts (parens-negative), vendor casing, multi-currency.</p></div>
<div class="card"><h3>4 · Missing Value Handler</h3><p>Disguised-null detection: <code></code>, <code>N/A</code>, <code>(blank)</code>, <code>?</code>.</p></div>
<div class="card"><h3>5 · Column Mapper</h3><p>Project to your accounting tool's required schema, coerce types, drop extras.</p></div>
<div class="card"><h3>6 · Pipeline Runner</h3><p>Save the cleanup. Run it on next month's export with one command. Same audit, automated.</p></div>
<div class="card"><h3>1 · Find Duplicates</h3><p>Fuzzy match (Jaro-Winkler), explicit strategies for Date+Amount+Vendor, survivor rules.</p></div>
<div class="card"><h3>2 · Clean Text</h3><p>Header whitespace, smart quotes from copy-paste, em-dash sentinels.</p></div>
<div class="card"><h3>3 · Standardize Formats</h3><p>ISO dates, numeric amounts (parens-negative), vendor casing, multi-currency.</p></div>
<div class="card"><h3>4 · Fix Missing Values</h3><p>Disguised-null detection: <code></code>, <code>N/A</code>, <code>(blank)</code>, <code>?</code>.</p></div>
<div class="card"><h3>5 · Map Columns</h3><p>Project to your accounting tool's required schema, coerce types, drop extras.</p></div>
<div class="card"><h3>6 · Automated Workflows</h3><p>Save the cleanup. Run it on next month's export with one command. Same audit, automated.</p></div>
</div>
</div>
</section>
@@ -336,13 +347,13 @@ row,column,field_type,old,new
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
<p><strong>DataTools</strong> — local data-cleaning for bookkeepers, accounts payable, and accounts receivable teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="../shopify-pet/">For Shopify operators</a> ·
<a href="../revops/">For RevOps agencies</a><br />
<a href="../ap-1099/">For accounts payable / 1099</a> ·
<a href="../ar-aging/">For accounts receivable</a><br />
<a href="https://gumroad.com/l/datatools?from=bookkeeper">Buy on Gumroad</a> ·
<a href="mailto:hello@datatools.app">Email support</a>
</p>

View File

@@ -11,7 +11,7 @@
"gumroad_listing": "https://gumroad.com/l/datatools",
"support_email": "hello@datatools.app",
"personas": ["shopify-pet", "bookkeeper", "revops"],
"personas": ["bookkeeper", "ap-1099", "ar-aging"],
"_substitutions_made": [
"{{site_origin}}/ → site_origin/",

View File

@@ -7,9 +7,9 @@ to ``landing/deploy.config.json`` and filling in the real URLs:
Output:
landing/dist/index.html
landing/dist/shopify-pet/index.html
landing/dist/bookkeeper/index.html
landing/dist/revops/index.html
landing/dist/ap-1099/index.html
landing/dist/ar-aging/index.html
landing/dist/_shared/styles.css
landing/dist/robots.txt
landing/dist/sitemap.xml
@@ -50,9 +50,9 @@ EXAMPLE_PATH = LANDING / "deploy.config.example.json"
# Files to substitute and copy. Order matters only for readability.
HTML_PAGES = [
LANDING / "index.html",
LANDING / "shopify-pet" / "index.html",
LANDING / "bookkeeper" / "index.html",
LANDING / "revops" / "index.html",
LANDING / "bookkeeper" / "index.html",
LANDING / "ap-1099" / "index.html",
LANDING / "ar-aging" / "index.html",
]
SHARED = LANDING / "_shared" / "styles.css"
@@ -125,7 +125,7 @@ def _stamp_sitemap(cfg: dict) -> str:
site = cfg["site_origin"].rstrip("/")
today = date.today().isoformat()
urls = [site + "/"] + [
f"{site}/{p}/" for p in cfg.get("personas", ["shopify-pet", "bookkeeper", "revops"])
f"{site}/{p}/" for p in cfg.get("personas", ["bookkeeper", "ap-1099", "ar-aging"])
]
items = "\n".join(
f" <url><loc>{u}</loc><lastmod>{today}</lastmod></url>"
@@ -177,11 +177,11 @@ def _build_404_html(cfg: dict) -> str:
<h1>That page isn't here.</h1>
<p class="lead" style="margin: 0 auto 28px;">Pick a workflow below to land somewhere useful.</p>
<p>
<a class="btn" href="{site_origin}/shopify-pet/">For Shopify</a>
&nbsp;
<a class="btn" href="{site_origin}/bookkeeper/">For bookkeepers</a>
&nbsp;
<a class="btn" href="{site_origin}/revops/">For RevOps</a>
<a class="btn" href="{site_origin}/ap-1099/">For AP / 1099</a>
&nbsp;
<a class="btn" href="{site_origin}/ar-aging/">For AR</a>
</p>
</div>
</section>

View File

@@ -3,13 +3,13 @@
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools — Local CSV / Excel Cleaning for Shopify, Bookkeepers, and RevOps</title>
<meta name="description" content="One desktop tool. Three workflows. Clean Shopify customer exports, reconcile messy bank statements, or dedupe lead lists across HubSpot and LinkedIn — all locally. $49 one-time." />
<title>DataTools — Local CSV / Excel Cleaning for Bookkeepers and Accountants</title>
<meta name="description" content="One desktop tool for messy accounting exports. Reconcile bank statements, build clean 1099 vendor lists, and de-duplicate AR aging — all locally. $49 one-time." />
<link rel="canonical" href="https://datatools.app/" />
<link rel="stylesheet" href="_shared/styles.css" />
<meta property="og:title" content="DataTools — Local CSV / Excel Cleaning" />
<meta property="og:description" content="One desktop tool, three niche workflows. Runs entirely offline. $49 one-time." />
<meta property="og:title" content="DataTools — Local CSV / Excel Cleaning for Accounting" />
<meta property="og:description" content="Reconcile bank exports, prep 1099 vendor lists, clean AR aging — offline. $49 one-time." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://datatools.app/" />
@@ -38,9 +38,9 @@
box-shadow: var(--shadow);
text-decoration: none;
}
.persona-card.shopify { --card-accent: #6ee7b7; }
.persona-card.bookkeeper{ --card-accent: #7dd3fc; }
.persona-card.revops { --card-accent: #c4b5fd; }
.persona-card.ap1099 { --card-accent: #fbbf24; }
.persona-card.ar { --card-accent: #6ee7b7; }
.persona-card .pill {
display: inline-block;
background: rgba(255,255,255,0.04);
@@ -93,70 +93,69 @@
<section class="hero">
<div class="container">
<div class="eyebrow">For Shopify operators · bookkeepers · marketing & RevOps agencies</div>
<h1>Local CSV / Excel cleaning.<br /><strong>One tool. Three workflows.</strong></h1>
<div class="eyebrow">For bookkeepers · accounts payable · accounts receivable</div>
<h1>Local CSV / Excel cleaning for accounting.<br /><strong>One tool. Three workflows.</strong></h1>
<p class="lead">
DataTools is a desktop app that fixes the data-cleaning headaches
every small business hits — duplicates Excel can't catch,
international phones it can't parse, dates and currencies in three
different formats per export. One $49 download. Works on Mac,
Windows, and Linux. <strong>Your data never leaves your
computer.</strong>
DataTools is a desktop app that fixes the export headaches that
throw off your books — the transaction your bank posted twice,
the vendor entered three ways at 1099 time, the invoice your aging
report counted twice. One $49 download. Mac, Windows, and Linux.
<strong>Your data never leaves your computer.</strong>
</p>
<div class="persona-grid">
<a class="persona-card shopify" href="shopify-pet/">
<span class="pill">🛍️ Shopify operator</span>
<h3>Customer / vendor / subscriber export cleanup</h3>
<p>
Klaviyo-import-ready customer lists in 30 seconds. Catches
cross-device duplicates, standardizes international phones
and addresses, fixes the disguised nulls that break product
feeds.
</p>
<ul class="pain">
<li>· Fix Klaviyo per-contact billing on phantom dupes</li>
<li>· Repair feeds rejected by Google Merchant / Meta</li>
<li>· Unify orders from Shopify + Etsy + Amazon + Faire</li>
<li>· Resolve VAT-MOSS country-name drift</li>
</ul>
<span class="open">Open the Shopify demo &amp; pricing</span>
</a>
<a class="persona-card bookkeeper" href="bookkeeper/">
<span class="pill">📒 Bookkeeper / accountant</span>
<h3>Bank-export reconciliation with audit trail</h3>
<span class="pill">📒 Bookkeeper</span>
<h3>Bank reconciliation with an audit trail</h3>
<p>
Catches the duplicate transaction QuickBooks imported twice
when Jan and Feb exports overlap. Standardizes dates,
amounts, and vendor casing. Hands you a row-level audit log
to share with the client.
When the Jan and Feb exports overlap, the same payment posts
twice in two formats. DataTools standardizes every date and
amount, then dedups on the real transaction so it ties out —
with a row-level audit log to hand the client.
</p>
<ul class="pain">
<li>· Catch month-overlap re-import dupes</li>
<li>· Consolidate vendors for clean 1099 reports</li>
<li>· Produce hand-off-ready audit trail</li>
<li>· Multi-currency books (EUR / GBP / BRL)</li>
<li>· Catch month-overlap re-import duplicates</li>
<li>· ISO dates, numeric amounts, parens-negatives resolved</li>
<li>· Hand-off-ready audit trail</li>
<li>· Sample: 26 rows → 20, six phantom duplicates removed</li>
</ul>
<span class="open">Open the bookkeeper demo &amp; pricing</span>
</a>
<a class="persona-card revops" href="revops/">
<span class="pill">🪢 Marketing / RevOps</span>
<h3>Lead-list dedup across HubSpot, LinkedIn, scrapes</h3>
<a class="persona-card ap1099" href="ap-1099/">
<span class="pill">🧾 Accounts payable / 1099</span>
<h3>Clean 1099 vendor list — missing EINs filled in</h3>
<p>
One canonical lead per real person — across HubSpot,
LinkedIn, Apollo, ZoomInfo, and manual scrapes.
International phones (50+ country codes), per-row country
column, fuzzy match with merge.
The same vendor entered three times, each record holding only
part of the details. DataTools consolidates each vendor to one
row and backfills the gaps from the duplicates, so the EINs you
need at filing time are recovered.
</p>
<ul class="pain">
<li>· Stop paying HubSpot tier price for cross-source dupes</li>
<li>· Protect sender reputation from invalid emails</li>
<li>· Skip the 48 wk GDPR review on cloud cleaners</li>
<li>· Suppression-list sync across 5+ platforms</li>
<li>· Consolidate vendor masters for 1099-NEC</li>
<li>· Recover EINs scattered across duplicate records</li>
<li>· Standardize phones, emails, and amounts</li>
<li>· Sample: 24 records → 8 vendors, 7 EINs recovered</li>
</ul>
<span class="open">Open the RevOps demo &amp; pricing</span>
<span class="open">Open the 1099 / AP demo &amp; pricing</span>
</a>
<a class="persona-card ar" href="ar-aging/">
<span class="pill">💵 Accounts receivable</span>
<h3>AR aging without the double-counted invoices</h3>
<p>
Double-entered invoices inflate your aging report and your
follow-ups. DataTools standardizes invoice dates, due dates,
and amounts, lowercases client emails, then removes the
duplicate invoice numbers so the aging is accurate.
</p>
<ul class="pain">
<li>· Remove double-entered invoices from the aging</li>
<li>· ISO dates, numeric amounts, lowercased client emails</li>
<li>· Backfill a blank status from its twin row</li>
<li>· Sample: 26 rows → 21, five duplicate invoices removed</li>
</ul>
<span class="open">Open the AR demo &amp; pricing</span>
</a>
</div>
</div>
@@ -168,9 +167,9 @@
<h2>One engine. Same six tools. Same $49.</h2>
<p>
The persona pages above are positioning, not different products.
Whichever you buy, you get the full bundle: Deduplicator, Text
Cleaner, Format Standardizer, Missing-Value Handler, Column
Mapper, and Pipeline Runner — pre-tuned with a saved pipeline
Whichever you buy, you get the full bundle: Find Duplicates, Clean
Text, Standardize Formats, Fix Missing Values, Map Columns,
and Automated Workflows — pre-tuned with a saved pipeline
that matches your workflow.
</p>
<div class="grid">
@@ -218,14 +217,14 @@
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
<p><strong>DataTools</strong> — local data-cleaning for bookkeepers, accounts payable, and accounts receivable teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="shopify-pet/">For Shopify operators</a> ·
<a href="bookkeeper/">For bookkeepers</a> ·
<a href="revops/">For RevOps agencies</a><br />
<a href="ap-1099/">For accounts payable / 1099</a> ·
<a href="ar-aging/">For accounts receivable</a><br />
<a href="mailto:hello@datatools.app">Email support</a>
</p>
</div>

View File

@@ -1,352 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools for RevOps — Dedupe Lead Lists Across HubSpot, LinkedIn, and Manual Scrapes · $49</title>
<meta name="description" content="One tool to dedupe lead lists across HubSpot, LinkedIn, and manual scrapes. International phones (50+ country codes), per-row country normalization, fuzzy match across vendors, fully offline. $49 one-time." />
<meta name="keywords" content="dedupe lead list, hubspot deduplicate, linkedin lead cleanup, marketing data cleaning, revops csv tool, multi-vendor lead unification, international phone normalization" />
<link rel="canonical" href="https://datatools.app/revops/" />
<link rel="stylesheet" href="../_shared/styles.css" />
<!-- Persona accent: RevOps → vivid violet -->
<style>
:root {
--accent: #c4b5fd;
--accent-ink: #2e1065;
}
</style>
<meta property="og:title" content="DataTools for RevOps — Dedupe Lead Lists Across HubSpot, LinkedIn, and Manual Scrapes" />
<meta property="og:description" content="International phones, country normalization, fuzzy dedup with merge — one tool, no upload. $49 one-time." />
<meta property="og:type" content="product" />
<meta property="og:url" content="https://datatools.app/revops/" />
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareApplication",
"name": "DataTools for RevOps",
"operatingSystem": "Windows, macOS, Linux",
"applicationCategory": "BusinessApplication",
"offers": {
"@type": "Offer",
"price": "49",
"priceCurrency": "USD"
},
"description": "Dedupe and unify lead lists across CRM, scraping, and manual sources. International phone normalization, per-row country, fuzzy match with merge. Six-tool data-cleaning bundle for RevOps and marketing agencies.",
"softwareVersion": "1.0"
}
</script>
</head>
<body>
<div class="buybar">
<div class="buybar-inner">
<div class="brand"><span class="brand-mark"></span> DataTools <span class="muted">/ for RevOps</span></div>
<div>
<span class="price-tag">$49 — one-time, no subscription</span>
<a class="btn" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Get DataTools →</a>
</div>
</div>
</div>
<section class="hero">
<div class="container">
<div class="eyebrow">For RevOps · marketing ops · agency lead-gen · audience-builders</div>
<h1>Dedupe lead lists across HubSpot, LinkedIn,<br /><strong>and manual scrapes — locally.</strong></h1>
<p class="lead">
The same prospect shows up as <code>alice@acme.com</code> in HubSpot,
<code>Alice.Johnson@acme.com</code> in LinkedIn Sales Navigator, and
<code>alice@acme.com</code> again from your VA's manual scrape. Their
phone is <code>(415) 555-1234</code> in one source and
<code>4155551234</code> in another. DataTools fuzzy-matches across
sources, normalizes phones to E.164 with per-row country awareness,
and produces one canonical lead per real person — without uploading
a single contact to a third-party tool.
</p>
<div class="cta-row">
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Get DataTools — $49 →</a>
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
<span class="price-note">One-time payment · cross-platform · runs offline</span>
</div>
<div class="stats">
<div class="stat"><div class="num">50+</div><div class="label">country codes</div></div>
<div class="stat"><div class="num">3</div><div class="label">CRM sources unified</div></div>
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
</div>
</div>
</section>
<!-- ============= Pain points ============= -->
<section>
<div class="container">
<div class="eyebrow">If your last campaign launch was held up by data hygiene</div>
<h2>Five pains DataTools fixes before you import to HubSpot</h2>
<div class="grid">
<div class="card">
<span class="icon">💸</span>
<h3>HubSpot / Marketo / Iterable bills you for every duplicate contact</h3>
<p>10 k contacts → enterprise tier at $48 k/mo. 18 % cross-source duplicate rate from Apollo + ZoomInfo + LinkedIn means you're at 8.2 k unique people but paying for 10 k. Every month. Forever.</p>
<p class="muted"><strong>What it costs:</strong> $200$800 per 1 k duplicate contacts — recurring, every month.</p>
</div>
<div class="card">
<span class="icon">🚫</span>
<h3>Sender reputation tanks when you mail to invalid or duplicate addresses</h3>
<p>One bad sending session — to addresses your team scraped or imported without hygiene — and your domain reputation takes weeks to recover. Your good campaigns sit in spam folders during the recovery.</p>
<p class="muted"><strong>What it costs:</strong> catastrophic — entire email programme degraded for 26 weeks.</p>
</div>
<div class="card">
<span class="icon">⚖️</span>
<h3>GDPR makes uploading to a cloud cleaner a legal-review marathon</h3>
<p>Every cloud-based lead-cleaner needs you to upload your prospect list. Your legal team needs 48 weeks to bless that. DataTools is desktop-only — no upload, no DPA, no review, no delay.</p>
<p class="muted"><strong>What it costs:</strong> 48 weeks of legal-review delay per tool, every time.</p>
</div>
<div class="card">
<span class="icon">🪢</span>
<h3>Apollo + ZoomInfo + LinkedIn + manual scrapes all use different schemas</h3>
<p>Each export has its own column names, scoring scale, country format. Unifying them by hand for one campaign costs 13 days. Doing it for every campaign is unsustainable.</p>
<p class="muted"><strong>What it costs:</strong> 13 days per campaign of manual unification + judgement calls that drift across team members.</p>
</div>
<div class="card">
<span class="icon">🛡️</span>
<h3>Suppression lists across 5+ marketing platforms get out of sync</h3>
<p>Each platform has its own suppression format. Out-of-sync lists let opted-out contacts slip through, triggering CAN-SPAM / GDPR exposure and the kind of "we got a complaint" email no one wants.</p>
<p class="muted"><strong>What it costs:</strong> compliance risk + churn-back cost + stakeholder trust.</p>
</div>
<div class="card">
<span class="icon">📞</span>
<h3>International dialer fails because phone formats vary</h3>
<p>Calling list to 15 countries with mixed formats means dialler rejects 815 % of numbers, your reps spend the day on "number invalid" tones instead of conversations.</p>
<p class="muted"><strong>What it costs:</strong> rep productivity × failure rate × team size.</p>
</div>
</div>
</div>
</section>
<section id="demo">
<div class="container">
<div class="eyebrow">Live demo · runs in your browser</div>
<h2>Try it on a real-looking 3-vendor lead list</h2>
<p>
The demo below loads a 25-row lead worksheet combining HubSpot,
LinkedIn Sales Navigator, and manual scraping — with the same prospect
appearing in two or three sources, country names spelled three
different ways (<code>USA</code>, <code>US</code>, <code>United
States</code>), and 13 different international phone formats. Click
<strong>Run pipeline</strong> and watch the 5-step pipeline (text
clean → format → missing → column map → dedup) collapse 25 rows to 19
with a single canonical record per prospect.
</p>
<div class="demo-frame">
<iframe
src="https://demo.datatools.app/?p=revops"
loading="lazy"
title="DataTools live demo — RevOps"
sandbox="allow-scripts allow-same-origin allow-downloads allow-forms"></iframe>
<div class="demo-caption">
Demo runs on free hosting. Capped at 100 input rows · output
watermarked. The paid product has no caps and runs entirely offline.
</div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">Built for the agency RevOps day</div>
<h2>Three workflows you do every campaign</h2>
<div class="grid">
<div class="card">
<span class="icon">🪢</span>
<h3>Email-list dedup across lead sources</h3>
<p>HubSpot exports + LinkedIn Sales Navigator + the VA's spreadsheet, all merged. Fuzzy match across email + phone + name catches the cross-source duplicates that broke your last campaign send.</p>
</div>
<div class="card">
<span class="icon">🌍</span>
<h3>Multi-platform audience reconciliation</h3>
<p>Build one canonical audience from Meta, Google Ads, LinkedIn, and your CRM. Each platform exports a different shape; column-mapper aligns them all, dedup merges the survivors with their most-complete fields.</p>
</div>
<div class="card">
<span class="icon">🛡️</span>
<h3>Suppression-list management</h3>
<p>Suppression lists need to dedupe across email + phone + first-party identifiers. Add a row, dedupe, ship the canonical CSV to every platform — without uploading the suppression list to any of them.</p>
</div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">If your campaigns target outside the US — almost everyone's do</div>
<h2>50+ country codes. Per-row country awareness.</h2>
<p>
Your HubSpot list has <code>(415) 555-1234</code>. Your scraped
list from the same prospect has <code>+1 415 555 1234</code>. Your
Italian prospect entered <code>+39 06 6982</code>. Your Brazilian
lead has <code>11 3071 0000</code>. Each comes from a row tagged
with its country — DataTools reads that column per row and parses
every phone correctly to E.164.
</p>
<ul class="bullets">
<li><strong>Per-row country column</strong> drives the parser — no global default that bucks UK numbers as malformed US.</li>
<li><strong>Country-name normalization</strong>: <code>USA</code> / <code>US</code> / <code>United States</code> all resolve to the same ISO-2 code.</li>
<li><strong>50+ country support</strong> via Google's libphonenumber, including KR, CN, IN, MX, BR, IL, TR, PL, DK, SE.</li>
<li><strong>Schema enforcement</strong> via the column-mapper: project to your CRM's required shape, coerce score columns to integers, reorder fields to match the import contract.</li>
</ul>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">For platforms that charge per contact</div>
<h2>Every duplicate you don't catch costs you for the life of the contract.</h2>
<p>
HubSpot prices on contacts. Klaviyo prices on contacts. Marketo,
Iterable, ActiveCampaign — all priced on contacts. Every duplicate
you don't catch is a recurring tax on your campaign. DataTools
catches them once, before import, with a fuzzy matcher that's
tuned to the cross-source noise you actually see.
</p>
<div class="callout">
<strong>Real numbers from the demo:</strong> 25 input rows from
three sources collapse to 19 — that's 6 duplicates the cross-source
noise was hiding. On a 50,000-row campaign list, that ratio
typically saves 12,000+ contacts a month, every month.
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">The thing every cloud cleaner can't say</div>
<h2>Your prospects' contact info never leaves your computer.</h2>
<p>
Cloud lead-cleaning tools require you to upload your audience.
That audience is your single most valuable agency asset — and once
it's on someone else's server, your client's privacy story is
no longer in your hands. DataTools is a desktop app. There is no
upload step.
</p>
<div class="terminal"><span class="prompt">$</span> python -m src.cli_pipeline campaign_q1.csv --pipeline revops_pipeline.json --apply
Reading campaign_q1.csv...
53,802 rows, 14 columns
Executing pipeline:
<span class="ok"></span> text_clean (160 ms) {cells_changed: 8,205}
<span class="ok"></span> format_standardize (1.4 s) {cells_changed: 41,889 — 50 country codes}
<span class="ok"></span> missing (140 ms) {sentinels_standardized: 6,710}
<span class="ok"></span> column_map (220 ms) {columns_renamed: 4, columns_added: 1}
<span class="ok"></span> dedup (4.8 s) {duplicates_removed: 12,344, merged: 12,344}
Initial rows: 53,802 → Final rows: 41,458
Total elapsed: 6.7 s
<span class="prompt">$</span> # 12,344 fewer contacts to pay for. for $49.</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match across email + phone + name + company; merge survivors with most-complete fields.</p></div>
<div class="card"><h3>2 · Text Cleaner</h3><p>Smart quotes from copy-paste, NBSP from spreadsheet exports, BOM from Excel.</p></div>
<div class="card"><h3>3 · Format Standardizer</h3><p>E.164 phones with per-row country, canonical emails, name casing, ISO dates.</p></div>
<div class="card"><h3>4 · Missing Value Handler</h3><p>Detect <code>TBD</code>, <code>(unknown)</code>, <code></code> across vendor exports.</p></div>
<div class="card"><h3>5 · Column Mapper</h3><p>Project to your CRM's required schema, coerce score to integer, reorder for import.</p></div>
<div class="card"><h3>6 · Pipeline Runner</h3><p>Save the cleanup as JSON. Drop next campaign's combined export on it. Same dedup, automated.</p></div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">Pricing — pay once, own it</div>
<h2>$49. No subscription. No per-campaign fee.</h2>
<div class="pricing">
<div class="card featured">
<div class="row"><div class="price">$49</div><div class="price-suffix">one-time</div></div>
<h3>DataTools for RevOps</h3>
<ul>
<li>All 6 tools, full pipeline</li>
<li>Mac · Windows · Linux installers</li>
<li>Code-signed (no Gatekeeper warnings)</li>
<li>Free updates for the v1.x line</li>
<li>Bonus: 3-source unification pipeline preset</li>
<li><strong>Use on any number of clients</strong> — no seat limits</li>
</ul>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Buy on Gumroad →</a>
</div>
<div class="card">
<div class="row"><div class="price">$149</div><div class="price-suffix">one-time</div></div>
<h3>Full DataTools Suite</h3>
<p class="muted">Available when 3+ bundles ship. Includes everything in the RevOps pack plus the Shopify and Bookkeeper bundles. Save $48.</p>
<a class="btn btn-ghost btn-large" href="#" aria-disabled="true">Coming when ready</a>
</div>
</div>
</div>
</section>
<section>
<div class="container">
<h2>Questions</h2>
<details class="faq">
<summary>Does this replace HubSpot's deduplication?</summary>
<p>No — it cleans data <em>before</em> import to HubSpot (or LinkedIn, Marketo, Klaviyo, etc.). HubSpot's dedup runs on already-imported contacts; DataTools catches duplicates that haven't yet cost you a contract slot.</p>
</details>
<details class="faq">
<summary>Does it handle international phones correctly?</summary>
<p>Yes — via Google's libphonenumber, with 50+ country codes. The killer feature is per-row country: point a column at it (any column with values like <code>US</code>, <code>USA</code>, <code>United States</code>, <code>+1</code>, <code>JP</code>, <code>Japan</code>) and DataTools parses each row in its own region. No more UK numbers bucketed as malformed US.</p>
</details>
<details class="faq">
<summary>Can I use it on multiple clients without paying again?</summary>
<p>Yes. The licence is per-operator, not per-client. Run it on every agency client's lead list for the same $49.</p>
</details>
<details class="faq">
<summary>How does fuzzy match work across columns?</summary>
<p>Out of the box, the dedup engine builds default strategies based on column names — typically email + phone with exact match, name with Jaro-Winkler at 85%. You can override via JSON: pick which columns to match on, which algorithm, and what threshold. Strategies survive in the saved pipeline so next campaign uses the same rules.</p>
</details>
<details class="faq">
<summary>What's the audit trail look like?</summary>
<p>A row-by-row CSV: every modified cell with its original value, new value, and which rule fired. A separate JSON file describes the pipeline that produced it. Together they reproduce the cleanup deterministically — your client can verify it on their machine.</p>
</details>
<details class="faq">
<summary>What's your refund policy?</summary>
<p>Try the live demo above on the sample dataset before you buy. If DataTools doesn't fit your workflow within 14 days, email for a refund — no questions asked.</p>
</details>
</div>
</section>
<section>
<div class="container" style="text-align: center;">
<h2>Stop paying twice for the same contact.</h2>
<p class="lead" style="margin: 0 auto 28px;">One $49 download. Catches the cross-source duplicates HubSpot and LinkedIn can't see, normalizes phones for 50+ countries, and saves a pipeline you can re-run on next campaign's combined list.</p>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Get DataTools — $49 →</a>
</div>
</section>
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="../shopify-pet/">For Shopify operators</a> ·
<a href="../bookkeeper/">For bookkeepers</a><br />
<a href="https://gumroad.com/l/datatools?from=revops">Buy on Gumroad</a> ·
<a href="mailto:hello@datatools.app">Email support</a>
</p>
</div>
</div>
</footer>
</body>
</html>

View File

@@ -1,381 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools for Shopify — Clean Customer & Product Exports Locally · $49</title>
<meta name="description" content="Clean Shopify customer, product, and subscriber exports — locally. Klaviyo-import-ready in 30 seconds. Catches duplicates Excel misses. Your data never leaves your computer. $49 one-time." />
<meta name="keywords" content="shopify customer cleanup, shopify csv cleaner, shopify product feed cleaner, klaviyo deduplicate, shopify customer dedup tool, shopify pet supplies" />
<link rel="canonical" href="https://datatools.app/shopify/" />
<link rel="stylesheet" href="../_shared/styles.css" />
<!-- Persona accent: Shopify pet → mint green (default in shared sheet) -->
<!-- Open Graph -->
<meta property="og:title" content="DataTools for Shopify — Clean Customer & Product Exports Locally" />
<meta property="og:description" content="Klaviyo-import-ready in 30 seconds. Local. No upload. $49 one-time." />
<meta property="og:type" content="product" />
<meta property="og:url" content="https://datatools.app/shopify/" />
<!-- Schema.org Product -->
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareApplication",
"name": "DataTools for Shopify",
"operatingSystem": "Windows, macOS, Linux",
"applicationCategory": "BusinessApplication",
"offers": {
"@type": "Offer",
"price": "49",
"priceCurrency": "USD"
},
"description": "Clean Shopify customer, product, and subscriber CSV exports locally. Six-tool data-cleaning bundle: dedupe, text-clean, format-standardize, missing-value handle, column-map, pipeline.",
"softwareVersion": "1.0"
}
</script>
</head>
<body>
<!-- ============= Sticky buy bar ============= -->
<div class="buybar">
<div class="buybar-inner">
<div class="brand"><span class="brand-mark"></span> DataTools <span class="muted">/ for Shopify</span></div>
<div>
<span class="price-tag">$49 — one-time, no subscription</span>
<a class="btn" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Get DataTools →</a>
</div>
</div>
</div>
<!-- ============= Hero ============= -->
<section class="hero">
<div class="container">
<div class="eyebrow">For Shopify operators · pet supplies · subscription stores · DTC</div>
<h1>Klaviyo-import-ready customer lists.<br /><strong>In 30 seconds. Locally.</strong></h1>
<p class="lead">
Your Shopify customer export is a mess of formatting drift, disguised
duplicates, and inconsistent phone numbers. DataTools fixes all of it
in one pass — fuzzy-dedupes the same customer Klaviyo would charge
you for twice, standardises phones across your international
subscribers, and hands you a cleaned CSV. <strong>Your data never
leaves your computer.</strong>
</p>
<div class="cta-row">
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Get DataTools — $49 →</a>
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
<span class="price-note">One-time payment · cross-platform · runs offline</span>
</div>
<div class="stats">
<div class="stat"><div class="num">6</div><div class="label">tools, one bundle</div></div>
<div class="stat"><div class="num">1 GB</div><div class="label">customer file in 2.5 min</div></div>
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
</div>
</div>
</section>
<!-- ============= Pain points ============= -->
<section>
<div class="container">
<div class="eyebrow">If any of these sound like your Tuesday</div>
<h2>Five pains DataTools fixes in one pass</h2>
<div class="grid">
<div class="card">
<span class="icon">💸</span>
<h3>Klaviyo / Mailchimp / Omnisend bills you for every duplicate</h3>
<p>Same customer signs up twice — once with a typo, once with a plus-tag, once on mobile. Your subscriber list has 1018 % duplicate rate and you're paying for every one of them, every month, forever.</p>
<p class="muted"><strong>What it costs:</strong> $30$300/mo per percent of dupes on a 50 k-list — recurring.</p>
</div>
<div class="card">
<span class="icon">📵</span>
<h3>Your product feed got rejected by Google Merchant Center</h3>
<p>Smart quotes from a copy-paste in product titles. NBSP in SKU. Inconsistent attribute casing. Feed bounces, the launch sits for 2472 hours while you try to find the bad row in a 12,000-line CSV.</p>
<p class="muted"><strong>What it costs:</strong> 13 days of delayed campaign × the campaign value.</p>
</div>
<div class="card">
<span class="icon">🪢</span>
<h3>Orders from Shopify + Etsy + Amazon + Faire don't speak the same language</h3>
<p>Each platform's export uses different column names for "customer email" / "ship country" / "order total." Merging takes hours of manual rename and copy-paste before the analysis can even begin.</p>
<p class="muted"><strong>What it costs:</strong> 48 hours per month manually merging exports.</p>
</div>
<div class="card">
<span class="icon">🔁</span>
<h3>Subscription churn looks higher than it is</h3>
<p>Pet-box subscribers cancel, then re-sub three months later under a different email or device. Your cohort report says churn is 20 % when it's actually 12 % — and you're over-paying for acquisition because LTV is mis-calculated.</p>
<p class="muted"><strong>What it costs:</strong> wrong CAC ceiling for the next year of paid ads.</p>
</div>
<div class="card">
<span class="icon">🌍</span>
<h3>VAT MOSS / EU tax breaks because country is spelled three ways</h3>
<p>Your UK customers are tagged <code>UK</code>, <code>U.K.</code>, and <code>United Kingdom</code> — all in one export. The VAT report aggregates them as three different markets. Compliance friction every quarter.</p>
<p class="muted"><strong>What it costs:</strong> compliance risk + repeated manual normalization.</p>
</div>
<div class="card">
<span class="icon">🔒</span>
<h3>Cloud cleaners want you to upload your customer list</h3>
<p>Your customer list is your single most valuable business asset. Uploading it to a SaaS to clean it is the privacy story you do not want. DataTools is desktop-only — your list never leaves your computer.</p>
<p class="muted"><strong>What it costs:</strong> nothing — and that's the point.</p>
</div>
</div>
</div>
</section>
<!-- ============= Live demo ============= -->
<section id="demo">
<div class="container">
<div class="eyebrow">Live demo · runs in your browser</div>
<h2>Try it on a real-looking Shopify customer export</h2>
<p>
The demo below loads a sample 15-row Shopify customer file with
pollution we've seen in actual stores: smart quotes from copy-paste,
duplicates with email-case drift, international phones from the UK,
Spain, Germany, Australia, and Japan, and the usual mess of
<code>N/A</code> / <code>(blank)</code> / <code>?</code> sentinels.
Click <strong>Run pipeline</strong> and watch every column get
cleaned in under a second.
</p>
<div class="demo-frame">
<iframe
src="https://demo.datatools.app/?p=shopify-pet"
loading="lazy"
title="DataTools live demo — Shopify pet supplies"
sandbox="allow-scripts allow-same-origin allow-downloads allow-forms"></iframe>
<div class="demo-caption">
Demo runs on free hosting (Streamlit Community Cloud). Capped at
100 input rows · output watermarked with one trailing row. The
paid product has no caps and runs entirely offline.
</div>
</div>
</div>
</section>
<!-- ============= Built for Shopify ============= -->
<section>
<div class="container">
<div class="eyebrow">Built for the Shopify operator</div>
<h2>Five workflows you do every week</h2>
<div class="grid">
<div class="card">
<span class="icon">🧹</span>
<h3>Customer-list cleanup</h3>
<p>Catches the same customer who shows up as <code>john@gmail.com</code>, <code>John@Gmail.com</code>, and <code>j.ohn@gmail.com</code>. Fuzzy match merges the spellings, exact match catches the obvious ones.</p>
</div>
<div class="card">
<span class="icon">📦</span>
<h3>Product catalogue dedup</h3>
<p>SKU whitespace, near-identical product names, copy-paste smart quotes in titles — gone. Audit log shows every change.</p>
</div>
<div class="card">
<span class="icon">🛒</span>
<h3>Abandoned-cart hygiene</h3>
<p>Before re-engagement: dedupe across email + phone, drop sentinels-as-missing, format dates so your sequence triggers fire correctly.</p>
</div>
<div class="card">
<span class="icon">📥</span>
<h3>Subscriber-list import to Klaviyo</h3>
<p>Klaviyo charges per contact. Every duplicate you don't catch costs you for the life of the subscription. Catch them once, pay once.</p>
</div>
<div class="card">
<span class="icon">🔗</span>
<h3>Multi-channel order consolidation</h3>
<p>Orders from Shopify + Etsy + a wholesale spreadsheet, each with a different column for "customer email." Column-mapper aligns them; dedup merges across channels.</p>
</div>
<div class="card">
<span class="icon">⚙️</span>
<h3>Repeatable pipeline</h3>
<p>Save the cleanup as a JSON file. Drop next week's export on it. Same cleanup, zero re-configuration. Automatable via the CLI.</p>
</div>
</div>
</div>
</section>
<!-- ============= Privacy moat ============= -->
<section>
<div class="container">
<div class="eyebrow">The thing every cloud cleaner can't say</div>
<h2>Your customer list never leaves your computer.</h2>
<p>
DataTools is a desktop app. There's no upload step, no SaaS account,
no subscription, no "trust our security policy." The first thing you
can do after install is open your browser's network tab, run the
cleaner on your real customer file, and verify zero outbound
requests.
</p>
<div class="callout">
<strong>Why it matters for Shopify:</strong> your customer list is
your single most valuable business asset. Cloud cleaners require
you to upload it. We don't.
</div>
<div class="terminal"><span class="prompt">$</span> python -m src.cli_pipeline customers.csv --apply
Reading customers.csv...
47,832 rows, 14 columns
Executing pipeline:
<span class="ok"></span> text_clean (140 ms) {cells_changed: 12,408}
<span class="ok"></span> format_standardize (810 ms) {cells_changed: 31,202}
<span class="ok"></span> missing (95 ms) {sentinels_standardized: 8,129}
<span class="ok"></span> dedup (3.1 s) {duplicates_removed: 2,347}
Initial rows: 47,832 → Final rows: 45,485
Total elapsed: 4.2 s
<span class="prompt">$</span> # zero network calls. zero. promise.</div>
</div>
</section>
<!-- ============= Audit moat ============= -->
<section>
<div class="container">
<div class="eyebrow">For when your client asks "what changed?"</div>
<h2>Every change auditable. Every cell logged.</h2>
<p>
Every modification is recorded with the original value, the new
value, and which rule fired. Hand the audit CSV to your accountant,
your marketing manager, or your boss along with the cleaned file.
No <em>"I trust the AI"</em> hand-waving — they see exactly what
happened.
</p>
<div class="callout">
<strong>Real example:</strong> the demo above standardized 27
cells across 15 customers. The audit log lists each one — row,
column, before, after, which standardizer fired. The dedup audit
lists every duplicate group with the survivor and its losers.
</div>
</div>
</section>
<!-- ============= International ============= -->
<section>
<div class="container">
<div class="eyebrow">If you sell internationally — most pet brands do</div>
<h2>Phones, addresses, and currencies from anywhere on Earth.</h2>
<p>
Your subscriber from London entered her phone as <code>020 7946
0958</code>. Your Tokyo customer entered <code>03-3210-7000</code>.
Your German wholesale buyer wrote <code>€2.410,75</code>. Excel
thinks all of them are mistakes. DataTools knows what country each
row is from (per-row country column) and parses every one correctly
to E.164 phones, ISO dates, and numeric amounts.
</p>
<ul class="bullets">
<li><strong>50+ country codes</strong> via Google's libphonenumber.</li>
<li><strong>Currency auto-detect</strong> for $ / £ / € / ¥ / R$ / kr / zł — including the EU comma-decimal that breaks Excel.</li>
<li><strong>Address shape detection</strong> for US, UK, Canada, Germany, Australia.</li>
<li><strong>Locale-aware month names</strong> in English, French, German.</li>
</ul>
</div>
</section>
<!-- ============= What you get ============= -->
<section>
<div class="container">
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match (Jaro-Winkler), 5 normalizers, survivor rules, interactive review.</p></div>
<div class="card"><h3>2 · Text Cleaner</h3><p>Whitespace, smart chars, NBSP, BOM, line endings, case ops.</p></div>
<div class="card"><h3>3 · Format Standardizer</h3><p>Dates, phones, emails, addresses, names, currencies, booleans.</p></div>
<div class="card"><h3>4 · Missing Value Handler</h3><p>Disguised-null detection, profile, mean/median/mode/ffill, drop strategies.</p></div>
<div class="card"><h3>5 · Column Mapper</h3><p>Fuzzy auto-rename, target schema, type coercion, required-field defaults.</p></div>
<div class="card"><h3>6 · Pipeline Runner</h3><p>Chain tools in recommended order, save/load JSON, automate weekly cleanups.</p></div>
</div>
</div>
</section>
<!-- ============= Pricing ============= -->
<section>
<div class="container">
<div class="eyebrow">Pricing — pay once, own it</div>
<h2>$49. No subscription. No ceiling on rows or files.</h2>
<div class="pricing">
<div class="card featured">
<div class="row"><div class="price">$49</div><div class="price-suffix">one-time</div></div>
<h3>DataTools for Shopify</h3>
<ul>
<li>All 6 tools, full pipeline</li>
<li>Mac · Windows · Linux installers</li>
<li>Code-signed (no Gatekeeper warnings)</li>
<li>Free updates for the v1.x line</li>
<li>Bonus: 3 ready-made Shopify pipelines</li>
</ul>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Buy on Gumroad →</a>
</div>
<div class="card">
<div class="row"><div class="price">$149</div><div class="price-suffix">one-time</div></div>
<h3>Full DataTools Suite</h3>
<p class="muted">Available when 3+ bundles ship. Includes everything in the Shopify pack plus the Bookkeeper and RevOps bundles. Save $48.</p>
<a class="btn btn-ghost btn-large" href="#" aria-disabled="true">Coming when ready</a>
</div>
</div>
</div>
</section>
<!-- ============= FAQ ============= -->
<section>
<div class="container">
<h2>Questions</h2>
<details class="faq">
<summary>Does this work with Shopify Plus?</summary>
<p>Yes — the input is just CSV / Excel from any source. Your Shopify Plus exports work the same as the standard plan, the same as a Shopify-to-CSV pipeline you've stitched together yourself. The cleaner doesn't care.</p>
</details>
<details class="faq">
<summary>How does this compare to Excel's "Remove Duplicates"?</summary>
<p>Excel does <em>exact</em> deduplication. <code>John@Gmail.com</code> and <code>john@gmail.com</code> are different customers to Excel. DataTools fuzzy-matches across case, whitespace, formatting, and even close-but-not-identical strings. The demo above merges 4 customer pairs Excel would leave duplicated.</p>
</details>
<details class="faq">
<summary>How big a file can it handle?</summary>
<p>1 GB CSV with international phones + addresses processes in about 2.5 minutes on a typical workstation. Streaming mode keeps memory bounded regardless of input size — we tested it on 26 million rows.</p>
</details>
<details class="faq">
<summary>Do I need to know Python to use it?</summary>
<p>No. The GUI is a browser interface that opens automatically when you double-click the app. It loads your file, you click Run, you download the cleaned file. The CLI is there for power users who want to script weekly cleanups.</p>
</details>
<details class="faq">
<summary>What about my privacy?</summary>
<p>Your customer list never leaves your computer. There is no cloud component, no telemetry, no "anonymous usage stats." When the app is running you can confirm zero outbound network requests in your browser's developer tools.</p>
</details>
<details class="faq">
<summary>What's your refund policy?</summary>
<p>Try the live demo above on the sample dataset before you buy. If you still find DataTools doesn't fit your workflow within 14 days, email for a refund — no questions asked.</p>
</details>
<details class="faq">
<summary>Will there be updates?</summary>
<p>Yes. The v1.x line is included free for everyone who buys DataTools today. We ship a patch every 30 days adding country support, edge-case fixes, and small features.</p>
</details>
</div>
</section>
<!-- ============= Final CTA ============= -->
<section>
<div class="container" style="text-align: center;">
<h2>Stop deduplicating customers by hand.</h2>
<p class="lead" style="margin: 0 auto 28px;">One $49 download. Mac, Windows, or Linux. Runs offline. Catches the duplicates Excel misses, standardizes the phones from your international customers, and saves a pipeline you can re-run on next week's export.</p>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Get DataTools — $49 →</a>
</div>
</section>
<!-- ============= Footer ============= -->
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="../bookkeeper/">For bookkeepers</a> ·
<a href="../revops/">For RevOps agencies</a><br />
<a href="https://gumroad.com/l/datatools?from=shopify-pet">Buy on Gumroad</a> ·
<a href="mailto:hello@datatools.app">Email support</a>
</p>
</div>
</div>
</footer>
</body>
</html>

View File

@@ -0,0 +1,192 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Find Duplicates</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="01_deduplicator">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Find Duplicates</strong>, shown with a file imported and a completed run (results + match-group review). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Find Duplicates</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Find rows that repeat, then keep one and remove the extras.</p>
<div class="dt-spacer"></div>
<!-- File pickup banner (using file from upload screen) -->
<div class="dt-alert info">
<span class="dt-mi">description</span>
<span>Using <strong>customers_export.csv</strong> from the upload screen.</span>
</div>
<button class="dt-btn" style="margin-bottom:4px">Use a different file</button>
<!-- Delimiter selector — delimited-text only (CSV/TSV); omitted for XLSX/XLS.
Shown here because the staged file is customers_export.csv. -->
<div class="dt-field" style="max-width:320px">
<label class="dt-label">Delimiter</label>
<div class="dt-select">Comma (,)</div>
<div class="dt-help-text">Auto-detected on upload. Change if the preview looks wrong.</div>
</div>
<!-- Preview expander (collapsed after a result exists) -->
<details class="dt-expander">
<summary>Preview: customers_export.csv</summary>
<div class="dt-expander-body">
<p class="dt-caption">18,442 rows, 6 columns</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>name</th><th>email</th><th>city</th><th>phone</th><th>signup_date</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>Jane Doe</td><td>jane@acme.io</td><td>Austin</td><td>512-555-0190</td><td>2024-01-04</td></tr>
<tr><td class="idx">1</td><td>jane doe</td><td>JANE@ACME.IO</td><td>austin</td><td>(512) 555-0190</td><td>01/04/2024</td></tr>
<tr><td class="idx">2</td><td>Bob Smith</td><td>bob@globex.com</td><td>Denver</td><td>720-555-7781</td><td>2024-02-11</td></tr>
<tr><td class="idx">3</td><td>R. Smith</td><td>bob@globex.com</td><td>Denver</td><td>720-555-7781</td><td>2024-02-11</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<!-- Basic controls (visible by default) -->
<div class="dt-cols-2">
<div class="dt-field"><label class="dt-label">Match threshold</label>
<div class="dt-slider"><div class="track"><div class="fill" style="width:70%"></div><div class="knob" style="left:70%"></div></div><div class="val">85</div></div>
<div class="dt-help-text">Higher means rows must look more alike to count as a duplicate.</div></div>
<div class="dt-field"><label class="dt-label">When duplicates are found, keep</label>
<div class="dt-select">the most-complete row</div>
<div class="dt-help-text">Which row survives in each group of duplicates.</div></div>
</div>
<!-- Advanced options (single expander; basics live above) -->
<details class="dt-expander">
<summary>Advanced options</summary>
<div class="dt-expander-body">
<p class="dt-help-text" style="margin-top:0">Leave these empty to auto-detect which columns to compare. Otherwise, list the columns that must match <strong>exactly</strong> and the ones that only need to match <strong>approximately</strong> — together these are the columns used to find duplicates.</p>
<div class="dt-cols-2">
<div>
<div class="dt-field"><label class="dt-label">Columns that must match exactly</label>
<div class="dt-multiselect"><span class="dt-ms-chip">email <span class="x"></span></span></div></div>
<div class="dt-field"><label class="dt-label">Columns to match approximately</label>
<div class="dt-multiselect"><span class="dt-ms-chip">name <span class="x"></span></span></div></div>
</div>
<div>
<div class="dt-field"><label class="dt-label">Approximate-match algorithm</label><div class="dt-select">jaro_winkler</div></div>
</div>
</div>
<div class="dt-check on" style="margin-top:6px"><span class="box"><span class="dt-mi">check</span></span> Merge mode — fill missing fields in the surviving row</div>
</div>
</details>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary dt-btn-block">Find Duplicates</button>
<hr class="dt-divider">
<!-- Results -->
<h2>Results</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Original rows</div><div class="value">18,442</div></div>
<div class="dt-metric"><div class="label">Duplicate rows</div><div class="value">312</div><div class="delta down">312 removed</div></div>
<div class="dt-metric"><div class="label">Match groups</div><div class="value">147</div></div>
<div class="dt-metric"><div class="label">Rows kept</div><div class="value">18,130</div></div>
</div>
<p class="dt-caption">Preview of an auto-resolved run: each group keeps its auto-picked survivor. Review the groups below to override any pending picks before the final download.</p>
<div class="dt-btn-row" style="max-width:560px">
<button class="dt-btn">Download auto-resolved CSV</button>
<button class="dt-btn">Download removed rows</button>
</div>
<hr class="dt-divider">
<!-- Match groups -->
<h2>Match Groups</h2>
<div class="dt-cols-3" style="max-width:520px">
<button class="dt-btn">Accept All</button>
<button class="dt-btn">Reject All</button>
<button class="dt-btn">Clear Decisions</button>
</div>
<p class="dt-caption" style="margin-top:8px">Differing columns are highlighted. The survivor row is kept; uncheck a row to split it out of the group.</p>
<!-- Match group card 1 -->
<div class="dt-match-card">
<div class="dt-match-head">
<span class="title">Group 1 · 2 rows</span>
<span class="conf"><span class="dt-count-pill success">98% match</span></span>
</div>
<div class="dt-match-body">
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>keep</th><th>name</th><th>email</th><th>city</th><th>phone</th><th>signup_date</th></tr></thead>
<tbody>
<tr class="dt-keep-row"><td><span class="dt-keep-tag">keep</span></td><td>Jane Doe</td><td>jane@acme.io</td><td>Austin</td><td>512-555-0190</td><td>2024-01-04</td></tr>
<tr><td><span class="dt-caption">remove</span></td><td class="dt-cell-flag">jane doe</td><td class="dt-cell-flag">JANE@ACME.IO</td><td class="dt-cell-flag">austin</td><td>(512) 555-0190</td><td class="dt-cell-flag">01/04/2024</td></tr>
</tbody>
</table>
</div>
</div>
</div>
<!-- Match group card 2 -->
<div class="dt-match-card">
<div class="dt-match-head">
<span class="title">Group 2 · 2 rows</span>
<span class="conf"><span class="dt-count-pill warn">87% match</span></span>
</div>
<div class="dt-match-body">
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>keep</th><th>name</th><th>email</th><th>city</th><th>phone</th><th>signup_date</th></tr></thead>
<tbody>
<tr class="dt-keep-row"><td><span class="dt-keep-tag">keep</span></td><td>Bob Smith</td><td>bob@globex.com</td><td>Denver</td><td>720-555-7781</td><td>2024-02-11</td></tr>
<tr><td><span class="dt-caption">remove</span></td><td class="dt-cell-flag">R. Smith</td><td>bob@globex.com</td><td>Denver</td><td>720-555-7781</td><td>2024-02-11</td></tr>
</tbody>
</table>
</div>
</div>
</div>
<p class="dt-caption" style="margin-top:14px">Decisions: 1 merged, 1 pending · Pending groups keep their auto-picked survivor unless you review them.</p>
<button class="dt-btn dt-btn-primary dt-btn-block" style="margin-top:8px">Apply Review Decisions &amp; Download Final CSV</button>
<!-- Processing log -->
<details class="dt-expander" style="margin-top:18px">
<summary>Processing Log</summary>
<div class="dt-expander-body">
<div class="dt-code">[00:00.01] Loaded 18,442 rows from customers_export.csv
[00:00.04] Strategy: exact(email) + fuzzy(name, jaro_winkler ≥ 85)
[00:00.91] Compared 18,442 rows → 147 match groups
[00:01.02] Survivor rule: most-complete · merge=on
[00:01.05] 312 rows flagged for removal</div>
</div>
</details>
<div class="dt-next-step"><span class="dt-mi">arrow_forward</span><span>Duplicates handled — your file is cleaned. Review the result or <a href="home.html">Back to Start here →</a></span><button class="dt-next-step-dismiss" title="Dismiss"></button></div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,223 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Clean Text</title>
<link rel="stylesheet" href="app.css">
<style>
/* Hidden-character badges — mirrors src/core/text_clean.py:hidden_char_css(),
not part of app.css so reproduced inline against the same palette. */
.hidden-char { display: inline-block; padding: 0 2px; margin: 0 1px; border-radius: 3px; font-family: var(--font-mono); font-size: 0.85em; cursor: help; }
.hidden-char.hidden-whitespace { background: #fff3cd; color: #856404; border: 1px solid #ffeaa7; }
.hidden-char.hidden-special { background: #d1ecf1; color: #0c5460; border: 1px solid #bee5eb; }
.hidden-char.hidden-control { background: #f8d7da; color: #721c24; border: 1px solid #f5c6cb; }
</style>
</head>
<body data-page="02_text_cleaner">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Clean Text</strong>, shown with a file imported and a completed run (results metrics, changes-by-column, before/after examples, cleaned preview, downloads). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Clean Text</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Trim extra spaces and strip out odd characters.</p>
<div class="dt-spacer"></div>
<!-- File pickup banner (using file from upload screen) -->
<div class="dt-alert info">
<span class="dt-mi">description</span>
<span>Using <strong>contacts_messy.csv</strong> from the upload screen.</span>
</div>
<button class="dt-btn" style="margin-bottom:4px">Use a different file</button>
<!-- Preview expander (collapsed once a result exists) -->
<details class="dt-expander">
<summary>Preview: contacts_messy.csv</summary>
<div class="dt-expander-body">
<p class="dt-caption">4,120 rows, 4 columns</p>
<div class="dt-check on" style="margin-top:2px"><span class="box"><span class="dt-mi">check</span></span> Show hidden characters</div>
<div style="display:flex;flex-wrap:wrap;align-items:center;gap:14px;margin-top:6px;font-size:12px;color:var(--ink-secondary)">
<span style="display:inline-flex;align-items:center;gap:6px"><span class="hidden-char hidden-whitespace" style="cursor:default">·</span> Whitespace</span>
<span style="display:inline-flex;align-items:center;gap:6px"><span class="hidden-char hidden-special" style="cursor:default"></span> Smart / special</span>
<span style="display:inline-flex;align-items:center;gap:6px"><span class="hidden-char hidden-control" style="cursor:default"></span> Control</span>
</div>
<div class="dt-table-wrap" style="margin-top:8px">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>name</th><th>email</th><th>company</th><th>notes</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td><span class="hidden-char hidden-whitespace" title="U+0020 SP LEAD">·</span>Jane Doe<span class="hidden-char hidden-whitespace" title="U+0020 SP TRAIL">·</span></td><td>jane@acme.io</td><td>Acme<span class="hidden-char hidden-whitespace" title="U+00A0 NBSP">·</span>Inc.</td><td>VIP<span class="hidden-char hidden-special" title="U+201D RIGHT DOUBLE QUOTE"></span></td></tr>
<tr><td class="idx">1</td><td>Bob&nbsp;&nbsp;Smith</td><td>bob@globex.com<span class="hidden-char hidden-special" title="U+200B ZWSP"></span></td><td>Globex</td><td><span class="hidden-char hidden-control" title="U+0007 CTRL"></span></td></tr>
<tr><td class="idx">2</td><td>Ana López</td><td>ana@initech.com</td><td>Initech<span class="hidden-char hidden-whitespace" title="U+0020 SP TRAIL">·</span></td><td>follow&nbsp;up</td></tr>
<tr><td class="idx">3</td><td><span class="hidden-char hidden-whitespace" title="U+0009 TAB"></span>Wei Chen</td><td>WEI@umbrella.co</td><td>Umbrella</td><td>“key<span class="hidden-char hidden-special" title="U+2014 EM DASH"></span>account”</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<hr class="dt-divider">
<!-- Options expander (collapsed once a result exists) -->
<details class="dt-expander">
<summary>Options</summary>
<div class="dt-expander-body">
<div class="dt-field">
<label class="dt-label">Preset</label>
<div class="dt-radio-row">
<span class="dt-radio on"><span class="dot"></span> excel-hygiene (recommended)</span>
<span class="dt-radio"><span class="dot"></span> minimal</span>
<span class="dt-radio"><span class="dot"></span> paranoid</span>
</div>
<div class="dt-help-text">
minimal: trim and collapse whitespace only — no character substitutions.<br>
excel-hygiene: trim, collapse whitespace, fold smart quotes, strip invisible chars, normalize line endings, and normalize accented characters.<br>
paranoid: everything in excel-hygiene plus strip control characters, strip BOM, and normalize accented and look-alike characters (lossy).
</div>
</div>
<details class="dt-expander">
<summary>Advanced options</summary>
<div class="dt-expander-body">
<div class="dt-cols-2">
<div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Trim leading/trailing whitespace</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Collapse internal whitespace</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Normalize line endings (\r\n → \n)</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Strip control characters</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Strip BOM</div>
</div>
<div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Fold smart characters (curly quotes, em-dash, NBSP)</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Strip zero-width / invisible characters</div>
<div class="dt-check on" title="Unicode NFC normalization"><span class="box"><span class="dt-mi">check</span></span> Normalize accented characters (NFC)</div>
<div class="dt-check" title="Unicode NFKC compatibility fold"><span class="box"></span> Normalize accented and look-alike characters (lossy: ① → 1, fi → fi)</div>
</div>
</div>
<h4>Scope</h4>
<div class="dt-field">
<label class="dt-label">Columns to clean (default: all string columns)</label>
<div class="dt-multiselect">
<span class="dt-ms-chip">name <span class="x"></span></span>
<span class="dt-ms-chip">email <span class="x"></span></span>
<span class="dt-ms-chip">company <span class="x"></span></span>
<span class="dt-ms-chip">notes <span class="x"></span></span>
</div>
</div>
<div class="dt-field">
<label class="dt-label">Columns to skip even if they look like text</label>
<div class="dt-multiselect"><span class="dt-ms-placeholder">Choose columns to leave untouched</span></div>
</div>
<h4>Case conversion</h4>
<div class="dt-field" style="max-width:360px">
<label class="dt-label">Apply case conversion to selected columns</label>
<div class="dt-select">None</div>
</div>
</div>
</details>
</div>
</details>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary dt-btn-block">Clean Text</button>
<hr class="dt-divider">
<!-- Results -->
<h2>Results</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Cells scanned</div><div class="value">16,480</div></div>
<div class="dt-metric"><div class="label">Cells changed</div><div class="value">3,947</div></div>
<div class="dt-metric"><div class="label">% changed</div><div class="value">24.0%</div></div>
<div class="dt-metric"><div class="label">Columns processed</div><div class="value">4</div></div>
</div>
<div class="dt-field">
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Show hidden characters (NBSP, ZWSP, smart quotes, control chars…)</div>
<div class="dt-help-text">Same setting as “Show hidden characters” in the preview above — toggling either updates both.</div>
</div>
<h4>Changes by column</h4>
<div class="dt-table-wrap" style="max-width:360px">
<table class="dt-table">
<thead><tr><th>column</th><th>cells_changed</th></tr></thead>
<tbody>
<tr><td>company</td><td>1,604</td></tr>
<tr><td>name</td><td>1,210</td></tr>
<tr><td>notes</td><td>982</td></tr>
<tr><td>email</td><td>151</td></tr>
</tbody>
</table>
</div>
<h4>Examples (first 25 changes)</h4>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>Row</th><th>Column</th><th>Before</th><th>After</th><th>Ops applied</th></tr></thead>
<tbody>
<tr><td>1</td><td>name</td><td><span class="hidden-char hidden-whitespace" title="U+0020 SP LEAD">·</span>Jane Doe<span class="hidden-char hidden-whitespace" title="U+0020 SP TRAIL">·</span></td><td>Jane Doe</td><td>trim</td></tr>
<tr><td>1</td><td>company</td><td>Acme<span class="hidden-char hidden-whitespace" title="U+00A0 NBSP">·</span>Inc.</td><td>Acme Inc.</td><td>fold_smart</td></tr>
<tr><td>1</td><td>notes</td><td>VIP<span class="hidden-char hidden-special" title="U+201D RIGHT DOUBLE QUOTE"></span></td><td>VIP"</td><td>fold_smart</td></tr>
<tr><td>2</td><td>name</td><td>Bob<span class="hidden-char hidden-whitespace" title="U+0020 SP">·</span><span class="hidden-char hidden-whitespace" title="U+0020 SP">·</span>Smith</td><td>Bob Smith</td><td>collapse_ws</td></tr>
<tr><td>2</td><td>email</td><td>bob@globex.com<span class="hidden-char hidden-special" title="U+200B ZWSP"></span></td><td>bob@globex.com</td><td>strip_zero_width</td></tr>
<tr><td>2</td><td>notes</td><td><span class="hidden-char hidden-control" title="U+0007 CTRL"></span></td><td></td><td>strip_control</td></tr>
<tr><td>3</td><td>company</td><td>Initech<span class="hidden-char hidden-whitespace" title="U+0020 SP TRAIL">·</span></td><td>Initech</td><td>trim</td></tr>
<tr><td>4</td><td>name</td><td><span class="hidden-char hidden-whitespace" title="U+0009 TAB"></span>Wei Chen</td><td>Wei Chen</td><td>trim</td></tr>
<tr><td>4</td><td>notes</td><td>“key<span class="hidden-char hidden-special" title="U+2014 EM DASH"></span>account”</td><td>"key-account"</td><td>fold_smart, nfc</td></tr>
</tbody>
</table>
</div>
<h4>Cleaned preview (first 10 rows)</h4>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>name</th><th>email</th><th>company</th><th>notes</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td class="dt-cell-add">Jane Doe</td><td>jane@acme.io</td><td class="dt-cell-add">Acme Inc.</td><td class="dt-cell-add">VIP"</td></tr>
<tr><td class="idx">1</td><td class="dt-cell-add">Bob Smith</td><td class="dt-cell-add">bob@globex.com</td><td>Globex</td><td class="dt-cell-add"></td></tr>
<tr><td class="idx">2</td><td>Ana López</td><td>ana@initech.com</td><td class="dt-cell-add">Initech</td><td>follow up</td></tr>
<tr><td class="idx">3</td><td class="dt-cell-add">Wei Chen</td><td>WEI@umbrella.co</td><td>Umbrella</td><td class="dt-cell-add">"key-account"</td></tr>
</tbody>
</table>
</div>
<p class="dt-caption">Changed cells highlighted. Toggle “Show hidden characters” to inspect the invisibles being removed.</p>
<hr class="dt-divider">
<!-- Downloads -->
<div class="dt-cols-3">
<button class="dt-btn dt-btn-primary">Download cleaned CSV</button>
<button class="dt-btn">Download changes audit</button>
<button class="dt-btn">Download config JSON</button>
</div>
<!-- Next-step suggestion -->
<div class="dt-next-step"><span class="dt-mi">arrow_forward</span><span>Text cleaned. Next, most files need: <a href="03_format_standardizer.html">Standardize Formats →</a></span><button class="dt-next-step-dismiss" title="Dismiss"></button></div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,265 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Standardize Formats</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="03_format_standardizer">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Standardize Formats</strong>, shown with a file imported from the upload screen and a completed run (results + changes audit + standardized preview). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Standardize Formats</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Make dates, phones, currency, and names look the same throughout.</p>
<div class="dt-spacer"></div>
<!-- File pickup banner (using file from upload screen) -->
<div class="dt-alert info">
<span class="dt-mi">description</span>
<span>Using <strong>customers_export.csv</strong> from the upload screen.</span>
</div>
<button class="dt-btn" style="margin-bottom:4px">Use a different file</button>
<!-- Preview expander (collapsed once a result exists) -->
<details class="dt-expander">
<summary>Preview: customers_export.csv</summary>
<div class="dt-expander-body">
<p class="dt-caption">18,442 rows, 6 columns</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>full_name</th><th>phone</th><th>amount</th><th>signup_date</th><th>active</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>jane DOE</td><td>(512) 555-0190</td><td>$1,234.5</td><td>01/04/2024</td><td>Y</td></tr>
<tr><td class="idx">1</td><td>bob smith</td><td>720.555.7781</td><td>$99</td><td>2024-2-11</td><td>yes</td></tr>
<tr><td class="idx">2</td><td>ALICIA REYES</td><td>+1 415 555 2233</td><td>$45,000</td><td>Mar 3, 2024</td><td>n</td></tr>
<tr><td class="idx">3</td><td>m. okafor</td><td>2125550148</td><td>$7.999</td><td>2024/04/22</td><td>true</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<hr class="dt-divider">
<!-- Options expander (collapsed after run; opened here to show the most informative content) -->
<details class="dt-expander" open>
<summary>Options</summary>
<div class="dt-expander-body">
<h3 style="margin-top:0">Column types</h3>
<p class="dt-caption">Assign each column to a field type. Auto-detected suggestions are pre-filled; pick <strong>(skip)</strong> to leave a column untouched.</p>
<!-- Per-column type selectboxes, 3 per row -->
<div class="dt-cols-3">
<div class="dt-field"><label class="dt-label">full_name</label><div class="dt-select">Name</div></div>
<div class="dt-field"><label class="dt-label">phone</label><div class="dt-select">Phone</div></div>
<div class="dt-field"><label class="dt-label">amount</label><div class="dt-select">Currency</div></div>
</div>
<div class="dt-cols-3">
<div class="dt-field"><label class="dt-label">signup_date</label><div class="dt-select">Date</div></div>
<div class="dt-field"><label class="dt-label">active</label><div class="dt-select">Boolean</div></div>
<div class="dt-field"><label class="dt-label">notes</label><div class="dt-select">(skip)</div></div>
</div>
<hr class="dt-divider">
<h3>Format options</h3>
<!-- Standards preset radio (vertical). Demo state: preset has auto-switched
to Custom because individual controls below diverge from the European base. -->
<div class="dt-field">
<label class="dt-label">Standards preset</label>
<div style="display:flex;flex-direction:column;gap:8px;margin-top:4px">
<span class="dt-radio" title="E.164 phones"><span class="dot"></span> US (default) — ISO 8601 dates · international-format phones (+1…) · USD</span>
<span class="dt-radio"><span class="dot"></span> European — DMY input · INTL phones · EUR comma decimal <span class="dt-count-pill info" style="margin-left:4px">base</span></span>
<span class="dt-radio"><span class="dot"></span> UK — DD/MM/YYYY · GB phones · Yes/No booleans</span>
<span class="dt-radio"><span class="dot"></span> ISO Strict — ISO 8601 · bare-number currency · true/false</span>
<span class="dt-radio"><span class="dot"></span> Legacy US — MM/DD/YYYY · National phones · Yes/No</span>
<span class="dt-radio on"><span class="dot"></span> Custom — based on <strong>European</strong>, 2 controls changed <span class="dt-count-pill warn" style="margin-left:4px">modified</span></span>
</div>
<div class="dt-precedence" style="margin-top:10px">
<span class="dt-mi">rule</span>
<span>Individual controls win over the preset. You started from <strong>European</strong>, then changed <strong>Ambiguous input order</strong> and <strong>Decimal separator</strong> below — so the preset is now <strong>Custom</strong>. The controls' current values are what actually run.</span>
</div>
<div class="dt-help-text">Pick a published standard or regional convention as the baseline. Every option below is still individually overridable; overriding any one switches the preset to Custom.</div>
</div>
<!-- Two-column format options -->
<div class="dt-cols-2" style="margin-top:14px">
<!-- Left column: Dates + Phones -->
<div>
<h4 style="margin-top:0"><strong>Dates</strong></h4>
<div class="dt-field"><label class="dt-label">Output format</label><div class="dt-select">YYYY-MM-DD (ISO)</div></div>
<div class="dt-field">
<label class="dt-label">Ambiguous input order (e.g. 01/02/2024) <span class="dt-count-pill warn" style="margin-left:4px">changed</span></label>
<div class="dt-radio-row">
<span class="dt-radio on"><span class="dot"></span> MDY (US)</span>
<span class="dt-radio"><span class="dot"></span> DMY (EU)</span>
</div>
<div class="dt-help-text">Winning value: <strong>MDY</strong>. Overrides the European base (DMY) — <code>01/02/2024</code> reads as <strong>2024-01-02</strong>.</div>
</div>
<h4><strong>Phones</strong></h4>
<div class="dt-field"><label class="dt-label" title="E.164">Output format</label><div class="dt-select" title="E.164">Standard international format (+15551234567)</div></div>
<div class="dt-field">
<label class="dt-label">Default region (ISO-2)</label>
<div class="dt-input">US</div>
<div class="dt-help-text">Region used when the input has no country code. US, GB, DE, etc.</div>
</div>
</div>
<!-- Right column: Currency + Names + Booleans -->
<div>
<h4 style="margin-top:0"><strong>Currency</strong></h4>
<div class="dt-field">
<label class="dt-label">Decimal separator in input <span class="dt-count-pill warn" style="margin-left:4px">changed</span></label>
<div class="dt-radio-row">
<span class="dt-radio on"><span class="dot"></span> dot (1,234.56)</span>
<span class="dt-radio"><span class="dot"></span> comma (1.234,56)</span>
</div>
<div class="dt-help-text">Winning value: <strong>dot</strong>. Overrides the European base (comma) — <code>$1,234.5</code> reads as <strong>1234.50</strong>.</div>
</div>
<div class="dt-field" style="max-width:200px"><label class="dt-label">Round to decimals</label><div class="dt-input">2</div></div>
<div class="dt-check"><span class="box"></span> Preserve original precision (don't round)</div>
<div class="dt-check"><span class="box"></span> Preserve currency code (emit <code>USD 1234.56</code>, <code>EUR 99.00</code>, etc.)</div>
<h4><strong>Names</strong></h4>
<div class="dt-field"><label class="dt-label">Casing</label><div class="dt-select">Title Case</div></div>
<h4><strong>Booleans</strong></h4>
<div class="dt-field"><label class="dt-label">Output style</label><div class="dt-select">True/False</div></div>
</div>
</div>
</div>
</details>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary dt-btn-block">Standardize Formats</button>
<hr class="dt-divider">
<!-- Results -->
<h2>Results</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Cells scanned</div><div class="value">92,210</div></div>
<div class="dt-metric"><div class="label">Cells changed</div><div class="value">61,838</div></div>
<div class="dt-metric"><div class="label">% changed</div><div class="value">67.1%</div></div>
<div class="dt-metric"><div class="label">Unparseable</div><div class="value">47</div></div>
</div>
<div class="dt-alert info">
<span class="dt-mi">info</span>
<span>47 cell(s) in typed columns didn't match a recognizable shape and were left as-is. See <strong>Unparseable cells</strong> below to review them, or re-classify the column to <strong>(skip)</strong>. (They aren't in the changes audit — nothing was changed.)</span>
</div>
<!-- Unparseable cells surface (the alert points here; these are left-as-is, so they never appear in the CHANGES audit) -->
<details class="dt-expander">
<summary>Unparseable cells (47)</summary>
<div class="dt-expander-body">
<p class="dt-caption">Cells in typed columns that didn't match a recognizable shape and were left unchanged.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>row</th><th>column</th><th>field_type</th><th>value (left as-is)</th></tr></thead>
<tbody>
<tr><td>318</td><td>signup_date</td><td>date</td><td class="dt-cell-flag">soon</td></tr>
<tr><td>902</td><td>phone</td><td>phone</td><td class="dt-cell-flag">ext. 4471</td></tr>
<tr><td>1,544</td><td>amount</td><td>currency</td><td class="dt-cell-flag">TBD</td></tr>
<tr><td>2,087</td><td>active</td><td>boolean</td><td class="dt-cell-flag">maybe</td></tr>
<tr><td>3,610</td><td>signup_date</td><td>date</td><td class="dt-cell-flag">00/00/0000</td></tr>
</tbody>
</table>
</div>
<p class="dt-caption" style="margin-top:8px">… and 42 more.</p>
</div>
</details>
<!-- Changes by column -->
<p style="margin-bottom:6px"><strong>Changes by column</strong></p>
<div class="dt-table-wrap" style="max-width:520px">
<table class="dt-table">
<thead><tr><th>column</th><th>field_type</th><th>cells_changed</th></tr></thead>
<tbody>
<tr><td>amount</td><td>currency</td><td>17,902</td></tr>
<tr><td>full_name</td><td>name</td><td>16,041</td></tr>
<tr><td>phone</td><td>phone</td><td>14,388</td></tr>
<tr><td>signup_date</td><td>date</td><td>11,205</td></tr>
<tr><td>active</td><td>boolean</td><td>2,302</td></tr>
</tbody>
</table>
</div>
<!-- Examples (first 25 changes) -->
<p style="margin:14px 0 6px"><strong>Examples (first 25 changes)</strong></p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>row</th><th>column</th><th>field_type</th><th>before</th><th>after</th></tr></thead>
<tbody>
<tr><td>1</td><td>full_name</td><td>name</td><td class="dt-cell-del">jane DOE</td><td class="dt-cell-add">Jane Doe</td></tr>
<tr><td>1</td><td>phone</td><td>phone</td><td class="dt-cell-del">(512) 555-0190</td><td class="dt-cell-add">+15125550190</td></tr>
<tr><td>1</td><td>amount</td><td>currency</td><td class="dt-cell-del">$1,234.5</td><td class="dt-cell-add">1234.50</td></tr>
<tr><td>1</td><td>signup_date</td><td>date</td><td class="dt-cell-del">01/04/2024</td><td class="dt-cell-add">2024-01-04</td></tr>
<tr><td>1</td><td>active</td><td>boolean</td><td class="dt-cell-del">Y</td><td class="dt-cell-add">True</td></tr>
<tr><td>2</td><td>full_name</td><td>name</td><td class="dt-cell-del">bob smith</td><td class="dt-cell-add">Bob Smith</td></tr>
<tr><td>2</td><td>phone</td><td>phone</td><td class="dt-cell-del">720.555.7781</td><td class="dt-cell-add">+17205557781</td></tr>
<tr><td>2</td><td>signup_date</td><td>date</td><td class="dt-cell-del">2024-2-11</td><td class="dt-cell-add">2024-02-11</td></tr>
<tr><td>3</td><td>signup_date</td><td>date</td><td class="dt-cell-del">Mar 3, 2024</td><td class="dt-cell-add">2024-03-03</td></tr>
<tr><td>4</td><td>amount</td><td>currency</td><td class="dt-cell-del">$7.999</td><td class="dt-cell-add">8.00</td></tr>
</tbody>
</table>
</div>
<!-- Standardized preview -->
<p style="margin:14px 0 6px"><strong>Standardized preview (first 10 rows)</strong></p>
<p class="dt-caption" style="margin:0 0 6px">Showing 5 of 6 columns — <code>notes</code> is set to (skip), so it's omitted here.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>full_name</th><th>phone</th><th>amount</th><th>signup_date</th><th>active</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>Jane Doe</td><td>+15125550190</td><td>1234.50</td><td>2024-01-04</td><td>True</td></tr>
<tr><td class="idx">1</td><td>Bob Smith</td><td>+17205557781</td><td>99.00</td><td>2024-02-11</td><td>True</td></tr>
<tr><td class="idx">2</td><td>Alicia Reyes</td><td>+14155552233</td><td>45000.00</td><td>2024-03-03</td><td>False</td></tr>
<tr><td class="idx">3</td><td>M. Okafor</td><td>+12125550148</td><td>8.00</td><td>2024-04-22</td><td>True</td></tr>
</tbody>
</table>
</div>
<hr class="dt-divider">
<!-- Downloads (3 columns) -->
<div class="dt-cols-3">
<button class="dt-btn dt-btn-primary">Download standardized CSV</button>
<button class="dt-btn">Download changes audit</button>
<button class="dt-btn">Download config JSON</button>
</div>
<!-- Next-step suggestion -->
<div class="dt-next-step"><span class="dt-mi">arrow_forward</span><span>Formats standardized. Next, most files need: <a href="04_missing_handler.html">Fix Missing Values →</a></span><button class="dt-next-step-dismiss" title="Dismiss"></button></div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,263 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Fix Missing Values</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="04_missing_handler">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Fix Missing Values</strong>, shown with a file imported and a completed run (per-column missingness profile + before/after results). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Fix Missing Values</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Find blank cells (even hidden ones) and fill them in or remove them.</p>
<div class="dt-spacer"></div>
<!-- File pickup banner (using file from upload screen) -->
<div class="dt-alert info">
<span class="dt-mi">description</span>
<span>Using <strong>survey_responses.csv</strong> from the upload screen.</span>
</div>
<button class="dt-btn" style="margin-bottom:4px">Use a different file</button>
<!-- Preview expander (collapsed after a result exists) -->
<details class="dt-expander">
<summary>Preview: survey_responses.csv</summary>
<div class="dt-expander-body">
<p class="dt-caption">2,150 rows, 6 columns</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>respondent_id</th><th>age</th><th>region</th><th>income</th><th>satisfaction</th><th>comments</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>R-1001</td><td>34</td><td>West</td><td>52000</td><td>4</td><td>great service</td></tr>
<tr><td class="idx">1</td><td>R-1002</td><td class="dt-cell-flag">N/A</td><td>East</td><td class="dt-cell-flag"></td><td>3</td><td class="dt-cell-flag">?</td></tr>
<tr><td class="idx">2</td><td>R-1003</td><td>41</td><td class="dt-cell-flag">-</td><td>61000</td><td class="dt-cell-flag">NULL</td><td>none</td></tr>
<tr><td class="idx">3</td><td>R-1004</td><td>29</td><td>South</td><td class="dt-cell-flag">N/A</td><td>5</td><td>quick</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<hr class="dt-divider">
<!-- Missingness profile — always visible: see the damage before configuring -->
<h2>Missingness profile</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Rows</div><div class="value">2,150</div></div>
<div class="dt-metric"><div class="label">Cells missing</div><div class="value">1,043</div></div>
<div class="dt-metric"><div class="label">% cells missing</div><div class="value">8.1%</div></div>
<div class="dt-metric"><div class="label">Complete rows</div><div class="value">1,388</div></div>
</div>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>column</th><th>dtype</th><th>missing</th><th>missing_pct</th><th>disguised</th><th>has_missing</th></tr></thead>
<tbody>
<tr><td>respondent_id</td><td>object</td><td>0</td><td>0.0%</td><td>0</td><td>False</td></tr>
<tr><td>age</td><td>float64</td><td>187</td><td>8.7%</td><td>61</td><td>True</td></tr>
<tr><td>region</td><td>object</td><td>142</td><td>6.6%</td><td>142</td><td>True</td></tr>
<tr><td>income</td><td>float64</td><td>329</td><td>15.3%</td><td>118</td><td>True</td></tr>
<tr><td>satisfaction</td><td>float64</td><td>95</td><td>4.4%</td><td>40</td><td>True</td></tr>
<tr><td>comments</td><td>object</td><td>290</td><td>13.5%</td><td>290</td><td>True</td></tr>
</tbody>
</table>
</div>
<hr class="dt-divider">
<!-- Options expander (Strategy) — configuration follows the diagnostic -->
<details class="dt-expander" open>
<summary>Options</summary>
<div class="dt-expander-body">
<h3>Strategy</h3>
<div class="dt-precedence">
<span class="dt-mi">layers</span>
<span>Resolution order: <strong>per-column override</strong><strong>global strategy</strong><strong>preset</strong>. The most specific setting wins; layers it overrides are dimmed.</span>
</div>
<div class="dt-field">
<label class="dt-label">Preset</label>
<div class="dt-help-text" style="color:var(--warn);display:flex;align-items:center;gap:5px;margin-bottom:8px"><span class="dt-mi" style="font-family:'Material Symbols Outlined';font-size:15px;line-height:1">info</span> Overridden by <strong>Global strategy → median</strong> (set under Advanced options). Presets apply only when global is &ldquo;(use preset)&rdquo;.</div>
<div class="dt-radio-row is-overridden" style="flex-direction:column;gap:10px">
<span class="dt-radio on"><span class="dot"></span> detect-only (standardize sentinels to NaN, no fill or drop)</span>
<span class="dt-radio"><span class="dot"></span> safe-fill (numeric → median, categorical → mode)</span>
<span class="dt-radio"><span class="dot"></span> drop-incomplete (drop any row with missing)</span>
</div>
<div class="dt-help-text">detect-only: replace 'N/A', '-', 'NULL', etc. with real NaN, then stop. safe-fill: also fill — numeric columns with median, others with mode. drop-incomplete: also drop every row that has any missing cell.</div>
</div>
<!-- Advanced options expander (open — most informative) -->
<details class="dt-expander" open>
<summary>Advanced options</summary>
<div class="dt-expander-body">
<div class="dt-cols-2">
<div>
<h4>Detection</h4>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Standardize disguised nulls to NaN</div>
<div class="dt-field">
<label class="dt-label" title="Sentinel values">Blanks in disguise (N/A, dash, NULL) — comma-separated</label>
<div class="dt-input">N/A, n/a, NA, NULL, null, None, -, --, ?, #N/A</div>
<div class="dt-help-text">Text that really means &ldquo;empty.&rdquo; Matched case-insensitively after stripping whitespace.</div>
</div>
</div>
<div>
<h4>Strategy override</h4>
<div class="dt-field">
<label class="dt-label">Global strategy</label>
<div class="dt-select">median</div>
<div class="dt-help-text">drop_row / drop_col use the thresholds below. mean / median / interpolate are numeric only — non-numeric columns fall back to the categorical strategy.</div>
</div>
<div class="dt-field">
<label class="dt-label">Categorical fallback (for non-numeric columns)</label>
<div class="dt-select">mode</div>
</div>
</div>
</div>
<h4>Drop thresholds</h4>
<div class="dt-cols-2">
<div class="dt-field">
<label class="dt-label">Row drop threshold (drop rows with ≥ this fraction missing across selected cols)</label>
<div class="dt-slider"><div class="track"><div class="fill" style="width:100%"></div><div class="knob" style="left:calc(100% - 8px)"></div></div><div class="val">1.00</div></div>
</div>
<div class="dt-field">
<label class="dt-label">Column drop threshold (drop columns with ≥ this fraction missing)</label>
<div class="dt-slider"><div class="track"><div class="fill" style="width:100%"></div><div class="knob" style="left:calc(100% - 8px)"></div></div><div class="val">1.00</div></div>
</div>
</div>
<h4>Scope</h4>
<div class="dt-field">
<label class="dt-label">Columns to handle (default: all)</label>
<div class="dt-multiselect">
<span class="dt-ms-chip">respondent_id <span class="x"></span></span>
<span class="dt-ms-chip">age <span class="x"></span></span>
<span class="dt-ms-chip">region <span class="x"></span></span>
<span class="dt-ms-chip">income <span class="x"></span></span>
<span class="dt-ms-chip">satisfaction <span class="x"></span></span>
<span class="dt-ms-chip">comments <span class="x"></span></span>
</div>
</div>
<div class="dt-field">
<label class="dt-label">Columns to skip</label>
<div class="dt-multiselect"><span class="dt-ms-placeholder">Choose columns</span></div>
</div>
<h4>Per-column strategy overrides (optional)</h4>
<p class="dt-caption">Set a different strategy for specific columns. Leave any row blank to use the global strategy.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>Column</th><th>Override</th><th>Resolves to</th></tr></thead>
<tbody>
<tr><td>age</td><td><span class="dt-select" style="display:inline-block;min-width:160px;padding:4px 24px 4px 10px;color:var(--ink-tertiary)">(global)</span></td><td>median <span style="color:var(--ink-tertiary);font-size:11px">· global</span></td></tr>
<tr><td>region</td><td><span class="dt-select" style="display:inline-block;min-width:160px;padding:4px 24px 4px 10px;color:var(--ink-tertiary)">(global)</span></td><td>mode <span style="color:var(--ink-tertiary);font-size:11px">· global → categorical fallback</span></td></tr>
<tr><td>income</td><td><span class="dt-select" style="display:inline-block;min-width:160px;padding:4px 24px 4px 10px;color:var(--ink-tertiary)">(global)</span></td><td>median <span style="color:var(--ink-tertiary);font-size:11px">· global</span></td></tr>
<tr><td>satisfaction</td><td><span class="dt-select" style="display:inline-block;min-width:160px;padding:4px 24px 4px 10px;color:var(--ink-tertiary)">(global)</span></td><td>median <span style="color:var(--ink-tertiary);font-size:11px">· global</span></td></tr>
<tr><td>comments</td><td><span class="dt-select" style="display:inline-block;min-width:160px;padding:4px 24px 4px 10px">constant</span></td><td><strong>constant</strong> <span style="color:var(--ink-tertiary);font-size:11px">· this column</span></td></tr>
</tbody>
</table>
</div>
</div>
</details>
</div>
</details>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary dt-btn-block">Handle Missing Values</button>
<hr class="dt-divider">
<!-- Results -->
<div id="missing-results-anchor"></div>
<h2>Results</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Sentinels → NaN</div><div class="value">651</div></div>
<div class="dt-metric"><div class="label">Cells filled</div><div class="value">1,043</div></div>
<div class="dt-metric"><div class="label">Rows dropped</div><div class="value">0</div></div>
<div class="dt-metric"><div class="label">Columns dropped</div><div class="value">0</div></div>
</div>
<p><strong>Missingness — before vs. after</strong></p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>column</th><th>before_missing</th><th>before_pct</th><th>after_missing</th><th>after_pct</th><th>strategy</th></tr></thead>
<tbody>
<tr><td>respondent_id</td><td>0</td><td>0.0</td><td>0</td><td>0.0</td><td class="dt-cell-flag"></td></tr>
<tr><td>age</td><td class="dt-cell-flag">187</td><td>8.7</td><td class="dt-cell-add">0</td><td class="dt-cell-add">0.0</td><td>median</td></tr>
<tr><td>region</td><td class="dt-cell-flag">142</td><td>6.6</td><td class="dt-cell-add">0</td><td class="dt-cell-add">0.0</td><td>mode</td></tr>
<tr><td>income</td><td class="dt-cell-flag">329</td><td>15.3</td><td class="dt-cell-add">0</td><td class="dt-cell-add">0.0</td><td>median</td></tr>
<tr><td>satisfaction</td><td class="dt-cell-flag">95</td><td>4.4</td><td class="dt-cell-add">0</td><td class="dt-cell-add">0.0</td><td>median</td></tr>
<tr><td>comments</td><td class="dt-cell-flag">290</td><td>13.5</td><td class="dt-cell-add">0</td><td class="dt-cell-add">0.0</td><td>constant</td></tr>
</tbody>
</table>
</div>
<p><strong>Audit (first 50 changes)</strong></p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>row</th><th>column</th><th>old_value</th><th>new_value</th><th>reason</th></tr></thead>
<tbody>
<tr><td>2</td><td>age</td><td class="dt-cell-flag">N/A</td><td class="dt-cell-add">37.0</td><td>fill: median</td></tr>
<tr><td>2</td><td>income</td><td class="dt-cell-flag">(blank)</td><td class="dt-cell-add">54000.0</td><td>fill: median</td></tr>
<tr><td>2</td><td>comments</td><td class="dt-cell-flag">?</td><td class="dt-cell-add">(no comment)</td><td>fill: constant</td></tr>
<tr><td>3</td><td>region</td><td class="dt-cell-flag">-</td><td class="dt-cell-add">West</td><td>fill: mode</td></tr>
<tr><td>3</td><td>satisfaction</td><td class="dt-cell-flag">NULL</td><td class="dt-cell-add">4.0</td><td>fill: median</td></tr>
<tr><td>4</td><td>income</td><td class="dt-cell-flag">N/A</td><td class="dt-cell-add">54000.0</td><td>fill: median</td></tr>
</tbody>
</table>
</div>
<p class="dt-caption">… and 1,037 more (download the full audit below).</p>
<p><strong>Handled preview (first 10 rows)</strong></p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>respondent_id</th><th>age</th><th>region</th><th>income</th><th>satisfaction</th><th>comments</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>R-1001</td><td>34.0</td><td>West</td><td>52000.0</td><td>4.0</td><td>great service</td></tr>
<tr><td class="idx">1</td><td>R-1002</td><td class="dt-cell-add">37.0</td><td>East</td><td class="dt-cell-add">54000.0</td><td>3.0</td><td class="dt-cell-add">(no comment)</td></tr>
<tr><td class="idx">2</td><td>R-1003</td><td>41.0</td><td class="dt-cell-add">West</td><td>61000.0</td><td class="dt-cell-add">4.0</td><td>none</td></tr>
<tr><td class="idx">3</td><td>R-1004</td><td>29.0</td><td>South</td><td class="dt-cell-add">54000.0</td><td>5.0</td><td>quick</td></tr>
</tbody>
</table>
</div>
<hr class="dt-divider">
<!-- Downloads (html_download_button anchors) -->
<div class="dt-cols-3">
<button class="dt-btn dt-btn-primary">Download handled CSV</button>
<button class="dt-btn">Download changes audit</button>
<button class="dt-btn">Download config JSON</button>
</div>
<div class="dt-next-step"><span class="dt-mi">arrow_forward</span><span>Missing values handled. Next, most files need: <a href="01_deduplicator.html">Find Duplicates →</a></span><button class="dt-next-step-dismiss" title="Dismiss"></button></div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,221 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Map Columns</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="05_column_mapper">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Map Columns</strong>, shown with a file imported, an interactive target schema + mapping configured, and a completed run (results + mapped preview). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Map Columns</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Rename columns, change their order, and set each one as text, number, or date.</p>
<div class="dt-spacer"></div>
<!-- File pickup banner (using file from upload screen) -->
<div class="dt-alert info">
<span class="dt-mi">description</span>
<span>Using <strong>crm_contacts_raw.csv</strong> from the upload screen.</span>
</div>
<button class="dt-btn" style="margin-bottom:4px">Use a different file</button>
<!-- Preview expander (collapsed after a result exists) -->
<details class="dt-expander">
<summary>Preview: crm_contacts_raw.csv</summary>
<div class="dt-expander-body">
<p class="dt-caption">4,210 rows, 6 columns</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>Full Name</th><th>EmailAddr</th><th>Phone #</th><th>Signup</th><th>Amount Spent</th><th>Notes</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>Jane Doe</td><td>jane@acme.io</td><td>512-555-0190</td><td>01/04/2024</td><td>$1,204.50</td><td>VIP</td></tr>
<tr><td class="idx">1</td><td>Bob Smith</td><td>bob@globex.com</td><td>720-555-7781</td><td>02/11/2024</td><td>$88.00</td><td></td></tr>
<tr><td class="idx">2</td><td>Carla Reyes</td><td>carla@initech.net</td><td>415-555-3322</td><td>03/02/2024</td><td>$612.10</td><td>renewal</td></tr>
<tr><td class="idx">3</td><td>Dev Patel</td><td>dev@umbrella.co</td><td>206-555-9043</td><td>03/19/2024</td><td>$0.00</td><td></td></tr>
</tbody>
</table>
</div>
</div>
</details>
<hr class="dt-divider">
<!-- Options expander (open — heart of the tool) -->
<details class="dt-expander" open>
<summary>Options</summary>
<div class="dt-expander-body">
<!-- ===== Target schema ===== -->
<h3 style="margin-top:0">Target schema</h3>
<div class="dt-field">
<label class="dt-label">How would you like to define the target schema?</label>
<div class="dt-radio-row" style="flex-direction:column; gap:8px">
<span class="dt-radio on"><span class="dot"></span> Build interactively (start from current columns)</span>
<span class="dt-radio"><span class="dot"></span> Import schema JSON</span>
<span class="dt-radio"><span class="dot"></span> Skip (rename / convert types only — no schema)</span>
</div>
<div class="dt-help-text">An interactive build is fastest for one-off cleanup. Import a JSON when you have a fixed contract (a CRM import format, db schema). Skip when you only want to rename or convert the type of specific columns.</div>
</div>
<p class="dt-caption">Edit the table to define your target schema. Add rows for fields the input doesn't have yet (with a default), or remove rows for columns you want to drop.</p>
<!-- Schema editor (st.data_editor, num_rows=dynamic) -->
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>Target name</th><th>Type</th><th>Required</th><th>Default (for added cols)</th><th>Aliases (comma-sep, helps fuzzy-match)</th></tr></thead>
<tbody>
<tr><td>full_name</td><td>string</td><td></td><td></td><td>Full Name, name</td></tr>
<tr><td>email</td><td>string</td><td></td><td></td><td>EmailAddr, email_address</td></tr>
<tr><td>phone</td><td>string</td><td></td><td></td><td>Phone #, tel</td></tr>
<tr><td>signup_date</td><td>date</td><td></td><td></td><td>Signup</td></tr>
<tr><td>amount_spent</td><td>float</td><td></td><td>0.0</td><td>Amount Spent</td></tr>
<tr><td>source</td><td>string</td><td></td><td>crm-import</td><td></td></tr>
<tr><td style="color:var(--ink-tertiary)"><span class="dt-mi" style="font-size:16px;vertical-align:-3px">add</span> add row</td><td></td><td></td><td></td><td></td></tr>
</tbody>
</table>
</div>
<p class="dt-caption">6 target fields · 1 added field (<code>source</code>) not present in the input.</p>
<hr class="dt-divider">
<!-- ===== Mapping ===== -->
<!-- Mapping follows the schema directly: define the schema, then map sources onto it. -->
<h3>Mapping</h3>
<!-- schema is set → source→target selectbox editor with auto-suggested flag -->
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>Source</th><th>Target</th><th>Auto-suggested</th></tr></thead>
<tbody>
<tr><td>Full Name</td><td>full_name</td><td></td></tr>
<tr><td>EmailAddr</td><td>email</td><td></td></tr>
<tr><td>Phone #</td><td>phone</td><td></td></tr>
<tr><td>Signup</td><td>signup_date</td><td></td></tr>
<tr><td>Amount Spent</td><td>amount_spent</td><td></td></tr>
<tr><td>Notes</td><td>(unmapped)</td><td></td></tr>
</tbody>
</table>
</div>
<p class="dt-caption">Pick a target for each source column. <code>Notes</code> stays unmapped — with the keep-extras strategy it is kept as-is. <code>source</code> is added from the schema default.</p>
<hr class="dt-divider">
<!-- ===== Strategy ===== -->
<!-- Strategy is a modifier on the mapping above (strictness: keep/drop extras, coerce, reorder), so it comes after the user can see what it acts on. -->
<h3>Strategy</h3>
<div class="dt-field">
<label class="dt-label">Preset</label>
<div class="dt-radio-row" style="flex-direction:column; gap:8px">
<span class="dt-radio"><span class="dot"></span> rename-only (just rename, leave types alone, keep extras)</span>
<span class="dt-radio"><span class="dot"></span> lenient-schema (rename + convert types + reorder, keep extras)</span>
<span class="dt-radio"><span class="dot"></span> strict-schema (rename + convert types + reorder, drop extras) <span class="dt-count-pill info" style="margin-left:4px">base</span></span>
<span class="dt-radio on"><span class="dot"></span> Custom — based on <strong>strict-schema</strong>, 1 control changed <span class="dt-count-pill warn" style="margin-left:4px">modified</span></span>
</div>
<div class="dt-precedence" style="margin-top:10px">
<span class="dt-mi">rule</span>
<span>Individual Advanced controls win over the preset. You started from <strong>strict-schema</strong>, then changed <strong>Unmapped source columns</strong> to <strong>keep</strong> below — so the preset is now <strong>Custom</strong>. The controls' current values are what actually run.</span>
</div>
<div class="dt-help-text">Pick a strategy as the baseline. Every Advanced toggle below is still individually overridable; overriding any one switches the preset to Custom.</div>
</div>
<!-- Advanced options expander -->
<details class="dt-expander" open>
<summary>Advanced options</summary>
<div class="dt-expander-body">
<div class="dt-cols-2">
<div>
<div class="dt-field">
<label class="dt-label">Unmapped source columns <span class="dt-count-pill warn" style="margin-left:4px">changed</span></label>
<div class="dt-select">keep</div>
<div class="dt-help-text">Winning value: <strong>keep</strong>. Overrides the strict-schema base (drop) — so <code>Notes</code> survives into the output.</div>
</div>
<div class="dt-check on" title="coerce types per schema"><span class="box"><span class="dt-mi">check</span></span> Convert each column to the right type</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Reorder to schema order</div>
</div>
<div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Auto-infer mapping (fuzzy match)</div>
<div class="dt-field">
<label class="dt-label">Fuzzy match threshold</label>
<div class="dt-slider"><div class="track"><div class="fill" style="width:80%"></div><div class="knob" style="left:80%"></div></div><div class="val">0.80</div></div>
</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Enforce required fields</div>
</div>
</div>
</div>
</details>
</div>
</details>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary dt-btn-block">Apply Column Mapping</button>
<hr class="dt-divider">
<!-- ===== Results ===== -->
<div id="colmap-results-anchor" style="height:1px"></div>
<h2>Results</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Renamed</div><div class="value">5</div></div>
<div class="dt-metric"><div class="label">Dropped</div><div class="value">0</div></div>
<div class="dt-metric"><div class="label">Added</div><div class="value">1</div></div>
<div class="dt-metric"><div class="label">Coerce fails</div><div class="value">3</div></div>
</div>
<div class="dt-alert info"><span class="dt-mi">info</span><span>Added (with defaults): <code>source</code></span></div>
<div class="dt-alert warn"><span class="dt-mi">warning</span><span>Some cells could not be coerced and were left as NaN: amount_spent (3)</span></div>
<p><strong>Mapped preview (first 10 rows)</strong></p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th class="dt-cell-add">full_name</th><th>email</th><th>phone</th><th>signup_date</th><th>amount_spent</th><th class="dt-cell-add">source</th><th>Notes</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>Jane Doe</td><td>jane@acme.io</td><td>512-555-0190</td><td>2024-01-04</td><td>1204.5</td><td>crm-import</td><td>VIP</td></tr>
<tr><td class="idx">1</td><td>Bob Smith</td><td>bob@globex.com</td><td>720-555-7781</td><td>2024-02-11</td><td>88.0</td><td>crm-import</td><td></td></tr>
<tr><td class="idx">2</td><td>Carla Reyes</td><td>carla@initech.net</td><td>415-555-3322</td><td>2024-03-02</td><td>612.1</td><td>crm-import</td><td>renewal</td></tr>
<tr><td class="idx">3</td><td>Dev Patel</td><td>dev@umbrella.co</td><td>206-555-9043</td><td>2024-03-19</td><td>0.0</td><td>crm-import</td><td></td></tr>
<tr><td class="idx">4</td><td>Mei Lin</td><td>mei@hooli.com</td><td>503-555-1188</td><td>2024-04-07</td><td class="dt-cell-flag">NaN</td><td>crm-import</td><td>trial</td></tr>
</tbody>
</table>
</div>
<hr class="dt-divider">
<!-- Downloads (3 columns) -->
<div class="dt-cols-3">
<button class="dt-btn dt-btn-primary">Download mapped CSV</button>
<button class="dt-btn">Download mapping audit</button>
<button class="dt-btn">Download config JSON</button>
</div>
<!-- Next-step suggestion -->
<div class="dt-next-step"><span class="dt-mi">arrow_forward</span><span>Columns mapped. <a href="home.html">Run the recommended clean →</a></span><button class="dt-next-step-dismiss" title="Dismiss"></button></div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,55 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Find Unusual Values</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="06_outlier_detector">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Find Unusual Values</strong> — a <strong>Coming&nbsp;Soon</strong> tool. The page is a stub: a "coming soon" notice, a plain-English list of what the tool will do, and a single "Notify me" action. <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Find Unusual Values</h1>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
<p class="dt-tool-caption">Spot values that look wrong — way too high, too low, or breaking your rules.</p>
<div class="dt-spacer"></div>
<!-- Coming-soon notice (st.info) -->
<div class="dt-alert info">
<span class="dt-mi">info</span>
<span>This tool is coming soon.</span>
</div>
<!-- What it will do (st.markdown) -->
<p><strong>What it will do:</strong></p>
<ul>
<li>Find values that are unusually high or low for a column</li>
<li>Spot values that break the rules you set (out of range, wrong type)</li>
<li>Choose how sensitive the check is</li>
<li>Flag unusual rows by adding a column, without changing your data</li>
<li>Cap extreme values at a limit you choose</li>
<li>See a summary of how many values were flagged</li>
</ul>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary"><span class="dt-mi">notifications</span> Notify me when this ships</button>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,55 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Combine Files</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="07_multi_file_merger">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Combine Files</strong> — a <strong>Coming&nbsp;Soon</strong> tool. The page is a stub: a "coming soon" notice, a plain-English list of what the tool will do, and a single "Notify me" action. <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Combine Files</h1>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
<p class="dt-tool-caption">Combine several CSV or Excel files into one — even if columns differ.</p>
<div class="dt-spacer"></div>
<!-- Coming-soon notice (st.info) -->
<div class="dt-alert info">
<span class="dt-mi">info</span>
<span>This tool is coming soon.</span>
</div>
<!-- What it will do (st.markdown) -->
<p><strong>What it will do:</strong></p>
<ul>
<li>Import several CSV or Excel files at once</li>
<li>Line up columns automatically by matching their names</li>
<li>Stack files on top of each other into one long file</li>
<li>Merge files side by side using shared key columns</li>
<li>Handle columns that don't match (fill the gaps with blanks or drop them)</li>
<li>Add a column showing which file each row came from</li>
</ul>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary"><span class="dt-mi">notifications</span> Notify me when this ships</button>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,55 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Quality Check</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="08_validator_reporter">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Quality Check</strong> — a <strong>Coming&nbsp;Soon</strong> tool. The page is a stub: a "coming soon" notice, a plain-English list of what the tool will do, and a single "Notify me" action. <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Quality Check</h1>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
<p class="dt-tool-caption">Check your file against rules you set, and export a PDF or Excel report.</p>
<div class="dt-spacer"></div>
<!-- Coming-soon notice (st.info) -->
<div class="dt-alert info">
<span class="dt-mi">info</span>
<span>This tool is coming soon.</span>
</div>
<!-- What it will do (st.markdown) -->
<p><strong>What it will do:</strong></p>
<ul>
<li>Check each column against rules you set (no blanks, no duplicates, matches a pattern, within a range, from a set list)</li>
<li>Check rules across columns (for example, start date is before end date)</li>
<li>Give each column and the whole file a quality score</li>
<li>Export a PDF quality report</li>
<li>Export an Excel report with the problem rows highlighted</li>
<li>Show a summary of what passed, what failed, and how serious each issue is</li>
</ul>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary"><span class="dt-mi">notifications</span> Notify me when this ships</button>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,373 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Automated Workflows</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="09_pipeline_runner">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Automated Workflows</strong> (Pipeline Runner), shown with a file imported, a four-step pipeline configured, and a completed run (results + per-step summary). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Automated Workflows</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Run several tools in a row — save the steps once, reuse them anytime.</p>
<div class="dt-spacer"></div>
<!-- Upload (file staged) -->
<label class="dt-label">Import CSV or Excel file</label>
<div class="dt-uploader">
<div class="dt-uploader-text">
<span class="hint"><span class="dt-mi" style="vertical-align:-4px">upload_file</span> Drag and drop file here</span>
<span class="sub">Up to 1.5 GB · CSV, TSV, XLSX, XLS · encoding &amp; delimiter auto-detected</span>
</div>
<button class="dt-btn">Browse files</button>
</div>
<div class="dt-file-chip">
<span class="dt-file-icon-chip"><svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M14 2H6a2 2 0 00-2 2v16a2 2 0 002 2h12a2 2 0 002-2V8z"/><path d="M14 2v6h6"/></svg></span>
<span class="name">customers_export.csv</span>
<span class="size">2.1 MB</span>
<button class="dt-btn dt-btn-tertiary" title="Remove"></button>
</div>
<!-- Preview expander (collapsed once a result exists) -->
<details class="dt-expander">
<summary>Preview: customers_export.csv</summary>
<div class="dt-expander-body">
<p class="dt-caption">18,442 rows, 6 columns</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>name</th><th>email</th><th>city</th><th>phone</th><th>signup_date</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td> Jane Doe </td><td>jane@acme.io</td><td>Austin</td><td>512-555-0190</td><td>2024-01-04</td></tr>
<tr><td class="idx">1</td><td>jane doe</td><td>JANE@ACME.IO</td><td>austin</td><td>(512) 555-0190</td><td>01/04/2024</td></tr>
<tr><td class="idx">2</td><td>Bob Smith</td><td>bob@globex.com</td><td>Denver</td><td>720.555.7781</td><td>2024-02-11</td></tr>
<tr><td class="idx">3</td><td>R. Smith</td><td>bob@globex.com</td><td></td><td>720-555-7781</td><td>Feb 11 2024</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<hr class="dt-divider">
<!-- Options: pipeline builder (collapsed once a result exists; opened here to show structure) -->
<details class="dt-expander" open>
<summary>Options</summary>
<div class="dt-expander-body">
<!-- Mode radio. Editing the steps below auto-switches the mode from the
recommended default to "Build interactively" (same precedence-visibility
pattern as Fix Missing Values: the active state is made legible, and the
default it superseded is marked "· modified"). -->
<div class="dt-field">
<label class="dt-label">How would you like to define the pipeline?</label>
<div class="dt-radio-row" style="flex-direction:column;gap:9px">
<span class="dt-radio"><span class="dot"></span> Use the recommended default (Clean Text → Standardize → Fix Missing → Find Duplicates) <span class="dt-count-pill warn" style="margin-left:4px">· modified</span></span>
<span class="dt-radio on"><span class="dot"></span> Build interactively</span>
<span class="dt-radio"><span class="dot"></span> Import a saved pipeline JSON</span>
</div>
</div>
<div class="dt-precedence">
<span class="dt-mi">edit</span>
<span>You started from the recommended default and edited a step, so the mode switched to <strong>Build interactively</strong>. The steps below are now yours to change — pick <strong>recommended default</strong> again to discard your edits and restore the suggested order.</span>
</div>
<p class="dt-caption" style="margin:10px 0">
Add, remove, reorder (drag the row index), enable, or configure each step.
Open a step's <strong>Configure</strong> panel to set its options in plain language.
Tool order is recommended, not enforced — violations surface as warnings below the table.
</p>
<!-- Pipeline editor. Each step row carries an enable toggle + a "Configure"
expander that reveals that tool's OWN controls as the editing surface
(built from .dt-* form classes). Raw per-row JSON has been removed;
JSON survives only as import/export under "Advanced" below. -->
<div class="dt-table-wrap">
<table class="dt-table">
<thead>
<tr>
<th class="idx"></th>
<th>Step</th>
<th style="text-align:center">Enabled</th>
<th style="text-align:right">Configure</th>
</tr>
</thead>
<tbody>
<tr>
<td class="idx">≡ 0</td>
<td><div style="font-weight:500" title="text_clean">Clean Text</div><div class="dt-caption" style="margin:2px 0 0">Trim spaces, collapse repeats, leave case as-is</div></td>
<td><span class="dt-check on" style="margin:0;justify-content:center"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td style="text-align:right;color:var(--ink-tertiary)"><span class="dt-mi" style="font-size:16px;vertical-align:-3px">tune</span> Configure <span class="dt-mi" style="font-size:14px;vertical-align:-2px">expand_more</span></td>
</tr>
</tbody>
</table>
</div>
<!-- text_clean config panel (open to show the per-step editing surface) -->
<details class="dt-expander" open style="margin:6px 0 10px">
<summary>Configure: Clean Text</summary>
<div class="dt-expander-body">
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Trim leading &amp; trailing whitespace</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Collapse repeated spaces to one</div>
<div class="dt-check"><span class="box"></span> Normalize smart quotes &amp; dashes to plain ASCII</div>
<div class="dt-field">
<label class="dt-label">Letter case</label>
<div class="dt-select">Leave as-is</div>
</div>
</div>
</details>
<div class="dt-table-wrap">
<table class="dt-table">
<tbody>
<tr>
<td class="idx">≡ 1</td>
<td><div style="font-weight:500" title="format_standardize">Standardize Formats</div><div class="dt-caption" style="margin:2px 0 0">Format phone as phone, signup_date as a date</div></td>
<td><span class="dt-check on" style="margin:0;justify-content:center"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td style="text-align:right;color:var(--ink-tertiary)"><span class="dt-mi" style="font-size:16px;vertical-align:-3px">tune</span> Configure <span class="dt-mi" style="font-size:14px;vertical-align:-2px">chevron_right</span></td>
</tr>
</tbody>
</table>
</div>
<!-- format_standardize config panel (collapsed) -->
<details class="dt-expander" style="margin:6px 0 10px">
<summary>Configure: Standardize Formats</summary>
<div class="dt-expander-body">
<p class="dt-caption" style="margin-bottom:8px">Choose a target format for each column. Columns left as &ldquo;Leave as-is&rdquo; are untouched.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>Column</th><th>Format as</th></tr></thead>
<tbody>
<tr><td>name</td><td><span class="dt-select" style="display:inline-block;min-width:150px;padding:4px 24px 4px 10px;color:var(--ink-tertiary)">Leave as-is</span></td></tr>
<tr><td>email</td><td><span class="dt-select" style="display:inline-block;min-width:150px;padding:4px 24px 4px 10px;color:var(--ink-tertiary)">Leave as-is</span></td></tr>
<tr><td>phone</td><td><span class="dt-select" style="display:inline-block;min-width:150px;padding:4px 24px 4px 10px">Phone number</span></td></tr>
<tr><td>signup_date</td><td><span class="dt-select" style="display:inline-block;min-width:150px;padding:4px 24px 4px 10px">Date</span></td></tr>
</tbody>
</table>
</div>
</div>
</details>
<div class="dt-table-wrap">
<table class="dt-table">
<tbody>
<tr>
<td class="idx">≡ 2</td>
<td><div style="font-weight:500" title="missing">Fix Missing Values</div><div class="dt-caption" style="margin:2px 0 0">Flag blank cells (treat &ldquo;N/A&rdquo; and &ldquo;&rdquo; as blank)</div></td>
<td><span class="dt-check on" style="margin:0;justify-content:center"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td style="text-align:right;color:var(--ink-tertiary)"><span class="dt-mi" style="font-size:16px;vertical-align:-3px">tune</span> Configure <span class="dt-mi" style="font-size:14px;vertical-align:-2px">chevron_right</span></td>
</tr>
</tbody>
</table>
</div>
<!-- missing config panel (collapsed) -->
<details class="dt-expander" style="margin:6px 0 10px">
<summary>Configure: Fix Missing Values</summary>
<div class="dt-expander-body">
<div class="dt-field">
<label class="dt-label">What should happen to blank cells?</label>
<div class="dt-radio-row" style="flex-direction:column;gap:8px">
<span class="dt-radio on"><span class="dot"></span> Flag them (mark blanks, change nothing)</span>
<span class="dt-radio"><span class="dot"></span> Fill them in (numbers → median, text → most common)</span>
<span class="dt-radio"><span class="dot"></span> Drop rows that have any blank</span>
</div>
</div>
<div class="dt-field">
<label class="dt-label">Treat these as blank (comma-separated)</label>
<div class="dt-input">N/A, —</div>
<div class="dt-help-text">Matched case-insensitively after stripping whitespace.</div>
</div>
</div>
</details>
<div class="dt-table-wrap">
<table class="dt-table">
<tbody>
<tr>
<td class="idx">≡ 3</td>
<td><div style="font-weight:500" title="dedup">Find Duplicates</div><div class="dt-caption" style="margin:2px 0 0">Match on email &amp; phone; keep the most complete row, merge in missing fields</div></td>
<td><span class="dt-check on" style="margin:0;justify-content:center"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td style="text-align:right;color:var(--ink-tertiary)"><span class="dt-mi" style="font-size:16px;vertical-align:-3px">tune</span> Configure <span class="dt-mi" style="font-size:14px;vertical-align:-2px">chevron_right</span></td>
</tr>
<tr>
<td class="idx" style="color:var(--ink-tertiary)"></td>
<td colspan="3" style="color:var(--ink-tertiary);font-family:var(--font-sans)">Add step</td>
</tr>
</tbody>
</table>
</div>
<!-- dedup config panel (collapsed) -->
<details class="dt-expander" style="margin:6px 0 10px">
<summary>Configure: Find Duplicates</summary>
<div class="dt-expander-body">
<div class="dt-field">
<label class="dt-label">When rows match, which one survives?</label>
<div class="dt-select">Keep the most complete row</div>
<div class="dt-help-text">Other options: keep the first seen, keep the last seen.</div>
</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Merge matched rows (fill each survivor's blanks from its duplicates)</div>
<div class="dt-field">
<label class="dt-label">Match on these columns</label>
<div class="dt-multiselect">
<span class="dt-ms-chip">email <span class="x"></span></span>
<span class="dt-ms-chip">phone <span class="x"></span></span>
</div>
</div>
</div>
</details>
<!-- Validation: pipeline is in recommended order, so no warning shown (warning block omitted) -->
<!-- Advanced: JSON is import/export only, never the per-step editing surface -->
<details class="dt-expander" style="margin-top:14px">
<summary>Advanced — import / export pipeline as JSON</summary>
<div class="dt-expander-body">
<p class="dt-caption" style="margin-bottom:8px">For sharing or version control. Editing is done in the step panels above — this is just the saved form of the same settings.</p>
<div class="dt-code">{
"version": 1,
"steps": [
{"tool": "text_clean", "enabled": true, "options": {"trim": true, "collapse_whitespace": true}},
{"tool": "format_standardize", "enabled": true, "options": {"column_types": {"phone": "phone", "signup_date": "date"}}},
{"tool": "missing", "enabled": true, "options": {"strategy": "flag", "sentinels": ["N/A", "—"]}},
{"tool": "dedup", "enabled": true, "options": {"survivor_rule": "most_complete", "merge": true, "keys": ["email", "phone"]}}
]
}</div>
<div class="dt-btn-row" style="margin-top:10px">
<button class="dt-btn"><span class="dt-mi">upload</span> Import JSON</button>
<button class="dt-btn"><span class="dt-mi">download</span> Export JSON</button>
</div>
</div>
</details>
<!-- Nested explainer expander -->
<details class="dt-expander" style="margin-top:14px">
<summary>Recommended tool order — why each step belongs where it does</summary>
<div class="dt-expander-body">
<p><strong>text_clean</strong> before <strong>format_standardize</strong> — format parsers (phone / currency / date) fail on smart-quote-contaminated or NBSP-padded input — clean text first</p>
<p><strong>text_clean</strong> before <strong>missing</strong> — sentinel detection misses cells padded with NBSP / zero-width characters — clean text first</p>
<p><strong>text_clean</strong> before <strong>dedup</strong> — fuzzy matching treats NBSP-padded values as different — clean text first</p>
<p><strong>format_standardize</strong> before <strong>missing</strong> — numeric imputation needs numeric dtypes; canonical phones / currencies improve sentinel detection</p>
<p><strong>format_standardize</strong> before <strong>dedup</strong> — canonical phones / lowercase emails enable cross-format duplicate matching</p>
<p style="margin-bottom:0"><strong>missing</strong> before <strong>dedup</strong> — deduping rows with mixed NaN sentinels produces brittle merges — resolve missing values first</p>
</div>
</details>
</div>
</details>
<hr class="dt-divider">
<!-- Run -->
<button class="dt-btn dt-btn-primary dt-btn-block">Run Pipeline</button>
<hr class="dt-divider">
<!-- Results -->
<h2>Results</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Initial rows</div><div class="value">18,442</div></div>
<div class="dt-metric"><div class="label">Final rows</div><div class="value">18,130</div></div>
<div class="dt-metric"><div class="label">Steps run</div><div class="value">4</div></div>
<div class="dt-metric"><div class="label">Elapsed</div><div class="value">1.84 s</div></div>
</div>
<h4>Per-step summary</h4>
<!-- Standalone error column removed: status is one pill per step. A failed step
turns the pill danger and surfaces its message in a detail row directly below
that step (shown only on failure); successful steps just show a green pill.
Summaries are plain-English phrases, not raw JSON. Demo: this run completed
cleanly (all four ok, matching the metrics above) — the format_standardize
row carries a warn pill + detail row to illustrate how a non-fatal step issue
surfaces inline without a dedicated always-empty column. -->
<div class="dt-table-wrap">
<table class="dt-table">
<thead>
<tr><th>step</th><th>status</th><th>elapsed</th><th>summary</th></tr>
</thead>
<tbody>
<tr>
<td>text_clean</td>
<td><span class="dt-count-pill success">ok</span></td>
<td>214 ms</td>
<td style="font-family:var(--font-sans)">1,204 cells changed in name &amp; city</td>
</tr>
<tr>
<td>format_standardize</td>
<td><span class="dt-count-pill warn"><span class="dt-mi" style="font-size:13px;margin-right:3px">warning</span> ok · 141 skipped</span></td>
<td>388 ms</td>
<td style="font-family:var(--font-sans)">18,301 phones and 17,996 dates standardized</td>
</tr>
<tr style="background:var(--warn-fill)">
<td></td>
<td colspan="3" style="font-family:var(--font-sans);color:var(--warn);white-space:normal">
<span class="dt-mi" style="font-size:15px;vertical-align:-3px;margin-right:4px">info</span>
141 phone values didn't match any known pattern and were left unchanged. The step still completed — review them in the output preview if needed.
</td>
</tr>
<tr>
<td>missing</td>
<td><span class="dt-count-pill success">ok</span></td>
<td>121 ms</td>
<td style="font-family:var(--font-sans)">642 blank cells flagged (sentinel &ldquo;&rdquo;)</td>
</tr>
<tr>
<td>dedup</td>
<td><span class="dt-count-pill success">ok</span></td>
<td>911 ms</td>
<td style="font-family:var(--font-sans)">312 duplicates removed across 147 groups (18,442 → 18,130 rows)</td>
</tr>
</tbody>
</table>
</div>
<h4>Output preview (first 10 rows)</h4>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>name</th><th>email</th><th>city</th><th>phone</th><th>signup_date</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>Jane Doe</td><td>jane@acme.io</td><td>Austin</td><td class="dt-cell-add">+1 512-555-0190</td><td class="dt-cell-add">2024-01-04</td></tr>
<tr><td class="idx">1</td><td>Bob Smith</td><td>bob@globex.com</td><td>Denver</td><td class="dt-cell-add">+1 720-555-7781</td><td class="dt-cell-add">2024-02-11</td></tr>
<tr><td class="idx">2</td><td>Carla Reyes</td><td>carla@initech.co</td><td>Phoenix</td><td class="dt-cell-add">+1 480-555-3320</td><td class="dt-cell-add">2024-03-02</td></tr>
<tr><td class="idx">3</td><td>Dan Okafor</td><td>dan@umbrella.net</td><td><span class="dt-cell-flag">⚑ missing</span></td><td class="dt-cell-add">+1 206-555-7745</td><td class="dt-cell-add">2024-03-18</td></tr>
<tr><td class="idx">4</td><td>Emily Tran</td><td>emily@hooli.com</td><td>Seattle</td><td class="dt-cell-add">+1 206-555-1182</td><td class="dt-cell-add">2024-04-05</td></tr>
</tbody>
</table>
</div>
<hr class="dt-divider">
<!-- Downloads (3 columns) -->
<div class="dt-cols-3">
<button class="dt-btn dt-btn-primary"><span class="dt-mi">download</span> Download cleaned CSV</button>
<button class="dt-btn"><span class="dt-mi">download</span> Download pipeline JSON</button>
<button class="dt-btn"><span class="dt-mi">download</span> Download run audit</button>
</div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,203 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — PDF to CSV</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="10_pdf_extractor">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>PDF to CSV</strong>, shown with two bank-statement PDFs imported and a completed scan (candidate transactions in the editable preview table). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>PDF to CSV</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Pull transactions out of bank-statement PDFs into a clean CSV file.</p>
<div class="dt-spacer"></div>
<!-- Scan options expander (collapsed by default) -->
<details class="dt-expander">
<summary>Scan options</summary>
<div class="dt-expander-body">
<div class="dt-cols-2">
<div class="dt-check on">
<span class="box"><span class="dt-mi">check</span></span>
Treat (4.50) as negative
</div>
<div class="dt-check on">
<span class="box"><span class="dt-mi">check</span></span>
Use OCR for scanned pages
</div>
</div>
<p class="dt-help-text" style="margin:0 0 10px">OCR status: ready (bundled Tesseract). Most modern bank PDFs are text-based and don't need OCR — only enable for image-based scans.</p>
<div class="dt-cols-2">
<div class="dt-field">
<label class="dt-label">Output date format</label>
<div class="dt-select">YYYY-MM-DD (2026-01-13)</div>
</div>
<div class="dt-field">
<label class="dt-label">Override year for short dates (optional)</label>
<input class="dt-input" type="text" placeholder="" value="" disabled>
<div class="dt-help-text">Leave blank for automatic (statement period → filename year → this override).</div>
</div>
</div>
</div>
</details>
<!-- Files section head -->
<div class="dt-files-section-head">
<h2>Files</h2>
<span class="dt-section-meta">2 files · 318.4 KB total</span>
</div>
<!-- Files card (Home-style bordered list + Add more files) -->
<div class="dt-card" style="padding-bottom:0">
<div class="dt-file-row" style="padding:6px 0">
<button class="dt-btn dt-btn-tertiary" title="Remove statement-jan-2026.pdf"></button>
<span class="dt-file-icon-chip"><svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M14 2H6a2 2 0 00-2 2v16a2 2 0 002 2h12a2 2 0 002-2V8z"/><path d="M14 2v6h6"/></svg></span>
<span class="dt-file-name">statement-jan-2026.pdf</span>
<span class="dt-file-size" style="margin-left:auto">171.2 KB</span>
</div>
<div class="dt-file-row" style="padding:6px 0">
<button class="dt-btn dt-btn-tertiary" title="Remove statement-feb-2026.pdf"></button>
<span class="dt-file-icon-chip"><svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M14 2H6a2 2 0 00-2 2v16a2 2 0 002 2h12a2 2 0 002-2V8z"/><path d="M14 2v6h6"/></svg></span>
<span class="dt-file-name">statement-feb-2026.pdf</span>
<span class="dt-file-size" style="margin-left:auto">147.2 KB</span>
</div>
<button class="dt-file-add" style="margin-left:-16px;margin-right:-16px;width:calc(100% + 32px)">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M12 5v14M5 12h14"/></svg> Add more files
</button>
</div>
<!-- Action buttons -->
<div class="dt-btn-row" style="margin-top:16px;max-width:340px">
<button class="dt-btn dt-btn-primary">Scan</button>
<button class="dt-btn">Clear all files</button>
</div>
<hr class="dt-divider">
<!-- Warnings expander (collapsed) -->
<details class="dt-expander">
<summary>Warnings (1)</summary>
<div class="dt-expander-body">
<div class="dt-alert warn">
<span class="dt-mi">warning</span>
<span>[statement-feb-2026.pdf] 2 lines matched a date but no amount — skipped (likely a wrapped description). Check the source if a transaction looks missing.</span>
</div>
</div>
</details>
<!-- Results -->
<h4>47 candidate transaction(s) from 2 file(s)</h4>
<p class="dt-caption">Uncheck rows to exclude. Edit any cell to fix a value the scanner got wrong. Hover the <span class="dt-mi" style="font-size:15px;vertical-align:-3px;color:var(--ink-tertiary)">info</span> on any row to see the original PDF text it came from.</p>
<!-- overflow-x:auto belt-and-suspenders: any residual width scrolls instead of clipping (app.css .dt-table-wrap is overflow:hidden) -->
<div class="dt-table-wrap" style="overflow-x:auto">
<table class="dt-table">
<thead>
<tr>
<th>Include</th>
<th></th>
<th>date</th>
<th>description</th>
<th>amount_debit</th>
<th>amount_credit</th>
<th>account_number</th>
<th>source_file</th>
</tr>
</thead>
<tbody>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 01/03 OPENING BALANCE 2,140.55" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-01-03</td><td>OPENING BALANCE</td><td></td><td></td><td>****4821</td><td>statement-jan-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 01/05 POS PURCHASE WHOLE FOODS MKT (84.12)" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-01-05</td><td>POS PURCHASE WHOLE FOODS MKT</td><td>84.12</td><td></td><td>****4821</td><td>statement-jan-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 01/08 ACH DEPOSIT PAYROLL ACME CORP 3,250.00" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-01-08</td><td>ACH DEPOSIT PAYROLL ACME CORP</td><td></td><td>3,250.00</td><td>****4821</td><td>statement-jan-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 01/11 ONLINE TRANSFER TO SAVINGS (500.00)" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-01-11</td><td>ONLINE TRANSFER TO SAVINGS</td><td>500.00</td><td></td><td>****4821</td><td>statement-jan-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check" style="margin:0"><span class="box"></span></span></td>
<td class="idx" title="raw: 01/12 INTEREST RATE 0.50% APY 0.00" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td class="dt-cell-flag">2026-01-12</td><td class="dt-cell-flag">INTEREST RATE 0.50% APY DETAIL <span style="font-family:var(--font-sans);font-size:11px;font-weight:500;background:var(--warn-fill);color:var(--warn);border-radius:999px;padding:1px 7px;white-space:nowrap">auto-excluded · not a transaction line</span></td><td></td><td></td><td>****4821</td><td>statement-jan-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 01/14 DEBIT CARD SHELL OIL #2287 (52.40)" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-01-14</td><td>DEBIT CARD SHELL OIL #2287</td><td>52.40</td><td></td><td>****4821</td><td>statement-jan-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 02/02 POS PURCHASE TRADER JOES #511 (61.88)" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-02-02</td><td>POS PURCHASE TRADER JOES #511</td><td>61.88</td><td></td><td>****4821</td><td>statement-feb-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 02/06 ACH DEPOSIT PAYROLL ACME CORP 3,250.00" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-02-06</td><td>ACH DEPOSIT PAYROLL ACME CORP</td><td></td><td>3,250.00</td><td>****4821</td><td>statement-feb-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 02/09 CHECK #1043 (1,200.00)" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-02-09</td><td>CHECK #1043</td><td>1,200.00</td><td></td><td>****4821</td><td>statement-feb-2026.pdf</td>
</tr>
</tbody>
</table>
</div>
<!-- Download area: configure-then-act — column selector first, download button below -->
<div style="margin-top:14px;max-width:520px">
<div class="dt-field" style="margin:0 0 14px">
<label class="dt-label">Columns to include in CSV</label>
<div class="dt-multiselect">
<span class="dt-ms-chip">date <span class="x"></span></span>
<span class="dt-ms-chip">description <span class="x"></span></span>
<span class="dt-ms-chip">amount_debit <span class="x"></span></span>
<span class="dt-ms-chip">amount_credit <span class="x"></span></span>
<span class="dt-ms-chip">account_number <span class="x"></span></span>
<span class="dt-ms-chip">source_file <span class="x"></span></span>
</div>
<div class="dt-help-text"><code>page</code> and <code>raw</code> are kept off by default; tick them if you want them in the file.</div>
</div>
<button class="dt-btn dt-btn-primary dt-btn-block">Download 46 rows as CSV</button>
<p class="dt-caption" style="margin-top:8px">1 row excluded (INTEREST RATE detail line).</p>
</div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,248 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Reconcile Two Files</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="11_reconciler">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Reconcile Two Files</strong>, shown with both files imported, key columns mapped, and a completed reconciliation (matched / review / unmatched results). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Reconcile Two Files</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Compare two lists of transactions (e.g. bank vs. ledger) and flag what doesn't match.</p>
<div class="dt-spacer"></div>
<!-- Side-by-side upload (st.columns(2) → two _side_panel) -->
<div class="dt-cols-2">
<!-- Left side -->
<div>
<h4 style="margin-top:0">Left (e.g. bank feed)</h4>
<div class="dt-alert info">
<span class="dt-mi">description</span>
<span>Using <strong>bank_feed_may.csv</strong> from the upload screen.</span>
</div>
<button class="dt-btn" style="margin-bottom:4px">Use a different file</button>
<p class="dt-caption" style="margin-top:6px"><code>bank_feed_may.csv</code> — 1,204 rows, 4 columns</p>
<details class="dt-expander">
<summary>Preview left (e.g. bank feed)</summary>
<div class="dt-expander-body">
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>posted_date</th><th>description</th><th>amount</th><th>ref</th></tr></thead>
<tbody>
<tr><td>2026-05-01</td><td>ACME SUPPLIES</td><td>-1240.00</td><td>CHK1041</td></tr>
<tr><td>2026-05-02</td><td>PAYROLL RUN</td><td>-8800.00</td><td>ACH5520</td></tr>
<tr><td>2026-05-03</td><td>CLIENT GLOBEX</td><td>5200.00</td><td>DEP0090</td></tr>
<tr><td>2026-05-04</td><td>UTILITY CO</td><td>-318.42</td><td>CHK1042</td></tr>
</tbody>
</table>
</div>
</div>
</details>
</div>
<!-- Right side -->
<div>
<h4 style="margin-top:0">Right (e.g. ledger)</h4>
<div class="dt-alert info">
<span class="dt-mi">description</span>
<span>Using <strong>ledger_may.xlsx</strong> from the upload screen.</span>
</div>
<button class="dt-btn" style="margin-bottom:4px">Use a different file</button>
<p class="dt-caption" style="margin-top:6px"><code>ledger_may.xlsx</code> — 1,198 rows, 5 columns</p>
<details class="dt-expander">
<summary>Preview right (e.g. ledger)</summary>
<div class="dt-expander-body">
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>txn_date</th><th>memo</th><th>value</th><th>invoice_no</th><th>account</th></tr></thead>
<tbody>
<tr><td>2026-05-01</td><td>Acme Supplies Inc</td><td>-1240.00</td><td>INV-1041</td><td>5000</td></tr>
<tr><td>2026-05-02</td><td>Monthly payroll</td><td>-8800.00</td><td>INV-5520</td><td>6000</td></tr>
<tr><td>2026-05-03</td><td>Globex retainer</td><td>5200.00</td><td>INV-0090</td><td>4000</td></tr>
<tr><td>2026-05-04</td><td>City Utilities</td><td>-318.40</td><td>INV-1042</td><td>6100</td></tr>
</tbody>
</table>
</div>
</div>
</details>
</div>
</div>
<hr class="dt-divider">
<!-- Match settings -->
<h2>Match settings</h2>
<div class="dt-cols-2">
<!-- Left pickers (file order: posted_date, description, amount → date, desc, amount) -->
<div>
<h4 style="margin-top:0">Left columns</h4>
<div class="dt-field"><label class="dt-label">Date column (optional)</label><div class="dt-select">posted_date</div></div>
<div class="dt-field"><label class="dt-label">Description column (optional)</label><div class="dt-select">description</div></div>
<div class="dt-field"><label class="dt-label">Amount column <span class="req">*</span></label><div class="dt-select">amount</div></div>
<div class="dt-field"><label class="dt-label">Reference columns (optional, e.g. check / invoice no.)</label>
<div class="dt-multiselect"><span class="dt-ms-chip">ref <span class="x"></span></span></div></div>
</div>
<!-- Right pickers (file order: txn_date, memo, value → date, desc, amount) -->
<div>
<h4 style="margin-top:0">Right columns</h4>
<div class="dt-field"><label class="dt-label">Date column (optional)</label><div class="dt-select">txn_date</div></div>
<div class="dt-field"><label class="dt-label">Description column (optional)</label><div class="dt-select">memo</div></div>
<div class="dt-field"><label class="dt-label">Amount column <span class="req">*</span></label><div class="dt-select">value</div></div>
<div class="dt-field"><label class="dt-label">Reference columns (must match left count)</label>
<div class="dt-multiselect"><span class="dt-ms-chip">invoice_no <span class="x"></span></span></div>
<div class="dt-help-text" style="color:var(--success);display:flex;align-items:center;gap:5px"><span class="dt-mi" style="font-family:'Material Symbols Outlined';font-size:15px;line-height:1">check_circle</span> 1 reference each side — counts match</div></div>
</div>
</div>
<!-- Tolerances & options (expanded=True) -->
<details class="dt-expander" open>
<summary>Tolerances &amp; options</summary>
<div class="dt-expander-body">
<div class="dt-cols-3">
<div class="dt-field"><label class="dt-label">Amount tolerance</label>
<div class="dt-input">0.0200</div>
<div class="dt-help-text">Absolute tolerance on amount (e.g. 0.01 to absorb cent rounding).</div></div>
<div class="dt-field"><label class="dt-label">Date tolerance (days)</label>
<div class="dt-input">1</div>
<div class="dt-help-text">Allow N calendar days of drift between posting dates.</div></div>
<div class="dt-field"><label class="dt-label">Invert right amount sign</label>
<div class="dt-check" style="margin-top:8px"><span class="box"></span></div>
<div class="dt-help-text">Use when one side records debits as positive and the other as negative.</div></div>
</div>
<div class="dt-field"><label class="dt-label">Description similarity boost (0 disables)</label>
<div class="dt-slider"><div class="track"><div class="fill" style="width:80%"></div><div class="knob" style="left:80%"></div></div><div class="val">80</div></div>
<div class="dt-help-text">When both sides have a description column set, accept matches with this minimum fuzzy similarity even if amount/date are merely within tolerance. Lower = more permissive.</div></div>
</div>
</details>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary dt-btn-block">Reconcile</button>
<hr class="dt-divider">
<!-- Results -->
<h2>Results</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Review</div><div class="value">9</div></div>
<div class="dt-metric"><div class="label">Unmatched left</div><div class="value">22</div></div>
<div class="dt-metric"><div class="label">Unmatched right</div><div class="value">16</div></div>
<div class="dt-metric"><div class="label">Matched</div><div class="value">1,173</div></div>
</div>
<p class="dt-caption">Coverage: 97.4% of the larger side</p>
<!-- Tabs (st.tabs) — exceptions-first; Review active by default -->
<div class="dt-tabs">
<span class="dt-tab is-active">Review (9)</span>
<span class="dt-tab">Unmatched left (22)</span>
<span class="dt-tab">Unmatched right (16)</span>
<span class="dt-tab">Matched (1,173)</span>
</div>
<!-- Active tab content: Review (exceptions-first default) -->
<p class="dt-caption">Pairs flagged because the algorithm couldn't pick a single best match (e.g. multiple equally-good candidates). Use the left/right indices to disambiguate manually.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>left_idx</th><th>left_amount</th><th>right_idx</th><th>right_value</th><th>candidates</th></tr></thead>
<tbody>
<tr><td>118</td><td>-450.00</td><td>121, 209</td><td>-450.00</td><td class="dt-cell-flag">2 equal</td></tr>
<tr><td>203</td><td>1000.00</td><td>198, 244</td><td>1000.00</td><td class="dt-cell-flag">2 equal</td></tr>
</tbody>
</table>
</div>
<!-- Other tab previews shown as collapsed expanders for review context -->
<details class="dt-expander">
<summary>Unmatched left (22) — only in bank_feed_may.csv</summary>
<div class="dt-expander-body">
<p class="dt-caption">Preview of first 25 of 22 rows.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>posted_date</th><th>description</th><th>amount</th><th>ref</th></tr></thead>
<tbody>
<tr><td class="dt-cell-del">2026-05-09</td><td class="dt-cell-del">BANK FEE</td><td class="dt-cell-del">-12.00</td><td class="dt-cell-del">FEE0001</td></tr>
<tr><td class="dt-cell-del">2026-05-14</td><td class="dt-cell-del">ATM WITHDRAWAL</td><td class="dt-cell-del">-200.00</td><td class="dt-cell-del">ATM7781</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<details class="dt-expander">
<summary>Unmatched right (16) — only in ledger_may.xlsx</summary>
<div class="dt-expander-body">
<p class="dt-caption">Preview of first 25 of 16 rows.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>txn_date</th><th>memo</th><th>value</th><th>invoice_no</th><th>account</th></tr></thead>
<tbody>
<tr><td class="dt-cell-del">2026-05-11</td><td class="dt-cell-del">Accrued interest</td><td class="dt-cell-del">37.50</td><td class="dt-cell-del">INV-9001</td><td class="dt-cell-del">7000</td></tr>
<tr><td class="dt-cell-del">2026-05-22</td><td class="dt-cell-del">Depreciation</td><td class="dt-cell-del">-410.00</td><td class="dt-cell-del">INV-9044</td><td class="dt-cell-del">8000</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<details class="dt-expander">
<summary>Matched (1,173) — cleanly reconciled</summary>
<div class="dt-expander-body">
<p class="dt-caption">Preview of first 25 of 1,173 rows — download the CSV below for the full set.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr>
<th>left_posted_date</th><th>left_description</th><th>left_amount</th>
<th>right_txn_date</th><th>right_memo</th><th>right_value</th><th>amount_diff</th>
</tr></thead>
<tbody>
<tr><td>2026-05-01</td><td>ACME SUPPLIES</td><td>-1240.00</td><td>2026-05-01</td><td>Acme Supplies Inc</td><td>-1240.00</td><td class="dt-cell-add">0.00</td></tr>
<tr><td>2026-05-02</td><td>PAYROLL RUN</td><td>-8800.00</td><td>2026-05-02</td><td>Monthly payroll</td><td>-8800.00</td><td class="dt-cell-add">0.00</td></tr>
<tr><td>2026-05-03</td><td>CLIENT GLOBEX</td><td>5200.00</td><td>2026-05-03</td><td>Globex retainer</td><td>5200.00</td><td class="dt-cell-add">0.00</td></tr>
<tr><td>2026-05-04</td><td>UTILITY CO</td><td>-318.42</td><td>2026-05-04</td><td>City Utilities</td><td>-318.40</td><td class="dt-cell-flag">0.02</td></tr>
<tr><td>2026-05-06</td><td>OFFICE DEPOT</td><td>-89.15</td><td>2026-05-07</td><td>Office supplies</td><td>-89.15</td><td class="dt-cell-add">0.00</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<hr class="dt-divider">
<!-- Downloads (st.columns(4) of html_download_button) — exceptions-first,
matching the tab/metric order; four parallel exports, equal weight -->
<div class="dt-btn-row">
<button class="dt-btn">Review CSV</button>
<button class="dt-btn">Unmatched left</button>
<button class="dt-btn">Unmatched right</button>
<button class="dt-btn">Matched CSV</button>
</div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

542
layout-review/app.css Normal file
View File

@@ -0,0 +1,542 @@
/* ===========================================================================
DataTools — static layout-review stylesheet
---------------------------------------------------------------------------
Faithful reproduction of the live Streamlit app's design system for human
review of page layouts. Tokens are copied verbatim from src/gui/theme.py
(§3 color + type scale) and the component values from
src/gui/components/_legacy.py:_DESIGN_TOKENS_CSS.
The live app applies these styles to Streamlit's data-testid DOM; here we
re-express the same look against clean semantic classes so the static HTML
stays readable. Where the app uses real .dt-* classes (page header, files
card, findings, stats) the class names are kept identical.
=========================================================================== */
@import url("https://fonts.googleapis.com/css2?family=Geist:wght@400;500;600;700&family=Geist+Mono:wght@400;500&display=swap");
@import url("https://fonts.googleapis.com/css2?family=Material+Symbols+Outlined:opsz,wght,FILL,GRAD@20..48,400,0,0&display=block");
:root {
--font-sans: "Geist", -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
--font-mono: "Geist Mono", ui-monospace, "SF Mono", Menlo, monospace;
--ink: #1c1917;
--ink-secondary: #57534e;
--ink-tertiary: #a8a29e;
--bg: #fafaf7;
--surface: #ffffff;
--surface-hover: #f8f7f3;
--border: #e7e5dc;
--border-strong: #d6d3c7;
--accent: #c2410c;
--accent-hover: #9a3412;
--accent-fill: #fef4ed;
--accent-fill-strong: #fde4d3;
--warn: #b45309;
--warn-fill: #fef3c7;
--info: #0369a1;
--info-fill: #e0f2fe;
--success: #15803d;
--success-fill: #dcfce7;
--danger: #b91c1c;
--danger-fill: #fee2e2;
--r-sm: 6px;
--r-md: 10px;
--r-lg: 14px;
--sidebar-w: 264px;
}
* { box-sizing: border-box; }
html, body {
margin: 0;
padding: 0;
background: var(--bg);
color: var(--ink);
font-family: var(--font-sans);
font-feature-settings: "ss01", "cv01", "cv11";
-webkit-font-smoothing: antialiased;
}
/* ---------- Type scale (theme.py §4) ---------- */
h1 { font-size: 32px; font-weight: 600; letter-spacing: -0.035em; line-height: 1.1; margin: 0 0 4px; }
h2 { font-size: 22px; font-weight: 600; letter-spacing: -0.025em; line-height: 1.2; margin: 1.5rem 0 0.75rem; }
h3 { font-size: 18px; font-weight: 500; letter-spacing: -0.018em; line-height: 1.25; margin: 1.25rem 0 0.5rem; }
h4 { font-size: 15px; font-weight: 500; letter-spacing: -0.012em; line-height: 1.35; margin: 1rem 0 0.5rem; }
p { font-size: 14px; font-weight: 400; line-height: 1.55; color: var(--ink); margin: 0 0 0.6rem; }
strong { font-weight: 500; color: var(--ink); }
a { color: var(--accent); text-decoration: none; }
a:hover { color: var(--accent-hover); text-decoration: underline; }
code, .dt-mono { font-family: var(--font-mono); font-size: 0.92em; font-feature-settings: "ss02"; }
/* ===========================================================================
App frame — sidebar + main + sticky footer
=========================================================================== */
.dt-app { display: flex; min-height: 100vh; }
/* ---------- Sidebar (cream paper) ---------- */
.dt-sidebar {
width: var(--sidebar-w);
flex-shrink: 0;
background: #f5f4ef;
border-right: 1px solid var(--border);
padding: 18px 14px 90px;
position: sticky;
top: 0;
align-self: flex-start;
height: 100vh;
overflow-y: auto;
}
.dt-brand { display: flex; align-items: center; gap: 10px; padding: 0 4px 18px; }
.dt-brand-mark {
width: 28px; height: 28px; border-radius: 7px;
background: var(--ink); color: var(--accent-fill);
display: inline-flex; align-items: center; justify-content: center;
font-weight: 700; font-size: 16px; letter-spacing: -0.04em; line-height: 1; flex-shrink: 0;
}
.dt-brand-name { display: flex; flex-direction: column; gap: 1px; line-height: 1.05; }
.dt-brand-eyebrow {
font-size: 9.5px; font-weight: 600; letter-spacing: 0.14em;
text-transform: uppercase; color: var(--ink-tertiary); line-height: 1;
}
.dt-brand-word { font-weight: 600; font-size: 15px; letter-spacing: -0.02em; color: var(--ink); }
.dt-nav { display: flex; flex-direction: column; }
.dt-nav-section {
font-size: 11.5px; text-transform: uppercase; letter-spacing: 0.08em;
color: var(--ink-tertiary); font-weight: 500;
padding: 14px 10px 4px; margin: 0;
display: flex; align-items: center; justify-content: space-between;
}
.dt-nav-section .dt-nav-indicator { font-size: 16px; color: var(--ink-tertiary); }
.dt-nav-link {
display: flex; align-items: center; gap: 8px;
color: var(--ink-secondary); font-size: 13px; font-weight: 500; line-height: 1.3;
padding: 5px 10px; border-radius: var(--r-sm); margin-bottom: 1px;
text-decoration: none; transition: background 0.12s ease, color 0.12s ease;
}
.dt-nav-link:hover { background: rgba(0,0,0,0.04); color: var(--ink); text-decoration: none; }
.dt-nav-link.is-active { background: rgba(0,0,0,0.04); color: var(--ink); font-weight: 600; }
.dt-nav-link .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; color: var(--ink-secondary); line-height: 1; }
.dt-nav-link.is-active .dt-mi { color: var(--ink); }
.dt-nav-link.is-soon { opacity: 0.55; }
/* "Start here" front-door item — weightier than ordinary nav links so the
obvious entry point reads at a glance. Accent-fill ground + accent-hover ink,
slightly larger hit area, with bottom margin to part it from the groups below.
Layers on .dt-nav-link, so the .is-active treatment still overrides cleanly. */
.dt-nav-start {
background: var(--accent-fill); color: var(--accent-hover); font-weight: 600;
padding: 8px 10px; margin-bottom: 12px;
}
.dt-nav-start:hover { background: var(--accent-fill-strong); color: var(--accent-hover); }
.dt-nav-start .dt-mi { color: var(--accent); }
.dt-nav-start.is-active { background: var(--accent-fill-strong); color: var(--accent-hover); }
.dt-nav-start.is-active .dt-mi { color: var(--accent); }
.dt-nav-soon-tag {
margin-left: auto; font-size: 9px; font-weight: 600; letter-spacing: 0.06em;
text-transform: uppercase; color: var(--ink-tertiary);
border: 1px solid var(--border-strong); border-radius: 999px; padding: 1px 6px;
}
.dt-sidebar-foot { margin-top: 22px; padding-top: 16px; border-top: 1px solid var(--border); display: flex; flex-direction: column; gap: 10px; }
.dt-sidebar-label { font-size: 11.5px; font-weight: 500; text-transform: uppercase; letter-spacing: 0.08em; color: var(--ink-tertiary); margin-bottom: 4px; }
.dt-license-badge { font-size: 12.5px; color: var(--ink-secondary); }
/* ---------- Main column ---------- */
.dt-main { flex: 1; min-width: 0; padding: 40px 56px 96px; }
.dt-main-inner { max-width: 920px; margin: 0 auto; }
/* Review banner above every mockup */
.dt-review-banner {
max-width: 920px; margin: 0 auto 20px; display: flex; gap: 10px; align-items: center;
background: var(--info-fill); color: var(--info);
border: 1px solid transparent; border-radius: var(--r-md);
padding: 8px 14px; font-size: 12.5px; line-height: 1.4;
}
.dt-review-banner a { color: var(--info); text-decoration: underline; }
.dt-review-banner .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; }
/* ---------- Sticky footer ---------- */
.dt-footer {
position: fixed; bottom: 0; left: var(--sidebar-w); right: 0;
background: rgba(255,255,255,0.97); backdrop-filter: blur(8px);
border-top: 1px solid var(--border-strong);
padding: 8px 20px; z-index: 50;
display: flex; align-items: center; gap: 8px;
}
.dt-footer-btn {
display: inline-flex; align-items: center; gap: 8px;
color: var(--ink-secondary); font-size: 13px; font-weight: 500; line-height: 1.3;
padding: 5px 10px; border-radius: var(--r-sm);
background: transparent; border: none; cursor: pointer; text-decoration: none;
}
.dt-footer-btn:hover { background: rgba(0,0,0,0.04); color: var(--ink); text-decoration: none; }
.dt-footer-btn .dt-mi { font-family: "Material Symbols Outlined"; font-size: 16px; }
/* ===========================================================================
Page header (brand + privacy pill) — .dt-page-* mirror the live app
=========================================================================== */
.dt-page-header {
display: flex; align-items: center; justify-content: space-between; gap: 24px;
margin: 0 0 24px; padding-bottom: 22px; border-bottom: 1px solid var(--border);
}
.dt-page-brand { display: flex; flex-direction: column; gap: 8px; }
.dt-page-brand-row { display: flex; align-items: center; gap: 18px; }
.dt-page-brand-mark {
width: 56px; height: 56px; border-radius: 14px; background: var(--ink);
color: var(--accent-fill); display: inline-flex; align-items: center; justify-content: center;
font-weight: 700; font-size: 32px; letter-spacing: -0.04em; line-height: 1; flex-shrink: 0;
}
.dt-page-brand-words { display: flex; flex-direction: column; gap: 2px; line-height: 1; }
.dt-page-eyebrow { font-size: 11.5px; font-weight: 600; letter-spacing: 0.14em; text-transform: uppercase; color: var(--ink-tertiary); line-height: 1.2; }
.dt-page-wordmark { margin: 0; font-weight: 600; font-size: 32px; letter-spacing: -0.035em; line-height: 1.1; color: var(--ink); }
.dt-page-subtitle { margin: 4px 0 0; color: var(--ink-secondary); font-size: 14px; line-height: 1.5; }
.dt-privacy-pill {
display: inline-flex; align-items: center; gap: 6px; padding: 6px 11px;
background: var(--success-fill); color: var(--success); border-radius: 999px;
font-size: 12px; font-weight: 500; white-space: nowrap; flex-shrink: 0;
}
.dt-privacy-pill svg { width: 13px; height: 13px; stroke-width: 2; }
/* ---------- Tool header (title + Help popover) ---------- */
.dt-tool-header { display: flex; align-items: flex-start; justify-content: space-between; gap: 16px; }
.dt-tool-header h1 { margin: 0; }
.dt-help-btn {
display: inline-flex; align-items: center; gap: 6px; white-space: nowrap;
background: var(--surface); color: var(--ink); border: 1px solid var(--border-strong);
border-radius: var(--r-md); padding: 9px 16px; font-size: 13.5px; font-weight: 500;
cursor: pointer; flex-shrink: 0; margin-top: 6px;
}
.dt-help-btn .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; }
.dt-tool-caption { font-size: 12.5px; color: var(--ink-tertiary); line-height: 1.5; margin: 2px 0 0; }
/* Right-side actions cluster in a tool header: the local-first privacy pill +
the Help button. One shared class so every tool page aligns identically
(replaces per-page inline flex/gap/margin drift). */
.dt-tool-header-actions { display: flex; align-items: center; gap: 12px; flex-shrink: 0; margin-top: 6px; }
.dt-tool-header-actions .dt-help-btn { margin-top: 0; }
/* ===========================================================================
Buttons
=========================================================================== */
.dt-btn {
border-radius: var(--r-md); font-family: var(--font-sans); font-weight: 500;
font-size: 13.5px; letter-spacing: -0.005em; line-height: 1; padding: 9px 16px;
border: 1px solid var(--border-strong); background: var(--surface); color: var(--ink);
cursor: pointer; transition: background 0.12s ease, border-color 0.12s ease, color 0.12s ease;
display: inline-flex; align-items: center; justify-content: center; gap: 8px;
}
.dt-btn:hover { background: var(--surface-hover); border-color: var(--ink-tertiary); }
.dt-btn-primary { background: var(--ink); color: var(--bg); border-color: var(--ink); }
.dt-btn-primary:hover { background: #292524; border-color: #292524; color: var(--bg); }
.dt-btn-tertiary { background: transparent; border: none; color: var(--ink-tertiary); padding: 4px 8px; }
.dt-btn-tertiary:hover { background: var(--danger-fill); color: var(--danger); }
.dt-btn:disabled, .dt-btn.is-disabled {
background: var(--surface-hover); color: var(--ink-tertiary);
border: 1px solid var(--border); cursor: not-allowed;
}
.dt-btn-block { width: 100%; }
.dt-btn .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; }
.dt-btn-row { display: flex; gap: 10px; flex-wrap: wrap; }
.dt-btn-row > .dt-btn { flex: 1; }
/* ===========================================================================
File uploader (cream dropzone)
=========================================================================== */
.dt-uploader {
background: var(--surface-hover); border: 1px dashed var(--border-strong);
border-radius: var(--r-md); padding: 22px 20px;
display: flex; align-items: center; justify-content: space-between; gap: 16px;
}
.dt-uploader-text { display: flex; flex-direction: column; gap: 2px; }
.dt-uploader-text .hint { font-size: 14px; color: var(--ink); }
.dt-uploader-text .sub { font-size: 12.5px; color: var(--ink-tertiary); }
.dt-uploader .dt-mi { font-family: "Material Symbols Outlined"; font-size: 24px; color: var(--ink-tertiary); }
/* Staged-file chip */
.dt-file-chip {
display: flex; align-items: center; gap: 12px;
background: var(--surface); border: 1px solid var(--border); border-radius: var(--r-sm);
padding: 10px 14px; margin-top: 10px;
}
.dt-file-chip .name { font-family: var(--font-mono); font-size: 13px; color: var(--ink); font-feature-settings: "ss02"; }
.dt-file-chip .size { font-family: var(--font-mono); font-size: 12px; color: var(--ink-tertiary); margin-left: auto; }
/* ===========================================================================
Expanders / bordered cards
=========================================================================== */
.dt-expander {
background: var(--surface); border: 1px solid var(--border); border-radius: var(--r-lg);
overflow: hidden; box-shadow: 0 1px 2px rgba(28,25,23,0.03); margin: 10px 0;
}
.dt-expander > summary, .dt-expander-head {
background: var(--surface-hover); border-bottom: 1px solid var(--border);
padding: 12px 16px; font-weight: 500; color: var(--ink); font-size: 14px;
cursor: pointer; list-style: none; display: flex; align-items: center; gap: 8px;
}
.dt-expander > summary::-webkit-details-marker { display: none; }
.dt-expander > summary::before {
content: "expand_more"; font-family: "Material Symbols Outlined"; font-size: 20px;
color: var(--ink-tertiary); transition: transform 0.15s ease;
}
.dt-expander[open] > summary::before { transform: rotate(180deg); }
.dt-expander-body, .dt-expander > .dt-expander-body { padding: 14px 16px; }
.dt-expander:not([open]) > summary { border-bottom: none; }
.dt-card {
background: var(--surface); border: 1px solid var(--border); border-radius: var(--r-lg);
box-shadow: 0 1px 2px rgba(28,25,23,0.03); padding: 16px; margin: 10px 0;
}
/* ===========================================================================
Alerts
=========================================================================== */
.dt-alert {
border-radius: var(--r-md); border: 1px solid transparent;
padding: 10px 14px; font-size: 13.5px; line-height: 1.45; margin: 10px 0;
display: flex; gap: 10px; align-items: flex-start;
}
.dt-alert .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; flex-shrink: 0; margin-top: 1px; }
.dt-alert.info { background: var(--info-fill); color: var(--info); }
.dt-alert.success { background: var(--success-fill); color: var(--success); }
.dt-alert.warn { background: var(--warn-fill); color: var(--warn); }
.dt-alert.error { background: var(--danger-fill); color: var(--danger); }
.dt-alert code { background: rgba(0,0,0,0.05); padding: 1px 5px; border-radius: 4px; }
/* Next-step strip — slim single-line "what to do next" suggestion shown at the
end of a tool's results. Subtle accent ground + left accent rule so it nudges
without competing with alerts; the trailing dismiss control is unobtrusive. */
.dt-next-step {
display: flex; align-items: center; gap: 10px;
background: var(--accent-fill); border-left: 3px solid var(--accent);
border-radius: var(--r-md); padding: 10px 14px; margin: 16px 0;
font-size: 13.5px; line-height: 1.4; color: var(--ink);
}
.dt-next-step .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; color: var(--accent); flex-shrink: 0; }
.dt-next-step a { color: var(--accent); font-weight: 500; }
.dt-next-step a:hover { color: var(--accent-hover); }
.dt-next-step-dismiss {
margin-left: auto; background: transparent; border: none; cursor: pointer;
color: var(--ink-tertiary); font-size: 13px; line-height: 1; padding: 2px 4px;
}
.dt-next-step-dismiss:hover { color: var(--ink-secondary); }
/* ===========================================================================
Inputs (static representations of Streamlit widgets)
=========================================================================== */
.dt-field { margin: 10px 0; }
.dt-label { font-size: 13px; font-weight: 500; color: var(--ink); margin-bottom: 5px; display: block; }
.dt-label .req { color: var(--accent); }
.dt-input, .dt-select, .dt-textarea {
width: 100%; background: var(--surface); border: 1px solid var(--border-strong);
border-radius: var(--r-sm); padding: 8px 11px; font-family: var(--font-sans);
font-size: 13.5px; color: var(--ink);
}
.dt-select { appearance: none; background-image: linear-gradient(45deg, transparent 50%, var(--ink-tertiary) 50%), linear-gradient(135deg, var(--ink-tertiary) 50%, transparent 50%); background-position: calc(100% - 16px) 14px, calc(100% - 11px) 14px; background-size: 5px 5px, 5px 5px; background-repeat: no-repeat; }
.dt-textarea { min-height: 76px; resize: vertical; font-family: var(--font-mono); font-size: 13px; }
.dt-help-text { font-size: 12px; color: var(--ink-tertiary); margin-top: 4px; }
/* Multiselect — chips inside a box */
.dt-multiselect {
width: 100%; background: var(--surface); border: 1px solid var(--border-strong);
border-radius: var(--r-sm); padding: 6px 8px; min-height: 38px;
display: flex; flex-wrap: wrap; gap: 6px; align-items: center;
}
.dt-ms-chip {
display: inline-flex; align-items: center; gap: 5px; background: var(--accent-fill);
color: var(--accent-hover); border-radius: var(--r-sm); padding: 3px 8px;
font-size: 12.5px; font-weight: 500;
}
.dt-ms-chip .x { color: var(--accent); font-size: 13px; }
.dt-ms-placeholder { color: var(--ink-tertiary); font-size: 13px; padding: 2px 4px; }
/* Checkbox / radio */
.dt-check { display: flex; align-items: center; gap: 9px; margin: 8px 0; font-size: 13.5px; color: var(--ink); }
.dt-check .box {
width: 18px; height: 18px; border-radius: 5px; border: 1px solid var(--border-strong);
background: var(--surface); display: inline-flex; align-items: center; justify-content: center; flex-shrink: 0;
}
.dt-check.on .box { background: var(--ink); border-color: var(--ink); color: var(--bg); }
.dt-check.on .box .dt-mi { font-family: "Material Symbols Outlined"; font-size: 14px; }
.dt-radio-row { display: flex; gap: 18px; flex-wrap: wrap; margin: 8px 0; }
.dt-radio { display: inline-flex; align-items: center; gap: 7px; font-size: 13.5px; }
.dt-radio .dot { width: 16px; height: 16px; border-radius: 50%; border: 1px solid var(--border-strong); display: inline-block; flex-shrink: 0; }
.dt-radio.on .dot { border: 5px solid var(--ink); }
/* Strategy precedence legend + overridden state (Fix Missing Values).
Makes the preset -> global -> per-column resolution order legible and
visibly dims a layer when a more specific layer wins. */
.dt-precedence {
display: flex; align-items: center; gap: 8px;
background: var(--surface-hover); border: 1px solid var(--border);
border-radius: var(--r-md); padding: 9px 13px; margin: 0 0 14px;
font-size: 12.5px; color: var(--ink-secondary); line-height: 1.4;
}
.dt-precedence .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; color: var(--ink-tertiary); flex-shrink: 0; }
.dt-precedence strong { color: var(--ink); font-weight: 600; }
.dt-radio-row.is-overridden { opacity: 0.5; }
.dt-radio-row.is-overridden .dt-radio { text-decoration: line-through; text-decoration-color: var(--ink-tertiary); }
/* Slider */
.dt-slider { margin: 14px 0 6px; }
.dt-slider .track { position: relative; height: 4px; background: var(--border-strong); border-radius: 2px; }
.dt-slider .fill { position: absolute; left: 0; top: 0; height: 4px; background: var(--ink); border-radius: 2px; }
.dt-slider .knob { position: absolute; top: 50%; width: 16px; height: 16px; border-radius: 50%; background: var(--ink); transform: translate(-50%, -50%); }
.dt-slider .val { font-family: var(--font-mono); font-size: 12px; color: var(--ink-secondary); margin-top: 8px; }
/* ===========================================================================
Layout helpers
=========================================================================== */
.dt-row { display: flex; gap: 16px; }
.dt-row > * { flex: 1; min-width: 0; }
.dt-cols-2 { display: grid; grid-template-columns: 1fr 1fr; gap: 16px; }
.dt-cols-3 { display: grid; grid-template-columns: repeat(3, 1fr); gap: 16px; }
.dt-divider { border: none; border-top: 1px solid var(--border); margin: 22px 0; }
.dt-caption { font-size: 12.5px; color: var(--ink-tertiary); line-height: 1.5; }
.dt-spacer { height: 12px; }
/* ===========================================================================
DataFrame / preview table
=========================================================================== */
.dt-table-wrap { border: 1px solid var(--border); border-radius: var(--r-md); overflow: hidden; margin: 8px 0; }
table.dt-table { width: 100%; border-collapse: collapse; font-size: 13px; }
table.dt-table th {
background: var(--surface-hover); color: var(--ink-secondary); font-weight: 500;
text-align: left; padding: 8px 12px; border-bottom: 1px solid var(--border);
font-size: 12px; text-transform: none; white-space: nowrap;
}
table.dt-table td {
padding: 7px 12px; border-bottom: 1px solid var(--border);
font-family: var(--font-mono); font-size: 12.5px; color: var(--ink); font-feature-settings: "ss02"; white-space: nowrap;
}
table.dt-table tr:last-child td { border-bottom: none; }
table.dt-table tr:nth-child(even) td { background: #fcfbf8; }
table.dt-table td.idx { color: var(--ink-tertiary); background: var(--surface-hover); }
.dt-cell-flag { color: var(--warn); }
.dt-cell-del { color: var(--danger); text-decoration: line-through; }
.dt-cell-add { color: var(--success); }
/* ===========================================================================
Stats overview (home) — copied from _legacy.py
=========================================================================== */
.dt-stats { display: grid; grid-template-columns: repeat(4, 1fr); gap: 12px; margin: 8px 0 20px; }
.dt-stat { background: var(--surface); border: 1px solid var(--border); border-radius: var(--r-lg); padding: 16px 18px; box-shadow: 0 1px 2px rgba(28,25,23,0.03); }
.dt-stat-label { font-size: 11.5px; text-transform: uppercase; letter-spacing: 0.08em; color: var(--ink-tertiary); font-weight: 500; margin-bottom: 6px; line-height: 1.4; }
.dt-stat-value { font-size: 28px; font-weight: 600; letter-spacing: -0.03em; line-height: 1; color: var(--ink); display: flex; align-items: baseline; gap: 6px; }
.dt-stat-unit { font-size: 12px; font-weight: 400; color: var(--ink-tertiary); letter-spacing: 0; }
.dt-stat.is-warn .dt-stat-value { color: var(--warn); }
.dt-stat.is-info .dt-stat-value { color: var(--info); }
.dt-stat.is-success .dt-stat-value { color: var(--success); }
@media (max-width: 900px) { .dt-stats { grid-template-columns: repeat(2, 1fr); } }
/* Metric (st.metric) */
.dt-metrics { display: flex; gap: 28px; flex-wrap: wrap; margin: 6px 0 14px; }
.dt-metric .label { font-size: 12.5px; color: var(--ink-tertiary); margin-bottom: 4px; }
.dt-metric .value { font-size: 26px; font-weight: 600; letter-spacing: -0.03em; color: var(--ink); line-height: 1; }
.dt-metric .delta { font-size: 12.5px; margin-top: 3px; }
.dt-metric .delta.up { color: var(--success); }
.dt-metric .delta.down { color: var(--danger); }
/* ===========================================================================
Files card (home) — copied from _legacy.py
=========================================================================== */
.dt-files-section-head { display: flex; align-items: baseline; justify-content: space-between; margin: 4px 0 10px; gap: 12px; }
.dt-files-section-head h2 { margin: 0; }
.dt-section-meta { font-size: 12.5px; color: var(--ink-tertiary); }
.dt-file-row { display: flex; align-items: center; gap: 12px; }
.dt-file-icon-chip { width: 28px; height: 28px; border-radius: var(--r-sm); background: var(--accent-fill); color: var(--accent); display: inline-flex; align-items: center; justify-content: center; flex-shrink: 0; }
.dt-file-icon-chip svg { width: 14px; height: 14px; stroke-width: 1.8; }
.dt-file-name { font-family: var(--font-mono); font-size: 13px; color: var(--ink); font-feature-settings: "ss02"; }
.dt-file-size { font-family: var(--font-mono); font-size: 12px; color: var(--ink-tertiary); font-feature-settings: "ss02"; }
.dt-file-add {
display: flex; align-items: center; justify-content: center; gap: 8px;
width: 100%; padding: 12px 16px; background: var(--surface-hover);
border: none; border-top: 1px dashed var(--border-strong);
border-radius: 0 0 var(--r-lg) var(--r-lg); cursor: pointer;
font-size: 13px; font-weight: 500; color: var(--ink-secondary); margin-top: 14px;
}
.dt-file-add:hover { background: var(--accent-fill); color: var(--accent); }
.dt-file-add svg { width: 14px; height: 14px; stroke-width: 2; }
/* ===========================================================================
Findings panel — copied from _legacy.py
=========================================================================== */
.dt-finding-group-head {
display: flex; align-items: center; gap: 12px; padding: 16px 22px;
border-bottom: 1px solid var(--border); background: var(--surface-hover);
margin: -16px -16px 1.2rem; border-radius: var(--r-lg) var(--r-lg) 0 0;
cursor: pointer; user-select: none;
}
.dt-finding-group-chevron { color: var(--ink-tertiary); font-family: "Material Symbols Outlined"; font-size: 20px; line-height: 1; flex-shrink: 0; }
.dt-severity-dot { width: 8px; height: 8px; border-radius: 50%; flex-shrink: 0; display: inline-block; }
.dt-severity-dot.warn { background: var(--warn); }
.dt-severity-dot.info { background: var(--info); }
.dt-severity-dot.error { background: var(--danger); }
.dt-severity-dot.success { background: var(--success); }
.dt-group-filename { font-family: var(--font-mono); font-size: 13.5px; font-weight: 500; color: var(--ink); font-feature-settings: "ss02"; }
.dt-group-counts { margin-left: auto; display: flex; align-items: center; gap: 8px; }
.dt-count-pill { display: inline-flex; align-items: center; padding: 3px 9px; border-radius: 999px; font-size: 11.5px; font-weight: 500; line-height: 1.4; white-space: nowrap; }
.dt-count-pill.warn { background: var(--warn-fill); color: var(--warn); }
.dt-count-pill.info { background: var(--info-fill); color: var(--info); }
.dt-count-pill.error { background: var(--danger-fill); color: var(--danger); }
.dt-count-pill.success { background: var(--success-fill); color: var(--success); }
.dt-finding-row { display: flex; align-items: flex-start; gap: 12px; padding: 12px 0; border-top: 1px solid var(--border); }
.dt-finding-row:first-of-type { border-top: none; }
.dt-finding-icon { width: 24px; height: 24px; border-radius: var(--r-sm); display: inline-flex; align-items: center; justify-content: center; flex-shrink: 0; }
.dt-finding-icon.warn { background: var(--warn-fill); color: var(--warn); }
.dt-finding-icon.info { background: var(--info-fill); color: var(--info); }
.dt-finding-icon.error { background: var(--danger-fill); color: var(--danger); }
.dt-finding-icon .dt-mi { font-family: "Material Symbols Outlined"; font-size: 16px; line-height: 1; }
.dt-finding-body { flex: 1; min-width: 0; }
.dt-finding-title { font-size: 14px; color: var(--ink); margin: 0 0 2px; line-height: 1.4; letter-spacing: -0.005em; }
.dt-finding-title strong { font-weight: 500; }
.dt-finding-meta { font-family: var(--font-mono); font-size: 12px; color: var(--ink-tertiary); line-height: 1.4; margin: 0; font-feature-settings: "ss02"; }
/* Overflow control — sits at the foot of a findings card when rows are hidden.
Bleeds to the card edges (cancels the .dt-card 16px padding) like .dt-file-add. */
.dt-finding-more {
display: flex; align-items: center; justify-content: center; gap: 6px;
width: calc(100% + 32px); margin: 4px -16px -16px;
padding: 11px 16px; background: var(--surface-hover);
border: none; border-top: 1px solid var(--border);
border-radius: 0 0 var(--r-lg) var(--r-lg); cursor: pointer;
font-family: var(--font-sans); font-size: 12.5px; font-weight: 500; color: var(--ink-secondary);
}
.dt-finding-more:hover { background: var(--accent-fill); color: var(--accent); }
.dt-finding-more .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; }
/* Collapsed findings panel — the group head fills the whole card (head only,
no body). Proper state variant so the two states don't drift; replaces the
per-instance inline margin-bottom:-16px hack. */
.dt-card.is-collapsed { padding: 0; }
.dt-finding-group-head.is-collapsed { margin: 0; border-bottom: none; border-radius: var(--r-lg); }
/* Match-group review card (dedup) */
.dt-match-card { background: var(--surface); border: 1px solid var(--border); border-radius: var(--r-lg); box-shadow: 0 1px 2px rgba(28,25,23,0.03); margin: 12px 0; overflow: hidden; }
.dt-match-head { background: var(--surface-hover); border-bottom: 1px solid var(--border); padding: 12px 16px; display: flex; align-items: center; gap: 12px; }
.dt-match-head .title { font-weight: 500; font-size: 14px; }
.dt-match-head .conf { margin-left: auto; }
.dt-match-body { padding: 14px 16px; }
.dt-keep-row { background: var(--success-fill); }
.dt-keep-tag { display: inline-flex; align-items: center; gap: 4px; background: var(--success-fill); color: var(--success); border-radius: 999px; padding: 2px 8px; font-size: 11px; font-weight: 500; }
/* Progress bar */
.dt-progress { height: 6px; background: var(--border); border-radius: 3px; overflow: hidden; margin: 10px 0; }
.dt-progress .bar { height: 100%; background: var(--ink); border-radius: 3px; }
/* Tabs */
.dt-tabs { display: flex; gap: 18px; border-bottom: 1px solid var(--border); margin: 10px 0 16px; }
.dt-tab { font-size: 13.5px; color: var(--ink-secondary); padding: 8px 2px; border-bottom: 2px solid transparent; cursor: pointer; }
.dt-tab.is-active { color: var(--ink); font-weight: 500; border-bottom-color: var(--accent); }
/* Code block */
.dt-code { background: var(--surface-hover); border: 1px solid var(--border); border-radius: var(--r-md); padding: 12px 14px; font-family: var(--font-mono); font-size: 12.5px; color: var(--ink); white-space: pre; overflow-x: auto; font-feature-settings: "ss02"; }
@media (max-width: 1100px) {
.dt-footer { left: 0; }
.dt-sidebar { display: none; }
.dt-main { padding: 28px 24px 96px; }
}

206
layout-review/home.html Normal file
View File

@@ -0,0 +1,206 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — File Analysis (Home)</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="home">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of the <strong>Home / File Analysis</strong> page, shown with three imported files in the post-analysis state. <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Page header: brand block + privacy pill -->
<header class="dt-page-header">
<div class="dt-page-brand">
<div class="dt-page-brand-row">
<div class="dt-page-brand-mark">D</div>
<div class="dt-page-brand-words">
<span class="dt-page-eyebrow">UNALOGIX</span>
<h1 class="dt-page-wordmark">DataTools</h1>
</div>
</div>
<p class="dt-page-subtitle">Clean. Normalize. Transform.</p>
</div>
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
</header>
<!-- Files section head -->
<div class="dt-files-section-head">
<h2>Files</h2>
<span class="dt-section-meta">3 files · 4.7 MB total</span>
</div>
<!-- Files card -->
<div class="dt-card" style="padding-bottom:0">
<div class="dt-file-row" style="padding:6px 0">
<button class="dt-btn dt-btn-tertiary" title="Remove"></button>
<span class="dt-file-icon-chip"><svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M14 2H6a2 2 0 00-2 2v16a2 2 0 002 2h12a2 2 0 002-2V8z"/><path d="M14 2v6h6"/></svg></span>
<span class="dt-file-name">customers_export.csv</span>
<span class="dt-file-size" style="margin-left:auto">2.1 MB</span>
</div>
<div class="dt-file-row" style="padding:6px 0">
<button class="dt-btn dt-btn-tertiary" title="Remove"></button>
<span class="dt-file-icon-chip"><svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M14 2H6a2 2 0 00-2 2v16a2 2 0 002 2h12a2 2 0 002-2V8z"/><path d="M14 2v6h6"/></svg></span>
<span class="dt-file-name">q3_transactions.xlsx</span>
<span class="dt-file-size" style="margin-left:auto">1.8 MB</span>
</div>
<div class="dt-file-row" style="padding:6px 0">
<button class="dt-btn dt-btn-tertiary" title="Remove"></button>
<span class="dt-file-icon-chip"><svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M14 2H6a2 2 0 00-2 2v16a2 2 0 002 2h12a2 2 0 002-2V8z"/><path d="M14 2v6h6"/></svg></span>
<span class="dt-file-name">vendor_list.csv</span>
<span class="dt-file-size" style="margin-left:auto">0.8 MB</span>
</div>
<button class="dt-file-add" style="margin-left:-16px;margin-right:-16px;width:calc(100% + 32px)">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M12 5v14M5 12h14"/></svg> Add more files
</button>
</div>
<!-- Action bar -->
<div class="dt-btn-row" style="margin-top:16px">
<button class="dt-btn dt-btn-primary" style="flex:0 0 auto">Run analysis</button>
<button class="dt-btn" style="flex:0 0 auto">Clear results</button>
</div>
<hr class="dt-divider">
<!-- Stats overview -->
<div class="dt-stats">
<div class="dt-stat">
<div class="dt-stat-label">Rows scanned</div>
<div class="dt-stat-value">48,210 <span class="dt-stat-unit">rows</span></div>
</div>
<div class="dt-stat">
<div class="dt-stat-label">Total findings</div>
<div class="dt-stat-value">14</div>
</div>
<div class="dt-stat is-warn">
<div class="dt-stat-label">Warnings</div>
<div class="dt-stat-value">9 <span class="dt-stat-unit">to review</span></div>
</div>
<div class="dt-stat is-info">
<div class="dt-stat-label">Info</div>
<div class="dt-stat-value">5 <span class="dt-stat-unit">suggestions</span></div>
</div>
</div>
<!-- ======================================================================
FRONT DOOR — primary path. The orchestrator (09_pipeline_runner)
wearing a friendly face: maps the analyzer's findings to the
recommended pipeline (Clean Text → Standardize → Fix Missing →
Find Duplicates) and runs them in order, returning a downloadable
result. This is the hero of the page; the per-file findings below
remain as the manual "fix one thing at a time" path.
====================================================================== -->
<div class="dt-card" style="border-color:var(--accent);background:var(--accent-fill);box-shadow:0 1px 2px rgba(28,25,23,0.03),0 0 0 1px var(--accent)">
<div style="display:flex;align-items:flex-start;gap:14px;flex-wrap:wrap">
<span class="dt-file-icon-chip" style="width:36px;height:36px;border-radius:var(--r-md)">
<span class="dt-mi" style="font-family:'Material Symbols Outlined';font-size:20px">auto_awesome</span>
</span>
<div style="flex:1;min-width:240px">
<h3 style="margin:0 0 4px;color:var(--ink)">Recommended</h3>
<p style="margin:0;color:var(--ink-secondary)">Runs the recommended clean — fix text, standardize formats, fill blanks, remove duplicates — in the right order, then hands you the cleaned file.</p>
</div>
<button class="dt-btn dt-btn-primary" style="flex:0 0 auto;align-self:center">
<span class="dt-mi">auto_fix_high</span> Clean these files for me
</button>
</div>
<!-- Pipeline-step affordance: the order the findings will be resolved in -->
<div style="display:flex;align-items:center;gap:6px;flex-wrap:wrap;margin-top:14px;padding-top:12px;border-top:1px solid var(--accent-fill-strong)">
<span class="dt-count-pill" style="background:var(--surface);color:var(--ink-secondary)">1 · Clean Text</span>
<span class="dt-mi" style="font-family:'Material Symbols Outlined';font-size:16px;color:var(--accent)">arrow_forward</span>
<span class="dt-count-pill" style="background:var(--surface);color:var(--ink-secondary)">2 · Standardize</span>
<span class="dt-mi" style="font-family:'Material Symbols Outlined';font-size:16px;color:var(--accent)">arrow_forward</span>
<span class="dt-count-pill" style="background:var(--surface);color:var(--ink-secondary)">3 · Fix Missing</span>
<span class="dt-mi" style="font-family:'Material Symbols Outlined';font-size:16px;color:var(--accent)">arrow_forward</span>
<span class="dt-count-pill" style="background:var(--surface);color:var(--ink-secondary)">4 · Find Duplicates</span>
<span class="dt-caption" style="margin-left:auto">Result downloads when finished</span>
</div>
</div>
<!-- Secondary / manual path — keep full control over each fix -->
<h3 style="margin-top:24px">Or fix issues one at a time</h3>
<p class="dt-caption" style="margin:-2px 0 4px">Prefer to handle things yourself? Open any finding to jump straight to the right tool.</p>
<!-- Per-file findings panel #1 -->
<div class="dt-card">
<div class="dt-finding-group-head">
<span class="dt-finding-group-chevron" style="transform:rotate(90deg)">chevron_right</span>
<span class="dt-severity-dot warn"></span>
<span class="dt-group-filename">customers_export.csv</span>
<div class="dt-group-counts">
<span class="dt-count-pill warn">6 warnings</span>
<span class="dt-count-pill info">2 info</span>
</div>
</div>
<div class="dt-finding-row">
<span class="dt-finding-icon warn"><span class="dt-mi">priority_high</span></span>
<div class="dt-finding-body">
<p class="dt-finding-title"><strong>312 duplicate rows</strong> across exact + near matches</p>
<p class="dt-finding-meta">column: email · Find Duplicates →</p>
</div>
</div>
<div class="dt-finding-row">
<span class="dt-finding-icon warn"><span class="dt-mi">format_color_text</span></span>
<div class="dt-finding-body">
<p class="dt-finding-title"><strong>1,204 cells</strong> with leading / trailing whitespace</p>
<p class="dt-finding-meta">columns: name, city · Clean Text →</p>
</div>
</div>
<div class="dt-finding-row">
<span class="dt-finding-icon info"><span class="dt-mi">event</span></span>
<div class="dt-finding-body">
<p class="dt-finding-title">Mixed date formats in <strong>signup_date</strong></p>
<p class="dt-finding-meta">3 formats detected · Standardize Formats →</p>
</div>
</div>
<button class="dt-finding-more">
<span class="dt-mi">expand_more</span> Show all 8 findings · 5 more
</button>
</div>
<!-- Per-file findings panel #2 (collapsed) -->
<div class="dt-card is-collapsed">
<div class="dt-finding-group-head is-collapsed">
<span class="dt-finding-group-chevron">chevron_right</span>
<span class="dt-severity-dot warn"></span>
<span class="dt-group-filename">q3_transactions.xlsx</span>
<div class="dt-group-counts">
<span class="dt-count-pill warn">3 warnings</span>
<span class="dt-count-pill info">3 info</span>
</div>
</div>
</div>
<!-- Per-file findings panel #3 (clean) -->
<div class="dt-card is-collapsed">
<div class="dt-finding-group-head is-collapsed">
<span class="dt-severity-dot success"></span>
<span class="dt-group-filename">vendor_list.csv</span>
<div class="dt-group-counts">
<span class="dt-count-pill success">no issues</span>
</div>
</div>
</div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

71
layout-review/index.html Normal file
View File

@@ -0,0 +1,71 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>DataTools — Layout Review</title>
<link rel="stylesheet" href="app.css">
<style>
.lr-wrap { max-width: 960px; margin: 0 auto; padding: 48px 32px 80px; }
.lr-grid { display: grid; grid-template-columns: repeat(2, 1fr); gap: 14px; margin-top: 18px; }
.lr-card { display: flex; align-items: center; gap: 14px; background: var(--surface); border: 1px solid var(--border); border-radius: var(--r-lg); padding: 16px 18px; box-shadow: 0 1px 2px rgba(28,25,23,0.03); text-decoration: none; transition: border-color .12s ease, box-shadow .12s ease; }
.lr-card:hover { border-color: var(--border-strong); box-shadow: 0 2px 8px rgba(28,25,23,0.06); text-decoration: none; }
.lr-ico { width: 40px; height: 40px; border-radius: var(--r-md); background: var(--accent-fill); color: var(--accent); display: inline-flex; align-items: center; justify-content: center; flex-shrink: 0; }
.lr-ico .dt-mi { font-family: "Material Symbols Outlined"; font-size: 22px; }
.lr-body { min-width: 0; }
.lr-name { font-size: 15px; font-weight: 600; color: var(--ink); letter-spacing: -0.01em; display:flex; align-items:center; gap:8px; }
.lr-desc { font-size: 12.5px; color: var(--ink-secondary); margin-top: 2px; line-height: 1.45; }
.lr-sec { font-size: 11.5px; text-transform: uppercase; letter-spacing: 0.08em; color: var(--ink-tertiary); font-weight: 600; margin: 26px 0 2px; }
.lr-soon { font-size: 9px; font-weight: 600; letter-spacing: .06em; text-transform: uppercase; color: var(--ink-tertiary); border: 1px solid var(--border-strong); border-radius: 999px; padding: 1px 6px; }
</style>
</head>
<body>
<div class="lr-wrap">
<header class="dt-page-header">
<div class="dt-page-brand">
<div class="dt-page-brand-row">
<div class="dt-page-brand-mark">D</div>
<div class="dt-page-brand-words">
<span class="dt-page-eyebrow">UNALOGIX · LAYOUT REVIEW</span>
<h1 class="dt-page-wordmark">DataTools</h1>
</div>
</div>
<p class="dt-page-subtitle">Static HTML reproductions of every tool page, built from the live app's design tokens for human review of layouts.</p>
</div>
</header>
<div class="dt-alert info">
<span class="dt-mi">info</span>
<span>These are faithful static mockups — not the running Streamlit app. Colors, type scale, spacing, and components are copied verbatim from <code>theme.py</code> and <code>components/_legacy.py</code>. Each page is shown in a representative <strong>populated</strong> state so the layout can be reviewed end-to-end. Fonts load from Google Fonts (needs network); the chrome (sidebar + footer) is shared across every page.</span>
</div>
<div class="lr-sec">Analysis</div>
<div class="lr-grid">
<a class="lr-card" href="home.html"><span class="lr-ico"><span class="dt-mi">insert_chart_outlined</span></span><span class="lr-body"><span class="lr-name">File Analysis (Home)</span><span class="lr-desc">Import files, run the analyzer, browse per-file findings.</span></span></a>
<a class="lr-card" href="11_reconciler.html"><span class="lr-ico"><span class="dt-mi">compare_arrows</span></span><span class="lr-body"><span class="lr-name">Reconcile Two Files</span><span class="lr-desc">Compare two lists of transactions and flag what doesn't match.</span></span></a>
</div>
<div class="lr-sec">Data Cleaners</div>
<div class="lr-grid">
<a class="lr-card" href="04_missing_handler.html"><span class="lr-ico"><span class="dt-mi">help_outline</span></span><span class="lr-body"><span class="lr-name">Fix Missing Values</span><span class="lr-desc">Find blank cells (even hidden ones) and fill them in or remove them.</span></span></a>
<a class="lr-card" href="06_outlier_detector.html"><span class="lr-ico"><span class="dt-mi">insights</span></span><span class="lr-body"><span class="lr-name">Find Unusual Values <span class="lr-soon">Soon</span></span><span class="lr-desc">Spot values that look wrong — too high, too low, or rule-breaking.</span></span></a>
<a class="lr-card" href="02_text_cleaner.html"><span class="lr-ico"><span class="dt-mi">text_format</span></span><span class="lr-body"><span class="lr-name">Clean Text</span><span class="lr-desc">Trim extra spaces and strip out odd characters.</span></span></a>
<a class="lr-card" href="03_format_standardizer.html"><span class="lr-ico"><span class="dt-mi">format_list_bulleted</span></span><span class="lr-body"><span class="lr-name">Standardize Formats</span><span class="lr-desc">Make dates, phones, currency, and names look the same throughout.</span></span></a>
<a class="lr-card" href="01_deduplicator.html"><span class="lr-ico"><span class="dt-mi">search</span></span><span class="lr-body"><span class="lr-name">Find Duplicates</span><span class="lr-desc">Find rows that repeat, then keep one and remove the extras.</span></span></a>
<a class="lr-card" href="08_validator_reporter.html"><span class="lr-ico"><span class="dt-mi">check_circle</span></span><span class="lr-body"><span class="lr-name">Quality Check <span class="lr-soon">Soon</span></span><span class="lr-desc">Check your file against rules and export a PDF or Excel report.</span></span></a>
</div>
<div class="lr-sec">Transformations</div>
<div class="lr-grid">
<a class="lr-card" href="05_column_mapper.html"><span class="lr-ico"><span class="dt-mi">view_column</span></span><span class="lr-body"><span class="lr-name">Map Columns</span><span class="lr-desc">Rename columns, reorder, and set each one as text, number, or date.</span></span></a>
<a class="lr-card" href="07_multi_file_merger.html"><span class="lr-ico"><span class="dt-mi">account_tree</span></span><span class="lr-body"><span class="lr-name">Combine Files <span class="lr-soon">Soon</span></span><span class="lr-desc">Combine several CSV or Excel files into one — even if columns differ.</span></span></a>
<a class="lr-card" href="10_pdf_extractor.html"><span class="lr-ico"><span class="dt-mi">picture_as_pdf</span></span><span class="lr-body"><span class="lr-name">PDF to CSV</span><span class="lr-desc">Pull transactions out of bank-statement PDFs into a clean CSV file.</span></span></a>
</div>
<div class="lr-sec">Automations</div>
<div class="lr-grid">
<a class="lr-card" href="09_pipeline_runner.html"><span class="lr-ico"><span class="dt-mi">auto_awesome</span></span><span class="lr-body"><span class="lr-name">Automated Workflows</span><span class="lr-desc">Run several tools in a row — save the steps and reuse them anytime.</span></span></a>
</div>
</div>
</body>
</html>

83
layout-review/shell.js Normal file
View File

@@ -0,0 +1,83 @@
/* Shared app chrome (sidebar nav + sticky footer) for the static layout
review pages. Mirrors src/gui/app.py:_build_navigation() ordering and
src/gui/components/_legacy.py:render_sticky_footer(). Each page sets
<body data-page="<tool_id|home>"> to mark the active nav item. */
(function () {
// Front-door entry — rendered standalone above the section groups.
var START = { id: "home", icon: "insert_chart_outlined", name: "Start here", href: "home.html" };
// Sections + entries in pipeline / job order.
var NAV = [
{ label: "Data Cleaners", items: [
{ id: "02_text_cleaner", icon: "text_format", name: "Clean Text", href: "02_text_cleaner.html" },
{ id: "03_format_standardizer", icon: "format_list_bulleted", name: "Standardize Formats", href: "03_format_standardizer.html" },
{ id: "04_missing_handler", icon: "help_outline", name: "Fix Missing Values", href: "04_missing_handler.html" },
{ id: "01_deduplicator", icon: "search", name: "Find Duplicates", href: "01_deduplicator.html" },
]},
{ label: "Transformations", items: [
{ id: "05_column_mapper", icon: "view_column", name: "Map Columns", href: "05_column_mapper.html" },
]},
{ label: "Automations", items: [
{ id: "09_pipeline_runner", icon: "auto_awesome", name: "Automated Workflows", href: "09_pipeline_runner.html" },
]},
{ label: "Finance", items: [
{ id: "11_reconciler", icon: "compare_arrows", name: "Reconcile Two Files", href: "11_reconciler.html" },
{ id: "10_pdf_extractor", icon: "picture_as_pdf", name: "PDF to CSV", href: "10_pdf_extractor.html" },
]},
{ label: "Coming soon", items: [
{ id: "06_outlier_detector", icon: "insights", name: "Find Unusual Values", href: "06_outlier_detector.html", soon: true },
{ id: "08_validator_reporter", icon: "check_circle", name: "Quality Check", href: "08_validator_reporter.html", soon: true },
{ id: "07_multi_file_merger", icon: "account_tree", name: "Combine Files", href: "07_multi_file_merger.html", soon: true },
]},
];
var active = document.body.getAttribute("data-page") || "";
// ---- Sidebar -----------------------------------------------------------
var sb = document.getElementById("dt-sidebar");
if (sb) {
var html = '' +
'<a class="dt-brand" href="index.html" style="text-decoration:none">' +
'<span class="dt-brand-mark">D</span>' +
'<span class="dt-brand-name">' +
'<span class="dt-brand-eyebrow">UNALOGIX</span>' +
'<span class="dt-brand-word">DataTools</span>' +
'</span>' +
'</a>' +
'<nav class="dt-nav">';
var startCls = "dt-nav-link dt-nav-start" + (START.id === active ? " is-active" : "");
html += '<a class="' + startCls + '" href="' + START.href + '">' +
'<span class="dt-mi">' + START.icon + '</span>' +
'<span>' + START.name + '</span>' +
'</a>';
NAV.forEach(function (sec) {
var indicator = "";
html += '<div class="dt-nav-section">' + sec.label +
'<span class="dt-nav-indicator">' + indicator + '</span></div>';
sec.items.forEach(function (it) {
var cls = "dt-nav-link" + (it.id === active ? " is-active" : "") + (it.soon ? " is-soon" : "");
html += '<a class="' + cls + '" href="' + it.href + '">' +
'<span class="dt-mi">' + it.icon + '</span>' +
'<span>' + it.name + '</span>' +
(it.soon ? '<span class="dt-nav-soon-tag">Soon</span>' : '') +
'</a>';
});
});
html += '</nav>' +
'<div class="dt-sidebar-foot">' +
'<div><div class="dt-sidebar-label">Language</div>' +
'<div class="dt-select" style="pointer-events:none">English</div></div>' +
'<div class="dt-license-badge">Core · 1,820 days left</div>' +
'</div>';
sb.innerHTML = html;
}
// ---- Sticky footer -----------------------------------------------------
var ft = document.getElementById("dt-footer");
if (ft) {
ft.innerHTML =
'<a class="dt-footer-btn" href="index.html"><span class="dt-mi">close</span>Close</a>' +
'<button class="dt-footer-btn" type="button"><span class="dt-mi">help_outline</span>Help</button>' +
'<span style="margin-left:auto;font-size:11.5px;color:var(--ink-tertiary)">DataTools · local-first · static layout preview</span>';
}
})();

View File

@@ -12,9 +12,14 @@ markers =
e2e: end-to-end CLI / integration tests
install: import / dependency sanity tests
fixture_sweep: parametrized sweep over the test-cases/ folder
gui: Streamlit AppTest-driven tests (live in tests/gui/)
# Warnings discipline: fail on unexpected DeprecationWarning from our own
# code, but tolerate third-party deprecations that we can't fix.
# Warnings discipline: fail on any DeprecationWarning *or* ResourceWarning
# from our own ``src`` package so a leaked file handle or stale stdlib call
# can't slip in unnoticed. Tolerate third-party deprecations / resource
# warnings — we can't fix pandas / openpyxl / streamlit churn from here.
filterwarnings =
error::DeprecationWarning:src
error::ResourceWarning:src
ignore::DeprecationWarning
ignore::ResourceWarning

View File

@@ -1,2 +1,6 @@
pytest>=8.0,<9
pytest-cov>=5.0,<6
# Test-only: generate small fixture PDFs in
# tests/test_pdf_extract_smoke.py so we can exercise pdfplumber +
# pypdfium2 end-to-end without committing binary fixtures.
fpdf2==2.8.7

View File

@@ -8,3 +8,16 @@ tqdm>=4.66,<5
typer>=0.12,<1
phonenumbers>=8.13,<9
streamlit>=1.35,<2
cryptography>=41,<49
# PDF Extractor stack — pinned to exact tested versions so a future
# upstream release can't quietly change pdfplumber's word-position
# behavior or pypdfium2's OCR rendering mid-build. Bump these
# explicitly when re-testing against a new release.
#
# ``pypdfium2`` is here for the OCR fallback path only (rasterizing
# pages to images for Tesseract). The drawable-canvas dep was
# removed when the visual picker was ripped out — the scanner is
# pure heuristic now, no coordinate UI.
pdfplumber==0.11.9
pypdfium2==5.8.0
pytesseract==0.3.13

View File

@@ -1,31 +0,0 @@
Lead ID,First Name,Last Name,Company,Title,Email,Phone,Country,Source,Score,Last Activity,Tags
HUB-001,Alice,Johnson,Acme Corp,VP Marketing,alice@acme.com,(415) 555-1234,USA,HubSpot,87,2025-12-04,Enterprise
HUB-002,bob,smith,Beta LLC,Director Growth,bob@beta.com,N/A,United States,HubSpot,N/A,2025-11-22,SMB
HUB-003,Carlos,Garcia,Gamma Inc,CEO,carlos@gamma.io,+34 91 411 1111,Spain,HubSpot,82,2025-10-30,Enterprise
HUB-004,DIANA,LEE,Delta Co,Marketing Manager,diana@delta.com,020 7946 0958,United Kingdom,HubSpot,74,2025-12-15,Mid-Market
HUB-005,Eve,Martinez,Epsilon Group,VP Ops,eve@epsilon.com,(none),Mexico,HubSpot,(blank),2025-09-15,SMB
LIN-006,Alice,Johnson,Acme Corporation,VP of Marketing,Alice.Johnson@acme.com,4155551234,US,LinkedIn,,2025-12-04,Enterprise
LIN-007,Frank,Brown,Foxtrot Ltd,Head Sales,frank@foxtrot.de,+49 30 12345678,Germany,LinkedIn,68,2025-12-01,Mid-Market
LIN-008,Grace,Davis,Golf Industries,Marketing Lead,grace@golfind.com,+44 20 7946 0958,UK,LinkedIn,79,2025-11-08,Mid-Market
LIN-009,henry,wilson,Hotel Logistics,COO,henry@hotellog.com,+86 10 1234 5678,China,LinkedIn,91,2025-12-12,Enterprise
LIN-010,IVY CHEN,,India Tech,CTO,ivy@indiatech.in,+91 11 2345 6789,IN,LinkedIn,88,2025-11-30,Enterprise
LIN-011,Jack,Taylor,Juliet & Co,Founder,jack@juliet.co,unknown,United States,LinkedIn,?,(unknown),SMB
SCR-012,Diana,Lee,Delta Company,Marketing Manager,diana@delta.com,020-7946-0958,UK,Manual Scrape,74,12/15/2025,Mid-Market
SCR-013,kate,o'neil,Kilo Ventures,Partner,kate@kilo.vc,+1 415 555 2222,USA,Manual Scrape,N/A,?,Investor
SCR-014,Carlos,García,Gamma Incorporated,CEO,Carlos@gamma.io,+34-91-411-1111,Spain,Manual Scrape,82,Oct 30 2025,Enterprise
SCR-015,Liam,Park,Lima Solutions,Director Marketing,liam@limasol.kr,+82 2 2287 0114,South Korea,Manual Scrape,77,2025-11-20,Enterprise
SCR-016,Mia,nguyen,Mike Corp,VP Marketing,mia@mikecorp.com.au,02 9374 4000,Australia,Manual Scrape,72,2025-10-05,Mid-Market
SCR-017,Noah,Brown,November Inc,Head of Growth,noah@november.com,(555) 444-5555,US,Manual Scrape,,#N/A,SMB
HUB-018,Frank,Brown,Foxtrot,Head of Sales,Frank@Foxtrot.de,+49-30-12345678,Germany,HubSpot,68,2025-12-01,Mid-Market
HUB-019,Olivia,Rossi,Oscar Italia,CMO,olivia@oscar.it,+39 06 6982,Italy,HubSpot,85,2025-12-08,Enterprise
HUB-020,papa,wong,Papa Trading,Founder,papa@papatrading.hk,+852 2123 4567,Hong Kong,HubSpot,69,2025-11-15,SMB
LIN-021,Quinn,Reyes,Quebec Group,VP Sales,quinn@quebec.mx,+52 55 5555 0000,Mexico,LinkedIn,80,2025-12-05,Mid-Market
LIN-022,Robert,Tan,Romeo Logistics,Director,r.tan@romeo.sg,+65 6123 4567,Singapore,LinkedIn,76,2025-11-28,Mid-Market
SCR-023,Sara,Khan,Sierra Foods,Head Marketing,sara@sierra.in,+91-22-1234-5678,India,Manual Scrape,73,2025-12-02,SMB
SCR-024,bob,Smith,Beta,Director Growth,Bob@Beta.com,(none),United States,Manual Scrape,(unknown),(unknown),SMB
HUB-025,Tara,Levi,Tango Tech,VP Product,tara@tango.il,+972 3 6957 0000,Israel,HubSpot,82,2025-12-10,Enterprise
HUB-026,Uma,Patel,Uniform Health,CMO,uma at uniform dot com,+44 20 7946 8888,United Kingdom,HubSpot,71,2025-12-12,Enterprise
LIN-027,Victor,Lee,Victor Co,Director,victor@@victorco.com,+1 415 555 8888,USA,LinkedIn,69,2025-11-30,SMB
SCR-028,Wendy,Akin,Whiskey Inc,CMO,wendy@whiskey.tr,+90 212 252 1111,Turkey,Manual Scrape,77,2025-12-04,Mid-Market
SCR-029,Xander,Ng,Xray Group,Founder,xander@xray.sg,+65 6234 5678,Singapore,Manual Scrape,65,2025-11-15,Suppressed
HUB-030,Yara,Costa,Yankee Foods,Marketing Lead,yara@yankee.br,+55 11 3071 2222,Brazil,HubSpot,,2025-12-15,Opted Out
1 Lead ID First Name Last Name Company Title Email Phone Country Source Score Last Activity Tags
2 HUB-001 Alice Johnson Acme Corp VP Marketing alice@acme.com (415) 555-1234 USA HubSpot 87 2025-12-04 Enterprise
3 HUB-002 bob smith Beta LLC Director Growth bob@beta.com N/A United States HubSpot N/A 2025-11-22 SMB
4 HUB-003 Carlos Garcia Gamma Inc CEO carlos@gamma.io +34 91 411 1111 Spain HubSpot 82 2025-10-30 Enterprise
5 HUB-004 DIANA LEE Delta Co Marketing Manager diana@delta.com 020 7946 0958 United Kingdom HubSpot 74 2025-12-15 Mid-Market
6 HUB-005 Eve Martinez Epsilon Group VP Ops eve@epsilon.com (none) Mexico HubSpot (blank) 2025-09-15 SMB
7 LIN-006 Alice Johnson Acme Corporation VP of Marketing Alice.Johnson@acme.com 4155551234 US LinkedIn 2025-12-04 Enterprise
8 LIN-007 Frank Brown Foxtrot Ltd Head Sales frank@foxtrot.de +49 30 12345678 Germany LinkedIn 68 2025-12-01 Mid-Market
9 LIN-008 Grace Davis Golf Industries Marketing Lead grace@golfind.com +44 20 7946 0958 UK LinkedIn 79 2025-11-08 Mid-Market
10 LIN-009 henry wilson Hotel Logistics COO henry@hotellog.com +86 10 1234 5678 China LinkedIn 91 2025-12-12 Enterprise
11 LIN-010 IVY CHEN India Tech CTO ivy@indiatech.in +91 11 2345 6789 IN LinkedIn 88 2025-11-30 Enterprise
12 LIN-011 Jack Taylor Juliet & Co Founder jack@juliet.co unknown United States LinkedIn ? (unknown) SMB
13 SCR-012 Diana Lee Delta Company Marketing Manager diana@delta.com 020-7946-0958 UK Manual Scrape 74 12/15/2025 Mid-Market
14 SCR-013 kate o'neil Kilo Ventures Partner kate@kilo.vc +1 415 555 2222 USA Manual Scrape N/A ? Investor
15 SCR-014 Carlos García Gamma Incorporated CEO Carlos@gamma.io +34-91-411-1111 Spain Manual Scrape 82 Oct 30 2025 Enterprise
16 SCR-015 Liam Park Lima Solutions Director Marketing liam@limasol.kr +82 2 2287 0114 South Korea Manual Scrape 77 2025-11-20 Enterprise
17 SCR-016 Mia nguyen Mike Corp VP Marketing mia@mikecorp.com.au 02 9374 4000 Australia Manual Scrape 72 2025-10-05 Mid-Market
18 SCR-017 Noah Brown November Inc Head of Growth noah@november.com (555) 444-5555 US Manual Scrape #N/A SMB
19 HUB-018 Frank Brown Foxtrot Head of Sales Frank@Foxtrot.de +49-30-12345678 Germany HubSpot 68 2025-12-01 Mid-Market
20 HUB-019 Olivia Rossi Oscar Italia CMO olivia@oscar.it +39 06 6982 Italy HubSpot 85 2025-12-08 Enterprise
21 HUB-020 papa wong Papa Trading Founder papa@papatrading.hk +852 2123 4567 Hong Kong HubSpot 69 2025-11-15 SMB
22 LIN-021 Quinn Reyes Quebec Group VP Sales quinn@quebec.mx +52 55 5555 0000 Mexico LinkedIn 80 2025-12-05 Mid-Market
23 LIN-022 Robert Tan Romeo Logistics Director r.tan@romeo.sg +65 6123 4567 Singapore LinkedIn 76 2025-11-28 Mid-Market
24 SCR-023 Sara Khan Sierra Foods Head Marketing sara@sierra.in +91-22-1234-5678 India Manual Scrape 73 2025-12-02 SMB
25 SCR-024 bob Smith Beta Director Growth Bob@Beta.com (none) United States Manual Scrape (unknown) (unknown) SMB
26 HUB-025 Tara Levi Tango Tech VP Product tara@tango.il +972 3 6957 0000 Israel HubSpot 82 2025-12-10 Enterprise
27 HUB-026 Uma Patel Uniform Health CMO uma at uniform dot com +44 20 7946 8888 United Kingdom HubSpot 71 2025-12-12 Enterprise
28 LIN-027 Victor Lee Victor Co Director victor@@victorco.com +1 415 555 8888 USA LinkedIn 69 2025-11-30 SMB
29 SCR-028 Wendy Akin Whiskey Inc CMO wendy@whiskey.tr +90 212 252 1111 Turkey Manual Scrape 77 2025-12-04 Mid-Market
30 SCR-029 Xander Ng Xray Group Founder xander@xray.sg +65 6234 5678 Singapore Manual Scrape 65 2025-11-15 Suppressed
31 HUB-030 Yara Costa Yankee Foods Marketing Lead yara@yankee.br +55 11 3071 2222 Brazil HubSpot 2025-12-15 Opted Out

View File

@@ -1,74 +0,0 @@
{
"steps": [
{
"tool": "text_clean",
"options": {},
"enabled": true,
"name": "1. Clean text (whitespace + smart quotes from copy-paste)"
},
{
"tool": "format_standardize",
"options": {
"column_types": {
"First Name": "name",
"Last Name": "name",
"Company": "name",
"Email": "email",
"Phone": "phone"
},
"phone_country_column": "Country",
"phone_format": "E164",
"email_gmail_canonical": true
},
"enabled": true,
"name": "2. E.164 phones (per-row country) · canonical emails · name casing"
},
{
"tool": "missing",
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["N/A", "n/a", "—", "?", "(unknown)", "unknown", "(blank)", "(none)", "TBD", "#N/A"]
},
"enabled": true,
"name": "3. Standardize sentinels across vendor exports"
},
{
"tool": "column_map",
"options": {
"schema": {
"fields": [
{"name": "Lead ID", "dtype": "string", "required": true},
{"name": "First Name", "dtype": "string"},
{"name": "Last Name", "dtype": "string"},
{"name": "Company", "dtype": "string"},
{"name": "Title", "dtype": "string"},
{"name": "Email", "dtype": "string"},
{"name": "Phone", "dtype": "string"},
{"name": "Country", "dtype": "string"},
{"name": "Source", "dtype": "string"},
{"name": "Score", "dtype": "integer"},
{"name": "Last Activity", "dtype": "date"},
{"name": "Tags", "dtype": "string"}
]
},
"auto_infer": true,
"unmapped": "keep",
"coerce_types": true,
"reorder_to_schema": true,
"enforce_required": false
},
"enabled": true,
"name": "4. Coerce types · reorder to canonical schema"
},
{
"tool": "dedup",
"options": {
"survivor_rule": "most_complete",
"merge": true
},
"enabled": true,
"name": "5. Dedup leads across HubSpot / LinkedIn / Manual Scrape (fuzzy + merge)"
}
]
}

View File

@@ -0,0 +1,27 @@
Invoice,Client,Email,Invoice_Date,Due_Date,Amount,Status
INV-1007,ACME LLC,AP@Acme.com,03/04/2025,04/03/2025,"$1,250.00",Open
INV-1007, Acme LLC ,ap@acme.com,2025-03-04,2025-04-03,"1,250.00",(blank)
INV-1001,northwind traders,billing@northwind.com,Mar 6 2025,04/05/2025,$980,Overdue
INV-1002,Globex Corp,AR@Globex.com,3/11/25,4/10/25,"2,400.50",Sent
INV-1011,initech,accounts@initech.com,04/01/2025,05/01/2025,"$ 1,100.00",?
INV-1011,Initech,Accounts@Initech.com,2025-04-01,2025-05-01,1100,Open
INV-1003,Stark Industries,ap@stark.com,Mar 6 2025,Apr 6 2025,$75.00,Open
INV-1004,Wayne Enterprises,ar@wayne.com,03/15/2025,04/14/2025,($300.00),
INV-1015,Hooli,billing@hooli.com,3/11/25,4/10/25,"$4,300.00",Overdue
INV-1015,hooli,Billing@Hooli.com,2025-03-11,2025-04-10,4300,(none)
INV-1005,Soylent Corp,ap@soylent.com,2025-03-20,2025-04-19,"$1,875.25",Sent
INV-1006,Umbrella Co,ar@umbrella.com,03/22/2025,04/21/2025,$640.00,TBD
INV-1019,Cyberdyne Systems,ap@cyberdyne.com,Mar 25 2025,04/24/2025,"$2,050.00",unknown
INV-1019,cyberdyne systems,AP@Cyberdyne.com,2025-03-25,2025-04-24,"2,050.00",Open
INV-1008,Vandelay Industries,ar@vandelay.com,3/28/25,4/27/25,$915.00,Overdue
INV-1009,Gekko & Co,billing@gekko.com,2025-03-30,2025-04-29,"$3,120.75",Open
INV-1010,Pied Piper,ap@piedpiper.com,04/02/2025,05/02/2025,$180,Sent
INV-1023,Tyrell Corp,ar@tyrell.com,04/05/2025,05/05/2025,($300.00),(blank)
INV-1023,Tyrell Corp,AR@Tyrell.com,2025-04-05,2025-05-05,-300.00,Open
INV-1012,Oscorp,ap@oscorp.com,Apr 8 2025,05/08/2025,"$5,000.00",Overdue
INV-1013,Nakatomi Trading,ar@nakatomi.com,4/9/25,5/9/25,$725.50,Sent
INV-1014,Bluth Company,billing@bluth.com,2025-04-10,2025-05-10,"$1,420.00",Open
INV-1016,Dunder Mifflin,ap@dundermifflin.com,04/12/2025,05/12/2025,$960.00,Overdue
INV-1017,Prestige Worldwide,ar@prestige.com,Apr 14 2025,05/14/2025,"$2,680.00",Sent
INV-1018,Sterling Cooper,billing@sterlingcooper.com,4/15/25,5/15/25,"$3,950.00",Open
INV-1020,Wonka Industries,ap@wonka.com,2025-04-18,2025-05-18,"$1,050.00",Overdue
1 Invoice Client Email Invoice_Date Due_Date Amount Status
2 INV-1007 ACME LLC AP@Acme.com 03/04/2025 04/03/2025 $1,250.00 Open
3 INV-1007 Acme LLC ap@acme.com 2025-03-04 2025-04-03 1,250.00 (blank)
4 INV-1001 northwind traders billing@northwind.com Mar 6 2025 04/05/2025 $980 Overdue
5 INV-1002 Globex Corp AR@Globex.com 3/11/25 4/10/25 2,400.50 Sent
6 INV-1011 initech accounts@initech.com 04/01/2025 05/01/2025 $ 1,100.00 ?
7 INV-1011 Initech Accounts@Initech.com 2025-04-01 2025-05-01 1100 Open
8 INV-1003 Stark Industries ap@stark.com Mar 6 2025 Apr 6 2025 $75.00 Open
9 INV-1004 Wayne Enterprises ar@wayne.com 03/15/2025 04/14/2025 ($300.00)
10 INV-1015 Hooli billing@hooli.com 3/11/25 4/10/25 $4,300.00 Overdue
11 INV-1015 hooli Billing@Hooli.com 2025-03-11 2025-04-10 4300 (none)
12 INV-1005 Soylent Corp ap@soylent.com 2025-03-20 2025-04-19 $1,875.25 Sent
13 INV-1006 Umbrella Co ar@umbrella.com 03/22/2025 04/21/2025 $640.00 TBD
14 INV-1019 Cyberdyne Systems ap@cyberdyne.com Mar 25 2025 04/24/2025 $2,050.00 unknown
15 INV-1019 cyberdyne systems AP@Cyberdyne.com 2025-03-25 2025-04-24 2,050.00 Open
16 INV-1008 Vandelay Industries ar@vandelay.com 3/28/25 4/27/25 $915.00 Overdue
17 INV-1009 Gekko & Co billing@gekko.com 2025-03-30 2025-04-29 $3,120.75 Open
18 INV-1010 Pied Piper ap@piedpiper.com 04/02/2025 05/02/2025 $180 Sent
19 INV-1023 Tyrell Corp ar@tyrell.com 04/05/2025 05/05/2025 ($300.00) (blank)
20 INV-1023 Tyrell Corp AR@Tyrell.com 2025-04-05 2025-05-05 -300.00 Open
21 INV-1012 Oscorp ap@oscorp.com Apr 8 2025 05/08/2025 $5,000.00 Overdue
22 INV-1013 Nakatomi Trading ar@nakatomi.com 4/9/25 5/9/25 $725.50 Sent
23 INV-1014 Bluth Company billing@bluth.com 2025-04-10 2025-05-10 $1,420.00 Open
24 INV-1016 Dunder Mifflin ap@dundermifflin.com 04/12/2025 05/12/2025 $960.00 Overdue
25 INV-1017 Prestige Worldwide ar@prestige.com Apr 14 2025 05/14/2025 $2,680.00 Sent
26 INV-1018 Sterling Cooper billing@sterlingcooper.com 4/15/25 5/15/25 $3,950.00 Open
27 INV-1020 Wonka Industries ap@wonka.com 2025-04-18 2025-05-18 $1,050.00 Overdue

View File

@@ -0,0 +1,50 @@
{
"steps": [
{
"tool": "text_clean",
"enabled": true,
"options": {
"trim": true,
"collapse_whitespace": true,
"fold_smart_chars": true,
"strip_zero_width": true
}
},
{
"tool": "format_standardize",
"enabled": true,
"options": {
"column_types": {
"Invoice_Date": "date",
"Due_Date": "date",
"Amount": "currency",
"Email": "email"
}
}
},
{
"tool": "missing",
"enabled": true,
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["—", "-", "?", "(blank)", "TBD", "unknown", "(none)", "N/A", "#N/A"]
}
},
{
"tool": "dedup",
"enabled": true,
"options": {
"survivor_rule": "most_complete",
"merge": true,
"strategies": [
{
"columns": [
{"column": "Invoice", "algorithm": "exact", "threshold": 100}
]
}
]
}
}
]
}

View File

@@ -0,0 +1,27 @@
Date,Description,Vendor,Category,Amount,Account
01/15/2025,“Stripe payout — weekly”,Stripe,Income,"+$3,450.00",Business Checking
2025-01-15,Verizon business line,Verizon,,($89.50),Business Checking
Jan 18 2025,Adobe Creative Cloud ,Adobe,(blank),-$129.99,Business Checking
1/27/25,Office supplies,Amazon,Supplies,-$74.20,Business Checking
02/03/2025, Monthly office rent,Highland Properties,Rent,"$1,200.00",Business Checking
Feb 5 2025,Account service fee,First National Bank,?,(50.00),Business Checking
2025-01-09,Shipping labels,amazon.com,unknown,-$18.40,Business Checking
1/22/25,Contractor — landing page,Bright Lane Design,TBD,- $599.88,Business Checking
Jan 30 2025,Late fee adjustment,verizon,Utilities,-$12.00,Business Checking
2025-01-11,Packaging tape,AMAZON.COM,Supplies,-$31.75,Business Checking
01/06/2025,Client deposit — ACME Co,ACME Co,Income,"$2,500.00",Business Checking
2025-01-20,Google Workspace,Google,Software,-$36.00,Business Checking
Jan 24 2025,Fuel — delivery van,Shell,Vehicle,-$58.63,Business Checking
1/28/25,QuickBooks subscription,Intuit,Software,-$80.00,Business Checking
2025-01-15,Stripe payout weekly,Stripe,Income,3450.00,Business Checking
01/15/2025,Verizon business line,Verizon,Utilities,-89.50,Business Checking
2025-01-18,Adobe Creative Cloud,Adobe,Software,-129.99,Business Checking
2025-02-03,Monthly office rent,Highland Properties,Rent,1200.00,Business Checking
2025-02-05,Account service fee,First National Bank,Bank Fees,-50.00,Business Checking
2025-01-22,Contractor landing page,Bright Lane Design,Contractors,-599.88,Business Checking
02/10/2025,Client deposit — Globex,Globex,Income,"$1,800.00",Business Checking
2025-02-12,Slack subscription,Slack,Software,-$96.00,Business Checking
Feb 14 2025,Coffee — client meeting,Blue Bottle,Meals,-$23.10,Business Checking
2/18/25,Insurance premium,Hartford,Insurance,-$240.50,Business Checking
02/21/2025,Refund — returned printer,Staples,Supplies,$210.99,Business Checking
Feb 25 2025,Domain renewal,Namecheap,Software,-$13.98,Business Checking
1 Date Description Vendor Category Amount Account
2 01/15/2025 “Stripe payout — weekly” Stripe Income +$3,450.00 Business Checking
3 2025-01-15 Verizon business line Verizon ($89.50) Business Checking
4 Jan 18 2025 Adobe Creative Cloud Adobe (blank) -$129.99 Business Checking
5 1/27/25 Office supplies Amazon Supplies -$74.20 Business Checking
6 02/03/2025 Monthly office rent Highland Properties Rent $1,200.00 Business Checking
7 Feb 5 2025 Account service fee First National Bank ? (50.00) Business Checking
8 2025-01-09 Shipping labels amazon.com unknown -$18.40 Business Checking
9 1/22/25 Contractor — landing page Bright Lane Design TBD - $599.88 Business Checking
10 Jan 30 2025 Late fee adjustment verizon Utilities -$12.00 Business Checking
11 2025-01-11 Packaging tape AMAZON.COM Supplies -$31.75 Business Checking
12 01/06/2025 Client deposit — ACME Co ACME Co Income $2,500.00 Business Checking
13 2025-01-20 Google Workspace Google Software -$36.00 Business Checking
14 Jan 24 2025 Fuel — delivery van Shell Vehicle -$58.63 Business Checking
15 1/28/25 QuickBooks subscription Intuit Software -$80.00 Business Checking
16 2025-01-15 Stripe payout weekly Stripe Income 3450.00 Business Checking
17 01/15/2025 Verizon business line Verizon Utilities -89.50 Business Checking
18 2025-01-18 Adobe Creative Cloud Adobe Software -129.99 Business Checking
19 2025-02-03 Monthly office rent Highland Properties Rent 1200.00 Business Checking
20 2025-02-05 Account service fee First National Bank Bank Fees -50.00 Business Checking
21 2025-01-22 Contractor landing page Bright Lane Design Contractors -599.88 Business Checking
22 02/10/2025 Client deposit — Globex Globex Income $1,800.00 Business Checking
23 2025-02-12 Slack subscription Slack Software -$96.00 Business Checking
24 Feb 14 2025 Coffee — client meeting Blue Bottle Meals -$23.10 Business Checking
25 2/18/25 Insurance premium Hartford Insurance -$240.50 Business Checking
26 02/21/2025 Refund — returned printer Staples Supplies $210.99 Business Checking
27 Feb 25 2025 Domain renewal Namecheap Software -$13.98 Business Checking

View File

@@ -0,0 +1,6 @@
{"steps":[
{"tool":"text_clean","enabled":true,"options":{"trim":true,"collapse_whitespace":true,"fold_smart_chars":true,"strip_zero_width":true}},
{"tool":"format_standardize","enabled":true,"options":{"column_types":{"Date":"date","Amount":"currency"}}},
{"tool":"missing","enabled":true,"options":{"strategy":"none","standardize_sentinels":true,"sentinels":["—","(blank)","?","unknown","TBD","N/A","#N/A","(none)"]}},
{"tool":"dedup","enabled":true,"options":{"survivor_rule":"most_complete","merge":true,"strategies":[{"columns":[{"column":"Date","algorithm":"exact","threshold":100},{"column":"Amount","algorithm":"exact","threshold":100}]}]}}
]}

View File

@@ -1,56 +0,0 @@
{
"steps": [
{
"tool": "text_clean",
"options": {},
"enabled": true,
"name": "1. Clean text (header whitespace, smart quotes, em-dash)"
},
{
"tool": "format_standardize",
"options": {
"column_types": {
"Date": "date",
"Amount": "currency",
"Balance": "currency",
"Vendor": "name"
},
"currency_decimal": "auto",
"currency_preserve_code": false,
"currency_decimals": 2,
"date_output_format": "%Y-%m-%d"
},
"enabled": true,
"name": "2. ISO dates · numeric amounts (parens-negative) · vendor casing"
},
{
"tool": "missing",
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["N/A", "n/a", "—", "-", "?", "(blank)", "(none)", "unknown", "#N/A"]
},
"enabled": true,
"name": "3. Standardize disguised nulls (— / N/A / (blank))"
},
{
"tool": "dedup",
"options": {
"survivor_rule": "most_complete",
"merge": false,
"date_column": "Date",
"strategies": [
{
"columns": [
{"column": "Date", "algorithm": "exact", "threshold": 100},
{"column": "Amount", "algorithm": "exact", "threshold": 100},
{"column": "Vendor", "algorithm": "jaro_winkler", "threshold": 80}
]
}
]
},
"enabled": true,
"name": "4. Dedup transactions on Date+Amount+fuzzy Vendor"
}
]
}

View File

@@ -1,31 +0,0 @@
Txn ID,Date ,Description,Amount,Balance,Account,Vendor,Category
TXN-2401,01/15/2025," AMAZON.COM*4F2X9 PURCHASE",-$129.99,"$2,450.01",Checking,Amazon,Office Supplies
TXN-2402,2025-01-15,"AMAZON.COM*4F2X9 PURCHASE",-$129.99,"2450.01",Checking,amazon.com,Office Supplies
TXN-2403,Jan 18 2025,"STAPLES #4422 — paper, toner",($89.50),$2360.51,Checking,STAPLES,Office Supplies
TXN-2404,01/22/2025,"Verizon Wireless ""autopay""",-$120.00,"$2,240.51",Checking,Verizon,Utilities
TXN-2405,2025-01-22,Verizon Wireless autopay,-120.00,"2,240.51",Checking,verizon,Utilities
TXN-2406,01-25-2025,"Stripe Payout — invoice #1077","+$3,450.00","$5,690.51",Checking,Stripe,Income
TXN-2407,1/27/25,"Office Lease - Suite 204",-1500.00,"$4,190.51",Checking,Acme Realty,Rent
TXN-2408,02/01/2025,"Wire — Acme Realty Mgmt","-$1,500.00","$2,690.51",Checking,acme realty,Rent
TXN-2409,2025-02-03,"Adobe Creative Cloud annual","- $599.88","$2,090.63",Credit Card,Adobe Inc.,Software
TXN-2410,02/03/2025,"ADOBE CREATIVE CLOUD ANN",-599.88,2090.63,Credit Card,adobe,Software
TXN-2411,Feb 5 2025,"FedEx — overnight to client A",-$32.50,"$2,058.13",Checking,FedEx,Shipping
TXN-2412,02/07/2025,"Square fee — invoice #1078","-$3.20","$2,054.93",Checking,Square,Fees
TXN-2413,02/10/2025,"Stripe Payout invoice #1079","+ $1,200.00","$3,254.93",Checking,Stripe,Income
TXN-2414,2025-02-12,"USPS PRIORITY — to vendor B","-12.40","$3,242.53",Checking,USPS,Shipping
TXN-2415,02/14/2025,"Zoom Video Comms — annual","-$149.90","$3,092.63",Credit Card,Zoom,Software
TXN-2416,2/14/25,"Zoom Video Communications","-149.90","3092.63",Credit Card,zoom,Software
TXN-2417,02/18/2025,"Costco Whse #421 — supplies","-$237.84","$2,854.79",Checking,Costco,Office Supplies
TXN-2418,2025-02-18,COSTCO WHSE #421,-237.84,"2,854.79",Checking,costco,Office Supplies
TXN-2419,02/22/2025,"Bank fee — int'l wire","-$45.00","$2,809.79",Checking,Bank Fee,Fees
TXN-2420,02/24/2025,"Stripe Payout — invoice #1080","+$2,100.00","$4,909.79",Checking,Stripe,Income
TXN-2421,02/28/2025," Refund — overcharge ","+$45.00","$4,954.79",Checking,,Refunds
TXN-2422,Feb 28 2025,REFUND OVERCHARGE,45.00,4954.79,Checking,N/A,Refunds
TXN-2423,03/01/2025,"Office Lease — Suite 204","-$1,500.00","$3,454.79",Checking,Acme Realty,Rent
TXN-2424,2025-03-03,"Slack Technologies — annual","-$840.00","$2,614.79",Credit Card,Slack,Software
TXN-2425,03/05/2025,"Stripe Payout — invoice #1081","+$1,875.00","$4,489.79",Checking,Stripe,Income
TXN-2426,03/08/2025,"Wire — Berlin office rent (EUR vendor)","-€1.450,00","$2,989.79",Checking,Mietverwaltung GmbH,Rent
TXN-2427,03/10/2025,"London supplier invoice (GBP)","-£950.00","$1,939.79",Checking,Stationery Co Ltd,Office Supplies
TXN-2428,03/12/2025,"São Paulo agency retainer","-R$ 1.299,90","$1,679.79",Credit Card,Estúdio Ágil,Software
TXN-2429,03/14/2025,"VAT MOSS prep — multi-EU sales","($89.00)","$1,768.79",Checking,EU VAT Service,Fees
TXN-2430,03/14/2025,"VAT MOSS prep multi EU sales",-89.00,"1,768.79",Checking,eu vat service,Fees
1 Txn ID Date Description Amount Balance Account Vendor Category
2 TXN-2401 01/15/2025 AMAZON.COM*4F2X9 PURCHASE -$129.99 $2,450.01 Checking Amazon Office Supplies
3 TXN-2402 2025-01-15 AMAZON.COM*4F2X9 PURCHASE -$129.99 2450.01 Checking amazon.com Office Supplies
4 TXN-2403 Jan 18 2025 STAPLES #4422 — paper, toner ($89.50) $2360.51 Checking STAPLES Office Supplies
5 TXN-2404 01/22/2025 Verizon Wireless "autopay" -$120.00 $2,240.51 Checking Verizon Utilities
6 TXN-2405 2025-01-22 Verizon Wireless autopay -120.00 2,240.51 Checking verizon Utilities
7 TXN-2406 01-25-2025 Stripe Payout — invoice #1077 +$3,450.00 $5,690.51 Checking Stripe Income
8 TXN-2407 1/27/25 Office Lease - Suite 204 -1500.00 $4,190.51 Checking Acme Realty Rent
9 TXN-2408 02/01/2025 Wire — Acme Realty Mgmt -$1,500.00 $2,690.51 Checking acme realty Rent
10 TXN-2409 2025-02-03 Adobe Creative Cloud annual - $599.88 $2,090.63 Credit Card Adobe Inc. Software
11 TXN-2410 02/03/2025 ADOBE CREATIVE CLOUD ANN -599.88 2090.63 Credit Card adobe Software
12 TXN-2411 Feb 5 2025 FedEx — overnight to client A -$32.50 $2,058.13 Checking FedEx Shipping
13 TXN-2412 02/07/2025 Square fee — invoice #1078 -$3.20 $2,054.93 Checking Square Fees
14 TXN-2413 02/10/2025 Stripe Payout invoice #1079 + $1,200.00 $3,254.93 Checking Stripe Income
15 TXN-2414 2025-02-12 USPS PRIORITY — to vendor B -12.40 $3,242.53 Checking USPS Shipping
16 TXN-2415 02/14/2025 Zoom Video Comms — annual -$149.90 $3,092.63 Credit Card Zoom Software
17 TXN-2416 2/14/25 Zoom Video Communications -149.90 3092.63 Credit Card zoom Software
18 TXN-2417 02/18/2025 Costco Whse #421 — supplies -$237.84 $2,854.79 Checking Costco Office Supplies
19 TXN-2418 2025-02-18 COSTCO WHSE #421 -237.84 2,854.79 Checking costco Office Supplies
20 TXN-2419 02/22/2025 Bank fee — int'l wire -$45.00 $2,809.79 Checking Bank Fee Fees
21 TXN-2420 02/24/2025 Stripe Payout — invoice #1080 +$2,100.00 $4,909.79 Checking Stripe Income
22 TXN-2421 02/28/2025 Refund — overcharge +$45.00 $4,954.79 Checking Refunds
23 TXN-2422 Feb 28 2025 REFUND OVERCHARGE 45.00 4954.79 Checking N/A Refunds
24 TXN-2423 03/01/2025 Office Lease — Suite 204 -$1,500.00 $3,454.79 Checking Acme Realty Rent
25 TXN-2424 2025-03-03 Slack Technologies — annual -$840.00 $2,614.79 Credit Card Slack Software
26 TXN-2425 03/05/2025 Stripe Payout — invoice #1081 +$1,875.00 $4,489.79 Checking Stripe Income
27 TXN-2426 03/08/2025 Wire — Berlin office rent (EUR vendor) -€1.450,00 $2,989.79 Checking Mietverwaltung GmbH Rent
28 TXN-2427 03/10/2025 London supplier invoice (GBP) -£950.00 $1,939.79 Checking Stationery Co Ltd Office Supplies
29 TXN-2428 03/12/2025 São Paulo agency retainer -R$ 1.299,90 $1,679.79 Credit Card Estúdio Ágil Software
30 TXN-2429 03/14/2025 VAT MOSS prep — multi-EU sales ($89.00) $1,768.79 Checking EU VAT Service Fees
31 TXN-2430 03/14/2025 VAT MOSS prep multi EU sales -89.00 1,768.79 Checking eu vat service Fees

View File

@@ -1,21 +0,0 @@
Customer ID,First Name,Last Name,Email,Phone,Address,City,State,ZIP,Country,Total Orders,Lifetime Value,Last Order Date,Tags
SHOP-1001, Alice ,Johnson,alice@petshop.com,(415) 555-1234,"123 Main St., Apt 4B",San Francisco,CA,94102,US,12,$1,240.50,2025-12-04,VIP
SHOP-1002,Bob,SMITH,Bob@PetShop.com,415.555.1234,"123 Main St, Apt 4B",San Francisco,CA,94102,US,12,"$1,240.50",N/A,VIP
SHOP-1003,carlos,garcia,carlos@petshop.com,5559876543,"742 Evergreen Terrace",Springfield,IL,62704,US,5,420.00,12/15/2025,Wholesale
SHOP-1004,Diana,Lee,diana@petshop.com,(555) 222-3344,"PO Box 12, Sherwood Forest",Nottingham,,NG1 5BA,GB,8,£890.25,2025-10-30,VIP|Wholesale
SHOP-1005,EVE MARTINEZ,,eve.martinez@petshop.com,555-9988,"Calle Mayor 45","Madrid",,"28013",ES,3,€180,2025-09-15,
SHOP-1006,Frank,Brown,frank@petshop.com,, ,"Berlin",BE,10115,DE,15,€2.410,75,(blank),Wholesale
SHOP-1007,Grace,Davis,grace@petshop.com,+1 555-111-1111,"888 Maple Ave",Toronto,ON,M5V 3A8,CA,1,$49.99,#N/A,New
SHOP-1008,henry,wilson,Henry@PetShop.com,5551111111,"888 Maple Avenue","Toronto",ON,M5V 3A8,CA,1,$49.99,2025-12-01,New
SHOP-1009,Ivy,Chen,IVY@petshop.com,+1 (555) 777-7777,"550 Elm Street, Suite 200",Brooklyn,NY,11201,US,4,"$320.50 ",10/12/2025,
SHOP-1010,Jack,Taylor,jack@petshop.com,(none),"550 elm street, suite 200",brooklyn,NY,11201,US,4,$320.50,2025-10-12,
SHOP-1011,kate,o'neil,kate.oneil@petshop.com,415-555-2222,"99 King's Rd","London",,SW3 4LX,GB,7,£675.00,?,VIP
SHOP-1012,luis,rodriguez,LUIS@petshop.com,+34 91 411 1111,"Avenida de la Paz 12, 3°D",Madrid,,28013,ES,2,"€89,99",unknown,
SHOP-1013,Mia,Park,mia@petshop.com,02-9374-4000,"Sydney Opera House Drive","Sydney",NSW,2000,AU,9,"A$ 1,299.00",2025-11-20,Wholesale
SHOP-1014,Noah,nguyen,noah@petshop.com,+81 3 3210 7000,"丸の内 2-7-3","Tokyo",,100-0005,JP,6,"¥75000",2025-12-10,VIP
SHOP-1015,Olivia,Brown,OLIVIA@PETSHOP.COM,(555) 333-4444,"742 evergreen terrace",springfield,IL,62704,US,3,$180.00,(none),
SHOP-1016,Pavel,Novak,pavel@petshop.com,+44 20 7946 1234,"22 Baker Street",London,,W1U 6AB,United Kingdom,4,£412.00,2025-11-18,VIP
SHOP-1017,Quinn,Murphy,quinn@petshop.com,+44 20 7946 5678,"5 Princes Street",Edinburgh,,EH2 2DA,U.K.,2,£189.50,2025-12-09,
SHOP-1018,Rachel,O'Brien,rachel@petshop.com,02-9374-9999,"100 George Street","Sydney",NSW,2000,UK,1,£75.00,?,New
SHOP-1019,Sam,Klein,sam@petshop.com,+49 30 99887766,"Friedrichstraße 100","Berlin",,10117,Germany,11,"€1.890,40",2025-12-11,VIP|Wholesale
SHOP-1020,Tara,Gianni,tara@petshop.com,+39 06 6982 4567,"Via del Corso 250",Roma,,00186,Italia,5,"€649,99",2025-12-03,
1 Customer ID First Name Last Name Email Phone Address City State ZIP Country Total Orders Lifetime Value Last Order Date Tags
2 SHOP-1001 Alice Johnson alice@petshop.com (415) 555-1234 123 Main St., Apt 4B San Francisco CA 94102 US 12 $1 240.50 2025-12-04 VIP
3 SHOP-1002 Bob SMITH Bob@PetShop.com 415.555.1234 123 Main St, Apt 4B San Francisco CA 94102 US 12 $1,240.50 N/A VIP
4 SHOP-1003 carlos garcia carlos@petshop.com 5559876543 742 Evergreen Terrace Springfield IL 62704 US 5 420.00 12/15/2025 Wholesale
5 SHOP-1004 Diana Lee diana@petshop.com (555) 222-3344 PO Box 12, Sherwood Forest Nottingham NG1 5BA GB 8 £890.25 2025-10-30 VIP|Wholesale
6 SHOP-1005 EVE MARTINEZ eve.martinez@petshop.com 555-9988 Calle Mayor 45 Madrid 28013 ES 3 €180 2025-09-15
7 SHOP-1006 Frank Brown frank@petshop.com Berlin BE 10115 DE 15 €2.410 75 (blank) Wholesale
8 SHOP-1007 Grace Davis grace@petshop.com +1 555-111-1111 888 Maple Ave Toronto ON M5V 3A8 CA 1 $49.99 #N/A New
9 SHOP-1008 henry wilson Henry@PetShop.com 5551111111 888 Maple Avenue Toronto ON M5V 3A8 CA 1 $49.99 2025-12-01 New
10 SHOP-1009 Ivy Chen IVY@petshop.com +1 (555) 777-7777 550 Elm Street, Suite 200 Brooklyn NY 11201 US 4 $320.50 10/12/2025
11 SHOP-1010 Jack Taylor jack@petshop.com (none) 550 elm street, suite 200 brooklyn NY 11201 US 4 $320.50 2025-10-12
12 SHOP-1011 kate o'neil kate.oneil@petshop.com 415-555-2222 99 King's Rd London SW3 4LX GB 7 £675.00 ? VIP
13 SHOP-1012 luis rodriguez LUIS@petshop.com +34 91 411 1111 Avenida de la Paz 12, 3°D Madrid 28013 ES 2 €89,99 unknown
14 SHOP-1013 Mia Park mia@petshop.com 02-9374-4000 Sydney Opera House Drive Sydney NSW 2000 AU 9 A$ 1,299.00 2025-11-20 Wholesale
15 SHOP-1014 Noah nguyen noah@petshop.com +81 3 3210 7000 丸の内 2-7-3 Tokyo 100-0005 JP 6 ¥75000 2025-12-10 VIP
16 SHOP-1015 Olivia Brown OLIVIA@PETSHOP.COM (555) 333-4444 742 evergreen terrace springfield IL 62704 US 3 $180.00 (none)
17 SHOP-1016 Pavel Novak pavel@petshop.com +44 20 7946 1234 22 Baker Street London W1U 6AB United Kingdom 4 £412.00 2025-11-18 VIP
18 SHOP-1017 Quinn Murphy quinn@petshop.com +44 20 7946 5678 5 Princes Street Edinburgh EH2 2DA U.K. 2 £189.50 2025-12-09
19 SHOP-1018 Rachel O'Brien rachel@petshop.com 02-9374-9999 100 George Street Sydney NSW 2000 UK 1 £75.00 ? New
20 SHOP-1019 Sam Klein sam@petshop.com +49 30 99887766 Friedrichstraße 100 Berlin 10117 Germany 11 €1.890,40 2025-12-11 VIP|Wholesale
21 SHOP-1020 Tara Gianni tara@petshop.com +39 06 6982 4567 Via del Corso 250 Roma 00186 Italia 5 €649,99 2025-12-03

View File

@@ -1,49 +0,0 @@
{
"steps": [
{
"tool": "text_clean",
"options": {},
"enabled": true,
"name": "1. Clean text (whitespace, smart quotes, NBSP, BOM)"
},
{
"tool": "format_standardize",
"options": {
"column_types": {
"First Name": "name",
"Last Name": "name",
"Email": "email",
"Phone": "phone",
"Address": "address",
"Lifetime Value": "currency",
"Last Order Date": "date"
},
"phone_country_column": "Country",
"address_country_column": "Country",
"currency_preserve_code": true,
"currency_decimal": "auto",
"email_gmail_canonical": false
},
"enabled": true,
"name": "2. Standardize phones, addresses, dates, currencies, names"
},
{
"tool": "missing",
"options": {
"strategy": "none",
"standardize_sentinels": true
},
"enabled": true,
"name": "3. Standardize disguised nulls (N/A, -, (blank), ?, #N/A)"
},
{
"tool": "dedup",
"options": {
"survivor_rule": "most_complete",
"merge": true
},
"enabled": true,
"name": "4. Dedup customers (fuzzy match, merge missing fields)"
}
]
}

View File

@@ -0,0 +1,25 @@
Vendor,Contact,Email,Phone,EIN,Address,Total_Paid
Acme Realty,Bob Stein,acme.ap@acmerealty.com,(212) 555-0100,12-3456789,(blank),"$12,400.00"
acme realty llc,Bob Stein, ACME.AP@AcmeRealty.com ,,,"118 Canal St, New York, NY 10013","$8,250"
ACME REALTY,R. Stein,Acme.AP@acmerealty.com,212.555.0100,N/A,TBD,"1,999.99"
Bright Books Bookkeeping,Dana Cole,hello@brightbooks.com,,98-7654321,(blank),"$6,000.00"
bright books,Dana Cole,HELLO@brightbooks.com,(415) 555-0142,unknown,"50 Market St, San Francisco, CA 94105","$6,000"
"Bright Books, LLC",D. Cole, hello@BrightBooks.com,4155550142,98-7654321,unknown,"5,500.00"
Northwind Logistics,Sam Reyes,ap@northwindlog.com,(312) 555-0198,,(blank),"$22,750.00"
northwind logistics inc,Sam Reyes,AP@NorthwindLog.com,,45-6789012,"900 W Loop, Chicago, IL 60607","$22,750"
Pearl Design Studio,“Jo” Marsh,billing@pearldesign.co,,33-2211000,(blank),"$3,200.00"
pearl design,Jo Marsh,Billing@PearlDesign.co,(206) 555-0167,TBD,"77 Pike St, Seattle, WA 98101","$3,200"
PEARL DESIGN STUDIO,J. Marsh, billing@pearldesign.co ,206.555.0167,33-2211000,unknown,"2,800.00"
Cooper Plumbing,Lee Cooper,office@cooperplumb.com,(617) 555-0133,,(blank),"$1,450.00"
cooper plumbing co,Lee Cooper,OFFICE@cooperplumb.com,,TBD,"12 Beacon St, Boston, MA 02108","$1,450"
COOPER PLUMBING,L. Cooper, office@CooperPlumb.com,6175550133,N/A,unknown,900.00
Vertex Marketing,Pat Nguyen,accounts@vertexmktg.com,(404) 555-0119,77-8899001,(blank),"$15,000.00"
vertex marketing group,Pat Nguyen,ACCOUNTS@VertexMktg.com,,unknown,"300 Peachtree St, Atlanta, GA 30308","$15,000"
Summit Consulting,Ray Brooks,invoices@summitconsult.net,,21-0099887,(blank),"$9,800.00"
summit consulting llc,Ray Brooks,INVOICES@summitconsult.net,(303) 555-0175,,"1100 17th St, Denver, CO 80202","$9,800"
SUMMIT CONSULTING,R. Brooks, invoices@SummitConsult.net ,303.555.0175,21-0099887,TBD,"7,250.00"
Garcia Catering,Mia Garcia,ap@garciacatering.com,(305) 555-0188,,(blank),"$4,600.00"
garcia catering services,Mia Garcia,AP@GarciaCatering.com,,66-1234509,"450 Ocean Dr, Miami, FL 33139",$600.00
Northwind Logistics,S. Reyes, ap@northwindlog.com ,312.555.0198,45-6789012,TBD,"21,000.00"
VERTEX MARKETING,P. Nguyen, accounts@vertexmktg.com ,404.555.0119,77-8899001,TBD,"14,500.00"
GARCIA CATERING,M. Garcia,ap@GARCIACATERING.com,305.555.0188,66-1234509,unknown,"4,200.00"
1 Vendor Contact Email Phone EIN Address Total_Paid
2 Acme Realty Bob Stein acme.ap@acmerealty.com (212) 555-0100 12-3456789 (blank) $12,400.00
3 acme realty llc Bob Stein ACME.AP@AcmeRealty.com 118 Canal St, New York, NY 10013 $8,250
4 ACME REALTY R. Stein Acme.AP@acmerealty.com 212.555.0100 N/A TBD 1,999.99
5 Bright Books Bookkeeping Dana Cole hello@brightbooks.com 98-7654321 (blank) $6,000.00
6 bright books Dana Cole HELLO@brightbooks.com (415) 555-0142 unknown 50 Market St, San Francisco, CA 94105 $6,000
7 Bright Books, LLC D. Cole hello@BrightBooks.com 4155550142 98-7654321 unknown 5,500.00
8 Northwind Logistics Sam Reyes ap@northwindlog.com (312) 555-0198 (blank) $22,750.00
9 northwind logistics inc Sam Reyes AP@NorthwindLog.com 45-6789012 900 W Loop, Chicago, IL 60607 $22,750
10 Pearl Design Studio “Jo” Marsh billing@pearldesign.co 33-2211000 (blank) $3,200.00
11 pearl design Jo Marsh Billing@PearlDesign.co (206) 555-0167 TBD 77 Pike St, Seattle, WA 98101 $3,200
12 PEARL DESIGN STUDIO J. Marsh billing@pearldesign.co 206.555.0167 33-2211000 unknown 2,800.00
13 Cooper Plumbing Lee Cooper office@cooperplumb.com (617) 555-0133 (blank) $1,450.00
14 cooper plumbing co Lee Cooper OFFICE@cooperplumb.com TBD 12 Beacon St, Boston, MA 02108 $1,450
15 COOPER PLUMBING L. Cooper office@CooperPlumb.com 6175550133 N/A unknown 900.00
16 Vertex Marketing Pat Nguyen accounts@vertexmktg.com (404) 555-0119 77-8899001 (blank) $15,000.00
17 vertex marketing group Pat Nguyen ACCOUNTS@VertexMktg.com unknown 300 Peachtree St, Atlanta, GA 30308 $15,000
18 Summit Consulting Ray Brooks invoices@summitconsult.net 21-0099887 (blank) $9,800.00
19 summit consulting llc Ray Brooks INVOICES@summitconsult.net (303) 555-0175 1100 17th St, Denver, CO 80202 $9,800
20 SUMMIT CONSULTING R. Brooks invoices@SummitConsult.net 303.555.0175 21-0099887 TBD 7,250.00
21 Garcia Catering Mia Garcia ap@garciacatering.com (305) 555-0188 (blank) $4,600.00
22 garcia catering services Mia Garcia AP@GarciaCatering.com 66-1234509 450 Ocean Dr, Miami, FL 33139 $600.00
23 Northwind Logistics S. Reyes ap@northwindlog.com 312.555.0198 45-6789012 TBD 21,000.00
24 VERTEX MARKETING P. Nguyen accounts@vertexmktg.com 404.555.0119 77-8899001 TBD 14,500.00
25 GARCIA CATERING M. Garcia ap@GARCIACATERING.com 305.555.0188 66-1234509 unknown 4,200.00

View File

@@ -0,0 +1,49 @@
{
"steps": [
{
"tool": "text_clean",
"enabled": true,
"options": {
"trim": true,
"collapse_whitespace": true,
"fold_smart_chars": true,
"strip_zero_width": true
}
},
{
"tool": "format_standardize",
"enabled": true,
"options": {
"column_types": {
"Phone": "phone",
"Email": "email",
"Total_Paid": "currency"
}
}
},
{
"tool": "missing",
"enabled": true,
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["—", "-", "--", "(blank)", "TBD", "unknown", "N/A", "#N/A", "(none)"]
}
},
{
"tool": "dedup",
"enabled": true,
"options": {
"survivor_rule": "most_complete",
"merge": true,
"strategies": [
{
"columns": [
{"column": "Email", "algorithm": "exact", "threshold": 100, "normalizer": "email"}
]
}
]
}
}
]
}

View File

@@ -1,13 +0,0 @@
customer_name,email,vendor,memo
Alice Johnson,alice@example.com,ACME Corp ,Welcome aboard
Bob Smith,bob@example.com,ACME Corp,Returning customer
Charlie Brown,charlie@example.com,Globex,Net 30
Diana Prince,diana@example.com,Globex,VIP
Edward Norton,ed@example.com,“Best Pet Supplies”,Order#42 - rush
Frank Castle,frank@example.com,Stark—Industries,"Line 1
Line 2
Line 3"
grace HOPPER ,grace@example.com,Globex,Loves long memos…
Henry Ford,henry@example.com,Ford Motor,Industrial
Iris West,iris@example.com,S.T.A.R. Labs,Notewith-bell
Jane Doe,jane@example.com,Acme,Standard
1 customer_name email vendor memo
2 Alice Johnson alice@example.com ACME Corp Welcome aboard
3 Bob Smith bob@example.com ACME Corp Returning customer
4 Charlie Brown charlie@example.com Globex Net 30
5 Diana Prince diana​@example.com Globex VIP
6 Edward Norton ed@example.com “Best Pet Supplies” Order#42 - rush
7 Frank Castle frank@example.com Stark—Industries Line 1 Line 2 Line 3
8 grace HOPPER grace@example.com Globex Loves long memos…
9 Henry Ford henry@example.com Ford Motor Industrial
10 Iris West iris@example.com S.T.A.R. Labs Notewith-bell
11 Jane Doe jane@example.com Acme Standard

106
scripts/generate_keypair.py Normal file
View File

@@ -0,0 +1,106 @@
#!/usr/bin/env python3
"""Generate a fresh Ed25519 keypair for production license signing.
**Creator-only.** Run once, write the private key somewhere safe,
configure the build pipeline with the public key.
Usage::
python scripts/generate_keypair.py
python scripts/generate_keypair.py --json
python scripts/generate_keypair.py --output keys.txt
The output looks like::
DATATOOLS_LICENSE_PRIVKEY=<64 hex chars> # KEEP SECRET
DATATOOLS_LICENSE_PUBKEY=<64 hex chars> # BAKE INTO BUILD
The private key never goes near the buyer-facing binary. Stash it in
a password manager / KMS / hardware token; the only places it gets
loaded are:
- ``scripts/generate_license.py`` when minting a buyer's blob
- Your CI's signing step, if you've automated blob minting
The public key gets set as ``DATATOOLS_LICENSE_PUBKEY`` in the
PyInstaller build env (so the shipped binary verifies against it),
and the production-safe runtime check refuses to start any frozen
build that's still using the in-source dev key.
"""
from __future__ import annotations
import argparse
import json
import sys
from pathlib import Path
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.asymmetric.ed25519 import Ed25519PrivateKey
def generate() -> tuple[str, str]:
"""Return ``(private_hex, public_hex)`` for a fresh keypair."""
priv = Ed25519PrivateKey.generate()
priv_hex = priv.private_bytes(
encoding=serialization.Encoding.Raw,
format=serialization.PrivateFormat.Raw,
encryption_algorithm=serialization.NoEncryption(),
).hex()
pub_hex = priv.public_key().public_bytes(
encoding=serialization.Encoding.Raw,
format=serialization.PublicFormat.Raw,
).hex()
return priv_hex, pub_hex
def main(argv: list[str] | None = None) -> int:
p = argparse.ArgumentParser(description=__doc__.splitlines()[0])
p.add_argument("--json", action="store_true", help="Emit JSON instead of env-file format.")
p.add_argument("--output", "-o", type=Path, default=None, help="Write to this file instead of stdout.")
args = p.parse_args(argv)
priv_hex, pub_hex = generate()
if args.json:
payload = json.dumps(
{"private_key": priv_hex, "public_key": pub_hex},
indent=2,
)
else:
payload = (
f"# DataTools license keypair — generated by generate_keypair.py\n"
f"# KEEP THE PRIVATE KEY SECRET. Lose it and your existing\n"
f"# licenses can't be renewed (you'd have to ship a new build\n"
f"# with a new public key and re-issue every active license).\n"
f"\n"
f"DATATOOLS_LICENSE_PRIVKEY={priv_hex}\n"
f"DATATOOLS_LICENSE_PUBKEY={pub_hex}\n"
)
if args.output:
args.output.write_text(payload + "\n", encoding="utf-8")
# chmod 600 — best-effort; ignored on Windows.
try:
args.output.chmod(0o600)
except OSError:
pass
print(f"Wrote {args.output} (mode 600)", file=sys.stderr)
else:
print(payload)
print(
"\nNext steps:\n"
" 1. Store the private key in your password manager.\n"
" 2. Bake the public key into the PyInstaller build:\n"
" DATATOOLS_LICENSE_PUBKEY=<pubkey> pyinstaller ...\n"
" 3. Mint buyer licenses by setting the private key:\n"
" DATATOOLS_LICENSE_PRIVKEY=<privkey> "
"python scripts/generate_license.py --name 'Buyer' --email b@x.com\n",
file=sys.stderr,
)
return 0
if __name__ == "__main__":
sys.exit(main())

215
scripts/generate_license.py Normal file
View File

@@ -0,0 +1,215 @@
#!/usr/bin/env python3
"""Mint a signed license blob for a buyer (LOCAL, break-glass).
.. warning::
This script mints **locally**, without going through the license
server. Prefer :mod:`src.admin_cli` (``datatools-admin mint``)
for routine work — it writes to the authoritative ``licenses``
Postgres table and emits the same blob.
Reach for this script only when the server is unreachable and a
buyer needs a license *right now*. Mints from here land in the
local issuance JSONL log; you'll need to reconcile them into the
server's DB afterwards.
Creator-only tool. Signs with the Ed25519 private key from
``$DATATOOLS_LICENSE_PRIVKEY`` (production) or the in-tree dev key
(local development).
Every successful mint also appends a record to the issuance log at
``~/.datatools-creator/issued.jsonl`` (override with
``$DATATOOLS_ISSUANCE_LOG``). That log is the creator-side system of
record for "who has a license" — useful for re-delivery, support, and
as the seed for the future server-side ``licenses`` table.
Examples
--------
Mint a 1-year CORE license for Jane Doe::
python scripts/generate_license.py \\
--name "Jane Doe" --email jane@example.com --tier core
Mint a 2-year PRO license and write the blob to a file::
python scripts/generate_license.py \\
--name "Acme Corp" --email ops@acme.com --tier pro \\
--years 2 --output acme.dtlic
Mint with the production key (CI / manual fulfillment)::
DATATOOLS_LICENSE_PRIVKEY=<prod-private-hex> \\
python scripts/generate_license.py --name ... --email ...
The output is a single base64-encoded token starting with ``DTLIC2:``
— paste this whole string into the buyer's delivery email or
deliver as an attached ``.dtlic`` file.
"""
from __future__ import annotations
import argparse
import json
import os
import sys
import uuid
from pathlib import Path
# Make ``src.license`` importable when run from the repo root.
_PROJECT_ROOT = Path(__file__).resolve().parent.parent
if str(_PROJECT_ROOT) not in sys.path:
sys.path.insert(0, str(_PROJECT_ROOT))
from src.license import Tier # noqa: E402
from src.license.crypto import encode_blob, sign # noqa: E402
from src.license.features import all_features_for_tier # noqa: E402
from src.license.schema import ( # noqa: E402
License,
_utcnow_iso,
default_expiry_iso,
)
def default_issuance_log() -> Path:
"""Path to the local issuance log (creator-side ledger).
Resolution order:
1. ``$DATATOOLS_ISSUANCE_LOG`` (absolute path; useful for tests
and for pointing at a shared / encrypted volume).
2. ``~/.datatools-creator/issued.jsonl`` — separate from the
buyer-facing ``~/.datatools/`` dir so it never gets bundled
into a shipped install.
"""
override = os.environ.get("DATATOOLS_ISSUANCE_LOG")
if override:
return Path(override).expanduser().resolve()
return Path.home() / ".datatools-creator" / "issued.jsonl"
def append_issuance_log(record: dict, *, path: Path | None = None) -> Path | None:
"""Best-effort append of *record* to the issuance log.
Returns the resolved path on success, ``None`` on IO failure
(with a warning printed to stderr). We intentionally do not raise:
the blob has already been minted by the time this runs, and losing
one ledger row is strictly better than aborting after a successful
mint and leaving the creator unsure whether to re-mint.
"""
p = path or default_issuance_log()
try:
p.parent.mkdir(parents=True, exist_ok=True)
with p.open("a", encoding="utf-8") as f:
f.write(json.dumps(record, sort_keys=True) + "\n")
try:
p.chmod(0o600)
except OSError:
pass
return p
except OSError as e:
print(
f"WARNING: could not write issuance log at {p}: {e}\n"
" The blob above is still valid — record the mint "
"manually.",
file=sys.stderr,
)
return None
def build_args() -> argparse.ArgumentParser:
p = argparse.ArgumentParser(
description="Mint a signed DataTools license blob.",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
p.add_argument("--name", required=True, help="Buyer's full name.")
p.add_argument("--email", required=True, help="Buyer's email.")
p.add_argument(
"--tier",
default=Tier.CORE.value,
choices=[t.value for t in Tier],
help="License tier (default: %(default)s).",
)
p.add_argument(
"--years",
type=int,
default=1,
help="License lifetime in years (default: %(default)s).",
)
p.add_argument(
"--key",
default=None,
help="Override the auto-generated license key (default: random).",
)
p.add_argument(
"--output",
"-o",
type=Path,
default=None,
help="Write the blob to this file (default: print to stdout).",
)
p.add_argument(
"--no-log",
action="store_true",
help=(
"Skip writing to the issuance log. Use for one-off test "
"mints; do NOT use for real buyer fulfillment."
),
)
return p
def main(argv: list[str] | None = None) -> int:
args = build_args().parse_args(argv)
tier = Tier(args.tier)
rid = uuid.uuid4().hex
key = args.key or f"DT1-{tier.value.upper()}-{rid[:8]}-{rid[8:16]}"
lic = License(
name=args.name,
email=args.email,
license_key=key,
tier=tier,
features=all_features_for_tier(tier),
issued_at=_utcnow_iso(),
expires_at=default_expiry_iso(years=args.years),
signature="",
)
signature = sign(lic.to_canonical_dict())
payload = lic.to_canonical_dict()
payload["signature"] = signature
blob = encode_blob(payload)
if not args.no_log:
log_path = append_issuance_log({
"license_key": lic.license_key,
"name": lic.name,
"email": lic.email,
"tier": lic.tier.value,
"issued_at": lic.issued_at,
"expires_at": lic.expires_at,
"blob": blob,
})
else:
log_path = None
if args.output:
args.output.write_text(blob + "\n", encoding="utf-8")
print(f"Wrote license to {args.output}", file=sys.stderr)
else:
print(blob)
print(
f" name: {lic.name}\n"
f" email: {lic.email}\n"
f" tier: {lic.tier.value}\n"
f" key: {lic.license_key}\n"
f" expires: {lic.expires_at}",
file=sys.stderr,
)
if log_path:
print(f" logged: {log_path}", file=sys.stderr)
return 0
if __name__ == "__main__":
sys.exit(main())

16
server/.dockerignore Normal file
View File

@@ -0,0 +1,16 @@
**/__pycache__
**/*.pyc
**/.pytest_cache
**/.mypy_cache
**/.ruff_cache
.git
.venv
venv
docs
landing
marketing
samples
test-cases
tests
logs
build

38
server/Dockerfile Normal file
View File

@@ -0,0 +1,38 @@
# syntax=docker/dockerfile:1.6
FROM python:3.12-slim AS base
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
curl \
libpq5 \
&& rm -rf /var/lib/apt/lists/*
RUN useradd --system --create-home --shell /usr/sbin/nologin --uid 10001 app
WORKDIR /app
COPY server/requirements.txt /app/requirements.txt
RUN pip install -r /app/requirements.txt
# Reused crypto / schema logic from the desktop app — single source of truth.
COPY src/license /app/datatools_license
COPY server/app /app/app
COPY server/config /app/config
COPY server/alembic /app/alembic
COPY server/alembic.ini /app/alembic.ini
RUN chown -R app:app /app
USER app
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=3s --start-period=15s --retries=3 \
CMD curl --fail --silent --show-error http://localhost:8000/health || exit 1
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--proxy-headers", "--forwarded-allow-ips", "*"]

38
server/alembic.ini Normal file
View File

@@ -0,0 +1,38 @@
[alembic]
script_location = alembic
prepend_sys_path = .
sqlalchemy.url =
[loggers]
keys = root,sqlalchemy,alembic
[handlers]
keys = console
[formatters]
keys = generic
[logger_root]
level = WARN
handlers = console
qualname =
[logger_sqlalchemy]
level = WARN
handlers =
qualname = sqlalchemy.engine
[logger_alembic]
level = INFO
handlers =
qualname = alembic
[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = generic
[formatter_generic]
format = %(levelname)-5.5s [%(name)s] %(message)s
datefmt = %H:%M:%S

46
server/alembic/env.py Normal file
View File

@@ -0,0 +1,46 @@
"""Alembic environment.
Reads the runtime database URL from ``app.db`` (which resolves the
password from the secrets file), so ``alembic upgrade head`` Just
Works inside the API container with no extra env wiring.
"""
from __future__ import annotations
from logging.config import fileConfig
from alembic import context
from app.db import Base, engine
from app import models # noqa: F401 — imported for side-effect of registering models
config = context.config
if config.config_file_name is not None:
fileConfig(config.config_file_name)
target_metadata = Base.metadata
def run_migrations_offline() -> None:
context.configure(
url=str(engine.url),
target_metadata=target_metadata,
literal_binds=True,
dialect_opts={"paramstyle": "named"},
)
with context.begin_transaction():
context.run_migrations()
def run_migrations_online() -> None:
with engine.connect() as connection:
context.configure(connection=connection, target_metadata=target_metadata)
with context.begin_transaction():
context.run_migrations()
if context.is_offline_mode():
run_migrations_offline()
else:
run_migrations_online()

View File

@@ -0,0 +1,26 @@
"""${message}
Revision ID: ${up_revision}
Revises: ${down_revision | comma,n}
Create Date: ${create_date}
"""
from __future__ import annotations
from typing import Sequence, Union
from alembic import op
import sqlalchemy as sa
${imports if imports else ""}
revision: str = ${repr(up_revision)}
down_revision: Union[str, None] = ${repr(down_revision)}
branch_labels: Union[str, Sequence[str], None] = ${repr(branch_labels)}
depends_on: Union[str, Sequence[str], None] = ${repr(depends_on)}
def upgrade() -> None:
${upgrades if upgrades else "pass"}
def downgrade() -> None:
${downgrades if downgrades else "pass"}

View File

@@ -0,0 +1,80 @@
"""Initial schema — licenses + gumroad_events.
Revision ID: 0001_initial
Revises:
Create Date: 2026-05-14
"""
from __future__ import annotations
from typing import Sequence, Union
import sqlalchemy as sa
from alembic import op
from sqlalchemy.dialects import postgresql
revision: str = "0001_initial"
down_revision: Union[str, None] = None
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
def upgrade() -> None:
op.create_table(
"licenses",
sa.Column("license_key", sa.String(), primary_key=True),
sa.Column("name", sa.String(), nullable=False),
sa.Column("email", sa.String(), nullable=False),
sa.Column("tier", sa.String(), nullable=False),
sa.Column("issued_at", sa.DateTime(timezone=True), nullable=False),
sa.Column("expires_at", sa.DateTime(timezone=True), nullable=False),
sa.Column("blob", sa.String(), nullable=False),
sa.Column("source", sa.String(), nullable=False),
sa.Column("source_order_id", sa.String(), nullable=True),
sa.Column("promotion", sa.String(), nullable=True),
sa.Column("amount_paid", sa.Numeric(10, 2), nullable=True),
sa.Column("currency", sa.String(length=3), server_default=sa.text("'USD'"), nullable=True),
sa.Column("revoked_at", sa.DateTime(timezone=True), nullable=True),
sa.Column("notes", sa.String(), nullable=True),
sa.Column("created_at", sa.DateTime(timezone=True), server_default=sa.text("now()"), nullable=False),
sa.Column("updated_at", sa.DateTime(timezone=True), server_default=sa.text("now()"), nullable=False),
sa.UniqueConstraint("source", "source_order_id", name="uq_licenses_source_order"),
)
op.create_index(
"ix_licenses_email_lower",
"licenses",
[sa.text("lower(email)")],
)
op.create_index(
"ix_licenses_expires_active",
"licenses",
["expires_at"],
postgresql_where=sa.text("revoked_at IS NULL"),
)
op.create_table(
"gumroad_events",
sa.Column("id", sa.BigInteger(), primary_key=True, autoincrement=True),
sa.Column("received_at", sa.DateTime(timezone=True), server_default=sa.text("now()"), nullable=False),
sa.Column("event_type", sa.String(), nullable=False),
sa.Column("order_id", sa.String(), nullable=True),
sa.Column("raw_payload", postgresql.JSONB(), nullable=False),
sa.Column("processed", sa.Boolean(), server_default=sa.text("false"), nullable=False),
sa.Column("error", sa.String(), nullable=True),
)
op.create_index("ix_gumroad_events_order_id", "gumroad_events", ["order_id"])
op.create_index(
"ix_gumroad_events_unprocessed",
"gumroad_events",
["received_at"],
postgresql_where=sa.text("processed = false"),
)
def downgrade() -> None:
op.drop_index("ix_gumroad_events_unprocessed", table_name="gumroad_events")
op.drop_index("ix_gumroad_events_order_id", table_name="gumroad_events")
op.drop_table("gumroad_events")
op.drop_index("ix_licenses_expires_active", table_name="licenses")
op.drop_index("ix_licenses_email_lower", table_name="licenses")
op.drop_table("licenses")

0
server/app/__init__.py Normal file
View File

View File

View File

@@ -0,0 +1,71 @@
"""Source-adapter interface.
The Mint API speaks only the normalized event types defined here.
Each storefront has its own adapter that:
- Verifies the storefront's webhook signature in its native format.
- Parses the storefront's payload into a :class:`SaleEvent` or
:class:`RefundEvent`.
- Maps the storefront's product/variant IDs to a license tier via
the per-source config in :mod:`app.adapters.config`.
Adding a new source (Lemon Squeezy, Stripe, Paddle) is one new
module that implements :class:`SourceAdapter`. The Mint API and DB
do not change.
"""
from __future__ import annotations
from dataclasses import dataclass, field
from decimal import Decimal
from typing import Any, Optional, Protocol
@dataclass(frozen=True)
class SaleEvent:
"""A storefront sale, normalized.
The Mint API consumes this directly — it never reaches into the
raw storefront payload. Anything storefront-specific that's worth
keeping is preserved in :attr:`raw_payload` for audit.
"""
source: str # e.g. "gumroad", "manual"
source_order_id: Optional[str] # storefront's order ID; None for manual mints
buyer_name: str
buyer_email: str
tier: str # mapped from product/variant
years: int = 1
promotion: Optional[str] = None
amount_paid: Optional[Decimal] = None
currency: Optional[str] = "USD"
notes: Optional[str] = None
raw_payload: dict = field(default_factory=dict)
@dataclass(frozen=True)
class RefundEvent:
"""A storefront refund — marks an existing license revoked."""
source: str
source_order_id: str
reason: Optional[str] = None
raw_payload: dict = field(default_factory=dict)
class SourceAdapter(Protocol):
"""Interface every storefront adapter implements."""
source_name: str
def verify_webhook(self, *, body: bytes, headers: dict[str, str]) -> bool:
"""Return True iff the request came from the legitimate storefront."""
...
def parse_sale(self, payload: dict[str, Any]) -> Optional[SaleEvent]:
"""Return a :class:`SaleEvent` if *payload* is a sale, else None."""
...
def parse_refund(self, payload: dict[str, Any]) -> Optional[RefundEvent]:
"""Return a :class:`RefundEvent` if *payload* is a refund, else None."""
...

View File

@@ -0,0 +1,173 @@
"""Gumroad adapter.
Receives "Ping" notifications from Gumroad — form-encoded POSTs sent
when a sale occurs. Gumroad's Ping URL is configured in the seller
dashboard (Settings → Advanced → Ping URL).
Authentication
--------------
Gumroad does not HMAC-sign the body. Their recommended pattern is
to put a secret in the URL itself::
https://licenses.datatools.unalogix.com/webhooks/gumroad?secret=...
The webhook receiver pulls the secret from the query string and
:meth:`GumroadAdapter.verify_webhook` constant-time-compares it
against the configured value. If they don't match, the request is
dropped with 404 (so a probing attacker can't tell whether the
endpoint exists, much less that it's the wrong secret).
The "test" field
----------------
Gumroad sends ``test=true`` on test pings fired from the dashboard.
We treat test pings as real sales (they create licenses just like
production sales), but tag them with ``notes='gumroad test ping'``
so the operator can filter / delete them later. Refusing test pings
would block the standard "Send Test Ping" verification flow.
Refunds, disputes, cancellations
--------------------------------
Stubbed for now (``parse_refund`` returns None). Gumroad doesn't
include refund signals in the standard sale Ping — refunds arrive
via the separate "Resource subscriptions" mechanism. Wiring that
in is PR 2.1; until then, refunds are handled by the operator
running ``datatools-admin revoke``.
"""
from __future__ import annotations
import hmac
from decimal import Decimal
from typing import Any, Optional
from app.adapters.base import RefundEvent, SaleEvent
from app.products import lookup as product_lookup
class GumroadAdapter:
source_name = "gumroad"
def __init__(self, secret: Optional[str]) -> None:
self._secret = secret
# --- Auth ----------------------------------------------------------------
def verify_webhook(self, *, body: bytes, headers: dict[str, str]) -> bool:
"""Not used — Gumroad authentication is via URL query param,
which only the route handler has direct access to. Call
:meth:`verify_secret` instead."""
return False
def verify_secret(self, presented: Optional[str]) -> bool:
"""Constant-time compare against the configured secret.
Returns False (not an exception) so the route handler can
decide the response code — we return 404 to avoid signaling
endpoint existence to an unauthenticated prober.
"""
if not self._secret or not presented:
return False
return hmac.compare_digest(presented, self._secret)
# --- Parsing -------------------------------------------------------------
def parse_sale(self, payload: dict[str, Any]) -> Optional[SaleEvent]:
"""Parse a Gumroad Ping form-encoded payload into a SaleEvent.
Returns None if the payload isn't a sale (e.g. some future
event type we don't yet handle). Returns None *with no row
side-effect* if the product_id is unmapped — the caller
should treat that as an error and record it in the audit
row, not silently drop.
"""
# Sale pings always include sale_id (the order ID) and email.
sale_id = payload.get("sale_id")
email = payload.get("email")
product_id = (
payload.get("product_id")
or payload.get("product_permalink")
or payload.get("permalink")
)
if not (sale_id and email and product_id):
return None
mapping = product_lookup(self.source_name, str(product_id))
if mapping is None:
# Unmapped — surface to caller as a SaleEvent with no tier.
# We deliberately don't raise here so the caller can
# log it to gumroad_events with error info and still
# return 200 (no Gumroad retry storm).
raise UnmappedProductError(
f"Gumroad product_id {product_id!r} has no entry in "
"config/products.yaml. Add a mapping and replay this "
f"sale (sale_id={sale_id})."
)
name = (payload.get("full_name") or "").strip() or _email_local(email)
price_cents = _to_int(payload.get("price"))
amount_paid = Decimal(price_cents) / Decimal(100) if price_cents is not None else None
currency = (payload.get("currency") or "USD").upper()
promotion = (payload.get("offer_code") or "").strip() or None
notes = None
if _is_truthy(payload.get("test")):
notes = "gumroad test ping"
return SaleEvent(
source=self.source_name,
source_order_id=str(sale_id),
buyer_name=name,
buyer_email=email.strip(),
tier=mapping.tier,
years=mapping.years,
promotion=promotion,
amount_paid=amount_paid,
currency=currency,
notes=notes,
raw_payload=dict(payload),
)
def parse_refund(self, payload: dict[str, Any]) -> Optional[RefundEvent]:
# PR 2.1.
return None
class UnmappedProductError(ValueError):
"""Raised when a sale arrives for a product not in products.yaml.
Caller catches and logs into ``gumroad_events.error`` so the
operator can fix the mapping and replay.
"""
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _email_local(email: str) -> str:
"""Fallback display name when ``full_name`` is missing — the part
of the email before the ``@``, capitalized. Better than 'Unknown'
for support tickets and the buyer's own delivery email."""
local = email.split("@", 1)[0]
return local.replace(".", " ").title()
def _to_int(v: Any) -> Optional[int]:
if v is None or v == "":
return None
try:
return int(v)
except (TypeError, ValueError):
return None
def _is_truthy(v: Any) -> bool:
if isinstance(v, bool):
return v
if v is None:
return False
return str(v).strip().lower() in {"1", "true", "yes", "on"}

View File

@@ -0,0 +1,52 @@
"""Manual adapter — operator-initiated mints (comps, support replacements).
There is no webhook to verify and no payload to parse: the operator
hands us the buyer details directly via the CLI, and we construct a
:class:`SaleEvent` from them. ``source='manual'`` separates these
rows from storefront-driven mints in the DB.
"""
from __future__ import annotations
from decimal import Decimal
from typing import Any, Optional
from app.adapters.base import RefundEvent, SaleEvent
class ManualAdapter:
source_name = "manual"
def verify_webhook(self, *, body: bytes, headers: dict[str, str]) -> bool:
return False # manual flows never come through webhooks
def parse_sale(self, payload: dict[str, Any]) -> Optional[SaleEvent]:
return self.build_sale(**payload)
def parse_refund(self, payload: dict[str, Any]) -> Optional[RefundEvent]:
return None
def build_sale(
self,
*,
name: str,
email: str,
tier: str,
years: int = 1,
promotion: Optional[str] = None,
amount_paid: Optional[Decimal] = None,
currency: Optional[str] = "USD",
notes: Optional[str] = None,
) -> SaleEvent:
return SaleEvent(
source=self.source_name,
source_order_id=None,
buyer_name=name,
buyer_email=email,
tier=tier,
years=years,
promotion=promotion,
amount_paid=amount_paid,
currency=currency,
notes=notes,
)

65
server/app/auth.py Normal file
View File

@@ -0,0 +1,65 @@
"""Auth guards for ``/internal/*``.
Active layer: Bearer token, presented by the operator's CLI and
matched against the value in the secrets dir. Token rotation =
update the file, restart the container.
:func:`require_localhost` is preserved but unused by default — it
fights the Docker bridge network model (the container sees the
gateway IP, not 127.0.0.1, regardless of where traffic originated).
Re-enable it only if the API runs in ``network_mode: host``.
"""
from __future__ import annotations
import hmac
from typing import Optional
from fastapi import HTTPException, Request, status
from app.config import get_settings
def require_localhost(request: Request) -> None:
"""Reject the request unless the connecting peer is loopback.
``request.client.host`` reflects the actual TCP peer (the nginx
upstream connecting from 127.0.0.1) when ``proxy_set_header`` is
used appropriately. We deliberately do NOT trust
``X-Forwarded-For`` here — we want the raw peer.
"""
peer = request.client.host if request.client else None
if peer not in {"127.0.0.1", "::1"}:
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND,
detail="Not found",
)
def require_bearer_token(request: Request) -> None:
"""Verify ``Authorization: Bearer <admin_token>``.
Uses constant-time comparison so timing leaks don't reveal token
prefixes. The 401 deliberately doesn't echo the supplied token or
leak whether a token is configured at all — clients should treat
"no token configured" the same as "wrong token".
"""
settings = get_settings()
expected: Optional[str] = settings.resolve_admin_token()
if not expected:
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Server not configured for internal access.",
)
auth = request.headers.get("Authorization", "")
if not auth.startswith("Bearer "):
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Bearer token required.",
)
presented = auth.removeprefix("Bearer ").strip()
if not hmac.compare_digest(presented, expected):
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Invalid token.",
)

64
server/app/config.py Normal file
View File

@@ -0,0 +1,64 @@
"""Runtime configuration loaded from environment + secret files.
Secrets are read from files (``*_FILE`` env vars pointing at
``/run/secrets/<name>``) so they never appear in ``docker inspect``
or process environment dumps. Plain ``*`` vars are the fallback for
local development where mounting secret files is overkill.
"""
from __future__ import annotations
from functools import lru_cache
from pathlib import Path
from typing import Optional
from pydantic import Field
from pydantic_settings import BaseSettings, SettingsConfigDict
class Settings(BaseSettings):
model_config = SettingsConfigDict(env_file=".env", extra="ignore")
database_url: str = Field(
default="postgresql+psycopg://datatools_api@localhost:5432/datatools_licenses",
validation_alias="DATABASE_URL",
)
admin_token: Optional[str] = Field(default=None, validation_alias="DATATOOLS_ADMIN_TOKEN")
admin_token_file: Optional[Path] = Field(default=None, validation_alias="DATATOOLS_ADMIN_TOKEN_FILE")
license_privkey_hex: Optional[str] = Field(default=None, validation_alias="DATATOOLS_LICENSE_PRIVKEY")
license_privkey_file: Optional[Path] = Field(default=None, validation_alias="DATATOOLS_LICENSE_PRIVKEY_FILE")
license_pubkey_hex: Optional[str] = Field(default=None, validation_alias="DATATOOLS_LICENSE_PUBKEY")
postmark_token: Optional[str] = Field(default=None, validation_alias="POSTMARK_TOKEN")
postmark_token_file: Optional[Path] = Field(default=None, validation_alias="POSTMARK_TOKEN_FILE")
gumroad_secret: Optional[str] = Field(default=None, validation_alias="GUMROAD_WEBHOOK_SECRET")
gumroad_secret_file: Optional[Path] = Field(default=None, validation_alias="GUMROAD_WEBHOOK_SECRET_FILE")
def resolve_admin_token(self) -> Optional[str]:
return _resolve(self.admin_token, self.admin_token_file)
def resolve_license_privkey(self) -> Optional[str]:
return _resolve(self.license_privkey_hex, self.license_privkey_file)
def resolve_postmark_token(self) -> Optional[str]:
return _resolve(self.postmark_token, self.postmark_token_file)
def resolve_gumroad_secret(self) -> Optional[str]:
return _resolve(self.gumroad_secret, self.gumroad_secret_file)
def _resolve(inline: Optional[str], path: Optional[Path]) -> Optional[str]:
if inline:
return inline.strip()
if path and path.exists():
return path.read_text(encoding="utf-8").strip()
return None
@lru_cache(maxsize=1)
def get_settings() -> Settings:
return Settings()

65
server/app/db.py Normal file
View File

@@ -0,0 +1,65 @@
"""SQLAlchemy engine + session factory.
The DB password lives in ``/run/secrets/pg_password``; we read it
from there (or ``$PG_PASSWORD`` for local dev) and splice it into
``DATABASE_URL`` so the password never has to be in plaintext in
``compose.yml`` or process environment listings.
"""
from __future__ import annotations
import os
from pathlib import Path
from typing import Generator
from urllib.parse import quote_plus, urlparse, urlunparse
from sqlalchemy import create_engine
from sqlalchemy.orm import DeclarativeBase, Session, sessionmaker
from app.config import get_settings
def _resolve_password() -> str | None:
inline = os.environ.get("PG_PASSWORD")
if inline:
return inline.strip()
path = os.environ.get("PG_PASSWORD_FILE")
if path and Path(path).exists():
return Path(path).read_text(encoding="utf-8").strip()
return None
def _build_url(base_url: str) -> str:
"""Inject the resolved password into ``base_url`` if absent."""
parsed = urlparse(base_url)
if parsed.password:
return base_url
pw = _resolve_password()
if pw is None:
return base_url
netloc = f"{parsed.username or ''}:{quote_plus(pw)}@{parsed.hostname}"
if parsed.port:
netloc += f":{parsed.port}"
return urlunparse(parsed._replace(netloc=netloc))
_settings = get_settings()
engine = create_engine(_build_url(_settings.database_url), pool_pre_ping=True, future=True)
SessionLocal = sessionmaker(bind=engine, autoflush=False, autocommit=False, expire_on_commit=False)
class Base(DeclarativeBase):
"""Declarative base for ORM models."""
def get_session() -> Generator[Session, None, None]:
"""FastAPI dependency. Commits on success, rolls back on exception."""
session = SessionLocal()
try:
yield session
session.commit()
except Exception:
session.rollback()
raise
finally:
session.close()

214
server/app/email.py Normal file
View File

@@ -0,0 +1,214 @@
"""Transactional email delivery.
Provider: Postmark. Picked for its transactional-deliverability
reputation and a tiny, no-SDK-needed HTTP API.
Configuration
-------------
- ``POSTMARK_TOKEN`` / ``POSTMARK_TOKEN_FILE`` — server API token.
- ``EMAIL_FROM`` — verified sender address (default
``licenses@datatools.unalogix.com``).
- ``EMAIL_REPLY_TO`` — optional Reply-To (default same as From).
When ``POSTMARK_TOKEN`` is unset the service falls back to
:class:`LoggingEmailService`, which prints the email to stdout
instead of sending. Lets the webhook handler exercise the full
flow before the Postmark account is provisioned.
"""
from __future__ import annotations
import logging
import os
from dataclasses import dataclass
from typing import Optional, Protocol
import httpx
from app.config import get_settings
log = logging.getLogger(__name__)
@dataclass(frozen=True)
class LicenseEmail:
"""Inputs the renderer needs from the caller."""
to_name: str
to_email: str
tier: str
license_key: str
expires_at_iso: str
blob: str
class EmailService(Protocol):
"""Provider-agnostic email surface — keeps Postmark out of the
callers' import graph."""
def send_license(self, msg: LicenseEmail) -> str:
"""Deliver the license-delivery email. Returns a provider
message id (or ``"logged"`` for the dev fallback) so the
caller can record it on the licenses row for audit."""
...
class LoggingEmailService:
"""Stand-in when no real provider is configured. Logs the
rendered message body at INFO so it shows up in ``docker compose
logs api`` — useful during local dev and during the deploy
window before Postmark is wired up."""
def send_license(self, msg: LicenseEmail) -> str:
body = _render_text(msg)
log.info(
"[email-stub] would send to=%s subject=%r\n%s",
msg.to_email,
_subject(msg),
body,
)
return "logged"
class PostmarkEmailService:
"""Postmark transactional API client.
Single endpoint, ~3 fields, no SDK needed. We use a per-call
httpx Client with a tight timeout — webhook handlers run on
the request thread and we never want to block them on a flaky
upstream.
"""
API_URL = "https://api.postmarkapp.com/email"
TIMEOUT_S = 8.0
def __init__(
self,
token: str,
*,
sender: str,
reply_to: Optional[str] = None,
message_stream: str = "outbound",
) -> None:
self._token = token
self._sender = sender
self._reply_to = reply_to or sender
self._stream = message_stream
def send_license(self, msg: LicenseEmail) -> str:
body_text = _render_text(msg)
body_html = _render_html(msg)
payload = {
"From": self._sender,
"To": _rfc_addr(msg.to_name, msg.to_email),
"ReplyTo": self._reply_to,
"Subject": _subject(msg),
"TextBody": body_text,
"HtmlBody": body_html,
"MessageStream": self._stream,
}
headers = {
"Accept": "application/json",
"Content-Type": "application/json",
"X-Postmark-Server-Token": self._token,
}
with httpx.Client(timeout=self.TIMEOUT_S) as c:
r = c.post(self.API_URL, json=payload, headers=headers)
if r.status_code >= 400:
raise EmailDeliveryError(
f"Postmark rejected the request: HTTP {r.status_code} "
f"body={r.text[:300]!r}"
)
return str(r.json().get("MessageID", ""))
class EmailDeliveryError(RuntimeError):
"""Provider returned a non-2xx. Caller should record this on the
audit row so the operator can replay after fixing the provider
config (verified sender domain, paid plan, etc.)."""
# ---------------------------------------------------------------------------
# Factory
# ---------------------------------------------------------------------------
def get_email_service() -> EmailService:
"""Choose the real provider if a token is configured, else the
logger. Reads settings fresh — tests can flip env vars between
sends without restarting."""
settings = get_settings()
token = settings.resolve_postmark_token()
if not token:
return LoggingEmailService()
sender = os.environ.get("EMAIL_FROM", "licenses@datatools.unalogix.com")
reply_to = os.environ.get("EMAIL_REPLY_TO")
return PostmarkEmailService(token, sender=sender, reply_to=reply_to)
# ---------------------------------------------------------------------------
# Rendering
# ---------------------------------------------------------------------------
def _subject(msg: LicenseEmail) -> str:
return f"Your DataTools license ({msg.tier})"
def _render_text(msg: LicenseEmail) -> str:
return (
f"Hi {msg.to_name},\n\n"
f"Thanks for your DataTools purchase. Your license is below.\n\n"
f"License key: {msg.license_key}\n"
f"Tier: {msg.tier}\n"
f"Expires: {msg.expires_at_iso[:10]}\n\n"
f"To activate, paste the full blob (starting with DTLIC2:) into\n"
f"the Activate screen, or run:\n\n"
f" python -m src.license_cli activate \"{msg.blob}\" \\\n"
f" --name \"{msg.to_name}\" --email {msg.to_email}\n\n"
f"Your blob:\n\n"
f"{msg.blob}\n\n"
f"Keep this email — you'll need the blob if you move to a new\n"
f"computer. Questions: reply to this email.\n\n"
f"— DataTools\n"
)
def _render_html(msg: LicenseEmail) -> str:
return (
"<!doctype html><html><body style=\"font-family:system-ui,sans-serif;"
"max-width:560px;margin:auto;padding:24px;color:#222;\">"
f"<p>Hi {_html_escape(msg.to_name)},</p>"
"<p>Thanks for your DataTools purchase. Your license is below.</p>"
"<table cellpadding=\"4\" style=\"border-collapse:collapse;\">"
f"<tr><td><b>License key</b></td><td><code>{_html_escape(msg.license_key)}</code></td></tr>"
f"<tr><td><b>Tier</b></td><td>{_html_escape(msg.tier)}</td></tr>"
f"<tr><td><b>Expires</b></td><td>{_html_escape(msg.expires_at_iso[:10])}</td></tr>"
"</table>"
"<p>To activate, paste the blob below into the <em>Activate</em> "
"screen on first launch.</p>"
"<pre style=\"background:#f4f4f4;padding:12px;border-radius:6px;"
"white-space:pre-wrap;word-break:break-all;font-size:11px;\">"
f"{_html_escape(msg.blob)}</pre>"
"<p style=\"color:#666;font-size:13px;\">Keep this email — you'll "
"need the blob if you move to a new computer. Questions: just reply.</p>"
"<p>— DataTools</p></body></html>"
)
def _rfc_addr(name: str, email: str) -> str:
# Postmark accepts "Name <addr>" or just "addr". Quote names with
# special chars; otherwise keep it readable in the inbox.
if not name or "@" in name:
return email
if any(c in name for c in ',<>"'):
name = name.replace('"', "").replace(",", "")
return f"{name} <{email}>"
def _html_escape(s: str) -> str:
return (
s.replace("&", "&amp;")
.replace("<", "&lt;")
.replace(">", "&gt;")
.replace('"', "&quot;")
)

19
server/app/main.py Normal file
View File

@@ -0,0 +1,19 @@
"""FastAPI entry point for the DataTools license server."""
from __future__ import annotations
from fastapi import FastAPI
from app.routes import internal, public, webhooks
app = FastAPI(
title="DataTools License Server",
version="0.1.0",
docs_url=None,
redoc_url=None,
openapi_url=None,
)
app.include_router(public.router)
app.include_router(internal.router)
app.include_router(webhooks.router)

136
server/app/mint.py Normal file
View File

@@ -0,0 +1,136 @@
"""Core mint + revoke logic.
Bridges the source-adapter layer (:mod:`app.adapters`) to the DB
layer (:mod:`app.models`), reusing the desktop app's signing /
encoding primitives from ``datatools_license.crypto`` so blobs minted
here verify against the same embedded pubkey on the buyer's machine.
"""
from __future__ import annotations
import os
import uuid
from datetime import datetime, timezone
from typing import Optional
from sqlalchemy import select
from sqlalchemy.orm import Session
from app.adapters.base import SaleEvent
from app.config import get_settings
from app.models import License
def _init_key_env() -> None:
"""Resolve secret-file pointers into env vars before importing crypto.
``datatools_license.crypto`` looks for ``DATATOOLS_LICENSE_PRIVKEY``
/ ``DATATOOLS_LICENSE_PUBKEY`` in ``os.environ``. When those come
from secret files (``*_FILE`` env vars), we read them once at
module import and stash so crypto can pick them up without
changes.
"""
settings = get_settings()
priv = settings.resolve_license_privkey()
if priv:
os.environ.setdefault("DATATOOLS_LICENSE_PRIVKEY", priv)
pub = settings.license_pubkey_hex
if pub:
os.environ.setdefault("DATATOOLS_LICENSE_PUBKEY", pub)
_init_key_env()
# Imported after env init so the crypto module reads the correct key.
from datatools_license.crypto import encode_blob, sign # noqa: E402
from datatools_license.features import all_features_for_tier # noqa: E402
from datatools_license.schema import ( # noqa: E402
License as LicenseDataclass,
Tier,
_utcnow_iso,
default_expiry_iso,
)
def _generate_license_key(tier: str) -> str:
rid = uuid.uuid4().hex
return f"DT1-{tier.upper()}-{rid[:8]}-{rid[8:16]}"
def _iso_to_dt(iso: str) -> datetime:
return datetime.fromisoformat(iso.replace("Z", "+00:00"))
def mint_from_sale(session: Session, sale: SaleEvent) -> License:
"""Idempotently mint a license for *sale*.
If a row with the same ``(source, source_order_id)`` already
exists, return it untouched — Gumroad retrying a webhook does not
produce a second blob with a different signature. Manual mints
(``source_order_id is None``) skip the dedup check and always
produce a new row.
"""
if sale.source_order_id is not None:
existing = session.execute(
select(License).where(
License.source == sale.source,
License.source_order_id == sale.source_order_id,
)
).scalar_one_or_none()
if existing is not None:
return existing
tier_enum = Tier(sale.tier)
license_key = _generate_license_key(sale.tier)
issued_iso = _utcnow_iso()
expires_iso = default_expiry_iso(years=sale.years)
unsigned = LicenseDataclass(
name=sale.buyer_name,
email=sale.buyer_email,
license_key=license_key,
tier=tier_enum,
features=all_features_for_tier(tier_enum),
issued_at=issued_iso,
expires_at=expires_iso,
signature="",
)
signature = sign(unsigned.to_canonical_dict())
payload = unsigned.to_canonical_dict()
payload["signature"] = signature
blob = encode_blob(payload)
row = License(
license_key=license_key,
name=sale.buyer_name,
email=sale.buyer_email,
tier=sale.tier,
issued_at=_iso_to_dt(issued_iso),
expires_at=_iso_to_dt(expires_iso),
blob=blob,
source=sale.source,
source_order_id=sale.source_order_id,
promotion=sale.promotion,
amount_paid=sale.amount_paid,
currency=sale.currency,
notes=sale.notes,
)
session.add(row)
session.flush()
return row
def revoke_license(
session: Session,
*,
license_key: str,
reason: Optional[str] = None,
) -> Optional[License]:
row = session.get(License, license_key)
if row is None:
return None
row.revoked_at = datetime.now(timezone.utc)
if reason:
suffix = f"\nRevoked: {reason}"
row.notes = ((row.notes or "") + suffix).strip()
return row

97
server/app/models.py Normal file
View File

@@ -0,0 +1,97 @@
"""ORM models for the licenses + gumroad_events tables.
Schema mirrors ``docs/LICENSE-SERVER.md``, generalized so any
``source`` can populate it. The ``(source, source_order_id)``
composite uniqueness key gives idempotent webhook retries — a
storefront firing the same sale twice maps to the same row.
"""
from __future__ import annotations
from datetime import datetime
from typing import Optional
from sqlalchemy import (
JSON,
BigInteger,
DateTime,
Index,
Integer,
Numeric,
String,
UniqueConstraint,
func,
text,
)
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy.orm import Mapped, mapped_column
# JSONB on Postgres (indexable, queryable), plain JSON elsewhere
# (SQLite for tests). Same Python interface either way.
_JSON_TYPE = JSON().with_variant(JSONB(), "postgresql")
# SQLite only auto-increments INTEGER PRIMARY KEY (not BIGINT).
# Postgres can autoincrement either, so the variant keeps the
# production migration on BigInteger while tests use Integer.
_PK_TYPE = BigInteger().with_variant(Integer(), "sqlite")
from app.db import Base
class License(Base):
__tablename__ = "licenses"
license_key: Mapped[str] = mapped_column(String, primary_key=True)
name: Mapped[str] = mapped_column(String, nullable=False)
email: Mapped[str] = mapped_column(String, nullable=False)
tier: Mapped[str] = mapped_column(String, nullable=False)
issued_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), nullable=False)
expires_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), nullable=False)
blob: Mapped[str] = mapped_column(String, nullable=False)
source: Mapped[str] = mapped_column(String, nullable=False)
source_order_id: Mapped[Optional[str]] = mapped_column(String, nullable=True)
promotion: Mapped[Optional[str]] = mapped_column(String, nullable=True)
amount_paid: Mapped[Optional[float]] = mapped_column(Numeric(10, 2), nullable=True)
currency: Mapped[Optional[str]] = mapped_column(String(3), nullable=True, server_default=text("'USD'"))
revoked_at: Mapped[Optional[datetime]] = mapped_column(DateTime(timezone=True), nullable=True)
notes: Mapped[Optional[str]] = mapped_column(String, nullable=True)
created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), nullable=False, server_default=func.now())
updated_at: Mapped[datetime] = mapped_column(
DateTime(timezone=True),
nullable=False,
server_default=func.now(),
onupdate=func.now(),
)
__table_args__ = (
UniqueConstraint("source", "source_order_id", name="uq_licenses_source_order"),
Index("ix_licenses_email_lower", func.lower(text("email"))),
Index("ix_licenses_expires_active", "expires_at", postgresql_where=text("revoked_at IS NULL")),
)
class GumroadEvent(Base):
"""Append-only audit log of every webhook delivery.
Stored regardless of processing outcome so we can replay failed
events, investigate disputes, and reconstruct the customer
record if the ``licenses`` table is ever corrupted.
"""
__tablename__ = "gumroad_events"
id: Mapped[int] = mapped_column(_PK_TYPE, primary_key=True, autoincrement=True)
received_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), nullable=False, server_default=func.now())
event_type: Mapped[str] = mapped_column(String, nullable=False)
order_id: Mapped[Optional[str]] = mapped_column(String, nullable=True)
raw_payload: Mapped[dict] = mapped_column(_JSON_TYPE, nullable=False)
processed: Mapped[bool] = mapped_column(server_default=text("false"), nullable=False)
error: Mapped[Optional[str]] = mapped_column(String, nullable=True)
__table_args__ = (
Index("ix_gumroad_events_order_id", "order_id"),
Index("ix_gumroad_events_unprocessed", "received_at", postgresql_where=text("processed = false")),
)

Some files were not shown because too many files have changed in this diff Show More