Compare commits

...

34 Commits

Author SHA1 Message Date
41ab2166ef build(ci): wire macOS code signing + notarization into release workflow
Add a guarded "Sign & notarize macOS app" step to build.yml that signs
dist/DataTools.app with the Developer ID (hardened runtime + entitlements
+ secure timestamp), notarizes via notarytool, and staples the ticket —
running before DMG packaging. The step exits 0 with a warning when the
MACOS_* secrets are absent, so dry-run dispatches still produce an
(unsigned) build.

Add build/macos/entitlements.plist with the hardened-runtime entitlements
a frozen PyInstaller/CPython app needs (JIT memory, library-validation
disabled for bundled .so/.dylib + Tesseract). Update build/README.md to
reflect that macOS signing is now wired and only needs the secrets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:56:17 +00:00
9943e6e537 test(demo): cover the demo app + sales-surface coherence
Adds a demo test suite on top of the data-value pins:

- tests/gui/test_app_demo.py (new, AppTest): every accounting persona
  renders with its dataset, the default/unknown-persona fallback resolves
  to bookkeeper, clicking Run produces the AFTER value (rows reduced to the
  validated count) with the watermarked download + Gumroad CTA, and
  switching persona via the quick-switch dropdown clears the stale result.
- tests/test_demo_pipelines.py (extended): cross-surface coherence —
  each persona key served by app_demo has a matching landing page whose
  iframe (?p=) and CTA (from=) point at it and that the hub links to;
  no retired Shopify/RevOps language remains in landing HTML; and the
  demo download still appends exactly one watermark row.

Full suite: 2584 passed, 91 skipped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 19:06:50 +00:00
e7ec79b9b5 demo: retarget landing pages to the accounting audience
Reorients the whole sales surface to accounting so it matches the rebuilt
demos. Replaces the Shopify and RevOps persona pages with accounts-payable
(1099) and accounts-receivable pages, refreshes the bookkeeper page, and
rewires the hub + deploy tooling:

- landing/bookkeeper/  — refreshed to the validated bank-rec demo
  (26 -> 20, six phantom duplicates), iframe ?p=bookkeeper.
- landing/ap-1099/     — NEW (replaces shopify-pet/): 1099 vendor prep,
  "24 records -> 8 vendors, 7 missing EINs recovered", iframe ?p=ap-1099,
  amber accent.
- landing/ar-aging/    — NEW (replaces revops/): AR open invoices,
  "26 -> 21, five double-entered invoices removed", iframe ?p=ar-aging,
  green accent.
- landing/index.html   — hub rewritten with the three accounting cards.
- deploy.py / deploy.config.example.json / README.md / _shared/styles.css
  — persona list, sitemap defaults, 404 links, cross-links, docs updated.

All demo iframes now point at the renamed app_demo personas; deploy.py
builds the dist bundle cleanly (verified) and the Gumroad ?from= tags match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:59:50 +00:00
6df726e69e demo: reconstruct sales demos for an accounting audience
Replaces the Shopify / RevOps / Bookkeeper demo trio with three accounting
personas that share one buyer, each entering through a workflow where a
messy export costs money — all running the same saved 4-step pipeline:

- bank_reconciliation.csv (Bookkeeper): 26 -> 20 rows, 6 double-posted
  transactions caught after date+amount standardization.
- vendor_1099.csv (AP / 1099): 24 records -> 8 vendors, 7 missing EINs
  recovered via dedup merge — the 1099-complete story.
- ar_open_invoices.csv (AR): 26 -> 21 rows, 5 double-entered invoices
  removed, blank status backfilled from the twin row.

Every number is validated against the live engine and pinned by
tests/test_demo_pipelines.py (read path mirrors app_demo._load_demo:
dtype=str, keep_default_na=False). Rewires src/gui/app_demo.py PERSONAS
(keys bookkeeper / ap-1099 / ar-aging, accounting H1/sub/CTA) and rewrites
docs/DEMO-PLAN.md sections 3/4/7 with the validated outcomes.

(Repo hygiene forced by a partial-clone gap: finalizes the already-deleted,
unreferenced samples/messy_text.csv whose blob was unrecoverable.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:52:39 +00:00
38616d69e2 test(pipeline): complete automated test suite for the pipeline feature
Adds ~115 tests pinning the Automated Workflows feature end to end:

- tests/test_pipeline.py (+43): per-adapter summary correctness on known
  inputs, multi-step data flow, error stop/continue contract, empty /
  single-column / all-disabled edges, dict+file serialization round-trips,
  recommended_pipeline(include=…), and a synthesized demo integration run.
- tests/test_cli_pipeline.py (new, 21): --recommend, dry-run-by-default,
  --apply output CSV + audit JSON, --steps, --strict abort, arg validation,
  --continue-on-error vs halt, and a save→load round-trip. Invokes the Typer
  app directly to bypass the license guard (house pattern).
- tests/gui/test_pipeline_builder.py (+9): reorder ▲/▼, disabled edge
  buttons, disabled-step persistence across reorder, restore-recommended,
  Advanced JSON export/import, and per-tool Configure panels emitting the
  correct option dicts (AppTest).
- tests/gui/test_pipeline_phrasing.py (new, 30): step_phrase/step_status and
  the adapter-key→friendly-name bridge as pure functions, incl. pluralization,
  column prose, and warn/error status derivation.

Full suite: 2565 passed, 91 skipped. No product bugs surfaced. Documents the
coverage in docs/DEVELOPER.md (test tree + a pipeline-coverage note).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:31:15 +00:00
00d3f28865 feat(pipeline): plain-English per-step result summaries
Replaces the raw-JSON summary column in the Results table with the mockup's
plain-English phrasing: "312 duplicates removed across 147 groups
(18,442 → 18,130 rows)", "1,204 cells cleaned in name & city", etc.
(correct singular/plural via a small _n helper).

Adds step_phrase() and step_status() to pipeline_modules.py. step_status
derives the status pill (✓ ok / ⚠ ok · N skipped / ✗ error / ⏭ skipped) and,
for warn/error steps (e.g. format_standardize unparseable cells, column_map
coercion failures / missing required targets), an inline detail callout
rendered directly below the results table — surfacing non-fatal issues in
context without a dedicated always-empty column.

Extends tests/gui/test_pipeline_builder.py with phrasing + status assertions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:21:17 +00:00
837f4b88b5 feat(pipeline): visual module-card builder for Automated Workflows
Replaces the raw options_json data-editor table with a per-step "module
card" builder matching the locked design mockup
(layout-review/09_pipeline_runner.html): each step shows a friendly name +
caption, an enable toggle, ▲/▼/✕ reorder/remove controls, and a Configure
expander that renders that tool's own controls in plain language. Raw JSON
is demoted to an Advanced import/export section.

New src/gui/components/pipeline_modules.py holds the adapter-key→tool_id
friendly-name bridge, one plain-language config renderer per tool
(text_clean, format_standardize, missing, column_map, dedup — emitting the
exact JSON option shapes the core adapters accept), and render_step_card.
Steps live in session state as an ordered list with stable ids so widget
keys survive reorder/remove. Reorder is ▲/▼ buttons (no JS drag dependency).

The on-disk/CLI pipeline JSON format is unchanged — CLI and src/core
untouched. Adds tests/gui/test_pipeline_builder.py (AppTest) covering seed,
configure panels, toggle/add/remove, and a full run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 18:16:09 +00:00
fd9606c67b build: drop the local Python release method, return to CI-only installer builds
Removes the single-command Python packaging method (build/make_release.py
+ build/build_portable_zip.py + build/macos/build_zip.sh) and the portable
.zip artifacts it produced. Release builds go back to the original GitHub
Actions process: the CI matrix builds one installer per platform (.dmg /
.exe / .AppImage) on tag push and attaches them to a GitHub Release.

Tesseract OCR bundling is preserved: the fetch helpers the workflow depends
on (fetch_tessdata, fetch_tesseract_for_platform) are extracted into a
standalone build/tesseract.py, which build.yml now imports.

Docs (README, build/README, DEVELOPER, TECHNICAL, USER-GUIDE, vendor README,
es translations) updated to drop the portable-zip flavor and point at the
new module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 17:47:36 +00:00
28ab51a869 Merge ui-redesign: journey-level UX redesign + live-app port
Brings the design-review mockups and the highest-leverage live-app
changes into main:
- layout-review/ mockups: 12-page review addressed; front door, taught
  pipeline order, consistent intake, coming-soon stubs, shared tokens.
- Live src/gui/: nav reordered to pipeline order with new Finance +
  Coming-soon groups; Home is the "Start here" front door with a
  one-click "Clean these files for me" pipeline runner; local-first
  pill on every working tool header.
- DECISIONS.md: PDF to CSV + Reconcile kept in-bundle under Finance.

Full suite green: 2441 passed, 91 skipped, 0 failed.

Follow-ups tracked (not blockers): streamlit-run visual verification of
the live UI; i18n keys for the front-door copy (English literals today);
rebuild the live coming-soon stub page bodies.
2026-06-08 17:41:30 +00:00
1895074b8f test+fix(gui): retire the now-empty "analysis" nav section
The journey-level nav restructure moved Home to a standalone "Start
here" entry and Reconcile into the "Finance" group, leaving the
"analysis" section with zero tools. Two registry tests encoded the old
layout and failed:
- test_every_section_has_at_least_one_tool[analysis] (empty section)
- test_reconciler_present (asserted section == "analysis")

Drop "analysis" from the Section literal, SECTION_LABELS, and app.py's
by_section bucket — it's genuinely dead now (home isn't a registry Tool).
Update the presence tests to assert Reconcile + PDF to CSV live in
"finance". The section-invariant tests (every section non-empty, has a
label, no orphan labels) are preserved and pass.

Full suite: 2441 passed, 91 skipped, 0 failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 17:11:02 +00:00
d807d3c11b feat(gui): add the one-click "Clean these files for me" front door
Issue #1 (the make-or-break UX fix): after the analyzer runs, Home now
leads with a primary "Clean these files for me" CTA that runs the
recommended pipeline (Clean Text -> Standardize -> Fix Missing -> Find
Duplicates, in order) on every imported file and hands back a cleaned
CSV per file — collapsing "which tool, what order" to one click. The
existing per-finding cards remain, reframed as "Or fix issues one at a
time" for users who want manual control.

- Reuses the core API verbatim (recommended_pipeline + run_pipeline);
  reader mirrors 9_Pipeline_Runner._read_uploaded so files load the same
  way the standalone orchestrator loads them.
- Per-file errors are captured so one bad file doesn't kill the batch;
  cleaned CSVs are cached in session_state so downloads survive reruns
  and are pruned when a file is removed or re-analyzed.

Verified: the read -> run_pipeline -> CSV data path executes correctly
(compile + a non-Streamlit functional smoke test). The Streamlit UI
scaffolding (button / download_button / progress / session_state)
mirrors the proven runner page but still needs a `streamlit run` check.
Front-door copy is English literals for now; i18n keys are a follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 17:06:30 +00:00
09ec01e98b feat(gui): port journey-level nav + local-first pill to the live app
Brings the live Streamlit app in line with the finalized layout-review
mockups (structural/low-risk changes; verified by compile + registry
sanity, still pending a streamlit-run visual check):

- tools_registry: Data Cleaners now in pipeline order (Clean Text ->
  Standardize -> Fix Missing -> Find Duplicates); new "finance" section
  (Reconcile, PDF to CSV) and "coming_soon" section (Find Unusual,
  Quality Check, Combine Files). Adds those to the Section type +
  SECTION_LABELS.
- app.py: Home becomes the "Start here" front door — a standalone,
  unlabeled top entry (play_circle icon) ahead of the hidden
  Activate/Logs/Close pages; nav groups reordered cleaners ->
  transformations -> automations -> finance -> coming soon.
- _legacy.py: render_tool_header now shows the "Runs 100% locally"
  privacy pill (right-aligned, Ready tools only — omitted on Coming
  Soon stubs); accent emphasis CSS for the Start-here nav link.
- i18n: add nav.start_here_title, nav.section_finance,
  nav.section_coming_soon to en + es packs.
- DECISIONS.md: log the PDF/Reconcile in-bundle (Finance group) call.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 17:01:57 +00:00
48251b625f refactor(layout-review): consolidate tool-header actions + align reconcile downloads
Consistency pass over the parallel-agent work:
- Replace 4 divergent inline header wrappers (flex/inline-flex, gap
  10/12px, margin-top present/absent across 8 tool pages) with one shared
  .dt-tool-header-actions class; strip the now-redundant per-button
  margin-top:0. Every tool header now aligns the local-first pill + Help
  button identically.
- Reconcile downloads row: reorder to the page's exceptions-first order
  (Review, Unmatched left, Unmatched right, Matched) to match the tabs and
  metric strip, and drop the lone competing primary — the four are
  parallel exports of equal weight.

Audited and confirmed already-consistent: compact intake banner, privacy
pill markup, .dt-next-step strips, the three coming-soon stubs, primary
CTAs, and the 3-download CSV/audit/config pattern.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:50:25 +00:00
dd0942d710 feat(layout-review): journey-level redesign — front door, taught order, consistency
Addresses the journey-level review (the app felt like 12 tools sharing a
stylesheet, not one guided product). File-partitioned changes:

Navigation (shell.js): rename Home -> "Start here" with front-door
emphasis (.dt-nav-start); reorder Data Cleaners into pipeline order
(Clean Text -> Standardize -> Fix Missing -> Find Duplicates); new
"Finance" group (Reconcile, PDF to CSV); all stubs moved to a bottom
"Coming soon" group, no longer interleaved with working tools.

Front door (home.html): a prominent primary "Clean these files for me"
that runs the recommended pipeline in order, above the existing
per-finding cards (reframed as "fix one thing at a time").

Shared tokens (app.css): .dt-next-step suggestion strip + .dt-nav-start.

Teach the order: a slim .dt-next-step strip at the end of each linear
cleaner page points to the next pipeline step (Map Columns -> Start here;
orchestrator/Finance pages correctly omit it).

Local-first: the green "Runs 100% locally" pill now sits in every working
tool page's header (home + 8 tools), where client data is entered.

Plain English: jargon relabeled on input controls (coerce, E.164,
NFC/NFKC, sentinels, survivor rule), technical terms kept in tooltips and
audit/output cells only.

Stubs (06/08/07): rebuilt to one identical skeleton — info line + plain
feature list + a real "Notify me when this ships" button; every disabled
control and uploader removed (a dimmed dropzone reads as broken).

Intake: full dropzone+chip replaced with the compact "Using <file>" banner
on Clean Text, Fix Missing, Find Duplicates, and both Reconcile sides.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:44:11 +00:00
cf31d9ef14 feat(layout-review): address review findings on pages 7-12
Find Duplicates (01_deduplicator):
- Delete the redundant outer Options wrapper; surface threshold +
  survivor rule directly, push the rest behind a single Advanced pane.
- Disambiguate competing primaries: top result is an auto-resolved
  preview (secondary download), review decisions are the single primary.
- Plain-English match labels (exact / approximate); clarify the third.
- Lift the match-card caption to a one-time instruction; note delimiter
  is delimited-text-only.

Quality Check (08_validator_reporter) — stub:
- Remove the dead disabled "Load rules file (JSON)" uploader so the
  stub invites a single action; keep the informative feature list.

Map Columns (05_column_mapper):
- Regroup schema -> mapping -> strategy/advanced (core task contiguous).
- Make preset-vs-Advanced precedence legible (Custom + modified marker).
- Adopt the compact file-intake banner; drop the duplicate resolved-
  mapping table; fix the add-row gutter style.

Combine Files (07_multi_file_merger) — stub:
- Actually disable the Merge CTA (add the disabled attribute).

PDF to CSV (10_pdf_extractor):
- Drop page/raw from the default preview to match export + fix the
  horizontal clip; surface raw via per-row affordance + overflow-x.
- Move the column selector above the download button; give auto-excluded
  rows a reason; align the files card to Home; de-dupe the row count.

Automated Workflows (09_pipeline_runner):
- Replace hand-edited JSON step config with per-step control expanders;
  JSON moved behind Advanced import/export.
- Editing the table marks the mode modified; fold the empty error column
  into the status pill; render summaries as plain English; collapse the
  explainer by default.

Cross-cutting items (stub standardization on page 10, shared disabled-
field token, remaining intake rollout) deferred to a holistic pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:35:46 +00:00
563d845b70 feat(layout-review): address review findings on pages 4-6
Find Unusual Values (06_outlier_detector) — coming-soon stub:
- Anchor the disabled Method on IQR (multiplier 1.5), not Z-score, per
  the logged robustness decision.
- Drop the redundant feature bullet list (kept alert + greyed controls
  + disabled button); also fixes the MAD-only-in-bullets mismatch.
- Remove the live uploader that dead-ended into disabled controls.

Clean Text (02_text_cleaner):
- Add an inline hidden-character legend (3 swatches reusing the actual
  badge classes) beside the canonical "Show hidden characters" toggle.
- Unify the two hidden-char toggles: preview one is canonical; the
  Results bare checkbox is wrapped in a field + bound note.
- Describe all three presets (minimal / excel-hygiene / paranoid).
- Give "Changes by column" a real "column" header instead of the
  grey index-gutter style.

Standardize Formats (03_format_standardizer):
- Make preset-vs-control precedence legible: preset shows Custom with a
  "modified" marker + base tag, diverging controls flag the winning
  value (same pattern as Fix Missing Values).
- Replace the dead-end unparseable alert with a real "Unparseable
  cells (47)" expander the alert now points to.
- Honest preview caption: "5 of 6 columns (notes skipped)".
Intake pattern (the cross-page reference) left untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:27:42 +00:00
be1e263223 feat(layout-review): address Fix Missing Values review findings
- Pin down strategy precedence: add a resolution-order legend
  (per-column -> global -> preset), dim/strike the preset radios when
  a global strategy overrides them, and add a "Resolves to" column to
  the per-column override table so the winning value is legible.
- Make the demo state honest: Global strategy = median is what drives
  the 1,043 fills, resolving the detect-only contradiction.
- Surface the missingness profile as an always-visible block above the
  (now-open) Options expander — diagnostic before configuration.
- Stop highlighting unchanged before/after cells (respondent_id 0->0);
  show "(global)" placeholders in unset per-column override cells.
- Fold the standalone "Strategy applied per column" table into the
  before/after table as a strategy column; inset maxed slider knobs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:23:32 +00:00
7ebfd0f153 feat(layout-review): address Reconcile page review findings
- Fix doubled "Invert right amount sign" label: keep the field label,
  strip the checkbox caption to the box only (also evens the 3-up row).
- Reorder results exceptions-first: tabs and metric strip both run
  Review -> Unmatched left -> Unmatched right -> Matched, with Review
  the default active tab and its table as the inline content; Matched
  demoted to a trailing context expander.
- Surface the "references must match left count" rule with an inline
  validation indicator under the right reference field instead of a
  label note alone.
- Mark the required Amount join key with the .req accent star on both
  sides so it reads distinct from the optional date/description pickers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:17:20 +00:00
2592604067 feat(layout-review): address Home page review findings
- Findings card no longer truncates silently: panel #1 gains a
  .dt-finding-more overflow control ("Show all 8 findings · 5 more").
- Replace the dead "Files analyzed: 3" stat (restated the section meta
  + visible rows) with "Rows scanned" — info not already on screen.
- Collapsed findings panels use a real .is-collapsed state variant
  instead of inline margin-bottom:-16px hacks, so states can't drift.
- Action bar buttons are content-sized; drop the 340px island that
  jarred against the full-width divider/stats below it.

Branding kept as deliberate landing-style treatment on Home (per
review decision); interior tool pages remain title-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:14:04 +00:00
58d0009849 refactor(layout-review): inline assets beside pages
Move app.css and shell.js into layout-review/ alongside the .html files
and reference them by bare filename; drop the assets/ subfolder.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:43:31 +00:00
b6c39d7a09 refactor(layout-review): move assets to repo root
Relocate assets/ (app.css, shell.js) from layout-review/ up to the repo
root and rewrite every page's link/script refs to ../assets/.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:31:53 +00:00
b2fa8503e6 chore: add layout-review HTML mockups
Static layout mockups for each app tool (deduplicator, text cleaner,
format standardizer, missing handler, column mapper, outlier detector,
multi-file merger, validator/reporter, pipeline runner, PDF extractor,
reconciler) plus index/home shells and shared assets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:28:23 +00:00
b703911df3 docs: reflect bundled Tesseract on every install surface
- NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers
  the bundled Tesseract OCR binary (Apache 2.0, upstream
  tesseract-ocr/tesseract, copyright Google + contributors) and the
  eng.traineddata from tessdata_best (also Apache 2.0). Clarifies
  DataTools itself remains proprietary. Full canonical Apache 2.0
  license text included.
- README.md + README.es.md (Download section): bumped size estimate
  ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR
  is bundled (no separate install required), with a link to the new
  license file.
- docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System
  requirements): bumped disk estimate, added a paragraph stating
  Tesseract 5.5 + eng.traineddata ship inside every installer /
  portable / AppImage, with a source-install fallback hint pointing
  developers to DEVELOPER.md.
- docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section
  documenting the runtime layout (sys._MEIPASS / tesseract / …),
  discovery order, source of bytes (build/vendor/tessdata + per-
  platform fetch in make_release.py), version pin, update recipe.
- docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor
  OCR)" — short version of the discovery order for the build
  pipeline section.
- build/README.md: distribution-outputs paragraph now lists
  Tesseract among bundled deps with the ~250-300 MB estimate; new
  "Tesseract bundling" section: layout diagram, resolver order,
  source of bytes + 5.5.0 pin, update steps, license-file ref.

Out-of-scope gaps noted by the docs sweep:
- docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a
  high-risk packaging headache; now superseded. Worth a one-line
  "(resolved — bundled as of v1.x)" callout in a future pass.
- USER-GUIDE §2 "What's included" table doesn't list PDF Extractor
  at all (it shipped in b8aff86…967d3f6). Separate gap to close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:50 +00:00
93ccada974 build: bundle Tesseract 5.5.0 + tessdata into every release artifact
End users no longer have to install Tesseract separately for OCR on
scanned PDFs — the engine ships inside the installer, portable .zip,
and AppImage for all three platforms.

Per-platform fetch in build/make_release.py (run before PyInstaller):
- Windows: download UB-Mannheim installer 5.5.0.20241111, extract
  with 7-Zip, copy tesseract.exe + required DLLs into the staging dir.
- macOS: ``brew install tesseract``, copy binary + every Homebrew-
  prefixed dylib resolved via otool -L (recurse one level for
  transitive deps), then install_name_tool rewrites IDs / load paths
  to @loader_path/... so the bundle is relocatable.
- Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary
  + every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'.

Wire-up:
- build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by
  make_release) and adds the staging dir + tessdata + the
  LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas
  so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/}
  and the license sits at the bundle root. Soft-warns when staging
  is empty so dev spec runs still complete.
- English tessdata pulled by fetch_tessdata() from
  tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at
  build/vendor/tessdata/.
- .github/workflows/build.yml: actions/cache@v4 step keyed on
  ``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the
  staging dir and the vendored tessdata across runs; apt installs
  patchelf on the Linux runner; PyInstaller step now receives the
  DATATOOLS_TESS_STAGING env var.
- .gitignore: build/_tesseract/ and the .traineddata blob.
- TESSERACT_SKIP_FETCH=1 honored for offline / manual stages.
- Installer / .dmg / .zip / AppImage scripts: one-line comments
  confirming Tesseract rides along automatically via PyInstaller's
  datas (no extra packaging steps required in those scripts).

Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post-
compression. Net installer size ~250-300 MB (was ~120 MB) — accepted
tradeoff for zero end-user OCR setup.

Reversal of the prior "don't bundle Tesseract" decision (option A).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:33 +00:00
17faf84aed feat(pdf): probe bundled Tesseract first when running frozen
Adds runtime support for the bundled Tesseract that ships inside the
DataTools installer / portable / AppImage artifacts. When DataTools
is launched from a PyInstaller frozen bundle the OCR engine now
resolves automatically — no end-user install required.

New helpers in src/pdf_extract.py:
- _bundled_tesseract_path() → Path | None — returns
  <sys._MEIPASS>/tesseract/tesseract[.exe] when getattr(sys,
  "frozen", False) AND sys._MEIPASS are present; None in dev.
- _bundled_tessdata_dir() → Path | None — same gating, returns
  <sys._MEIPASS>/tesseract/tessdata.
- _apply_bundled_tessdata_prefix() — sets TESSDATA_PREFIX to the
  bundled tessdata dir before any pytesseract call; only if frozen,
  dir exists, and the user hasn't already overridden the env var.

Discovery order in ocr_available() / _autodetect_tesseract_path():
1. DATATOOLS_TESSERACT_PATH env override (existing)
2. Bundled binary (NEW — frozen-only)
3. System PATH (existing)
4. Windows well-known install dirs (existing legacy fallback)

In dev (not frozen) every new probe is a no-op so the developer
experience is unchanged.

12 new tests cover frozen vs. non-frozen detection on each platform,
the user-override respect for TESSDATA_PREFIX, autodetect priority
ordering, and the no-bundled-dir graceful path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:19:52 +00:00
4d8513b1a3 docs: cover help popover, +/- nav indicators, render_tool_header
User-facing docs (USER-GUIDE en+es, README en+es):
- New short paragraph under §3.1 GUI noting the in-tool Help button
  on every detail page, what it contains (When to use / Steps /
  Examples / Tip), and that content lives in tools.<id>.help_md.
- One-line note in the README tool tables pointing at the same.
- Mention the sidebar +/- nav indicators replacing Streamlit's
  default Material Symbols chevron.

Developer docs:
- DEVELOPER: new "Tool page header" subsection documenting
  render_tool_header(tool_id), the help_md markdown skeleton, and
  the fallback to help.missing_body when a tool's help is absent.
  Update i18n authoring rules to list help.* keys and the per-tool
  help_md field alongside name/description/page_title/page_caption.
- TECHNICAL: new §10c documenting the sidebar nav indicator swap —
  CSS in _HIDE_CHROME_CSS plus _SWAP_NAV_SECTION_INDICATOR_JS
  injected through the hide_streamlit_chrome() iframe bundle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:08:01 +00:00
ac94208d8f chore: production-readiness sweep on the help-popover wave
- Drop unused 'from src.i18n import t' from pages 1-9 (the swap to
  render_tool_header(tool_id) means no page calls t() directly anymore).
  Pages 10, 11 and the underscore-prefixed pages were already clean or
  legitimately use t().

- Rewrite PDF Extractor help_md (en + es). The original prose described
  features the tool does NOT have — template drawing, per-source saved
  templates, automatic reuse. The actual tool is a heuristic batch
  scanner (per its own docstring: "No templates, no per-bank
  configuration"). New copy: scan → uncheck → pick date format → enable
  OCR if needed → download. Spanish version tagged with
  '<!-- TODO: review Spanish -->' since the prose is best-effort.

- Document why both stSidebarNavSectionHeader (legacy, streamlit~=1.35)
  and stNavSectionHeader (current, 1.57) testids appear in the chrome
  CSS — requirements floor is streamlit>=1.35,<2 so dropping the legacy
  selector would silently break the lower bound.

- Pin the t()-returns-key-on-miss contract that render_tool_header's
  fallback path depends on, with a comment at the call site.

- Pin the demo's intentional skip of hide_streamlit_chrome (so the
  +/- sidebar swap JS doesn't ever try to load there) with a load-
  bearing comment in app_demo.py.

- Confirmed i18n parity: every tool id has page_title / page_caption /
  description / name / help_md in BOTH packs; help.button_label and
  help.missing_body in both.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:07:33 +00:00
4955fb239b test: cover help_md keys, header smoke, and bilingual ES smoke
Two stale Spanish smoke assertions still expected English page titles
for PDF Extractor and Reconciler — the i18n work landed real
translations ("PDF a CSV", "Reconciliar dos archivos"), so refresh the
expected substrings and the surrounding comment.

Add new coverage for the help-popover feature:
- TestHelpPopoverKeys (test_lang_packs): every tool_id resolves a
  non-empty tools.<id>.help_md in BOTH packs; help.button_label and
  help.missing_body resolve in both.
- TestDescriptionCopy (test_tools_registry): every Tool.description
  non-empty and under 120 chars — pins the post-jargon-scrub copy
  so future drift back into multi-clause prose is loud.
- TestRenderToolHeaderSmoke: render_tool_header is callable, listed
  in components.__all__, and every i18n key it touches resolves in
  both packs. Runs without a Streamlit script context.

Suite: 2427 passed (+9 new), 91 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:07:19 +00:00
4a8961d58a fix(gui): keep tool-page Help button on one line at narrow widths
When the viewport shrunk, the help popover button in the title row
was wrapping its label vertically — ``[icon]`` over ``Help`` — because
the button was set to use_container_width=True and the column it sat
in collapsed below the button's natural width.

Two-pronged fix:
- Set use_container_width=False on the popover so the button sizes to
  content (icon + label) instead of stretching to the column.
- Widen the column ratio from [10, 1] to [8, 2] so there's room for
  the button without forcing the title text to truncate.
- Add CSS pinning ``white-space: nowrap`` on every popover button (and
  its inner div / p) as defense-in-depth — even if the button does
  get squeezed, the label can't wrap. ``min-width: max-content`` keeps
  the button from compressing below its content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:54:41 +00:00
fe4b5dc755 fix(sidebar): correct testid + JS swap so +/− actually renders
The prior attempt used data-testid=stSidebarNavSectionHeader, which is
not what Streamlit 1.57 emits — the correct testid is stNavSectionHeader
(verified against the bundled JS in streamlit/static/static/js/).
The section header is also a <div> with onClick, not a <button>, and
the React component keeps the expanded state in a prop without
surfacing aria-expanded on the DOM. Pure CSS can therefore neither
locate the header nor switch the glyph by state, which is why the
chevron was unchanged in the rendered UI.

Switch strategies:
- CSS now targets the correct stNavSectionHeader / stIconMaterial
  selectors, drops the Material Symbols font from the icon span, and
  restyles it so a plain ascii character reads as proper typography
  (size, weight, color, hover).
- Add _SWAP_NAV_SECTION_INDICATOR_JS — small inline script that
  rewrites the icon's text node from "expand_more"/"expand_less" to
  "+"/"−" (U+2212), throttled via requestAnimationFrame, re-applied
  on every DOM mutation by a MutationObserver. Bundled into the same
  iframe injection as the existing brand/upload/findings scripts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:52:47 +00:00
209b5fb1aa style(sidebar): swap expand chevrons for +/− indicators on nav sections
Streamlit's default sidebar section header uses a Material Symbols
expand_more chevron — three different icons (chevron down, chevron up,
sometimes a plain triangle) depending on version, all of which felt
inconsistent with the rest of the chrome.

Hide the built-in icon (svg / material-symbols span — covered with
multiple selectors for cross-version durability) and render our own
glyph as a right-aligned pseudo-element on the section-header button,
keyed off the standard ARIA aria-expanded attribute:
- collapsed → "+"
- expanded  → "−" (U+2212, visually balanced with +)

Hover deepens the indicator color to match the surrounding nav-link
hover treatment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:23:49 +00:00
904356f4e8 feat(gui): inline Help popover next to every tool's title
Adds a contextual Help button on each detail page, right of the title.
Clicking it opens a Streamlit popover with a one-shot how-to: when to
use, numbered steps, before→after examples, and an optional one-line
tip. Designed to be scannable — no paragraph prose.

Implementation:
- New ``render_tool_header(tool_id)`` helper in components replaces the
  bare ``st.title(...) + st.caption(...)`` block on each of the 11 tool
  pages. Title in the wide column, popover in a narrow right column;
  caption sits on its own line beneath.
- Help content is one markdown blob per tool stored in i18n under
  ``tools.<id>.help_md`` (en + es). Editors can tweak copy without
  touching Python.
- ``help.button_label`` and ``help.missing_body`` keys added to both
  packs for the popover trigger and the empty-tool fallback.

All 11 tool pages now use the same header pattern — including the
PDF Extractor and Reconciler which previously had hardcoded title/
caption pairs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:21:55 +00:00
7203a81af7 copy: strip jargon from tool descriptions and captions
Prior round only touched page_caption; the description field (shown on
home grid cards) still said "imputation", "missingness",
"winsorization", "schema coercion", "fuzzy matching with normalization",
etc. The audience is non-technical buyers — they shouldn't need a stats
or DB-admin vocabulary to read a tool card.

Rewrite both description and page_caption across en, es, and the
tools_registry (the fallback source of truth) using everyday words:
blanks instead of nulls, fill in instead of impute, look wrong instead
of statistical outliers, etc. Same one-line shape as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 17:09:52 +00:00
dd3b9bd59d copy: tighten tool-page captions to one plain-English line
Each tool's page caption is what tells a user what the tool actually
does the moment they land. They were inconsistent — some terse, most
multi-clause with a redundant "Runs locally — your data never leaves
this computer" trailer that's already a privacy pill on Home.

Rewrite every caption (en + es) as a single ~60-80 char action-first
line. Replaces the hardcoded multi-line Reconciler caption with the
same shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 14:34:34 +00:00
92 changed files with 8698 additions and 2207 deletions

View File

@@ -1,18 +1,17 @@
name: Build installers
# Triggers:
# * Tag push (v*) → produces installers + portable zips, attaches them
# to a GitHub Release.
# * Manual dispatch → uploads everything as workflow artifacts only.
# * Tag push (v*) → produces installers, attaches them to a GitHub Release.
# * Manual dispatch → uploads the installers as workflow artifacts only.
#
# Outputs per platform (downloadable by buyers):
# * macOS: .dmg installer + portable .zip (signed .app inside).
# * Windows: .exe installer + portable .zip (no-install).
# * Linux: .AppImage (already portable; no separate zip).
# * macOS: .dmg installer
# * Windows: .exe installer
# * Linux: .AppImage (already portable; no separate installer step)
#
# Self-contained: every artifact ships its own Python interpreter + every
# runtime dep through PyInstaller. No pre/post install steps on the
# buyer's machine.
# runtime dep (including bundled Tesseract OCR) through PyInstaller. No
# pre/post install steps on the buyer's machine.
#
# What this workflow doesn't do (yet):
# * Code signing (Mac Developer ID, Windows code-signing cert).
@@ -40,16 +39,16 @@ jobs:
include:
- os: macos-latest
platform: mac
installer_glob: dist/DataTools-*-mac.dmg
portable_glob: dist/DataTools-*-mac-portable.zip
artifact_name: DataTools-mac.dmg
artifact_path: dist/DataTools-*-mac.dmg
- os: windows-latest
platform: win
installer_glob: dist/DataTools-*-win-setup.exe
portable_glob: dist/DataTools-*-win-portable.zip
artifact_name: DataTools-win.exe
artifact_path: dist/DataTools-*-win-setup.exe
- os: ubuntu-latest
platform: linux
installer_glob: dist/DataTools-*-linux-x86_64.AppImage
portable_glob: '' # AppImage is already a portable single file
artifact_name: DataTools-linux.AppImage
artifact_path: dist/DataTools-*-linux-x86_64.AppImage
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
@@ -65,6 +64,30 @@ jobs:
pip install -r requirements.txt
pip install pyinstaller pillow
# ---- Tesseract bundling cache --------------------------------
# The fetch logic inside build/tesseract.py downloads:
# * build/vendor/tessdata/eng.traineddata (~16 MB, shared)
# * build/_tesseract/<platform>/ (binary + libs, 30-120 MB)
# Cache both so iterative CI runs don't re-download. The
# cache key bakes in the pinned Tesseract version + tessdata
# URL so a version bump invalidates automatically.
- name: Cache Tesseract bundle inputs
uses: actions/cache@v4
with:
path: |
build/_tesseract
build/vendor/tessdata
key: tesseract-${{ runner.os }}-5.5.0-tessdata_best-v1
# ---- Linux: install patchelf so tesseract.py can rewrite
# RPATH on the bundled tesseract binary. apt-get install
# tesseract-ocr is handled inside tesseract.py itself. --------
- name: Install Linux build prereqs for Tesseract bundling
if: matrix.os == 'ubuntu-latest'
run: |
sudo apt-get update
sudo apt-get install -y patchelf
- name: Read version
id: version
shell: bash
@@ -75,19 +98,109 @@ jobs:
- name: Generate platform icons
run: python build/generate_icons.py
# Stage Tesseract before PyInstaller. The tesseract.py helpers
# handle the per-platform fetch (UB-Mannheim on Win, brew on
# Mac, apt on Linux) and stage the binary + libs into
# build/_tesseract/<platform>/ where the spec picks them up.
# We invoke a tiny inline Python so the workflow doesn't have
# to know the per-platform target string.
- name: Stage Tesseract binary + tessdata
shell: bash
env:
DATATOOLS_PLATFORM: ${{ matrix.platform }}
run: |
python - <<'PY'
import os, sys
sys.path.insert(0, "build")
from tesseract import fetch_tessdata, fetch_tesseract_for_platform
target = os.environ["DATATOOLS_PLATFORM"]
fetch_tessdata()
fetch_tesseract_for_platform(target)
PY
- name: Build PyInstaller bundle
shell: bash
env:
# The spec reads this to find the per-platform staging dir;
# see build/datatools.spec for the contract.
DATATOOLS_TESS_STAGING: build/_tesseract/${{ matrix.platform }}
run: pyinstaller build/datatools.spec --clean --noconfirm
# ---- macOS code signing + notarization (before DMG packaging) -
# Signs dist/DataTools.app with the Developer ID, notarizes it,
# and staples the ticket so Gatekeeper passes offline. Wrapped in
# a guard: if the cert secret is absent the step prints a warning
# and exits 0, so dry-run dispatches still produce an (unsigned)
# build. Secret names match build/README.md "Signing".
- name: Sign & notarize macOS app
if: matrix.os == 'macos-latest'
env:
CERT_P12_BASE64: ${{ secrets.MACOS_DEVELOPER_ID_CERT_P12_BASE64 }}
CERT_PASSWORD: ${{ secrets.MACOS_DEVELOPER_ID_CERT_PASSWORD }}
NOTARY_APPLE_ID: ${{ secrets.MACOS_NOTARY_APPLE_ID }}
NOTARY_TEAM_ID: ${{ secrets.MACOS_NOTARY_TEAM_ID }}
NOTARY_PASSWORD: ${{ secrets.MACOS_NOTARY_PASSWORD }}
run: |
set -euo pipefail
if [ -z "${CERT_P12_BASE64:-}" ]; then
echo "::warning::MACOS_DEVELOPER_ID_CERT_P12_BASE64 not set — shipping an UNSIGNED build (Gatekeeper will warn buyers)."
exit 0
fi
APP="dist/DataTools.app"
# 1. Import the Developer ID cert into an ephemeral keychain.
KEYCHAIN="$RUNNER_TEMP/build.keychain-db"
KEYCHAIN_PW="$(uuidgen)"
security create-keychain -p "$KEYCHAIN_PW" "$KEYCHAIN"
security set-keychain-settings -lut 3600 "$KEYCHAIN"
security unlock-keychain -p "$KEYCHAIN_PW" "$KEYCHAIN"
echo "$CERT_P12_BASE64" | base64 --decode > "$RUNNER_TEMP/cert.p12"
security import "$RUNNER_TEMP/cert.p12" -k "$KEYCHAIN" -P "$CERT_PASSWORD" \
-T /usr/bin/codesign
security set-key-partition-list -S apple-tool:,apple: -s -k "$KEYCHAIN_PW" "$KEYCHAIN" >/dev/null
# Make the ephemeral keychain searchable (preserve the login keychain).
security list-keychains -d user -s "$KEYCHAIN" \
$(security list-keychains -d user | sed 's/"//g')
IDENTITY="$(security find-identity -v -p codesigning "$KEYCHAIN" \
| grep 'Developer ID Application' | head -1 | awk -F'"' '{print $2}')"
if [ -z "$IDENTITY" ]; then
echo "::error::No 'Developer ID Application' identity found in the imported cert."
exit 1
fi
echo "Signing with: $IDENTITY"
# 2. Sign the bundle (hardened runtime + secure timestamp + entitlements).
# --deep signs the nested dylibs/.so the PyInstaller bundle carries.
codesign --deep --force --options runtime --timestamp \
--entitlements build/macos/entitlements.plist \
--sign "$IDENTITY" "$APP"
codesign --verify --strict --verbose=2 "$APP"
# 3. Notarize the .app (notarytool needs a zip/dmg/pkg, not a bare .app),
# then staple so Gatekeeper validates offline.
if [ -n "${NOTARY_APPLE_ID:-}" ]; then
ditto -c -k --keepParent "$APP" "$RUNNER_TEMP/DataTools.zip"
xcrun notarytool submit "$RUNNER_TEMP/DataTools.zip" \
--apple-id "$NOTARY_APPLE_ID" \
--team-id "$NOTARY_TEAM_ID" \
--password "$NOTARY_PASSWORD" \
--wait
xcrun stapler staple "$APP"
xcrun stapler validate "$APP"
else
echo "::warning::Notary credentials not set — app is signed but NOT notarized (Gatekeeper will still warn)."
fi
rm -f "$RUNNER_TEMP/cert.p12"
# ---- Per-platform installer packaging ------------------------
- name: Package macOS DMG (installer)
if: matrix.os == 'macos-latest'
run: bash build/macos/build_dmg.sh "${{ steps.version.outputs.version }}"
- name: Package macOS portable .zip
if: matrix.os == 'macos-latest'
run: bash build/macos/build_zip.sh "${{ steps.version.outputs.version }}"
- name: Install Inno Setup (Windows)
if: matrix.os == 'windows-latest'
run: choco install innosetup --no-progress -y
@@ -98,10 +211,6 @@ jobs:
run: |
iscc /DAppVersion=${{ steps.version.outputs.version }} build\installer.iss
- name: Package Windows portable .zip
if: matrix.os == 'windows-latest'
run: python build/build_portable_zip.py win ${{ steps.version.outputs.version }}
- name: Install AppImage tooling (Linux)
if: matrix.os == 'ubuntu-latest'
run: |
@@ -119,29 +228,14 @@ jobs:
- name: Upload installer artifact
uses: actions/upload-artifact@v4
with:
name: DataTools-${{ matrix.platform }}-installer
path: ${{ matrix.installer_glob }}
name: ${{ matrix.artifact_name }}
path: ${{ matrix.artifact_path }}
if-no-files-found: error
- name: Upload portable artifact
if: matrix.portable_glob != ''
uses: actions/upload-artifact@v4
with:
name: DataTools-${{ matrix.platform }}-portable
path: ${{ matrix.portable_glob }}
if-no-files-found: error
- name: Attach installer to Release (tag push only)
- name: Attach to Release (tag push only)
if: startsWith(github.ref, 'refs/tags/v')
uses: softprops/action-gh-release@v2
with:
files: ${{ matrix.installer_glob }}
files: ${{ matrix.artifact_path }}
fail_on_unmatched_files: true
generate_release_notes: true
- name: Attach portable to Release (tag push only)
if: startsWith(github.ref, 'refs/tags/v') && matrix.portable_glob != ''
uses: softprops/action-gh-release@v2
with:
files: ${{ matrix.portable_glob }}
fail_on_unmatched_files: true

8
.gitignore vendored
View File

@@ -16,6 +16,14 @@ build/dist/
build/icon.ico
build/icon.icns
build/icon.png
# Tesseract bundling — fetched at build time, not committed. See
# build/vendor/README.md for the canonical URLs and rationale.
# - build/_tesseract/ : per-platform binary + DLLs/dylibs staging dir
# - build/vendor/tessdata/eng.traineddata : ~16 MB language data
build/_tesseract/
build/vendor/tessdata/*.traineddata
.pytest_cache/
# Claude Code agent worktrees + local settings

33
DECISIONS.md Normal file
View File

@@ -0,0 +1,33 @@
# Product & architecture decisions
A running log of decisions that aren't obvious from the code and would
otherwise be re-litigated. Newest first.
## 2026-06-08 — PDF to CSV and Reconcile stay in the bundle, under a "Finance" group
**Decision:** `10_pdf_extractor` (PDF to CSV) and `11_reconciler` (Reconcile
Two Files) remain part of the DataTools suite. In the sidebar they are
segregated into their own **Finance** section, distinct from the
file-cleaning tools.
**Context / why this needed deciding:**
- Both tools sit outside the documented 9-script cleaning architecture
(TECHNICAL.md / USER-GUIDE.md stop at the orchestrator).
- They occupy the "reconciliation / manual data-entry" territory the
product's honest-positioning note explicitly placed outside a
file-cleaning tool's scope.
- A journey-level UX review flagged that every extra tool in the main
sidebar raises the "which tool do I need?" load for a non-technical
buyer, so tools serving a different job should live in a clearly
different place.
**Resolution:** Keep them in-bundle (they're built, useful, and ship
today) but group them under "Finance" so the cleaning flow stays
uncluttered. Revisit only if a separate finance-focused product emerges.
**Implications:**
- `tools_registry.py`: Reconcile + PDF to CSV carry a `finance` section.
- Sidebar order: Start here → Data Cleaners → Transformations →
Automations → Finance → Coming soon.
- This is the source-of-truth realization of the `layout-review/`
mockups (see `layout-review/shell.js`).

220
LICENSE_TESSERACT.txt Normal file
View File

@@ -0,0 +1,220 @@
This license applies to the bundled Tesseract OCR binary distributed
inside DataTools installer artifacts (Windows .exe, macOS .dmg, Linux
.AppImage) and the corresponding portable .zip downloads.
Tesseract OCR upstream: https://github.com/tesseract-ocr/tesseract
Copyright (C) 2006-2024 Google Inc. and the Tesseract OCR contributors
The Tesseract OCR binary is distributed under the Apache License,
Version 2.0, the full text of which is reproduced verbatim below.
The bundled `eng.traineddata` data file is the "best" English model
from https://github.com/tesseract-ocr/tessdata_best and is licensed
under the Apache License, Version 2.0 as well.
DataTools itself is proprietary and is NOT covered by this license;
see LICENSE.txt at the repository root for DataTools' own license.
================================================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for describing the origin of the Work and
reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may accept and charge a
fee for, acceptance of support, warranty, indemnity, or other
liability obligations and/or rights consistent with this License.
However, in accepting such obligations, You may act only on Your
own behalf and on Your sole responsibility, not on behalf of any
other Contributor, and only if You agree to indemnify, defend,
and hold each Contributor harmless for any liability incurred by,
or claims asserted against, such Contributor by reason of your
accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied. See the License for the specific language governing
permissions and limitations under the License.

View File

@@ -18,17 +18,21 @@ Limpieza local de CSV / Excel. CLI + GUI en el navegador, sin nube, sin ceremoni
| 08 | Verificación de calidad | Próximamente |
| 09 | **Flujos automatizados** — encadena herramientas en un orden recomendado (no forzado), guarda/carga JSON, automatiza limpiezas semanales | Listo |
Cada página de herramienta incluye una ventana emergente de **Help** (a la derecha del título) con una guía compacta de Cuándo usarla / Pasos / Ejemplos / Consejo. El texto vive en los paquetes de idioma (`tools.<id>.help_md`).
## Descarga (usuarios no técnicos)
Paquetes precompilados — sin instalar Python, sin permisos de administrador, sin internet en ejecución. Cada versión ofrece dos formatos por sistema operativo: un **instalador** que crea accesos directos en el escritorio + menú Inicio / Launchpad, y un **.zip portable** que descomprimes y haces doble clic. Elige el que te permita tu política de TI.
Paquetes precompilados — sin instalar Python, sin permisos de administrador, sin internet en ejecución. Cada versión ofrece un **instalador** por sistema operativo que crea accesos directos en el escritorio + menú Inicio / Launchpad.
| Plataforma | Instalador (recomendado) | Portable (sin instalar) |
|---|---|---|
| **macOS** | `DataTools-X.Y.Z-mac.dmg` — ábrelo, arrastra DataTools.app a /Applications, ejecútalo desde Launchpad. | `DataTools-X.Y.Z-mac-portable.zip` — descomprime donde quieras, doble clic en `DataTools.app`. |
| **Windows** | `DataTools-X.Y.Z-win-setup.exe` — ejecuta el instalador (por usuario, sin admin). Crea acceso directo en el escritorio + entrada en el menú Inicio. | `DataTools-X.Y.Z-win-portable.zip` — descomprime donde quieras, doble clic en `DataTools.exe`. |
| **Linux** | `DataTools-X.Y.Z-linux-x86_64.AppImage``chmod +x` y doble clic. | El AppImage ya es portable. |
| Plataforma | Instalador |
|---|---|
| **macOS** | `DataTools-X.Y.Z-mac.dmg` — ábrelo, arrastra DataTools.app a /Applications, ejecútalo desde Launchpad. |
| **Windows** | `DataTools-X.Y.Z-win-setup.exe` — ejecuta el instalador (por usuario, sin admin). Crea acceso directo en el escritorio + entrada en el menú Inicio. |
| **Linux** | `DataTools-X.Y.Z-linux-x86_64.AppImage``chmod +x` y doble clic. El AppImage ya es portable. |
Última versión: consulta [GitHub Releases](https://git.invixiom.com/giteadmin/datatools-dev/releases) (o el listado de Gumroad). Cada paquete ocupa ~200 MB descomprimido; al primer arranque la app levanta un servidor local en http://127.0.0.1:8501 y abre tu navegador predeterminado. Nada sale de tu equipo — instalador y portable son idénticos por dentro.
Última versión: consulta [GitHub Releases](https://git.invixiom.com/giteadmin/datatools-dev/releases) (o el listado de Gumroad). Cada paquete ocupa ~300 MB descomprimido; al primer arranque la app levanta un servidor local en http://127.0.0.1:8501 y abre tu navegador predeterminado. Nada sale de tu equipo.
**Tesseract OCR viene incluido.** El soporte para PDFs escaneados del Extractor de PDF funciona sin configuración adicional en las tres plataformas — no hace falta instalar Tesseract por separado. Atribución de licencia: ver [`LICENSE_TESSERACT.txt`](LICENSE_TESSERACT.txt).
**Avisos del primer arranque (una sola vez):**
- **macOS** sin firma: clic derecho → **Abrir** → confirma. (Las compilaciones firmadas se lo saltan.)

View File

@@ -18,17 +18,21 @@ Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony. GU
| 08 | Quality Check | Coming Soon |
| 09 | **Automated Workflows** — chain tools with recommended (not forced) order, save/load JSON, automate weekly cleanups | Ready |
Every tool page has an in-tool **Help** popover (right of the title) with a compact When-to-use / Steps / Examples / Tip card. Copy lives in the language packs (`tools.<id>.help_md`).
## Download (non-technical users)
Pre-built bundles — no Python install, no admin rights, no internet at runtime. Each release ships two flavors per OS: an **installer** that wires up Desktop + Start Menu / Launchpad shortcuts, and a **portable .zip** you unzip and double-click. Pick whichever your IT policy allows.
Pre-built bundles — no Python install, no admin rights, no internet at runtime. Each release ships an **installer** per OS that wires up Desktop + Start Menu / Launchpad shortcuts.
| Platform | Installer (recommended) | Portable (no install) |
|---|---|---|
| **macOS** | `DataTools-X.Y.Z-mac.dmg` — open, drag DataTools.app into /Applications, launch from Launchpad. | `DataTools-X.Y.Z-mac-portable.zip` — unzip anywhere, double-click `DataTools.app`. |
| **Windows** | `DataTools-X.Y.Z-win-setup.exe` — run installer (per-user, no admin). Desktop shortcut + Start Menu entry created. | `DataTools-X.Y.Z-win-portable.zip` — unzip anywhere, double-click `DataTools.exe`. |
| **Linux** | `DataTools-X.Y.Z-linux-x86_64.AppImage``chmod +x`, double-click. | The AppImage is already portable. |
| Platform | Installer |
|---|---|
| **macOS** | `DataTools-X.Y.Z-mac.dmg` — open, drag DataTools.app into /Applications, launch from Launchpad. |
| **Windows** | `DataTools-X.Y.Z-win-setup.exe` — run installer (per-user, no admin). Desktop shortcut + Start Menu entry created. |
| **Linux** | `DataTools-X.Y.Z-linux-x86_64.AppImage``chmod +x`, double-click. The AppImage is already portable. |
Latest release: see [GitHub Releases](https://git.invixiom.com/giteadmin/datatools-dev/releases) (or the Gumroad listing). Each bundle is ~200 MB unpacked; on first launch the app starts a local server at http://127.0.0.1:8501 and opens your default browser. Nothing leaves your machine — installers and portables are byte-identical inside.
Latest release: see [GitHub Releases](https://git.invixiom.com/giteadmin/datatools-dev/releases) (or the Gumroad listing). Each bundle is ~300 MB unpacked; on first launch the app starts a local server at http://127.0.0.1:8501 and opens your default browser. Nothing leaves your machine.
**Tesseract OCR is bundled.** Scanned-PDF support in the PDF Extractor works out of the box on all three platforms — no separate Tesseract install required. License attribution: see [`LICENSE_TESSERACT.txt`](LICENSE_TESSERACT.txt).
**First-launch warnings (one-time):**
- **macOS** unsigned builds: right-click → **Open** → confirm. (Signed builds skip this.)

View File

@@ -23,14 +23,12 @@ build/
├── generate_icons.py Builds icon.ico / icon.icns / icon.png from
│ src/gui/assets/datatools_icon_256.png. Run
│ once before pyinstaller (CI does this).
├── build_portable_zip.py Cross-platform: zips dist/DataTools/ into a
no-install portable download. Used by the
Windows + Linux portable artifacts.
├── tesseract.py Fetches the per-platform Tesseract binary +
eng.traineddata at build time. CI imports
fetch_tessdata + fetch_tesseract_for_platform.
├── macos/
── build_dmg.sh Wraps dist/DataTools.app into a .dmg with a
drag-to-/Applications layout (installer).
│ └── build_zip.sh Wraps dist/DataTools.app into a portable
│ .zip via ditto (preserves bundle metadata).
── build_dmg.sh Wraps dist/DataTools.app into a .dmg with a
drag-to-/Applications layout (installer).
├── appimage/
│ ├── AppRun Entry point invoked when the AppImage runs.
│ ├── datatools.desktop Linux desktop-entry metadata.
@@ -43,19 +41,20 @@ build/
## Distribution outputs per platform
Each CI run produces two downloads per platform — an installer for
buyers who want shortcuts wired automatically, and a portable .zip
for buyers (or IT-locked-down machines) that can't run installers:
Each CI run produces one installer per platform:
| Platform | Installer | Portable |
|----------|----------------------------------------|------------------------------------------------|
| macOS | `DataTools-<ver>-mac.dmg` | `DataTools-<ver>-mac-portable.zip` (ditto .app)|
| Windows | `DataTools-<ver>-win-setup.exe` | `DataTools-<ver>-win-portable.zip` |
| Linux | `DataTools-<ver>-linux-x86_64.AppImage`| (the AppImage IS the portable) |
| Platform | Installer |
|----------|----------------------------------------|
| macOS | `DataTools-<ver>-mac.dmg` |
| Windows | `DataTools-<ver>-win-setup.exe` |
| Linux | `DataTools-<ver>-linux-x86_64.AppImage` (already portable) |
All six outputs are self-contained: every dependency (Python, pandas,
streamlit, pdfplumber, the lot) is frozen into the bundle. The buyer
does not need to install Python, pip, or anything else first.
All three outputs are self-contained: every dependency (Python, pandas,
streamlit, pdfplumber, **Tesseract OCR + `eng.traineddata`**, the lot)
is frozen into the bundle. The buyer does not need to install Python,
pip, Tesseract, or anything else first. With Tesseract bundled, each
artifact is roughly **250300 MB** on disk (up from ~120 MB pre-OCR);
unpacked installs run ~300400 MB once scratch space is counted.
## Easy-launch surface
@@ -73,55 +72,55 @@ the resulting installers to a GitHub Release. Manual
## Releasing
### Single-command local build (recommended for one-developer workflow)
### CI build (push tag → GitHub Release) — the release process
PyInstaller can't cross-compile, so a single machine produces one
platform's packages. Run this on each target OS:
```bash
# One-time setup per machine:
pip install -r requirements.txt
pip install pyinstaller pillow
# Windows only: install Inno Setup from https://jrsoftware.org/isdl.php
# Linux only: drop appimagetool onto PATH (see preflight output)
# Build everything for the current OS:
python build/make_release.py
```
Outputs land in `dist/`:
- Windows host → `DataTools-<ver>-win-setup.exe` + `DataTools-<ver>-win-portable.zip`
- macOS host → `DataTools-<ver>-mac.dmg` + `DataTools-<ver>-mac-portable.zip`
- Linux host → `DataTools-<ver>-linux-x86_64.AppImage`
Useful flags:
```bash
python build/make_release.py --preflight # check tooling, build nothing
python build/make_release.py --clean # wipe dist/ first
python build/make_release.py --skip-installer # just the portable zip
python build/make_release.py --skip-portable # just the installer
```
### CI build (push tag → GitHub Release)
If you have CI runners for all three OSes:
Releases are built by GitHub Actions (`.github/workflows/build.yml`),
not on a developer's machine. The matrix runs on
macos-latest / windows-latest / ubuntu-latest, stages Tesseract
(`build/tesseract.py`), runs PyInstaller, packages the per-platform
installer, and attaches it to a GitHub Release on tag push:
1. Bump `__version__` in `src/__init__.py`.
2. `git commit -am "release: vX.Y.Z" && git tag vX.Y.Z`.
3. `git push && git push --tags`.
4. CI builds all three platforms and creates a Release with the
installers + portable zips attached.
installers attached.
5. Mirror the Release assets to Gumroad (manual until v2).
A manual `workflow_dispatch` run does the same build but uploads the
installers as workflow artifacts instead of creating a Release —
useful for smoke-testing a build without cutting a tag.
### Local build (single platform, for testing)
PyInstaller can't cross-compile, so a local build produces only the
current OS's installer. This mirrors what CI does, by hand — use it to
debug the bundle before tagging. See the per-platform recipes below for
the exact commands; the short version is:
```bash
pip install -r requirements.txt
pip install pyinstaller pillow
python build/generate_icons.py
python -c "import sys; sys.path.insert(0,'build'); \
from tesseract import fetch_tessdata, fetch_tesseract_for_platform; \
fetch_tessdata(); fetch_tesseract_for_platform('mac')" # win / mac / linux
pyinstaller build/datatools.spec --clean --noconfirm
# then run the matching packager: build/macos/build_dmg.sh,
# build/installer.iss (iscc), or build/appimage/build.sh
```
## Signing (Phase 2 — needs accounts/credentials)
Both code-signing steps are intentionally not in CI yet because they
require credentials the owner sets up first.
**macOS signing + notarization is now wired into `build.yml`** (the
"Sign & notarize macOS app" step, with `build/macos/entitlements.plist`).
It is guarded: if `MACOS_DEVELOPER_ID_CERT_P12_BASE64` is absent the step
warns and exits 0, so dry-run dispatches still produce an unsigned build.
To activate it, just add the secrets below — no code change needed.
**Windows** code-signing is still not wired (accepted v1 friction).
**macOS** — Apple Developer Program enrollment ($99/yr). Once enrolled,
add these GitHub Secrets and uncomment the `codesign` + `notarytool`
steps in `build.yml`:
add these GitHub Secrets to activate the signing step in `build.yml`:
| Secret | Value |
|---|---|
@@ -287,6 +286,57 @@ Mac code-signing in CI requires the cert + private key as a GitHub
secret (encoded with `base64`). Detailed walkthrough belongs in a
later doc — for v1, sign locally and upload to GitHub Releases.
## Tesseract bundling (PDF Extractor OCR)
Frozen artifacts ship a per-platform Tesseract binary plus the English
`eng.traineddata` model so scanned-PDF support in the PDF Extractor
works out of the box — no separate user install. Source / pip
developer setups still need system Tesseract on `PATH`.
**Layout inside the bundle**:
```
DataTools/ (or DataTools.app/Contents/MacOS/)
└── tesseract/
├── tesseract (Linux/macOS binary; tesseract.exe on Windows)
└── tessdata/
└── eng.traineddata
```
The runtime resolver (in `src/`, owned by the runtime team) walks:
1. `DATATOOLS_TESSERACT_BIN` env var override.
2. `Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]"` — frozen
bundles only.
3. `tesseract` on `PATH`.
4. Windows well-known paths.
**Where the bytes come from**:
- **Tessdata** — vendored in-repo at `build/vendor/tessdata/eng.traineddata`
(sourced from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best)).
`datatools.spec` copies it into `tesseract/tessdata/`.
- **Binary** — fetched per-platform at build time by
`build/tesseract.py` from pinned upstream URLs. Current pin:
**Tesseract 5.5.0**. CI imports `fetch_tessdata` +
`fetch_tesseract_for_platform` from this module before PyInstaller.
**Updating Tesseract**:
1. Bump the version pin and the per-platform fetch URLs in
`build/tesseract.py`.
2. If the model schema changed upstream, refresh
`build/vendor/tessdata/eng.traineddata` from `tessdata_best` at the
matching tag.
3. Push a `v*` tag so CI rebuilds all three platforms, then
smoke-test a scanned PDF through the PDF Extractor.
4. Update `LICENSE_TESSERACT.txt` at the repo root if upstream license
terms change (Apache-2.0 today).
License attribution for the bundled binary lives at
`LICENSE_TESSERACT.txt` at the repo root — it must ship alongside any
binary that contains Tesseract.
## Common pitfalls
| Symptom | Fix |

View File

@@ -9,6 +9,11 @@
# latest release from https://github.com/AppImage/AppImageKit/releases).
#
# Output: dist/DataTools-<version>-linux-x86_64.AppImage
#
# Tesseract bundling: no-op here. The PyInstaller bundle in
# dist/DataTools/ already contains tesseract/{tesseract, *.so,
# tessdata/eng.traineddata} from the spec's datas; ``cp -R``
# below carries it along into the AppDir.
set -euo pipefail

View File

@@ -1,69 +0,0 @@
"""Wrap the PyInstaller folder build into a portable .zip.
Self-contained download: unzip → double-click the launcher → app runs.
No installer, no Python install, no admin rights required.
Usage:
python build/build_portable_zip.py <platform> <version>
Where ``platform`` is one of ``win`` / ``mac`` / ``linux``. The
script just produces a generic ``dist/DataTools/`` zip; on macOS the
preferred portable format is the ``ditto``-wrapped .app — see
``build/macos/build_zip.sh`` for that flow. This helper exists mainly
for Windows + Linux, where there's no .app bundle to wrap.
Output:
dist/DataTools-<version>-<platform>-portable.zip
The zip root is the ``DataTools/`` folder so an unzip produces a
self-contained dir the user can drop anywhere (Desktop, USB stick,
network share). On Windows, the launcher is ``DataTools.exe`` inside
that folder; on Linux, ``DataTools``.
"""
from __future__ import annotations
import shutil
import sys
from pathlib import Path
REPO = Path(__file__).resolve().parent.parent
DIST_DIR = REPO / "dist"
BUNDLE_DIR = DIST_DIR / "DataTools"
def main() -> int:
if len(sys.argv) < 3:
sys.stderr.write(
"usage: python build/build_portable_zip.py <platform> <version>\n"
)
return 2
platform = sys.argv[1]
version = sys.argv[2]
if not BUNDLE_DIR.is_dir():
sys.stderr.write(
f"Bundle dir not found at {BUNDLE_DIR}.\n"
"Run ``pyinstaller build/datatools.spec --clean --noconfirm`` first.\n"
)
return 1
out_stem = DIST_DIR / f"DataTools-{version}-{platform}-portable"
# ``make_archive`` takes a base name (no extension) and produces
# ``<base>.zip``. ``root_dir`` = parent of what we want compressed,
# ``base_dir`` = the folder name inside the archive root. This
# combo yields a single top-level ``DataTools/`` directory inside
# the .zip rather than dumping its contents loose.
archive = shutil.make_archive(
base_name=str(out_stem),
format="zip",
root_dir=str(DIST_DIR),
base_dir="DataTools",
)
size_mb = Path(archive).stat().st_size / (1024 * 1024)
print(f"wrote {archive} ({size_mb:.1f} MB)")
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -24,6 +24,7 @@
# -*- mode: python ; coding: utf-8 -*-
import os
from pathlib import Path
from PyInstaller.utils.hooks import (
collect_all,
@@ -103,6 +104,78 @@ datas += [
(str(REPO / ".streamlit" / "config.toml"),".streamlit"),
]
# ----- Tesseract OCR bundle ----------------------------------------
# ``build/tesseract.py`` stages the per-platform Tesseract binary
# + its runtime libs (DLLs/dylibs/sos) into
# ``build/_tesseract/<target>/`` and the shared eng.traineddata into
# ``build/vendor/tessdata/``. We add both to ``datas`` so PyInstaller
# drops them at the path the runtime expects:
#
# <bundle>/tesseract/tesseract[.exe]
# <bundle>/tesseract/<all dll/dylib/so deps>
# <bundle>/tesseract/tessdata/eng.traineddata
#
# The runtime discovery code in src/pdf_extract.py reads this layout
# from ``Path(sys._MEIPASS) / "tesseract" / ...``. Keep the two ends
# in sync — if you rename "tesseract" here, update pdf_extract.py too.
#
# CI (.github/workflows/build.yml) sets DATATOOLS_TESS_STAGING to the
# right per-platform dir before invoking PyInstaller. For ad-hoc
# `pyinstaller build/datatools.spec` runs without that env var, fall
# back to the canonical staging path.
_tess_staging_env = os.environ.get("DATATOOLS_TESS_STAGING")
if _tess_staging_env:
_tess_staging = Path(_tess_staging_env)
else:
# Pick the obvious per-host staging dir as a fallback so spec-only
# builds (without the CI env var) still work in dev.
import sys as _sys_for_target
_target_guess = (
"win" if _sys_for_target.platform.startswith("win")
else "mac" if _sys_for_target.platform == "darwin"
else "linux"
)
_tess_staging = REPO / "build" / "_tesseract" / _target_guess
_tessdata = REPO / "build" / "vendor" / "tessdata"
if _tess_staging.is_dir() and any(_tess_staging.iterdir()):
# Drop every file in the staging dir directly under
# ``<bundle>/tesseract/`` (binary + DLL/dylib/so siblings).
datas += [(str(_tess_staging), "tesseract")]
else:
# Don't hard-fail spec parse — useful for first-time devs running
# PyInstaller before fetching binaries. Surface a loud warning
# though, since the OCR feature will silently fail at runtime.
print(
f"WARNING: {_tess_staging} is empty or missing OCR will be "
"disabled in the bundle. Run build/tesseract.py's "
"fetch_tesseract_for_platform before pyinstaller, or "
"pre-stage the binary manually."
)
if (_tessdata / "eng.traineddata").exists():
datas += [(str(_tessdata), "tesseract/tessdata")]
else:
print(
f"WARNING: {_tessdata}/eng.traineddata is missing OCR will "
"have no language data at runtime. Run build/tesseract.py's "
"fetch_tessdata or fetch manually per build/vendor/README.md."
)
# Bundle the Apache-2.0 LICENSE text alongside the binary. The docs
# agent maintains LICENSE_TESSERACT.txt at the repo root; PyInstaller
# drops it at the bundle root next to DataTools[.exe].
_tess_license = REPO / "LICENSE_TESSERACT.txt"
if _tess_license.exists():
datas += [(str(_tess_license), ".")]
else:
print(
"WARNING: LICENSE_TESSERACT.txt missing at repo root. Required "
"by Apache-2.0 for redistribution; the docs agent should "
"create it. Continuing without it for now."
)
# ----- Analysis ------------------------------------------------------
a = Analysis(
@@ -158,6 +231,13 @@ coll = COLLECT(
# macOS .app bundle wrapper. PyInstaller produces it only on Mac;
# this block is a no-op on Win/Linux.
#
# Tesseract bundling note: ``BUNDLE(coll, ...)`` carries the entire
# COLLECT output (binaries + datas) into the .app's
# Contents/Resources tree, so the ``tesseract/`` subdir we built up
# in ``datas`` lands at ``DataTools.app/Contents/Resources/tesseract/``
# and the runtime ``sys._MEIPASS`` resolves there. No extra plumbing
# needed.
import sys as _sys
if _sys.platform == "darwin":
app = BUNDLE(

View File

@@ -63,6 +63,14 @@ Name: "desktopicon"; Description: "Create a &desktop shortcut"; GroupDescription
Name: "quicklaunchicon"; Description: "Create a &Quick Launch shortcut"; GroupDescription: "Additional shortcuts:"; Flags: unchecked; OnlyBelowVersion: 6.1
[Files]
; PyInstaller's dist/DataTools/ tree includes:
; * DataTools.exe + frozen Python runtime
; * tesseract/tesseract.exe + DLLs + tessdata/eng.traineddata
; (bundled via build/datatools.spec datas; runtime discovery in
; src/pdf_extract.py reads sys._MEIPASS / "tesseract" / ...).
; * LICENSE_TESSERACT.txt at the bundle root (Apache-2.0).
; The recursesubdirs flag below picks all of those up — no separate
; Files: entry needed for tesseract/.
Source: "..\dist\DataTools\*"; DestDir: "{app}"; Flags: recursesubdirs ignoreversion
[Icons]

View File

@@ -10,6 +10,11 @@
#
# Code signing + notarization happen separately (see build/README.md
# "Signing"). This script only handles the packaging step.
#
# Tesseract bundling: no-op here. The .app already contains
# Contents/Resources/tesseract/{tesseract, *.dylib, tessdata/} thanks
# to PyInstaller's BUNDLE() carrying the spec's datas through. This
# script just wraps the finished .app — no extra steps for OCR.
set -euo pipefail

View File

@@ -1,38 +0,0 @@
#!/usr/bin/env bash
# Wrap dist/DataTools.app into a no-install portable .zip.
#
# Usage:
# bash build/macos/build_zip.sh <version>
#
# Why a portable .zip in addition to the .dmg:
# * Buyers who don't want an installer can unzip and double-click the
# .app directly — no drag-to-/Applications step, no installer
# chrome. Self-contained: the .app holds Python + every dep.
# * IT-locked-down machines often block .dmg auto-mount but allow
# .zip download + extraction.
#
# Run after ``pyinstaller build/datatools.spec --clean --noconfirm``
# has produced ``dist/DataTools.app``. Output goes to
# ``dist/DataTools-<version>-mac-portable.zip``.
set -euo pipefail
VERSION="${1:-0.0.0-dev}"
APP="dist/DataTools.app"
ZIP="dist/DataTools-${VERSION}-mac-portable.zip"
if [[ ! -d "$APP" ]]; then
echo "Error: $APP not found. Run pyinstaller build/datatools.spec first." >&2
exit 1
fi
# ``ditto`` preserves the .app bundle's extended attributes and
# resource forks (a plain ``zip`` strips them and can break code
# signatures + Info.plist resolution on the buyer's machine).
#
# --sequesterRsrc keeps the AppleDouble metadata inside the archive
# rather than as parallel ._ files on disk after extraction.
rm -f "$ZIP"
ditto -c -k --sequesterRsrc --keepParent "$APP" "$ZIP"
echo "Built $ZIP ($(du -h "$ZIP" | cut -f1))"

View File

@@ -0,0 +1,28 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<!--
Hardened-runtime entitlements for the notarized DataTools.app.
PyInstaller freezes a CPython interpreter that maps writable+executable
memory and loads many unsigned .so/.dylib modules at runtime. Without
these entitlements the hardened runtime kills the process on launch
(or notarization rejects the bundle). Keep this list minimal — the app
is a local-only Streamlit server, so no network-server/device/camera
entitlements are needed.
-->
<plist version="1.0">
<dict>
<!-- CPython JIT-style writable/executable memory + ctypes trampolines -->
<key>com.apple.security.cs.allow-jit</key>
<true/>
<key>com.apple.security.cs.allow-unsigned-executable-memory</key>
<true/>
<!-- Load the bundled C-extension .so / .dylib modules (pandas, pdfplumber,
Pillow, the bundled Tesseract dylibs) that aren't Team-ID signed -->
<key>com.apple.security.cs.disable-library-validation</key>
<true/>
<!-- Launcher sets DATATOOLS_*/TESSDATA_PREFIX/PYTHON* before exec -->
<key>com.apple.security.cs.allow-dyld-environment-variables</key>
<true/>
</dict>
</plist>

View File

@@ -1,348 +0,0 @@
"""Single-command release builder for DataTools.
PyInstaller can't cross-compile — to produce a Windows .exe you run
this on Windows, for a Mac .dmg you run it on macOS, for a Linux
AppImage you run it on Linux. One script, one OS at a time.
What this script does (in order):
1. Preflight — checks PyInstaller, Pillow, and the platform's
packager (Inno Setup on Win / hdiutil + ditto on Mac /
appimagetool on Linux) are reachable. Bails with install
instructions if anything is missing.
2. Generates icon.ico / icon.icns / icon.png from the PNG asset.
3. Runs PyInstaller against build/datatools.spec.
4. Wraps the PyInstaller output into:
* Windows: DataTools-<ver>-win-setup.exe (Inno Setup)
+ DataTools-<ver>-win-portable.zip
* macOS: DataTools-<ver>-mac.dmg
+ DataTools-<ver>-mac-portable.zip
* Linux: DataTools-<ver>-linux-x86_64.AppImage
5. Prints what landed in dist/ and the byte sizes.
Usage:
python build/make_release.py # build everything for this OS
python build/make_release.py --preflight # check tooling, don't build
python build/make_release.py --skip-installer # only the portable zip
python build/make_release.py --skip-portable # only the installer
python build/make_release.py --clean # wipe dist/ first
Run from the repo root or from build/ — either works.
"""
from __future__ import annotations
import argparse
import platform
import re
import shutil
import subprocess
import sys
from pathlib import Path
REPO = Path(__file__).resolve().parent.parent
BUILD = REPO / "build"
DIST = REPO / "dist"
# ---------------------------------------------------------------------------
# Output helpers — colourless so logs stay readable in any terminal/CI tail.
# ---------------------------------------------------------------------------
def _step(msg: str) -> None:
print(f"\n==> {msg}", flush=True)
def _ok(msg: str) -> None:
print(f" ok: {msg}", flush=True)
def _warn(msg: str) -> None:
print(f" warn: {msg}", flush=True)
def _err(msg: str) -> None:
print(f" ERROR: {msg}", file=sys.stderr, flush=True)
def _run(cmd: list[str], cwd: Path | None = None, env: dict | None = None) -> None:
"""Run *cmd*, stream output, exit on failure with a useful banner."""
printable = " ".join(map(str, cmd))
print(f" $ {printable}", flush=True)
try:
subprocess.run(cmd, check=True, cwd=cwd or REPO, env=env)
except subprocess.CalledProcessError as e:
_err(f"command failed (exit {e.returncode}): {printable}")
sys.exit(e.returncode)
except FileNotFoundError:
_err(f"command not found: {cmd[0]}")
sys.exit(127)
# ---------------------------------------------------------------------------
# Platform detection
# ---------------------------------------------------------------------------
def _detect_platform() -> str:
"""Return ``win`` / ``mac`` / ``linux`` based on sys.platform."""
p = sys.platform
if p.startswith("win"):
return "win"
if p == "darwin":
return "mac"
if p.startswith("linux"):
return "linux"
_err(f"unsupported platform {p!r}; this script handles win/mac/linux only.")
sys.exit(2)
# ---------------------------------------------------------------------------
# Version — single source of truth in src/__init__.py
# ---------------------------------------------------------------------------
def _read_version() -> str:
init_py = (REPO / "src" / "__init__.py").read_text(encoding="utf-8")
m = re.search(r'__version__\s*=\s*["\']([^"\']+)["\']', init_py)
if not m:
_err("could not parse __version__ from src/__init__.py")
sys.exit(1)
return m.group(1)
# ---------------------------------------------------------------------------
# Preflight — check tooling before doing anything destructive
# ---------------------------------------------------------------------------
def _have_module(name: str) -> bool:
try:
__import__(name)
return True
except ImportError:
return False
def _have_command(name: str) -> bool:
return shutil.which(name) is not None
# Per-platform install hints. The error messages quote these so a buyer
# building from source isn't left guessing what to install next.
_INSTALL_HINTS = {
"pyinstaller": "pip install pyinstaller",
"pil": "pip install pillow",
"iscc": "Inno Setup (Windows): https://jrsoftware.org/isdl.php — install, then re-open the shell so iscc lands on PATH.",
"hdiutil": "ships with macOS — if it's missing your Mac install is broken.",
"ditto": "ships with macOS — if it's missing your Mac install is broken.",
"appimagetool": "Linux: download appimagetool-x86_64.AppImage from https://github.com/AppImage/AppImageKit/releases, chmod +x, drop on PATH.",
}
def preflight(target: str) -> None:
"""Verify every tool the target build needs is reachable; exit if not."""
_step(f"preflight ({target})")
missing: list[tuple[str, str]] = []
# Python-side deps — same on every platform. The ``_INSTALL_HINTS``
# lookup uses lowercase keys so module name capitalization doesn't
# need to match.
for mod in ("PyInstaller", "PIL"):
if not _have_module(mod):
hint = _INSTALL_HINTS.get(mod.lower(), f"pip install {mod}")
missing.append((mod.lower(), hint))
else:
_ok(f"{mod} importable")
# PyInstaller's CLI must also be reachable as a binary, not just as
# an importable module — the spec is invoked via the ``pyinstaller``
# command. ``python -m PyInstaller`` is a fine fallback so don't
# hard-fail if only the CLI binary is missing.
if _have_command("pyinstaller"):
_ok("pyinstaller on PATH")
else:
_warn("pyinstaller binary not on PATH — will fall back to `python -m PyInstaller`")
# Platform-specific packagers.
if target == "win":
if _have_command("iscc"):
_ok("Inno Setup (iscc) on PATH")
else:
missing.append(("iscc", _INSTALL_HINTS["iscc"]))
elif target == "mac":
for tool in ("hdiutil", "ditto"):
if _have_command(tool):
_ok(f"{tool} on PATH")
else:
missing.append((tool, _INSTALL_HINTS[tool]))
elif target == "linux":
if _have_command("appimagetool"):
_ok("appimagetool on PATH")
else:
missing.append(("appimagetool", _INSTALL_HINTS["appimagetool"]))
if missing:
_err("missing prerequisites:")
for name, hint in missing:
print(f" - {name}: {hint}", file=sys.stderr)
sys.exit(1)
_ok("all prerequisites present")
# ---------------------------------------------------------------------------
# Build steps
# ---------------------------------------------------------------------------
def step_generate_icons() -> None:
_step("generate icons")
_run([sys.executable, str(BUILD / "generate_icons.py")])
def step_pyinstaller(clean: bool) -> None:
_step("pyinstaller bundle")
# Use ``python -m PyInstaller`` so we don't depend on the binary
# being on PATH (Windows users frequently see this — pip's
# Scripts/ dir isn't auto-added).
cmd = [sys.executable, "-m", "PyInstaller",
str(BUILD / "datatools.spec"),
"--noconfirm"]
if clean:
cmd.append("--clean")
_run(cmd)
def step_package_win(version: str, do_installer: bool, do_portable: bool) -> list[Path]:
out: list[Path] = []
if do_installer:
_step("Windows installer (Inno Setup)")
_run(["iscc", f"/DAppVersion={version}", str(BUILD / "installer.iss")])
out.append(DIST / f"DataTools-{version}-win-setup.exe")
if do_portable:
_step("Windows portable .zip")
_run([sys.executable, str(BUILD / "build_portable_zip.py"), "win", version])
out.append(DIST / f"DataTools-{version}-win-portable.zip")
return out
def step_package_mac(version: str, do_installer: bool, do_portable: bool) -> list[Path]:
out: list[Path] = []
if do_installer:
_step("macOS DMG (installer)")
_run(["bash", str(BUILD / "macos" / "build_dmg.sh"), version])
out.append(DIST / f"DataTools-{version}-mac.dmg")
if do_portable:
_step("macOS portable .zip")
_run(["bash", str(BUILD / "macos" / "build_zip.sh"), version])
out.append(DIST / f"DataTools-{version}-mac-portable.zip")
return out
def step_package_linux(version: str, do_installer: bool, do_portable: bool) -> list[Path]:
# On Linux the AppImage IS the portable. We ignore the two flags
# and always produce the single file — splitting wouldn't add
# value.
if not (do_installer or do_portable):
return []
_step("Linux AppImage")
_run(["bash", str(BUILD / "appimage" / "build.sh"), version])
return [DIST / f"DataTools-{version}-linux-x86_64.AppImage"]
# ---------------------------------------------------------------------------
# Orchestration
# ---------------------------------------------------------------------------
def _summarise(outputs: list[Path]) -> None:
_step("done — outputs")
if not outputs:
_warn("no files produced (everything skipped via flags)")
return
for p in outputs:
if p.exists():
size_mb = p.stat().st_size / (1024 * 1024)
print(f" {p.relative_to(REPO)} ({size_mb:.1f} MB)")
else:
_warn(f"expected output missing: {p.relative_to(REPO)}")
def main() -> int:
parser = argparse.ArgumentParser(
prog="make_release.py",
description=(
"Build the installer + portable zip for the current OS. "
"Cross-compilation isn't supported by PyInstaller — run "
"this once per platform you want to target."
),
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument(
"--platform", choices=("auto", "win", "mac", "linux"), default="auto",
help="Override OS detection (mostly for testing). Default: auto.",
)
parser.add_argument(
"--preflight", action="store_true",
help="Check tooling and exit without building.",
)
parser.add_argument(
"--clean", action="store_true",
help="Wipe dist/ before building.",
)
parser.add_argument(
"--skip-installer", action="store_true",
help="Don't build the OS installer (.exe / .dmg).",
)
parser.add_argument(
"--skip-portable", action="store_true",
help="Don't build the portable .zip.",
)
args = parser.parse_args()
target = _detect_platform() if args.platform == "auto" else args.platform
version = _read_version()
do_installer = not args.skip_installer
do_portable = not args.skip_portable
print(f"DataTools release builder")
print(f" target: {target} (host: {platform.platform()})")
print(f" version: {version}")
print(f" installer: {'yes' if do_installer else 'no'}")
print(f" portable: {'yes' if do_portable else 'no'}")
print(f" dist dir: {DIST}")
if target != _detect_platform():
_warn(
f"--platform {target} but host is {_detect_platform()}. "
"PyInstaller can't cross-compile — the bundle will be for "
"the HOST, only the packaging step will follow your override. "
"Useful only for testing the packager paths."
)
preflight(target)
if args.preflight:
return 0
if args.clean and DIST.exists():
_step(f"cleaning {DIST}")
shutil.rmtree(DIST)
step_generate_icons()
step_pyinstaller(clean=args.clean)
if target == "win":
outputs = step_package_win(version, do_installer, do_portable)
elif target == "mac":
outputs = step_package_mac(version, do_installer, do_portable)
else:
outputs = step_package_linux(version, do_installer, do_portable)
_summarise(outputs)
return 0
if __name__ == "__main__":
sys.exit(main())

453
build/tesseract.py Normal file
View File

@@ -0,0 +1,453 @@
"""Tesseract bundling helpers for the release build.
PDF Extractor OCR ships a per-platform Tesseract binary plus the English
``eng.traineddata`` model inside the frozen PyInstaller bundle so scanned
PDFs work without a separate user install. These helpers fetch the binary
and tessdata at build time; the GitHub Actions workflow
(``.github/workflows/build.yml``) imports ``fetch_tessdata`` and
``fetch_tesseract_for_platform`` and runs them before PyInstaller.
Everything is staged under ``build/_tesseract/<platform>/`` (gitignored).
The PyInstaller spec (``build/datatools.spec``) reads that staging dir plus
``build/vendor/tessdata/`` and bundles them under ``<bundle>/tesseract/``,
where the runtime discovery code in ``src/pdf_extract.py`` expects:
Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]"
Path(sys._MEIPASS) / "tesseract" / "tessdata" / "eng.traineddata"
"""
from __future__ import annotations
import os
import shutil
import subprocess
import sys
import urllib.request
from pathlib import Path
REPO = Path(__file__).resolve().parent.parent
BUILD = REPO / "build"
# Tesseract bundling. The runtime discovery code in
# ``src/pdf_extract.py`` looks for the binary at
# ``Path(sys._MEIPASS) / "tesseract" / "tesseract[.exe]"`` and tessdata
# at ``... / "tesseract" / "tessdata" / "eng.traineddata"``. We stage
# everything under ``build/_tesseract/<platform>/`` (gitignored) and
# the PyInstaller spec adds that staging dir to ``datas=`` so it lands
# at the right place inside the frozen bundle.
TESSERACT_VERSION = "5.5.0"
TESSDATA_DIR = BUILD / "vendor" / "tessdata"
TESSDATA_URL = (
"https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata"
)
TESSERACT_STAGING = BUILD / "_tesseract"
# ---------------------------------------------------------------------------
# Output helpers — colourless so logs stay readable in any terminal/CI tail.
# ---------------------------------------------------------------------------
def _step(msg: str) -> None:
print(f"\n==> {msg}", flush=True)
def _ok(msg: str) -> None:
print(f" ok: {msg}", flush=True)
def _warn(msg: str) -> None:
print(f" warn: {msg}", flush=True)
def _err(msg: str) -> None:
print(f" ERROR: {msg}", file=sys.stderr, flush=True)
def _run(cmd: list[str], cwd: Path | None = None, env: dict | None = None) -> None:
"""Run *cmd*, stream output, exit on failure with a useful banner."""
printable = " ".join(map(str, cmd))
print(f" $ {printable}", flush=True)
try:
subprocess.run(cmd, check=True, cwd=cwd or REPO, env=env)
except subprocess.CalledProcessError as e:
_err(f"command failed (exit {e.returncode}): {printable}")
sys.exit(e.returncode)
except FileNotFoundError:
_err(f"command not found: {cmd[0]}")
sys.exit(127)
# ---------------------------------------------------------------------------
# Tesseract bundling — fetch the binary + tessdata at build time.
#
# We download (not vendor) because:
# * Binaries are large (5-40 MB per platform) and license-encumbered
# to keep current in git.
# * tessdata is Apache-2.0 and ~16 MB — fine to redistribute but
# bloats clones for contributors who don't touch OCR.
#
# Caching layout:
# build/_tesseract/win/tesseract.exe + DLLs
# build/_tesseract/mac/tesseract + dylibs
# build/_tesseract/linux/tesseract + libs
# build/vendor/tessdata/eng.traineddata (shared across platforms)
#
# The PyInstaller spec reads ``build/_tesseract/<platform>/`` and the
# tessdata dir, then bundles them under ``<bundle>/tesseract/``.
# ---------------------------------------------------------------------------
def _download(url: str, dest: Path, *, expected_min_bytes: int = 1024) -> None:
"""Download *url* to *dest* atomically. Sanity-check the size."""
dest.parent.mkdir(parents=True, exist_ok=True)
tmp = dest.with_suffix(dest.suffix + ".part")
print(f" GET {url}", flush=True)
try:
with urllib.request.urlopen(url, timeout=120) as r, open(tmp, "wb") as f:
shutil.copyfileobj(r, f)
except Exception as e: # noqa: BLE001 — bubble any network error up
if tmp.exists():
tmp.unlink()
_err(f"download failed: {url}\n {e}")
raise
size = tmp.stat().st_size
if size < expected_min_bytes:
tmp.unlink()
raise RuntimeError(
f"downloaded file too small ({size} bytes < {expected_min_bytes}); "
f"the URL probably 404'd into an HTML error page."
)
tmp.replace(dest)
_ok(f"downloaded {dest.name} ({size / (1024 * 1024):.1f} MB)")
def fetch_tessdata() -> Path:
"""Ensure ``build/vendor/tessdata/eng.traineddata`` exists; return its path.
Shared across platforms. Downloaded once and cached. The
runtime expects this file at ``<bundle>/tesseract/tessdata/eng.traineddata``;
the PyInstaller spec handles the placement.
"""
_step("fetch tessdata (eng.traineddata)")
TESSDATA_DIR.mkdir(parents=True, exist_ok=True)
target = TESSDATA_DIR / "eng.traineddata"
if target.exists() and target.stat().st_size > 1_000_000:
_ok(f"already cached: {target.relative_to(REPO)} "
f"({target.stat().st_size / (1024 * 1024):.1f} MB)")
return target
# ~16 MB on disk for the "best" model. Allow some slack on the
# min-bytes check (3 MB) so we still catch HTML 404 pages.
_download(TESSDATA_URL, target, expected_min_bytes=3 * 1024 * 1024)
return target
def _fetch_tesseract_windows(staging: Path) -> None:
"""Stage tesseract.exe + DLLs into *staging*.
Strategy (no easy stand-alone Windows tarball exists — UB-Mannheim
ships the canonical Windows builds as Inno Setup installers):
1. Download the installer .exe from the UB-Mannheim mirror.
2. Extract it with 7-Zip (which can read Inno Setup archives via
the {app} group). 7-Zip is preinstalled on
``windows-latest`` GitHub Actions runners (`C:\\Program Files\\7-Zip\\7z.exe`).
3. Copy tesseract.exe + every DLL + the tessdata dir from the
extraction into ``staging/``.
The DLL set tesseract.exe needs at runtime (per UB-Mannheim's
Inno Setup script):
libtesseract-5.dll, libleptonica-6.dll, libgomp-1.dll,
libstdc++-6.dll, libwinpthread-1.dll, libgcc_s_seh-1.dll,
liblz4.dll, libjpeg-8.dll, libpng16-16.dll, libtiff-6.dll,
libwebp-7.dll, libwebpmux-3.dll, libopenjp2-7.dll, zlib1.dll
The whole {app} tree from the installer is ~120 MB; we copy
just the .exe + .dll files (~50 MB) since the runtime only
needs the binary and its direct deps.
"""
# UB-Mannheim posts builds under a versioned filename; the exact
# build revision changes (5.5.0.20241111 at time of writing).
# We pin a specific rev so reproducible builds don't drift.
rev = "20241111" # patch rev for tesseract 5.5.0 on the UB-Mannheim mirror
fname = f"tesseract-ocr-w64-setup-{TESSERACT_VERSION}.{rev}.exe"
url = f"https://digi.bib.uni-mannheim.de/tesseract/{fname}"
cache = TESSERACT_STAGING / fname
if not cache.exists():
_download(url, cache, expected_min_bytes=20 * 1024 * 1024)
# 7-Zip is preinstalled on windows-latest runners; on a dev box
# the user installs it (choco install 7zip) or substitutes
# innoextract. Locate it.
sevenz = (
shutil.which("7z")
or shutil.which("7z.exe")
or r"C:\Program Files\7-Zip\7z.exe"
)
if not Path(sevenz).exists() and not shutil.which("7z"):
_err(
"7-Zip not found. On Windows CI runners it's preinstalled; "
"on a dev box install via ``choco install 7zip`` or extract "
f"{cache} manually into {staging}/ and re-run with "
"TESSERACT_SKIP_FETCH=1."
)
raise FileNotFoundError("7z")
extract = TESSERACT_STAGING / "win_extract"
if extract.exists():
shutil.rmtree(extract)
extract.mkdir(parents=True)
_run([str(sevenz), "x", "-y", f"-o{extract}", str(cache)])
staging.mkdir(parents=True, exist_ok=True)
# The Inno Setup payload lands under ``{app}/`` inside the
# extraction. Recursively grab tesseract.exe + DLLs.
found_exe = False
for root, _dirs, files in os.walk(extract):
for f in files:
src = Path(root) / f
if f.lower() == "tesseract.exe":
shutil.copy2(src, staging / "tesseract.exe")
found_exe = True
elif f.lower().endswith(".dll"):
shutil.copy2(src, staging / f)
if not found_exe:
raise RuntimeError(
f"tesseract.exe not found inside extracted installer at {extract}"
)
_ok(f"staged Windows tesseract into {staging.relative_to(REPO)}")
def _fetch_tesseract_macos(staging: Path) -> None:
"""Stage tesseract + dylibs into *staging* on macOS.
Strategy: use Homebrew. ``brew install tesseract`` is the
sanctioned macOS path and the binary it installs is the same one
every guide on the internet points at. We copy the binary +
every dylib it links against into the staging dir, then run
``install_name_tool`` to rewrite the load paths so the binary
works after relocation into the .app bundle.
Caveat: ``brew`` must be on PATH (it is on ``macos-latest``
runners). If it isn't, we surface a helpful error rather than
fail mysteriously.
"""
if not shutil.which("brew"):
_err(
"Homebrew not found. On macos-latest GitHub runners it's "
"preinstalled; on a dev Mac install from https://brew.sh and "
"re-run. Alternatively pre-stage tesseract into "
f"{staging}/ and set TESSERACT_SKIP_FETCH=1."
)
raise FileNotFoundError("brew")
# ``brew install`` is idempotent — fine to run on every build. We
# don't pin the version through brew because brew tracks its own
# taps; instead we assert the version matches TESSERACT_VERSION
# after install.
_run(["brew", "install", "tesseract"])
# Find the binary brew just installed.
tess_path = shutil.which("tesseract")
if not tess_path:
raise RuntimeError("brew install tesseract succeeded but tesseract not on PATH")
staging.mkdir(parents=True, exist_ok=True)
shutil.copy2(tess_path, staging / "tesseract")
# Copy every non-system dylib the binary links against. The
# ``otool -L`` output lists absolute paths under /opt/homebrew/
# (Apple Silicon) or /usr/local/ (Intel). We skip /usr/lib/* and
# /System/* (Apple-shipped, present on every Mac).
try:
otool = subprocess.run(
["otool", "-L", str(staging / "tesseract")],
check=True, capture_output=True, text=True,
)
except subprocess.CalledProcessError as e:
raise RuntimeError(f"otool failed: {e.stderr}") from e
deps = []
for line in otool.stdout.splitlines()[1:]:
path = line.strip().split(" ", 1)[0]
if path.startswith(("/opt/homebrew/", "/usr/local/")):
deps.append(path)
# Copy each dep and its transitive deps. One level of recursion
# is usually enough for the tesseract dep tree (libtesseract →
# libleptonica → libpng/libjpeg/libtiff/libwebp).
copied: set[str] = set()
def _copy_with_deps(libpath: str) -> None:
if libpath in copied or not Path(libpath).exists():
return
copied.add(libpath)
dest = staging / Path(libpath).name
shutil.copy2(libpath, dest)
# Rewrite the dest's own load path to @loader_path so the
# bundle is relocatable.
try:
subprocess.run(
["install_name_tool", "-id", f"@loader_path/{Path(libpath).name}", str(dest)],
check=True, capture_output=True,
)
except subprocess.CalledProcessError:
# Not fatal — install_name_tool refuses on already-relative
# IDs. The dyld loader will still find them via
# @loader_path rewrites on the consumer side.
pass
# Walk this lib's own deps.
try:
sub = subprocess.run(
["otool", "-L", libpath], check=True, capture_output=True, text=True,
)
for sub_line in sub.stdout.splitlines()[1:]:
sub_path = sub_line.strip().split(" ", 1)[0]
if sub_path.startswith(("/opt/homebrew/", "/usr/local/")):
_copy_with_deps(sub_path)
except subprocess.CalledProcessError:
pass
for dep in deps:
_copy_with_deps(dep)
# Rewrite the tesseract binary's references to point at
# @loader_path/<dyname> so it can find its deps inside the bundle.
bin_path = staging / "tesseract"
for dep in deps:
try:
subprocess.run(
["install_name_tool", "-change", dep,
f"@loader_path/{Path(dep).name}", str(bin_path)],
check=True, capture_output=True,
)
except subprocess.CalledProcessError:
pass
_ok(f"staged macOS tesseract + {len(copied)} dylibs into {staging.relative_to(REPO)}")
def _fetch_tesseract_linux(staging: Path) -> None:
"""Stage tesseract + .so files into *staging* on Linux.
Strategy: ``apt-get install tesseract-ocr libtesseract5``
(preinstalled on most ubuntu-latest images; we run install
anyway because the package is idempotent). Then copy the
binary + every .so it links against into staging. ``patchelf``
rewrites RPATH so the bundle is relocatable.
"""
if not shutil.which("apt-get") and not shutil.which("tesseract"):
_err(
"Neither apt-get nor a pre-installed tesseract found. On "
"ubuntu-latest runners both are present. On other distros "
"install tesseract-ocr via your package manager and re-run "
"with TESSERACT_SKIP_FETCH=1 after pre-staging the binary."
)
raise FileNotFoundError("tesseract")
if shutil.which("apt-get") and not shutil.which("tesseract"):
_run(["sudo", "apt-get", "update"])
_run(["sudo", "apt-get", "install", "-y", "tesseract-ocr", "libtesseract5"])
tess_path = shutil.which("tesseract")
if not tess_path:
raise RuntimeError("apt-get install succeeded but tesseract not on PATH")
staging.mkdir(parents=True, exist_ok=True)
shutil.copy2(tess_path, staging / "tesseract")
# Collect .so dependencies via ldd. Skip the dynamic linker and
# libc/libpthread/libdl/libm/libstdc++/libgcc_s — those are
# guaranteed to exist on every Linux target and shipping them can
# cause GLIBC mismatch errors on older distros. The interesting
# tesseract-specific deps are libtesseract, libleptonica, and the
# image format libs (libpng, libjpeg, libtiff, libwebp, libgif).
SKIP_PREFIXES = (
"linux-vdso", "/lib64/ld-linux", "/lib/ld-linux",
"libc.so", "libdl.so", "libpthread.so", "libm.so",
"librt.so", "libnsl.so", "libutil.so",
)
try:
ldd = subprocess.run(
["ldd", str(staging / "tesseract")],
check=True, capture_output=True, text=True,
)
except subprocess.CalledProcessError as e:
raise RuntimeError(f"ldd failed: {e.stderr}") from e
copied = 0
for line in ldd.stdout.splitlines():
# Format: " libfoo.so.N => /path/to/libfoo.so.N (0x...)"
parts = line.split("=>")
if len(parts) != 2:
continue
soname = parts[0].strip()
if soname.startswith(SKIP_PREFIXES):
continue
path_part = parts[1].strip().split(" ", 1)[0]
if not path_part or not Path(path_part).exists():
continue
shutil.copy2(path_part, staging / Path(path_part).name)
copied += 1
# patchelf is optional — if present, rewrite RPATH to $ORIGIN so
# the binary finds its bundled .so files. If absent, the
# PyInstaller LD_LIBRARY_PATH that the launcher sets will cover
# it (we already chdir into _MEIPASS for the runtime).
if shutil.which("patchelf"):
try:
_run(["patchelf", "--set-rpath", "$ORIGIN", str(staging / "tesseract")])
except SystemExit:
_warn("patchelf rpath rewrite failed — relying on LD_LIBRARY_PATH at runtime")
_ok(f"staged Linux tesseract + {copied} .so files into {staging.relative_to(REPO)}")
def fetch_tesseract_for_platform(target: str) -> Path:
"""Stage the per-platform Tesseract binary + libs into ``build/_tesseract/<target>/``.
Returns the staging dir path. The PyInstaller spec adds this dir
(plus tessdata) to its ``datas=`` so the bundle ends up with
everything under ``<bundle>/tesseract/`` where the runtime
discovery code expects it.
Honours ``TESSERACT_SKIP_FETCH=1`` — set this when you've
pre-staged the binary by hand (offline build, behind a proxy,
custom build of tesseract, etc.). The script still verifies the
binary is present and surfaces a helpful error if not.
"""
_step(f"fetch tesseract binary ({target})")
staging = TESSERACT_STAGING / target
exe_name = "tesseract.exe" if target == "win" else "tesseract"
exe_path = staging / exe_name
if os.environ.get("TESSERACT_SKIP_FETCH") == "1":
if not exe_path.exists():
_err(
f"TESSERACT_SKIP_FETCH=1 but {exe_path} is missing. "
"Pre-stage the binary + its libs into that dir, then re-run."
)
sys.exit(1)
_ok(f"skipping fetch (TESSERACT_SKIP_FETCH=1); using {exe_path.relative_to(REPO)}")
return staging
if exe_path.exists():
_ok(f"already staged: {exe_path.relative_to(REPO)}")
return staging
if target == "win":
_fetch_tesseract_windows(staging)
elif target == "mac":
_fetch_tesseract_macos(staging)
elif target == "linux":
_fetch_tesseract_linux(staging)
else:
_err(f"unknown target {target!r} for tesseract fetch")
sys.exit(2)
if not exe_path.exists():
_err(
f"fetch step finished but {exe_path.relative_to(REPO)} is missing. "
"Inspect the logs above; you may need to pre-stage the binary manually."
)
sys.exit(1)
return staging

63
build/vendor/README.md vendored Normal file
View File

@@ -0,0 +1,63 @@
# build/vendor/ — third-party bundle inputs (fetched at build time)
This tree holds the third-party assets that get bundled into the
PyInstaller artifacts but that we deliberately do **not** keep in git
(too large / license-encumbered / re-fetchable on demand).
The build's Tesseract helper (`build/tesseract.py`) populates
everything in here before the PyInstaller step — CI
(`.github/workflows/build.yml`) calls it ahead of the build. The
contents are git-ignored except for this README.
## tessdata/
Holds the Tesseract language data file(s) used by the PDF Extractor
OCR fallback. Only English is bundled today.
### Canonical source
We use the **"best" model** from `tesseract-ocr/tessdata_best` (LSTM,
slower but higher accuracy than the legacy `tessdata` set, and only
~12 MB compressed → ~16 MB uncompressed):
```
https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata
```
There is also `tessdata_fast/` (~4 MB, lower accuracy) if you ever
want to optimise for bundle size over recognition quality. For bank
statements (the only OCR use case so far), the extra accuracy of the
`_best` model is worth the 10 MB.
### Why we don't vendor it in git
* ~16 MB binary file — bloats clone times for everyone, including
contributors who never touch the OCR code path.
* Apache-2.0-licensed and stable; the file rarely changes upstream
(last touched 2021), so a build-time fetch is safe.
* The Tesseract project explicitly distributes these via GitHub
raw URLs — they're meant to be downloaded, not redistributed
through other repos.
### How it gets populated
`build/tesseract.py::fetch_tessdata()` checks for
`build/vendor/tessdata/eng.traineddata` on every run. If it's
missing, it downloads the file from the canonical URL above and
caches it here. Subsequent builds reuse the cached file.
On CI, the directory is restored from the GitHub Actions cache so we
don't pay the download cost on every run (`.github/workflows/build.yml`
caches `build/vendor/tessdata/` keyed on the URL above).
## Manual one-time fetch (if you're offline or behind a proxy)
```bash
mkdir -p build/vendor/tessdata
curl -L -o build/vendor/tessdata/eng.traineddata \
https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata
```
Verify the file is non-empty and starts with the magic bytes
`b"\x00\x00\x00\x00"` followed by a header that `pytesseract` can
read; the script does a basic sanity check after download.

0
build/vendor/tessdata/.gitkeep vendored Normal file
View File

View File

@@ -32,17 +32,22 @@ rebuilds it from a stale headline.
| Friction kills conversion | BUSINESS.md §7 | Demo dataset preloaded; no "select a file" first-step |
| < $1,200/mo recurring | BUSINESS.md §9 | Migration plan to $5/mo VPS only after rate-limit signal |
## 3. The three personas (per PLAN.md §2.3)
## 3. The three personas — one audience: accounting (per PLAN.md §2.3)
We niche to **accounting** and enter through the three workflows where a
messy export costs real money. Same engine, three landing pages — each
is the same buyer at a different desk (bookkeeping, payables, receivables).
| Tag | Persona | Top-of-funnel keyword | Demo dataset | Pre-saved pipeline |
|---|---|---|---|---|
| `shopify-pet` | Shopify operator (priority: pet supplies) | "shopify customer cleanup" | `samples/demo/shopify_pet_customers.csv` | `shopify_pet_pipeline.json` |
| `bookkeeper` | Bookkeeper / freelance accountant | "reconcile bank export csv" | `samples/demo/bookkeeper_bank_reconcile.csv` | `bookkeeper_bank_pipeline.json` |
| `revops` | Marketing / RevOps agency | "dedupe lead list across vendors" | `samples/demo/agency_combined_leads.csv` | `agency_leads_pipeline.json` |
| `bookkeeper` | Bookkeeper — bank reconciliation | "reconcile bank export csv duplicates" | `samples/demo/bank_reconciliation.csv` | `bank_reconciliation_pipeline.json` |
| `ap-1099` | Accounts payable — 1099 vendor prep | "clean 1099 vendor list missing EIN" | `samples/demo/vendor_1099.csv` | `vendor_1099_pipeline.json` |
| `ar-aging` | Accounts receivable — open invoices | "remove duplicate invoices aging report" | `samples/demo/ar_open_invoices.csv` | `ar_open_invoices_pipeline.json` |
Each persona gets its **own landing page URL**, its **own demo dataset
loaded by default**, and its **own H1 + below-the-fold copy.** The
engine is identical; only positioning differs.
Each persona gets its **own landing page URL** (`?p=<tag>`), its **own
demo dataset loaded by default**, and its **own H1 + below-the-fold
copy** — wired in `src/gui/app_demo.py::PERSONAS`. The engine is
identical; only positioning differs.
## 4. Demo dataset specifications
@@ -53,114 +58,77 @@ persona's tooling. Each contains every kind of pollution the bundle's
five tools fix, so a single demo run shows every tool earning its
keep.
### 4.0 Pain-point coverage map
### 4.0 Value-proof map
Each demo dataset is engineered so the buyer sees their **own top
pain** demonstrated in the AFTER preview. The mapping below pairs
each pain from PLAN.md §2.3a with the rows / columns that exercise
it. Refresh the dataset only when this coverage drops.
Each demo dataset is engineered so the buyer sees their **own top pain**
fixed in the AFTER preview, with one unmistakable headline number. All
three run the same saved 4-step pipeline (Clean Text → Standardize
Formats → Fix Missing Values → Find Duplicates). The numbers below are
**validated against the live engine** (`tests/test_demo_pipelines.py`
pins them) — refresh the dataset only if a number stops landing.
| Persona | Pain (from PLAN §2.3a) | Demo coverage |
| Persona | Headline proof | What the visitor watches happen |
|---|---|---|
| Shopify pet | S1 — Klaviyo per-contact dupes | 5 dup pairs across rows 115 (case + format + address-twin variants) |
| Shopify pet | S2 — feed-rejection chars | smart-quote / NBSP / BOM in rows 16, 9, 11 |
| Shopify pet | S3 — multi-channel | partner-style customer IDs (`SHOP-`); demonstration of column-level mapping covered in RevOps demo |
| Shopify pet | S4 — subscription identity | rows 1+2, 7+8, 9+10 — same person, different format |
| Shopify pet | S5 — VAT-MOSS country drift | rows 1618 (`United Kingdom` / `U.K.` / `UK`) + rows 1920 (`Germany`/`Italia`) |
| Bookkeeper | B1 — month-overlap re-import | 7 dup pairs spanning Jan↔Feb and Mar boundaries |
| Bookkeeper | B2 — 1099 vendor consolidation | Amazon × 3 spellings, Verizon × 2, Acme Realty × 2, Adobe × 2, Costco × 2, Zoom × 2, Stripe × 4 |
| Bookkeeper | B3 — audit trail | every cell change in the run logged with old/new/rule — surface in the demo's audit tab |
| Bookkeeper | B4 — per-license economics | demonstrated by pricing copy, not data |
| Bookkeeper | B5 — multi-currency | rows 26 (EUR), 27 (GBP), 28 (BRL with comma decimal), 29 (parens-negative) |
| RevOps | R1 — per-contact tier | 6 cross-source dup pairs (HubSpot × LinkedIn × Manual Scrape) |
| RevOps | R2 — deliverability | rows 2627 (`uma at uniform dot com`, `victor@@victorco.com` invalid emails) |
| RevOps | R3 — GDPR / privacy | demonstrated by the network-tab moat panel + zero-upload claim |
| RevOps | R4 — vendor unification | 3 source values (HubSpot / LinkedIn / Manual Scrape), 13 country codes, mixed-shape headers |
| RevOps | R5 — suppression list | rows 2930 (`Suppressed`, `Opted Out` tags) |
| Bookkeeper | **26 → 20 rows · 6 phantom duplicates removed** | The same payment posted twice (different date + amount format) collapses to one; dates go ISO, parens-negatives become real negatives |
| AP / 1099 | **24 records → 8 vendors · 7 missing EINs recovered** | Each vendor's scattered records merge into one complete row; `merge=true` backfills the EIN/address/phone that any single record was missing |
| AR aging | **26 → 21 rows · 5 double-entered invoices removed** | Duplicate invoice numbers collapse; a blank status is backfilled from its twin; invoice + due dates go ISO, amounts numeric |
### 4.1 `shopify_pet_customers.csv` (20 rows)
### 4.1 `bank_reconciliation.csv` (26 rows) — Bookkeeper
**Looks like**: a Shopify customer export filtered for "Pet Supplies"
sales channel, 12 months activity.
**Looks like**: two months (Jan + Feb 2025) of business-checking activity
from a bank portal, where the Feb re-export overlaps Jan so the same
transaction posts twice. Columns: `Date, Description, Vendor, Category,
Amount, Account`.
**Pollution included**:
- Whitespace padding (" Alice ", "Sydney Opera House Drive ")
- Mixed phone formats: `(415) 555-1234`, `415.555.1234`, `5559876543`,
`+1 555-111-1111`
- International phones: GB, ES, DE, AU, JP (15 demo rows span 6
countries)
- Currency variants: `$1,240.50`, `£890.25`, `€2.410,75` (EU comma
decimal), `A$ 1,299.00`, `¥75000`
- Date formats: `2025-12-04`, `12/15/2025`, `?`, `(blank)`, `(none)`,
`#N/A`
- Disguised nulls: `N/A`, blank, `(blank)`, `?`, `#N/A`, `(none)`,
`unknown`
- Name casing: `EVE MARTINEZ`, `henry`, `O'NEIL`, `noah`, mixed Title /
ALL CAPS / lower
- Email case variants that *should* dedup: `Bob@PetShop.com` vs
`alice@petshop.com`
- 4 fuzzy duplicates (Alice/Bob same address, Grace/Henry same phone,
Carlos/Olivia same address, Ivy/Jack same address)
- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`, `1/27/25`, `Feb 5 2025`.
- Currency formats incl. negatives: `-$129.99`, `($89.50)` parens-negative, `+$3,450.00`, `- $599.88`, bare `-129.99`, `(50.00)`.
- Whitespace + NBSP padding; smart quotes and an em-dash inside descriptions.
- Vendor casing variety on *non-duplicate* rows: `Amazon` / `amazon.com` / `AMAZON.COM`, `Verizon` / `verizon`.
- Disguised nulls in Category: `—`, `(blank)`, `?`, `unknown`, `TBD`.
- **6 duplicate transactions** — each pair shares the same vendor + real value but a different date *and* amount format, so they collapse only after standardization.
**After running the pipeline**: 20 rows → 15, ~29 cells canonicalized,
~45 sentinels standardised, 5 cross-row duplicates merged. The
customer table is now Klaviyo-import-ready and the country column
(previously `UK` / `U.K.` / `United Kingdom` / `Germany` / `Italia`)
is GB / DE / IT — VAT MOSS report won't break.
**After running the pipeline** (validated): **26 → 20 rows, 6 duplicates
removed**, 36 date/amount cells standardized (0 unparseable), all dates
ISO, parens-negatives resolved (`($89.50)``-89.50`), disguised-null
categories flagged. The reconciliation ties out.
### 4.2 `bookkeeper_bank_reconcile.csv` (30 rows)
### 4.2 `vendor_1099.csv` (24 rows) — Accounts payable / 1099
**Looks like**: two months of business checking + credit-card activity
exported from a bank portal, with the Feb export accidentally
overlapping the Jan export at the month boundary.
**Looks like**: a 1099-NEC vendor master list where the same vendor was
entered 23 times across the year by different staff, each record holding
only *part* of the vendor's details. Columns: `Vendor, Contact, Email,
Phone, EIN, Address, Total_Paid`.
**Pollution included**:
- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`,
`1/27/25`, `Feb 5 2025`
- Currency formats: `-$129.99`, `($89.50)` parens-negative,
`+$3,450.00`, `- $599.88` space, bare `-129.99`, `(50.00)`
- Header trailing whitespace: `"Date "`
- Smart quotes around descriptions: `"autopay"`
- Em-dash sentinels in Vendor: `—`
- Smart-em-dash inside descriptions: `STAPLES #4422 — paper, toner`
- Vendor casing inconsistency: `Amazon` / `amazon.com` / `AMAZON.COM`,
`Verizon` / `verizon`
- 6 duplicate transactions (same date+amount+vendor recorded twice
with different formats)
- The duplicate records for a vendor share one email differing only by case/whitespace (the reliable dedup key, matched with the `email` normalizer).
- EIN / Phone / Address scattered across the duplicate set so no single record is complete but the union is — gaps marked `—`, `(blank)`, `TBD`, `unknown`, `N/A`.
- Vendor name casing/spelling variants, phone formats, EIN formats (`12-3456789` vs `123456789`), `Total_Paid` currency variants.
**After running the pipeline**: 30 rows → 23, ~84 cells normalized, 7
duplicates removed (month-overlap + VAT-MOSS dups). All dates
ISO-formatted, all amounts numeric (including EUR/GBP/BRL with comma
decimal), vendor casing canonical, parens-negative resolved.
**After running the pipeline** (validated): **24 records → 8 vendors, 16
duplicates removed, 7 missing EINs recovered** by `merge=true` +
`most_complete` survivor, 35 disguised nulls caught, phones/emails/amounts
standardized (0 unparseable). One vendor genuinely has no EIN in any
record — it survives with a blank EIN as the realistic "flag for
follow-up" case.
### 4.3 `agency_combined_leads.csv` (30 rows)
### 4.3 `ar_open_invoices.csv` (26 rows) — Accounts receivable
**Looks like**: a marketing-ops worksheet combining lead exports from
HubSpot + LinkedIn Sales Navigator + manual scraping, ready for
campaign targeting.
**Looks like**: an open-invoices (unpaid AR) export where some invoices
were double-entered in different formats and client contacts are messy.
Columns: `Invoice, Client, Email, Invoice_Date, Due_Date, Amount, Status`.
**Pollution included**:
- Phone formats per region: US, UK, Spain, Germany, China, India,
Australia, Mexico, Israel, Singapore, Hong Kong, Italy, South
Korea — 13 country codes
- Country column inconsistent: `USA` / `US` / `United States`
- Disguised nulls: `N/A`, `unknown`, `(unknown)`, `(blank)`, `(none)`,
`?`, `—`, `#N/A`, `TBD`
- Source column tags origin (`HubSpot` / `LinkedIn` / `Manual Scrape`)
- Email duplicates across sources with case variants: `alice@acme.com`
+ `Alice.Johnson@acme.com`, `bob@beta.com` + `Bob@Beta.com`,
`diana@delta.com` from two sources, `carlos@gamma.io` from two
sources, `Frank@Foxtrot.de` + `frank@foxtrot.de`
- Name casing: `DIANA LEE`, `henry`, `IVY CHEN`, mixed
- 6 fuzzy / cross-source duplicates designed to survive the dedup
- Score column with sentinel pollution that needs coercion to integer
- Two date columns with mixed formats; currency variants incl. a credit memo `($300.00)``-300.00`.
- Client name casing variety; email case variants (`AP@Acme.com` vs `ap@acme.com`).
- Status disguised nulls: `—`, `?`, `(blank)`, `TBD`, `unknown`, `(none)`.
- **5 double-entered invoices** — same invoice number twice, dates/amount in different formats, one copy with a blank status the other fills.
**After running the pipeline**: 30 rows → 24, ~43 cells canonicalized,
14 sentinels resolved, 6 cross-source duplicates merged with `merge=true`
so each survivor inherits the most-complete picture. Invalid-email
rows (deliverability stress) and `Suppressed`/`Opted Out` tags
(suppression-list use case) survive as flagged rows the operator
manually reviews.
**After running the pipeline** (validated): **26 → 21 rows, 5 duplicate
invoices removed**, both date columns ISO + amounts numeric + emails
lowercased (0 unparseable), 7 disguised-null statuses caught, and a blank
status backfilled from its twin via `merge=true`. The aging report stops
double-counting.
## 5. UX flow (per persona)
@@ -174,26 +142,26 @@ dedicated `app_demo.py` for the cloud build).
│ "{Persona-specific H1}" │
├──────────────────────────────────────────────────────────┤
│ │
│ Sample dataset preloaded: shopify_pet_customers.csv │
│ Sample dataset preloaded: bank_reconciliation.csv
│ [Replace with your own file (capped 100 rows)] │
│ │
│ ┌─ BEFORE preview (15 rows) ─────────────────────────┐ │
│ │ Alice | (415) 555-1234 | $1,240.50 | … │ │
│ │ Bob | 415.555.1234 | $1,240.50 | … │ │
│ ┌─ BEFORE preview (26 rows) ─────────────────────────┐ │
│ │ 01/15/2025 | Stripe | +$3,450.00 | … │ │
│ │ 2025-01-15 | Stripe | 3450.00 | … (dup) │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Pipeline (saved): │
│ 1. Text Clean → 2. Format Standardize → │
│ 3. Missing → 4. Deduplicate
│ 1. Clean Text → 2. Standardize Formats → │
│ 3. Fix Missing → 4. Find Duplicates
│ │
│ [▶ Run pipeline] │
│ │
│ ┌─ AFTER preview ───────────────────────────────────┐ │
│ │ 15 rows → 11 (4 duplicates merged) │ │
│ │ 27 cells canonicalized · 33 sentinels resolved │ │
│ │ 26 rows → 20 (6 duplicate transactions removed) │ │
│ │ 36 cells standardized · 4 disguised nulls flagged │ │
│ │ │ │
│ │ Alice Johnson | +14155551234 | 1240.50 | … │ │
│ │ 2025-01-15 | Stripe | 3450.00 | … │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────┘ │
│ │
@@ -244,27 +212,35 @@ not "demo crippled" data.
## 7. CTA copy (per persona)
### 7.1 Shopify pet operator
Copy lives in `src/gui/app_demo.py::PERSONAS` (H1 / sub / CTA per tag);
keep this section in sync with that dict.
- **H1**: *Clean your customer / vendor / subscriber exports — locally.*
- **Sub**: *Klaviyo-import-ready in 30 seconds. Catches duplicates Excel
misses. Your data never leaves your computer.*
- **CTA**: *Get DataTools for Shopify — $49 →*
### 7.1 Bookkeeper — bank reconciliation (`?p=bookkeeper`)
### 7.2 Bookkeeper / freelance accountant
- **H1**: *Reconcile messy bank exports. Hand your client an audit
trail.*
- **Sub**: *Catches the duplicate transaction Quickbooks imported twice.
Standardizes dates, amounts, vendor casing. Every change auditable.*
- **H1**: *Catch the transactions your bank export posted twice. Locally.*
- **Sub**: *When the Jan and Feb exports overlap, the same payment posts
twice in two formats. DataTools standardizes every date and amount, then
dedups on the real transaction so your reconciliation ties out — 26 rows
→ 20, six phantom duplicates gone.*
- **CTA**: *Get DataTools for Bookkeepers — $49 →*
### 7.3 Marketing / RevOps agency
### 7.2 Accounts payable — 1099 prep (`?p=ap-1099`)
- **H1**: *Dedupe leads across HubSpot, LinkedIn, and manual scrapes.*
- **Sub**: *International phones, country normalization, fuzzy dedup
with merge — one tool, one schema, no upload.*
- **CTA**: *Get DataTools for RevOps — $49 →*
- **H1**: *Build a clean 1099 vendor list — with the missing EINs filled in.*
- **Sub**: *The same vendor entered three times, each record holding only
part of the details. DataTools consolidates to one row and backfills the
gaps from the duplicates — 24 records → 8 vendors, 7 missing EINs
recovered.*
- **CTA**: *Get DataTools for Accounting — $49 →*
### 7.3 Accounts receivable — open invoices (`?p=ar-aging`)
- **H1**: *Stop chasing the invoices your aging report counted twice. Locally.*
- **Sub**: *Double-entered invoices inflate your AR aging and your
follow-ups. DataTools standardizes dates and amounts, lowercases client
emails, and removes the duplicate invoice numbers — 26 rows → 21, five
phantom invoices off the books.*
- **CTA**: *Get DataTools for Accounting — $49 →*
## 8. Telemetry / conversion tracking

View File

@@ -96,6 +96,36 @@ DeduplicationResult # deduplicated_df, removed_df, match_groups, l
No other call sites change. Gate auto-discovers it via the registry.
### Tool page header — `render_tool_header(tool_id)`
Every tool page renders its title block via `render_tool_header(tool_id)` in `src/gui/components/_legacy.py` — do not call `st.title()` + `st.caption()` directly. The helper renders:
- `tools.<id>.page_title` as the page title (left column).
- A **Help** popover button right of the title (icon `:material/help_outline:`, label from `help.button_label`). Clicking opens an `st.popover` containing the markdown body.
- `tools.<id>.page_caption` as the caption below.
All copy is i18n-driven; editors can tweak help text without touching Python. If a tool is missing its `help_md` key, the popover falls back to `help.missing_body`.
**`help_md` structure** (markdown, stored as a single string with `\n` line breaks in JSON):
```
**When to use**
- bullet 1
- bullet 2
**Steps**
1. numbered step
2. numbered step
**Examples**
- example 1
- example 2
**Tip** one-sentence pro tip.
```
Keep it short — the popover is intentionally compact. Mirror the structure across every tool so the muscle memory transfers.
### i18n — language packs
The GUI's user-facing strings live in `src/i18n/packs/<code>.json`, keyed by ISO-639-1 code. English (`en.json`) is canonical; missing keys in other packs fall back to English, and missing keys in English fall back to the literal dotted key so a typo is visible rather than silent.
@@ -120,7 +150,8 @@ st.warning(t("gate.warning", name=filename)) # {name} interpolated via str.for
3. Use the dotted key at the call site: `t("section.subsection.key")` or `t("section.key", name=value)` for placeholder interpolation.
**Authoring rules:**
- Keys live under semantic sections (`home.*`, `upload.*`, `findings.*`, `tools.<id>.name`). Don't nest by language or by tool unless the string is genuinely tool-specific.
- Keys live under semantic sections (`home.*`, `upload.*`, `findings.*`, `help.*`, `tools.<id>.name`). Don't nest by language or by tool unless the string is genuinely tool-specific.
- Per-tool header copy lives under `tools.<id>.{page_title, page_caption, help_md}`. `page_caption` is the one-line subtitle under the title; `help_md` is the popover body (see *Tool page header* above). Top-level `help.button_label` / `help.missing_body` are shared across every tool.
- Use `{named}` placeholders (not positional `{0}`) so translators see what's being interpolated.
- Strings can contain Streamlit markdown (`**bold**`) — pass through `st.markdown` / `st.caption` as usual.
- Do **not** put strings inside the farewell-overlay JS payload without going through `_js_html_safe()` in `src/gui/components/_legacy.py`; the helper escapes both the JS string terminator and HTML special chars. The test `TestFarewellEscape` pins that contract.
@@ -265,6 +296,37 @@ GUI / CLI handlers: use `format_for_user(exc, context="...")` to render.
All `DataToolsError` subclasses extend stdlib `ValueError` or `OSError` so existing handlers still catch them.
## PDF Extractor — bundled Tesseract
Frozen builds (installer / AppImage) ship Tesseract OCR inside the bundle so scanned PDFs work without a separate system install. Source / `pip` developer environments still resolve Tesseract from `PATH`.
**Runtime layout (frozen bundles)**:
| Resource | Path |
|---|---|
| Tesseract binary | `Path(sys._MEIPASS) / "tesseract" / "tesseract"` (Linux/macOS), `…/tesseract/tesseract.exe` (Windows) |
| Tessdata directory | `Path(sys._MEIPASS) / "tesseract" / "tessdata"` |
| English model | `Path(sys._MEIPASS) / "tesseract" / "tessdata" / "eng.traineddata"` |
**Discovery order** (PDF Extractor runtime):
1. `DATATOOLS_TESSERACT_BIN` env var (override — explicit path to a `tesseract` binary).
2. Bundled path under `sys._MEIPASS` (frozen bundles only — falls through to step 3 otherwise).
3. `tesseract` on `PATH` (developer setups, source checkouts).
4. Windows well-known locations (`C:\Program Files\Tesseract-OCR\tesseract.exe`, etc.).
**Where the bytes come from**:
- **Tessdata** is vendored at `build/vendor/tessdata/eng.traineddata` — the "best" English model from [tessdata_best](https://github.com/tesseract-ocr/tessdata_best). PyInstaller's spec copies it into `tesseract/tessdata/` inside the bundle.
- **Tesseract binary** is fetched at build time by `build/tesseract.py` — per-platform download URLs are pinned in that module. The current pin is **Tesseract 5.5.0**. CI (`.github/workflows/build.yml`) imports `fetch_tessdata` + `fetch_tesseract_for_platform` and runs them before PyInstaller.
**To update Tesseract**:
1. Bump the version pin + the per-platform fetch URLs in `build/tesseract.py`.
2. If upstream changed the `eng.traineddata` schema, refresh `build/vendor/tessdata/eng.traineddata` from `tessdata_best` at the matching tag.
3. Push a `v*` tag so CI rebuilds all three platforms, then smoke-test a scanned-PDF run through the PDF Extractor before publishing the release.
4. Update `LICENSE_TESSERACT.txt` at the repo root if the upstream license terms change (Tesseract is Apache-2.0 today).
## Tests
```bash
@@ -290,6 +352,8 @@ tests/
├── test_analyze.py · test_normalize.py · test_text_clean.py
├── test_format_standardize.py
├── test_format_standardize_corpus.py # 199-row buyer corpus
├── test_pipeline.py # pipeline engine: adapters, run, validate, serialize
├── test_cli_pipeline.py # pipeline CLI: recommend/apply/strict/audit
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
├── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
@@ -303,10 +367,27 @@ tests/
├── test_workflows.py # happy path per Ready tool
├── test_dedup_review.py # match-group card interactions
├── test_advanced_panels.py # config_panel widgets
├── test_pipeline_builder.py # module-card builder: cards, reorder, JSON, run
├── test_pipeline_phrasing.py # step_phrase/step_status + name bridge (pure fns)
├── test_errors.py # malformed-upload error paths
└── test_findings_panel.py # analyzer findings rendering
```
### Pipeline (Automated Workflows) coverage
The pipeline feature is pinned end to end across four files (~115 tests):
`test_pipeline.py` (core engine — every adapter's summary numbers, step
data-flow, error stop/continue, empty/single-column/all-disabled edges,
dict + file serialization round-trips, `recommended_pipeline(include=…)`,
soft-dependency validation), `test_cli_pipeline.py` (CLI — `--recommend`,
dry-run-by-default, `--apply` output + audit JSON, `--steps`, `--strict`,
`--continue-on-error`, arg validation, save→load round-trip),
`test_pipeline_builder.py` (the visual builder via AppTest — card seeding,
toggle, reorder ▲/▼, add/remove, restore-recommended, Advanced JSON
import/export, per-tool Configure panels emitting the right option dicts),
and `test_pipeline_phrasing.py` (the plain-English `step_phrase`/`step_status`
helpers and the adapter-key→friendly-name bridge as pure functions).
### GUI test layer
GUI tests drive pages with `streamlit.testing.v1.AppTest` —

View File

@@ -122,6 +122,17 @@ Tag a release → 3 platform artifacts upload to GitHub Releases. Manual: copy t
`demo/streamlit_app.py` → Streamlit Community Cloud. Configure deployment in Streamlit UI. Custom domain via CNAME (verify policy at deploy time). Fall back to $5/mo VPS if rate limits / branding constraints hit.
### 3.10 Bundled Tesseract (PDF Extractor OCR)
Frozen builds ship Tesseract 5.5 + `eng.traineddata` inside the PyInstaller bundle so scanned PDFs work without a separate install. Per-platform binary URLs pinned in `build/tesseract.py`; tessdata vendored at `build/vendor/tessdata/eng.traineddata`. License attribution in `LICENSE_TESSERACT.txt` at the repo root.
**Discovery order at runtime** (see `docs/DEVELOPER.md` for the full Path layout):
1. `DATATOOLS_TESSERACT_BIN` env var override.
2. Bundled path under `sys._MEIPASS / "tesseract" /` (frozen bundles only).
3. `tesseract` on `PATH` (source / pip developer environments).
4. Windows well-known locations.
## 4. Libraries
| Purpose | Library |
@@ -242,6 +253,15 @@ The GUI uses an in-house, JSON-backed translation layer at `src/i18n/`. **No** `
**Why not gettext**: zero compiled artifacts in the PyInstaller bundle, no build step before tests run, no `.po`/`.mo` round-trip for translators (anyone can edit JSON), and the same lookup works in unit tests without process state. Locked in because the surface won't grow large enough to need the alternative, and the alternative breaks the "drop a file, run pytest, ship" loop.
## 10c. GUI chrome — sidebar nav indicator swap
Streamlit's `st.Page`-driven sidebar renders section headers with a Material Symbols ligature (`expand_more` / `expand_less`). The header element is not a button and carries no `aria-expanded`, so a pure-CSS swap can't follow open/closed state. We replace the glyph with plain typographic `+` / `` (U+2212) via JS:
- **CSS** (`components/_legacy.py`, `_HIDE_CHROME_CSS`) drops the Material Symbols font on `[data-testid="stIconMaterial"]` inside `[data-testid="stNavSectionHeader"]` so the rewritten character renders as normal text rather than re-resolving as an icon name.
- **JS** (`_SWAP_NAV_SECTION_INDICATOR_JS`) walks each section header, reads the icon's text node, and rewrites `expand_more``+` / `expand_less```. A MutationObserver re-runs the swap when Streamlit re-renders the sidebar (RAF-throttled so a burst of mutations is one swap).
The script ships through the same component-iframe bundle as the brand injector and upload-button rename inside `hide_streamlit_chrome()` — one iframe per page, three DOM mutations.
## 11. Per-script functional specs
Specs live in this section as scripts enter active build. Each follows the Tier 1/2/3 structure with explicit strategic framing (what's the market gap given some of this is free elsewhere).

View File

@@ -25,16 +25,11 @@ Para usar la misma licencia en otro equipo: desactiva éste (página Activar →
## 1. Instalación
No necesitas tener Python ni permisos de administrador — el paquete trae su propio intérprete y todas las dependencias. Dos formatos por sistema operativo, elige el que tu política de TI permita:
- **Instalador** — crea automáticamente acceso directo en el escritorio + entrada en el menú Inicio / Launchpad. Recomendado para la mayoría.
- **.zip portable** — descomprime y haz doble clic. No toca el registro, se ejecuta desde cualquier lugar (escritorio, USB, recurso de red). Úsalo si no puedes ejecutar instaladores, quieres una instalación de una sola carpeta que puedas copiar entre equipos, o estás evaluando antes de instalar.
Ambos formatos son idénticos por dentro: mismo Python, mismas dependencias, mismo comportamiento de arranque.
No necesitas tener Python ni permisos de administrador — el paquete trae su propio intérprete y todas las dependencias. Cada sistema operativo tiene un único instalador que crea automáticamente el acceso directo en el escritorio + la entrada en el menú Inicio / Launchpad.
### 1.1 Windows
**Opción A — Instalador (`DataTools-<ver>-win-setup.exe`)**
**Instalador (`DataTools-<ver>-win-setup.exe`)**
1. Descarga `DataTools-<ver>-win-setup.exe` desde tu correo de licencia o GitHub Releases.
2. Doble clic en el instalador. La primera vez, Windows SmartScreen mostrará **"Windows protegió tu PC"** — pulsa **Más información****Ejecutar de todas formas**. (Este aviso solo aparece una vez por compilación hasta que tengamos un certificado EV de firma de código.)
@@ -44,18 +39,11 @@ Ambos formatos son idénticos por dentro: mismo Python, mismas dependencias, mis
Para anclarlo a la barra de tareas, lanza la app una vez, clic derecho en su icono de la barra de tareas, y **Anclar a la barra de tareas**. Windows requiere este paso manual — ningún instalador puede anclar por programa.
**Opción B — Portable (`DataTools-<ver>-win-portable.zip`)**
1. Descarga `DataTools-<ver>-win-portable.zip`.
2. Clic derecho en el .zip → **Extraer todo…** → elige una carpeta (p. ej. `C:\Tools\DataTools`).
3. Abre la carpeta `DataTools\` extraída, doble clic en `DataTools.exe`. El aviso de SmartScreen aparece solo la primera vez.
4. Para crear tu propio acceso directo en el escritorio: clic derecho en `DataTools.exe`**Enviar a → Escritorio (crear acceso directo)**.
**Desinstalar** (solo instalador): Configuración → Aplicaciones → DataTools → Desinstalar. Portable: borra la carpeta.
**Desinstalar**: Configuración → Aplicaciones → DataTools → Desinstalar.
### 1.2 macOS
**Opción A — DMG instalador (`DataTools-<ver>-mac.dmg`)**
**DMG instalador (`DataTools-<ver>-mac.dmg`)**
1. Descarga `DataTools-<ver>-mac.dmg`.
2. Doble clic en el .dmg. Se abre una ventana de Finder con el icono **DataTools** y un alias **Aplicaciones**.
@@ -65,12 +53,6 @@ Para anclarlo a la barra de tareas, lanza la app una vez, clic derecho en su ico
Para mantener DataTools en el Dock: lanza la app, clic derecho en su icono del Dock → **Opciones → Mantener en el Dock**. macOS no permite que los instaladores fijen al Dock automáticamente.
**Opción B — Portable (`DataTools-<ver>-mac-portable.zip`)**
1. Descarga `DataTools-<ver>-mac-portable.zip`. Safari descomprime al descargar por defecto; en Finder verás `DataTools.app` directamente.
2. Mueve `DataTools.app` a **Aplicaciones** si quieres que aparezca en Launchpad — o déjalo en el escritorio, un USB o un recurso de red. La .app portable se ejecuta desde cualquier sitio.
3. Doble clic en `DataTools.app`. Clic derecho → **Abrir** la primera vez (misma rutina que con el DMG).
**Desinstalar**: arrastra `DataTools.app` a la Papelera. Tus archivos de datos siguen donde estén — la app no instala nada más.
### 1.3 Linux
@@ -103,7 +85,9 @@ La ventana del lanzador queda abierta en segundo plano. Cerrarla detiene el serv
- Windows 10/11 (64 bits), macOS 11+, Linux moderno (2020+).
- Navegador moderno (Chrome, Edge, Firefox, Safari, últimos 3 años).
- ~400 MB de espacio libre en disco (el paquete ocupa ~200 MB; el resto es espacio de trabajo para CSV grandes).
- ~500 MB de espacio libre en disco (el paquete ocupa ~300 MB; el resto es espacio de trabajo para CSV grandes).
**OCR para PDFs escaneados viene incluido** — Tesseract 5.5 y el modelo en inglés `eng.traineddata` vienen dentro de cada instalador / portable / AppImage. La ruta de extracción de PDFs escaneados del Extractor de PDF funciona sin configuración adicional; no hace falta instalar nada por separado. (Quien ejecute desde un checkout con `pip install -r requirements.txt` sigue necesitando Tesseract del sistema en el `PATH` — ver [DEVELOPER.md §PDF Extractor — bundled Tesseract](DEVELOPER.md#pdf-extractor--bundled-tesseract) (solo en inglés).)
Matriz de soporte completa: [REQUIREMENTS.md](REQUIREMENTS.md) (solo en inglés).
@@ -135,6 +119,10 @@ Matriz de soporte completa: [REQUIREMENTS.md](REQUIREMENTS.md) (solo en inglés)
Las opciones avanzadas se encuentran en paneles desplegables. El archivo original nunca se modifica.
**Ayuda en la herramienta**: cada página tiene un botón **Help** a la derecha del título. Al pulsarlo se abre una ventana emergente con una guía compacta (Cuándo usarla · Pasos · Ejemplos · Consejo). Úsala como recordatorio a media tarea — la ventana se cierra al hacer clic fuera y tus datos no se ven afectados.
**Navegación lateral**: la barra lateral agrupa las herramientas en secciones (Análisis, Limpiadores de datos, Transformaciones, Automatizaciones). Cada cabecera muestra `+` cuando está plegada y `` cuando está desplegada — pulsa la cabecera para alternar.
### 3.2 CLI
```bash

View File

@@ -25,16 +25,11 @@ To use the same license on a different machine: deactivate this one (Activate pa
## 1. Install
You don't need Python and you don't need admin rights — the bundle ships its own interpreter and every dependency. Two flavors per OS, pick whichever your IT policy allows:
- **Installer** — wires up Desktop shortcut + Start Menu / Launchpad entry automatically. Recommended for most users.
- **Portable .zip** — unzip and double-click. No registry writes, runs from anywhere (Desktop, USB stick, network share). Use this if you can't run installers, want a single-folder install you can copy between machines, or are evaluating before committing to install.
Both flavors are byte-identical inside: same Python, same dependencies, same launch behavior.
You don't need Python and you don't need admin rights — the bundle ships its own interpreter and every dependency. Each OS gets a single installer that wires up the Desktop shortcut + Start Menu / Launchpad entry automatically.
### 1.1 Windows
**Option A — Installer (`DataTools-<ver>-win-setup.exe`)**
**Installer (`DataTools-<ver>-win-setup.exe`)**
1. Download `DataTools-<ver>-win-setup.exe` from your release email or GitHub Releases.
2. Double-click the installer. On the first run Windows SmartScreen will say **"Windows protected your PC"** — click **More info****Run anyway**. (This warning only appears once per build until we have an EV code-signing cert.)
@@ -44,18 +39,11 @@ Both flavors are byte-identical inside: same Python, same dependencies, same lau
To pin to the taskbar, launch the app once, right-click its icon in the taskbar, then **Pin to taskbar**. Windows requires this manual step — no installer is allowed to pin programmatically.
**Option B — Portable (`DataTools-<ver>-win-portable.zip`)**
1. Download `DataTools-<ver>-win-portable.zip`.
2. Right-click the .zip → **Extract All…** → pick a folder (e.g. `C:\Tools\DataTools`).
3. Open the extracted `DataTools\` folder, double-click `DataTools.exe`. SmartScreen warning fires the first time only.
4. To create your own desktop shortcut later: right-click `DataTools.exe`**Send to → Desktop (create shortcut)**.
**Uninstall** (installer only): Settings → Apps → DataTools → Uninstall. Portable: delete the folder.
**Uninstall**: Settings → Apps → DataTools → Uninstall.
### 1.2 macOS
**Option A — Installer DMG (`DataTools-<ver>-mac.dmg`)**
**Installer DMG (`DataTools-<ver>-mac.dmg`)**
1. Download `DataTools-<ver>-mac.dmg`.
2. Double-click the .dmg. A Finder window opens showing the **DataTools** icon and an **Applications** alias.
@@ -65,12 +53,6 @@ To pin to the taskbar, launch the app once, right-click its icon in the taskbar,
To keep DataTools in the Dock: launch the app, right-click its Dock icon → **Options → Keep in Dock**. macOS doesn't allow installers to pin to the Dock automatically.
**Option B — Portable (`DataTools-<ver>-mac-portable.zip`)**
1. Download `DataTools-<ver>-mac-portable.zip`. Safari auto-unzips on download; in Finder you'll see `DataTools.app` directly.
2. Move `DataTools.app` to **Applications** if you want it discoverable via Launchpad — or keep it on your Desktop, a USB stick, or a network share. The portable .app runs from anywhere.
3. Double-click `DataTools.app`. Right-click → **Open** the first time (same unsigned-build dance as the DMG).
**Uninstall**: drag `DataTools.app` to the Trash. Your data files stay where you put them — nothing else is installed.
### 1.3 Linux
@@ -103,7 +85,9 @@ The launcher window stays open in the background. Closing it stops the server
- Windows 10/11 (64-bit), macOS 11+, modern Linux (2020+).
- Modern browser (Chrome, Edge, Firefox, Safari, last 3 years).
- ~400 MB free disk space (the bundle itself is ~200 MB; the rest is working scratch space for large CSVs).
- ~500 MB free disk space (the bundle itself is ~300 MB; the rest is working scratch space for large CSVs).
**OCR for scanned PDFs is bundled** — Tesseract 5.5 + the English `eng.traineddata` model ship inside every installer / portable / AppImage. The PDF Extractor's scanned-statement path works out of the box; no separate install required. (Developers running from a `pip install -r requirements.txt` checkout still need system Tesseract on `PATH` — see [DEVELOPER.md §PDF Extractor — bundled Tesseract](DEVELOPER.md#pdf-extractor--bundled-tesseract).)
Full numbered support matrix: [REQUIREMENTS.md](REQUIREMENTS.md).
@@ -135,6 +119,10 @@ Full numbered support matrix: [REQUIREMENTS.md](REQUIREMENTS.md).
Advanced options are tucked in expander panes. The original file is never modified.
**In-tool Help**: every tool page has a **Help** button right of the title. Click it to open a popover with a compact how-to (When to use · Steps · Examples · Tip). Use it as a refresher mid-task — the popover closes when you click outside, your inputs are untouched.
**Sidebar nav**: the sidebar groups tools into sections (Analysis, Data Cleaners, Transformations, Automations). Each section header shows `+` when collapsed and `` when expanded — click the header to toggle.
### 3.2 CLI
```bash

View File

@@ -9,9 +9,9 @@ Cloudflare Pages.
```
landing/
├── _shared/styles.css shared CSS (system fonts, no externals)
├── shopify-pet/index.html Shopify operator (priority: pet supplies)
├── bookkeeper/index.html bookkeeper / freelance accountant
├── revops/index.html marketing / RevOps agency
├── bookkeeper/index.html bookkeeper — bank reconciliation
├── ap-1099/index.html accounts payable — 1099 vendor prep
├── ar-aging/index.html accounts receivable — open invoices
└── README.md this file
```
@@ -19,8 +19,8 @@ Each page:
- Inherits `landing/_shared/styles.css`
- Overrides the `--accent` colour variable in an inline `<style>` block
so each persona has its own visual identity (Shopify = mint green,
Bookkeeper = steel blue, RevOps = vivid violet)
so each persona has its own visual identity (Bookkeeper = steel blue,
AP / 1099 = amber/gold, AR = receivables green)
- Has a sticky buy bar with the Gumroad CTA tagged with `?from=<persona>`
- Embeds the live demo (Streamlit) via `<iframe>` with a sandbox attribute
- Carries persona-specific H1, sub-copy, use cases, FAQ, and a
@@ -64,13 +64,13 @@ wrangler pages deploy landing/dist
```
Configure the custom apex domain (`datatools.app`) in the Cloudflare
Pages project settings; sub-paths `/shopify-pet/`, `/bookkeeper/`,
`/revops/` are served automatically because the directory layout
Pages project settings; sub-paths `/bookkeeper/`, `/ap-1099/`,
`/ar-aging/` are served automatically because the directory layout
mirrors them. Cache rule defaults are fine (HTML 1 day, CSS 7 days).
If you want **separate Pages projects** per persona for independent
A/B testing, point three projects at the same `landing/dist/` and
configure each with its own sub-domain (`shopify.datatools.app`, etc.)
configure each with its own sub-domain (`bookkeeper.datatools.app`, etc.)
and a Pages rule that rewrites the root to that persona's
sub-directory.
@@ -110,7 +110,7 @@ Refresh the page when:
| `page_view → run_completed < 30%` for 4 weeks | The demo iframe isn't loading or visitors aren't engaging. Check the iframe URL. Move the demo above the fold if it's currently below. |
| New tool ships (0609) | Add it to the persona's saved pipeline only if it fits — don't bloat the demo with every tool. |
| Pricing change | Update `<meta>` schema, the buybar `.price-tag`, the pricing card, and the FAQ. Search-and-replace `$49` across the file. |
| New persona added (4th, 5th) | Copy `shopify-pet/index.html`, replace persona-specific copy, add to the `footer` cross-link block on the existing pages. |
| New persona added (4th, 5th) | Copy `bookkeeper/index.html`, replace persona-specific copy, add to the `footer` cross-link block on the existing pages. |
## Why static HTML

View File

@@ -5,7 +5,7 @@
* with zero build step, no privacy banner needed).
* • Mobile-first; layout reflows below 720 px.
* • Dark, focused, content-first. Buyer reads this on a laptop
* between Shopify exports — keep it readable and skimmable.
* between messy accounting exports — keep it readable and skimmable.
* • Persona pages all share this sheet — niche differences live in
* copy + accent-color variables overridden in each page's <style>.
*/
@@ -18,7 +18,7 @@
--text-mute: #9aa3b2;
--text-soft: #c8ced8;
--rule: #252a36;
--accent: #6ee7b7; /* Shopify pet default — overridden per persona */
--accent: #6ee7b7; /* default accent — overridden per persona */
--accent-ink: #052e1a;
--warn: #fbbf24;
--max: 1080px;

391
landing/ap-1099/index.html Normal file
View File

@@ -0,0 +1,391 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools for 1099 Prep — Clean Your Vendor Master & Recover Missing EINs Locally · $49</title>
<meta name="description" content="Build a clean 1099 vendor list — locally. Consolidates duplicate vendor rows, backfills scattered EINs, and flags the genuinely missing ones. 24 messy records → 8 complete vendors, 7 EINs recovered. Your data never leaves your computer. $49 one-time." />
<meta name="keywords" content="1099 vendor list, missing EIN, accounts payable cleanup, vendor master dedupe, 1099-NEC prep, QuickBooks vendor export, deduplicate vendors" />
<link rel="canonical" href="https://datatools.app/ap-1099/" />
<link rel="stylesheet" href="../_shared/styles.css" />
<!-- Persona accent: Accounts Payable / 1099 → amber/gold invoice tone -->
<style>
:root { --accent: #d97706; --accent-ink: #2a1604; }
</style>
<!-- Open Graph -->
<meta property="og:title" content="DataTools for 1099 Prep — Clean Your Vendor Master & Recover Missing EINs Locally" />
<meta property="og:description" content="Consolidate duplicate vendors, backfill scattered EINs, file 1099-NECs on time. Local. No upload. $49 one-time." />
<meta property="og:type" content="product" />
<meta property="og:url" content="https://datatools.app/ap-1099/" />
<!-- Schema.org Product -->
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareApplication",
"name": "DataTools for 1099 Prep",
"operatingSystem": "Windows, macOS, Linux",
"applicationCategory": "BusinessApplication",
"offers": {
"@type": "Offer",
"price": "49",
"priceCurrency": "USD"
},
"description": "Clean your accounts-payable vendor master locally for 1099-NEC season. Six-tool data-cleaning bundle: dedupe-merge to consolidate duplicate vendor rows and backfill missing EINs, text-clean, format-standardize, missing-value handle, column-map, pipeline.",
"softwareVersion": "1.0"
}
</script>
</head>
<body>
<!-- ============= Sticky buy bar ============= -->
<div class="buybar">
<div class="buybar-inner">
<div class="brand"><span class="brand-mark"></span> DataTools <span class="muted">/ for 1099 prep</span></div>
<div>
<span class="price-tag">$49 — one-time, no subscription</span>
<a class="btn" href="https://gumroad.com/l/datatools?from=ap-1099" rel="noopener">Get DataTools →</a>
</div>
</div>
</div>
<!-- ============= Hero ============= -->
<section class="hero">
<div class="container">
<div class="eyebrow">For accounts payable · 1099-NEC season · vendor master cleanup</div>
<h1>Build a clean 1099 vendor list —<br /><strong>with the missing EINs filled in.</strong></h1>
<p class="lead">
The same vendor got entered three times across the year — one row has
the EIN, another the address, another the phone — and now it's January
and you can't file because the numbers are scattered. DataTools
consolidates each vendor to one row and backfills the gaps from the
duplicates: in our sample, <strong>24 messy records become 8 complete
vendors with 7 missing EINs recovered</strong> from duplicate rows.
<strong>Your data never leaves your computer.</strong>
</p>
<div class="cta-row">
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=ap-1099" rel="noopener">Get DataTools for Accounting — $49 →</a>
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
<span class="price-note">One-time payment · cross-platform · runs offline</span>
</div>
<div class="stats">
<div class="stat"><div class="num">24→8</div><div class="label">messy records to complete vendors</div></div>
<div class="stat"><div class="num">7</div><div class="label">missing EINs recovered</div></div>
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
</div>
</div>
</section>
<!-- ============= Pain points ============= -->
<section>
<div class="container">
<div class="eyebrow">If any of these sound like your January</div>
<h2>Five pains DataTools fixes in one pass</h2>
<div class="grid">
<div class="card">
<span class="icon">🧾</span>
<h3>The same vendor is in the list two or three times</h3>
<p>Different staff entered "Acme LLC", "Acme, L.L.C.", and "ACME Llc" across the year. Each is a separate row in the vendor master, and each only holds part of the story — so your 1099 totals split across three near-duplicate spellings.</p>
<p class="muted"><strong>What it costs:</strong> hours of manual matching, plus the risk of filing the wrong total.</p>
</div>
<div class="card">
<span class="icon">🔢</span>
<h3>The EIN is on a different row than the rest of the details</h3>
<p>One record captured the EIN at onboarding; the row you actually paid against doesn't have it. At 1099 time the field is blank even though you collected it months ago — it's just sitting on a duplicate.</p>
<p class="muted"><strong>What it costs:</strong> chasing W-9s you already have on file.</p>
</div>
<div class="card">
<span class="icon">📵</span>
<h3>Phones, addresses, and amounts are formatted five different ways</h3>
<p>Remittance phone as <code>(212) 555-0147</code> on one row and <code>212.555.0147</code> on another. Amounts with stray <code>$</code> and commas. The export won't reconcile and the 1099-NEC box totals don't tie out.</p>
<p class="muted"><strong>What it costs:</strong> a half-day reconciling before you can even start filing.</p>
</div>
<div class="card">
<span class="icon"></span>
<h3>You don't know which EINs are genuinely missing</h3>
<p>Some EINs are recoverable from a duplicate row. Some you never collected. Until the list is consolidated you can't tell the two apart — so you either over-chase vendors or under-file.</p>
<p class="muted"><strong>What it costs:</strong> late filings and TIN-mismatch penalties.</p>
</div>
<div class="card">
<span class="icon">📤</span>
<h3>Your QuickBooks vendor export doesn't match your AP ledger</h3>
<p>The vendor master in QuickBooks, the payments spreadsheet, and the W-9 tracker each use different column names for "vendor name" / "Tax ID" / "amount paid." Merging them is an afternoon of manual rename before any analysis begins.</p>
<p class="muted"><strong>What it costs:</strong> 48 hours per filing season manually merging exports.</p>
</div>
<div class="card">
<span class="icon">🔒</span>
<h3>Cloud cleaners want you to upload your vendor master</h3>
<p>Your vendor master holds EINs, remittance addresses, and payment history — exactly the data you should not be uploading to a SaaS to clean. DataTools is desktop-only — your vendor list never leaves your computer.</p>
<p class="muted"><strong>What it costs:</strong> nothing — and that's the point.</p>
</div>
</div>
</div>
</section>
<!-- ============= Live demo ============= -->
<section id="demo">
<div class="container">
<div class="eyebrow">Live demo · runs in your browser</div>
<h2>Try it on a real-looking vendor master export</h2>
<p>
The demo below loads a sample 24-row vendor file with the pollution
we've seen in real AP systems: the same vendor entered two or three
times under slightly different spellings, EINs that live on one
duplicate row but not the one you paid against, phones and amounts
formatted five ways, and the usual mess of
<code>N/A</code> / <code>(blank)</code> / <code>?</code> sentinels.
Click <strong>Run pipeline</strong> and watch the 24 records collapse
to <strong>8 complete vendors with 7 EINs recovered</strong> in under
a second.
</p>
<div class="demo-frame">
<iframe
src="https://demo.datatools.app/?p=ap-1099"
loading="lazy"
title="DataTools live demo — accounts payable / 1099 vendor cleanup"
sandbox="allow-scripts allow-same-origin allow-downloads allow-forms"></iframe>
<div class="demo-caption">
Demo runs on free hosting (Streamlit Community Cloud). Capped at
100 input rows · output watermarked with one trailing row. The
paid product has no caps and runs entirely offline.
</div>
</div>
</div>
</section>
<!-- ============= Built for AP / 1099 ============= -->
<section>
<div class="container">
<div class="eyebrow">Built for the accounts-payable team</div>
<h2>Five workflows you do every filing season</h2>
<div class="grid">
<div class="card">
<span class="icon">🧹</span>
<h3>Vendor-master consolidation</h3>
<p>Catches the same vendor that shows up as <code>Acme LLC</code>, <code>Acme, L.L.C.</code>, and <code>ACME Llc</code>. Fuzzy match merges the spellings; the dedup merge collapses them to one row and backfills the gaps from each duplicate.</p>
</div>
<div class="card">
<span class="icon">🔢</span>
<h3>EIN backfill &amp; missing-EIN flagging</h3>
<p>Pulls the EIN off whichever duplicate row captured it and fills it into the survivor. The EINs that are <em>genuinely</em> missing get flagged so you know exactly which W-9s to chase.</p>
</div>
<div class="card">
<span class="icon">💵</span>
<h3>1099-NEC amount roll-up</h3>
<p>Before filing: standardize amounts, drop sentinels-as-missing, and merge so each vendor's total paid lands on one row and ties to your AP ledger.</p>
</div>
<div class="card">
<span class="icon">📥</span>
<h3>QuickBooks vendor export cleanup</h3>
<p>Whitespace in Tax IDs, near-identical vendor names, copy-paste smart quotes in remittance addresses — gone. Audit log shows every change for your reviewer.</p>
</div>
<div class="card">
<span class="icon">🔗</span>
<h3>Merging the W-9 tracker into the AP ledger</h3>
<p>The vendor master, the payments spreadsheet, and the W-9 tracker each name "Tax ID" differently. Map Columns aligns them; the dedup merge consolidates across all three sources.</p>
</div>
<div class="card">
<span class="icon">⚙️</span>
<h3>Repeatable pipeline</h3>
<p>Save the cleanup as a JSON file. Drop next year's vendor export on it. Same consolidation, zero re-configuration. Automatable via the CLI.</p>
</div>
</div>
</div>
</section>
<!-- ============= Privacy moat ============= -->
<section>
<div class="container">
<div class="eyebrow">The thing every cloud cleaner can't say</div>
<h2>Your vendor master never leaves your computer.</h2>
<p>
DataTools is a desktop app. There's no upload step, no SaaS account,
no subscription, no "trust our security policy." The first thing you
can do after install is open your browser's network tab, run the
cleaner on your real vendor file, and verify zero outbound
requests.
</p>
<div class="callout">
<strong>Why it matters for AP:</strong> your vendor master holds EINs,
remittance addresses, and payment history. Cloud cleaners require you
to upload it. We don't.
</div>
<div class="terminal"><span class="prompt">$</span> python -m src.cli_pipeline vendor_1099.csv --pipeline vendor_1099_pipeline.json --apply
Reading vendor_1099.csv...
24 rows, 9 columns
Executing pipeline:
<span class="ok"></span> text_clean (38 ms) {cells_changed: 41}
<span class="ok"></span> format_standardize (62 ms) {cells_changed: 36} # phones, EINs, amounts
<span class="ok"></span> missing (11 ms) {sentinels_standardized: 9}
<span class="ok"></span> dedup (140 ms) {groups_merged: 8, rows_removed: 16, eins_backfilled: 7}
Initial rows: 24 → Final rows: 8 (8 complete vendors)
EINs recovered from duplicate rows: 7 | Still missing (flagged): 1
Unparseable cells: 0
Total elapsed: 0.25 s
<span class="prompt">$</span> # zero network calls. zero. promise.</div>
</div>
</section>
<!-- ============= Audit moat ============= -->
<section>
<div class="container">
<div class="eyebrow">For when your reviewer asks "what changed?"</div>
<h2>Every change auditable. Every cell logged.</h2>
<p>
Every modification is recorded with the original value, the new
value, and which rule fired. Hand the audit CSV to your controller,
your reviewer, or the IRS-ready workpaper file along with the cleaned
vendor list. No <em>"I trust the AI"</em> hand-waving — they see
exactly which EIN came from which duplicate row.
</p>
<div class="callout">
<strong>Real example:</strong> the demo above merged 24 records into
8 vendors and backfilled 7 EINs. The dedup audit lists every vendor
group with the survivor, its merged-in duplicates, and the source row
each recovered EIN was pulled from. The standardize audit lists every
phone, amount, and Tax ID it reformatted.
</div>
</div>
</section>
<!-- ============= Format handling ============= -->
<section>
<div class="container">
<div class="eyebrow">If your vendors are messy — most AP files are</div>
<h2>EINs, phones, addresses, and amounts in every shape.</h2>
<p>
One row has the EIN as <code>12-3456789</code>, another as
<code>123456789</code>. The remittance phone is <code>(212)
555-0147</code> on one and <code>212.555.0147</code> on the next.
An amount reads <code>$12,410.75</code> with a stray space. Excel
treats half of these as text errors. DataTools normalizes every one —
EINs to a single format, phones to E.164, amounts to clean numerics —
so the file reconciles and the 1099 box totals tie out.
</p>
<ul class="bullets">
<li><strong>EIN / Tax-ID normalization</strong> to one consistent <code>NN-NNNNNNN</code> shape, with genuinely-missing ones flagged.</li>
<li><strong>Phone standardization</strong> to E.164 via Google's libphonenumber.</li>
<li><strong>Amount parsing</strong> for <code>$</code> / commas / stray spaces — including amounts Excel mis-types as text.</li>
<li><strong>Address shape detection</strong> for US remittance addresses.</li>
</ul>
</div>
</section>
<!-- ============= What you get ============= -->
<section>
<div class="container">
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Find Duplicates</h3><p>Fuzzy match (Jaro-Winkler), 5 normalizers, survivor rules, gap-backfill merge, interactive review.</p></div>
<div class="card"><h3>2 · Clean Text</h3><p>Whitespace, smart chars, NBSP, BOM, line endings, case ops.</p></div>
<div class="card"><h3>3 · Standardize Formats</h3><p>EINs, amounts, dates, phones, emails, addresses, names, booleans.</p></div>
<div class="card"><h3>4 · Fix Missing Values</h3><p>Disguised-null detection, profile, flag genuinely-missing fields, drop strategies.</p></div>
<div class="card"><h3>5 · Map Columns</h3><p>Fuzzy auto-rename, target schema, type coercion, required-field defaults.</p></div>
<div class="card"><h3>6 · Automated Workflows</h3><p>Chain tools in recommended order, save/load JSON, automate next year's vendor cleanup.</p></div>
</div>
</div>
</section>
<!-- ============= Pricing ============= -->
<section>
<div class="container">
<div class="eyebrow">Pricing — pay once, own it</div>
<h2>$49. No subscription. No ceiling on rows or files.</h2>
<div class="pricing">
<div class="card featured">
<div class="row"><div class="price">$49</div><div class="price-suffix">one-time</div></div>
<h3>DataTools for 1099 Prep</h3>
<ul>
<li>All 6 tools, full pipeline</li>
<li>Mac · Windows · Linux installers</li>
<li>Code-signed (no Gatekeeper warnings)</li>
<li>Free updates for the v1.x line</li>
<li>Bonus: ready-made vendor-master &amp; 1099 pipelines</li>
</ul>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=ap-1099" rel="noopener">Buy on Gumroad →</a>
</div>
<div class="card">
<div class="row"><div class="price">$149</div><div class="price-suffix">one-time</div></div>
<h3>Full DataTools Suite</h3>
<p class="muted">Available when 3+ bundles ship. Includes everything in the 1099-prep pack plus the Bookkeeper and Accounts-Receivable bundles. Save $48.</p>
<a class="btn btn-ghost btn-large" href="#" aria-disabled="true">Coming when ready</a>
</div>
</div>
</div>
</section>
<!-- ============= FAQ ============= -->
<section>
<div class="container">
<h2>Questions</h2>
<details class="faq">
<summary>Does this work with my QuickBooks vendor export?</summary>
<p>Yes — the input is just CSV / Excel from any source. Your QuickBooks vendor export works the same as a Xero export, a Bill.com download, or a vendor spreadsheet you maintain by hand. The cleaner doesn't care where the file came from.</p>
</details>
<details class="faq">
<summary>How does this compare to Excel's "Remove Duplicates"?</summary>
<p>Excel does <em>exact</em> deduplication and only deletes — it never backfills. <code>Acme LLC</code> and <code>Acme, L.L.C.</code> are different vendors to Excel, and even when it does catch a duplicate it throws the extra row away, taking the EIN with it. DataTools fuzzy-matches across spelling drift, merges the group to one survivor, and pulls the missing EIN, phone, and address off the rows it merges in.</p>
</details>
<details class="faq">
<summary>How does it recover a missing EIN?</summary>
<p>When it merges a group of duplicate vendor rows, it keeps the survivor and backfills any empty field — including the EIN — from whichever duplicate row had it. In the sample file, 7 of the 8 vendors had their EIN recovered this way; the 1 that's truly missing gets flagged so you know to chase the W-9.</p>
</details>
<details class="faq">
<summary>Do I need to know Python to use it?</summary>
<p>No. The GUI is a browser interface that opens automatically when you double-click the app. It loads your vendor file, you click Run, you download the cleaned list. The CLI is there for power users who want to script next year's cleanup.</p>
</details>
<details class="faq">
<summary>What about my data privacy?</summary>
<p>Your vendor master — EINs, remittance addresses, payment history — never leaves your computer. There is no cloud component, no telemetry, no "anonymous usage stats." When the app is running you can confirm zero outbound network requests in your browser's developer tools.</p>
</details>
<details class="faq">
<summary>What's your refund policy?</summary>
<p>Try the live demo above on the sample vendor dataset before you buy. If you still find DataTools doesn't fit your workflow within 14 days, email for a refund — no questions asked.</p>
</details>
<details class="faq">
<summary>Will there be updates?</summary>
<p>Yes. The v1.x line is included free for everyone who buys DataTools today. We ship a patch every 30 days adding format support, edge-case fixes, and small features.</p>
</details>
</div>
</section>
<!-- ============= Final CTA ============= -->
<section>
<div class="container" style="text-align: center;">
<h2>Stop chasing scattered EINs by hand.</h2>
<p class="lead" style="margin: 0 auto 28px;">One $49 download. Mac, Windows, or Linux. Runs offline. Consolidates 24 messy records into 8 complete vendors, recovers the 7 EINs hiding on duplicate rows, flags the ones genuinely missing, and saves a pipeline you can re-run on next year's vendor export.</p>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=ap-1099" rel="noopener">Get DataTools for Accounting — $49 →</a>
</div>
</section>
<!-- ============= Footer ============= -->
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for accounts payable, bookkeepers, and accounts-receivable teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="../bookkeeper/">For bookkeepers</a> ·
<a href="../ar-aging/">For accounts receivable</a><br />
<a href="https://gumroad.com/l/datatools?from=ap-1099">Buy on Gumroad</a> ·
<a href="mailto:hello@datatools.app">Email support</a>
</p>
</div>
</div>
</footer>
</body>
</html>

358
landing/ar-aging/index.html Normal file
View File

@@ -0,0 +1,358 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools for Accounts Receivable — Kill Duplicate Invoices Inflating Your AR Aging Report · $49</title>
<meta name="description" content="One tool to clean your open-invoices export: standardize invoice dates, due dates, and amounts, lowercase client emails, then remove double-entered invoice numbers so your AR aging report is accurate. 26 rows → 21, five duplicate invoices removed. Fully offline. $49 one-time." />
<meta name="keywords" content="accounts receivable aging, duplicate invoices, AR cleanup, open invoices export, invoice dedupe, aging report accuracy, receivables csv tool" />
<link rel="canonical" href="https://datatools.app/ar-aging/" />
<link rel="stylesheet" href="../_shared/styles.css" />
<!-- Persona accent: Accounts Receivable → receivables green -->
<style>
:root {
--accent: #059669;
--accent-ink: #03241a;
}
</style>
<meta property="og:title" content="DataTools for Accounts Receivable — Kill Duplicate Invoices Inflating Your AR Aging Report" />
<meta property="og:description" content="Standardize invoice dates, due dates, and amounts, lowercase client emails, then dedupe double-entered invoices — one tool, no upload. $49 one-time." />
<meta property="og:type" content="product" />
<meta property="og:url" content="https://datatools.app/ar-aging/" />
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareApplication",
"name": "DataTools for Accounts Receivable",
"operatingSystem": "Windows, macOS, Linux",
"applicationCategory": "BusinessApplication",
"offers": {
"@type": "Offer",
"price": "49",
"priceCurrency": "USD"
},
"description": "Clean and dedupe your open-invoices export so the AR aging report is accurate. Standardize invoice dates, due dates, and amounts, lowercase client emails, then remove double-entered invoice numbers — backfilling a blank status from its twin row. Six-tool data-cleaning bundle for accounts receivable and accounting teams.",
"softwareVersion": "1.0"
}
</script>
</head>
<body>
<div class="buybar">
<div class="buybar-inner">
<div class="brand"><span class="brand-mark"></span> DataTools <span class="muted">/ for Accounts Receivable</span></div>
<div>
<span class="price-tag">$49 — one-time, no subscription</span>
<a class="btn" href="https://gumroad.com/l/datatools?from=ar-aging" rel="noopener">Get DataTools →</a>
</div>
</div>
</div>
<section class="hero">
<div class="container">
<div class="eyebrow">For accounts receivable · controllers · collections · accounting teams</div>
<h1>Stop chasing the invoices<br /><strong>your aging report counted twice.</strong></h1>
<p class="lead">
The same invoice number gets posted twice — once as
<code>3/04/2026</code> for <code>$1,250.00</code>, again as
<code>2026-03-04</code> for <code>1250</code> — so your AR aging
report double-counts the receivable and your team chases a balance
that was never really open. DataTools standardizes every invoice
date, due date, and amount, lowercases client emails, then removes
the double-entered invoice numbers — taking a real open-invoices
export from <strong>26 rows to 21, five duplicate invoices
removed</strong> — all on your own machine, with nothing uploaded.
</p>
<div class="cta-row">
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=ar-aging" rel="noopener">Get DataTools for Accounting — $49 →</a>
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
<span class="price-note">One-time payment · cross-platform · runs offline</span>
</div>
<div class="stats">
<div class="stat"><div class="num">26→21</div><div class="label">rows after dedupe</div></div>
<div class="stat"><div class="num">5</div><div class="label">duplicate invoices removed</div></div>
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
</div>
</div>
</section>
<!-- ============= Pain points ============= -->
<section>
<div class="container">
<div class="eyebrow">If your last aging report didn't tie out to cash</div>
<h2>Five pains DataTools fixes before you run the aging report</h2>
<div class="grid">
<div class="card">
<span class="icon">💸</span>
<h3>Double-entered invoices inflate every aging bucket</h3>
<p>The same invoice number posted twice — once in <code>MM/DD/YYYY</code>, once in ISO — lands in two rows and gets counted twice. Your 60-day bucket looks worse than it is, and the receivables total overstates what's actually owed.</p>
<p class="muted"><strong>What it costs:</strong> overstated AR, a balance sheet that won't reconcile, and a controller asking why.</p>
</div>
<div class="card">
<span class="icon">📞</span>
<h3>Collections chases invoices that were already paid or never real</h3>
<p>When a duplicate invoice number shows as still-open, a collector emails the client about a balance that doesn't exist. The client pushes back, trust erodes, and your team burns a morning untangling it.</p>
<p class="muted"><strong>What it costs:</strong> wasted collections hours + an awkward "please disregard" to the client.</p>
</div>
<div class="card">
<span class="icon">⚖️</span>
<h3>Uploading the AR ledger to a cloud cleaner is a compliance headache</h3>
<p>Every cloud-based cleaner wants you to upload your full receivables ledger — client names, amounts, contact emails. That's a data-handling review your firm doesn't want to run. DataTools is desktop-only — no upload, no DPA, no review.</p>
<p class="muted"><strong>What it costs:</strong> weeks of review per tool, or just not cleaning the data at all.</p>
</div>
<div class="card">
<span class="icon">🗓️</span>
<h3>Mixed date formats make due dates and aging unreliable</h3>
<p>Invoice dates arrive as <code>3/4/26</code>, <code>2026-03-04</code>, and <code>Mar 4 2026</code>; due dates are just as mixed. Sort by date and the buckets are wrong, so the wrong invoices show up in the wrong aging column.</p>
<p class="muted"><strong>What it costs:</strong> 13 hours per close reconciling dates by hand, every period.</p>
</div>
<div class="card">
<span class="icon">📧</span>
<h3>Messy client contacts break your remittance reminders</h3>
<p>Client names come in mixed casing and emails arrive as <code>Billing@ClientCo.com</code> in one row and <code>billing@clientco.com</code> in another — so the same client looks like two, and reminders go out twice or not at all.</p>
<p class="muted"><strong>What it costs:</strong> duplicate dunning, missed reminders, and a client list that won't group.</p>
</div>
<div class="card">
<span class="icon"></span>
<h3>Blank invoice statuses hide whether a receivable is really open</h3>
<p>When one of the two twin rows has a blank status, you can't tell if the invoice is open, partial, or paid — so it either gets dropped from the aging report or counted at the wrong stage.</p>
<p class="muted"><strong>What it costs:</strong> misclassified receivables and an aging report you can't trust.</p>
</div>
</div>
</div>
</section>
<section id="demo">
<div class="container">
<div class="eyebrow">Live demo · runs in your browser</div>
<h2>Try it on a real-looking open-invoices export</h2>
<p>
The demo below loads a 26-row open-invoices export with five
double-entered invoice numbers — the same invoice posted twice in
different date and amount formats (<code>3/04/2026</code> vs
<code>2026-03-04</code>, <code>$1,250.00</code> vs <code>1250</code>),
client emails in mixed case, and one blank invoice status. Click
<strong>Run pipeline</strong> and watch the 5-step pipeline (text
clean → format → missing → column map → dedup) standardize both date
columns to ISO, coerce amounts to numbers, lowercase the emails, and
collapse 26 rows to 21 — backfilling the blank status from its twin
row so the aging report is accurate.
</p>
<div class="demo-frame">
<iframe
src="https://demo.datatools.app/?p=ar-aging"
loading="lazy"
title="DataTools live demo — Accounts Receivable"
sandbox="allow-scripts allow-same-origin allow-downloads allow-forms"></iframe>
<div class="demo-caption">
Demo runs on free hosting. Capped at 100 input rows · output
watermarked. The paid product has no caps and runs entirely offline.
</div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">Built for the receivables close</div>
<h2>Three workflows you do every period</h2>
<div class="grid">
<div class="card">
<span class="icon">🪢</span>
<h3>Dedupe double-entered invoices</h3>
<p>Match on invoice number, drop the second posting, and keep one canonical row per invoice — backfilling a blank status, due date, or amount from its twin so nothing accurate is lost when the duplicate goes.</p>
</div>
<div class="card">
<span class="icon">🗓️</span>
<h3>Standardize invoice and due dates</h3>
<p>Coerce every invoice date and due date to ISO and every amount to a clean number, so the aging buckets sort correctly and the receivables total ties out to the ledger.</p>
</div>
<div class="card">
<span class="icon">📧</span>
<h3>Normalize client contacts for remittance</h3>
<p>Lowercase client emails and fix name casing so each client groups as one. Send remit-to reminders once, to a clean contact list — not twice because two rows looked like two clients.</p>
</div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">If your export comes from QuickBooks, Xero, or a billing system</div>
<h2>Standardized dates and amounts. One row per invoice.</h2>
<p>
Your billing system exports <code>3/04/2026</code>. The re-post of
the same invoice has <code>2026-03-04</code>. The amount is
<code>$1,250.00</code> in one row and <code>1250</code> in the other.
DataTools reads each row, normalizes both date columns to ISO,
coerces the amount to a number, and then matches on invoice number
to keep exactly one canonical row per receivable.
</p>
<ul class="bullets">
<li><strong>Invoice date + due date</strong> both standardized to ISO, so every aging bucket sorts and totals correctly.</li>
<li><strong>Amounts coerced to numbers</strong>: <code>$1,250.00</code> and <code>1250</code> resolve to the same value — no false mismatch between twin rows.</li>
<li><strong>Client emails lowercased</strong> so the same client groups as one for remittance reminders.</li>
<li><strong>Status backfill on dedupe</strong>: when a twin row has a blank invoice status, the survivor inherits it — so no open receivable goes missing from the report.</li>
</ul>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">For anyone who reports on receivables</div>
<h2>Every duplicate invoice you don't catch overstates your AR.</h2>
<p>
Your aging report is only as good as the export under it. Every
double-entered invoice number is a receivable counted twice — it
inflates the aging buckets, overstates the total owed, and sends
collections after balances that aren't really open. DataTools
catches them once, before the report runs, by matching on invoice
number with the date and amount noise already standardized away.
</p>
<div class="callout">
<strong>Real numbers from the demo:</strong> a 26-row open-invoices
export collapses to 21 — that's five double-entered invoices the
mixed date and amount formats were hiding, both date columns now
ISO, amounts numeric, emails lowercased, 0 unparseable, and a blank
status backfilled from its twin row. The aging report finally ties out.
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">The thing every cloud cleaner can't say</div>
<h2>Your clients' receivables never leave your computer.</h2>
<p>
Cloud cleaning tools require you to upload your AR ledger — client
names, invoice amounts, remit-to contacts. That ledger is sensitive
client financial data, and once it's on someone else's server, your
firm owns a data-handling problem you didn't need. DataTools is a
desktop app. There is no upload step.
</p>
<div class="terminal"><span class="prompt">$</span> python -m src.cli_pipeline ar_open_invoices.csv --pipeline ar_open_invoices_pipeline.json --apply
Reading ar_open_invoices.csv...
26 rows, 9 columns
Executing pipeline:
<span class="ok"></span> text_clean (40 ms) {cells_changed: 31}
<span class="ok"></span> format_standardize (120 ms) {dates_to_iso: 41, amounts_to_number: 26, emails_lowercased: 18}
<span class="ok"></span> missing (30 ms) {sentinels_standardized: 4, status_backfilled: 1}
<span class="ok"></span> column_map (20 ms) {columns_renamed: 2}
<span class="ok"></span> dedup (60 ms) {duplicate_invoices_removed: 5, merged: 5}
Initial rows: 26 → Final rows: 21
Unparseable dates/amounts: 0
Total elapsed: 0.3 s
<span class="prompt">$</span> # 5 double-entered invoices gone. aging report ties out. for $49.</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Find Duplicates</h3><p>Match on invoice number; keep one canonical row per receivable and backfill blanks from the twin.</p></div>
<div class="card"><h3>2 · Clean Text</h3><p>Smart quotes from copy-paste, NBSP from spreadsheet exports, BOM from Excel.</p></div>
<div class="card"><h3>3 · Standardize Formats</h3><p>Invoice and due dates to ISO, amounts to clean numbers, client emails lowercased.</p></div>
<div class="card"><h3>4 · Fix Missing Values</h3><p>Detect <code>TBD</code>, <code>(unknown)</code>, <code></code> and backfill blank invoice statuses on dedupe.</p></div>
<div class="card"><h3>5 · Map Columns</h3><p>Project to your aging-report schema, coerce amount to a number, reorder fields for import.</p></div>
<div class="card"><h3>6 · Automated Workflows</h3><p>Save the cleanup as JSON. Drop next period's open-invoices export on it. Same dedupe, automated.</p></div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">Pricing — pay once, own it</div>
<h2>$49. No subscription. No per-close fee.</h2>
<div class="pricing">
<div class="card featured">
<div class="row"><div class="price">$49</div><div class="price-suffix">one-time</div></div>
<h3>DataTools for Accounts Receivable</h3>
<ul>
<li>All 6 tools, full pipeline</li>
<li>Mac · Windows · Linux installers</li>
<li>Code-signed (no Gatekeeper warnings)</li>
<li>Free updates for the v1.x line</li>
<li>Bonus: open-invoices dedupe pipeline preset</li>
<li><strong>Use on any number of clients</strong> — no seat limits</li>
</ul>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=ar-aging" rel="noopener">Buy on Gumroad →</a>
</div>
<div class="card">
<div class="row"><div class="price">$149</div><div class="price-suffix">one-time</div></div>
<h3>Full DataTools Suite</h3>
<p class="muted">Available when 3+ bundles ship. Includes everything in the Accounts Receivable pack plus the Bookkeeper and Accounts Payable / 1099 bundles. Save $48.</p>
<a class="btn btn-ghost btn-large" href="#" aria-disabled="true">Coming when ready</a>
</div>
</div>
</div>
</section>
<section>
<div class="container">
<h2>Questions</h2>
<details class="faq">
<summary>Does this replace my accounting system's deduplication?</summary>
<p>No — it cleans the export <em>before</em> you run the aging report or import it back. Most billing systems will happily hold two postings of the same invoice number; DataTools catches the double-entered invoice so it never inflates a single aging bucket.</p>
</details>
<details class="faq">
<summary>How does it know two rows are the same invoice?</summary>
<p>It matches on invoice number after the date and amount formats are standardized away. So a posting dated <code>3/04/2026</code> for <code>$1,250.00</code> and its twin dated <code>2026-03-04</code> for <code>1250</code> are recognized as one invoice — and only one canonical row survives.</p>
</details>
<details class="faq">
<summary>What happens to a blank invoice status when the duplicate is removed?</summary>
<p>It's backfilled. If one twin row has a blank status and the other says <code>open</code>, the surviving row inherits <code>open</code> — so no real receivable drops off the aging report just because the duplicate carried the better data.</p>
</details>
<details class="faq">
<summary>Can I use it on multiple clients without paying again?</summary>
<p>Yes. The licence is per-operator, not per-client. Run it on every client's open-invoices export for the same $49.</p>
</details>
<details class="faq">
<summary>What's the audit trail look like?</summary>
<p>A row-by-row CSV: every modified cell with its original value, new value, and which rule fired — every date coerced to ISO, every amount normalized, every duplicate invoice removed. A separate JSON file describes the pipeline that produced it, so the cleanup reproduces deterministically and your client can verify it on their machine.</p>
</details>
<details class="faq">
<summary>What's your refund policy?</summary>
<p>Try the live demo above on the sample open-invoices export before you buy. If DataTools doesn't fit your workflow within 14 days, email for a refund — no questions asked.</p>
</details>
</div>
</section>
<section>
<div class="container" style="text-align: center;">
<h2>Stop counting the same receivable twice.</h2>
<p class="lead" style="margin: 0 auto 28px;">One $49 download. Standardizes invoice dates, due dates, and amounts, lowercases client emails, removes the double-entered invoices your aging report was counting twice, and saves a pipeline you can re-run on next period's open-invoices export.</p>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=ar-aging" rel="noopener">Get DataTools for Accounting — $49 →</a>
</div>
</section>
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for bookkeepers, accounts payable, and accounts receivable teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="../bookkeeper/">For bookkeepers</a> ·
<a href="../ap-1099/">For accounts payable / 1099</a><br />
<a href="https://gumroad.com/l/datatools?from=ar-aging">Buy on Gumroad</a> ·
<a href="mailto:hello@datatools.app">Email support</a>
</p>
</div>
</div>
</footer>
</body>
</html>

View File

@@ -3,9 +3,9 @@
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools for Bookkeepers — Reconcile Bank Exports With An Audit Trail · $49</title>
<meta name="description" content="Reconcile messy bank exports. Catch duplicate transactions QuickBooks imported twice. Standardize dates, amounts, and vendor casing — locally. Every change auditable. $49 one-time." />
<meta name="keywords" content="reconcile bank export csv, quickbooks duplicate transactions, vendor list cleanup, bookkeeper csv tool, bank export deduplicator, bookkeeper audit trail" />
<title>DataTools for Bookkeepers — Catch Bank Transactions Posted Twice · $49</title>
<meta name="description" content="Catch the transactions your bank export posted twice. Standardize every date to ISO and every amount to numeric, then dedup on the real transaction so the reconciliation ties out — with a row-level audit trail. $49 one-time." />
<meta name="keywords" content="bank reconciliation, duplicate transactions, bank export csv cleanup, QuickBooks reconcile, bookkeeper csv tool" />
<link rel="canonical" href="https://datatools.app/bookkeeper/" />
<link rel="stylesheet" href="../_shared/styles.css" />
@@ -18,8 +18,8 @@
</style>
<!-- Open Graph -->
<meta property="og:title" content="DataTools for Bookkeepers — Reconcile Bank Exports With An Audit Trail" />
<meta property="og:description" content="Catch duplicate transactions. Standardize dates and amounts. Hand your client an audit trail. $49 one-time." />
<meta property="og:title" content="DataTools for Bookkeepers — Catch Bank Transactions Posted Twice" />
<meta property="og:description" content="The same payment posts twice in two date/amount formats and a plain dedupe misses it. DataTools standardizes, dedups on the real transaction, and hands you an audit trail. $49 one-time." />
<meta property="og:type" content="product" />
<meta property="og:url" content="https://datatools.app/bookkeeper/" />
@@ -35,7 +35,7 @@
"price": "49",
"priceCurrency": "USD"
},
"description": "Reconcile bank exports, dedupe vendor lists, and produce a hand-off-ready audit trail. Six-tool data-cleaning bundle for bookkeepers and freelance accountants.",
"description": "Catch the duplicate transactions your bank export posted twice across overlapping months, standardize dates and amounts, and produce a hand-off-ready audit trail. Six-tool data-cleaning bundle for bookkeepers and freelance accountants.",
"softwareVersion": "1.0"
}
</script>
@@ -47,7 +47,7 @@
<div class="brand"><span class="brand-mark"></span> DataTools <span class="muted">/ for Bookkeepers</span></div>
<div>
<span class="price-tag">$49 — one-time, no subscription</span>
<a class="btn" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Get DataTools →</a>
<a class="btn" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Get DataTools for Bookkeepers — $49 </a>
</div>
</div>
</div>
@@ -55,24 +55,29 @@
<section class="hero">
<div class="container">
<div class="eyebrow">For bookkeepers · freelance accountants · small-firm partners</div>
<h1>Reconcile messy bank exports.<br /><strong>Hand your client an audit trail.</strong></h1>
<h1>Catch the transactions your bank export<br /><strong>posted twice.</strong></h1>
<p class="lead">
The Jan and Feb exports overlap and you've got the same transaction
booked twice. Vendor names are <em>"Amazon"</em>, <em>"amazon.com"</em>,
and <em>"AMAZON.COM*4F2X9"</em> in three different rows. Dates are a
smoosh of <code>01/15/2025</code>, <code>2025-01-15</code>, and
<code>Jan 18 2025</code>. DataTools fixes all of it in one pass —
and produces a row-by-row CSV showing every change so your client
can verify your work.
The Jan and Feb exports overlap, so the <em>same</em> payment posts
twice in two different shapes — <code>01/15/2025&nbsp;&nbsp;+$3,450.00</code>
in one export and <code>2025-01-15&nbsp;&nbsp;3450.00</code> in the
other — and a plain Excel dedupe never catches it because the dates and
amounts don't match character-for-character. DataTools standardizes
every date to ISO and every amount to numeric (parens-negatives
resolved), then dedups on the <em>real</em> transaction so the
reconciliation ties out. On the sample export that's
<strong>26 rows → 20</strong> — six phantom duplicate transactions
removed, 36 date/amount cells standardized, 0 unparseable — and you
get a row-by-row CSV showing every change so your client can verify
your work.
</p>
<div class="cta-row">
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Get DataTools — $49 →</a>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Get DataTools for Bookkeepers — $49 →</a>
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
<span class="price-note">One-time payment · cross-platform · runs offline</span>
</div>
<div class="stats">
<div class="stat"><div class="num">6</div><div class="label">tools, one bundle</div></div>
<div class="stat"><div class="num">100 %</div><div class="label">auditable changes</div></div>
<div class="stat"><div class="num">26→20</div><div class="label">rows, on the sample export</div></div>
<div class="stat"><div class="num">6</div><div class="label">phantom duplicates removed</div></div>
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
</div>
</div>
@@ -129,13 +134,15 @@
<div class="eyebrow">Live demo · runs in your browser</div>
<h2>Try it on a sample bank export with a known overlap</h2>
<p>
The demo below loads a 25-row export combining January and February
The demo below loads a 26-row export combining January and February
activity, with the month-boundary rows duplicated across exports —
the exact scenario where QuickBooks (or any reconciler) silently
double-counts transactions. Click <strong>Run pipeline</strong> and
watch the dedup catch every overlap, dates land in ISO format, and
the parens-negative amounts (<code>($89.50)</code>) become proper
negative numbers.
watch it standardize 36 date/amount cells, land every date in ISO
format, turn the parens-negative amounts (<code>($89.50)</code>) into
proper negatives, flag the disguised-null categories, and dedup the
export down to <strong>20 real transactions</strong> — six phantom
duplicates removed, 0 unparseable.
</p>
<div class="demo-frame">
<iframe
@@ -197,13 +204,17 @@
price. DataTools writes the audit by default, downloadable as a
separate CSV alongside the cleaned file.
</div>
<div class="terminal"><span class="prompt">$</span> head -5 client_jan2025_changes.csv
<div class="terminal"><span class="prompt">$</span> python -m src.cli_pipeline bank_reconciliation.csv --pipeline bank_reconciliation_pipeline.json --apply
standardize · 36 date/amount cells normalized (ISO dates, numeric amounts, parens-negatives resolved)
missing · disguised-null categories flagged (—, N/A, (blank))
dedup · 6 phantom duplicate transactions removed
rows · 26 → 20 · 0 unparseable
✓ wrote bank_reconciliation.cleaned.csv + bank_reconciliation.changes.csv (row-level audit)
<span class="prompt">$</span> head -4 bank_reconciliation.changes.csv
row,column,field_type,old,new
0,"Date ",date,"01/15/2025","2025-01-15"
0,Description,name," AMAZON.COM*4F2X9 PURCHASE","Amazon.com*4F2X9 Purchase"
0,Amount,currency,"-$129.99","-129.99"
1,Date ,date,"2025-01-15","2025-01-15"
<span class="prompt">$</span> # one row of audit per cell change. handed to the client. signed off.</div>
0,Amount,currency,"+$3,450.00","3450.00"
0,Category,category,"—","(missing)"
</div>
</section>
@@ -336,13 +347,13 @@ row,column,field_type,old,new
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
<p><strong>DataTools</strong> — local data-cleaning for bookkeepers, accounts payable, and accounts receivable teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="../shopify-pet/">For Shopify operators</a> ·
<a href="../revops/">For RevOps agencies</a><br />
<a href="../ap-1099/">For accounts payable / 1099</a> ·
<a href="../ar-aging/">For accounts receivable</a><br />
<a href="https://gumroad.com/l/datatools?from=bookkeeper">Buy on Gumroad</a> ·
<a href="mailto:hello@datatools.app">Email support</a>
</p>

View File

@@ -11,7 +11,7 @@
"gumroad_listing": "https://gumroad.com/l/datatools",
"support_email": "hello@datatools.app",
"personas": ["shopify-pet", "bookkeeper", "revops"],
"personas": ["bookkeeper", "ap-1099", "ar-aging"],
"_substitutions_made": [
"{{site_origin}}/ → site_origin/",

View File

@@ -7,9 +7,9 @@ to ``landing/deploy.config.json`` and filling in the real URLs:
Output:
landing/dist/index.html
landing/dist/shopify-pet/index.html
landing/dist/bookkeeper/index.html
landing/dist/revops/index.html
landing/dist/ap-1099/index.html
landing/dist/ar-aging/index.html
landing/dist/_shared/styles.css
landing/dist/robots.txt
landing/dist/sitemap.xml
@@ -50,9 +50,9 @@ EXAMPLE_PATH = LANDING / "deploy.config.example.json"
# Files to substitute and copy. Order matters only for readability.
HTML_PAGES = [
LANDING / "index.html",
LANDING / "shopify-pet" / "index.html",
LANDING / "bookkeeper" / "index.html",
LANDING / "revops" / "index.html",
LANDING / "ap-1099" / "index.html",
LANDING / "ar-aging" / "index.html",
]
SHARED = LANDING / "_shared" / "styles.css"
@@ -125,7 +125,7 @@ def _stamp_sitemap(cfg: dict) -> str:
site = cfg["site_origin"].rstrip("/")
today = date.today().isoformat()
urls = [site + "/"] + [
f"{site}/{p}/" for p in cfg.get("personas", ["shopify-pet", "bookkeeper", "revops"])
f"{site}/{p}/" for p in cfg.get("personas", ["bookkeeper", "ap-1099", "ar-aging"])
]
items = "\n".join(
f" <url><loc>{u}</loc><lastmod>{today}</lastmod></url>"
@@ -177,11 +177,11 @@ def _build_404_html(cfg: dict) -> str:
<h1>That page isn't here.</h1>
<p class="lead" style="margin: 0 auto 28px;">Pick a workflow below to land somewhere useful.</p>
<p>
<a class="btn" href="{site_origin}/shopify-pet/">For Shopify</a>
&nbsp;
<a class="btn" href="{site_origin}/bookkeeper/">For bookkeepers</a>
&nbsp;
<a class="btn" href="{site_origin}/revops/">For RevOps</a>
<a class="btn" href="{site_origin}/ap-1099/">For AP / 1099</a>
&nbsp;
<a class="btn" href="{site_origin}/ar-aging/">For AR</a>
</p>
</div>
</section>

View File

@@ -3,13 +3,13 @@
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools — Local CSV / Excel Cleaning for Shopify, Bookkeepers, and RevOps</title>
<meta name="description" content="One desktop tool. Three workflows. Clean Shopify customer exports, reconcile messy bank statements, or dedupe lead lists across HubSpot and LinkedIn — all locally. $49 one-time." />
<title>DataTools — Local CSV / Excel Cleaning for Bookkeepers and Accountants</title>
<meta name="description" content="One desktop tool for messy accounting exports. Reconcile bank statements, build clean 1099 vendor lists, and de-duplicate AR aging — all locally. $49 one-time." />
<link rel="canonical" href="https://datatools.app/" />
<link rel="stylesheet" href="_shared/styles.css" />
<meta property="og:title" content="DataTools — Local CSV / Excel Cleaning" />
<meta property="og:description" content="One desktop tool, three niche workflows. Runs entirely offline. $49 one-time." />
<meta property="og:title" content="DataTools — Local CSV / Excel Cleaning for Accounting" />
<meta property="og:description" content="Reconcile bank exports, prep 1099 vendor lists, clean AR aging — offline. $49 one-time." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://datatools.app/" />
@@ -38,9 +38,9 @@
box-shadow: var(--shadow);
text-decoration: none;
}
.persona-card.shopify { --card-accent: #6ee7b7; }
.persona-card.bookkeeper{ --card-accent: #7dd3fc; }
.persona-card.revops { --card-accent: #c4b5fd; }
.persona-card.ap1099 { --card-accent: #fbbf24; }
.persona-card.ar { --card-accent: #6ee7b7; }
.persona-card .pill {
display: inline-block;
background: rgba(255,255,255,0.04);
@@ -93,70 +93,69 @@
<section class="hero">
<div class="container">
<div class="eyebrow">For Shopify operators · bookkeepers · marketing & RevOps agencies</div>
<h1>Local CSV / Excel cleaning.<br /><strong>One tool. Three workflows.</strong></h1>
<div class="eyebrow">For bookkeepers · accounts payable · accounts receivable</div>
<h1>Local CSV / Excel cleaning for accounting.<br /><strong>One tool. Three workflows.</strong></h1>
<p class="lead">
DataTools is a desktop app that fixes the data-cleaning headaches
every small business hits — duplicates Excel can't catch,
international phones it can't parse, dates and currencies in three
different formats per export. One $49 download. Works on Mac,
Windows, and Linux. <strong>Your data never leaves your
computer.</strong>
DataTools is a desktop app that fixes the export headaches that
throw off your books — the transaction your bank posted twice,
the vendor entered three ways at 1099 time, the invoice your aging
report counted twice. One $49 download. Mac, Windows, and Linux.
<strong>Your data never leaves your computer.</strong>
</p>
<div class="persona-grid">
<a class="persona-card shopify" href="shopify-pet/">
<span class="pill">🛍️ Shopify operator</span>
<h3>Customer / vendor / subscriber export cleanup</h3>
<p>
Klaviyo-import-ready customer lists in 30 seconds. Catches
cross-device duplicates, standardizes international phones
and addresses, fixes the disguised nulls that break product
feeds.
</p>
<ul class="pain">
<li>· Fix Klaviyo per-contact billing on phantom dupes</li>
<li>· Repair feeds rejected by Google Merchant / Meta</li>
<li>· Unify orders from Shopify + Etsy + Amazon + Faire</li>
<li>· Resolve VAT-MOSS country-name drift</li>
</ul>
<span class="open">Open the Shopify demo &amp; pricing</span>
</a>
<a class="persona-card bookkeeper" href="bookkeeper/">
<span class="pill">📒 Bookkeeper / accountant</span>
<h3>Bank-export reconciliation with audit trail</h3>
<span class="pill">📒 Bookkeeper</span>
<h3>Bank reconciliation with an audit trail</h3>
<p>
Catches the duplicate transaction QuickBooks imported twice
when Jan and Feb exports overlap. Standardizes dates,
amounts, and vendor casing. Hands you a row-level audit log
to share with the client.
When the Jan and Feb exports overlap, the same payment posts
twice in two formats. DataTools standardizes every date and
amount, then dedups on the real transaction so it ties out —
with a row-level audit log to hand the client.
</p>
<ul class="pain">
<li>· Catch month-overlap re-import dupes</li>
<li>· Consolidate vendors for clean 1099 reports</li>
<li>· Produce hand-off-ready audit trail</li>
<li>· Multi-currency books (EUR / GBP / BRL)</li>
<li>· Catch month-overlap re-import duplicates</li>
<li>· ISO dates, numeric amounts, parens-negatives resolved</li>
<li>· Hand-off-ready audit trail</li>
<li>· Sample: 26 rows → 20, six phantom duplicates removed</li>
</ul>
<span class="open">Open the bookkeeper demo &amp; pricing</span>
</a>
<a class="persona-card revops" href="revops/">
<span class="pill">🪢 Marketing / RevOps</span>
<h3>Lead-list dedup across HubSpot, LinkedIn, scrapes</h3>
<a class="persona-card ap1099" href="ap-1099/">
<span class="pill">🧾 Accounts payable / 1099</span>
<h3>Clean 1099 vendor list — missing EINs filled in</h3>
<p>
One canonical lead per real person — across HubSpot,
LinkedIn, Apollo, ZoomInfo, and manual scrapes.
International phones (50+ country codes), per-row country
column, fuzzy match with merge.
The same vendor entered three times, each record holding only
part of the details. DataTools consolidates each vendor to one
row and backfills the gaps from the duplicates, so the EINs you
need at filing time are recovered.
</p>
<ul class="pain">
<li>· Stop paying HubSpot tier price for cross-source dupes</li>
<li>· Protect sender reputation from invalid emails</li>
<li>· Skip the 48 wk GDPR review on cloud cleaners</li>
<li>· Suppression-list sync across 5+ platforms</li>
<li>· Consolidate vendor masters for 1099-NEC</li>
<li>· Recover EINs scattered across duplicate records</li>
<li>· Standardize phones, emails, and amounts</li>
<li>· Sample: 24 records → 8 vendors, 7 EINs recovered</li>
</ul>
<span class="open">Open the RevOps demo &amp; pricing</span>
<span class="open">Open the 1099 / AP demo &amp; pricing</span>
</a>
<a class="persona-card ar" href="ar-aging/">
<span class="pill">💵 Accounts receivable</span>
<h3>AR aging without the double-counted invoices</h3>
<p>
Double-entered invoices inflate your aging report and your
follow-ups. DataTools standardizes invoice dates, due dates,
and amounts, lowercases client emails, then removes the
duplicate invoice numbers so the aging is accurate.
</p>
<ul class="pain">
<li>· Remove double-entered invoices from the aging</li>
<li>· ISO dates, numeric amounts, lowercased client emails</li>
<li>· Backfill a blank status from its twin row</li>
<li>· Sample: 26 rows → 21, five duplicate invoices removed</li>
</ul>
<span class="open">Open the AR demo &amp; pricing</span>
</a>
</div>
</div>
@@ -218,14 +217,14 @@
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
<p><strong>DataTools</strong> — local data-cleaning for bookkeepers, accounts payable, and accounts receivable teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="shopify-pet/">For Shopify operators</a> ·
<a href="bookkeeper/">For bookkeepers</a> ·
<a href="revops/">For RevOps agencies</a><br />
<a href="ap-1099/">For accounts payable / 1099</a> ·
<a href="ar-aging/">For accounts receivable</a><br />
<a href="mailto:hello@datatools.app">Email support</a>
</p>
</div>

View File

@@ -1,352 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools for RevOps — Dedupe Lead Lists Across HubSpot, LinkedIn, and Manual Scrapes · $49</title>
<meta name="description" content="One tool to dedupe lead lists across HubSpot, LinkedIn, and manual scrapes. International phones (50+ country codes), per-row country normalization, fuzzy match across vendors, fully offline. $49 one-time." />
<meta name="keywords" content="dedupe lead list, hubspot deduplicate, linkedin lead cleanup, marketing data cleaning, revops csv tool, multi-vendor lead unification, international phone normalization" />
<link rel="canonical" href="https://datatools.app/revops/" />
<link rel="stylesheet" href="../_shared/styles.css" />
<!-- Persona accent: RevOps → vivid violet -->
<style>
:root {
--accent: #c4b5fd;
--accent-ink: #2e1065;
}
</style>
<meta property="og:title" content="DataTools for RevOps — Dedupe Lead Lists Across HubSpot, LinkedIn, and Manual Scrapes" />
<meta property="og:description" content="International phones, country normalization, fuzzy dedup with merge — one tool, no upload. $49 one-time." />
<meta property="og:type" content="product" />
<meta property="og:url" content="https://datatools.app/revops/" />
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareApplication",
"name": "DataTools for RevOps",
"operatingSystem": "Windows, macOS, Linux",
"applicationCategory": "BusinessApplication",
"offers": {
"@type": "Offer",
"price": "49",
"priceCurrency": "USD"
},
"description": "Dedupe and unify lead lists across CRM, scraping, and manual sources. International phone normalization, per-row country, fuzzy match with merge. Six-tool data-cleaning bundle for RevOps and marketing agencies.",
"softwareVersion": "1.0"
}
</script>
</head>
<body>
<div class="buybar">
<div class="buybar-inner">
<div class="brand"><span class="brand-mark"></span> DataTools <span class="muted">/ for RevOps</span></div>
<div>
<span class="price-tag">$49 — one-time, no subscription</span>
<a class="btn" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Get DataTools →</a>
</div>
</div>
</div>
<section class="hero">
<div class="container">
<div class="eyebrow">For RevOps · marketing ops · agency lead-gen · audience-builders</div>
<h1>Dedupe lead lists across HubSpot, LinkedIn,<br /><strong>and manual scrapes — locally.</strong></h1>
<p class="lead">
The same prospect shows up as <code>alice@acme.com</code> in HubSpot,
<code>Alice.Johnson@acme.com</code> in LinkedIn Sales Navigator, and
<code>alice@acme.com</code> again from your VA's manual scrape. Their
phone is <code>(415) 555-1234</code> in one source and
<code>4155551234</code> in another. DataTools fuzzy-matches across
sources, normalizes phones to E.164 with per-row country awareness,
and produces one canonical lead per real person — without uploading
a single contact to a third-party tool.
</p>
<div class="cta-row">
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Get DataTools — $49 →</a>
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
<span class="price-note">One-time payment · cross-platform · runs offline</span>
</div>
<div class="stats">
<div class="stat"><div class="num">50+</div><div class="label">country codes</div></div>
<div class="stat"><div class="num">3</div><div class="label">CRM sources unified</div></div>
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
</div>
</div>
</section>
<!-- ============= Pain points ============= -->
<section>
<div class="container">
<div class="eyebrow">If your last campaign launch was held up by data hygiene</div>
<h2>Five pains DataTools fixes before you import to HubSpot</h2>
<div class="grid">
<div class="card">
<span class="icon">💸</span>
<h3>HubSpot / Marketo / Iterable bills you for every duplicate contact</h3>
<p>10 k contacts → enterprise tier at $48 k/mo. 18 % cross-source duplicate rate from Apollo + ZoomInfo + LinkedIn means you're at 8.2 k unique people but paying for 10 k. Every month. Forever.</p>
<p class="muted"><strong>What it costs:</strong> $200$800 per 1 k duplicate contacts — recurring, every month.</p>
</div>
<div class="card">
<span class="icon">🚫</span>
<h3>Sender reputation tanks when you mail to invalid or duplicate addresses</h3>
<p>One bad sending session — to addresses your team scraped or imported without hygiene — and your domain reputation takes weeks to recover. Your good campaigns sit in spam folders during the recovery.</p>
<p class="muted"><strong>What it costs:</strong> catastrophic — entire email programme degraded for 26 weeks.</p>
</div>
<div class="card">
<span class="icon">⚖️</span>
<h3>GDPR makes uploading to a cloud cleaner a legal-review marathon</h3>
<p>Every cloud-based lead-cleaner needs you to upload your prospect list. Your legal team needs 48 weeks to bless that. DataTools is desktop-only — no upload, no DPA, no review, no delay.</p>
<p class="muted"><strong>What it costs:</strong> 48 weeks of legal-review delay per tool, every time.</p>
</div>
<div class="card">
<span class="icon">🪢</span>
<h3>Apollo + ZoomInfo + LinkedIn + manual scrapes all use different schemas</h3>
<p>Each export has its own column names, scoring scale, country format. Unifying them by hand for one campaign costs 13 days. Doing it for every campaign is unsustainable.</p>
<p class="muted"><strong>What it costs:</strong> 13 days per campaign of manual unification + judgement calls that drift across team members.</p>
</div>
<div class="card">
<span class="icon">🛡️</span>
<h3>Suppression lists across 5+ marketing platforms get out of sync</h3>
<p>Each platform has its own suppression format. Out-of-sync lists let opted-out contacts slip through, triggering CAN-SPAM / GDPR exposure and the kind of "we got a complaint" email no one wants.</p>
<p class="muted"><strong>What it costs:</strong> compliance risk + churn-back cost + stakeholder trust.</p>
</div>
<div class="card">
<span class="icon">📞</span>
<h3>International dialer fails because phone formats vary</h3>
<p>Calling list to 15 countries with mixed formats means dialler rejects 815 % of numbers, your reps spend the day on "number invalid" tones instead of conversations.</p>
<p class="muted"><strong>What it costs:</strong> rep productivity × failure rate × team size.</p>
</div>
</div>
</div>
</section>
<section id="demo">
<div class="container">
<div class="eyebrow">Live demo · runs in your browser</div>
<h2>Try it on a real-looking 3-vendor lead list</h2>
<p>
The demo below loads a 25-row lead worksheet combining HubSpot,
LinkedIn Sales Navigator, and manual scraping — with the same prospect
appearing in two or three sources, country names spelled three
different ways (<code>USA</code>, <code>US</code>, <code>United
States</code>), and 13 different international phone formats. Click
<strong>Run pipeline</strong> and watch the 5-step pipeline (text
clean → format → missing → column map → dedup) collapse 25 rows to 19
with a single canonical record per prospect.
</p>
<div class="demo-frame">
<iframe
src="https://demo.datatools.app/?p=revops"
loading="lazy"
title="DataTools live demo — RevOps"
sandbox="allow-scripts allow-same-origin allow-downloads allow-forms"></iframe>
<div class="demo-caption">
Demo runs on free hosting. Capped at 100 input rows · output
watermarked. The paid product has no caps and runs entirely offline.
</div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">Built for the agency RevOps day</div>
<h2>Three workflows you do every campaign</h2>
<div class="grid">
<div class="card">
<span class="icon">🪢</span>
<h3>Email-list dedup across lead sources</h3>
<p>HubSpot exports + LinkedIn Sales Navigator + the VA's spreadsheet, all merged. Fuzzy match across email + phone + name catches the cross-source duplicates that broke your last campaign send.</p>
</div>
<div class="card">
<span class="icon">🌍</span>
<h3>Multi-platform audience reconciliation</h3>
<p>Build one canonical audience from Meta, Google Ads, LinkedIn, and your CRM. Each platform exports a different shape; Map Columns aligns them all, dedup merges the survivors with their most-complete fields.</p>
</div>
<div class="card">
<span class="icon">🛡️</span>
<h3>Suppression-list management</h3>
<p>Suppression lists need to dedupe across email + phone + first-party identifiers. Add a row, dedupe, ship the canonical CSV to every platform — without uploading the suppression list to any of them.</p>
</div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">If your campaigns target outside the US — almost everyone's do</div>
<h2>50+ country codes. Per-row country awareness.</h2>
<p>
Your HubSpot list has <code>(415) 555-1234</code>. Your scraped
list from the same prospect has <code>+1 415 555 1234</code>. Your
Italian prospect entered <code>+39 06 6982</code>. Your Brazilian
lead has <code>11 3071 0000</code>. Each comes from a row tagged
with its country — DataTools reads that column per row and parses
every phone correctly to E.164.
</p>
<ul class="bullets">
<li><strong>Per-row country column</strong> drives the parser — no global default that bucks UK numbers as malformed US.</li>
<li><strong>Country-name normalization</strong>: <code>USA</code> / <code>US</code> / <code>United States</code> all resolve to the same ISO-2 code.</li>
<li><strong>50+ country support</strong> via Google's libphonenumber, including KR, CN, IN, MX, BR, IL, TR, PL, DK, SE.</li>
<li><strong>Schema enforcement</strong> via Map Columns: project to your CRM's required shape, coerce score columns to integers, reorder fields to match the import contract.</li>
</ul>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">For platforms that charge per contact</div>
<h2>Every duplicate you don't catch costs you for the life of the contract.</h2>
<p>
HubSpot prices on contacts. Klaviyo prices on contacts. Marketo,
Iterable, ActiveCampaign — all priced on contacts. Every duplicate
you don't catch is a recurring tax on your campaign. DataTools
catches them once, before import, with a fuzzy matcher that's
tuned to the cross-source noise you actually see.
</p>
<div class="callout">
<strong>Real numbers from the demo:</strong> 25 input rows from
three sources collapse to 19 — that's 6 duplicates the cross-source
noise was hiding. On a 50,000-row campaign list, that ratio
typically saves 12,000+ contacts a month, every month.
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">The thing every cloud cleaner can't say</div>
<h2>Your prospects' contact info never leaves your computer.</h2>
<p>
Cloud lead-cleaning tools require you to upload your audience.
That audience is your single most valuable agency asset — and once
it's on someone else's server, your client's privacy story is
no longer in your hands. DataTools is a desktop app. There is no
upload step.
</p>
<div class="terminal"><span class="prompt">$</span> python -m src.cli_pipeline campaign_q1.csv --pipeline revops_pipeline.json --apply
Reading campaign_q1.csv...
53,802 rows, 14 columns
Executing pipeline:
<span class="ok"></span> text_clean (160 ms) {cells_changed: 8,205}
<span class="ok"></span> format_standardize (1.4 s) {cells_changed: 41,889 — 50 country codes}
<span class="ok"></span> missing (140 ms) {sentinels_standardized: 6,710}
<span class="ok"></span> column_map (220 ms) {columns_renamed: 4, columns_added: 1}
<span class="ok"></span> dedup (4.8 s) {duplicates_removed: 12,344, merged: 12,344}
Initial rows: 53,802 → Final rows: 41,458
Total elapsed: 6.7 s
<span class="prompt">$</span> # 12,344 fewer contacts to pay for. for $49.</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Find Duplicates</h3><p>Fuzzy match across email + phone + name + company; merge survivors with most-complete fields.</p></div>
<div class="card"><h3>2 · Clean Text</h3><p>Smart quotes from copy-paste, NBSP from spreadsheet exports, BOM from Excel.</p></div>
<div class="card"><h3>3 · Standardize Formats</h3><p>E.164 phones with per-row country, canonical emails, name casing, ISO dates.</p></div>
<div class="card"><h3>4 · Fix Missing Values</h3><p>Detect <code>TBD</code>, <code>(unknown)</code>, <code></code> across vendor exports.</p></div>
<div class="card"><h3>5 · Map Columns</h3><p>Project to your CRM's required schema, coerce score to integer, reorder for import.</p></div>
<div class="card"><h3>6 · Automated Workflows</h3><p>Save the cleanup as JSON. Drop next campaign's combined export on it. Same dedup, automated.</p></div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">Pricing — pay once, own it</div>
<h2>$49. No subscription. No per-campaign fee.</h2>
<div class="pricing">
<div class="card featured">
<div class="row"><div class="price">$49</div><div class="price-suffix">one-time</div></div>
<h3>DataTools for RevOps</h3>
<ul>
<li>All 6 tools, full pipeline</li>
<li>Mac · Windows · Linux installers</li>
<li>Code-signed (no Gatekeeper warnings)</li>
<li>Free updates for the v1.x line</li>
<li>Bonus: 3-source unification pipeline preset</li>
<li><strong>Use on any number of clients</strong> — no seat limits</li>
</ul>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Buy on Gumroad →</a>
</div>
<div class="card">
<div class="row"><div class="price">$149</div><div class="price-suffix">one-time</div></div>
<h3>Full DataTools Suite</h3>
<p class="muted">Available when 3+ bundles ship. Includes everything in the RevOps pack plus the Shopify and Bookkeeper bundles. Save $48.</p>
<a class="btn btn-ghost btn-large" href="#" aria-disabled="true">Coming when ready</a>
</div>
</div>
</div>
</section>
<section>
<div class="container">
<h2>Questions</h2>
<details class="faq">
<summary>Does this replace HubSpot's deduplication?</summary>
<p>No — it cleans data <em>before</em> import to HubSpot (or LinkedIn, Marketo, Klaviyo, etc.). HubSpot's dedup runs on already-imported contacts; DataTools catches duplicates that haven't yet cost you a contract slot.</p>
</details>
<details class="faq">
<summary>Does it handle international phones correctly?</summary>
<p>Yes — via Google's libphonenumber, with 50+ country codes. The killer feature is per-row country: point a column at it (any column with values like <code>US</code>, <code>USA</code>, <code>United States</code>, <code>+1</code>, <code>JP</code>, <code>Japan</code>) and DataTools parses each row in its own region. No more UK numbers bucketed as malformed US.</p>
</details>
<details class="faq">
<summary>Can I use it on multiple clients without paying again?</summary>
<p>Yes. The licence is per-operator, not per-client. Run it on every agency client's lead list for the same $49.</p>
</details>
<details class="faq">
<summary>How does fuzzy match work across columns?</summary>
<p>Out of the box, the dedup engine builds default strategies based on column names — typically email + phone with exact match, name with Jaro-Winkler at 85%. You can override via JSON: pick which columns to match on, which algorithm, and what threshold. Strategies survive in the saved pipeline so next campaign uses the same rules.</p>
</details>
<details class="faq">
<summary>What's the audit trail look like?</summary>
<p>A row-by-row CSV: every modified cell with its original value, new value, and which rule fired. A separate JSON file describes the pipeline that produced it. Together they reproduce the cleanup deterministically — your client can verify it on their machine.</p>
</details>
<details class="faq">
<summary>What's your refund policy?</summary>
<p>Try the live demo above on the sample dataset before you buy. If DataTools doesn't fit your workflow within 14 days, email for a refund — no questions asked.</p>
</details>
</div>
</section>
<section>
<div class="container" style="text-align: center;">
<h2>Stop paying twice for the same contact.</h2>
<p class="lead" style="margin: 0 auto 28px;">One $49 download. Catches the cross-source duplicates HubSpot and LinkedIn can't see, normalizes phones for 50+ countries, and saves a pipeline you can re-run on next campaign's combined list.</p>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Get DataTools — $49 →</a>
</div>
</section>
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="../shopify-pet/">For Shopify operators</a> ·
<a href="../bookkeeper/">For bookkeepers</a><br />
<a href="https://gumroad.com/l/datatools?from=revops">Buy on Gumroad</a> ·
<a href="mailto:hello@datatools.app">Email support</a>
</p>
</div>
</div>
</footer>
</body>
</html>

View File

@@ -1,381 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools for Shopify — Clean Customer & Product Exports Locally · $49</title>
<meta name="description" content="Clean Shopify customer, product, and subscriber exports — locally. Klaviyo-import-ready in 30 seconds. Catches duplicates Excel misses. Your data never leaves your computer. $49 one-time." />
<meta name="keywords" content="shopify customer cleanup, shopify csv cleaner, shopify product feed cleaner, klaviyo deduplicate, shopify customer dedup tool, shopify pet supplies" />
<link rel="canonical" href="https://datatools.app/shopify/" />
<link rel="stylesheet" href="../_shared/styles.css" />
<!-- Persona accent: Shopify pet → mint green (default in shared sheet) -->
<!-- Open Graph -->
<meta property="og:title" content="DataTools for Shopify — Clean Customer & Product Exports Locally" />
<meta property="og:description" content="Klaviyo-import-ready in 30 seconds. Local. No upload. $49 one-time." />
<meta property="og:type" content="product" />
<meta property="og:url" content="https://datatools.app/shopify/" />
<!-- Schema.org Product -->
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareApplication",
"name": "DataTools for Shopify",
"operatingSystem": "Windows, macOS, Linux",
"applicationCategory": "BusinessApplication",
"offers": {
"@type": "Offer",
"price": "49",
"priceCurrency": "USD"
},
"description": "Clean Shopify customer, product, and subscriber CSV exports locally. Six-tool data-cleaning bundle: dedupe, text-clean, format-standardize, missing-value handle, column-map, pipeline.",
"softwareVersion": "1.0"
}
</script>
</head>
<body>
<!-- ============= Sticky buy bar ============= -->
<div class="buybar">
<div class="buybar-inner">
<div class="brand"><span class="brand-mark"></span> DataTools <span class="muted">/ for Shopify</span></div>
<div>
<span class="price-tag">$49 — one-time, no subscription</span>
<a class="btn" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Get DataTools →</a>
</div>
</div>
</div>
<!-- ============= Hero ============= -->
<section class="hero">
<div class="container">
<div class="eyebrow">For Shopify operators · pet supplies · subscription stores · DTC</div>
<h1>Klaviyo-import-ready customer lists.<br /><strong>In 30 seconds. Locally.</strong></h1>
<p class="lead">
Your Shopify customer export is a mess of formatting drift, disguised
duplicates, and inconsistent phone numbers. DataTools fixes all of it
in one pass — fuzzy-dedupes the same customer Klaviyo would charge
you for twice, standardises phones across your international
subscribers, and hands you a cleaned CSV. <strong>Your data never
leaves your computer.</strong>
</p>
<div class="cta-row">
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Get DataTools — $49 →</a>
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
<span class="price-note">One-time payment · cross-platform · runs offline</span>
</div>
<div class="stats">
<div class="stat"><div class="num">6</div><div class="label">tools, one bundle</div></div>
<div class="stat"><div class="num">1 GB</div><div class="label">customer file in 2.5 min</div></div>
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
</div>
</div>
</section>
<!-- ============= Pain points ============= -->
<section>
<div class="container">
<div class="eyebrow">If any of these sound like your Tuesday</div>
<h2>Five pains DataTools fixes in one pass</h2>
<div class="grid">
<div class="card">
<span class="icon">💸</span>
<h3>Klaviyo / Mailchimp / Omnisend bills you for every duplicate</h3>
<p>Same customer signs up twice — once with a typo, once with a plus-tag, once on mobile. Your subscriber list has 1018 % duplicate rate and you're paying for every one of them, every month, forever.</p>
<p class="muted"><strong>What it costs:</strong> $30$300/mo per percent of dupes on a 50 k-list — recurring.</p>
</div>
<div class="card">
<span class="icon">📵</span>
<h3>Your product feed got rejected by Google Merchant Center</h3>
<p>Smart quotes from a copy-paste in product titles. NBSP in SKU. Inconsistent attribute casing. Feed bounces, the launch sits for 2472 hours while you try to find the bad row in a 12,000-line CSV.</p>
<p class="muted"><strong>What it costs:</strong> 13 days of delayed campaign × the campaign value.</p>
</div>
<div class="card">
<span class="icon">🪢</span>
<h3>Orders from Shopify + Etsy + Amazon + Faire don't speak the same language</h3>
<p>Each platform's export uses different column names for "customer email" / "ship country" / "order total." Merging takes hours of manual rename and copy-paste before the analysis can even begin.</p>
<p class="muted"><strong>What it costs:</strong> 48 hours per month manually merging exports.</p>
</div>
<div class="card">
<span class="icon">🔁</span>
<h3>Subscription churn looks higher than it is</h3>
<p>Pet-box subscribers cancel, then re-sub three months later under a different email or device. Your cohort report says churn is 20 % when it's actually 12 % — and you're over-paying for acquisition because LTV is mis-calculated.</p>
<p class="muted"><strong>What it costs:</strong> wrong CAC ceiling for the next year of paid ads.</p>
</div>
<div class="card">
<span class="icon">🌍</span>
<h3>VAT MOSS / EU tax breaks because country is spelled three ways</h3>
<p>Your UK customers are tagged <code>UK</code>, <code>U.K.</code>, and <code>United Kingdom</code> — all in one export. The VAT report aggregates them as three different markets. Compliance friction every quarter.</p>
<p class="muted"><strong>What it costs:</strong> compliance risk + repeated manual normalization.</p>
</div>
<div class="card">
<span class="icon">🔒</span>
<h3>Cloud cleaners want you to upload your customer list</h3>
<p>Your customer list is your single most valuable business asset. Uploading it to a SaaS to clean it is the privacy story you do not want. DataTools is desktop-only — your list never leaves your computer.</p>
<p class="muted"><strong>What it costs:</strong> nothing — and that's the point.</p>
</div>
</div>
</div>
</section>
<!-- ============= Live demo ============= -->
<section id="demo">
<div class="container">
<div class="eyebrow">Live demo · runs in your browser</div>
<h2>Try it on a real-looking Shopify customer export</h2>
<p>
The demo below loads a sample 15-row Shopify customer file with
pollution we've seen in actual stores: smart quotes from copy-paste,
duplicates with email-case drift, international phones from the UK,
Spain, Germany, Australia, and Japan, and the usual mess of
<code>N/A</code> / <code>(blank)</code> / <code>?</code> sentinels.
Click <strong>Run pipeline</strong> and watch every column get
cleaned in under a second.
</p>
<div class="demo-frame">
<iframe
src="https://demo.datatools.app/?p=shopify-pet"
loading="lazy"
title="DataTools live demo — Shopify pet supplies"
sandbox="allow-scripts allow-same-origin allow-downloads allow-forms"></iframe>
<div class="demo-caption">
Demo runs on free hosting (Streamlit Community Cloud). Capped at
100 input rows · output watermarked with one trailing row. The
paid product has no caps and runs entirely offline.
</div>
</div>
</div>
</section>
<!-- ============= Built for Shopify ============= -->
<section>
<div class="container">
<div class="eyebrow">Built for the Shopify operator</div>
<h2>Five workflows you do every week</h2>
<div class="grid">
<div class="card">
<span class="icon">🧹</span>
<h3>Customer-list cleanup</h3>
<p>Catches the same customer who shows up as <code>john@gmail.com</code>, <code>John@Gmail.com</code>, and <code>j.ohn@gmail.com</code>. Fuzzy match merges the spellings, exact match catches the obvious ones.</p>
</div>
<div class="card">
<span class="icon">📦</span>
<h3>Product catalogue dedup</h3>
<p>SKU whitespace, near-identical product names, copy-paste smart quotes in titles — gone. Audit log shows every change.</p>
</div>
<div class="card">
<span class="icon">🛒</span>
<h3>Abandoned-cart hygiene</h3>
<p>Before re-engagement: dedupe across email + phone, drop sentinels-as-missing, format dates so your sequence triggers fire correctly.</p>
</div>
<div class="card">
<span class="icon">📥</span>
<h3>Subscriber-list import to Klaviyo</h3>
<p>Klaviyo charges per contact. Every duplicate you don't catch costs you for the life of the subscription. Catch them once, pay once.</p>
</div>
<div class="card">
<span class="icon">🔗</span>
<h3>Multi-channel order consolidation</h3>
<p>Orders from Shopify + Etsy + a wholesale spreadsheet, each with a different column for "customer email." Map Columns aligns them; dedup merges across channels.</p>
</div>
<div class="card">
<span class="icon">⚙️</span>
<h3>Repeatable pipeline</h3>
<p>Save the cleanup as a JSON file. Drop next week's export on it. Same cleanup, zero re-configuration. Automatable via the CLI.</p>
</div>
</div>
</div>
</section>
<!-- ============= Privacy moat ============= -->
<section>
<div class="container">
<div class="eyebrow">The thing every cloud cleaner can't say</div>
<h2>Your customer list never leaves your computer.</h2>
<p>
DataTools is a desktop app. There's no upload step, no SaaS account,
no subscription, no "trust our security policy." The first thing you
can do after install is open your browser's network tab, run the
cleaner on your real customer file, and verify zero outbound
requests.
</p>
<div class="callout">
<strong>Why it matters for Shopify:</strong> your customer list is
your single most valuable business asset. Cloud cleaners require
you to upload it. We don't.
</div>
<div class="terminal"><span class="prompt">$</span> python -m src.cli_pipeline customers.csv --apply
Reading customers.csv...
47,832 rows, 14 columns
Executing pipeline:
<span class="ok"></span> text_clean (140 ms) {cells_changed: 12,408}
<span class="ok"></span> format_standardize (810 ms) {cells_changed: 31,202}
<span class="ok"></span> missing (95 ms) {sentinels_standardized: 8,129}
<span class="ok"></span> dedup (3.1 s) {duplicates_removed: 2,347}
Initial rows: 47,832 → Final rows: 45,485
Total elapsed: 4.2 s
<span class="prompt">$</span> # zero network calls. zero. promise.</div>
</div>
</section>
<!-- ============= Audit moat ============= -->
<section>
<div class="container">
<div class="eyebrow">For when your client asks "what changed?"</div>
<h2>Every change auditable. Every cell logged.</h2>
<p>
Every modification is recorded with the original value, the new
value, and which rule fired. Hand the audit CSV to your accountant,
your marketing manager, or your boss along with the cleaned file.
No <em>"I trust the AI"</em> hand-waving — they see exactly what
happened.
</p>
<div class="callout">
<strong>Real example:</strong> the demo above standardized 27
cells across 15 customers. The audit log lists each one — row,
column, before, after, which standardizer fired. The dedup audit
lists every duplicate group with the survivor and its losers.
</div>
</div>
</section>
<!-- ============= International ============= -->
<section>
<div class="container">
<div class="eyebrow">If you sell internationally — most pet brands do</div>
<h2>Phones, addresses, and currencies from anywhere on Earth.</h2>
<p>
Your subscriber from London entered her phone as <code>020 7946
0958</code>. Your Tokyo customer entered <code>03-3210-7000</code>.
Your German wholesale buyer wrote <code>€2.410,75</code>. Excel
thinks all of them are mistakes. DataTools knows what country each
row is from (per-row country column) and parses every one correctly
to E.164 phones, ISO dates, and numeric amounts.
</p>
<ul class="bullets">
<li><strong>50+ country codes</strong> via Google's libphonenumber.</li>
<li><strong>Currency auto-detect</strong> for $ / £ / € / ¥ / R$ / kr / zł — including the EU comma-decimal that breaks Excel.</li>
<li><strong>Address shape detection</strong> for US, UK, Canada, Germany, Australia.</li>
<li><strong>Locale-aware month names</strong> in English, French, German.</li>
</ul>
</div>
</section>
<!-- ============= What you get ============= -->
<section>
<div class="container">
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Find Duplicates</h3><p>Fuzzy match (Jaro-Winkler), 5 normalizers, survivor rules, interactive review.</p></div>
<div class="card"><h3>2 · Clean Text</h3><p>Whitespace, smart chars, NBSP, BOM, line endings, case ops.</p></div>
<div class="card"><h3>3 · Standardize Formats</h3><p>Dates, phones, emails, addresses, names, currencies, booleans.</p></div>
<div class="card"><h3>4 · Fix Missing Values</h3><p>Disguised-null detection, profile, mean/median/mode/ffill, drop strategies.</p></div>
<div class="card"><h3>5 · Map Columns</h3><p>Fuzzy auto-rename, target schema, type coercion, required-field defaults.</p></div>
<div class="card"><h3>6 · Automated Workflows</h3><p>Chain tools in recommended order, save/load JSON, automate weekly cleanups.</p></div>
</div>
</div>
</section>
<!-- ============= Pricing ============= -->
<section>
<div class="container">
<div class="eyebrow">Pricing — pay once, own it</div>
<h2>$49. No subscription. No ceiling on rows or files.</h2>
<div class="pricing">
<div class="card featured">
<div class="row"><div class="price">$49</div><div class="price-suffix">one-time</div></div>
<h3>DataTools for Shopify</h3>
<ul>
<li>All 6 tools, full pipeline</li>
<li>Mac · Windows · Linux installers</li>
<li>Code-signed (no Gatekeeper warnings)</li>
<li>Free updates for the v1.x line</li>
<li>Bonus: 3 ready-made Shopify pipelines</li>
</ul>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Buy on Gumroad →</a>
</div>
<div class="card">
<div class="row"><div class="price">$149</div><div class="price-suffix">one-time</div></div>
<h3>Full DataTools Suite</h3>
<p class="muted">Available when 3+ bundles ship. Includes everything in the Shopify pack plus the Bookkeeper and RevOps bundles. Save $48.</p>
<a class="btn btn-ghost btn-large" href="#" aria-disabled="true">Coming when ready</a>
</div>
</div>
</div>
</section>
<!-- ============= FAQ ============= -->
<section>
<div class="container">
<h2>Questions</h2>
<details class="faq">
<summary>Does this work with Shopify Plus?</summary>
<p>Yes — the input is just CSV / Excel from any source. Your Shopify Plus exports work the same as the standard plan, the same as a Shopify-to-CSV pipeline you've stitched together yourself. The cleaner doesn't care.</p>
</details>
<details class="faq">
<summary>How does this compare to Excel's "Remove Duplicates"?</summary>
<p>Excel does <em>exact</em> deduplication. <code>John@Gmail.com</code> and <code>john@gmail.com</code> are different customers to Excel. DataTools fuzzy-matches across case, whitespace, formatting, and even close-but-not-identical strings. The demo above merges 4 customer pairs Excel would leave duplicated.</p>
</details>
<details class="faq">
<summary>How big a file can it handle?</summary>
<p>1 GB CSV with international phones + addresses processes in about 2.5 minutes on a typical workstation. Streaming mode keeps memory bounded regardless of input size — we tested it on 26 million rows.</p>
</details>
<details class="faq">
<summary>Do I need to know Python to use it?</summary>
<p>No. The GUI is a browser interface that opens automatically when you double-click the app. It loads your file, you click Run, you download the cleaned file. The CLI is there for power users who want to script weekly cleanups.</p>
</details>
<details class="faq">
<summary>What about my privacy?</summary>
<p>Your customer list never leaves your computer. There is no cloud component, no telemetry, no "anonymous usage stats." When the app is running you can confirm zero outbound network requests in your browser's developer tools.</p>
</details>
<details class="faq">
<summary>What's your refund policy?</summary>
<p>Try the live demo above on the sample dataset before you buy. If you still find DataTools doesn't fit your workflow within 14 days, email for a refund — no questions asked.</p>
</details>
<details class="faq">
<summary>Will there be updates?</summary>
<p>Yes. The v1.x line is included free for everyone who buys DataTools today. We ship a patch every 30 days adding country support, edge-case fixes, and small features.</p>
</details>
</div>
</section>
<!-- ============= Final CTA ============= -->
<section>
<div class="container" style="text-align: center;">
<h2>Stop deduplicating customers by hand.</h2>
<p class="lead" style="margin: 0 auto 28px;">One $49 download. Mac, Windows, or Linux. Runs offline. Catches the duplicates Excel misses, standardizes the phones from your international customers, and saves a pipeline you can re-run on next week's export.</p>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Get DataTools — $49 →</a>
</div>
</section>
<!-- ============= Footer ============= -->
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="../bookkeeper/">For bookkeepers</a> ·
<a href="../revops/">For RevOps agencies</a><br />
<a href="https://gumroad.com/l/datatools?from=shopify-pet">Buy on Gumroad</a> ·
<a href="mailto:hello@datatools.app">Email support</a>
</p>
</div>
</div>
</footer>
</body>
</html>

View File

@@ -0,0 +1,192 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Find Duplicates</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="01_deduplicator">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Find Duplicates</strong>, shown with a file imported and a completed run (results + match-group review). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Find Duplicates</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Find rows that repeat, then keep one and remove the extras.</p>
<div class="dt-spacer"></div>
<!-- File pickup banner (using file from upload screen) -->
<div class="dt-alert info">
<span class="dt-mi">description</span>
<span>Using <strong>customers_export.csv</strong> from the upload screen.</span>
</div>
<button class="dt-btn" style="margin-bottom:4px">Use a different file</button>
<!-- Delimiter selector — delimited-text only (CSV/TSV); omitted for XLSX/XLS.
Shown here because the staged file is customers_export.csv. -->
<div class="dt-field" style="max-width:320px">
<label class="dt-label">Delimiter</label>
<div class="dt-select">Comma (,)</div>
<div class="dt-help-text">Auto-detected on upload. Change if the preview looks wrong.</div>
</div>
<!-- Preview expander (collapsed after a result exists) -->
<details class="dt-expander">
<summary>Preview: customers_export.csv</summary>
<div class="dt-expander-body">
<p class="dt-caption">18,442 rows, 6 columns</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>name</th><th>email</th><th>city</th><th>phone</th><th>signup_date</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>Jane Doe</td><td>jane@acme.io</td><td>Austin</td><td>512-555-0190</td><td>2024-01-04</td></tr>
<tr><td class="idx">1</td><td>jane doe</td><td>JANE@ACME.IO</td><td>austin</td><td>(512) 555-0190</td><td>01/04/2024</td></tr>
<tr><td class="idx">2</td><td>Bob Smith</td><td>bob@globex.com</td><td>Denver</td><td>720-555-7781</td><td>2024-02-11</td></tr>
<tr><td class="idx">3</td><td>R. Smith</td><td>bob@globex.com</td><td>Denver</td><td>720-555-7781</td><td>2024-02-11</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<!-- Basic controls (visible by default) -->
<div class="dt-cols-2">
<div class="dt-field"><label class="dt-label">Match threshold</label>
<div class="dt-slider"><div class="track"><div class="fill" style="width:70%"></div><div class="knob" style="left:70%"></div></div><div class="val">85</div></div>
<div class="dt-help-text">Higher means rows must look more alike to count as a duplicate.</div></div>
<div class="dt-field"><label class="dt-label">When duplicates are found, keep</label>
<div class="dt-select">the most-complete row</div>
<div class="dt-help-text">Which row survives in each group of duplicates.</div></div>
</div>
<!-- Advanced options (single expander; basics live above) -->
<details class="dt-expander">
<summary>Advanced options</summary>
<div class="dt-expander-body">
<p class="dt-help-text" style="margin-top:0">Leave these empty to auto-detect which columns to compare. Otherwise, list the columns that must match <strong>exactly</strong> and the ones that only need to match <strong>approximately</strong> — together these are the columns used to find duplicates.</p>
<div class="dt-cols-2">
<div>
<div class="dt-field"><label class="dt-label">Columns that must match exactly</label>
<div class="dt-multiselect"><span class="dt-ms-chip">email <span class="x"></span></span></div></div>
<div class="dt-field"><label class="dt-label">Columns to match approximately</label>
<div class="dt-multiselect"><span class="dt-ms-chip">name <span class="x"></span></span></div></div>
</div>
<div>
<div class="dt-field"><label class="dt-label">Approximate-match algorithm</label><div class="dt-select">jaro_winkler</div></div>
</div>
</div>
<div class="dt-check on" style="margin-top:6px"><span class="box"><span class="dt-mi">check</span></span> Merge mode — fill missing fields in the surviving row</div>
</div>
</details>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary dt-btn-block">Find Duplicates</button>
<hr class="dt-divider">
<!-- Results -->
<h2>Results</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Original rows</div><div class="value">18,442</div></div>
<div class="dt-metric"><div class="label">Duplicate rows</div><div class="value">312</div><div class="delta down">312 removed</div></div>
<div class="dt-metric"><div class="label">Match groups</div><div class="value">147</div></div>
<div class="dt-metric"><div class="label">Rows kept</div><div class="value">18,130</div></div>
</div>
<p class="dt-caption">Preview of an auto-resolved run: each group keeps its auto-picked survivor. Review the groups below to override any pending picks before the final download.</p>
<div class="dt-btn-row" style="max-width:560px">
<button class="dt-btn">Download auto-resolved CSV</button>
<button class="dt-btn">Download removed rows</button>
</div>
<hr class="dt-divider">
<!-- Match groups -->
<h2>Match Groups</h2>
<div class="dt-cols-3" style="max-width:520px">
<button class="dt-btn">Accept All</button>
<button class="dt-btn">Reject All</button>
<button class="dt-btn">Clear Decisions</button>
</div>
<p class="dt-caption" style="margin-top:8px">Differing columns are highlighted. The survivor row is kept; uncheck a row to split it out of the group.</p>
<!-- Match group card 1 -->
<div class="dt-match-card">
<div class="dt-match-head">
<span class="title">Group 1 · 2 rows</span>
<span class="conf"><span class="dt-count-pill success">98% match</span></span>
</div>
<div class="dt-match-body">
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>keep</th><th>name</th><th>email</th><th>city</th><th>phone</th><th>signup_date</th></tr></thead>
<tbody>
<tr class="dt-keep-row"><td><span class="dt-keep-tag">keep</span></td><td>Jane Doe</td><td>jane@acme.io</td><td>Austin</td><td>512-555-0190</td><td>2024-01-04</td></tr>
<tr><td><span class="dt-caption">remove</span></td><td class="dt-cell-flag">jane doe</td><td class="dt-cell-flag">JANE@ACME.IO</td><td class="dt-cell-flag">austin</td><td>(512) 555-0190</td><td class="dt-cell-flag">01/04/2024</td></tr>
</tbody>
</table>
</div>
</div>
</div>
<!-- Match group card 2 -->
<div class="dt-match-card">
<div class="dt-match-head">
<span class="title">Group 2 · 2 rows</span>
<span class="conf"><span class="dt-count-pill warn">87% match</span></span>
</div>
<div class="dt-match-body">
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>keep</th><th>name</th><th>email</th><th>city</th><th>phone</th><th>signup_date</th></tr></thead>
<tbody>
<tr class="dt-keep-row"><td><span class="dt-keep-tag">keep</span></td><td>Bob Smith</td><td>bob@globex.com</td><td>Denver</td><td>720-555-7781</td><td>2024-02-11</td></tr>
<tr><td><span class="dt-caption">remove</span></td><td class="dt-cell-flag">R. Smith</td><td>bob@globex.com</td><td>Denver</td><td>720-555-7781</td><td>2024-02-11</td></tr>
</tbody>
</table>
</div>
</div>
</div>
<p class="dt-caption" style="margin-top:14px">Decisions: 1 merged, 1 pending · Pending groups keep their auto-picked survivor unless you review them.</p>
<button class="dt-btn dt-btn-primary dt-btn-block" style="margin-top:8px">Apply Review Decisions &amp; Download Final CSV</button>
<!-- Processing log -->
<details class="dt-expander" style="margin-top:18px">
<summary>Processing Log</summary>
<div class="dt-expander-body">
<div class="dt-code">[00:00.01] Loaded 18,442 rows from customers_export.csv
[00:00.04] Strategy: exact(email) + fuzzy(name, jaro_winkler ≥ 85)
[00:00.91] Compared 18,442 rows → 147 match groups
[00:01.02] Survivor rule: most-complete · merge=on
[00:01.05] 312 rows flagged for removal</div>
</div>
</details>
<div class="dt-next-step"><span class="dt-mi">arrow_forward</span><span>Duplicates handled — your file is cleaned. Review the result or <a href="home.html">Back to Start here →</a></span><button class="dt-next-step-dismiss" title="Dismiss"></button></div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,223 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Clean Text</title>
<link rel="stylesheet" href="app.css">
<style>
/* Hidden-character badges — mirrors src/core/text_clean.py:hidden_char_css(),
not part of app.css so reproduced inline against the same palette. */
.hidden-char { display: inline-block; padding: 0 2px; margin: 0 1px; border-radius: 3px; font-family: var(--font-mono); font-size: 0.85em; cursor: help; }
.hidden-char.hidden-whitespace { background: #fff3cd; color: #856404; border: 1px solid #ffeaa7; }
.hidden-char.hidden-special { background: #d1ecf1; color: #0c5460; border: 1px solid #bee5eb; }
.hidden-char.hidden-control { background: #f8d7da; color: #721c24; border: 1px solid #f5c6cb; }
</style>
</head>
<body data-page="02_text_cleaner">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Clean Text</strong>, shown with a file imported and a completed run (results metrics, changes-by-column, before/after examples, cleaned preview, downloads). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Clean Text</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Trim extra spaces and strip out odd characters.</p>
<div class="dt-spacer"></div>
<!-- File pickup banner (using file from upload screen) -->
<div class="dt-alert info">
<span class="dt-mi">description</span>
<span>Using <strong>contacts_messy.csv</strong> from the upload screen.</span>
</div>
<button class="dt-btn" style="margin-bottom:4px">Use a different file</button>
<!-- Preview expander (collapsed once a result exists) -->
<details class="dt-expander">
<summary>Preview: contacts_messy.csv</summary>
<div class="dt-expander-body">
<p class="dt-caption">4,120 rows, 4 columns</p>
<div class="dt-check on" style="margin-top:2px"><span class="box"><span class="dt-mi">check</span></span> Show hidden characters</div>
<div style="display:flex;flex-wrap:wrap;align-items:center;gap:14px;margin-top:6px;font-size:12px;color:var(--ink-secondary)">
<span style="display:inline-flex;align-items:center;gap:6px"><span class="hidden-char hidden-whitespace" style="cursor:default">·</span> Whitespace</span>
<span style="display:inline-flex;align-items:center;gap:6px"><span class="hidden-char hidden-special" style="cursor:default"></span> Smart / special</span>
<span style="display:inline-flex;align-items:center;gap:6px"><span class="hidden-char hidden-control" style="cursor:default"></span> Control</span>
</div>
<div class="dt-table-wrap" style="margin-top:8px">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>name</th><th>email</th><th>company</th><th>notes</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td><span class="hidden-char hidden-whitespace" title="U+0020 SP LEAD">·</span>Jane Doe<span class="hidden-char hidden-whitespace" title="U+0020 SP TRAIL">·</span></td><td>jane@acme.io</td><td>Acme<span class="hidden-char hidden-whitespace" title="U+00A0 NBSP">·</span>Inc.</td><td>VIP<span class="hidden-char hidden-special" title="U+201D RIGHT DOUBLE QUOTE"></span></td></tr>
<tr><td class="idx">1</td><td>Bob&nbsp;&nbsp;Smith</td><td>bob@globex.com<span class="hidden-char hidden-special" title="U+200B ZWSP"></span></td><td>Globex</td><td><span class="hidden-char hidden-control" title="U+0007 CTRL"></span></td></tr>
<tr><td class="idx">2</td><td>Ana López</td><td>ana@initech.com</td><td>Initech<span class="hidden-char hidden-whitespace" title="U+0020 SP TRAIL">·</span></td><td>follow&nbsp;up</td></tr>
<tr><td class="idx">3</td><td><span class="hidden-char hidden-whitespace" title="U+0009 TAB"></span>Wei Chen</td><td>WEI@umbrella.co</td><td>Umbrella</td><td>“key<span class="hidden-char hidden-special" title="U+2014 EM DASH"></span>account”</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<hr class="dt-divider">
<!-- Options expander (collapsed once a result exists) -->
<details class="dt-expander">
<summary>Options</summary>
<div class="dt-expander-body">
<div class="dt-field">
<label class="dt-label">Preset</label>
<div class="dt-radio-row">
<span class="dt-radio on"><span class="dot"></span> excel-hygiene (recommended)</span>
<span class="dt-radio"><span class="dot"></span> minimal</span>
<span class="dt-radio"><span class="dot"></span> paranoid</span>
</div>
<div class="dt-help-text">
minimal: trim and collapse whitespace only — no character substitutions.<br>
excel-hygiene: trim, collapse whitespace, fold smart quotes, strip invisible chars, normalize line endings, and normalize accented characters.<br>
paranoid: everything in excel-hygiene plus strip control characters, strip BOM, and normalize accented and look-alike characters (lossy).
</div>
</div>
<details class="dt-expander">
<summary>Advanced options</summary>
<div class="dt-expander-body">
<div class="dt-cols-2">
<div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Trim leading/trailing whitespace</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Collapse internal whitespace</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Normalize line endings (\r\n → \n)</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Strip control characters</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Strip BOM</div>
</div>
<div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Fold smart characters (curly quotes, em-dash, NBSP)</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Strip zero-width / invisible characters</div>
<div class="dt-check on" title="Unicode NFC normalization"><span class="box"><span class="dt-mi">check</span></span> Normalize accented characters (NFC)</div>
<div class="dt-check" title="Unicode NFKC compatibility fold"><span class="box"></span> Normalize accented and look-alike characters (lossy: ① → 1, fi → fi)</div>
</div>
</div>
<h4>Scope</h4>
<div class="dt-field">
<label class="dt-label">Columns to clean (default: all string columns)</label>
<div class="dt-multiselect">
<span class="dt-ms-chip">name <span class="x"></span></span>
<span class="dt-ms-chip">email <span class="x"></span></span>
<span class="dt-ms-chip">company <span class="x"></span></span>
<span class="dt-ms-chip">notes <span class="x"></span></span>
</div>
</div>
<div class="dt-field">
<label class="dt-label">Columns to skip even if they look like text</label>
<div class="dt-multiselect"><span class="dt-ms-placeholder">Choose columns to leave untouched</span></div>
</div>
<h4>Case conversion</h4>
<div class="dt-field" style="max-width:360px">
<label class="dt-label">Apply case conversion to selected columns</label>
<div class="dt-select">None</div>
</div>
</div>
</details>
</div>
</details>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary dt-btn-block">Clean Text</button>
<hr class="dt-divider">
<!-- Results -->
<h2>Results</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Cells scanned</div><div class="value">16,480</div></div>
<div class="dt-metric"><div class="label">Cells changed</div><div class="value">3,947</div></div>
<div class="dt-metric"><div class="label">% changed</div><div class="value">24.0%</div></div>
<div class="dt-metric"><div class="label">Columns processed</div><div class="value">4</div></div>
</div>
<div class="dt-field">
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Show hidden characters (NBSP, ZWSP, smart quotes, control chars…)</div>
<div class="dt-help-text">Same setting as “Show hidden characters” in the preview above — toggling either updates both.</div>
</div>
<h4>Changes by column</h4>
<div class="dt-table-wrap" style="max-width:360px">
<table class="dt-table">
<thead><tr><th>column</th><th>cells_changed</th></tr></thead>
<tbody>
<tr><td>company</td><td>1,604</td></tr>
<tr><td>name</td><td>1,210</td></tr>
<tr><td>notes</td><td>982</td></tr>
<tr><td>email</td><td>151</td></tr>
</tbody>
</table>
</div>
<h4>Examples (first 25 changes)</h4>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>Row</th><th>Column</th><th>Before</th><th>After</th><th>Ops applied</th></tr></thead>
<tbody>
<tr><td>1</td><td>name</td><td><span class="hidden-char hidden-whitespace" title="U+0020 SP LEAD">·</span>Jane Doe<span class="hidden-char hidden-whitespace" title="U+0020 SP TRAIL">·</span></td><td>Jane Doe</td><td>trim</td></tr>
<tr><td>1</td><td>company</td><td>Acme<span class="hidden-char hidden-whitespace" title="U+00A0 NBSP">·</span>Inc.</td><td>Acme Inc.</td><td>fold_smart</td></tr>
<tr><td>1</td><td>notes</td><td>VIP<span class="hidden-char hidden-special" title="U+201D RIGHT DOUBLE QUOTE"></span></td><td>VIP"</td><td>fold_smart</td></tr>
<tr><td>2</td><td>name</td><td>Bob<span class="hidden-char hidden-whitespace" title="U+0020 SP">·</span><span class="hidden-char hidden-whitespace" title="U+0020 SP">·</span>Smith</td><td>Bob Smith</td><td>collapse_ws</td></tr>
<tr><td>2</td><td>email</td><td>bob@globex.com<span class="hidden-char hidden-special" title="U+200B ZWSP"></span></td><td>bob@globex.com</td><td>strip_zero_width</td></tr>
<tr><td>2</td><td>notes</td><td><span class="hidden-char hidden-control" title="U+0007 CTRL"></span></td><td></td><td>strip_control</td></tr>
<tr><td>3</td><td>company</td><td>Initech<span class="hidden-char hidden-whitespace" title="U+0020 SP TRAIL">·</span></td><td>Initech</td><td>trim</td></tr>
<tr><td>4</td><td>name</td><td><span class="hidden-char hidden-whitespace" title="U+0009 TAB"></span>Wei Chen</td><td>Wei Chen</td><td>trim</td></tr>
<tr><td>4</td><td>notes</td><td>“key<span class="hidden-char hidden-special" title="U+2014 EM DASH"></span>account”</td><td>"key-account"</td><td>fold_smart, nfc</td></tr>
</tbody>
</table>
</div>
<h4>Cleaned preview (first 10 rows)</h4>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>name</th><th>email</th><th>company</th><th>notes</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td class="dt-cell-add">Jane Doe</td><td>jane@acme.io</td><td class="dt-cell-add">Acme Inc.</td><td class="dt-cell-add">VIP"</td></tr>
<tr><td class="idx">1</td><td class="dt-cell-add">Bob Smith</td><td class="dt-cell-add">bob@globex.com</td><td>Globex</td><td class="dt-cell-add"></td></tr>
<tr><td class="idx">2</td><td>Ana López</td><td>ana@initech.com</td><td class="dt-cell-add">Initech</td><td>follow up</td></tr>
<tr><td class="idx">3</td><td class="dt-cell-add">Wei Chen</td><td>WEI@umbrella.co</td><td>Umbrella</td><td class="dt-cell-add">"key-account"</td></tr>
</tbody>
</table>
</div>
<p class="dt-caption">Changed cells highlighted. Toggle “Show hidden characters” to inspect the invisibles being removed.</p>
<hr class="dt-divider">
<!-- Downloads -->
<div class="dt-cols-3">
<button class="dt-btn dt-btn-primary">Download cleaned CSV</button>
<button class="dt-btn">Download changes audit</button>
<button class="dt-btn">Download config JSON</button>
</div>
<!-- Next-step suggestion -->
<div class="dt-next-step"><span class="dt-mi">arrow_forward</span><span>Text cleaned. Next, most files need: <a href="03_format_standardizer.html">Standardize Formats →</a></span><button class="dt-next-step-dismiss" title="Dismiss"></button></div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,265 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Standardize Formats</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="03_format_standardizer">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Standardize Formats</strong>, shown with a file imported from the upload screen and a completed run (results + changes audit + standardized preview). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Standardize Formats</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Make dates, phones, currency, and names look the same throughout.</p>
<div class="dt-spacer"></div>
<!-- File pickup banner (using file from upload screen) -->
<div class="dt-alert info">
<span class="dt-mi">description</span>
<span>Using <strong>customers_export.csv</strong> from the upload screen.</span>
</div>
<button class="dt-btn" style="margin-bottom:4px">Use a different file</button>
<!-- Preview expander (collapsed once a result exists) -->
<details class="dt-expander">
<summary>Preview: customers_export.csv</summary>
<div class="dt-expander-body">
<p class="dt-caption">18,442 rows, 6 columns</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>full_name</th><th>phone</th><th>amount</th><th>signup_date</th><th>active</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>jane DOE</td><td>(512) 555-0190</td><td>$1,234.5</td><td>01/04/2024</td><td>Y</td></tr>
<tr><td class="idx">1</td><td>bob smith</td><td>720.555.7781</td><td>$99</td><td>2024-2-11</td><td>yes</td></tr>
<tr><td class="idx">2</td><td>ALICIA REYES</td><td>+1 415 555 2233</td><td>$45,000</td><td>Mar 3, 2024</td><td>n</td></tr>
<tr><td class="idx">3</td><td>m. okafor</td><td>2125550148</td><td>$7.999</td><td>2024/04/22</td><td>true</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<hr class="dt-divider">
<!-- Options expander (collapsed after run; opened here to show the most informative content) -->
<details class="dt-expander" open>
<summary>Options</summary>
<div class="dt-expander-body">
<h3 style="margin-top:0">Column types</h3>
<p class="dt-caption">Assign each column to a field type. Auto-detected suggestions are pre-filled; pick <strong>(skip)</strong> to leave a column untouched.</p>
<!-- Per-column type selectboxes, 3 per row -->
<div class="dt-cols-3">
<div class="dt-field"><label class="dt-label">full_name</label><div class="dt-select">Name</div></div>
<div class="dt-field"><label class="dt-label">phone</label><div class="dt-select">Phone</div></div>
<div class="dt-field"><label class="dt-label">amount</label><div class="dt-select">Currency</div></div>
</div>
<div class="dt-cols-3">
<div class="dt-field"><label class="dt-label">signup_date</label><div class="dt-select">Date</div></div>
<div class="dt-field"><label class="dt-label">active</label><div class="dt-select">Boolean</div></div>
<div class="dt-field"><label class="dt-label">notes</label><div class="dt-select">(skip)</div></div>
</div>
<hr class="dt-divider">
<h3>Format options</h3>
<!-- Standards preset radio (vertical). Demo state: preset has auto-switched
to Custom because individual controls below diverge from the European base. -->
<div class="dt-field">
<label class="dt-label">Standards preset</label>
<div style="display:flex;flex-direction:column;gap:8px;margin-top:4px">
<span class="dt-radio" title="E.164 phones"><span class="dot"></span> US (default) — ISO 8601 dates · international-format phones (+1…) · USD</span>
<span class="dt-radio"><span class="dot"></span> European — DMY input · INTL phones · EUR comma decimal <span class="dt-count-pill info" style="margin-left:4px">base</span></span>
<span class="dt-radio"><span class="dot"></span> UK — DD/MM/YYYY · GB phones · Yes/No booleans</span>
<span class="dt-radio"><span class="dot"></span> ISO Strict — ISO 8601 · bare-number currency · true/false</span>
<span class="dt-radio"><span class="dot"></span> Legacy US — MM/DD/YYYY · National phones · Yes/No</span>
<span class="dt-radio on"><span class="dot"></span> Custom — based on <strong>European</strong>, 2 controls changed <span class="dt-count-pill warn" style="margin-left:4px">modified</span></span>
</div>
<div class="dt-precedence" style="margin-top:10px">
<span class="dt-mi">rule</span>
<span>Individual controls win over the preset. You started from <strong>European</strong>, then changed <strong>Ambiguous input order</strong> and <strong>Decimal separator</strong> below — so the preset is now <strong>Custom</strong>. The controls' current values are what actually run.</span>
</div>
<div class="dt-help-text">Pick a published standard or regional convention as the baseline. Every option below is still individually overridable; overriding any one switches the preset to Custom.</div>
</div>
<!-- Two-column format options -->
<div class="dt-cols-2" style="margin-top:14px">
<!-- Left column: Dates + Phones -->
<div>
<h4 style="margin-top:0"><strong>Dates</strong></h4>
<div class="dt-field"><label class="dt-label">Output format</label><div class="dt-select">YYYY-MM-DD (ISO)</div></div>
<div class="dt-field">
<label class="dt-label">Ambiguous input order (e.g. 01/02/2024) <span class="dt-count-pill warn" style="margin-left:4px">changed</span></label>
<div class="dt-radio-row">
<span class="dt-radio on"><span class="dot"></span> MDY (US)</span>
<span class="dt-radio"><span class="dot"></span> DMY (EU)</span>
</div>
<div class="dt-help-text">Winning value: <strong>MDY</strong>. Overrides the European base (DMY) — <code>01/02/2024</code> reads as <strong>2024-01-02</strong>.</div>
</div>
<h4><strong>Phones</strong></h4>
<div class="dt-field"><label class="dt-label" title="E.164">Output format</label><div class="dt-select" title="E.164">Standard international format (+15551234567)</div></div>
<div class="dt-field">
<label class="dt-label">Default region (ISO-2)</label>
<div class="dt-input">US</div>
<div class="dt-help-text">Region used when the input has no country code. US, GB, DE, etc.</div>
</div>
</div>
<!-- Right column: Currency + Names + Booleans -->
<div>
<h4 style="margin-top:0"><strong>Currency</strong></h4>
<div class="dt-field">
<label class="dt-label">Decimal separator in input <span class="dt-count-pill warn" style="margin-left:4px">changed</span></label>
<div class="dt-radio-row">
<span class="dt-radio on"><span class="dot"></span> dot (1,234.56)</span>
<span class="dt-radio"><span class="dot"></span> comma (1.234,56)</span>
</div>
<div class="dt-help-text">Winning value: <strong>dot</strong>. Overrides the European base (comma) — <code>$1,234.5</code> reads as <strong>1234.50</strong>.</div>
</div>
<div class="dt-field" style="max-width:200px"><label class="dt-label">Round to decimals</label><div class="dt-input">2</div></div>
<div class="dt-check"><span class="box"></span> Preserve original precision (don't round)</div>
<div class="dt-check"><span class="box"></span> Preserve currency code (emit <code>USD 1234.56</code>, <code>EUR 99.00</code>, etc.)</div>
<h4><strong>Names</strong></h4>
<div class="dt-field"><label class="dt-label">Casing</label><div class="dt-select">Title Case</div></div>
<h4><strong>Booleans</strong></h4>
<div class="dt-field"><label class="dt-label">Output style</label><div class="dt-select">True/False</div></div>
</div>
</div>
</div>
</details>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary dt-btn-block">Standardize Formats</button>
<hr class="dt-divider">
<!-- Results -->
<h2>Results</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Cells scanned</div><div class="value">92,210</div></div>
<div class="dt-metric"><div class="label">Cells changed</div><div class="value">61,838</div></div>
<div class="dt-metric"><div class="label">% changed</div><div class="value">67.1%</div></div>
<div class="dt-metric"><div class="label">Unparseable</div><div class="value">47</div></div>
</div>
<div class="dt-alert info">
<span class="dt-mi">info</span>
<span>47 cell(s) in typed columns didn't match a recognizable shape and were left as-is. See <strong>Unparseable cells</strong> below to review them, or re-classify the column to <strong>(skip)</strong>. (They aren't in the changes audit — nothing was changed.)</span>
</div>
<!-- Unparseable cells surface (the alert points here; these are left-as-is, so they never appear in the CHANGES audit) -->
<details class="dt-expander">
<summary>Unparseable cells (47)</summary>
<div class="dt-expander-body">
<p class="dt-caption">Cells in typed columns that didn't match a recognizable shape and were left unchanged.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>row</th><th>column</th><th>field_type</th><th>value (left as-is)</th></tr></thead>
<tbody>
<tr><td>318</td><td>signup_date</td><td>date</td><td class="dt-cell-flag">soon</td></tr>
<tr><td>902</td><td>phone</td><td>phone</td><td class="dt-cell-flag">ext. 4471</td></tr>
<tr><td>1,544</td><td>amount</td><td>currency</td><td class="dt-cell-flag">TBD</td></tr>
<tr><td>2,087</td><td>active</td><td>boolean</td><td class="dt-cell-flag">maybe</td></tr>
<tr><td>3,610</td><td>signup_date</td><td>date</td><td class="dt-cell-flag">00/00/0000</td></tr>
</tbody>
</table>
</div>
<p class="dt-caption" style="margin-top:8px">… and 42 more.</p>
</div>
</details>
<!-- Changes by column -->
<p style="margin-bottom:6px"><strong>Changes by column</strong></p>
<div class="dt-table-wrap" style="max-width:520px">
<table class="dt-table">
<thead><tr><th>column</th><th>field_type</th><th>cells_changed</th></tr></thead>
<tbody>
<tr><td>amount</td><td>currency</td><td>17,902</td></tr>
<tr><td>full_name</td><td>name</td><td>16,041</td></tr>
<tr><td>phone</td><td>phone</td><td>14,388</td></tr>
<tr><td>signup_date</td><td>date</td><td>11,205</td></tr>
<tr><td>active</td><td>boolean</td><td>2,302</td></tr>
</tbody>
</table>
</div>
<!-- Examples (first 25 changes) -->
<p style="margin:14px 0 6px"><strong>Examples (first 25 changes)</strong></p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>row</th><th>column</th><th>field_type</th><th>before</th><th>after</th></tr></thead>
<tbody>
<tr><td>1</td><td>full_name</td><td>name</td><td class="dt-cell-del">jane DOE</td><td class="dt-cell-add">Jane Doe</td></tr>
<tr><td>1</td><td>phone</td><td>phone</td><td class="dt-cell-del">(512) 555-0190</td><td class="dt-cell-add">+15125550190</td></tr>
<tr><td>1</td><td>amount</td><td>currency</td><td class="dt-cell-del">$1,234.5</td><td class="dt-cell-add">1234.50</td></tr>
<tr><td>1</td><td>signup_date</td><td>date</td><td class="dt-cell-del">01/04/2024</td><td class="dt-cell-add">2024-01-04</td></tr>
<tr><td>1</td><td>active</td><td>boolean</td><td class="dt-cell-del">Y</td><td class="dt-cell-add">True</td></tr>
<tr><td>2</td><td>full_name</td><td>name</td><td class="dt-cell-del">bob smith</td><td class="dt-cell-add">Bob Smith</td></tr>
<tr><td>2</td><td>phone</td><td>phone</td><td class="dt-cell-del">720.555.7781</td><td class="dt-cell-add">+17205557781</td></tr>
<tr><td>2</td><td>signup_date</td><td>date</td><td class="dt-cell-del">2024-2-11</td><td class="dt-cell-add">2024-02-11</td></tr>
<tr><td>3</td><td>signup_date</td><td>date</td><td class="dt-cell-del">Mar 3, 2024</td><td class="dt-cell-add">2024-03-03</td></tr>
<tr><td>4</td><td>amount</td><td>currency</td><td class="dt-cell-del">$7.999</td><td class="dt-cell-add">8.00</td></tr>
</tbody>
</table>
</div>
<!-- Standardized preview -->
<p style="margin:14px 0 6px"><strong>Standardized preview (first 10 rows)</strong></p>
<p class="dt-caption" style="margin:0 0 6px">Showing 5 of 6 columns — <code>notes</code> is set to (skip), so it's omitted here.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>full_name</th><th>phone</th><th>amount</th><th>signup_date</th><th>active</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>Jane Doe</td><td>+15125550190</td><td>1234.50</td><td>2024-01-04</td><td>True</td></tr>
<tr><td class="idx">1</td><td>Bob Smith</td><td>+17205557781</td><td>99.00</td><td>2024-02-11</td><td>True</td></tr>
<tr><td class="idx">2</td><td>Alicia Reyes</td><td>+14155552233</td><td>45000.00</td><td>2024-03-03</td><td>False</td></tr>
<tr><td class="idx">3</td><td>M. Okafor</td><td>+12125550148</td><td>8.00</td><td>2024-04-22</td><td>True</td></tr>
</tbody>
</table>
</div>
<hr class="dt-divider">
<!-- Downloads (3 columns) -->
<div class="dt-cols-3">
<button class="dt-btn dt-btn-primary">Download standardized CSV</button>
<button class="dt-btn">Download changes audit</button>
<button class="dt-btn">Download config JSON</button>
</div>
<!-- Next-step suggestion -->
<div class="dt-next-step"><span class="dt-mi">arrow_forward</span><span>Formats standardized. Next, most files need: <a href="04_missing_handler.html">Fix Missing Values →</a></span><button class="dt-next-step-dismiss" title="Dismiss"></button></div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,263 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Fix Missing Values</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="04_missing_handler">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Fix Missing Values</strong>, shown with a file imported and a completed run (per-column missingness profile + before/after results). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Fix Missing Values</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Find blank cells (even hidden ones) and fill them in or remove them.</p>
<div class="dt-spacer"></div>
<!-- File pickup banner (using file from upload screen) -->
<div class="dt-alert info">
<span class="dt-mi">description</span>
<span>Using <strong>survey_responses.csv</strong> from the upload screen.</span>
</div>
<button class="dt-btn" style="margin-bottom:4px">Use a different file</button>
<!-- Preview expander (collapsed after a result exists) -->
<details class="dt-expander">
<summary>Preview: survey_responses.csv</summary>
<div class="dt-expander-body">
<p class="dt-caption">2,150 rows, 6 columns</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>respondent_id</th><th>age</th><th>region</th><th>income</th><th>satisfaction</th><th>comments</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>R-1001</td><td>34</td><td>West</td><td>52000</td><td>4</td><td>great service</td></tr>
<tr><td class="idx">1</td><td>R-1002</td><td class="dt-cell-flag">N/A</td><td>East</td><td class="dt-cell-flag"></td><td>3</td><td class="dt-cell-flag">?</td></tr>
<tr><td class="idx">2</td><td>R-1003</td><td>41</td><td class="dt-cell-flag">-</td><td>61000</td><td class="dt-cell-flag">NULL</td><td>none</td></tr>
<tr><td class="idx">3</td><td>R-1004</td><td>29</td><td>South</td><td class="dt-cell-flag">N/A</td><td>5</td><td>quick</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<hr class="dt-divider">
<!-- Missingness profile — always visible: see the damage before configuring -->
<h2>Missingness profile</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Rows</div><div class="value">2,150</div></div>
<div class="dt-metric"><div class="label">Cells missing</div><div class="value">1,043</div></div>
<div class="dt-metric"><div class="label">% cells missing</div><div class="value">8.1%</div></div>
<div class="dt-metric"><div class="label">Complete rows</div><div class="value">1,388</div></div>
</div>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>column</th><th>dtype</th><th>missing</th><th>missing_pct</th><th>disguised</th><th>has_missing</th></tr></thead>
<tbody>
<tr><td>respondent_id</td><td>object</td><td>0</td><td>0.0%</td><td>0</td><td>False</td></tr>
<tr><td>age</td><td>float64</td><td>187</td><td>8.7%</td><td>61</td><td>True</td></tr>
<tr><td>region</td><td>object</td><td>142</td><td>6.6%</td><td>142</td><td>True</td></tr>
<tr><td>income</td><td>float64</td><td>329</td><td>15.3%</td><td>118</td><td>True</td></tr>
<tr><td>satisfaction</td><td>float64</td><td>95</td><td>4.4%</td><td>40</td><td>True</td></tr>
<tr><td>comments</td><td>object</td><td>290</td><td>13.5%</td><td>290</td><td>True</td></tr>
</tbody>
</table>
</div>
<hr class="dt-divider">
<!-- Options expander (Strategy) — configuration follows the diagnostic -->
<details class="dt-expander" open>
<summary>Options</summary>
<div class="dt-expander-body">
<h3>Strategy</h3>
<div class="dt-precedence">
<span class="dt-mi">layers</span>
<span>Resolution order: <strong>per-column override</strong><strong>global strategy</strong><strong>preset</strong>. The most specific setting wins; layers it overrides are dimmed.</span>
</div>
<div class="dt-field">
<label class="dt-label">Preset</label>
<div class="dt-help-text" style="color:var(--warn);display:flex;align-items:center;gap:5px;margin-bottom:8px"><span class="dt-mi" style="font-family:'Material Symbols Outlined';font-size:15px;line-height:1">info</span> Overridden by <strong>Global strategy → median</strong> (set under Advanced options). Presets apply only when global is &ldquo;(use preset)&rdquo;.</div>
<div class="dt-radio-row is-overridden" style="flex-direction:column;gap:10px">
<span class="dt-radio on"><span class="dot"></span> detect-only (standardize sentinels to NaN, no fill or drop)</span>
<span class="dt-radio"><span class="dot"></span> safe-fill (numeric → median, categorical → mode)</span>
<span class="dt-radio"><span class="dot"></span> drop-incomplete (drop any row with missing)</span>
</div>
<div class="dt-help-text">detect-only: replace 'N/A', '-', 'NULL', etc. with real NaN, then stop. safe-fill: also fill — numeric columns with median, others with mode. drop-incomplete: also drop every row that has any missing cell.</div>
</div>
<!-- Advanced options expander (open — most informative) -->
<details class="dt-expander" open>
<summary>Advanced options</summary>
<div class="dt-expander-body">
<div class="dt-cols-2">
<div>
<h4>Detection</h4>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Standardize disguised nulls to NaN</div>
<div class="dt-field">
<label class="dt-label" title="Sentinel values">Blanks in disguise (N/A, dash, NULL) — comma-separated</label>
<div class="dt-input">N/A, n/a, NA, NULL, null, None, -, --, ?, #N/A</div>
<div class="dt-help-text">Text that really means &ldquo;empty.&rdquo; Matched case-insensitively after stripping whitespace.</div>
</div>
</div>
<div>
<h4>Strategy override</h4>
<div class="dt-field">
<label class="dt-label">Global strategy</label>
<div class="dt-select">median</div>
<div class="dt-help-text">drop_row / drop_col use the thresholds below. mean / median / interpolate are numeric only — non-numeric columns fall back to the categorical strategy.</div>
</div>
<div class="dt-field">
<label class="dt-label">Categorical fallback (for non-numeric columns)</label>
<div class="dt-select">mode</div>
</div>
</div>
</div>
<h4>Drop thresholds</h4>
<div class="dt-cols-2">
<div class="dt-field">
<label class="dt-label">Row drop threshold (drop rows with ≥ this fraction missing across selected cols)</label>
<div class="dt-slider"><div class="track"><div class="fill" style="width:100%"></div><div class="knob" style="left:calc(100% - 8px)"></div></div><div class="val">1.00</div></div>
</div>
<div class="dt-field">
<label class="dt-label">Column drop threshold (drop columns with ≥ this fraction missing)</label>
<div class="dt-slider"><div class="track"><div class="fill" style="width:100%"></div><div class="knob" style="left:calc(100% - 8px)"></div></div><div class="val">1.00</div></div>
</div>
</div>
<h4>Scope</h4>
<div class="dt-field">
<label class="dt-label">Columns to handle (default: all)</label>
<div class="dt-multiselect">
<span class="dt-ms-chip">respondent_id <span class="x"></span></span>
<span class="dt-ms-chip">age <span class="x"></span></span>
<span class="dt-ms-chip">region <span class="x"></span></span>
<span class="dt-ms-chip">income <span class="x"></span></span>
<span class="dt-ms-chip">satisfaction <span class="x"></span></span>
<span class="dt-ms-chip">comments <span class="x"></span></span>
</div>
</div>
<div class="dt-field">
<label class="dt-label">Columns to skip</label>
<div class="dt-multiselect"><span class="dt-ms-placeholder">Choose columns</span></div>
</div>
<h4>Per-column strategy overrides (optional)</h4>
<p class="dt-caption">Set a different strategy for specific columns. Leave any row blank to use the global strategy.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>Column</th><th>Override</th><th>Resolves to</th></tr></thead>
<tbody>
<tr><td>age</td><td><span class="dt-select" style="display:inline-block;min-width:160px;padding:4px 24px 4px 10px;color:var(--ink-tertiary)">(global)</span></td><td>median <span style="color:var(--ink-tertiary);font-size:11px">· global</span></td></tr>
<tr><td>region</td><td><span class="dt-select" style="display:inline-block;min-width:160px;padding:4px 24px 4px 10px;color:var(--ink-tertiary)">(global)</span></td><td>mode <span style="color:var(--ink-tertiary);font-size:11px">· global → categorical fallback</span></td></tr>
<tr><td>income</td><td><span class="dt-select" style="display:inline-block;min-width:160px;padding:4px 24px 4px 10px;color:var(--ink-tertiary)">(global)</span></td><td>median <span style="color:var(--ink-tertiary);font-size:11px">· global</span></td></tr>
<tr><td>satisfaction</td><td><span class="dt-select" style="display:inline-block;min-width:160px;padding:4px 24px 4px 10px;color:var(--ink-tertiary)">(global)</span></td><td>median <span style="color:var(--ink-tertiary);font-size:11px">· global</span></td></tr>
<tr><td>comments</td><td><span class="dt-select" style="display:inline-block;min-width:160px;padding:4px 24px 4px 10px">constant</span></td><td><strong>constant</strong> <span style="color:var(--ink-tertiary);font-size:11px">· this column</span></td></tr>
</tbody>
</table>
</div>
</div>
</details>
</div>
</details>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary dt-btn-block">Handle Missing Values</button>
<hr class="dt-divider">
<!-- Results -->
<div id="missing-results-anchor"></div>
<h2>Results</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Sentinels → NaN</div><div class="value">651</div></div>
<div class="dt-metric"><div class="label">Cells filled</div><div class="value">1,043</div></div>
<div class="dt-metric"><div class="label">Rows dropped</div><div class="value">0</div></div>
<div class="dt-metric"><div class="label">Columns dropped</div><div class="value">0</div></div>
</div>
<p><strong>Missingness — before vs. after</strong></p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>column</th><th>before_missing</th><th>before_pct</th><th>after_missing</th><th>after_pct</th><th>strategy</th></tr></thead>
<tbody>
<tr><td>respondent_id</td><td>0</td><td>0.0</td><td>0</td><td>0.0</td><td class="dt-cell-flag"></td></tr>
<tr><td>age</td><td class="dt-cell-flag">187</td><td>8.7</td><td class="dt-cell-add">0</td><td class="dt-cell-add">0.0</td><td>median</td></tr>
<tr><td>region</td><td class="dt-cell-flag">142</td><td>6.6</td><td class="dt-cell-add">0</td><td class="dt-cell-add">0.0</td><td>mode</td></tr>
<tr><td>income</td><td class="dt-cell-flag">329</td><td>15.3</td><td class="dt-cell-add">0</td><td class="dt-cell-add">0.0</td><td>median</td></tr>
<tr><td>satisfaction</td><td class="dt-cell-flag">95</td><td>4.4</td><td class="dt-cell-add">0</td><td class="dt-cell-add">0.0</td><td>median</td></tr>
<tr><td>comments</td><td class="dt-cell-flag">290</td><td>13.5</td><td class="dt-cell-add">0</td><td class="dt-cell-add">0.0</td><td>constant</td></tr>
</tbody>
</table>
</div>
<p><strong>Audit (first 50 changes)</strong></p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>row</th><th>column</th><th>old_value</th><th>new_value</th><th>reason</th></tr></thead>
<tbody>
<tr><td>2</td><td>age</td><td class="dt-cell-flag">N/A</td><td class="dt-cell-add">37.0</td><td>fill: median</td></tr>
<tr><td>2</td><td>income</td><td class="dt-cell-flag">(blank)</td><td class="dt-cell-add">54000.0</td><td>fill: median</td></tr>
<tr><td>2</td><td>comments</td><td class="dt-cell-flag">?</td><td class="dt-cell-add">(no comment)</td><td>fill: constant</td></tr>
<tr><td>3</td><td>region</td><td class="dt-cell-flag">-</td><td class="dt-cell-add">West</td><td>fill: mode</td></tr>
<tr><td>3</td><td>satisfaction</td><td class="dt-cell-flag">NULL</td><td class="dt-cell-add">4.0</td><td>fill: median</td></tr>
<tr><td>4</td><td>income</td><td class="dt-cell-flag">N/A</td><td class="dt-cell-add">54000.0</td><td>fill: median</td></tr>
</tbody>
</table>
</div>
<p class="dt-caption">… and 1,037 more (download the full audit below).</p>
<p><strong>Handled preview (first 10 rows)</strong></p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>respondent_id</th><th>age</th><th>region</th><th>income</th><th>satisfaction</th><th>comments</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>R-1001</td><td>34.0</td><td>West</td><td>52000.0</td><td>4.0</td><td>great service</td></tr>
<tr><td class="idx">1</td><td>R-1002</td><td class="dt-cell-add">37.0</td><td>East</td><td class="dt-cell-add">54000.0</td><td>3.0</td><td class="dt-cell-add">(no comment)</td></tr>
<tr><td class="idx">2</td><td>R-1003</td><td>41.0</td><td class="dt-cell-add">West</td><td>61000.0</td><td class="dt-cell-add">4.0</td><td>none</td></tr>
<tr><td class="idx">3</td><td>R-1004</td><td>29.0</td><td>South</td><td class="dt-cell-add">54000.0</td><td>5.0</td><td>quick</td></tr>
</tbody>
</table>
</div>
<hr class="dt-divider">
<!-- Downloads (html_download_button anchors) -->
<div class="dt-cols-3">
<button class="dt-btn dt-btn-primary">Download handled CSV</button>
<button class="dt-btn">Download changes audit</button>
<button class="dt-btn">Download config JSON</button>
</div>
<div class="dt-next-step"><span class="dt-mi">arrow_forward</span><span>Missing values handled. Next, most files need: <a href="01_deduplicator.html">Find Duplicates →</a></span><button class="dt-next-step-dismiss" title="Dismiss"></button></div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,221 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Map Columns</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="05_column_mapper">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Map Columns</strong>, shown with a file imported, an interactive target schema + mapping configured, and a completed run (results + mapped preview). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Map Columns</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Rename columns, change their order, and set each one as text, number, or date.</p>
<div class="dt-spacer"></div>
<!-- File pickup banner (using file from upload screen) -->
<div class="dt-alert info">
<span class="dt-mi">description</span>
<span>Using <strong>crm_contacts_raw.csv</strong> from the upload screen.</span>
</div>
<button class="dt-btn" style="margin-bottom:4px">Use a different file</button>
<!-- Preview expander (collapsed after a result exists) -->
<details class="dt-expander">
<summary>Preview: crm_contacts_raw.csv</summary>
<div class="dt-expander-body">
<p class="dt-caption">4,210 rows, 6 columns</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>Full Name</th><th>EmailAddr</th><th>Phone #</th><th>Signup</th><th>Amount Spent</th><th>Notes</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>Jane Doe</td><td>jane@acme.io</td><td>512-555-0190</td><td>01/04/2024</td><td>$1,204.50</td><td>VIP</td></tr>
<tr><td class="idx">1</td><td>Bob Smith</td><td>bob@globex.com</td><td>720-555-7781</td><td>02/11/2024</td><td>$88.00</td><td></td></tr>
<tr><td class="idx">2</td><td>Carla Reyes</td><td>carla@initech.net</td><td>415-555-3322</td><td>03/02/2024</td><td>$612.10</td><td>renewal</td></tr>
<tr><td class="idx">3</td><td>Dev Patel</td><td>dev@umbrella.co</td><td>206-555-9043</td><td>03/19/2024</td><td>$0.00</td><td></td></tr>
</tbody>
</table>
</div>
</div>
</details>
<hr class="dt-divider">
<!-- Options expander (open — heart of the tool) -->
<details class="dt-expander" open>
<summary>Options</summary>
<div class="dt-expander-body">
<!-- ===== Target schema ===== -->
<h3 style="margin-top:0">Target schema</h3>
<div class="dt-field">
<label class="dt-label">How would you like to define the target schema?</label>
<div class="dt-radio-row" style="flex-direction:column; gap:8px">
<span class="dt-radio on"><span class="dot"></span> Build interactively (start from current columns)</span>
<span class="dt-radio"><span class="dot"></span> Import schema JSON</span>
<span class="dt-radio"><span class="dot"></span> Skip (rename / convert types only — no schema)</span>
</div>
<div class="dt-help-text">An interactive build is fastest for one-off cleanup. Import a JSON when you have a fixed contract (a CRM import format, db schema). Skip when you only want to rename or convert the type of specific columns.</div>
</div>
<p class="dt-caption">Edit the table to define your target schema. Add rows for fields the input doesn't have yet (with a default), or remove rows for columns you want to drop.</p>
<!-- Schema editor (st.data_editor, num_rows=dynamic) -->
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>Target name</th><th>Type</th><th>Required</th><th>Default (for added cols)</th><th>Aliases (comma-sep, helps fuzzy-match)</th></tr></thead>
<tbody>
<tr><td>full_name</td><td>string</td><td></td><td></td><td>Full Name, name</td></tr>
<tr><td>email</td><td>string</td><td></td><td></td><td>EmailAddr, email_address</td></tr>
<tr><td>phone</td><td>string</td><td></td><td></td><td>Phone #, tel</td></tr>
<tr><td>signup_date</td><td>date</td><td></td><td></td><td>Signup</td></tr>
<tr><td>amount_spent</td><td>float</td><td></td><td>0.0</td><td>Amount Spent</td></tr>
<tr><td>source</td><td>string</td><td></td><td>crm-import</td><td></td></tr>
<tr><td style="color:var(--ink-tertiary)"><span class="dt-mi" style="font-size:16px;vertical-align:-3px">add</span> add row</td><td></td><td></td><td></td><td></td></tr>
</tbody>
</table>
</div>
<p class="dt-caption">6 target fields · 1 added field (<code>source</code>) not present in the input.</p>
<hr class="dt-divider">
<!-- ===== Mapping ===== -->
<!-- Mapping follows the schema directly: define the schema, then map sources onto it. -->
<h3>Mapping</h3>
<!-- schema is set → source→target selectbox editor with auto-suggested flag -->
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>Source</th><th>Target</th><th>Auto-suggested</th></tr></thead>
<tbody>
<tr><td>Full Name</td><td>full_name</td><td></td></tr>
<tr><td>EmailAddr</td><td>email</td><td></td></tr>
<tr><td>Phone #</td><td>phone</td><td></td></tr>
<tr><td>Signup</td><td>signup_date</td><td></td></tr>
<tr><td>Amount Spent</td><td>amount_spent</td><td></td></tr>
<tr><td>Notes</td><td>(unmapped)</td><td></td></tr>
</tbody>
</table>
</div>
<p class="dt-caption">Pick a target for each source column. <code>Notes</code> stays unmapped — with the keep-extras strategy it is kept as-is. <code>source</code> is added from the schema default.</p>
<hr class="dt-divider">
<!-- ===== Strategy ===== -->
<!-- Strategy is a modifier on the mapping above (strictness: keep/drop extras, coerce, reorder), so it comes after the user can see what it acts on. -->
<h3>Strategy</h3>
<div class="dt-field">
<label class="dt-label">Preset</label>
<div class="dt-radio-row" style="flex-direction:column; gap:8px">
<span class="dt-radio"><span class="dot"></span> rename-only (just rename, leave types alone, keep extras)</span>
<span class="dt-radio"><span class="dot"></span> lenient-schema (rename + convert types + reorder, keep extras)</span>
<span class="dt-radio"><span class="dot"></span> strict-schema (rename + convert types + reorder, drop extras) <span class="dt-count-pill info" style="margin-left:4px">base</span></span>
<span class="dt-radio on"><span class="dot"></span> Custom — based on <strong>strict-schema</strong>, 1 control changed <span class="dt-count-pill warn" style="margin-left:4px">modified</span></span>
</div>
<div class="dt-precedence" style="margin-top:10px">
<span class="dt-mi">rule</span>
<span>Individual Advanced controls win over the preset. You started from <strong>strict-schema</strong>, then changed <strong>Unmapped source columns</strong> to <strong>keep</strong> below — so the preset is now <strong>Custom</strong>. The controls' current values are what actually run.</span>
</div>
<div class="dt-help-text">Pick a strategy as the baseline. Every Advanced toggle below is still individually overridable; overriding any one switches the preset to Custom.</div>
</div>
<!-- Advanced options expander -->
<details class="dt-expander" open>
<summary>Advanced options</summary>
<div class="dt-expander-body">
<div class="dt-cols-2">
<div>
<div class="dt-field">
<label class="dt-label">Unmapped source columns <span class="dt-count-pill warn" style="margin-left:4px">changed</span></label>
<div class="dt-select">keep</div>
<div class="dt-help-text">Winning value: <strong>keep</strong>. Overrides the strict-schema base (drop) — so <code>Notes</code> survives into the output.</div>
</div>
<div class="dt-check on" title="coerce types per schema"><span class="box"><span class="dt-mi">check</span></span> Convert each column to the right type</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Reorder to schema order</div>
</div>
<div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Auto-infer mapping (fuzzy match)</div>
<div class="dt-field">
<label class="dt-label">Fuzzy match threshold</label>
<div class="dt-slider"><div class="track"><div class="fill" style="width:80%"></div><div class="knob" style="left:80%"></div></div><div class="val">0.80</div></div>
</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Enforce required fields</div>
</div>
</div>
</div>
</details>
</div>
</details>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary dt-btn-block">Apply Column Mapping</button>
<hr class="dt-divider">
<!-- ===== Results ===== -->
<div id="colmap-results-anchor" style="height:1px"></div>
<h2>Results</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Renamed</div><div class="value">5</div></div>
<div class="dt-metric"><div class="label">Dropped</div><div class="value">0</div></div>
<div class="dt-metric"><div class="label">Added</div><div class="value">1</div></div>
<div class="dt-metric"><div class="label">Coerce fails</div><div class="value">3</div></div>
</div>
<div class="dt-alert info"><span class="dt-mi">info</span><span>Added (with defaults): <code>source</code></span></div>
<div class="dt-alert warn"><span class="dt-mi">warning</span><span>Some cells could not be coerced and were left as NaN: amount_spent (3)</span></div>
<p><strong>Mapped preview (first 10 rows)</strong></p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th class="dt-cell-add">full_name</th><th>email</th><th>phone</th><th>signup_date</th><th>amount_spent</th><th class="dt-cell-add">source</th><th>Notes</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>Jane Doe</td><td>jane@acme.io</td><td>512-555-0190</td><td>2024-01-04</td><td>1204.5</td><td>crm-import</td><td>VIP</td></tr>
<tr><td class="idx">1</td><td>Bob Smith</td><td>bob@globex.com</td><td>720-555-7781</td><td>2024-02-11</td><td>88.0</td><td>crm-import</td><td></td></tr>
<tr><td class="idx">2</td><td>Carla Reyes</td><td>carla@initech.net</td><td>415-555-3322</td><td>2024-03-02</td><td>612.1</td><td>crm-import</td><td>renewal</td></tr>
<tr><td class="idx">3</td><td>Dev Patel</td><td>dev@umbrella.co</td><td>206-555-9043</td><td>2024-03-19</td><td>0.0</td><td>crm-import</td><td></td></tr>
<tr><td class="idx">4</td><td>Mei Lin</td><td>mei@hooli.com</td><td>503-555-1188</td><td>2024-04-07</td><td class="dt-cell-flag">NaN</td><td>crm-import</td><td>trial</td></tr>
</tbody>
</table>
</div>
<hr class="dt-divider">
<!-- Downloads (3 columns) -->
<div class="dt-cols-3">
<button class="dt-btn dt-btn-primary">Download mapped CSV</button>
<button class="dt-btn">Download mapping audit</button>
<button class="dt-btn">Download config JSON</button>
</div>
<!-- Next-step suggestion -->
<div class="dt-next-step"><span class="dt-mi">arrow_forward</span><span>Columns mapped. <a href="home.html">Run the recommended clean →</a></span><button class="dt-next-step-dismiss" title="Dismiss"></button></div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,55 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Find Unusual Values</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="06_outlier_detector">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Find Unusual Values</strong> — a <strong>Coming&nbsp;Soon</strong> tool. The page is a stub: a "coming soon" notice, a plain-English list of what the tool will do, and a single "Notify me" action. <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Find Unusual Values</h1>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
<p class="dt-tool-caption">Spot values that look wrong — way too high, too low, or breaking your rules.</p>
<div class="dt-spacer"></div>
<!-- Coming-soon notice (st.info) -->
<div class="dt-alert info">
<span class="dt-mi">info</span>
<span>This tool is coming soon.</span>
</div>
<!-- What it will do (st.markdown) -->
<p><strong>What it will do:</strong></p>
<ul>
<li>Find values that are unusually high or low for a column</li>
<li>Spot values that break the rules you set (out of range, wrong type)</li>
<li>Choose how sensitive the check is</li>
<li>Flag unusual rows by adding a column, without changing your data</li>
<li>Cap extreme values at a limit you choose</li>
<li>See a summary of how many values were flagged</li>
</ul>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary"><span class="dt-mi">notifications</span> Notify me when this ships</button>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,55 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Combine Files</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="07_multi_file_merger">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Combine Files</strong> — a <strong>Coming&nbsp;Soon</strong> tool. The page is a stub: a "coming soon" notice, a plain-English list of what the tool will do, and a single "Notify me" action. <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Combine Files</h1>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
<p class="dt-tool-caption">Combine several CSV or Excel files into one — even if columns differ.</p>
<div class="dt-spacer"></div>
<!-- Coming-soon notice (st.info) -->
<div class="dt-alert info">
<span class="dt-mi">info</span>
<span>This tool is coming soon.</span>
</div>
<!-- What it will do (st.markdown) -->
<p><strong>What it will do:</strong></p>
<ul>
<li>Import several CSV or Excel files at once</li>
<li>Line up columns automatically by matching their names</li>
<li>Stack files on top of each other into one long file</li>
<li>Merge files side by side using shared key columns</li>
<li>Handle columns that don't match (fill the gaps with blanks or drop them)</li>
<li>Add a column showing which file each row came from</li>
</ul>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary"><span class="dt-mi">notifications</span> Notify me when this ships</button>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,55 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Quality Check</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="08_validator_reporter">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Quality Check</strong> — a <strong>Coming&nbsp;Soon</strong> tool. The page is a stub: a "coming soon" notice, a plain-English list of what the tool will do, and a single "Notify me" action. <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Quality Check</h1>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
<p class="dt-tool-caption">Check your file against rules you set, and export a PDF or Excel report.</p>
<div class="dt-spacer"></div>
<!-- Coming-soon notice (st.info) -->
<div class="dt-alert info">
<span class="dt-mi">info</span>
<span>This tool is coming soon.</span>
</div>
<!-- What it will do (st.markdown) -->
<p><strong>What it will do:</strong></p>
<ul>
<li>Check each column against rules you set (no blanks, no duplicates, matches a pattern, within a range, from a set list)</li>
<li>Check rules across columns (for example, start date is before end date)</li>
<li>Give each column and the whole file a quality score</li>
<li>Export a PDF quality report</li>
<li>Export an Excel report with the problem rows highlighted</li>
<li>Show a summary of what passed, what failed, and how serious each issue is</li>
</ul>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary"><span class="dt-mi">notifications</span> Notify me when this ships</button>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,373 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Automated Workflows</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="09_pipeline_runner">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Automated Workflows</strong> (Pipeline Runner), shown with a file imported, a four-step pipeline configured, and a completed run (results + per-step summary). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Automated Workflows</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Run several tools in a row — save the steps once, reuse them anytime.</p>
<div class="dt-spacer"></div>
<!-- Upload (file staged) -->
<label class="dt-label">Import CSV or Excel file</label>
<div class="dt-uploader">
<div class="dt-uploader-text">
<span class="hint"><span class="dt-mi" style="vertical-align:-4px">upload_file</span> Drag and drop file here</span>
<span class="sub">Up to 1.5 GB · CSV, TSV, XLSX, XLS · encoding &amp; delimiter auto-detected</span>
</div>
<button class="dt-btn">Browse files</button>
</div>
<div class="dt-file-chip">
<span class="dt-file-icon-chip"><svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M14 2H6a2 2 0 00-2 2v16a2 2 0 002 2h12a2 2 0 002-2V8z"/><path d="M14 2v6h6"/></svg></span>
<span class="name">customers_export.csv</span>
<span class="size">2.1 MB</span>
<button class="dt-btn dt-btn-tertiary" title="Remove"></button>
</div>
<!-- Preview expander (collapsed once a result exists) -->
<details class="dt-expander">
<summary>Preview: customers_export.csv</summary>
<div class="dt-expander-body">
<p class="dt-caption">18,442 rows, 6 columns</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>name</th><th>email</th><th>city</th><th>phone</th><th>signup_date</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td> Jane Doe </td><td>jane@acme.io</td><td>Austin</td><td>512-555-0190</td><td>2024-01-04</td></tr>
<tr><td class="idx">1</td><td>jane doe</td><td>JANE@ACME.IO</td><td>austin</td><td>(512) 555-0190</td><td>01/04/2024</td></tr>
<tr><td class="idx">2</td><td>Bob Smith</td><td>bob@globex.com</td><td>Denver</td><td>720.555.7781</td><td>2024-02-11</td></tr>
<tr><td class="idx">3</td><td>R. Smith</td><td>bob@globex.com</td><td></td><td>720-555-7781</td><td>Feb 11 2024</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<hr class="dt-divider">
<!-- Options: pipeline builder (collapsed once a result exists; opened here to show structure) -->
<details class="dt-expander" open>
<summary>Options</summary>
<div class="dt-expander-body">
<!-- Mode radio. Editing the steps below auto-switches the mode from the
recommended default to "Build interactively" (same precedence-visibility
pattern as Fix Missing Values: the active state is made legible, and the
default it superseded is marked "· modified"). -->
<div class="dt-field">
<label class="dt-label">How would you like to define the pipeline?</label>
<div class="dt-radio-row" style="flex-direction:column;gap:9px">
<span class="dt-radio"><span class="dot"></span> Use the recommended default (Clean Text → Standardize → Fix Missing → Find Duplicates) <span class="dt-count-pill warn" style="margin-left:4px">· modified</span></span>
<span class="dt-radio on"><span class="dot"></span> Build interactively</span>
<span class="dt-radio"><span class="dot"></span> Import a saved pipeline JSON</span>
</div>
</div>
<div class="dt-precedence">
<span class="dt-mi">edit</span>
<span>You started from the recommended default and edited a step, so the mode switched to <strong>Build interactively</strong>. The steps below are now yours to change — pick <strong>recommended default</strong> again to discard your edits and restore the suggested order.</span>
</div>
<p class="dt-caption" style="margin:10px 0">
Add, remove, reorder (drag the row index), enable, or configure each step.
Open a step's <strong>Configure</strong> panel to set its options in plain language.
Tool order is recommended, not enforced — violations surface as warnings below the table.
</p>
<!-- Pipeline editor. Each step row carries an enable toggle + a "Configure"
expander that reveals that tool's OWN controls as the editing surface
(built from .dt-* form classes). Raw per-row JSON has been removed;
JSON survives only as import/export under "Advanced" below. -->
<div class="dt-table-wrap">
<table class="dt-table">
<thead>
<tr>
<th class="idx"></th>
<th>Step</th>
<th style="text-align:center">Enabled</th>
<th style="text-align:right">Configure</th>
</tr>
</thead>
<tbody>
<tr>
<td class="idx">≡ 0</td>
<td><div style="font-weight:500" title="text_clean">Clean Text</div><div class="dt-caption" style="margin:2px 0 0">Trim spaces, collapse repeats, leave case as-is</div></td>
<td><span class="dt-check on" style="margin:0;justify-content:center"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td style="text-align:right;color:var(--ink-tertiary)"><span class="dt-mi" style="font-size:16px;vertical-align:-3px">tune</span> Configure <span class="dt-mi" style="font-size:14px;vertical-align:-2px">expand_more</span></td>
</tr>
</tbody>
</table>
</div>
<!-- text_clean config panel (open to show the per-step editing surface) -->
<details class="dt-expander" open style="margin:6px 0 10px">
<summary>Configure: Clean Text</summary>
<div class="dt-expander-body">
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Trim leading &amp; trailing whitespace</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Collapse repeated spaces to one</div>
<div class="dt-check"><span class="box"></span> Normalize smart quotes &amp; dashes to plain ASCII</div>
<div class="dt-field">
<label class="dt-label">Letter case</label>
<div class="dt-select">Leave as-is</div>
</div>
</div>
</details>
<div class="dt-table-wrap">
<table class="dt-table">
<tbody>
<tr>
<td class="idx">≡ 1</td>
<td><div style="font-weight:500" title="format_standardize">Standardize Formats</div><div class="dt-caption" style="margin:2px 0 0">Format phone as phone, signup_date as a date</div></td>
<td><span class="dt-check on" style="margin:0;justify-content:center"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td style="text-align:right;color:var(--ink-tertiary)"><span class="dt-mi" style="font-size:16px;vertical-align:-3px">tune</span> Configure <span class="dt-mi" style="font-size:14px;vertical-align:-2px">chevron_right</span></td>
</tr>
</tbody>
</table>
</div>
<!-- format_standardize config panel (collapsed) -->
<details class="dt-expander" style="margin:6px 0 10px">
<summary>Configure: Standardize Formats</summary>
<div class="dt-expander-body">
<p class="dt-caption" style="margin-bottom:8px">Choose a target format for each column. Columns left as &ldquo;Leave as-is&rdquo; are untouched.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>Column</th><th>Format as</th></tr></thead>
<tbody>
<tr><td>name</td><td><span class="dt-select" style="display:inline-block;min-width:150px;padding:4px 24px 4px 10px;color:var(--ink-tertiary)">Leave as-is</span></td></tr>
<tr><td>email</td><td><span class="dt-select" style="display:inline-block;min-width:150px;padding:4px 24px 4px 10px;color:var(--ink-tertiary)">Leave as-is</span></td></tr>
<tr><td>phone</td><td><span class="dt-select" style="display:inline-block;min-width:150px;padding:4px 24px 4px 10px">Phone number</span></td></tr>
<tr><td>signup_date</td><td><span class="dt-select" style="display:inline-block;min-width:150px;padding:4px 24px 4px 10px">Date</span></td></tr>
</tbody>
</table>
</div>
</div>
</details>
<div class="dt-table-wrap">
<table class="dt-table">
<tbody>
<tr>
<td class="idx">≡ 2</td>
<td><div style="font-weight:500" title="missing">Fix Missing Values</div><div class="dt-caption" style="margin:2px 0 0">Flag blank cells (treat &ldquo;N/A&rdquo; and &ldquo;&rdquo; as blank)</div></td>
<td><span class="dt-check on" style="margin:0;justify-content:center"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td style="text-align:right;color:var(--ink-tertiary)"><span class="dt-mi" style="font-size:16px;vertical-align:-3px">tune</span> Configure <span class="dt-mi" style="font-size:14px;vertical-align:-2px">chevron_right</span></td>
</tr>
</tbody>
</table>
</div>
<!-- missing config panel (collapsed) -->
<details class="dt-expander" style="margin:6px 0 10px">
<summary>Configure: Fix Missing Values</summary>
<div class="dt-expander-body">
<div class="dt-field">
<label class="dt-label">What should happen to blank cells?</label>
<div class="dt-radio-row" style="flex-direction:column;gap:8px">
<span class="dt-radio on"><span class="dot"></span> Flag them (mark blanks, change nothing)</span>
<span class="dt-radio"><span class="dot"></span> Fill them in (numbers → median, text → most common)</span>
<span class="dt-radio"><span class="dot"></span> Drop rows that have any blank</span>
</div>
</div>
<div class="dt-field">
<label class="dt-label">Treat these as blank (comma-separated)</label>
<div class="dt-input">N/A, —</div>
<div class="dt-help-text">Matched case-insensitively after stripping whitespace.</div>
</div>
</div>
</details>
<div class="dt-table-wrap">
<table class="dt-table">
<tbody>
<tr>
<td class="idx">≡ 3</td>
<td><div style="font-weight:500" title="dedup">Find Duplicates</div><div class="dt-caption" style="margin:2px 0 0">Match on email &amp; phone; keep the most complete row, merge in missing fields</div></td>
<td><span class="dt-check on" style="margin:0;justify-content:center"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td style="text-align:right;color:var(--ink-tertiary)"><span class="dt-mi" style="font-size:16px;vertical-align:-3px">tune</span> Configure <span class="dt-mi" style="font-size:14px;vertical-align:-2px">chevron_right</span></td>
</tr>
<tr>
<td class="idx" style="color:var(--ink-tertiary)"></td>
<td colspan="3" style="color:var(--ink-tertiary);font-family:var(--font-sans)">Add step</td>
</tr>
</tbody>
</table>
</div>
<!-- dedup config panel (collapsed) -->
<details class="dt-expander" style="margin:6px 0 10px">
<summary>Configure: Find Duplicates</summary>
<div class="dt-expander-body">
<div class="dt-field">
<label class="dt-label">When rows match, which one survives?</label>
<div class="dt-select">Keep the most complete row</div>
<div class="dt-help-text">Other options: keep the first seen, keep the last seen.</div>
</div>
<div class="dt-check on"><span class="box"><span class="dt-mi">check</span></span> Merge matched rows (fill each survivor's blanks from its duplicates)</div>
<div class="dt-field">
<label class="dt-label">Match on these columns</label>
<div class="dt-multiselect">
<span class="dt-ms-chip">email <span class="x"></span></span>
<span class="dt-ms-chip">phone <span class="x"></span></span>
</div>
</div>
</div>
</details>
<!-- Validation: pipeline is in recommended order, so no warning shown (warning block omitted) -->
<!-- Advanced: JSON is import/export only, never the per-step editing surface -->
<details class="dt-expander" style="margin-top:14px">
<summary>Advanced — import / export pipeline as JSON</summary>
<div class="dt-expander-body">
<p class="dt-caption" style="margin-bottom:8px">For sharing or version control. Editing is done in the step panels above — this is just the saved form of the same settings.</p>
<div class="dt-code">{
"version": 1,
"steps": [
{"tool": "text_clean", "enabled": true, "options": {"trim": true, "collapse_whitespace": true}},
{"tool": "format_standardize", "enabled": true, "options": {"column_types": {"phone": "phone", "signup_date": "date"}}},
{"tool": "missing", "enabled": true, "options": {"strategy": "flag", "sentinels": ["N/A", "—"]}},
{"tool": "dedup", "enabled": true, "options": {"survivor_rule": "most_complete", "merge": true, "keys": ["email", "phone"]}}
]
}</div>
<div class="dt-btn-row" style="margin-top:10px">
<button class="dt-btn"><span class="dt-mi">upload</span> Import JSON</button>
<button class="dt-btn"><span class="dt-mi">download</span> Export JSON</button>
</div>
</div>
</details>
<!-- Nested explainer expander -->
<details class="dt-expander" style="margin-top:14px">
<summary>Recommended tool order — why each step belongs where it does</summary>
<div class="dt-expander-body">
<p><strong>text_clean</strong> before <strong>format_standardize</strong> — format parsers (phone / currency / date) fail on smart-quote-contaminated or NBSP-padded input — clean text first</p>
<p><strong>text_clean</strong> before <strong>missing</strong> — sentinel detection misses cells padded with NBSP / zero-width characters — clean text first</p>
<p><strong>text_clean</strong> before <strong>dedup</strong> — fuzzy matching treats NBSP-padded values as different — clean text first</p>
<p><strong>format_standardize</strong> before <strong>missing</strong> — numeric imputation needs numeric dtypes; canonical phones / currencies improve sentinel detection</p>
<p><strong>format_standardize</strong> before <strong>dedup</strong> — canonical phones / lowercase emails enable cross-format duplicate matching</p>
<p style="margin-bottom:0"><strong>missing</strong> before <strong>dedup</strong> — deduping rows with mixed NaN sentinels produces brittle merges — resolve missing values first</p>
</div>
</details>
</div>
</details>
<hr class="dt-divider">
<!-- Run -->
<button class="dt-btn dt-btn-primary dt-btn-block">Run Pipeline</button>
<hr class="dt-divider">
<!-- Results -->
<h2>Results</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Initial rows</div><div class="value">18,442</div></div>
<div class="dt-metric"><div class="label">Final rows</div><div class="value">18,130</div></div>
<div class="dt-metric"><div class="label">Steps run</div><div class="value">4</div></div>
<div class="dt-metric"><div class="label">Elapsed</div><div class="value">1.84 s</div></div>
</div>
<h4>Per-step summary</h4>
<!-- Standalone error column removed: status is one pill per step. A failed step
turns the pill danger and surfaces its message in a detail row directly below
that step (shown only on failure); successful steps just show a green pill.
Summaries are plain-English phrases, not raw JSON. Demo: this run completed
cleanly (all four ok, matching the metrics above) — the format_standardize
row carries a warn pill + detail row to illustrate how a non-fatal step issue
surfaces inline without a dedicated always-empty column. -->
<div class="dt-table-wrap">
<table class="dt-table">
<thead>
<tr><th>step</th><th>status</th><th>elapsed</th><th>summary</th></tr>
</thead>
<tbody>
<tr>
<td>text_clean</td>
<td><span class="dt-count-pill success">ok</span></td>
<td>214 ms</td>
<td style="font-family:var(--font-sans)">1,204 cells changed in name &amp; city</td>
</tr>
<tr>
<td>format_standardize</td>
<td><span class="dt-count-pill warn"><span class="dt-mi" style="font-size:13px;margin-right:3px">warning</span> ok · 141 skipped</span></td>
<td>388 ms</td>
<td style="font-family:var(--font-sans)">18,301 phones and 17,996 dates standardized</td>
</tr>
<tr style="background:var(--warn-fill)">
<td></td>
<td colspan="3" style="font-family:var(--font-sans);color:var(--warn);white-space:normal">
<span class="dt-mi" style="font-size:15px;vertical-align:-3px;margin-right:4px">info</span>
141 phone values didn't match any known pattern and were left unchanged. The step still completed — review them in the output preview if needed.
</td>
</tr>
<tr>
<td>missing</td>
<td><span class="dt-count-pill success">ok</span></td>
<td>121 ms</td>
<td style="font-family:var(--font-sans)">642 blank cells flagged (sentinel &ldquo;&rdquo;)</td>
</tr>
<tr>
<td>dedup</td>
<td><span class="dt-count-pill success">ok</span></td>
<td>911 ms</td>
<td style="font-family:var(--font-sans)">312 duplicates removed across 147 groups (18,442 → 18,130 rows)</td>
</tr>
</tbody>
</table>
</div>
<h4>Output preview (first 10 rows)</h4>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th class="idx"></th><th>name</th><th>email</th><th>city</th><th>phone</th><th>signup_date</th></tr></thead>
<tbody>
<tr><td class="idx">0</td><td>Jane Doe</td><td>jane@acme.io</td><td>Austin</td><td class="dt-cell-add">+1 512-555-0190</td><td class="dt-cell-add">2024-01-04</td></tr>
<tr><td class="idx">1</td><td>Bob Smith</td><td>bob@globex.com</td><td>Denver</td><td class="dt-cell-add">+1 720-555-7781</td><td class="dt-cell-add">2024-02-11</td></tr>
<tr><td class="idx">2</td><td>Carla Reyes</td><td>carla@initech.co</td><td>Phoenix</td><td class="dt-cell-add">+1 480-555-3320</td><td class="dt-cell-add">2024-03-02</td></tr>
<tr><td class="idx">3</td><td>Dan Okafor</td><td>dan@umbrella.net</td><td><span class="dt-cell-flag">⚑ missing</span></td><td class="dt-cell-add">+1 206-555-7745</td><td class="dt-cell-add">2024-03-18</td></tr>
<tr><td class="idx">4</td><td>Emily Tran</td><td>emily@hooli.com</td><td>Seattle</td><td class="dt-cell-add">+1 206-555-1182</td><td class="dt-cell-add">2024-04-05</td></tr>
</tbody>
</table>
</div>
<hr class="dt-divider">
<!-- Downloads (3 columns) -->
<div class="dt-cols-3">
<button class="dt-btn dt-btn-primary"><span class="dt-mi">download</span> Download cleaned CSV</button>
<button class="dt-btn"><span class="dt-mi">download</span> Download pipeline JSON</button>
<button class="dt-btn"><span class="dt-mi">download</span> Download run audit</button>
</div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,203 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — PDF to CSV</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="10_pdf_extractor">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>PDF to CSV</strong>, shown with two bank-statement PDFs imported and a completed scan (candidate transactions in the editable preview table). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>PDF to CSV</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Pull transactions out of bank-statement PDFs into a clean CSV file.</p>
<div class="dt-spacer"></div>
<!-- Scan options expander (collapsed by default) -->
<details class="dt-expander">
<summary>Scan options</summary>
<div class="dt-expander-body">
<div class="dt-cols-2">
<div class="dt-check on">
<span class="box"><span class="dt-mi">check</span></span>
Treat (4.50) as negative
</div>
<div class="dt-check on">
<span class="box"><span class="dt-mi">check</span></span>
Use OCR for scanned pages
</div>
</div>
<p class="dt-help-text" style="margin:0 0 10px">OCR status: ready (bundled Tesseract). Most modern bank PDFs are text-based and don't need OCR — only enable for image-based scans.</p>
<div class="dt-cols-2">
<div class="dt-field">
<label class="dt-label">Output date format</label>
<div class="dt-select">YYYY-MM-DD (2026-01-13)</div>
</div>
<div class="dt-field">
<label class="dt-label">Override year for short dates (optional)</label>
<input class="dt-input" type="text" placeholder="" value="" disabled>
<div class="dt-help-text">Leave blank for automatic (statement period → filename year → this override).</div>
</div>
</div>
</div>
</details>
<!-- Files section head -->
<div class="dt-files-section-head">
<h2>Files</h2>
<span class="dt-section-meta">2 files · 318.4 KB total</span>
</div>
<!-- Files card (Home-style bordered list + Add more files) -->
<div class="dt-card" style="padding-bottom:0">
<div class="dt-file-row" style="padding:6px 0">
<button class="dt-btn dt-btn-tertiary" title="Remove statement-jan-2026.pdf"></button>
<span class="dt-file-icon-chip"><svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M14 2H6a2 2 0 00-2 2v16a2 2 0 002 2h12a2 2 0 002-2V8z"/><path d="M14 2v6h6"/></svg></span>
<span class="dt-file-name">statement-jan-2026.pdf</span>
<span class="dt-file-size" style="margin-left:auto">171.2 KB</span>
</div>
<div class="dt-file-row" style="padding:6px 0">
<button class="dt-btn dt-btn-tertiary" title="Remove statement-feb-2026.pdf"></button>
<span class="dt-file-icon-chip"><svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M14 2H6a2 2 0 00-2 2v16a2 2 0 002 2h12a2 2 0 002-2V8z"/><path d="M14 2v6h6"/></svg></span>
<span class="dt-file-name">statement-feb-2026.pdf</span>
<span class="dt-file-size" style="margin-left:auto">147.2 KB</span>
</div>
<button class="dt-file-add" style="margin-left:-16px;margin-right:-16px;width:calc(100% + 32px)">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M12 5v14M5 12h14"/></svg> Add more files
</button>
</div>
<!-- Action buttons -->
<div class="dt-btn-row" style="margin-top:16px;max-width:340px">
<button class="dt-btn dt-btn-primary">Scan</button>
<button class="dt-btn">Clear all files</button>
</div>
<hr class="dt-divider">
<!-- Warnings expander (collapsed) -->
<details class="dt-expander">
<summary>Warnings (1)</summary>
<div class="dt-expander-body">
<div class="dt-alert warn">
<span class="dt-mi">warning</span>
<span>[statement-feb-2026.pdf] 2 lines matched a date but no amount — skipped (likely a wrapped description). Check the source if a transaction looks missing.</span>
</div>
</div>
</details>
<!-- Results -->
<h4>47 candidate transaction(s) from 2 file(s)</h4>
<p class="dt-caption">Uncheck rows to exclude. Edit any cell to fix a value the scanner got wrong. Hover the <span class="dt-mi" style="font-size:15px;vertical-align:-3px;color:var(--ink-tertiary)">info</span> on any row to see the original PDF text it came from.</p>
<!-- overflow-x:auto belt-and-suspenders: any residual width scrolls instead of clipping (app.css .dt-table-wrap is overflow:hidden) -->
<div class="dt-table-wrap" style="overflow-x:auto">
<table class="dt-table">
<thead>
<tr>
<th>Include</th>
<th></th>
<th>date</th>
<th>description</th>
<th>amount_debit</th>
<th>amount_credit</th>
<th>account_number</th>
<th>source_file</th>
</tr>
</thead>
<tbody>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 01/03 OPENING BALANCE 2,140.55" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-01-03</td><td>OPENING BALANCE</td><td></td><td></td><td>****4821</td><td>statement-jan-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 01/05 POS PURCHASE WHOLE FOODS MKT (84.12)" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-01-05</td><td>POS PURCHASE WHOLE FOODS MKT</td><td>84.12</td><td></td><td>****4821</td><td>statement-jan-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 01/08 ACH DEPOSIT PAYROLL ACME CORP 3,250.00" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-01-08</td><td>ACH DEPOSIT PAYROLL ACME CORP</td><td></td><td>3,250.00</td><td>****4821</td><td>statement-jan-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 01/11 ONLINE TRANSFER TO SAVINGS (500.00)" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-01-11</td><td>ONLINE TRANSFER TO SAVINGS</td><td>500.00</td><td></td><td>****4821</td><td>statement-jan-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check" style="margin:0"><span class="box"></span></span></td>
<td class="idx" title="raw: 01/12 INTEREST RATE 0.50% APY 0.00" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td class="dt-cell-flag">2026-01-12</td><td class="dt-cell-flag">INTEREST RATE 0.50% APY DETAIL <span style="font-family:var(--font-sans);font-size:11px;font-weight:500;background:var(--warn-fill);color:var(--warn);border-radius:999px;padding:1px 7px;white-space:nowrap">auto-excluded · not a transaction line</span></td><td></td><td></td><td>****4821</td><td>statement-jan-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 01/14 DEBIT CARD SHELL OIL #2287 (52.40)" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-01-14</td><td>DEBIT CARD SHELL OIL #2287</td><td>52.40</td><td></td><td>****4821</td><td>statement-jan-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 02/02 POS PURCHASE TRADER JOES #511 (61.88)" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-02-02</td><td>POS PURCHASE TRADER JOES #511</td><td>61.88</td><td></td><td>****4821</td><td>statement-feb-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 02/06 ACH DEPOSIT PAYROLL ACME CORP 3,250.00" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-02-06</td><td>ACH DEPOSIT PAYROLL ACME CORP</td><td></td><td>3,250.00</td><td>****4821</td><td>statement-feb-2026.pdf</td>
</tr>
<tr>
<td><span class="dt-check on" style="margin:0"><span class="box"><span class="dt-mi">check</span></span></span></td>
<td class="idx" title="raw: 02/09 CHECK #1043 (1,200.00)" style="cursor:help"><span class="dt-mi" style="font-size:16px">info</span></td>
<td>2026-02-09</td><td>CHECK #1043</td><td>1,200.00</td><td></td><td>****4821</td><td>statement-feb-2026.pdf</td>
</tr>
</tbody>
</table>
</div>
<!-- Download area: configure-then-act — column selector first, download button below -->
<div style="margin-top:14px;max-width:520px">
<div class="dt-field" style="margin:0 0 14px">
<label class="dt-label">Columns to include in CSV</label>
<div class="dt-multiselect">
<span class="dt-ms-chip">date <span class="x"></span></span>
<span class="dt-ms-chip">description <span class="x"></span></span>
<span class="dt-ms-chip">amount_debit <span class="x"></span></span>
<span class="dt-ms-chip">amount_credit <span class="x"></span></span>
<span class="dt-ms-chip">account_number <span class="x"></span></span>
<span class="dt-ms-chip">source_file <span class="x"></span></span>
</div>
<div class="dt-help-text"><code>page</code> and <code>raw</code> are kept off by default; tick them if you want them in the file.</div>
</div>
<button class="dt-btn dt-btn-primary dt-btn-block">Download 46 rows as CSV</button>
<p class="dt-caption" style="margin-top:8px">1 row excluded (INTEREST RATE detail line).</p>
</div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

View File

@@ -0,0 +1,248 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — Reconcile Two Files</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="11_reconciler">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of <strong>Reconcile Two Files</strong>, shown with both files imported, key columns mapped, and a completed reconciliation (matched / review / unmatched results). <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Tool header -->
<div class="dt-tool-header">
<h1>Reconcile Two Files</h1>
<div class="dt-tool-header-actions">
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
<button class="dt-help-btn"><span class="dt-mi">help_outline</span> Help</button>
</div>
</div>
<p class="dt-tool-caption">Compare two lists of transactions (e.g. bank vs. ledger) and flag what doesn't match.</p>
<div class="dt-spacer"></div>
<!-- Side-by-side upload (st.columns(2) → two _side_panel) -->
<div class="dt-cols-2">
<!-- Left side -->
<div>
<h4 style="margin-top:0">Left (e.g. bank feed)</h4>
<div class="dt-alert info">
<span class="dt-mi">description</span>
<span>Using <strong>bank_feed_may.csv</strong> from the upload screen.</span>
</div>
<button class="dt-btn" style="margin-bottom:4px">Use a different file</button>
<p class="dt-caption" style="margin-top:6px"><code>bank_feed_may.csv</code> — 1,204 rows, 4 columns</p>
<details class="dt-expander">
<summary>Preview left (e.g. bank feed)</summary>
<div class="dt-expander-body">
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>posted_date</th><th>description</th><th>amount</th><th>ref</th></tr></thead>
<tbody>
<tr><td>2026-05-01</td><td>ACME SUPPLIES</td><td>-1240.00</td><td>CHK1041</td></tr>
<tr><td>2026-05-02</td><td>PAYROLL RUN</td><td>-8800.00</td><td>ACH5520</td></tr>
<tr><td>2026-05-03</td><td>CLIENT GLOBEX</td><td>5200.00</td><td>DEP0090</td></tr>
<tr><td>2026-05-04</td><td>UTILITY CO</td><td>-318.42</td><td>CHK1042</td></tr>
</tbody>
</table>
</div>
</div>
</details>
</div>
<!-- Right side -->
<div>
<h4 style="margin-top:0">Right (e.g. ledger)</h4>
<div class="dt-alert info">
<span class="dt-mi">description</span>
<span>Using <strong>ledger_may.xlsx</strong> from the upload screen.</span>
</div>
<button class="dt-btn" style="margin-bottom:4px">Use a different file</button>
<p class="dt-caption" style="margin-top:6px"><code>ledger_may.xlsx</code> — 1,198 rows, 5 columns</p>
<details class="dt-expander">
<summary>Preview right (e.g. ledger)</summary>
<div class="dt-expander-body">
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>txn_date</th><th>memo</th><th>value</th><th>invoice_no</th><th>account</th></tr></thead>
<tbody>
<tr><td>2026-05-01</td><td>Acme Supplies Inc</td><td>-1240.00</td><td>INV-1041</td><td>5000</td></tr>
<tr><td>2026-05-02</td><td>Monthly payroll</td><td>-8800.00</td><td>INV-5520</td><td>6000</td></tr>
<tr><td>2026-05-03</td><td>Globex retainer</td><td>5200.00</td><td>INV-0090</td><td>4000</td></tr>
<tr><td>2026-05-04</td><td>City Utilities</td><td>-318.40</td><td>INV-1042</td><td>6100</td></tr>
</tbody>
</table>
</div>
</div>
</details>
</div>
</div>
<hr class="dt-divider">
<!-- Match settings -->
<h2>Match settings</h2>
<div class="dt-cols-2">
<!-- Left pickers (file order: posted_date, description, amount → date, desc, amount) -->
<div>
<h4 style="margin-top:0">Left columns</h4>
<div class="dt-field"><label class="dt-label">Date column (optional)</label><div class="dt-select">posted_date</div></div>
<div class="dt-field"><label class="dt-label">Description column (optional)</label><div class="dt-select">description</div></div>
<div class="dt-field"><label class="dt-label">Amount column <span class="req">*</span></label><div class="dt-select">amount</div></div>
<div class="dt-field"><label class="dt-label">Reference columns (optional, e.g. check / invoice no.)</label>
<div class="dt-multiselect"><span class="dt-ms-chip">ref <span class="x"></span></span></div></div>
</div>
<!-- Right pickers (file order: txn_date, memo, value → date, desc, amount) -->
<div>
<h4 style="margin-top:0">Right columns</h4>
<div class="dt-field"><label class="dt-label">Date column (optional)</label><div class="dt-select">txn_date</div></div>
<div class="dt-field"><label class="dt-label">Description column (optional)</label><div class="dt-select">memo</div></div>
<div class="dt-field"><label class="dt-label">Amount column <span class="req">*</span></label><div class="dt-select">value</div></div>
<div class="dt-field"><label class="dt-label">Reference columns (must match left count)</label>
<div class="dt-multiselect"><span class="dt-ms-chip">invoice_no <span class="x"></span></span></div>
<div class="dt-help-text" style="color:var(--success);display:flex;align-items:center;gap:5px"><span class="dt-mi" style="font-family:'Material Symbols Outlined';font-size:15px;line-height:1">check_circle</span> 1 reference each side — counts match</div></div>
</div>
</div>
<!-- Tolerances & options (expanded=True) -->
<details class="dt-expander" open>
<summary>Tolerances &amp; options</summary>
<div class="dt-expander-body">
<div class="dt-cols-3">
<div class="dt-field"><label class="dt-label">Amount tolerance</label>
<div class="dt-input">0.0200</div>
<div class="dt-help-text">Absolute tolerance on amount (e.g. 0.01 to absorb cent rounding).</div></div>
<div class="dt-field"><label class="dt-label">Date tolerance (days)</label>
<div class="dt-input">1</div>
<div class="dt-help-text">Allow N calendar days of drift between posting dates.</div></div>
<div class="dt-field"><label class="dt-label">Invert right amount sign</label>
<div class="dt-check" style="margin-top:8px"><span class="box"></span></div>
<div class="dt-help-text">Use when one side records debits as positive and the other as negative.</div></div>
</div>
<div class="dt-field"><label class="dt-label">Description similarity boost (0 disables)</label>
<div class="dt-slider"><div class="track"><div class="fill" style="width:80%"></div><div class="knob" style="left:80%"></div></div><div class="val">80</div></div>
<div class="dt-help-text">When both sides have a description column set, accept matches with this minimum fuzzy similarity even if amount/date are merely within tolerance. Lower = more permissive.</div></div>
</div>
</details>
<hr class="dt-divider">
<button class="dt-btn dt-btn-primary dt-btn-block">Reconcile</button>
<hr class="dt-divider">
<!-- Results -->
<h2>Results</h2>
<div class="dt-metrics">
<div class="dt-metric"><div class="label">Review</div><div class="value">9</div></div>
<div class="dt-metric"><div class="label">Unmatched left</div><div class="value">22</div></div>
<div class="dt-metric"><div class="label">Unmatched right</div><div class="value">16</div></div>
<div class="dt-metric"><div class="label">Matched</div><div class="value">1,173</div></div>
</div>
<p class="dt-caption">Coverage: 97.4% of the larger side</p>
<!-- Tabs (st.tabs) — exceptions-first; Review active by default -->
<div class="dt-tabs">
<span class="dt-tab is-active">Review (9)</span>
<span class="dt-tab">Unmatched left (22)</span>
<span class="dt-tab">Unmatched right (16)</span>
<span class="dt-tab">Matched (1,173)</span>
</div>
<!-- Active tab content: Review (exceptions-first default) -->
<p class="dt-caption">Pairs flagged because the algorithm couldn't pick a single best match (e.g. multiple equally-good candidates). Use the left/right indices to disambiguate manually.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>left_idx</th><th>left_amount</th><th>right_idx</th><th>right_value</th><th>candidates</th></tr></thead>
<tbody>
<tr><td>118</td><td>-450.00</td><td>121, 209</td><td>-450.00</td><td class="dt-cell-flag">2 equal</td></tr>
<tr><td>203</td><td>1000.00</td><td>198, 244</td><td>1000.00</td><td class="dt-cell-flag">2 equal</td></tr>
</tbody>
</table>
</div>
<!-- Other tab previews shown as collapsed expanders for review context -->
<details class="dt-expander">
<summary>Unmatched left (22) — only in bank_feed_may.csv</summary>
<div class="dt-expander-body">
<p class="dt-caption">Preview of first 25 of 22 rows.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>posted_date</th><th>description</th><th>amount</th><th>ref</th></tr></thead>
<tbody>
<tr><td class="dt-cell-del">2026-05-09</td><td class="dt-cell-del">BANK FEE</td><td class="dt-cell-del">-12.00</td><td class="dt-cell-del">FEE0001</td></tr>
<tr><td class="dt-cell-del">2026-05-14</td><td class="dt-cell-del">ATM WITHDRAWAL</td><td class="dt-cell-del">-200.00</td><td class="dt-cell-del">ATM7781</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<details class="dt-expander">
<summary>Unmatched right (16) — only in ledger_may.xlsx</summary>
<div class="dt-expander-body">
<p class="dt-caption">Preview of first 25 of 16 rows.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr><th>txn_date</th><th>memo</th><th>value</th><th>invoice_no</th><th>account</th></tr></thead>
<tbody>
<tr><td class="dt-cell-del">2026-05-11</td><td class="dt-cell-del">Accrued interest</td><td class="dt-cell-del">37.50</td><td class="dt-cell-del">INV-9001</td><td class="dt-cell-del">7000</td></tr>
<tr><td class="dt-cell-del">2026-05-22</td><td class="dt-cell-del">Depreciation</td><td class="dt-cell-del">-410.00</td><td class="dt-cell-del">INV-9044</td><td class="dt-cell-del">8000</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<details class="dt-expander">
<summary>Matched (1,173) — cleanly reconciled</summary>
<div class="dt-expander-body">
<p class="dt-caption">Preview of first 25 of 1,173 rows — download the CSV below for the full set.</p>
<div class="dt-table-wrap">
<table class="dt-table">
<thead><tr>
<th>left_posted_date</th><th>left_description</th><th>left_amount</th>
<th>right_txn_date</th><th>right_memo</th><th>right_value</th><th>amount_diff</th>
</tr></thead>
<tbody>
<tr><td>2026-05-01</td><td>ACME SUPPLIES</td><td>-1240.00</td><td>2026-05-01</td><td>Acme Supplies Inc</td><td>-1240.00</td><td class="dt-cell-add">0.00</td></tr>
<tr><td>2026-05-02</td><td>PAYROLL RUN</td><td>-8800.00</td><td>2026-05-02</td><td>Monthly payroll</td><td>-8800.00</td><td class="dt-cell-add">0.00</td></tr>
<tr><td>2026-05-03</td><td>CLIENT GLOBEX</td><td>5200.00</td><td>2026-05-03</td><td>Globex retainer</td><td>5200.00</td><td class="dt-cell-add">0.00</td></tr>
<tr><td>2026-05-04</td><td>UTILITY CO</td><td>-318.42</td><td>2026-05-04</td><td>City Utilities</td><td>-318.40</td><td class="dt-cell-flag">0.02</td></tr>
<tr><td>2026-05-06</td><td>OFFICE DEPOT</td><td>-89.15</td><td>2026-05-07</td><td>Office supplies</td><td>-89.15</td><td class="dt-cell-add">0.00</td></tr>
</tbody>
</table>
</div>
</div>
</details>
<hr class="dt-divider">
<!-- Downloads (st.columns(4) of html_download_button) — exceptions-first,
matching the tab/metric order; four parallel exports, equal weight -->
<div class="dt-btn-row">
<button class="dt-btn">Review CSV</button>
<button class="dt-btn">Unmatched left</button>
<button class="dt-btn">Unmatched right</button>
<button class="dt-btn">Matched CSV</button>
</div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

542
layout-review/app.css Normal file
View File

@@ -0,0 +1,542 @@
/* ===========================================================================
DataTools — static layout-review stylesheet
---------------------------------------------------------------------------
Faithful reproduction of the live Streamlit app's design system for human
review of page layouts. Tokens are copied verbatim from src/gui/theme.py
(§3 color + type scale) and the component values from
src/gui/components/_legacy.py:_DESIGN_TOKENS_CSS.
The live app applies these styles to Streamlit's data-testid DOM; here we
re-express the same look against clean semantic classes so the static HTML
stays readable. Where the app uses real .dt-* classes (page header, files
card, findings, stats) the class names are kept identical.
=========================================================================== */
@import url("https://fonts.googleapis.com/css2?family=Geist:wght@400;500;600;700&family=Geist+Mono:wght@400;500&display=swap");
@import url("https://fonts.googleapis.com/css2?family=Material+Symbols+Outlined:opsz,wght,FILL,GRAD@20..48,400,0,0&display=block");
:root {
--font-sans: "Geist", -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
--font-mono: "Geist Mono", ui-monospace, "SF Mono", Menlo, monospace;
--ink: #1c1917;
--ink-secondary: #57534e;
--ink-tertiary: #a8a29e;
--bg: #fafaf7;
--surface: #ffffff;
--surface-hover: #f8f7f3;
--border: #e7e5dc;
--border-strong: #d6d3c7;
--accent: #c2410c;
--accent-hover: #9a3412;
--accent-fill: #fef4ed;
--accent-fill-strong: #fde4d3;
--warn: #b45309;
--warn-fill: #fef3c7;
--info: #0369a1;
--info-fill: #e0f2fe;
--success: #15803d;
--success-fill: #dcfce7;
--danger: #b91c1c;
--danger-fill: #fee2e2;
--r-sm: 6px;
--r-md: 10px;
--r-lg: 14px;
--sidebar-w: 264px;
}
* { box-sizing: border-box; }
html, body {
margin: 0;
padding: 0;
background: var(--bg);
color: var(--ink);
font-family: var(--font-sans);
font-feature-settings: "ss01", "cv01", "cv11";
-webkit-font-smoothing: antialiased;
}
/* ---------- Type scale (theme.py §4) ---------- */
h1 { font-size: 32px; font-weight: 600; letter-spacing: -0.035em; line-height: 1.1; margin: 0 0 4px; }
h2 { font-size: 22px; font-weight: 600; letter-spacing: -0.025em; line-height: 1.2; margin: 1.5rem 0 0.75rem; }
h3 { font-size: 18px; font-weight: 500; letter-spacing: -0.018em; line-height: 1.25; margin: 1.25rem 0 0.5rem; }
h4 { font-size: 15px; font-weight: 500; letter-spacing: -0.012em; line-height: 1.35; margin: 1rem 0 0.5rem; }
p { font-size: 14px; font-weight: 400; line-height: 1.55; color: var(--ink); margin: 0 0 0.6rem; }
strong { font-weight: 500; color: var(--ink); }
a { color: var(--accent); text-decoration: none; }
a:hover { color: var(--accent-hover); text-decoration: underline; }
code, .dt-mono { font-family: var(--font-mono); font-size: 0.92em; font-feature-settings: "ss02"; }
/* ===========================================================================
App frame — sidebar + main + sticky footer
=========================================================================== */
.dt-app { display: flex; min-height: 100vh; }
/* ---------- Sidebar (cream paper) ---------- */
.dt-sidebar {
width: var(--sidebar-w);
flex-shrink: 0;
background: #f5f4ef;
border-right: 1px solid var(--border);
padding: 18px 14px 90px;
position: sticky;
top: 0;
align-self: flex-start;
height: 100vh;
overflow-y: auto;
}
.dt-brand { display: flex; align-items: center; gap: 10px; padding: 0 4px 18px; }
.dt-brand-mark {
width: 28px; height: 28px; border-radius: 7px;
background: var(--ink); color: var(--accent-fill);
display: inline-flex; align-items: center; justify-content: center;
font-weight: 700; font-size: 16px; letter-spacing: -0.04em; line-height: 1; flex-shrink: 0;
}
.dt-brand-name { display: flex; flex-direction: column; gap: 1px; line-height: 1.05; }
.dt-brand-eyebrow {
font-size: 9.5px; font-weight: 600; letter-spacing: 0.14em;
text-transform: uppercase; color: var(--ink-tertiary); line-height: 1;
}
.dt-brand-word { font-weight: 600; font-size: 15px; letter-spacing: -0.02em; color: var(--ink); }
.dt-nav { display: flex; flex-direction: column; }
.dt-nav-section {
font-size: 11.5px; text-transform: uppercase; letter-spacing: 0.08em;
color: var(--ink-tertiary); font-weight: 500;
padding: 14px 10px 4px; margin: 0;
display: flex; align-items: center; justify-content: space-between;
}
.dt-nav-section .dt-nav-indicator { font-size: 16px; color: var(--ink-tertiary); }
.dt-nav-link {
display: flex; align-items: center; gap: 8px;
color: var(--ink-secondary); font-size: 13px; font-weight: 500; line-height: 1.3;
padding: 5px 10px; border-radius: var(--r-sm); margin-bottom: 1px;
text-decoration: none; transition: background 0.12s ease, color 0.12s ease;
}
.dt-nav-link:hover { background: rgba(0,0,0,0.04); color: var(--ink); text-decoration: none; }
.dt-nav-link.is-active { background: rgba(0,0,0,0.04); color: var(--ink); font-weight: 600; }
.dt-nav-link .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; color: var(--ink-secondary); line-height: 1; }
.dt-nav-link.is-active .dt-mi { color: var(--ink); }
.dt-nav-link.is-soon { opacity: 0.55; }
/* "Start here" front-door item — weightier than ordinary nav links so the
obvious entry point reads at a glance. Accent-fill ground + accent-hover ink,
slightly larger hit area, with bottom margin to part it from the groups below.
Layers on .dt-nav-link, so the .is-active treatment still overrides cleanly. */
.dt-nav-start {
background: var(--accent-fill); color: var(--accent-hover); font-weight: 600;
padding: 8px 10px; margin-bottom: 12px;
}
.dt-nav-start:hover { background: var(--accent-fill-strong); color: var(--accent-hover); }
.dt-nav-start .dt-mi { color: var(--accent); }
.dt-nav-start.is-active { background: var(--accent-fill-strong); color: var(--accent-hover); }
.dt-nav-start.is-active .dt-mi { color: var(--accent); }
.dt-nav-soon-tag {
margin-left: auto; font-size: 9px; font-weight: 600; letter-spacing: 0.06em;
text-transform: uppercase; color: var(--ink-tertiary);
border: 1px solid var(--border-strong); border-radius: 999px; padding: 1px 6px;
}
.dt-sidebar-foot { margin-top: 22px; padding-top: 16px; border-top: 1px solid var(--border); display: flex; flex-direction: column; gap: 10px; }
.dt-sidebar-label { font-size: 11.5px; font-weight: 500; text-transform: uppercase; letter-spacing: 0.08em; color: var(--ink-tertiary); margin-bottom: 4px; }
.dt-license-badge { font-size: 12.5px; color: var(--ink-secondary); }
/* ---------- Main column ---------- */
.dt-main { flex: 1; min-width: 0; padding: 40px 56px 96px; }
.dt-main-inner { max-width: 920px; margin: 0 auto; }
/* Review banner above every mockup */
.dt-review-banner {
max-width: 920px; margin: 0 auto 20px; display: flex; gap: 10px; align-items: center;
background: var(--info-fill); color: var(--info);
border: 1px solid transparent; border-radius: var(--r-md);
padding: 8px 14px; font-size: 12.5px; line-height: 1.4;
}
.dt-review-banner a { color: var(--info); text-decoration: underline; }
.dt-review-banner .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; }
/* ---------- Sticky footer ---------- */
.dt-footer {
position: fixed; bottom: 0; left: var(--sidebar-w); right: 0;
background: rgba(255,255,255,0.97); backdrop-filter: blur(8px);
border-top: 1px solid var(--border-strong);
padding: 8px 20px; z-index: 50;
display: flex; align-items: center; gap: 8px;
}
.dt-footer-btn {
display: inline-flex; align-items: center; gap: 8px;
color: var(--ink-secondary); font-size: 13px; font-weight: 500; line-height: 1.3;
padding: 5px 10px; border-radius: var(--r-sm);
background: transparent; border: none; cursor: pointer; text-decoration: none;
}
.dt-footer-btn:hover { background: rgba(0,0,0,0.04); color: var(--ink); text-decoration: none; }
.dt-footer-btn .dt-mi { font-family: "Material Symbols Outlined"; font-size: 16px; }
/* ===========================================================================
Page header (brand + privacy pill) — .dt-page-* mirror the live app
=========================================================================== */
.dt-page-header {
display: flex; align-items: center; justify-content: space-between; gap: 24px;
margin: 0 0 24px; padding-bottom: 22px; border-bottom: 1px solid var(--border);
}
.dt-page-brand { display: flex; flex-direction: column; gap: 8px; }
.dt-page-brand-row { display: flex; align-items: center; gap: 18px; }
.dt-page-brand-mark {
width: 56px; height: 56px; border-radius: 14px; background: var(--ink);
color: var(--accent-fill); display: inline-flex; align-items: center; justify-content: center;
font-weight: 700; font-size: 32px; letter-spacing: -0.04em; line-height: 1; flex-shrink: 0;
}
.dt-page-brand-words { display: flex; flex-direction: column; gap: 2px; line-height: 1; }
.dt-page-eyebrow { font-size: 11.5px; font-weight: 600; letter-spacing: 0.14em; text-transform: uppercase; color: var(--ink-tertiary); line-height: 1.2; }
.dt-page-wordmark { margin: 0; font-weight: 600; font-size: 32px; letter-spacing: -0.035em; line-height: 1.1; color: var(--ink); }
.dt-page-subtitle { margin: 4px 0 0; color: var(--ink-secondary); font-size: 14px; line-height: 1.5; }
.dt-privacy-pill {
display: inline-flex; align-items: center; gap: 6px; padding: 6px 11px;
background: var(--success-fill); color: var(--success); border-radius: 999px;
font-size: 12px; font-weight: 500; white-space: nowrap; flex-shrink: 0;
}
.dt-privacy-pill svg { width: 13px; height: 13px; stroke-width: 2; }
/* ---------- Tool header (title + Help popover) ---------- */
.dt-tool-header { display: flex; align-items: flex-start; justify-content: space-between; gap: 16px; }
.dt-tool-header h1 { margin: 0; }
.dt-help-btn {
display: inline-flex; align-items: center; gap: 6px; white-space: nowrap;
background: var(--surface); color: var(--ink); border: 1px solid var(--border-strong);
border-radius: var(--r-md); padding: 9px 16px; font-size: 13.5px; font-weight: 500;
cursor: pointer; flex-shrink: 0; margin-top: 6px;
}
.dt-help-btn .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; }
.dt-tool-caption { font-size: 12.5px; color: var(--ink-tertiary); line-height: 1.5; margin: 2px 0 0; }
/* Right-side actions cluster in a tool header: the local-first privacy pill +
the Help button. One shared class so every tool page aligns identically
(replaces per-page inline flex/gap/margin drift). */
.dt-tool-header-actions { display: flex; align-items: center; gap: 12px; flex-shrink: 0; margin-top: 6px; }
.dt-tool-header-actions .dt-help-btn { margin-top: 0; }
/* ===========================================================================
Buttons
=========================================================================== */
.dt-btn {
border-radius: var(--r-md); font-family: var(--font-sans); font-weight: 500;
font-size: 13.5px; letter-spacing: -0.005em; line-height: 1; padding: 9px 16px;
border: 1px solid var(--border-strong); background: var(--surface); color: var(--ink);
cursor: pointer; transition: background 0.12s ease, border-color 0.12s ease, color 0.12s ease;
display: inline-flex; align-items: center; justify-content: center; gap: 8px;
}
.dt-btn:hover { background: var(--surface-hover); border-color: var(--ink-tertiary); }
.dt-btn-primary { background: var(--ink); color: var(--bg); border-color: var(--ink); }
.dt-btn-primary:hover { background: #292524; border-color: #292524; color: var(--bg); }
.dt-btn-tertiary { background: transparent; border: none; color: var(--ink-tertiary); padding: 4px 8px; }
.dt-btn-tertiary:hover { background: var(--danger-fill); color: var(--danger); }
.dt-btn:disabled, .dt-btn.is-disabled {
background: var(--surface-hover); color: var(--ink-tertiary);
border: 1px solid var(--border); cursor: not-allowed;
}
.dt-btn-block { width: 100%; }
.dt-btn .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; }
.dt-btn-row { display: flex; gap: 10px; flex-wrap: wrap; }
.dt-btn-row > .dt-btn { flex: 1; }
/* ===========================================================================
File uploader (cream dropzone)
=========================================================================== */
.dt-uploader {
background: var(--surface-hover); border: 1px dashed var(--border-strong);
border-radius: var(--r-md); padding: 22px 20px;
display: flex; align-items: center; justify-content: space-between; gap: 16px;
}
.dt-uploader-text { display: flex; flex-direction: column; gap: 2px; }
.dt-uploader-text .hint { font-size: 14px; color: var(--ink); }
.dt-uploader-text .sub { font-size: 12.5px; color: var(--ink-tertiary); }
.dt-uploader .dt-mi { font-family: "Material Symbols Outlined"; font-size: 24px; color: var(--ink-tertiary); }
/* Staged-file chip */
.dt-file-chip {
display: flex; align-items: center; gap: 12px;
background: var(--surface); border: 1px solid var(--border); border-radius: var(--r-sm);
padding: 10px 14px; margin-top: 10px;
}
.dt-file-chip .name { font-family: var(--font-mono); font-size: 13px; color: var(--ink); font-feature-settings: "ss02"; }
.dt-file-chip .size { font-family: var(--font-mono); font-size: 12px; color: var(--ink-tertiary); margin-left: auto; }
/* ===========================================================================
Expanders / bordered cards
=========================================================================== */
.dt-expander {
background: var(--surface); border: 1px solid var(--border); border-radius: var(--r-lg);
overflow: hidden; box-shadow: 0 1px 2px rgba(28,25,23,0.03); margin: 10px 0;
}
.dt-expander > summary, .dt-expander-head {
background: var(--surface-hover); border-bottom: 1px solid var(--border);
padding: 12px 16px; font-weight: 500; color: var(--ink); font-size: 14px;
cursor: pointer; list-style: none; display: flex; align-items: center; gap: 8px;
}
.dt-expander > summary::-webkit-details-marker { display: none; }
.dt-expander > summary::before {
content: "expand_more"; font-family: "Material Symbols Outlined"; font-size: 20px;
color: var(--ink-tertiary); transition: transform 0.15s ease;
}
.dt-expander[open] > summary::before { transform: rotate(180deg); }
.dt-expander-body, .dt-expander > .dt-expander-body { padding: 14px 16px; }
.dt-expander:not([open]) > summary { border-bottom: none; }
.dt-card {
background: var(--surface); border: 1px solid var(--border); border-radius: var(--r-lg);
box-shadow: 0 1px 2px rgba(28,25,23,0.03); padding: 16px; margin: 10px 0;
}
/* ===========================================================================
Alerts
=========================================================================== */
.dt-alert {
border-radius: var(--r-md); border: 1px solid transparent;
padding: 10px 14px; font-size: 13.5px; line-height: 1.45; margin: 10px 0;
display: flex; gap: 10px; align-items: flex-start;
}
.dt-alert .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; flex-shrink: 0; margin-top: 1px; }
.dt-alert.info { background: var(--info-fill); color: var(--info); }
.dt-alert.success { background: var(--success-fill); color: var(--success); }
.dt-alert.warn { background: var(--warn-fill); color: var(--warn); }
.dt-alert.error { background: var(--danger-fill); color: var(--danger); }
.dt-alert code { background: rgba(0,0,0,0.05); padding: 1px 5px; border-radius: 4px; }
/* Next-step strip — slim single-line "what to do next" suggestion shown at the
end of a tool's results. Subtle accent ground + left accent rule so it nudges
without competing with alerts; the trailing dismiss control is unobtrusive. */
.dt-next-step {
display: flex; align-items: center; gap: 10px;
background: var(--accent-fill); border-left: 3px solid var(--accent);
border-radius: var(--r-md); padding: 10px 14px; margin: 16px 0;
font-size: 13.5px; line-height: 1.4; color: var(--ink);
}
.dt-next-step .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; color: var(--accent); flex-shrink: 0; }
.dt-next-step a { color: var(--accent); font-weight: 500; }
.dt-next-step a:hover { color: var(--accent-hover); }
.dt-next-step-dismiss {
margin-left: auto; background: transparent; border: none; cursor: pointer;
color: var(--ink-tertiary); font-size: 13px; line-height: 1; padding: 2px 4px;
}
.dt-next-step-dismiss:hover { color: var(--ink-secondary); }
/* ===========================================================================
Inputs (static representations of Streamlit widgets)
=========================================================================== */
.dt-field { margin: 10px 0; }
.dt-label { font-size: 13px; font-weight: 500; color: var(--ink); margin-bottom: 5px; display: block; }
.dt-label .req { color: var(--accent); }
.dt-input, .dt-select, .dt-textarea {
width: 100%; background: var(--surface); border: 1px solid var(--border-strong);
border-radius: var(--r-sm); padding: 8px 11px; font-family: var(--font-sans);
font-size: 13.5px; color: var(--ink);
}
.dt-select { appearance: none; background-image: linear-gradient(45deg, transparent 50%, var(--ink-tertiary) 50%), linear-gradient(135deg, var(--ink-tertiary) 50%, transparent 50%); background-position: calc(100% - 16px) 14px, calc(100% - 11px) 14px; background-size: 5px 5px, 5px 5px; background-repeat: no-repeat; }
.dt-textarea { min-height: 76px; resize: vertical; font-family: var(--font-mono); font-size: 13px; }
.dt-help-text { font-size: 12px; color: var(--ink-tertiary); margin-top: 4px; }
/* Multiselect — chips inside a box */
.dt-multiselect {
width: 100%; background: var(--surface); border: 1px solid var(--border-strong);
border-radius: var(--r-sm); padding: 6px 8px; min-height: 38px;
display: flex; flex-wrap: wrap; gap: 6px; align-items: center;
}
.dt-ms-chip {
display: inline-flex; align-items: center; gap: 5px; background: var(--accent-fill);
color: var(--accent-hover); border-radius: var(--r-sm); padding: 3px 8px;
font-size: 12.5px; font-weight: 500;
}
.dt-ms-chip .x { color: var(--accent); font-size: 13px; }
.dt-ms-placeholder { color: var(--ink-tertiary); font-size: 13px; padding: 2px 4px; }
/* Checkbox / radio */
.dt-check { display: flex; align-items: center; gap: 9px; margin: 8px 0; font-size: 13.5px; color: var(--ink); }
.dt-check .box {
width: 18px; height: 18px; border-radius: 5px; border: 1px solid var(--border-strong);
background: var(--surface); display: inline-flex; align-items: center; justify-content: center; flex-shrink: 0;
}
.dt-check.on .box { background: var(--ink); border-color: var(--ink); color: var(--bg); }
.dt-check.on .box .dt-mi { font-family: "Material Symbols Outlined"; font-size: 14px; }
.dt-radio-row { display: flex; gap: 18px; flex-wrap: wrap; margin: 8px 0; }
.dt-radio { display: inline-flex; align-items: center; gap: 7px; font-size: 13.5px; }
.dt-radio .dot { width: 16px; height: 16px; border-radius: 50%; border: 1px solid var(--border-strong); display: inline-block; flex-shrink: 0; }
.dt-radio.on .dot { border: 5px solid var(--ink); }
/* Strategy precedence legend + overridden state (Fix Missing Values).
Makes the preset -> global -> per-column resolution order legible and
visibly dims a layer when a more specific layer wins. */
.dt-precedence {
display: flex; align-items: center; gap: 8px;
background: var(--surface-hover); border: 1px solid var(--border);
border-radius: var(--r-md); padding: 9px 13px; margin: 0 0 14px;
font-size: 12.5px; color: var(--ink-secondary); line-height: 1.4;
}
.dt-precedence .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; color: var(--ink-tertiary); flex-shrink: 0; }
.dt-precedence strong { color: var(--ink); font-weight: 600; }
.dt-radio-row.is-overridden { opacity: 0.5; }
.dt-radio-row.is-overridden .dt-radio { text-decoration: line-through; text-decoration-color: var(--ink-tertiary); }
/* Slider */
.dt-slider { margin: 14px 0 6px; }
.dt-slider .track { position: relative; height: 4px; background: var(--border-strong); border-radius: 2px; }
.dt-slider .fill { position: absolute; left: 0; top: 0; height: 4px; background: var(--ink); border-radius: 2px; }
.dt-slider .knob { position: absolute; top: 50%; width: 16px; height: 16px; border-radius: 50%; background: var(--ink); transform: translate(-50%, -50%); }
.dt-slider .val { font-family: var(--font-mono); font-size: 12px; color: var(--ink-secondary); margin-top: 8px; }
/* ===========================================================================
Layout helpers
=========================================================================== */
.dt-row { display: flex; gap: 16px; }
.dt-row > * { flex: 1; min-width: 0; }
.dt-cols-2 { display: grid; grid-template-columns: 1fr 1fr; gap: 16px; }
.dt-cols-3 { display: grid; grid-template-columns: repeat(3, 1fr); gap: 16px; }
.dt-divider { border: none; border-top: 1px solid var(--border); margin: 22px 0; }
.dt-caption { font-size: 12.5px; color: var(--ink-tertiary); line-height: 1.5; }
.dt-spacer { height: 12px; }
/* ===========================================================================
DataFrame / preview table
=========================================================================== */
.dt-table-wrap { border: 1px solid var(--border); border-radius: var(--r-md); overflow: hidden; margin: 8px 0; }
table.dt-table { width: 100%; border-collapse: collapse; font-size: 13px; }
table.dt-table th {
background: var(--surface-hover); color: var(--ink-secondary); font-weight: 500;
text-align: left; padding: 8px 12px; border-bottom: 1px solid var(--border);
font-size: 12px; text-transform: none; white-space: nowrap;
}
table.dt-table td {
padding: 7px 12px; border-bottom: 1px solid var(--border);
font-family: var(--font-mono); font-size: 12.5px; color: var(--ink); font-feature-settings: "ss02"; white-space: nowrap;
}
table.dt-table tr:last-child td { border-bottom: none; }
table.dt-table tr:nth-child(even) td { background: #fcfbf8; }
table.dt-table td.idx { color: var(--ink-tertiary); background: var(--surface-hover); }
.dt-cell-flag { color: var(--warn); }
.dt-cell-del { color: var(--danger); text-decoration: line-through; }
.dt-cell-add { color: var(--success); }
/* ===========================================================================
Stats overview (home) — copied from _legacy.py
=========================================================================== */
.dt-stats { display: grid; grid-template-columns: repeat(4, 1fr); gap: 12px; margin: 8px 0 20px; }
.dt-stat { background: var(--surface); border: 1px solid var(--border); border-radius: var(--r-lg); padding: 16px 18px; box-shadow: 0 1px 2px rgba(28,25,23,0.03); }
.dt-stat-label { font-size: 11.5px; text-transform: uppercase; letter-spacing: 0.08em; color: var(--ink-tertiary); font-weight: 500; margin-bottom: 6px; line-height: 1.4; }
.dt-stat-value { font-size: 28px; font-weight: 600; letter-spacing: -0.03em; line-height: 1; color: var(--ink); display: flex; align-items: baseline; gap: 6px; }
.dt-stat-unit { font-size: 12px; font-weight: 400; color: var(--ink-tertiary); letter-spacing: 0; }
.dt-stat.is-warn .dt-stat-value { color: var(--warn); }
.dt-stat.is-info .dt-stat-value { color: var(--info); }
.dt-stat.is-success .dt-stat-value { color: var(--success); }
@media (max-width: 900px) { .dt-stats { grid-template-columns: repeat(2, 1fr); } }
/* Metric (st.metric) */
.dt-metrics { display: flex; gap: 28px; flex-wrap: wrap; margin: 6px 0 14px; }
.dt-metric .label { font-size: 12.5px; color: var(--ink-tertiary); margin-bottom: 4px; }
.dt-metric .value { font-size: 26px; font-weight: 600; letter-spacing: -0.03em; color: var(--ink); line-height: 1; }
.dt-metric .delta { font-size: 12.5px; margin-top: 3px; }
.dt-metric .delta.up { color: var(--success); }
.dt-metric .delta.down { color: var(--danger); }
/* ===========================================================================
Files card (home) — copied from _legacy.py
=========================================================================== */
.dt-files-section-head { display: flex; align-items: baseline; justify-content: space-between; margin: 4px 0 10px; gap: 12px; }
.dt-files-section-head h2 { margin: 0; }
.dt-section-meta { font-size: 12.5px; color: var(--ink-tertiary); }
.dt-file-row { display: flex; align-items: center; gap: 12px; }
.dt-file-icon-chip { width: 28px; height: 28px; border-radius: var(--r-sm); background: var(--accent-fill); color: var(--accent); display: inline-flex; align-items: center; justify-content: center; flex-shrink: 0; }
.dt-file-icon-chip svg { width: 14px; height: 14px; stroke-width: 1.8; }
.dt-file-name { font-family: var(--font-mono); font-size: 13px; color: var(--ink); font-feature-settings: "ss02"; }
.dt-file-size { font-family: var(--font-mono); font-size: 12px; color: var(--ink-tertiary); font-feature-settings: "ss02"; }
.dt-file-add {
display: flex; align-items: center; justify-content: center; gap: 8px;
width: 100%; padding: 12px 16px; background: var(--surface-hover);
border: none; border-top: 1px dashed var(--border-strong);
border-radius: 0 0 var(--r-lg) var(--r-lg); cursor: pointer;
font-size: 13px; font-weight: 500; color: var(--ink-secondary); margin-top: 14px;
}
.dt-file-add:hover { background: var(--accent-fill); color: var(--accent); }
.dt-file-add svg { width: 14px; height: 14px; stroke-width: 2; }
/* ===========================================================================
Findings panel — copied from _legacy.py
=========================================================================== */
.dt-finding-group-head {
display: flex; align-items: center; gap: 12px; padding: 16px 22px;
border-bottom: 1px solid var(--border); background: var(--surface-hover);
margin: -16px -16px 1.2rem; border-radius: var(--r-lg) var(--r-lg) 0 0;
cursor: pointer; user-select: none;
}
.dt-finding-group-chevron { color: var(--ink-tertiary); font-family: "Material Symbols Outlined"; font-size: 20px; line-height: 1; flex-shrink: 0; }
.dt-severity-dot { width: 8px; height: 8px; border-radius: 50%; flex-shrink: 0; display: inline-block; }
.dt-severity-dot.warn { background: var(--warn); }
.dt-severity-dot.info { background: var(--info); }
.dt-severity-dot.error { background: var(--danger); }
.dt-severity-dot.success { background: var(--success); }
.dt-group-filename { font-family: var(--font-mono); font-size: 13.5px; font-weight: 500; color: var(--ink); font-feature-settings: "ss02"; }
.dt-group-counts { margin-left: auto; display: flex; align-items: center; gap: 8px; }
.dt-count-pill { display: inline-flex; align-items: center; padding: 3px 9px; border-radius: 999px; font-size: 11.5px; font-weight: 500; line-height: 1.4; white-space: nowrap; }
.dt-count-pill.warn { background: var(--warn-fill); color: var(--warn); }
.dt-count-pill.info { background: var(--info-fill); color: var(--info); }
.dt-count-pill.error { background: var(--danger-fill); color: var(--danger); }
.dt-count-pill.success { background: var(--success-fill); color: var(--success); }
.dt-finding-row { display: flex; align-items: flex-start; gap: 12px; padding: 12px 0; border-top: 1px solid var(--border); }
.dt-finding-row:first-of-type { border-top: none; }
.dt-finding-icon { width: 24px; height: 24px; border-radius: var(--r-sm); display: inline-flex; align-items: center; justify-content: center; flex-shrink: 0; }
.dt-finding-icon.warn { background: var(--warn-fill); color: var(--warn); }
.dt-finding-icon.info { background: var(--info-fill); color: var(--info); }
.dt-finding-icon.error { background: var(--danger-fill); color: var(--danger); }
.dt-finding-icon .dt-mi { font-family: "Material Symbols Outlined"; font-size: 16px; line-height: 1; }
.dt-finding-body { flex: 1; min-width: 0; }
.dt-finding-title { font-size: 14px; color: var(--ink); margin: 0 0 2px; line-height: 1.4; letter-spacing: -0.005em; }
.dt-finding-title strong { font-weight: 500; }
.dt-finding-meta { font-family: var(--font-mono); font-size: 12px; color: var(--ink-tertiary); line-height: 1.4; margin: 0; font-feature-settings: "ss02"; }
/* Overflow control — sits at the foot of a findings card when rows are hidden.
Bleeds to the card edges (cancels the .dt-card 16px padding) like .dt-file-add. */
.dt-finding-more {
display: flex; align-items: center; justify-content: center; gap: 6px;
width: calc(100% + 32px); margin: 4px -16px -16px;
padding: 11px 16px; background: var(--surface-hover);
border: none; border-top: 1px solid var(--border);
border-radius: 0 0 var(--r-lg) var(--r-lg); cursor: pointer;
font-family: var(--font-sans); font-size: 12.5px; font-weight: 500; color: var(--ink-secondary);
}
.dt-finding-more:hover { background: var(--accent-fill); color: var(--accent); }
.dt-finding-more .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; }
/* Collapsed findings panel — the group head fills the whole card (head only,
no body). Proper state variant so the two states don't drift; replaces the
per-instance inline margin-bottom:-16px hack. */
.dt-card.is-collapsed { padding: 0; }
.dt-finding-group-head.is-collapsed { margin: 0; border-bottom: none; border-radius: var(--r-lg); }
/* Match-group review card (dedup) */
.dt-match-card { background: var(--surface); border: 1px solid var(--border); border-radius: var(--r-lg); box-shadow: 0 1px 2px rgba(28,25,23,0.03); margin: 12px 0; overflow: hidden; }
.dt-match-head { background: var(--surface-hover); border-bottom: 1px solid var(--border); padding: 12px 16px; display: flex; align-items: center; gap: 12px; }
.dt-match-head .title { font-weight: 500; font-size: 14px; }
.dt-match-head .conf { margin-left: auto; }
.dt-match-body { padding: 14px 16px; }
.dt-keep-row { background: var(--success-fill); }
.dt-keep-tag { display: inline-flex; align-items: center; gap: 4px; background: var(--success-fill); color: var(--success); border-radius: 999px; padding: 2px 8px; font-size: 11px; font-weight: 500; }
/* Progress bar */
.dt-progress { height: 6px; background: var(--border); border-radius: 3px; overflow: hidden; margin: 10px 0; }
.dt-progress .bar { height: 100%; background: var(--ink); border-radius: 3px; }
/* Tabs */
.dt-tabs { display: flex; gap: 18px; border-bottom: 1px solid var(--border); margin: 10px 0 16px; }
.dt-tab { font-size: 13.5px; color: var(--ink-secondary); padding: 8px 2px; border-bottom: 2px solid transparent; cursor: pointer; }
.dt-tab.is-active { color: var(--ink); font-weight: 500; border-bottom-color: var(--accent); }
/* Code block */
.dt-code { background: var(--surface-hover); border: 1px solid var(--border); border-radius: var(--r-md); padding: 12px 14px; font-family: var(--font-mono); font-size: 12.5px; color: var(--ink); white-space: pre; overflow-x: auto; font-feature-settings: "ss02"; }
@media (max-width: 1100px) {
.dt-footer { left: 0; }
.dt-sidebar { display: none; }
.dt-main { padding: 28px 24px 96px; }
}

206
layout-review/home.html Normal file
View File

@@ -0,0 +1,206 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Layout review — File Analysis (Home)</title>
<link rel="stylesheet" href="app.css">
</head>
<body data-page="home">
<div class="dt-app">
<aside class="dt-sidebar" id="dt-sidebar"></aside>
<main class="dt-main">
<div class="dt-review-banner">
<span class="dt-mi">visibility</span>
<span>Static layout preview of the <strong>Home / File Analysis</strong> page, shown with three imported files in the post-analysis state. <a href="index.html">All pages →</a></span>
</div>
<div class="dt-main-inner">
<!-- Page header: brand block + privacy pill -->
<header class="dt-page-header">
<div class="dt-page-brand">
<div class="dt-page-brand-row">
<div class="dt-page-brand-mark">D</div>
<div class="dt-page-brand-words">
<span class="dt-page-eyebrow">UNALOGIX</span>
<h1 class="dt-page-wordmark">DataTools</h1>
</div>
</div>
<p class="dt-page-subtitle">Clean. Normalize. Transform.</p>
</div>
<span class="dt-privacy-pill">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">
<rect x="4" y="11" width="16" height="10" rx="2"/>
<path d="M8 11V7a4 4 0 018 0v4"/>
</svg>
Runs 100% locally
</span>
</header>
<!-- Files section head -->
<div class="dt-files-section-head">
<h2>Files</h2>
<span class="dt-section-meta">3 files · 4.7 MB total</span>
</div>
<!-- Files card -->
<div class="dt-card" style="padding-bottom:0">
<div class="dt-file-row" style="padding:6px 0">
<button class="dt-btn dt-btn-tertiary" title="Remove"></button>
<span class="dt-file-icon-chip"><svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M14 2H6a2 2 0 00-2 2v16a2 2 0 002 2h12a2 2 0 002-2V8z"/><path d="M14 2v6h6"/></svg></span>
<span class="dt-file-name">customers_export.csv</span>
<span class="dt-file-size" style="margin-left:auto">2.1 MB</span>
</div>
<div class="dt-file-row" style="padding:6px 0">
<button class="dt-btn dt-btn-tertiary" title="Remove"></button>
<span class="dt-file-icon-chip"><svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M14 2H6a2 2 0 00-2 2v16a2 2 0 002 2h12a2 2 0 002-2V8z"/><path d="M14 2v6h6"/></svg></span>
<span class="dt-file-name">q3_transactions.xlsx</span>
<span class="dt-file-size" style="margin-left:auto">1.8 MB</span>
</div>
<div class="dt-file-row" style="padding:6px 0">
<button class="dt-btn dt-btn-tertiary" title="Remove"></button>
<span class="dt-file-icon-chip"><svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M14 2H6a2 2 0 00-2 2v16a2 2 0 002 2h12a2 2 0 002-2V8z"/><path d="M14 2v6h6"/></svg></span>
<span class="dt-file-name">vendor_list.csv</span>
<span class="dt-file-size" style="margin-left:auto">0.8 MB</span>
</div>
<button class="dt-file-add" style="margin-left:-16px;margin-right:-16px;width:calc(100% + 32px)">
<svg viewBox="0 0 24 24" fill="none" stroke="currentColor"><path d="M12 5v14M5 12h14"/></svg> Add more files
</button>
</div>
<!-- Action bar -->
<div class="dt-btn-row" style="margin-top:16px">
<button class="dt-btn dt-btn-primary" style="flex:0 0 auto">Run analysis</button>
<button class="dt-btn" style="flex:0 0 auto">Clear results</button>
</div>
<hr class="dt-divider">
<!-- Stats overview -->
<div class="dt-stats">
<div class="dt-stat">
<div class="dt-stat-label">Rows scanned</div>
<div class="dt-stat-value">48,210 <span class="dt-stat-unit">rows</span></div>
</div>
<div class="dt-stat">
<div class="dt-stat-label">Total findings</div>
<div class="dt-stat-value">14</div>
</div>
<div class="dt-stat is-warn">
<div class="dt-stat-label">Warnings</div>
<div class="dt-stat-value">9 <span class="dt-stat-unit">to review</span></div>
</div>
<div class="dt-stat is-info">
<div class="dt-stat-label">Info</div>
<div class="dt-stat-value">5 <span class="dt-stat-unit">suggestions</span></div>
</div>
</div>
<!-- ======================================================================
FRONT DOOR — primary path. The orchestrator (09_pipeline_runner)
wearing a friendly face: maps the analyzer's findings to the
recommended pipeline (Clean Text → Standardize → Fix Missing →
Find Duplicates) and runs them in order, returning a downloadable
result. This is the hero of the page; the per-file findings below
remain as the manual "fix one thing at a time" path.
====================================================================== -->
<div class="dt-card" style="border-color:var(--accent);background:var(--accent-fill);box-shadow:0 1px 2px rgba(28,25,23,0.03),0 0 0 1px var(--accent)">
<div style="display:flex;align-items:flex-start;gap:14px;flex-wrap:wrap">
<span class="dt-file-icon-chip" style="width:36px;height:36px;border-radius:var(--r-md)">
<span class="dt-mi" style="font-family:'Material Symbols Outlined';font-size:20px">auto_awesome</span>
</span>
<div style="flex:1;min-width:240px">
<h3 style="margin:0 0 4px;color:var(--ink)">Recommended</h3>
<p style="margin:0;color:var(--ink-secondary)">Runs the recommended clean — fix text, standardize formats, fill blanks, remove duplicates — in the right order, then hands you the cleaned file.</p>
</div>
<button class="dt-btn dt-btn-primary" style="flex:0 0 auto;align-self:center">
<span class="dt-mi">auto_fix_high</span> Clean these files for me
</button>
</div>
<!-- Pipeline-step affordance: the order the findings will be resolved in -->
<div style="display:flex;align-items:center;gap:6px;flex-wrap:wrap;margin-top:14px;padding-top:12px;border-top:1px solid var(--accent-fill-strong)">
<span class="dt-count-pill" style="background:var(--surface);color:var(--ink-secondary)">1 · Clean Text</span>
<span class="dt-mi" style="font-family:'Material Symbols Outlined';font-size:16px;color:var(--accent)">arrow_forward</span>
<span class="dt-count-pill" style="background:var(--surface);color:var(--ink-secondary)">2 · Standardize</span>
<span class="dt-mi" style="font-family:'Material Symbols Outlined';font-size:16px;color:var(--accent)">arrow_forward</span>
<span class="dt-count-pill" style="background:var(--surface);color:var(--ink-secondary)">3 · Fix Missing</span>
<span class="dt-mi" style="font-family:'Material Symbols Outlined';font-size:16px;color:var(--accent)">arrow_forward</span>
<span class="dt-count-pill" style="background:var(--surface);color:var(--ink-secondary)">4 · Find Duplicates</span>
<span class="dt-caption" style="margin-left:auto">Result downloads when finished</span>
</div>
</div>
<!-- Secondary / manual path — keep full control over each fix -->
<h3 style="margin-top:24px">Or fix issues one at a time</h3>
<p class="dt-caption" style="margin:-2px 0 4px">Prefer to handle things yourself? Open any finding to jump straight to the right tool.</p>
<!-- Per-file findings panel #1 -->
<div class="dt-card">
<div class="dt-finding-group-head">
<span class="dt-finding-group-chevron" style="transform:rotate(90deg)">chevron_right</span>
<span class="dt-severity-dot warn"></span>
<span class="dt-group-filename">customers_export.csv</span>
<div class="dt-group-counts">
<span class="dt-count-pill warn">6 warnings</span>
<span class="dt-count-pill info">2 info</span>
</div>
</div>
<div class="dt-finding-row">
<span class="dt-finding-icon warn"><span class="dt-mi">priority_high</span></span>
<div class="dt-finding-body">
<p class="dt-finding-title"><strong>312 duplicate rows</strong> across exact + near matches</p>
<p class="dt-finding-meta">column: email · Find Duplicates →</p>
</div>
</div>
<div class="dt-finding-row">
<span class="dt-finding-icon warn"><span class="dt-mi">format_color_text</span></span>
<div class="dt-finding-body">
<p class="dt-finding-title"><strong>1,204 cells</strong> with leading / trailing whitespace</p>
<p class="dt-finding-meta">columns: name, city · Clean Text →</p>
</div>
</div>
<div class="dt-finding-row">
<span class="dt-finding-icon info"><span class="dt-mi">event</span></span>
<div class="dt-finding-body">
<p class="dt-finding-title">Mixed date formats in <strong>signup_date</strong></p>
<p class="dt-finding-meta">3 formats detected · Standardize Formats →</p>
</div>
</div>
<button class="dt-finding-more">
<span class="dt-mi">expand_more</span> Show all 8 findings · 5 more
</button>
</div>
<!-- Per-file findings panel #2 (collapsed) -->
<div class="dt-card is-collapsed">
<div class="dt-finding-group-head is-collapsed">
<span class="dt-finding-group-chevron">chevron_right</span>
<span class="dt-severity-dot warn"></span>
<span class="dt-group-filename">q3_transactions.xlsx</span>
<div class="dt-group-counts">
<span class="dt-count-pill warn">3 warnings</span>
<span class="dt-count-pill info">3 info</span>
</div>
</div>
</div>
<!-- Per-file findings panel #3 (clean) -->
<div class="dt-card is-collapsed">
<div class="dt-finding-group-head is-collapsed">
<span class="dt-severity-dot success"></span>
<span class="dt-group-filename">vendor_list.csv</span>
<div class="dt-group-counts">
<span class="dt-count-pill success">no issues</span>
</div>
</div>
</div>
</div>
</main>
</div>
<footer class="dt-footer" id="dt-footer"></footer>
<script src="shell.js"></script>
</body>
</html>

71
layout-review/index.html Normal file
View File

@@ -0,0 +1,71 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>DataTools — Layout Review</title>
<link rel="stylesheet" href="app.css">
<style>
.lr-wrap { max-width: 960px; margin: 0 auto; padding: 48px 32px 80px; }
.lr-grid { display: grid; grid-template-columns: repeat(2, 1fr); gap: 14px; margin-top: 18px; }
.lr-card { display: flex; align-items: center; gap: 14px; background: var(--surface); border: 1px solid var(--border); border-radius: var(--r-lg); padding: 16px 18px; box-shadow: 0 1px 2px rgba(28,25,23,0.03); text-decoration: none; transition: border-color .12s ease, box-shadow .12s ease; }
.lr-card:hover { border-color: var(--border-strong); box-shadow: 0 2px 8px rgba(28,25,23,0.06); text-decoration: none; }
.lr-ico { width: 40px; height: 40px; border-radius: var(--r-md); background: var(--accent-fill); color: var(--accent); display: inline-flex; align-items: center; justify-content: center; flex-shrink: 0; }
.lr-ico .dt-mi { font-family: "Material Symbols Outlined"; font-size: 22px; }
.lr-body { min-width: 0; }
.lr-name { font-size: 15px; font-weight: 600; color: var(--ink); letter-spacing: -0.01em; display:flex; align-items:center; gap:8px; }
.lr-desc { font-size: 12.5px; color: var(--ink-secondary); margin-top: 2px; line-height: 1.45; }
.lr-sec { font-size: 11.5px; text-transform: uppercase; letter-spacing: 0.08em; color: var(--ink-tertiary); font-weight: 600; margin: 26px 0 2px; }
.lr-soon { font-size: 9px; font-weight: 600; letter-spacing: .06em; text-transform: uppercase; color: var(--ink-tertiary); border: 1px solid var(--border-strong); border-radius: 999px; padding: 1px 6px; }
</style>
</head>
<body>
<div class="lr-wrap">
<header class="dt-page-header">
<div class="dt-page-brand">
<div class="dt-page-brand-row">
<div class="dt-page-brand-mark">D</div>
<div class="dt-page-brand-words">
<span class="dt-page-eyebrow">UNALOGIX · LAYOUT REVIEW</span>
<h1 class="dt-page-wordmark">DataTools</h1>
</div>
</div>
<p class="dt-page-subtitle">Static HTML reproductions of every tool page, built from the live app's design tokens for human review of layouts.</p>
</div>
</header>
<div class="dt-alert info">
<span class="dt-mi">info</span>
<span>These are faithful static mockups — not the running Streamlit app. Colors, type scale, spacing, and components are copied verbatim from <code>theme.py</code> and <code>components/_legacy.py</code>. Each page is shown in a representative <strong>populated</strong> state so the layout can be reviewed end-to-end. Fonts load from Google Fonts (needs network); the chrome (sidebar + footer) is shared across every page.</span>
</div>
<div class="lr-sec">Analysis</div>
<div class="lr-grid">
<a class="lr-card" href="home.html"><span class="lr-ico"><span class="dt-mi">insert_chart_outlined</span></span><span class="lr-body"><span class="lr-name">File Analysis (Home)</span><span class="lr-desc">Import files, run the analyzer, browse per-file findings.</span></span></a>
<a class="lr-card" href="11_reconciler.html"><span class="lr-ico"><span class="dt-mi">compare_arrows</span></span><span class="lr-body"><span class="lr-name">Reconcile Two Files</span><span class="lr-desc">Compare two lists of transactions and flag what doesn't match.</span></span></a>
</div>
<div class="lr-sec">Data Cleaners</div>
<div class="lr-grid">
<a class="lr-card" href="04_missing_handler.html"><span class="lr-ico"><span class="dt-mi">help_outline</span></span><span class="lr-body"><span class="lr-name">Fix Missing Values</span><span class="lr-desc">Find blank cells (even hidden ones) and fill them in or remove them.</span></span></a>
<a class="lr-card" href="06_outlier_detector.html"><span class="lr-ico"><span class="dt-mi">insights</span></span><span class="lr-body"><span class="lr-name">Find Unusual Values <span class="lr-soon">Soon</span></span><span class="lr-desc">Spot values that look wrong — too high, too low, or rule-breaking.</span></span></a>
<a class="lr-card" href="02_text_cleaner.html"><span class="lr-ico"><span class="dt-mi">text_format</span></span><span class="lr-body"><span class="lr-name">Clean Text</span><span class="lr-desc">Trim extra spaces and strip out odd characters.</span></span></a>
<a class="lr-card" href="03_format_standardizer.html"><span class="lr-ico"><span class="dt-mi">format_list_bulleted</span></span><span class="lr-body"><span class="lr-name">Standardize Formats</span><span class="lr-desc">Make dates, phones, currency, and names look the same throughout.</span></span></a>
<a class="lr-card" href="01_deduplicator.html"><span class="lr-ico"><span class="dt-mi">search</span></span><span class="lr-body"><span class="lr-name">Find Duplicates</span><span class="lr-desc">Find rows that repeat, then keep one and remove the extras.</span></span></a>
<a class="lr-card" href="08_validator_reporter.html"><span class="lr-ico"><span class="dt-mi">check_circle</span></span><span class="lr-body"><span class="lr-name">Quality Check <span class="lr-soon">Soon</span></span><span class="lr-desc">Check your file against rules and export a PDF or Excel report.</span></span></a>
</div>
<div class="lr-sec">Transformations</div>
<div class="lr-grid">
<a class="lr-card" href="05_column_mapper.html"><span class="lr-ico"><span class="dt-mi">view_column</span></span><span class="lr-body"><span class="lr-name">Map Columns</span><span class="lr-desc">Rename columns, reorder, and set each one as text, number, or date.</span></span></a>
<a class="lr-card" href="07_multi_file_merger.html"><span class="lr-ico"><span class="dt-mi">account_tree</span></span><span class="lr-body"><span class="lr-name">Combine Files <span class="lr-soon">Soon</span></span><span class="lr-desc">Combine several CSV or Excel files into one — even if columns differ.</span></span></a>
<a class="lr-card" href="10_pdf_extractor.html"><span class="lr-ico"><span class="dt-mi">picture_as_pdf</span></span><span class="lr-body"><span class="lr-name">PDF to CSV</span><span class="lr-desc">Pull transactions out of bank-statement PDFs into a clean CSV file.</span></span></a>
</div>
<div class="lr-sec">Automations</div>
<div class="lr-grid">
<a class="lr-card" href="09_pipeline_runner.html"><span class="lr-ico"><span class="dt-mi">auto_awesome</span></span><span class="lr-body"><span class="lr-name">Automated Workflows</span><span class="lr-desc">Run several tools in a row — save the steps and reuse them anytime.</span></span></a>
</div>
</div>
</body>
</html>

83
layout-review/shell.js Normal file
View File

@@ -0,0 +1,83 @@
/* Shared app chrome (sidebar nav + sticky footer) for the static layout
review pages. Mirrors src/gui/app.py:_build_navigation() ordering and
src/gui/components/_legacy.py:render_sticky_footer(). Each page sets
<body data-page="<tool_id|home>"> to mark the active nav item. */
(function () {
// Front-door entry — rendered standalone above the section groups.
var START = { id: "home", icon: "insert_chart_outlined", name: "Start here", href: "home.html" };
// Sections + entries in pipeline / job order.
var NAV = [
{ label: "Data Cleaners", items: [
{ id: "02_text_cleaner", icon: "text_format", name: "Clean Text", href: "02_text_cleaner.html" },
{ id: "03_format_standardizer", icon: "format_list_bulleted", name: "Standardize Formats", href: "03_format_standardizer.html" },
{ id: "04_missing_handler", icon: "help_outline", name: "Fix Missing Values", href: "04_missing_handler.html" },
{ id: "01_deduplicator", icon: "search", name: "Find Duplicates", href: "01_deduplicator.html" },
]},
{ label: "Transformations", items: [
{ id: "05_column_mapper", icon: "view_column", name: "Map Columns", href: "05_column_mapper.html" },
]},
{ label: "Automations", items: [
{ id: "09_pipeline_runner", icon: "auto_awesome", name: "Automated Workflows", href: "09_pipeline_runner.html" },
]},
{ label: "Finance", items: [
{ id: "11_reconciler", icon: "compare_arrows", name: "Reconcile Two Files", href: "11_reconciler.html" },
{ id: "10_pdf_extractor", icon: "picture_as_pdf", name: "PDF to CSV", href: "10_pdf_extractor.html" },
]},
{ label: "Coming soon", items: [
{ id: "06_outlier_detector", icon: "insights", name: "Find Unusual Values", href: "06_outlier_detector.html", soon: true },
{ id: "08_validator_reporter", icon: "check_circle", name: "Quality Check", href: "08_validator_reporter.html", soon: true },
{ id: "07_multi_file_merger", icon: "account_tree", name: "Combine Files", href: "07_multi_file_merger.html", soon: true },
]},
];
var active = document.body.getAttribute("data-page") || "";
// ---- Sidebar -----------------------------------------------------------
var sb = document.getElementById("dt-sidebar");
if (sb) {
var html = '' +
'<a class="dt-brand" href="index.html" style="text-decoration:none">' +
'<span class="dt-brand-mark">D</span>' +
'<span class="dt-brand-name">' +
'<span class="dt-brand-eyebrow">UNALOGIX</span>' +
'<span class="dt-brand-word">DataTools</span>' +
'</span>' +
'</a>' +
'<nav class="dt-nav">';
var startCls = "dt-nav-link dt-nav-start" + (START.id === active ? " is-active" : "");
html += '<a class="' + startCls + '" href="' + START.href + '">' +
'<span class="dt-mi">' + START.icon + '</span>' +
'<span>' + START.name + '</span>' +
'</a>';
NAV.forEach(function (sec) {
var indicator = "";
html += '<div class="dt-nav-section">' + sec.label +
'<span class="dt-nav-indicator">' + indicator + '</span></div>';
sec.items.forEach(function (it) {
var cls = "dt-nav-link" + (it.id === active ? " is-active" : "") + (it.soon ? " is-soon" : "");
html += '<a class="' + cls + '" href="' + it.href + '">' +
'<span class="dt-mi">' + it.icon + '</span>' +
'<span>' + it.name + '</span>' +
(it.soon ? '<span class="dt-nav-soon-tag">Soon</span>' : '') +
'</a>';
});
});
html += '</nav>' +
'<div class="dt-sidebar-foot">' +
'<div><div class="dt-sidebar-label">Language</div>' +
'<div class="dt-select" style="pointer-events:none">English</div></div>' +
'<div class="dt-license-badge">Core · 1,820 days left</div>' +
'</div>';
sb.innerHTML = html;
}
// ---- Sticky footer -----------------------------------------------------
var ft = document.getElementById("dt-footer");
if (ft) {
ft.innerHTML =
'<a class="dt-footer-btn" href="index.html"><span class="dt-mi">close</span>Close</a>' +
'<button class="dt-footer-btn" type="button"><span class="dt-mi">help_outline</span>Help</button>' +
'<span style="margin-left:auto;font-size:11.5px;color:var(--ink-tertiary)">DataTools · local-first · static layout preview</span>';
}
})();

View File

@@ -1,31 +0,0 @@
Lead ID,First Name,Last Name,Company,Title,Email,Phone,Country,Source,Score,Last Activity,Tags
HUB-001,Alice,Johnson,Acme Corp,VP Marketing,alice@acme.com,(415) 555-1234,USA,HubSpot,87,2025-12-04,Enterprise
HUB-002,bob,smith,Beta LLC,Director Growth,bob@beta.com,N/A,United States,HubSpot,N/A,2025-11-22,SMB
HUB-003,Carlos,Garcia,Gamma Inc,CEO,carlos@gamma.io,+34 91 411 1111,Spain,HubSpot,82,2025-10-30,Enterprise
HUB-004,DIANA,LEE,Delta Co,Marketing Manager,diana@delta.com,020 7946 0958,United Kingdom,HubSpot,74,2025-12-15,Mid-Market
HUB-005,Eve,Martinez,Epsilon Group,VP Ops,eve@epsilon.com,(none),Mexico,HubSpot,(blank),2025-09-15,SMB
LIN-006,Alice,Johnson,Acme Corporation,VP of Marketing,Alice.Johnson@acme.com,4155551234,US,LinkedIn,,2025-12-04,Enterprise
LIN-007,Frank,Brown,Foxtrot Ltd,Head Sales,frank@foxtrot.de,+49 30 12345678,Germany,LinkedIn,68,2025-12-01,Mid-Market
LIN-008,Grace,Davis,Golf Industries,Marketing Lead,grace@golfind.com,+44 20 7946 0958,UK,LinkedIn,79,2025-11-08,Mid-Market
LIN-009,henry,wilson,Hotel Logistics,COO,henry@hotellog.com,+86 10 1234 5678,China,LinkedIn,91,2025-12-12,Enterprise
LIN-010,IVY CHEN,,India Tech,CTO,ivy@indiatech.in,+91 11 2345 6789,IN,LinkedIn,88,2025-11-30,Enterprise
LIN-011,Jack,Taylor,Juliet & Co,Founder,jack@juliet.co,unknown,United States,LinkedIn,?,(unknown),SMB
SCR-012,Diana,Lee,Delta Company,Marketing Manager,diana@delta.com,020-7946-0958,UK,Manual Scrape,74,12/15/2025,Mid-Market
SCR-013,kate,o'neil,Kilo Ventures,Partner,kate@kilo.vc,+1 415 555 2222,USA,Manual Scrape,N/A,?,Investor
SCR-014,Carlos,García,Gamma Incorporated,CEO,Carlos@gamma.io,+34-91-411-1111,Spain,Manual Scrape,82,Oct 30 2025,Enterprise
SCR-015,Liam,Park,Lima Solutions,Director Marketing,liam@limasol.kr,+82 2 2287 0114,South Korea,Manual Scrape,77,2025-11-20,Enterprise
SCR-016,Mia,nguyen,Mike Corp,VP Marketing,mia@mikecorp.com.au,02 9374 4000,Australia,Manual Scrape,72,2025-10-05,Mid-Market
SCR-017,Noah,Brown,November Inc,Head of Growth,noah@november.com,(555) 444-5555,US,Manual Scrape,,#N/A,SMB
HUB-018,Frank,Brown,Foxtrot,Head of Sales,Frank@Foxtrot.de,+49-30-12345678,Germany,HubSpot,68,2025-12-01,Mid-Market
HUB-019,Olivia,Rossi,Oscar Italia,CMO,olivia@oscar.it,+39 06 6982,Italy,HubSpot,85,2025-12-08,Enterprise
HUB-020,papa,wong,Papa Trading,Founder,papa@papatrading.hk,+852 2123 4567,Hong Kong,HubSpot,69,2025-11-15,SMB
LIN-021,Quinn,Reyes,Quebec Group,VP Sales,quinn@quebec.mx,+52 55 5555 0000,Mexico,LinkedIn,80,2025-12-05,Mid-Market
LIN-022,Robert,Tan,Romeo Logistics,Director,r.tan@romeo.sg,+65 6123 4567,Singapore,LinkedIn,76,2025-11-28,Mid-Market
SCR-023,Sara,Khan,Sierra Foods,Head Marketing,sara@sierra.in,+91-22-1234-5678,India,Manual Scrape,73,2025-12-02,SMB
SCR-024,bob,Smith,Beta,Director Growth,Bob@Beta.com,(none),United States,Manual Scrape,(unknown),(unknown),SMB
HUB-025,Tara,Levi,Tango Tech,VP Product,tara@tango.il,+972 3 6957 0000,Israel,HubSpot,82,2025-12-10,Enterprise
HUB-026,Uma,Patel,Uniform Health,CMO,uma at uniform dot com,+44 20 7946 8888,United Kingdom,HubSpot,71,2025-12-12,Enterprise
LIN-027,Victor,Lee,Victor Co,Director,victor@@victorco.com,+1 415 555 8888,USA,LinkedIn,69,2025-11-30,SMB
SCR-028,Wendy,Akin,Whiskey Inc,CMO,wendy@whiskey.tr,+90 212 252 1111,Turkey,Manual Scrape,77,2025-12-04,Mid-Market
SCR-029,Xander,Ng,Xray Group,Founder,xander@xray.sg,+65 6234 5678,Singapore,Manual Scrape,65,2025-11-15,Suppressed
HUB-030,Yara,Costa,Yankee Foods,Marketing Lead,yara@yankee.br,+55 11 3071 2222,Brazil,HubSpot,,2025-12-15,Opted Out
1 Lead ID First Name Last Name Company Title Email Phone Country Source Score Last Activity Tags
2 HUB-001 Alice Johnson Acme Corp VP Marketing alice@acme.com (415) 555-1234 USA HubSpot 87 2025-12-04 Enterprise
3 HUB-002 bob smith Beta LLC Director Growth bob@beta.com N/A United States HubSpot N/A 2025-11-22 SMB
4 HUB-003 Carlos Garcia Gamma Inc CEO carlos@gamma.io +34 91 411 1111 Spain HubSpot 82 2025-10-30 Enterprise
5 HUB-004 DIANA LEE Delta Co Marketing Manager diana@delta.com 020 7946 0958 United Kingdom HubSpot 74 2025-12-15 Mid-Market
6 HUB-005 Eve Martinez Epsilon Group VP Ops eve@epsilon.com (none) Mexico HubSpot (blank) 2025-09-15 SMB
7 LIN-006 Alice Johnson Acme Corporation VP of Marketing Alice.Johnson@acme.com 4155551234 US LinkedIn 2025-12-04 Enterprise
8 LIN-007 Frank Brown Foxtrot Ltd Head Sales frank@foxtrot.de +49 30 12345678 Germany LinkedIn 68 2025-12-01 Mid-Market
9 LIN-008 Grace Davis Golf Industries Marketing Lead grace@golfind.com +44 20 7946 0958 UK LinkedIn 79 2025-11-08 Mid-Market
10 LIN-009 henry wilson Hotel Logistics COO henry@hotellog.com +86 10 1234 5678 China LinkedIn 91 2025-12-12 Enterprise
11 LIN-010 IVY CHEN India Tech CTO ivy@indiatech.in +91 11 2345 6789 IN LinkedIn 88 2025-11-30 Enterprise
12 LIN-011 Jack Taylor Juliet & Co Founder jack@juliet.co unknown United States LinkedIn ? (unknown) SMB
13 SCR-012 Diana Lee Delta Company Marketing Manager diana@delta.com 020-7946-0958 UK Manual Scrape 74 12/15/2025 Mid-Market
14 SCR-013 kate o'neil Kilo Ventures Partner kate@kilo.vc +1 415 555 2222 USA Manual Scrape N/A ? Investor
15 SCR-014 Carlos García Gamma Incorporated CEO Carlos@gamma.io +34-91-411-1111 Spain Manual Scrape 82 Oct 30 2025 Enterprise
16 SCR-015 Liam Park Lima Solutions Director Marketing liam@limasol.kr +82 2 2287 0114 South Korea Manual Scrape 77 2025-11-20 Enterprise
17 SCR-016 Mia nguyen Mike Corp VP Marketing mia@mikecorp.com.au 02 9374 4000 Australia Manual Scrape 72 2025-10-05 Mid-Market
18 SCR-017 Noah Brown November Inc Head of Growth noah@november.com (555) 444-5555 US Manual Scrape #N/A SMB
19 HUB-018 Frank Brown Foxtrot Head of Sales Frank@Foxtrot.de +49-30-12345678 Germany HubSpot 68 2025-12-01 Mid-Market
20 HUB-019 Olivia Rossi Oscar Italia CMO olivia@oscar.it +39 06 6982 Italy HubSpot 85 2025-12-08 Enterprise
21 HUB-020 papa wong Papa Trading Founder papa@papatrading.hk +852 2123 4567 Hong Kong HubSpot 69 2025-11-15 SMB
22 LIN-021 Quinn Reyes Quebec Group VP Sales quinn@quebec.mx +52 55 5555 0000 Mexico LinkedIn 80 2025-12-05 Mid-Market
23 LIN-022 Robert Tan Romeo Logistics Director r.tan@romeo.sg +65 6123 4567 Singapore LinkedIn 76 2025-11-28 Mid-Market
24 SCR-023 Sara Khan Sierra Foods Head Marketing sara@sierra.in +91-22-1234-5678 India Manual Scrape 73 2025-12-02 SMB
25 SCR-024 bob Smith Beta Director Growth Bob@Beta.com (none) United States Manual Scrape (unknown) (unknown) SMB
26 HUB-025 Tara Levi Tango Tech VP Product tara@tango.il +972 3 6957 0000 Israel HubSpot 82 2025-12-10 Enterprise
27 HUB-026 Uma Patel Uniform Health CMO uma at uniform dot com +44 20 7946 8888 United Kingdom HubSpot 71 2025-12-12 Enterprise
28 LIN-027 Victor Lee Victor Co Director victor@@victorco.com +1 415 555 8888 USA LinkedIn 69 2025-11-30 SMB
29 SCR-028 Wendy Akin Whiskey Inc CMO wendy@whiskey.tr +90 212 252 1111 Turkey Manual Scrape 77 2025-12-04 Mid-Market
30 SCR-029 Xander Ng Xray Group Founder xander@xray.sg +65 6234 5678 Singapore Manual Scrape 65 2025-11-15 Suppressed
31 HUB-030 Yara Costa Yankee Foods Marketing Lead yara@yankee.br +55 11 3071 2222 Brazil HubSpot 2025-12-15 Opted Out

View File

@@ -1,74 +0,0 @@
{
"steps": [
{
"tool": "text_clean",
"options": {},
"enabled": true,
"name": "1. Clean text (whitespace + smart quotes from copy-paste)"
},
{
"tool": "format_standardize",
"options": {
"column_types": {
"First Name": "name",
"Last Name": "name",
"Company": "name",
"Email": "email",
"Phone": "phone"
},
"phone_country_column": "Country",
"phone_format": "E164",
"email_gmail_canonical": true
},
"enabled": true,
"name": "2. E.164 phones (per-row country) · canonical emails · name casing"
},
{
"tool": "missing",
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["N/A", "n/a", "—", "?", "(unknown)", "unknown", "(blank)", "(none)", "TBD", "#N/A"]
},
"enabled": true,
"name": "3. Standardize sentinels across vendor exports"
},
{
"tool": "column_map",
"options": {
"schema": {
"fields": [
{"name": "Lead ID", "dtype": "string", "required": true},
{"name": "First Name", "dtype": "string"},
{"name": "Last Name", "dtype": "string"},
{"name": "Company", "dtype": "string"},
{"name": "Title", "dtype": "string"},
{"name": "Email", "dtype": "string"},
{"name": "Phone", "dtype": "string"},
{"name": "Country", "dtype": "string"},
{"name": "Source", "dtype": "string"},
{"name": "Score", "dtype": "integer"},
{"name": "Last Activity", "dtype": "date"},
{"name": "Tags", "dtype": "string"}
]
},
"auto_infer": true,
"unmapped": "keep",
"coerce_types": true,
"reorder_to_schema": true,
"enforce_required": false
},
"enabled": true,
"name": "4. Coerce types · reorder to canonical schema"
},
{
"tool": "dedup",
"options": {
"survivor_rule": "most_complete",
"merge": true
},
"enabled": true,
"name": "5. Dedup leads across HubSpot / LinkedIn / Manual Scrape (fuzzy + merge)"
}
]
}

View File

@@ -0,0 +1,27 @@
Invoice,Client,Email,Invoice_Date,Due_Date,Amount,Status
INV-1007,ACME LLC,AP@Acme.com,03/04/2025,04/03/2025,"$1,250.00",Open
INV-1007, Acme LLC ,ap@acme.com,2025-03-04,2025-04-03,"1,250.00",(blank)
INV-1001,northwind traders,billing@northwind.com,Mar 6 2025,04/05/2025,$980,Overdue
INV-1002,Globex Corp,AR@Globex.com,3/11/25,4/10/25,"2,400.50",Sent
INV-1011,initech,accounts@initech.com,04/01/2025,05/01/2025,"$ 1,100.00",?
INV-1011,Initech,Accounts@Initech.com,2025-04-01,2025-05-01,1100,Open
INV-1003,Stark Industries,ap@stark.com,Mar 6 2025,Apr 6 2025,$75.00,Open
INV-1004,Wayne Enterprises,ar@wayne.com,03/15/2025,04/14/2025,($300.00),
INV-1015,Hooli,billing@hooli.com,3/11/25,4/10/25,"$4,300.00",Overdue
INV-1015,hooli,Billing@Hooli.com,2025-03-11,2025-04-10,4300,(none)
INV-1005,Soylent Corp,ap@soylent.com,2025-03-20,2025-04-19,"$1,875.25",Sent
INV-1006,Umbrella Co,ar@umbrella.com,03/22/2025,04/21/2025,$640.00,TBD
INV-1019,Cyberdyne Systems,ap@cyberdyne.com,Mar 25 2025,04/24/2025,"$2,050.00",unknown
INV-1019,cyberdyne systems,AP@Cyberdyne.com,2025-03-25,2025-04-24,"2,050.00",Open
INV-1008,Vandelay Industries,ar@vandelay.com,3/28/25,4/27/25,$915.00,Overdue
INV-1009,Gekko & Co,billing@gekko.com,2025-03-30,2025-04-29,"$3,120.75",Open
INV-1010,Pied Piper,ap@piedpiper.com,04/02/2025,05/02/2025,$180,Sent
INV-1023,Tyrell Corp,ar@tyrell.com,04/05/2025,05/05/2025,($300.00),(blank)
INV-1023,Tyrell Corp,AR@Tyrell.com,2025-04-05,2025-05-05,-300.00,Open
INV-1012,Oscorp,ap@oscorp.com,Apr 8 2025,05/08/2025,"$5,000.00",Overdue
INV-1013,Nakatomi Trading,ar@nakatomi.com,4/9/25,5/9/25,$725.50,Sent
INV-1014,Bluth Company,billing@bluth.com,2025-04-10,2025-05-10,"$1,420.00",Open
INV-1016,Dunder Mifflin,ap@dundermifflin.com,04/12/2025,05/12/2025,$960.00,Overdue
INV-1017,Prestige Worldwide,ar@prestige.com,Apr 14 2025,05/14/2025,"$2,680.00",Sent
INV-1018,Sterling Cooper,billing@sterlingcooper.com,4/15/25,5/15/25,"$3,950.00",Open
INV-1020,Wonka Industries,ap@wonka.com,2025-04-18,2025-05-18,"$1,050.00",Overdue
1 Invoice Client Email Invoice_Date Due_Date Amount Status
2 INV-1007 ACME LLC AP@Acme.com 03/04/2025 04/03/2025 $1,250.00 Open
3 INV-1007 Acme LLC ap@acme.com 2025-03-04 2025-04-03 1,250.00 (blank)
4 INV-1001 northwind traders billing@northwind.com Mar 6 2025 04/05/2025 $980 Overdue
5 INV-1002 Globex Corp AR@Globex.com 3/11/25 4/10/25 2,400.50 Sent
6 INV-1011 initech accounts@initech.com 04/01/2025 05/01/2025 $ 1,100.00 ?
7 INV-1011 Initech Accounts@Initech.com 2025-04-01 2025-05-01 1100 Open
8 INV-1003 Stark Industries ap@stark.com Mar 6 2025 Apr 6 2025 $75.00 Open
9 INV-1004 Wayne Enterprises ar@wayne.com 03/15/2025 04/14/2025 ($300.00)
10 INV-1015 Hooli billing@hooli.com 3/11/25 4/10/25 $4,300.00 Overdue
11 INV-1015 hooli Billing@Hooli.com 2025-03-11 2025-04-10 4300 (none)
12 INV-1005 Soylent Corp ap@soylent.com 2025-03-20 2025-04-19 $1,875.25 Sent
13 INV-1006 Umbrella Co ar@umbrella.com 03/22/2025 04/21/2025 $640.00 TBD
14 INV-1019 Cyberdyne Systems ap@cyberdyne.com Mar 25 2025 04/24/2025 $2,050.00 unknown
15 INV-1019 cyberdyne systems AP@Cyberdyne.com 2025-03-25 2025-04-24 2,050.00 Open
16 INV-1008 Vandelay Industries ar@vandelay.com 3/28/25 4/27/25 $915.00 Overdue
17 INV-1009 Gekko & Co billing@gekko.com 2025-03-30 2025-04-29 $3,120.75 Open
18 INV-1010 Pied Piper ap@piedpiper.com 04/02/2025 05/02/2025 $180 Sent
19 INV-1023 Tyrell Corp ar@tyrell.com 04/05/2025 05/05/2025 ($300.00) (blank)
20 INV-1023 Tyrell Corp AR@Tyrell.com 2025-04-05 2025-05-05 -300.00 Open
21 INV-1012 Oscorp ap@oscorp.com Apr 8 2025 05/08/2025 $5,000.00 Overdue
22 INV-1013 Nakatomi Trading ar@nakatomi.com 4/9/25 5/9/25 $725.50 Sent
23 INV-1014 Bluth Company billing@bluth.com 2025-04-10 2025-05-10 $1,420.00 Open
24 INV-1016 Dunder Mifflin ap@dundermifflin.com 04/12/2025 05/12/2025 $960.00 Overdue
25 INV-1017 Prestige Worldwide ar@prestige.com Apr 14 2025 05/14/2025 $2,680.00 Sent
26 INV-1018 Sterling Cooper billing@sterlingcooper.com 4/15/25 5/15/25 $3,950.00 Open
27 INV-1020 Wonka Industries ap@wonka.com 2025-04-18 2025-05-18 $1,050.00 Overdue

View File

@@ -0,0 +1,50 @@
{
"steps": [
{
"tool": "text_clean",
"enabled": true,
"options": {
"trim": true,
"collapse_whitespace": true,
"fold_smart_chars": true,
"strip_zero_width": true
}
},
{
"tool": "format_standardize",
"enabled": true,
"options": {
"column_types": {
"Invoice_Date": "date",
"Due_Date": "date",
"Amount": "currency",
"Email": "email"
}
}
},
{
"tool": "missing",
"enabled": true,
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["—", "-", "?", "(blank)", "TBD", "unknown", "(none)", "N/A", "#N/A"]
}
},
{
"tool": "dedup",
"enabled": true,
"options": {
"survivor_rule": "most_complete",
"merge": true,
"strategies": [
{
"columns": [
{"column": "Invoice", "algorithm": "exact", "threshold": 100}
]
}
]
}
}
]
}

View File

@@ -0,0 +1,27 @@
Date,Description,Vendor,Category,Amount,Account
01/15/2025,“Stripe payout — weekly”,Stripe,Income,"+$3,450.00",Business Checking
2025-01-15,Verizon business line,Verizon,,($89.50),Business Checking
Jan 18 2025,Adobe Creative Cloud ,Adobe,(blank),-$129.99,Business Checking
1/27/25,Office supplies,Amazon,Supplies,-$74.20,Business Checking
02/03/2025, Monthly office rent,Highland Properties,Rent,"$1,200.00",Business Checking
Feb 5 2025,Account service fee,First National Bank,?,(50.00),Business Checking
2025-01-09,Shipping labels,amazon.com,unknown,-$18.40,Business Checking
1/22/25,Contractor — landing page,Bright Lane Design,TBD,- $599.88,Business Checking
Jan 30 2025,Late fee adjustment,verizon,Utilities,-$12.00,Business Checking
2025-01-11,Packaging tape,AMAZON.COM,Supplies,-$31.75,Business Checking
01/06/2025,Client deposit — ACME Co,ACME Co,Income,"$2,500.00",Business Checking
2025-01-20,Google Workspace,Google,Software,-$36.00,Business Checking
Jan 24 2025,Fuel — delivery van,Shell,Vehicle,-$58.63,Business Checking
1/28/25,QuickBooks subscription,Intuit,Software,-$80.00,Business Checking
2025-01-15,Stripe payout weekly,Stripe,Income,3450.00,Business Checking
01/15/2025,Verizon business line,Verizon,Utilities,-89.50,Business Checking
2025-01-18,Adobe Creative Cloud,Adobe,Software,-129.99,Business Checking
2025-02-03,Monthly office rent,Highland Properties,Rent,1200.00,Business Checking
2025-02-05,Account service fee,First National Bank,Bank Fees,-50.00,Business Checking
2025-01-22,Contractor landing page,Bright Lane Design,Contractors,-599.88,Business Checking
02/10/2025,Client deposit — Globex,Globex,Income,"$1,800.00",Business Checking
2025-02-12,Slack subscription,Slack,Software,-$96.00,Business Checking
Feb 14 2025,Coffee — client meeting,Blue Bottle,Meals,-$23.10,Business Checking
2/18/25,Insurance premium,Hartford,Insurance,-$240.50,Business Checking
02/21/2025,Refund — returned printer,Staples,Supplies,$210.99,Business Checking
Feb 25 2025,Domain renewal,Namecheap,Software,-$13.98,Business Checking
1 Date Description Vendor Category Amount Account
2 01/15/2025 “Stripe payout — weekly” Stripe Income +$3,450.00 Business Checking
3 2025-01-15 Verizon business line Verizon ($89.50) Business Checking
4 Jan 18 2025 Adobe Creative Cloud Adobe (blank) -$129.99 Business Checking
5 1/27/25 Office supplies Amazon Supplies -$74.20 Business Checking
6 02/03/2025 Monthly office rent Highland Properties Rent $1,200.00 Business Checking
7 Feb 5 2025 Account service fee First National Bank ? (50.00) Business Checking
8 2025-01-09 Shipping labels amazon.com unknown -$18.40 Business Checking
9 1/22/25 Contractor — landing page Bright Lane Design TBD - $599.88 Business Checking
10 Jan 30 2025 Late fee adjustment verizon Utilities -$12.00 Business Checking
11 2025-01-11 Packaging tape AMAZON.COM Supplies -$31.75 Business Checking
12 01/06/2025 Client deposit — ACME Co ACME Co Income $2,500.00 Business Checking
13 2025-01-20 Google Workspace Google Software -$36.00 Business Checking
14 Jan 24 2025 Fuel — delivery van Shell Vehicle -$58.63 Business Checking
15 1/28/25 QuickBooks subscription Intuit Software -$80.00 Business Checking
16 2025-01-15 Stripe payout weekly Stripe Income 3450.00 Business Checking
17 01/15/2025 Verizon business line Verizon Utilities -89.50 Business Checking
18 2025-01-18 Adobe Creative Cloud Adobe Software -129.99 Business Checking
19 2025-02-03 Monthly office rent Highland Properties Rent 1200.00 Business Checking
20 2025-02-05 Account service fee First National Bank Bank Fees -50.00 Business Checking
21 2025-01-22 Contractor landing page Bright Lane Design Contractors -599.88 Business Checking
22 02/10/2025 Client deposit — Globex Globex Income $1,800.00 Business Checking
23 2025-02-12 Slack subscription Slack Software -$96.00 Business Checking
24 Feb 14 2025 Coffee — client meeting Blue Bottle Meals -$23.10 Business Checking
25 2/18/25 Insurance premium Hartford Insurance -$240.50 Business Checking
26 02/21/2025 Refund — returned printer Staples Supplies $210.99 Business Checking
27 Feb 25 2025 Domain renewal Namecheap Software -$13.98 Business Checking

View File

@@ -0,0 +1,6 @@
{"steps":[
{"tool":"text_clean","enabled":true,"options":{"trim":true,"collapse_whitespace":true,"fold_smart_chars":true,"strip_zero_width":true}},
{"tool":"format_standardize","enabled":true,"options":{"column_types":{"Date":"date","Amount":"currency"}}},
{"tool":"missing","enabled":true,"options":{"strategy":"none","standardize_sentinels":true,"sentinels":["—","(blank)","?","unknown","TBD","N/A","#N/A","(none)"]}},
{"tool":"dedup","enabled":true,"options":{"survivor_rule":"most_complete","merge":true,"strategies":[{"columns":[{"column":"Date","algorithm":"exact","threshold":100},{"column":"Amount","algorithm":"exact","threshold":100}]}]}}
]}

View File

@@ -1,56 +0,0 @@
{
"steps": [
{
"tool": "text_clean",
"options": {},
"enabled": true,
"name": "1. Clean text (header whitespace, smart quotes, em-dash)"
},
{
"tool": "format_standardize",
"options": {
"column_types": {
"Date": "date",
"Amount": "currency",
"Balance": "currency",
"Vendor": "name"
},
"currency_decimal": "auto",
"currency_preserve_code": false,
"currency_decimals": 2,
"date_output_format": "%Y-%m-%d"
},
"enabled": true,
"name": "2. ISO dates · numeric amounts (parens-negative) · vendor casing"
},
{
"tool": "missing",
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["N/A", "n/a", "—", "-", "?", "(blank)", "(none)", "unknown", "#N/A"]
},
"enabled": true,
"name": "3. Standardize disguised nulls (— / N/A / (blank))"
},
{
"tool": "dedup",
"options": {
"survivor_rule": "most_complete",
"merge": false,
"date_column": "Date",
"strategies": [
{
"columns": [
{"column": "Date", "algorithm": "exact", "threshold": 100},
{"column": "Amount", "algorithm": "exact", "threshold": 100},
{"column": "Vendor", "algorithm": "jaro_winkler", "threshold": 80}
]
}
]
},
"enabled": true,
"name": "4. Dedup transactions on Date+Amount+fuzzy Vendor"
}
]
}

View File

@@ -1,31 +0,0 @@
Txn ID,Date ,Description,Amount,Balance,Account,Vendor,Category
TXN-2401,01/15/2025," AMAZON.COM*4F2X9 PURCHASE",-$129.99,"$2,450.01",Checking,Amazon,Office Supplies
TXN-2402,2025-01-15,"AMAZON.COM*4F2X9 PURCHASE",-$129.99,"2450.01",Checking,amazon.com,Office Supplies
TXN-2403,Jan 18 2025,"STAPLES #4422 — paper, toner",($89.50),$2360.51,Checking,STAPLES,Office Supplies
TXN-2404,01/22/2025,"Verizon Wireless ""autopay""",-$120.00,"$2,240.51",Checking,Verizon,Utilities
TXN-2405,2025-01-22,Verizon Wireless autopay,-120.00,"2,240.51",Checking,verizon,Utilities
TXN-2406,01-25-2025,"Stripe Payout — invoice #1077","+$3,450.00","$5,690.51",Checking,Stripe,Income
TXN-2407,1/27/25,"Office Lease - Suite 204",-1500.00,"$4,190.51",Checking,Acme Realty,Rent
TXN-2408,02/01/2025,"Wire — Acme Realty Mgmt","-$1,500.00","$2,690.51",Checking,acme realty,Rent
TXN-2409,2025-02-03,"Adobe Creative Cloud annual","- $599.88","$2,090.63",Credit Card,Adobe Inc.,Software
TXN-2410,02/03/2025,"ADOBE CREATIVE CLOUD ANN",-599.88,2090.63,Credit Card,adobe,Software
TXN-2411,Feb 5 2025,"FedEx — overnight to client A",-$32.50,"$2,058.13",Checking,FedEx,Shipping
TXN-2412,02/07/2025,"Square fee — invoice #1078","-$3.20","$2,054.93",Checking,Square,Fees
TXN-2413,02/10/2025,"Stripe Payout invoice #1079","+ $1,200.00","$3,254.93",Checking,Stripe,Income
TXN-2414,2025-02-12,"USPS PRIORITY — to vendor B","-12.40","$3,242.53",Checking,USPS,Shipping
TXN-2415,02/14/2025,"Zoom Video Comms — annual","-$149.90","$3,092.63",Credit Card,Zoom,Software
TXN-2416,2/14/25,"Zoom Video Communications","-149.90","3092.63",Credit Card,zoom,Software
TXN-2417,02/18/2025,"Costco Whse #421 — supplies","-$237.84","$2,854.79",Checking,Costco,Office Supplies
TXN-2418,2025-02-18,COSTCO WHSE #421,-237.84,"2,854.79",Checking,costco,Office Supplies
TXN-2419,02/22/2025,"Bank fee — int'l wire","-$45.00","$2,809.79",Checking,Bank Fee,Fees
TXN-2420,02/24/2025,"Stripe Payout — invoice #1080","+$2,100.00","$4,909.79",Checking,Stripe,Income
TXN-2421,02/28/2025," Refund — overcharge ","+$45.00","$4,954.79",Checking,,Refunds
TXN-2422,Feb 28 2025,REFUND OVERCHARGE,45.00,4954.79,Checking,N/A,Refunds
TXN-2423,03/01/2025,"Office Lease — Suite 204","-$1,500.00","$3,454.79",Checking,Acme Realty,Rent
TXN-2424,2025-03-03,"Slack Technologies — annual","-$840.00","$2,614.79",Credit Card,Slack,Software
TXN-2425,03/05/2025,"Stripe Payout — invoice #1081","+$1,875.00","$4,489.79",Checking,Stripe,Income
TXN-2426,03/08/2025,"Wire — Berlin office rent (EUR vendor)","-€1.450,00","$2,989.79",Checking,Mietverwaltung GmbH,Rent
TXN-2427,03/10/2025,"London supplier invoice (GBP)","-£950.00","$1,939.79",Checking,Stationery Co Ltd,Office Supplies
TXN-2428,03/12/2025,"São Paulo agency retainer","-R$ 1.299,90","$1,679.79",Credit Card,Estúdio Ágil,Software
TXN-2429,03/14/2025,"VAT MOSS prep — multi-EU sales","($89.00)","$1,768.79",Checking,EU VAT Service,Fees
TXN-2430,03/14/2025,"VAT MOSS prep multi EU sales",-89.00,"1,768.79",Checking,eu vat service,Fees
1 Txn ID Date Description Amount Balance Account Vendor Category
2 TXN-2401 01/15/2025 AMAZON.COM*4F2X9 PURCHASE -$129.99 $2,450.01 Checking Amazon Office Supplies
3 TXN-2402 2025-01-15 AMAZON.COM*4F2X9 PURCHASE -$129.99 2450.01 Checking amazon.com Office Supplies
4 TXN-2403 Jan 18 2025 STAPLES #4422 — paper, toner ($89.50) $2360.51 Checking STAPLES Office Supplies
5 TXN-2404 01/22/2025 Verizon Wireless "autopay" -$120.00 $2,240.51 Checking Verizon Utilities
6 TXN-2405 2025-01-22 Verizon Wireless autopay -120.00 2,240.51 Checking verizon Utilities
7 TXN-2406 01-25-2025 Stripe Payout — invoice #1077 +$3,450.00 $5,690.51 Checking Stripe Income
8 TXN-2407 1/27/25 Office Lease - Suite 204 -1500.00 $4,190.51 Checking Acme Realty Rent
9 TXN-2408 02/01/2025 Wire — Acme Realty Mgmt -$1,500.00 $2,690.51 Checking acme realty Rent
10 TXN-2409 2025-02-03 Adobe Creative Cloud annual - $599.88 $2,090.63 Credit Card Adobe Inc. Software
11 TXN-2410 02/03/2025 ADOBE CREATIVE CLOUD ANN -599.88 2090.63 Credit Card adobe Software
12 TXN-2411 Feb 5 2025 FedEx — overnight to client A -$32.50 $2,058.13 Checking FedEx Shipping
13 TXN-2412 02/07/2025 Square fee — invoice #1078 -$3.20 $2,054.93 Checking Square Fees
14 TXN-2413 02/10/2025 Stripe Payout invoice #1079 + $1,200.00 $3,254.93 Checking Stripe Income
15 TXN-2414 2025-02-12 USPS PRIORITY — to vendor B -12.40 $3,242.53 Checking USPS Shipping
16 TXN-2415 02/14/2025 Zoom Video Comms — annual -$149.90 $3,092.63 Credit Card Zoom Software
17 TXN-2416 2/14/25 Zoom Video Communications -149.90 3092.63 Credit Card zoom Software
18 TXN-2417 02/18/2025 Costco Whse #421 — supplies -$237.84 $2,854.79 Checking Costco Office Supplies
19 TXN-2418 2025-02-18 COSTCO WHSE #421 -237.84 2,854.79 Checking costco Office Supplies
20 TXN-2419 02/22/2025 Bank fee — int'l wire -$45.00 $2,809.79 Checking Bank Fee Fees
21 TXN-2420 02/24/2025 Stripe Payout — invoice #1080 +$2,100.00 $4,909.79 Checking Stripe Income
22 TXN-2421 02/28/2025 Refund — overcharge +$45.00 $4,954.79 Checking Refunds
23 TXN-2422 Feb 28 2025 REFUND OVERCHARGE 45.00 4954.79 Checking N/A Refunds
24 TXN-2423 03/01/2025 Office Lease — Suite 204 -$1,500.00 $3,454.79 Checking Acme Realty Rent
25 TXN-2424 2025-03-03 Slack Technologies — annual -$840.00 $2,614.79 Credit Card Slack Software
26 TXN-2425 03/05/2025 Stripe Payout — invoice #1081 +$1,875.00 $4,489.79 Checking Stripe Income
27 TXN-2426 03/08/2025 Wire — Berlin office rent (EUR vendor) -€1.450,00 $2,989.79 Checking Mietverwaltung GmbH Rent
28 TXN-2427 03/10/2025 London supplier invoice (GBP) -£950.00 $1,939.79 Checking Stationery Co Ltd Office Supplies
29 TXN-2428 03/12/2025 São Paulo agency retainer -R$ 1.299,90 $1,679.79 Credit Card Estúdio Ágil Software
30 TXN-2429 03/14/2025 VAT MOSS prep — multi-EU sales ($89.00) $1,768.79 Checking EU VAT Service Fees
31 TXN-2430 03/14/2025 VAT MOSS prep multi EU sales -89.00 1,768.79 Checking eu vat service Fees

View File

@@ -1,21 +0,0 @@
Customer ID,First Name,Last Name,Email,Phone,Address,City,State,ZIP,Country,Total Orders,Lifetime Value,Last Order Date,Tags
SHOP-1001, Alice ,Johnson,alice@petshop.com,(415) 555-1234,"123 Main St., Apt 4B",San Francisco,CA,94102,US,12,$1,240.50,2025-12-04,VIP
SHOP-1002,Bob,SMITH,Bob@PetShop.com,415.555.1234,"123 Main St, Apt 4B",San Francisco,CA,94102,US,12,"$1,240.50",N/A,VIP
SHOP-1003,carlos,garcia,carlos@petshop.com,5559876543,"742 Evergreen Terrace",Springfield,IL,62704,US,5,420.00,12/15/2025,Wholesale
SHOP-1004,Diana,Lee,diana@petshop.com,(555) 222-3344,"PO Box 12, Sherwood Forest",Nottingham,,NG1 5BA,GB,8,£890.25,2025-10-30,VIP|Wholesale
SHOP-1005,EVE MARTINEZ,,eve.martinez@petshop.com,555-9988,"Calle Mayor 45","Madrid",,"28013",ES,3,€180,2025-09-15,
SHOP-1006,Frank,Brown,frank@petshop.com,, ,"Berlin",BE,10115,DE,15,€2.410,75,(blank),Wholesale
SHOP-1007,Grace,Davis,grace@petshop.com,+1 555-111-1111,"888 Maple Ave",Toronto,ON,M5V 3A8,CA,1,$49.99,#N/A,New
SHOP-1008,henry,wilson,Henry@PetShop.com,5551111111,"888 Maple Avenue","Toronto",ON,M5V 3A8,CA,1,$49.99,2025-12-01,New
SHOP-1009,Ivy,Chen,IVY@petshop.com,+1 (555) 777-7777,"550 Elm Street, Suite 200",Brooklyn,NY,11201,US,4,"$320.50 ",10/12/2025,
SHOP-1010,Jack,Taylor,jack@petshop.com,(none),"550 elm street, suite 200",brooklyn,NY,11201,US,4,$320.50,2025-10-12,
SHOP-1011,kate,o'neil,kate.oneil@petshop.com,415-555-2222,"99 King's Rd","London",,SW3 4LX,GB,7,£675.00,?,VIP
SHOP-1012,luis,rodriguez,LUIS@petshop.com,+34 91 411 1111,"Avenida de la Paz 12, 3°D",Madrid,,28013,ES,2,"€89,99",unknown,
SHOP-1013,Mia,Park,mia@petshop.com,02-9374-4000,"Sydney Opera House Drive","Sydney",NSW,2000,AU,9,"A$ 1,299.00",2025-11-20,Wholesale
SHOP-1014,Noah,nguyen,noah@petshop.com,+81 3 3210 7000,"丸の内 2-7-3","Tokyo",,100-0005,JP,6,"¥75000",2025-12-10,VIP
SHOP-1015,Olivia,Brown,OLIVIA@PETSHOP.COM,(555) 333-4444,"742 evergreen terrace",springfield,IL,62704,US,3,$180.00,(none),
SHOP-1016,Pavel,Novak,pavel@petshop.com,+44 20 7946 1234,"22 Baker Street",London,,W1U 6AB,United Kingdom,4,£412.00,2025-11-18,VIP
SHOP-1017,Quinn,Murphy,quinn@petshop.com,+44 20 7946 5678,"5 Princes Street",Edinburgh,,EH2 2DA,U.K.,2,£189.50,2025-12-09,
SHOP-1018,Rachel,O'Brien,rachel@petshop.com,02-9374-9999,"100 George Street","Sydney",NSW,2000,UK,1,£75.00,?,New
SHOP-1019,Sam,Klein,sam@petshop.com,+49 30 99887766,"Friedrichstraße 100","Berlin",,10117,Germany,11,"€1.890,40",2025-12-11,VIP|Wholesale
SHOP-1020,Tara,Gianni,tara@petshop.com,+39 06 6982 4567,"Via del Corso 250",Roma,,00186,Italia,5,"€649,99",2025-12-03,
1 Customer ID First Name Last Name Email Phone Address City State ZIP Country Total Orders Lifetime Value Last Order Date Tags
2 SHOP-1001 Alice Johnson alice@petshop.com (415) 555-1234 123 Main St., Apt 4B San Francisco CA 94102 US 12 $1 240.50 2025-12-04 VIP
3 SHOP-1002 Bob SMITH Bob@PetShop.com 415.555.1234 123 Main St, Apt 4B San Francisco CA 94102 US 12 $1,240.50 N/A VIP
4 SHOP-1003 carlos garcia carlos@petshop.com 5559876543 742 Evergreen Terrace Springfield IL 62704 US 5 420.00 12/15/2025 Wholesale
5 SHOP-1004 Diana Lee diana@petshop.com (555) 222-3344 PO Box 12, Sherwood Forest Nottingham NG1 5BA GB 8 £890.25 2025-10-30 VIP|Wholesale
6 SHOP-1005 EVE MARTINEZ eve.martinez@petshop.com 555-9988 Calle Mayor 45 Madrid 28013 ES 3 €180 2025-09-15
7 SHOP-1006 Frank Brown frank@petshop.com Berlin BE 10115 DE 15 €2.410 75 (blank) Wholesale
8 SHOP-1007 Grace Davis grace@petshop.com +1 555-111-1111 888 Maple Ave Toronto ON M5V 3A8 CA 1 $49.99 #N/A New
9 SHOP-1008 henry wilson Henry@PetShop.com 5551111111 888 Maple Avenue Toronto ON M5V 3A8 CA 1 $49.99 2025-12-01 New
10 SHOP-1009 Ivy Chen IVY@petshop.com +1 (555) 777-7777 550 Elm Street, Suite 200 Brooklyn NY 11201 US 4 $320.50 10/12/2025
11 SHOP-1010 Jack Taylor jack@petshop.com (none) 550 elm street, suite 200 brooklyn NY 11201 US 4 $320.50 2025-10-12
12 SHOP-1011 kate o'neil kate.oneil@petshop.com 415-555-2222 99 King's Rd London SW3 4LX GB 7 £675.00 ? VIP
13 SHOP-1012 luis rodriguez LUIS@petshop.com +34 91 411 1111 Avenida de la Paz 12, 3°D Madrid 28013 ES 2 €89,99 unknown
14 SHOP-1013 Mia Park mia@petshop.com 02-9374-4000 Sydney Opera House Drive Sydney NSW 2000 AU 9 A$ 1,299.00 2025-11-20 Wholesale
15 SHOP-1014 Noah nguyen noah@petshop.com +81 3 3210 7000 丸の内 2-7-3 Tokyo 100-0005 JP 6 ¥75000 2025-12-10 VIP
16 SHOP-1015 Olivia Brown OLIVIA@PETSHOP.COM (555) 333-4444 742 evergreen terrace springfield IL 62704 US 3 $180.00 (none)
17 SHOP-1016 Pavel Novak pavel@petshop.com +44 20 7946 1234 22 Baker Street London W1U 6AB United Kingdom 4 £412.00 2025-11-18 VIP
18 SHOP-1017 Quinn Murphy quinn@petshop.com +44 20 7946 5678 5 Princes Street Edinburgh EH2 2DA U.K. 2 £189.50 2025-12-09
19 SHOP-1018 Rachel O'Brien rachel@petshop.com 02-9374-9999 100 George Street Sydney NSW 2000 UK 1 £75.00 ? New
20 SHOP-1019 Sam Klein sam@petshop.com +49 30 99887766 Friedrichstraße 100 Berlin 10117 Germany 11 €1.890,40 2025-12-11 VIP|Wholesale
21 SHOP-1020 Tara Gianni tara@petshop.com +39 06 6982 4567 Via del Corso 250 Roma 00186 Italia 5 €649,99 2025-12-03

View File

@@ -1,49 +0,0 @@
{
"steps": [
{
"tool": "text_clean",
"options": {},
"enabled": true,
"name": "1. Clean text (whitespace, smart quotes, NBSP, BOM)"
},
{
"tool": "format_standardize",
"options": {
"column_types": {
"First Name": "name",
"Last Name": "name",
"Email": "email",
"Phone": "phone",
"Address": "address",
"Lifetime Value": "currency",
"Last Order Date": "date"
},
"phone_country_column": "Country",
"address_country_column": "Country",
"currency_preserve_code": true,
"currency_decimal": "auto",
"email_gmail_canonical": false
},
"enabled": true,
"name": "2. Standardize phones, addresses, dates, currencies, names"
},
{
"tool": "missing",
"options": {
"strategy": "none",
"standardize_sentinels": true
},
"enabled": true,
"name": "3. Standardize disguised nulls (N/A, -, (blank), ?, #N/A)"
},
{
"tool": "dedup",
"options": {
"survivor_rule": "most_complete",
"merge": true
},
"enabled": true,
"name": "4. Dedup customers (fuzzy match, merge missing fields)"
}
]
}

View File

@@ -0,0 +1,25 @@
Vendor,Contact,Email,Phone,EIN,Address,Total_Paid
Acme Realty,Bob Stein,acme.ap@acmerealty.com,(212) 555-0100,12-3456789,(blank),"$12,400.00"
acme realty llc,Bob Stein, ACME.AP@AcmeRealty.com ,,,"118 Canal St, New York, NY 10013","$8,250"
ACME REALTY,R. Stein,Acme.AP@acmerealty.com,212.555.0100,N/A,TBD,"1,999.99"
Bright Books Bookkeeping,Dana Cole,hello@brightbooks.com,,98-7654321,(blank),"$6,000.00"
bright books,Dana Cole,HELLO@brightbooks.com,(415) 555-0142,unknown,"50 Market St, San Francisco, CA 94105","$6,000"
"Bright Books, LLC",D. Cole, hello@BrightBooks.com,4155550142,98-7654321,unknown,"5,500.00"
Northwind Logistics,Sam Reyes,ap@northwindlog.com,(312) 555-0198,,(blank),"$22,750.00"
northwind logistics inc,Sam Reyes,AP@NorthwindLog.com,,45-6789012,"900 W Loop, Chicago, IL 60607","$22,750"
Pearl Design Studio,“Jo” Marsh,billing@pearldesign.co,,33-2211000,(blank),"$3,200.00"
pearl design,Jo Marsh,Billing@PearlDesign.co,(206) 555-0167,TBD,"77 Pike St, Seattle, WA 98101","$3,200"
PEARL DESIGN STUDIO,J. Marsh, billing@pearldesign.co ,206.555.0167,33-2211000,unknown,"2,800.00"
Cooper Plumbing,Lee Cooper,office@cooperplumb.com,(617) 555-0133,,(blank),"$1,450.00"
cooper plumbing co,Lee Cooper,OFFICE@cooperplumb.com,,TBD,"12 Beacon St, Boston, MA 02108","$1,450"
COOPER PLUMBING,L. Cooper, office@CooperPlumb.com,6175550133,N/A,unknown,900.00
Vertex Marketing,Pat Nguyen,accounts@vertexmktg.com,(404) 555-0119,77-8899001,(blank),"$15,000.00"
vertex marketing group,Pat Nguyen,ACCOUNTS@VertexMktg.com,,unknown,"300 Peachtree St, Atlanta, GA 30308","$15,000"
Summit Consulting,Ray Brooks,invoices@summitconsult.net,,21-0099887,(blank),"$9,800.00"
summit consulting llc,Ray Brooks,INVOICES@summitconsult.net,(303) 555-0175,,"1100 17th St, Denver, CO 80202","$9,800"
SUMMIT CONSULTING,R. Brooks, invoices@SummitConsult.net ,303.555.0175,21-0099887,TBD,"7,250.00"
Garcia Catering,Mia Garcia,ap@garciacatering.com,(305) 555-0188,,(blank),"$4,600.00"
garcia catering services,Mia Garcia,AP@GarciaCatering.com,,66-1234509,"450 Ocean Dr, Miami, FL 33139",$600.00
Northwind Logistics,S. Reyes, ap@northwindlog.com ,312.555.0198,45-6789012,TBD,"21,000.00"
VERTEX MARKETING,P. Nguyen, accounts@vertexmktg.com ,404.555.0119,77-8899001,TBD,"14,500.00"
GARCIA CATERING,M. Garcia,ap@GARCIACATERING.com,305.555.0188,66-1234509,unknown,"4,200.00"
1 Vendor Contact Email Phone EIN Address Total_Paid
2 Acme Realty Bob Stein acme.ap@acmerealty.com (212) 555-0100 12-3456789 (blank) $12,400.00
3 acme realty llc Bob Stein ACME.AP@AcmeRealty.com 118 Canal St, New York, NY 10013 $8,250
4 ACME REALTY R. Stein Acme.AP@acmerealty.com 212.555.0100 N/A TBD 1,999.99
5 Bright Books Bookkeeping Dana Cole hello@brightbooks.com 98-7654321 (blank) $6,000.00
6 bright books Dana Cole HELLO@brightbooks.com (415) 555-0142 unknown 50 Market St, San Francisco, CA 94105 $6,000
7 Bright Books, LLC D. Cole hello@BrightBooks.com 4155550142 98-7654321 unknown 5,500.00
8 Northwind Logistics Sam Reyes ap@northwindlog.com (312) 555-0198 (blank) $22,750.00
9 northwind logistics inc Sam Reyes AP@NorthwindLog.com 45-6789012 900 W Loop, Chicago, IL 60607 $22,750
10 Pearl Design Studio “Jo” Marsh billing@pearldesign.co 33-2211000 (blank) $3,200.00
11 pearl design Jo Marsh Billing@PearlDesign.co (206) 555-0167 TBD 77 Pike St, Seattle, WA 98101 $3,200
12 PEARL DESIGN STUDIO J. Marsh billing@pearldesign.co 206.555.0167 33-2211000 unknown 2,800.00
13 Cooper Plumbing Lee Cooper office@cooperplumb.com (617) 555-0133 (blank) $1,450.00
14 cooper plumbing co Lee Cooper OFFICE@cooperplumb.com TBD 12 Beacon St, Boston, MA 02108 $1,450
15 COOPER PLUMBING L. Cooper office@CooperPlumb.com 6175550133 N/A unknown 900.00
16 Vertex Marketing Pat Nguyen accounts@vertexmktg.com (404) 555-0119 77-8899001 (blank) $15,000.00
17 vertex marketing group Pat Nguyen ACCOUNTS@VertexMktg.com unknown 300 Peachtree St, Atlanta, GA 30308 $15,000
18 Summit Consulting Ray Brooks invoices@summitconsult.net 21-0099887 (blank) $9,800.00
19 summit consulting llc Ray Brooks INVOICES@summitconsult.net (303) 555-0175 1100 17th St, Denver, CO 80202 $9,800
20 SUMMIT CONSULTING R. Brooks invoices@SummitConsult.net 303.555.0175 21-0099887 TBD 7,250.00
21 Garcia Catering Mia Garcia ap@garciacatering.com (305) 555-0188 (blank) $4,600.00
22 garcia catering services Mia Garcia AP@GarciaCatering.com 66-1234509 450 Ocean Dr, Miami, FL 33139 $600.00
23 Northwind Logistics S. Reyes ap@northwindlog.com 312.555.0198 45-6789012 TBD 21,000.00
24 VERTEX MARKETING P. Nguyen accounts@vertexmktg.com 404.555.0119 77-8899001 TBD 14,500.00
25 GARCIA CATERING M. Garcia ap@GARCIACATERING.com 305.555.0188 66-1234509 unknown 4,200.00

View File

@@ -0,0 +1,49 @@
{
"steps": [
{
"tool": "text_clean",
"enabled": true,
"options": {
"trim": true,
"collapse_whitespace": true,
"fold_smart_chars": true,
"strip_zero_width": true
}
},
{
"tool": "format_standardize",
"enabled": true,
"options": {
"column_types": {
"Phone": "phone",
"Email": "email",
"Total_Paid": "currency"
}
}
},
{
"tool": "missing",
"enabled": true,
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["—", "-", "--", "(blank)", "TBD", "unknown", "N/A", "#N/A", "(none)"]
}
},
{
"tool": "dedup",
"enabled": true,
"options": {
"survivor_rule": "most_complete",
"merge": true,
"strategies": [
{
"columns": [
{"column": "Email", "algorithm": "exact", "threshold": 100, "normalizer": "email"}
]
}
]
}
}
]
}

View File

@@ -1,13 +0,0 @@
customer_name,email,vendor,memo
Alice Johnson,alice@example.com,ACME Corp ,Welcome aboard
Bob Smith,bob@example.com,ACME Corp,Returning customer
Charlie Brown,charlie@example.com,Globex,Net 30
Diana Prince,diana@example.com,Globex,VIP
Edward Norton,ed@example.com,“Best Pet Supplies”,Order#42 - rush
Frank Castle,frank@example.com,Stark—Industries,"Line 1
Line 2
Line 3"
grace HOPPER ,grace@example.com,Globex,Loves long memos…
Henry Ford,henry@example.com,Ford Motor,Industrial
Iris West,iris@example.com,S.T.A.R. Labs,Notewith-bell
Jane Doe,jane@example.com,Acme,Standard
1 customer_name email vendor memo
2 Alice Johnson alice@example.com ACME Corp Welcome aboard
3 Bob Smith bob@example.com ACME Corp Returning customer
4 Charlie Brown charlie@example.com Globex Net 30
5 Diana Prince diana​@example.com Globex VIP
6 Edward Norton ed@example.com “Best Pet Supplies” Order#42 - rush
7 Frank Castle frank@example.com Stark—Industries Line 1 Line 2 Line 3
8 grace HOPPER grace@example.com Globex Loves long memos…
9 Henry Ford henry@example.com Ford Motor Industrial
10 Iris West iris@example.com S.T.A.R. Labs Notewith-bell
11 Jane Doe jane@example.com Acme Standard

View File

@@ -146,6 +146,101 @@ def _sync_uploader_to_home_uploads() -> None:
st.session_state["home_findings_by_file"] = findings
def _read_upload_df(name: str, data: bytes):
"""Bytes -> DataFrame. Mirrors the Automated Workflows page reader:
Excel by extension, else CSV with encoding fallbacks. Kept in step
with ``9_Pipeline_Runner._read_uploaded`` so the one-click clean
reads files exactly as the standalone orchestrator would."""
import io as _io
from pathlib import Path as _Path
import pandas as pd
suffix = _Path(name).suffix.lower()
bio = _io.BytesIO(data)
if suffix in (".xlsx", ".xls"):
return pd.read_excel(bio)
for enc in ("utf-8", "utf-8-sig", "latin-1"):
try:
bio.seek(0)
sep = "\t" if suffix == ".tsv" else ","
return pd.read_csv(bio, encoding=enc, sep=sep, on_bad_lines="warn")
except UnicodeDecodeError:
continue
bio.seek(0)
return pd.read_csv(bio, encoding="latin-1")
def _run_recommended_clean(home_uploads: dict) -> None:
"""Front-door action: run the recommended pipeline (Clean Text ->
Standardize -> Fix Missing -> Find Duplicates, in that order) on
every imported file and stash a cleaned CSV per file in
``session_state`` for download. This is the orchestrator wearing a
friendly face — it consumes the same ``recommended_pipeline`` the
Automated Workflows page builds. Per-file errors are captured so one
bad file doesn't kill the batch."""
from src.core.pipeline import recommended_pipeline, run_pipeline
from src.core.errors import format_for_user
from src.audit import log_event
pipeline = recommended_pipeline()
names = list(home_uploads.keys())
results: dict = {}
progress = st.progress(0.0, text="Cleaning…")
for i, name in enumerate(names, start=1):
progress.progress((i - 1) / max(len(names), 1), text=name)
try:
df = _read_upload_df(name, home_uploads[name]["bytes"])
res = run_pipeline(df, pipeline, stop_on_error=False)
results[name] = {
"csv": res.final_df.to_csv(index=False).encode("utf-8"),
"initial_rows": res.initial_rows,
"final_rows": res.final_rows,
"error": None,
}
except Exception as e: # noqa: BLE001 — surface per file, keep the batch alive
results[name] = {"csv": None, "error": format_for_user(e)}
progress.empty()
log_event("tool_run", "Home one-click recommended clean", files=names)
st.session_state["home_clean_results"] = results
st.rerun()
def _render_clean_results() -> None:
"""Render per-file cleaned-CSV download buttons + a short summary from
the stash produced by :func:`_run_recommended_clean`. Only files
still present in ``home_uploads`` are shown, so removing a file
drops its stale result."""
import hashlib as _hashlib
results: dict = st.session_state.get("home_clean_results", {})
if not results:
return
current = st.session_state.get("home_uploads", {})
for name, r in results.items():
if name not in current:
continue
digest = _hashlib.sha1(
name.encode("utf-8"), usedforsecurity=False,
).hexdigest()[:10]
if r.get("error"):
st.error(f"**Could not clean `{name}`**\n\n```\n{r['error']}\n```")
continue
stem = name.rsplit(".", 1)[0]
st.download_button(
f"⬇ Download cleaned {name}",
data=r["csv"],
file_name=f"{stem}_cleaned.csv",
mime="text/csv",
key=f"home_clean_dl_{digest}",
width="stretch",
)
removed = r["initial_rows"] - r["final_rows"]
st.caption(
f"{r['final_rows']:,} rows kept"
+ (f" · {removed:,} removed" if removed else " · nothing to remove")
)
def _home_page() -> None:
"""Render the home page — multi-file upload + per-file analysis.
@@ -443,6 +538,7 @@ def _home_page() -> None:
if clear_clicked:
st.session_state["home_findings_by_file"] = {}
st.session_state["home_clean_results"] = {}
st.rerun()
if run_clicked:
@@ -458,6 +554,8 @@ def _home_page() -> None:
findings_by_file[name] = _run_analysis_on_upload(stashed)
progress.progress(i / len(pending), text=name)
st.session_state["home_findings_by_file"] = findings_by_file
# A fresh analysis invalidates any prior one-click clean outputs.
st.session_state["home_clean_results"] = {}
progress.empty()
st.rerun()
@@ -468,6 +566,30 @@ def _home_page() -> None:
# 4-card summary above the findings panels so the user can
# eyeball the run before expanding any one file.
_render_stats_overview(findings_by_file)
# ---- Front door: one-click recommended clean (primary path) ----
# The analyzer has the findings; the majority case is "just fix
# it." This primary button runs the recommended pipeline in the
# correct order and hands back a cleaned file per upload, so the
# user never has to decide which tool or what order. The per-file
# findings below remain the "fix one thing at a time" path.
if st.button(
"✨ Clean these files for me",
type="primary",
key="home_clean_all",
width="stretch",
):
_run_recommended_clean(home_uploads)
st.caption(
"Recommended: cleans text, standardizes formats, fills blanks, "
"and removes duplicates — in the right order — then gives you the "
"cleaned file."
)
_render_clean_results()
# ---- Manual path: per-file findings, fix one thing at a time ----
st.markdown("###### Or fix issues one at a time")
st.caption("Open any finding below to jump straight to the right tool.")
# Preserve the upload-stash order so the user sees results in
# the same order they appear in the file list above.
for name in home_uploads:

View File

@@ -78,10 +78,11 @@ def _page_for(tool_id: str, *, page_slug: str, icon: str, title: str) -> "st.Pag
def _build_navigation() -> dict[str, list]:
by_section: dict[str, list] = {
"analysis": [],
"cleaners": [],
"transformations": [],
"automations": [],
"finance": [],
"coming_soon": [],
}
# Resolve the tool name through ``tool_name`` (i18n lookup) instead
# of using the registry's English ``tool.name`` field, otherwise the
@@ -96,16 +97,16 @@ def _build_navigation() -> dict[str, list]:
)
)
# Home is now surfaced under the new "Analysis" section as
# "File Analysis" — the home page's content (importing files,
# running the analyzer, browsing findings) is itself a data-analysis
# workflow, so grouping it next to Reconcile keeps the sidebar's
# mental model coherent. ``default=True`` still points at this
# page so first-visit lands here regardless of section placement.
# Home is the product's front door: "Start here". It's surfaced as a
# standalone, unlabeled top entry (in the "" section, ahead of the
# hidden Activate/Logs/Close pages) so it reads as the obvious
# starting point above the tool groups rather than one item among
# equals. The companion CSS in ``hide_streamlit_chrome`` gives its
# nav link accent emphasis. ``default=True`` lands first-visit here.
home = st.Page(
_home_page,
title=_t("nav.file_analysis_title") or "File Analysis",
icon=":material/insert_chart_outlined:",
title=_t("nav.start_here_title") or "Start here",
icon=":material/play_circle:",
default=True,
url_path="home",
)
@@ -136,17 +137,20 @@ def _build_navigation() -> dict[str, list]:
url_path="close",
)
# Activate / Logs / Close stay in the unlabeled section (key ``""``)
# so the CSS in ``hide_streamlit_chrome`` keeps hiding them by
# ``href``. Home moved out of that bucket into "Analysis" — the
# unlabeled section now contains ONLY hidden pages, so no orphan
# entry appears above the "Analysis" header in the sidebar.
# Home leads the unlabeled section (key ``""``) so "Start here" sits
# at the very top with no section header above it. Activate / Logs /
# Close follow in the same unlabeled bucket and stay hidden by their
# ``href`` via the CSS in ``hide_streamlit_chrome``. Section order
# below is the journey order: cleaners (pipeline order) →
# transformations → automations → finance → coming soon (last, so
# not-yet-shipped tools never interleave with working ones).
return {
"": [activate, logs, close],
section_label("analysis"): [home, *by_section["analysis"]],
"": [home, activate, logs, close],
section_label("cleaners"): by_section["cleaners"],
section_label("transformations"): by_section["transformations"],
section_label("automations"): by_section["automations"],
section_label("finance"): by_section["finance"],
section_label("coming_soon"): by_section["coming_soon"],
}

View File

@@ -9,10 +9,10 @@ side-by-side, and converts the visitor to a Gumroad purchase.
Launch:
streamlit run src/gui/app_demo.py
URL routing:
https://demo.datatools.app/?p=shopify-pet (Shopify operator)
https://demo.datatools.app/?p=bookkeeper (Bookkeeper)
https://demo.datatools.app/?p=revops (RevOps agency)
URL routing (all three personas serve one audience: accounting):
https://demo.datatools.app/?p=bookkeeper (Bookkeeper — bank reconciliation)
https://demo.datatools.app/?p=ap-1099 (Accounts payable — 1099 vendor prep)
https://demo.datatools.app/?p=ar-aging (Accounts receivable — open invoices)
Free / paid boundary (per docs/DEMO-PLAN.md §6):
- input rows capped at ``DEMO_ROW_CAP``
@@ -64,59 +64,66 @@ GUMROAD_BASE: str = "https://gumroad.com/l/datatools"
DEMO_DIR = _project_root / "samples" / "demo"
# All three personas serve one audience — accounting — entering through the
# three workflows where messy exports cost real money: bank reconciliation,
# 1099 / AP vendor prep, and AR aging. Each H1/sub names the exact pain and
# the validated demo outcome (see docs/DEMO-PLAN.md §4 for the numbers).
PERSONAS: dict[str, dict[str, Any]] = {
"shopify-pet": {
"label": "Shopify pet operator",
"icon": "🛍️",
"h1": "Klaviyo-import-ready customer lists. **In 30 seconds. Locally.**",
"sub": (
"Your Shopify customer export has duplicates Excel can't catch, "
"international phones Excel can't parse, and disguised nulls "
"(`N/A`, `(blank)`, `?`) that break Klaviyo's import. "
"DataTools fixes all of it in one pass — and your data never "
"leaves your computer."
),
"data_file": "shopify_pet_customers.csv",
"pipeline_file": "shopify_pet_pipeline.json",
"cta": "Get DataTools for Shopify — $49 →",
"landing": "https://datatools.app/shopify/",
},
"bookkeeper": {
"label": "Bookkeeper / freelance accountant",
"label": "Bookkeeper — bank reconciliation",
"icon": "📒",
"h1": "Reconcile messy bank exports. **Hand your client an audit trail.**",
"h1": "Catch the transactions your bank export posted twice. **Locally.**",
"sub": (
"The Jan and Feb exports overlap; the same transaction posts twice. "
"Vendor names are *Amazon* / *amazon.com* / *AMAZON.COM*4F2X9* in "
"three rows. DataTools dedups on Date + Amount + fuzzy Vendor, "
"produces ISO dates and numeric amounts, and gives you a row-level "
"audit log to hand the client."
"When the Jan and Feb exports overlap, the same payment lands "
"twice — once as `01/15/2025 +$3,450.00`, once as "
"`2025-01-15 3450.00`. DataTools standardizes every date and "
"amount, then dedups on the *real* transaction so your "
"reconciliation ties out. In this sample: **26 rows → 20, six "
"phantom duplicates removed** — and your data never leaves your "
"computer."
),
"data_file": "bookkeeper_bank_reconcile.csv",
"pipeline_file": "bookkeeper_bank_pipeline.json",
"data_file": "bank_reconciliation.csv",
"pipeline_file": "bank_reconciliation_pipeline.json",
"cta": "Get DataTools for Bookkeepers — $49 →",
"landing": "https://datatools.app/bookkeeper/",
},
"revops": {
"label": "Marketing / RevOps agency",
"icon": "🪢",
"h1": "Dedupe lead lists across HubSpot, LinkedIn, and manual scrapes — **locally.**",
"ap-1099": {
"label": "Accounts payable — 1099 prep",
"icon": "🧾",
"h1": "Build a clean 1099 vendor list — **with the missing EINs filled in.**",
"sub": (
"The same prospect shows up in HubSpot as `alice@acme.com`, in "
"LinkedIn as `Alice.Johnson@acme.com`, and in your VA's manual "
"scrape as `alice@acme.com` again. Country is `USA` / `US` / "
"`United States`. DataTools fuzzy-matches across sources, "
"normalizes phones for 50+ countries, and merges survivors "
"with their most-complete fields — without uploading anything."
"The same vendor was entered three times across the year — one "
"record has the EIN, another the address, a third the phone. "
"DataTools consolidates each vendor to one row and *backfills the "
"gaps from the duplicates*. In this sample: **24 messy records → "
"8 complete vendors, with 7 missing EINs recovered** from the "
"duplicate rows. No upload, no VLOOKUP gymnastics."
),
"data_file": "agency_combined_leads.csv",
"pipeline_file": "agency_leads_pipeline.json",
"cta": "Get DataTools for RevOps — $49 →",
"landing": "https://datatools.app/revops/",
"data_file": "vendor_1099.csv",
"pipeline_file": "vendor_1099_pipeline.json",
"cta": "Get DataTools for Accounting — $49 →",
"landing": "https://datatools.app/accounting/",
},
"ar-aging": {
"label": "Accounts receivable — open invoices",
"icon": "💵",
"h1": "Stop chasing the invoices your aging report counted twice. **Locally.**",
"sub": (
"Double-entered invoices inflate your AR aging and your "
"follow-ups. DataTools standardizes invoice dates, due dates, and "
"amounts, lowercases client emails, then removes the duplicate "
"invoice numbers — backfilling any blank status from the twin row. "
"In this sample: **26 rows → 21, five phantom invoices off the "
"books** in one pass."
),
"data_file": "ar_open_invoices.csv",
"pipeline_file": "ar_open_invoices_pipeline.json",
"cta": "Get DataTools for Accounting — $49 →",
"landing": "https://datatools.app/accounting/",
},
}
DEFAULT_PERSONA = "shopify-pet"
DEFAULT_PERSONA = "bookkeeper"
# ---------------------------------------------------------------------------
@@ -132,6 +139,15 @@ st.set_page_config(
# Strip Streamlit chrome that breaks the iframe-embed look on the
# landing pages.
#
# We deliberately do NOT call ``hide_streamlit_chrome()`` from the
# paid GUI here — that helper drags in the license gate, the sidebar
# brand block, language selector, and the +/- nav-section indicator
# script. The demo has no sidebar (we hide it below), no licensing
# (it's the marketing surface), and a different visual palette (dark
# theme vs. the paid app's cream paper). Keep this hand-rolled chrome
# in sync with the demo's own dark palette; do NOT replace it with
# the paid GUI's chrome helper.
st.markdown("""
<style>
#MainMenu, footer, header { visibility: hidden; }

View File

@@ -48,6 +48,7 @@ __all__ = [
# Shared chrome / pickup
"back_to_home_link",
"render_sticky_footer",
"render_tool_header",
"hide_streamlit_chrome",
"html_download_button",
"local_download_button",

View File

@@ -95,6 +95,18 @@ footer {
[data-testid="stSidebarNav"] a[href$="/close/"] {
display: none !important;
}
/* "Start here" front-door nav item — accent emphasis so the obvious
entry point reads at a glance above the tool groups. Targets the Home
link by href; accent values mirror theme.py (§3 color scale). */
[data-testid="stSidebarNav"] a[href$="/home"],
[data-testid="stSidebarNav"] a[href$="/home/"] {
background: #fef4ed !important;
font-weight: 600 !important;
}
[data-testid="stSidebarNav"] a[href$="/home"]:hover,
[data-testid="stSidebarNav"] a[href$="/home/"]:hover {
background: #fde4d3 !important;
}
/* Reclaim top padding lost from hidden header. Streamlit's default
block-container padding-top is ~6rem (room for the header it ships).
We hide the header so reclaim that space — the page title should sit
@@ -279,7 +291,13 @@ body, .stApp {
with class ``st-emotion-cache-…`` inside ``stSidebarNav`` — class
hashes are unstable across versions, so we lean on the structural
position (the bare span / h2 directly inside the nav list) rather
than emotion classes. */
than emotion classes.
``stSidebarNavSectionHeader`` is the LEGACY testid used by
Streamlit ~1.35; current Streamlit emits ``stNavSectionHeader``
(handled by the dedicated block further down). Both are kept in
the selector list because the requirements floor is
``streamlit>=1.35,<2`` — dropping the legacy testid would break
the visual treatment on the lower bound. */
[data-testid="stSidebarNav"] h2,
[data-testid="stSidebarNav"] h3,
[data-testid="stSidebarNavSeparator"] span,
@@ -316,6 +334,9 @@ body, .stApp {
[data-testid="stSidebarNavItems"] > li {
margin-bottom: 1px !important;
}
/* Legacy testid — kept for streamlit~=1.35 (see note above). The
tighter padding for the current ``stNavSectionHeader`` is set in
the dedicated block further down. */
[data-testid="stSidebarNavSectionHeader"] {
padding-top: 10px !important;
padding-bottom: 2px !important;
@@ -335,6 +356,59 @@ body, .stApp {
box-shadow: none !important;
}
/* ---------- Section header expand indicator ----------
Streamlit's nav section header uses a Material Symbols ligature
icon (``expand_more`` / ``expand_less``) and does NOT expose
``aria-expanded`` on the header — the React component keeps that
state internally. Pure CSS therefore can't switch the glyph based
on state, so the visible swap is performed by
``_SWAP_NAV_SECTION_INDICATOR_JS`` (rewrites the icon's text node
to ``+`` / ```` and re-applies on mutation). This block only
handles the static styling so the rewritten glyph reads as a
normal typographic plus/minus instead of a Material font ligature
that would still try to resolve ``+`` as an icon name. */
[data-testid="stNavSectionHeader"] {
position: relative !important;
}
[data-testid="stNavSectionHeader"] [data-testid="stIconMaterial"] {
/* Drop the Material Symbols font so the JS-swapped ``+`` / ````
characters render as plain typography. ``font-feature-settings``
is reset so no ligature kicks in. */
font-family: var(--font-sans) !important;
font-feature-settings: normal !important;
-webkit-font-feature-settings: normal !important;
-moz-font-feature-settings: normal !important;
font-weight: 500 !important;
font-size: 16px !important;
line-height: 1 !important;
color: var(--ink-tertiary) !important;
width: auto !important;
height: auto !important;
transition: color 0.15s ease !important;
}
[data-testid="stNavSectionHeader"]:hover [data-testid="stIconMaterial"] {
color: var(--ink) !important;
}
/* ---------- Popover button — never wrap the label ----------
Tagged ``dt-help-popover`` in code comments so the render-time
docstring can point editors here. Streamlit's popover trigger is
a normal button; on the tool-page header it sits in a narrow
column right of the title, and a small viewport was squeezing
the column enough to wrap ``Help`` onto two lines under the icon.
``white-space: nowrap`` keeps the icon + label on one line; the
companion ``min-width: max-content`` keeps the BUTTON itself from
shrinking below its content, so it overflows the column cleanly
instead of compressing into a vertical pile. */
[data-testid="stPopover"] button {
white-space: nowrap !important;
min-width: max-content !important;
}
[data-testid="stPopover"] button > div,
[data-testid="stPopover"] button > div > p {
white-space: nowrap !important;
}
/* Inline + block code → mono with subtle accent chip. theme.py owns
the family + size; this layer adds the warm-fill background. */
[data-testid="stMarkdownContainer"] code {
@@ -1112,6 +1186,47 @@ _WIRE_COLLAPSIBLE_FINDINGS_JS = """
"""
_SWAP_NAV_SECTION_INDICATOR_JS = """
<script>
(function () {
// Replace Streamlit's ``expand_more`` / ``expand_less`` Material
// ligature in sidebar nav section headers with plain ``+`` / ````.
// The section header isn't a button and doesn't carry
// ``aria-expanded``, so a pure-CSS swap can't switch the glyph
// based on state — we walk the icon's text node directly.
function swap(doc) {
var headers = doc.querySelectorAll('[data-testid="stNavSectionHeader"]');
headers.forEach(function (h) {
var icon = h.querySelector('[data-testid="stIconMaterial"]');
if (!icon) return;
var text = (icon.textContent || '').trim();
var glyph = null;
if (text === 'expand_more') glyph = '+';
else if (text === 'expand_less') glyph = ''; // U+2212
else if (text === '+' || text === '') return; // already swapped
else return;
icon.textContent = glyph;
});
}
var doc;
try { doc = window.parent.document; }
catch (e) { doc = document; }
swap(doc);
var win = doc.defaultView || window.parent || window;
if ('MutationObserver' in win) {
var raf = 0;
try {
new win.MutationObserver(function () {
if (raf) return;
raf = win.requestAnimationFrame(function () { raf = 0; swap(doc); });
}).observe(doc.body, { childList: true, subtree: true, characterData: true });
} catch (e) {}
}
})();
</script>
"""
_RENAME_UPLOAD_BUTTON_JS = """
<script>
(function () {
@@ -1185,7 +1300,8 @@ def hide_streamlit_chrome(*, gate_license: bool = True) -> None:
st.iframe(
_INJECT_BRAND_JS
+ _RENAME_UPLOAD_BUTTON_JS
+ _WIRE_COLLAPSIBLE_FINDINGS_JS,
+ _WIRE_COLLAPSIBLE_FINDINGS_JS
+ _SWAP_NAV_SECTION_INDICATOR_JS,
height=1,
)
# Stamp a session-start record into the audit log the first time
@@ -2042,6 +2158,78 @@ a[data-testid="stPageLink-NavLink"][href*="close"] {
)
def render_tool_header(tool_id: str) -> None:
"""Title row with an inline Help popover anchored to the right.
Replaces the bare ``st.title(...)`` + ``st.caption(...)`` block on
each tool page. Help content is one markdown blob per tool in the
i18n pack (``tools.<id>.help_md``) so editors can tweak copy without
touching Python. The popover is anchored next to the title rather
than the caption so it reads as part of the page header.
Layout: ``[title | help button]`` over ``[caption]``. The help
column is narrow; the title gets the rest. Vertical alignment is
left to Streamlit's column default (top) — works on 1.35+ without
the ``vertical_alignment`` kwarg that landed later.
The popover button uses ``use_container_width=False`` so it sizes
to its content (icon + ``Help`` label). With ``True`` the button
stretches to fill the narrow column, and when the viewport shrinks
the label was wrapping vertically. The companion CSS rule (search
``dt-help-popover``) pins ``white-space: nowrap`` on every popover
button as a defense-in-depth so the label can never wrap, no
matter how the column ends up sized.
"""
col_title, col_help = st.columns([7, 3])
with col_title:
st.title(_t(f"tools.{tool_id}.page_title"))
with col_help:
# Local-first reassurance + Help, right-aligned opposite the
# title. The "Runs 100% locally" privacy pill is shown on every
# working tool page (where the user is actively feeding in a
# customer list) and omitted on not-yet-shipped "Coming Soon"
# tools, which process nothing. When the pill is shown it also
# serves as the spacer that nudges the popover down toward the
# title baseline; without it we keep the explicit spacer.
from src.gui.tools_registry import tool_by_id as _tool_by_id
_tool = _tool_by_id(tool_id)
if _tool is None or _tool.status == "Ready":
import html as _html
st.markdown(
'<div style="display:flex;justify-content:flex-end">'
'<span class="dt-privacy-pill">'
'<svg viewBox="0 0 24 24" fill="none" stroke="currentColor">'
'<rect x="4" y="11" width="16" height="10" rx="2"/>'
'<path d="M8 11V7a4 4 0 018 0v4"/>'
'</svg>'
f'{_html.escape(_t("home.privacy_pill"))}'
'</span>'
'</div>',
unsafe_allow_html=True,
)
else:
# Spacer pushes the popover button down so it sits closer to
# the title's baseline than to its top.
st.write("")
body = _t(f"tools.{tool_id}.help_md")
# ``src.i18n.t`` falls back to returning the lookup key itself
# on miss (see ``_resolve`` → key-as-string fallback). That's
# what we detect here: any tool whose ``help_md`` entry is
# absent from both en + es packs shows the generic missing-body
# string instead of the raw dotted key. Real help_md content
# in the packs starts with ``**When to use**``-style markdown,
# so this prefix check is safe.
if body.startswith("tools."):
body = _t("help.missing_body")
with st.popover(
_t("help.button_label"),
icon=":material/help_outline:",
use_container_width=False,
):
st.markdown(body)
st.caption(_t(f"tools.{tool_id}.page_caption"))
def _render_sticky_footer_DISABLED() -> None:
"""Slim fixed-position footer at the bottom of the viewport.

View File

@@ -0,0 +1,505 @@
"""Visual pipeline builder — per-step "module" cards + plain-language config panels.
The Automated Workflows page (``9_Pipeline_Runner.py``) used to configure each
step through a raw ``options_json`` text column. This module replaces that with
one **module card** per step: a friendly name + caption, an enable toggle,
reorder/remove controls, and a **Configure** expander that renders that tool's
own controls in plain language (no JSON). Raw JSON survives only as the page's
Advanced import/export surface.
Each config renderer takes the step's current ``options`` dict, renders the
curated controls from the design mockup (``layout-review/09_pipeline_runner.html``),
and returns an updated **JSON-serialisable** options dict — the same shape the
``TOOL_ADAPTERS`` in ``src/core/pipeline.py`` consume via ``Options.from_dict``.
Two hard Streamlit constraints shaped this:
* No nested expanders — the per-step Configure expander means config renderers
here must NOT open their own expander, and the page must not wrap the card
stack in an outer expander.
* Widget identity must be stable across reorder/remove — every widget key is
derived from a step's stable ``id``, never its list position.
"""
from __future__ import annotations
from typing import Any, Callable, Optional
import pandas as pd
import streamlit as st
from src.gui.tools_registry import tool_name
# ---------------------------------------------------------------------------
# Adapter-key → registry tool_id bridge
# ---------------------------------------------------------------------------
#
# Pipeline steps are keyed by adapter name (``text_clean``); the tools registry
# and i18n packs are keyed by tool_id (``02_text_cleaner``). The registry has no
# reverse lookup, so we keep the bridge here. ``step_label`` resolves the
# localized friendly name; ``step_caption`` returns a short, plain-English "what
# this step does" line for the card body.
PIPELINE_TOOL_META: dict[str, str] = {
"text_clean": "02_text_cleaner",
"format_standardize": "03_format_standardizer",
"missing": "04_missing_handler",
"column_map": "05_column_mapper",
"dedup": "01_deduplicator",
}
_STEP_CAPTIONS: dict[str, str] = {
"text_clean": "Trim spaces, collapse repeats, strip invisible characters.",
"format_standardize": "Canonicalize phones, dates, currency, names per column.",
"missing": "Flag, fill, or drop blank cells (and disguised blanks).",
"column_map": "Rename source columns onto your target column names.",
"dedup": "Find duplicate rows and keep one survivor per group.",
}
def step_label(tool: str) -> str:
"""Friendly, localized name for a pipeline adapter key (falls back to the key)."""
tool_id = PIPELINE_TOOL_META.get(tool)
return tool_name(tool_id) if tool_id else tool
def step_caption(tool: str) -> str:
return _STEP_CAPTIONS.get(tool, "")
# ---------------------------------------------------------------------------
# Plain-English result phrasing
# ---------------------------------------------------------------------------
#
# Each adapter returns a stats dict (see ``TOOL_ADAPTERS`` in
# ``src/core/pipeline.py``). ``step_phrase`` turns that dict into the one-line
# sentence the mockup shows in the Results table ("312 duplicates removed across
# 147 groups …"); ``step_status`` derives the status pill + an optional inline
# detail line for steps that warn (e.g. unparseable cells) or error.
def _fmt_cols(cols: list) -> str:
"""Join column names for prose: 'name', 'name & city', 'a, b & 2 more'."""
cols = [str(c) for c in cols]
if not cols:
return ""
if len(cols) == 1:
return cols[0]
if len(cols) == 2:
return f"{cols[0]} & {cols[1]}"
if len(cols) == 3:
return f"{cols[0]}, {cols[1]} & {cols[2]}"
return f"{cols[0]}, {cols[1]} & {len(cols) - 2} more"
def _in_cols(cols: list) -> str:
label = _fmt_cols(cols)
return f" in {label}" if label else ""
def _n(count: int, noun: str) -> str:
"""'1 column' / '3 columns' — naive but covers every noun used here."""
return f"{count:,} {noun}" if count == 1 else f"{count:,} {noun}s"
def step_phrase(tool: str, summary: dict) -> str:
"""A plain-English, one-line summary of what a step did."""
s = summary or {}
if tool == "text_clean":
changed = s.get("cells_changed", 0)
if not changed:
return "No changes needed."
return f"{_n(changed, 'cell')} cleaned{_in_cols(s.get('columns_processed', []))}"
if tool == "format_standardize":
changed = s.get("cells_changed", 0)
bad = s.get("cells_unparseable", 0)
if not changed and not bad:
return "Nothing to standardize."
base = f"{_n(changed, 'cell')} standardized{_in_cols(s.get('columns_processed', []))}"
return base if not bad else f"{base} ({bad:,} left unchanged)"
if tool == "missing":
parts: list[str] = []
if s.get("cells_filled"):
parts.append(f"{_n(s['cells_filled'], 'cell')} filled")
if s.get("rows_dropped"):
parts.append(f"{_n(s['rows_dropped'], 'row')} dropped")
if s.get("columns_dropped"):
parts.append(f"{_n(len(s['columns_dropped']), 'column')} dropped")
if not parts and s.get("sentinels_standardized"):
parts.append(f"{_n(s['sentinels_standardized'], 'blank cell')} flagged")
return ", ".join(parts) if parts else "No missing values to handle."
if tool == "column_map":
parts = []
if s.get("columns_renamed"):
parts.append(f"{_n(s['columns_renamed'], 'column')} renamed")
if s.get("columns_added"):
parts.append(f"{_n(len(s['columns_added']), 'column')} added")
if s.get("columns_dropped"):
parts.append(f"{_n(len(s['columns_dropped']), 'column')} dropped")
return ", ".join(parts) if parts else "Columns already aligned."
if tool == "dedup":
removed = s.get("duplicates_removed", 0)
if not removed:
return "No duplicates found."
return (
f"{_n(removed, 'duplicate')} removed across {_n(s.get('groups', 0), 'group')} "
f"({s.get('input_rows', 0):,}{s.get('output_rows', 0):,} rows)"
)
return ", ".join(f"{k}: {v}" for k, v in s.items())
def step_status(
tool: str, summary: dict, *, skipped: bool = False, error: Optional[str] = None,
) -> tuple[str, str, str]:
"""Return ``(pill_label, level, detail)`` for a step result.
``level`` is one of ``ok`` / ``warn`` / ``error`` / ``skipped``. ``detail``
is a longer inline explanation for warn/error rows (else "").
"""
if error:
return "✗ error", "error", error.splitlines()[0]
if skipped:
return "⏭ skipped", "skipped", ""
s = summary or {}
if tool == "format_standardize" and s.get("cells_unparseable"):
n = s["cells_unparseable"]
return (
f"⚠ ok · {n:,} skipped", "warn",
f"{n:,} values didn't match a known pattern and were left "
"unchanged. The step still completed — review them in the output "
"preview if needed.",
)
if tool == "column_map":
fails = s.get("coercion_failures") or {}
n_fail = sum(fails.values()) if isinstance(fails, dict) else 0
missing_req = s.get("missing_required_targets") or []
if missing_req:
return (
"⚠ ok · missing targets", "warn",
"Required target columns had no source match: "
+ ", ".join(map(str, missing_req)) + ".",
)
if n_fail:
return (
f"⚠ ok · {n_fail:,} not coerced", "warn",
f"{n_fail:,} values couldn't be coerced to their target type "
"and were left as-is.",
)
return "✓ ok", "ok", ""
# ---------------------------------------------------------------------------
# Per-tool config renderers
# ---------------------------------------------------------------------------
#
# Uniform signature: ``render_<tool>_config(df, options, kp) -> options``.
# * ``df`` — the uploaded DataFrame (for column lists / type hints).
# * ``options`` — the step's current options dict (seed widget defaults).
# * ``kp`` — key prefix, unique per step (``f"{tool}_{id}"``).
# Returns a JSON-serialisable options dict. Renderers must not open expanders.
_CASE_LABELS: list[tuple[str, Optional[str]]] = [
("Leave as-is", None),
("UPPERCASE", "upper"),
("lowercase", "lower"),
("Title Case", "title"),
("Sentence case", "sentence"),
]
def render_text_clean_config(df: pd.DataFrame, options: dict, kp: str) -> dict:
trim = st.checkbox(
"Trim leading & trailing whitespace",
value=bool(options.get("trim", True)), key=f"{kp}_trim",
)
collapse = st.checkbox(
"Collapse repeated spaces to one",
value=bool(options.get("collapse_whitespace", True)), key=f"{kp}_collapse",
)
fold = st.checkbox(
"Normalize smart quotes & dashes to plain ASCII",
value=bool(options.get("fold_smart_chars", True)), key=f"{kp}_fold",
)
strip_zw = st.checkbox(
"Strip zero-width / invisible characters",
value=bool(options.get("strip_zero_width", True)), key=f"{kp}_zw",
)
cur_case = options.get("case")
case_idx = next((i for i, (_, v) in enumerate(_CASE_LABELS) if v == cur_case), 0)
case_choice = st.selectbox(
"Letter case",
[lbl for lbl, _ in _CASE_LABELS],
index=case_idx, key=f"{kp}_case",
)
case_val = next(v for lbl, v in _CASE_LABELS if lbl == case_choice)
out: dict[str, Any] = {
"trim": trim,
"collapse_whitespace": collapse,
"fold_smart_chars": fold,
"strip_zero_width": strip_zw,
}
if case_val is not None:
out["case"] = case_val
return out
_FORMAT_LABELS: list[tuple[str, Optional[str]]] = [
("Leave as-is", None),
("Date", "date"),
("Phone number", "phone"),
("Currency", "currency"),
("Name", "name"),
("Address", "address"),
("Email", "email"),
("Boolean (yes/no)", "boolean"),
]
def render_format_standardize_config(df: pd.DataFrame, options: dict, kp: str) -> dict:
st.caption(
"Pick a target format for each column. Columns left as “Leave as-is” "
"are untouched."
)
current = dict(options.get("column_types", {}))
labels = [lbl for lbl, _ in _FORMAT_LABELS]
column_types: dict[str, str] = {}
for col in df.columns:
cur_val = current.get(col)
idx = next((i for i, (_, v) in enumerate(_FORMAT_LABELS) if v == cur_val), 0)
choice = st.selectbox(
str(col), labels, index=idx, key=f"{kp}_fmt__{col}",
)
val = next(v for lbl, v in _FORMAT_LABELS if lbl == choice)
if val is not None:
column_types[str(col)] = val
return {"column_types": column_types}
# Plain-language blank-handling choices → core strategy values. "fill" is a UI
# token expanded to numeric median + categorical mode (MissingOptions handles
# the per-dtype split via ``categorical_strategy``).
_MISSING_CHOICES: list[tuple[str, str]] = [
("Flag them (mark blanks, change nothing)", "flag"),
("Fill them in (numbers → median, text → most common)", "fill"),
("Drop rows that have any blank", "drop"),
]
def _missing_mode_from_strategy(strategy: Optional[str]) -> str:
if strategy in ("drop_row", "drop_col", "drop_both"):
return "drop"
if strategy in ("mean", "median", "mode", "constant", "ffill", "bfill", "interpolate"):
return "fill"
return "flag"
def render_missing_config(df: pd.DataFrame, options: dict, kp: str) -> dict:
from src.core.missing import DEFAULT_SENTINELS
cur_mode = _missing_mode_from_strategy(options.get("strategy"))
mode_idx = next((i for i, (_, v) in enumerate(_MISSING_CHOICES) if v == cur_mode), 0)
mode_choice = st.radio(
"What should happen to blank cells?",
[lbl for lbl, _ in _MISSING_CHOICES],
index=mode_idx, key=f"{kp}_strategy",
)
mode = next(v for lbl, v in _MISSING_CHOICES if lbl == mode_choice)
seed_sentinels = options.get("sentinels") or list(DEFAULT_SENTINELS)
sent_text = st.text_input(
"Treat these as blank (comma-separated)",
value=", ".join(seed_sentinels), key=f"{kp}_sentinels",
help="Matched case-insensitively after stripping whitespace.",
)
sentinels = [s.strip() for s in sent_text.split(",") if s.strip()]
out: dict[str, Any] = {
"standardize_sentinels": True,
"sentinels": sentinels,
}
if mode == "flag":
out["strategy"] = "none"
elif mode == "fill":
out["strategy"] = "median"
out["categorical_strategy"] = "mode"
else: # drop
out["strategy"] = "drop_row"
return out
_UNMAPPED_CHOICES = ["keep", "drop", "error"]
def render_column_map_config(df: pd.DataFrame, options: dict, kp: str) -> dict:
st.caption(
"Type the target name each source column should become. Leave a target "
"blank to keep that column's name unchanged."
)
current = dict(options.get("mapping", {}))
table = pd.DataFrame(
{
"source": [str(c) for c in df.columns],
"target": [current.get(str(c), "") for c in df.columns],
}
)
edited = st.data_editor(
table,
width="stretch",
hide_index=True,
disabled=["source"],
column_config={
"source": st.column_config.TextColumn("Source column"),
"target": st.column_config.TextColumn("Rename to"),
},
key=f"{kp}_mapping",
)
mapping = {
str(r["source"]): str(r["target"]).strip()
for _, r in edited.iterrows()
if str(r.get("target") or "").strip()
}
c1, c2 = st.columns(2)
with c1:
unmapped = st.selectbox(
"Columns with no rename",
_UNMAPPED_CHOICES,
index=_UNMAPPED_CHOICES.index(options.get("unmapped", "keep"))
if options.get("unmapped") in _UNMAPPED_CHOICES else 0,
key=f"{kp}_unmapped",
help="keep: leave them in place · drop: remove them · error: stop the run.",
)
with c2:
coerce = st.checkbox(
"Coerce values to target types",
value=bool(options.get("coerce_types", False)), key=f"{kp}_coerce",
)
return {"mapping": mapping, "unmapped": unmapped, "coerce_types": coerce}
_SURVIVOR_LABELS: list[tuple[str, str]] = [
("Keep the most complete row", "most_complete"),
("Keep the first seen", "first"),
("Keep the last seen", "last"),
("Keep the most recent (by date)", "most_recent"),
]
def render_dedup_config(df: pd.DataFrame, options: dict, kp: str) -> dict:
cur_rule = options.get("survivor_rule", "first")
rule_idx = next((i for i, (_, v) in enumerate(_SURVIVOR_LABELS) if v == cur_rule), 0)
rule_choice = st.selectbox(
"When rows match, which one survives?",
[lbl for lbl, _ in _SURVIVOR_LABELS],
index=rule_idx, key=f"{kp}_survivor",
)
survivor_rule = next(v for lbl, v in _SURVIVOR_LABELS if lbl == rule_choice)
merge = st.checkbox(
"Merge matched rows (fill each survivor's blanks from its duplicates)",
value=bool(options.get("merge", False)), key=f"{kp}_merge",
)
# Recover the previously-selected match columns from the stored strategies
# (a single exact-match strategy over the chosen columns).
prev_cols: list[str] = []
for strat in options.get("strategies", []) or []:
for c in strat.get("columns", []):
if c.get("column"):
prev_cols.append(c["column"])
all_cols = [str(c) for c in df.columns]
match_cols = st.multiselect(
"Match on these columns",
all_cols,
default=[c for c in prev_cols if c in all_cols],
key=f"{kp}_matchcols",
help="Rows are duplicates when these columns all match. Leave empty to auto-detect.",
)
out: dict[str, Any] = {"survivor_rule": survivor_rule, "merge": merge}
if match_cols:
out["strategies"] = [
{"columns": [
{"column": c, "algorithm": "exact", "threshold": 100}
for c in match_cols
]}
]
if survivor_rule == "most_recent":
date_default = options.get("date_column")
date_idx = all_cols.index(date_default) if date_default in all_cols else 0
out["date_column"] = st.selectbox(
"Date column (for most-recent)",
all_cols, index=date_idx, key=f"{kp}_datecol",
) if all_cols else None
return out
CONFIG_RENDERERS: dict[str, Callable[[pd.DataFrame, dict, str], dict]] = {
"text_clean": render_text_clean_config,
"format_standardize": render_format_standardize_config,
"missing": render_missing_config,
"column_map": render_column_map_config,
"dedup": render_dedup_config,
}
# ---------------------------------------------------------------------------
# Module card
# ---------------------------------------------------------------------------
def render_step_card(
df: pd.DataFrame, step: dict, idx: int, total: int,
) -> Optional[str]:
"""Render one pipeline step as a module card.
Mutates ``step`` in place (``enabled`` toggle, ``options`` from the Configure
panel). Returns an action string (``"up"`` / ``"down"`` / ``"remove"``) when
the user clicks a reorder/remove control, else ``None`` — the caller applies
the action to the step list and reruns.
"""
sid = step["id"]
kp = f"{step['tool']}_{sid}"
action: Optional[str] = None
with st.container(border=True):
head, toggle, up, down, rm = st.columns([0.66, 0.12, 0.07, 0.07, 0.08])
with head:
st.markdown(f"**{idx + 1}. {step_label(step['tool'])}**")
st.caption(step_caption(step["tool"]))
with toggle:
step["enabled"] = st.toggle(
"On", value=step.get("enabled", True), key=f"{kp}_enabled",
help="Disabled steps are kept in the pipeline but skipped at run time.",
)
with up:
if st.button("", key=f"{kp}_up", disabled=idx == 0,
help="Move up", width="stretch"):
action = "up"
with down:
if st.button("", key=f"{kp}_down", disabled=idx == total - 1,
help="Move down", width="stretch"):
action = "down"
with rm:
if st.button("", key=f"{kp}_rm", help="Remove step", width="stretch"):
action = "remove"
renderer = CONFIG_RENDERERS.get(step["tool"])
with st.expander(f"Configure: {step_label(step['tool'])}"):
if renderer is None:
st.caption("This step has no options.")
else:
step["options"] = renderer(df, step.get("options", {}) or {}, kp)
return action

View File

@@ -25,6 +25,7 @@ from src.gui.components import (
hide_streamlit_chrome,
html_download_button,
render_sticky_footer,
render_tool_header,
)
from src.pdf_extract import (
PdfDependencyMissing,
@@ -103,13 +104,7 @@ def _format_size(n_bytes: int) -> str:
# Header + dep guard
# ---------------------------------------------------------------------------
st.markdown("# PDF to CSV")
st.caption(
"Scan bank-statement PDFs for transaction rows "
"(``[date] [description] [amount]``). Review the table, uncheck "
"rows you don't want, edit any cell that needs fixing, then "
"download as CSV. No per-bank setup."
)
render_tool_header("10_pdf_extractor")
_pdf_ok, _pdf_missing = _pdf_deps_status()
if not _pdf_ok:

View File

@@ -25,6 +25,7 @@ from src.gui.components import (
hide_streamlit_chrome,
html_download_button,
render_sticky_footer,
render_tool_header,
)
from src.core.reconcile import ReconcileOptions, reconcile
@@ -38,12 +39,7 @@ log_page_open("11_Reconciler")
# Header
# ---------------------------------------------------------------------------
st.title("Reconcile Two Files")
st.caption(
"Match transactions between two sources (e.g. bank feed vs. ledger). "
"Outputs four buckets: matched, unmatched-left, unmatched-right, and "
"ambiguous-for-review."
)
render_tool_header("11_reconciler")
# ---------------------------------------------------------------------------

View File

@@ -20,6 +20,7 @@ from src.gui.components import (
apply_review_decisions,
back_to_home_link,
render_sticky_footer,
render_tool_header,
config_panel,
hide_streamlit_chrome,
html_download_button,
@@ -28,7 +29,6 @@ from src.gui.components import (
require_feature_or_render_upgrade,
results_summary,
)
from src.i18n import t
from src.license import FeatureFlag
hide_streamlit_chrome()
@@ -60,8 +60,7 @@ for key, default in _DEFAULTS.items():
# Header
# ---------------------------------------------------------------------------
st.title(t("tools.01_deduplicator.page_title"))
st.caption(t("tools.01_deduplicator.page_caption"))
render_tool_header("01_deduplicator")
# ---------------------------------------------------------------------------

View File

@@ -17,13 +17,13 @@ if str(_project_root) not in sys.path:
from src.gui.components import (
back_to_home_link,
render_sticky_footer,
render_tool_header,
hide_streamlit_chrome,
html_download_button,
pickup_or_upload,
render_hidden_aware_preview,
require_feature_or_render_upgrade,
)
from src.i18n import t
from src.license import FeatureFlag
from src.core.text_clean import (
PRESETS,
@@ -45,8 +45,7 @@ require_feature_or_render_upgrade(FeatureFlag.TEXT_CLEANER)
# Header
# ---------------------------------------------------------------------------
st.title(t("tools.02_text_cleaner.page_title"))
st.caption(t("tools.02_text_cleaner.page_caption"))
render_tool_header("02_text_cleaner")
# ---------------------------------------------------------------------------
# File upload

View File

@@ -17,12 +17,12 @@ if str(_project_root) not in sys.path:
from src.gui.components import (
back_to_home_link,
render_sticky_footer,
render_tool_header,
hide_streamlit_chrome,
html_download_button,
pickup_or_upload,
require_feature_or_render_upgrade,
)
from src.i18n import t
from src.core.format_standardize import (
PRESETS,
FieldType,
@@ -43,8 +43,7 @@ require_feature_or_render_upgrade(FeatureFlag.FORMAT_STANDARDIZER)
# Header
# ---------------------------------------------------------------------------
st.title(t("tools.03_format_standardizer.page_title"))
st.caption(t("tools.03_format_standardizer.page_caption"))
render_tool_header("03_format_standardizer")
# ---------------------------------------------------------------------------

View File

@@ -17,12 +17,12 @@ if str(_project_root) not in sys.path:
from src.gui.components import (
back_to_home_link,
render_sticky_footer,
render_tool_header,
hide_streamlit_chrome,
html_download_button,
pickup_or_upload,
require_feature_or_render_upgrade,
)
from src.i18n import t
from src.core.missing import (
DEFAULT_SENTINELS,
MissingOptions,
@@ -44,8 +44,7 @@ require_feature_or_render_upgrade(FeatureFlag.MISSING_HANDLER)
# Header
# ---------------------------------------------------------------------------
st.title(t("tools.04_missing_handler.page_title"))
st.caption(t("tools.04_missing_handler.page_caption"))
render_tool_header("04_missing_handler")
# ---------------------------------------------------------------------------

View File

@@ -17,12 +17,12 @@ if str(_project_root) not in sys.path:
from src.gui.components import (
back_to_home_link,
render_sticky_footer,
render_tool_header,
hide_streamlit_chrome,
html_download_button,
pickup_or_upload,
require_feature_or_render_upgrade,
)
from src.i18n import t
from src.core.column_mapper import (
MapOptions,
PRESETS,
@@ -45,8 +45,7 @@ require_feature_or_render_upgrade(FeatureFlag.COLUMN_MAPPER)
# Header
# ---------------------------------------------------------------------------
st.title(t("tools.05_column_mapper.page_title"))
st.caption(t("tools.05_column_mapper.page_caption"))
render_tool_header("05_column_mapper")
# ---------------------------------------------------------------------------

View File

@@ -14,10 +14,10 @@ if str(_project_root) not in sys.path:
from src.gui.components import (
back_to_home_link,
render_sticky_footer,
render_tool_header,
hide_streamlit_chrome,
require_feature_or_render_upgrade,
)
from src.i18n import t
from src.license import FeatureFlag
hide_streamlit_chrome()
@@ -31,8 +31,7 @@ require_feature_or_render_upgrade(FeatureFlag.OUTLIER_DETECTOR)
# Header
# ---------------------------------------------------------------------------
st.title(t("tools.06_outlier_detector.page_title"))
st.caption(t("tools.06_outlier_detector.page_caption"))
render_tool_header("06_outlier_detector")
st.info("This tool is under development.")

View File

@@ -14,10 +14,10 @@ if str(_project_root) not in sys.path:
from src.gui.components import (
back_to_home_link,
render_sticky_footer,
render_tool_header,
hide_streamlit_chrome,
require_feature_or_render_upgrade,
)
from src.i18n import t
from src.license import FeatureFlag
hide_streamlit_chrome()
@@ -31,8 +31,7 @@ require_feature_or_render_upgrade(FeatureFlag.MULTI_FILE_MERGER)
# Header
# ---------------------------------------------------------------------------
st.title(t("tools.07_multi_file_merger.page_title"))
st.caption(t("tools.07_multi_file_merger.page_caption"))
render_tool_header("07_multi_file_merger")
st.info("This tool is under development.")

View File

@@ -14,10 +14,10 @@ if str(_project_root) not in sys.path:
from src.gui.components import (
back_to_home_link,
render_sticky_footer,
render_tool_header,
hide_streamlit_chrome,
require_feature_or_render_upgrade,
)
from src.i18n import t
from src.license import FeatureFlag
hide_streamlit_chrome()
@@ -31,8 +31,7 @@ require_feature_or_render_upgrade(FeatureFlag.VALIDATOR_REPORTER)
# Header
# ---------------------------------------------------------------------------
st.title(t("tools.08_validator_reporter.page_title"))
st.caption(t("tools.08_validator_reporter.page_caption"))
render_tool_header("08_validator_reporter")
st.info("This tool is under development.")

View File

@@ -17,12 +17,12 @@ if str(_project_root) not in sys.path:
from src.gui.components import (
back_to_home_link,
render_sticky_footer,
render_tool_header,
hide_streamlit_chrome,
html_download_button,
pickup_or_upload,
require_feature_or_render_upgrade,
)
from src.i18n import t
from src.core.pipeline import (
Pipeline,
SOFT_DEPENDENCIES,
@@ -32,6 +32,12 @@ from src.core.pipeline import (
run_pipeline,
validate_pipeline,
)
from src.gui.components.pipeline_modules import (
render_step_card,
step_label,
step_phrase,
step_status,
)
from src.license import FeatureFlag
hide_streamlit_chrome()
@@ -46,8 +52,7 @@ require_feature_or_render_upgrade(FeatureFlag.PIPELINE_RUNNER)
# Header
# ---------------------------------------------------------------------------
st.title(t("tools.09_pipeline_runner.page_title"))
st.caption(t("tools.09_pipeline_runner.page_caption"))
render_tool_header("09_pipeline_runner")
# ---------------------------------------------------------------------------
@@ -105,120 +110,148 @@ st.divider()
# ---------------------------------------------------------------------------
# Pipeline builder
# Pipeline builder — visual module cards
# ---------------------------------------------------------------------------
#
# Wrapped in an outer expander whose default state mirrors the preview
# expander above: open before a result exists, folded once the user has
# clicked Run Pipeline. The pipeline editor is this page's "Options"
# section — structurally analogous to Text Cleaner's options block.
# Each step is a "module" card (src/gui/components/pipeline_modules.py) with a
# plain-language Configure panel — no raw JSON. Steps live in session state as
# an ordered list of dicts, each carrying a STABLE integer id so widget keys
# survive reorder/remove. Raw JSON is import/export only, under Advanced.
#
# NB: the builder is NOT wrapped in an outer expander — per-step Configure
# panels are expanders, and Streamlit forbids nesting expanders.
with st.expander("Options", expanded=not _has_result):
mode = st.radio(
def _seed_steps_from(pipeline) -> None:
"""Replace the session step list from a Pipeline, assigning fresh ids."""
seq = st.session_state.get("pipeline_step_seq", 0)
steps: list[dict] = []
for s in pipeline.steps:
steps.append({
"id": seq, "tool": s.tool,
"enabled": s.enabled, "options": dict(s.options),
})
seq += 1
st.session_state["pipeline_steps"] = steps
st.session_state["pipeline_step_seq"] = seq
if "pipeline_steps" not in st.session_state:
_seed_steps_from(recommended_pipeline())
st.subheader("Build your pipeline")
mode = st.radio(
"How would you like to define the pipeline?",
[
"Use the recommended default (text-clean → format → missing → dedup)",
"Use the recommended default (Clean Text → Standardize → Fix Missing → Find Duplicates)",
"Build interactively",
"Import a saved pipeline JSON",
],
index=0,
key="pipeline_mode",
)
if mode.startswith("Use the recommended"):
# Only reseed on an explicit click that lands here while the steps already
# diverge — otherwise every rerun would wipe edits. We detect "user just
# selected this mode" by comparing against the recommended default and
# offering a one-click restore rather than silently discarding.
rec_dict = recommended_pipeline().to_dict()
cur_dict = {
"steps": [
{"tool": s["tool"], "options": s["options"],
"enabled": s["enabled"], "name": None}
for s in st.session_state["pipeline_steps"]
]
}
if cur_dict != rec_dict:
st.info(
"You've edited the recommended steps, so they're now yours to "
"change — you're effectively in **Build interactively** mode. "
"Restore the suggested steps to discard your edits."
)
if "pipeline_rows" not in st.session_state:
default = recommended_pipeline()
st.session_state["pipeline_rows"] = pd.DataFrame([
{
"tool": s.tool, "enabled": s.enabled,
"options_json": json.dumps(s.options),
}
for s in default.steps
])
if mode.startswith("Use the recommended"):
default = recommended_pipeline()
st.session_state["pipeline_rows"] = pd.DataFrame([
{
"tool": s.tool, "enabled": s.enabled,
"options_json": json.dumps(s.options),
}
for s in default.steps
])
elif mode.startswith("Import"):
if st.button("↺ Restore recommended steps"):
_seed_steps_from(recommended_pipeline())
st.rerun()
elif mode.startswith("Import"):
pipeline_file = st.file_uploader(
"Pipeline JSON", type=["json"], key="pipeline_upload",
)
if pipeline_file is not None:
try:
data = json.loads(pipeline_file.getvalue())
uploaded_pipe = Pipeline.from_dict(data)
st.session_state["pipeline_rows"] = pd.DataFrame([
{
"tool": s.tool, "enabled": s.enabled,
"options_json": json.dumps(s.options),
}
for s in uploaded_pipe.steps
])
st.success(f"Loaded {len(uploaded_pipe.steps)} step(s).")
_seed_steps_from(Pipeline.from_dict(data))
st.success(
f"Loaded {len(st.session_state['pipeline_steps'])} step(s). "
"Switch to **Build interactively** to tweak them."
)
except Exception as e:
from src.core.errors import format_for_user
st.error(f"**Could not parse pipeline**\n\n```\n{format_for_user(e)}\n```")
st.caption(
"Edit the table to add, remove, reorder (drag the row index), enable, "
"or configure each step. Tool order is recommended, not enforced — "
"violations surface as warnings below the table."
)
edited = st.data_editor(
st.session_state["pipeline_rows"],
width="stretch",
num_rows="dynamic",
column_config={
"tool": st.column_config.SelectboxColumn(
"Tool", options=TOOL_NAMES, required=True,
),
"enabled": st.column_config.CheckboxColumn("Enabled"),
"options_json": st.column_config.TextColumn(
"Options (JSON)",
help='e.g. {"column_types": {"phone": "phone"}}',
),
},
key="pipeline_editor",
)
st.session_state["pipeline_rows"] = edited
st.caption(
"Each step is a module: toggle it on/off, reorder with ▲ ▼, remove with ✕, "
"and open **Configure** to set its options in plain language. Tool order is "
"recommended, not enforced — violations surface as warnings below."
)
# Build a Pipeline object from the editor state.
steps_list: list[Step] = []
parse_errors: list[str] = []
for i, row in edited.iterrows():
tool = row.get("tool")
if not tool or pd.isna(tool):
continue
raw_opts = row.get("options_json") or "{}"
if pd.isna(raw_opts):
raw_opts = "{}"
try:
opts = json.loads(raw_opts) if isinstance(raw_opts, str) else dict(raw_opts)
if not isinstance(opts, dict):
raise ValueError("options must be a JSON object")
except Exception as e:
parse_errors.append(f"Step {i + 1}: {e}")
continue
# Render the module stack. A reorder/remove action mutates the list and reruns.
steps = st.session_state["pipeline_steps"]
total = len(steps)
pending_action: tuple[str, int] | None = None
for i, step in enumerate(steps):
act = render_step_card(df, step, i, total)
if act is not None:
pending_action = (act, i)
if pending_action is not None:
act, i = pending_action
if act == "remove":
steps.pop(i)
elif act == "up" and i > 0:
steps[i - 1], steps[i] = steps[i], steps[i - 1]
elif act == "down" and i < total - 1:
steps[i + 1], steps[i] = steps[i], steps[i + 1]
st.session_state["pipeline_steps"] = steps
st.rerun()
# Add-step control.
add_col, btn_col = st.columns([0.7, 0.3])
with add_col:
add_tool = st.selectbox(
"Add a step",
TOOL_NAMES,
format_func=step_label,
key="pipeline_add_tool",
label_visibility="collapsed",
)
with btn_col:
if st.button(" Add step", width="stretch"):
seq = st.session_state.get("pipeline_step_seq", 0)
steps.append({"id": seq, "tool": add_tool, "enabled": True, "options": {}})
st.session_state["pipeline_step_seq"] = seq + 1
st.rerun()
# Build a Pipeline object from the step list.
steps_list: list[Step] = []
parse_errors: list[str] = []
for i, step in enumerate(steps):
try:
steps_list.append(Step(
tool=str(tool),
options=opts,
enabled=bool(row.get("enabled", True)),
tool=str(step["tool"]),
options=dict(step.get("options") or {}),
enabled=bool(step.get("enabled", True)),
))
except Exception as e:
parse_errors.append(f"Step {i + 1}: {e}")
parse_errors.append(f"Step {i + 1} ({step.get('tool')}): {e}")
if parse_errors:
for err in parse_errors:
for err in parse_errors:
st.error(err)
current_pipeline = Pipeline(steps=steps_list) if steps_list else None
current_pipeline = Pipeline(steps=steps_list) if steps_list else None
if current_pipeline is not None:
if current_pipeline is not None:
warnings = validate_pipeline(current_pipeline)
if warnings:
st.warning(
@@ -227,14 +260,37 @@ with st.expander("Options", expanded=not _has_result):
+ "\n\nThe pipeline will still run — these are recommendations only."
)
with st.expander("Recommended tool order — why each step belongs where it does"):
with st.expander("Recommended tool order — why each step belongs where it does"):
st.markdown(
"\n".join(
f"- **{e}** before **{l}** — {why}"
f"- **{step_label(e)}** before **{step_label(l)}** — {why}"
for e, l, why in SOFT_DEPENDENCIES
)
)
with st.expander("Advanced — import / export pipeline as JSON"):
st.caption(
"For sharing or version control. Editing is done in the step panels "
"above — this is just the saved form of the same settings. The same "
"JSON runs in the CLI via `--pipeline pipeline.json`."
)
export_json = json.dumps(
current_pipeline.to_dict() if current_pipeline else {"steps": []},
indent=2, default=str,
)
st.code(export_json, language="json")
adv_paste = st.text_area(
"Paste pipeline JSON to load it", key="pipeline_json_paste", height=140,
)
if st.button("Load pasted JSON", disabled=not adv_paste.strip()):
try:
_seed_steps_from(Pipeline.from_dict(json.loads(adv_paste)))
st.success("Loaded. Scroll up to see the steps.")
st.rerun()
except Exception as e:
from src.core.errors import format_for_user
st.error(f"**Could not parse pipeline**\n\n```\n{format_for_user(e)}\n```")
st.divider()
# ---------------------------------------------------------------------------
@@ -258,14 +314,14 @@ if st.button(
def _on_step(sr) -> None:
completed[0] += 1
if sr.skipped:
log_lines.append(f"{sr.step.display_name()} (skipped)")
log_lines.append(f"{step_label(sr.step.tool)} (skipped)")
elif sr.error:
log_lines.append(
f"{sr.step.display_name()}{sr.error.splitlines()[0]}"
f"{step_label(sr.step.tool)}{sr.error.splitlines()[0]}"
)
else:
log_lines.append(
f"{sr.step.display_name()}{sr.elapsed_seconds*1000:.0f} ms"
f"{step_label(sr.step.tool)}{sr.elapsed_seconds*1000:.0f} ms"
)
log_box.markdown("\n".join(log_lines))
progress.progress(
@@ -329,22 +385,38 @@ m3.metric("Steps run", sum(1 for s in result.step_results if not s.skipped))
m4.metric("Elapsed", f"{result.total_elapsed:.2f} s")
st.markdown("**Per-step summary**")
# Plain-English status pill + summary phrase per step (mockup §Results). The
# at-a-glance table stays scannable; any warn/error step also gets an inline
# detail callout directly below it, so a non-fatal issue surfaces in context
# without a dedicated always-empty column.
step_df = pd.DataFrame([
{
"step": sr.step.display_name(),
"status": (
"skipped" if sr.skipped
else "error" if sr.error
else "ok"
"step": step_label(sr.step.tool),
"status": step_status(
sr.step.tool, sr.summary, skipped=sr.skipped, error=sr.error,
)[0],
"elapsed": f"{int(sr.elapsed_seconds * 1000)} ms",
"summary": (
"" if sr.skipped
else step_phrase(sr.step.tool, sr.summary)
),
"elapsed_ms": int(sr.elapsed_seconds * 1000),
"summary": json.dumps(sr.summary, default=str)[:200],
"error": sr.error or "",
}
for sr in result.step_results
])
st.dataframe(step_df, width="stretch", hide_index=True)
for sr in result.step_results:
_label, level, detail = step_status(
sr.step.tool, sr.summary, skipped=sr.skipped, error=sr.error,
)
if not detail:
continue
name = step_label(sr.step.tool)
if level == "error":
st.error(f"**{name}** — {detail}")
else:
st.warning(f"**{name}** — {detail}")
st.markdown("**Output preview (first 10 rows)**")
st.dataframe(result.final_df.head(10), width="stretch")

View File

@@ -24,7 +24,10 @@ Tier = Literal["core", "pro", "enterprise"]
Status = Literal["Ready", "Coming Soon"]
# Sidebar grouping. Tools are bucketed by what the user is trying to
# accomplish rather than by implementation detail.
Section = Literal["analysis", "cleaners", "transformations", "automations"]
Section = Literal[
"cleaners", "transformations", "automations",
"finance", "coming_soon",
]
@dataclass(frozen=True)
@@ -42,41 +45,21 @@ class Tool:
# Order in this list IS the order shown in each sidebar section, so
# arranging it carefully matters: within "cleaners" we lead with the
# operations a non-technical user is most likely to need (filling
# blanks, flagging outliers) before progressing to format cleanup,
# dedup, and the final quality report.
# arranging it carefully matters. Within "cleaners" the order is the
# recommended PIPELINE order (Clean Text → Standardize → Fix Missing
# Find Duplicates) so a user running tools by hand follows the sequence
# the orchestrator would. "Coming Soon" tools are grouped at the end in
# their own section so they never interleave with working tools, and the
# finance-oriented tools (Reconcile, PDF to CSV) live in their own group
# (see DECISIONS.md 2026-06-08).
TOOLS: list[Tool] = [
Tool(
tool_id="04_missing_handler",
icon=":material/help_outline:",
name="Fix Missing Values",
description=(
"Detect disguised nulls, missingness analysis, and imputation strategies."
),
page_slug="4_Missing_Values",
status="Ready",
section="cleaners",
),
Tool(
tool_id="06_outlier_detector",
icon=":material/insights:",
name="Find Unusual Values",
description=(
"Z-score, IQR, and MAD detection with domain-rule violations and "
"winsorization."
),
page_slug="6_Outlier_Detector",
status="Coming Soon",
section="cleaners",
),
Tool(
tool_id="02_text_cleaner",
icon=":material/text_format:",
name="Clean Text",
description=(
"Whitespace trim, multi-space collapse, Unicode normalization, "
"BOM and line-ending handling."
"Trim extra spaces and strip out odd characters that copy-paste "
"leaves behind."
),
page_slug="2_Text_Cleaner",
status="Ready",
@@ -87,87 +70,118 @@ TOOLS: list[Tool] = [
icon=":material/format_list_bulleted:",
name="Standardize Formats",
description=(
"Standardize dates, currencies, names, phone numbers, and addresses."
"Make dates, phone numbers, currency, names, and addresses look "
"the same throughout."
),
page_slug="3_Format_Standardizer",
status="Ready",
section="cleaners",
),
Tool(
tool_id="04_missing_handler",
icon=":material/help_outline:",
name="Fix Missing Values",
description=(
"Find blank cells (even ones written as 'N/A' or '?') and fill "
"them in or remove them."
),
page_slug="4_Missing_Values",
status="Ready",
section="cleaners",
),
Tool(
tool_id="01_deduplicator",
icon=":material/search:",
name="Find Duplicates",
description=(
"Fuzzy matching, normalization, survivor selection, and "
"interactive review."
"Find rows that repeat — exact and similar — and remove the extras."
),
page_slug="1_Deduplicator",
status="Ready",
section="cleaners",
),
Tool(
tool_id="08_validator_reporter",
icon=":material/check_circle:",
name="Quality Check",
description=(
"Validate against rules and generate PDF/Excel quality reports."
),
page_slug="8_Validator_Reporter",
status="Coming Soon",
section="cleaners",
),
Tool(
tool_id="05_column_mapper",
icon=":material/view_column:",
name="Map Columns",
description="Rename columns, enforce a target schema, and coerce types.",
description=(
"Rename columns, change their order, and set each one as text, "
"number, or date."
),
page_slug="5_Column_Mapper",
status="Ready",
section="transformations",
),
Tool(
tool_id="07_multi_file_merger",
icon=":material/account_tree:",
name="Combine Files",
description="Combine multiple CSV/Excel files with schema alignment.",
page_slug="7_Multi_File_Merger",
status="Coming Soon",
section="transformations",
),
Tool(
tool_id="09_pipeline_runner",
icon=":material/auto_awesome:",
name="Automated Workflows",
description=(
"Chain tools in recommended order and pass output between steps."
"Run several tools in a row — save the steps once, reuse them "
"anytime."
),
page_slug="9_Pipeline_Runner",
status="Ready",
section="automations",
),
Tool(
tool_id="10_pdf_extractor",
icon=":material/picture_as_pdf:",
name="PDF to CSV",
description=(
"Extract bank-statement transactions from PDFs using reusable "
"per-source templates."
),
page_slug="10_PDF_Extractor",
status="Ready",
section="transformations",
),
Tool(
tool_id="11_reconciler",
icon=":material/compare_arrows:",
name="Reconcile Two Files",
description=(
"Match transactions between two sources (e.g. bank feed vs. "
"ledger) with amount and date tolerance."
"Compare two lists of transactions (e.g. bank vs. ledger) and "
"flag what doesn't match."
),
page_slug="11_Reconciler",
status="Ready",
section="analysis",
section="finance",
),
Tool(
tool_id="10_pdf_extractor",
icon=":material/picture_as_pdf:",
name="PDF to CSV",
description=(
"Pull transactions out of bank-statement PDFs into a clean CSV file."
),
page_slug="10_PDF_Extractor",
status="Ready",
section="finance",
),
Tool(
tool_id="06_outlier_detector",
icon=":material/insights:",
name="Find Unusual Values",
description=(
"Spot values that look wrong — way too high, way too low, or "
"breaking your rules."
),
page_slug="6_Outlier_Detector",
status="Coming Soon",
section="coming_soon",
),
Tool(
tool_id="08_validator_reporter",
icon=":material/check_circle:",
name="Quality Check",
description=(
"Check your file against rules you set, and export a PDF or "
"Excel report."
),
page_slug="8_Validator_Reporter",
status="Coming Soon",
section="coming_soon",
),
Tool(
tool_id="07_multi_file_merger",
icon=":material/account_tree:",
name="Combine Files",
description=(
"Combine several CSV or Excel files into one — even if their "
"columns don't match."
),
page_slug="7_Multi_File_Merger",
status="Coming Soon",
section="coming_soon",
),
]
@@ -175,10 +189,11 @@ TOOLS: list[Tool] = [
# Display labels for each sidebar section. Kept here so i18n falls back
# to a sensible English string if a translation pack is missing the key.
SECTION_LABELS: dict[Section, str] = {
"analysis": "Analysis",
"cleaners": "Data Cleaners",
"transformations": "Transformations",
"automations": "Automations",
"finance": "Finance",
"coming_soon": "Coming soon",
}

View File

@@ -15,6 +15,10 @@
"ready": "Ready",
"coming_soon": "Coming Soon"
},
"help": {
"button_label": "Help",
"missing_body": "No help written yet for this tool."
},
"upload": {
"heading": "Import one or more files to start",
"intro": "Optional: scan an imported file for data quality issues and see which tools can fix each one. Skip if you already know what you need.",
@@ -107,63 +111,80 @@
"tools": {
"01_deduplicator": {
"name": "Find Duplicates",
"description": "Fuzzy matching, normalization, survivor selection, and interactive review.",
"description": "Find rows that repeat — exact and similar — and remove the extras.",
"page_title": "Find Duplicates",
"page_caption": "Find and remove duplicate rows in CSV, delimited text, and Excel files."
"page_caption": "Find rows that repeat, then keep one and remove the extras.",
"help_md": "**When to use**\n- Customer or contact lists\n- Mailing lists from multiple sources\n- Product catalogs that may overlap\n\n**Steps**\n1. Upload your file\n2. Pick the column(s) that identify a row (email, phone, name+zip)\n3. Choose **Exact** or **Similar** matching\n4. Pick which row to keep (newest, longest, first)\n5. Preview, then export\n\n**Examples**\n- `John Smith` + `JOHN SMITH` → same person\n- `jane@co.com` + `jane@co.com ` → same email (trailing space)\n- `555-1234` + `(555) 1234` → same phone\n\n**Tip** Start with Exact; add Similar if you suspect typos."
},
"02_text_cleaner": {
"name": "Clean Text",
"description": "Whitespace trim, multi-space collapse, Unicode normalization, BOM and line-ending handling.",
"description": "Trim extra spaces and strip out odd characters that copy-paste leaves behind.",
"page_title": "Clean Text",
"page_caption": "Trim whitespace, fold smart quotes, strip invisible characters, and normalize line endings. Runs locally — your data never leaves this computer."
"page_caption": "Trim extra spaces and strip out odd characters.",
"help_md": "**When to use**\n- Text copied from web pages, PDFs, or older systems\n- Files with inconsistent spacing\n- Data with hidden or special characters\n\n**Steps**\n1. Upload your file\n2. Pick the text columns to clean\n3. Choose options: trim spaces, remove invisible characters, normalize quotes\n4. Preview the changes\n5. Export\n\n**Examples**\n- ` hello world ` → `hello world`\n- `“smart quotes”` → `\"normal quotes\"`\n- `datawithhidden` → `datawithhidden`\n\n**Tip** Always preview — text changes can affect later steps like duplicate detection."
},
"03_format_standardizer": {
"name": "Standardize Formats",
"description": "Standardize dates, currencies, names, phone numbers, and addresses.",
"description": "Make dates, phone numbers, currency, names, and addresses look the same throughout.",
"page_title": "Standardize Formats",
"page_caption": "Canonicalize dates, phone numbers, currency, names, addresses, and booleans on a per-column basis. Runs locally — your data never leaves this computer."
"page_caption": "Make dates, phones, currency, and names look the same throughout.",
"help_md": "**When to use**\n- Data from multiple sources that wrote dates/phones differently\n- Before sending to a system that wants one format\n- Preparing data for analysis or charts\n\n**Steps**\n1. Upload your file\n2. Pick a column (date, phone, currency, etc.)\n3. Choose the target format\n4. Preview\n5. Repeat for other columns, then export\n\n**Examples**\n- `Jan 5, 2025` / `01/05/2025` / `5-Jan-25` → `2025-01-05`\n- `(555) 123-4567` / `555.123.4567` → `+1 555-123-4567`\n- `$1,234.50` / `1234.5 USD` → `1234.50`\n\n**Tip** Run on several columns in one session — each column remembers its chosen format."
},
"04_missing_handler": {
"name": "Fix Missing Values",
"description": "Detect disguised nulls, missingness analysis, and imputation strategies.",
"description": "Find blank cells (even ones written as 'N/A' or '?') and fill them in or remove them.",
"page_title": "Fix Missing Values",
"page_caption": "Detect disguised nulls, profile missingness, and apply imputation or drop strategies. Runs locally — your data never leaves this computer."
"page_caption": "Find blank cells (even hidden ones) and fill them in or remove them.",
"help_md": "**When to use**\n- Spreadsheets with gaps\n- Files where someone typed `N/A` or `-` instead of leaving a cell blank\n- Before importing into a system that rejects blanks\n\n**Steps**\n1. Upload your file\n2. Review which columns have blanks\n3. Pick a strategy per column: **fill**, **drop the row**, or **leave alone**\n4. For numbers, pick a fill value: average, median, zero, or your own\n5. Preview, then export\n\n**Examples**\n- `N/A`, `-`, ` ` → treated as blank\n- Blank salary → filled with the column average\n- Row with no email → dropped\n\n**Tip** Don't fill the row's identifier (email, ID) — drop the row instead."
},
"05_column_mapper": {
"name": "Map Columns",
"description": "Rename columns, enforce a target schema, and coerce types.",
"description": "Rename columns, change their order, and set each one as text, number, or date.",
"page_title": "Map Columns",
"page_caption": "Rename columns, enforce a target schema, and coerce types. Runs locally — your data never leaves this computer."
"page_caption": "Rename columns, change their order, and set each one as text, number, or date.",
"help_md": "**When to use**\n- Combining files from vendors with different column names\n- Forcing the layout your system expects\n- Cleaning up exports with extra or weirdly-named columns\n\n**Steps**\n1. Upload your file\n2. Match each incoming column to your standard name\n3. Set each column's type: text, number, or date\n4. Reorder or drop columns\n5. Export with the new layout\n\n**Examples**\n- `cust_email` → `Customer Email`\n- `amt` → `Amount` (set as number)\n- `notes_internal` → drop\n\n**Tip** Save the mapping if you'll get the same file format again next month."
},
"06_outlier_detector": {
"name": "Find Unusual Values",
"description": "Z-score, IQR, and MAD detection with domain-rule violations and winsorization.",
"description": "Spot values that look wrong — way too high, way too low, or breaking your rules.",
"page_title": "Find Unusual Values",
"page_caption": "Detect and handle outliers in numeric columns."
"page_caption": "Spot values that look wrong — way too high, too low, or breaking your rules.",
"help_md": "**When to use**\n- Spotting typos, fraud, or bad imports in numeric data\n- Cleaning sensor or transaction data\n- Before reporting numbers to leadership\n\n**Steps**\n1. Upload your file\n2. Pick the numeric column to check\n3. Set a normal range (or use auto-detect)\n4. Review the flagged rows\n5. Choose: keep, remove, or cap to the limit\n\n**Examples**\n- Salary column with one row at `$9,999,999` → flagged\n- Age column with `250` → flagged\n- Rule: `price must be > 0` → flags negatives\n\n**Tip** Review flagged rows by hand — a real outlier is sometimes the most important data point."
},
"07_multi_file_merger": {
"name": "Combine Files",
"description": "Combine multiple CSV/Excel files with schema alignment.",
"description": "Combine several CSV or Excel files into one — even if their columns don't match.",
"page_title": "Combine Files",
"page_caption": "Combine multiple CSV and Excel files into one dataset."
"page_caption": "Combine several CSV or Excel files into one — even if columns differ.",
"help_md": "**When to use**\n- Monthly reports across the year\n- Exports from different stores or branches\n- Multi-system data that needs to be in one file\n\n**Steps**\n1. Upload two or more files\n2. Confirm column matches (auto-detected; override if needed)\n3. Pick how to handle missing columns (skip, blank, default value)\n4. Preview the combined result\n5. Export the single file\n\n**Examples**\n- `January.csv` + `February.csv` → `2025.csv`\n- `NY-store.xlsx` + `LA-store.xlsx` → `all-stores.csv`\n- File A has `Email`, file B has `email_addr` → matched automatically\n\n**Tip** Add a `source` column so you can tell which file each row came from."
},
"08_validator_reporter": {
"name": "Quality Check",
"description": "Validate against rules and generate PDF/Excel quality reports.",
"description": "Check your file against rules you set, and export a PDF or Excel report.",
"page_title": "Quality Check",
"page_caption": "Validate data against rules and generate quality reports."
"page_caption": "Check your file against rules and export a PDF or Excel report.",
"help_md": "**When to use**\n- Before handing data off to a client or partner\n- Before a strict system import\n- Routine quality audits\n\n**Steps**\n1. Upload your file\n2. Pick the rules to check (required columns, valid emails, no duplicates)\n3. Run the check\n4. Review the score and findings\n5. Export the report as PDF or Excel\n\n**Examples**\n- Rule: `email column must look like an email` → 12 rows fail\n- Rule: `amount must be > 0` → 3 rows fail\n- Rule: `no duplicate customer IDs` → 5 duplicates found\n\n**Tip** Run this last in your cleanup, then keep the PDF as proof of quality."
},
"09_pipeline_runner": {
"name": "Automated Workflows",
"description": "Chain tools in recommended order and pass output between steps.",
"description": "Run several tools in a row — save the steps once, reuse them anytime.",
"page_title": "Automated Workflows",
"page_caption": "Chain DataTools cleaning steps into one repeatable workflow. The pipeline recommends an order; you stay in control."
"page_caption": "Run several tools in a row — save the steps and reuse them anytime.",
"help_md": "**When to use**\n- A cleanup you do every week or month\n- Multi-step processes you keep repeating\n- Onboarding teammates to your data routine\n\n**Steps**\n1. Pick the tools you want to run, in order\n2. Configure each step\n3. Save the workflow as a JSON file\n4. Next time, load the workflow and upload a fresh file\n5. Get the cleaned output in one click\n\n**Examples**\n- `Clean Text` → `Standardize Formats` → `Find Duplicates` → export\n- Saved as `weekly-customer-cleanup.json`\n- Shared with a teammate so they get the same result\n\n**Tip** Start with two or three tools. You can always edit and add more later."
},
"10_pdf_extractor": {
"name": "PDF to CSV",
"description": "Extract bank-statement transactions from PDFs using reusable per-source templates.",
"description": "Pull transactions out of bank-statement PDFs into a clean CSV file.",
"page_title": "PDF to CSV",
"page_caption": "Extract transaction tables from bank-statement PDFs. Build one template per source and reuse it for every statement that follows the same layout. Runs locally — your data never leaves this computer."
"page_caption": "Pull transactions out of bank-statement PDFs into a clean CSV file.",
"help_md": "**When to use**\n- Bank or credit-card statements\n- Vendor invoices with line-item tables\n- Any PDF with a transaction table\n\n**Steps**\n1. Upload one or more PDFs (batch is fine)\n2. Click **Scan** — rows that look like transactions are pulled out automatically\n3. Uncheck any rows you don't want\n4. Pick your date format (and turn on OCR if the PDF is a scanned image)\n5. Download the CSV\n\n**Examples**\n- Chase March statement → 87 transactions found\n- Drop in 12 months at once and get one combined CSV\n- Image-only scan + OCR → still works if Tesseract is installed\n\n**Tip** If a withdrawal shows as `(4.50)`, leave **Treat (4.50) as negative** on. Turn it off only if your statements use a different convention."
},
"11_reconciler": {
"name": "Reconcile Two Files",
"description": "Compare two lists of transactions (e.g. bank vs. ledger) and flag what doesn't match.",
"page_title": "Reconcile Two Files",
"page_caption": "Compare two lists of transactions (e.g. bank vs. ledger) and flag what doesn't match.",
"help_md": "**When to use**\n- Matching your bank statement to your books\n- Vendor invoices vs. payments sent\n- Inventory receipts vs. orders placed\n\n**Steps**\n1. Upload both files (e.g. bank export + accounting export)\n2. Pick the columns to match on (date, amount, reference)\n3. Set tolerances (e.g. date ±2 days, amount exact)\n4. Review four buckets: **matched**, **only in left**, **only in right**, **needs review**\n5. Export the results\n\n**Examples**\n- Bank `2025-03-15 $99.50` ↔ Books `2025-03-16 $99.50` → matched\n- Bank charge with no books entry → only in left\n- Same amount on the same day twice → flagged for review\n\n**Tip** Tighten the date tolerance once you trust the match — fewer ambiguous cases to review."
}
},
"nav": {
@@ -172,9 +193,12 @@
"section_cleaners": "Data Cleaners",
"section_transformations": "Transformations",
"section_automations": "Automations",
"section_finance": "Finance",
"section_coming_soon": "Coming soon",
"review_page_title": "Review",
"home_page_title": "Home",
"file_analysis_title": "File Analysis",
"start_here_title": "Start here",
"section_account": "Account",
"activate_title": "Activate",
"close_title": "Close",

View File

@@ -15,6 +15,10 @@
"ready": "Listo",
"coming_soon": "Próximamente"
},
"help": {
"button_label": "Ayuda",
"missing_body": "Aún no hay ayuda para esta herramienta."
},
"upload": {
"heading": "Importa uno o más archivos para empezar",
"intro": "Opcional: analiza un archivo para detectar problemas de calidad de datos y ver qué herramientas pueden corregir cada uno. Sáltalo si ya sabes lo que necesitas.",
@@ -107,63 +111,80 @@
"tools": {
"01_deduplicator": {
"name": "Buscar duplicados",
"description": "Coincidencia difusa, normalización, selección de superviviente y revisión interactiva.",
"description": "Encuentra filas que se repiten — exactas y similares — y elimina las extras.",
"page_title": "Buscar duplicados",
"page_caption": "Encuentra y elimina filas duplicadas en archivos CSV, texto delimitado y Excel."
"page_caption": "Encuentra filas que se repiten, conserva una y elimina las extras.",
"help_md": "**Cuándo usarlo**\n- Listas de clientes o contactos\n- Listas de correo de varias fuentes\n- Catálogos de productos que pueden solaparse\n\n**Pasos**\n1. Sube tu archivo\n2. Elige la(s) columna(s) que identifican una fila (email, teléfono, nombre+CP)\n3. Elige coincidencia **Exacta** o **Similar**\n4. Elige qué fila conservar (más reciente, más larga, primera)\n5. Previsualiza y exporta\n\n**Ejemplos**\n- `John Smith` + `JOHN SMITH` → misma persona\n- `jane@co.com` + `jane@co.com ` → mismo email (espacio al final)\n- `555-1234` + `(555) 1234` → mismo teléfono\n\n**Consejo** Empieza con Exacta; añade Similar si sospechas erratas."
},
"02_text_cleaner": {
"name": "Limpiar texto",
"description": "Recorte de espacios, colapso de espacios múltiples, normalización Unicode, manejo de BOM y de finales de línea.",
"description": "Quita espacios extra y caracteres raros que deja el copiar y pegar.",
"page_title": "Limpiar texto",
"page_caption": "Recorta espacios, normaliza comillas tipográficas, elimina caracteres invisibles y unifica saltos de línea. Se ejecuta localmente — tus datos nunca salen de este equipo."
"page_caption": "Quita espacios extra y caracteres raros.",
"help_md": "**Cuándo usarlo**\n- Texto copiado de páginas web, PDFs o sistemas antiguos\n- Archivos con espaciado inconsistente\n- Datos con caracteres ocultos o especiales\n\n**Pasos**\n1. Sube tu archivo\n2. Elige las columnas de texto a limpiar\n3. Elige opciones: recortar espacios, eliminar caracteres invisibles, normalizar comillas\n4. Previsualiza los cambios\n5. Exporta\n\n**Ejemplos**\n- ` hola mundo ` → `hola mundo`\n- `“comillas tipográficas”` → `\"comillas normales\"`\n- `datoconoculto` → `datoconoculto`\n\n**Consejo** Siempre previsualiza — los cambios pueden afectar pasos posteriores como duplicados."
},
"03_format_standardizer": {
"name": "Estandarizar formatos",
"description": "Estandariza fechas, monedas, nombres, números de teléfono y direcciones.",
"description": "Haz que fechas, teléfonos, monedas, nombres y direcciones se vean iguales en todo el archivo.",
"page_title": "Estandarizar formatos",
"page_caption": "Canoniza fechas, números de teléfono, monedas, nombres, direcciones y booleanos columna por columna. Se ejecuta localmente — tus datos nunca salen de este equipo."
"page_caption": "Haz que fechas, teléfonos, monedas y nombres se vean iguales en todo el archivo.",
"help_md": "**Cuándo usarlo**\n- Datos de varias fuentes con fechas/teléfonos en formatos distintos\n- Antes de enviar a un sistema que exige un formato\n- Preparando datos para análisis o gráficos\n\n**Pasos**\n1. Sube tu archivo\n2. Elige una columna (fecha, teléfono, moneda, etc.)\n3. Elige el formato destino\n4. Previsualiza\n5. Repite con otras columnas y exporta\n\n**Ejemplos**\n- `5 Ene 2025` / `01/05/2025` / `5-Ene-25` → `2025-01-05`\n- `(555) 123-4567` / `555.123.4567` → `+1 555-123-4567`\n- `$1.234,50` / `1234.5 USD` → `1234.50`\n\n**Consejo** Trabaja varias columnas en una sesión — cada una recuerda su formato."
},
"04_missing_handler": {
"name": "Corregir valores faltantes",
"description": "Detecta nulos disfrazados, analiza la ausencia de datos y aplica estrategias de imputación.",
"description": "Encuentra celdas vacías (incluso escritas como «N/A» o «?») y rellénalas o elimínalas.",
"page_title": "Corregir valores faltantes",
"page_caption": "Detecta nulos disfrazados, perfila la ausencia de datos y aplica imputación o estrategias de descarte. Se ejecuta localmente — tus datos nunca salen de este equipo."
"page_caption": "Encuentra celdas vacías (incluso ocultas) y rellénalas o elimínalas.",
"help_md": "**Cuándo usarlo**\n- Hojas con huecos\n- Archivos donde alguien escribió `N/A` o `-` en vez de dejar la celda vacía\n- Antes de importar a un sistema que rechaza celdas vacías\n\n**Pasos**\n1. Sube tu archivo\n2. Revisa qué columnas tienen celdas vacías\n3. Elige una estrategia por columna: **rellenar**, **eliminar la fila** o **dejar igual**\n4. Para números, elige el valor: media, mediana, cero o uno propio\n5. Previsualiza y exporta\n\n**Ejemplos**\n- `N/A`, `-`, ` ` → tratados como vacíos\n- Salario vacío → relleno con la media de la columna\n- Fila sin email → eliminada\n\n**Consejo** No rellenes el identificador (email, ID) — mejor elimina la fila."
},
"05_column_mapper": {
"name": "Mapear columnas",
"description": "Renombra columnas, aplica un esquema objetivo y fuerza tipos de datos.",
"description": "Renombra columnas, cambia su orden y define cada una como texto, número o fecha.",
"page_title": "Mapear columnas",
"page_caption": "Renombra columnas, aplica un esquema objetivo y fuerza tipos. Se ejecuta localmente — tus datos nunca salen de este equipo."
"page_caption": "Renombra columnas, cambia su orden y define cada una como texto, número o fecha.",
"help_md": "**Cuándo usarlo**\n- Combinando archivos de proveedores con nombres de columna distintos\n- Forzando el esquema que tu sistema espera\n- Limpiando exportes con columnas raras o de más\n\n**Pasos**\n1. Sube tu archivo\n2. Empareja cada columna entrante con tu nombre estándar\n3. Define el tipo de cada columna: texto, número o fecha\n4. Reordena o elimina columnas\n5. Exporta con la nueva disposición\n\n**Ejemplos**\n- `cust_email` → `Email Cliente`\n- `amt` → `Importe` (definido como número)\n- `notes_internal` → eliminar\n\n**Consejo** Guarda el mapeo si recibirás el mismo formato el próximo mes."
},
"06_outlier_detector": {
"name": "Detectar valores atípicos",
"description": "Detección por Z-score, IQR y MAD con reglas de dominio y winsorización.",
"description": "Detecta valores que parecen incorrectos — demasiado altos, demasiado bajos o fuera de regla.",
"page_title": "Detectar valores atípicos",
"page_caption": "Detecta y trata valores atípicos en columnas numéricas."
"page_caption": "Detecta valores que parecen incorrectos — demasiado altos, bajos o fuera de regla.",
"help_md": "**Cuándo usarlo**\n- Detectar erratas, fraude o imports mal hechos en datos numéricos\n- Limpiar datos de sensores o transacciones\n- Antes de reportar números a dirección\n\n**Pasos**\n1. Sube tu archivo\n2. Elige la columna numérica a revisar\n3. Define un rango normal (o usa auto-detección)\n4. Revisa las filas marcadas\n5. Elige: conservar, eliminar o limitar al borde\n\n**Ejemplos**\n- Columna de salarios con una fila de `$9.999.999` → marcada\n- Columna de edad con `250` → marcada\n- Regla: `precio debe ser > 0` → marca los negativos\n\n**Consejo** Revisa a mano las filas marcadas — a veces un atípico real es el dato más importante."
},
"07_multi_file_merger": {
"name": "Combinar archivos",
"description": "Combina varios archivos CSV/Excel alineando sus esquemas.",
"description": "Combina varios archivos CSV o Excel en uno — aunque sus columnas no coincidan.",
"page_title": "Combinar archivos",
"page_caption": "Combina varios archivos CSV y Excel en un único conjunto de datos."
"page_caption": "Combina varios CSV o Excel en uno — aunque las columnas no coincidan.",
"help_md": "**Cuándo usarlo**\n- Informes mensuales a lo largo del año\n- Exportes de varias tiendas o sucursales\n- Datos de varios sistemas que deben quedar en un archivo\n\n**Pasos**\n1. Sube dos o más archivos\n2. Confirma las coincidencias de columna (auto-detectadas; modifica si hace falta)\n3. Decide cómo tratar columnas que falten (omitir, vacío, valor por defecto)\n4. Previsualiza el resultado combinado\n5. Exporta el archivo único\n\n**Ejemplos**\n- `Enero.csv` + `Febrero.csv` → `2025.csv`\n- `NY-store.xlsx` + `LA-store.xlsx` → `todas-las-tiendas.csv`\n- Archivo A tiene `Email`, archivo B tiene `email_addr` → emparejados automáticamente\n\n**Consejo** Añade una columna `origen` para saber de qué archivo viene cada fila."
},
"08_validator_reporter": {
"name": "Verificación de calidad",
"description": "Valida contra reglas y genera informes de calidad en PDF/Excel.",
"description": "Comprueba tu archivo según reglas que tú definas y exporta un informe PDF o Excel.",
"page_title": "Verificación de calidad",
"page_caption": "Valida datos contra reglas y genera informes de calidad."
"page_caption": "Comprueba tu archivo según reglas y exporta un informe PDF o Excel.",
"help_md": "**Cuándo usarlo**\n- Antes de entregar datos a un cliente o socio\n- Antes de un import estricto a otro sistema\n- Auditorías rutinarias de calidad\n\n**Pasos**\n1. Sube tu archivo\n2. Elige las reglas (columnas requeridas, emails válidos, sin duplicados)\n3. Ejecuta la comprobación\n4. Revisa la puntuación y los hallazgos\n5. Exporta el informe como PDF o Excel\n\n**Ejemplos**\n- Regla: `email debe parecerse a un email` → 12 filas fallan\n- Regla: `importe debe ser > 0` → 3 filas fallan\n- Regla: `sin ID de cliente duplicados` → 5 duplicados encontrados\n\n**Consejo** Ejecuta esto al final, y guarda el PDF como prueba de calidad."
},
"09_pipeline_runner": {
"name": "Flujos automatizados",
"description": "Encadena herramientas en el orden recomendado y pasa la salida entre pasos.",
"description": "Ejecuta varias herramientas seguidas — guarda los pasos una vez y reutilízalos.",
"page_title": "Flujos automatizados",
"page_caption": "Encadena pasos de limpieza de DataTools en un flujo repetible. La canalización recomienda un orden; tú mantienes el control."
"page_caption": "Ejecuta varias herramientas seguidas — guarda los pasos y reutilízalos.",
"help_md": "**Cuándo usarlo**\n- Una limpieza que haces cada semana o mes\n- Procesos de varios pasos que repites\n- Cuando entrenas a un compañero en tu rutina de datos\n\n**Pasos**\n1. Elige las herramientas a ejecutar, en orden\n2. Configura cada paso\n3. Guarda el flujo como archivo JSON\n4. La próxima vez, carga el flujo y sube un archivo nuevo\n5. Obtén la salida limpia con un clic\n\n**Ejemplos**\n- `Limpiar texto` → `Estandarizar formatos` → `Buscar duplicados` → exportar\n- Guardado como `limpieza-clientes-semanal.json`\n- Compartido con un compañero para obtener el mismo resultado\n\n**Consejo** Empieza con dos o tres herramientas. Siempre puedes editar y añadir más."
},
"10_pdf_extractor": {
"name": "PDF a CSV",
"description": "Extrae transacciones de extractos bancarios en PDF usando plantillas reutilizables por origen.",
"description": "Extrae transacciones de extractos bancarios en PDF a un archivo CSV limpio.",
"page_title": "PDF a CSV",
"page_caption": "Extrae tablas de transacciones de extractos bancarios en PDF. Crea una plantilla por origen y reutilízala para cada extracto que siga el mismo formato. Se ejecuta localmente — tus datos no salen de este equipo."
"page_caption": "Extrae transacciones de extractos bancarios en PDF a un archivo CSV limpio.",
"help_md": "<!-- TODO: review Spanish -->\n**Cuándo usarlo**\n- Extractos bancarios o de tarjeta\n- Facturas de proveedor con tablas\n- Cualquier PDF con tabla de transacciones\n\n**Pasos**\n1. Sube uno o más PDFs (modo lote permitido)\n2. Pulsa **Escanear** — las filas que parecen transacciones se extraen automáticamente\n3. Desmarca las filas que no quieras\n4. Elige el formato de fecha (y activa OCR si el PDF es una imagen escaneada)\n5. Descarga el CSV\n\n**Ejemplos**\n- Extracto Chase de marzo → 87 transacciones detectadas\n- Procesa 12 meses de una vez y obtén un CSV combinado\n- PDF solo-imagen + OCR → funciona si Tesseract está instalado\n\n**Consejo** Si un cargo aparece como `(4,50)`, deja activado **Tratar (4,50) como negativo**. Desactívalo solo si tus extractos usan otra convención."
},
"11_reconciler": {
"name": "Reconciliar dos archivos",
"description": "Compara dos listas de transacciones (p. ej. banco vs. libro mayor) y señala lo que no coincide.",
"page_title": "Reconciliar dos archivos",
"page_caption": "Compara dos listas de transacciones (p. ej. banco vs. libro mayor) y señala lo que no coincide.",
"help_md": "**Cuándo usarlo**\n- Cuadrar el extracto del banco con tus libros\n- Facturas de proveedor vs. pagos enviados\n- Recepciones de inventario vs. pedidos realizados\n\n**Pasos**\n1. Sube ambos archivos (p. ej. exporte del banco + exporte contable)\n2. Elige las columnas para emparejar (fecha, importe, referencia)\n3. Define tolerancias (p. ej. fecha ±2 días, importe exacto)\n4. Revisa cuatro grupos: **emparejados**, **solo en izquierda**, **solo en derecha**, **necesita revisión**\n5. Exporta los resultados\n\n**Ejemplos**\n- Banco `2025-03-15 $99.50` ↔ Libros `2025-03-16 $99.50` → emparejado\n- Cargo bancario sin entrada en libros → solo en izquierda\n- Mismo importe el mismo día dos veces → marcado para revisión\n\n**Consejo** Estrecha la tolerancia de fecha cuando confíes en el emparejamiento — menos casos ambiguos."
}
},
"nav": {
@@ -172,9 +193,12 @@
"section_cleaners": "Limpiadores de datos",
"section_transformations": "Transformaciones",
"section_automations": "Automatizaciones",
"section_finance": "Finanzas",
"section_coming_soon": "Próximamente",
"review_page_title": "Revisión",
"home_page_title": "Inicio",
"file_analysis_title": "Análisis de archivo",
"start_here_title": "Empezar aquí",
"section_account": "Cuenta",
"activate_title": "Activar",
"close_title": "Cerrar",

View File

@@ -24,6 +24,7 @@ import io
import os
import platform
import re
import sys
from dataclasses import dataclass, field
from datetime import datetime
from pathlib import Path
@@ -286,10 +287,96 @@ def page_has_extractable_text(page: Page, min_words: int = 5) -> bool:
return len(page.words) >= min_words
# ---------------------------------------------------------------------------
# Tesseract discovery
#
# Discovery order (shared with the PyInstaller build agent):
#
# 1. ``DATATOOLS_TESSERACT_PATH`` env var override (user escape hatch)
# 2. Bundled binary inside the PyInstaller frozen bundle
# (``sys._MEIPASS / "tesseract" / "tesseract[.exe]"``) — only
# present when running from a frozen DataTools installer/portable
# build. No-op in a dev checkout.
# 3. System PATH lookup (``pytesseract.get_tesseract_version()``)
# 4. Windows well-known install dirs (legacy fallback for users who
# installed UB Mannheim's Tesseract-OCR themselves)
#
# When a bundled tessdata directory exists, ``TESSDATA_PREFIX`` is set
# so Tesseract picks up the bundled ``eng.traineddata``. User-supplied
# ``TESSDATA_PREFIX`` is never clobbered.
# ---------------------------------------------------------------------------
def _bundled_tesseract_path() -> Path | None:
"""Return the path to the bundled Tesseract binary, or ``None``.
Only returns a non-None value when running from a PyInstaller
frozen bundle (``sys.frozen`` is truthy AND ``sys._MEIPASS`` is
set). The bundled binary lives at
``<_MEIPASS>/tesseract/tesseract`` (``.exe`` on Windows) per the
contract shared with the build agent.
The file is NOT required to exist for this helper to return a
path — callers ``stat`` / ``.exists()``-check it themselves so a
missing bundled binary is treated the same as "not bundled" and
discovery falls through to PATH lookup.
"""
if not getattr(sys, "frozen", False):
return None
meipass = getattr(sys, "_MEIPASS", None)
if not meipass:
return None
binary = "tesseract.exe" if platform.system() == "Windows" else "tesseract"
return Path(meipass) / "tesseract" / binary
def _bundled_tessdata_dir() -> Path | None:
"""Return the bundled ``tessdata`` directory or ``None``.
Same frozen-state gating as ``_bundled_tesseract_path``; the dir
lives at ``<_MEIPASS>/tesseract/tessdata``. Callers use this to
point Tesseract at the bundled language data via the
``TESSDATA_PREFIX`` env var.
"""
if not getattr(sys, "frozen", False):
return None
meipass = getattr(sys, "_MEIPASS", None)
if not meipass:
return None
return Path(meipass) / "tesseract" / "tessdata"
def _apply_bundled_tessdata_prefix() -> None:
"""Point Tesseract at the bundled ``tessdata`` directory.
Sets ``TESSDATA_PREFIX`` to the bundled path so the frozen
Tesseract binary picks up the bundled ``eng.traineddata``. A
user-supplied ``TESSDATA_PREFIX`` is preserved untouched — power
users who explicitly chose their own language data win.
No-op outside a frozen bundle, or if the bundled dir doesn't
exist (e.g. tessdata wasn't packaged for the current platform).
"""
if os.environ.get("TESSDATA_PREFIX"):
return
tessdata = _bundled_tessdata_dir()
if tessdata is not None and tessdata.exists():
os.environ["TESSDATA_PREFIX"] = str(tessdata)
def _autodetect_tesseract_path() -> str | None:
"""Probe well-known install locations for ``tesseract.exe`` on
Windows. No-op on macOS/Linux where Tesseract is on PATH via
the system package manager."""
"""Locate a Tesseract binary outside the user's ``PATH``.
Tries the bundled binary first (only present in PyInstaller
frozen builds) so installer/portable users get a working OCR
without touching their system. Falls back to the legacy Windows
well-known install locations so users who installed UB
Mannheim's Tesseract-OCR themselves keep working too.
"""
bundled = _bundled_tesseract_path()
if bundled is not None and bundled.exists():
return str(bundled)
if platform.system() != "Windows":
return None
candidates = [
@@ -309,17 +396,30 @@ def ocr_available() -> tuple[bool, str]:
"""Return ``(available, reason)`` — is OCR usable right now?
Discovery order: ``DATATOOLS_TESSERACT_PATH`` env var override,
then PATH-based lookup, then well-known Windows install
locations.
then the bundled binary (only present in a frozen build), then
PATH-based lookup, then well-known Windows install locations.
See the module-level discovery block for the full contract.
"""
try:
import pytesseract # noqa: PLC0415
except ImportError:
return False, "pytesseract is not installed."
# Point Tesseract at the bundled tessdata (if any) BEFORE the
# first ``get_tesseract_version`` call so the bundled language
# data is loaded even when the user happens to also have a
# system Tesseract that we'd otherwise fall through to.
_apply_bundled_tessdata_prefix()
override = os.environ.get("DATATOOLS_TESSERACT_PATH")
if override:
pytesseract.pytesseract.tesseract_cmd = override
else:
# Probe the bundled binary BEFORE PATH so frozen builds use
# their own Tesseract instead of any incidental system one.
bundled = _bundled_tesseract_path()
if bundled is not None and bundled.exists():
pytesseract.pytesseract.tesseract_cmd = str(bundled)
try:
pytesseract.get_tesseract_version()

116
tests/gui/test_app_demo.py Normal file
View File

@@ -0,0 +1,116 @@
"""Public demo app (``src/gui/app_demo.py``) behavior — AppTest.
The demo app is the marketing surface: it preloads one accounting persona's
dataset, runs the saved pipeline, and shows BEFORE/AFTER + a buy CTA. These
tests pin that every persona renders, the run produces its headline value,
persona switching works, and the buy path is present — so a regression can't
silently ship a broken or empty demo to a prospect.
The dataset value numbers themselves are pinned separately in
``tests/test_demo_pipelines.py``; here we assert the *app* surfaces them.
"""
from __future__ import annotations
from pathlib import Path
import pandas as pd
import pytest
from streamlit.testing.v1 import AppTest
_PAGE = str(
Path(__file__).resolve().parent.parent.parent / "src" / "gui" / "app_demo.py"
)
_DEMO = Path(__file__).resolve().parent.parent.parent / "samples" / "demo"
# (persona key, data file, expected rows before -> after, a label substring)
_PERSONAS = [
("bookkeeper", "bank_reconciliation.csv", 26, 20, "Bookkeeper"),
("ap-1099", "vendor_1099.csv", 24, 8, "payable"),
("ar-aging", "ar_open_invoices.csv", 26, 21, "receivable"),
]
def _app(persona: str | None = None) -> AppTest:
at = AppTest.from_file(_PAGE, default_timeout=60)
if persona is not None:
at.query_params["p"] = persona
return at.run()
def _md(at: AppTest) -> str:
return " ".join(m.value for m in at.markdown)
@pytest.mark.parametrize("key,data_file,before,after,label", _PERSONAS)
def test_persona_renders_with_its_dataset(key, data_file, before, after, label):
at = _app(key)
assert not at.exception
md = _md(at)
assert label in md, f"persona label {label!r} not rendered"
# BEFORE preview reflects the real dataset size.
real_rows = len(pd.read_csv(_DEMO / data_file, dtype=str, keep_default_na=False))
assert real_rows == before # guards the fixture against silent drift
assert f"BEFORE — {before} rows" in md
# The saved pipeline is shown (read-only) as the canonical steps.
assert "text_clean" in md and "dedup" in md
assert any("Run pipeline" in b.label for b in at.button)
def test_default_persona_is_bookkeeper():
at = _app(None)
assert not at.exception
assert "Bookkeeper" in _md(at)
def test_unknown_persona_falls_back_to_default():
at = _app("not-a-real-persona")
assert not at.exception
assert "Bookkeeper" in _md(at)
@pytest.mark.parametrize("key,data_file,before,after,label", _PERSONAS)
def test_run_shows_after_value_and_buy_path(key, data_file, before, after, label):
at = _app(key)
[b for b in at.button if "Run pipeline" in b.label][0].click().run()
assert not at.exception, at.exception
# A result is cached and the AFTER header reports the dedup win.
assert "demo_result" in at.session_state
result = at.session_state["demo_result"]
assert len(result.final_df) == after
assert result.final_rows < result.initial_rows
assert f"{before}{after} rows" in _md(at)
# The buy path is present after a run (download + Gumroad CTA). The
# cleaned-CSV download is a download_button, not a plain button.
downloads = at.get("download_button")
assert any("Download cleaned CSV" in d.label for d in downloads)
assert f"gumroad.com/l/datatools?from={key}" in _md(at)
def test_persona_switch_clears_stale_result():
# Run the bookkeeper demo, then switch persona via the quick-switch
# dropdown (driving the selectbox — a raw query-param change is
# overridden by the dropdown's persisted value).
at = _app("bookkeeper")
[b for b in at.button if "Run pipeline" in b.label][0].click().run()
assert "demo_result" in at.session_state
switch = [s for s in at.selectbox if s.key == "persona_switch"][0]
switch.set_value("ap-1099").run()
assert not at.exception
# The page drops the stale bookkeeper result when the persona changes,
# so the visitor never sees the wrong dataset's AFTER block.
assert "demo_result" not in at.session_state
assert "payable" in _md(at) # now showing the AP/1099 persona
def test_run_offers_a_watermarked_download():
"""After a run the visitor gets a download, labeled as watermarked
(the free/paid boundary from DEMO-PLAN §6)."""
at = _app("bookkeeper")
[b for b in at.button if "Run pipeline" in b.label][0].click().run()
dl = [d for d in at.get("download_button") if "Download cleaned CSV" in d.label]
assert dl, "no cleaned-CSV download after a run"
assert "watermark" in dl[0].label.lower()

View File

@@ -0,0 +1,281 @@
"""Pipeline Runner — visual module-card builder contract (AppTest).
Pins the behaviors the JSON-table → module-card rewrite introduced:
recommended steps seed as cards with friendly names, each step exposes a
plain-language Configure panel (no raw per-row JSON), steps can be toggled /
added / removed, JSON lives only under Advanced, and a run produces results
with friendly step names. The page's bare initial-render contract across junk
files is covered separately in ``tests/test_junk_corpus_tool_pages.py``.
"""
from __future__ import annotations
import json
from pathlib import Path
import pytest
from streamlit.testing.v1 import AppTest
_PAGE = (
Path(__file__).resolve().parent.parent.parent
/ "src" / "gui" / "pages" / "9_Pipeline_Runner.py"
)
_CSV = (
b"name,email,phone,signup_date\n"
b" Jane Doe ,jane@acme.io,512-555-0190,2024-01-04\n"
b"jane doe,JANE@ACME.IO,(512) 555-0190,01/04/2024\n"
b"Bob Smith,bob@globex.com,720.555.7781,2024-02-11\n"
)
def _app() -> AppTest:
at = AppTest.from_file(str(_PAGE), default_timeout=30)
at.session_state["home_uploaded_bytes"] = _CSV
at.session_state["home_uploaded_name"] = "customers.csv"
at.session_state["home_uploaded_size"] = len(_CSV)
return at.run()
def test_recommended_steps_seed_as_named_cards():
at = _app()
assert not at.exception
tools = [s["tool"] for s in at.session_state["pipeline_steps"]]
assert tools == ["text_clean", "format_standardize", "missing", "dedup"]
md = " ".join(m.value for m in at.markdown)
for friendly in ("Clean Text", "Standardize Formats",
"Fix Missing Values", "Find Duplicates"):
assert friendly in md
def test_each_step_has_a_configure_panel_and_json_is_advanced_only():
at = _app()
labels = [e.label for e in at.get("expander")]
assert any(l.startswith("Configure: Clean Text") for l in labels)
assert any(l.startswith("Configure: Find Duplicates") for l in labels)
# Raw JSON is import/export only — never a per-step editing surface.
assert any("Advanced — import / export" in l for l in labels)
def test_toggle_disables_step_and_persists():
at = _app()
at.toggle[0].set_value(False).run()
assert at.session_state["pipeline_steps"][0]["enabled"] is False
def test_add_step_appends_a_working_config_panel():
at = _app()
[s for s in at.selectbox if s.key == "pipeline_add_tool"][0].set_value("column_map").run()
[b for b in at.button if "Add step" in b.label][0].click().run()
assert not at.exception
assert at.session_state["pipeline_steps"][-1]["tool"] == "column_map"
labels = [e.label for e in at.get("expander")]
assert any(l.startswith("Configure: Map Columns") for l in labels)
def test_remove_step_drops_it():
at = _app()
before = len(at.session_state["pipeline_steps"])
# The first ✕ remove button in the card stack.
[b for b in at.button if b.label == ""][0].click().run()
assert not at.exception
assert len(at.session_state["pipeline_steps"]) == before - 1
def test_run_produces_results_with_friendly_names():
at = _app()
[b for b in at.button if b.label == "Run Pipeline"][0].click().run()
assert not at.exception, at.exception
assert "pipeline_result" in at.session_state
res = at.session_state["pipeline_result"]
assert res.initial_rows == 3 and res.final_rows == 2 # the two Jane rows merge
assert all(sr.error is None for sr in res.step_results)
def test_step_phrase_is_plain_english_not_json():
from src.gui.components.pipeline_modules import step_phrase, step_status
# dedup phrasing mirrors the design mockup wording exactly.
phrase = step_phrase("dedup", {
"input_rows": 18442, "output_rows": 18130,
"duplicates_removed": 312, "groups": 147,
})
assert phrase == "312 duplicates removed across 147 groups (18,442 → 18,130 rows)"
# text_clean lists affected columns in prose, with thousands separators.
assert step_phrase("text_clean", {
"cells_changed": 1204, "columns_processed": ["name", "city"],
}) == "1,204 cells cleaned in name & city"
# singular nouns pluralize correctly
assert step_phrase("missing", {"rows_dropped": 1, "columns_dropped": ["x"]}) == \
"1 row dropped, 1 column dropped"
# unparseable cells downgrade the pill to warn with an inline detail
label, level, detail = step_status(
"format_standardize", {"cells_changed": 100, "cells_unparseable": 141},
)
assert level == "warn" and "141 skipped" in label and detail
# a clean step is "ok" with no detail
assert step_status("text_clean", {"cells_changed": 5})[1] == "ok"
# ---------------------------------------------------------------------------
# Helpers for the reorder / config tests below
# ---------------------------------------------------------------------------
def _ids(at) -> dict:
"""Map tool name → that step's stable id (assumes unique tools)."""
return {s["tool"]: s["id"] for s in at.session_state["pipeline_steps"]}
def _tools(at) -> list:
return [s["tool"] for s in at.session_state["pipeline_steps"]]
# ---------------------------------------------------------------------------
# Reorder
# ---------------------------------------------------------------------------
def test_reorder_down_swaps_with_next_step():
at = _app()
sid = _ids(at)["text_clean"]
before = _tools(at)
assert before == ["text_clean", "format_standardize", "missing", "dedup"]
[b for b in at.button if b.key == f"text_clean_{sid}_down"][0].click().run()
assert not at.exception
assert _tools(at) == ["format_standardize", "text_clean", "missing", "dedup"]
def test_reorder_up_swaps_with_previous_step():
at = _app()
sid = _ids(at)["missing"]
[b for b in at.button if b.key == f"missing_{sid}_up"][0].click().run()
assert not at.exception
assert _tools(at) == ["text_clean", "missing", "format_standardize", "dedup"]
def test_first_up_and_last_down_buttons_are_disabled():
at = _app()
ids = _ids(at)
first_up = [b for b in at.button if b.key == f"text_clean_{ids['text_clean']}_up"][0]
last_down = [b for b in at.button if b.key == f"dedup_{ids['dedup']}_down"][0]
assert first_up.disabled is True
assert last_down.disabled is True
# interior steps are freely movable
mid_up = [b for b in at.button if b.key == f"missing_{ids['missing']}_up"][0]
assert mid_up.disabled is False
def test_disabled_step_stays_disabled_after_reorder():
at = _app()
sid = _ids(at)["text_clean"]
at.toggle[0].set_value(False).run()
assert at.session_state["pipeline_steps"][0]["enabled"] is False
# move the now-disabled first step down one slot
[b for b in at.button if b.key == f"text_clean_{sid}_down"][0].click().run()
assert not at.exception
steps = at.session_state["pipeline_steps"]
moved = [s for s in steps if s["tool"] == "text_clean"][0]
assert steps.index(moved) == 1 # it moved
assert moved["enabled"] is False # ...and stayed disabled
# ---------------------------------------------------------------------------
# Restore recommended steps
# ---------------------------------------------------------------------------
def test_restore_recommended_steps_button():
at = _app()
# Diverge from the recommended default by removing a step.
[b for b in at.button if b.label == ""][0].click().run()
assert _tools(at) == ["format_standardize", "missing", "dedup"]
restore = [b for b in at.button if "Restore recommended steps" in b.label]
assert len(restore) == 1
restore[0].click().run()
assert not at.exception
assert _tools(at) == ["text_clean", "format_standardize", "missing", "dedup"]
def test_restore_button_absent_when_steps_match_default():
at = _app()
# Untouched recommended steps → no restore prompt.
assert not [b for b in at.button if "Restore recommended steps" in b.label]
# ---------------------------------------------------------------------------
# Advanced JSON export / import
# ---------------------------------------------------------------------------
def test_advanced_json_export_reflects_current_steps():
at = _app()
exported = json.loads(at.code[0].value)
assert [s["tool"] for s in exported["steps"]] == \
["text_clean", "format_standardize", "missing", "dedup"]
# Remove a step and confirm the exported JSON drops it too.
[b for b in at.button if b.label == ""][0].click().run()
exported = json.loads(at.code[0].value)
assert [s["tool"] for s in exported["steps"]] == \
["format_standardize", "missing", "dedup"]
def test_load_pasted_json_replaces_the_step_list():
at = _app()
one_step = json.dumps(
{"steps": [{"tool": "dedup", "options": {}, "enabled": True}]}
)
[t for t in at.text_area if t.key == "pipeline_json_paste"][0].set_value(
one_step
).run()
[b for b in at.button if b.label == "Load pasted JSON"][0].click().run()
assert not at.exception
assert _tools(at) == ["dedup"]
# ---------------------------------------------------------------------------
# Config renderers emit the right options
# ---------------------------------------------------------------------------
def test_format_standardize_config_emits_column_types():
at = _app()
fid = _ids(at)["format_standardize"]
[s for s in at.selectbox if s.key == f"format_standardize_{fid}_fmt__phone"][0] \
.set_value("Phone number").run()
[b for b in at.button if b.label == "Run Pipeline"][0].click().run()
assert not at.exception
step = [s for s in at.session_state["pipeline_steps"]
if s["tool"] == "format_standardize"][0]
assert step["options"]["column_types"].get("phone") == "phone"
def test_missing_config_drop_radio_emits_drop_row_strategy():
at = _app()
mid = _ids(at)["missing"]
[r for r in at.radio if r.key == f"missing_{mid}_strategy"][0] \
.set_value("Drop rows that have any blank").run()
[b for b in at.button if b.label == "Run Pipeline"][0].click().run()
assert not at.exception
step = [s for s in at.session_state["pipeline_steps"]
if s["tool"] == "missing"][0]
assert step["options"]["strategy"] == "drop_row"
def test_dedup_config_multiselect_builds_strategies():
at = _app()
did = _ids(at)["dedup"]
[m for m in at.multiselect if m.key == f"dedup_{did}_matchcols"][0] \
.set_value(["email"]).run()
[b for b in at.button if b.label == "Run Pipeline"][0].click().run()
assert not at.exception
step = [s for s in at.session_state["pipeline_steps"]
if s["tool"] == "dedup"][0]
strategies = step["options"]["strategies"]
cols = [c["column"] for c in strategies[0]["columns"]]
assert cols == ["email"]
assert strategies[0]["columns"][0]["algorithm"] == "exact"

View File

@@ -0,0 +1,254 @@
"""Pure-function tests for pipeline_modules phrasing helpers.
These cover the adapter-key → tool bridge, the plain-English ``step_phrase``
wording, ``step_status`` pill levels, and the column-prose / pluralization
helpers (``_fmt_cols`` / ``_n``). No Streamlit / AppTest needed — every symbol
under test is a pure function over plain dicts/lists.
"""
from __future__ import annotations
import pytest
from src.core.pipeline import TOOL_NAMES
from src.gui.components.pipeline_modules import (
CONFIG_RENDERERS,
PIPELINE_TOOL_META,
_fmt_cols,
_n,
step_label,
step_phrase,
step_status,
)
# ---------------------------------------------------------------------------
# Bridge completeness
# ---------------------------------------------------------------------------
@pytest.mark.parametrize("tool", TOOL_NAMES)
def test_pipeline_tool_meta_covers_every_tool(tool):
assert tool in PIPELINE_TOOL_META
assert PIPELINE_TOOL_META[tool] # non-empty tool_id
@pytest.mark.parametrize("tool", TOOL_NAMES)
def test_step_label_is_friendly_and_not_the_raw_key(tool):
label = step_label(tool)
assert isinstance(label, str)
assert label
assert label != tool
@pytest.mark.parametrize("tool", TOOL_NAMES)
def test_every_tool_has_a_config_renderer(tool):
assert tool in CONFIG_RENDERERS
assert callable(CONFIG_RENDERERS[tool])
def test_step_label_falls_back_to_raw_key_for_unknown_tool():
assert step_label("not_a_tool") == "not_a_tool"
# ---------------------------------------------------------------------------
# step_phrase — populated + no-op cases for all five tools
# ---------------------------------------------------------------------------
def test_step_phrase_text_clean_populated_and_noop():
assert step_phrase("text_clean", {
"cells_changed": 1204, "columns_processed": ["name", "city"],
}) == "1,204 cells cleaned in name & city"
assert step_phrase("text_clean", {"cells_changed": 0}) == "No changes needed."
assert step_phrase("text_clean", {}) == "No changes needed."
def test_step_phrase_format_standardize_populated_and_noop():
assert step_phrase("format_standardize", {
"cells_changed": 50, "columns_processed": ["phone"],
}) == "50 cells standardized in phone"
# unparseable cells append a "left unchanged" tail
assert step_phrase("format_standardize", {
"cells_changed": 50, "cells_unparseable": 3, "columns_processed": ["phone"],
}) == "50 cells standardized in phone (3 left unchanged)"
assert step_phrase("format_standardize", {}) == "Nothing to standardize."
assert step_phrase("format_standardize", {
"cells_changed": 0, "cells_unparseable": 0,
}) == "Nothing to standardize."
def test_step_phrase_missing_populated_and_noop():
assert step_phrase("missing", {
"cells_filled": 12, "rows_dropped": 4, "columns_dropped": ["x", "y"],
}) == "12 cells filled, 4 rows dropped, 2 columns dropped"
assert step_phrase("missing", {}) == "No missing values to handle."
# sentinel-only flagging path
assert step_phrase("missing", {
"sentinels_standardized": 7,
}) == "7 blank cells flagged"
def test_step_phrase_column_map_populated_and_noop():
assert step_phrase("column_map", {
"columns_renamed": 3, "columns_added": ["new"], "columns_dropped": ["old", "gone"],
}) == "3 columns renamed, 1 column added, 2 columns dropped"
assert step_phrase("column_map", {}) == "Columns already aligned."
def test_step_phrase_dedup_mockup_case():
assert step_phrase("dedup", {
"input_rows": 18442, "output_rows": 18130,
"duplicates_removed": 312, "groups": 147,
}) == "312 duplicates removed across 147 groups (18,442 → 18,130 rows)"
def test_step_phrase_dedup_noop():
assert step_phrase("dedup", {"duplicates_removed": 0}) == "No duplicates found."
assert step_phrase("dedup", {}) == "No duplicates found."
# ---------------------------------------------------------------------------
# Pluralization (_n) through step_phrase
# ---------------------------------------------------------------------------
def test_step_phrase_dedup_singular():
assert step_phrase("dedup", {
"input_rows": 10, "output_rows": 9,
"duplicates_removed": 1, "groups": 1,
}) == "1 duplicate removed across 1 group (10 → 9 rows)"
def test_step_phrase_missing_singular():
assert step_phrase("missing", {
"rows_dropped": 1, "columns_dropped": ["x"],
}) == "1 row dropped, 1 column dropped"
def test_n_singular_vs_plural_every_noun():
assert _n(1, "cell") == "1 cell"
assert _n(2, "cell") == "2 cells"
assert _n(1, "row") == "1 row"
assert _n(3, "row") == "3 rows"
assert _n(1, "column") == "1 column"
assert _n(5, "column") == "5 columns"
assert _n(1, "duplicate") == "1 duplicate"
assert _n(9, "duplicate") == "9 duplicates"
assert _n(1, "group") == "1 group"
assert _n(4, "group") == "4 groups"
def test_n_thousands_separator():
assert _n(1204, "cell") == "1,204 cells"
assert _n(18442, "row") == "18,442 rows"
# ---------------------------------------------------------------------------
# Column prose (_fmt_cols)
# ---------------------------------------------------------------------------
def test_fmt_cols_zero():
assert _fmt_cols([]) == ""
def test_fmt_cols_one():
assert _fmt_cols(["name"]) == "name"
def test_fmt_cols_two():
assert _fmt_cols(["name", "city"]) == "name & city"
def test_fmt_cols_three():
assert _fmt_cols(["a", "b", "c"]) == "a, b & c"
def test_fmt_cols_four_or_more():
assert _fmt_cols(["a", "b", "c", "d"]) == "a, b & 2 more"
assert _fmt_cols(["a", "b", "c", "d", "e"]) == "a, b & 3 more"
def test_fmt_cols_coerces_non_strings():
assert _fmt_cols([1, 2]) == "1 & 2"
# ---------------------------------------------------------------------------
# step_status — pill levels + details
# ---------------------------------------------------------------------------
def test_step_status_clean_is_ok():
assert step_status("text_clean", {"cells_changed": 5}) == ("✓ ok", "ok", "")
def test_step_status_skipped():
label, level, detail = step_status("text_clean", {"cells_changed": 5}, skipped=True)
assert level == "skipped"
assert detail == ""
assert "skipped" in label
def test_step_status_error_uses_first_line_only():
label, level, detail = step_status(
"dedup", {}, error="X: msg\nline2\nline3",
)
assert level == "error"
assert detail == "X: msg"
assert "error" in label
def test_step_status_error_takes_precedence_over_skipped():
label, level, detail = step_status(
"text_clean", {}, skipped=True, error="boom\nsecond",
)
assert level == "error"
assert detail == "boom"
def test_step_status_format_standardize_unparseable_warns():
label, level, detail = step_status(
"format_standardize", {"cells_changed": 100, "cells_unparseable": 141},
)
assert level == "warn"
assert "141 skipped" in label
assert detail # non-empty inline detail
def test_step_status_format_standardize_no_unparseable_is_ok():
assert step_status(
"format_standardize", {"cells_changed": 100},
) == ("✓ ok", "ok", "")
def test_step_status_column_map_coercion_failures_warn():
label, level, detail = step_status(
"column_map", {"coercion_failures": {"age": 4}},
)
assert level == "warn"
assert "4 not coerced" in label
assert detail
def test_step_status_column_map_missing_required_targets_warn():
label, level, detail = step_status(
"column_map", {"missing_required_targets": ["email"]},
)
assert level == "warn"
assert "missing targets" in label
assert "email" in detail
def test_step_status_column_map_missing_targets_take_precedence_over_coercion():
# both present → missing-targets branch wins
label, level, detail = step_status(
"column_map",
{"missing_required_targets": ["email"], "coercion_failures": {"age": 4}},
)
assert level == "warn"
assert "missing targets" in label
def test_step_status_unknown_tool_is_ok():
assert step_status("mystery", {"foo": 1}) == ("✓ ok", "ok", "")

View File

@@ -63,12 +63,11 @@ EXPECTED_SUBSTRINGS: dict[str, dict[str, str]] = {
"7_Multi_File_Merger": {"en": "Combine Files", "es": "Combinar archivos"},
"8_Validator_Reporter": {"en": "Quality Check", "es": "Verificación de calidad"},
"9_Pipeline_Runner": {"en": "Automated", "es": "Flujos automatizados"},
# The PDF Extractor and Reconciler pages are English-only today
# (translations tracked as a follow-up). The smoke test value is
# still that the page *renders at all* in 'es'; the substring is
# the same English hero text under both languages.
"10_PDF_Extractor": {"en": "PDF to CSV", "es": "PDF to CSV"},
"11_Reconciler": {"en": "Reconcile", "es": "Reconcile"},
# PDF Extractor + Reconciler page titles are now translated in
# both packs (``tools.<id>.page_title``). Their hero copy diverges
# by language, so the smoke test pins the localized substring.
"10_PDF_Extractor": {"en": "PDF to CSV", "es": "PDF a CSV"},
"11_Reconciler": {"en": "Reconcile", "es": "Reconciliar"},
"99_Close": {"en": "Shutting down", "es": "Cerrando"},
}

293
tests/test_cli_pipeline.py Normal file
View File

@@ -0,0 +1,293 @@
"""Integration tests for the pipeline CLI (src/cli_pipeline.py).
The Typer ``app`` is invoked directly via ``CliRunner`` to bypass the
license ``guard(...)`` that ``main()`` runs before ``app()`` — matching the
house pattern in ``test_cli_text_clean.py``.
"""
from __future__ import annotations
import json
import pandas as pd
import pytest
from typer.testing import CliRunner
from src.cli_pipeline import app
from src.core.pipeline import Pipeline, _DEFAULT_ORDER
runner = CliRunner()
@pytest.fixture
def messy_csv(tmp_path):
"""A small messy CSV with duplicate / whitespace / mixed-case rows."""
df = pd.DataFrame({
"name": [" Alice ", "alice", "Bob", "Charlie"],
"email": ["A@X.COM", "a@x.com", "bob@x.com", "charlie@x.com"],
"phone": ["555-1234", "5551234", "555-9999", "555-0000"],
"signup_date": ["2020-01-01", "2020-01-01", "2020-02-02", "2020-03-03"],
})
path = tmp_path / "messy.csv"
df.to_csv(path, index=False)
return path
def _pipeline_artifacts(csv_path):
"""The output CSV + audit JSON the CLI writes next to *csv_path*."""
out_csv = csv_path.parent / f"{csv_path.stem}_pipeline.csv"
audit = csv_path.parent / f"{csv_path.stem}_pipeline.json"
return out_csv, audit
# ---------------------------------------------------------------------------
# --recommend
# ---------------------------------------------------------------------------
class TestRecommend:
def test_recommend_prints_valid_json(self):
result = runner.invoke(app, ["--recommend"])
assert result.exit_code == 0
data = json.loads(result.output)
assert "steps" in data
tools = [s["tool"] for s in data["steps"]]
assert tools == list(_DEFAULT_ORDER)
def test_recommend_default_tools_in_order(self):
result = runner.invoke(app, ["--recommend"])
data = json.loads(result.output)
tools = [s["tool"] for s in data["steps"]]
assert tools == ["text_clean", "format_standardize", "missing", "dedup"]
assert len(tools) == 4
def test_recommend_output_writes_loadable_file(self, tmp_path):
out = tmp_path / "pipeline.json"
result = runner.invoke(app, ["--recommend", "--output", str(out)])
assert result.exit_code == 0
assert out.exists()
# Confirmation message printed instead of raw JSON.
assert str(out) in result.output
pipe = Pipeline.from_file(out)
assert [s.tool for s in pipe.steps] == list(_DEFAULT_ORDER)
def test_recommend_output_message_not_json(self, tmp_path):
out = tmp_path / "pipeline.json"
result = runner.invoke(app, ["--recommend", "--output", str(out)])
assert "saved to" in result.output.lower()
# ---------------------------------------------------------------------------
# Argument / input validation
# ---------------------------------------------------------------------------
class TestArgValidation:
def test_no_args_exits_2(self):
result = runner.invoke(app, [])
assert result.exit_code == 2
assert "input file is required" in result.output.lower()
def test_nonexistent_input_exits_1(self, tmp_path):
missing = tmp_path / "does_not_exist_xyz.csv"
result = runner.invoke(app, [str(missing)])
assert result.exit_code == 1
assert "not found" in result.output.lower()
def test_pipeline_and_steps_together_exits_1(self, messy_csv, tmp_path):
pj = tmp_path / "p.json"
Pipeline.from_dict({"steps": [{"tool": "text_clean"}]}).to_file(pj)
result = runner.invoke(
app,
[str(messy_csv), "--pipeline", str(pj), "--steps", "text_clean"],
)
assert result.exit_code == 1
assert "not both" in result.output.lower()
def test_pipeline_nonexistent_exits_1(self, messy_csv, tmp_path):
missing = tmp_path / "no_such_pipeline.json"
result = runner.invoke(
app, [str(messy_csv), "--pipeline", str(missing)],
)
assert result.exit_code == 1
assert "not found" in result.output.lower()
def test_unknown_tool_in_steps_errors(self, messy_csv):
result = runner.invoke(app, [str(messy_csv), "--steps", "bogus_tool"])
assert result.exit_code != 0
# Helpful error naming the offending value.
assert "bogus_tool" in result.output
# ---------------------------------------------------------------------------
# Dry-run (default)
# ---------------------------------------------------------------------------
class TestDryRun:
def test_dry_run_exit_0_and_plan_printed(self, messy_csv):
result = runner.invoke(app, [str(messy_csv)])
assert result.exit_code == 0
assert "Pipeline plan:" in result.output
assert "plan-only run" in result.output
def test_dry_run_writes_no_artifacts(self, messy_csv):
result = runner.invoke(app, [str(messy_csv)])
assert result.exit_code == 0
out_csv, audit = _pipeline_artifacts(messy_csv)
assert not out_csv.exists()
assert not audit.exists()
# ---------------------------------------------------------------------------
# --apply
# ---------------------------------------------------------------------------
class TestApply:
def test_apply_default_pipeline_writes_outputs(self, messy_csv):
result = runner.invoke(app, [str(messy_csv), "--apply"])
assert result.exit_code == 0
out_csv, audit = _pipeline_artifacts(messy_csv)
assert out_csv.exists()
assert audit.exists()
# Output CSV is readable.
df = pd.read_csv(out_csv)
assert len(df.columns) >= 1
def test_apply_audit_has_documented_keys(self, messy_csv):
result = runner.invoke(app, [str(messy_csv), "--apply"])
assert result.exit_code == 0
_, audit = _pipeline_artifacts(messy_csv)
data = json.loads(audit.read_text())
for key in (
"pipeline", "warnings", "initial_rows", "final_rows",
"total_elapsed_seconds", "steps",
):
assert key in data, f"missing audit key: {key}"
# One step entry per pipeline step (default = 4).
assert len(data["steps"]) == len(_DEFAULT_ORDER)
for step in data["steps"]:
for k in (
"tool", "name", "enabled", "skipped",
"elapsed_seconds", "summary", "error",
):
assert k in step, f"missing step key: {k}"
def test_apply_dedup_reduces_rows(self, messy_csv):
result = runner.invoke(app, [str(messy_csv), "--apply"])
assert result.exit_code == 0
_, audit = _pipeline_artifacts(messy_csv)
data = json.loads(audit.read_text())
# 4 input rows; the first two are duplicates once cleaned/standardized.
assert data["initial_rows"] == 4
assert data["final_rows"] < data["initial_rows"]
def test_apply_custom_output_path(self, messy_csv, tmp_path):
out = tmp_path / "custom.csv"
result = runner.invoke(
app, [str(messy_csv), "--apply", "--output", str(out)],
)
assert result.exit_code == 0
assert out.exists()
# Default-named CSV should NOT be written when --output is given.
default_csv, _ = _pipeline_artifacts(messy_csv)
assert not default_csv.exists()
# Audit JSON is still written next to the input.
_, audit = _pipeline_artifacts(messy_csv)
assert audit.exists()
def test_apply_custom_steps_subset(self, messy_csv):
result = runner.invoke(
app, [str(messy_csv), "--apply", "--steps", "text_clean,missing"],
)
assert result.exit_code == 0
_, audit = _pipeline_artifacts(messy_csv)
data = json.loads(audit.read_text())
tools = [s["tool"] for s in data["steps"]]
assert tools == ["text_clean", "missing"]
# ---------------------------------------------------------------------------
# Strict mode
# ---------------------------------------------------------------------------
class TestStrict:
def test_strict_out_of_order_exits_2(self, messy_csv):
result = runner.invoke(
app,
[str(messy_csv), "--steps", "dedup,text_clean", "--strict", "--apply"],
)
assert result.exit_code == 2
assert "abort" in result.output.lower()
def test_strict_out_of_order_writes_nothing(self, messy_csv):
result = runner.invoke(
app,
[str(messy_csv), "--steps", "dedup,text_clean", "--strict", "--apply"],
)
assert result.exit_code == 2
out_csv, audit = _pipeline_artifacts(messy_csv)
assert not out_csv.exists()
assert not audit.exists()
# ---------------------------------------------------------------------------
# Round-trip: --recommend --output then --pipeline --apply
# ---------------------------------------------------------------------------
class TestRoundTrip:
def test_save_then_run_saved_pipeline(self, messy_csv, tmp_path):
pj = tmp_path / "p.json"
r1 = runner.invoke(app, ["--recommend", "--output", str(pj)])
assert r1.exit_code == 0
assert pj.exists()
r2 = runner.invoke(
app, [str(messy_csv), "--pipeline", str(pj), "--apply"],
)
assert r2.exit_code == 0
out_csv, audit = _pipeline_artifacts(messy_csv)
assert out_csv.exists()
assert audit.exists()
# ---------------------------------------------------------------------------
# Step error handling (--continue-on-error)
# ---------------------------------------------------------------------------
class TestStepError:
"""A dedup step with an invalid survivor_rule raises a ConfigError at
run time, letting us exercise the stop/continue-on-error contract."""
def _bad_pipeline(self, tmp_path):
pj = tmp_path / "bad.json"
Pipeline.from_dict({
"steps": [{
"tool": "dedup",
"options": {"survivor_rule": "not_a_real_rule"},
}]
}).to_file(pj)
return pj
def test_step_error_halts_without_continue(self, messy_csv, tmp_path):
pj = self._bad_pipeline(tmp_path)
result = runner.invoke(
app, [str(messy_csv), "--pipeline", str(pj), "--apply"],
)
assert result.exit_code != 0
out_csv, audit = _pipeline_artifacts(messy_csv)
# Halted before writing output.
assert not out_csv.exists()
assert not audit.exists()
def test_continue_on_error_completes_and_records_error(self, messy_csv, tmp_path):
pj = self._bad_pipeline(tmp_path)
result = runner.invoke(
app,
[str(messy_csv), "--pipeline", str(pj), "--apply",
"--continue-on-error"],
)
assert result.exit_code == 0
out_csv, audit = _pipeline_artifacts(messy_csv)
assert out_csv.exists()
assert audit.exists()
data = json.loads(audit.read_text())
assert len(data["steps"]) == 1
assert data["steps"][0]["error"], "expected the failed step's error recorded"

View File

@@ -0,0 +1,116 @@
"""Demo pipelines must keep showing value (accounting personas).
Each persona's preloaded dataset + saved pipeline is the marketing surface
driven by ``src/gui/app_demo.py``. These tests pin that every demo loads,
runs clean, and produces its headline value (duplicate rows removed, clean
parse, disguised nulls caught) — so a stale dataset or an engine change can't
silently gut the sales demo. The read path mirrors ``app_demo._load_demo``
exactly (``dtype=str, keep_default_na=False`` so every disguised null survives
to the pipeline).
"""
from __future__ import annotations
from pathlib import Path
import pandas as pd
import pytest
from src.core.pipeline import Pipeline, run_pipeline
_REPO = Path(__file__).resolve().parent.parent
_DEMO = _REPO / "samples" / "demo"
# (data_file, pipeline_file, min_duplicates_removed) — one per accounting
# persona in app_demo.PERSONAS. The dup floors are the validated demo numbers.
_DEMOS = [
("bank_reconciliation.csv", "bank_reconciliation_pipeline.json", 6),
("vendor_1099.csv", "vendor_1099_pipeline.json", 8),
("ar_open_invoices.csv", "ar_open_invoices_pipeline.json", 5),
]
@pytest.mark.parametrize("data_file,pipeline_file,min_dupes", _DEMOS)
def test_demo_runs_clean_and_shows_value(data_file, pipeline_file, min_dupes):
df = pd.read_csv(_DEMO / data_file, dtype=str, keep_default_na=False)
pipe = Pipeline.from_file(_DEMO / pipeline_file)
res = run_pipeline(df, pipe, stop_on_error=True)
# 1. Nothing errored — the demo never shows a visitor a red banner.
assert all(sr.error is None for sr in res.step_results), [
(sr.step.tool, sr.error) for sr in res.step_results
]
# 2. Dedup removed the designed duplicate rows (the headline value).
assert res.final_rows < res.initial_rows
dedup = next(sr for sr in res.step_results if sr.step.tool == "dedup")
assert dedup.summary["duplicates_removed"] >= min_dupes
# 3. Standardization parsed every typed cell — a demo with unparseable
# cells reads as "the tool choked," which kills the pitch.
fmt = next(sr for sr in res.step_results if sr.step.tool == "format_standardize")
assert fmt.summary["cells_unparseable"] == 0
assert fmt.summary["cells_changed"] > 0
# 4. The disguised nulls (—, (blank), TBD, …) were caught.
miss = next(sr for sr in res.step_results if sr.step.tool == "missing")
assert miss.summary["sentinels_standardized"] > 0
def test_app_demo_references_each_demo_file():
"""Every data/pipeline file the demo app names must exist on disk.
Guards against a rename in app_demo.py drifting away from samples/demo/
(or vice versa) without a test catching it.
"""
src = (_REPO / "src" / "gui" / "app_demo.py").read_text(encoding="utf-8")
for data_file, pipeline_file, _ in _DEMOS:
assert data_file in src, f"{data_file} not referenced in app_demo.py"
assert pipeline_file in src, f"{pipeline_file} not referenced in app_demo.py"
assert (_DEMO / data_file).exists(), f"missing {data_file}"
assert (_DEMO / pipeline_file).exists(), f"missing {pipeline_file}"
# The accounting persona keys served by the demo app — each must line up with
# a landing page that embeds the matching demo. (key, data-file stem)
_PERSONA_KEYS = [
("bookkeeper", "bank_reconciliation"),
("ap-1099", "vendor_1099"),
("ar-aging", "ar_open_invoices"),
]
_LANDING = _REPO / "landing"
@pytest.mark.parametrize("key,stem", _PERSONA_KEYS)
def test_landing_page_embeds_the_matching_demo(key, stem):
"""Each landing page exists and its iframe + CTA point at this persona —
so the sales surface (landing -> demo app -> dataset) stays coherent."""
app_src = (_REPO / "src" / "gui" / "app_demo.py").read_text(encoding="utf-8")
assert f'"{key}"' in app_src, f"persona key {key!r} not served by app_demo.py"
page = _LANDING / key / "index.html"
assert page.exists(), f"missing landing page for {key}"
html = page.read_text(encoding="utf-8")
assert f"?p={key}" in html, f"{key} landing iframe doesn't load ?p={key}"
assert f"from={key}" in html, f"{key} landing CTA isn't tagged from={key}"
# The hub links to this persona's page.
hub = (_LANDING / "index.html").read_text(encoding="utf-8")
assert f'href="{key}/"' in hub, f"hub doesn't link to {key}/"
def test_landing_surface_has_no_stale_persona_refs():
"""No retired Shopify / RevOps persona language remains in landing HTML."""
for html_file in _LANDING.rglob("*.html"):
text = html_file.read_text(encoding="utf-8").lower()
for stale in ("shopify", "revops", "klaviyo", "hubspot"):
assert stale not in text, f"{stale!r} still in {html_file.relative_to(_REPO)}"
def test_demo_app_builds_a_single_watermark_row():
"""The demo download appends exactly one trailing watermark row
(DEMO-PLAN §6: the AFTER preview must read as production-quality)."""
src = (_REPO / "src" / "gui" / "app_demo.py").read_text(encoding="utf-8")
assert "DataTools demo — buy at" in src
# One trailing row concatenated onto the result frame.
assert "watermark_row" in src and "pd.concat([result.final_df, watermark_row]" in src

View File

@@ -178,3 +178,32 @@ class TestKeyCoverage:
for lang in ("en", "es"):
value = t(key, lang)
assert value and value != key, f"missing {key!r} in {lang}"
class TestHelpPopoverKeys:
"""Every tool's inline Help popover (``render_tool_header``) pulls
its copy from ``tools.<id>.help_md`` and the two shared labels
``help.button_label`` / ``help.missing_body``. A missing key would
fall back to the literal lookup key and render that string in the
popover instead of helpful content."""
@pytest.mark.parametrize("lang", ["en", "es"])
def test_help_shared_keys_present(self, lang):
for key in ("help.button_label", "help.missing_body"):
value = t(key, lang)
assert value and value != key, f"missing {key!r} in {lang!r}"
@pytest.mark.parametrize("lang", ["en", "es"])
def test_every_tool_has_help_md(self, lang):
# Import lazily so this file stays importable without the GUI.
from src.gui.tools_registry import TOOLS
missing: list[str] = []
for tool in TOOLS:
key = f"tools.{tool.tool_id}.help_md"
value = t(key, lang)
if not value or value == key or not value.strip():
missing.append(tool.tool_id)
assert not missing, (
f"language {lang!r} is missing help_md for: {missing}"
)

View File

@@ -12,9 +12,16 @@ a fixture statement at test time.
from __future__ import annotations
import os
from pathlib import Path
from src import pdf_extract
from src.pdf_extract import (
Page,
WordBox,
_apply_bundled_tessdata_prefix,
_bundled_tessdata_dir,
_bundled_tesseract_path,
_extract_account_number,
_extract_statement_period,
_find_amount_tokens,
@@ -456,3 +463,131 @@ class TestYearFromFilename:
def test_empty_filename(self):
assert year_from_filename("") is None
assert year_from_filename(None) is None
class TestBundledTesseractPath:
"""Frozen-bundle Tesseract discovery for installer / portable builds.
The build agent packages Tesseract at
``<sys._MEIPASS>/tesseract/tesseract[.exe]`` with language data
at ``<sys._MEIPASS>/tesseract/tessdata``. These tests pin that
contract on the runtime side."""
def test_returns_none_when_not_frozen(self, monkeypatch):
# Default dev environment: ``sys.frozen`` is unset.
monkeypatch.delattr("sys.frozen", raising=False)
monkeypatch.delattr("sys._MEIPASS", raising=False)
assert _bundled_tesseract_path() is None
assert _bundled_tessdata_dir() is None
def test_returns_none_when_frozen_but_no_meipass(self, monkeypatch):
# Defensive: ``sys.frozen`` true but ``_MEIPASS`` missing
# (shouldn't happen in real PyInstaller bundles but guard
# the helper so it can't NoneType-explode).
monkeypatch.setattr("sys.frozen", True, raising=False)
monkeypatch.delattr("sys._MEIPASS", raising=False)
assert _bundled_tesseract_path() is None
assert _bundled_tessdata_dir() is None
def test_frozen_linux_returns_unsuffixed_binary(
self, monkeypatch, tmp_path,
):
monkeypatch.setattr("sys.frozen", True, raising=False)
monkeypatch.setattr("sys._MEIPASS", str(tmp_path), raising=False)
monkeypatch.setattr("platform.system", lambda: "Linux")
expected = tmp_path / "tesseract" / "tesseract"
assert _bundled_tesseract_path() == expected
def test_frozen_macos_returns_unsuffixed_binary(
self, monkeypatch, tmp_path,
):
monkeypatch.setattr("sys.frozen", True, raising=False)
monkeypatch.setattr("sys._MEIPASS", str(tmp_path), raising=False)
monkeypatch.setattr("platform.system", lambda: "Darwin")
expected = tmp_path / "tesseract" / "tesseract"
assert _bundled_tesseract_path() == expected
def test_frozen_windows_returns_exe_binary(self, monkeypatch, tmp_path):
monkeypatch.setattr("sys.frozen", True, raising=False)
monkeypatch.setattr("sys._MEIPASS", str(tmp_path), raising=False)
monkeypatch.setattr("platform.system", lambda: "Windows")
expected = tmp_path / "tesseract" / "tesseract.exe"
assert _bundled_tesseract_path() == expected
def test_frozen_returns_tessdata_dir(self, monkeypatch, tmp_path):
monkeypatch.setattr("sys.frozen", True, raising=False)
monkeypatch.setattr("sys._MEIPASS", str(tmp_path), raising=False)
expected = tmp_path / "tesseract" / "tessdata"
assert _bundled_tessdata_dir() == expected
class TestAutodetectFavoursBundled:
"""When a bundled binary exists, ``_autodetect_tesseract_path``
should return it BEFORE falling through to Windows install
locations — frozen builds shouldn't depend on the user's
system tesseract even on Windows."""
def test_bundled_wins_over_windows_program_files(
self, monkeypatch, tmp_path,
):
# Simulate frozen Windows build with a bundled binary on disk.
bundle_root = tmp_path / "bundle"
bundled_bin = bundle_root / "tesseract" / "tesseract.exe"
bundled_bin.parent.mkdir(parents=True)
bundled_bin.write_bytes(b"")
monkeypatch.setattr("sys.frozen", True, raising=False)
monkeypatch.setattr(
"sys._MEIPASS", str(bundle_root), raising=False,
)
monkeypatch.setattr("platform.system", lambda: "Windows")
# Pretend the Program Files install also exists — bundled
# should still win because we probe it first.
monkeypatch.setattr(Path, "exists", lambda self: True)
assert pdf_extract._autodetect_tesseract_path() == str(bundled_bin)
def test_falls_through_when_not_frozen(self, monkeypatch):
# Dev: not frozen, not Windows → no candidate at all.
monkeypatch.delattr("sys.frozen", raising=False)
monkeypatch.delattr("sys._MEIPASS", raising=False)
monkeypatch.setattr("platform.system", lambda: "Linux")
assert pdf_extract._autodetect_tesseract_path() is None
class TestApplyBundledTessdataPrefix:
"""``TESSDATA_PREFIX`` env var handling — bundled data should be
pointed at without clobbering a user override."""
def test_no_op_when_not_frozen(self, monkeypatch):
monkeypatch.delenv("TESSDATA_PREFIX", raising=False)
monkeypatch.delattr("sys.frozen", raising=False)
monkeypatch.delattr("sys._MEIPASS", raising=False)
_apply_bundled_tessdata_prefix()
assert "TESSDATA_PREFIX" not in os.environ
def test_sets_when_frozen_and_bundled_exists(
self, monkeypatch, tmp_path,
):
tessdata = tmp_path / "tesseract" / "tessdata"
tessdata.mkdir(parents=True)
monkeypatch.setattr("sys.frozen", True, raising=False)
monkeypatch.setattr("sys._MEIPASS", str(tmp_path), raising=False)
monkeypatch.delenv("TESSDATA_PREFIX", raising=False)
_apply_bundled_tessdata_prefix()
assert os.environ.get("TESSDATA_PREFIX") == str(tessdata)
def test_does_not_clobber_user_override(self, monkeypatch, tmp_path):
tessdata = tmp_path / "tesseract" / "tessdata"
tessdata.mkdir(parents=True)
monkeypatch.setattr("sys.frozen", True, raising=False)
monkeypatch.setattr("sys._MEIPASS", str(tmp_path), raising=False)
monkeypatch.setenv("TESSDATA_PREFIX", "/user/picked/this")
_apply_bundled_tessdata_prefix()
assert os.environ["TESSDATA_PREFIX"] == "/user/picked/this"
def test_no_op_when_bundled_dir_missing(self, monkeypatch, tmp_path):
# Frozen, but the build didn't ship a tessdata dir.
monkeypatch.setattr("sys.frozen", True, raising=False)
monkeypatch.setattr("sys._MEIPASS", str(tmp_path), raising=False)
monkeypatch.delenv("TESSDATA_PREFIX", raising=False)
_apply_bundled_tessdata_prefix()
assert "TESSDATA_PREFIX" not in os.environ

View File

@@ -322,3 +322,499 @@ class TestSoftDependencies:
assert len(order) == len(TOOL_NAMES), (
f"SOFT_DEPENDENCIES contain a cycle; topo order={order}"
)
# ---------------------------------------------------------------------------
# Per-adapter summary correctness — exact numbers on KNOWN-messy input.
# Each adapter is also exercised through run_pipeline so the StepResult
# carries the adapter's summary verbatim.
# ---------------------------------------------------------------------------
def _run_one(df, tool, options):
"""Run a single-step pipeline and return (StepResult, PipelineResult)."""
res = run_pipeline(df, Pipeline(steps=[Step(tool, options)]))
return res.step_results[0], res
class TestTextCleanSummary:
def test_two_trimmable_cells_counted(self):
df = pd.DataFrame({
"a": [" x ", "y", " z"], # 2 cells need trimming (" x ", " z")
"b": ["ok", "fine", "good"], # already clean
})
out, summary = TOOL_ADAPTERS["text_clean"](df, {"trim": True})
assert summary["cells_total"] == 6
assert summary["cells_changed"] == 2
assert sorted(summary["columns_processed"]) == ["a", "b"]
assert out["a"].tolist() == ["x", "y", "z"]
def test_title_case_changes_all_cells(self):
df = pd.DataFrame({"name": ["alice smith", "BOB JONES"]})
out, summary = TOOL_ADAPTERS["text_clean"](df, {"case": "title"})
assert summary["cells_changed"] == 2
assert out["name"].tolist() == ["Alice Smith", "Bob Jones"]
def test_collapse_whitespace_counts_internal_runs(self):
df = pd.DataFrame({"name": ["a b", "c d", "e f"]})
out, summary = TOOL_ADAPTERS["text_clean"](
df, {"trim": True, "collapse_whitespace": True},
)
# "a b" and "e f" collapse; "c d" is already single-spaced.
assert summary["cells_changed"] == 2
assert out["name"].tolist() == ["a b", "c d", "e f"]
def test_summary_visible_through_run_pipeline(self):
df = pd.DataFrame({"a": [" x ", "y"]})
sr, _res = _run_one(df, "text_clean", {"trim": True})
assert sr.skipped is False
assert sr.error is None
assert sr.summary["cells_changed"] == 1
assert sr.summary["cells_total"] == 2
class TestFormatStandardizeSummary:
def test_one_unparseable_phone(self):
df = pd.DataFrame({
"phone": ["(415) 555-1234", "not a phone", "+44 20 7946 0958"],
})
out, summary = TOOL_ADAPTERS["format_standardize"](
df, {"column_types": {"phone": "phone"}},
)
assert summary["cells_total"] == 3
assert summary["cells_unparseable"] == 1
assert summary["cells_changed"] == 2
assert summary["columns_processed"] == ["phone"]
assert out["phone"].tolist() == [
"+14155551234", "not a phone", "+442079460958",
]
def test_date_standardization_counts(self):
df = pd.DataFrame({"signup_date": ["2024-01-05", "Jan 5 2024", "garbage"]})
out, summary = TOOL_ADAPTERS["format_standardize"](
df, {"column_types": {"signup_date": "date"}},
)
# "2024-01-05" already canonical; "Jan 5 2024" rewritten; "garbage" fails.
assert summary["cells_unparseable"] == 1
assert summary["cells_changed"] == 1
assert out["signup_date"].tolist()[:2] == ["2024-01-05", "2024-01-05"]
def test_summary_visible_through_run_pipeline(self):
df = pd.DataFrame({"phone": ["(415) 555-1234", "bad"]})
sr, _ = _run_one(df, "format_standardize", {"column_types": {"phone": "phone"}})
assert sr.summary["cells_unparseable"] == 1
assert sr.summary["columns_processed"] == ["phone"]
class TestMissingSummary:
def test_median_fills_each_blank(self):
df = pd.DataFrame({"val": [1.0, np.nan, 3.0, np.nan, 5.0]})
out, summary = TOOL_ADAPTERS["missing"](df, {"strategy": "median"})
assert summary["cells_filled"] == 2 # exactly the 2 NaNs
assert summary["rows_dropped"] == 0
assert out["val"].tolist() == [1.0, 3.0, 3.0, 3.0, 5.0] # median is 3.0
def test_drop_row_by_threshold(self):
# row_drop_threshold is the *fraction* of nulls needed to drop a row.
df = pd.DataFrame({
"a": [1.0, np.nan, 3.0],
"b": ["x", np.nan, "z"], # middle row is 100% null
})
out, summary = TOOL_ADAPTERS["missing"](
df, {"strategy": "drop_row", "row_drop_threshold": 0.4},
)
assert summary["rows_dropped"] == 1
assert len(out) == 2
def test_sentinel_standardization_count(self):
df = pd.DataFrame({"x": ["ok", "N/A", "fine", "N/A"]})
out, summary = TOOL_ADAPTERS["missing"](df, {
"strategy": "none",
"sentinels": ["N/A"],
"standardize_sentinels": True,
})
assert summary["sentinels_standardized"] == 2
# The two "N/A" cells became real NaN.
assert out["x"].isna().sum() == 2
assert out["x"].tolist()[0] == "ok"
def test_summary_visible_through_run_pipeline(self):
df = pd.DataFrame({"val": [1.0, np.nan, 3.0]})
sr, _ = _run_one(df, "missing", {"strategy": "median"})
assert sr.summary["cells_filled"] == 1
class TestColumnMapSummary:
def test_single_rename(self):
df = pd.DataFrame({"old": [1, 2], "keep": [3, 4]})
out, summary = TOOL_ADAPTERS["column_map"](
df, {"mapping": {"old": "new"}, "unmapped": "keep"},
)
assert summary["columns_renamed"] == 1
assert summary["columns_dropped"] == []
assert list(out.columns) == ["new", "keep"]
def test_unmapped_drop_reports_dropped_columns(self):
df = pd.DataFrame({"old": [1, 2], "keep": [3, 4]})
out, summary = TOOL_ADAPTERS["column_map"](
df, {"mapping": {"old": "new"}, "unmapped": "drop"},
)
assert summary["columns_renamed"] == 1
assert summary["columns_dropped"] == ["keep"]
assert list(out.columns) == ["new"]
def test_summary_visible_through_run_pipeline(self):
df = pd.DataFrame({"old": [1], "keep": [2]})
sr, _ = _run_one(df, "column_map", {"mapping": {"old": "new"}})
assert sr.summary["columns_renamed"] == 1
class TestDedupSummary:
def test_exact_duplicate_rows(self):
df = pd.DataFrame({
"email": ["a@x.com", "b@x.com", "a@x.com", "a@x.com"],
"name": ["A", "B", "A", "A"],
})
out, summary = TOOL_ADAPTERS["dedup"](df, {"survivor_rule": "first"})
assert summary["input_rows"] == 4
assert summary["output_rows"] == 2
assert summary["duplicates_removed"] == 2
assert summary["groups"] == 1
assert out["email"].tolist() == ["a@x.com", "b@x.com"]
def test_explicit_exact_strategy_on_column(self):
df = pd.DataFrame({
"email": ["a@x.com", "b@x.com", "a@x.com"],
"name": ["A", "B", "C"],
})
out, summary = TOOL_ADAPTERS["dedup"](df, {
"survivor_rule": "first",
"strategies": [{"columns": [
{"column": "email", "algorithm": "exact", "threshold": 100},
]}],
})
assert summary["duplicates_removed"] == 1
assert summary["groups"] == 1
def test_most_complete_keeps_fuller_survivor(self):
df = pd.DataFrame({
"email": ["a@x.com", "a@x.com"],
"name": ["", "Alice"], # second row is more complete
"phone": ["111", "111"],
})
out, summary = TOOL_ADAPTERS["dedup"](df, {"survivor_rule": "most_complete"})
assert summary["duplicates_removed"] == 1
assert out.iloc[0]["name"] == "Alice"
def test_no_duplicates_is_noop(self):
df = pd.DataFrame({"email": ["a@x.com", "b@x.com"], "name": ["A", "B"]})
out, summary = TOOL_ADAPTERS["dedup"](df, {"survivor_rule": "first"})
assert summary["duplicates_removed"] == 0
assert summary["output_rows"] == 2
def test_summary_visible_through_run_pipeline(self):
df = pd.DataFrame({"email": ["a@x.com", "a@x.com"], "name": ["A", "A"]})
sr, res = _run_one(df, "dedup", {"survivor_rule": "first"})
assert sr.summary["duplicates_removed"] == 1
assert res.final_rows == 1
# ---------------------------------------------------------------------------
# Data flow — a later step depends on an earlier step's output
# ---------------------------------------------------------------------------
class TestDataFlow:
def test_text_clean_enables_dedup_match(self):
# The two phones differ only by surrounding whitespace; without
# the trim they are distinct, so dedup alone would keep both.
df = pd.DataFrame({"phone": [" +14155551234 ", "+14155551234"]})
p = Pipeline(steps=[
Step("text_clean", {"trim": True}),
Step("dedup", {"survivor_rule": "first"}),
])
res = run_pipeline(df, p)
assert res.initial_rows == 2
assert res.final_rows == 1
assert res.final_df["phone"].tolist() == ["+14155551234"]
def test_dedup_default_matching_normalizes_whitespace(self):
# Note: dedup's exact matcher already normalizes surrounding
# whitespace, so the two phones collapse even WITHOUT a prior
# text_clean. The survivor still carries the un-trimmed value.
df = pd.DataFrame({"phone": [" +14155551234 ", "+14155551234"]})
res = run_pipeline(df, Pipeline(steps=[Step("dedup", {"survivor_rule": "first"})]))
assert res.final_rows == 1
# Survivor keeps the raw (still-padded) text — dedup does not clean.
assert res.final_df["phone"].tolist() == [" +14155551234 "]
def test_chained_initial_and_final_rows(self):
df = pd.DataFrame({
"name": [" Al ", "al", "Bob"],
"v": [1, 1, 2],
})
p = Pipeline(steps=[
Step("text_clean", {"trim": True, "case": "title"}),
Step("dedup", {"survivor_rule": "first"}),
])
res = run_pipeline(df, p)
# " Al " and "al" both become "Al" → duplicate rows collapse.
assert res.initial_rows == 3
assert res.final_rows == 2
assert "Al" in res.final_df["name"].tolist()
# ---------------------------------------------------------------------------
# Error handling — stop_on_error semantics
# ---------------------------------------------------------------------------
class TestErrorHandling:
def test_most_recent_without_date_raises_by_default(self, messy_df):
# dedup with survivor_rule="most_recent" but no date_column errors.
p = Pipeline(steps=[Step("dedup", {"survivor_rule": "most_recent"})])
with pytest.raises(InputValidationError):
run_pipeline(messy_df, p)
def test_continue_on_error_sets_error_string(self):
df = pd.DataFrame({"email": ["a@x.com", "a@x.com"], "name": ["A", "B"]})
p = Pipeline(steps=[
Step("text_clean", {"trim": True}),
Step("dedup", {"survivor_rule": "most_recent"}), # will fail
Step("missing", {"strategy": "none"}),
])
res = run_pipeline(df, p, stop_on_error=False)
bad = res.step_results[1]
assert bad.error is not None
assert isinstance(bad.error, str) and bad.error.strip()
# The failed step did NOT change the row count — previous df carried.
assert res.step_results[2].error is None
assert res.final_rows == 2
def test_failed_step_summary_is_empty(self):
df = pd.DataFrame({"e": ["a", "a"], "n": ["x", "y"]})
p = Pipeline(steps=[Step("dedup", {"survivor_rule": "most_recent"})])
res = run_pipeline(df, p, stop_on_error=False)
assert res.step_results[0].summary == {}
assert res.step_results[0].skipped is False
def test_config_error_on_bad_survivor_rule_propagates(self, messy_df):
p = Pipeline(steps=[Step("dedup", {"survivor_rule": "nonsense"})])
with pytest.raises(ConfigError):
run_pipeline(messy_df, p)
# ---------------------------------------------------------------------------
# Edge inputs
# ---------------------------------------------------------------------------
class TestEdgeInputs:
def test_empty_dataframe_runs_clean(self):
empty = pd.DataFrame({"name": [], "phone": []})
res = run_pipeline(
empty,
recommended_pipeline(options={"missing": {"strategy": "none"}}),
)
assert res.initial_rows == 0
assert res.final_rows == 0
assert all(sr.error is None for sr in res.step_results if not sr.skipped)
def test_single_column_dataframe(self):
df = pd.DataFrame({"name": [" Al ", "al"]})
res = run_pipeline(
df, Pipeline(steps=[Step("text_clean", {"trim": True, "case": "title"})]),
)
assert res.final_df["name"].tolist() == ["Al", "Al"]
def test_all_steps_disabled_returns_unchanged(self, messy_df):
snapshot = messy_df.copy(deep=True)
p = Pipeline(steps=[
Step("text_clean", enabled=False),
Step("format_standardize", enabled=False),
Step("missing", enabled=False),
Step("dedup", enabled=False),
])
res = run_pipeline(messy_df, p)
assert all(sr.skipped is True for sr in res.step_results)
assert res.final_rows == res.initial_rows == 5
pd.testing.assert_frame_equal(res.final_df, snapshot)
def test_empty_pipeline_is_identity(self, messy_df):
res = run_pipeline(messy_df, Pipeline(steps=[]))
assert res.step_results == []
assert res.final_rows == 5
pd.testing.assert_frame_equal(res.final_df, messy_df)
# ---------------------------------------------------------------------------
# Serialization round-trips with disabled / named / nested-option steps
# ---------------------------------------------------------------------------
class TestSerializationRoundtrips:
def test_disabled_and_named_step_survive_dict(self):
p = Pipeline(steps=[
Step("text_clean", {"trim": True}, enabled=False, name="Pre-clean"),
Step("dedup", {"survivor_rule": "first"}, name="Final dedup"),
])
loaded = Pipeline.from_dict(p.to_dict())
assert loaded.steps[0].enabled is False
assert loaded.steps[0].name == "Pre-clean"
assert loaded.steps[0].options == {"trim": True}
assert loaded.steps[1].name == "Final dedup"
assert loaded.steps[1].display_name() == "Final dedup"
def test_nested_options_survive_dict(self):
nested = {
"column_types": {"phone": "phone", "signup_date": "date"},
}
strat = {
"survivor_rule": "most_complete",
"strategies": [{"columns": [
{"column": "email", "algorithm": "exact", "threshold": 100},
]}],
}
p = Pipeline(steps=[
Step("format_standardize", nested),
Step("dedup", strat),
])
loaded = Pipeline.from_dict(p.to_dict())
assert loaded.steps[0].options["column_types"] == nested["column_types"]
assert loaded.steps[1].options["strategies"] == strat["strategies"]
def test_nested_options_survive_file(self, tmp_path):
p = Pipeline(steps=[
Step("format_standardize",
{"column_types": {"phone": "phone"}},
enabled=False, name="formats"),
])
path = tmp_path / "pipe.json"
p.to_file(path)
loaded = Pipeline.from_file(path)
assert loaded.steps[0].enabled is False
assert loaded.steps[0].name == "formats"
assert loaded.steps[0].options == {"column_types": {"phone": "phone"}}
def test_roundtrip_is_idempotent(self):
p = Pipeline(steps=[
Step("text_clean", {"trim": True}, enabled=False, name="x"),
Step("missing", {"strategy": "median"}),
])
once = Pipeline.from_dict(p.to_dict())
twice = Pipeline.from_dict(once.to_dict())
assert once.to_dict() == twice.to_dict() == p.to_dict()
# ---------------------------------------------------------------------------
# recommended_pipeline(include=...) — subsetting, ordering, option seeding
# ---------------------------------------------------------------------------
class TestRecommendedInclude:
def test_subset_preserves_given_order(self):
p = recommended_pipeline(include=["dedup", "text_clean"])
assert [s.tool for s in p.steps] == ["dedup", "text_clean"]
def test_column_map_first(self):
p = recommended_pipeline(include=[
"column_map", "text_clean", "format_standardize", "missing", "dedup",
])
assert p.steps[0].tool == "column_map"
assert len(p.steps) == 5
def test_column_map_last(self):
p = recommended_pipeline(include=[
"text_clean", "format_standardize", "missing", "dedup", "column_map",
])
assert p.steps[-1].tool == "column_map"
def test_unknown_tool_in_include_raises(self):
with pytest.raises(InputValidationError):
recommended_pipeline(include=["text_clean", "not_a_tool"])
def test_options_seeding_only_targets_named_tool(self):
p = recommended_pipeline(
include=["text_clean", "dedup"],
options={"dedup": {"survivor_rule": "last"}},
)
assert p.steps[0].options == {} # text_clean unseeded
assert p.steps[1].options == {"survivor_rule": "last"}
def test_empty_include_yields_no_steps(self):
p = recommended_pipeline(include=[])
assert p.steps == []
def test_seeded_options_are_independent_copies(self):
seed = {"text_clean": {"trim": True}}
p = recommended_pipeline(include=["text_clean"], options=seed)
# Mutating the produced step must not leak back into the seed.
p.steps[0].options["trim"] = False
assert seed["text_clean"]["trim"] is True
# ---------------------------------------------------------------------------
# Realistic demo integration — messy customers table end-to-end
# ---------------------------------------------------------------------------
class TestDemoIntegration:
@pytest.fixture
def customers_df(self):
return pd.DataFrame({
"Full Name": [" alice smith ", "BOB JONES", "alice smith", ""],
"Email": ["alice@x.com ", "bob@x.com", "alice@x.com", "carol@x.com"],
"Phone": [" +14155551234 ", "+442079460958",
"+14155551234", "+13035551111"],
})
def test_full_recommended_plus_column_map(self, customers_df):
p = recommended_pipeline(
include=["text_clean", "format_standardize", "missing",
"dedup", "column_map"],
options={
"text_clean": {"trim": True, "collapse_whitespace": True},
"missing": {"strategy": "none"},
"dedup": {
"survivor_rule": "most_complete",
"strategies": [{"columns": [
{"column": "Phone", "algorithm": "exact", "threshold": 100},
]}],
},
"column_map": {
"mapping": {"Full Name": "name", "Email": "email",
"Phone": "phone"},
"unmapped": "keep",
},
},
)
res = run_pipeline(customers_df, p)
# Two rows share the same phone after trimming → one duplicate removed.
assert res.initial_rows == 4
assert res.final_rows == 3
assert res.final_rows < res.initial_rows
# Headers were renamed by the trailing column_map step.
assert list(res.final_df.columns) == ["name", "email", "phone"]
# The surviving Alice row kept its (trimmed) phone.
phones = res.final_df["phone"].tolist()
assert "+14155551234" in phones
assert phones.count("+14155551234") == 1 # only one Alice survives
# Every executed step succeeded.
assert all(sr.error is None for sr in res.step_results if not sr.skipped)
# column_map reported the three renames.
cm = res.step_results[-1]
assert cm.step.tool == "column_map"
assert cm.summary["columns_renamed"] == 3
def test_demo_dedup_step_reports_one_duplicate(self, customers_df):
p = recommended_pipeline(options={
"text_clean": {"trim": True},
"missing": {"strategy": "none"},
"dedup": {
"survivor_rule": "most_complete",
"strategies": [{"columns": [
{"column": "Phone", "algorithm": "exact", "threshold": 100},
]}],
},
})
res = run_pipeline(customers_df, p)
dedup_sr = next(s for s in res.step_results if s.step.tool == "dedup")
assert dedup_sr.summary["duplicates_removed"] == 1
assert dedup_sr.summary["groups"] == 1

View File

@@ -157,6 +157,78 @@ class TestLocalizedAccessors:
assert label and label != f"nav.section_{section}"
class TestDescriptionCopy:
"""The post-jargon-strip descriptions are intentionally tight one-
liners. Pin them so future drift toward bloated marketing copy
(or an accidentally-empty string) is caught by CI."""
# Roomy upper bound; the tightest description today is ~60 chars
# and the longest is just over 90. ~120 leaves headroom for minor
# copy tweaks without inviting paragraph-length card bodies.
_MAX_DESCRIPTION_CHARS = 120
def test_every_description_is_non_empty(self):
empty = [t.tool_id for t in TOOLS if not t.description.strip()]
assert not empty, f"tools with empty descriptions: {empty}"
def test_every_description_under_max_chars(self):
too_long = [
(t.tool_id, len(t.description))
for t in TOOLS
if len(t.description) > self._MAX_DESCRIPTION_CHARS
]
assert not too_long, (
f"tool descriptions exceed {self._MAX_DESCRIPTION_CHARS} chars: "
f"{too_long}"
)
class TestRenderToolHeaderSmoke:
"""``render_tool_header`` is the helper every tool page now calls in
place of ``st.title(...) + st.caption(...)``. We can't render it
without a Streamlit script context, but we CAN verify it imports
cleanly via the public ``src.gui.components`` surface and resolves
the expected i18n keys for a known tool id."""
def test_importable_from_public_components_package(self):
from src.gui.components import render_tool_header
assert callable(render_tool_header)
def test_listed_in_public_all(self):
# The public ``__all__`` is what per-tool builds key off; a
# removal here would silently break tool pages that import
# from ``src.gui.components`` directly.
from src.gui import components as components_pkg
assert "render_tool_header" in components_pkg.__all__
def test_resolves_expected_i18n_keys_for_known_tool(self):
# The helper reads four pack keys per render:
# ``tools.<id>.page_title``, ``tools.<id>.page_caption``,
# ``tools.<id>.help_md``, plus shared ``help.button_label`` /
# ``help.missing_body``. We don't invoke the helper (no script
# context) — we verify the keys it would touch resolve to
# non-empty strings in both packs.
from src.i18n import t as _t
tool_id = "02_text_cleaner"
for lang in ("en", "es"):
for suffix in ("page_title", "page_caption", "help_md"):
key = f"tools.{tool_id}.{suffix}"
value = _t(key, lang)
assert value and value != key, (
f"render_tool_header({tool_id!r}) "
f"would render the literal key {key!r} in {lang!r}"
)
for key in ("help.button_label", "help.missing_body"):
value = _t(key, lang)
assert value and value != key, (
f"render_tool_header would render the literal key "
f"{key!r} in {lang!r}"
)
class TestReconcilerAndPdfArePresent:
"""The two newest pages were the most likely to be forgotten in
the registry — pin them explicitly so a regression flagging
@@ -167,12 +239,15 @@ class TestReconcilerAndPdfArePresent:
assert tool is not None
assert tool.page_slug == "10_PDF_Extractor"
assert tool.status == "Ready"
# PDF to CSV + Reconcile live in the "Finance" group (outside the
# cleaning flow) per DECISIONS.md 2026-06-08.
assert tool.section == "finance"
def test_reconciler_present(self):
tool = tool_by_id("11_reconciler")
assert tool is not None
assert tool.page_slug == "11_Reconciler"
assert tool.status == "Ready"
# The new "analysis" section was introduced with this tool;
# if the section disappears, the sidebar group goes empty.
assert tool.section == "analysis"
# Reconcile sits in the "Finance" group (see DECISIONS.md
# 2026-06-08); if that section disappears the sidebar goes empty.
assert tool.section == "finance"