demo: reconstruct sales demos for an accounting audience

Replaces the Shopify / RevOps / Bookkeeper demo trio with three accounting
personas that share one buyer, each entering through a workflow where a
messy export costs money — all running the same saved 4-step pipeline:

- bank_reconciliation.csv (Bookkeeper): 26 -> 20 rows, 6 double-posted
  transactions caught after date+amount standardization.
- vendor_1099.csv (AP / 1099): 24 records -> 8 vendors, 7 missing EINs
  recovered via dedup merge — the 1099-complete story.
- ar_open_invoices.csv (AR): 26 -> 21 rows, 5 double-entered invoices
  removed, blank status backfilled from the twin row.

Every number is validated against the live engine and pinned by
tests/test_demo_pipelines.py (read path mirrors app_demo._load_demo:
dtype=str, keep_default_na=False). Rewires src/gui/app_demo.py PERSONAS
(keys bookkeeper / ap-1099 / ar-aging, accounting H1/sub/CTA) and rewrites
docs/DEMO-PLAN.md sections 3/4/7 with the validated outcomes.

(Repo hygiene forced by a partial-clone gap: finalizes the already-deleted,
unreferenced samples/messy_text.csv whose blob was unrecoverable.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-22 18:52:39 +00:00
parent 38616d69e2
commit 6df726e69e
16 changed files with 404 additions and 441 deletions

View File

@@ -32,17 +32,22 @@ rebuilds it from a stale headline.
| Friction kills conversion | BUSINESS.md §7 | Demo dataset preloaded; no "select a file" first-step | | Friction kills conversion | BUSINESS.md §7 | Demo dataset preloaded; no "select a file" first-step |
| < $1,200/mo recurring | BUSINESS.md §9 | Migration plan to $5/mo VPS only after rate-limit signal | | < $1,200/mo recurring | BUSINESS.md §9 | Migration plan to $5/mo VPS only after rate-limit signal |
## 3. The three personas (per PLAN.md §2.3) ## 3. The three personas — one audience: accounting (per PLAN.md §2.3)
We niche to **accounting** and enter through the three workflows where a
messy export costs real money. Same engine, three landing pages — each
is the same buyer at a different desk (bookkeeping, payables, receivables).
| Tag | Persona | Top-of-funnel keyword | Demo dataset | Pre-saved pipeline | | Tag | Persona | Top-of-funnel keyword | Demo dataset | Pre-saved pipeline |
|---|---|---|---|---| |---|---|---|---|---|
| `shopify-pet` | Shopify operator (priority: pet supplies) | "shopify customer cleanup" | `samples/demo/shopify_pet_customers.csv` | `shopify_pet_pipeline.json` | | `bookkeeper` | Bookkeeper — bank reconciliation | "reconcile bank export csv duplicates" | `samples/demo/bank_reconciliation.csv` | `bank_reconciliation_pipeline.json` |
| `bookkeeper` | Bookkeeper / freelance accountant | "reconcile bank export csv" | `samples/demo/bookkeeper_bank_reconcile.csv` | `bookkeeper_bank_pipeline.json` | | `ap-1099` | Accounts payable — 1099 vendor prep | "clean 1099 vendor list missing EIN" | `samples/demo/vendor_1099.csv` | `vendor_1099_pipeline.json` |
| `revops` | Marketing / RevOps agency | "dedupe lead list across vendors" | `samples/demo/agency_combined_leads.csv` | `agency_leads_pipeline.json` | | `ar-aging` | Accounts receivable — open invoices | "remove duplicate invoices aging report" | `samples/demo/ar_open_invoices.csv` | `ar_open_invoices_pipeline.json` |
Each persona gets its **own landing page URL**, its **own demo dataset Each persona gets its **own landing page URL** (`?p=<tag>`), its **own
loaded by default**, and its **own H1 + below-the-fold copy.** The demo dataset loaded by default**, and its **own H1 + below-the-fold
engine is identical; only positioning differs. copy** — wired in `src/gui/app_demo.py::PERSONAS`. The engine is
identical; only positioning differs.
## 4. Demo dataset specifications ## 4. Demo dataset specifications
@@ -53,114 +58,77 @@ persona's tooling. Each contains every kind of pollution the bundle's
five tools fix, so a single demo run shows every tool earning its five tools fix, so a single demo run shows every tool earning its
keep. keep.
### 4.0 Pain-point coverage map ### 4.0 Value-proof map
Each demo dataset is engineered so the buyer sees their **own top Each demo dataset is engineered so the buyer sees their **own top pain**
pain** demonstrated in the AFTER preview. The mapping below pairs fixed in the AFTER preview, with one unmistakable headline number. All
each pain from PLAN.md §2.3a with the rows / columns that exercise three run the same saved 4-step pipeline (Clean Text → Standardize
it. Refresh the dataset only when this coverage drops. Formats → Fix Missing Values → Find Duplicates). The numbers below are
**validated against the live engine** (`tests/test_demo_pipelines.py`
pins them) — refresh the dataset only if a number stops landing.
| Persona | Pain (from PLAN §2.3a) | Demo coverage | | Persona | Headline proof | What the visitor watches happen |
|---|---|---| |---|---|---|
| Shopify pet | S1 — Klaviyo per-contact dupes | 5 dup pairs across rows 115 (case + format + address-twin variants) | | Bookkeeper | **26 → 20 rows · 6 phantom duplicates removed** | The same payment posted twice (different date + amount format) collapses to one; dates go ISO, parens-negatives become real negatives |
| Shopify pet | S2 — feed-rejection chars | smart-quote / NBSP / BOM in rows 16, 9, 11 | | AP / 1099 | **24 records → 8 vendors · 7 missing EINs recovered** | Each vendor's scattered records merge into one complete row; `merge=true` backfills the EIN/address/phone that any single record was missing |
| Shopify pet | S3 — multi-channel | partner-style customer IDs (`SHOP-`); demonstration of column-level mapping covered in RevOps demo | | AR aging | **26 → 21 rows · 5 double-entered invoices removed** | Duplicate invoice numbers collapse; a blank status is backfilled from its twin; invoice + due dates go ISO, amounts numeric |
| Shopify pet | S4 — subscription identity | rows 1+2, 7+8, 9+10 — same person, different format |
| Shopify pet | S5 — VAT-MOSS country drift | rows 1618 (`United Kingdom` / `U.K.` / `UK`) + rows 1920 (`Germany`/`Italia`) |
| Bookkeeper | B1 — month-overlap re-import | 7 dup pairs spanning Jan↔Feb and Mar boundaries |
| Bookkeeper | B2 — 1099 vendor consolidation | Amazon × 3 spellings, Verizon × 2, Acme Realty × 2, Adobe × 2, Costco × 2, Zoom × 2, Stripe × 4 |
| Bookkeeper | B3 — audit trail | every cell change in the run logged with old/new/rule — surface in the demo's audit tab |
| Bookkeeper | B4 — per-license economics | demonstrated by pricing copy, not data |
| Bookkeeper | B5 — multi-currency | rows 26 (EUR), 27 (GBP), 28 (BRL with comma decimal), 29 (parens-negative) |
| RevOps | R1 — per-contact tier | 6 cross-source dup pairs (HubSpot × LinkedIn × Manual Scrape) |
| RevOps | R2 — deliverability | rows 2627 (`uma at uniform dot com`, `victor@@victorco.com` invalid emails) |
| RevOps | R3 — GDPR / privacy | demonstrated by the network-tab moat panel + zero-upload claim |
| RevOps | R4 — vendor unification | 3 source values (HubSpot / LinkedIn / Manual Scrape), 13 country codes, mixed-shape headers |
| RevOps | R5 — suppression list | rows 2930 (`Suppressed`, `Opted Out` tags) |
### 4.1 `shopify_pet_customers.csv` (20 rows) ### 4.1 `bank_reconciliation.csv` (26 rows) — Bookkeeper
**Looks like**: a Shopify customer export filtered for "Pet Supplies" **Looks like**: two months (Jan + Feb 2025) of business-checking activity
sales channel, 12 months activity. from a bank portal, where the Feb re-export overlaps Jan so the same
transaction posts twice. Columns: `Date, Description, Vendor, Category,
Amount, Account`.
**Pollution included**: **Pollution included**:
- Whitespace padding (" Alice ", "Sydney Opera House Drive ") - Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`, `1/27/25`, `Feb 5 2025`.
- Mixed phone formats: `(415) 555-1234`, `415.555.1234`, `5559876543`, - Currency formats incl. negatives: `-$129.99`, `($89.50)` parens-negative, `+$3,450.00`, `- $599.88`, bare `-129.99`, `(50.00)`.
`+1 555-111-1111` - Whitespace + NBSP padding; smart quotes and an em-dash inside descriptions.
- International phones: GB, ES, DE, AU, JP (15 demo rows span 6 - Vendor casing variety on *non-duplicate* rows: `Amazon` / `amazon.com` / `AMAZON.COM`, `Verizon` / `verizon`.
countries) - Disguised nulls in Category: `—`, `(blank)`, `?`, `unknown`, `TBD`.
- Currency variants: `$1,240.50`, `£890.25`, `€2.410,75` (EU comma - **6 duplicate transactions** — each pair shares the same vendor + real value but a different date *and* amount format, so they collapse only after standardization.
decimal), `A$ 1,299.00`, `¥75000`
- Date formats: `2025-12-04`, `12/15/2025`, `?`, `(blank)`, `(none)`,
`#N/A`
- Disguised nulls: `N/A`, blank, `(blank)`, `?`, `#N/A`, `(none)`,
`unknown`
- Name casing: `EVE MARTINEZ`, `henry`, `O'NEIL`, `noah`, mixed Title /
ALL CAPS / lower
- Email case variants that *should* dedup: `Bob@PetShop.com` vs
`alice@petshop.com`
- 4 fuzzy duplicates (Alice/Bob same address, Grace/Henry same phone,
Carlos/Olivia same address, Ivy/Jack same address)
**After running the pipeline**: 20 rows → 15, ~29 cells canonicalized, **After running the pipeline** (validated): **26 → 20 rows, 6 duplicates
~45 sentinels standardised, 5 cross-row duplicates merged. The removed**, 36 date/amount cells standardized (0 unparseable), all dates
customer table is now Klaviyo-import-ready and the country column ISO, parens-negatives resolved (`($89.50)``-89.50`), disguised-null
(previously `UK` / `U.K.` / `United Kingdom` / `Germany` / `Italia`) categories flagged. The reconciliation ties out.
is GB / DE / IT — VAT MOSS report won't break.
### 4.2 `bookkeeper_bank_reconcile.csv` (30 rows) ### 4.2 `vendor_1099.csv` (24 rows) — Accounts payable / 1099
**Looks like**: two months of business checking + credit-card activity **Looks like**: a 1099-NEC vendor master list where the same vendor was
exported from a bank portal, with the Feb export accidentally entered 23 times across the year by different staff, each record holding
overlapping the Jan export at the month boundary. only *part* of the vendor's details. Columns: `Vendor, Contact, Email,
Phone, EIN, Address, Total_Paid`.
**Pollution included**: **Pollution included**:
- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`, - The duplicate records for a vendor share one email differing only by case/whitespace (the reliable dedup key, matched with the `email` normalizer).
`1/27/25`, `Feb 5 2025` - EIN / Phone / Address scattered across the duplicate set so no single record is complete but the union is — gaps marked `—`, `(blank)`, `TBD`, `unknown`, `N/A`.
- Currency formats: `-$129.99`, `($89.50)` parens-negative, - Vendor name casing/spelling variants, phone formats, EIN formats (`12-3456789` vs `123456789`), `Total_Paid` currency variants.
`+$3,450.00`, `- $599.88` space, bare `-129.99`, `(50.00)`
- Header trailing whitespace: `"Date "`
- Smart quotes around descriptions: `"autopay"`
- Em-dash sentinels in Vendor: `—`
- Smart-em-dash inside descriptions: `STAPLES #4422 — paper, toner`
- Vendor casing inconsistency: `Amazon` / `amazon.com` / `AMAZON.COM`,
`Verizon` / `verizon`
- 6 duplicate transactions (same date+amount+vendor recorded twice
with different formats)
**After running the pipeline**: 30 rows → 23, ~84 cells normalized, 7 **After running the pipeline** (validated): **24 records → 8 vendors, 16
duplicates removed (month-overlap + VAT-MOSS dups). All dates duplicates removed, 7 missing EINs recovered** by `merge=true` +
ISO-formatted, all amounts numeric (including EUR/GBP/BRL with comma `most_complete` survivor, 35 disguised nulls caught, phones/emails/amounts
decimal), vendor casing canonical, parens-negative resolved. standardized (0 unparseable). One vendor genuinely has no EIN in any
record — it survives with a blank EIN as the realistic "flag for
follow-up" case.
### 4.3 `agency_combined_leads.csv` (30 rows) ### 4.3 `ar_open_invoices.csv` (26 rows) — Accounts receivable
**Looks like**: a marketing-ops worksheet combining lead exports from **Looks like**: an open-invoices (unpaid AR) export where some invoices
HubSpot + LinkedIn Sales Navigator + manual scraping, ready for were double-entered in different formats and client contacts are messy.
campaign targeting. Columns: `Invoice, Client, Email, Invoice_Date, Due_Date, Amount, Status`.
**Pollution included**: **Pollution included**:
- Phone formats per region: US, UK, Spain, Germany, China, India, - Two date columns with mixed formats; currency variants incl. a credit memo `($300.00)``-300.00`.
Australia, Mexico, Israel, Singapore, Hong Kong, Italy, South - Client name casing variety; email case variants (`AP@Acme.com` vs `ap@acme.com`).
Korea — 13 country codes - Status disguised nulls: `—`, `?`, `(blank)`, `TBD`, `unknown`, `(none)`.
- Country column inconsistent: `USA` / `US` / `United States` - **5 double-entered invoices** — same invoice number twice, dates/amount in different formats, one copy with a blank status the other fills.
- Disguised nulls: `N/A`, `unknown`, `(unknown)`, `(blank)`, `(none)`,
`?`, `—`, `#N/A`, `TBD`
- Source column tags origin (`HubSpot` / `LinkedIn` / `Manual Scrape`)
- Email duplicates across sources with case variants: `alice@acme.com`
+ `Alice.Johnson@acme.com`, `bob@beta.com` + `Bob@Beta.com`,
`diana@delta.com` from two sources, `carlos@gamma.io` from two
sources, `Frank@Foxtrot.de` + `frank@foxtrot.de`
- Name casing: `DIANA LEE`, `henry`, `IVY CHEN`, mixed
- 6 fuzzy / cross-source duplicates designed to survive the dedup
- Score column with sentinel pollution that needs coercion to integer
**After running the pipeline**: 30 rows → 24, ~43 cells canonicalized, **After running the pipeline** (validated): **26 → 21 rows, 5 duplicate
14 sentinels resolved, 6 cross-source duplicates merged with `merge=true` invoices removed**, both date columns ISO + amounts numeric + emails
so each survivor inherits the most-complete picture. Invalid-email lowercased (0 unparseable), 7 disguised-null statuses caught, and a blank
rows (deliverability stress) and `Suppressed`/`Opted Out` tags status backfilled from its twin via `merge=true`. The aging report stops
(suppression-list use case) survive as flagged rows the operator double-counting.
manually reviews.
## 5. UX flow (per persona) ## 5. UX flow (per persona)
@@ -174,26 +142,26 @@ dedicated `app_demo.py` for the cloud build).
│ "{Persona-specific H1}" │ │ "{Persona-specific H1}" │
├──────────────────────────────────────────────────────────┤ ├──────────────────────────────────────────────────────────┤
│ │ │ │
│ Sample dataset preloaded: shopify_pet_customers.csv │ │ Sample dataset preloaded: bank_reconciliation.csv
│ [Replace with your own file (capped 100 rows)] │ │ [Replace with your own file (capped 100 rows)] │
│ │ │ │
│ ┌─ BEFORE preview (15 rows) ─────────────────────────┐ │ │ ┌─ BEFORE preview (26 rows) ─────────────────────────┐ │
│ │ Alice | (415) 555-1234 | $1,240.50 | … │ │ │ │ 01/15/2025 | Stripe | +$3,450.00 | … │ │
│ │ Bob | 415.555.1234 | $1,240.50 | … │ │ │ │ 2025-01-15 | Stripe | 3450.00 | … (dup) │ │
│ │ ... │ │ │ │ ... │ │
│ └──────────────────────────────────────────────────┘ │ │ └──────────────────────────────────────────────────┘ │
│ │ │ │
│ Pipeline (saved): │ │ Pipeline (saved): │
│ 1. Text Clean → 2. Format Standardize → │ │ 1. Clean Text → 2. Standardize Formats → │
│ 3. Missing → 4. Deduplicate │ 3. Fix Missing → 4. Find Duplicates
│ │ │ │
│ [▶ Run pipeline] │ │ [▶ Run pipeline] │
│ │ │ │
│ ┌─ AFTER preview ───────────────────────────────────┐ │ │ ┌─ AFTER preview ───────────────────────────────────┐ │
│ │ 15 rows → 11 (4 duplicates merged) │ │ │ │ 26 rows → 20 (6 duplicate transactions removed) │ │
│ │ 27 cells canonicalized · 33 sentinels resolved │ │ │ │ 36 cells standardized · 4 disguised nulls flagged │ │
│ │ │ │ │ │ │ │
│ │ Alice Johnson | +14155551234 | 1240.50 | … │ │ │ │ 2025-01-15 | Stripe | 3450.00 | … │ │
│ │ ... │ │ │ │ ... │ │
│ └──────────────────────────────────────────────────┘ │ │ └──────────────────────────────────────────────────┘ │
│ │ │ │
@@ -244,27 +212,35 @@ not "demo crippled" data.
## 7. CTA copy (per persona) ## 7. CTA copy (per persona)
### 7.1 Shopify pet operator Copy lives in `src/gui/app_demo.py::PERSONAS` (H1 / sub / CTA per tag);
keep this section in sync with that dict.
- **H1**: *Clean your customer / vendor / subscriber exports — locally.* ### 7.1 Bookkeeper — bank reconciliation (`?p=bookkeeper`)
- **Sub**: *Klaviyo-import-ready in 30 seconds. Catches duplicates Excel
misses. Your data never leaves your computer.*
- **CTA**: *Get DataTools for Shopify — $49 →*
### 7.2 Bookkeeper / freelance accountant - **H1**: *Catch the transactions your bank export posted twice. Locally.*
- **Sub**: *When the Jan and Feb exports overlap, the same payment posts
- **H1**: *Reconcile messy bank exports. Hand your client an audit twice in two formats. DataTools standardizes every date and amount, then
trail.* dedups on the real transaction so your reconciliation ties out — 26 rows
- **Sub**: *Catches the duplicate transaction Quickbooks imported twice. → 20, six phantom duplicates gone.*
Standardizes dates, amounts, vendor casing. Every change auditable.*
- **CTA**: *Get DataTools for Bookkeepers — $49 →* - **CTA**: *Get DataTools for Bookkeepers — $49 →*
### 7.3 Marketing / RevOps agency ### 7.2 Accounts payable — 1099 prep (`?p=ap-1099`)
- **H1**: *Dedupe leads across HubSpot, LinkedIn, and manual scrapes.* - **H1**: *Build a clean 1099 vendor list — with the missing EINs filled in.*
- **Sub**: *International phones, country normalization, fuzzy dedup - **Sub**: *The same vendor entered three times, each record holding only
with merge — one tool, one schema, no upload.* part of the details. DataTools consolidates to one row and backfills the
- **CTA**: *Get DataTools for RevOps — $49 →* gaps from the duplicates — 24 records → 8 vendors, 7 missing EINs
recovered.*
- **CTA**: *Get DataTools for Accounting — $49 →*
### 7.3 Accounts receivable — open invoices (`?p=ar-aging`)
- **H1**: *Stop chasing the invoices your aging report counted twice. Locally.*
- **Sub**: *Double-entered invoices inflate your AR aging and your
follow-ups. DataTools standardizes dates and amounts, lowercases client
emails, and removes the duplicate invoice numbers — 26 rows → 21, five
phantom invoices off the books.*
- **CTA**: *Get DataTools for Accounting — $49 →*
## 8. Telemetry / conversion tracking ## 8. Telemetry / conversion tracking

View File

@@ -1,31 +0,0 @@
Lead ID,First Name,Last Name,Company,Title,Email,Phone,Country,Source,Score,Last Activity,Tags
HUB-001,Alice,Johnson,Acme Corp,VP Marketing,alice@acme.com,(415) 555-1234,USA,HubSpot,87,2025-12-04,Enterprise
HUB-002,bob,smith,Beta LLC,Director Growth,bob@beta.com,N/A,United States,HubSpot,N/A,2025-11-22,SMB
HUB-003,Carlos,Garcia,Gamma Inc,CEO,carlos@gamma.io,+34 91 411 1111,Spain,HubSpot,82,2025-10-30,Enterprise
HUB-004,DIANA,LEE,Delta Co,Marketing Manager,diana@delta.com,020 7946 0958,United Kingdom,HubSpot,74,2025-12-15,Mid-Market
HUB-005,Eve,Martinez,Epsilon Group,VP Ops,eve@epsilon.com,(none),Mexico,HubSpot,(blank),2025-09-15,SMB
LIN-006,Alice,Johnson,Acme Corporation,VP of Marketing,Alice.Johnson@acme.com,4155551234,US,LinkedIn,,2025-12-04,Enterprise
LIN-007,Frank,Brown,Foxtrot Ltd,Head Sales,frank@foxtrot.de,+49 30 12345678,Germany,LinkedIn,68,2025-12-01,Mid-Market
LIN-008,Grace,Davis,Golf Industries,Marketing Lead,grace@golfind.com,+44 20 7946 0958,UK,LinkedIn,79,2025-11-08,Mid-Market
LIN-009,henry,wilson,Hotel Logistics,COO,henry@hotellog.com,+86 10 1234 5678,China,LinkedIn,91,2025-12-12,Enterprise
LIN-010,IVY CHEN,,India Tech,CTO,ivy@indiatech.in,+91 11 2345 6789,IN,LinkedIn,88,2025-11-30,Enterprise
LIN-011,Jack,Taylor,Juliet & Co,Founder,jack@juliet.co,unknown,United States,LinkedIn,?,(unknown),SMB
SCR-012,Diana,Lee,Delta Company,Marketing Manager,diana@delta.com,020-7946-0958,UK,Manual Scrape,74,12/15/2025,Mid-Market
SCR-013,kate,o'neil,Kilo Ventures,Partner,kate@kilo.vc,+1 415 555 2222,USA,Manual Scrape,N/A,?,Investor
SCR-014,Carlos,García,Gamma Incorporated,CEO,Carlos@gamma.io,+34-91-411-1111,Spain,Manual Scrape,82,Oct 30 2025,Enterprise
SCR-015,Liam,Park,Lima Solutions,Director Marketing,liam@limasol.kr,+82 2 2287 0114,South Korea,Manual Scrape,77,2025-11-20,Enterprise
SCR-016,Mia,nguyen,Mike Corp,VP Marketing,mia@mikecorp.com.au,02 9374 4000,Australia,Manual Scrape,72,2025-10-05,Mid-Market
SCR-017,Noah,Brown,November Inc,Head of Growth,noah@november.com,(555) 444-5555,US,Manual Scrape,,#N/A,SMB
HUB-018,Frank,Brown,Foxtrot,Head of Sales,Frank@Foxtrot.de,+49-30-12345678,Germany,HubSpot,68,2025-12-01,Mid-Market
HUB-019,Olivia,Rossi,Oscar Italia,CMO,olivia@oscar.it,+39 06 6982,Italy,HubSpot,85,2025-12-08,Enterprise
HUB-020,papa,wong,Papa Trading,Founder,papa@papatrading.hk,+852 2123 4567,Hong Kong,HubSpot,69,2025-11-15,SMB
LIN-021,Quinn,Reyes,Quebec Group,VP Sales,quinn@quebec.mx,+52 55 5555 0000,Mexico,LinkedIn,80,2025-12-05,Mid-Market
LIN-022,Robert,Tan,Romeo Logistics,Director,r.tan@romeo.sg,+65 6123 4567,Singapore,LinkedIn,76,2025-11-28,Mid-Market
SCR-023,Sara,Khan,Sierra Foods,Head Marketing,sara@sierra.in,+91-22-1234-5678,India,Manual Scrape,73,2025-12-02,SMB
SCR-024,bob,Smith,Beta,Director Growth,Bob@Beta.com,(none),United States,Manual Scrape,(unknown),(unknown),SMB
HUB-025,Tara,Levi,Tango Tech,VP Product,tara@tango.il,+972 3 6957 0000,Israel,HubSpot,82,2025-12-10,Enterprise
HUB-026,Uma,Patel,Uniform Health,CMO,uma at uniform dot com,+44 20 7946 8888,United Kingdom,HubSpot,71,2025-12-12,Enterprise
LIN-027,Victor,Lee,Victor Co,Director,victor@@victorco.com,+1 415 555 8888,USA,LinkedIn,69,2025-11-30,SMB
SCR-028,Wendy,Akin,Whiskey Inc,CMO,wendy@whiskey.tr,+90 212 252 1111,Turkey,Manual Scrape,77,2025-12-04,Mid-Market
SCR-029,Xander,Ng,Xray Group,Founder,xander@xray.sg,+65 6234 5678,Singapore,Manual Scrape,65,2025-11-15,Suppressed
HUB-030,Yara,Costa,Yankee Foods,Marketing Lead,yara@yankee.br,+55 11 3071 2222,Brazil,HubSpot,,2025-12-15,Opted Out
1 Lead ID First Name Last Name Company Title Email Phone Country Source Score Last Activity Tags
2 HUB-001 Alice Johnson Acme Corp VP Marketing alice@acme.com (415) 555-1234 USA HubSpot 87 2025-12-04 Enterprise
3 HUB-002 bob smith Beta LLC Director Growth bob@beta.com N/A United States HubSpot N/A 2025-11-22 SMB
4 HUB-003 Carlos Garcia Gamma Inc CEO carlos@gamma.io +34 91 411 1111 Spain HubSpot 82 2025-10-30 Enterprise
5 HUB-004 DIANA LEE Delta Co Marketing Manager diana@delta.com 020 7946 0958 United Kingdom HubSpot 74 2025-12-15 Mid-Market
6 HUB-005 Eve Martinez Epsilon Group VP Ops eve@epsilon.com (none) Mexico HubSpot (blank) 2025-09-15 SMB
7 LIN-006 Alice Johnson Acme Corporation VP of Marketing Alice.Johnson@acme.com 4155551234 US LinkedIn 2025-12-04 Enterprise
8 LIN-007 Frank Brown Foxtrot Ltd Head Sales frank@foxtrot.de +49 30 12345678 Germany LinkedIn 68 2025-12-01 Mid-Market
9 LIN-008 Grace Davis Golf Industries Marketing Lead grace@golfind.com +44 20 7946 0958 UK LinkedIn 79 2025-11-08 Mid-Market
10 LIN-009 henry wilson Hotel Logistics COO henry@hotellog.com +86 10 1234 5678 China LinkedIn 91 2025-12-12 Enterprise
11 LIN-010 IVY CHEN India Tech CTO ivy@indiatech.in +91 11 2345 6789 IN LinkedIn 88 2025-11-30 Enterprise
12 LIN-011 Jack Taylor Juliet & Co Founder jack@juliet.co unknown United States LinkedIn ? (unknown) SMB
13 SCR-012 Diana Lee Delta Company Marketing Manager diana@delta.com 020-7946-0958 UK Manual Scrape 74 12/15/2025 Mid-Market
14 SCR-013 kate o'neil Kilo Ventures Partner kate@kilo.vc +1 415 555 2222 USA Manual Scrape N/A ? Investor
15 SCR-014 Carlos García Gamma Incorporated CEO Carlos@gamma.io +34-91-411-1111 Spain Manual Scrape 82 Oct 30 2025 Enterprise
16 SCR-015 Liam Park Lima Solutions Director Marketing liam@limasol.kr +82 2 2287 0114 South Korea Manual Scrape 77 2025-11-20 Enterprise
17 SCR-016 Mia nguyen Mike Corp VP Marketing mia@mikecorp.com.au 02 9374 4000 Australia Manual Scrape 72 2025-10-05 Mid-Market
18 SCR-017 Noah Brown November Inc Head of Growth noah@november.com (555) 444-5555 US Manual Scrape #N/A SMB
19 HUB-018 Frank Brown Foxtrot Head of Sales Frank@Foxtrot.de +49-30-12345678 Germany HubSpot 68 2025-12-01 Mid-Market
20 HUB-019 Olivia Rossi Oscar Italia CMO olivia@oscar.it +39 06 6982 Italy HubSpot 85 2025-12-08 Enterprise
21 HUB-020 papa wong Papa Trading Founder papa@papatrading.hk +852 2123 4567 Hong Kong HubSpot 69 2025-11-15 SMB
22 LIN-021 Quinn Reyes Quebec Group VP Sales quinn@quebec.mx +52 55 5555 0000 Mexico LinkedIn 80 2025-12-05 Mid-Market
23 LIN-022 Robert Tan Romeo Logistics Director r.tan@romeo.sg +65 6123 4567 Singapore LinkedIn 76 2025-11-28 Mid-Market
24 SCR-023 Sara Khan Sierra Foods Head Marketing sara@sierra.in +91-22-1234-5678 India Manual Scrape 73 2025-12-02 SMB
25 SCR-024 bob Smith Beta Director Growth Bob@Beta.com (none) United States Manual Scrape (unknown) (unknown) SMB
26 HUB-025 Tara Levi Tango Tech VP Product tara@tango.il +972 3 6957 0000 Israel HubSpot 82 2025-12-10 Enterprise
27 HUB-026 Uma Patel Uniform Health CMO uma at uniform dot com +44 20 7946 8888 United Kingdom HubSpot 71 2025-12-12 Enterprise
28 LIN-027 Victor Lee Victor Co Director victor@@victorco.com +1 415 555 8888 USA LinkedIn 69 2025-11-30 SMB
29 SCR-028 Wendy Akin Whiskey Inc CMO wendy@whiskey.tr +90 212 252 1111 Turkey Manual Scrape 77 2025-12-04 Mid-Market
30 SCR-029 Xander Ng Xray Group Founder xander@xray.sg +65 6234 5678 Singapore Manual Scrape 65 2025-11-15 Suppressed
31 HUB-030 Yara Costa Yankee Foods Marketing Lead yara@yankee.br +55 11 3071 2222 Brazil HubSpot 2025-12-15 Opted Out

View File

@@ -1,74 +0,0 @@
{
"steps": [
{
"tool": "text_clean",
"options": {},
"enabled": true,
"name": "1. Clean text (whitespace + smart quotes from copy-paste)"
},
{
"tool": "format_standardize",
"options": {
"column_types": {
"First Name": "name",
"Last Name": "name",
"Company": "name",
"Email": "email",
"Phone": "phone"
},
"phone_country_column": "Country",
"phone_format": "E164",
"email_gmail_canonical": true
},
"enabled": true,
"name": "2. E.164 phones (per-row country) · canonical emails · name casing"
},
{
"tool": "missing",
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["N/A", "n/a", "—", "?", "(unknown)", "unknown", "(blank)", "(none)", "TBD", "#N/A"]
},
"enabled": true,
"name": "3. Standardize sentinels across vendor exports"
},
{
"tool": "column_map",
"options": {
"schema": {
"fields": [
{"name": "Lead ID", "dtype": "string", "required": true},
{"name": "First Name", "dtype": "string"},
{"name": "Last Name", "dtype": "string"},
{"name": "Company", "dtype": "string"},
{"name": "Title", "dtype": "string"},
{"name": "Email", "dtype": "string"},
{"name": "Phone", "dtype": "string"},
{"name": "Country", "dtype": "string"},
{"name": "Source", "dtype": "string"},
{"name": "Score", "dtype": "integer"},
{"name": "Last Activity", "dtype": "date"},
{"name": "Tags", "dtype": "string"}
]
},
"auto_infer": true,
"unmapped": "keep",
"coerce_types": true,
"reorder_to_schema": true,
"enforce_required": false
},
"enabled": true,
"name": "4. Coerce types · reorder to canonical schema"
},
{
"tool": "dedup",
"options": {
"survivor_rule": "most_complete",
"merge": true
},
"enabled": true,
"name": "5. Dedup leads across HubSpot / LinkedIn / Manual Scrape (fuzzy + merge)"
}
]
}

View File

@@ -0,0 +1,27 @@
Invoice,Client,Email,Invoice_Date,Due_Date,Amount,Status
INV-1007,ACME LLC,AP@Acme.com,03/04/2025,04/03/2025,"$1,250.00",Open
INV-1007, Acme LLC ,ap@acme.com,2025-03-04,2025-04-03,"1,250.00",(blank)
INV-1001,northwind traders,billing@northwind.com,Mar 6 2025,04/05/2025,$980,Overdue
INV-1002,Globex Corp,AR@Globex.com,3/11/25,4/10/25,"2,400.50",Sent
INV-1011,initech,accounts@initech.com,04/01/2025,05/01/2025,"$ 1,100.00",?
INV-1011,Initech,Accounts@Initech.com,2025-04-01,2025-05-01,1100,Open
INV-1003,Stark Industries,ap@stark.com,Mar 6 2025,Apr 6 2025,$75.00,Open
INV-1004,Wayne Enterprises,ar@wayne.com,03/15/2025,04/14/2025,($300.00),
INV-1015,Hooli,billing@hooli.com,3/11/25,4/10/25,"$4,300.00",Overdue
INV-1015,hooli,Billing@Hooli.com,2025-03-11,2025-04-10,4300,(none)
INV-1005,Soylent Corp,ap@soylent.com,2025-03-20,2025-04-19,"$1,875.25",Sent
INV-1006,Umbrella Co,ar@umbrella.com,03/22/2025,04/21/2025,$640.00,TBD
INV-1019,Cyberdyne Systems,ap@cyberdyne.com,Mar 25 2025,04/24/2025,"$2,050.00",unknown
INV-1019,cyberdyne systems,AP@Cyberdyne.com,2025-03-25,2025-04-24,"2,050.00",Open
INV-1008,Vandelay Industries,ar@vandelay.com,3/28/25,4/27/25,$915.00,Overdue
INV-1009,Gekko & Co,billing@gekko.com,2025-03-30,2025-04-29,"$3,120.75",Open
INV-1010,Pied Piper,ap@piedpiper.com,04/02/2025,05/02/2025,$180,Sent
INV-1023,Tyrell Corp,ar@tyrell.com,04/05/2025,05/05/2025,($300.00),(blank)
INV-1023,Tyrell Corp,AR@Tyrell.com,2025-04-05,2025-05-05,-300.00,Open
INV-1012,Oscorp,ap@oscorp.com,Apr 8 2025,05/08/2025,"$5,000.00",Overdue
INV-1013,Nakatomi Trading,ar@nakatomi.com,4/9/25,5/9/25,$725.50,Sent
INV-1014,Bluth Company,billing@bluth.com,2025-04-10,2025-05-10,"$1,420.00",Open
INV-1016,Dunder Mifflin,ap@dundermifflin.com,04/12/2025,05/12/2025,$960.00,Overdue
INV-1017,Prestige Worldwide,ar@prestige.com,Apr 14 2025,05/14/2025,"$2,680.00",Sent
INV-1018,Sterling Cooper,billing@sterlingcooper.com,4/15/25,5/15/25,"$3,950.00",Open
INV-1020,Wonka Industries,ap@wonka.com,2025-04-18,2025-05-18,"$1,050.00",Overdue
1 Invoice Client Email Invoice_Date Due_Date Amount Status
2 INV-1007 ACME LLC AP@Acme.com 03/04/2025 04/03/2025 $1,250.00 Open
3 INV-1007 Acme LLC ap@acme.com 2025-03-04 2025-04-03 1,250.00 (blank)
4 INV-1001 northwind traders billing@northwind.com Mar 6 2025 04/05/2025 $980 Overdue
5 INV-1002 Globex Corp AR@Globex.com 3/11/25 4/10/25 2,400.50 Sent
6 INV-1011 initech accounts@initech.com 04/01/2025 05/01/2025 $ 1,100.00 ?
7 INV-1011 Initech Accounts@Initech.com 2025-04-01 2025-05-01 1100 Open
8 INV-1003 Stark Industries ap@stark.com Mar 6 2025 Apr 6 2025 $75.00 Open
9 INV-1004 Wayne Enterprises ar@wayne.com 03/15/2025 04/14/2025 ($300.00)
10 INV-1015 Hooli billing@hooli.com 3/11/25 4/10/25 $4,300.00 Overdue
11 INV-1015 hooli Billing@Hooli.com 2025-03-11 2025-04-10 4300 (none)
12 INV-1005 Soylent Corp ap@soylent.com 2025-03-20 2025-04-19 $1,875.25 Sent
13 INV-1006 Umbrella Co ar@umbrella.com 03/22/2025 04/21/2025 $640.00 TBD
14 INV-1019 Cyberdyne Systems ap@cyberdyne.com Mar 25 2025 04/24/2025 $2,050.00 unknown
15 INV-1019 cyberdyne systems AP@Cyberdyne.com 2025-03-25 2025-04-24 2,050.00 Open
16 INV-1008 Vandelay Industries ar@vandelay.com 3/28/25 4/27/25 $915.00 Overdue
17 INV-1009 Gekko & Co billing@gekko.com 2025-03-30 2025-04-29 $3,120.75 Open
18 INV-1010 Pied Piper ap@piedpiper.com 04/02/2025 05/02/2025 $180 Sent
19 INV-1023 Tyrell Corp ar@tyrell.com 04/05/2025 05/05/2025 ($300.00) (blank)
20 INV-1023 Tyrell Corp AR@Tyrell.com 2025-04-05 2025-05-05 -300.00 Open
21 INV-1012 Oscorp ap@oscorp.com Apr 8 2025 05/08/2025 $5,000.00 Overdue
22 INV-1013 Nakatomi Trading ar@nakatomi.com 4/9/25 5/9/25 $725.50 Sent
23 INV-1014 Bluth Company billing@bluth.com 2025-04-10 2025-05-10 $1,420.00 Open
24 INV-1016 Dunder Mifflin ap@dundermifflin.com 04/12/2025 05/12/2025 $960.00 Overdue
25 INV-1017 Prestige Worldwide ar@prestige.com Apr 14 2025 05/14/2025 $2,680.00 Sent
26 INV-1018 Sterling Cooper billing@sterlingcooper.com 4/15/25 5/15/25 $3,950.00 Open
27 INV-1020 Wonka Industries ap@wonka.com 2025-04-18 2025-05-18 $1,050.00 Overdue

View File

@@ -0,0 +1,50 @@
{
"steps": [
{
"tool": "text_clean",
"enabled": true,
"options": {
"trim": true,
"collapse_whitespace": true,
"fold_smart_chars": true,
"strip_zero_width": true
}
},
{
"tool": "format_standardize",
"enabled": true,
"options": {
"column_types": {
"Invoice_Date": "date",
"Due_Date": "date",
"Amount": "currency",
"Email": "email"
}
}
},
{
"tool": "missing",
"enabled": true,
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["—", "-", "?", "(blank)", "TBD", "unknown", "(none)", "N/A", "#N/A"]
}
},
{
"tool": "dedup",
"enabled": true,
"options": {
"survivor_rule": "most_complete",
"merge": true,
"strategies": [
{
"columns": [
{"column": "Invoice", "algorithm": "exact", "threshold": 100}
]
}
]
}
}
]
}

View File

@@ -0,0 +1,27 @@
Date,Description,Vendor,Category,Amount,Account
01/15/2025,“Stripe payout — weekly”,Stripe,Income,"+$3,450.00",Business Checking
2025-01-15,Verizon business line,Verizon,,($89.50),Business Checking
Jan 18 2025,Adobe Creative Cloud ,Adobe,(blank),-$129.99,Business Checking
1/27/25,Office supplies,Amazon,Supplies,-$74.20,Business Checking
02/03/2025, Monthly office rent,Highland Properties,Rent,"$1,200.00",Business Checking
Feb 5 2025,Account service fee,First National Bank,?,(50.00),Business Checking
2025-01-09,Shipping labels,amazon.com,unknown,-$18.40,Business Checking
1/22/25,Contractor — landing page,Bright Lane Design,TBD,- $599.88,Business Checking
Jan 30 2025,Late fee adjustment,verizon,Utilities,-$12.00,Business Checking
2025-01-11,Packaging tape,AMAZON.COM,Supplies,-$31.75,Business Checking
01/06/2025,Client deposit — ACME Co,ACME Co,Income,"$2,500.00",Business Checking
2025-01-20,Google Workspace,Google,Software,-$36.00,Business Checking
Jan 24 2025,Fuel — delivery van,Shell,Vehicle,-$58.63,Business Checking
1/28/25,QuickBooks subscription,Intuit,Software,-$80.00,Business Checking
2025-01-15,Stripe payout weekly,Stripe,Income,3450.00,Business Checking
01/15/2025,Verizon business line,Verizon,Utilities,-89.50,Business Checking
2025-01-18,Adobe Creative Cloud,Adobe,Software,-129.99,Business Checking
2025-02-03,Monthly office rent,Highland Properties,Rent,1200.00,Business Checking
2025-02-05,Account service fee,First National Bank,Bank Fees,-50.00,Business Checking
2025-01-22,Contractor landing page,Bright Lane Design,Contractors,-599.88,Business Checking
02/10/2025,Client deposit — Globex,Globex,Income,"$1,800.00",Business Checking
2025-02-12,Slack subscription,Slack,Software,-$96.00,Business Checking
Feb 14 2025,Coffee — client meeting,Blue Bottle,Meals,-$23.10,Business Checking
2/18/25,Insurance premium,Hartford,Insurance,-$240.50,Business Checking
02/21/2025,Refund — returned printer,Staples,Supplies,$210.99,Business Checking
Feb 25 2025,Domain renewal,Namecheap,Software,-$13.98,Business Checking
1 Date Description Vendor Category Amount Account
2 01/15/2025 “Stripe payout — weekly” Stripe Income +$3,450.00 Business Checking
3 2025-01-15 Verizon business line Verizon ($89.50) Business Checking
4 Jan 18 2025 Adobe Creative Cloud Adobe (blank) -$129.99 Business Checking
5 1/27/25 Office supplies Amazon Supplies -$74.20 Business Checking
6 02/03/2025 Monthly office rent Highland Properties Rent $1,200.00 Business Checking
7 Feb 5 2025 Account service fee First National Bank ? (50.00) Business Checking
8 2025-01-09 Shipping labels amazon.com unknown -$18.40 Business Checking
9 1/22/25 Contractor — landing page Bright Lane Design TBD - $599.88 Business Checking
10 Jan 30 2025 Late fee adjustment verizon Utilities -$12.00 Business Checking
11 2025-01-11 Packaging tape AMAZON.COM Supplies -$31.75 Business Checking
12 01/06/2025 Client deposit — ACME Co ACME Co Income $2,500.00 Business Checking
13 2025-01-20 Google Workspace Google Software -$36.00 Business Checking
14 Jan 24 2025 Fuel — delivery van Shell Vehicle -$58.63 Business Checking
15 1/28/25 QuickBooks subscription Intuit Software -$80.00 Business Checking
16 2025-01-15 Stripe payout weekly Stripe Income 3450.00 Business Checking
17 01/15/2025 Verizon business line Verizon Utilities -89.50 Business Checking
18 2025-01-18 Adobe Creative Cloud Adobe Software -129.99 Business Checking
19 2025-02-03 Monthly office rent Highland Properties Rent 1200.00 Business Checking
20 2025-02-05 Account service fee First National Bank Bank Fees -50.00 Business Checking
21 2025-01-22 Contractor landing page Bright Lane Design Contractors -599.88 Business Checking
22 02/10/2025 Client deposit — Globex Globex Income $1,800.00 Business Checking
23 2025-02-12 Slack subscription Slack Software -$96.00 Business Checking
24 Feb 14 2025 Coffee — client meeting Blue Bottle Meals -$23.10 Business Checking
25 2/18/25 Insurance premium Hartford Insurance -$240.50 Business Checking
26 02/21/2025 Refund — returned printer Staples Supplies $210.99 Business Checking
27 Feb 25 2025 Domain renewal Namecheap Software -$13.98 Business Checking

View File

@@ -0,0 +1,6 @@
{"steps":[
{"tool":"text_clean","enabled":true,"options":{"trim":true,"collapse_whitespace":true,"fold_smart_chars":true,"strip_zero_width":true}},
{"tool":"format_standardize","enabled":true,"options":{"column_types":{"Date":"date","Amount":"currency"}}},
{"tool":"missing","enabled":true,"options":{"strategy":"none","standardize_sentinels":true,"sentinels":["—","(blank)","?","unknown","TBD","N/A","#N/A","(none)"]}},
{"tool":"dedup","enabled":true,"options":{"survivor_rule":"most_complete","merge":true,"strategies":[{"columns":[{"column":"Date","algorithm":"exact","threshold":100},{"column":"Amount","algorithm":"exact","threshold":100}]}]}}
]}

View File

@@ -1,56 +0,0 @@
{
"steps": [
{
"tool": "text_clean",
"options": {},
"enabled": true,
"name": "1. Clean text (header whitespace, smart quotes, em-dash)"
},
{
"tool": "format_standardize",
"options": {
"column_types": {
"Date": "date",
"Amount": "currency",
"Balance": "currency",
"Vendor": "name"
},
"currency_decimal": "auto",
"currency_preserve_code": false,
"currency_decimals": 2,
"date_output_format": "%Y-%m-%d"
},
"enabled": true,
"name": "2. ISO dates · numeric amounts (parens-negative) · vendor casing"
},
{
"tool": "missing",
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["N/A", "n/a", "—", "-", "?", "(blank)", "(none)", "unknown", "#N/A"]
},
"enabled": true,
"name": "3. Standardize disguised nulls (— / N/A / (blank))"
},
{
"tool": "dedup",
"options": {
"survivor_rule": "most_complete",
"merge": false,
"date_column": "Date",
"strategies": [
{
"columns": [
{"column": "Date", "algorithm": "exact", "threshold": 100},
{"column": "Amount", "algorithm": "exact", "threshold": 100},
{"column": "Vendor", "algorithm": "jaro_winkler", "threshold": 80}
]
}
]
},
"enabled": true,
"name": "4. Dedup transactions on Date+Amount+fuzzy Vendor"
}
]
}

View File

@@ -1,31 +0,0 @@
Txn ID,Date ,Description,Amount,Balance,Account,Vendor,Category
TXN-2401,01/15/2025," AMAZON.COM*4F2X9 PURCHASE",-$129.99,"$2,450.01",Checking,Amazon,Office Supplies
TXN-2402,2025-01-15,"AMAZON.COM*4F2X9 PURCHASE",-$129.99,"2450.01",Checking,amazon.com,Office Supplies
TXN-2403,Jan 18 2025,"STAPLES #4422 — paper, toner",($89.50),$2360.51,Checking,STAPLES,Office Supplies
TXN-2404,01/22/2025,"Verizon Wireless ""autopay""",-$120.00,"$2,240.51",Checking,Verizon,Utilities
TXN-2405,2025-01-22,Verizon Wireless autopay,-120.00,"2,240.51",Checking,verizon,Utilities
TXN-2406,01-25-2025,"Stripe Payout — invoice #1077","+$3,450.00","$5,690.51",Checking,Stripe,Income
TXN-2407,1/27/25,"Office Lease - Suite 204",-1500.00,"$4,190.51",Checking,Acme Realty,Rent
TXN-2408,02/01/2025,"Wire — Acme Realty Mgmt","-$1,500.00","$2,690.51",Checking,acme realty,Rent
TXN-2409,2025-02-03,"Adobe Creative Cloud annual","- $599.88","$2,090.63",Credit Card,Adobe Inc.,Software
TXN-2410,02/03/2025,"ADOBE CREATIVE CLOUD ANN",-599.88,2090.63,Credit Card,adobe,Software
TXN-2411,Feb 5 2025,"FedEx — overnight to client A",-$32.50,"$2,058.13",Checking,FedEx,Shipping
TXN-2412,02/07/2025,"Square fee — invoice #1078","-$3.20","$2,054.93",Checking,Square,Fees
TXN-2413,02/10/2025,"Stripe Payout invoice #1079","+ $1,200.00","$3,254.93",Checking,Stripe,Income
TXN-2414,2025-02-12,"USPS PRIORITY — to vendor B","-12.40","$3,242.53",Checking,USPS,Shipping
TXN-2415,02/14/2025,"Zoom Video Comms — annual","-$149.90","$3,092.63",Credit Card,Zoom,Software
TXN-2416,2/14/25,"Zoom Video Communications","-149.90","3092.63",Credit Card,zoom,Software
TXN-2417,02/18/2025,"Costco Whse #421 — supplies","-$237.84","$2,854.79",Checking,Costco,Office Supplies
TXN-2418,2025-02-18,COSTCO WHSE #421,-237.84,"2,854.79",Checking,costco,Office Supplies
TXN-2419,02/22/2025,"Bank fee — int'l wire","-$45.00","$2,809.79",Checking,Bank Fee,Fees
TXN-2420,02/24/2025,"Stripe Payout — invoice #1080","+$2,100.00","$4,909.79",Checking,Stripe,Income
TXN-2421,02/28/2025," Refund — overcharge ","+$45.00","$4,954.79",Checking,,Refunds
TXN-2422,Feb 28 2025,REFUND OVERCHARGE,45.00,4954.79,Checking,N/A,Refunds
TXN-2423,03/01/2025,"Office Lease — Suite 204","-$1,500.00","$3,454.79",Checking,Acme Realty,Rent
TXN-2424,2025-03-03,"Slack Technologies — annual","-$840.00","$2,614.79",Credit Card,Slack,Software
TXN-2425,03/05/2025,"Stripe Payout — invoice #1081","+$1,875.00","$4,489.79",Checking,Stripe,Income
TXN-2426,03/08/2025,"Wire — Berlin office rent (EUR vendor)","-€1.450,00","$2,989.79",Checking,Mietverwaltung GmbH,Rent
TXN-2427,03/10/2025,"London supplier invoice (GBP)","-£950.00","$1,939.79",Checking,Stationery Co Ltd,Office Supplies
TXN-2428,03/12/2025,"São Paulo agency retainer","-R$ 1.299,90","$1,679.79",Credit Card,Estúdio Ágil,Software
TXN-2429,03/14/2025,"VAT MOSS prep — multi-EU sales","($89.00)","$1,768.79",Checking,EU VAT Service,Fees
TXN-2430,03/14/2025,"VAT MOSS prep multi EU sales",-89.00,"1,768.79",Checking,eu vat service,Fees
1 Txn ID Date Description Amount Balance Account Vendor Category
2 TXN-2401 01/15/2025 AMAZON.COM*4F2X9 PURCHASE -$129.99 $2,450.01 Checking Amazon Office Supplies
3 TXN-2402 2025-01-15 AMAZON.COM*4F2X9 PURCHASE -$129.99 2450.01 Checking amazon.com Office Supplies
4 TXN-2403 Jan 18 2025 STAPLES #4422 — paper, toner ($89.50) $2360.51 Checking STAPLES Office Supplies
5 TXN-2404 01/22/2025 Verizon Wireless "autopay" -$120.00 $2,240.51 Checking Verizon Utilities
6 TXN-2405 2025-01-22 Verizon Wireless autopay -120.00 2,240.51 Checking verizon Utilities
7 TXN-2406 01-25-2025 Stripe Payout — invoice #1077 +$3,450.00 $5,690.51 Checking Stripe Income
8 TXN-2407 1/27/25 Office Lease - Suite 204 -1500.00 $4,190.51 Checking Acme Realty Rent
9 TXN-2408 02/01/2025 Wire — Acme Realty Mgmt -$1,500.00 $2,690.51 Checking acme realty Rent
10 TXN-2409 2025-02-03 Adobe Creative Cloud annual - $599.88 $2,090.63 Credit Card Adobe Inc. Software
11 TXN-2410 02/03/2025 ADOBE CREATIVE CLOUD ANN -599.88 2090.63 Credit Card adobe Software
12 TXN-2411 Feb 5 2025 FedEx — overnight to client A -$32.50 $2,058.13 Checking FedEx Shipping
13 TXN-2412 02/07/2025 Square fee — invoice #1078 -$3.20 $2,054.93 Checking Square Fees
14 TXN-2413 02/10/2025 Stripe Payout invoice #1079 + $1,200.00 $3,254.93 Checking Stripe Income
15 TXN-2414 2025-02-12 USPS PRIORITY — to vendor B -12.40 $3,242.53 Checking USPS Shipping
16 TXN-2415 02/14/2025 Zoom Video Comms — annual -$149.90 $3,092.63 Credit Card Zoom Software
17 TXN-2416 2/14/25 Zoom Video Communications -149.90 3092.63 Credit Card zoom Software
18 TXN-2417 02/18/2025 Costco Whse #421 — supplies -$237.84 $2,854.79 Checking Costco Office Supplies
19 TXN-2418 2025-02-18 COSTCO WHSE #421 -237.84 2,854.79 Checking costco Office Supplies
20 TXN-2419 02/22/2025 Bank fee — int'l wire -$45.00 $2,809.79 Checking Bank Fee Fees
21 TXN-2420 02/24/2025 Stripe Payout — invoice #1080 +$2,100.00 $4,909.79 Checking Stripe Income
22 TXN-2421 02/28/2025 Refund — overcharge +$45.00 $4,954.79 Checking Refunds
23 TXN-2422 Feb 28 2025 REFUND OVERCHARGE 45.00 4954.79 Checking N/A Refunds
24 TXN-2423 03/01/2025 Office Lease — Suite 204 -$1,500.00 $3,454.79 Checking Acme Realty Rent
25 TXN-2424 2025-03-03 Slack Technologies — annual -$840.00 $2,614.79 Credit Card Slack Software
26 TXN-2425 03/05/2025 Stripe Payout — invoice #1081 +$1,875.00 $4,489.79 Checking Stripe Income
27 TXN-2426 03/08/2025 Wire — Berlin office rent (EUR vendor) -€1.450,00 $2,989.79 Checking Mietverwaltung GmbH Rent
28 TXN-2427 03/10/2025 London supplier invoice (GBP) -£950.00 $1,939.79 Checking Stationery Co Ltd Office Supplies
29 TXN-2428 03/12/2025 São Paulo agency retainer -R$ 1.299,90 $1,679.79 Credit Card Estúdio Ágil Software
30 TXN-2429 03/14/2025 VAT MOSS prep — multi-EU sales ($89.00) $1,768.79 Checking EU VAT Service Fees
31 TXN-2430 03/14/2025 VAT MOSS prep multi EU sales -89.00 1,768.79 Checking eu vat service Fees

View File

@@ -1,21 +0,0 @@
Customer ID,First Name,Last Name,Email,Phone,Address,City,State,ZIP,Country,Total Orders,Lifetime Value,Last Order Date,Tags
SHOP-1001, Alice ,Johnson,alice@petshop.com,(415) 555-1234,"123 Main St., Apt 4B",San Francisco,CA,94102,US,12,$1,240.50,2025-12-04,VIP
SHOP-1002,Bob,SMITH,Bob@PetShop.com,415.555.1234,"123 Main St, Apt 4B",San Francisco,CA,94102,US,12,"$1,240.50",N/A,VIP
SHOP-1003,carlos,garcia,carlos@petshop.com,5559876543,"742 Evergreen Terrace",Springfield,IL,62704,US,5,420.00,12/15/2025,Wholesale
SHOP-1004,Diana,Lee,diana@petshop.com,(555) 222-3344,"PO Box 12, Sherwood Forest",Nottingham,,NG1 5BA,GB,8,£890.25,2025-10-30,VIP|Wholesale
SHOP-1005,EVE MARTINEZ,,eve.martinez@petshop.com,555-9988,"Calle Mayor 45","Madrid",,"28013",ES,3,€180,2025-09-15,
SHOP-1006,Frank,Brown,frank@petshop.com,, ,"Berlin",BE,10115,DE,15,€2.410,75,(blank),Wholesale
SHOP-1007,Grace,Davis,grace@petshop.com,+1 555-111-1111,"888 Maple Ave",Toronto,ON,M5V 3A8,CA,1,$49.99,#N/A,New
SHOP-1008,henry,wilson,Henry@PetShop.com,5551111111,"888 Maple Avenue","Toronto",ON,M5V 3A8,CA,1,$49.99,2025-12-01,New
SHOP-1009,Ivy,Chen,IVY@petshop.com,+1 (555) 777-7777,"550 Elm Street, Suite 200",Brooklyn,NY,11201,US,4,"$320.50 ",10/12/2025,
SHOP-1010,Jack,Taylor,jack@petshop.com,(none),"550 elm street, suite 200",brooklyn,NY,11201,US,4,$320.50,2025-10-12,
SHOP-1011,kate,o'neil,kate.oneil@petshop.com,415-555-2222,"99 King's Rd","London",,SW3 4LX,GB,7,£675.00,?,VIP
SHOP-1012,luis,rodriguez,LUIS@petshop.com,+34 91 411 1111,"Avenida de la Paz 12, 3°D",Madrid,,28013,ES,2,"€89,99",unknown,
SHOP-1013,Mia,Park,mia@petshop.com,02-9374-4000,"Sydney Opera House Drive","Sydney",NSW,2000,AU,9,"A$ 1,299.00",2025-11-20,Wholesale
SHOP-1014,Noah,nguyen,noah@petshop.com,+81 3 3210 7000,"丸の内 2-7-3","Tokyo",,100-0005,JP,6,"¥75000",2025-12-10,VIP
SHOP-1015,Olivia,Brown,OLIVIA@PETSHOP.COM,(555) 333-4444,"742 evergreen terrace",springfield,IL,62704,US,3,$180.00,(none),
SHOP-1016,Pavel,Novak,pavel@petshop.com,+44 20 7946 1234,"22 Baker Street",London,,W1U 6AB,United Kingdom,4,£412.00,2025-11-18,VIP
SHOP-1017,Quinn,Murphy,quinn@petshop.com,+44 20 7946 5678,"5 Princes Street",Edinburgh,,EH2 2DA,U.K.,2,£189.50,2025-12-09,
SHOP-1018,Rachel,O'Brien,rachel@petshop.com,02-9374-9999,"100 George Street","Sydney",NSW,2000,UK,1,£75.00,?,New
SHOP-1019,Sam,Klein,sam@petshop.com,+49 30 99887766,"Friedrichstraße 100","Berlin",,10117,Germany,11,"€1.890,40",2025-12-11,VIP|Wholesale
SHOP-1020,Tara,Gianni,tara@petshop.com,+39 06 6982 4567,"Via del Corso 250",Roma,,00186,Italia,5,"€649,99",2025-12-03,
1 Customer ID First Name Last Name Email Phone Address City State ZIP Country Total Orders Lifetime Value Last Order Date Tags
2 SHOP-1001 Alice Johnson alice@petshop.com (415) 555-1234 123 Main St., Apt 4B San Francisco CA 94102 US 12 $1 240.50 2025-12-04 VIP
3 SHOP-1002 Bob SMITH Bob@PetShop.com 415.555.1234 123 Main St, Apt 4B San Francisco CA 94102 US 12 $1,240.50 N/A VIP
4 SHOP-1003 carlos garcia carlos@petshop.com 5559876543 742 Evergreen Terrace Springfield IL 62704 US 5 420.00 12/15/2025 Wholesale
5 SHOP-1004 Diana Lee diana@petshop.com (555) 222-3344 PO Box 12, Sherwood Forest Nottingham NG1 5BA GB 8 £890.25 2025-10-30 VIP|Wholesale
6 SHOP-1005 EVE MARTINEZ eve.martinez@petshop.com 555-9988 Calle Mayor 45 Madrid 28013 ES 3 €180 2025-09-15
7 SHOP-1006 Frank Brown frank@petshop.com Berlin BE 10115 DE 15 €2.410 75 (blank) Wholesale
8 SHOP-1007 Grace Davis grace@petshop.com +1 555-111-1111 888 Maple Ave Toronto ON M5V 3A8 CA 1 $49.99 #N/A New
9 SHOP-1008 henry wilson Henry@PetShop.com 5551111111 888 Maple Avenue Toronto ON M5V 3A8 CA 1 $49.99 2025-12-01 New
10 SHOP-1009 Ivy Chen IVY@petshop.com +1 (555) 777-7777 550 Elm Street, Suite 200 Brooklyn NY 11201 US 4 $320.50 10/12/2025
11 SHOP-1010 Jack Taylor jack@petshop.com (none) 550 elm street, suite 200 brooklyn NY 11201 US 4 $320.50 2025-10-12
12 SHOP-1011 kate o'neil kate.oneil@petshop.com 415-555-2222 99 King's Rd London SW3 4LX GB 7 £675.00 ? VIP
13 SHOP-1012 luis rodriguez LUIS@petshop.com +34 91 411 1111 Avenida de la Paz 12, 3°D Madrid 28013 ES 2 €89,99 unknown
14 SHOP-1013 Mia Park mia@petshop.com 02-9374-4000 Sydney Opera House Drive Sydney NSW 2000 AU 9 A$ 1,299.00 2025-11-20 Wholesale
15 SHOP-1014 Noah nguyen noah@petshop.com +81 3 3210 7000 丸の内 2-7-3 Tokyo 100-0005 JP 6 ¥75000 2025-12-10 VIP
16 SHOP-1015 Olivia Brown OLIVIA@PETSHOP.COM (555) 333-4444 742 evergreen terrace springfield IL 62704 US 3 $180.00 (none)
17 SHOP-1016 Pavel Novak pavel@petshop.com +44 20 7946 1234 22 Baker Street London W1U 6AB United Kingdom 4 £412.00 2025-11-18 VIP
18 SHOP-1017 Quinn Murphy quinn@petshop.com +44 20 7946 5678 5 Princes Street Edinburgh EH2 2DA U.K. 2 £189.50 2025-12-09
19 SHOP-1018 Rachel O'Brien rachel@petshop.com 02-9374-9999 100 George Street Sydney NSW 2000 UK 1 £75.00 ? New
20 SHOP-1019 Sam Klein sam@petshop.com +49 30 99887766 Friedrichstraße 100 Berlin 10117 Germany 11 €1.890,40 2025-12-11 VIP|Wholesale
21 SHOP-1020 Tara Gianni tara@petshop.com +39 06 6982 4567 Via del Corso 250 Roma 00186 Italia 5 €649,99 2025-12-03

View File

@@ -1,49 +0,0 @@
{
"steps": [
{
"tool": "text_clean",
"options": {},
"enabled": true,
"name": "1. Clean text (whitespace, smart quotes, NBSP, BOM)"
},
{
"tool": "format_standardize",
"options": {
"column_types": {
"First Name": "name",
"Last Name": "name",
"Email": "email",
"Phone": "phone",
"Address": "address",
"Lifetime Value": "currency",
"Last Order Date": "date"
},
"phone_country_column": "Country",
"address_country_column": "Country",
"currency_preserve_code": true,
"currency_decimal": "auto",
"email_gmail_canonical": false
},
"enabled": true,
"name": "2. Standardize phones, addresses, dates, currencies, names"
},
{
"tool": "missing",
"options": {
"strategy": "none",
"standardize_sentinels": true
},
"enabled": true,
"name": "3. Standardize disguised nulls (N/A, -, (blank), ?, #N/A)"
},
{
"tool": "dedup",
"options": {
"survivor_rule": "most_complete",
"merge": true
},
"enabled": true,
"name": "4. Dedup customers (fuzzy match, merge missing fields)"
}
]
}

View File

@@ -0,0 +1,25 @@
Vendor,Contact,Email,Phone,EIN,Address,Total_Paid
Acme Realty,Bob Stein,acme.ap@acmerealty.com,(212) 555-0100,12-3456789,(blank),"$12,400.00"
acme realty llc,Bob Stein, ACME.AP@AcmeRealty.com ,,,"118 Canal St, New York, NY 10013","$8,250"
ACME REALTY,R. Stein,Acme.AP@acmerealty.com,212.555.0100,N/A,TBD,"1,999.99"
Bright Books Bookkeeping,Dana Cole,hello@brightbooks.com,,98-7654321,(blank),"$6,000.00"
bright books,Dana Cole,HELLO@brightbooks.com,(415) 555-0142,unknown,"50 Market St, San Francisco, CA 94105","$6,000"
"Bright Books, LLC",D. Cole, hello@BrightBooks.com,4155550142,98-7654321,unknown,"5,500.00"
Northwind Logistics,Sam Reyes,ap@northwindlog.com,(312) 555-0198,,(blank),"$22,750.00"
northwind logistics inc,Sam Reyes,AP@NorthwindLog.com,,45-6789012,"900 W Loop, Chicago, IL 60607","$22,750"
Pearl Design Studio,“Jo” Marsh,billing@pearldesign.co,,33-2211000,(blank),"$3,200.00"
pearl design,Jo Marsh,Billing@PearlDesign.co,(206) 555-0167,TBD,"77 Pike St, Seattle, WA 98101","$3,200"
PEARL DESIGN STUDIO,J. Marsh, billing@pearldesign.co ,206.555.0167,33-2211000,unknown,"2,800.00"
Cooper Plumbing,Lee Cooper,office@cooperplumb.com,(617) 555-0133,,(blank),"$1,450.00"
cooper plumbing co,Lee Cooper,OFFICE@cooperplumb.com,,TBD,"12 Beacon St, Boston, MA 02108","$1,450"
COOPER PLUMBING,L. Cooper, office@CooperPlumb.com,6175550133,N/A,unknown,900.00
Vertex Marketing,Pat Nguyen,accounts@vertexmktg.com,(404) 555-0119,77-8899001,(blank),"$15,000.00"
vertex marketing group,Pat Nguyen,ACCOUNTS@VertexMktg.com,,unknown,"300 Peachtree St, Atlanta, GA 30308","$15,000"
Summit Consulting,Ray Brooks,invoices@summitconsult.net,,21-0099887,(blank),"$9,800.00"
summit consulting llc,Ray Brooks,INVOICES@summitconsult.net,(303) 555-0175,,"1100 17th St, Denver, CO 80202","$9,800"
SUMMIT CONSULTING,R. Brooks, invoices@SummitConsult.net ,303.555.0175,21-0099887,TBD,"7,250.00"
Garcia Catering,Mia Garcia,ap@garciacatering.com,(305) 555-0188,,(blank),"$4,600.00"
garcia catering services,Mia Garcia,AP@GarciaCatering.com,,66-1234509,"450 Ocean Dr, Miami, FL 33139",$600.00
Northwind Logistics,S. Reyes, ap@northwindlog.com ,312.555.0198,45-6789012,TBD,"21,000.00"
VERTEX MARKETING,P. Nguyen, accounts@vertexmktg.com ,404.555.0119,77-8899001,TBD,"14,500.00"
GARCIA CATERING,M. Garcia,ap@GARCIACATERING.com,305.555.0188,66-1234509,unknown,"4,200.00"
1 Vendor Contact Email Phone EIN Address Total_Paid
2 Acme Realty Bob Stein acme.ap@acmerealty.com (212) 555-0100 12-3456789 (blank) $12,400.00
3 acme realty llc Bob Stein ACME.AP@AcmeRealty.com 118 Canal St, New York, NY 10013 $8,250
4 ACME REALTY R. Stein Acme.AP@acmerealty.com 212.555.0100 N/A TBD 1,999.99
5 Bright Books Bookkeeping Dana Cole hello@brightbooks.com 98-7654321 (blank) $6,000.00
6 bright books Dana Cole HELLO@brightbooks.com (415) 555-0142 unknown 50 Market St, San Francisco, CA 94105 $6,000
7 Bright Books, LLC D. Cole hello@BrightBooks.com 4155550142 98-7654321 unknown 5,500.00
8 Northwind Logistics Sam Reyes ap@northwindlog.com (312) 555-0198 (blank) $22,750.00
9 northwind logistics inc Sam Reyes AP@NorthwindLog.com 45-6789012 900 W Loop, Chicago, IL 60607 $22,750
10 Pearl Design Studio “Jo” Marsh billing@pearldesign.co 33-2211000 (blank) $3,200.00
11 pearl design Jo Marsh Billing@PearlDesign.co (206) 555-0167 TBD 77 Pike St, Seattle, WA 98101 $3,200
12 PEARL DESIGN STUDIO J. Marsh billing@pearldesign.co 206.555.0167 33-2211000 unknown 2,800.00
13 Cooper Plumbing Lee Cooper office@cooperplumb.com (617) 555-0133 (blank) $1,450.00
14 cooper plumbing co Lee Cooper OFFICE@cooperplumb.com TBD 12 Beacon St, Boston, MA 02108 $1,450
15 COOPER PLUMBING L. Cooper office@CooperPlumb.com 6175550133 N/A unknown 900.00
16 Vertex Marketing Pat Nguyen accounts@vertexmktg.com (404) 555-0119 77-8899001 (blank) $15,000.00
17 vertex marketing group Pat Nguyen ACCOUNTS@VertexMktg.com unknown 300 Peachtree St, Atlanta, GA 30308 $15,000
18 Summit Consulting Ray Brooks invoices@summitconsult.net 21-0099887 (blank) $9,800.00
19 summit consulting llc Ray Brooks INVOICES@summitconsult.net (303) 555-0175 1100 17th St, Denver, CO 80202 $9,800
20 SUMMIT CONSULTING R. Brooks invoices@SummitConsult.net 303.555.0175 21-0099887 TBD 7,250.00
21 Garcia Catering Mia Garcia ap@garciacatering.com (305) 555-0188 (blank) $4,600.00
22 garcia catering services Mia Garcia AP@GarciaCatering.com 66-1234509 450 Ocean Dr, Miami, FL 33139 $600.00
23 Northwind Logistics S. Reyes ap@northwindlog.com 312.555.0198 45-6789012 TBD 21,000.00
24 VERTEX MARKETING P. Nguyen accounts@vertexmktg.com 404.555.0119 77-8899001 TBD 14,500.00
25 GARCIA CATERING M. Garcia ap@GARCIACATERING.com 305.555.0188 66-1234509 unknown 4,200.00

View File

@@ -0,0 +1,49 @@
{
"steps": [
{
"tool": "text_clean",
"enabled": true,
"options": {
"trim": true,
"collapse_whitespace": true,
"fold_smart_chars": true,
"strip_zero_width": true
}
},
{
"tool": "format_standardize",
"enabled": true,
"options": {
"column_types": {
"Phone": "phone",
"Email": "email",
"Total_Paid": "currency"
}
}
},
{
"tool": "missing",
"enabled": true,
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["—", "-", "--", "(blank)", "TBD", "unknown", "N/A", "#N/A", "(none)"]
}
},
{
"tool": "dedup",
"enabled": true,
"options": {
"survivor_rule": "most_complete",
"merge": true,
"strategies": [
{
"columns": [
{"column": "Email", "algorithm": "exact", "threshold": 100, "normalizer": "email"}
]
}
]
}
}
]
}

View File

@@ -1,13 +0,0 @@
customer_name,email,vendor,memo
Alice Johnson,alice@example.com,ACME Corp ,Welcome aboard
Bob Smith,bob@example.com,ACME Corp,Returning customer
Charlie Brown,charlie@example.com,Globex,Net 30
Diana Prince,diana@example.com,Globex,VIP
Edward Norton,ed@example.com,“Best Pet Supplies”,Order#42 - rush
Frank Castle,frank@example.com,Stark—Industries,"Line 1
Line 2
Line 3"
grace HOPPER ,grace@example.com,Globex,Loves long memos…
Henry Ford,henry@example.com,Ford Motor,Industrial
Iris West,iris@example.com,S.T.A.R. Labs,Notewith-bell
Jane Doe,jane@example.com,Acme,Standard
1 customer_name email vendor memo
2 Alice Johnson alice@example.com ACME Corp Welcome aboard
3 Bob Smith bob@example.com ACME Corp Returning customer
4 Charlie Brown charlie@example.com Globex Net 30
5 Diana Prince diana​@example.com Globex VIP
6 Edward Norton ed@example.com “Best Pet Supplies” Order#42 - rush
7 Frank Castle frank@example.com Stark—Industries Line 1 Line 2 Line 3
8 grace HOPPER grace@example.com Globex Loves long memos…
9 Henry Ford henry@example.com Ford Motor Industrial
10 Iris West iris@example.com S.T.A.R. Labs Notewith-bell
11 Jane Doe jane@example.com Acme Standard

View File

@@ -9,10 +9,10 @@ side-by-side, and converts the visitor to a Gumroad purchase.
Launch: Launch:
streamlit run src/gui/app_demo.py streamlit run src/gui/app_demo.py
URL routing: URL routing (all three personas serve one audience: accounting):
https://demo.datatools.app/?p=shopify-pet (Shopify operator) https://demo.datatools.app/?p=bookkeeper (Bookkeeper — bank reconciliation)
https://demo.datatools.app/?p=bookkeeper (Bookkeeper) https://demo.datatools.app/?p=ap-1099 (Accounts payable — 1099 vendor prep)
https://demo.datatools.app/?p=revops (RevOps agency) https://demo.datatools.app/?p=ar-aging (Accounts receivable — open invoices)
Free / paid boundary (per docs/DEMO-PLAN.md §6): Free / paid boundary (per docs/DEMO-PLAN.md §6):
- input rows capped at ``DEMO_ROW_CAP`` - input rows capped at ``DEMO_ROW_CAP``
@@ -64,59 +64,66 @@ GUMROAD_BASE: str = "https://gumroad.com/l/datatools"
DEMO_DIR = _project_root / "samples" / "demo" DEMO_DIR = _project_root / "samples" / "demo"
# All three personas serve one audience — accounting — entering through the
# three workflows where messy exports cost real money: bank reconciliation,
# 1099 / AP vendor prep, and AR aging. Each H1/sub names the exact pain and
# the validated demo outcome (see docs/DEMO-PLAN.md §4 for the numbers).
PERSONAS: dict[str, dict[str, Any]] = { PERSONAS: dict[str, dict[str, Any]] = {
"shopify-pet": {
"label": "Shopify pet operator",
"icon": "🛍️",
"h1": "Klaviyo-import-ready customer lists. **In 30 seconds. Locally.**",
"sub": (
"Your Shopify customer export has duplicates Excel can't catch, "
"international phones Excel can't parse, and disguised nulls "
"(`N/A`, `(blank)`, `?`) that break Klaviyo's import. "
"DataTools fixes all of it in one pass — and your data never "
"leaves your computer."
),
"data_file": "shopify_pet_customers.csv",
"pipeline_file": "shopify_pet_pipeline.json",
"cta": "Get DataTools for Shopify — $49 →",
"landing": "https://datatools.app/shopify/",
},
"bookkeeper": { "bookkeeper": {
"label": "Bookkeeper / freelance accountant", "label": "Bookkeeper — bank reconciliation",
"icon": "📒", "icon": "📒",
"h1": "Reconcile messy bank exports. **Hand your client an audit trail.**", "h1": "Catch the transactions your bank export posted twice. **Locally.**",
"sub": ( "sub": (
"The Jan and Feb exports overlap; the same transaction posts twice. " "When the Jan and Feb exports overlap, the same payment lands "
"Vendor names are *Amazon* / *amazon.com* / *AMAZON.COM*4F2X9* in " "twice — once as `01/15/2025 +$3,450.00`, once as "
"three rows. DataTools dedups on Date + Amount + fuzzy Vendor, " "`2025-01-15 3450.00`. DataTools standardizes every date and "
"produces ISO dates and numeric amounts, and gives you a row-level " "amount, then dedups on the *real* transaction so your "
"audit log to hand the client." "reconciliation ties out. In this sample: **26 rows → 20, six "
"phantom duplicates removed** — and your data never leaves your "
"computer."
), ),
"data_file": "bookkeeper_bank_reconcile.csv", "data_file": "bank_reconciliation.csv",
"pipeline_file": "bookkeeper_bank_pipeline.json", "pipeline_file": "bank_reconciliation_pipeline.json",
"cta": "Get DataTools for Bookkeepers — $49 →", "cta": "Get DataTools for Bookkeepers — $49 →",
"landing": "https://datatools.app/bookkeeper/", "landing": "https://datatools.app/bookkeeper/",
}, },
"revops": { "ap-1099": {
"label": "Marketing / RevOps agency", "label": "Accounts payable — 1099 prep",
"icon": "🪢", "icon": "🧾",
"h1": "Dedupe lead lists across HubSpot, LinkedIn, and manual scrapes — **locally.**", "h1": "Build a clean 1099 vendor list — **with the missing EINs filled in.**",
"sub": ( "sub": (
"The same prospect shows up in HubSpot as `alice@acme.com`, in " "The same vendor was entered three times across the year — one "
"LinkedIn as `Alice.Johnson@acme.com`, and in your VA's manual " "record has the EIN, another the address, a third the phone. "
"scrape as `alice@acme.com` again. Country is `USA` / `US` / " "DataTools consolidates each vendor to one row and *backfills the "
"`United States`. DataTools fuzzy-matches across sources, " "gaps from the duplicates*. In this sample: **24 messy records → "
"normalizes phones for 50+ countries, and merges survivors " "8 complete vendors, with 7 missing EINs recovered** from the "
"with their most-complete fields — without uploading anything." "duplicate rows. No upload, no VLOOKUP gymnastics."
), ),
"data_file": "agency_combined_leads.csv", "data_file": "vendor_1099.csv",
"pipeline_file": "agency_leads_pipeline.json", "pipeline_file": "vendor_1099_pipeline.json",
"cta": "Get DataTools for RevOps — $49 →", "cta": "Get DataTools for Accounting — $49 →",
"landing": "https://datatools.app/revops/", "landing": "https://datatools.app/accounting/",
},
"ar-aging": {
"label": "Accounts receivable — open invoices",
"icon": "💵",
"h1": "Stop chasing the invoices your aging report counted twice. **Locally.**",
"sub": (
"Double-entered invoices inflate your AR aging and your "
"follow-ups. DataTools standardizes invoice dates, due dates, and "
"amounts, lowercases client emails, then removes the duplicate "
"invoice numbers — backfilling any blank status from the twin row. "
"In this sample: **26 rows → 21, five phantom invoices off the "
"books** in one pass."
),
"data_file": "ar_open_invoices.csv",
"pipeline_file": "ar_open_invoices_pipeline.json",
"cta": "Get DataTools for Accounting — $49 →",
"landing": "https://datatools.app/accounting/",
}, },
} }
DEFAULT_PERSONA = "shopify-pet" DEFAULT_PERSONA = "bookkeeper"
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------

View File

@@ -0,0 +1,71 @@
"""Demo pipelines must keep showing value (accounting personas).
Each persona's preloaded dataset + saved pipeline is the marketing surface
driven by ``src/gui/app_demo.py``. These tests pin that every demo loads,
runs clean, and produces its headline value (duplicate rows removed, clean
parse, disguised nulls caught) — so a stale dataset or an engine change can't
silently gut the sales demo. The read path mirrors ``app_demo._load_demo``
exactly (``dtype=str, keep_default_na=False`` so every disguised null survives
to the pipeline).
"""
from __future__ import annotations
from pathlib import Path
import pandas as pd
import pytest
from src.core.pipeline import Pipeline, run_pipeline
_REPO = Path(__file__).resolve().parent.parent
_DEMO = _REPO / "samples" / "demo"
# (data_file, pipeline_file, min_duplicates_removed) — one per accounting
# persona in app_demo.PERSONAS. The dup floors are the validated demo numbers.
_DEMOS = [
("bank_reconciliation.csv", "bank_reconciliation_pipeline.json", 6),
("vendor_1099.csv", "vendor_1099_pipeline.json", 8),
("ar_open_invoices.csv", "ar_open_invoices_pipeline.json", 5),
]
@pytest.mark.parametrize("data_file,pipeline_file,min_dupes", _DEMOS)
def test_demo_runs_clean_and_shows_value(data_file, pipeline_file, min_dupes):
df = pd.read_csv(_DEMO / data_file, dtype=str, keep_default_na=False)
pipe = Pipeline.from_file(_DEMO / pipeline_file)
res = run_pipeline(df, pipe, stop_on_error=True)
# 1. Nothing errored — the demo never shows a visitor a red banner.
assert all(sr.error is None for sr in res.step_results), [
(sr.step.tool, sr.error) for sr in res.step_results
]
# 2. Dedup removed the designed duplicate rows (the headline value).
assert res.final_rows < res.initial_rows
dedup = next(sr for sr in res.step_results if sr.step.tool == "dedup")
assert dedup.summary["duplicates_removed"] >= min_dupes
# 3. Standardization parsed every typed cell — a demo with unparseable
# cells reads as "the tool choked," which kills the pitch.
fmt = next(sr for sr in res.step_results if sr.step.tool == "format_standardize")
assert fmt.summary["cells_unparseable"] == 0
assert fmt.summary["cells_changed"] > 0
# 4. The disguised nulls (—, (blank), TBD, …) were caught.
miss = next(sr for sr in res.step_results if sr.step.tool == "missing")
assert miss.summary["sentinels_standardized"] > 0
def test_app_demo_references_each_demo_file():
"""Every data/pipeline file the demo app names must exist on disk.
Guards against a rename in app_demo.py drifting away from samples/demo/
(or vice versa) without a test catching it.
"""
src = (_REPO / "src" / "gui" / "app_demo.py").read_text(encoding="utf-8")
for data_file, pipeline_file, _ in _DEMOS:
assert data_file in src, f"{data_file} not referenced in app_demo.py"
assert pipeline_file in src, f"{pipeline_file} not referenced in app_demo.py"
assert (_DEMO / data_file).exists(), f"missing {data_file}"
assert (_DEMO / pipeline_file).exists(), f"missing {pipeline_file}"