demo: reconstruct sales demos for an accounting audience

Replaces the Shopify / RevOps / Bookkeeper demo trio with three accounting
personas that share one buyer, each entering through a workflow where a
messy export costs money — all running the same saved 4-step pipeline:

- bank_reconciliation.csv (Bookkeeper): 26 -> 20 rows, 6 double-posted
  transactions caught after date+amount standardization.
- vendor_1099.csv (AP / 1099): 24 records -> 8 vendors, 7 missing EINs
  recovered via dedup merge — the 1099-complete story.
- ar_open_invoices.csv (AR): 26 -> 21 rows, 5 double-entered invoices
  removed, blank status backfilled from the twin row.

Every number is validated against the live engine and pinned by
tests/test_demo_pipelines.py (read path mirrors app_demo._load_demo:
dtype=str, keep_default_na=False). Rewires src/gui/app_demo.py PERSONAS
(keys bookkeeper / ap-1099 / ar-aging, accounting H1/sub/CTA) and rewrites
docs/DEMO-PLAN.md sections 3/4/7 with the validated outcomes.

(Repo hygiene forced by a partial-clone gap: finalizes the already-deleted,
unreferenced samples/messy_text.csv whose blob was unrecoverable.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-22 18:52:39 +00:00
parent 38616d69e2
commit 6df726e69e
16 changed files with 404 additions and 441 deletions

View File

@@ -32,17 +32,22 @@ rebuilds it from a stale headline.
| Friction kills conversion | BUSINESS.md §7 | Demo dataset preloaded; no "select a file" first-step |
| < $1,200/mo recurring | BUSINESS.md §9 | Migration plan to $5/mo VPS only after rate-limit signal |
## 3. The three personas (per PLAN.md §2.3)
## 3. The three personas — one audience: accounting (per PLAN.md §2.3)
We niche to **accounting** and enter through the three workflows where a
messy export costs real money. Same engine, three landing pages — each
is the same buyer at a different desk (bookkeeping, payables, receivables).
| Tag | Persona | Top-of-funnel keyword | Demo dataset | Pre-saved pipeline |
|---|---|---|---|---|
| `shopify-pet` | Shopify operator (priority: pet supplies) | "shopify customer cleanup" | `samples/demo/shopify_pet_customers.csv` | `shopify_pet_pipeline.json` |
| `bookkeeper` | Bookkeeper / freelance accountant | "reconcile bank export csv" | `samples/demo/bookkeeper_bank_reconcile.csv` | `bookkeeper_bank_pipeline.json` |
| `revops` | Marketing / RevOps agency | "dedupe lead list across vendors" | `samples/demo/agency_combined_leads.csv` | `agency_leads_pipeline.json` |
| `bookkeeper` | Bookkeeper — bank reconciliation | "reconcile bank export csv duplicates" | `samples/demo/bank_reconciliation.csv` | `bank_reconciliation_pipeline.json` |
| `ap-1099` | Accounts payable — 1099 vendor prep | "clean 1099 vendor list missing EIN" | `samples/demo/vendor_1099.csv` | `vendor_1099_pipeline.json` |
| `ar-aging` | Accounts receivable — open invoices | "remove duplicate invoices aging report" | `samples/demo/ar_open_invoices.csv` | `ar_open_invoices_pipeline.json` |
Each persona gets its **own landing page URL**, its **own demo dataset
loaded by default**, and its **own H1 + below-the-fold copy.** The
engine is identical; only positioning differs.
Each persona gets its **own landing page URL** (`?p=<tag>`), its **own
demo dataset loaded by default**, and its **own H1 + below-the-fold
copy** — wired in `src/gui/app_demo.py::PERSONAS`. The engine is
identical; only positioning differs.
## 4. Demo dataset specifications
@@ -53,114 +58,77 @@ persona's tooling. Each contains every kind of pollution the bundle's
five tools fix, so a single demo run shows every tool earning its
keep.
### 4.0 Pain-point coverage map
### 4.0 Value-proof map
Each demo dataset is engineered so the buyer sees their **own top
pain** demonstrated in the AFTER preview. The mapping below pairs
each pain from PLAN.md §2.3a with the rows / columns that exercise
it. Refresh the dataset only when this coverage drops.
Each demo dataset is engineered so the buyer sees their **own top pain**
fixed in the AFTER preview, with one unmistakable headline number. All
three run the same saved 4-step pipeline (Clean Text → Standardize
Formats → Fix Missing Values → Find Duplicates). The numbers below are
**validated against the live engine** (`tests/test_demo_pipelines.py`
pins them) — refresh the dataset only if a number stops landing.
| Persona | Pain (from PLAN §2.3a) | Demo coverage |
| Persona | Headline proof | What the visitor watches happen |
|---|---|---|
| Shopify pet | S1 — Klaviyo per-contact dupes | 5 dup pairs across rows 115 (case + format + address-twin variants) |
| Shopify pet | S2 — feed-rejection chars | smart-quote / NBSP / BOM in rows 16, 9, 11 |
| Shopify pet | S3 — multi-channel | partner-style customer IDs (`SHOP-`); demonstration of column-level mapping covered in RevOps demo |
| Shopify pet | S4 — subscription identity | rows 1+2, 7+8, 9+10 — same person, different format |
| Shopify pet | S5 — VAT-MOSS country drift | rows 1618 (`United Kingdom` / `U.K.` / `UK`) + rows 1920 (`Germany`/`Italia`) |
| Bookkeeper | B1 — month-overlap re-import | 7 dup pairs spanning Jan↔Feb and Mar boundaries |
| Bookkeeper | B2 — 1099 vendor consolidation | Amazon × 3 spellings, Verizon × 2, Acme Realty × 2, Adobe × 2, Costco × 2, Zoom × 2, Stripe × 4 |
| Bookkeeper | B3 — audit trail | every cell change in the run logged with old/new/rule — surface in the demo's audit tab |
| Bookkeeper | B4 — per-license economics | demonstrated by pricing copy, not data |
| Bookkeeper | B5 — multi-currency | rows 26 (EUR), 27 (GBP), 28 (BRL with comma decimal), 29 (parens-negative) |
| RevOps | R1 — per-contact tier | 6 cross-source dup pairs (HubSpot × LinkedIn × Manual Scrape) |
| RevOps | R2 — deliverability | rows 2627 (`uma at uniform dot com`, `victor@@victorco.com` invalid emails) |
| RevOps | R3 — GDPR / privacy | demonstrated by the network-tab moat panel + zero-upload claim |
| RevOps | R4 — vendor unification | 3 source values (HubSpot / LinkedIn / Manual Scrape), 13 country codes, mixed-shape headers |
| RevOps | R5 — suppression list | rows 2930 (`Suppressed`, `Opted Out` tags) |
| Bookkeeper | **26 → 20 rows · 6 phantom duplicates removed** | The same payment posted twice (different date + amount format) collapses to one; dates go ISO, parens-negatives become real negatives |
| AP / 1099 | **24 records → 8 vendors · 7 missing EINs recovered** | Each vendor's scattered records merge into one complete row; `merge=true` backfills the EIN/address/phone that any single record was missing |
| AR aging | **26 → 21 rows · 5 double-entered invoices removed** | Duplicate invoice numbers collapse; a blank status is backfilled from its twin; invoice + due dates go ISO, amounts numeric |
### 4.1 `shopify_pet_customers.csv` (20 rows)
### 4.1 `bank_reconciliation.csv` (26 rows) — Bookkeeper
**Looks like**: a Shopify customer export filtered for "Pet Supplies"
sales channel, 12 months activity.
**Looks like**: two months (Jan + Feb 2025) of business-checking activity
from a bank portal, where the Feb re-export overlaps Jan so the same
transaction posts twice. Columns: `Date, Description, Vendor, Category,
Amount, Account`.
**Pollution included**:
- Whitespace padding (" Alice ", "Sydney Opera House Drive ")
- Mixed phone formats: `(415) 555-1234`, `415.555.1234`, `5559876543`,
`+1 555-111-1111`
- International phones: GB, ES, DE, AU, JP (15 demo rows span 6
countries)
- Currency variants: `$1,240.50`, `£890.25`, `€2.410,75` (EU comma
decimal), `A$ 1,299.00`, `¥75000`
- Date formats: `2025-12-04`, `12/15/2025`, `?`, `(blank)`, `(none)`,
`#N/A`
- Disguised nulls: `N/A`, blank, `(blank)`, `?`, `#N/A`, `(none)`,
`unknown`
- Name casing: `EVE MARTINEZ`, `henry`, `O'NEIL`, `noah`, mixed Title /
ALL CAPS / lower
- Email case variants that *should* dedup: `Bob@PetShop.com` vs
`alice@petshop.com`
- 4 fuzzy duplicates (Alice/Bob same address, Grace/Henry same phone,
Carlos/Olivia same address, Ivy/Jack same address)
- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`, `1/27/25`, `Feb 5 2025`.
- Currency formats incl. negatives: `-$129.99`, `($89.50)` parens-negative, `+$3,450.00`, `- $599.88`, bare `-129.99`, `(50.00)`.
- Whitespace + NBSP padding; smart quotes and an em-dash inside descriptions.
- Vendor casing variety on *non-duplicate* rows: `Amazon` / `amazon.com` / `AMAZON.COM`, `Verizon` / `verizon`.
- Disguised nulls in Category: `—`, `(blank)`, `?`, `unknown`, `TBD`.
- **6 duplicate transactions** — each pair shares the same vendor + real value but a different date *and* amount format, so they collapse only after standardization.
**After running the pipeline**: 20 rows → 15, ~29 cells canonicalized,
~45 sentinels standardised, 5 cross-row duplicates merged. The
customer table is now Klaviyo-import-ready and the country column
(previously `UK` / `U.K.` / `United Kingdom` / `Germany` / `Italia`)
is GB / DE / IT — VAT MOSS report won't break.
**After running the pipeline** (validated): **26 → 20 rows, 6 duplicates
removed**, 36 date/amount cells standardized (0 unparseable), all dates
ISO, parens-negatives resolved (`($89.50)``-89.50`), disguised-null
categories flagged. The reconciliation ties out.
### 4.2 `bookkeeper_bank_reconcile.csv` (30 rows)
### 4.2 `vendor_1099.csv` (24 rows) — Accounts payable / 1099
**Looks like**: two months of business checking + credit-card activity
exported from a bank portal, with the Feb export accidentally
overlapping the Jan export at the month boundary.
**Looks like**: a 1099-NEC vendor master list where the same vendor was
entered 23 times across the year by different staff, each record holding
only *part* of the vendor's details. Columns: `Vendor, Contact, Email,
Phone, EIN, Address, Total_Paid`.
**Pollution included**:
- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`,
`1/27/25`, `Feb 5 2025`
- Currency formats: `-$129.99`, `($89.50)` parens-negative,
`+$3,450.00`, `- $599.88` space, bare `-129.99`, `(50.00)`
- Header trailing whitespace: `"Date "`
- Smart quotes around descriptions: `"autopay"`
- Em-dash sentinels in Vendor: `—`
- Smart-em-dash inside descriptions: `STAPLES #4422 — paper, toner`
- Vendor casing inconsistency: `Amazon` / `amazon.com` / `AMAZON.COM`,
`Verizon` / `verizon`
- 6 duplicate transactions (same date+amount+vendor recorded twice
with different formats)
- The duplicate records for a vendor share one email differing only by case/whitespace (the reliable dedup key, matched with the `email` normalizer).
- EIN / Phone / Address scattered across the duplicate set so no single record is complete but the union is — gaps marked `—`, `(blank)`, `TBD`, `unknown`, `N/A`.
- Vendor name casing/spelling variants, phone formats, EIN formats (`12-3456789` vs `123456789`), `Total_Paid` currency variants.
**After running the pipeline**: 30 rows → 23, ~84 cells normalized, 7
duplicates removed (month-overlap + VAT-MOSS dups). All dates
ISO-formatted, all amounts numeric (including EUR/GBP/BRL with comma
decimal), vendor casing canonical, parens-negative resolved.
**After running the pipeline** (validated): **24 records → 8 vendors, 16
duplicates removed, 7 missing EINs recovered** by `merge=true` +
`most_complete` survivor, 35 disguised nulls caught, phones/emails/amounts
standardized (0 unparseable). One vendor genuinely has no EIN in any
record — it survives with a blank EIN as the realistic "flag for
follow-up" case.
### 4.3 `agency_combined_leads.csv` (30 rows)
### 4.3 `ar_open_invoices.csv` (26 rows) — Accounts receivable
**Looks like**: a marketing-ops worksheet combining lead exports from
HubSpot + LinkedIn Sales Navigator + manual scraping, ready for
campaign targeting.
**Looks like**: an open-invoices (unpaid AR) export where some invoices
were double-entered in different formats and client contacts are messy.
Columns: `Invoice, Client, Email, Invoice_Date, Due_Date, Amount, Status`.
**Pollution included**:
- Phone formats per region: US, UK, Spain, Germany, China, India,
Australia, Mexico, Israel, Singapore, Hong Kong, Italy, South
Korea — 13 country codes
- Country column inconsistent: `USA` / `US` / `United States`
- Disguised nulls: `N/A`, `unknown`, `(unknown)`, `(blank)`, `(none)`,
`?`, `—`, `#N/A`, `TBD`
- Source column tags origin (`HubSpot` / `LinkedIn` / `Manual Scrape`)
- Email duplicates across sources with case variants: `alice@acme.com`
+ `Alice.Johnson@acme.com`, `bob@beta.com` + `Bob@Beta.com`,
`diana@delta.com` from two sources, `carlos@gamma.io` from two
sources, `Frank@Foxtrot.de` + `frank@foxtrot.de`
- Name casing: `DIANA LEE`, `henry`, `IVY CHEN`, mixed
- 6 fuzzy / cross-source duplicates designed to survive the dedup
- Score column with sentinel pollution that needs coercion to integer
- Two date columns with mixed formats; currency variants incl. a credit memo `($300.00)``-300.00`.
- Client name casing variety; email case variants (`AP@Acme.com` vs `ap@acme.com`).
- Status disguised nulls: `—`, `?`, `(blank)`, `TBD`, `unknown`, `(none)`.
- **5 double-entered invoices** — same invoice number twice, dates/amount in different formats, one copy with a blank status the other fills.
**After running the pipeline**: 30 rows → 24, ~43 cells canonicalized,
14 sentinels resolved, 6 cross-source duplicates merged with `merge=true`
so each survivor inherits the most-complete picture. Invalid-email
rows (deliverability stress) and `Suppressed`/`Opted Out` tags
(suppression-list use case) survive as flagged rows the operator
manually reviews.
**After running the pipeline** (validated): **26 → 21 rows, 5 duplicate
invoices removed**, both date columns ISO + amounts numeric + emails
lowercased (0 unparseable), 7 disguised-null statuses caught, and a blank
status backfilled from its twin via `merge=true`. The aging report stops
double-counting.
## 5. UX flow (per persona)
@@ -174,26 +142,26 @@ dedicated `app_demo.py` for the cloud build).
│ "{Persona-specific H1}" │
├──────────────────────────────────────────────────────────┤
│ │
│ Sample dataset preloaded: shopify_pet_customers.csv │
│ Sample dataset preloaded: bank_reconciliation.csv
│ [Replace with your own file (capped 100 rows)] │
│ │
│ ┌─ BEFORE preview (15 rows) ─────────────────────────┐ │
│ │ Alice | (415) 555-1234 | $1,240.50 | … │ │
│ │ Bob | 415.555.1234 | $1,240.50 | … │ │
│ ┌─ BEFORE preview (26 rows) ─────────────────────────┐ │
│ │ 01/15/2025 | Stripe | +$3,450.00 | … │ │
│ │ 2025-01-15 | Stripe | 3450.00 | … (dup) │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Pipeline (saved): │
│ 1. Text Clean → 2. Format Standardize → │
│ 3. Missing → 4. Deduplicate
│ 1. Clean Text → 2. Standardize Formats → │
│ 3. Fix Missing → 4. Find Duplicates
│ │
│ [▶ Run pipeline] │
│ │
│ ┌─ AFTER preview ───────────────────────────────────┐ │
│ │ 15 rows → 11 (4 duplicates merged) │ │
│ │ 27 cells canonicalized · 33 sentinels resolved │ │
│ │ 26 rows → 20 (6 duplicate transactions removed) │ │
│ │ 36 cells standardized · 4 disguised nulls flagged │ │
│ │ │ │
│ │ Alice Johnson | +14155551234 | 1240.50 | … │ │
│ │ 2025-01-15 | Stripe | 3450.00 | … │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────┘ │
│ │
@@ -244,27 +212,35 @@ not "demo crippled" data.
## 7. CTA copy (per persona)
### 7.1 Shopify pet operator
Copy lives in `src/gui/app_demo.py::PERSONAS` (H1 / sub / CTA per tag);
keep this section in sync with that dict.
- **H1**: *Clean your customer / vendor / subscriber exports — locally.*
- **Sub**: *Klaviyo-import-ready in 30 seconds. Catches duplicates Excel
misses. Your data never leaves your computer.*
- **CTA**: *Get DataTools for Shopify — $49 →*
### 7.1 Bookkeeper — bank reconciliation (`?p=bookkeeper`)
### 7.2 Bookkeeper / freelance accountant
- **H1**: *Reconcile messy bank exports. Hand your client an audit
trail.*
- **Sub**: *Catches the duplicate transaction Quickbooks imported twice.
Standardizes dates, amounts, vendor casing. Every change auditable.*
- **H1**: *Catch the transactions your bank export posted twice. Locally.*
- **Sub**: *When the Jan and Feb exports overlap, the same payment posts
twice in two formats. DataTools standardizes every date and amount, then
dedups on the real transaction so your reconciliation ties out — 26 rows
→ 20, six phantom duplicates gone.*
- **CTA**: *Get DataTools for Bookkeepers — $49 →*
### 7.3 Marketing / RevOps agency
### 7.2 Accounts payable — 1099 prep (`?p=ap-1099`)
- **H1**: *Dedupe leads across HubSpot, LinkedIn, and manual scrapes.*
- **Sub**: *International phones, country normalization, fuzzy dedup
with merge — one tool, one schema, no upload.*
- **CTA**: *Get DataTools for RevOps — $49 →*
- **H1**: *Build a clean 1099 vendor list — with the missing EINs filled in.*
- **Sub**: *The same vendor entered three times, each record holding only
part of the details. DataTools consolidates to one row and backfills the
gaps from the duplicates — 24 records → 8 vendors, 7 missing EINs
recovered.*
- **CTA**: *Get DataTools for Accounting — $49 →*
### 7.3 Accounts receivable — open invoices (`?p=ar-aging`)
- **H1**: *Stop chasing the invoices your aging report counted twice. Locally.*
- **Sub**: *Double-entered invoices inflate your AR aging and your
follow-ups. DataTools standardizes dates and amounts, lowercases client
emails, and removes the duplicate invoice numbers — 26 rows → 21, five
phantom invoices off the books.*
- **CTA**: *Get DataTools for Accounting — $49 →*
## 8. Telemetry / conversion tracking

View File

@@ -1,31 +0,0 @@
Lead ID,First Name,Last Name,Company,Title,Email,Phone,Country,Source,Score,Last Activity,Tags
HUB-001,Alice,Johnson,Acme Corp,VP Marketing,alice@acme.com,(415) 555-1234,USA,HubSpot,87,2025-12-04,Enterprise
HUB-002,bob,smith,Beta LLC,Director Growth,bob@beta.com,N/A,United States,HubSpot,N/A,2025-11-22,SMB
HUB-003,Carlos,Garcia,Gamma Inc,CEO,carlos@gamma.io,+34 91 411 1111,Spain,HubSpot,82,2025-10-30,Enterprise
HUB-004,DIANA,LEE,Delta Co,Marketing Manager,diana@delta.com,020 7946 0958,United Kingdom,HubSpot,74,2025-12-15,Mid-Market
HUB-005,Eve,Martinez,Epsilon Group,VP Ops,eve@epsilon.com,(none),Mexico,HubSpot,(blank),2025-09-15,SMB
LIN-006,Alice,Johnson,Acme Corporation,VP of Marketing,Alice.Johnson@acme.com,4155551234,US,LinkedIn,,2025-12-04,Enterprise
LIN-007,Frank,Brown,Foxtrot Ltd,Head Sales,frank@foxtrot.de,+49 30 12345678,Germany,LinkedIn,68,2025-12-01,Mid-Market
LIN-008,Grace,Davis,Golf Industries,Marketing Lead,grace@golfind.com,+44 20 7946 0958,UK,LinkedIn,79,2025-11-08,Mid-Market
LIN-009,henry,wilson,Hotel Logistics,COO,henry@hotellog.com,+86 10 1234 5678,China,LinkedIn,91,2025-12-12,Enterprise
LIN-010,IVY CHEN,,India Tech,CTO,ivy@indiatech.in,+91 11 2345 6789,IN,LinkedIn,88,2025-11-30,Enterprise
LIN-011,Jack,Taylor,Juliet & Co,Founder,jack@juliet.co,unknown,United States,LinkedIn,?,(unknown),SMB
SCR-012,Diana,Lee,Delta Company,Marketing Manager,diana@delta.com,020-7946-0958,UK,Manual Scrape,74,12/15/2025,Mid-Market
SCR-013,kate,o'neil,Kilo Ventures,Partner,kate@kilo.vc,+1 415 555 2222,USA,Manual Scrape,N/A,?,Investor
SCR-014,Carlos,García,Gamma Incorporated,CEO,Carlos@gamma.io,+34-91-411-1111,Spain,Manual Scrape,82,Oct 30 2025,Enterprise
SCR-015,Liam,Park,Lima Solutions,Director Marketing,liam@limasol.kr,+82 2 2287 0114,South Korea,Manual Scrape,77,2025-11-20,Enterprise
SCR-016,Mia,nguyen,Mike Corp,VP Marketing,mia@mikecorp.com.au,02 9374 4000,Australia,Manual Scrape,72,2025-10-05,Mid-Market
SCR-017,Noah,Brown,November Inc,Head of Growth,noah@november.com,(555) 444-5555,US,Manual Scrape,,#N/A,SMB
HUB-018,Frank,Brown,Foxtrot,Head of Sales,Frank@Foxtrot.de,+49-30-12345678,Germany,HubSpot,68,2025-12-01,Mid-Market
HUB-019,Olivia,Rossi,Oscar Italia,CMO,olivia@oscar.it,+39 06 6982,Italy,HubSpot,85,2025-12-08,Enterprise
HUB-020,papa,wong,Papa Trading,Founder,papa@papatrading.hk,+852 2123 4567,Hong Kong,HubSpot,69,2025-11-15,SMB
LIN-021,Quinn,Reyes,Quebec Group,VP Sales,quinn@quebec.mx,+52 55 5555 0000,Mexico,LinkedIn,80,2025-12-05,Mid-Market
LIN-022,Robert,Tan,Romeo Logistics,Director,r.tan@romeo.sg,+65 6123 4567,Singapore,LinkedIn,76,2025-11-28,Mid-Market
SCR-023,Sara,Khan,Sierra Foods,Head Marketing,sara@sierra.in,+91-22-1234-5678,India,Manual Scrape,73,2025-12-02,SMB
SCR-024,bob,Smith,Beta,Director Growth,Bob@Beta.com,(none),United States,Manual Scrape,(unknown),(unknown),SMB
HUB-025,Tara,Levi,Tango Tech,VP Product,tara@tango.il,+972 3 6957 0000,Israel,HubSpot,82,2025-12-10,Enterprise
HUB-026,Uma,Patel,Uniform Health,CMO,uma at uniform dot com,+44 20 7946 8888,United Kingdom,HubSpot,71,2025-12-12,Enterprise
LIN-027,Victor,Lee,Victor Co,Director,victor@@victorco.com,+1 415 555 8888,USA,LinkedIn,69,2025-11-30,SMB
SCR-028,Wendy,Akin,Whiskey Inc,CMO,wendy@whiskey.tr,+90 212 252 1111,Turkey,Manual Scrape,77,2025-12-04,Mid-Market
SCR-029,Xander,Ng,Xray Group,Founder,xander@xray.sg,+65 6234 5678,Singapore,Manual Scrape,65,2025-11-15,Suppressed
HUB-030,Yara,Costa,Yankee Foods,Marketing Lead,yara@yankee.br,+55 11 3071 2222,Brazil,HubSpot,,2025-12-15,Opted Out
1 Lead ID First Name Last Name Company Title Email Phone Country Source Score Last Activity Tags
2 HUB-001 Alice Johnson Acme Corp VP Marketing alice@acme.com (415) 555-1234 USA HubSpot 87 2025-12-04 Enterprise
3 HUB-002 bob smith Beta LLC Director Growth bob@beta.com N/A United States HubSpot N/A 2025-11-22 SMB
4 HUB-003 Carlos Garcia Gamma Inc CEO carlos@gamma.io +34 91 411 1111 Spain HubSpot 82 2025-10-30 Enterprise
5 HUB-004 DIANA LEE Delta Co Marketing Manager diana@delta.com 020 7946 0958 United Kingdom HubSpot 74 2025-12-15 Mid-Market
6 HUB-005 Eve Martinez Epsilon Group VP Ops eve@epsilon.com (none) Mexico HubSpot (blank) 2025-09-15 SMB
7 LIN-006 Alice Johnson Acme Corporation VP of Marketing Alice.Johnson@acme.com 4155551234 US LinkedIn 2025-12-04 Enterprise
8 LIN-007 Frank Brown Foxtrot Ltd Head Sales frank@foxtrot.de +49 30 12345678 Germany LinkedIn 68 2025-12-01 Mid-Market
9 LIN-008 Grace Davis Golf Industries Marketing Lead grace@golfind.com +44 20 7946 0958 UK LinkedIn 79 2025-11-08 Mid-Market
10 LIN-009 henry wilson Hotel Logistics COO henry@hotellog.com +86 10 1234 5678 China LinkedIn 91 2025-12-12 Enterprise
11 LIN-010 IVY CHEN India Tech CTO ivy@indiatech.in +91 11 2345 6789 IN LinkedIn 88 2025-11-30 Enterprise
12 LIN-011 Jack Taylor Juliet & Co Founder jack@juliet.co unknown United States LinkedIn ? (unknown) SMB
13 SCR-012 Diana Lee Delta Company Marketing Manager diana@delta.com 020-7946-0958 UK Manual Scrape 74 12/15/2025 Mid-Market
14 SCR-013 kate o'neil Kilo Ventures Partner kate@kilo.vc +1 415 555 2222 USA Manual Scrape N/A ? Investor
15 SCR-014 Carlos García Gamma Incorporated CEO Carlos@gamma.io +34-91-411-1111 Spain Manual Scrape 82 Oct 30 2025 Enterprise
16 SCR-015 Liam Park Lima Solutions Director Marketing liam@limasol.kr +82 2 2287 0114 South Korea Manual Scrape 77 2025-11-20 Enterprise
17 SCR-016 Mia nguyen Mike Corp VP Marketing mia@mikecorp.com.au 02 9374 4000 Australia Manual Scrape 72 2025-10-05 Mid-Market
18 SCR-017 Noah Brown November Inc Head of Growth noah@november.com (555) 444-5555 US Manual Scrape #N/A SMB
19 HUB-018 Frank Brown Foxtrot Head of Sales Frank@Foxtrot.de +49-30-12345678 Germany HubSpot 68 2025-12-01 Mid-Market
20 HUB-019 Olivia Rossi Oscar Italia CMO olivia@oscar.it +39 06 6982 Italy HubSpot 85 2025-12-08 Enterprise
21 HUB-020 papa wong Papa Trading Founder papa@papatrading.hk +852 2123 4567 Hong Kong HubSpot 69 2025-11-15 SMB
22 LIN-021 Quinn Reyes Quebec Group VP Sales quinn@quebec.mx +52 55 5555 0000 Mexico LinkedIn 80 2025-12-05 Mid-Market
23 LIN-022 Robert Tan Romeo Logistics Director r.tan@romeo.sg +65 6123 4567 Singapore LinkedIn 76 2025-11-28 Mid-Market
24 SCR-023 Sara Khan Sierra Foods Head Marketing sara@sierra.in +91-22-1234-5678 India Manual Scrape 73 2025-12-02 SMB
25 SCR-024 bob Smith Beta Director Growth Bob@Beta.com (none) United States Manual Scrape (unknown) (unknown) SMB
26 HUB-025 Tara Levi Tango Tech VP Product tara@tango.il +972 3 6957 0000 Israel HubSpot 82 2025-12-10 Enterprise
27 HUB-026 Uma Patel Uniform Health CMO uma at uniform dot com +44 20 7946 8888 United Kingdom HubSpot 71 2025-12-12 Enterprise
28 LIN-027 Victor Lee Victor Co Director victor@@victorco.com +1 415 555 8888 USA LinkedIn 69 2025-11-30 SMB
29 SCR-028 Wendy Akin Whiskey Inc CMO wendy@whiskey.tr +90 212 252 1111 Turkey Manual Scrape 77 2025-12-04 Mid-Market
30 SCR-029 Xander Ng Xray Group Founder xander@xray.sg +65 6234 5678 Singapore Manual Scrape 65 2025-11-15 Suppressed
31 HUB-030 Yara Costa Yankee Foods Marketing Lead yara@yankee.br +55 11 3071 2222 Brazil HubSpot 2025-12-15 Opted Out

View File

@@ -1,74 +0,0 @@
{
"steps": [
{
"tool": "text_clean",
"options": {},
"enabled": true,
"name": "1. Clean text (whitespace + smart quotes from copy-paste)"
},
{
"tool": "format_standardize",
"options": {
"column_types": {
"First Name": "name",
"Last Name": "name",
"Company": "name",
"Email": "email",
"Phone": "phone"
},
"phone_country_column": "Country",
"phone_format": "E164",
"email_gmail_canonical": true
},
"enabled": true,
"name": "2. E.164 phones (per-row country) · canonical emails · name casing"
},
{
"tool": "missing",
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["N/A", "n/a", "—", "?", "(unknown)", "unknown", "(blank)", "(none)", "TBD", "#N/A"]
},
"enabled": true,
"name": "3. Standardize sentinels across vendor exports"
},
{
"tool": "column_map",
"options": {
"schema": {
"fields": [
{"name": "Lead ID", "dtype": "string", "required": true},
{"name": "First Name", "dtype": "string"},
{"name": "Last Name", "dtype": "string"},
{"name": "Company", "dtype": "string"},
{"name": "Title", "dtype": "string"},
{"name": "Email", "dtype": "string"},
{"name": "Phone", "dtype": "string"},
{"name": "Country", "dtype": "string"},
{"name": "Source", "dtype": "string"},
{"name": "Score", "dtype": "integer"},
{"name": "Last Activity", "dtype": "date"},
{"name": "Tags", "dtype": "string"}
]
},
"auto_infer": true,
"unmapped": "keep",
"coerce_types": true,
"reorder_to_schema": true,
"enforce_required": false
},
"enabled": true,
"name": "4. Coerce types · reorder to canonical schema"
},
{
"tool": "dedup",
"options": {
"survivor_rule": "most_complete",
"merge": true
},
"enabled": true,
"name": "5. Dedup leads across HubSpot / LinkedIn / Manual Scrape (fuzzy + merge)"
}
]
}

View File

@@ -0,0 +1,27 @@
Invoice,Client,Email,Invoice_Date,Due_Date,Amount,Status
INV-1007,ACME LLC,AP@Acme.com,03/04/2025,04/03/2025,"$1,250.00",Open
INV-1007, Acme LLC ,ap@acme.com,2025-03-04,2025-04-03,"1,250.00",(blank)
INV-1001,northwind traders,billing@northwind.com,Mar 6 2025,04/05/2025,$980,Overdue
INV-1002,Globex Corp,AR@Globex.com,3/11/25,4/10/25,"2,400.50",Sent
INV-1011,initech,accounts@initech.com,04/01/2025,05/01/2025,"$ 1,100.00",?
INV-1011,Initech,Accounts@Initech.com,2025-04-01,2025-05-01,1100,Open
INV-1003,Stark Industries,ap@stark.com,Mar 6 2025,Apr 6 2025,$75.00,Open
INV-1004,Wayne Enterprises,ar@wayne.com,03/15/2025,04/14/2025,($300.00),
INV-1015,Hooli,billing@hooli.com,3/11/25,4/10/25,"$4,300.00",Overdue
INV-1015,hooli,Billing@Hooli.com,2025-03-11,2025-04-10,4300,(none)
INV-1005,Soylent Corp,ap@soylent.com,2025-03-20,2025-04-19,"$1,875.25",Sent
INV-1006,Umbrella Co,ar@umbrella.com,03/22/2025,04/21/2025,$640.00,TBD
INV-1019,Cyberdyne Systems,ap@cyberdyne.com,Mar 25 2025,04/24/2025,"$2,050.00",unknown
INV-1019,cyberdyne systems,AP@Cyberdyne.com,2025-03-25,2025-04-24,"2,050.00",Open
INV-1008,Vandelay Industries,ar@vandelay.com,3/28/25,4/27/25,$915.00,Overdue
INV-1009,Gekko & Co,billing@gekko.com,2025-03-30,2025-04-29,"$3,120.75",Open
INV-1010,Pied Piper,ap@piedpiper.com,04/02/2025,05/02/2025,$180,Sent
INV-1023,Tyrell Corp,ar@tyrell.com,04/05/2025,05/05/2025,($300.00),(blank)
INV-1023,Tyrell Corp,AR@Tyrell.com,2025-04-05,2025-05-05,-300.00,Open
INV-1012,Oscorp,ap@oscorp.com,Apr 8 2025,05/08/2025,"$5,000.00",Overdue
INV-1013,Nakatomi Trading,ar@nakatomi.com,4/9/25,5/9/25,$725.50,Sent
INV-1014,Bluth Company,billing@bluth.com,2025-04-10,2025-05-10,"$1,420.00",Open
INV-1016,Dunder Mifflin,ap@dundermifflin.com,04/12/2025,05/12/2025,$960.00,Overdue
INV-1017,Prestige Worldwide,ar@prestige.com,Apr 14 2025,05/14/2025,"$2,680.00",Sent
INV-1018,Sterling Cooper,billing@sterlingcooper.com,4/15/25,5/15/25,"$3,950.00",Open
INV-1020,Wonka Industries,ap@wonka.com,2025-04-18,2025-05-18,"$1,050.00",Overdue
1 Invoice Client Email Invoice_Date Due_Date Amount Status
2 INV-1007 ACME LLC AP@Acme.com 03/04/2025 04/03/2025 $1,250.00 Open
3 INV-1007 Acme LLC ap@acme.com 2025-03-04 2025-04-03 1,250.00 (blank)
4 INV-1001 northwind traders billing@northwind.com Mar 6 2025 04/05/2025 $980 Overdue
5 INV-1002 Globex Corp AR@Globex.com 3/11/25 4/10/25 2,400.50 Sent
6 INV-1011 initech accounts@initech.com 04/01/2025 05/01/2025 $ 1,100.00 ?
7 INV-1011 Initech Accounts@Initech.com 2025-04-01 2025-05-01 1100 Open
8 INV-1003 Stark Industries ap@stark.com Mar 6 2025 Apr 6 2025 $75.00 Open
9 INV-1004 Wayne Enterprises ar@wayne.com 03/15/2025 04/14/2025 ($300.00)
10 INV-1015 Hooli billing@hooli.com 3/11/25 4/10/25 $4,300.00 Overdue
11 INV-1015 hooli Billing@Hooli.com 2025-03-11 2025-04-10 4300 (none)
12 INV-1005 Soylent Corp ap@soylent.com 2025-03-20 2025-04-19 $1,875.25 Sent
13 INV-1006 Umbrella Co ar@umbrella.com 03/22/2025 04/21/2025 $640.00 TBD
14 INV-1019 Cyberdyne Systems ap@cyberdyne.com Mar 25 2025 04/24/2025 $2,050.00 unknown
15 INV-1019 cyberdyne systems AP@Cyberdyne.com 2025-03-25 2025-04-24 2,050.00 Open
16 INV-1008 Vandelay Industries ar@vandelay.com 3/28/25 4/27/25 $915.00 Overdue
17 INV-1009 Gekko & Co billing@gekko.com 2025-03-30 2025-04-29 $3,120.75 Open
18 INV-1010 Pied Piper ap@piedpiper.com 04/02/2025 05/02/2025 $180 Sent
19 INV-1023 Tyrell Corp ar@tyrell.com 04/05/2025 05/05/2025 ($300.00) (blank)
20 INV-1023 Tyrell Corp AR@Tyrell.com 2025-04-05 2025-05-05 -300.00 Open
21 INV-1012 Oscorp ap@oscorp.com Apr 8 2025 05/08/2025 $5,000.00 Overdue
22 INV-1013 Nakatomi Trading ar@nakatomi.com 4/9/25 5/9/25 $725.50 Sent
23 INV-1014 Bluth Company billing@bluth.com 2025-04-10 2025-05-10 $1,420.00 Open
24 INV-1016 Dunder Mifflin ap@dundermifflin.com 04/12/2025 05/12/2025 $960.00 Overdue
25 INV-1017 Prestige Worldwide ar@prestige.com Apr 14 2025 05/14/2025 $2,680.00 Sent
26 INV-1018 Sterling Cooper billing@sterlingcooper.com 4/15/25 5/15/25 $3,950.00 Open
27 INV-1020 Wonka Industries ap@wonka.com 2025-04-18 2025-05-18 $1,050.00 Overdue

View File

@@ -0,0 +1,50 @@
{
"steps": [
{
"tool": "text_clean",
"enabled": true,
"options": {
"trim": true,
"collapse_whitespace": true,
"fold_smart_chars": true,
"strip_zero_width": true
}
},
{
"tool": "format_standardize",
"enabled": true,
"options": {
"column_types": {
"Invoice_Date": "date",
"Due_Date": "date",
"Amount": "currency",
"Email": "email"
}
}
},
{
"tool": "missing",
"enabled": true,
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["—", "-", "?", "(blank)", "TBD", "unknown", "(none)", "N/A", "#N/A"]
}
},
{
"tool": "dedup",
"enabled": true,
"options": {
"survivor_rule": "most_complete",
"merge": true,
"strategies": [
{
"columns": [
{"column": "Invoice", "algorithm": "exact", "threshold": 100}
]
}
]
}
}
]
}

View File

@@ -0,0 +1,27 @@
Date,Description,Vendor,Category,Amount,Account
01/15/2025,“Stripe payout — weekly”,Stripe,Income,"+$3,450.00",Business Checking
2025-01-15,Verizon business line,Verizon,,($89.50),Business Checking
Jan 18 2025,Adobe Creative Cloud ,Adobe,(blank),-$129.99,Business Checking
1/27/25,Office supplies,Amazon,Supplies,-$74.20,Business Checking
02/03/2025, Monthly office rent,Highland Properties,Rent,"$1,200.00",Business Checking
Feb 5 2025,Account service fee,First National Bank,?,(50.00),Business Checking
2025-01-09,Shipping labels,amazon.com,unknown,-$18.40,Business Checking
1/22/25,Contractor — landing page,Bright Lane Design,TBD,- $599.88,Business Checking
Jan 30 2025,Late fee adjustment,verizon,Utilities,-$12.00,Business Checking
2025-01-11,Packaging tape,AMAZON.COM,Supplies,-$31.75,Business Checking
01/06/2025,Client deposit — ACME Co,ACME Co,Income,"$2,500.00",Business Checking
2025-01-20,Google Workspace,Google,Software,-$36.00,Business Checking
Jan 24 2025,Fuel — delivery van,Shell,Vehicle,-$58.63,Business Checking
1/28/25,QuickBooks subscription,Intuit,Software,-$80.00,Business Checking
2025-01-15,Stripe payout weekly,Stripe,Income,3450.00,Business Checking
01/15/2025,Verizon business line,Verizon,Utilities,-89.50,Business Checking
2025-01-18,Adobe Creative Cloud,Adobe,Software,-129.99,Business Checking
2025-02-03,Monthly office rent,Highland Properties,Rent,1200.00,Business Checking
2025-02-05,Account service fee,First National Bank,Bank Fees,-50.00,Business Checking
2025-01-22,Contractor landing page,Bright Lane Design,Contractors,-599.88,Business Checking
02/10/2025,Client deposit — Globex,Globex,Income,"$1,800.00",Business Checking
2025-02-12,Slack subscription,Slack,Software,-$96.00,Business Checking
Feb 14 2025,Coffee — client meeting,Blue Bottle,Meals,-$23.10,Business Checking
2/18/25,Insurance premium,Hartford,Insurance,-$240.50,Business Checking
02/21/2025,Refund — returned printer,Staples,Supplies,$210.99,Business Checking
Feb 25 2025,Domain renewal,Namecheap,Software,-$13.98,Business Checking
1 Date Description Vendor Category Amount Account
2 01/15/2025 “Stripe payout — weekly” Stripe Income +$3,450.00 Business Checking
3 2025-01-15 Verizon business line Verizon ($89.50) Business Checking
4 Jan 18 2025 Adobe Creative Cloud Adobe (blank) -$129.99 Business Checking
5 1/27/25 Office supplies Amazon Supplies -$74.20 Business Checking
6 02/03/2025 Monthly office rent Highland Properties Rent $1,200.00 Business Checking
7 Feb 5 2025 Account service fee First National Bank ? (50.00) Business Checking
8 2025-01-09 Shipping labels amazon.com unknown -$18.40 Business Checking
9 1/22/25 Contractor — landing page Bright Lane Design TBD - $599.88 Business Checking
10 Jan 30 2025 Late fee adjustment verizon Utilities -$12.00 Business Checking
11 2025-01-11 Packaging tape AMAZON.COM Supplies -$31.75 Business Checking
12 01/06/2025 Client deposit — ACME Co ACME Co Income $2,500.00 Business Checking
13 2025-01-20 Google Workspace Google Software -$36.00 Business Checking
14 Jan 24 2025 Fuel — delivery van Shell Vehicle -$58.63 Business Checking
15 1/28/25 QuickBooks subscription Intuit Software -$80.00 Business Checking
16 2025-01-15 Stripe payout weekly Stripe Income 3450.00 Business Checking
17 01/15/2025 Verizon business line Verizon Utilities -89.50 Business Checking
18 2025-01-18 Adobe Creative Cloud Adobe Software -129.99 Business Checking
19 2025-02-03 Monthly office rent Highland Properties Rent 1200.00 Business Checking
20 2025-02-05 Account service fee First National Bank Bank Fees -50.00 Business Checking
21 2025-01-22 Contractor landing page Bright Lane Design Contractors -599.88 Business Checking
22 02/10/2025 Client deposit — Globex Globex Income $1,800.00 Business Checking
23 2025-02-12 Slack subscription Slack Software -$96.00 Business Checking
24 Feb 14 2025 Coffee — client meeting Blue Bottle Meals -$23.10 Business Checking
25 2/18/25 Insurance premium Hartford Insurance -$240.50 Business Checking
26 02/21/2025 Refund — returned printer Staples Supplies $210.99 Business Checking
27 Feb 25 2025 Domain renewal Namecheap Software -$13.98 Business Checking

View File

@@ -0,0 +1,6 @@
{"steps":[
{"tool":"text_clean","enabled":true,"options":{"trim":true,"collapse_whitespace":true,"fold_smart_chars":true,"strip_zero_width":true}},
{"tool":"format_standardize","enabled":true,"options":{"column_types":{"Date":"date","Amount":"currency"}}},
{"tool":"missing","enabled":true,"options":{"strategy":"none","standardize_sentinels":true,"sentinels":["—","(blank)","?","unknown","TBD","N/A","#N/A","(none)"]}},
{"tool":"dedup","enabled":true,"options":{"survivor_rule":"most_complete","merge":true,"strategies":[{"columns":[{"column":"Date","algorithm":"exact","threshold":100},{"column":"Amount","algorithm":"exact","threshold":100}]}]}}
]}

View File

@@ -1,56 +0,0 @@
{
"steps": [
{
"tool": "text_clean",
"options": {},
"enabled": true,
"name": "1. Clean text (header whitespace, smart quotes, em-dash)"
},
{
"tool": "format_standardize",
"options": {
"column_types": {
"Date": "date",
"Amount": "currency",
"Balance": "currency",
"Vendor": "name"
},
"currency_decimal": "auto",
"currency_preserve_code": false,
"currency_decimals": 2,
"date_output_format": "%Y-%m-%d"
},
"enabled": true,
"name": "2. ISO dates · numeric amounts (parens-negative) · vendor casing"
},
{
"tool": "missing",
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["N/A", "n/a", "—", "-", "?", "(blank)", "(none)", "unknown", "#N/A"]
},
"enabled": true,
"name": "3. Standardize disguised nulls (— / N/A / (blank))"
},
{
"tool": "dedup",
"options": {
"survivor_rule": "most_complete",
"merge": false,
"date_column": "Date",
"strategies": [
{
"columns": [
{"column": "Date", "algorithm": "exact", "threshold": 100},
{"column": "Amount", "algorithm": "exact", "threshold": 100},
{"column": "Vendor", "algorithm": "jaro_winkler", "threshold": 80}
]
}
]
},
"enabled": true,
"name": "4. Dedup transactions on Date+Amount+fuzzy Vendor"
}
]
}

View File

@@ -1,31 +0,0 @@
Txn ID,Date ,Description,Amount,Balance,Account,Vendor,Category
TXN-2401,01/15/2025," AMAZON.COM*4F2X9 PURCHASE",-$129.99,"$2,450.01",Checking,Amazon,Office Supplies
TXN-2402,2025-01-15,"AMAZON.COM*4F2X9 PURCHASE",-$129.99,"2450.01",Checking,amazon.com,Office Supplies
TXN-2403,Jan 18 2025,"STAPLES #4422 — paper, toner",($89.50),$2360.51,Checking,STAPLES,Office Supplies
TXN-2404,01/22/2025,"Verizon Wireless ""autopay""",-$120.00,"$2,240.51",Checking,Verizon,Utilities
TXN-2405,2025-01-22,Verizon Wireless autopay,-120.00,"2,240.51",Checking,verizon,Utilities
TXN-2406,01-25-2025,"Stripe Payout — invoice #1077","+$3,450.00","$5,690.51",Checking,Stripe,Income
TXN-2407,1/27/25,"Office Lease - Suite 204",-1500.00,"$4,190.51",Checking,Acme Realty,Rent
TXN-2408,02/01/2025,"Wire — Acme Realty Mgmt","-$1,500.00","$2,690.51",Checking,acme realty,Rent
TXN-2409,2025-02-03,"Adobe Creative Cloud annual","- $599.88","$2,090.63",Credit Card,Adobe Inc.,Software
TXN-2410,02/03/2025,"ADOBE CREATIVE CLOUD ANN",-599.88,2090.63,Credit Card,adobe,Software
TXN-2411,Feb 5 2025,"FedEx — overnight to client A",-$32.50,"$2,058.13",Checking,FedEx,Shipping
TXN-2412,02/07/2025,"Square fee — invoice #1078","-$3.20","$2,054.93",Checking,Square,Fees
TXN-2413,02/10/2025,"Stripe Payout invoice #1079","+ $1,200.00","$3,254.93",Checking,Stripe,Income
TXN-2414,2025-02-12,"USPS PRIORITY — to vendor B","-12.40","$3,242.53",Checking,USPS,Shipping
TXN-2415,02/14/2025,"Zoom Video Comms — annual","-$149.90","$3,092.63",Credit Card,Zoom,Software
TXN-2416,2/14/25,"Zoom Video Communications","-149.90","3092.63",Credit Card,zoom,Software
TXN-2417,02/18/2025,"Costco Whse #421 — supplies","-$237.84","$2,854.79",Checking,Costco,Office Supplies
TXN-2418,2025-02-18,COSTCO WHSE #421,-237.84,"2,854.79",Checking,costco,Office Supplies
TXN-2419,02/22/2025,"Bank fee — int'l wire","-$45.00","$2,809.79",Checking,Bank Fee,Fees
TXN-2420,02/24/2025,"Stripe Payout — invoice #1080","+$2,100.00","$4,909.79",Checking,Stripe,Income
TXN-2421,02/28/2025," Refund — overcharge ","+$45.00","$4,954.79",Checking,,Refunds
TXN-2422,Feb 28 2025,REFUND OVERCHARGE,45.00,4954.79,Checking,N/A,Refunds
TXN-2423,03/01/2025,"Office Lease — Suite 204","-$1,500.00","$3,454.79",Checking,Acme Realty,Rent
TXN-2424,2025-03-03,"Slack Technologies — annual","-$840.00","$2,614.79",Credit Card,Slack,Software
TXN-2425,03/05/2025,"Stripe Payout — invoice #1081","+$1,875.00","$4,489.79",Checking,Stripe,Income
TXN-2426,03/08/2025,"Wire — Berlin office rent (EUR vendor)","-€1.450,00","$2,989.79",Checking,Mietverwaltung GmbH,Rent
TXN-2427,03/10/2025,"London supplier invoice (GBP)","-£950.00","$1,939.79",Checking,Stationery Co Ltd,Office Supplies
TXN-2428,03/12/2025,"São Paulo agency retainer","-R$ 1.299,90","$1,679.79",Credit Card,Estúdio Ágil,Software
TXN-2429,03/14/2025,"VAT MOSS prep — multi-EU sales","($89.00)","$1,768.79",Checking,EU VAT Service,Fees
TXN-2430,03/14/2025,"VAT MOSS prep multi EU sales",-89.00,"1,768.79",Checking,eu vat service,Fees
1 Txn ID Date Description Amount Balance Account Vendor Category
2 TXN-2401 01/15/2025 AMAZON.COM*4F2X9 PURCHASE -$129.99 $2,450.01 Checking Amazon Office Supplies
3 TXN-2402 2025-01-15 AMAZON.COM*4F2X9 PURCHASE -$129.99 2450.01 Checking amazon.com Office Supplies
4 TXN-2403 Jan 18 2025 STAPLES #4422 — paper, toner ($89.50) $2360.51 Checking STAPLES Office Supplies
5 TXN-2404 01/22/2025 Verizon Wireless "autopay" -$120.00 $2,240.51 Checking Verizon Utilities
6 TXN-2405 2025-01-22 Verizon Wireless autopay -120.00 2,240.51 Checking verizon Utilities
7 TXN-2406 01-25-2025 Stripe Payout — invoice #1077 +$3,450.00 $5,690.51 Checking Stripe Income
8 TXN-2407 1/27/25 Office Lease - Suite 204 -1500.00 $4,190.51 Checking Acme Realty Rent
9 TXN-2408 02/01/2025 Wire — Acme Realty Mgmt -$1,500.00 $2,690.51 Checking acme realty Rent
10 TXN-2409 2025-02-03 Adobe Creative Cloud annual - $599.88 $2,090.63 Credit Card Adobe Inc. Software
11 TXN-2410 02/03/2025 ADOBE CREATIVE CLOUD ANN -599.88 2090.63 Credit Card adobe Software
12 TXN-2411 Feb 5 2025 FedEx — overnight to client A -$32.50 $2,058.13 Checking FedEx Shipping
13 TXN-2412 02/07/2025 Square fee — invoice #1078 -$3.20 $2,054.93 Checking Square Fees
14 TXN-2413 02/10/2025 Stripe Payout invoice #1079 + $1,200.00 $3,254.93 Checking Stripe Income
15 TXN-2414 2025-02-12 USPS PRIORITY — to vendor B -12.40 $3,242.53 Checking USPS Shipping
16 TXN-2415 02/14/2025 Zoom Video Comms — annual -$149.90 $3,092.63 Credit Card Zoom Software
17 TXN-2416 2/14/25 Zoom Video Communications -149.90 3092.63 Credit Card zoom Software
18 TXN-2417 02/18/2025 Costco Whse #421 — supplies -$237.84 $2,854.79 Checking Costco Office Supplies
19 TXN-2418 2025-02-18 COSTCO WHSE #421 -237.84 2,854.79 Checking costco Office Supplies
20 TXN-2419 02/22/2025 Bank fee — int'l wire -$45.00 $2,809.79 Checking Bank Fee Fees
21 TXN-2420 02/24/2025 Stripe Payout — invoice #1080 +$2,100.00 $4,909.79 Checking Stripe Income
22 TXN-2421 02/28/2025 Refund — overcharge +$45.00 $4,954.79 Checking Refunds
23 TXN-2422 Feb 28 2025 REFUND OVERCHARGE 45.00 4954.79 Checking N/A Refunds
24 TXN-2423 03/01/2025 Office Lease — Suite 204 -$1,500.00 $3,454.79 Checking Acme Realty Rent
25 TXN-2424 2025-03-03 Slack Technologies — annual -$840.00 $2,614.79 Credit Card Slack Software
26 TXN-2425 03/05/2025 Stripe Payout — invoice #1081 +$1,875.00 $4,489.79 Checking Stripe Income
27 TXN-2426 03/08/2025 Wire — Berlin office rent (EUR vendor) -€1.450,00 $2,989.79 Checking Mietverwaltung GmbH Rent
28 TXN-2427 03/10/2025 London supplier invoice (GBP) -£950.00 $1,939.79 Checking Stationery Co Ltd Office Supplies
29 TXN-2428 03/12/2025 São Paulo agency retainer -R$ 1.299,90 $1,679.79 Credit Card Estúdio Ágil Software
30 TXN-2429 03/14/2025 VAT MOSS prep — multi-EU sales ($89.00) $1,768.79 Checking EU VAT Service Fees
31 TXN-2430 03/14/2025 VAT MOSS prep multi EU sales -89.00 1,768.79 Checking eu vat service Fees

View File

@@ -1,21 +0,0 @@
Customer ID,First Name,Last Name,Email,Phone,Address,City,State,ZIP,Country,Total Orders,Lifetime Value,Last Order Date,Tags
SHOP-1001, Alice ,Johnson,alice@petshop.com,(415) 555-1234,"123 Main St., Apt 4B",San Francisco,CA,94102,US,12,$1,240.50,2025-12-04,VIP
SHOP-1002,Bob,SMITH,Bob@PetShop.com,415.555.1234,"123 Main St, Apt 4B",San Francisco,CA,94102,US,12,"$1,240.50",N/A,VIP
SHOP-1003,carlos,garcia,carlos@petshop.com,5559876543,"742 Evergreen Terrace",Springfield,IL,62704,US,5,420.00,12/15/2025,Wholesale
SHOP-1004,Diana,Lee,diana@petshop.com,(555) 222-3344,"PO Box 12, Sherwood Forest",Nottingham,,NG1 5BA,GB,8,£890.25,2025-10-30,VIP|Wholesale
SHOP-1005,EVE MARTINEZ,,eve.martinez@petshop.com,555-9988,"Calle Mayor 45","Madrid",,"28013",ES,3,€180,2025-09-15,
SHOP-1006,Frank,Brown,frank@petshop.com,, ,"Berlin",BE,10115,DE,15,€2.410,75,(blank),Wholesale
SHOP-1007,Grace,Davis,grace@petshop.com,+1 555-111-1111,"888 Maple Ave",Toronto,ON,M5V 3A8,CA,1,$49.99,#N/A,New
SHOP-1008,henry,wilson,Henry@PetShop.com,5551111111,"888 Maple Avenue","Toronto",ON,M5V 3A8,CA,1,$49.99,2025-12-01,New
SHOP-1009,Ivy,Chen,IVY@petshop.com,+1 (555) 777-7777,"550 Elm Street, Suite 200",Brooklyn,NY,11201,US,4,"$320.50 ",10/12/2025,
SHOP-1010,Jack,Taylor,jack@petshop.com,(none),"550 elm street, suite 200",brooklyn,NY,11201,US,4,$320.50,2025-10-12,
SHOP-1011,kate,o'neil,kate.oneil@petshop.com,415-555-2222,"99 King's Rd","London",,SW3 4LX,GB,7,£675.00,?,VIP
SHOP-1012,luis,rodriguez,LUIS@petshop.com,+34 91 411 1111,"Avenida de la Paz 12, 3°D",Madrid,,28013,ES,2,"€89,99",unknown,
SHOP-1013,Mia,Park,mia@petshop.com,02-9374-4000,"Sydney Opera House Drive","Sydney",NSW,2000,AU,9,"A$ 1,299.00",2025-11-20,Wholesale
SHOP-1014,Noah,nguyen,noah@petshop.com,+81 3 3210 7000,"丸の内 2-7-3","Tokyo",,100-0005,JP,6,"¥75000",2025-12-10,VIP
SHOP-1015,Olivia,Brown,OLIVIA@PETSHOP.COM,(555) 333-4444,"742 evergreen terrace",springfield,IL,62704,US,3,$180.00,(none),
SHOP-1016,Pavel,Novak,pavel@petshop.com,+44 20 7946 1234,"22 Baker Street",London,,W1U 6AB,United Kingdom,4,£412.00,2025-11-18,VIP
SHOP-1017,Quinn,Murphy,quinn@petshop.com,+44 20 7946 5678,"5 Princes Street",Edinburgh,,EH2 2DA,U.K.,2,£189.50,2025-12-09,
SHOP-1018,Rachel,O'Brien,rachel@petshop.com,02-9374-9999,"100 George Street","Sydney",NSW,2000,UK,1,£75.00,?,New
SHOP-1019,Sam,Klein,sam@petshop.com,+49 30 99887766,"Friedrichstraße 100","Berlin",,10117,Germany,11,"€1.890,40",2025-12-11,VIP|Wholesale
SHOP-1020,Tara,Gianni,tara@petshop.com,+39 06 6982 4567,"Via del Corso 250",Roma,,00186,Italia,5,"€649,99",2025-12-03,
1 Customer ID First Name Last Name Email Phone Address City State ZIP Country Total Orders Lifetime Value Last Order Date Tags
2 SHOP-1001 Alice Johnson alice@petshop.com (415) 555-1234 123 Main St., Apt 4B San Francisco CA 94102 US 12 $1 240.50 2025-12-04 VIP
3 SHOP-1002 Bob SMITH Bob@PetShop.com 415.555.1234 123 Main St, Apt 4B San Francisco CA 94102 US 12 $1,240.50 N/A VIP
4 SHOP-1003 carlos garcia carlos@petshop.com 5559876543 742 Evergreen Terrace Springfield IL 62704 US 5 420.00 12/15/2025 Wholesale
5 SHOP-1004 Diana Lee diana@petshop.com (555) 222-3344 PO Box 12, Sherwood Forest Nottingham NG1 5BA GB 8 £890.25 2025-10-30 VIP|Wholesale
6 SHOP-1005 EVE MARTINEZ eve.martinez@petshop.com 555-9988 Calle Mayor 45 Madrid 28013 ES 3 €180 2025-09-15
7 SHOP-1006 Frank Brown frank@petshop.com Berlin BE 10115 DE 15 €2.410 75 (blank) Wholesale
8 SHOP-1007 Grace Davis grace@petshop.com +1 555-111-1111 888 Maple Ave Toronto ON M5V 3A8 CA 1 $49.99 #N/A New
9 SHOP-1008 henry wilson Henry@PetShop.com 5551111111 888 Maple Avenue Toronto ON M5V 3A8 CA 1 $49.99 2025-12-01 New
10 SHOP-1009 Ivy Chen IVY@petshop.com +1 (555) 777-7777 550 Elm Street, Suite 200 Brooklyn NY 11201 US 4 $320.50 10/12/2025
11 SHOP-1010 Jack Taylor jack@petshop.com (none) 550 elm street, suite 200 brooklyn NY 11201 US 4 $320.50 2025-10-12
12 SHOP-1011 kate o'neil kate.oneil@petshop.com 415-555-2222 99 King's Rd London SW3 4LX GB 7 £675.00 ? VIP
13 SHOP-1012 luis rodriguez LUIS@petshop.com +34 91 411 1111 Avenida de la Paz 12, 3°D Madrid 28013 ES 2 €89,99 unknown
14 SHOP-1013 Mia Park mia@petshop.com 02-9374-4000 Sydney Opera House Drive Sydney NSW 2000 AU 9 A$ 1,299.00 2025-11-20 Wholesale
15 SHOP-1014 Noah nguyen noah@petshop.com +81 3 3210 7000 丸の内 2-7-3 Tokyo 100-0005 JP 6 ¥75000 2025-12-10 VIP
16 SHOP-1015 Olivia Brown OLIVIA@PETSHOP.COM (555) 333-4444 742 evergreen terrace springfield IL 62704 US 3 $180.00 (none)
17 SHOP-1016 Pavel Novak pavel@petshop.com +44 20 7946 1234 22 Baker Street London W1U 6AB United Kingdom 4 £412.00 2025-11-18 VIP
18 SHOP-1017 Quinn Murphy quinn@petshop.com +44 20 7946 5678 5 Princes Street Edinburgh EH2 2DA U.K. 2 £189.50 2025-12-09
19 SHOP-1018 Rachel O'Brien rachel@petshop.com 02-9374-9999 100 George Street Sydney NSW 2000 UK 1 £75.00 ? New
20 SHOP-1019 Sam Klein sam@petshop.com +49 30 99887766 Friedrichstraße 100 Berlin 10117 Germany 11 €1.890,40 2025-12-11 VIP|Wholesale
21 SHOP-1020 Tara Gianni tara@petshop.com +39 06 6982 4567 Via del Corso 250 Roma 00186 Italia 5 €649,99 2025-12-03

View File

@@ -1,49 +0,0 @@
{
"steps": [
{
"tool": "text_clean",
"options": {},
"enabled": true,
"name": "1. Clean text (whitespace, smart quotes, NBSP, BOM)"
},
{
"tool": "format_standardize",
"options": {
"column_types": {
"First Name": "name",
"Last Name": "name",
"Email": "email",
"Phone": "phone",
"Address": "address",
"Lifetime Value": "currency",
"Last Order Date": "date"
},
"phone_country_column": "Country",
"address_country_column": "Country",
"currency_preserve_code": true,
"currency_decimal": "auto",
"email_gmail_canonical": false
},
"enabled": true,
"name": "2. Standardize phones, addresses, dates, currencies, names"
},
{
"tool": "missing",
"options": {
"strategy": "none",
"standardize_sentinels": true
},
"enabled": true,
"name": "3. Standardize disguised nulls (N/A, -, (blank), ?, #N/A)"
},
{
"tool": "dedup",
"options": {
"survivor_rule": "most_complete",
"merge": true
},
"enabled": true,
"name": "4. Dedup customers (fuzzy match, merge missing fields)"
}
]
}

View File

@@ -0,0 +1,25 @@
Vendor,Contact,Email,Phone,EIN,Address,Total_Paid
Acme Realty,Bob Stein,acme.ap@acmerealty.com,(212) 555-0100,12-3456789,(blank),"$12,400.00"
acme realty llc,Bob Stein, ACME.AP@AcmeRealty.com ,,,"118 Canal St, New York, NY 10013","$8,250"
ACME REALTY,R. Stein,Acme.AP@acmerealty.com,212.555.0100,N/A,TBD,"1,999.99"
Bright Books Bookkeeping,Dana Cole,hello@brightbooks.com,,98-7654321,(blank),"$6,000.00"
bright books,Dana Cole,HELLO@brightbooks.com,(415) 555-0142,unknown,"50 Market St, San Francisco, CA 94105","$6,000"
"Bright Books, LLC",D. Cole, hello@BrightBooks.com,4155550142,98-7654321,unknown,"5,500.00"
Northwind Logistics,Sam Reyes,ap@northwindlog.com,(312) 555-0198,,(blank),"$22,750.00"
northwind logistics inc,Sam Reyes,AP@NorthwindLog.com,,45-6789012,"900 W Loop, Chicago, IL 60607","$22,750"
Pearl Design Studio,“Jo” Marsh,billing@pearldesign.co,,33-2211000,(blank),"$3,200.00"
pearl design,Jo Marsh,Billing@PearlDesign.co,(206) 555-0167,TBD,"77 Pike St, Seattle, WA 98101","$3,200"
PEARL DESIGN STUDIO,J. Marsh, billing@pearldesign.co ,206.555.0167,33-2211000,unknown,"2,800.00"
Cooper Plumbing,Lee Cooper,office@cooperplumb.com,(617) 555-0133,,(blank),"$1,450.00"
cooper plumbing co,Lee Cooper,OFFICE@cooperplumb.com,,TBD,"12 Beacon St, Boston, MA 02108","$1,450"
COOPER PLUMBING,L. Cooper, office@CooperPlumb.com,6175550133,N/A,unknown,900.00
Vertex Marketing,Pat Nguyen,accounts@vertexmktg.com,(404) 555-0119,77-8899001,(blank),"$15,000.00"
vertex marketing group,Pat Nguyen,ACCOUNTS@VertexMktg.com,,unknown,"300 Peachtree St, Atlanta, GA 30308","$15,000"
Summit Consulting,Ray Brooks,invoices@summitconsult.net,,21-0099887,(blank),"$9,800.00"
summit consulting llc,Ray Brooks,INVOICES@summitconsult.net,(303) 555-0175,,"1100 17th St, Denver, CO 80202","$9,800"
SUMMIT CONSULTING,R. Brooks, invoices@SummitConsult.net ,303.555.0175,21-0099887,TBD,"7,250.00"
Garcia Catering,Mia Garcia,ap@garciacatering.com,(305) 555-0188,,(blank),"$4,600.00"
garcia catering services,Mia Garcia,AP@GarciaCatering.com,,66-1234509,"450 Ocean Dr, Miami, FL 33139",$600.00
Northwind Logistics,S. Reyes, ap@northwindlog.com ,312.555.0198,45-6789012,TBD,"21,000.00"
VERTEX MARKETING,P. Nguyen, accounts@vertexmktg.com ,404.555.0119,77-8899001,TBD,"14,500.00"
GARCIA CATERING,M. Garcia,ap@GARCIACATERING.com,305.555.0188,66-1234509,unknown,"4,200.00"
1 Vendor Contact Email Phone EIN Address Total_Paid
2 Acme Realty Bob Stein acme.ap@acmerealty.com (212) 555-0100 12-3456789 (blank) $12,400.00
3 acme realty llc Bob Stein ACME.AP@AcmeRealty.com 118 Canal St, New York, NY 10013 $8,250
4 ACME REALTY R. Stein Acme.AP@acmerealty.com 212.555.0100 N/A TBD 1,999.99
5 Bright Books Bookkeeping Dana Cole hello@brightbooks.com 98-7654321 (blank) $6,000.00
6 bright books Dana Cole HELLO@brightbooks.com (415) 555-0142 unknown 50 Market St, San Francisco, CA 94105 $6,000
7 Bright Books, LLC D. Cole hello@BrightBooks.com 4155550142 98-7654321 unknown 5,500.00
8 Northwind Logistics Sam Reyes ap@northwindlog.com (312) 555-0198 (blank) $22,750.00
9 northwind logistics inc Sam Reyes AP@NorthwindLog.com 45-6789012 900 W Loop, Chicago, IL 60607 $22,750
10 Pearl Design Studio “Jo” Marsh billing@pearldesign.co 33-2211000 (blank) $3,200.00
11 pearl design Jo Marsh Billing@PearlDesign.co (206) 555-0167 TBD 77 Pike St, Seattle, WA 98101 $3,200
12 PEARL DESIGN STUDIO J. Marsh billing@pearldesign.co 206.555.0167 33-2211000 unknown 2,800.00
13 Cooper Plumbing Lee Cooper office@cooperplumb.com (617) 555-0133 (blank) $1,450.00
14 cooper plumbing co Lee Cooper OFFICE@cooperplumb.com TBD 12 Beacon St, Boston, MA 02108 $1,450
15 COOPER PLUMBING L. Cooper office@CooperPlumb.com 6175550133 N/A unknown 900.00
16 Vertex Marketing Pat Nguyen accounts@vertexmktg.com (404) 555-0119 77-8899001 (blank) $15,000.00
17 vertex marketing group Pat Nguyen ACCOUNTS@VertexMktg.com unknown 300 Peachtree St, Atlanta, GA 30308 $15,000
18 Summit Consulting Ray Brooks invoices@summitconsult.net 21-0099887 (blank) $9,800.00
19 summit consulting llc Ray Brooks INVOICES@summitconsult.net (303) 555-0175 1100 17th St, Denver, CO 80202 $9,800
20 SUMMIT CONSULTING R. Brooks invoices@SummitConsult.net 303.555.0175 21-0099887 TBD 7,250.00
21 Garcia Catering Mia Garcia ap@garciacatering.com (305) 555-0188 (blank) $4,600.00
22 garcia catering services Mia Garcia AP@GarciaCatering.com 66-1234509 450 Ocean Dr, Miami, FL 33139 $600.00
23 Northwind Logistics S. Reyes ap@northwindlog.com 312.555.0198 45-6789012 TBD 21,000.00
24 VERTEX MARKETING P. Nguyen accounts@vertexmktg.com 404.555.0119 77-8899001 TBD 14,500.00
25 GARCIA CATERING M. Garcia ap@GARCIACATERING.com 305.555.0188 66-1234509 unknown 4,200.00

View File

@@ -0,0 +1,49 @@
{
"steps": [
{
"tool": "text_clean",
"enabled": true,
"options": {
"trim": true,
"collapse_whitespace": true,
"fold_smart_chars": true,
"strip_zero_width": true
}
},
{
"tool": "format_standardize",
"enabled": true,
"options": {
"column_types": {
"Phone": "phone",
"Email": "email",
"Total_Paid": "currency"
}
}
},
{
"tool": "missing",
"enabled": true,
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["—", "-", "--", "(blank)", "TBD", "unknown", "N/A", "#N/A", "(none)"]
}
},
{
"tool": "dedup",
"enabled": true,
"options": {
"survivor_rule": "most_complete",
"merge": true,
"strategies": [
{
"columns": [
{"column": "Email", "algorithm": "exact", "threshold": 100, "normalizer": "email"}
]
}
]
}
}
]
}

View File

@@ -1,13 +0,0 @@
customer_name,email,vendor,memo
Alice Johnson,alice@example.com,ACME Corp ,Welcome aboard
Bob Smith,bob@example.com,ACME Corp,Returning customer
Charlie Brown,charlie@example.com,Globex,Net 30
Diana Prince,diana@example.com,Globex,VIP
Edward Norton,ed@example.com,“Best Pet Supplies”,Order#42 - rush
Frank Castle,frank@example.com,Stark—Industries,"Line 1
Line 2
Line 3"
grace HOPPER ,grace@example.com,Globex,Loves long memos…
Henry Ford,henry@example.com,Ford Motor,Industrial
Iris West,iris@example.com,S.T.A.R. Labs,Notewith-bell
Jane Doe,jane@example.com,Acme,Standard
1 customer_name email vendor memo
2 Alice Johnson alice@example.com ACME Corp Welcome aboard
3 Bob Smith bob@example.com ACME Corp Returning customer
4 Charlie Brown charlie@example.com Globex Net 30
5 Diana Prince diana​@example.com Globex VIP
6 Edward Norton ed@example.com “Best Pet Supplies” Order#42 - rush
7 Frank Castle frank@example.com Stark—Industries Line 1 Line 2 Line 3
8 grace HOPPER grace@example.com Globex Loves long memos…
9 Henry Ford henry@example.com Ford Motor Industrial
10 Iris West iris@example.com S.T.A.R. Labs Notewith-bell
11 Jane Doe jane@example.com Acme Standard

View File

@@ -9,10 +9,10 @@ side-by-side, and converts the visitor to a Gumroad purchase.
Launch:
streamlit run src/gui/app_demo.py
URL routing:
https://demo.datatools.app/?p=shopify-pet (Shopify operator)
https://demo.datatools.app/?p=bookkeeper (Bookkeeper)
https://demo.datatools.app/?p=revops (RevOps agency)
URL routing (all three personas serve one audience: accounting):
https://demo.datatools.app/?p=bookkeeper (Bookkeeper — bank reconciliation)
https://demo.datatools.app/?p=ap-1099 (Accounts payable — 1099 vendor prep)
https://demo.datatools.app/?p=ar-aging (Accounts receivable — open invoices)
Free / paid boundary (per docs/DEMO-PLAN.md §6):
- input rows capped at ``DEMO_ROW_CAP``
@@ -64,59 +64,66 @@ GUMROAD_BASE: str = "https://gumroad.com/l/datatools"
DEMO_DIR = _project_root / "samples" / "demo"
# All three personas serve one audience — accounting — entering through the
# three workflows where messy exports cost real money: bank reconciliation,
# 1099 / AP vendor prep, and AR aging. Each H1/sub names the exact pain and
# the validated demo outcome (see docs/DEMO-PLAN.md §4 for the numbers).
PERSONAS: dict[str, dict[str, Any]] = {
"shopify-pet": {
"label": "Shopify pet operator",
"icon": "🛍️",
"h1": "Klaviyo-import-ready customer lists. **In 30 seconds. Locally.**",
"sub": (
"Your Shopify customer export has duplicates Excel can't catch, "
"international phones Excel can't parse, and disguised nulls "
"(`N/A`, `(blank)`, `?`) that break Klaviyo's import. "
"DataTools fixes all of it in one pass — and your data never "
"leaves your computer."
),
"data_file": "shopify_pet_customers.csv",
"pipeline_file": "shopify_pet_pipeline.json",
"cta": "Get DataTools for Shopify — $49 →",
"landing": "https://datatools.app/shopify/",
},
"bookkeeper": {
"label": "Bookkeeper / freelance accountant",
"label": "Bookkeeper — bank reconciliation",
"icon": "📒",
"h1": "Reconcile messy bank exports. **Hand your client an audit trail.**",
"h1": "Catch the transactions your bank export posted twice. **Locally.**",
"sub": (
"The Jan and Feb exports overlap; the same transaction posts twice. "
"Vendor names are *Amazon* / *amazon.com* / *AMAZON.COM*4F2X9* in "
"three rows. DataTools dedups on Date + Amount + fuzzy Vendor, "
"produces ISO dates and numeric amounts, and gives you a row-level "
"audit log to hand the client."
"When the Jan and Feb exports overlap, the same payment lands "
"twice — once as `01/15/2025 +$3,450.00`, once as "
"`2025-01-15 3450.00`. DataTools standardizes every date and "
"amount, then dedups on the *real* transaction so your "
"reconciliation ties out. In this sample: **26 rows → 20, six "
"phantom duplicates removed** — and your data never leaves your "
"computer."
),
"data_file": "bookkeeper_bank_reconcile.csv",
"pipeline_file": "bookkeeper_bank_pipeline.json",
"data_file": "bank_reconciliation.csv",
"pipeline_file": "bank_reconciliation_pipeline.json",
"cta": "Get DataTools for Bookkeepers — $49 →",
"landing": "https://datatools.app/bookkeeper/",
},
"revops": {
"label": "Marketing / RevOps agency",
"icon": "🪢",
"h1": "Dedupe lead lists across HubSpot, LinkedIn, and manual scrapes — **locally.**",
"ap-1099": {
"label": "Accounts payable — 1099 prep",
"icon": "🧾",
"h1": "Build a clean 1099 vendor list — **with the missing EINs filled in.**",
"sub": (
"The same prospect shows up in HubSpot as `alice@acme.com`, in "
"LinkedIn as `Alice.Johnson@acme.com`, and in your VA's manual "
"scrape as `alice@acme.com` again. Country is `USA` / `US` / "
"`United States`. DataTools fuzzy-matches across sources, "
"normalizes phones for 50+ countries, and merges survivors "
"with their most-complete fields — without uploading anything."
"The same vendor was entered three times across the year — one "
"record has the EIN, another the address, a third the phone. "
"DataTools consolidates each vendor to one row and *backfills the "
"gaps from the duplicates*. In this sample: **24 messy records → "
"8 complete vendors, with 7 missing EINs recovered** from the "
"duplicate rows. No upload, no VLOOKUP gymnastics."
),
"data_file": "agency_combined_leads.csv",
"pipeline_file": "agency_leads_pipeline.json",
"cta": "Get DataTools for RevOps — $49 →",
"landing": "https://datatools.app/revops/",
"data_file": "vendor_1099.csv",
"pipeline_file": "vendor_1099_pipeline.json",
"cta": "Get DataTools for Accounting — $49 →",
"landing": "https://datatools.app/accounting/",
},
"ar-aging": {
"label": "Accounts receivable — open invoices",
"icon": "💵",
"h1": "Stop chasing the invoices your aging report counted twice. **Locally.**",
"sub": (
"Double-entered invoices inflate your AR aging and your "
"follow-ups. DataTools standardizes invoice dates, due dates, and "
"amounts, lowercases client emails, then removes the duplicate "
"invoice numbers — backfilling any blank status from the twin row. "
"In this sample: **26 rows → 21, five phantom invoices off the "
"books** in one pass."
),
"data_file": "ar_open_invoices.csv",
"pipeline_file": "ar_open_invoices_pipeline.json",
"cta": "Get DataTools for Accounting — $49 →",
"landing": "https://datatools.app/accounting/",
},
}
DEFAULT_PERSONA = "shopify-pet"
DEFAULT_PERSONA = "bookkeeper"
# ---------------------------------------------------------------------------

View File

@@ -0,0 +1,71 @@
"""Demo pipelines must keep showing value (accounting personas).
Each persona's preloaded dataset + saved pipeline is the marketing surface
driven by ``src/gui/app_demo.py``. These tests pin that every demo loads,
runs clean, and produces its headline value (duplicate rows removed, clean
parse, disguised nulls caught) — so a stale dataset or an engine change can't
silently gut the sales demo. The read path mirrors ``app_demo._load_demo``
exactly (``dtype=str, keep_default_na=False`` so every disguised null survives
to the pipeline).
"""
from __future__ import annotations
from pathlib import Path
import pandas as pd
import pytest
from src.core.pipeline import Pipeline, run_pipeline
_REPO = Path(__file__).resolve().parent.parent
_DEMO = _REPO / "samples" / "demo"
# (data_file, pipeline_file, min_duplicates_removed) — one per accounting
# persona in app_demo.PERSONAS. The dup floors are the validated demo numbers.
_DEMOS = [
("bank_reconciliation.csv", "bank_reconciliation_pipeline.json", 6),
("vendor_1099.csv", "vendor_1099_pipeline.json", 8),
("ar_open_invoices.csv", "ar_open_invoices_pipeline.json", 5),
]
@pytest.mark.parametrize("data_file,pipeline_file,min_dupes", _DEMOS)
def test_demo_runs_clean_and_shows_value(data_file, pipeline_file, min_dupes):
df = pd.read_csv(_DEMO / data_file, dtype=str, keep_default_na=False)
pipe = Pipeline.from_file(_DEMO / pipeline_file)
res = run_pipeline(df, pipe, stop_on_error=True)
# 1. Nothing errored — the demo never shows a visitor a red banner.
assert all(sr.error is None for sr in res.step_results), [
(sr.step.tool, sr.error) for sr in res.step_results
]
# 2. Dedup removed the designed duplicate rows (the headline value).
assert res.final_rows < res.initial_rows
dedup = next(sr for sr in res.step_results if sr.step.tool == "dedup")
assert dedup.summary["duplicates_removed"] >= min_dupes
# 3. Standardization parsed every typed cell — a demo with unparseable
# cells reads as "the tool choked," which kills the pitch.
fmt = next(sr for sr in res.step_results if sr.step.tool == "format_standardize")
assert fmt.summary["cells_unparseable"] == 0
assert fmt.summary["cells_changed"] > 0
# 4. The disguised nulls (—, (blank), TBD, …) were caught.
miss = next(sr for sr in res.step_results if sr.step.tool == "missing")
assert miss.summary["sentinels_standardized"] > 0
def test_app_demo_references_each_demo_file():
"""Every data/pipeline file the demo app names must exist on disk.
Guards against a rename in app_demo.py drifting away from samples/demo/
(or vice versa) without a test catching it.
"""
src = (_REPO / "src" / "gui" / "app_demo.py").read_text(encoding="utf-8")
for data_file, pipeline_file, _ in _DEMOS:
assert data_file in src, f"{data_file} not referenced in app_demo.py"
assert pipeline_file in src, f"{pipeline_file} not referenced in app_demo.py"
assert (_DEMO / data_file).exists(), f"missing {data_file}"
assert (_DEMO / pipeline_file).exists(), f"missing {pipeline_file}"