demo: reconstruct sales demos for an accounting audience
Replaces the Shopify / RevOps / Bookkeeper demo trio with three accounting personas that share one buyer, each entering through a workflow where a messy export costs money — all running the same saved 4-step pipeline: - bank_reconciliation.csv (Bookkeeper): 26 -> 20 rows, 6 double-posted transactions caught after date+amount standardization. - vendor_1099.csv (AP / 1099): 24 records -> 8 vendors, 7 missing EINs recovered via dedup merge — the 1099-complete story. - ar_open_invoices.csv (AR): 26 -> 21 rows, 5 double-entered invoices removed, blank status backfilled from the twin row. Every number is validated against the live engine and pinned by tests/test_demo_pipelines.py (read path mirrors app_demo._load_demo: dtype=str, keep_default_na=False). Rewires src/gui/app_demo.py PERSONAS (keys bookkeeper / ap-1099 / ar-aging, accounting H1/sub/CTA) and rewrites docs/DEMO-PLAN.md sections 3/4/7 with the validated outcomes. (Repo hygiene forced by a partial-clone gap: finalizes the already-deleted, unreferenced samples/messy_text.csv whose blob was unrecoverable.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -32,17 +32,22 @@ rebuilds it from a stale headline.
|
|||||||
| Friction kills conversion | BUSINESS.md §7 | Demo dataset preloaded; no "select a file" first-step |
|
| Friction kills conversion | BUSINESS.md §7 | Demo dataset preloaded; no "select a file" first-step |
|
||||||
| < $1,200/mo recurring | BUSINESS.md §9 | Migration plan to $5/mo VPS only after rate-limit signal |
|
| < $1,200/mo recurring | BUSINESS.md §9 | Migration plan to $5/mo VPS only after rate-limit signal |
|
||||||
|
|
||||||
## 3. The three personas (per PLAN.md §2.3)
|
## 3. The three personas — one audience: accounting (per PLAN.md §2.3)
|
||||||
|
|
||||||
|
We niche to **accounting** and enter through the three workflows where a
|
||||||
|
messy export costs real money. Same engine, three landing pages — each
|
||||||
|
is the same buyer at a different desk (bookkeeping, payables, receivables).
|
||||||
|
|
||||||
| Tag | Persona | Top-of-funnel keyword | Demo dataset | Pre-saved pipeline |
|
| Tag | Persona | Top-of-funnel keyword | Demo dataset | Pre-saved pipeline |
|
||||||
|---|---|---|---|---|
|
|---|---|---|---|---|
|
||||||
| `shopify-pet` | Shopify operator (priority: pet supplies) | "shopify customer cleanup" | `samples/demo/shopify_pet_customers.csv` | `shopify_pet_pipeline.json` |
|
| `bookkeeper` | Bookkeeper — bank reconciliation | "reconcile bank export csv duplicates" | `samples/demo/bank_reconciliation.csv` | `bank_reconciliation_pipeline.json` |
|
||||||
| `bookkeeper` | Bookkeeper / freelance accountant | "reconcile bank export csv" | `samples/demo/bookkeeper_bank_reconcile.csv` | `bookkeeper_bank_pipeline.json` |
|
| `ap-1099` | Accounts payable — 1099 vendor prep | "clean 1099 vendor list missing EIN" | `samples/demo/vendor_1099.csv` | `vendor_1099_pipeline.json` |
|
||||||
| `revops` | Marketing / RevOps agency | "dedupe lead list across vendors" | `samples/demo/agency_combined_leads.csv` | `agency_leads_pipeline.json` |
|
| `ar-aging` | Accounts receivable — open invoices | "remove duplicate invoices aging report" | `samples/demo/ar_open_invoices.csv` | `ar_open_invoices_pipeline.json` |
|
||||||
|
|
||||||
Each persona gets its **own landing page URL**, its **own demo dataset
|
Each persona gets its **own landing page URL** (`?p=<tag>`), its **own
|
||||||
loaded by default**, and its **own H1 + below-the-fold copy.** The
|
demo dataset loaded by default**, and its **own H1 + below-the-fold
|
||||||
engine is identical; only positioning differs.
|
copy** — wired in `src/gui/app_demo.py::PERSONAS`. The engine is
|
||||||
|
identical; only positioning differs.
|
||||||
|
|
||||||
## 4. Demo dataset specifications
|
## 4. Demo dataset specifications
|
||||||
|
|
||||||
@@ -53,114 +58,77 @@ persona's tooling. Each contains every kind of pollution the bundle's
|
|||||||
five tools fix, so a single demo run shows every tool earning its
|
five tools fix, so a single demo run shows every tool earning its
|
||||||
keep.
|
keep.
|
||||||
|
|
||||||
### 4.0 Pain-point coverage map
|
### 4.0 Value-proof map
|
||||||
|
|
||||||
Each demo dataset is engineered so the buyer sees their **own top
|
Each demo dataset is engineered so the buyer sees their **own top pain**
|
||||||
pain** demonstrated in the AFTER preview. The mapping below pairs
|
fixed in the AFTER preview, with one unmistakable headline number. All
|
||||||
each pain from PLAN.md §2.3a with the rows / columns that exercise
|
three run the same saved 4-step pipeline (Clean Text → Standardize
|
||||||
it. Refresh the dataset only when this coverage drops.
|
Formats → Fix Missing Values → Find Duplicates). The numbers below are
|
||||||
|
**validated against the live engine** (`tests/test_demo_pipelines.py`
|
||||||
|
pins them) — refresh the dataset only if a number stops landing.
|
||||||
|
|
||||||
| Persona | Pain (from PLAN §2.3a) | Demo coverage |
|
| Persona | Headline proof | What the visitor watches happen |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| Shopify pet | S1 — Klaviyo per-contact dupes | 5 dup pairs across rows 1–15 (case + format + address-twin variants) |
|
| Bookkeeper | **26 → 20 rows · 6 phantom duplicates removed** | The same payment posted twice (different date + amount format) collapses to one; dates go ISO, parens-negatives become real negatives |
|
||||||
| Shopify pet | S2 — feed-rejection chars | smart-quote / NBSP / BOM in rows 1–6, 9, 11 |
|
| AP / 1099 | **24 records → 8 vendors · 7 missing EINs recovered** | Each vendor's scattered records merge into one complete row; `merge=true` backfills the EIN/address/phone that any single record was missing |
|
||||||
| Shopify pet | S3 — multi-channel | partner-style customer IDs (`SHOP-`); demonstration of column-level mapping covered in RevOps demo |
|
| AR aging | **26 → 21 rows · 5 double-entered invoices removed** | Duplicate invoice numbers collapse; a blank status is backfilled from its twin; invoice + due dates go ISO, amounts numeric |
|
||||||
| Shopify pet | S4 — subscription identity | rows 1+2, 7+8, 9+10 — same person, different format |
|
|
||||||
| Shopify pet | S5 — VAT-MOSS country drift | rows 16–18 (`United Kingdom` / `U.K.` / `UK`) + rows 19–20 (`Germany`/`Italia`) |
|
|
||||||
| Bookkeeper | B1 — month-overlap re-import | 7 dup pairs spanning Jan↔Feb and Mar boundaries |
|
|
||||||
| Bookkeeper | B2 — 1099 vendor consolidation | Amazon × 3 spellings, Verizon × 2, Acme Realty × 2, Adobe × 2, Costco × 2, Zoom × 2, Stripe × 4 |
|
|
||||||
| Bookkeeper | B3 — audit trail | every cell change in the run logged with old/new/rule — surface in the demo's audit tab |
|
|
||||||
| Bookkeeper | B4 — per-license economics | demonstrated by pricing copy, not data |
|
|
||||||
| Bookkeeper | B5 — multi-currency | rows 26 (EUR), 27 (GBP), 28 (BRL with comma decimal), 29 (parens-negative) |
|
|
||||||
| RevOps | R1 — per-contact tier | 6 cross-source dup pairs (HubSpot × LinkedIn × Manual Scrape) |
|
|
||||||
| RevOps | R2 — deliverability | rows 26–27 (`uma at uniform dot com`, `victor@@victorco.com` invalid emails) |
|
|
||||||
| RevOps | R3 — GDPR / privacy | demonstrated by the network-tab moat panel + zero-upload claim |
|
|
||||||
| RevOps | R4 — vendor unification | 3 source values (HubSpot / LinkedIn / Manual Scrape), 13 country codes, mixed-shape headers |
|
|
||||||
| RevOps | R5 — suppression list | rows 29–30 (`Suppressed`, `Opted Out` tags) |
|
|
||||||
|
|
||||||
### 4.1 `shopify_pet_customers.csv` (20 rows)
|
### 4.1 `bank_reconciliation.csv` (26 rows) — Bookkeeper
|
||||||
|
|
||||||
**Looks like**: a Shopify customer export filtered for "Pet Supplies"
|
**Looks like**: two months (Jan + Feb 2025) of business-checking activity
|
||||||
sales channel, 12 months activity.
|
from a bank portal, where the Feb re-export overlaps Jan so the same
|
||||||
|
transaction posts twice. Columns: `Date, Description, Vendor, Category,
|
||||||
|
Amount, Account`.
|
||||||
|
|
||||||
**Pollution included**:
|
**Pollution included**:
|
||||||
- Whitespace padding (" Alice ", "Sydney Opera House Drive ")
|
- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`, `1/27/25`, `Feb 5 2025`.
|
||||||
- Mixed phone formats: `(415) 555-1234`, `415.555.1234`, `5559876543`,
|
- Currency formats incl. negatives: `-$129.99`, `($89.50)` parens-negative, `+$3,450.00`, `- $599.88`, bare `-129.99`, `(50.00)`.
|
||||||
`+1 555-111-1111`
|
- Whitespace + NBSP padding; smart quotes and an em-dash inside descriptions.
|
||||||
- International phones: GB, ES, DE, AU, JP (15 demo rows span 6
|
- Vendor casing variety on *non-duplicate* rows: `Amazon` / `amazon.com` / `AMAZON.COM`, `Verizon` / `verizon`.
|
||||||
countries)
|
- Disguised nulls in Category: `—`, `(blank)`, `?`, `unknown`, `TBD`.
|
||||||
- Currency variants: `$1,240.50`, `£890.25`, `€2.410,75` (EU comma
|
- **6 duplicate transactions** — each pair shares the same vendor + real value but a different date *and* amount format, so they collapse only after standardization.
|
||||||
decimal), `A$ 1,299.00`, `¥75000`
|
|
||||||
- Date formats: `2025-12-04`, `12/15/2025`, `?`, `(blank)`, `(none)`,
|
|
||||||
`#N/A`
|
|
||||||
- Disguised nulls: `N/A`, blank, `(blank)`, `?`, `#N/A`, `(none)`,
|
|
||||||
`unknown`
|
|
||||||
- Name casing: `EVE MARTINEZ`, `henry`, `O'NEIL`, `noah`, mixed Title /
|
|
||||||
ALL CAPS / lower
|
|
||||||
- Email case variants that *should* dedup: `Bob@PetShop.com` vs
|
|
||||||
`alice@petshop.com`
|
|
||||||
- 4 fuzzy duplicates (Alice/Bob same address, Grace/Henry same phone,
|
|
||||||
Carlos/Olivia same address, Ivy/Jack same address)
|
|
||||||
|
|
||||||
**After running the pipeline**: 20 rows → 15, ~29 cells canonicalized,
|
**After running the pipeline** (validated): **26 → 20 rows, 6 duplicates
|
||||||
~45 sentinels standardised, 5 cross-row duplicates merged. The
|
removed**, 36 date/amount cells standardized (0 unparseable), all dates
|
||||||
customer table is now Klaviyo-import-ready and the country column
|
ISO, parens-negatives resolved (`($89.50)` → `-89.50`), disguised-null
|
||||||
(previously `UK` / `U.K.` / `United Kingdom` / `Germany` / `Italia`)
|
categories flagged. The reconciliation ties out.
|
||||||
is GB / DE / IT — VAT MOSS report won't break.
|
|
||||||
|
|
||||||
### 4.2 `bookkeeper_bank_reconcile.csv` (30 rows)
|
### 4.2 `vendor_1099.csv` (24 rows) — Accounts payable / 1099
|
||||||
|
|
||||||
**Looks like**: two months of business checking + credit-card activity
|
**Looks like**: a 1099-NEC vendor master list where the same vendor was
|
||||||
exported from a bank portal, with the Feb export accidentally
|
entered 2–3 times across the year by different staff, each record holding
|
||||||
overlapping the Jan export at the month boundary.
|
only *part* of the vendor's details. Columns: `Vendor, Contact, Email,
|
||||||
|
Phone, EIN, Address, Total_Paid`.
|
||||||
|
|
||||||
**Pollution included**:
|
**Pollution included**:
|
||||||
- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`,
|
- The duplicate records for a vendor share one email differing only by case/whitespace (the reliable dedup key, matched with the `email` normalizer).
|
||||||
`1/27/25`, `Feb 5 2025`
|
- EIN / Phone / Address scattered across the duplicate set so no single record is complete but the union is — gaps marked `—`, `(blank)`, `TBD`, `unknown`, `N/A`.
|
||||||
- Currency formats: `-$129.99`, `($89.50)` parens-negative,
|
- Vendor name casing/spelling variants, phone formats, EIN formats (`12-3456789` vs `123456789`), `Total_Paid` currency variants.
|
||||||
`+$3,450.00`, `- $599.88` space, bare `-129.99`, `(50.00)`
|
|
||||||
- Header trailing whitespace: `"Date "`
|
|
||||||
- Smart quotes around descriptions: `"autopay"`
|
|
||||||
- Em-dash sentinels in Vendor: `—`
|
|
||||||
- Smart-em-dash inside descriptions: `STAPLES #4422 — paper, toner`
|
|
||||||
- Vendor casing inconsistency: `Amazon` / `amazon.com` / `AMAZON.COM`,
|
|
||||||
`Verizon` / `verizon`
|
|
||||||
- 6 duplicate transactions (same date+amount+vendor recorded twice
|
|
||||||
with different formats)
|
|
||||||
|
|
||||||
**After running the pipeline**: 30 rows → 23, ~84 cells normalized, 7
|
**After running the pipeline** (validated): **24 records → 8 vendors, 16
|
||||||
duplicates removed (month-overlap + VAT-MOSS dups). All dates
|
duplicates removed, 7 missing EINs recovered** by `merge=true` +
|
||||||
ISO-formatted, all amounts numeric (including EUR/GBP/BRL with comma
|
`most_complete` survivor, 35 disguised nulls caught, phones/emails/amounts
|
||||||
decimal), vendor casing canonical, parens-negative resolved.
|
standardized (0 unparseable). One vendor genuinely has no EIN in any
|
||||||
|
record — it survives with a blank EIN as the realistic "flag for
|
||||||
|
follow-up" case.
|
||||||
|
|
||||||
### 4.3 `agency_combined_leads.csv` (30 rows)
|
### 4.3 `ar_open_invoices.csv` (26 rows) — Accounts receivable
|
||||||
|
|
||||||
**Looks like**: a marketing-ops worksheet combining lead exports from
|
**Looks like**: an open-invoices (unpaid AR) export where some invoices
|
||||||
HubSpot + LinkedIn Sales Navigator + manual scraping, ready for
|
were double-entered in different formats and client contacts are messy.
|
||||||
campaign targeting.
|
Columns: `Invoice, Client, Email, Invoice_Date, Due_Date, Amount, Status`.
|
||||||
|
|
||||||
**Pollution included**:
|
**Pollution included**:
|
||||||
- Phone formats per region: US, UK, Spain, Germany, China, India,
|
- Two date columns with mixed formats; currency variants incl. a credit memo `($300.00)` → `-300.00`.
|
||||||
Australia, Mexico, Israel, Singapore, Hong Kong, Italy, South
|
- Client name casing variety; email case variants (`AP@Acme.com` vs `ap@acme.com`).
|
||||||
Korea — 13 country codes
|
- Status disguised nulls: `—`, `?`, `(blank)`, `TBD`, `unknown`, `(none)`.
|
||||||
- Country column inconsistent: `USA` / `US` / `United States`
|
- **5 double-entered invoices** — same invoice number twice, dates/amount in different formats, one copy with a blank status the other fills.
|
||||||
- Disguised nulls: `N/A`, `unknown`, `(unknown)`, `(blank)`, `(none)`,
|
|
||||||
`?`, `—`, `#N/A`, `TBD`
|
|
||||||
- Source column tags origin (`HubSpot` / `LinkedIn` / `Manual Scrape`)
|
|
||||||
- Email duplicates across sources with case variants: `alice@acme.com`
|
|
||||||
+ `Alice.Johnson@acme.com`, `bob@beta.com` + `Bob@Beta.com`,
|
|
||||||
`diana@delta.com` from two sources, `carlos@gamma.io` from two
|
|
||||||
sources, `Frank@Foxtrot.de` + `frank@foxtrot.de`
|
|
||||||
- Name casing: `DIANA LEE`, `henry`, `IVY CHEN`, mixed
|
|
||||||
- 6 fuzzy / cross-source duplicates designed to survive the dedup
|
|
||||||
- Score column with sentinel pollution that needs coercion to integer
|
|
||||||
|
|
||||||
**After running the pipeline**: 30 rows → 24, ~43 cells canonicalized,
|
**After running the pipeline** (validated): **26 → 21 rows, 5 duplicate
|
||||||
14 sentinels resolved, 6 cross-source duplicates merged with `merge=true`
|
invoices removed**, both date columns ISO + amounts numeric + emails
|
||||||
so each survivor inherits the most-complete picture. Invalid-email
|
lowercased (0 unparseable), 7 disguised-null statuses caught, and a blank
|
||||||
rows (deliverability stress) and `Suppressed`/`Opted Out` tags
|
status backfilled from its twin via `merge=true`. The aging report stops
|
||||||
(suppression-list use case) survive as flagged rows the operator
|
double-counting.
|
||||||
manually reviews.
|
|
||||||
|
|
||||||
## 5. UX flow (per persona)
|
## 5. UX flow (per persona)
|
||||||
|
|
||||||
@@ -174,26 +142,26 @@ dedicated `app_demo.py` for the cloud build).
|
|||||||
│ "{Persona-specific H1}" │
|
│ "{Persona-specific H1}" │
|
||||||
├──────────────────────────────────────────────────────────┤
|
├──────────────────────────────────────────────────────────┤
|
||||||
│ │
|
│ │
|
||||||
│ Sample dataset preloaded: shopify_pet_customers.csv │
|
│ Sample dataset preloaded: bank_reconciliation.csv │
|
||||||
│ [Replace with your own file (capped 100 rows)] │
|
│ [Replace with your own file (capped 100 rows)] │
|
||||||
│ │
|
│ │
|
||||||
│ ┌─ BEFORE preview (15 rows) ─────────────────────────┐ │
|
│ ┌─ BEFORE preview (26 rows) ─────────────────────────┐ │
|
||||||
│ │ Alice | (415) 555-1234 | $1,240.50 | … │ │
|
│ │ 01/15/2025 | Stripe | +$3,450.00 | … │ │
|
||||||
│ │ Bob | 415.555.1234 | $1,240.50 | … │ │
|
│ │ 2025-01-15 | Stripe | 3450.00 | … (dup) │ │
|
||||||
│ │ ... │ │
|
│ │ ... │ │
|
||||||
│ └──────────────────────────────────────────────────┘ │
|
│ └──────────────────────────────────────────────────┘ │
|
||||||
│ │
|
│ │
|
||||||
│ Pipeline (saved): │
|
│ Pipeline (saved): │
|
||||||
│ 1. Text Clean → 2. Format Standardize → │
|
│ 1. Clean Text → 2. Standardize Formats → │
|
||||||
│ 3. Missing → 4. Deduplicate │
|
│ 3. Fix Missing → 4. Find Duplicates │
|
||||||
│ │
|
│ │
|
||||||
│ [▶ Run pipeline] │
|
│ [▶ Run pipeline] │
|
||||||
│ │
|
│ │
|
||||||
│ ┌─ AFTER preview ───────────────────────────────────┐ │
|
│ ┌─ AFTER preview ───────────────────────────────────┐ │
|
||||||
│ │ 15 rows → 11 (4 duplicates merged) │ │
|
│ │ 26 rows → 20 (6 duplicate transactions removed) │ │
|
||||||
│ │ 27 cells canonicalized · 33 sentinels resolved │ │
|
│ │ 36 cells standardized · 4 disguised nulls flagged │ │
|
||||||
│ │ │ │
|
│ │ │ │
|
||||||
│ │ Alice Johnson | +14155551234 | 1240.50 | … │ │
|
│ │ 2025-01-15 | Stripe | 3450.00 | … │ │
|
||||||
│ │ ... │ │
|
│ │ ... │ │
|
||||||
│ └──────────────────────────────────────────────────┘ │
|
│ └──────────────────────────────────────────────────┘ │
|
||||||
│ │
|
│ │
|
||||||
@@ -244,27 +212,35 @@ not "demo crippled" data.
|
|||||||
|
|
||||||
## 7. CTA copy (per persona)
|
## 7. CTA copy (per persona)
|
||||||
|
|
||||||
### 7.1 Shopify pet operator
|
Copy lives in `src/gui/app_demo.py::PERSONAS` (H1 / sub / CTA per tag);
|
||||||
|
keep this section in sync with that dict.
|
||||||
|
|
||||||
- **H1**: *Clean your customer / vendor / subscriber exports — locally.*
|
### 7.1 Bookkeeper — bank reconciliation (`?p=bookkeeper`)
|
||||||
- **Sub**: *Klaviyo-import-ready in 30 seconds. Catches duplicates Excel
|
|
||||||
misses. Your data never leaves your computer.*
|
|
||||||
- **CTA**: *Get DataTools for Shopify — $49 →*
|
|
||||||
|
|
||||||
### 7.2 Bookkeeper / freelance accountant
|
- **H1**: *Catch the transactions your bank export posted twice. Locally.*
|
||||||
|
- **Sub**: *When the Jan and Feb exports overlap, the same payment posts
|
||||||
- **H1**: *Reconcile messy bank exports. Hand your client an audit
|
twice in two formats. DataTools standardizes every date and amount, then
|
||||||
trail.*
|
dedups on the real transaction so your reconciliation ties out — 26 rows
|
||||||
- **Sub**: *Catches the duplicate transaction Quickbooks imported twice.
|
→ 20, six phantom duplicates gone.*
|
||||||
Standardizes dates, amounts, vendor casing. Every change auditable.*
|
|
||||||
- **CTA**: *Get DataTools for Bookkeepers — $49 →*
|
- **CTA**: *Get DataTools for Bookkeepers — $49 →*
|
||||||
|
|
||||||
### 7.3 Marketing / RevOps agency
|
### 7.2 Accounts payable — 1099 prep (`?p=ap-1099`)
|
||||||
|
|
||||||
- **H1**: *Dedupe leads across HubSpot, LinkedIn, and manual scrapes.*
|
- **H1**: *Build a clean 1099 vendor list — with the missing EINs filled in.*
|
||||||
- **Sub**: *International phones, country normalization, fuzzy dedup
|
- **Sub**: *The same vendor entered three times, each record holding only
|
||||||
with merge — one tool, one schema, no upload.*
|
part of the details. DataTools consolidates to one row and backfills the
|
||||||
- **CTA**: *Get DataTools for RevOps — $49 →*
|
gaps from the duplicates — 24 records → 8 vendors, 7 missing EINs
|
||||||
|
recovered.*
|
||||||
|
- **CTA**: *Get DataTools for Accounting — $49 →*
|
||||||
|
|
||||||
|
### 7.3 Accounts receivable — open invoices (`?p=ar-aging`)
|
||||||
|
|
||||||
|
- **H1**: *Stop chasing the invoices your aging report counted twice. Locally.*
|
||||||
|
- **Sub**: *Double-entered invoices inflate your AR aging and your
|
||||||
|
follow-ups. DataTools standardizes dates and amounts, lowercases client
|
||||||
|
emails, and removes the duplicate invoice numbers — 26 rows → 21, five
|
||||||
|
phantom invoices off the books.*
|
||||||
|
- **CTA**: *Get DataTools for Accounting — $49 →*
|
||||||
|
|
||||||
## 8. Telemetry / conversion tracking
|
## 8. Telemetry / conversion tracking
|
||||||
|
|
||||||
|
|||||||
@@ -1,31 +0,0 @@
|
|||||||
Lead ID,First Name,Last Name,Company,Title,Email,Phone,Country,Source,Score,Last Activity,Tags
|
|
||||||
HUB-001,Alice,Johnson,Acme Corp,VP Marketing,alice@acme.com,(415) 555-1234,USA,HubSpot,87,2025-12-04,Enterprise
|
|
||||||
HUB-002,bob,smith,Beta LLC,Director Growth,bob@beta.com,N/A,United States,HubSpot,N/A,2025-11-22,SMB
|
|
||||||
HUB-003,Carlos,Garcia,Gamma Inc,CEO,carlos@gamma.io,+34 91 411 1111,Spain,HubSpot,82,2025-10-30,Enterprise
|
|
||||||
HUB-004,DIANA,LEE,Delta Co,Marketing Manager,diana@delta.com,020 7946 0958,United Kingdom,HubSpot,74,2025-12-15,Mid-Market
|
|
||||||
HUB-005,Eve,Martinez,Epsilon Group,VP Ops,eve@epsilon.com,(none),Mexico,HubSpot,(blank),2025-09-15,SMB
|
|
||||||
LIN-006,Alice,Johnson,Acme Corporation,VP of Marketing,Alice.Johnson@acme.com,4155551234,US,LinkedIn,—,2025-12-04,Enterprise
|
|
||||||
LIN-007,Frank,Brown,Foxtrot Ltd,Head Sales,frank@foxtrot.de,+49 30 12345678,Germany,LinkedIn,68,2025-12-01,Mid-Market
|
|
||||||
LIN-008,Grace,Davis,Golf Industries,Marketing Lead,grace@golfind.com,+44 20 7946 0958,UK,LinkedIn,79,2025-11-08,Mid-Market
|
|
||||||
LIN-009,henry,wilson,Hotel Logistics,COO,henry@hotellog.com,+86 10 1234 5678,China,LinkedIn,91,2025-12-12,Enterprise
|
|
||||||
LIN-010,IVY CHEN,,India Tech,CTO,ivy@indiatech.in,+91 11 2345 6789,IN,LinkedIn,88,2025-11-30,Enterprise
|
|
||||||
LIN-011,Jack,Taylor,Juliet & Co,Founder,jack@juliet.co,unknown,United States,LinkedIn,?,(unknown),SMB
|
|
||||||
SCR-012,Diana,Lee,Delta Company,Marketing Manager,diana@delta.com,020-7946-0958,UK,Manual Scrape,74,12/15/2025,Mid-Market
|
|
||||||
SCR-013,kate,o'neil,Kilo Ventures,Partner,kate@kilo.vc,+1 415 555 2222,USA,Manual Scrape,N/A,?,Investor
|
|
||||||
SCR-014,Carlos,García,Gamma Incorporated,CEO,Carlos@gamma.io,+34-91-411-1111,Spain,Manual Scrape,82,Oct 30 2025,Enterprise
|
|
||||||
SCR-015,Liam,Park,Lima Solutions,Director Marketing,liam@limasol.kr,+82 2 2287 0114,South Korea,Manual Scrape,77,2025-11-20,Enterprise
|
|
||||||
SCR-016,Mia,nguyen,Mike Corp,VP Marketing,mia@mikecorp.com.au,02 9374 4000,Australia,Manual Scrape,72,2025-10-05,Mid-Market
|
|
||||||
SCR-017,Noah,Brown,November Inc,Head of Growth,noah@november.com,(555) 444-5555,US,Manual Scrape,—,#N/A,SMB
|
|
||||||
HUB-018,Frank,Brown,Foxtrot,Head of Sales,Frank@Foxtrot.de,+49-30-12345678,Germany,HubSpot,68,2025-12-01,Mid-Market
|
|
||||||
HUB-019,Olivia,Rossi,Oscar Italia,CMO,olivia@oscar.it,+39 06 6982,Italy,HubSpot,85,2025-12-08,Enterprise
|
|
||||||
HUB-020,papa,wong,Papa Trading,Founder,papa@papatrading.hk,+852 2123 4567,Hong Kong,HubSpot,69,2025-11-15,SMB
|
|
||||||
LIN-021,Quinn,Reyes,Quebec Group,VP Sales,quinn@quebec.mx,+52 55 5555 0000,Mexico,LinkedIn,80,2025-12-05,Mid-Market
|
|
||||||
LIN-022,Robert,Tan,Romeo Logistics,Director,r.tan@romeo.sg,+65 6123 4567,Singapore,LinkedIn,76,2025-11-28,Mid-Market
|
|
||||||
SCR-023,Sara,Khan,Sierra Foods,Head Marketing,sara@sierra.in,+91-22-1234-5678,India,Manual Scrape,73,2025-12-02,SMB
|
|
||||||
SCR-024,bob,Smith,Beta,Director Growth,Bob@Beta.com,(none),United States,Manual Scrape,(unknown),(unknown),SMB
|
|
||||||
HUB-025,Tara,Levi,Tango Tech,VP Product,tara@tango.il,+972 3 6957 0000,Israel,HubSpot,82,2025-12-10,Enterprise
|
|
||||||
HUB-026,Uma,Patel,Uniform Health,CMO,uma at uniform dot com,+44 20 7946 8888,United Kingdom,HubSpot,71,2025-12-12,Enterprise
|
|
||||||
LIN-027,Victor,Lee,Victor Co,Director,victor@@victorco.com,+1 415 555 8888,USA,LinkedIn,69,2025-11-30,SMB
|
|
||||||
SCR-028,Wendy,Akin,Whiskey Inc,CMO,wendy@whiskey.tr,+90 212 252 1111,Turkey,Manual Scrape,77,2025-12-04,Mid-Market
|
|
||||||
SCR-029,Xander,Ng,Xray Group,Founder,xander@xray.sg,+65 6234 5678,Singapore,Manual Scrape,65,2025-11-15,Suppressed
|
|
||||||
HUB-030,Yara,Costa,Yankee Foods,Marketing Lead,yara@yankee.br,+55 11 3071 2222,Brazil,HubSpot,—,2025-12-15,Opted Out
|
|
||||||
|
@@ -1,74 +0,0 @@
|
|||||||
{
|
|
||||||
"steps": [
|
|
||||||
{
|
|
||||||
"tool": "text_clean",
|
|
||||||
"options": {},
|
|
||||||
"enabled": true,
|
|
||||||
"name": "1. Clean text (whitespace + smart quotes from copy-paste)"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"tool": "format_standardize",
|
|
||||||
"options": {
|
|
||||||
"column_types": {
|
|
||||||
"First Name": "name",
|
|
||||||
"Last Name": "name",
|
|
||||||
"Company": "name",
|
|
||||||
"Email": "email",
|
|
||||||
"Phone": "phone"
|
|
||||||
},
|
|
||||||
"phone_country_column": "Country",
|
|
||||||
"phone_format": "E164",
|
|
||||||
"email_gmail_canonical": true
|
|
||||||
},
|
|
||||||
"enabled": true,
|
|
||||||
"name": "2. E.164 phones (per-row country) · canonical emails · name casing"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"tool": "missing",
|
|
||||||
"options": {
|
|
||||||
"strategy": "none",
|
|
||||||
"standardize_sentinels": true,
|
|
||||||
"sentinels": ["N/A", "n/a", "—", "?", "(unknown)", "unknown", "(blank)", "(none)", "TBD", "#N/A"]
|
|
||||||
},
|
|
||||||
"enabled": true,
|
|
||||||
"name": "3. Standardize sentinels across vendor exports"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"tool": "column_map",
|
|
||||||
"options": {
|
|
||||||
"schema": {
|
|
||||||
"fields": [
|
|
||||||
{"name": "Lead ID", "dtype": "string", "required": true},
|
|
||||||
{"name": "First Name", "dtype": "string"},
|
|
||||||
{"name": "Last Name", "dtype": "string"},
|
|
||||||
{"name": "Company", "dtype": "string"},
|
|
||||||
{"name": "Title", "dtype": "string"},
|
|
||||||
{"name": "Email", "dtype": "string"},
|
|
||||||
{"name": "Phone", "dtype": "string"},
|
|
||||||
{"name": "Country", "dtype": "string"},
|
|
||||||
{"name": "Source", "dtype": "string"},
|
|
||||||
{"name": "Score", "dtype": "integer"},
|
|
||||||
{"name": "Last Activity", "dtype": "date"},
|
|
||||||
{"name": "Tags", "dtype": "string"}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"auto_infer": true,
|
|
||||||
"unmapped": "keep",
|
|
||||||
"coerce_types": true,
|
|
||||||
"reorder_to_schema": true,
|
|
||||||
"enforce_required": false
|
|
||||||
},
|
|
||||||
"enabled": true,
|
|
||||||
"name": "4. Coerce types · reorder to canonical schema"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"tool": "dedup",
|
|
||||||
"options": {
|
|
||||||
"survivor_rule": "most_complete",
|
|
||||||
"merge": true
|
|
||||||
},
|
|
||||||
"enabled": true,
|
|
||||||
"name": "5. Dedup leads across HubSpot / LinkedIn / Manual Scrape (fuzzy + merge)"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
27
samples/demo/ar_open_invoices.csv
Normal file
27
samples/demo/ar_open_invoices.csv
Normal file
@@ -0,0 +1,27 @@
|
|||||||
|
Invoice,Client,Email,Invoice_Date,Due_Date,Amount,Status
|
||||||
|
INV-1007,ACME LLC,AP@Acme.com,03/04/2025,04/03/2025,"$1,250.00",Open
|
||||||
|
INV-1007, Acme LLC ,ap@acme.com,2025-03-04,2025-04-03,"1,250.00",(blank)
|
||||||
|
INV-1001,northwind traders,billing@northwind.com,Mar 6 2025,04/05/2025,$980,Overdue
|
||||||
|
INV-1002,Globex Corp,AR@Globex.com,3/11/25,4/10/25,"2,400.50",Sent
|
||||||
|
INV-1011,initech,accounts@initech.com,04/01/2025,05/01/2025,"$ 1,100.00",?
|
||||||
|
INV-1011,Initech,Accounts@Initech.com,2025-04-01,2025-05-01,1100,Open
|
||||||
|
INV-1003,Stark Industries,ap@stark.com,Mar 6 2025,Apr 6 2025,$75.00,Open
|
||||||
|
INV-1004,Wayne Enterprises,ar@wayne.com,03/15/2025,04/14/2025,($300.00),—
|
||||||
|
INV-1015,Hooli,billing@hooli.com,3/11/25,4/10/25,"$4,300.00",Overdue
|
||||||
|
INV-1015,hooli,Billing@Hooli.com,2025-03-11,2025-04-10,4300,(none)
|
||||||
|
INV-1005,Soylent Corp,ap@soylent.com,2025-03-20,2025-04-19,"$1,875.25",Sent
|
||||||
|
INV-1006,Umbrella Co,ar@umbrella.com,03/22/2025,04/21/2025,$640.00,TBD
|
||||||
|
INV-1019,Cyberdyne Systems,ap@cyberdyne.com,Mar 25 2025,04/24/2025,"$2,050.00",unknown
|
||||||
|
INV-1019,cyberdyne systems,AP@Cyberdyne.com,2025-03-25,2025-04-24,"2,050.00",Open
|
||||||
|
INV-1008,Vandelay Industries,ar@vandelay.com,3/28/25,4/27/25,$915.00,Overdue
|
||||||
|
INV-1009,Gekko & Co,billing@gekko.com,2025-03-30,2025-04-29,"$3,120.75",Open
|
||||||
|
INV-1010,Pied Piper,ap@piedpiper.com,04/02/2025,05/02/2025,$180,Sent
|
||||||
|
INV-1023,Tyrell Corp,ar@tyrell.com,04/05/2025,05/05/2025,($300.00),(blank)
|
||||||
|
INV-1023,Tyrell Corp,AR@Tyrell.com,2025-04-05,2025-05-05,-300.00,Open
|
||||||
|
INV-1012,Oscorp,ap@oscorp.com,Apr 8 2025,05/08/2025,"$5,000.00",Overdue
|
||||||
|
INV-1013,Nakatomi Trading,ar@nakatomi.com,4/9/25,5/9/25,$725.50,Sent
|
||||||
|
INV-1014,Bluth Company,billing@bluth.com,2025-04-10,2025-05-10,"$1,420.00",Open
|
||||||
|
INV-1016,Dunder Mifflin,ap@dundermifflin.com,04/12/2025,05/12/2025,$960.00,Overdue
|
||||||
|
INV-1017,Prestige Worldwide,ar@prestige.com,Apr 14 2025,05/14/2025,"$2,680.00",Sent
|
||||||
|
INV-1018,Sterling Cooper,billing@sterlingcooper.com,4/15/25,5/15/25,"$3,950.00",Open
|
||||||
|
INV-1020,Wonka Industries,ap@wonka.com,2025-04-18,2025-05-18,"$1,050.00",Overdue
|
||||||
|
50
samples/demo/ar_open_invoices_pipeline.json
Normal file
50
samples/demo/ar_open_invoices_pipeline.json
Normal file
@@ -0,0 +1,50 @@
|
|||||||
|
{
|
||||||
|
"steps": [
|
||||||
|
{
|
||||||
|
"tool": "text_clean",
|
||||||
|
"enabled": true,
|
||||||
|
"options": {
|
||||||
|
"trim": true,
|
||||||
|
"collapse_whitespace": true,
|
||||||
|
"fold_smart_chars": true,
|
||||||
|
"strip_zero_width": true
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"tool": "format_standardize",
|
||||||
|
"enabled": true,
|
||||||
|
"options": {
|
||||||
|
"column_types": {
|
||||||
|
"Invoice_Date": "date",
|
||||||
|
"Due_Date": "date",
|
||||||
|
"Amount": "currency",
|
||||||
|
"Email": "email"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"tool": "missing",
|
||||||
|
"enabled": true,
|
||||||
|
"options": {
|
||||||
|
"strategy": "none",
|
||||||
|
"standardize_sentinels": true,
|
||||||
|
"sentinels": ["—", "-", "?", "(blank)", "TBD", "unknown", "(none)", "N/A", "#N/A"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"tool": "dedup",
|
||||||
|
"enabled": true,
|
||||||
|
"options": {
|
||||||
|
"survivor_rule": "most_complete",
|
||||||
|
"merge": true,
|
||||||
|
"strategies": [
|
||||||
|
{
|
||||||
|
"columns": [
|
||||||
|
{"column": "Invoice", "algorithm": "exact", "threshold": 100}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
27
samples/demo/bank_reconciliation.csv
Normal file
27
samples/demo/bank_reconciliation.csv
Normal file
@@ -0,0 +1,27 @@
|
|||||||
|
Date,Description,Vendor,Category,Amount,Account
|
||||||
|
01/15/2025,“Stripe payout — weekly”,Stripe,Income,"+$3,450.00",Business Checking
|
||||||
|
2025-01-15,Verizon business line,Verizon,—,($89.50),Business Checking
|
||||||
|
Jan 18 2025,Adobe Creative Cloud ,Adobe,(blank),-$129.99,Business Checking
|
||||||
|
1/27/25,Office supplies,Amazon,Supplies,-$74.20,Business Checking
|
||||||
|
02/03/2025, Monthly office rent,Highland Properties,Rent,"$1,200.00",Business Checking
|
||||||
|
Feb 5 2025,Account service fee,First National Bank,?,(50.00),Business Checking
|
||||||
|
2025-01-09,Shipping labels,amazon.com,unknown,-$18.40,Business Checking
|
||||||
|
1/22/25,Contractor — landing page,Bright Lane Design,TBD,- $599.88,Business Checking
|
||||||
|
Jan 30 2025,Late fee adjustment,verizon,Utilities,-$12.00,Business Checking
|
||||||
|
2025-01-11,Packaging tape,AMAZON.COM,Supplies,-$31.75,Business Checking
|
||||||
|
01/06/2025,Client deposit — ACME Co,ACME Co,Income,"$2,500.00",Business Checking
|
||||||
|
2025-01-20,Google Workspace,Google,Software,-$36.00,Business Checking
|
||||||
|
Jan 24 2025,Fuel — delivery van,Shell,Vehicle,-$58.63,Business Checking
|
||||||
|
1/28/25,QuickBooks subscription,Intuit,Software,-$80.00,Business Checking
|
||||||
|
2025-01-15,Stripe payout weekly,Stripe,Income,3450.00,Business Checking
|
||||||
|
01/15/2025,Verizon business line,Verizon,Utilities,-89.50,Business Checking
|
||||||
|
2025-01-18,Adobe Creative Cloud,Adobe,Software,-129.99,Business Checking
|
||||||
|
2025-02-03,Monthly office rent,Highland Properties,Rent,1200.00,Business Checking
|
||||||
|
2025-02-05,Account service fee,First National Bank,Bank Fees,-50.00,Business Checking
|
||||||
|
2025-01-22,Contractor landing page,Bright Lane Design,Contractors,-599.88,Business Checking
|
||||||
|
02/10/2025,Client deposit — Globex,Globex,Income,"$1,800.00",Business Checking
|
||||||
|
2025-02-12,Slack subscription,Slack,Software,-$96.00,Business Checking
|
||||||
|
Feb 14 2025,Coffee — client meeting,Blue Bottle,Meals,-$23.10,Business Checking
|
||||||
|
2/18/25,Insurance premium,Hartford,Insurance,-$240.50,Business Checking
|
||||||
|
02/21/2025,Refund — returned printer,Staples,Supplies,$210.99,Business Checking
|
||||||
|
Feb 25 2025,Domain renewal,Namecheap,Software,-$13.98,Business Checking
|
||||||
|
6
samples/demo/bank_reconciliation_pipeline.json
Normal file
6
samples/demo/bank_reconciliation_pipeline.json
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
{"steps":[
|
||||||
|
{"tool":"text_clean","enabled":true,"options":{"trim":true,"collapse_whitespace":true,"fold_smart_chars":true,"strip_zero_width":true}},
|
||||||
|
{"tool":"format_standardize","enabled":true,"options":{"column_types":{"Date":"date","Amount":"currency"}}},
|
||||||
|
{"tool":"missing","enabled":true,"options":{"strategy":"none","standardize_sentinels":true,"sentinels":["—","(blank)","?","unknown","TBD","N/A","#N/A","(none)"]}},
|
||||||
|
{"tool":"dedup","enabled":true,"options":{"survivor_rule":"most_complete","merge":true,"strategies":[{"columns":[{"column":"Date","algorithm":"exact","threshold":100},{"column":"Amount","algorithm":"exact","threshold":100}]}]}}
|
||||||
|
]}
|
||||||
@@ -1,56 +0,0 @@
|
|||||||
{
|
|
||||||
"steps": [
|
|
||||||
{
|
|
||||||
"tool": "text_clean",
|
|
||||||
"options": {},
|
|
||||||
"enabled": true,
|
|
||||||
"name": "1. Clean text (header whitespace, smart quotes, em-dash)"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"tool": "format_standardize",
|
|
||||||
"options": {
|
|
||||||
"column_types": {
|
|
||||||
"Date": "date",
|
|
||||||
"Amount": "currency",
|
|
||||||
"Balance": "currency",
|
|
||||||
"Vendor": "name"
|
|
||||||
},
|
|
||||||
"currency_decimal": "auto",
|
|
||||||
"currency_preserve_code": false,
|
|
||||||
"currency_decimals": 2,
|
|
||||||
"date_output_format": "%Y-%m-%d"
|
|
||||||
},
|
|
||||||
"enabled": true,
|
|
||||||
"name": "2. ISO dates · numeric amounts (parens-negative) · vendor casing"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"tool": "missing",
|
|
||||||
"options": {
|
|
||||||
"strategy": "none",
|
|
||||||
"standardize_sentinels": true,
|
|
||||||
"sentinels": ["N/A", "n/a", "—", "-", "?", "(blank)", "(none)", "unknown", "#N/A"]
|
|
||||||
},
|
|
||||||
"enabled": true,
|
|
||||||
"name": "3. Standardize disguised nulls (— / N/A / (blank))"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"tool": "dedup",
|
|
||||||
"options": {
|
|
||||||
"survivor_rule": "most_complete",
|
|
||||||
"merge": false,
|
|
||||||
"date_column": "Date",
|
|
||||||
"strategies": [
|
|
||||||
{
|
|
||||||
"columns": [
|
|
||||||
{"column": "Date", "algorithm": "exact", "threshold": 100},
|
|
||||||
{"column": "Amount", "algorithm": "exact", "threshold": 100},
|
|
||||||
{"column": "Vendor", "algorithm": "jaro_winkler", "threshold": 80}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"enabled": true,
|
|
||||||
"name": "4. Dedup transactions on Date+Amount+fuzzy Vendor"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
@@ -1,31 +0,0 @@
|
|||||||
Txn ID,Date ,Description,Amount,Balance,Account,Vendor,Category
|
|
||||||
TXN-2401,01/15/2025," AMAZON.COM*4F2X9 PURCHASE",-$129.99,"$2,450.01",Checking,Amazon,Office Supplies
|
|
||||||
TXN-2402,2025-01-15,"AMAZON.COM*4F2X9 PURCHASE",-$129.99,"2450.01",Checking,amazon.com,Office Supplies
|
|
||||||
TXN-2403,Jan 18 2025,"STAPLES #4422 — paper, toner",($89.50),$2360.51,Checking,STAPLES,Office Supplies
|
|
||||||
TXN-2404,01/22/2025,"Verizon Wireless ""autopay""",-$120.00,"$2,240.51",Checking,Verizon,Utilities
|
|
||||||
TXN-2405,2025-01-22,Verizon Wireless autopay,-120.00,"2,240.51",Checking,verizon,Utilities
|
|
||||||
TXN-2406,01-25-2025,"Stripe Payout — invoice #1077","+$3,450.00","$5,690.51",Checking,Stripe,Income
|
|
||||||
TXN-2407,1/27/25,"Office Lease - Suite 204",-1500.00,"$4,190.51",Checking,Acme Realty,Rent
|
|
||||||
TXN-2408,02/01/2025,"Wire — Acme Realty Mgmt","-$1,500.00","$2,690.51",Checking,acme realty,Rent
|
|
||||||
TXN-2409,2025-02-03,"Adobe Creative Cloud annual","- $599.88","$2,090.63",Credit Card,Adobe Inc.,Software
|
|
||||||
TXN-2410,02/03/2025,"ADOBE CREATIVE CLOUD ANN",-599.88,2090.63,Credit Card,adobe,Software
|
|
||||||
TXN-2411,Feb 5 2025,"FedEx — overnight to client A",-$32.50,"$2,058.13",Checking,FedEx,Shipping
|
|
||||||
TXN-2412,02/07/2025,"Square fee — invoice #1078","-$3.20","$2,054.93",Checking,Square,Fees
|
|
||||||
TXN-2413,02/10/2025,"Stripe Payout invoice #1079","+ $1,200.00","$3,254.93",Checking,Stripe,Income
|
|
||||||
TXN-2414,2025-02-12,"USPS PRIORITY — to vendor B","-12.40","$3,242.53",Checking,USPS,Shipping
|
|
||||||
TXN-2415,02/14/2025,"Zoom Video Comms — annual","-$149.90","$3,092.63",Credit Card,Zoom,Software
|
|
||||||
TXN-2416,2/14/25,"Zoom Video Communications","-149.90","3092.63",Credit Card,zoom,Software
|
|
||||||
TXN-2417,02/18/2025,"Costco Whse #421 — supplies","-$237.84","$2,854.79",Checking,Costco,Office Supplies
|
|
||||||
TXN-2418,2025-02-18,COSTCO WHSE #421,-237.84,"2,854.79",Checking,costco,Office Supplies
|
|
||||||
TXN-2419,02/22/2025,"Bank fee — int'l wire","-$45.00","$2,809.79",Checking,Bank Fee,Fees
|
|
||||||
TXN-2420,02/24/2025,"Stripe Payout — invoice #1080","+$2,100.00","$4,909.79",Checking,Stripe,Income
|
|
||||||
TXN-2421,02/28/2025," Refund — overcharge ","+$45.00","$4,954.79",Checking,—,Refunds
|
|
||||||
TXN-2422,Feb 28 2025,REFUND OVERCHARGE,45.00,4954.79,Checking,N/A,Refunds
|
|
||||||
TXN-2423,03/01/2025,"Office Lease — Suite 204","-$1,500.00","$3,454.79",Checking,Acme Realty,Rent
|
|
||||||
TXN-2424,2025-03-03,"Slack Technologies — annual","-$840.00","$2,614.79",Credit Card,Slack,Software
|
|
||||||
TXN-2425,03/05/2025,"Stripe Payout — invoice #1081","+$1,875.00","$4,489.79",Checking,Stripe,Income
|
|
||||||
TXN-2426,03/08/2025,"Wire — Berlin office rent (EUR vendor)","-€1.450,00","$2,989.79",Checking,Mietverwaltung GmbH,Rent
|
|
||||||
TXN-2427,03/10/2025,"London supplier invoice (GBP)","-£950.00","$1,939.79",Checking,Stationery Co Ltd,Office Supplies
|
|
||||||
TXN-2428,03/12/2025,"São Paulo agency retainer","-R$ 1.299,90","$1,679.79",Credit Card,Estúdio Ágil,Software
|
|
||||||
TXN-2429,03/14/2025,"VAT MOSS prep — multi-EU sales","($89.00)","$1,768.79",Checking,EU VAT Service,Fees
|
|
||||||
TXN-2430,03/14/2025,"VAT MOSS prep multi EU sales",-89.00,"1,768.79",Checking,eu vat service,Fees
|
|
||||||
|
@@ -1,21 +0,0 @@
|
|||||||
Customer ID,First Name,Last Name,Email,Phone,Address,City,State,ZIP,Country,Total Orders,Lifetime Value,Last Order Date,Tags
|
|
||||||
SHOP-1001, Alice ,Johnson,alice@petshop.com,(415) 555-1234,"123 Main St., Apt 4B",San Francisco,CA,94102,US,12,$1,240.50,2025-12-04,VIP
|
|
||||||
SHOP-1002,Bob,SMITH,Bob@PetShop.com,415.555.1234,"123 Main St, Apt 4B",San Francisco,CA,94102,US,12,"$1,240.50",N/A,VIP
|
|
||||||
SHOP-1003,carlos,garcia,carlos@petshop.com,5559876543,"742 Evergreen Terrace",Springfield,IL,62704,US,5,420.00,12/15/2025,Wholesale
|
|
||||||
SHOP-1004,Diana,Lee,diana@petshop.com,(555) 222-3344,"PO Box 12, Sherwood Forest",Nottingham,,NG1 5BA,GB,8,£890.25,2025-10-30,VIP|Wholesale
|
|
||||||
SHOP-1005,EVE MARTINEZ,,eve.martinez@petshop.com,555-9988,"Calle Mayor 45","Madrid",,"28013",ES,3,€180,2025-09-15,
|
|
||||||
SHOP-1006,Frank,Brown,frank@petshop.com,, ,"Berlin",BE,10115,DE,15,€2.410,75,(blank),Wholesale
|
|
||||||
SHOP-1007,Grace,Davis,grace@petshop.com,+1 555-111-1111,"888 Maple Ave",Toronto,ON,M5V 3A8,CA,1,$49.99,#N/A,New
|
|
||||||
SHOP-1008,henry,wilson,Henry@PetShop.com,5551111111,"888 Maple Avenue","Toronto",ON,M5V 3A8,CA,1,$49.99,2025-12-01,New
|
|
||||||
SHOP-1009,Ivy,Chen,IVY@petshop.com,+1 (555) 777-7777,"550 Elm Street, Suite 200",Brooklyn,NY,11201,US,4,"$320.50 ",10/12/2025,
|
|
||||||
SHOP-1010,Jack,Taylor,jack@petshop.com,(none),"550 elm street, suite 200",brooklyn,NY,11201,US,4,$320.50,2025-10-12,
|
|
||||||
SHOP-1011,kate,o'neil,kate.oneil@petshop.com,415-555-2222,"99 King's Rd","London",,SW3 4LX,GB,7,£675.00,?,VIP
|
|
||||||
SHOP-1012,luis,rodriguez,LUIS@petshop.com,+34 91 411 1111,"Avenida de la Paz 12, 3°D",Madrid,,28013,ES,2,"€89,99",unknown,
|
|
||||||
SHOP-1013,Mia,Park,mia@petshop.com,02-9374-4000,"Sydney Opera House Drive","Sydney",NSW,2000,AU,9,"A$ 1,299.00",2025-11-20,Wholesale
|
|
||||||
SHOP-1014,Noah,nguyen,noah@petshop.com,+81 3 3210 7000,"丸の内 2-7-3","Tokyo",,100-0005,JP,6,"¥75000",2025-12-10,VIP
|
|
||||||
SHOP-1015,Olivia,Brown,OLIVIA@PETSHOP.COM,(555) 333-4444,"742 evergreen terrace",springfield,IL,62704,US,3,$180.00,(none),
|
|
||||||
SHOP-1016,Pavel,Novak,pavel@petshop.com,+44 20 7946 1234,"22 Baker Street",London,,W1U 6AB,United Kingdom,4,£412.00,2025-11-18,VIP
|
|
||||||
SHOP-1017,Quinn,Murphy,quinn@petshop.com,+44 20 7946 5678,"5 Princes Street",Edinburgh,,EH2 2DA,U.K.,2,£189.50,2025-12-09,
|
|
||||||
SHOP-1018,Rachel,O'Brien,rachel@petshop.com,02-9374-9999,"100 George Street","Sydney",NSW,2000,UK,1,£75.00,?,New
|
|
||||||
SHOP-1019,Sam,Klein,sam@petshop.com,+49 30 99887766,"Friedrichstraße 100","Berlin",,10117,Germany,11,"€1.890,40",2025-12-11,VIP|Wholesale
|
|
||||||
SHOP-1020,Tara,Gianni,tara@petshop.com,+39 06 6982 4567,"Via del Corso 250",Roma,,00186,Italia,5,"€649,99",2025-12-03,
|
|
||||||
|
@@ -1,49 +0,0 @@
|
|||||||
{
|
|
||||||
"steps": [
|
|
||||||
{
|
|
||||||
"tool": "text_clean",
|
|
||||||
"options": {},
|
|
||||||
"enabled": true,
|
|
||||||
"name": "1. Clean text (whitespace, smart quotes, NBSP, BOM)"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"tool": "format_standardize",
|
|
||||||
"options": {
|
|
||||||
"column_types": {
|
|
||||||
"First Name": "name",
|
|
||||||
"Last Name": "name",
|
|
||||||
"Email": "email",
|
|
||||||
"Phone": "phone",
|
|
||||||
"Address": "address",
|
|
||||||
"Lifetime Value": "currency",
|
|
||||||
"Last Order Date": "date"
|
|
||||||
},
|
|
||||||
"phone_country_column": "Country",
|
|
||||||
"address_country_column": "Country",
|
|
||||||
"currency_preserve_code": true,
|
|
||||||
"currency_decimal": "auto",
|
|
||||||
"email_gmail_canonical": false
|
|
||||||
},
|
|
||||||
"enabled": true,
|
|
||||||
"name": "2. Standardize phones, addresses, dates, currencies, names"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"tool": "missing",
|
|
||||||
"options": {
|
|
||||||
"strategy": "none",
|
|
||||||
"standardize_sentinels": true
|
|
||||||
},
|
|
||||||
"enabled": true,
|
|
||||||
"name": "3. Standardize disguised nulls (N/A, -, (blank), ?, #N/A)"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"tool": "dedup",
|
|
||||||
"options": {
|
|
||||||
"survivor_rule": "most_complete",
|
|
||||||
"merge": true
|
|
||||||
},
|
|
||||||
"enabled": true,
|
|
||||||
"name": "4. Dedup customers (fuzzy match, merge missing fields)"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
25
samples/demo/vendor_1099.csv
Normal file
25
samples/demo/vendor_1099.csv
Normal file
@@ -0,0 +1,25 @@
|
|||||||
|
Vendor,Contact,Email,Phone,EIN,Address,Total_Paid
|
||||||
|
Acme Realty,Bob Stein,acme.ap@acmerealty.com,(212) 555-0100,12-3456789,(blank),"$12,400.00"
|
||||||
|
acme realty llc,Bob Stein, ACME.AP@AcmeRealty.com ,,—,"118 Canal St, New York, NY 10013","$8,250"
|
||||||
|
ACME REALTY,R. Stein,Acme.AP@acmerealty.com,212.555.0100,N/A,TBD,"1,999.99"
|
||||||
|
Bright Books Bookkeeping,Dana Cole,hello@brightbooks.com,,98-7654321,(blank),"$6,000.00"
|
||||||
|
bright books,Dana Cole,HELLO@brightbooks.com,(415) 555-0142,unknown,"50 Market St, San Francisco, CA 94105","$6,000"
|
||||||
|
"Bright Books, LLC",D. Cole, hello@BrightBooks.com,4155550142,98-7654321,unknown,"5,500.00"
|
||||||
|
Northwind Logistics,Sam Reyes,ap@northwindlog.com,(312) 555-0198,—,(blank),"$22,750.00"
|
||||||
|
northwind logistics inc,Sam Reyes,AP@NorthwindLog.com,,45-6789012,"900 W Loop, Chicago, IL 60607","$22,750"
|
||||||
|
Pearl Design Studio,“Jo” Marsh,billing@pearldesign.co,,33-2211000,(blank),"$3,200.00"
|
||||||
|
pearl design,Jo Marsh,Billing@PearlDesign.co,(206) 555-0167,TBD,"77 Pike St, Seattle, WA 98101","$3,200"
|
||||||
|
PEARL DESIGN STUDIO,J. Marsh, billing@pearldesign.co ,206.555.0167,33-2211000,unknown,"2,800.00"
|
||||||
|
Cooper Plumbing,Lee Cooper,office@cooperplumb.com,(617) 555-0133,—,(blank),"$1,450.00"
|
||||||
|
cooper plumbing co,Lee Cooper,OFFICE@cooperplumb.com,,TBD,"12 Beacon St, Boston, MA 02108","$1,450"
|
||||||
|
COOPER PLUMBING,L. Cooper, office@CooperPlumb.com,6175550133,N/A,unknown,900.00
|
||||||
|
Vertex Marketing,Pat Nguyen,accounts@vertexmktg.com,(404) 555-0119,77-8899001,(blank),"$15,000.00"
|
||||||
|
vertex marketing group,Pat Nguyen,ACCOUNTS@VertexMktg.com,,unknown,"300 Peachtree St, Atlanta, GA 30308","$15,000"
|
||||||
|
Summit Consulting,Ray Brooks,invoices@summitconsult.net,,21-0099887,(blank),"$9,800.00"
|
||||||
|
summit consulting llc,Ray Brooks,INVOICES@summitconsult.net,(303) 555-0175,—,"1100 17th St, Denver, CO 80202","$9,800"
|
||||||
|
SUMMIT CONSULTING,R. Brooks, invoices@SummitConsult.net ,303.555.0175,21-0099887,TBD,"7,250.00"
|
||||||
|
Garcia Catering,Mia Garcia,ap@garciacatering.com,(305) 555-0188,—,(blank),"$4,600.00"
|
||||||
|
garcia catering services,Mia Garcia,AP@GarciaCatering.com,,66-1234509,"450 Ocean Dr, Miami, FL 33139",$600.00
|
||||||
|
Northwind Logistics,S. Reyes, ap@northwindlog.com ,312.555.0198,45-6789012,TBD,"21,000.00"
|
||||||
|
VERTEX MARKETING,P. Nguyen, accounts@vertexmktg.com ,404.555.0119,77-8899001,TBD,"14,500.00"
|
||||||
|
GARCIA CATERING,M. Garcia,ap@GARCIACATERING.com,305.555.0188,66-1234509,unknown,"4,200.00"
|
||||||
|
49
samples/demo/vendor_1099_pipeline.json
Normal file
49
samples/demo/vendor_1099_pipeline.json
Normal file
@@ -0,0 +1,49 @@
|
|||||||
|
{
|
||||||
|
"steps": [
|
||||||
|
{
|
||||||
|
"tool": "text_clean",
|
||||||
|
"enabled": true,
|
||||||
|
"options": {
|
||||||
|
"trim": true,
|
||||||
|
"collapse_whitespace": true,
|
||||||
|
"fold_smart_chars": true,
|
||||||
|
"strip_zero_width": true
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"tool": "format_standardize",
|
||||||
|
"enabled": true,
|
||||||
|
"options": {
|
||||||
|
"column_types": {
|
||||||
|
"Phone": "phone",
|
||||||
|
"Email": "email",
|
||||||
|
"Total_Paid": "currency"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"tool": "missing",
|
||||||
|
"enabled": true,
|
||||||
|
"options": {
|
||||||
|
"strategy": "none",
|
||||||
|
"standardize_sentinels": true,
|
||||||
|
"sentinels": ["—", "-", "--", "(blank)", "TBD", "unknown", "N/A", "#N/A", "(none)"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"tool": "dedup",
|
||||||
|
"enabled": true,
|
||||||
|
"options": {
|
||||||
|
"survivor_rule": "most_complete",
|
||||||
|
"merge": true,
|
||||||
|
"strategies": [
|
||||||
|
{
|
||||||
|
"columns": [
|
||||||
|
{"column": "Email", "algorithm": "exact", "threshold": 100, "normalizer": "email"}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
@@ -1,13 +0,0 @@
|
|||||||
customer_name,email,vendor,memo
|
|
||||||
Alice Johnson,alice@example.com,ACME Corp ,Welcome aboard
|
|
||||||
Bob Smith,bob@example.com,ACME Corp,Returning customer
|
|
||||||
Charlie Brown,charlie@example.com,Globex,Net 30
|
|
||||||
Diana Prince,diana@example.com,Globex,VIP
|
|
||||||
Edward Norton,ed@example.com,“Best Pet Supplies”,Order#42 - rush
|
|
||||||
Frank Castle,frank@example.com,Stark—Industries,"Line 1
|
|
||||||
Line 2
|
|
||||||
Line 3"
|
|
||||||
grace HOPPER ,grace@example.com,Globex,Loves long memos…
|
|
||||||
Henry Ford,henry@example.com,Ford Motor,Industrial
|
|
||||||
Iris West,iris@example.com,S.T.A.R. Labs,Notewith-bell
|
|
||||||
Jane Doe,jane@example.com,Acme,Standard
|
|
||||||
|
@@ -9,10 +9,10 @@ side-by-side, and converts the visitor to a Gumroad purchase.
|
|||||||
Launch:
|
Launch:
|
||||||
streamlit run src/gui/app_demo.py
|
streamlit run src/gui/app_demo.py
|
||||||
|
|
||||||
URL routing:
|
URL routing (all three personas serve one audience: accounting):
|
||||||
https://demo.datatools.app/?p=shopify-pet (Shopify operator)
|
https://demo.datatools.app/?p=bookkeeper (Bookkeeper — bank reconciliation)
|
||||||
https://demo.datatools.app/?p=bookkeeper (Bookkeeper)
|
https://demo.datatools.app/?p=ap-1099 (Accounts payable — 1099 vendor prep)
|
||||||
https://demo.datatools.app/?p=revops (RevOps agency)
|
https://demo.datatools.app/?p=ar-aging (Accounts receivable — open invoices)
|
||||||
|
|
||||||
Free / paid boundary (per docs/DEMO-PLAN.md §6):
|
Free / paid boundary (per docs/DEMO-PLAN.md §6):
|
||||||
- input rows capped at ``DEMO_ROW_CAP``
|
- input rows capped at ``DEMO_ROW_CAP``
|
||||||
@@ -64,59 +64,66 @@ GUMROAD_BASE: str = "https://gumroad.com/l/datatools"
|
|||||||
DEMO_DIR = _project_root / "samples" / "demo"
|
DEMO_DIR = _project_root / "samples" / "demo"
|
||||||
|
|
||||||
|
|
||||||
|
# All three personas serve one audience — accounting — entering through the
|
||||||
|
# three workflows where messy exports cost real money: bank reconciliation,
|
||||||
|
# 1099 / AP vendor prep, and AR aging. Each H1/sub names the exact pain and
|
||||||
|
# the validated demo outcome (see docs/DEMO-PLAN.md §4 for the numbers).
|
||||||
PERSONAS: dict[str, dict[str, Any]] = {
|
PERSONAS: dict[str, dict[str, Any]] = {
|
||||||
"shopify-pet": {
|
|
||||||
"label": "Shopify pet operator",
|
|
||||||
"icon": "🛍️",
|
|
||||||
"h1": "Klaviyo-import-ready customer lists. **In 30 seconds. Locally.**",
|
|
||||||
"sub": (
|
|
||||||
"Your Shopify customer export has duplicates Excel can't catch, "
|
|
||||||
"international phones Excel can't parse, and disguised nulls "
|
|
||||||
"(`N/A`, `(blank)`, `?`) that break Klaviyo's import. "
|
|
||||||
"DataTools fixes all of it in one pass — and your data never "
|
|
||||||
"leaves your computer."
|
|
||||||
),
|
|
||||||
"data_file": "shopify_pet_customers.csv",
|
|
||||||
"pipeline_file": "shopify_pet_pipeline.json",
|
|
||||||
"cta": "Get DataTools for Shopify — $49 →",
|
|
||||||
"landing": "https://datatools.app/shopify/",
|
|
||||||
},
|
|
||||||
"bookkeeper": {
|
"bookkeeper": {
|
||||||
"label": "Bookkeeper / freelance accountant",
|
"label": "Bookkeeper — bank reconciliation",
|
||||||
"icon": "📒",
|
"icon": "📒",
|
||||||
"h1": "Reconcile messy bank exports. **Hand your client an audit trail.**",
|
"h1": "Catch the transactions your bank export posted twice. **Locally.**",
|
||||||
"sub": (
|
"sub": (
|
||||||
"The Jan and Feb exports overlap; the same transaction posts twice. "
|
"When the Jan and Feb exports overlap, the same payment lands "
|
||||||
"Vendor names are *Amazon* / *amazon.com* / *AMAZON.COM*4F2X9* in "
|
"twice — once as `01/15/2025 +$3,450.00`, once as "
|
||||||
"three rows. DataTools dedups on Date + Amount + fuzzy Vendor, "
|
"`2025-01-15 3450.00`. DataTools standardizes every date and "
|
||||||
"produces ISO dates and numeric amounts, and gives you a row-level "
|
"amount, then dedups on the *real* transaction so your "
|
||||||
"audit log to hand the client."
|
"reconciliation ties out. In this sample: **26 rows → 20, six "
|
||||||
|
"phantom duplicates removed** — and your data never leaves your "
|
||||||
|
"computer."
|
||||||
),
|
),
|
||||||
"data_file": "bookkeeper_bank_reconcile.csv",
|
"data_file": "bank_reconciliation.csv",
|
||||||
"pipeline_file": "bookkeeper_bank_pipeline.json",
|
"pipeline_file": "bank_reconciliation_pipeline.json",
|
||||||
"cta": "Get DataTools for Bookkeepers — $49 →",
|
"cta": "Get DataTools for Bookkeepers — $49 →",
|
||||||
"landing": "https://datatools.app/bookkeeper/",
|
"landing": "https://datatools.app/bookkeeper/",
|
||||||
},
|
},
|
||||||
"revops": {
|
"ap-1099": {
|
||||||
"label": "Marketing / RevOps agency",
|
"label": "Accounts payable — 1099 prep",
|
||||||
"icon": "🪢",
|
"icon": "🧾",
|
||||||
"h1": "Dedupe lead lists across HubSpot, LinkedIn, and manual scrapes — **locally.**",
|
"h1": "Build a clean 1099 vendor list — **with the missing EINs filled in.**",
|
||||||
"sub": (
|
"sub": (
|
||||||
"The same prospect shows up in HubSpot as `alice@acme.com`, in "
|
"The same vendor was entered three times across the year — one "
|
||||||
"LinkedIn as `Alice.Johnson@acme.com`, and in your VA's manual "
|
"record has the EIN, another the address, a third the phone. "
|
||||||
"scrape as `alice@acme.com` again. Country is `USA` / `US` / "
|
"DataTools consolidates each vendor to one row and *backfills the "
|
||||||
"`United States`. DataTools fuzzy-matches across sources, "
|
"gaps from the duplicates*. In this sample: **24 messy records → "
|
||||||
"normalizes phones for 50+ countries, and merges survivors "
|
"8 complete vendors, with 7 missing EINs recovered** from the "
|
||||||
"with their most-complete fields — without uploading anything."
|
"duplicate rows. No upload, no VLOOKUP gymnastics."
|
||||||
),
|
),
|
||||||
"data_file": "agency_combined_leads.csv",
|
"data_file": "vendor_1099.csv",
|
||||||
"pipeline_file": "agency_leads_pipeline.json",
|
"pipeline_file": "vendor_1099_pipeline.json",
|
||||||
"cta": "Get DataTools for RevOps — $49 →",
|
"cta": "Get DataTools for Accounting — $49 →",
|
||||||
"landing": "https://datatools.app/revops/",
|
"landing": "https://datatools.app/accounting/",
|
||||||
|
},
|
||||||
|
"ar-aging": {
|
||||||
|
"label": "Accounts receivable — open invoices",
|
||||||
|
"icon": "💵",
|
||||||
|
"h1": "Stop chasing the invoices your aging report counted twice. **Locally.**",
|
||||||
|
"sub": (
|
||||||
|
"Double-entered invoices inflate your AR aging and your "
|
||||||
|
"follow-ups. DataTools standardizes invoice dates, due dates, and "
|
||||||
|
"amounts, lowercases client emails, then removes the duplicate "
|
||||||
|
"invoice numbers — backfilling any blank status from the twin row. "
|
||||||
|
"In this sample: **26 rows → 21, five phantom invoices off the "
|
||||||
|
"books** in one pass."
|
||||||
|
),
|
||||||
|
"data_file": "ar_open_invoices.csv",
|
||||||
|
"pipeline_file": "ar_open_invoices_pipeline.json",
|
||||||
|
"cta": "Get DataTools for Accounting — $49 →",
|
||||||
|
"landing": "https://datatools.app/accounting/",
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
DEFAULT_PERSONA = "shopify-pet"
|
DEFAULT_PERSONA = "bookkeeper"
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|||||||
71
tests/test_demo_pipelines.py
Normal file
71
tests/test_demo_pipelines.py
Normal file
@@ -0,0 +1,71 @@
|
|||||||
|
"""Demo pipelines must keep showing value (accounting personas).
|
||||||
|
|
||||||
|
Each persona's preloaded dataset + saved pipeline is the marketing surface
|
||||||
|
driven by ``src/gui/app_demo.py``. These tests pin that every demo loads,
|
||||||
|
runs clean, and produces its headline value (duplicate rows removed, clean
|
||||||
|
parse, disguised nulls caught) — so a stale dataset or an engine change can't
|
||||||
|
silently gut the sales demo. The read path mirrors ``app_demo._load_demo``
|
||||||
|
exactly (``dtype=str, keep_default_na=False`` so every disguised null survives
|
||||||
|
to the pipeline).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from src.core.pipeline import Pipeline, run_pipeline
|
||||||
|
|
||||||
|
_REPO = Path(__file__).resolve().parent.parent
|
||||||
|
_DEMO = _REPO / "samples" / "demo"
|
||||||
|
|
||||||
|
# (data_file, pipeline_file, min_duplicates_removed) — one per accounting
|
||||||
|
# persona in app_demo.PERSONAS. The dup floors are the validated demo numbers.
|
||||||
|
_DEMOS = [
|
||||||
|
("bank_reconciliation.csv", "bank_reconciliation_pipeline.json", 6),
|
||||||
|
("vendor_1099.csv", "vendor_1099_pipeline.json", 8),
|
||||||
|
("ar_open_invoices.csv", "ar_open_invoices_pipeline.json", 5),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("data_file,pipeline_file,min_dupes", _DEMOS)
|
||||||
|
def test_demo_runs_clean_and_shows_value(data_file, pipeline_file, min_dupes):
|
||||||
|
df = pd.read_csv(_DEMO / data_file, dtype=str, keep_default_na=False)
|
||||||
|
pipe = Pipeline.from_file(_DEMO / pipeline_file)
|
||||||
|
res = run_pipeline(df, pipe, stop_on_error=True)
|
||||||
|
|
||||||
|
# 1. Nothing errored — the demo never shows a visitor a red banner.
|
||||||
|
assert all(sr.error is None for sr in res.step_results), [
|
||||||
|
(sr.step.tool, sr.error) for sr in res.step_results
|
||||||
|
]
|
||||||
|
|
||||||
|
# 2. Dedup removed the designed duplicate rows (the headline value).
|
||||||
|
assert res.final_rows < res.initial_rows
|
||||||
|
dedup = next(sr for sr in res.step_results if sr.step.tool == "dedup")
|
||||||
|
assert dedup.summary["duplicates_removed"] >= min_dupes
|
||||||
|
|
||||||
|
# 3. Standardization parsed every typed cell — a demo with unparseable
|
||||||
|
# cells reads as "the tool choked," which kills the pitch.
|
||||||
|
fmt = next(sr for sr in res.step_results if sr.step.tool == "format_standardize")
|
||||||
|
assert fmt.summary["cells_unparseable"] == 0
|
||||||
|
assert fmt.summary["cells_changed"] > 0
|
||||||
|
|
||||||
|
# 4. The disguised nulls (—, (blank), TBD, …) were caught.
|
||||||
|
miss = next(sr for sr in res.step_results if sr.step.tool == "missing")
|
||||||
|
assert miss.summary["sentinels_standardized"] > 0
|
||||||
|
|
||||||
|
|
||||||
|
def test_app_demo_references_each_demo_file():
|
||||||
|
"""Every data/pipeline file the demo app names must exist on disk.
|
||||||
|
|
||||||
|
Guards against a rename in app_demo.py drifting away from samples/demo/
|
||||||
|
(or vice versa) without a test catching it.
|
||||||
|
"""
|
||||||
|
src = (_REPO / "src" / "gui" / "app_demo.py").read_text(encoding="utf-8")
|
||||||
|
for data_file, pipeline_file, _ in _DEMOS:
|
||||||
|
assert data_file in src, f"{data_file} not referenced in app_demo.py"
|
||||||
|
assert pipeline_file in src, f"{pipeline_file} not referenced in app_demo.py"
|
||||||
|
assert (_DEMO / data_file).exists(), f"missing {data_file}"
|
||||||
|
assert (_DEMO / pipeline_file).exists(), f"missing {pipeline_file}"
|
||||||
Reference in New Issue
Block a user