diff --git a/docs/DEMO-PLAN.md b/docs/DEMO-PLAN.md index 2a289e2..3fb00ee 100644 --- a/docs/DEMO-PLAN.md +++ b/docs/DEMO-PLAN.md @@ -32,17 +32,22 @@ rebuilds it from a stale headline. | Friction kills conversion | BUSINESS.md §7 | Demo dataset preloaded; no "select a file" first-step | | < $1,200/mo recurring | BUSINESS.md §9 | Migration plan to $5/mo VPS only after rate-limit signal | -## 3. The three personas (per PLAN.md §2.3) +## 3. The three personas — one audience: accounting (per PLAN.md §2.3) + +We niche to **accounting** and enter through the three workflows where a +messy export costs real money. Same engine, three landing pages — each +is the same buyer at a different desk (bookkeeping, payables, receivables). | Tag | Persona | Top-of-funnel keyword | Demo dataset | Pre-saved pipeline | |---|---|---|---|---| -| `shopify-pet` | Shopify operator (priority: pet supplies) | "shopify customer cleanup" | `samples/demo/shopify_pet_customers.csv` | `shopify_pet_pipeline.json` | -| `bookkeeper` | Bookkeeper / freelance accountant | "reconcile bank export csv" | `samples/demo/bookkeeper_bank_reconcile.csv` | `bookkeeper_bank_pipeline.json` | -| `revops` | Marketing / RevOps agency | "dedupe lead list across vendors" | `samples/demo/agency_combined_leads.csv` | `agency_leads_pipeline.json` | +| `bookkeeper` | Bookkeeper — bank reconciliation | "reconcile bank export csv duplicates" | `samples/demo/bank_reconciliation.csv` | `bank_reconciliation_pipeline.json` | +| `ap-1099` | Accounts payable — 1099 vendor prep | "clean 1099 vendor list missing EIN" | `samples/demo/vendor_1099.csv` | `vendor_1099_pipeline.json` | +| `ar-aging` | Accounts receivable — open invoices | "remove duplicate invoices aging report" | `samples/demo/ar_open_invoices.csv` | `ar_open_invoices_pipeline.json` | -Each persona gets its **own landing page URL**, its **own demo dataset -loaded by default**, and its **own H1 + below-the-fold copy.** The -engine is identical; only positioning differs. +Each persona gets its **own landing page URL** (`?p=`), its **own +demo dataset loaded by default**, and its **own H1 + below-the-fold +copy** — wired in `src/gui/app_demo.py::PERSONAS`. The engine is +identical; only positioning differs. ## 4. Demo dataset specifications @@ -53,114 +58,77 @@ persona's tooling. Each contains every kind of pollution the bundle's five tools fix, so a single demo run shows every tool earning its keep. -### 4.0 Pain-point coverage map +### 4.0 Value-proof map -Each demo dataset is engineered so the buyer sees their **own top -pain** demonstrated in the AFTER preview. The mapping below pairs -each pain from PLAN.md §2.3a with the rows / columns that exercise -it. Refresh the dataset only when this coverage drops. +Each demo dataset is engineered so the buyer sees their **own top pain** +fixed in the AFTER preview, with one unmistakable headline number. All +three run the same saved 4-step pipeline (Clean Text → Standardize +Formats → Fix Missing Values → Find Duplicates). The numbers below are +**validated against the live engine** (`tests/test_demo_pipelines.py` +pins them) — refresh the dataset only if a number stops landing. -| Persona | Pain (from PLAN §2.3a) | Demo coverage | +| Persona | Headline proof | What the visitor watches happen | |---|---|---| -| Shopify pet | S1 — Klaviyo per-contact dupes | 5 dup pairs across rows 1–15 (case + format + address-twin variants) | -| Shopify pet | S2 — feed-rejection chars | smart-quote / NBSP / BOM in rows 1–6, 9, 11 | -| Shopify pet | S3 — multi-channel | partner-style customer IDs (`SHOP-`); demonstration of column-level mapping covered in RevOps demo | -| Shopify pet | S4 — subscription identity | rows 1+2, 7+8, 9+10 — same person, different format | -| Shopify pet | S5 — VAT-MOSS country drift | rows 16–18 (`United Kingdom` / `U.K.` / `UK`) + rows 19–20 (`Germany`/`Italia`) | -| Bookkeeper | B1 — month-overlap re-import | 7 dup pairs spanning Jan↔Feb and Mar boundaries | -| Bookkeeper | B2 — 1099 vendor consolidation | Amazon × 3 spellings, Verizon × 2, Acme Realty × 2, Adobe × 2, Costco × 2, Zoom × 2, Stripe × 4 | -| Bookkeeper | B3 — audit trail | every cell change in the run logged with old/new/rule — surface in the demo's audit tab | -| Bookkeeper | B4 — per-license economics | demonstrated by pricing copy, not data | -| Bookkeeper | B5 — multi-currency | rows 26 (EUR), 27 (GBP), 28 (BRL with comma decimal), 29 (parens-negative) | -| RevOps | R1 — per-contact tier | 6 cross-source dup pairs (HubSpot × LinkedIn × Manual Scrape) | -| RevOps | R2 — deliverability | rows 26–27 (`uma at uniform dot com`, `victor@@victorco.com` invalid emails) | -| RevOps | R3 — GDPR / privacy | demonstrated by the network-tab moat panel + zero-upload claim | -| RevOps | R4 — vendor unification | 3 source values (HubSpot / LinkedIn / Manual Scrape), 13 country codes, mixed-shape headers | -| RevOps | R5 — suppression list | rows 29–30 (`Suppressed`, `Opted Out` tags) | +| Bookkeeper | **26 → 20 rows · 6 phantom duplicates removed** | The same payment posted twice (different date + amount format) collapses to one; dates go ISO, parens-negatives become real negatives | +| AP / 1099 | **24 records → 8 vendors · 7 missing EINs recovered** | Each vendor's scattered records merge into one complete row; `merge=true` backfills the EIN/address/phone that any single record was missing | +| AR aging | **26 → 21 rows · 5 double-entered invoices removed** | Duplicate invoice numbers collapse; a blank status is backfilled from its twin; invoice + due dates go ISO, amounts numeric | -### 4.1 `shopify_pet_customers.csv` (20 rows) +### 4.1 `bank_reconciliation.csv` (26 rows) — Bookkeeper -**Looks like**: a Shopify customer export filtered for "Pet Supplies" -sales channel, 12 months activity. +**Looks like**: two months (Jan + Feb 2025) of business-checking activity +from a bank portal, where the Feb re-export overlaps Jan so the same +transaction posts twice. Columns: `Date, Description, Vendor, Category, +Amount, Account`. **Pollution included**: -- Whitespace padding (" Alice ", "Sydney Opera House Drive ") -- Mixed phone formats: `(415) 555-1234`, `415.555.1234`, `5559876543`, - `+1 555-111-1111` -- International phones: GB, ES, DE, AU, JP (15 demo rows span 6 - countries) -- Currency variants: `$1,240.50`, `£890.25`, `€2.410,75` (EU comma - decimal), `A$ 1,299.00`, `¥75000` -- Date formats: `2025-12-04`, `12/15/2025`, `?`, `(blank)`, `(none)`, - `#N/A` -- Disguised nulls: `N/A`, blank, `(blank)`, `?`, `#N/A`, `(none)`, - `unknown` -- Name casing: `EVE MARTINEZ`, `henry`, `O'NEIL`, `noah`, mixed Title / - ALL CAPS / lower -- Email case variants that *should* dedup: `Bob@PetShop.com` vs - `alice@petshop.com` -- 4 fuzzy duplicates (Alice/Bob same address, Grace/Henry same phone, - Carlos/Olivia same address, Ivy/Jack same address) +- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`, `1/27/25`, `Feb 5 2025`. +- Currency formats incl. negatives: `-$129.99`, `($89.50)` parens-negative, `+$3,450.00`, `- $599.88`, bare `-129.99`, `(50.00)`. +- Whitespace + NBSP padding; smart quotes and an em-dash inside descriptions. +- Vendor casing variety on *non-duplicate* rows: `Amazon` / `amazon.com` / `AMAZON.COM`, `Verizon` / `verizon`. +- Disguised nulls in Category: `—`, `(blank)`, `?`, `unknown`, `TBD`. +- **6 duplicate transactions** — each pair shares the same vendor + real value but a different date *and* amount format, so they collapse only after standardization. -**After running the pipeline**: 20 rows → 15, ~29 cells canonicalized, -~45 sentinels standardised, 5 cross-row duplicates merged. The -customer table is now Klaviyo-import-ready and the country column -(previously `UK` / `U.K.` / `United Kingdom` / `Germany` / `Italia`) -is GB / DE / IT — VAT MOSS report won't break. +**After running the pipeline** (validated): **26 → 20 rows, 6 duplicates +removed**, 36 date/amount cells standardized (0 unparseable), all dates +ISO, parens-negatives resolved (`($89.50)` → `-89.50`), disguised-null +categories flagged. The reconciliation ties out. -### 4.2 `bookkeeper_bank_reconcile.csv` (30 rows) +### 4.2 `vendor_1099.csv` (24 rows) — Accounts payable / 1099 -**Looks like**: two months of business checking + credit-card activity -exported from a bank portal, with the Feb export accidentally -overlapping the Jan export at the month boundary. +**Looks like**: a 1099-NEC vendor master list where the same vendor was +entered 2–3 times across the year by different staff, each record holding +only *part* of the vendor's details. Columns: `Vendor, Contact, Email, +Phone, EIN, Address, Total_Paid`. **Pollution included**: -- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`, - `1/27/25`, `Feb 5 2025` -- Currency formats: `-$129.99`, `($89.50)` parens-negative, - `+$3,450.00`, `- $599.88` space, bare `-129.99`, `(50.00)` -- Header trailing whitespace: `"Date "` -- Smart quotes around descriptions: `"autopay"` -- Em-dash sentinels in Vendor: `—` -- Smart-em-dash inside descriptions: `STAPLES #4422 — paper, toner` -- Vendor casing inconsistency: `Amazon` / `amazon.com` / `AMAZON.COM`, - `Verizon` / `verizon` -- 6 duplicate transactions (same date+amount+vendor recorded twice - with different formats) +- The duplicate records for a vendor share one email differing only by case/whitespace (the reliable dedup key, matched with the `email` normalizer). +- EIN / Phone / Address scattered across the duplicate set so no single record is complete but the union is — gaps marked `—`, `(blank)`, `TBD`, `unknown`, `N/A`. +- Vendor name casing/spelling variants, phone formats, EIN formats (`12-3456789` vs `123456789`), `Total_Paid` currency variants. -**After running the pipeline**: 30 rows → 23, ~84 cells normalized, 7 -duplicates removed (month-overlap + VAT-MOSS dups). All dates -ISO-formatted, all amounts numeric (including EUR/GBP/BRL with comma -decimal), vendor casing canonical, parens-negative resolved. +**After running the pipeline** (validated): **24 records → 8 vendors, 16 +duplicates removed, 7 missing EINs recovered** by `merge=true` + +`most_complete` survivor, 35 disguised nulls caught, phones/emails/amounts +standardized (0 unparseable). One vendor genuinely has no EIN in any +record — it survives with a blank EIN as the realistic "flag for +follow-up" case. -### 4.3 `agency_combined_leads.csv` (30 rows) +### 4.3 `ar_open_invoices.csv` (26 rows) — Accounts receivable -**Looks like**: a marketing-ops worksheet combining lead exports from -HubSpot + LinkedIn Sales Navigator + manual scraping, ready for -campaign targeting. +**Looks like**: an open-invoices (unpaid AR) export where some invoices +were double-entered in different formats and client contacts are messy. +Columns: `Invoice, Client, Email, Invoice_Date, Due_Date, Amount, Status`. **Pollution included**: -- Phone formats per region: US, UK, Spain, Germany, China, India, - Australia, Mexico, Israel, Singapore, Hong Kong, Italy, South - Korea — 13 country codes -- Country column inconsistent: `USA` / `US` / `United States` -- Disguised nulls: `N/A`, `unknown`, `(unknown)`, `(blank)`, `(none)`, - `?`, `—`, `#N/A`, `TBD` -- Source column tags origin (`HubSpot` / `LinkedIn` / `Manual Scrape`) -- Email duplicates across sources with case variants: `alice@acme.com` - + `Alice.Johnson@acme.com`, `bob@beta.com` + `Bob@Beta.com`, - `diana@delta.com` from two sources, `carlos@gamma.io` from two - sources, `Frank@Foxtrot.de` + `frank@foxtrot.de` -- Name casing: `DIANA LEE`, `henry`, `IVY CHEN`, mixed -- 6 fuzzy / cross-source duplicates designed to survive the dedup -- Score column with sentinel pollution that needs coercion to integer +- Two date columns with mixed formats; currency variants incl. a credit memo `($300.00)` → `-300.00`. +- Client name casing variety; email case variants (`AP@Acme.com` vs `ap@acme.com`). +- Status disguised nulls: `—`, `?`, `(blank)`, `TBD`, `unknown`, `(none)`. +- **5 double-entered invoices** — same invoice number twice, dates/amount in different formats, one copy with a blank status the other fills. -**After running the pipeline**: 30 rows → 24, ~43 cells canonicalized, -14 sentinels resolved, 6 cross-source duplicates merged with `merge=true` -so each survivor inherits the most-complete picture. Invalid-email -rows (deliverability stress) and `Suppressed`/`Opted Out` tags -(suppression-list use case) survive as flagged rows the operator -manually reviews. +**After running the pipeline** (validated): **26 → 21 rows, 5 duplicate +invoices removed**, both date columns ISO + amounts numeric + emails +lowercased (0 unparseable), 7 disguised-null statuses caught, and a blank +status backfilled from its twin via `merge=true`. The aging report stops +double-counting. ## 5. UX flow (per persona) @@ -174,26 +142,26 @@ dedicated `app_demo.py` for the cloud build). │ "{Persona-specific H1}" │ ├──────────────────────────────────────────────────────────┤ │ │ -│ Sample dataset preloaded: shopify_pet_customers.csv │ +│ Sample dataset preloaded: bank_reconciliation.csv │ │ [Replace with your own file (capped 100 rows)] │ │ │ -│ ┌─ BEFORE preview (15 rows) ─────────────────────────┐ │ -│ │ Alice | (415) 555-1234 | $1,240.50 | … │ │ -│ │ Bob | 415.555.1234 | $1,240.50 | … │ │ +│ ┌─ BEFORE preview (26 rows) ─────────────────────────┐ │ +│ │ 01/15/2025 | Stripe | +$3,450.00 | … │ │ +│ │ 2025-01-15 | Stripe | 3450.00 | … (dup) │ │ │ │ ... │ │ │ └──────────────────────────────────────────────────┘ │ │ │ │ Pipeline (saved): │ -│ 1. Text Clean → 2. Format Standardize → │ -│ 3. Missing → 4. Deduplicate │ +│ 1. Clean Text → 2. Standardize Formats → │ +│ 3. Fix Missing → 4. Find Duplicates │ │ │ │ [▶ Run pipeline] │ │ │ │ ┌─ AFTER preview ───────────────────────────────────┐ │ -│ │ 15 rows → 11 (4 duplicates merged) │ │ -│ │ 27 cells canonicalized · 33 sentinels resolved │ │ +│ │ 26 rows → 20 (6 duplicate transactions removed) │ │ +│ │ 36 cells standardized · 4 disguised nulls flagged │ │ │ │ │ │ -│ │ Alice Johnson | +14155551234 | 1240.50 | … │ │ +│ │ 2025-01-15 | Stripe | 3450.00 | … │ │ │ │ ... │ │ │ └──────────────────────────────────────────────────┘ │ │ │ @@ -244,27 +212,35 @@ not "demo crippled" data. ## 7. CTA copy (per persona) -### 7.1 Shopify pet operator +Copy lives in `src/gui/app_demo.py::PERSONAS` (H1 / sub / CTA per tag); +keep this section in sync with that dict. -- **H1**: *Clean your customer / vendor / subscriber exports — locally.* -- **Sub**: *Klaviyo-import-ready in 30 seconds. Catches duplicates Excel - misses. Your data never leaves your computer.* -- **CTA**: *Get DataTools for Shopify — $49 →* +### 7.1 Bookkeeper — bank reconciliation (`?p=bookkeeper`) -### 7.2 Bookkeeper / freelance accountant - -- **H1**: *Reconcile messy bank exports. Hand your client an audit - trail.* -- **Sub**: *Catches the duplicate transaction Quickbooks imported twice. - Standardizes dates, amounts, vendor casing. Every change auditable.* +- **H1**: *Catch the transactions your bank export posted twice. Locally.* +- **Sub**: *When the Jan and Feb exports overlap, the same payment posts + twice in two formats. DataTools standardizes every date and amount, then + dedups on the real transaction so your reconciliation ties out — 26 rows + → 20, six phantom duplicates gone.* - **CTA**: *Get DataTools for Bookkeepers — $49 →* -### 7.3 Marketing / RevOps agency +### 7.2 Accounts payable — 1099 prep (`?p=ap-1099`) -- **H1**: *Dedupe leads across HubSpot, LinkedIn, and manual scrapes.* -- **Sub**: *International phones, country normalization, fuzzy dedup - with merge — one tool, one schema, no upload.* -- **CTA**: *Get DataTools for RevOps — $49 →* +- **H1**: *Build a clean 1099 vendor list — with the missing EINs filled in.* +- **Sub**: *The same vendor entered three times, each record holding only + part of the details. DataTools consolidates to one row and backfills the + gaps from the duplicates — 24 records → 8 vendors, 7 missing EINs + recovered.* +- **CTA**: *Get DataTools for Accounting — $49 →* + +### 7.3 Accounts receivable — open invoices (`?p=ar-aging`) + +- **H1**: *Stop chasing the invoices your aging report counted twice. Locally.* +- **Sub**: *Double-entered invoices inflate your AR aging and your + follow-ups. DataTools standardizes dates and amounts, lowercases client + emails, and removes the duplicate invoice numbers — 26 rows → 21, five + phantom invoices off the books.* +- **CTA**: *Get DataTools for Accounting — $49 →* ## 8. Telemetry / conversion tracking diff --git a/samples/demo/agency_combined_leads.csv b/samples/demo/agency_combined_leads.csv deleted file mode 100644 index 8f8eb84..0000000 --- a/samples/demo/agency_combined_leads.csv +++ /dev/null @@ -1,31 +0,0 @@ -Lead ID,First Name,Last Name,Company,Title,Email,Phone,Country,Source,Score,Last Activity,Tags -HUB-001,Alice,Johnson,Acme Corp,VP Marketing,alice@acme.com,(415) 555-1234,USA,HubSpot,87,2025-12-04,Enterprise -HUB-002,bob,smith,Beta LLC,Director Growth,bob@beta.com,N/A,United States,HubSpot,N/A,2025-11-22,SMB -HUB-003,Carlos,Garcia,Gamma Inc,CEO,carlos@gamma.io,+34 91 411 1111,Spain,HubSpot,82,2025-10-30,Enterprise -HUB-004,DIANA,LEE,Delta Co,Marketing Manager,diana@delta.com,020 7946 0958,United Kingdom,HubSpot,74,2025-12-15,Mid-Market -HUB-005,Eve,Martinez,Epsilon Group,VP Ops,eve@epsilon.com,(none),Mexico,HubSpot,(blank),2025-09-15,SMB -LIN-006,Alice,Johnson,Acme Corporation,VP of Marketing,Alice.Johnson@acme.com,4155551234,US,LinkedIn,—,2025-12-04,Enterprise -LIN-007,Frank,Brown,Foxtrot Ltd,Head Sales,frank@foxtrot.de,+49 30 12345678,Germany,LinkedIn,68,2025-12-01,Mid-Market -LIN-008,Grace,Davis,Golf Industries,Marketing Lead,grace@golfind.com,+44 20 7946 0958,UK,LinkedIn,79,2025-11-08,Mid-Market -LIN-009,henry,wilson,Hotel Logistics,COO,henry@hotellog.com,+86 10 1234 5678,China,LinkedIn,91,2025-12-12,Enterprise -LIN-010,IVY CHEN,,India Tech,CTO,ivy@indiatech.in,+91 11 2345 6789,IN,LinkedIn,88,2025-11-30,Enterprise -LIN-011,Jack,Taylor,Juliet & Co,Founder,jack@juliet.co,unknown,United States,LinkedIn,?,(unknown),SMB -SCR-012,Diana,Lee,Delta Company,Marketing Manager,diana@delta.com,020-7946-0958,UK,Manual Scrape,74,12/15/2025,Mid-Market -SCR-013,kate,o'neil,Kilo Ventures,Partner,kate@kilo.vc,+1 415 555 2222,USA,Manual Scrape,N/A,?,Investor -SCR-014,Carlos,García,Gamma Incorporated,CEO,Carlos@gamma.io,+34-91-411-1111,Spain,Manual Scrape,82,Oct 30 2025,Enterprise -SCR-015,Liam,Park,Lima Solutions,Director Marketing,liam@limasol.kr,+82 2 2287 0114,South Korea,Manual Scrape,77,2025-11-20,Enterprise -SCR-016,Mia,nguyen,Mike Corp,VP Marketing,mia@mikecorp.com.au,02 9374 4000,Australia,Manual Scrape,72,2025-10-05,Mid-Market -SCR-017,Noah,Brown,November Inc,Head of Growth,noah@november.com,(555) 444-5555,US,Manual Scrape,—,#N/A,SMB -HUB-018,Frank,Brown,Foxtrot,Head of Sales,Frank@Foxtrot.de,+49-30-12345678,Germany,HubSpot,68,2025-12-01,Mid-Market -HUB-019,Olivia,Rossi,Oscar Italia,CMO,olivia@oscar.it,+39 06 6982,Italy,HubSpot,85,2025-12-08,Enterprise -HUB-020,papa,wong,Papa Trading,Founder,papa@papatrading.hk,+852 2123 4567,Hong Kong,HubSpot,69,2025-11-15,SMB -LIN-021,Quinn,Reyes,Quebec Group,VP Sales,quinn@quebec.mx,+52 55 5555 0000,Mexico,LinkedIn,80,2025-12-05,Mid-Market -LIN-022,Robert,Tan,Romeo Logistics,Director,r.tan@romeo.sg,+65 6123 4567,Singapore,LinkedIn,76,2025-11-28,Mid-Market -SCR-023,Sara,Khan,Sierra Foods,Head Marketing,sara@sierra.in,+91-22-1234-5678,India,Manual Scrape,73,2025-12-02,SMB -SCR-024,bob,Smith,Beta,Director Growth,Bob@Beta.com,(none),United States,Manual Scrape,(unknown),(unknown),SMB -HUB-025,Tara,Levi,Tango Tech,VP Product,tara@tango.il,+972 3 6957 0000,Israel,HubSpot,82,2025-12-10,Enterprise -HUB-026,Uma,Patel,Uniform Health,CMO,uma at uniform dot com,+44 20 7946 8888,United Kingdom,HubSpot,71,2025-12-12,Enterprise -LIN-027,Victor,Lee,Victor Co,Director,victor@@victorco.com,+1 415 555 8888,USA,LinkedIn,69,2025-11-30,SMB -SCR-028,Wendy,Akin,Whiskey Inc,CMO,wendy@whiskey.tr,+90 212 252 1111,Turkey,Manual Scrape,77,2025-12-04,Mid-Market -SCR-029,Xander,Ng,Xray Group,Founder,xander@xray.sg,+65 6234 5678,Singapore,Manual Scrape,65,2025-11-15,Suppressed -HUB-030,Yara,Costa,Yankee Foods,Marketing Lead,yara@yankee.br,+55 11 3071 2222,Brazil,HubSpot,—,2025-12-15,Opted Out diff --git a/samples/demo/agency_leads_pipeline.json b/samples/demo/agency_leads_pipeline.json deleted file mode 100644 index 06e40dd..0000000 --- a/samples/demo/agency_leads_pipeline.json +++ /dev/null @@ -1,74 +0,0 @@ -{ - "steps": [ - { - "tool": "text_clean", - "options": {}, - "enabled": true, - "name": "1. Clean text (whitespace + smart quotes from copy-paste)" - }, - { - "tool": "format_standardize", - "options": { - "column_types": { - "First Name": "name", - "Last Name": "name", - "Company": "name", - "Email": "email", - "Phone": "phone" - }, - "phone_country_column": "Country", - "phone_format": "E164", - "email_gmail_canonical": true - }, - "enabled": true, - "name": "2. E.164 phones (per-row country) · canonical emails · name casing" - }, - { - "tool": "missing", - "options": { - "strategy": "none", - "standardize_sentinels": true, - "sentinels": ["N/A", "n/a", "—", "?", "(unknown)", "unknown", "(blank)", "(none)", "TBD", "#N/A"] - }, - "enabled": true, - "name": "3. Standardize sentinels across vendor exports" - }, - { - "tool": "column_map", - "options": { - "schema": { - "fields": [ - {"name": "Lead ID", "dtype": "string", "required": true}, - {"name": "First Name", "dtype": "string"}, - {"name": "Last Name", "dtype": "string"}, - {"name": "Company", "dtype": "string"}, - {"name": "Title", "dtype": "string"}, - {"name": "Email", "dtype": "string"}, - {"name": "Phone", "dtype": "string"}, - {"name": "Country", "dtype": "string"}, - {"name": "Source", "dtype": "string"}, - {"name": "Score", "dtype": "integer"}, - {"name": "Last Activity", "dtype": "date"}, - {"name": "Tags", "dtype": "string"} - ] - }, - "auto_infer": true, - "unmapped": "keep", - "coerce_types": true, - "reorder_to_schema": true, - "enforce_required": false - }, - "enabled": true, - "name": "4. Coerce types · reorder to canonical schema" - }, - { - "tool": "dedup", - "options": { - "survivor_rule": "most_complete", - "merge": true - }, - "enabled": true, - "name": "5. Dedup leads across HubSpot / LinkedIn / Manual Scrape (fuzzy + merge)" - } - ] -} diff --git a/samples/demo/ar_open_invoices.csv b/samples/demo/ar_open_invoices.csv new file mode 100644 index 0000000..a893b4e --- /dev/null +++ b/samples/demo/ar_open_invoices.csv @@ -0,0 +1,27 @@ +Invoice,Client,Email,Invoice_Date,Due_Date,Amount,Status +INV-1007,ACME LLC,AP@Acme.com,03/04/2025,04/03/2025,"$1,250.00",Open +INV-1007, Acme LLC ,ap@acme.com,2025-03-04,2025-04-03,"1,250.00",(blank) +INV-1001,northwind traders,billing@northwind.com,Mar 6 2025,04/05/2025,$980,Overdue +INV-1002,Globex Corp,AR@Globex.com,3/11/25,4/10/25,"2,400.50",Sent +INV-1011,initech,accounts@initech.com,04/01/2025,05/01/2025,"$ 1,100.00",? +INV-1011,Initech,Accounts@Initech.com,2025-04-01,2025-05-01,1100,Open +INV-1003,Stark Industries,ap@stark.com,Mar 6 2025,Apr 6 2025,$75.00,Open +INV-1004,Wayne Enterprises,ar@wayne.com,03/15/2025,04/14/2025,($300.00),— +INV-1015,Hooli,billing@hooli.com,3/11/25,4/10/25,"$4,300.00",Overdue +INV-1015,hooli,Billing@Hooli.com,2025-03-11,2025-04-10,4300,(none) +INV-1005,Soylent Corp,ap@soylent.com,2025-03-20,2025-04-19,"$1,875.25",Sent +INV-1006,Umbrella Co,ar@umbrella.com,03/22/2025,04/21/2025,$640.00,TBD +INV-1019,Cyberdyne Systems,ap@cyberdyne.com,Mar 25 2025,04/24/2025,"$2,050.00",unknown +INV-1019,cyberdyne systems,AP@Cyberdyne.com,2025-03-25,2025-04-24,"2,050.00",Open +INV-1008,Vandelay Industries,ar@vandelay.com,3/28/25,4/27/25,$915.00,Overdue +INV-1009,Gekko & Co,billing@gekko.com,2025-03-30,2025-04-29,"$3,120.75",Open +INV-1010,Pied Piper,ap@piedpiper.com,04/02/2025,05/02/2025,$180,Sent +INV-1023,Tyrell Corp,ar@tyrell.com,04/05/2025,05/05/2025,($300.00),(blank) +INV-1023,Tyrell Corp,AR@Tyrell.com,2025-04-05,2025-05-05,-300.00,Open +INV-1012,Oscorp,ap@oscorp.com,Apr 8 2025,05/08/2025,"$5,000.00",Overdue +INV-1013,Nakatomi Trading,ar@nakatomi.com,4/9/25,5/9/25,$725.50,Sent +INV-1014,Bluth Company,billing@bluth.com,2025-04-10,2025-05-10,"$1,420.00",Open +INV-1016,Dunder Mifflin,ap@dundermifflin.com,04/12/2025,05/12/2025,$960.00,Overdue +INV-1017,Prestige Worldwide,ar@prestige.com,Apr 14 2025,05/14/2025,"$2,680.00",Sent +INV-1018,Sterling Cooper,billing@sterlingcooper.com,4/15/25,5/15/25,"$3,950.00",Open +INV-1020,Wonka Industries,ap@wonka.com,2025-04-18,2025-05-18,"$1,050.00",Overdue diff --git a/samples/demo/ar_open_invoices_pipeline.json b/samples/demo/ar_open_invoices_pipeline.json new file mode 100644 index 0000000..0e2518e --- /dev/null +++ b/samples/demo/ar_open_invoices_pipeline.json @@ -0,0 +1,50 @@ +{ + "steps": [ + { + "tool": "text_clean", + "enabled": true, + "options": { + "trim": true, + "collapse_whitespace": true, + "fold_smart_chars": true, + "strip_zero_width": true + } + }, + { + "tool": "format_standardize", + "enabled": true, + "options": { + "column_types": { + "Invoice_Date": "date", + "Due_Date": "date", + "Amount": "currency", + "Email": "email" + } + } + }, + { + "tool": "missing", + "enabled": true, + "options": { + "strategy": "none", + "standardize_sentinels": true, + "sentinels": ["—", "-", "?", "(blank)", "TBD", "unknown", "(none)", "N/A", "#N/A"] + } + }, + { + "tool": "dedup", + "enabled": true, + "options": { + "survivor_rule": "most_complete", + "merge": true, + "strategies": [ + { + "columns": [ + {"column": "Invoice", "algorithm": "exact", "threshold": 100} + ] + } + ] + } + } + ] +} diff --git a/samples/demo/bank_reconciliation.csv b/samples/demo/bank_reconciliation.csv new file mode 100644 index 0000000..9ec1df8 --- /dev/null +++ b/samples/demo/bank_reconciliation.csv @@ -0,0 +1,27 @@ +Date,Description,Vendor,Category,Amount,Account +01/15/2025,“Stripe payout — weekly”,Stripe,Income,"+$3,450.00",Business Checking +2025-01-15,Verizon business line,Verizon,—,($89.50),Business Checking +Jan 18 2025,Adobe Creative Cloud ,Adobe,(blank),-$129.99,Business Checking +1/27/25,Office supplies,Amazon,Supplies,-$74.20,Business Checking +02/03/2025, Monthly office rent,Highland Properties,Rent,"$1,200.00",Business Checking +Feb 5 2025,Account service fee,First National Bank,?,(50.00),Business Checking +2025-01-09,Shipping labels,amazon.com,unknown,-$18.40,Business Checking +1/22/25,Contractor — landing page,Bright Lane Design,TBD,- $599.88,Business Checking +Jan 30 2025,Late fee adjustment,verizon,Utilities,-$12.00,Business Checking +2025-01-11,Packaging tape,AMAZON.COM,Supplies,-$31.75,Business Checking +01/06/2025,Client deposit — ACME Co,ACME Co,Income,"$2,500.00",Business Checking +2025-01-20,Google Workspace,Google,Software,-$36.00,Business Checking +Jan 24 2025,Fuel — delivery van,Shell,Vehicle,-$58.63,Business Checking +1/28/25,QuickBooks subscription,Intuit,Software,-$80.00,Business Checking +2025-01-15,Stripe payout weekly,Stripe,Income,3450.00,Business Checking +01/15/2025,Verizon business line,Verizon,Utilities,-89.50,Business Checking +2025-01-18,Adobe Creative Cloud,Adobe,Software,-129.99,Business Checking +2025-02-03,Monthly office rent,Highland Properties,Rent,1200.00,Business Checking +2025-02-05,Account service fee,First National Bank,Bank Fees,-50.00,Business Checking +2025-01-22,Contractor landing page,Bright Lane Design,Contractors,-599.88,Business Checking +02/10/2025,Client deposit — Globex,Globex,Income,"$1,800.00",Business Checking +2025-02-12,Slack subscription,Slack,Software,-$96.00,Business Checking +Feb 14 2025,Coffee — client meeting,Blue Bottle,Meals,-$23.10,Business Checking +2/18/25,Insurance premium,Hartford,Insurance,-$240.50,Business Checking +02/21/2025,Refund — returned printer,Staples,Supplies,$210.99,Business Checking +Feb 25 2025,Domain renewal,Namecheap,Software,-$13.98,Business Checking diff --git a/samples/demo/bank_reconciliation_pipeline.json b/samples/demo/bank_reconciliation_pipeline.json new file mode 100644 index 0000000..b7b4b44 --- /dev/null +++ b/samples/demo/bank_reconciliation_pipeline.json @@ -0,0 +1,6 @@ +{"steps":[ + {"tool":"text_clean","enabled":true,"options":{"trim":true,"collapse_whitespace":true,"fold_smart_chars":true,"strip_zero_width":true}}, + {"tool":"format_standardize","enabled":true,"options":{"column_types":{"Date":"date","Amount":"currency"}}}, + {"tool":"missing","enabled":true,"options":{"strategy":"none","standardize_sentinels":true,"sentinels":["—","(blank)","?","unknown","TBD","N/A","#N/A","(none)"]}}, + {"tool":"dedup","enabled":true,"options":{"survivor_rule":"most_complete","merge":true,"strategies":[{"columns":[{"column":"Date","algorithm":"exact","threshold":100},{"column":"Amount","algorithm":"exact","threshold":100}]}]}} +]} diff --git a/samples/demo/bookkeeper_bank_pipeline.json b/samples/demo/bookkeeper_bank_pipeline.json deleted file mode 100644 index 87d3fc0..0000000 --- a/samples/demo/bookkeeper_bank_pipeline.json +++ /dev/null @@ -1,56 +0,0 @@ -{ - "steps": [ - { - "tool": "text_clean", - "options": {}, - "enabled": true, - "name": "1. Clean text (header whitespace, smart quotes, em-dash)" - }, - { - "tool": "format_standardize", - "options": { - "column_types": { - "Date": "date", - "Amount": "currency", - "Balance": "currency", - "Vendor": "name" - }, - "currency_decimal": "auto", - "currency_preserve_code": false, - "currency_decimals": 2, - "date_output_format": "%Y-%m-%d" - }, - "enabled": true, - "name": "2. ISO dates · numeric amounts (parens-negative) · vendor casing" - }, - { - "tool": "missing", - "options": { - "strategy": "none", - "standardize_sentinels": true, - "sentinels": ["N/A", "n/a", "—", "-", "?", "(blank)", "(none)", "unknown", "#N/A"] - }, - "enabled": true, - "name": "3. Standardize disguised nulls (— / N/A / (blank))" - }, - { - "tool": "dedup", - "options": { - "survivor_rule": "most_complete", - "merge": false, - "date_column": "Date", - "strategies": [ - { - "columns": [ - {"column": "Date", "algorithm": "exact", "threshold": 100}, - {"column": "Amount", "algorithm": "exact", "threshold": 100}, - {"column": "Vendor", "algorithm": "jaro_winkler", "threshold": 80} - ] - } - ] - }, - "enabled": true, - "name": "4. Dedup transactions on Date+Amount+fuzzy Vendor" - } - ] -} diff --git a/samples/demo/bookkeeper_bank_reconcile.csv b/samples/demo/bookkeeper_bank_reconcile.csv deleted file mode 100644 index c29d865..0000000 --- a/samples/demo/bookkeeper_bank_reconcile.csv +++ /dev/null @@ -1,31 +0,0 @@ -Txn ID,Date ,Description,Amount,Balance,Account,Vendor,Category -TXN-2401,01/15/2025," AMAZON.COM*4F2X9 PURCHASE",-$129.99,"$2,450.01",Checking,Amazon,Office Supplies -TXN-2402,2025-01-15,"AMAZON.COM*4F2X9 PURCHASE",-$129.99,"2450.01",Checking,amazon.com,Office Supplies -TXN-2403,Jan 18 2025,"STAPLES #4422 — paper, toner",($89.50),$2360.51,Checking,STAPLES,Office Supplies -TXN-2404,01/22/2025,"Verizon Wireless ""autopay""",-$120.00,"$2,240.51",Checking,Verizon,Utilities -TXN-2405,2025-01-22,Verizon Wireless autopay,-120.00,"2,240.51",Checking,verizon,Utilities -TXN-2406,01-25-2025,"Stripe Payout — invoice #1077","+$3,450.00","$5,690.51",Checking,Stripe,Income -TXN-2407,1/27/25,"Office Lease - Suite 204",-1500.00,"$4,190.51",Checking,Acme Realty,Rent -TXN-2408,02/01/2025,"Wire — Acme Realty Mgmt","-$1,500.00","$2,690.51",Checking,acme realty,Rent -TXN-2409,2025-02-03,"Adobe Creative Cloud annual","- $599.88","$2,090.63",Credit Card,Adobe Inc.,Software -TXN-2410,02/03/2025,"ADOBE CREATIVE CLOUD ANN",-599.88,2090.63,Credit Card,adobe,Software -TXN-2411,Feb 5 2025,"FedEx — overnight to client A",-$32.50,"$2,058.13",Checking,FedEx,Shipping -TXN-2412,02/07/2025,"Square fee — invoice #1078","-$3.20","$2,054.93",Checking,Square,Fees -TXN-2413,02/10/2025,"Stripe Payout invoice #1079","+ $1,200.00","$3,254.93",Checking,Stripe,Income -TXN-2414,2025-02-12,"USPS PRIORITY — to vendor B","-12.40","$3,242.53",Checking,USPS,Shipping -TXN-2415,02/14/2025,"Zoom Video Comms — annual","-$149.90","$3,092.63",Credit Card,Zoom,Software -TXN-2416,2/14/25,"Zoom Video Communications","-149.90","3092.63",Credit Card,zoom,Software -TXN-2417,02/18/2025,"Costco Whse #421 — supplies","-$237.84","$2,854.79",Checking,Costco,Office Supplies -TXN-2418,2025-02-18,COSTCO WHSE #421,-237.84,"2,854.79",Checking,costco,Office Supplies -TXN-2419,02/22/2025,"Bank fee — int'l wire","-$45.00","$2,809.79",Checking,Bank Fee,Fees -TXN-2420,02/24/2025,"Stripe Payout — invoice #1080","+$2,100.00","$4,909.79",Checking,Stripe,Income -TXN-2421,02/28/2025," Refund — overcharge ","+$45.00","$4,954.79",Checking,—,Refunds -TXN-2422,Feb 28 2025,REFUND OVERCHARGE,45.00,4954.79,Checking,N/A,Refunds -TXN-2423,03/01/2025,"Office Lease — Suite 204","-$1,500.00","$3,454.79",Checking,Acme Realty,Rent -TXN-2424,2025-03-03,"Slack Technologies — annual","-$840.00","$2,614.79",Credit Card,Slack,Software -TXN-2425,03/05/2025,"Stripe Payout — invoice #1081","+$1,875.00","$4,489.79",Checking,Stripe,Income -TXN-2426,03/08/2025,"Wire — Berlin office rent (EUR vendor)","-€1.450,00","$2,989.79",Checking,Mietverwaltung GmbH,Rent -TXN-2427,03/10/2025,"London supplier invoice (GBP)","-£950.00","$1,939.79",Checking,Stationery Co Ltd,Office Supplies -TXN-2428,03/12/2025,"São Paulo agency retainer","-R$ 1.299,90","$1,679.79",Credit Card,Estúdio Ágil,Software -TXN-2429,03/14/2025,"VAT MOSS prep — multi-EU sales","($89.00)","$1,768.79",Checking,EU VAT Service,Fees -TXN-2430,03/14/2025,"VAT MOSS prep multi EU sales",-89.00,"1,768.79",Checking,eu vat service,Fees diff --git a/samples/demo/shopify_pet_customers.csv b/samples/demo/shopify_pet_customers.csv deleted file mode 100644 index 58b1caf..0000000 --- a/samples/demo/shopify_pet_customers.csv +++ /dev/null @@ -1,21 +0,0 @@ -Customer ID,First Name,Last Name,Email,Phone,Address,City,State,ZIP,Country,Total Orders,Lifetime Value,Last Order Date,Tags -SHOP-1001, Alice ,Johnson,alice@petshop.com,(415) 555-1234,"123 Main St., Apt 4B",San Francisco,CA,94102,US,12,$1,240.50,2025-12-04,VIP -SHOP-1002,Bob,SMITH,Bob@PetShop.com,415.555.1234,"123 Main St, Apt 4B",San Francisco,CA,94102,US,12,"$1,240.50",N/A,VIP -SHOP-1003,carlos,garcia,carlos@petshop.com,5559876543,"742 Evergreen Terrace",Springfield,IL,62704,US,5,420.00,12/15/2025,Wholesale -SHOP-1004,Diana,Lee,diana@petshop.com,(555) 222-3344,"PO Box 12, Sherwood Forest",Nottingham,,NG1 5BA,GB,8,£890.25,2025-10-30,VIP|Wholesale -SHOP-1005,EVE MARTINEZ,,eve.martinez@petshop.com,555-9988,"Calle Mayor 45","Madrid",,"28013",ES,3,€180,2025-09-15, -SHOP-1006,Frank,Brown,frank@petshop.com,, ,"Berlin",BE,10115,DE,15,€2.410,75,(blank),Wholesale -SHOP-1007,Grace,Davis,grace@petshop.com,+1 555-111-1111,"888 Maple Ave",Toronto,ON,M5V 3A8,CA,1,$49.99,#N/A,New -SHOP-1008,henry,wilson,Henry@PetShop.com,5551111111,"888 Maple Avenue","Toronto",ON,M5V 3A8,CA,1,$49.99,2025-12-01,New -SHOP-1009,Ivy,Chen,IVY@petshop.com,+1 (555) 777-7777,"550 Elm Street, Suite 200",Brooklyn,NY,11201,US,4,"$320.50 ",10/12/2025, -SHOP-1010,Jack,Taylor,jack@petshop.com,(none),"550 elm street, suite 200",brooklyn,NY,11201,US,4,$320.50,2025-10-12, -SHOP-1011,kate,o'neil,kate.oneil@petshop.com,415-555-2222,"99 King's Rd","London",,SW3 4LX,GB,7,£675.00,?,VIP -SHOP-1012,luis,rodriguez,LUIS@petshop.com,+34 91 411 1111,"Avenida de la Paz 12, 3°D",Madrid,,28013,ES,2,"€89,99",unknown, -SHOP-1013,Mia,Park,mia@petshop.com,02-9374-4000,"Sydney Opera House Drive","Sydney",NSW,2000,AU,9,"A$ 1,299.00",2025-11-20,Wholesale -SHOP-1014,Noah,nguyen,noah@petshop.com,+81 3 3210 7000,"丸の内 2-7-3","Tokyo",,100-0005,JP,6,"¥75000",2025-12-10,VIP -SHOP-1015,Olivia,Brown,OLIVIA@PETSHOP.COM,(555) 333-4444,"742 evergreen terrace",springfield,IL,62704,US,3,$180.00,(none), -SHOP-1016,Pavel,Novak,pavel@petshop.com,+44 20 7946 1234,"22 Baker Street",London,,W1U 6AB,United Kingdom,4,£412.00,2025-11-18,VIP -SHOP-1017,Quinn,Murphy,quinn@petshop.com,+44 20 7946 5678,"5 Princes Street",Edinburgh,,EH2 2DA,U.K.,2,£189.50,2025-12-09, -SHOP-1018,Rachel,O'Brien,rachel@petshop.com,02-9374-9999,"100 George Street","Sydney",NSW,2000,UK,1,£75.00,?,New -SHOP-1019,Sam,Klein,sam@petshop.com,+49 30 99887766,"Friedrichstraße 100","Berlin",,10117,Germany,11,"€1.890,40",2025-12-11,VIP|Wholesale -SHOP-1020,Tara,Gianni,tara@petshop.com,+39 06 6982 4567,"Via del Corso 250",Roma,,00186,Italia,5,"€649,99",2025-12-03, diff --git a/samples/demo/shopify_pet_pipeline.json b/samples/demo/shopify_pet_pipeline.json deleted file mode 100644 index 8aa2ad2..0000000 --- a/samples/demo/shopify_pet_pipeline.json +++ /dev/null @@ -1,49 +0,0 @@ -{ - "steps": [ - { - "tool": "text_clean", - "options": {}, - "enabled": true, - "name": "1. Clean text (whitespace, smart quotes, NBSP, BOM)" - }, - { - "tool": "format_standardize", - "options": { - "column_types": { - "First Name": "name", - "Last Name": "name", - "Email": "email", - "Phone": "phone", - "Address": "address", - "Lifetime Value": "currency", - "Last Order Date": "date" - }, - "phone_country_column": "Country", - "address_country_column": "Country", - "currency_preserve_code": true, - "currency_decimal": "auto", - "email_gmail_canonical": false - }, - "enabled": true, - "name": "2. Standardize phones, addresses, dates, currencies, names" - }, - { - "tool": "missing", - "options": { - "strategy": "none", - "standardize_sentinels": true - }, - "enabled": true, - "name": "3. Standardize disguised nulls (N/A, -, (blank), ?, #N/A)" - }, - { - "tool": "dedup", - "options": { - "survivor_rule": "most_complete", - "merge": true - }, - "enabled": true, - "name": "4. Dedup customers (fuzzy match, merge missing fields)" - } - ] -} diff --git a/samples/demo/vendor_1099.csv b/samples/demo/vendor_1099.csv new file mode 100644 index 0000000..aa76e26 --- /dev/null +++ b/samples/demo/vendor_1099.csv @@ -0,0 +1,25 @@ +Vendor,Contact,Email,Phone,EIN,Address,Total_Paid +Acme Realty,Bob Stein,acme.ap@acmerealty.com,(212) 555-0100,12-3456789,(blank),"$12,400.00" +acme realty llc,Bob Stein, ACME.AP@AcmeRealty.com ,,—,"118 Canal St, New York, NY 10013","$8,250" +ACME REALTY,R. Stein,Acme.AP@acmerealty.com,212.555.0100,N/A,TBD,"1,999.99" +Bright Books Bookkeeping,Dana Cole,hello@brightbooks.com,,98-7654321,(blank),"$6,000.00" +bright books,Dana Cole,HELLO@brightbooks.com,(415) 555-0142,unknown,"50 Market St, San Francisco, CA 94105","$6,000" +"Bright Books, LLC",D. Cole, hello@BrightBooks.com,4155550142,98-7654321,unknown,"5,500.00" +Northwind Logistics,Sam Reyes,ap@northwindlog.com,(312) 555-0198,—,(blank),"$22,750.00" +northwind logistics inc,Sam Reyes,AP@NorthwindLog.com,,45-6789012,"900 W Loop, Chicago, IL 60607","$22,750" +Pearl Design Studio,“Jo” Marsh,billing@pearldesign.co,,33-2211000,(blank),"$3,200.00" +pearl design,Jo Marsh,Billing@PearlDesign.co,(206) 555-0167,TBD,"77 Pike St, Seattle, WA 98101","$3,200" +PEARL DESIGN STUDIO,J. Marsh, billing@pearldesign.co ,206.555.0167,33-2211000,unknown,"2,800.00" +Cooper Plumbing,Lee Cooper,office@cooperplumb.com,(617) 555-0133,—,(blank),"$1,450.00" +cooper plumbing co,Lee Cooper,OFFICE@cooperplumb.com,,TBD,"12 Beacon St, Boston, MA 02108","$1,450" +COOPER PLUMBING,L. Cooper, office@CooperPlumb.com,6175550133,N/A,unknown,900.00 +Vertex Marketing,Pat Nguyen,accounts@vertexmktg.com,(404) 555-0119,77-8899001,(blank),"$15,000.00" +vertex marketing group,Pat Nguyen,ACCOUNTS@VertexMktg.com,,unknown,"300 Peachtree St, Atlanta, GA 30308","$15,000" +Summit Consulting,Ray Brooks,invoices@summitconsult.net,,21-0099887,(blank),"$9,800.00" +summit consulting llc,Ray Brooks,INVOICES@summitconsult.net,(303) 555-0175,—,"1100 17th St, Denver, CO 80202","$9,800" +SUMMIT CONSULTING,R. Brooks, invoices@SummitConsult.net ,303.555.0175,21-0099887,TBD,"7,250.00" +Garcia Catering,Mia Garcia,ap@garciacatering.com,(305) 555-0188,—,(blank),"$4,600.00" +garcia catering services,Mia Garcia,AP@GarciaCatering.com,,66-1234509,"450 Ocean Dr, Miami, FL 33139",$600.00 +Northwind Logistics,S. Reyes, ap@northwindlog.com ,312.555.0198,45-6789012,TBD,"21,000.00" +VERTEX MARKETING,P. Nguyen, accounts@vertexmktg.com ,404.555.0119,77-8899001,TBD,"14,500.00" +GARCIA CATERING,M. Garcia,ap@GARCIACATERING.com,305.555.0188,66-1234509,unknown,"4,200.00" diff --git a/samples/demo/vendor_1099_pipeline.json b/samples/demo/vendor_1099_pipeline.json new file mode 100644 index 0000000..4fb77ac --- /dev/null +++ b/samples/demo/vendor_1099_pipeline.json @@ -0,0 +1,49 @@ +{ + "steps": [ + { + "tool": "text_clean", + "enabled": true, + "options": { + "trim": true, + "collapse_whitespace": true, + "fold_smart_chars": true, + "strip_zero_width": true + } + }, + { + "tool": "format_standardize", + "enabled": true, + "options": { + "column_types": { + "Phone": "phone", + "Email": "email", + "Total_Paid": "currency" + } + } + }, + { + "tool": "missing", + "enabled": true, + "options": { + "strategy": "none", + "standardize_sentinels": true, + "sentinels": ["—", "-", "--", "(blank)", "TBD", "unknown", "N/A", "#N/A", "(none)"] + } + }, + { + "tool": "dedup", + "enabled": true, + "options": { + "survivor_rule": "most_complete", + "merge": true, + "strategies": [ + { + "columns": [ + {"column": "Email", "algorithm": "exact", "threshold": 100, "normalizer": "email"} + ] + } + ] + } + } + ] +} diff --git a/samples/messy_text.csv b/samples/messy_text.csv deleted file mode 100644 index 95de1cd..0000000 --- a/samples/messy_text.csv +++ /dev/null @@ -1,13 +0,0 @@ -customer_name,email,vendor,memo -Alice Johnson,alice@example.com,ACME Corp ,Welcome aboard -Bob Smith,bob@example.com,ACME Corp,Returning customer -Charlie Brown,charlie@example.com,Globex,Net 30 -Diana Prince,diana​@example.com,Globex,VIP -Edward Norton,ed@example.com,“Best Pet Supplies”,Order#42 - rush -Frank Castle,frank@example.com,Stark—Industries,"Line 1 -Line 2 -Line 3" - grace HOPPER ,grace@example.com,Globex,Loves long memos… -Henry Ford,henry@example.com,Ford Motor,Industrial -Iris West,iris@example.com,S.T.A.R. Labs,Notewith-bell -Jane Doe,jane@example.com,Acme,Standard diff --git a/src/gui/app_demo.py b/src/gui/app_demo.py index 80d18cb..f91164f 100644 --- a/src/gui/app_demo.py +++ b/src/gui/app_demo.py @@ -9,10 +9,10 @@ side-by-side, and converts the visitor to a Gumroad purchase. Launch: streamlit run src/gui/app_demo.py -URL routing: - https://demo.datatools.app/?p=shopify-pet (Shopify operator) - https://demo.datatools.app/?p=bookkeeper (Bookkeeper) - https://demo.datatools.app/?p=revops (RevOps agency) +URL routing (all three personas serve one audience: accounting): + https://demo.datatools.app/?p=bookkeeper (Bookkeeper — bank reconciliation) + https://demo.datatools.app/?p=ap-1099 (Accounts payable — 1099 vendor prep) + https://demo.datatools.app/?p=ar-aging (Accounts receivable — open invoices) Free / paid boundary (per docs/DEMO-PLAN.md §6): - input rows capped at ``DEMO_ROW_CAP`` @@ -64,59 +64,66 @@ GUMROAD_BASE: str = "https://gumroad.com/l/datatools" DEMO_DIR = _project_root / "samples" / "demo" +# All three personas serve one audience — accounting — entering through the +# three workflows where messy exports cost real money: bank reconciliation, +# 1099 / AP vendor prep, and AR aging. Each H1/sub names the exact pain and +# the validated demo outcome (see docs/DEMO-PLAN.md §4 for the numbers). PERSONAS: dict[str, dict[str, Any]] = { - "shopify-pet": { - "label": "Shopify pet operator", - "icon": "🛍️", - "h1": "Klaviyo-import-ready customer lists. **In 30 seconds. Locally.**", - "sub": ( - "Your Shopify customer export has duplicates Excel can't catch, " - "international phones Excel can't parse, and disguised nulls " - "(`N/A`, `(blank)`, `?`) that break Klaviyo's import. " - "DataTools fixes all of it in one pass — and your data never " - "leaves your computer." - ), - "data_file": "shopify_pet_customers.csv", - "pipeline_file": "shopify_pet_pipeline.json", - "cta": "Get DataTools for Shopify — $49 →", - "landing": "https://datatools.app/shopify/", - }, "bookkeeper": { - "label": "Bookkeeper / freelance accountant", + "label": "Bookkeeper — bank reconciliation", "icon": "📒", - "h1": "Reconcile messy bank exports. **Hand your client an audit trail.**", + "h1": "Catch the transactions your bank export posted twice. **Locally.**", "sub": ( - "The Jan and Feb exports overlap; the same transaction posts twice. " - "Vendor names are *Amazon* / *amazon.com* / *AMAZON.COM*4F2X9* in " - "three rows. DataTools dedups on Date + Amount + fuzzy Vendor, " - "produces ISO dates and numeric amounts, and gives you a row-level " - "audit log to hand the client." + "When the Jan and Feb exports overlap, the same payment lands " + "twice — once as `01/15/2025 +$3,450.00`, once as " + "`2025-01-15 3450.00`. DataTools standardizes every date and " + "amount, then dedups on the *real* transaction so your " + "reconciliation ties out. In this sample: **26 rows → 20, six " + "phantom duplicates removed** — and your data never leaves your " + "computer." ), - "data_file": "bookkeeper_bank_reconcile.csv", - "pipeline_file": "bookkeeper_bank_pipeline.json", + "data_file": "bank_reconciliation.csv", + "pipeline_file": "bank_reconciliation_pipeline.json", "cta": "Get DataTools for Bookkeepers — $49 →", "landing": "https://datatools.app/bookkeeper/", }, - "revops": { - "label": "Marketing / RevOps agency", - "icon": "🪢", - "h1": "Dedupe lead lists across HubSpot, LinkedIn, and manual scrapes — **locally.**", + "ap-1099": { + "label": "Accounts payable — 1099 prep", + "icon": "🧾", + "h1": "Build a clean 1099 vendor list — **with the missing EINs filled in.**", "sub": ( - "The same prospect shows up in HubSpot as `alice@acme.com`, in " - "LinkedIn as `Alice.Johnson@acme.com`, and in your VA's manual " - "scrape as `alice@acme.com` again. Country is `USA` / `US` / " - "`United States`. DataTools fuzzy-matches across sources, " - "normalizes phones for 50+ countries, and merges survivors " - "with their most-complete fields — without uploading anything." + "The same vendor was entered three times across the year — one " + "record has the EIN, another the address, a third the phone. " + "DataTools consolidates each vendor to one row and *backfills the " + "gaps from the duplicates*. In this sample: **24 messy records → " + "8 complete vendors, with 7 missing EINs recovered** from the " + "duplicate rows. No upload, no VLOOKUP gymnastics." ), - "data_file": "agency_combined_leads.csv", - "pipeline_file": "agency_leads_pipeline.json", - "cta": "Get DataTools for RevOps — $49 →", - "landing": "https://datatools.app/revops/", + "data_file": "vendor_1099.csv", + "pipeline_file": "vendor_1099_pipeline.json", + "cta": "Get DataTools for Accounting — $49 →", + "landing": "https://datatools.app/accounting/", + }, + "ar-aging": { + "label": "Accounts receivable — open invoices", + "icon": "💵", + "h1": "Stop chasing the invoices your aging report counted twice. **Locally.**", + "sub": ( + "Double-entered invoices inflate your AR aging and your " + "follow-ups. DataTools standardizes invoice dates, due dates, and " + "amounts, lowercases client emails, then removes the duplicate " + "invoice numbers — backfilling any blank status from the twin row. " + "In this sample: **26 rows → 21, five phantom invoices off the " + "books** in one pass." + ), + "data_file": "ar_open_invoices.csv", + "pipeline_file": "ar_open_invoices_pipeline.json", + "cta": "Get DataTools for Accounting — $49 →", + "landing": "https://datatools.app/accounting/", }, } -DEFAULT_PERSONA = "shopify-pet" +DEFAULT_PERSONA = "bookkeeper" # --------------------------------------------------------------------------- diff --git a/tests/test_demo_pipelines.py b/tests/test_demo_pipelines.py new file mode 100644 index 0000000..1901f1f --- /dev/null +++ b/tests/test_demo_pipelines.py @@ -0,0 +1,71 @@ +"""Demo pipelines must keep showing value (accounting personas). + +Each persona's preloaded dataset + saved pipeline is the marketing surface +driven by ``src/gui/app_demo.py``. These tests pin that every demo loads, +runs clean, and produces its headline value (duplicate rows removed, clean +parse, disguised nulls caught) — so a stale dataset or an engine change can't +silently gut the sales demo. The read path mirrors ``app_demo._load_demo`` +exactly (``dtype=str, keep_default_na=False`` so every disguised null survives +to the pipeline). +""" + +from __future__ import annotations + +from pathlib import Path + +import pandas as pd +import pytest + +from src.core.pipeline import Pipeline, run_pipeline + +_REPO = Path(__file__).resolve().parent.parent +_DEMO = _REPO / "samples" / "demo" + +# (data_file, pipeline_file, min_duplicates_removed) — one per accounting +# persona in app_demo.PERSONAS. The dup floors are the validated demo numbers. +_DEMOS = [ + ("bank_reconciliation.csv", "bank_reconciliation_pipeline.json", 6), + ("vendor_1099.csv", "vendor_1099_pipeline.json", 8), + ("ar_open_invoices.csv", "ar_open_invoices_pipeline.json", 5), +] + + +@pytest.mark.parametrize("data_file,pipeline_file,min_dupes", _DEMOS) +def test_demo_runs_clean_and_shows_value(data_file, pipeline_file, min_dupes): + df = pd.read_csv(_DEMO / data_file, dtype=str, keep_default_na=False) + pipe = Pipeline.from_file(_DEMO / pipeline_file) + res = run_pipeline(df, pipe, stop_on_error=True) + + # 1. Nothing errored — the demo never shows a visitor a red banner. + assert all(sr.error is None for sr in res.step_results), [ + (sr.step.tool, sr.error) for sr in res.step_results + ] + + # 2. Dedup removed the designed duplicate rows (the headline value). + assert res.final_rows < res.initial_rows + dedup = next(sr for sr in res.step_results if sr.step.tool == "dedup") + assert dedup.summary["duplicates_removed"] >= min_dupes + + # 3. Standardization parsed every typed cell — a demo with unparseable + # cells reads as "the tool choked," which kills the pitch. + fmt = next(sr for sr in res.step_results if sr.step.tool == "format_standardize") + assert fmt.summary["cells_unparseable"] == 0 + assert fmt.summary["cells_changed"] > 0 + + # 4. The disguised nulls (—, (blank), TBD, …) were caught. + miss = next(sr for sr in res.step_results if sr.step.tool == "missing") + assert miss.summary["sentinels_standardized"] > 0 + + +def test_app_demo_references_each_demo_file(): + """Every data/pipeline file the demo app names must exist on disk. + + Guards against a rename in app_demo.py drifting away from samples/demo/ + (or vice versa) without a test catching it. + """ + src = (_REPO / "src" / "gui" / "app_demo.py").read_text(encoding="utf-8") + for data_file, pipeline_file, _ in _DEMOS: + assert data_file in src, f"{data_file} not referenced in app_demo.py" + assert pipeline_file in src, f"{pipeline_file} not referenced in app_demo.py" + assert (_DEMO / data_file).exists(), f"missing {data_file}" + assert (_DEMO / pipeline_file).exists(), f"missing {pipeline_file}"