feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI
05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI
09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI
with soft tool-dependency graph (recommended,
not enforced) and JSON save/load for repeatable
weekly cleanups.
Format Standardizer reworked for 1 GB international files:
• Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
• Per-row country / address columns drive parsing
• Audit cap (default 10 k rows, ~50 MB RAM)
• standardize_file(): chunked streaming entry point (~165 k rows/sec)
• currency_decimal="auto" for EU comma-decimal locales
• R$ / kr / zł multi-char currency prefixes
• cli_format.py with auto-stream above 100 MB inputs
Encoding detection arbiter + language-aware probe:
Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.
Distribution-readiness assets:
• streamlit_app.py — Streamlit Community Cloud entry shim
• src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
100-row cap + watermark, free-vs-paid boundary enforced at surface
• samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
• landing/ — 4 static HTML pages (apex chooser + 3 niche),
shared CSS, deploy.py URL-substitution script,
auto-generated robots.txt + sitemap.xml + 404.html + favicon
• docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
— full strategy + measurement + deployment + master checklist
Test counts:
before: 1,520 passed · 4 skipped · 17 xfailed
after: 1,729 passed · 0 skipped · 0 xfailed
Tier-1 corpora added:
• missing-corpus 3 use cases + 16 edge cases
• column-mapper-corpus 3 use cases + 5 edge cases
• format-cleaner intl 20-row 13-country stress fixture
Engine hardening flushed out by the corpora:
• interpolate guards against object-dtype columns
• mean/median skip all-NaN columns (silences numpy warning)
• fillna runs under future.no_silent_downcasting (silences pandas warning)
• mojibake test no longer skips when ftfy installed (monkeypatch path)
• drop-row threshold semantics: strict-greater (consistent across rows / cols)
• currency_decimal validator allow-set updated for "auto"
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
4
.gitignore
vendored
4
.gitignore
vendored
@@ -10,3 +10,7 @@ build/
|
||||
|
||||
# Claude Code agent worktrees + local settings
|
||||
.claude/
|
||||
|
||||
# Landing-page deploy outputs and operator config (real URLs, not committed)
|
||||
landing/dist/
|
||||
landing/deploy.config.json
|
||||
|
||||
12
README.md
12
README.md
@@ -9,12 +9,12 @@ Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony.
|
||||
| 01 | **Deduplicator** — exact + fuzzy match, 5 normalizers, survivor rules, audit | Ready |
|
||||
| 02 | **Text Cleaner** — whitespace, smart chars, BOM, line endings, case ops | Ready |
|
||||
| 03 | **Format Standardizer** — dates, phones, emails, addresses, names, currencies, booleans | Ready |
|
||||
| 04 | Missing Value Handler | Coming Soon |
|
||||
| 05 | Column Mapper | Coming Soon |
|
||||
| 04 | **Missing Value Handler** — disguised-null detection, profile, mean/median/mode/ffill/bfill/interpolate, drop strategies | Ready |
|
||||
| 05 | **Column Mapper** — fuzzy auto-rename, target schema with type coercion, required fields with defaults, drop/reorder | Ready |
|
||||
| 06 | Outlier Detector | Coming Soon |
|
||||
| 07 | Multi-File Merger | Coming Soon |
|
||||
| 08 | Validator & Reporter | Coming Soon |
|
||||
| 09 | Pipeline Runner | Coming Soon |
|
||||
| 09 | **Pipeline Runner** — chain tools with recommended (not forced) order, save/load JSON, automate weekly cleanups | Ready |
|
||||
|
||||
## Install
|
||||
|
||||
@@ -31,10 +31,14 @@ Python 3.10+ required.
|
||||
streamlit run src/gui/app.py
|
||||
```
|
||||
|
||||
**CLI** — three entry points:
|
||||
**CLI** — seven entry points:
|
||||
```bash
|
||||
python -m src.cli customers.csv [--apply] # dedup
|
||||
python -m src.cli_text_clean messy.csv [--apply] # text clean
|
||||
python -m src.cli_format intl.csv [--apply] # format standardize (auto-streams >100 MB)
|
||||
python -m src.cli_missing holes.csv [--apply] # missing values
|
||||
python -m src.cli_column_map vendor.csv [--apply] # column mapper
|
||||
python -m src.cli_pipeline any_file.csv [--apply] # chain tools end-to-end
|
||||
python -m src.cli_analyze any_file.csv [--json] # scan only
|
||||
```
|
||||
|
||||
|
||||
332
docs/DEMO-PLAN.md
Normal file
332
docs/DEMO-PLAN.md
Normal file
@@ -0,0 +1,332 @@
|
||||
# Demo Plan — DataTools
|
||||
|
||||
> Creator-only. Implements PLAN.md §2.2 (the demo IS the product) and
|
||||
> §2.3 (niche down — three landing pages, one engine).
|
||||
> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael
|
||||
|
||||
The hosted demo is the single highest-leverage marketing asset in the
|
||||
plan. This document defines exactly what loads, in what order, with
|
||||
what data, for which buyer — so the operator builds it once and never
|
||||
rebuilds it from a stale headline.
|
||||
|
||||
## 1. Goals
|
||||
|
||||
- Convert a cold visitor to a paid buyer in **under three minutes** of
|
||||
active interaction.
|
||||
- Demonstrate the *full pipeline* (not one tool) on a dataset that
|
||||
*looks like the visitor's own work* — not a toy CSV.
|
||||
- Survive zero attention to maintenance — once running, the demo
|
||||
should keep working as the engine evolves (the pre-saved pipeline
|
||||
JSONs use the same code path the paid product uses).
|
||||
- Provide a shareable artifact for niche-community posts (a public URL
|
||||
the operator can drop into a subreddit reply with one sentence).
|
||||
|
||||
## 2. Constraints (non-negotiable)
|
||||
|
||||
| Constraint | Source | Implication |
|
||||
|---|---|---|
|
||||
| Free hosting at launch | BUSINESS.md §9 | Streamlit Community Cloud (1 GB RAM, sleeps after 7 days idle) |
|
||||
| No login | BUSINESS.md §7 | No email gate, no signup wall, no "create account to continue" |
|
||||
| Async / no-touch | DECISIONS.md §1 #8 | Cannot offer "schedule a demo with us" CTA |
|
||||
| Runs locally on paid product | BUSINESS.md §11 | Demo can't expose the same engine to abuse — needs row caps |
|
||||
| Friction kills conversion | BUSINESS.md §7 | Demo dataset preloaded; no "select a file" first-step |
|
||||
| < $1,200/mo recurring | BUSINESS.md §9 | Migration plan to $5/mo VPS only after rate-limit signal |
|
||||
|
||||
## 3. The three personas (per PLAN.md §2.3)
|
||||
|
||||
| Tag | Persona | Top-of-funnel keyword | Demo dataset | Pre-saved pipeline |
|
||||
|---|---|---|---|---|
|
||||
| `shopify-pet` | Shopify operator (priority: pet supplies) | "shopify customer cleanup" | `samples/demo/shopify_pet_customers.csv` | `shopify_pet_pipeline.json` |
|
||||
| `bookkeeper` | Bookkeeper / freelance accountant | "reconcile bank export csv" | `samples/demo/bookkeeper_bank_reconcile.csv` | `bookkeeper_bank_pipeline.json` |
|
||||
| `revops` | Marketing / RevOps agency | "dedupe lead list across vendors" | `samples/demo/agency_combined_leads.csv` | `agency_leads_pipeline.json` |
|
||||
|
||||
Each persona gets its **own landing page URL**, its **own demo dataset
|
||||
loaded by default**, and its **own H1 + below-the-fold copy.** The
|
||||
engine is identical; only positioning differs.
|
||||
|
||||
## 4. Demo dataset specifications
|
||||
|
||||
Each dataset is intentionally small (~15–25 rows) so the full pipeline
|
||||
runs in well under one second on Streamlit Community Cloud's free
|
||||
hardware. Each row is a *plausible-looking* export from that
|
||||
persona's tooling. Each contains every kind of pollution the bundle's
|
||||
five tools fix, so a single demo run shows every tool earning its
|
||||
keep.
|
||||
|
||||
### 4.0 Pain-point coverage map
|
||||
|
||||
Each demo dataset is engineered so the buyer sees their **own top
|
||||
pain** demonstrated in the AFTER preview. The mapping below pairs
|
||||
each pain from PLAN.md §2.3a with the rows / columns that exercise
|
||||
it. Refresh the dataset only when this coverage drops.
|
||||
|
||||
| Persona | Pain (from PLAN §2.3a) | Demo coverage |
|
||||
|---|---|---|
|
||||
| Shopify pet | S1 — Klaviyo per-contact dupes | 5 dup pairs across rows 1–15 (case + format + address-twin variants) |
|
||||
| Shopify pet | S2 — feed-rejection chars | smart-quote / NBSP / BOM in rows 1–6, 9, 11 |
|
||||
| Shopify pet | S3 — multi-channel | partner-style customer IDs (`SHOP-`); demonstration of column-level mapping covered in RevOps demo |
|
||||
| Shopify pet | S4 — subscription identity | rows 1+2, 7+8, 9+10 — same person, different format |
|
||||
| Shopify pet | S5 — VAT-MOSS country drift | rows 16–18 (`United Kingdom` / `U.K.` / `UK`) + rows 19–20 (`Germany`/`Italia`) |
|
||||
| Bookkeeper | B1 — month-overlap re-import | 7 dup pairs spanning Jan↔Feb and Mar boundaries |
|
||||
| Bookkeeper | B2 — 1099 vendor consolidation | Amazon × 3 spellings, Verizon × 2, Acme Realty × 2, Adobe × 2, Costco × 2, Zoom × 2, Stripe × 4 |
|
||||
| Bookkeeper | B3 — audit trail | every cell change in the run logged with old/new/rule — surface in the demo's audit tab |
|
||||
| Bookkeeper | B4 — per-license economics | demonstrated by pricing copy, not data |
|
||||
| Bookkeeper | B5 — multi-currency | rows 26 (EUR), 27 (GBP), 28 (BRL with comma decimal), 29 (parens-negative) |
|
||||
| RevOps | R1 — per-contact tier | 6 cross-source dup pairs (HubSpot × LinkedIn × Manual Scrape) |
|
||||
| RevOps | R2 — deliverability | rows 26–27 (`uma at uniform dot com`, `victor@@victorco.com` invalid emails) |
|
||||
| RevOps | R3 — GDPR / privacy | demonstrated by the network-tab moat panel + zero-upload claim |
|
||||
| RevOps | R4 — vendor unification | 3 source values (HubSpot / LinkedIn / Manual Scrape), 13 country codes, mixed-shape headers |
|
||||
| RevOps | R5 — suppression list | rows 29–30 (`Suppressed`, `Opted Out` tags) |
|
||||
|
||||
### 4.1 `shopify_pet_customers.csv` (20 rows)
|
||||
|
||||
**Looks like**: a Shopify customer export filtered for "Pet Supplies"
|
||||
sales channel, 12 months activity.
|
||||
|
||||
**Pollution included**:
|
||||
- Whitespace padding (" Alice ", "Sydney Opera House Drive ")
|
||||
- Mixed phone formats: `(415) 555-1234`, `415.555.1234`, `5559876543`,
|
||||
`+1 555-111-1111`
|
||||
- International phones: GB, ES, DE, AU, JP (15 demo rows span 6
|
||||
countries)
|
||||
- Currency variants: `$1,240.50`, `£890.25`, `€2.410,75` (EU comma
|
||||
decimal), `A$ 1,299.00`, `¥75000`
|
||||
- Date formats: `2025-12-04`, `12/15/2025`, `?`, `(blank)`, `(none)`,
|
||||
`#N/A`
|
||||
- Disguised nulls: `N/A`, blank, `(blank)`, `?`, `#N/A`, `(none)`,
|
||||
`unknown`
|
||||
- Name casing: `EVE MARTINEZ`, `henry`, `O'NEIL`, `noah`, mixed Title /
|
||||
ALL CAPS / lower
|
||||
- Email case variants that *should* dedup: `Bob@PetShop.com` vs
|
||||
`alice@petshop.com`
|
||||
- 4 fuzzy duplicates (Alice/Bob same address, Grace/Henry same phone,
|
||||
Carlos/Olivia same address, Ivy/Jack same address)
|
||||
|
||||
**After running the pipeline**: 20 rows → 15, ~29 cells canonicalized,
|
||||
~45 sentinels standardised, 5 cross-row duplicates merged. The
|
||||
customer table is now Klaviyo-import-ready and the country column
|
||||
(previously `UK` / `U.K.` / `United Kingdom` / `Germany` / `Italia`)
|
||||
is GB / DE / IT — VAT MOSS report won't break.
|
||||
|
||||
### 4.2 `bookkeeper_bank_reconcile.csv` (30 rows)
|
||||
|
||||
**Looks like**: two months of business checking + credit-card activity
|
||||
exported from a bank portal, with the Feb export accidentally
|
||||
overlapping the Jan export at the month boundary.
|
||||
|
||||
**Pollution included**:
|
||||
- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`,
|
||||
`1/27/25`, `Feb 5 2025`
|
||||
- Currency formats: `-$129.99`, `($89.50)` parens-negative,
|
||||
`+$3,450.00`, `- $599.88` space, bare `-129.99`, `(50.00)`
|
||||
- Header trailing whitespace: `"Date "`
|
||||
- Smart quotes around descriptions: `"autopay"`
|
||||
- Em-dash sentinels in Vendor: `—`
|
||||
- Smart-em-dash inside descriptions: `STAPLES #4422 — paper, toner`
|
||||
- Vendor casing inconsistency: `Amazon` / `amazon.com` / `AMAZON.COM`,
|
||||
`Verizon` / `verizon`
|
||||
- 6 duplicate transactions (same date+amount+vendor recorded twice
|
||||
with different formats)
|
||||
|
||||
**After running the pipeline**: 30 rows → 23, ~84 cells normalized, 7
|
||||
duplicates removed (month-overlap + VAT-MOSS dups). All dates
|
||||
ISO-formatted, all amounts numeric (including EUR/GBP/BRL with comma
|
||||
decimal), vendor casing canonical, parens-negative resolved.
|
||||
|
||||
### 4.3 `agency_combined_leads.csv` (30 rows)
|
||||
|
||||
**Looks like**: a marketing-ops worksheet combining lead exports from
|
||||
HubSpot + LinkedIn Sales Navigator + manual scraping, ready for
|
||||
campaign targeting.
|
||||
|
||||
**Pollution included**:
|
||||
- Phone formats per region: US, UK, Spain, Germany, China, India,
|
||||
Australia, Mexico, Israel, Singapore, Hong Kong, Italy, South
|
||||
Korea — 13 country codes
|
||||
- Country column inconsistent: `USA` / `US` / `United States`
|
||||
- Disguised nulls: `N/A`, `unknown`, `(unknown)`, `(blank)`, `(none)`,
|
||||
`?`, `—`, `#N/A`, `TBD`
|
||||
- Source column tags origin (`HubSpot` / `LinkedIn` / `Manual Scrape`)
|
||||
- Email duplicates across sources with case variants: `alice@acme.com`
|
||||
+ `Alice.Johnson@acme.com`, `bob@beta.com` + `Bob@Beta.com`,
|
||||
`diana@delta.com` from two sources, `carlos@gamma.io` from two
|
||||
sources, `Frank@Foxtrot.de` + `frank@foxtrot.de`
|
||||
- Name casing: `DIANA LEE`, `henry`, `IVY CHEN`, mixed
|
||||
- 6 fuzzy / cross-source duplicates designed to survive the dedup
|
||||
- Score column with sentinel pollution that needs coercion to integer
|
||||
|
||||
**After running the pipeline**: 30 rows → 24, ~43 cells canonicalized,
|
||||
14 sentinels resolved, 6 cross-source duplicates merged with `merge=true`
|
||||
so each survivor inherits the most-complete picture. Invalid-email
|
||||
rows (deliverability stress) and `Suppressed`/`Opted Out` tags
|
||||
(suppression-list use case) survive as flagged rows the operator
|
||||
manually reviews.
|
||||
|
||||
## 5. UX flow (per persona)
|
||||
|
||||
The demo is a single Streamlit page (likely
|
||||
`src/gui/pages/0_Review.py` repurposed for demo mode, or a
|
||||
dedicated `app_demo.py` for the cloud build).
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────┐
|
||||
│ DataTools — for {Persona} │
|
||||
│ "{Persona-specific H1}" │
|
||||
├──────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Sample dataset preloaded: shopify_pet_customers.csv │
|
||||
│ [Replace with your own file (capped 100 rows)] │
|
||||
│ │
|
||||
│ ┌─ BEFORE preview (15 rows) ─────────────────────────┐ │
|
||||
│ │ Alice | (415) 555-1234 | $1,240.50 | … │ │
|
||||
│ │ Bob | 415.555.1234 | $1,240.50 | … │ │
|
||||
│ │ ... │ │
|
||||
│ └──────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ Pipeline (saved): │
|
||||
│ 1. Text Clean → 2. Format Standardize → │
|
||||
│ 3. Missing → 4. Deduplicate │
|
||||
│ │
|
||||
│ [▶ Run pipeline] │
|
||||
│ │
|
||||
│ ┌─ AFTER preview ───────────────────────────────────┐ │
|
||||
│ │ 15 rows → 11 (4 duplicates merged) │ │
|
||||
│ │ 27 cells canonicalized · 33 sentinels resolved │ │
|
||||
│ │ │ │
|
||||
│ │ Alice Johnson | +14155551234 | 1240.50 | … │ │
|
||||
│ │ ... │ │
|
||||
│ └──────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ [Download cleaned CSV (sample, watermarked)] │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────────────────────┐ │
|
||||
│ │ Like what you see? │ │
|
||||
│ │ Run this on YOUR 50,000-row export — locally. │ │
|
||||
│ │ No upload. Your data never leaves your machine. │ │
|
||||
│ │ [Get DataTools — $49 →] │ │
|
||||
│ └──────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Critical UX points**:
|
||||
- Sample dataset is *already loaded* on page paint. Visitor never
|
||||
sees an empty state.
|
||||
- BEFORE table is shown side-by-side with AFTER once the run
|
||||
completes. Hidden-character toggle on by default so the visitor
|
||||
*sees* what was hidden in their data.
|
||||
- "Replace with your own file" is a secondary action below the BEFORE
|
||||
table — not the headline.
|
||||
- Per-step metrics are shown in the AFTER block: "27 cells
|
||||
canonicalized, 33 sentinels resolved, 4 duplicates merged." Numbers
|
||||
sell more than narrative.
|
||||
- Buy button is **inside** the AFTER block and **above the fold** when
|
||||
the run completes. Friction kills.
|
||||
|
||||
## 6. Free vs paid boundary
|
||||
|
||||
The demo runs the **same code** as the paid product. Caps are surface,
|
||||
not engine.
|
||||
|
||||
| Limit | Free demo | Paid (downloaded) |
|
||||
|---|---|---|
|
||||
| Input rows | 100 | unlimited (1 GB+ via streaming) |
|
||||
| File size | 5 MB | unlimited |
|
||||
| Output | watermarked CSV ("DataTools demo — buy at <url>" appended as last row) | clean CSV |
|
||||
| Pipeline editor | locked to the persona-saved pipeline | full edit / save / load JSON |
|
||||
| Save pipeline JSON | disabled | enabled |
|
||||
| International | enabled | enabled |
|
||||
| Audit log download | disabled | enabled |
|
||||
| Tool 06–09 | as they ship | as they ship |
|
||||
|
||||
The watermark is a **single trailing row**, not an in-cell tag — so
|
||||
the demo's AFTER preview *visibly* reads as production-quality data,
|
||||
not "demo crippled" data.
|
||||
|
||||
## 7. CTA copy (per persona)
|
||||
|
||||
### 7.1 Shopify pet operator
|
||||
|
||||
- **H1**: *Clean your customer / vendor / subscriber exports — locally.*
|
||||
- **Sub**: *Klaviyo-import-ready in 30 seconds. Catches duplicates Excel
|
||||
misses. Your data never leaves your computer.*
|
||||
- **CTA**: *Get DataTools for Shopify — $49 →*
|
||||
|
||||
### 7.2 Bookkeeper / freelance accountant
|
||||
|
||||
- **H1**: *Reconcile messy bank exports. Hand your client an audit
|
||||
trail.*
|
||||
- **Sub**: *Catches the duplicate transaction Quickbooks imported twice.
|
||||
Standardizes dates, amounts, vendor casing. Every change auditable.*
|
||||
- **CTA**: *Get DataTools for Bookkeepers — $49 →*
|
||||
|
||||
### 7.3 Marketing / RevOps agency
|
||||
|
||||
- **H1**: *Dedupe leads across HubSpot, LinkedIn, and manual scrapes.*
|
||||
- **Sub**: *International phones, country normalization, fuzzy dedup
|
||||
with merge — one tool, one schema, no upload.*
|
||||
- **CTA**: *Get DataTools for RevOps — $49 →*
|
||||
|
||||
## 8. Telemetry / conversion tracking
|
||||
|
||||
Async + no-touch + free hosting limits what we can instrument. Use
|
||||
event-only counters, no PII:
|
||||
|
||||
| Event | Source | Aggregate-only field |
|
||||
|---|---|---|
|
||||
| `demo.page_view` | landing page | persona tag |
|
||||
| `demo.run_clicked` | demo page | persona tag |
|
||||
| `demo.run_completed` | demo page | persona tag, rows_processed |
|
||||
| `demo.cta_clicked` | demo page | persona tag |
|
||||
| `gumroad.purchase` | Gumroad webhook | landing-page-source query param (`?from=shopify-pet`) |
|
||||
|
||||
Conversion = `cta_clicked / run_completed`. Demo-quality issue surfaces
|
||||
when `run_completed / page_view` < 30 % (visitors not engaging).
|
||||
|
||||
Self-host counters on Cloudflare Pages (free, GDPR-friendly). No
|
||||
Google Analytics — adds privacy banner, conflicts with the "your data
|
||||
never leaves your computer" message.
|
||||
|
||||
## 9. Maintenance plan
|
||||
|
||||
**Recurring**: zero. The demo runs on the same engine the paid
|
||||
product ships, so any improvement to the engine improves the demo
|
||||
automatically. The pre-saved pipeline JSONs reference column names
|
||||
and tool names, both stable APIs.
|
||||
|
||||
**Triggers for revisit**:
|
||||
|
||||
| Trigger | Action |
|
||||
|---|---|
|
||||
| Streamlit Community Cloud rate-limits / sleeps too aggressively | Migrate to a $5–10/mo VPS (BUSINESS.md §9 contingency) |
|
||||
| Demo dataset becomes stale (e.g. all phones standardize to no-op) | Refresh with a new pollution batch — *don't change the persona* |
|
||||
| `run_completed / page_view < 30 %` for 4 consecutive weeks | Audit the demo: is the BEFORE preview showing the mess clearly? Is the AFTER too small to notice? |
|
||||
| `cta_clicked / run_completed < 5 %` for 4 consecutive weeks | The demo is impressive but the CTA isn't earning trust — revise copy + add a screenshot of the network tab showing zero outbound calls (PLAN.md §2.4) |
|
||||
| New tool ships (06–09) | Decide *per persona* whether to add it to that persona's saved pipeline. Not all tools belong on all personas |
|
||||
|
||||
## 10. Build sequence (drops into PLAN.md week 2)
|
||||
|
||||
| Day | Action |
|
||||
|---|---|
|
||||
| 1 | Demo build of Streamlit app: 3 personas, switch via query param `?p=shopify-pet` |
|
||||
| 2 | Pipeline JSONs wired in; row cap + watermark applied; download button |
|
||||
| 3 | Deploy to Streamlit Community Cloud · 3 sub-paths or 3 separate apps |
|
||||
| 4 | Persona landing pages: 3 static HTML pages on Cloudflare Pages, each with iframe embed of its persona demo + CTA |
|
||||
| 5 | Telemetry counters wired (Cloudflare event API) · Gumroad webhook captures `?from=` |
|
||||
|
||||
End of day 5: three URLs the operator can drop into three different
|
||||
niche-community threads, each performing its own conversion math.
|
||||
|
||||
## 11. Anti-temptations (things the demo deliberately refuses)
|
||||
|
||||
- **No "try it on your data first" gate that requires email.** The
|
||||
whole point is friction-free.
|
||||
- **No "schedule a demo" CTA.** Locked by no-touch.
|
||||
- **No live chat widget.** Same.
|
||||
- **No A/B-test framework yet.** Single-arm copy, ship it, iterate
|
||||
monthly. A/B requires statistical traffic the funnel doesn't have
|
||||
pre-PMF.
|
||||
- **No watermark inside cells.** The AFTER preview must look
|
||||
production-quality. Watermark goes on a single trailing row that's
|
||||
obviously the demo signature.
|
||||
- **No animation / loader theatrics.** Pipeline runs in <1 s; a
|
||||
fake-progress bar lies about speed.
|
||||
236
docs/DEPLOYMENT.md
Normal file
236
docs/DEPLOYMENT.md
Normal file
@@ -0,0 +1,236 @@
|
||||
# Deployment — demo + landing pages
|
||||
|
||||
> One page. Two services. ~30 minutes from "code complete" to
|
||||
> "URL the user can hit." Every step here is from-scratch reproducible
|
||||
> on a clean laptop.
|
||||
> **Version**: 1.0 · **Adopted**: 2026-05-01
|
||||
|
||||
This doc covers the **two distribution surfaces** that ship to public
|
||||
URLs: the Streamlit demo (the iframe target) and the Cloudflare Pages
|
||||
landing pages (the marketing surface that embeds it).
|
||||
|
||||
The *paid* product — PyInstaller installers, code-signing, Gumroad
|
||||
listing — is covered in `docs/NEXT-STEPS.md`.
|
||||
|
||||
---
|
||||
|
||||
## Part 1 · Deploy the demo (Streamlit Community Cloud — free)
|
||||
|
||||
### A. Pre-flight (one-time, ~2 min)
|
||||
|
||||
You need a free [Streamlit Community Cloud](https://streamlit.io/cloud)
|
||||
account. Sign in with the GitHub account that hosts this repo.
|
||||
|
||||
### B. Deploy (~5 min, mostly waiting for the Cloud build)
|
||||
|
||||
1. **Push the repo to GitHub** (private or public — both work). The
|
||||
important files are at the **repo root**:
|
||||
|
||||
- `streamlit_app.py` — Cloud auto-detects this; nothing to configure
|
||||
- `requirements.txt` — Cloud installs from this
|
||||
- `.streamlit/config.toml` — Cloud honours this
|
||||
- `samples/demo/*.csv` + `*_pipeline.json` — the demo's data
|
||||
- `src/` — the engine
|
||||
|
||||
2. In Streamlit Community Cloud → **New app**:
|
||||
- Repository: your fork
|
||||
- Branch: `main`
|
||||
- Main file path: `streamlit_app.py` (the default — leave it)
|
||||
- App URL: `datatools-demo` (or any free subdomain)
|
||||
- **Deploy**
|
||||
|
||||
3. First build is 2–3 min while Cloud installs `pandas`, `phonenumbers`,
|
||||
`rapidfuzz`, etc. Subsequent deploys are < 30 s.
|
||||
|
||||
### C. Verify
|
||||
|
||||
Open the deployed URL. Append `?p=shopify-pet` to the URL bar —
|
||||
the persona-specific demo loads. Try `?p=bookkeeper` and
|
||||
`?p=revops` to confirm all three personas route correctly. Click
|
||||
**Run pipeline**; the AFTER preview should appear within ~1 second.
|
||||
|
||||
### D. The output URL
|
||||
|
||||
The deployed URL is what feeds into `landing/deploy.config.json` →
|
||||
`demo_base_url`. Without trailing slash. For example:
|
||||
|
||||
https://datatools-demo.streamlit.app
|
||||
|
||||
### E. Migration trigger
|
||||
|
||||
Per `BUSINESS.md` §9 / `DEMO-PLAN.md` §9, migrate to a $5–10/mo VPS
|
||||
when:
|
||||
|
||||
- Streamlit Community Cloud rate-limits / sleeps too aggressively, OR
|
||||
- the demo crosses ~5 k page-views/month (free-tier capacity)
|
||||
|
||||
The migration is one command if you containerise:
|
||||
`docker run -p 8501:8501 -v $(pwd):/app python:3.12-slim …`
|
||||
|
||||
---
|
||||
|
||||
## Part 2 · Deploy the landing pages (Cloudflare Pages — free)
|
||||
|
||||
### A. Pre-flight (one-time, ~5 min)
|
||||
|
||||
You need:
|
||||
|
||||
- A Cloudflare account (free) and a domain (any registrar) with
|
||||
nameservers pointed at Cloudflare. **OR** skip the custom domain
|
||||
step and use the auto-generated `*.pages.dev` URL.
|
||||
- A Gumroad listing URL (placeholder until your account is set up —
|
||||
use `https://gumroad.com/l/datatools` and update it later).
|
||||
|
||||
### B. Build the deploy-ready bundle (~30 sec)
|
||||
|
||||
```bash
|
||||
# One-time: copy the template
|
||||
cp landing/deploy.config.example.json landing/deploy.config.json
|
||||
# Edit it with your real URLs
|
||||
edit landing/deploy.config.json
|
||||
# Build
|
||||
python3 landing/deploy.py
|
||||
# → produces landing/dist/
|
||||
```
|
||||
|
||||
`landing/deploy.config.json` is **gitignored**; your real URLs never
|
||||
hit the repo.
|
||||
|
||||
### C. Deploy (~3 min)
|
||||
|
||||
Two paths — pick one:
|
||||
|
||||
**Drag-and-drop (zero CLI):**
|
||||
|
||||
1. Cloudflare Pages dashboard → **Create project** → **Direct Upload**
|
||||
2. Drag `landing/dist/` into the upload zone
|
||||
3. Project name: `datatools` (becomes `datatools.pages.dev`)
|
||||
4. Click **Deploy**
|
||||
|
||||
**Wrangler CLI (one command, scriptable):**
|
||||
|
||||
```bash
|
||||
npm install -g wrangler # one-time
|
||||
wrangler login # one-time
|
||||
wrangler pages deploy landing/dist
|
||||
```
|
||||
|
||||
### D. Custom domain (~5 min, optional)
|
||||
|
||||
Pages dashboard → your project → **Custom domains** → add
|
||||
`datatools.app` (or whichever apex domain you registered). Cloudflare
|
||||
auto-issues TLS. Once propagated:
|
||||
|
||||
- `https://datatools.app/` → apex chooser
|
||||
- `https://datatools.app/shopify-pet/` → Shopify landing
|
||||
- `https://datatools.app/bookkeeper/` → Bookkeeper landing
|
||||
- `https://datatools.app/revops/` → RevOps landing
|
||||
|
||||
### E. Verify
|
||||
|
||||
For each persona:
|
||||
|
||||
1. Open the persona URL.
|
||||
2. Confirm the demo iframe loads (the URL inside it points at the
|
||||
Streamlit demo from Part 1).
|
||||
3. Click "Run pipeline" inside the iframe → AFTER preview appears.
|
||||
4. Click the "Get DataTools" button → opens Gumroad with the
|
||||
correct `?from=<persona>` query (verify in the URL bar).
|
||||
|
||||
If the iframe shows "Refused to connect", check Cloudflare Pages →
|
||||
**Settings** → **Functions** for any CSP that disallows Streamlit's
|
||||
domain. (Default Pages config does not set CSP, so this is rarely an
|
||||
issue.)
|
||||
|
||||
---
|
||||
|
||||
## Part 3 · Updates
|
||||
|
||||
The cycle is:
|
||||
|
||||
```bash
|
||||
# 1) Edit code or copy
|
||||
edit landing/<persona>/index.html
|
||||
edit src/gui/app_demo.py
|
||||
|
||||
# 2) Rebuild landing
|
||||
python3 landing/deploy.py
|
||||
|
||||
# 3) Re-deploy landing
|
||||
wrangler pages deploy landing/dist
|
||||
|
||||
# 4) Re-deploy demo
|
||||
git push origin main
|
||||
# (Streamlit Cloud auto-deploys on push)
|
||||
```
|
||||
|
||||
Both surfaces deploy in under 5 minutes end-to-end.
|
||||
|
||||
---
|
||||
|
||||
## Part 4 · Sanity checks (post-deploy, ~3 min)
|
||||
|
||||
Run these once, then trust the build (per `POST-LAUNCH.md` §6):
|
||||
|
||||
```bash
|
||||
# Landing pages serve and reference the right demo URL
|
||||
curl -s https://datatools.app/ | grep -c persona-card
|
||||
# → 3 (one per persona card)
|
||||
|
||||
curl -s https://datatools.app/shopify-pet/ | grep -c "datatools-demo"
|
||||
# → ≥1 (iframe src points at your demo)
|
||||
|
||||
# Demo responds and routes the persona param
|
||||
curl -s https://datatools-demo.streamlit.app/?p=shopify-pet | grep -c "Shopify"
|
||||
# → ≥1
|
||||
|
||||
# Sitemap is valid XML and lists all 4 pages
|
||||
curl -s https://datatools.app/sitemap.xml | grep -c "<url>"
|
||||
# → 4
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Part 5 · Cost ceiling check
|
||||
|
||||
| Service | Tier | Cost | Cap |
|
||||
|---|---|---|---|
|
||||
| Cloudflare Pages | Free | $0 | 500 builds/month, unlimited bandwidth |
|
||||
| Streamlit Community Cloud | Free | $0 | 1 GB RAM, sleeps after 7 days idle |
|
||||
| Custom domain | Cloudflare or registrar | ~$15/year | n/a |
|
||||
| GitHub | Free for private repos with limited collaborators | $0 | n/a |
|
||||
| **Total ongoing** | | **~$1.25/mo** (domain only) | |
|
||||
|
||||
Well inside the `BUSINESS.md` §9 cap of $1,200/mo recurring. The
|
||||
$5–10/mo VPS migration is a contingency only — don't pre-build it.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Streamlit Cloud build fails with "ModuleNotFoundError: src.core"**
|
||||
|
||||
`streamlit_app.py` puts the repo root on `sys.path` before invoking
|
||||
the demo module — but only if the file is at the repo root. Confirm
|
||||
`streamlit_app.py` lives at `/streamlit_app.py`, not nested in a
|
||||
folder.
|
||||
|
||||
**Cloudflare Pages deploy succeeds but persona pages 404**
|
||||
|
||||
The directory layout is preserved by `deploy.py`. Confirm your
|
||||
`landing/dist/` has `shopify-pet/index.html`, etc. — not just three
|
||||
flat files. If you used drag-and-drop, drag the **directory**, not
|
||||
its contents.
|
||||
|
||||
**The iframe shows "X-Frame-Options denied"**
|
||||
|
||||
Streamlit Community Cloud allows iframe embedding by default. If
|
||||
you've migrated to a self-hosted demo with a reverse proxy, set
|
||||
`X-Frame-Options: ALLOWALL` (or remove the header entirely) for the
|
||||
demo's domain.
|
||||
|
||||
**Gumroad URL has no `?from=` parameter when clicked**
|
||||
|
||||
The `&from=` query param is added by the landing-page CTA, not by
|
||||
Gumroad. If it's missing, the landing-page HTML wasn't substituted —
|
||||
re-run `python3 landing/deploy.py` and re-deploy.
|
||||
319
docs/NEXT-STEPS.md
Normal file
319
docs/NEXT-STEPS.md
Normal file
@@ -0,0 +1,319 @@
|
||||
# Next Steps — from "code complete" to first paying customer
|
||||
|
||||
> Creator-only. The runnable checklist that takes the operator from
|
||||
> the current state (1,729 tests passing, 6 tools shipped, 0 paying
|
||||
> customers) through launch and into the first 90 days.
|
||||
> **Version**: 1.0 · **Adopted**: 2026-05-01
|
||||
|
||||
This document is the **single answer** to "what now?". Every line
|
||||
item has an owner, a time estimate, a blocker, a cost, and the
|
||||
external dependency that makes it un-shippable today. Items are
|
||||
ordered by **must-finish-before-the-next-item** — work top-down.
|
||||
|
||||
Cross-references:
|
||||
- Strategy: `PLAN.md` (the 8 strategic moves + the 90-day sequence)
|
||||
- Demo specs: `DEMO-PLAN.md`
|
||||
- Deployment mechanics: `DEPLOYMENT.md`
|
||||
- Post-launch measurement: `POST-LAUNCH.md`
|
||||
- Locked criteria: `DECISIONS.md` §1
|
||||
|
||||
Status legend:
|
||||
- **🟢** Done — the asset exists in this repo
|
||||
- **🟡** Buildable now — no external dependency needed
|
||||
- **🟠** External dependency — needs an account / signup / payment
|
||||
- **🔴** Manual / requires user input that can't be automated
|
||||
|
||||
---
|
||||
|
||||
## Phase 0 · What's already done (skip ahead)
|
||||
|
||||
| ✓ | Item | Where it lives |
|
||||
|---|------|----------------|
|
||||
| 🟢 | 6 of 9 tools shipped (Dedup, Text, Format, Missing, Column-Map, Pipeline) | `src/core/`, `src/cli_*.py`, `src/gui/pages/` |
|
||||
| 🟢 | Pipeline Runner (the retention multiplier per `PLAN.md` §2.6) | `src/core/pipeline.py`, `src/cli_pipeline.py`, `src/gui/pages/9_Pipeline_Runner.py` |
|
||||
| 🟢 | 1,729 passing tests · 0 skipped · 0 xfailed | `tests/` |
|
||||
| 🟢 | 3 niche demo datasets + pre-tuned pipeline JSONs | `samples/demo/` |
|
||||
| 🟢 | Streamlit demo app + Cloud entry shim | `streamlit_app.py`, `src/gui/app_demo.py` |
|
||||
| 🟢 | 3 niche landing pages + apex chooser + shared CSS | `landing/` |
|
||||
| 🟢 | Landing-page deploy script (URL-substitution + sitemap + 404 + favicon) | `landing/deploy.py` |
|
||||
| 🟢 | Strategic plan + demo plan + post-launch measurement plan + deployment doc | `docs/PLAN.md`, `DEMO-PLAN.md`, `POST-LAUNCH.md`, `DEPLOYMENT.md` |
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 · Stand the funnel up (target: end of week 1, ~6 hours total work)
|
||||
|
||||
The bottleneck right now is **distribution, not feature count**.
|
||||
Everything in this phase is about turning code into a URL the user
|
||||
can hit.
|
||||
|
||||
### 1.1 — 🟠 Push to GitHub (5 min)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | `git init` (if not already), commit, push to a private or public GitHub repo. |
|
||||
| **Why** | Cloud deploy services need a Git source. Streamlit Community Cloud auto-deploys on push to `main`. |
|
||||
| **External dependency** | A GitHub account (free). |
|
||||
| **Cost** | $0. |
|
||||
| **Blocked by** | Nothing. |
|
||||
|
||||
### 1.2 — 🟠 Deploy the demo to Streamlit Community Cloud (15 min)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | Follow `DEPLOYMENT.md` Part 1. Result: a public URL like `https://datatools-demo.streamlit.app`. |
|
||||
| **Why** | The landing pages embed this in their iframe. Without it, every "Run pipeline" button on the landing pages 404s. |
|
||||
| **External dependency** | Free Streamlit Community Cloud account, signed in via GitHub. |
|
||||
| **Cost** | $0. |
|
||||
| **Blocked by** | 1.1 (the repo must be on GitHub). |
|
||||
| **Watch out for** | First build takes 2–3 min while Cloud installs deps. Subsequent deploys < 30 s. |
|
||||
|
||||
### 1.3 — 🟠 Buy the apex domain (5 min, ~$15/year)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | Register `datatools.app` (or whichever) at any registrar. Point the nameservers at Cloudflare. |
|
||||
| **Why** | The landing-page canonical URLs and CTA buttons refer to this domain. Pages can deploy to a free `*.pages.dev` URL first if you want to defer this. |
|
||||
| **External dependency** | A registrar account; payment method. |
|
||||
| **Cost** | ~$15/year. Within `BUSINESS.md` §9 cost cap. |
|
||||
| **Blocked by** | Nothing — can run in parallel with 1.1 / 1.2. |
|
||||
|
||||
### 1.4 — 🟠 Deploy the landing pages to Cloudflare Pages (15 min)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | Follow `DEPLOYMENT.md` Part 2. Run `python3 landing/deploy.py` with the operator's URLs in `deploy.config.json`, then `wrangler pages deploy landing/dist` (or drag-drop). |
|
||||
| **Why** | This is the marketing surface. Three persona URLs go live as soon as it deploys. |
|
||||
| **External dependency** | Free Cloudflare account; Wrangler CLI (optional — drag-drop works too). |
|
||||
| **Cost** | $0. |
|
||||
| **Blocked by** | 1.2 (the demo URL goes into `deploy.config.json`); ideally 1.3 for the custom domain. |
|
||||
| **Watch out for** | The `deploy.config.json` file is gitignored — your real URLs never get committed. |
|
||||
|
||||
### 1.5 — 🟠 Open a Gumroad listing (15 min) **— stub for now**
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | Create a Gumroad account, draft a listing with a single screenshot + the landing-page copy, set price to $49. Don't enable purchases yet — leave it as a draft. |
|
||||
| **Why** | The CTA buttons on the landing pages link to `gumroad.com/l/datatools?from=<persona>`. Until the listing exists, those buttons 404. |
|
||||
| **External dependency** | Free Gumroad account; Stripe-connected payout method (defer to Phase 2). |
|
||||
| **Cost** | $0 to draft, ~10% per sale once live. |
|
||||
| **Blocked by** | Nothing — can run in parallel with 1.1–1.4. |
|
||||
| **Watch out for** | The listing URL must be `gumroad.com/l/datatools` to match the landing-page hard-coded CTAs. If you pick a different slug, update `landing/deploy.config.json` → `gumroad_listing` and re-run `deploy.py`. |
|
||||
|
||||
### 1.6 — 🟡 End-to-end smoke verification (10 min)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | Run the four `curl` commands from `DEPLOYMENT.md` Part 4. All four landing pages, all three demo personas, sitemap.xml. |
|
||||
| **Why** | First time something can break is the moment a real user hits it. Ten minutes of `curl` saves a week of "why is conversion zero." |
|
||||
| **External dependency** | None. |
|
||||
| **Cost** | $0. |
|
||||
| **Blocked by** | 1.4 + 1.2. |
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 · Make it sellable (target: end of week 2)
|
||||
|
||||
### 2.1 — 🟠 Apple Developer Program enrollment (5 min to start, 1–2 weeks lead)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | Per `BUSINESS.md` §10. Required for code-signing the macOS installer. |
|
||||
| **External dependency** | Apple ID + government-issued ID (individual) or D-U-N-S number (org). |
|
||||
| **Cost** | $99/year. |
|
||||
| **Blocked by** | Nothing — start ASAP because of the 1–2 week approval window. The pipeline waits on this; nothing else does. |
|
||||
|
||||
### 2.2 — 🟡 PyInstaller spec + cross-platform build (1–3 days first time)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | A `build/datatools.spec` that bundles the Streamlit GUI + all 6 tools + samples into one app. Mac `.dmg`, Windows `.exe` installer, Linux AppImage. |
|
||||
| **Why** | The buyer's deliverable. Without this, there is nothing to attach to the Gumroad listing. |
|
||||
| **External dependency** | None for Linux/Mac builds. Windows builds need a Windows machine or a CI matrix runner. |
|
||||
| **Cost** | $0 (GitHub Actions matrix runners are free for public repos). |
|
||||
| **Blocked by** | Nothing for the spec; 2.1 for the signed Mac build. |
|
||||
| **Watch out for** | Streamlit's bundle size lands around 300–500 MB per `DECISIONS.md` §4c — accepted tradeoff. |
|
||||
|
||||
### 2.3 — 🟡 macOS sign + notarize (30 min once Apple Dev is approved)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | Sign the `.dmg`, submit to Apple's notarization service, staple the ticket. |
|
||||
| **Why** | Without it, Gatekeeper hard-blocks the install with no obvious way out (per `BUSINESS.md` §10). The buyer gives up. |
|
||||
| **External dependency** | Apple Developer Program (2.1). |
|
||||
| **Cost** | $0 incremental over 2.1. |
|
||||
| **Blocked by** | 2.1 + 2.2. |
|
||||
|
||||
### 2.4 — 🔴 Refund policy + license + Gumroad listing copy (1 hour)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | A clear refund policy (14-day no-questions per the FAQ already on the landing pages) + a software licence text + the Gumroad listing description. |
|
||||
| **Why** | Required by Gumroad's terms; surfaces on the listing page; protects against buyer disputes. |
|
||||
| **External dependency** | None — operator authoring. |
|
||||
| **Cost** | $0. |
|
||||
| **Blocked by** | Nothing. |
|
||||
| **Hint** | Most of the copy is already in the landing pages' FAQ section — paste it into Gumroad. |
|
||||
|
||||
### 2.5 — 🟠 Activate the Gumroad listing (15 min)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | Upload the cross-platform installers from 2.2/2.3, paste the copy from 2.4, set $49 price, enable purchases, configure Stripe payout. |
|
||||
| **Why** | This is the "buy" button finally working. |
|
||||
| **External dependency** | Gumroad + Stripe account; the installers from 2.2/2.3. |
|
||||
| **Cost** | ~10 % per sale. |
|
||||
| **Blocked by** | 2.2, 2.3, 2.4. |
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 · First-traffic ignition (target: end of week 4)
|
||||
|
||||
Per `PLAN.md` §3 and `BUSINESS.md` §7 channel priorities. The strict
|
||||
no-touch constraint of `DECISIONS.md` §1 #8 makes channel choice
|
||||
matter — these are the only ones that fit.
|
||||
|
||||
### 3.1 — 🔴 First niche-community post (30 min)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | One value-first post in one niche-relevant community (e.g. r/shopify, IndieHackers Shopify chat, a Slack/Discord that allows it). Lead with the demo URL, not the buy URL. |
|
||||
| **Why** | Marketplaces alone don't drive discovery. Communities are the only first-touch channel that works under no-touch. |
|
||||
| **External dependency** | Account in the chosen community; understand its self-promotion rules. |
|
||||
| **Cost** | $0. |
|
||||
| **Blocked by** | 1.4 (demo URL must work). |
|
||||
| **Hint** | Pick the persona with the most familiar community to the operator. Don't try all three at once — see `POST-LAUNCH.md` §2 "decide ONE thing" rule. |
|
||||
|
||||
### 3.2 — 🟡 First long-tail SEO blog post (4–6 hours)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | One 800–1,500-word post on `datatools.app/blog/` (sub-route of Cloudflare Pages or Substack) targeting one niche keyword from `BUSINESS.md` §7. Topic: a real problem you've encountered, the cleanup steps, the demo URL at the end. |
|
||||
| **Why** | Compounding asset — `BUSINESS.md` §2 says SEO pays in 6–18 months, not week 1. Don't mistake it for an early-stage channel. |
|
||||
| **External dependency** | None. |
|
||||
| **Cost** | $0. |
|
||||
| **Blocked by** | Nothing. |
|
||||
|
||||
### 3.3 — 🟡 Cloudflare Web Analytics + event counters (45 min)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | Enable Cloudflare Web Analytics on the Pages project (one click). Add a tiny inline `<script>` to each landing page that fires `cta_clicked` when the buy button is hit, before redirecting. Per `POST-LAUNCH.md` §1. |
|
||||
| **Why** | Without this, the post-launch checklist is unrunnable. |
|
||||
| **External dependency** | Cloudflare account (already from 1.4). |
|
||||
| **Cost** | $0. |
|
||||
| **Blocked by** | 1.4. |
|
||||
| **Hint** | The Gumroad webhook captures `?from=<persona>` automatically — no extra wiring. |
|
||||
|
||||
### 3.4 — 🟡 Email autoresponder (post-purchase delivery + 3-touch onboarding) (2–3 hours)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | Gumroad's built-in delivery email plus three follow-up emails (day 1, day 7, day 14): "are you running into X?", "here's an advanced trick", "save your pipeline as JSON for next week". |
|
||||
| **Why** | Increases activation, reduces refund risk, surfaces support questions while volume is small. |
|
||||
| **External dependency** | Gumroad delivery is built-in. The 3-touch sequence needs a free email service (Resend's free tier or Mailchimp's free tier). |
|
||||
| **Cost** | $0–$30/month per `BUSINESS.md` §9. |
|
||||
| **Blocked by** | 2.5. |
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 · First-buyer trigger and review
|
||||
|
||||
Per `PLAN.md` §4 decision triggers and `POST-LAUNCH.md` §4.
|
||||
|
||||
### 4.1 — 🟢 Run the monthly review (30 min, first Monday after launch)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | Follow `POST-LAUNCH.md` §2 — pull last-30-days demo events + Gumroad sales + refunds, compute the five numbers, decide ONE change. |
|
||||
| **Why** | Without this discipline, the funnel drifts and the operator changes 5 things at once and learns nothing. |
|
||||
| **External dependency** | None — analytics from 3.3, sales from 2.5. |
|
||||
| **Cost** | $0. |
|
||||
| **Blocked by** | 3.3 + 2.5. |
|
||||
|
||||
### 4.2 — 🟢 First paying customer (target: 90 days)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | The actual first sale. |
|
||||
| **Why** | Per `BUSINESS.md` §6: validates the funnel; not the business. |
|
||||
| **Trigger action** | Continue, no plan change. Make the first $1k/month within month 6. |
|
||||
|
||||
### 4.3 — 🔴 Zero-paid-in-90-days fallback (only fires if 4.2 doesn't)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What** | Per `POST-LAUNCH.md` §4 — audit the funnel, not the features. Run a 1-week outbound experiment to 30 niche contacts as a control (per `BUSINESS.md` §8 the no-touch revisit is allowed below $5k MRR if it produces signal). |
|
||||
| **Why** | Distinguishes "no reach" from "no conversion" — they need different fixes. |
|
||||
| **External dependency** | Operator's time. |
|
||||
| **Cost** | The 10 hr/wk allocation already exists; this displaces other work. |
|
||||
| **Blocked by** | The 90-day calendar trigger from 4.2. |
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 · Steady state — what NOT to build
|
||||
|
||||
Per `PLAN.md` §5 (anti-temptations) and `DECISIONS.md` §8 (re-lock
|
||||
triggers). The trap is treating "more code" as the answer when the
|
||||
data says "more reach" or "more conversion." The five forbidden
|
||||
moves until $5k/mo MRR:
|
||||
|
||||
| | Why locked |
|
||||
|---|---|
|
||||
| ❌ More tools (06–08) | `PLAN.md` §2.1 distribution-gate. Tool 09 was the exception; no others until first paid customer + one external review. |
|
||||
| ❌ SaaS pivot | `DECISIONS.md` §4 — recurring infra conflicts with the lifestyle constraint. |
|
||||
| ❌ Live chat / sales calls | `DECISIONS.md` §1 #8 — no-touch is locked until $5k/mo. |
|
||||
| ❌ Custom integrations / one-off consulting | Breaks "build once, sell many." |
|
||||
| ❌ Going broad on personas | `PLAN.md` §5 — "all small businesses" converts at 1 %; vertical converts at 5–15 %. |
|
||||
|
||||
---
|
||||
|
||||
## Triage table — what blocks what
|
||||
|
||||
```
|
||||
Phase 1 (week 1) Phase 2 (week 2) Phase 3 (week 4)
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ 1.1 Push GH │──────────┐ │ 2.1 Apple │ ───┐ │ 3.1 Community│
|
||||
│ 1.2 Demo │──┐ ├──▶│ Dev (1-2w) │ │ │ 3.2 SEO post │
|
||||
│ 1.3 Domain │ │ │ │ 2.2 Build │ ───┤ │ 3.3 Analytics│
|
||||
│ 1.4 Pages │◀─┘ │ │ 2.3 Sign │ ───┤ │ 3.4 Emails │
|
||||
│ 1.5 Gumroad │──────────┘ │ 2.4 Copy │ │ └──────────────┘
|
||||
│ 1.6 Verify │ │ 2.5 Activate │ ◀──┘
|
||||
└──────────────┘ └──────────────┘ ↓
|
||||
┌──────────────┐
|
||||
│ 4.1 Monthly │
|
||||
│ 4.2 First $ │
|
||||
│ 4.3 Fallback │
|
||||
└──────────────┘
|
||||
```
|
||||
|
||||
The longest blocking path is **2.1 Apple Developer Program**
|
||||
(1–2 weeks). Start it on day 1 of week 1 — it unblocks everything in
|
||||
Phase 2 and you can do all of Phase 1 while waiting.
|
||||
|
||||
---
|
||||
|
||||
## Time estimate — total operator time
|
||||
|
||||
| Phase | Hours | Wall-clock |
|
||||
|---|---|---|
|
||||
| Phase 1 | ~1 hour | end of week 1 (mostly waiting for builds) |
|
||||
| Phase 2 | ~1 day | end of week 2 (gated by Apple Dev approval) |
|
||||
| Phase 3 | ~6 hours | week 3–4 |
|
||||
| Phase 4 | 30 min/month | ongoing |
|
||||
| **Total to launch** | **~12 hours of operator time** | **~14 days wall-clock** |
|
||||
|
||||
Well inside the 10 hr/wk constraint of `DECISIONS.md` §1 #2.
|
||||
|
||||
---
|
||||
|
||||
## The thing that decides whether the plan works
|
||||
|
||||
Not the build. Not the deploy. Not even the first sale.
|
||||
|
||||
**The discipline of running the monthly review** in Phase 4 — and the
|
||||
"decide ONE thing per month" rule from `POST-LAUNCH.md` §2 — is what
|
||||
separates "this product exists" from "this product compounds." Every
|
||||
feature added before the funnel is measured is a guess; every change
|
||||
made after the monthly review is informed.
|
||||
|
||||
Don't skip 4.1.
|
||||
220
docs/PLAN.md
Normal file
220
docs/PLAN.md
Normal file
@@ -0,0 +1,220 @@
|
||||
# Strategic Plan — DataTools
|
||||
|
||||
> Creator-only. Locks the "what next" in light of the locked criteria
|
||||
> (DECISIONS.md §1) and the v1.6 honest status (BUSINESS.md §13).
|
||||
> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael
|
||||
|
||||
This document is the active plan, derived from the strategic review of
|
||||
2026-05-01. It compresses the eight strategic moves and a 90-day
|
||||
execution sequence onto one page so the next decision (build vs.
|
||||
ship vs. market) has a single reference.
|
||||
|
||||
It is **not** a re-lock of operating criteria — those still live in
|
||||
DECISIONS.md and have not changed. This plan is downstream of those
|
||||
criteria; if a move below conflicts with §1 of Decisions, the criteria
|
||||
win.
|
||||
|
||||
## 1. Frame
|
||||
|
||||
**Locked context** (BUSINESS.md, DECISIONS.md):
|
||||
|
||||
- Niche Python automation tools, $49–79 single / $149 suite.
|
||||
- Cash budget ≤ $1,200/mo recurring · Time ≤ 10 hr/wk · No external funding.
|
||||
- Async + no-touch sales (revisit at $5k/mo MRR).
|
||||
- Marketplace-first distribution (Gumroad / Lemon Squeezy).
|
||||
- Streamlit GUI + CLI dual interface, runs locally.
|
||||
- Lifestyle cashflow goal (no exit needed).
|
||||
|
||||
**Honest current state** (2026-05-01):
|
||||
|
||||
| Asset | State |
|
||||
|---|---|
|
||||
| Tools 1–5 (Dedup, Text Clean, Format Standardize, Missing, Column Mapper) | Ready · 1,691 tests passing · 0 xfailed |
|
||||
| Tools 6–9 (Outlier, Multi-File Merge, Validator, Pipeline) | Coming Soon |
|
||||
| PyInstaller installer pipeline | Not started |
|
||||
| macOS code signing (Apple Dev Program) | Not started |
|
||||
| Hosted browser demo (Streamlit Cloud) | Not deployed |
|
||||
| Landing page | Not live |
|
||||
| Marketplace listing (Gumroad) | Not listed |
|
||||
| Paying customers | 0 |
|
||||
|
||||
**Diagnosis**: the bottleneck is not feature count — it's distribution.
|
||||
The next $1 of value comes from closing the gap between "code-complete"
|
||||
and "buyer-pulls-out-card", not from tool 6.
|
||||
|
||||
## 2. The eight strategic moves
|
||||
|
||||
Numbered moves. Each is consistent with locked criteria.
|
||||
|
||||
### 2.1 Freeze new-tool development (one exception). Ship what exists.
|
||||
|
||||
Tools 6–8 are blocked behind a **distribution gate**: no work on them
|
||||
until the existing 5 tools have a paying customer + one external review
|
||||
(BUSINESS.md §4 sequence rule, applied recursively inside the bundle).
|
||||
|
||||
**Exception granted 2026-05-01**: Tool 09 Pipeline Runner is built
|
||||
*now*. Rationale: the pipeline transforms the bundle from "5 tools you
|
||||
buy" into "an automatable workflow you depend on." That conversion is
|
||||
what produces retention and word-of-mouth — the only marketing channel
|
||||
that scales under the no-network/no-touch constraint.
|
||||
|
||||
### 2.2 The demo *is* the product. Make it embarrassingly good.
|
||||
|
||||
- Three persona-tagged sample datasets, not one generic CSV: Shopify
|
||||
customers / bookkeeper bank export / agency lead list.
|
||||
- Run the *full pipeline* on the sample (Review → Dedup → Text Clean →
|
||||
Format → Missing → Column Map). Free version caps **output rows**,
|
||||
not the experience.
|
||||
- Embed the demo as an **iframe on the landing page** (not "click to
|
||||
open"). Friction kills conversion.
|
||||
- Persistent CTA after demo: *"Run this on your own 50 k-row file →
|
||||
buy for $49 →"* directly above the Gumroad button.
|
||||
|
||||
### 2.3 Niche down. Stop selling "data cleaning."
|
||||
|
||||
One engine, three landing pages:
|
||||
|
||||
| Persona | Landing-page lead | Demo dataset |
|
||||
|---|---|---|
|
||||
| Shopify operator (priority: pet supplies) | "Clean your customer / vendor / subscriber exports" | uc01_shopify_customer_list |
|
||||
| Bookkeeper / freelance accountant | "Reconcile bank exports + vendor lists. Auditable changes." | uc06_bank_export_overlap |
|
||||
| Marketing / RevOps agency | "Dedupe lead lists. Standardize phones across vendors." | uc13_combined_lead_sources |
|
||||
|
||||
Generic copy competes with `pip install pandas`. Vertical copy
|
||||
competes with nothing.
|
||||
|
||||
### 2.3a Top pain points per niche
|
||||
|
||||
The "what does this actually fix?" question. Each pain point below is
|
||||
sourced from operator-domain knowledge of these markets and the
|
||||
buyer-use-case research already captured in `BUSINESS.md §4a`. Pain
|
||||
points are ranked by **frequency × dollar impact** for that persona —
|
||||
high-frequency / high-cost pains lead the landing-page copy and the
|
||||
demo dataset.
|
||||
|
||||
> **Validation gap (honest disclaimer)**: these pains are derived from
|
||||
> operator knowledge of the categories, not from a sample of buyer
|
||||
> interviews. Per `BUSINESS.md §8` (no-touch constraint review at $5k/mo
|
||||
> MRR), validate the top-3 per persona via 5 buyer interviews before the
|
||||
> first $200 of paid acquisition spend. If any pain ranks below the
|
||||
> assumed level, swap it for the next-highest in this list.
|
||||
|
||||
#### Shopify operator (priority: pet supplies)
|
||||
|
||||
| # | Pain | $ / time impact | Tools that fix it |
|
||||
|---|------|-----------------|---|
|
||||
| S1 | **Klaviyo / Mailchimp / Omnisend per-contact billing.** Subscriber list with 10–18 % duplicate rate (case drift, plus signs in Gmail addresses, multiple devices) → recurring overpay forever. | $30–300/mo per percent of dupes on a 50 k list — recurring | Dedup + Format Standardize (email canonicalization) + Pipeline (re-run weekly) |
|
||||
| S2 | **Product feed rejected by Google Merchant Center / Meta Catalog.** Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 24–72 h while feed gets fixed. | 1–3 days delayed launch × campaign value | Text Cleaner + Format Standardize |
|
||||
| S3 | **Multi-channel order consolidation.** Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 4–8 hr / month manually merging | Column Mapper + Dedup + Pipeline |
|
||||
| S4 | **Subscription identity fragmentation.** Pet-box subscribers cancel and re-sub under a different email; cohort analysis says churn is 20 % when it's actually 12 % — pricing decisions wrong. | Mis-priced LTV → over- or under-paid acquisition | Dedup with `merge=true` survivor |
|
||||
| S5 | **International tax / VAT MOSS compliance.** Country column is `UK` / `U.K.` / `United Kingdom` / `GB` in the same export; VAT report breaks. Phone formats per region break call-center routing. | Compliance penalty risk + ops friction | Format Standardize (per-row country) + Column Mapper |
|
||||
|
||||
#### Bookkeeper / freelance accountant
|
||||
|
||||
| # | Pain | $ / time impact | Tools that fix it |
|
||||
|---|------|-----------------|---|
|
||||
| B1 | **Bank-export month-overlap re-import.** Same transaction posts twice when Jan and Feb exports overlap at the boundary; client's books understate cash by 1–4 %. | 2–4 hr / month / client + reconciliation errors | Dedup with explicit Date+Amount+fuzzy Vendor strategy |
|
||||
| B2 | **QBO / Xero vendor consolidation for 1099 reports.** "Amazon" / "amazon.com" / "AMAZON.COM*4F2X9" become 3 vendors; 1099 reports break, P&L by vendor unusable. | 1–2 hr / 1099 cycle + IRS-paper-trail risk | Format Standardize (name canonicalization) + Dedup |
|
||||
| B3 | **Liability / professional indemnity.** Cannot use AI tools that don't show their work; client audit response window is 24–48 h. | Per-firm liability premium ≈ $500–2,500 / yr | Audit log built into every tool — every change row-logged |
|
||||
| B4 | **Per-license-not-per-client economics.** Most cleanup tools are per-seat / per-client SaaS; bookkeepers managing 10–30 clients hit price walls fast. | $30/mo × N clients vs. $49 once | Desktop license, no per-client constraint |
|
||||
| B5 | **Multi-currency books.** US-domiciled clients with EU customers; comma-decimal amounts (`€1.234,56`) crash standard parsers; parens-negative (`($89.50)`) treated as positive. | 30–60 min per multi-currency client per month | Format Standardize (`currency_decimal=auto`, parens-negative) |
|
||||
|
||||
#### Marketing / RevOps agency
|
||||
|
||||
| # | Pain | $ / time impact | Tools that fix it |
|
||||
|---|------|-----------------|---|
|
||||
| R1 | **HubSpot / Marketo / Iterable per-contact tier pricing.** 10 k contacts → enterprise tier at $4–8 k/mo. Every duplicate is a recurring tax. | $200–800 / month per 1 k duplicate contacts — recurring | Dedup with cross-source merge + Pipeline |
|
||||
| R2 | **Email-deliverability / sender reputation.** Sending to invalid or duplicate addresses tanks reputation; recovery takes weeks. | Catastrophic — entire email programme degraded | Format Standardize (email canonicalization) + Missing (sentinel detection) |
|
||||
| R3 | **GDPR / contact-data privacy.** Uploading lead data to a third-party cleaning SaaS is itself a GDPR concern; legal review blocks adoption. | Compliance risk + 4–8 wk legal-review delay | Local-only desktop app, zero outbound calls |
|
||||
| R4 | **Multi-vendor lead-source unification.** Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 1–3 days per campaign of manual unification | Column Mapper (alias matching) + Format Standardize (per-row country) + Dedup |
|
||||
| R5 | **Suppression-list management across 5+ platforms.** Each platform has its own format; un-deduped suppression lists let opt-outs slip through, triggering CAN-SPAM / GDPR exposure. | Compliance risk + churn-back cost | Pipeline saved as JSON, re-run on each new suppression batch |
|
||||
|
||||
### 2.4 Operationalize the moat the docs already name.
|
||||
|
||||
Three durable advantages, each promoted from buried feature to
|
||||
landing-page H1:
|
||||
|
||||
- **Quality**: 1 GB international standardization in ~2.5 minutes,
|
||||
locally. Excel can't do this; OpenRefine fights you for an hour.
|
||||
- **Privacy**: "Your data never leaves this computer." Already in the
|
||||
GUI footer — promote to landing-page lead, screenshot the empty
|
||||
network tab.
|
||||
- **Update cadence**: ship a v1.1 patch within 30 days of v1.0 launch.
|
||||
Not features — *evidence* the product is alive. "Added Czech Republic
|
||||
phone format support" beats "no updates in 6 months" every time.
|
||||
|
||||
### 2.5 Surface the audit-trail feature in sales copy.
|
||||
|
||||
Every tool has a structured audit log. Most cleaning tools do not.
|
||||
Bookkeepers and consultants get fired if they can't show what changed
|
||||
to a client. The audit feature is currently invisible on every
|
||||
proposed landing page and should be the **second-largest callout** —
|
||||
right after "runs locally."
|
||||
|
||||
Copy seed: *"Every change auditable. Hand the audit CSV to your client
|
||||
with the cleaned file."*
|
||||
|
||||
### 2.6 The Pipeline Runner is the retention multiplier.
|
||||
|
||||
A buyer with a saved pipeline isn't a one-off purchase — they're a
|
||||
recurring user who recommends the product. This is exactly the
|
||||
behavioural lever the no-touch constraint needs (DECISIONS.md §8
|
||||
trigger). Build it now (see §2.1 exception).
|
||||
|
||||
### 2.7 Add a $199 "priority support" tier post-launch.
|
||||
|
||||
Same code, async-email SLA (24 h response). Targets the bookkeeper /
|
||||
consultant persona whose own time is $300/hr. Zero new product work,
|
||||
~3× ARPU on 5–10 % of buyers. Lock the SLA to **async only** so the
|
||||
no-touch constraint isn't violated. Defer until $5 k/mo MRR (the same
|
||||
trigger DECISIONS.md §8 already names).
|
||||
|
||||
### 2.8 Dependency-aware pipeline UX.
|
||||
|
||||
Tools have soft execution-order preferences (Text Clean before Format
|
||||
Standardize, Format before Dedup, Missing before Dedup). The Pipeline
|
||||
Runner *recommends* the order, *warns* on reversals, and **never
|
||||
forces** — the user owns their workflow. Implementation: see
|
||||
`src/core/pipeline.py` `SOFT_DEPENDENCIES`.
|
||||
|
||||
## 3. 90-day execution sequence
|
||||
|
||||
| Week | Action | Done when |
|
||||
|---|---|---|
|
||||
| 1 | PyInstaller pipeline · Mac/Win unsigned installers · Apple Dev Program enrollment (1–2 wk lead) | `dist/datatools-mac.dmg` and `dist/datatools-win.exe` install on a clean machine |
|
||||
| 2 | Demo deployed to Streamlit Cloud · landing page v1 with embedded demo · 3 persona datasets in the demo | Public URL serves a working pipeline run on a sample dataset in < 30 s |
|
||||
| 3 | Gumroad listing live · share value-first in 3 niche communities (no pitch) · 1 long-tail SEO post for the lead persona | First listing impression captured · post not removed for self-promotion |
|
||||
| 4 | Pipeline Runner v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded |
|
||||
| 5–8 | Bookkeeper landing page · agency landing page · second tool's promo cycle · priority-support tier added (defer purchase until §2.7 trigger) | Three live landing pages with distinct H1, demo dataset, conversion target |
|
||||
| 9–13 | Tool 06–08 only **if** revenue trajectory supports continued investment · otherwise more market work on the existing 5 + 09 | Decision made on 13 Aug 2026 with revenue data, not feature ambition |
|
||||
|
||||
## 4. Decision triggers (re-evaluation prompts)
|
||||
|
||||
These flip the plan, not the underlying criteria:
|
||||
|
||||
| Trigger | Reaction |
|
||||
|---|---|
|
||||
| First paying customer in week 4–13 | Continue. Plan is working. |
|
||||
| **Zero** paid in 90 days | Audit the funnel. Demo conversion? Niche fit? Price? Don't add features. |
|
||||
| $5 k/mo MRR | DECISIONS.md §8 trigger fires: revisit async + priority-support tier. |
|
||||
| Marketplace policy / shutdown | Switch to own-domain Stripe immediately; landing pages are already self-hosted. |
|
||||
| Streamlit hard direction change | Low-probability re-lock per DECISIONS.md §8. Tk fallback is documented. |
|
||||
|
||||
## 5. Anti-temptations (things the plan refuses)
|
||||
|
||||
- **More tools before more buyers.** Locked. Exception only for Pipeline Runner per §2.1.
|
||||
- **SaaS pivot.** Recurring infra conflicts with the lifestyle constraint (DECISIONS.md §4).
|
||||
- **Live chat / sales calls.** Conflicts with no-touch (DECISIONS.md §1 #8).
|
||||
- **Custom integrations / one-off consulting.** $300/hr looks tempting; breaks the "build once, sell many" model that justifies the entire strategy.
|
||||
- **Going broad on personas.** "All small businesses" is a generic landing page that converts at 1 %; "Shopify pet-supply operators with 1k–50k customers" converts at 5–15 % in the right communities.
|
||||
|
||||
## 6. What this plan deliberately leaves open
|
||||
|
||||
- Whether tools 06–08 ever ship. Decided on revenue, not roadmap.
|
||||
- Whether to add a fourth niche landing page. Decided on which of the
|
||||
three is producing.
|
||||
- Whether to invest in own-domain SEO. Compounding 6–18 mo asset; not
|
||||
the early-stage channel. Revisit when marketplace + community
|
||||
produces baseline traffic to optimise.
|
||||
- Whether to add a Notion / Slack support community. If support volume
|
||||
per 100 sales > 10 (BUSINESS.md §12 target), revisit; else leave async-email only.
|
||||
158
docs/POST-LAUNCH.md
Normal file
158
docs/POST-LAUNCH.md
Normal file
@@ -0,0 +1,158 @@
|
||||
# Post-launch — 90-day measurement plan
|
||||
|
||||
> Creator-only. The other half of `PLAN.md`: PLAN tells you what to
|
||||
> build, this tells you what to measure once it's live and which
|
||||
> numbers trigger which actions.
|
||||
> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael
|
||||
|
||||
This is a runnable monthly checklist, not analytics theatre. Every
|
||||
metric below has a **threshold** and an **action**. If you're not
|
||||
willing to execute the action when the threshold trips, drop the
|
||||
metric — measuring without responding is busywork.
|
||||
|
||||
## 1. The five numbers that matter
|
||||
|
||||
Every other dashboard, chart, or vanity stat is downstream of these
|
||||
five. The funnel is short on purpose; pre-PMF traffic doesn't have
|
||||
the resolution to support more.
|
||||
|
||||
| # | Metric | How to compute | Threshold | When tripped |
|
||||
|---|--------|----------------|-----------|--------------|
|
||||
| 1 | **Persona engagement** | `demo.run_completed / demo.page_view` per persona | < 30 % for 4 consecutive weeks | Demo isn't running or BEFORE preview isn't compelling. **Action:** check iframe loads; widen BEFORE preview to show pollution clearly; move demo above the fold. |
|
||||
| 2 | **Demo→CTA intent** | `demo.cta_clicked / demo.run_completed` per persona | < 5 % for 4 consecutive weeks | Demo is impressive but the CTA isn't earning trust. **Action:** add network-tab privacy screenshot; soften the price callout; A-B test eyebrow copy on the CTA card. |
|
||||
| 3 | **Purchase rate** | `gumroad.purchase / demo.cta_clicked` per persona | < 30 % for 4 consecutive weeks | Visitors click through but don't pull the card out. **Action:** check Gumroad listing renders cleanly; verify refund-policy copy; check that the screenshot on the listing matches the demo they just ran. |
|
||||
| 4 | **Refund rate** | `gumroad.refunds / gumroad.purchase` rolling 30 days | > 5 % | Buyer expectation mismatch. **Action:** read every refund email; determine if it's a feature gap (build it), a positioning lie (rewrite), or a personal-fit miss (fine, ignore). |
|
||||
| 5 | **Support load** | email tickets / 100 sales rolling 30 days | > 10 | The product isn't self-serve enough at this price. **Action:** find the top 3 questions; add to in-app onboarding + landing-page FAQ + the persona's saved pipeline. |
|
||||
|
||||
These five also map to BUSINESS.md §12 — that doc names the metrics;
|
||||
this doc operationalises them.
|
||||
|
||||
## 2. Monthly review — 30-minute checklist
|
||||
|
||||
Block 30 minutes on the first Monday of every month for the first six
|
||||
months. After month 6 if numbers are stable, drop to 15 minutes
|
||||
quarterly.
|
||||
|
||||
```
|
||||
[ ] Pull last 30 days of demo events from Cloudflare Web Analytics
|
||||
[ ] Pull last 30 days of Gumroad sales + refunds export
|
||||
[ ] Compute the five numbers in §1 per persona
|
||||
[ ] Note which thresholds are tripped (if any)
|
||||
[ ] Read every refund email since last review
|
||||
[ ] Read every support email since last review
|
||||
[ ] Decide ONE thing to change this month (only one)
|
||||
[ ] Update CHANGELOG with what was changed and why
|
||||
[ ] Schedule next review
|
||||
```
|
||||
|
||||
The "decide ONE thing" rule is load-bearing. Pre-PMF traffic doesn't
|
||||
have the volume to A/B-test multiple changes in parallel — you'll just
|
||||
confuse yourself about what moved the number.
|
||||
|
||||
## 3. Per-persona scoreboard (template)
|
||||
|
||||
Maintain in a single text file or spreadsheet. The shape that fits in
|
||||
a notebook page is the shape you'll actually update.
|
||||
|
||||
```
|
||||
Month: 2026-06
|
||||
─────────────────────────────────────────────────────────────────
|
||||
Shopify Bookkeeper RevOps Total
|
||||
Page views 420 180 290 890
|
||||
Demo runs 137 59 82 278
|
||||
CTA clicks 9 7 6 22
|
||||
Purchases 3 2 2 7
|
||||
|
||||
Metric 1 (engage) 33% 33% 28% 31%
|
||||
Metric 2 (intent) 7% 12% 7% 8%
|
||||
Metric 3 (purchase) 33% 29% 33% 32%
|
||||
Metric 4 (refund) 0% 0% 0% 0%
|
||||
Metric 5 (support) 3 tickets / 100 sales
|
||||
|
||||
Tripped thresholds: RevOps engagement (28% < 30%)
|
||||
|
||||
This-month change: Move demo embed above the fold on revops
|
||||
page; reduce hero text by 40%.
|
||||
|
||||
Last-month change: Added network-tab screenshot to all 3
|
||||
pages. Result: intent +1.5 percentage
|
||||
points on Shopify, flat elsewhere.
|
||||
```
|
||||
|
||||
## 4. Stage-gate triggers from PLAN.md
|
||||
|
||||
Reproduced here so the gate criteria sit beside the metrics that
|
||||
fire them:
|
||||
|
||||
| Trigger | From | Action |
|
||||
|---|------|--------|
|
||||
| **First paying customer** | PLAN §4 | Continue. Plan is working. |
|
||||
| **Zero paid in 90 days** | PLAN §4 | Audit the funnel. Don't add features. Run a small (1-week) outbound experiment to 30 niche-community contacts as a control, even though it stretches the no-touch constraint, to determine whether the bottleneck is reach or conversion. |
|
||||
| **$5 k/mo MRR** | DECISIONS §8 | Re-evaluate async constraint. Add priority-support tier (PLAN §2.7). |
|
||||
| **$10 k/mo MRR** | DECISIONS §8 | Revisit time-budget allocation. Decide on tools 06–08 vs. additional bundles. |
|
||||
| **Marketplace shutdown** | PLAN §4 / DECISIONS §8 | Switch landing-page CTA to own-domain Stripe Checkout. Pre-built; one-line edit. |
|
||||
| **Streamlit hard direction change** | DECISIONS §8 | Low-probability re-lock. Tk fallback documented. |
|
||||
| **Burnout signal** | DECISIONS §8 | Stop. Triage. The constraint matters more than the revenue ramp. |
|
||||
|
||||
## 5. What we deliberately do NOT measure
|
||||
|
||||
These look productive but predict nothing pre-PMF. Don't add them.
|
||||
|
||||
- **Bounce rate** — single-page sites have artificially high bounce. Useless signal.
|
||||
- **Time on page** — landing pages are *supposed* to be quick reads. Long time on page often means confusion, not engagement.
|
||||
- **Heatmaps / scroll-depth** — no statistical resolution at <500 monthly visitors. Add when you cross 5 k/month.
|
||||
- **Email open rates** — under §2.7 priority support is the only email channel; opens aren't a buying signal.
|
||||
- **Social mentions** — vanity. The signal that matters is "did they buy" or "did they come back."
|
||||
|
||||
## 6. What we measure once, then trust
|
||||
|
||||
Do these once, then let them run for 6+ months without re-measuring:
|
||||
|
||||
- **Demo correctness** — once per pipeline release, run all 3 demos
|
||||
end-to-end via `tests/test_pipeline.py` and check the output looks
|
||||
reasonable. The CI pipeline already does this; nothing to add.
|
||||
- **Cross-platform install** — once per release, verify the
|
||||
PyInstaller bundle launches on Mac / Windows / Linux. After three
|
||||
green releases, trust the build pipeline; spot-check on major OS
|
||||
updates only.
|
||||
- **Privacy claim integrity** — once at launch, capture the network
|
||||
tab while running the cleaner and host that screenshot at a stable
|
||||
URL. Re-capture only when a new tool or dependency is added.
|
||||
|
||||
## 7. Per-persona attribution
|
||||
|
||||
The buy buttons on every landing page carry `?from=<persona>` query
|
||||
parameters. Gumroad propagates that into the order metadata. Use it
|
||||
to attribute purchases:
|
||||
|
||||
| persona key | landing page URL | Gumroad query | Source |
|
||||
|---|---|---|---|
|
||||
| `shopify-pet` | `/shopify-pet/` | `?from=shopify-pet` | Shopify operator |
|
||||
| `bookkeeper` | `/bookkeeper/` | `?from=bookkeeper` | Bookkeeper / freelance accountant |
|
||||
| `revops` | `/revops/` | `?from=revops` | Marketing / RevOps agency |
|
||||
| `apex` | `/` | (no query — use `unknown` bucket) | Generic discovery |
|
||||
|
||||
When `unknown` exceeds 30 % of total, add UTM tagging to community
|
||||
posts and SEO blog backlinks so you can break the bucket apart.
|
||||
|
||||
## 8. The four months that decide whether the plan works
|
||||
|
||||
Reading PLAN.md §3 + this doc together, the rough script:
|
||||
|
||||
| Month | What's running | What we expect to learn |
|
||||
|---|---|---|
|
||||
| **M1** (June) | Installers · demo · 3 landing pages · Gumroad live | Whether the funnel mechanically works. Numbers will be noisy; just look for one purchase. |
|
||||
| **M2** (July) | M1 + community posts in 3 niches + 1 SEO post | Which persona converts. Re-allocate effort to the highest-converting niche. |
|
||||
| **M3** (August) | M2 + landing-page changes from M2 review | Whether intent-rate moved on the change. Decide tools 06–08 go/no-go. |
|
||||
| **M4** (September) | M3 + first repeat-buyer signals | Whether the Pipeline Runner is producing retention as designed. |
|
||||
|
||||
By end of M4, the data tells you whether the plan is producing
|
||||
$1k–3k/mo (BUSINESS.md §6 6-month target) — extrapolated from the
|
||||
trajectory, not the absolute number.
|
||||
|
||||
## 9. The hardest part of the plan to execute
|
||||
|
||||
Not the metrics. Not the build. **The "decide ONE thing per month"
|
||||
rule** — operators with engineering backgrounds chronically pick
|
||||
three changes per month and conclude nothing because their signal
|
||||
is muddled. This doc says one. It means one.
|
||||
@@ -46,7 +46,7 @@ Numbered support matrix. Updated with every shipped capability.
|
||||
**Cell-level**:
|
||||
- `smart_punctuation_in_data`, `nbsp_or_unicode_whitespace`, `zero_width_or_invisible`, `dirty_column_headers`, `whitespace_padding`, `null_like_sentinels`, `suspected_mojibake`, `mixed_case_email_column`, `inconsistent_date_format`, `near_duplicate_rows`, `leading_zero_ids`.
|
||||
|
||||
**Encoding integrity**: `encoding_uncertain`, `encoding_decode_failed`, `empty_input`.
|
||||
**Encoding integrity**: `encoding_uncertain`, `encoding_decode_failed`, `encoding_lying_bom`, `empty_input`.
|
||||
|
||||
Sample size: 1,000 rows (configurable).
|
||||
|
||||
@@ -71,17 +71,37 @@ Sample size: 1,000 rows (configurable).
|
||||
- Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
|
||||
- Output write: ~10 s.
|
||||
- Recommended RAM: 4× input size for full-Apply path.
|
||||
- Format standardizer (`standardize_file`): ~150k rows/sec on cache-warm
|
||||
international data; chunk-bounded RAM (~50 MB peak at default
|
||||
chunk_size=50,000). A 1 GB CSV with mixed phone+currency+address
|
||||
columns finishes in ~2.5–10 minutes depending on column count.
|
||||
|
||||
## 11. Tools
|
||||
1. Deduplicator — Ready
|
||||
2. Text Cleaner — Ready
|
||||
3. Format Standardizer — Ready
|
||||
4. Missing Value Handler — Coming Soon
|
||||
5. Column Mapper — Coming Soon
|
||||
4. Missing Value Handler — Ready
|
||||
5. Column Mapper — Ready
|
||||
6. Outlier Detector — Coming Soon
|
||||
7. Multi-File Merger — Coming Soon
|
||||
8. Validator & Reporter — Coming Soon
|
||||
9. Pipeline Runner — Coming Soon
|
||||
9. Pipeline Runner — Ready
|
||||
|
||||
### 11.a Recommended pipeline order (soft, not enforced)
|
||||
|
||||
The Pipeline Runner ships with a `SOFT_DEPENDENCIES` table; the
|
||||
following ordering is the default and the basis of the warning
|
||||
surface. Re-ordering is allowed; the runner emits a warning string
|
||||
and proceeds.
|
||||
|
||||
| # | Tool | Why this slot |
|
||||
|---|------|---------------|
|
||||
| 1 | column_map (optional, for header alignment) | Multi-vendor unification — rename early so downstream tools see canonical headers |
|
||||
| 2 | text_clean | NBSP / smart quotes / zero-width pollution silently breaks downstream parsers |
|
||||
| 3 | format_standardize | Phones / dates / currencies → canonical form before missing detection and dedup |
|
||||
| 4 | missing | Sentinel detection, imputation, drop strategies — needs canonical types |
|
||||
| 5 | column_map (optional, for schema enforcement) | Project to target schema, coerce, drop extras AFTER cleaning |
|
||||
| 6 | dedup | Fuzzy matching is most accurate on canonicalised, sentinel-laundered data |
|
||||
|
||||
## 12. Gate (Review & Normalize)
|
||||
- Gates every tool page.
|
||||
@@ -95,7 +115,7 @@ Sample size: 1,000 rows (configurable).
|
||||
|
||||
## 13. Interfaces
|
||||
- **GUI**: Streamlit, browser-based, local, no internet.
|
||||
- **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_analyze`.
|
||||
- **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_format` · `src.cli_missing` · `src.cli_column_map` · `src.cli_pipeline` · `src.cli_analyze`.
|
||||
- **Python API**: `from src.core import …` (analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …).
|
||||
- **JSON output**: `--json` on `cli_analyze`.
|
||||
|
||||
@@ -113,8 +133,8 @@ Sample size: 1,000 rows (configurable).
|
||||
- **Dev**: pytest, tox.
|
||||
|
||||
## 16. Test coverage
|
||||
- 1,230 tests passing, 4 skipped (ftfy not installed), 17 xfailed (documented).
|
||||
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases).
|
||||
- 1,729 tests passing, 0 skipped, 0 xfailed.
|
||||
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
|
||||
- Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
|
||||
|
||||
## 17. Privacy / data handling
|
||||
|
||||
142
landing/README.md
Normal file
142
landing/README.md
Normal file
@@ -0,0 +1,142 @@
|
||||
# Landing pages
|
||||
|
||||
Three persona-tagged landing pages per `docs/PLAN.md` §2.3 and
|
||||
`docs/DEMO-PLAN.md` §3 / §7. Static HTML, zero build step, ship to
|
||||
Cloudflare Pages.
|
||||
|
||||
## Structure
|
||||
|
||||
```
|
||||
landing/
|
||||
├── _shared/styles.css shared CSS (system fonts, no externals)
|
||||
├── shopify-pet/index.html Shopify operator (priority: pet supplies)
|
||||
├── bookkeeper/index.html bookkeeper / freelance accountant
|
||||
├── revops/index.html marketing / RevOps agency
|
||||
└── README.md this file
|
||||
```
|
||||
|
||||
Each page:
|
||||
|
||||
- Inherits `landing/_shared/styles.css`
|
||||
- Overrides the `--accent` colour variable in an inline `<style>` block
|
||||
so each persona has its own visual identity (Shopify = mint green,
|
||||
Bookkeeper = steel blue, RevOps = vivid violet)
|
||||
- Has a sticky buy bar with the Gumroad CTA tagged with `?from=<persona>`
|
||||
- Embeds the live demo (Streamlit) via `<iframe>` with a sandbox attribute
|
||||
- Carries persona-specific H1, sub-copy, use cases, FAQ, and a
|
||||
ready-to-paste `terminal` block showing the CLI in action
|
||||
- Includes Open Graph + Schema.org `SoftwareApplication` JSON-LD for
|
||||
link-share previews and SEO
|
||||
|
||||
## Pre-deploy URL substitutions — automated
|
||||
|
||||
The HTML carries placeholder URLs (the literal strings
|
||||
`https://demo.datatools.app`, `https://datatools.app`,
|
||||
`https://gumroad.com/l/datatools`, `mailto:hello@datatools.app`)
|
||||
that **must** be replaced before deployment. A small Python script
|
||||
does this for you — no global search-and-replace needed.
|
||||
|
||||
```bash
|
||||
# 1) Copy the template and fill in your real URLs:
|
||||
cp landing/deploy.config.example.json landing/deploy.config.json
|
||||
edit landing/deploy.config.json
|
||||
|
||||
# 2) Build the deploy-ready bundle:
|
||||
python3 landing/deploy.py
|
||||
# → produces landing/dist/ with substitutions applied,
|
||||
# plus robots.txt, sitemap.xml, 404.html, favicon.svg
|
||||
```
|
||||
|
||||
`landing/deploy.config.json` is gitignored so your real URLs never
|
||||
hit the repo. Re-run `landing/deploy.py` whenever you change a URL or
|
||||
edit any HTML source.
|
||||
|
||||
## Cloudflare Pages deployment
|
||||
|
||||
The simplest path — one Pages project pointed at `landing/dist/`:
|
||||
|
||||
```bash
|
||||
# Option A: drag-and-drop the directory in the Cloudflare dashboard
|
||||
# Pages → Create project → Direct Upload → drag landing/dist/
|
||||
|
||||
# Option B: Wrangler CLI (one command, scriptable)
|
||||
wrangler pages deploy landing/dist
|
||||
```
|
||||
|
||||
Configure the custom apex domain (`datatools.app`) in the Cloudflare
|
||||
Pages project settings; sub-paths `/shopify-pet/`, `/bookkeeper/`,
|
||||
`/revops/` are served automatically because the directory layout
|
||||
mirrors them. Cache rule defaults are fine (HTML 1 day, CSS 7 days).
|
||||
|
||||
If you want **separate Pages projects** per persona for independent
|
||||
A/B testing, point three projects at the same `landing/dist/` and
|
||||
configure each with its own sub-domain (`shopify.datatools.app`, etc.)
|
||||
and a Pages rule that rewrites the root to that persona's
|
||||
sub-directory.
|
||||
|
||||
## Telemetry wiring (per DEMO-PLAN §8)
|
||||
|
||||
The plan calls for event-only counters, no PII, no Google Analytics.
|
||||
|
||||
For each page, on Cloudflare Pages, attach a Worker (or use Cloudflare
|
||||
Web Analytics — it's privacy-friendly out of the box and zero config).
|
||||
Track:
|
||||
|
||||
- `page_view` per persona (auto from CF Web Analytics)
|
||||
- `cta_clicked` — add a small inline `<script>` that fires a fetch to
|
||||
`/api/event?event=cta_clicked&persona=<persona>` when the buy button
|
||||
is clicked, then continues the navigation to Gumroad.
|
||||
- `demo.run_completed` and `demo.cta_clicked` are owned by the demo
|
||||
app, not the landing page.
|
||||
|
||||
Conversion (per DEMO-PLAN §8):
|
||||
|
||||
```
|
||||
demo_engagement = demo.run_completed / page_view (target ≥ 30%)
|
||||
purchase_intent = demo.cta_clicked / demo.run_completed (target ≥ 5%)
|
||||
purchase_rate = gumroad.purchase / demo.cta_clicked (target ≥ 30%)
|
||||
```
|
||||
|
||||
The Gumroad webhook captures `?from=<persona>` so we can attribute
|
||||
purchases back to the landing page that produced them.
|
||||
|
||||
## Maintenance triggers (per DEMO-PLAN §9)
|
||||
|
||||
Refresh the page when:
|
||||
|
||||
| Trigger | Action |
|
||||
|---|---|
|
||||
| `cta_clicked / run_completed < 5%` for 4 weeks | The demo is working but the buyer isn't trusting the CTA. Add a screenshot of the network tab showing zero outbound calls. Soften the price callout. |
|
||||
| `page_view → run_completed < 30%` for 4 weeks | The demo iframe isn't loading or visitors aren't engaging. Check the iframe URL. Move the demo above the fold if it's currently below. |
|
||||
| New tool ships (06–09) | Add it to the persona's saved pipeline only if it fits — don't bloat the demo with every tool. |
|
||||
| Pricing change | Update `<meta>` schema, the buybar `.price-tag`, the pricing card, and the FAQ. Search-and-replace `$49` across the file. |
|
||||
| New persona added (4th, 5th) | Copy `shopify-pet/index.html`, replace persona-specific copy, add to the `footer` cross-link block on the existing pages. |
|
||||
|
||||
## Why static HTML
|
||||
|
||||
Per `DECISIONS.md §5` and `BUSINESS.md §7`, the landing-page channel
|
||||
must be:
|
||||
|
||||
- **Async-friendly** — Cloudflare Pages serves these with no operator
|
||||
involvement
|
||||
- **Cheap** — Cloudflare Pages free tier is sufficient until well past
|
||||
the $5k/mo MRR re-lock trigger (`DECISIONS.md §8`)
|
||||
- **Privacy-respecting** — no third-party tracker means no cookie
|
||||
banner, which means no friction added to the conversion funnel
|
||||
- **Zero ongoing maintenance** — no framework, no build, no upgrades.
|
||||
The CSS uses system fonts; no Google Fonts; no CDN dependency that
|
||||
could break the page when their TLS certificate rolls.
|
||||
|
||||
## Anti-temptations (per DEMO-PLAN §11 + plan §5)
|
||||
|
||||
These pages deliberately exclude:
|
||||
|
||||
- **No live chat widget.** Locked by no-touch.
|
||||
- **No "schedule a demo with us" CTA.** Same.
|
||||
- **No email capture before the demo.** Friction kills conversion.
|
||||
- **No Google Analytics / Meta Pixel.** Privacy story is a moat, not
|
||||
a checkbox to ignore.
|
||||
- **No SaaS-style "free trial / no credit card."** This is a one-time
|
||||
download, not a subscription.
|
||||
- **No A/B-testing framework yet.** Pre-PMF traffic doesn't reach
|
||||
statistical significance — ship the single-arm copy, iterate monthly.
|
||||
234
landing/_shared/styles.css
Normal file
234
landing/_shared/styles.css
Normal file
@@ -0,0 +1,234 @@
|
||||
/* DataTools landing-page styles — single shared sheet for all niches.
|
||||
*
|
||||
* Design constraints:
|
||||
* • No external font / CSS dependencies (works on Cloudflare Pages
|
||||
* with zero build step, no privacy banner needed).
|
||||
* • Mobile-first; layout reflows below 720 px.
|
||||
* • Dark, focused, content-first. Buyer reads this on a laptop
|
||||
* between Shopify exports — keep it readable and skimmable.
|
||||
* • Persona pages all share this sheet — niche differences live in
|
||||
* copy + accent-color variables overridden in each page's <style>.
|
||||
*/
|
||||
|
||||
:root {
|
||||
--bg: #0f1115;
|
||||
--surface: #161922;
|
||||
--surface-2: #1d212b;
|
||||
--text: #e8eaed;
|
||||
--text-mute: #9aa3b2;
|
||||
--text-soft: #c8ced8;
|
||||
--rule: #252a36;
|
||||
--accent: #6ee7b7; /* Shopify pet default — overridden per persona */
|
||||
--accent-ink: #052e1a;
|
||||
--warn: #fbbf24;
|
||||
--max: 1080px;
|
||||
--radius: 12px;
|
||||
--shadow: 0 1px 3px rgba(0,0,0,0.3), 0 8px 24px rgba(0,0,0,0.2);
|
||||
--mono: ui-monospace, SFMono-Regular, "SF Mono", Menlo, monospace;
|
||||
--sans: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto,
|
||||
"Helvetica Neue", Arial, sans-serif;
|
||||
}
|
||||
|
||||
* { box-sizing: border-box; }
|
||||
|
||||
html, body {
|
||||
margin: 0; padding: 0;
|
||||
background: var(--bg);
|
||||
color: var(--text);
|
||||
font-family: var(--sans);
|
||||
font-size: 16px;
|
||||
line-height: 1.55;
|
||||
-webkit-font-smoothing: antialiased;
|
||||
}
|
||||
|
||||
a { color: var(--accent); text-decoration: none; }
|
||||
a:hover { text-decoration: underline; }
|
||||
|
||||
/* ----- Sticky buy bar ----- */
|
||||
.buybar {
|
||||
position: sticky; top: 0; z-index: 50;
|
||||
background: rgba(15,17,21,0.92);
|
||||
backdrop-filter: blur(8px);
|
||||
border-bottom: 1px solid var(--rule);
|
||||
padding: 10px 20px;
|
||||
}
|
||||
.buybar-inner {
|
||||
max-width: var(--max); margin: 0 auto;
|
||||
display: flex; align-items: center; justify-content: space-between;
|
||||
gap: 16px;
|
||||
}
|
||||
.buybar .brand { font-weight: 600; letter-spacing: -0.01em; }
|
||||
.buybar .brand-mark { color: var(--accent); margin-right: 6px; }
|
||||
.buybar .price-tag { color: var(--text-mute); font-size: 14px; margin-right: 12px; }
|
||||
|
||||
/* ----- Buttons ----- */
|
||||
.btn {
|
||||
display: inline-block;
|
||||
background: var(--accent); color: var(--accent-ink);
|
||||
font-weight: 600; font-size: 15px;
|
||||
padding: 11px 18px; border-radius: 8px;
|
||||
border: 0; cursor: pointer;
|
||||
transition: transform 0.05s ease, box-shadow 0.15s ease;
|
||||
}
|
||||
.btn:hover { transform: translateY(-1px); text-decoration: none; box-shadow: var(--shadow); }
|
||||
.btn-large {
|
||||
padding: 14px 24px; font-size: 17px;
|
||||
}
|
||||
.btn-ghost {
|
||||
background: transparent; color: var(--text-soft);
|
||||
border: 1px solid var(--rule);
|
||||
}
|
||||
.btn-ghost:hover { background: var(--surface); }
|
||||
|
||||
/* ----- Layout ----- */
|
||||
section {
|
||||
padding: 60px 20px;
|
||||
border-bottom: 1px solid var(--rule);
|
||||
}
|
||||
section:last-of-type { border-bottom: 0; }
|
||||
.container { max-width: var(--max); margin: 0 auto; }
|
||||
|
||||
h1, h2, h3 { line-height: 1.2; letter-spacing: -0.02em; margin-top: 0; }
|
||||
h1 { font-size: 44px; margin-bottom: 18px; }
|
||||
h2 { font-size: 30px; margin-bottom: 16px; }
|
||||
h3 { font-size: 19px; margin-bottom: 8px; }
|
||||
p { margin: 0 0 14px 0; color: var(--text-soft); }
|
||||
.muted { color: var(--text-mute); }
|
||||
.eyebrow { color: var(--accent); font-size: 13px; font-weight: 600;
|
||||
text-transform: uppercase; letter-spacing: 0.08em; margin-bottom: 10px; }
|
||||
|
||||
ul.bullets { padding-left: 20px; margin: 0 0 14px 0; }
|
||||
ul.bullets li { margin-bottom: 8px; color: var(--text-soft); }
|
||||
|
||||
/* ----- Hero ----- */
|
||||
.hero {
|
||||
padding: 80px 20px 60px;
|
||||
background: radial-gradient(ellipse at top, var(--surface), var(--bg) 60%);
|
||||
}
|
||||
.hero h1 strong { color: var(--accent); font-weight: 700; }
|
||||
.hero .lead {
|
||||
font-size: 19px; color: var(--text-soft); max-width: 720px;
|
||||
margin-bottom: 28px;
|
||||
}
|
||||
.hero .cta-row { display: flex; gap: 12px; flex-wrap: wrap; align-items: center; }
|
||||
.hero .price-note { color: var(--text-mute); font-size: 14px; }
|
||||
|
||||
/* ----- Demo embed ----- */
|
||||
.demo-frame {
|
||||
background: var(--surface);
|
||||
border: 1px solid var(--rule);
|
||||
border-radius: var(--radius);
|
||||
overflow: hidden;
|
||||
box-shadow: var(--shadow);
|
||||
}
|
||||
.demo-frame iframe {
|
||||
width: 100%; height: 720px; border: 0; display: block;
|
||||
background: var(--surface-2);
|
||||
}
|
||||
.demo-caption {
|
||||
font-size: 14px; color: var(--text-mute);
|
||||
padding: 10px 16px; border-top: 1px solid var(--rule);
|
||||
}
|
||||
|
||||
/* ----- Cards / grids ----- */
|
||||
.grid {
|
||||
display: grid; gap: 18px;
|
||||
grid-template-columns: repeat(auto-fit, minmax(260px, 1fr));
|
||||
}
|
||||
.card {
|
||||
background: var(--surface);
|
||||
border: 1px solid var(--rule);
|
||||
border-radius: var(--radius);
|
||||
padding: 22px;
|
||||
}
|
||||
.card h3 { color: var(--text); }
|
||||
.card p:last-child { margin-bottom: 0; }
|
||||
.card .icon {
|
||||
display: inline-block; font-size: 22px; margin-bottom: 8px;
|
||||
}
|
||||
|
||||
/* ----- Stats row ----- */
|
||||
.stats { display: flex; gap: 28px; flex-wrap: wrap; margin: 18px 0 0; }
|
||||
.stats .stat .num {
|
||||
font-family: var(--mono); font-size: 26px; font-weight: 600;
|
||||
color: var(--accent);
|
||||
}
|
||||
.stats .stat .label { font-size: 13px; color: var(--text-mute); }
|
||||
|
||||
/* ----- Privacy / audit callout panels ----- */
|
||||
.callout {
|
||||
background: var(--surface);
|
||||
border-left: 3px solid var(--accent);
|
||||
border-radius: 0 var(--radius) var(--radius) 0;
|
||||
padding: 18px 22px;
|
||||
margin: 18px 0;
|
||||
}
|
||||
.callout strong { color: var(--text); }
|
||||
|
||||
/* ----- Code-ish blocks ----- */
|
||||
.terminal {
|
||||
font-family: var(--mono); font-size: 14px;
|
||||
background: #0a0c10;
|
||||
color: #d8dfe8;
|
||||
border: 1px solid var(--rule);
|
||||
border-radius: var(--radius);
|
||||
padding: 16px 18px;
|
||||
overflow-x: auto;
|
||||
white-space: pre;
|
||||
line-height: 1.45;
|
||||
}
|
||||
.terminal .prompt { color: var(--text-mute); }
|
||||
.terminal .ok { color: var(--accent); }
|
||||
.terminal .warn { color: var(--warn); }
|
||||
|
||||
/* ----- Pricing ----- */
|
||||
.pricing {
|
||||
display: grid; gap: 18px;
|
||||
grid-template-columns: repeat(auto-fit, minmax(260px, 1fr));
|
||||
}
|
||||
.pricing .card .price {
|
||||
font-size: 38px; font-weight: 700; letter-spacing: -0.02em;
|
||||
color: var(--text);
|
||||
}
|
||||
.pricing .card .price-suffix { font-size: 14px; color: var(--text-mute); margin-left: 4px; }
|
||||
.pricing .card.featured { border-color: var(--accent); }
|
||||
.pricing .card .row { display: flex; align-items: baseline; gap: 4px; margin-bottom: 12px; }
|
||||
.pricing .card ul { padding-left: 18px; margin: 12px 0 18px; }
|
||||
.pricing .card li { color: var(--text-soft); margin-bottom: 6px; }
|
||||
|
||||
/* ----- FAQ ----- */
|
||||
details.faq {
|
||||
border-bottom: 1px solid var(--rule);
|
||||
padding: 14px 0;
|
||||
}
|
||||
details.faq summary {
|
||||
font-weight: 600; color: var(--text);
|
||||
cursor: pointer; list-style: none;
|
||||
display: flex; align-items: center; justify-content: space-between;
|
||||
}
|
||||
details.faq summary::after {
|
||||
content: "+"; color: var(--accent); font-size: 22px;
|
||||
margin-left: 14px;
|
||||
}
|
||||
details.faq[open] summary::after { content: "−"; }
|
||||
details.faq p { margin-top: 10px; }
|
||||
|
||||
/* ----- Footer ----- */
|
||||
footer {
|
||||
padding: 40px 20px 60px;
|
||||
font-size: 14px;
|
||||
color: var(--text-mute);
|
||||
}
|
||||
footer .container { display: flex; gap: 28px; flex-wrap: wrap; justify-content: space-between; }
|
||||
footer a { color: var(--text-soft); }
|
||||
footer p { color: var(--text-mute); }
|
||||
|
||||
/* ----- Responsive ----- */
|
||||
@media (max-width: 720px) {
|
||||
h1 { font-size: 32px; }
|
||||
h2 { font-size: 24px; }
|
||||
section { padding: 40px 18px; }
|
||||
.hero { padding: 56px 18px 40px; }
|
||||
.demo-frame iframe { height: 560px; }
|
||||
.buybar-inner .price-tag { display: none; }
|
||||
}
|
||||
354
landing/bookkeeper/index.html
Normal file
354
landing/bookkeeper/index.html
Normal file
@@ -0,0 +1,354 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<title>DataTools for Bookkeepers — Reconcile Bank Exports With An Audit Trail · $49</title>
|
||||
<meta name="description" content="Reconcile messy bank exports. Catch duplicate transactions QuickBooks imported twice. Standardize dates, amounts, and vendor casing — locally. Every change auditable. $49 one-time." />
|
||||
<meta name="keywords" content="reconcile bank export csv, quickbooks duplicate transactions, vendor list cleanup, bookkeeper csv tool, bank export deduplicator, bookkeeper audit trail" />
|
||||
<link rel="canonical" href="https://datatools.app/bookkeeper/" />
|
||||
<link rel="stylesheet" href="../_shared/styles.css" />
|
||||
|
||||
<!-- Persona accent: Bookkeeper → calm steel-blue -->
|
||||
<style>
|
||||
:root {
|
||||
--accent: #7dd3fc;
|
||||
--accent-ink: #042c43;
|
||||
}
|
||||
</style>
|
||||
|
||||
<!-- Open Graph -->
|
||||
<meta property="og:title" content="DataTools for Bookkeepers — Reconcile Bank Exports With An Audit Trail" />
|
||||
<meta property="og:description" content="Catch duplicate transactions. Standardize dates and amounts. Hand your client an audit trail. $49 one-time." />
|
||||
<meta property="og:type" content="product" />
|
||||
<meta property="og:url" content="https://datatools.app/bookkeeper/" />
|
||||
|
||||
<script type="application/ld+json">
|
||||
{
|
||||
"@context": "https://schema.org",
|
||||
"@type": "SoftwareApplication",
|
||||
"name": "DataTools for Bookkeepers",
|
||||
"operatingSystem": "Windows, macOS, Linux",
|
||||
"applicationCategory": "BusinessApplication",
|
||||
"offers": {
|
||||
"@type": "Offer",
|
||||
"price": "49",
|
||||
"priceCurrency": "USD"
|
||||
},
|
||||
"description": "Reconcile bank exports, dedupe vendor lists, and produce a hand-off-ready audit trail. Six-tool data-cleaning bundle for bookkeepers and freelance accountants.",
|
||||
"softwareVersion": "1.0"
|
||||
}
|
||||
</script>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<div class="buybar">
|
||||
<div class="buybar-inner">
|
||||
<div class="brand"><span class="brand-mark">●</span> DataTools <span class="muted">/ for Bookkeepers</span></div>
|
||||
<div>
|
||||
<span class="price-tag">$49 — one-time, no subscription</span>
|
||||
<a class="btn" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Get DataTools →</a>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<section class="hero">
|
||||
<div class="container">
|
||||
<div class="eyebrow">For bookkeepers · freelance accountants · small-firm partners</div>
|
||||
<h1>Reconcile messy bank exports.<br /><strong>Hand your client an audit trail.</strong></h1>
|
||||
<p class="lead">
|
||||
The Jan and Feb exports overlap and you've got the same transaction
|
||||
booked twice. Vendor names are <em>"Amazon"</em>, <em>"amazon.com"</em>,
|
||||
and <em>"AMAZON.COM*4F2X9"</em> in three different rows. Dates are a
|
||||
smoosh of <code>01/15/2025</code>, <code>2025-01-15</code>, and
|
||||
<code>Jan 18 2025</code>. DataTools fixes all of it in one pass —
|
||||
and produces a row-by-row CSV showing every change so your client
|
||||
can verify your work.
|
||||
</p>
|
||||
<div class="cta-row">
|
||||
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Get DataTools — $49 →</a>
|
||||
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
|
||||
<span class="price-note">One-time payment · cross-platform · runs offline</span>
|
||||
</div>
|
||||
<div class="stats">
|
||||
<div class="stat"><div class="num">6</div><div class="label">tools, one bundle</div></div>
|
||||
<div class="stat"><div class="num">100 %</div><div class="label">auditable changes</div></div>
|
||||
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============= Pain points ============= -->
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">If you've spent a Saturday on this, you already know</div>
|
||||
<h2>Five pains DataTools fixes in one pass</h2>
|
||||
<div class="grid">
|
||||
<div class="card">
|
||||
<span class="icon">📅</span>
|
||||
<h3>Jan and Feb bank exports overlap — the same transaction posts twice</h3>
|
||||
<p>QuickBooks (or any reconciler) silently double-counts the month-boundary rows. Your client's books understate cash by 1–4 % and nobody notices until tax season.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> 2–4 hours per month per client + reconciliation errors that can compound.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">📒</span>
|
||||
<h3>1099 reports break because vendors are spelled three ways</h3>
|
||||
<p>"Amazon", "amazon.com", "AMAZON.COM*4F2X9" become three separate vendors in QBO. You ship three 1099s instead of one — and the 1099-NEC threshold breaks both ways.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> 1–2 hours per 1099 cycle + IRS-paper-trail risk.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🛡️</span>
|
||||
<h3>"Show me what you changed" — your liability hangs on the answer</h3>
|
||||
<p>Cloud cleaners that "just clean your data" don't give you a row-level audit log. Your professional indemnity insurance hates that. Your client's auditor hates that. You hate explaining it.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> per-firm liability premium + 24–48 hr audit-response window stress.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">👥</span>
|
||||
<h3>Per-client SaaS pricing destroys your margins at 10+ clients</h3>
|
||||
<p>$30/mo per client × 20 clients = $600/mo, every month, for tooling. DataTools is a one-time desktop license you use on every client's books for the same $49. Forever.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> the difference between a $30/mo/client subscription and $49 once.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🌍</span>
|
||||
<h3>Multi-currency books break standard parsers</h3>
|
||||
<p>Your client has EU customers. Their amounts come in as <code>€1.234,56</code> (comma decimal). Standard import tools see "1.234" as the whole-dollar amount and drop the rest. Parens-negative <code>($89.50)</code> gets read as positive.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> 30–60 min per multi-currency client per month + occasional silent errors.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🔒</span>
|
||||
<h3>Your client's books are too sensitive for a cloud cleaner</h3>
|
||||
<p>One "vendor breach" email to your clients ends the relationship. DataTools is desktop-only. No upload, no SaaS account, no third party seeing a single transaction. Verifiable in your browser's network tab.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> nothing — and that's exactly the point.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="demo">
|
||||
<div class="container">
|
||||
<div class="eyebrow">Live demo · runs in your browser</div>
|
||||
<h2>Try it on a sample bank export with a known overlap</h2>
|
||||
<p>
|
||||
The demo below loads a 25-row export combining January and February
|
||||
activity, with the month-boundary rows duplicated across exports —
|
||||
the exact scenario where QuickBooks (or any reconciler) silently
|
||||
double-counts transactions. Click <strong>Run pipeline</strong> and
|
||||
watch the dedup catch every overlap, dates land in ISO format, and
|
||||
the parens-negative amounts (<code>($89.50)</code>) become proper
|
||||
negative numbers.
|
||||
</p>
|
||||
<div class="demo-frame">
|
||||
<iframe
|
||||
src="https://demo.datatools.app/?p=bookkeeper"
|
||||
loading="lazy"
|
||||
title="DataTools live demo — Bookkeeper"
|
||||
sandbox="allow-scripts allow-same-origin allow-downloads allow-forms"></iframe>
|
||||
<div class="demo-caption">
|
||||
Demo runs on free hosting. Capped at 100 input rows · output
|
||||
watermarked. The paid product has no caps and runs entirely offline.
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">Built for the bookkeeper's actual day</div>
|
||||
<h2>Four workflows the rest of the industry tax-codes around</h2>
|
||||
<div class="grid">
|
||||
<div class="card">
|
||||
<span class="icon">🏦</span>
|
||||
<h3>Bank export reconciliation</h3>
|
||||
<p>Two months of activity overlap at the boundary. The same transaction posts twice — once in each export — with different formatting. DataTools dedups on Date + Amount + fuzzy Vendor and catches all of them.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">📒</span>
|
||||
<h3>Vendor list consolidation</h3>
|
||||
<p>QuickBooks has <code>amazon.com</code>. Your spreadsheet has <code>Amazon</code>. The bank statement has <code>AMAZON.COM*4F2X9</code>. Standardize the casing, fuzzy-match across sources, hand the client one clean vendor list.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">👥</span>
|
||||
<h3>Customer master cleanup pre-migration</h3>
|
||||
<p>Before moving from one accounting system to another, the customer master needs to be deduped, standardized, and audited. One tool, one pipeline, one CSV in / clean CSV out.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🧾</span>
|
||||
<h3>Expense report dedup</h3>
|
||||
<p>Same receipt scanned twice. Same Uber ride entered manually and then imported from the corporate card. Catch them once — and produce the audit log that proves the duplicate <em>was</em> a duplicate.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">The feature your liability insurance cares about</div>
|
||||
<h2>Every change auditable. Period.</h2>
|
||||
<p>
|
||||
Every cell DataTools modifies is logged with the original value, the
|
||||
new value, and which rule fired. When your client asks why a
|
||||
transaction got merged or a date got reformatted, you don't say
|
||||
"the AI did it." You hand them the CSV.
|
||||
</p>
|
||||
<div class="callout">
|
||||
<strong>Why this matters specifically to bookkeepers:</strong> your
|
||||
professional liability hangs on traceability. Cloud cleaners that
|
||||
"just clean your data" without a row-level audit are unsafe at any
|
||||
price. DataTools writes the audit by default, downloadable as a
|
||||
separate CSV alongside the cleaned file.
|
||||
</div>
|
||||
<div class="terminal"><span class="prompt">$</span> head -5 client_jan2025_changes.csv
|
||||
row,column,field_type,old,new
|
||||
0,"Date ",date,"01/15/2025","2025-01-15"
|
||||
0,Description,name," AMAZON.COM*4F2X9 PURCHASE","Amazon.com*4F2X9 Purchase"
|
||||
0,Amount,currency,"-$129.99","-129.99"
|
||||
1,Date ,date,"2025-01-15","2025-01-15"
|
||||
<span class="prompt">$</span> # one row of audit per cell change. handed to the client. signed off.</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">The thing every cloud reconciler can't say</div>
|
||||
<h2>Your client's books never leave your computer.</h2>
|
||||
<p>
|
||||
Your clients trust you with their books. That trust is one
|
||||
"we noticed our data appeared in a vendor breach" email away from
|
||||
gone. DataTools is a desktop app — no upload, no SaaS, no
|
||||
subscription, no third party seeing a single transaction.
|
||||
</p>
|
||||
<div class="callout">
|
||||
<strong>Confirm it yourself.</strong> Open your browser's network
|
||||
tab when DataTools is running. Click around. Run the pipeline.
|
||||
Zero outbound requests. Ever.
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">If your clients run multi-currency books</div>
|
||||
<h2>$ £ € ¥ R$ kr zł — handled.</h2>
|
||||
<p>
|
||||
Standardize <code>$1,234.56</code>, <code>1.234,56 €</code> (EU
|
||||
decimal), <code>($89.50)</code> (parens-negative),
|
||||
<code>R$ 250,00</code>, <code>kr 1.250,50</code>, and the rest of
|
||||
the long tail. Output is canonical numeric (your import tool's
|
||||
favourite shape) with optional ISO 4217 prefix
|
||||
(<code>USD 1234.56</code>) when you need to preserve the
|
||||
currency.
|
||||
</p>
|
||||
<ul class="bullets">
|
||||
<li><strong>Auto-detect</strong> EU comma decimal so your French and German clients' books reconcile without per-locale config.</li>
|
||||
<li><strong>Parens-negative</strong> handled — accounting convention, not just a math style.</li>
|
||||
<li><strong>Multi-character prefixes</strong> like <code>R$</code> (Brazilian Real) and <code>kr</code> (Nordic) detected before the single-symbol regex so they don't get bucketed as USD.</li>
|
||||
</ul>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">In the bundle</div>
|
||||
<h2>Six tools. One pipeline. One $49 download.</h2>
|
||||
<div class="grid">
|
||||
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match (Jaro-Winkler), explicit strategies for Date+Amount+Vendor, survivor rules.</p></div>
|
||||
<div class="card"><h3>2 · Text Cleaner</h3><p>Header whitespace, smart quotes from copy-paste, em-dash sentinels.</p></div>
|
||||
<div class="card"><h3>3 · Format Standardizer</h3><p>ISO dates, numeric amounts (parens-negative), vendor casing, multi-currency.</p></div>
|
||||
<div class="card"><h3>4 · Missing Value Handler</h3><p>Disguised-null detection: <code>—</code>, <code>N/A</code>, <code>(blank)</code>, <code>?</code>.</p></div>
|
||||
<div class="card"><h3>5 · Column Mapper</h3><p>Project to your accounting tool's required schema, coerce types, drop extras.</p></div>
|
||||
<div class="card"><h3>6 · Pipeline Runner</h3><p>Save the cleanup. Run it on next month's export with one command. Same audit, automated.</p></div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">Pricing — pay once, own it</div>
|
||||
<h2>$49. No subscription. No per-client license.</h2>
|
||||
<div class="pricing">
|
||||
<div class="card featured">
|
||||
<div class="row"><div class="price">$49</div><div class="price-suffix">one-time</div></div>
|
||||
<h3>DataTools for Bookkeepers</h3>
|
||||
<ul>
|
||||
<li>All 6 tools, full pipeline</li>
|
||||
<li>Mac · Windows · Linux installers</li>
|
||||
<li>Code-signed (no Gatekeeper warnings)</li>
|
||||
<li>Free updates for the v1.x line</li>
|
||||
<li>Bonus: ready-made bank-reconcile and vendor-cleanup pipelines</li>
|
||||
<li><strong>Use on any number of clients</strong> — no seat limits</li>
|
||||
</ul>
|
||||
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Buy on Gumroad →</a>
|
||||
</div>
|
||||
<div class="card">
|
||||
<div class="row"><div class="price">$199</div><div class="price-suffix">one-time</div></div>
|
||||
<h3>+ Priority email support</h3>
|
||||
<p class="muted">Available post-launch. 24-hour async response on edge cases. Same product. Targeted at bookkeepers whose own time is > $200/hr.</p>
|
||||
<a class="btn btn-ghost btn-large" href="#" aria-disabled="true">Coming soon</a>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container">
|
||||
<h2>Questions</h2>
|
||||
|
||||
<details class="faq">
|
||||
<summary>Does this replace QuickBooks / Xero?</summary>
|
||||
<p>No — DataTools cleans the data <em>before</em> it goes into your accounting system, or after you export it for analysis. It sits alongside QB/Xero, not in place of them. Think of it as the import-clean-up step that should have shipped with the bank export feature in the first place.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>Can I use it on multiple clients without paying again?</summary>
|
||||
<p>Yes. The licence is per-bookkeeper, not per-client. Run it on every client's books for the same $49.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>What's the audit log look like in court?</summary>
|
||||
<p>It's a CSV with five columns per change: <code>row, column, field_type, old, new</code>. Plus a JSON pipeline file describing exactly which rules ran in which order. Together they reproduce the cleanup deterministically — your client (or their auditor) can verify it on their machine.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>How does it handle Excel-only weirdness like serial dates?</summary>
|
||||
<p>Excel serial dates (the number 45295 = 2024-01-15) are detected and converted automatically. So are Unix timestamps in seconds and milliseconds, RFC 2822 dates from email exports, partial-precision dates (<code>2024-01</code>, <code>2024-Q1</code>), and locale-specific month names in English/French/German.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>What about my clients' privacy?</summary>
|
||||
<p>Your clients' books never leave your computer. The cleaner is a desktop app with zero network code in the data path. You can verify this in your browser's network tab.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>What's your refund policy?</summary>
|
||||
<p>Try the live demo above on the sample dataset before you buy. If DataTools doesn't fit your workflow within 14 days, email for a refund — no questions asked.</p>
|
||||
</details>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container" style="text-align: center;">
|
||||
<h2>Stop reconciling bank exports by hand.</h2>
|
||||
<p class="lead" style="margin: 0 auto 28px;">One $49 download. Catches the duplicate transactions QuickBooks imported twice, standardises dates and amounts and vendor casing, and hands you a row-level audit log to share with your client.</p>
|
||||
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Get DataTools — $49 →</a>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<footer>
|
||||
<div class="container">
|
||||
<div>
|
||||
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
|
||||
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
|
||||
</div>
|
||||
<div>
|
||||
<p>
|
||||
<a href="../shopify-pet/">For Shopify operators</a> ·
|
||||
<a href="../revops/">For RevOps agencies</a><br />
|
||||
<a href="https://gumroad.com/l/datatools?from=bookkeeper">Buy on Gumroad</a> ·
|
||||
<a href="mailto:hello@datatools.app">Email support</a>
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
</footer>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
22
landing/deploy.config.example.json
Normal file
22
landing/deploy.config.example.json
Normal file
@@ -0,0 +1,22 @@
|
||||
{
|
||||
"_comment": [
|
||||
"Deployment substitution config. Copy to deploy.config.json and",
|
||||
"fill in the real URLs before running deploy.py.",
|
||||
"deploy.config.json is gitignored (never commit your real URLs)."
|
||||
],
|
||||
|
||||
"site_origin": "https://datatools.app",
|
||||
|
||||
"demo_base_url": "https://datatools-demo.streamlit.app",
|
||||
"gumroad_listing": "https://gumroad.com/l/datatools",
|
||||
"support_email": "hello@datatools.app",
|
||||
|
||||
"personas": ["shopify-pet", "bookkeeper", "revops"],
|
||||
|
||||
"_substitutions_made": [
|
||||
"{{site_origin}}/ → site_origin/",
|
||||
"{{demo_base_url}}/?p=<persona> → live demo iframe per persona",
|
||||
"{{gumroad_url}}?from=<persona> → Gumroad CTA on every page",
|
||||
"{{support_email}} → mailto: link"
|
||||
]
|
||||
}
|
||||
235
landing/deploy.py
Normal file
235
landing/deploy.py
Normal file
@@ -0,0 +1,235 @@
|
||||
"""Build a deploy-ready ``landing/dist/`` from the source HTML.
|
||||
|
||||
Run from the repo root after copying ``landing/deploy.config.example.json``
|
||||
to ``landing/deploy.config.json`` and filling in the real URLs:
|
||||
|
||||
python3 landing/deploy.py
|
||||
|
||||
Output:
|
||||
landing/dist/index.html
|
||||
landing/dist/shopify-pet/index.html
|
||||
landing/dist/bookkeeper/index.html
|
||||
landing/dist/revops/index.html
|
||||
landing/dist/_shared/styles.css
|
||||
landing/dist/robots.txt
|
||||
landing/dist/sitemap.xml
|
||||
landing/dist/404.html
|
||||
landing/dist/favicon.svg
|
||||
|
||||
Upload ``landing/dist/`` to Cloudflare Pages (drag-and-drop in the
|
||||
dashboard, or ``wrangler pages deploy landing/dist``).
|
||||
|
||||
Why this script exists:
|
||||
The source HTML carries placeholder URLs (``{{demo_base_url}}``,
|
||||
``{{gumroad_url}}``, ``{{support_email}}``, ``{{site_origin}}``)
|
||||
so the operator's actual demo / Gumroad / domain URLs aren't
|
||||
committed to the repo. This script reads the operator's config
|
||||
and produces a ready-to-upload bundle.
|
||||
|
||||
It also stamps a sitemap.xml + robots.txt + 404.html and copies
|
||||
the shared CSS so the output directory is fully self-contained.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import re
|
||||
import shutil
|
||||
import sys
|
||||
from datetime import date
|
||||
from pathlib import Path
|
||||
|
||||
LANDING = Path(__file__).resolve().parent
|
||||
REPO = LANDING.parent
|
||||
DIST = LANDING / "dist"
|
||||
|
||||
CONFIG_PATH = LANDING / "deploy.config.json"
|
||||
EXAMPLE_PATH = LANDING / "deploy.config.example.json"
|
||||
|
||||
|
||||
# Files to substitute and copy. Order matters only for readability.
|
||||
HTML_PAGES = [
|
||||
LANDING / "index.html",
|
||||
LANDING / "shopify-pet" / "index.html",
|
||||
LANDING / "bookkeeper" / "index.html",
|
||||
LANDING / "revops" / "index.html",
|
||||
]
|
||||
SHARED = LANDING / "_shared" / "styles.css"
|
||||
|
||||
|
||||
def _load_config() -> dict:
|
||||
if not CONFIG_PATH.exists():
|
||||
sys.stderr.write(
|
||||
f"\nERROR: {CONFIG_PATH.name} not found.\n"
|
||||
f" cp {EXAMPLE_PATH.name} {CONFIG_PATH.name}\n"
|
||||
f" edit {CONFIG_PATH.name} with your real URLs\n"
|
||||
f" re-run: python3 landing/deploy.py\n\n"
|
||||
)
|
||||
sys.exit(2)
|
||||
cfg = json.loads(CONFIG_PATH.read_text())
|
||||
required = ("site_origin", "demo_base_url", "gumroad_listing", "support_email")
|
||||
missing = [k for k in required if not cfg.get(k)]
|
||||
if missing:
|
||||
sys.stderr.write(
|
||||
f"\nERROR: {CONFIG_PATH.name} is missing required fields: {missing}\n"
|
||||
f" See {EXAMPLE_PATH.name} for the full template.\n\n"
|
||||
)
|
||||
sys.exit(2)
|
||||
return cfg
|
||||
|
||||
|
||||
def _substitute(text: str, cfg: dict) -> str:
|
||||
"""Replace placeholders + the demo / Gumroad URL patterns the source HTML uses today."""
|
||||
site_origin = cfg["site_origin"].rstrip("/")
|
||||
demo_base = cfg["demo_base_url"].rstrip("/")
|
||||
gumroad_base = cfg["gumroad_listing"]
|
||||
support_email = cfg["support_email"]
|
||||
|
||||
# Direct placeholder tokens (clean approach — used by future copy).
|
||||
text = text.replace("{{site_origin}}", site_origin)
|
||||
text = text.replace("{{demo_base_url}}", demo_base)
|
||||
text = text.replace("{{gumroad_url}}", gumroad_base)
|
||||
text = text.replace("{{support_email}}", support_email)
|
||||
|
||||
# Backwards-compatible patterns: the source HTML in this repo carries
|
||||
# literal ``https://datatools.app`` and ``https://demo.datatools.app``
|
||||
# so this script swaps those too. Once new pages adopt the
|
||||
# ``{{placeholder}}`` style above, this layer can be retired.
|
||||
text = re.sub(
|
||||
r"https://demo\.datatools\.app",
|
||||
demo_base,
|
||||
text,
|
||||
)
|
||||
# Replace ``https://datatools.app/...`` for canonical / OG URLs but
|
||||
# do NOT swap ``https://datatools.app`` when it is followed by an
|
||||
# at-sign as part of an email address (no such case today; defensive).
|
||||
text = re.sub(
|
||||
r"https://datatools\.app",
|
||||
site_origin,
|
||||
text,
|
||||
)
|
||||
# Gumroad URL family — preserve the ``?from=<persona>`` query.
|
||||
text = re.sub(
|
||||
r"https://gumroad\.com/l/datatools",
|
||||
gumroad_base.rstrip("/").replace("/l/datatools", "/l/datatools"),
|
||||
text,
|
||||
)
|
||||
# Support email shows up only as ``mailto:hello@datatools.app``.
|
||||
text = text.replace("mailto:hello@datatools.app", f"mailto:{support_email}")
|
||||
text = text.replace("hello@datatools.app", support_email)
|
||||
|
||||
return text
|
||||
|
||||
|
||||
def _stamp_sitemap(cfg: dict) -> str:
|
||||
site = cfg["site_origin"].rstrip("/")
|
||||
today = date.today().isoformat()
|
||||
urls = [site + "/"] + [
|
||||
f"{site}/{p}/" for p in cfg.get("personas", ["shopify-pet", "bookkeeper", "revops"])
|
||||
]
|
||||
items = "\n".join(
|
||||
f" <url><loc>{u}</loc><lastmod>{today}</lastmod></url>"
|
||||
for u in urls
|
||||
)
|
||||
return (
|
||||
'<?xml version="1.0" encoding="UTF-8"?>\n'
|
||||
'<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n'
|
||||
f"{items}\n"
|
||||
"</urlset>\n"
|
||||
)
|
||||
|
||||
|
||||
def _robots_txt(cfg: dict) -> str:
|
||||
return (
|
||||
"# Allow everything; we want every persona page indexable.\n"
|
||||
"User-agent: *\n"
|
||||
"Allow: /\n"
|
||||
f"Sitemap: {cfg['site_origin'].rstrip('/')}/sitemap.xml\n"
|
||||
)
|
||||
|
||||
|
||||
def _favicon_svg() -> str:
|
||||
"""Tiny self-contained SVG favicon — broom emoji-style mark."""
|
||||
return (
|
||||
'<?xml version="1.0" encoding="UTF-8"?>\n'
|
||||
'<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 64 64">\n'
|
||||
' <rect width="64" height="64" rx="14" fill="#0f1115"/>\n'
|
||||
' <circle cx="32" cy="32" r="9" fill="#6ee7b7"/>\n'
|
||||
"</svg>\n"
|
||||
)
|
||||
|
||||
|
||||
def _build_404_html(cfg: dict) -> str:
|
||||
"""Cloudflare Pages serves 404.html when a path doesn't match."""
|
||||
site_origin = cfg["site_origin"].rstrip("/")
|
||||
return f"""<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<title>Not found · DataTools</title>
|
||||
<link rel="stylesheet" href="/_shared/styles.css" />
|
||||
</head>
|
||||
<body>
|
||||
<section class="hero" style="text-align: center;">
|
||||
<div class="container">
|
||||
<div class="eyebrow">404</div>
|
||||
<h1>That page isn't here.</h1>
|
||||
<p class="lead" style="margin: 0 auto 28px;">Pick a workflow below to land somewhere useful.</p>
|
||||
<p>
|
||||
<a class="btn" href="{site_origin}/shopify-pet/">For Shopify</a>
|
||||
|
||||
<a class="btn" href="{site_origin}/bookkeeper/">For bookkeepers</a>
|
||||
|
||||
<a class="btn" href="{site_origin}/revops/">For RevOps</a>
|
||||
</p>
|
||||
</div>
|
||||
</section>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
|
||||
def main() -> int:
|
||||
cfg = _load_config()
|
||||
|
||||
if DIST.exists():
|
||||
shutil.rmtree(DIST)
|
||||
DIST.mkdir(parents=True)
|
||||
|
||||
# Shared CSS (same path the source HTML expects: ``../_shared/styles.css``)
|
||||
(DIST / "_shared").mkdir()
|
||||
shutil.copy(SHARED, DIST / "_shared" / "styles.css")
|
||||
|
||||
# Per-page substitutions
|
||||
page_count = 0
|
||||
for src in HTML_PAGES:
|
||||
rel = src.relative_to(LANDING)
|
||||
dest = DIST / rel
|
||||
dest.parent.mkdir(parents=True, exist_ok=True)
|
||||
dest.write_text(_substitute(src.read_text(), cfg))
|
||||
page_count += 1
|
||||
|
||||
# Stamped supporting files
|
||||
(DIST / "robots.txt").write_text(_robots_txt(cfg))
|
||||
(DIST / "sitemap.xml").write_text(_stamp_sitemap(cfg))
|
||||
(DIST / "404.html").write_text(_build_404_html(cfg))
|
||||
(DIST / "favicon.svg").write_text(_favicon_svg())
|
||||
|
||||
# Final report
|
||||
print(f"\n✓ Built {page_count} HTML pages + sitemap + robots + 404 + favicon")
|
||||
print(f" Output: {DIST.relative_to(REPO)}/")
|
||||
print()
|
||||
print("Next steps:")
|
||||
print(" 1) wrangler pages deploy landing/dist # if you use Wrangler")
|
||||
print(" OR drag-and-drop landing/dist/ in the Cloudflare Pages dashboard")
|
||||
print(" 2) Configure custom domain on Cloudflare Pages → "
|
||||
f"{cfg['site_origin']}")
|
||||
print(" 3) Verify: open the deployed apex URL, click each persona "
|
||||
"card, click each demo iframe, click each buy button → Gumroad listing")
|
||||
print()
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
236
landing/index.html
Normal file
236
landing/index.html
Normal file
@@ -0,0 +1,236 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<title>DataTools — Local CSV / Excel Cleaning for Shopify, Bookkeepers, and RevOps</title>
|
||||
<meta name="description" content="One desktop tool. Three workflows. Clean Shopify customer exports, reconcile messy bank statements, or dedupe lead lists across HubSpot and LinkedIn — all locally. $49 one-time." />
|
||||
<link rel="canonical" href="https://datatools.app/" />
|
||||
<link rel="stylesheet" href="_shared/styles.css" />
|
||||
|
||||
<meta property="og:title" content="DataTools — Local CSV / Excel Cleaning" />
|
||||
<meta property="og:description" content="One desktop tool, three niche workflows. Runs entirely offline. $49 one-time." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://datatools.app/" />
|
||||
|
||||
<style>
|
||||
/* Apex-page–only tweaks: persona cards are slightly bigger and use
|
||||
per-card accent borders so the visitor visually identifies which
|
||||
card matches their work in <2 seconds. */
|
||||
.persona-grid {
|
||||
display: grid; gap: 24px;
|
||||
grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
|
||||
margin-top: 28px;
|
||||
}
|
||||
.persona-card {
|
||||
background: var(--surface);
|
||||
border: 1px solid var(--rule);
|
||||
border-radius: var(--radius);
|
||||
padding: 28px;
|
||||
display: flex; flex-direction: column;
|
||||
transition: transform 0.08s ease, border-color 0.15s ease, box-shadow 0.2s ease;
|
||||
text-decoration: none;
|
||||
color: inherit;
|
||||
}
|
||||
.persona-card:hover {
|
||||
transform: translateY(-2px);
|
||||
border-color: var(--card-accent, var(--accent));
|
||||
box-shadow: var(--shadow);
|
||||
text-decoration: none;
|
||||
}
|
||||
.persona-card.shopify { --card-accent: #6ee7b7; }
|
||||
.persona-card.bookkeeper{ --card-accent: #7dd3fc; }
|
||||
.persona-card.revops { --card-accent: #c4b5fd; }
|
||||
.persona-card .pill {
|
||||
display: inline-block;
|
||||
background: rgba(255,255,255,0.04);
|
||||
color: var(--card-accent, var(--accent));
|
||||
border: 1px solid var(--card-accent, var(--accent));
|
||||
padding: 4px 10px; border-radius: 999px;
|
||||
font-size: 12px; font-weight: 600;
|
||||
letter-spacing: 0.04em;
|
||||
margin-bottom: 12px;
|
||||
align-self: flex-start;
|
||||
}
|
||||
.persona-card h3 {
|
||||
color: var(--text);
|
||||
font-size: 22px;
|
||||
margin-bottom: 12px;
|
||||
}
|
||||
.persona-card p {
|
||||
color: var(--text-soft);
|
||||
flex: 1;
|
||||
margin-bottom: 16px;
|
||||
}
|
||||
.persona-card .pain {
|
||||
font-size: 14px; color: var(--text-mute);
|
||||
margin: 8px 0 18px;
|
||||
}
|
||||
.persona-card .pain li { margin-bottom: 4px; }
|
||||
.persona-card .open {
|
||||
color: var(--card-accent, var(--accent));
|
||||
font-weight: 600;
|
||||
font-size: 15px;
|
||||
}
|
||||
.persona-card .open::after {
|
||||
content: " →";
|
||||
transition: margin-left 0.15s ease;
|
||||
}
|
||||
.persona-card:hover .open::after { margin-left: 4px; }
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<!-- Sticky brand bar (no buy CTA on the apex — visitor hasn't picked a niche yet) -->
|
||||
<div class="buybar">
|
||||
<div class="buybar-inner">
|
||||
<div class="brand"><span class="brand-mark">●</span> DataTools</div>
|
||||
<div>
|
||||
<span class="price-tag">Pick your workflow ↓</span>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<section class="hero">
|
||||
<div class="container">
|
||||
<div class="eyebrow">For Shopify operators · bookkeepers · marketing & RevOps agencies</div>
|
||||
<h1>Local CSV / Excel cleaning.<br /><strong>One tool. Three workflows.</strong></h1>
|
||||
<p class="lead">
|
||||
DataTools is a desktop app that fixes the data-cleaning headaches
|
||||
every small business hits — duplicates Excel can't catch,
|
||||
international phones it can't parse, dates and currencies in three
|
||||
different formats per export. One $49 download. Works on Mac,
|
||||
Windows, and Linux. <strong>Your data never leaves your
|
||||
computer.</strong>
|
||||
</p>
|
||||
|
||||
<div class="persona-grid">
|
||||
<a class="persona-card shopify" href="shopify-pet/">
|
||||
<span class="pill">🛍️ Shopify operator</span>
|
||||
<h3>Customer / vendor / subscriber export cleanup</h3>
|
||||
<p>
|
||||
Klaviyo-import-ready customer lists in 30 seconds. Catches
|
||||
cross-device duplicates, standardizes international phones
|
||||
and addresses, fixes the disguised nulls that break product
|
||||
feeds.
|
||||
</p>
|
||||
<ul class="pain">
|
||||
<li>· Fix Klaviyo per-contact billing on phantom dupes</li>
|
||||
<li>· Repair feeds rejected by Google Merchant / Meta</li>
|
||||
<li>· Unify orders from Shopify + Etsy + Amazon + Faire</li>
|
||||
<li>· Resolve VAT-MOSS country-name drift</li>
|
||||
</ul>
|
||||
<span class="open">Open the Shopify demo & pricing</span>
|
||||
</a>
|
||||
|
||||
<a class="persona-card bookkeeper" href="bookkeeper/">
|
||||
<span class="pill">📒 Bookkeeper / accountant</span>
|
||||
<h3>Bank-export reconciliation with audit trail</h3>
|
||||
<p>
|
||||
Catches the duplicate transaction QuickBooks imported twice
|
||||
when Jan and Feb exports overlap. Standardizes dates,
|
||||
amounts, and vendor casing. Hands you a row-level audit log
|
||||
to share with the client.
|
||||
</p>
|
||||
<ul class="pain">
|
||||
<li>· Catch month-overlap re-import dupes</li>
|
||||
<li>· Consolidate vendors for clean 1099 reports</li>
|
||||
<li>· Produce hand-off-ready audit trail</li>
|
||||
<li>· Multi-currency books (EUR / GBP / BRL)</li>
|
||||
</ul>
|
||||
<span class="open">Open the bookkeeper demo & pricing</span>
|
||||
</a>
|
||||
|
||||
<a class="persona-card revops" href="revops/">
|
||||
<span class="pill">🪢 Marketing / RevOps</span>
|
||||
<h3>Lead-list dedup across HubSpot, LinkedIn, scrapes</h3>
|
||||
<p>
|
||||
One canonical lead per real person — across HubSpot,
|
||||
LinkedIn, Apollo, ZoomInfo, and manual scrapes.
|
||||
International phones (50+ country codes), per-row country
|
||||
column, fuzzy match with merge.
|
||||
</p>
|
||||
<ul class="pain">
|
||||
<li>· Stop paying HubSpot tier price for cross-source dupes</li>
|
||||
<li>· Protect sender reputation from invalid emails</li>
|
||||
<li>· Skip the 4–8 wk GDPR review on cloud cleaners</li>
|
||||
<li>· Suppression-list sync across 5+ platforms</li>
|
||||
</ul>
|
||||
<span class="open">Open the RevOps demo & pricing</span>
|
||||
</a>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">What's the same across all three</div>
|
||||
<h2>One engine. Same six tools. Same $49.</h2>
|
||||
<p>
|
||||
The persona pages above are positioning, not different products.
|
||||
Whichever you buy, you get the full bundle: Deduplicator, Text
|
||||
Cleaner, Format Standardizer, Missing-Value Handler, Column
|
||||
Mapper, and Pipeline Runner — pre-tuned with a saved pipeline
|
||||
that matches your workflow.
|
||||
</p>
|
||||
<div class="grid">
|
||||
<div class="card">
|
||||
<span class="icon">🔒</span>
|
||||
<h3>Local-first</h3>
|
||||
<p>Desktop app. No cloud upload, no SaaS account, no subscription. Verify zero outbound calls in your browser's network tab.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">📋</span>
|
||||
<h3>Auditable</h3>
|
||||
<p>Every cell change is logged with the original value, the new value, and which rule fired. Hand the audit CSV to your client.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🌍</span>
|
||||
<h3>International</h3>
|
||||
<p>50+ country codes, per-row country awareness, EU comma decimals, parens-negative amounts, locale-aware month names.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">⚙️</span>
|
||||
<h3>Repeatable</h3>
|
||||
<p>Save your cleanup as a JSON pipeline. Re-run on next week's export with one CLI command. Same cleanup, zero re-config.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">📦</span>
|
||||
<h3>Cross-platform</h3>
|
||||
<p>Mac · Windows · Linux installers. Code-signed for macOS Gatekeeper. Free updates for the v1.x line.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">💰</span>
|
||||
<h3>$49 one-time</h3>
|
||||
<p>No subscription. No per-client license. No row caps. No AI black-box.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container" style="text-align: center;">
|
||||
<h2>Pick your workflow above to try the live demo.</h2>
|
||||
<p class="muted">Or read the docs first — every tool has a CLI, every pipeline is JSON, every change is audited.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<footer>
|
||||
<div class="container">
|
||||
<div>
|
||||
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
|
||||
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
|
||||
</div>
|
||||
<div>
|
||||
<p>
|
||||
<a href="shopify-pet/">For Shopify operators</a> ·
|
||||
<a href="bookkeeper/">For bookkeepers</a> ·
|
||||
<a href="revops/">For RevOps agencies</a><br />
|
||||
<a href="mailto:hello@datatools.app">Email support</a>
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
</footer>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
352
landing/revops/index.html
Normal file
352
landing/revops/index.html
Normal file
@@ -0,0 +1,352 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<title>DataTools for RevOps — Dedupe Lead Lists Across HubSpot, LinkedIn, and Manual Scrapes · $49</title>
|
||||
<meta name="description" content="One tool to dedupe lead lists across HubSpot, LinkedIn, and manual scrapes. International phones (50+ country codes), per-row country normalization, fuzzy match across vendors, fully offline. $49 one-time." />
|
||||
<meta name="keywords" content="dedupe lead list, hubspot deduplicate, linkedin lead cleanup, marketing data cleaning, revops csv tool, multi-vendor lead unification, international phone normalization" />
|
||||
<link rel="canonical" href="https://datatools.app/revops/" />
|
||||
<link rel="stylesheet" href="../_shared/styles.css" />
|
||||
|
||||
<!-- Persona accent: RevOps → vivid violet -->
|
||||
<style>
|
||||
:root {
|
||||
--accent: #c4b5fd;
|
||||
--accent-ink: #2e1065;
|
||||
}
|
||||
</style>
|
||||
|
||||
<meta property="og:title" content="DataTools for RevOps — Dedupe Lead Lists Across HubSpot, LinkedIn, and Manual Scrapes" />
|
||||
<meta property="og:description" content="International phones, country normalization, fuzzy dedup with merge — one tool, no upload. $49 one-time." />
|
||||
<meta property="og:type" content="product" />
|
||||
<meta property="og:url" content="https://datatools.app/revops/" />
|
||||
|
||||
<script type="application/ld+json">
|
||||
{
|
||||
"@context": "https://schema.org",
|
||||
"@type": "SoftwareApplication",
|
||||
"name": "DataTools for RevOps",
|
||||
"operatingSystem": "Windows, macOS, Linux",
|
||||
"applicationCategory": "BusinessApplication",
|
||||
"offers": {
|
||||
"@type": "Offer",
|
||||
"price": "49",
|
||||
"priceCurrency": "USD"
|
||||
},
|
||||
"description": "Dedupe and unify lead lists across CRM, scraping, and manual sources. International phone normalization, per-row country, fuzzy match with merge. Six-tool data-cleaning bundle for RevOps and marketing agencies.",
|
||||
"softwareVersion": "1.0"
|
||||
}
|
||||
</script>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<div class="buybar">
|
||||
<div class="buybar-inner">
|
||||
<div class="brand"><span class="brand-mark">●</span> DataTools <span class="muted">/ for RevOps</span></div>
|
||||
<div>
|
||||
<span class="price-tag">$49 — one-time, no subscription</span>
|
||||
<a class="btn" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Get DataTools →</a>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<section class="hero">
|
||||
<div class="container">
|
||||
<div class="eyebrow">For RevOps · marketing ops · agency lead-gen · audience-builders</div>
|
||||
<h1>Dedupe lead lists across HubSpot, LinkedIn,<br /><strong>and manual scrapes — locally.</strong></h1>
|
||||
<p class="lead">
|
||||
The same prospect shows up as <code>alice@acme.com</code> in HubSpot,
|
||||
<code>Alice.Johnson@acme.com</code> in LinkedIn Sales Navigator, and
|
||||
<code>alice@acme.com</code> again from your VA's manual scrape. Their
|
||||
phone is <code>(415) 555-1234</code> in one source and
|
||||
<code>4155551234</code> in another. DataTools fuzzy-matches across
|
||||
sources, normalizes phones to E.164 with per-row country awareness,
|
||||
and produces one canonical lead per real person — without uploading
|
||||
a single contact to a third-party tool.
|
||||
</p>
|
||||
<div class="cta-row">
|
||||
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Get DataTools — $49 →</a>
|
||||
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
|
||||
<span class="price-note">One-time payment · cross-platform · runs offline</span>
|
||||
</div>
|
||||
<div class="stats">
|
||||
<div class="stat"><div class="num">50+</div><div class="label">country codes</div></div>
|
||||
<div class="stat"><div class="num">3</div><div class="label">CRM sources unified</div></div>
|
||||
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============= Pain points ============= -->
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">If your last campaign launch was held up by data hygiene</div>
|
||||
<h2>Five pains DataTools fixes before you import to HubSpot</h2>
|
||||
<div class="grid">
|
||||
<div class="card">
|
||||
<span class="icon">💸</span>
|
||||
<h3>HubSpot / Marketo / Iterable bills you for every duplicate contact</h3>
|
||||
<p>10 k contacts → enterprise tier at $4–8 k/mo. 18 % cross-source duplicate rate from Apollo + ZoomInfo + LinkedIn means you're at 8.2 k unique people but paying for 10 k. Every month. Forever.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> $200–$800 per 1 k duplicate contacts — recurring, every month.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🚫</span>
|
||||
<h3>Sender reputation tanks when you mail to invalid or duplicate addresses</h3>
|
||||
<p>One bad sending session — to addresses your team scraped or imported without hygiene — and your domain reputation takes weeks to recover. Your good campaigns sit in spam folders during the recovery.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> catastrophic — entire email programme degraded for 2–6 weeks.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">⚖️</span>
|
||||
<h3>GDPR makes uploading to a cloud cleaner a legal-review marathon</h3>
|
||||
<p>Every cloud-based lead-cleaner needs you to upload your prospect list. Your legal team needs 4–8 weeks to bless that. DataTools is desktop-only — no upload, no DPA, no review, no delay.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> 4–8 weeks of legal-review delay per tool, every time.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🪢</span>
|
||||
<h3>Apollo + ZoomInfo + LinkedIn + manual scrapes all use different schemas</h3>
|
||||
<p>Each export has its own column names, scoring scale, country format. Unifying them by hand for one campaign costs 1–3 days. Doing it for every campaign is unsustainable.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> 1–3 days per campaign of manual unification + judgement calls that drift across team members.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🛡️</span>
|
||||
<h3>Suppression lists across 5+ marketing platforms get out of sync</h3>
|
||||
<p>Each platform has its own suppression format. Out-of-sync lists let opted-out contacts slip through, triggering CAN-SPAM / GDPR exposure and the kind of "we got a complaint" email no one wants.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> compliance risk + churn-back cost + stakeholder trust.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">📞</span>
|
||||
<h3>International dialer fails because phone formats vary</h3>
|
||||
<p>Calling list to 15 countries with mixed formats means dialler rejects 8–15 % of numbers, your reps spend the day on "number invalid" tones instead of conversations.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> rep productivity × failure rate × team size.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="demo">
|
||||
<div class="container">
|
||||
<div class="eyebrow">Live demo · runs in your browser</div>
|
||||
<h2>Try it on a real-looking 3-vendor lead list</h2>
|
||||
<p>
|
||||
The demo below loads a 25-row lead worksheet combining HubSpot,
|
||||
LinkedIn Sales Navigator, and manual scraping — with the same prospect
|
||||
appearing in two or three sources, country names spelled three
|
||||
different ways (<code>USA</code>, <code>US</code>, <code>United
|
||||
States</code>), and 13 different international phone formats. Click
|
||||
<strong>Run pipeline</strong> and watch the 5-step pipeline (text
|
||||
clean → format → missing → column map → dedup) collapse 25 rows to 19
|
||||
with a single canonical record per prospect.
|
||||
</p>
|
||||
<div class="demo-frame">
|
||||
<iframe
|
||||
src="https://demo.datatools.app/?p=revops"
|
||||
loading="lazy"
|
||||
title="DataTools live demo — RevOps"
|
||||
sandbox="allow-scripts allow-same-origin allow-downloads allow-forms"></iframe>
|
||||
<div class="demo-caption">
|
||||
Demo runs on free hosting. Capped at 100 input rows · output
|
||||
watermarked. The paid product has no caps and runs entirely offline.
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">Built for the agency RevOps day</div>
|
||||
<h2>Three workflows you do every campaign</h2>
|
||||
<div class="grid">
|
||||
<div class="card">
|
||||
<span class="icon">🪢</span>
|
||||
<h3>Email-list dedup across lead sources</h3>
|
||||
<p>HubSpot exports + LinkedIn Sales Navigator + the VA's spreadsheet, all merged. Fuzzy match across email + phone + name catches the cross-source duplicates that broke your last campaign send.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🌍</span>
|
||||
<h3>Multi-platform audience reconciliation</h3>
|
||||
<p>Build one canonical audience from Meta, Google Ads, LinkedIn, and your CRM. Each platform exports a different shape; column-mapper aligns them all, dedup merges the survivors with their most-complete fields.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🛡️</span>
|
||||
<h3>Suppression-list management</h3>
|
||||
<p>Suppression lists need to dedupe across email + phone + first-party identifiers. Add a row, dedupe, ship the canonical CSV to every platform — without uploading the suppression list to any of them.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">If your campaigns target outside the US — almost everyone's do</div>
|
||||
<h2>50+ country codes. Per-row country awareness.</h2>
|
||||
<p>
|
||||
Your HubSpot list has <code>(415) 555-1234</code>. Your scraped
|
||||
list from the same prospect has <code>+1 415 555 1234</code>. Your
|
||||
Italian prospect entered <code>+39 06 6982</code>. Your Brazilian
|
||||
lead has <code>11 3071 0000</code>. Each comes from a row tagged
|
||||
with its country — DataTools reads that column per row and parses
|
||||
every phone correctly to E.164.
|
||||
</p>
|
||||
<ul class="bullets">
|
||||
<li><strong>Per-row country column</strong> drives the parser — no global default that bucks UK numbers as malformed US.</li>
|
||||
<li><strong>Country-name normalization</strong>: <code>USA</code> / <code>US</code> / <code>United States</code> all resolve to the same ISO-2 code.</li>
|
||||
<li><strong>50+ country support</strong> via Google's libphonenumber, including KR, CN, IN, MX, BR, IL, TR, PL, DK, SE.</li>
|
||||
<li><strong>Schema enforcement</strong> via the column-mapper: project to your CRM's required shape, coerce score columns to integers, reorder fields to match the import contract.</li>
|
||||
</ul>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">For platforms that charge per contact</div>
|
||||
<h2>Every duplicate you don't catch costs you for the life of the contract.</h2>
|
||||
<p>
|
||||
HubSpot prices on contacts. Klaviyo prices on contacts. Marketo,
|
||||
Iterable, ActiveCampaign — all priced on contacts. Every duplicate
|
||||
you don't catch is a recurring tax on your campaign. DataTools
|
||||
catches them once, before import, with a fuzzy matcher that's
|
||||
tuned to the cross-source noise you actually see.
|
||||
</p>
|
||||
<div class="callout">
|
||||
<strong>Real numbers from the demo:</strong> 25 input rows from
|
||||
three sources collapse to 19 — that's 6 duplicates the cross-source
|
||||
noise was hiding. On a 50,000-row campaign list, that ratio
|
||||
typically saves 12,000+ contacts a month, every month.
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">The thing every cloud cleaner can't say</div>
|
||||
<h2>Your prospects' contact info never leaves your computer.</h2>
|
||||
<p>
|
||||
Cloud lead-cleaning tools require you to upload your audience.
|
||||
That audience is your single most valuable agency asset — and once
|
||||
it's on someone else's server, your client's privacy story is
|
||||
no longer in your hands. DataTools is a desktop app. There is no
|
||||
upload step.
|
||||
</p>
|
||||
<div class="terminal"><span class="prompt">$</span> python -m src.cli_pipeline campaign_q1.csv --pipeline revops_pipeline.json --apply
|
||||
Reading campaign_q1.csv...
|
||||
53,802 rows, 14 columns
|
||||
Executing pipeline:
|
||||
<span class="ok">✓</span> text_clean (160 ms) {cells_changed: 8,205}
|
||||
<span class="ok">✓</span> format_standardize (1.4 s) {cells_changed: 41,889 — 50 country codes}
|
||||
<span class="ok">✓</span> missing (140 ms) {sentinels_standardized: 6,710}
|
||||
<span class="ok">✓</span> column_map (220 ms) {columns_renamed: 4, columns_added: 1}
|
||||
<span class="ok">✓</span> dedup (4.8 s) {duplicates_removed: 12,344, merged: 12,344}
|
||||
|
||||
Initial rows: 53,802 → Final rows: 41,458
|
||||
Total elapsed: 6.7 s
|
||||
<span class="prompt">$</span> # 12,344 fewer contacts to pay for. for $49.</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">In the bundle</div>
|
||||
<h2>Six tools. One pipeline. One $49 download.</h2>
|
||||
<div class="grid">
|
||||
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match across email + phone + name + company; merge survivors with most-complete fields.</p></div>
|
||||
<div class="card"><h3>2 · Text Cleaner</h3><p>Smart quotes from copy-paste, NBSP from spreadsheet exports, BOM from Excel.</p></div>
|
||||
<div class="card"><h3>3 · Format Standardizer</h3><p>E.164 phones with per-row country, canonical emails, name casing, ISO dates.</p></div>
|
||||
<div class="card"><h3>4 · Missing Value Handler</h3><p>Detect <code>TBD</code>, <code>(unknown)</code>, <code>—</code> across vendor exports.</p></div>
|
||||
<div class="card"><h3>5 · Column Mapper</h3><p>Project to your CRM's required schema, coerce score to integer, reorder for import.</p></div>
|
||||
<div class="card"><h3>6 · Pipeline Runner</h3><p>Save the cleanup as JSON. Drop next campaign's combined export on it. Same dedup, automated.</p></div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">Pricing — pay once, own it</div>
|
||||
<h2>$49. No subscription. No per-campaign fee.</h2>
|
||||
<div class="pricing">
|
||||
<div class="card featured">
|
||||
<div class="row"><div class="price">$49</div><div class="price-suffix">one-time</div></div>
|
||||
<h3>DataTools for RevOps</h3>
|
||||
<ul>
|
||||
<li>All 6 tools, full pipeline</li>
|
||||
<li>Mac · Windows · Linux installers</li>
|
||||
<li>Code-signed (no Gatekeeper warnings)</li>
|
||||
<li>Free updates for the v1.x line</li>
|
||||
<li>Bonus: 3-source unification pipeline preset</li>
|
||||
<li><strong>Use on any number of clients</strong> — no seat limits</li>
|
||||
</ul>
|
||||
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Buy on Gumroad →</a>
|
||||
</div>
|
||||
<div class="card">
|
||||
<div class="row"><div class="price">$149</div><div class="price-suffix">one-time</div></div>
|
||||
<h3>Full DataTools Suite</h3>
|
||||
<p class="muted">Available when 3+ bundles ship. Includes everything in the RevOps pack plus the Shopify and Bookkeeper bundles. Save $48.</p>
|
||||
<a class="btn btn-ghost btn-large" href="#" aria-disabled="true">Coming when ready</a>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container">
|
||||
<h2>Questions</h2>
|
||||
|
||||
<details class="faq">
|
||||
<summary>Does this replace HubSpot's deduplication?</summary>
|
||||
<p>No — it cleans data <em>before</em> import to HubSpot (or LinkedIn, Marketo, Klaviyo, etc.). HubSpot's dedup runs on already-imported contacts; DataTools catches duplicates that haven't yet cost you a contract slot.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>Does it handle international phones correctly?</summary>
|
||||
<p>Yes — via Google's libphonenumber, with 50+ country codes. The killer feature is per-row country: point a column at it (any column with values like <code>US</code>, <code>USA</code>, <code>United States</code>, <code>+1</code>, <code>JP</code>, <code>Japan</code>) and DataTools parses each row in its own region. No more UK numbers bucketed as malformed US.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>Can I use it on multiple clients without paying again?</summary>
|
||||
<p>Yes. The licence is per-operator, not per-client. Run it on every agency client's lead list for the same $49.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>How does fuzzy match work across columns?</summary>
|
||||
<p>Out of the box, the dedup engine builds default strategies based on column names — typically email + phone with exact match, name with Jaro-Winkler at 85%. You can override via JSON: pick which columns to match on, which algorithm, and what threshold. Strategies survive in the saved pipeline so next campaign uses the same rules.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>What's the audit trail look like?</summary>
|
||||
<p>A row-by-row CSV: every modified cell with its original value, new value, and which rule fired. A separate JSON file describes the pipeline that produced it. Together they reproduce the cleanup deterministically — your client can verify it on their machine.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>What's your refund policy?</summary>
|
||||
<p>Try the live demo above on the sample dataset before you buy. If DataTools doesn't fit your workflow within 14 days, email for a refund — no questions asked.</p>
|
||||
</details>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<div class="container" style="text-align: center;">
|
||||
<h2>Stop paying twice for the same contact.</h2>
|
||||
<p class="lead" style="margin: 0 auto 28px;">One $49 download. Catches the cross-source duplicates HubSpot and LinkedIn can't see, normalizes phones for 50+ countries, and saves a pipeline you can re-run on next campaign's combined list.</p>
|
||||
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Get DataTools — $49 →</a>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<footer>
|
||||
<div class="container">
|
||||
<div>
|
||||
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
|
||||
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
|
||||
</div>
|
||||
<div>
|
||||
<p>
|
||||
<a href="../shopify-pet/">For Shopify operators</a> ·
|
||||
<a href="../bookkeeper/">For bookkeepers</a><br />
|
||||
<a href="https://gumroad.com/l/datatools?from=revops">Buy on Gumroad</a> ·
|
||||
<a href="mailto:hello@datatools.app">Email support</a>
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
</footer>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
381
landing/shopify-pet/index.html
Normal file
381
landing/shopify-pet/index.html
Normal file
@@ -0,0 +1,381 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||||
<title>DataTools for Shopify — Clean Customer & Product Exports Locally · $49</title>
|
||||
<meta name="description" content="Clean Shopify customer, product, and subscriber exports — locally. Klaviyo-import-ready in 30 seconds. Catches duplicates Excel misses. Your data never leaves your computer. $49 one-time." />
|
||||
<meta name="keywords" content="shopify customer cleanup, shopify csv cleaner, shopify product feed cleaner, klaviyo deduplicate, shopify customer dedup tool, shopify pet supplies" />
|
||||
<link rel="canonical" href="https://datatools.app/shopify/" />
|
||||
<link rel="stylesheet" href="../_shared/styles.css" />
|
||||
|
||||
<!-- Persona accent: Shopify pet → mint green (default in shared sheet) -->
|
||||
|
||||
<!-- Open Graph -->
|
||||
<meta property="og:title" content="DataTools for Shopify — Clean Customer & Product Exports Locally" />
|
||||
<meta property="og:description" content="Klaviyo-import-ready in 30 seconds. Local. No upload. $49 one-time." />
|
||||
<meta property="og:type" content="product" />
|
||||
<meta property="og:url" content="https://datatools.app/shopify/" />
|
||||
|
||||
<!-- Schema.org Product -->
|
||||
<script type="application/ld+json">
|
||||
{
|
||||
"@context": "https://schema.org",
|
||||
"@type": "SoftwareApplication",
|
||||
"name": "DataTools for Shopify",
|
||||
"operatingSystem": "Windows, macOS, Linux",
|
||||
"applicationCategory": "BusinessApplication",
|
||||
"offers": {
|
||||
"@type": "Offer",
|
||||
"price": "49",
|
||||
"priceCurrency": "USD"
|
||||
},
|
||||
"description": "Clean Shopify customer, product, and subscriber CSV exports locally. Six-tool data-cleaning bundle: dedupe, text-clean, format-standardize, missing-value handle, column-map, pipeline.",
|
||||
"softwareVersion": "1.0"
|
||||
}
|
||||
</script>
|
||||
</head>
|
||||
<body>
|
||||
|
||||
<!-- ============= Sticky buy bar ============= -->
|
||||
<div class="buybar">
|
||||
<div class="buybar-inner">
|
||||
<div class="brand"><span class="brand-mark">●</span> DataTools <span class="muted">/ for Shopify</span></div>
|
||||
<div>
|
||||
<span class="price-tag">$49 — one-time, no subscription</span>
|
||||
<a class="btn" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Get DataTools →</a>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- ============= Hero ============= -->
|
||||
<section class="hero">
|
||||
<div class="container">
|
||||
<div class="eyebrow">For Shopify operators · pet supplies · subscription stores · DTC</div>
|
||||
<h1>Klaviyo-import-ready customer lists.<br /><strong>In 30 seconds. Locally.</strong></h1>
|
||||
<p class="lead">
|
||||
Your Shopify customer export is a mess of formatting drift, disguised
|
||||
duplicates, and inconsistent phone numbers. DataTools fixes all of it
|
||||
in one pass — fuzzy-dedupes the same customer Klaviyo would charge
|
||||
you for twice, standardises phones across your international
|
||||
subscribers, and hands you a cleaned CSV. <strong>Your data never
|
||||
leaves your computer.</strong>
|
||||
</p>
|
||||
<div class="cta-row">
|
||||
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Get DataTools — $49 →</a>
|
||||
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
|
||||
<span class="price-note">One-time payment · cross-platform · runs offline</span>
|
||||
</div>
|
||||
<div class="stats">
|
||||
<div class="stat"><div class="num">6</div><div class="label">tools, one bundle</div></div>
|
||||
<div class="stat"><div class="num">1 GB</div><div class="label">customer file in 2.5 min</div></div>
|
||||
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============= Pain points ============= -->
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">If any of these sound like your Tuesday</div>
|
||||
<h2>Five pains DataTools fixes in one pass</h2>
|
||||
<div class="grid">
|
||||
<div class="card">
|
||||
<span class="icon">💸</span>
|
||||
<h3>Klaviyo / Mailchimp / Omnisend bills you for every duplicate</h3>
|
||||
<p>Same customer signs up twice — once with a typo, once with a plus-tag, once on mobile. Your subscriber list has 10–18 % duplicate rate and you're paying for every one of them, every month, forever.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> $30–$300/mo per percent of dupes on a 50 k-list — recurring.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">📵</span>
|
||||
<h3>Your product feed got rejected by Google Merchant Center</h3>
|
||||
<p>Smart quotes from a copy-paste in product titles. NBSP in SKU. Inconsistent attribute casing. Feed bounces, the launch sits for 24–72 hours while you try to find the bad row in a 12,000-line CSV.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> 1–3 days of delayed campaign × the campaign value.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🪢</span>
|
||||
<h3>Orders from Shopify + Etsy + Amazon + Faire don't speak the same language</h3>
|
||||
<p>Each platform's export uses different column names for "customer email" / "ship country" / "order total." Merging takes hours of manual rename and copy-paste before the analysis can even begin.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> 4–8 hours per month manually merging exports.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🔁</span>
|
||||
<h3>Subscription churn looks higher than it is</h3>
|
||||
<p>Pet-box subscribers cancel, then re-sub three months later under a different email or device. Your cohort report says churn is 20 % when it's actually 12 % — and you're over-paying for acquisition because LTV is mis-calculated.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> wrong CAC ceiling for the next year of paid ads.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🌍</span>
|
||||
<h3>VAT MOSS / EU tax breaks because country is spelled three ways</h3>
|
||||
<p>Your UK customers are tagged <code>UK</code>, <code>U.K.</code>, and <code>United Kingdom</code> — all in one export. The VAT report aggregates them as three different markets. Compliance friction every quarter.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> compliance risk + repeated manual normalization.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🔒</span>
|
||||
<h3>Cloud cleaners want you to upload your customer list</h3>
|
||||
<p>Your customer list is your single most valuable business asset. Uploading it to a SaaS to clean it is the privacy story you do not want. DataTools is desktop-only — your list never leaves your computer.</p>
|
||||
<p class="muted"><strong>What it costs:</strong> nothing — and that's the point.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============= Live demo ============= -->
|
||||
<section id="demo">
|
||||
<div class="container">
|
||||
<div class="eyebrow">Live demo · runs in your browser</div>
|
||||
<h2>Try it on a real-looking Shopify customer export</h2>
|
||||
<p>
|
||||
The demo below loads a sample 15-row Shopify customer file with
|
||||
pollution we've seen in actual stores: smart quotes from copy-paste,
|
||||
duplicates with email-case drift, international phones from the UK,
|
||||
Spain, Germany, Australia, and Japan, and the usual mess of
|
||||
<code>N/A</code> / <code>(blank)</code> / <code>?</code> sentinels.
|
||||
Click <strong>Run pipeline</strong> and watch every column get
|
||||
cleaned in under a second.
|
||||
</p>
|
||||
<div class="demo-frame">
|
||||
<iframe
|
||||
src="https://demo.datatools.app/?p=shopify-pet"
|
||||
loading="lazy"
|
||||
title="DataTools live demo — Shopify pet supplies"
|
||||
sandbox="allow-scripts allow-same-origin allow-downloads allow-forms"></iframe>
|
||||
<div class="demo-caption">
|
||||
Demo runs on free hosting (Streamlit Community Cloud). Capped at
|
||||
100 input rows · output watermarked with one trailing row. The
|
||||
paid product has no caps and runs entirely offline.
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============= Built for Shopify ============= -->
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">Built for the Shopify operator</div>
|
||||
<h2>Five workflows you do every week</h2>
|
||||
<div class="grid">
|
||||
<div class="card">
|
||||
<span class="icon">🧹</span>
|
||||
<h3>Customer-list cleanup</h3>
|
||||
<p>Catches the same customer who shows up as <code>john@gmail.com</code>, <code>John@Gmail.com</code>, and <code>j.ohn@gmail.com</code>. Fuzzy match merges the spellings, exact match catches the obvious ones.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">📦</span>
|
||||
<h3>Product catalogue dedup</h3>
|
||||
<p>SKU whitespace, near-identical product names, copy-paste smart quotes in titles — gone. Audit log shows every change.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🛒</span>
|
||||
<h3>Abandoned-cart hygiene</h3>
|
||||
<p>Before re-engagement: dedupe across email + phone, drop sentinels-as-missing, format dates so your sequence triggers fire correctly.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">📥</span>
|
||||
<h3>Subscriber-list import to Klaviyo</h3>
|
||||
<p>Klaviyo charges per contact. Every duplicate you don't catch costs you for the life of the subscription. Catch them once, pay once.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">🔗</span>
|
||||
<h3>Multi-channel order consolidation</h3>
|
||||
<p>Orders from Shopify + Etsy + a wholesale spreadsheet, each with a different column for "customer email." Column-mapper aligns them; dedup merges across channels.</p>
|
||||
</div>
|
||||
<div class="card">
|
||||
<span class="icon">⚙️</span>
|
||||
<h3>Repeatable pipeline</h3>
|
||||
<p>Save the cleanup as a JSON file. Drop next week's export on it. Same cleanup, zero re-configuration. Automatable via the CLI.</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============= Privacy moat ============= -->
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">The thing every cloud cleaner can't say</div>
|
||||
<h2>Your customer list never leaves your computer.</h2>
|
||||
<p>
|
||||
DataTools is a desktop app. There's no upload step, no SaaS account,
|
||||
no subscription, no "trust our security policy." The first thing you
|
||||
can do after install is open your browser's network tab, run the
|
||||
cleaner on your real customer file, and verify zero outbound
|
||||
requests.
|
||||
</p>
|
||||
<div class="callout">
|
||||
<strong>Why it matters for Shopify:</strong> your customer list is
|
||||
your single most valuable business asset. Cloud cleaners require
|
||||
you to upload it. We don't.
|
||||
</div>
|
||||
<div class="terminal"><span class="prompt">$</span> python -m src.cli_pipeline customers.csv --apply
|
||||
Reading customers.csv...
|
||||
47,832 rows, 14 columns
|
||||
Executing pipeline:
|
||||
<span class="ok">✓</span> text_clean (140 ms) {cells_changed: 12,408}
|
||||
<span class="ok">✓</span> format_standardize (810 ms) {cells_changed: 31,202}
|
||||
<span class="ok">✓</span> missing (95 ms) {sentinels_standardized: 8,129}
|
||||
<span class="ok">✓</span> dedup (3.1 s) {duplicates_removed: 2,347}
|
||||
|
||||
Initial rows: 47,832 → Final rows: 45,485
|
||||
Total elapsed: 4.2 s
|
||||
<span class="prompt">$</span> # zero network calls. zero. promise.</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============= Audit moat ============= -->
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">For when your client asks "what changed?"</div>
|
||||
<h2>Every change auditable. Every cell logged.</h2>
|
||||
<p>
|
||||
Every modification is recorded with the original value, the new
|
||||
value, and which rule fired. Hand the audit CSV to your accountant,
|
||||
your marketing manager, or your boss along with the cleaned file.
|
||||
No <em>"I trust the AI"</em> hand-waving — they see exactly what
|
||||
happened.
|
||||
</p>
|
||||
<div class="callout">
|
||||
<strong>Real example:</strong> the demo above standardized 27
|
||||
cells across 15 customers. The audit log lists each one — row,
|
||||
column, before, after, which standardizer fired. The dedup audit
|
||||
lists every duplicate group with the survivor and its losers.
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============= International ============= -->
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">If you sell internationally — most pet brands do</div>
|
||||
<h2>Phones, addresses, and currencies from anywhere on Earth.</h2>
|
||||
<p>
|
||||
Your subscriber from London entered her phone as <code>020 7946
|
||||
0958</code>. Your Tokyo customer entered <code>03-3210-7000</code>.
|
||||
Your German wholesale buyer wrote <code>€2.410,75</code>. Excel
|
||||
thinks all of them are mistakes. DataTools knows what country each
|
||||
row is from (per-row country column) and parses every one correctly
|
||||
to E.164 phones, ISO dates, and numeric amounts.
|
||||
</p>
|
||||
<ul class="bullets">
|
||||
<li><strong>50+ country codes</strong> via Google's libphonenumber.</li>
|
||||
<li><strong>Currency auto-detect</strong> for $ / £ / € / ¥ / R$ / kr / zł — including the EU comma-decimal that breaks Excel.</li>
|
||||
<li><strong>Address shape detection</strong> for US, UK, Canada, Germany, Australia.</li>
|
||||
<li><strong>Locale-aware month names</strong> in English, French, German.</li>
|
||||
</ul>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============= What you get ============= -->
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">In the bundle</div>
|
||||
<h2>Six tools. One pipeline. One $49 download.</h2>
|
||||
<div class="grid">
|
||||
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match (Jaro-Winkler), 5 normalizers, survivor rules, interactive review.</p></div>
|
||||
<div class="card"><h3>2 · Text Cleaner</h3><p>Whitespace, smart chars, NBSP, BOM, line endings, case ops.</p></div>
|
||||
<div class="card"><h3>3 · Format Standardizer</h3><p>Dates, phones, emails, addresses, names, currencies, booleans.</p></div>
|
||||
<div class="card"><h3>4 · Missing Value Handler</h3><p>Disguised-null detection, profile, mean/median/mode/ffill, drop strategies.</p></div>
|
||||
<div class="card"><h3>5 · Column Mapper</h3><p>Fuzzy auto-rename, target schema, type coercion, required-field defaults.</p></div>
|
||||
<div class="card"><h3>6 · Pipeline Runner</h3><p>Chain tools in recommended order, save/load JSON, automate weekly cleanups.</p></div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============= Pricing ============= -->
|
||||
<section>
|
||||
<div class="container">
|
||||
<div class="eyebrow">Pricing — pay once, own it</div>
|
||||
<h2>$49. No subscription. No ceiling on rows or files.</h2>
|
||||
<div class="pricing">
|
||||
<div class="card featured">
|
||||
<div class="row"><div class="price">$49</div><div class="price-suffix">one-time</div></div>
|
||||
<h3>DataTools for Shopify</h3>
|
||||
<ul>
|
||||
<li>All 6 tools, full pipeline</li>
|
||||
<li>Mac · Windows · Linux installers</li>
|
||||
<li>Code-signed (no Gatekeeper warnings)</li>
|
||||
<li>Free updates for the v1.x line</li>
|
||||
<li>Bonus: 3 ready-made Shopify pipelines</li>
|
||||
</ul>
|
||||
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Buy on Gumroad →</a>
|
||||
</div>
|
||||
<div class="card">
|
||||
<div class="row"><div class="price">$149</div><div class="price-suffix">one-time</div></div>
|
||||
<h3>Full DataTools Suite</h3>
|
||||
<p class="muted">Available when 3+ bundles ship. Includes everything in the Shopify pack plus the Bookkeeper and RevOps bundles. Save $48.</p>
|
||||
<a class="btn btn-ghost btn-large" href="#" aria-disabled="true">Coming when ready</a>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============= FAQ ============= -->
|
||||
<section>
|
||||
<div class="container">
|
||||
<h2>Questions</h2>
|
||||
|
||||
<details class="faq">
|
||||
<summary>Does this work with Shopify Plus?</summary>
|
||||
<p>Yes — the input is just CSV / Excel from any source. Your Shopify Plus exports work the same as the standard plan, the same as a Shopify-to-CSV pipeline you've stitched together yourself. The cleaner doesn't care.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>How does this compare to Excel's "Remove Duplicates"?</summary>
|
||||
<p>Excel does <em>exact</em> deduplication. <code>John@Gmail.com</code> and <code>john@gmail.com</code> are different customers to Excel. DataTools fuzzy-matches across case, whitespace, formatting, and even close-but-not-identical strings. The demo above merges 4 customer pairs Excel would leave duplicated.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>How big a file can it handle?</summary>
|
||||
<p>1 GB CSV with international phones + addresses processes in about 2.5 minutes on a typical workstation. Streaming mode keeps memory bounded regardless of input size — we tested it on 26 million rows.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>Do I need to know Python to use it?</summary>
|
||||
<p>No. The GUI is a browser interface that opens automatically when you double-click the app. It loads your file, you click Run, you download the cleaned file. The CLI is there for power users who want to script weekly cleanups.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>What about my privacy?</summary>
|
||||
<p>Your customer list never leaves your computer. There is no cloud component, no telemetry, no "anonymous usage stats." When the app is running you can confirm zero outbound network requests in your browser's developer tools.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>What's your refund policy?</summary>
|
||||
<p>Try the live demo above on the sample dataset before you buy. If you still find DataTools doesn't fit your workflow within 14 days, email for a refund — no questions asked.</p>
|
||||
</details>
|
||||
|
||||
<details class="faq">
|
||||
<summary>Will there be updates?</summary>
|
||||
<p>Yes. The v1.x line is included free for everyone who buys DataTools today. We ship a patch every 30 days adding country support, edge-case fixes, and small features.</p>
|
||||
</details>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============= Final CTA ============= -->
|
||||
<section>
|
||||
<div class="container" style="text-align: center;">
|
||||
<h2>Stop deduplicating customers by hand.</h2>
|
||||
<p class="lead" style="margin: 0 auto 28px;">One $49 download. Mac, Windows, or Linux. Runs offline. Catches the duplicates Excel misses, standardizes the phones from your international customers, and saves a pipeline you can re-run on next week's export.</p>
|
||||
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Get DataTools — $49 →</a>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<!-- ============= Footer ============= -->
|
||||
<footer>
|
||||
<div class="container">
|
||||
<div>
|
||||
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
|
||||
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
|
||||
</div>
|
||||
<div>
|
||||
<p>
|
||||
<a href="../bookkeeper/">For bookkeepers</a> ·
|
||||
<a href="../revops/">For RevOps agencies</a><br />
|
||||
<a href="https://gumroad.com/l/datatools?from=shopify-pet">Buy on Gumroad</a> ·
|
||||
<a href="mailto:hello@datatools.app">Email support</a>
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
</footer>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
31
samples/demo/agency_combined_leads.csv
Normal file
31
samples/demo/agency_combined_leads.csv
Normal file
@@ -0,0 +1,31 @@
|
||||
Lead ID,First Name,Last Name,Company,Title,Email,Phone,Country,Source,Score,Last Activity,Tags
|
||||
HUB-001,Alice,Johnson,Acme Corp,VP Marketing,alice@acme.com,(415) 555-1234,USA,HubSpot,87,2025-12-04,Enterprise
|
||||
HUB-002,bob,smith,Beta LLC,Director Growth,bob@beta.com,N/A,United States,HubSpot,N/A,2025-11-22,SMB
|
||||
HUB-003,Carlos,Garcia,Gamma Inc,CEO,carlos@gamma.io,+34 91 411 1111,Spain,HubSpot,82,2025-10-30,Enterprise
|
||||
HUB-004,DIANA,LEE,Delta Co,Marketing Manager,diana@delta.com,020 7946 0958,United Kingdom,HubSpot,74,2025-12-15,Mid-Market
|
||||
HUB-005,Eve,Martinez,Epsilon Group,VP Ops,eve@epsilon.com,(none),Mexico,HubSpot,(blank),2025-09-15,SMB
|
||||
LIN-006,Alice,Johnson,Acme Corporation,VP of Marketing,Alice.Johnson@acme.com,4155551234,US,LinkedIn,—,2025-12-04,Enterprise
|
||||
LIN-007,Frank,Brown,Foxtrot Ltd,Head Sales,frank@foxtrot.de,+49 30 12345678,Germany,LinkedIn,68,2025-12-01,Mid-Market
|
||||
LIN-008,Grace,Davis,Golf Industries,Marketing Lead,grace@golfind.com,+44 20 7946 0958,UK,LinkedIn,79,2025-11-08,Mid-Market
|
||||
LIN-009,henry,wilson,Hotel Logistics,COO,henry@hotellog.com,+86 10 1234 5678,China,LinkedIn,91,2025-12-12,Enterprise
|
||||
LIN-010,IVY CHEN,,India Tech,CTO,ivy@indiatech.in,+91 11 2345 6789,IN,LinkedIn,88,2025-11-30,Enterprise
|
||||
LIN-011,Jack,Taylor,Juliet & Co,Founder,jack@juliet.co,unknown,United States,LinkedIn,?,(unknown),SMB
|
||||
SCR-012,Diana,Lee,Delta Company,Marketing Manager,diana@delta.com,020-7946-0958,UK,Manual Scrape,74,12/15/2025,Mid-Market
|
||||
SCR-013,kate,o'neil,Kilo Ventures,Partner,kate@kilo.vc,+1 415 555 2222,USA,Manual Scrape,N/A,?,Investor
|
||||
SCR-014,Carlos,García,Gamma Incorporated,CEO,Carlos@gamma.io,+34-91-411-1111,Spain,Manual Scrape,82,Oct 30 2025,Enterprise
|
||||
SCR-015,Liam,Park,Lima Solutions,Director Marketing,liam@limasol.kr,+82 2 2287 0114,South Korea,Manual Scrape,77,2025-11-20,Enterprise
|
||||
SCR-016,Mia,nguyen,Mike Corp,VP Marketing,mia@mikecorp.com.au,02 9374 4000,Australia,Manual Scrape,72,2025-10-05,Mid-Market
|
||||
SCR-017,Noah,Brown,November Inc,Head of Growth,noah@november.com,(555) 444-5555,US,Manual Scrape,—,#N/A,SMB
|
||||
HUB-018,Frank,Brown,Foxtrot,Head of Sales,Frank@Foxtrot.de,+49-30-12345678,Germany,HubSpot,68,2025-12-01,Mid-Market
|
||||
HUB-019,Olivia,Rossi,Oscar Italia,CMO,olivia@oscar.it,+39 06 6982,Italy,HubSpot,85,2025-12-08,Enterprise
|
||||
HUB-020,papa,wong,Papa Trading,Founder,papa@papatrading.hk,+852 2123 4567,Hong Kong,HubSpot,69,2025-11-15,SMB
|
||||
LIN-021,Quinn,Reyes,Quebec Group,VP Sales,quinn@quebec.mx,+52 55 5555 0000,Mexico,LinkedIn,80,2025-12-05,Mid-Market
|
||||
LIN-022,Robert,Tan,Romeo Logistics,Director,r.tan@romeo.sg,+65 6123 4567,Singapore,LinkedIn,76,2025-11-28,Mid-Market
|
||||
SCR-023,Sara,Khan,Sierra Foods,Head Marketing,sara@sierra.in,+91-22-1234-5678,India,Manual Scrape,73,2025-12-02,SMB
|
||||
SCR-024,bob,Smith,Beta,Director Growth,Bob@Beta.com,(none),United States,Manual Scrape,(unknown),(unknown),SMB
|
||||
HUB-025,Tara,Levi,Tango Tech,VP Product,tara@tango.il,+972 3 6957 0000,Israel,HubSpot,82,2025-12-10,Enterprise
|
||||
HUB-026,Uma,Patel,Uniform Health,CMO,uma at uniform dot com,+44 20 7946 8888,United Kingdom,HubSpot,71,2025-12-12,Enterprise
|
||||
LIN-027,Victor,Lee,Victor Co,Director,victor@@victorco.com,+1 415 555 8888,USA,LinkedIn,69,2025-11-30,SMB
|
||||
SCR-028,Wendy,Akin,Whiskey Inc,CMO,wendy@whiskey.tr,+90 212 252 1111,Turkey,Manual Scrape,77,2025-12-04,Mid-Market
|
||||
SCR-029,Xander,Ng,Xray Group,Founder,xander@xray.sg,+65 6234 5678,Singapore,Manual Scrape,65,2025-11-15,Suppressed
|
||||
HUB-030,Yara,Costa,Yankee Foods,Marketing Lead,yara@yankee.br,+55 11 3071 2222,Brazil,HubSpot,—,2025-12-15,Opted Out
|
||||
|
74
samples/demo/agency_leads_pipeline.json
Normal file
74
samples/demo/agency_leads_pipeline.json
Normal file
@@ -0,0 +1,74 @@
|
||||
{
|
||||
"steps": [
|
||||
{
|
||||
"tool": "text_clean",
|
||||
"options": {},
|
||||
"enabled": true,
|
||||
"name": "1. Clean text (whitespace + smart quotes from copy-paste)"
|
||||
},
|
||||
{
|
||||
"tool": "format_standardize",
|
||||
"options": {
|
||||
"column_types": {
|
||||
"First Name": "name",
|
||||
"Last Name": "name",
|
||||
"Company": "name",
|
||||
"Email": "email",
|
||||
"Phone": "phone"
|
||||
},
|
||||
"phone_country_column": "Country",
|
||||
"phone_format": "E164",
|
||||
"email_gmail_canonical": true
|
||||
},
|
||||
"enabled": true,
|
||||
"name": "2. E.164 phones (per-row country) · canonical emails · name casing"
|
||||
},
|
||||
{
|
||||
"tool": "missing",
|
||||
"options": {
|
||||
"strategy": "none",
|
||||
"standardize_sentinels": true,
|
||||
"sentinels": ["N/A", "n/a", "—", "?", "(unknown)", "unknown", "(blank)", "(none)", "TBD", "#N/A"]
|
||||
},
|
||||
"enabled": true,
|
||||
"name": "3. Standardize sentinels across vendor exports"
|
||||
},
|
||||
{
|
||||
"tool": "column_map",
|
||||
"options": {
|
||||
"schema": {
|
||||
"fields": [
|
||||
{"name": "Lead ID", "dtype": "string", "required": true},
|
||||
{"name": "First Name", "dtype": "string"},
|
||||
{"name": "Last Name", "dtype": "string"},
|
||||
{"name": "Company", "dtype": "string"},
|
||||
{"name": "Title", "dtype": "string"},
|
||||
{"name": "Email", "dtype": "string"},
|
||||
{"name": "Phone", "dtype": "string"},
|
||||
{"name": "Country", "dtype": "string"},
|
||||
{"name": "Source", "dtype": "string"},
|
||||
{"name": "Score", "dtype": "integer"},
|
||||
{"name": "Last Activity", "dtype": "date"},
|
||||
{"name": "Tags", "dtype": "string"}
|
||||
]
|
||||
},
|
||||
"auto_infer": true,
|
||||
"unmapped": "keep",
|
||||
"coerce_types": true,
|
||||
"reorder_to_schema": true,
|
||||
"enforce_required": false
|
||||
},
|
||||
"enabled": true,
|
||||
"name": "4. Coerce types · reorder to canonical schema"
|
||||
},
|
||||
{
|
||||
"tool": "dedup",
|
||||
"options": {
|
||||
"survivor_rule": "most_complete",
|
||||
"merge": true
|
||||
},
|
||||
"enabled": true,
|
||||
"name": "5. Dedup leads across HubSpot / LinkedIn / Manual Scrape (fuzzy + merge)"
|
||||
}
|
||||
]
|
||||
}
|
||||
56
samples/demo/bookkeeper_bank_pipeline.json
Normal file
56
samples/demo/bookkeeper_bank_pipeline.json
Normal file
@@ -0,0 +1,56 @@
|
||||
{
|
||||
"steps": [
|
||||
{
|
||||
"tool": "text_clean",
|
||||
"options": {},
|
||||
"enabled": true,
|
||||
"name": "1. Clean text (header whitespace, smart quotes, em-dash)"
|
||||
},
|
||||
{
|
||||
"tool": "format_standardize",
|
||||
"options": {
|
||||
"column_types": {
|
||||
"Date": "date",
|
||||
"Amount": "currency",
|
||||
"Balance": "currency",
|
||||
"Vendor": "name"
|
||||
},
|
||||
"currency_decimal": "auto",
|
||||
"currency_preserve_code": false,
|
||||
"currency_decimals": 2,
|
||||
"date_output_format": "%Y-%m-%d"
|
||||
},
|
||||
"enabled": true,
|
||||
"name": "2. ISO dates · numeric amounts (parens-negative) · vendor casing"
|
||||
},
|
||||
{
|
||||
"tool": "missing",
|
||||
"options": {
|
||||
"strategy": "none",
|
||||
"standardize_sentinels": true,
|
||||
"sentinels": ["N/A", "n/a", "—", "-", "?", "(blank)", "(none)", "unknown", "#N/A"]
|
||||
},
|
||||
"enabled": true,
|
||||
"name": "3. Standardize disguised nulls (— / N/A / (blank))"
|
||||
},
|
||||
{
|
||||
"tool": "dedup",
|
||||
"options": {
|
||||
"survivor_rule": "most_complete",
|
||||
"merge": false,
|
||||
"date_column": "Date",
|
||||
"strategies": [
|
||||
{
|
||||
"columns": [
|
||||
{"column": "Date", "algorithm": "exact", "threshold": 100},
|
||||
{"column": "Amount", "algorithm": "exact", "threshold": 100},
|
||||
{"column": "Vendor", "algorithm": "jaro_winkler", "threshold": 80}
|
||||
]
|
||||
}
|
||||
]
|
||||
},
|
||||
"enabled": true,
|
||||
"name": "4. Dedup transactions on Date+Amount+fuzzy Vendor"
|
||||
}
|
||||
]
|
||||
}
|
||||
31
samples/demo/bookkeeper_bank_reconcile.csv
Normal file
31
samples/demo/bookkeeper_bank_reconcile.csv
Normal file
@@ -0,0 +1,31 @@
|
||||
Txn ID,Date ,Description,Amount,Balance,Account,Vendor,Category
|
||||
TXN-2401,01/15/2025," AMAZON.COM*4F2X9 PURCHASE",-$129.99,"$2,450.01",Checking,Amazon,Office Supplies
|
||||
TXN-2402,2025-01-15,"AMAZON.COM*4F2X9 PURCHASE",-$129.99,"2450.01",Checking,amazon.com,Office Supplies
|
||||
TXN-2403,Jan 18 2025,"STAPLES #4422 — paper, toner",($89.50),$2360.51,Checking,STAPLES,Office Supplies
|
||||
TXN-2404,01/22/2025,"Verizon Wireless ""autopay""",-$120.00,"$2,240.51",Checking,Verizon,Utilities
|
||||
TXN-2405,2025-01-22,Verizon Wireless autopay,-120.00,"2,240.51",Checking,verizon,Utilities
|
||||
TXN-2406,01-25-2025,"Stripe Payout — invoice #1077","+$3,450.00","$5,690.51",Checking,Stripe,Income
|
||||
TXN-2407,1/27/25,"Office Lease - Suite 204",-1500.00,"$4,190.51",Checking,Acme Realty,Rent
|
||||
TXN-2408,02/01/2025,"Wire — Acme Realty Mgmt","-$1,500.00","$2,690.51",Checking,acme realty,Rent
|
||||
TXN-2409,2025-02-03,"Adobe Creative Cloud annual","- $599.88","$2,090.63",Credit Card,Adobe Inc.,Software
|
||||
TXN-2410,02/03/2025,"ADOBE CREATIVE CLOUD ANN",-599.88,2090.63,Credit Card,adobe,Software
|
||||
TXN-2411,Feb 5 2025,"FedEx — overnight to client A",-$32.50,"$2,058.13",Checking,FedEx,Shipping
|
||||
TXN-2412,02/07/2025,"Square fee — invoice #1078","-$3.20","$2,054.93",Checking,Square,Fees
|
||||
TXN-2413,02/10/2025,"Stripe Payout invoice #1079","+ $1,200.00","$3,254.93",Checking,Stripe,Income
|
||||
TXN-2414,2025-02-12,"USPS PRIORITY — to vendor B","-12.40","$3,242.53",Checking,USPS,Shipping
|
||||
TXN-2415,02/14/2025,"Zoom Video Comms — annual","-$149.90","$3,092.63",Credit Card,Zoom,Software
|
||||
TXN-2416,2/14/25,"Zoom Video Communications","-149.90","3092.63",Credit Card,zoom,Software
|
||||
TXN-2417,02/18/2025,"Costco Whse #421 — supplies","-$237.84","$2,854.79",Checking,Costco,Office Supplies
|
||||
TXN-2418,2025-02-18,COSTCO WHSE #421,-237.84,"2,854.79",Checking,costco,Office Supplies
|
||||
TXN-2419,02/22/2025,"Bank fee — int'l wire","-$45.00","$2,809.79",Checking,Bank Fee,Fees
|
||||
TXN-2420,02/24/2025,"Stripe Payout — invoice #1080","+$2,100.00","$4,909.79",Checking,Stripe,Income
|
||||
TXN-2421,02/28/2025," Refund — overcharge ","+$45.00","$4,954.79",Checking,—,Refunds
|
||||
TXN-2422,Feb 28 2025,REFUND OVERCHARGE,45.00,4954.79,Checking,N/A,Refunds
|
||||
TXN-2423,03/01/2025,"Office Lease — Suite 204","-$1,500.00","$3,454.79",Checking,Acme Realty,Rent
|
||||
TXN-2424,2025-03-03,"Slack Technologies — annual","-$840.00","$2,614.79",Credit Card,Slack,Software
|
||||
TXN-2425,03/05/2025,"Stripe Payout — invoice #1081","+$1,875.00","$4,489.79",Checking,Stripe,Income
|
||||
TXN-2426,03/08/2025,"Wire — Berlin office rent (EUR vendor)","-€1.450,00","$2,989.79",Checking,Mietverwaltung GmbH,Rent
|
||||
TXN-2427,03/10/2025,"London supplier invoice (GBP)","-£950.00","$1,939.79",Checking,Stationery Co Ltd,Office Supplies
|
||||
TXN-2428,03/12/2025,"São Paulo agency retainer","-R$ 1.299,90","$1,679.79",Credit Card,Estúdio Ágil,Software
|
||||
TXN-2429,03/14/2025,"VAT MOSS prep — multi-EU sales","($89.00)","$1,768.79",Checking,EU VAT Service,Fees
|
||||
TXN-2430,03/14/2025,"VAT MOSS prep multi EU sales",-89.00,"1,768.79",Checking,eu vat service,Fees
|
||||
|
21
samples/demo/shopify_pet_customers.csv
Normal file
21
samples/demo/shopify_pet_customers.csv
Normal file
@@ -0,0 +1,21 @@
|
||||
Customer ID,First Name,Last Name,Email,Phone,Address,City,State,ZIP,Country,Total Orders,Lifetime Value,Last Order Date,Tags
|
||||
SHOP-1001, Alice ,Johnson,alice@petshop.com,(415) 555-1234,"123 Main St., Apt 4B",San Francisco,CA,94102,US,12,$1,240.50,2025-12-04,VIP
|
||||
SHOP-1002,Bob,SMITH,Bob@PetShop.com,415.555.1234,"123 Main St, Apt 4B",San Francisco,CA,94102,US,12,"$1,240.50",N/A,VIP
|
||||
SHOP-1003,carlos,garcia,carlos@petshop.com,5559876543,"742 Evergreen Terrace",Springfield,IL,62704,US,5,420.00,12/15/2025,Wholesale
|
||||
SHOP-1004,Diana,Lee,diana@petshop.com,(555) 222-3344,"PO Box 12, Sherwood Forest",Nottingham,,NG1 5BA,GB,8,£890.25,2025-10-30,VIP|Wholesale
|
||||
SHOP-1005,EVE MARTINEZ,,eve.martinez@petshop.com,555-9988,"Calle Mayor 45","Madrid",,"28013",ES,3,€180,2025-09-15,
|
||||
SHOP-1006,Frank,Brown,frank@petshop.com,, ,"Berlin",BE,10115,DE,15,€2.410,75,(blank),Wholesale
|
||||
SHOP-1007,Grace,Davis,grace@petshop.com,+1 555-111-1111,"888 Maple Ave",Toronto,ON,M5V 3A8,CA,1,$49.99,#N/A,New
|
||||
SHOP-1008,henry,wilson,Henry@PetShop.com,5551111111,"888 Maple Avenue","Toronto",ON,M5V 3A8,CA,1,$49.99,2025-12-01,New
|
||||
SHOP-1009,Ivy,Chen,IVY@petshop.com,+1 (555) 777-7777,"550 Elm Street, Suite 200",Brooklyn,NY,11201,US,4,"$320.50 ",10/12/2025,
|
||||
SHOP-1010,Jack,Taylor,jack@petshop.com,(none),"550 elm street, suite 200",brooklyn,NY,11201,US,4,$320.50,2025-10-12,
|
||||
SHOP-1011,kate,o'neil,kate.oneil@petshop.com,415-555-2222,"99 King's Rd","London",,SW3 4LX,GB,7,£675.00,?,VIP
|
||||
SHOP-1012,luis,rodriguez,LUIS@petshop.com,+34 91 411 1111,"Avenida de la Paz 12, 3°D",Madrid,,28013,ES,2,"€89,99",unknown,
|
||||
SHOP-1013,Mia,Park,mia@petshop.com,02-9374-4000,"Sydney Opera House Drive","Sydney",NSW,2000,AU,9,"A$ 1,299.00",2025-11-20,Wholesale
|
||||
SHOP-1014,Noah,nguyen,noah@petshop.com,+81 3 3210 7000,"丸の内 2-7-3","Tokyo",,100-0005,JP,6,"¥75000",2025-12-10,VIP
|
||||
SHOP-1015,Olivia,Brown,OLIVIA@PETSHOP.COM,(555) 333-4444,"742 evergreen terrace",springfield,IL,62704,US,3,$180.00,(none),
|
||||
SHOP-1016,Pavel,Novak,pavel@petshop.com,+44 20 7946 1234,"22 Baker Street",London,,W1U 6AB,United Kingdom,4,£412.00,2025-11-18,VIP
|
||||
SHOP-1017,Quinn,Murphy,quinn@petshop.com,+44 20 7946 5678,"5 Princes Street",Edinburgh,,EH2 2DA,U.K.,2,£189.50,2025-12-09,
|
||||
SHOP-1018,Rachel,O'Brien,rachel@petshop.com,02-9374-9999,"100 George Street","Sydney",NSW,2000,UK,1,£75.00,?,New
|
||||
SHOP-1019,Sam,Klein,sam@petshop.com,+49 30 99887766,"Friedrichstraße 100","Berlin",,10117,Germany,11,"€1.890,40",2025-12-11,VIP|Wholesale
|
||||
SHOP-1020,Tara,Gianni,tara@petshop.com,+39 06 6982 4567,"Via del Corso 250",Roma,,00186,Italia,5,"€649,99",2025-12-03,
|
||||
|
49
samples/demo/shopify_pet_pipeline.json
Normal file
49
samples/demo/shopify_pet_pipeline.json
Normal file
@@ -0,0 +1,49 @@
|
||||
{
|
||||
"steps": [
|
||||
{
|
||||
"tool": "text_clean",
|
||||
"options": {},
|
||||
"enabled": true,
|
||||
"name": "1. Clean text (whitespace, smart quotes, NBSP, BOM)"
|
||||
},
|
||||
{
|
||||
"tool": "format_standardize",
|
||||
"options": {
|
||||
"column_types": {
|
||||
"First Name": "name",
|
||||
"Last Name": "name",
|
||||
"Email": "email",
|
||||
"Phone": "phone",
|
||||
"Address": "address",
|
||||
"Lifetime Value": "currency",
|
||||
"Last Order Date": "date"
|
||||
},
|
||||
"phone_country_column": "Country",
|
||||
"address_country_column": "Country",
|
||||
"currency_preserve_code": true,
|
||||
"currency_decimal": "auto",
|
||||
"email_gmail_canonical": false
|
||||
},
|
||||
"enabled": true,
|
||||
"name": "2. Standardize phones, addresses, dates, currencies, names"
|
||||
},
|
||||
{
|
||||
"tool": "missing",
|
||||
"options": {
|
||||
"strategy": "none",
|
||||
"standardize_sentinels": true
|
||||
},
|
||||
"enabled": true,
|
||||
"name": "3. Standardize disguised nulls (N/A, -, (blank), ?, #N/A)"
|
||||
},
|
||||
{
|
||||
"tool": "dedup",
|
||||
"options": {
|
||||
"survivor_rule": "most_complete",
|
||||
"merge": true
|
||||
},
|
||||
"enabled": true,
|
||||
"name": "4. Dedup customers (fuzzy match, merge missing fields)"
|
||||
}
|
||||
]
|
||||
}
|
||||
355
src/cli_column_map.py
Normal file
355
src/cli_column_map.py
Normal file
@@ -0,0 +1,355 @@
|
||||
"""CLI for the DataTools Column Mapper (script 05).
|
||||
|
||||
Usage:
|
||||
python -m src.cli_column_map input.csv # auto-mapping preview
|
||||
python -m src.cli_column_map input.csv --schema target.json --apply
|
||||
python -m src.cli_column_map input.csv --rename "First Name=first_name,Email=email" --apply
|
||||
python -m src.cli_column_map input.csv --schema target.json --preset strict-schema --apply
|
||||
python -m src.cli_column_map input.csv --schema target.json --coerce --apply
|
||||
python -m src.cli_column_map --help
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import typer
|
||||
from loguru import logger
|
||||
|
||||
app = typer.Typer(
|
||||
name="column-map",
|
||||
help=(
|
||||
"Rename columns, enforce a target schema, and coerce types in CSV / Excel files.\n\n"
|
||||
"Default behaviour: preview the mapping (no file written). Add --apply "
|
||||
"to write the mapped output and audit log.\n\n"
|
||||
"Examples:\n\n"
|
||||
" # Show what auto-mapping would do (no schema → identity)\n"
|
||||
" python -m src.cli_column_map vendor.csv\n\n"
|
||||
" # Map against a target JSON schema with strict drop / coerce / reorder\n"
|
||||
" python -m src.cli_column_map vendor.csv --schema target.json "
|
||||
"--preset strict-schema --apply\n\n"
|
||||
" # Hand-rolled rename without a schema\n"
|
||||
" python -m src.cli_column_map data.csv "
|
||||
"--rename 'First Name=first_name,Last Name=last_name' --apply\n\n"
|
||||
" # Coerce specific columns inline\n"
|
||||
" python -m src.cli_column_map data.csv "
|
||||
"--coerce-col 'age:integer,joined:date' --apply\n"
|
||||
),
|
||||
add_completion=False,
|
||||
no_args_is_help=True,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _setup_logging(log_dir: Path) -> Path:
|
||||
log_dir.mkdir(parents=True, exist_ok=True)
|
||||
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
log_path = log_dir / f"column_map_{ts}.log"
|
||||
logger.remove()
|
||||
logger.add(sys.stderr, level="WARNING", format="{message}")
|
||||
logger.add(
|
||||
str(log_path),
|
||||
level="DEBUG",
|
||||
format="{time:YYYY-MM-DD HH:mm:ss} | {level:<8} | {message}",
|
||||
)
|
||||
return log_path
|
||||
|
||||
|
||||
def _parse_pairs(raw: Optional[str], separator: str = ",") -> dict[str, str]:
|
||||
"""Parse ``a=1,b=2`` into a dict."""
|
||||
if not raw:
|
||||
return {}
|
||||
out: dict[str, str] = {}
|
||||
for piece in raw.split(separator):
|
||||
piece = piece.strip()
|
||||
if not piece:
|
||||
continue
|
||||
if "=" not in piece:
|
||||
raise typer.BadParameter(
|
||||
f"Invalid pair: {piece!r}. Expected 'key=value[,key=value...]'."
|
||||
)
|
||||
k, v = piece.split("=", 1)
|
||||
out[k.strip()] = v.strip()
|
||||
return out
|
||||
|
||||
|
||||
def _parse_coerce(raw: Optional[str]) -> dict[str, str]:
|
||||
"""Parse ``age:integer,joined:date`` into a dict."""
|
||||
if not raw:
|
||||
return {}
|
||||
out: dict[str, str] = {}
|
||||
for piece in raw.split(","):
|
||||
piece = piece.strip()
|
||||
if not piece:
|
||||
continue
|
||||
if ":" not in piece:
|
||||
raise typer.BadParameter(
|
||||
f"Invalid --coerce-col piece: {piece!r}. "
|
||||
f"Expected 'col:dtype[,col:dtype...]'."
|
||||
)
|
||||
col, dtype = piece.split(":", 1)
|
||||
out[col.strip()] = dtype.strip()
|
||||
return out
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main command
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@app.command()
|
||||
def map_(
|
||||
input_file: str = typer.Argument(
|
||||
...,
|
||||
help="Path to the CSV or Excel file.",
|
||||
),
|
||||
output: Optional[str] = typer.Option(
|
||||
None, "--output", "-o",
|
||||
help="Output file path. Default: {input}_mapped.csv",
|
||||
),
|
||||
apply: bool = typer.Option(
|
||||
False, "--apply",
|
||||
help="Write the output. Without this flag, only the mapping plan is shown.",
|
||||
),
|
||||
preset: str = typer.Option(
|
||||
"rename-only", "--preset",
|
||||
help="Preset: rename-only, strict-schema, or lenient-schema.",
|
||||
),
|
||||
schema: Optional[str] = typer.Option(
|
||||
None, "--schema",
|
||||
help="Path to a target schema JSON file (TargetSchema format).",
|
||||
),
|
||||
rename: Optional[str] = typer.Option(
|
||||
None, "--rename",
|
||||
help="Explicit rename pairs: 'src=tgt[,src=tgt...]' (overrides auto-inference).",
|
||||
),
|
||||
coerce_col: Optional[str] = typer.Option(
|
||||
None, "--coerce-col",
|
||||
help=(
|
||||
"Inline type coercion (no schema needed): 'col:dtype[,col:dtype...]'. "
|
||||
"Valid dtypes: string, integer, float, boolean, date, datetime, category, auto."
|
||||
),
|
||||
),
|
||||
unmapped: Optional[str] = typer.Option(
|
||||
None, "--unmapped",
|
||||
help="Strategy for unmapped source columns: keep | drop | error.",
|
||||
),
|
||||
threshold: Optional[float] = typer.Option(
|
||||
None, "--threshold",
|
||||
help="Fuzzy-match threshold for auto-inference (0.0..1.0). Default 0.6.",
|
||||
),
|
||||
no_auto: bool = typer.Option(
|
||||
False, "--no-auto",
|
||||
help="Disable auto-inference; honour only explicit --rename pairs.",
|
||||
),
|
||||
no_coerce: bool = typer.Option(
|
||||
False, "--no-coerce",
|
||||
help="Disable type coercion (overrides preset).",
|
||||
),
|
||||
no_reorder: bool = typer.Option(
|
||||
False, "--no-reorder",
|
||||
help="Disable schema-order reorder (overrides preset).",
|
||||
),
|
||||
no_required: bool = typer.Option(
|
||||
False, "--no-required",
|
||||
help="Don't enforce required-target presence (overrides preset).",
|
||||
),
|
||||
config: Optional[str] = typer.Option(
|
||||
None, "--config",
|
||||
help="Load options from a saved JSON config file.",
|
||||
),
|
||||
save_config: Optional[str] = typer.Option(
|
||||
None, "--save-config",
|
||||
help="Save current options to a JSON config file.",
|
||||
),
|
||||
sheet: Optional[str] = typer.Option(
|
||||
None, "--sheet",
|
||||
help="Excel sheet name or index (default: first sheet).",
|
||||
),
|
||||
encoding_override: Optional[str] = typer.Option(
|
||||
None, "--encoding",
|
||||
help="Override auto-detected file encoding.",
|
||||
),
|
||||
header_row: Optional[int] = typer.Option(
|
||||
None, "--header-row",
|
||||
help="0-based row index for the header (default: auto-detect).",
|
||||
),
|
||||
):
|
||||
"""Map source columns to a target schema; rename, coerce, drop, reorder."""
|
||||
from src.core.io import read_file, write_file
|
||||
from src.core.column_mapper import (
|
||||
MapOptions,
|
||||
PRESETS,
|
||||
TargetField,
|
||||
TargetSchema,
|
||||
coerce_series,
|
||||
map_columns,
|
||||
)
|
||||
import pandas as pd
|
||||
|
||||
input_path = Path(input_file)
|
||||
if not input_path.exists():
|
||||
typer.echo(f"Error: File not found: {input_path}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
if preset not in PRESETS:
|
||||
typer.echo(
|
||||
f"Error: Unknown preset '{preset}'. "
|
||||
f"Choose from: {', '.join(sorted(PRESETS))}.",
|
||||
err=True,
|
||||
)
|
||||
raise typer.Exit(1)
|
||||
|
||||
log_path = _setup_logging(Path("logs"))
|
||||
|
||||
# Build options
|
||||
if config:
|
||||
cfg_path = Path(config)
|
||||
if not cfg_path.exists():
|
||||
typer.echo(f"Error: Config file not found: {cfg_path}", err=True)
|
||||
raise typer.Exit(1)
|
||||
options = MapOptions.from_file(cfg_path)
|
||||
else:
|
||||
options = MapOptions.from_preset(preset)
|
||||
|
||||
if schema:
|
||||
sp = Path(schema)
|
||||
if not sp.exists():
|
||||
typer.echo(f"Error: Schema file not found: {sp}", err=True)
|
||||
raise typer.Exit(1)
|
||||
options.schema = TargetSchema.from_file(sp)
|
||||
if rename:
|
||||
options.mapping = {**options.mapping, **_parse_pairs(rename)}
|
||||
if unmapped:
|
||||
options.unmapped = unmapped # type: ignore[assignment]
|
||||
if threshold is not None:
|
||||
options.fuzzy_threshold = threshold
|
||||
if no_auto:
|
||||
options.auto_infer = False
|
||||
if no_coerce:
|
||||
options.coerce_types = False
|
||||
if no_reorder:
|
||||
options.reorder_to_schema = False
|
||||
if no_required:
|
||||
options.enforce_required = False
|
||||
|
||||
# Inline coercion (no schema): build a tiny one-field-per-column schema.
|
||||
inline_coerce = _parse_coerce(coerce_col)
|
||||
if inline_coerce and options.schema is None:
|
||||
options.schema = TargetSchema(fields=[
|
||||
TargetField(name=col, dtype=dt) # type: ignore[arg-type]
|
||||
for col, dt in inline_coerce.items()
|
||||
])
|
||||
options.coerce_types = True
|
||||
|
||||
if save_config:
|
||||
saved = options.to_file(save_config)
|
||||
typer.echo(f"Config saved to {saved}")
|
||||
|
||||
# Read input
|
||||
typer.echo(f"Reading {input_path.name}...")
|
||||
try:
|
||||
sheet_arg: str | int | None = None
|
||||
if sheet is not None:
|
||||
try:
|
||||
sheet_arg = int(sheet)
|
||||
except ValueError:
|
||||
sheet_arg = sheet
|
||||
df = read_file(
|
||||
input_path,
|
||||
encoding=encoding_override,
|
||||
header_row=header_row,
|
||||
sheet_name=sheet_arg if sheet_arg is not None else 0,
|
||||
repair=False,
|
||||
)
|
||||
if not isinstance(df, pd.DataFrame):
|
||||
df = pd.concat(list(df), ignore_index=True)
|
||||
except Exception as e:
|
||||
typer.echo(f"Error reading file: {e}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
typer.echo(f" {len(df)} rows, {len(df.columns)} columns")
|
||||
|
||||
typer.echo("Mapping columns...")
|
||||
try:
|
||||
result = map_columns(df, options)
|
||||
except (ValueError, OSError) as e:
|
||||
typer.echo(f"Error: {e}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
_print_results(result, input_path, options)
|
||||
|
||||
if apply:
|
||||
stem = input_path.stem
|
||||
out_path = Path(output) if output else input_path.parent / f"{stem}_mapped.csv"
|
||||
write_file(result.mapped_df, out_path)
|
||||
typer.echo(f"\nMapped file: {out_path}")
|
||||
# Audit: write the resolved mapping as JSON next to the output.
|
||||
audit_path = input_path.parent / f"{stem}_mapping.json"
|
||||
audit_path.write_text(json.dumps({
|
||||
"mapping": result.mapping,
|
||||
"inferred_pairs": result.inferred_pairs,
|
||||
"columns_renamed": result.columns_renamed,
|
||||
"columns_dropped": result.columns_dropped,
|
||||
"columns_added": result.columns_added,
|
||||
"coercion_failures": result.coercion_failures,
|
||||
"unmapped_kept": result.unmapped_kept,
|
||||
"missing_required_targets": result.missing_required_targets,
|
||||
}, indent=2, default=str))
|
||||
typer.echo(f"Mapping audit: {audit_path}")
|
||||
else:
|
||||
typer.echo("\nThis was a preview. Add --apply to write the mapped output.")
|
||||
|
||||
typer.echo(f"Log: {log_path}")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Output formatting
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _print_results(result, input_path: Path, options) -> None:
|
||||
typer.echo(f"\n{'─'*60}")
|
||||
typer.echo(f" File: {input_path.name}")
|
||||
typer.echo(f" Columns renamed: {result.columns_renamed}")
|
||||
typer.echo(f" Columns dropped: {len(result.columns_dropped)}")
|
||||
typer.echo(f" Columns added: {len(result.columns_added)}")
|
||||
typer.echo(f" Unmapped kept: {len(result.unmapped_kept)}")
|
||||
typer.echo(f" Coercion failures: "
|
||||
f"{sum(result.coercion_failures.values())} cells across "
|
||||
f"{len(result.coercion_failures)} column(s)")
|
||||
typer.echo(f"{'─'*60}")
|
||||
|
||||
if result.mapping:
|
||||
typer.echo("\nMapping:")
|
||||
for src, tgt in result.mapping.items():
|
||||
tag = " (auto)" if src in result.inferred_pairs else ""
|
||||
arrow = "→" if src != tgt else "≡"
|
||||
typer.echo(f" {src!r} {arrow} {tgt!r}{tag}")
|
||||
if result.columns_dropped:
|
||||
typer.echo(f"\nDropped: {result.columns_dropped}")
|
||||
if result.columns_added:
|
||||
typer.echo(f"\nAdded (defaults): {result.columns_added}")
|
||||
if result.coercion_failures:
|
||||
typer.echo("\nCoercion failures:")
|
||||
for col, n in result.coercion_failures.items():
|
||||
typer.echo(f" {col}: {n} row(s) could not be coerced")
|
||||
if result.missing_required_targets:
|
||||
typer.echo(f"\nMissing required targets: {result.missing_required_targets}")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# __main__
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def main():
|
||||
app()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
364
src/cli_format.py
Normal file
364
src/cli_format.py
Normal file
@@ -0,0 +1,364 @@
|
||||
"""CLI for the DataTools Format Standardizer (script 03).
|
||||
|
||||
Usage:
|
||||
python -m src.cli_format input.csv \\
|
||||
--types 'phone:phone,price:currency,name:name' \\
|
||||
--apply
|
||||
|
||||
# 1 GB international file with per-row country column:
|
||||
python -m src.cli_format huge.csv \\
|
||||
--types 'phone:phone,address:address,price:currency' \\
|
||||
--phone-country country --address-country country \\
|
||||
--preserve-code --audit-max 50000 --apply
|
||||
|
||||
The CLI auto-streams (chunked read/write, bounded RAM) when the input
|
||||
exceeds ~100 MB. Force or disable with ``--stream`` / ``--no-stream``.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import typer
|
||||
from loguru import logger
|
||||
|
||||
app = typer.Typer(
|
||||
name="format",
|
||||
help=(
|
||||
"Standardize dates, phones, currencies, names, and addresses "
|
||||
"in CSV / Excel files.\n\n"
|
||||
"Default behaviour: preview the changes (no file written). "
|
||||
"Add --apply to write output.\n\n"
|
||||
"For 1 GB+ international files, the CLI auto-streams in 50,000-row "
|
||||
"chunks so memory stays bounded. Use --phone-country / "
|
||||
"--address-country to point at a per-row ISO-3166 column for "
|
||||
"country-aware parsing.\n\n"
|
||||
"Examples:\n\n"
|
||||
" # Preview\n"
|
||||
" python -m src.cli_format data.csv --types 'phone:phone,price:currency'\n\n"
|
||||
" # International file with per-row country\n"
|
||||
" python -m src.cli_format leads.csv --types 'phone:phone' "
|
||||
"--phone-country country --apply\n\n"
|
||||
" # Force streaming with smaller chunks for tight memory\n"
|
||||
" python -m src.cli_format huge.csv --types 'phone:phone' "
|
||||
"--stream --chunk-size 10000 --apply\n"
|
||||
),
|
||||
add_completion=False,
|
||||
no_args_is_help=True,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _setup_logging(log_dir: Path) -> Path:
|
||||
log_dir.mkdir(parents=True, exist_ok=True)
|
||||
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
log_path = log_dir / f"format_{ts}.log"
|
||||
logger.remove()
|
||||
logger.add(sys.stderr, level="WARNING", format="{message}")
|
||||
logger.add(
|
||||
str(log_path), level="DEBUG",
|
||||
format="{time:YYYY-MM-DD HH:mm:ss} | {level:<8} | {message}",
|
||||
)
|
||||
return log_path
|
||||
|
||||
|
||||
def _parse_types(raw: Optional[str]) -> dict[str, str]:
|
||||
"""Parse ``col:phone,col:date`` into a dict."""
|
||||
if not raw:
|
||||
return {}
|
||||
out: dict[str, str] = {}
|
||||
for piece in raw.split(","):
|
||||
piece = piece.strip()
|
||||
if not piece:
|
||||
continue
|
||||
if ":" not in piece:
|
||||
raise typer.BadParameter(
|
||||
f"Invalid --types piece: {piece!r}. "
|
||||
f"Expected 'col:type[,col:type...]' "
|
||||
f"where type is one of: date, phone, currency, name, address, email, boolean."
|
||||
)
|
||||
col, ft = piece.split(":", 1)
|
||||
out[col.strip()] = ft.strip()
|
||||
return out
|
||||
|
||||
|
||||
_AUTO_STREAM_THRESHOLD = 100 * 1024 * 1024 # 100 MB
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main command
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@app.command()
|
||||
def standardize(
|
||||
input_file: str = typer.Argument(..., help="CSV or TSV file path."),
|
||||
output: Optional[str] = typer.Option(
|
||||
None, "--output", "-o",
|
||||
help="Output file path. Default: {input}_standardized.csv",
|
||||
),
|
||||
apply: bool = typer.Option(
|
||||
False, "--apply",
|
||||
help="Write the output. Without this flag, only a preview is shown.",
|
||||
),
|
||||
types: Optional[str] = typer.Option(
|
||||
None, "--types",
|
||||
help="Per-column types: 'col:type[,col:type...]'. "
|
||||
"Types: date, phone, currency, name, address, email, boolean.",
|
||||
),
|
||||
preset: Optional[str] = typer.Option(
|
||||
None, "--preset",
|
||||
help="Named preset (e.g. 'us', 'uk', 'eu', 'jp'). Layered before --types.",
|
||||
),
|
||||
phone_country: Optional[str] = typer.Option(
|
||||
None, "--phone-country",
|
||||
help="Column name carrying the per-row ISO-3166 country code for phones.",
|
||||
),
|
||||
address_country: Optional[str] = typer.Option(
|
||||
None, "--address-country",
|
||||
help="Column name carrying the per-row country code for addresses.",
|
||||
),
|
||||
phone_region: str = typer.Option(
|
||||
"US", "--phone-region",
|
||||
help="Default phone region when no per-row column is set. ISO-3166 alpha-2.",
|
||||
),
|
||||
phone_format: str = typer.Option(
|
||||
"E164", "--phone-format",
|
||||
help="Phone output format: E164 | INTERNATIONAL | NATIONAL | RFC3966 | DIGITS.",
|
||||
),
|
||||
preserve_code: bool = typer.Option(
|
||||
False, "--preserve-code",
|
||||
help="Currency: emit ISO-4217 prefix (e.g. 'USD 1500.00').",
|
||||
),
|
||||
decimals: int = typer.Option(
|
||||
2, "--decimals",
|
||||
help="Currency decimal precision.",
|
||||
),
|
||||
audit_max: int = typer.Option(
|
||||
10_000, "--audit-max",
|
||||
help="Cap the change-audit at N rows (0 = no audit, -1 = unbounded).",
|
||||
),
|
||||
stream: Optional[bool] = typer.Option(
|
||||
None, "--stream/--no-stream",
|
||||
help="Force streaming (chunked, bounded RAM). Auto-on for inputs > 100 MB.",
|
||||
),
|
||||
chunk_size: int = typer.Option(
|
||||
50_000, "--chunk-size",
|
||||
help="Rows per chunk in streaming mode.",
|
||||
),
|
||||
cache_size: int = typer.Option(
|
||||
262_144, "--cache-size",
|
||||
help="Per-column LRU-cache size (set 0 to disable).",
|
||||
),
|
||||
encoding_override: Optional[str] = typer.Option(
|
||||
None, "--encoding",
|
||||
help="Override auto-detected file encoding.",
|
||||
),
|
||||
delimiter: Optional[str] = typer.Option(
|
||||
None, "--delimiter",
|
||||
help="Override auto-detected delimiter.",
|
||||
),
|
||||
config: Optional[str] = typer.Option(
|
||||
None, "--config",
|
||||
help="Load options from a saved JSON config.",
|
||||
),
|
||||
save_config: Optional[str] = typer.Option(
|
||||
None, "--save-config",
|
||||
help="Save current options to a JSON config.",
|
||||
),
|
||||
):
|
||||
"""Standardize formats across a CSV / TSV. Auto-streams for large inputs."""
|
||||
from src.core.format_standardize import (
|
||||
FieldType,
|
||||
StandardizeOptions,
|
||||
standardize_dataframe,
|
||||
standardize_file,
|
||||
)
|
||||
from src.core.io import read_file, detect_encoding, detect_delimiter
|
||||
import pandas as pd
|
||||
|
||||
inp = Path(input_file)
|
||||
if not inp.exists():
|
||||
typer.echo(f"Error: File not found: {inp}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
log_path = _setup_logging(Path("logs"))
|
||||
|
||||
# Build options
|
||||
if config:
|
||||
cp = Path(config)
|
||||
if not cp.exists():
|
||||
typer.echo(f"Error: Config file not found: {cp}", err=True)
|
||||
raise typer.Exit(1)
|
||||
options = StandardizeOptions.from_file(cp)
|
||||
elif preset:
|
||||
try:
|
||||
options = StandardizeOptions.from_preset(preset)
|
||||
except ValueError as e:
|
||||
typer.echo(f"Error: {e}", err=True)
|
||||
raise typer.Exit(1)
|
||||
else:
|
||||
options = StandardizeOptions()
|
||||
|
||||
parsed_types = _parse_types(types)
|
||||
if parsed_types:
|
||||
try:
|
||||
options.column_types = {
|
||||
col: FieldType(t) for col, t in parsed_types.items()
|
||||
}
|
||||
except ValueError as e:
|
||||
typer.echo(
|
||||
f"Error: {e}. Valid types: "
|
||||
+ ", ".join(sorted(t.value for t in FieldType)),
|
||||
err=True,
|
||||
)
|
||||
raise typer.Exit(1)
|
||||
|
||||
if not options.column_types:
|
||||
typer.echo(
|
||||
"Error: no column types declared. Pass --types 'col:type,...' "
|
||||
"or --preset / --config with a column_types map.",
|
||||
err=True,
|
||||
)
|
||||
raise typer.Exit(1)
|
||||
|
||||
if phone_country:
|
||||
options.phone_country_column = phone_country
|
||||
if address_country:
|
||||
options.address_country_column = address_country
|
||||
options.phone_region = phone_region
|
||||
options.phone_format = phone_format # type: ignore[assignment]
|
||||
options.currency_preserve_code = preserve_code
|
||||
options.currency_decimals = decimals
|
||||
options.audit_max_rows = (
|
||||
None if audit_max < 0 else audit_max
|
||||
)
|
||||
options.cache_size = cache_size
|
||||
|
||||
if save_config:
|
||||
saved = options.to_file(save_config)
|
||||
typer.echo(f"Config saved to {saved}")
|
||||
|
||||
# Decide streaming mode
|
||||
file_size = inp.stat().st_size
|
||||
use_stream = stream if stream is not None else file_size > _AUTO_STREAM_THRESHOLD
|
||||
|
||||
enc = encoding_override or detect_encoding(inp)
|
||||
delim = delimiter or detect_delimiter(inp, enc)
|
||||
|
||||
out_path = Path(output) if output else inp.parent / f"{inp.stem}_standardized.csv"
|
||||
|
||||
typer.echo(
|
||||
f"Reading {inp.name} ({file_size/1024/1024:.1f} MB; "
|
||||
f"{'streaming' if use_stream else 'in-memory'} mode)..."
|
||||
)
|
||||
|
||||
if use_stream:
|
||||
if not apply:
|
||||
typer.echo(
|
||||
"\nStreaming mode does not produce a preview. "
|
||||
"Re-run with --apply to write output, or remove --stream to preview a sample."
|
||||
)
|
||||
raise typer.Exit(0)
|
||||
|
||||
last_log = [0.0]
|
||||
import time as _time
|
||||
|
||||
def _progress(rows, chunks):
|
||||
now = _time.perf_counter()
|
||||
if now - last_log[0] < 1.0:
|
||||
return
|
||||
last_log[0] = now
|
||||
typer.echo(f" ... {rows:,} rows ({chunks} chunks)")
|
||||
|
||||
t0 = _time.perf_counter()
|
||||
res = standardize_file(
|
||||
inp, out_path, options,
|
||||
chunk_size=chunk_size,
|
||||
progress_callback=_progress,
|
||||
encoding=enc,
|
||||
delimiter=delim,
|
||||
)
|
||||
elapsed = _time.perf_counter() - t0
|
||||
typer.echo(f"\n{'─'*60}")
|
||||
typer.echo(f" File: {inp.name}")
|
||||
typer.echo(f" Rows: {res.rows_processed:,}")
|
||||
typer.echo(f" Chunks: {res.chunks_processed}")
|
||||
typer.echo(f" Cells changed: {res.cells_changed:,}")
|
||||
typer.echo(
|
||||
f" Cells unparseable: {res.cells_unparseable:,} / {res.cells_total:,}"
|
||||
)
|
||||
typer.echo(
|
||||
f" Throughput: {res.rows_processed / max(elapsed, 1e-9):,.0f} rows/sec"
|
||||
)
|
||||
typer.echo(f" Elapsed: {elapsed:.2f}s")
|
||||
typer.echo(f"{'─'*60}")
|
||||
typer.echo(f"\nStandardized: {res.output_path}")
|
||||
if res.audit_path:
|
||||
typer.echo(f"Changes audit: {res.audit_path}")
|
||||
typer.echo(f"Log: {log_path}")
|
||||
return
|
||||
|
||||
# In-memory path
|
||||
try:
|
||||
df = read_file(
|
||||
inp, encoding=enc, delimiter=delim, repair=False,
|
||||
)
|
||||
if not isinstance(df, pd.DataFrame):
|
||||
df = pd.concat(list(df), ignore_index=True)
|
||||
except Exception as e:
|
||||
typer.echo(f"Error reading file: {e}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
typer.echo(f" {len(df):,} rows, {len(df.columns)} columns")
|
||||
|
||||
typer.echo("Standardizing...")
|
||||
try:
|
||||
result = standardize_dataframe(df, options)
|
||||
except (ValueError, OSError) as e:
|
||||
typer.echo(f"Error: {e}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
pct = (result.cells_changed / result.cells_total * 100) if result.cells_total else 0
|
||||
typer.echo(f"\n{'─'*60}")
|
||||
typer.echo(f" File: {inp.name}")
|
||||
typer.echo(f" Columns processed: {len(result.columns_processed)}")
|
||||
typer.echo(f" Cells scanned: {result.cells_total:,}")
|
||||
typer.echo(f" Cells changed: {result.cells_changed:,} ({pct:.1f}%)")
|
||||
typer.echo(f" Cells unparseable: {result.cells_unparseable:,}")
|
||||
typer.echo(f"{'─'*60}")
|
||||
if result.cells_changed and not result.changes.empty:
|
||||
typer.echo("\nFirst examples:")
|
||||
for _, row in result.changes.head(5).iterrows():
|
||||
old = repr(row["old"])[:40]
|
||||
new = repr(row["new"])[:40]
|
||||
typer.echo(
|
||||
f" Row {row['row'] + 1}, {row['column']} "
|
||||
f"({row['field_type']}): {old} → {new}"
|
||||
)
|
||||
|
||||
if apply:
|
||||
from src.core.io import write_file
|
||||
write_file(result.standardized_df, out_path)
|
||||
typer.echo(f"\nStandardized: {out_path}")
|
||||
if not result.changes.empty:
|
||||
audit_path = inp.parent / f"{inp.stem}_changes.csv"
|
||||
write_file(result.changes, audit_path)
|
||||
typer.echo(f"Changes audit: {audit_path}")
|
||||
else:
|
||||
typer.echo("\nThis was a preview. Add --apply to write the output.")
|
||||
|
||||
typer.echo(f"Log: {log_path}")
|
||||
|
||||
|
||||
def main():
|
||||
app()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
380
src/cli_missing.py
Normal file
380
src/cli_missing.py
Normal file
@@ -0,0 +1,380 @@
|
||||
"""CLI for the DataTools Missing Value Handler (script 04).
|
||||
|
||||
Usage:
|
||||
python -m src.cli_missing input.csv # profile only
|
||||
python -m src.cli_missing input.csv --apply # detect-only + write
|
||||
python -m src.cli_missing input.csv --preset safe-fill --apply
|
||||
python -m src.cli_missing input.csv --strategy median --apply
|
||||
python -m src.cli_missing input.csv --strategy drop_row --apply
|
||||
python -m src.cli_missing input.csv --strategy constant --fill-value 0 --apply
|
||||
python -m src.cli_missing input.csv --strategy median --columns age,score --apply
|
||||
python -m src.cli_missing input.csv --col-strategy "age:median,city:mode" --apply
|
||||
python -m src.cli_missing --help
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import typer
|
||||
from loguru import logger
|
||||
|
||||
app = typer.Typer(
|
||||
name="missing",
|
||||
help=(
|
||||
"Detect and handle missing values in CSV / Excel files.\n\n"
|
||||
"Default behaviour: profile only (no file written). Add --apply to "
|
||||
"write the handled output and audit log.\n\n"
|
||||
"Strategies:\n"
|
||||
" none, drop_row, drop_col, drop_both,\n"
|
||||
" mean, median, mode, constant,\n"
|
||||
" ffill, bfill, interpolate\n\n"
|
||||
"Examples:\n\n"
|
||||
" # Profile missingness without writing anything\n"
|
||||
" python -m src.cli_missing customers.csv\n\n"
|
||||
" # Standardize sentinels (\"N/A\", \"-\", \"NULL\", …) to NaN and write\n"
|
||||
" python -m src.cli_missing customers.csv --apply\n\n"
|
||||
" # Safe fill: numeric → median, categorical → mode\n"
|
||||
" python -m src.cli_missing customers.csv --preset safe-fill --apply\n\n"
|
||||
" # Drop rows missing >50%% of selected columns\n"
|
||||
" python -m src.cli_missing customers.csv --strategy drop_row "
|
||||
"--row-threshold 0.5 --apply\n\n"
|
||||
" # Per-column strategies\n"
|
||||
" python -m src.cli_missing customers.csv "
|
||||
"--col-strategy 'age:median,city:mode,notes:constant' --fill-value '' --apply\n"
|
||||
),
|
||||
add_completion=False,
|
||||
no_args_is_help=True,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _setup_logging(log_dir: Path) -> Path:
|
||||
log_dir.mkdir(parents=True, exist_ok=True)
|
||||
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
log_path = log_dir / f"missing_{ts}.log"
|
||||
logger.remove()
|
||||
logger.add(sys.stderr, level="WARNING", format="{message}")
|
||||
logger.add(
|
||||
str(log_path),
|
||||
level="DEBUG",
|
||||
format="{time:YYYY-MM-DD HH:mm:ss} | {level:<8} | {message}",
|
||||
)
|
||||
return log_path
|
||||
|
||||
|
||||
def _split_csv_arg(raw: Optional[str]) -> Optional[list[str]]:
|
||||
if raw is None:
|
||||
return None
|
||||
return [c.strip() for c in raw.split(",") if c.strip()]
|
||||
|
||||
|
||||
def _parse_col_strategy(raw: Optional[str]) -> dict[str, str]:
|
||||
"""Parse ``--col-strategy 'age:median,city:mode'`` into a dict."""
|
||||
if not raw:
|
||||
return {}
|
||||
out: dict[str, str] = {}
|
||||
for piece in raw.split(","):
|
||||
piece = piece.strip()
|
||||
if not piece:
|
||||
continue
|
||||
if ":" not in piece:
|
||||
raise typer.BadParameter(
|
||||
f"Invalid --col-strategy piece: '{piece}'. "
|
||||
f"Expected 'col:strategy[,col:strategy...]'."
|
||||
)
|
||||
col, strat = piece.split(":", 1)
|
||||
out[col.strip()] = strat.strip()
|
||||
return out
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main command
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@app.command()
|
||||
def handle(
|
||||
input_file: str = typer.Argument(
|
||||
...,
|
||||
help="Path to the CSV or Excel file.",
|
||||
),
|
||||
output: Optional[str] = typer.Option(
|
||||
None, "--output", "-o",
|
||||
help="Output file path. Default: {input}_missing.csv",
|
||||
),
|
||||
apply: bool = typer.Option(
|
||||
False, "--apply",
|
||||
help="Write the output. Without this flag, only the profile is shown.",
|
||||
),
|
||||
preset: str = typer.Option(
|
||||
"detect-only", "--preset",
|
||||
help="Preset: detect-only, safe-fill, or drop-incomplete.",
|
||||
),
|
||||
strategy: Optional[str] = typer.Option(
|
||||
None, "--strategy",
|
||||
help=(
|
||||
"Override the preset strategy: none, drop_row, drop_col, drop_both, "
|
||||
"mean, median, mode, constant, ffill, bfill, interpolate."
|
||||
),
|
||||
),
|
||||
col_strategy: Optional[str] = typer.Option(
|
||||
None, "--col-strategy",
|
||||
help="Per-column strategies: 'col:strategy[,col:strategy...]'.",
|
||||
),
|
||||
fill_value: Optional[str] = typer.Option(
|
||||
None, "--fill-value",
|
||||
help="Constant fill value (used with --strategy constant).",
|
||||
),
|
||||
columns: Optional[str] = typer.Option(
|
||||
None, "--columns",
|
||||
help="Comma-separated columns to handle (default: all columns).",
|
||||
),
|
||||
skip: Optional[str] = typer.Option(
|
||||
None, "--skip",
|
||||
help="Comma-separated columns to skip.",
|
||||
),
|
||||
sentinels: Optional[str] = typer.Option(
|
||||
None, "--sentinels",
|
||||
help=(
|
||||
"Comma-separated extra sentinels to treat as missing "
|
||||
"(merged with the built-in defaults)."
|
||||
),
|
||||
),
|
||||
no_sentinels: bool = typer.Option(
|
||||
False, "--no-sentinels",
|
||||
help="Disable disguised-null standardization entirely.",
|
||||
),
|
||||
row_threshold: float = typer.Option(
|
||||
1.0, "--row-threshold",
|
||||
help=(
|
||||
"For drop_row: drop rows whose missing fraction across selected "
|
||||
"columns is STRICTLY GREATER than this value (0.0..1.0). "
|
||||
"Default 1.0 = never drop. Use 0.0 to drop any row with any "
|
||||
"missing; 0.5 to drop rows >50%% missing."
|
||||
),
|
||||
),
|
||||
col_threshold: float = typer.Option(
|
||||
1.0, "--col-threshold",
|
||||
help=(
|
||||
"For drop_col: drop columns whose missing fraction is strictly "
|
||||
"greater than this value. Default 1.0 = never drop."
|
||||
),
|
||||
),
|
||||
config: Optional[str] = typer.Option(
|
||||
None, "--config",
|
||||
help="Load options from a saved JSON config file.",
|
||||
),
|
||||
save_config: Optional[str] = typer.Option(
|
||||
None, "--save-config",
|
||||
help="Save current options to a JSON config file.",
|
||||
),
|
||||
sheet: Optional[str] = typer.Option(
|
||||
None, "--sheet",
|
||||
help="Excel sheet name or index (default: first sheet).",
|
||||
),
|
||||
encoding_override: Optional[str] = typer.Option(
|
||||
None, "--encoding",
|
||||
help="Override auto-detected file encoding.",
|
||||
),
|
||||
header_row: Optional[int] = typer.Option(
|
||||
None, "--header-row",
|
||||
help="0-based row index for the header (default: auto-detect).",
|
||||
),
|
||||
full_changelog: bool = typer.Option(
|
||||
False, "--full-changelog",
|
||||
help="Write every change to the audit CSV (default caps to first 1000).",
|
||||
),
|
||||
):
|
||||
"""Detect and handle missing values."""
|
||||
from src.core.io import read_file, write_file
|
||||
from src.core.missing import MissingOptions, PRESETS, handle_missing
|
||||
import pandas as pd
|
||||
|
||||
# Validate inputs
|
||||
input_path = Path(input_file)
|
||||
if not input_path.exists():
|
||||
typer.echo(f"Error: File not found: {input_path}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
if preset not in PRESETS:
|
||||
typer.echo(
|
||||
f"Error: Unknown preset '{preset}'. "
|
||||
f"Choose from: {', '.join(sorted(PRESETS))}.",
|
||||
err=True,
|
||||
)
|
||||
raise typer.Exit(1)
|
||||
|
||||
log_path = _setup_logging(Path("logs"))
|
||||
|
||||
# Build options
|
||||
if config:
|
||||
cfg_path = Path(config)
|
||||
if not cfg_path.exists():
|
||||
typer.echo(f"Error: Config file not found: {cfg_path}", err=True)
|
||||
raise typer.Exit(1)
|
||||
options = MissingOptions.from_file(cfg_path)
|
||||
logger.info("Loaded config from {}", cfg_path)
|
||||
else:
|
||||
options = MissingOptions.from_preset(preset)
|
||||
|
||||
if strategy:
|
||||
options.strategy = strategy # type: ignore[assignment]
|
||||
if col_strategy:
|
||||
options.column_strategies = _parse_col_strategy(col_strategy) # type: ignore[assignment]
|
||||
if fill_value is not None:
|
||||
options.fill_value = fill_value
|
||||
cols_list = _split_csv_arg(columns)
|
||||
if cols_list is not None:
|
||||
options.columns = cols_list
|
||||
skip_list = _split_csv_arg(skip)
|
||||
if skip_list:
|
||||
options.skip_columns = skip_list
|
||||
extra = _split_csv_arg(sentinels)
|
||||
if extra:
|
||||
options.sentinels = list(dict.fromkeys([*options.sentinels, *extra]))
|
||||
if no_sentinels:
|
||||
options.standardize_sentinels = False
|
||||
options.row_drop_threshold = row_threshold
|
||||
options.col_drop_threshold = col_threshold
|
||||
|
||||
if save_config:
|
||||
saved = options.to_file(save_config)
|
||||
typer.echo(f"Config saved to {saved}")
|
||||
|
||||
# Read input
|
||||
typer.echo(f"Reading {input_path.name}...")
|
||||
try:
|
||||
sheet_arg: str | int | None = None
|
||||
if sheet is not None:
|
||||
try:
|
||||
sheet_arg = int(sheet)
|
||||
except ValueError:
|
||||
sheet_arg = sheet
|
||||
df = read_file(
|
||||
input_path,
|
||||
encoding=encoding_override,
|
||||
header_row=header_row,
|
||||
sheet_name=sheet_arg if sheet_arg is not None else 0,
|
||||
repair=False,
|
||||
)
|
||||
if not isinstance(df, pd.DataFrame):
|
||||
df = pd.concat(list(df), ignore_index=True)
|
||||
except Exception as e:
|
||||
typer.echo(f"Error reading file: {e}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
typer.echo(f" {len(df)} rows, {len(df.columns)} columns")
|
||||
|
||||
# Run
|
||||
typer.echo("Profiling missingness...")
|
||||
try:
|
||||
result = handle_missing(df, options)
|
||||
except (ValueError, OSError) as e:
|
||||
typer.echo(f"Error: {e}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
_print_results(result, input_path, options)
|
||||
|
||||
# Write
|
||||
if apply:
|
||||
stem = input_path.stem
|
||||
out_path = Path(output) if output else input_path.parent / f"{stem}_missing.csv"
|
||||
write_file(result.handled_df, out_path)
|
||||
typer.echo(f"\nHandled file: {out_path}")
|
||||
|
||||
if not result.changes.empty:
|
||||
changes_path = input_path.parent / f"{stem}_missing_changes.csv"
|
||||
audit_df = result.changes
|
||||
cap = 1000
|
||||
if not full_changelog and len(audit_df) > cap:
|
||||
typer.echo(
|
||||
f"Note: changelog capped at {cap} rows. "
|
||||
f"Use --full-changelog to write all {len(audit_df)} changes."
|
||||
)
|
||||
audit_df = audit_df.head(cap)
|
||||
write_file(audit_df, changes_path)
|
||||
typer.echo(f"Changes audit: {changes_path}")
|
||||
else:
|
||||
typer.echo(
|
||||
"\nThis was a profile only. Add --apply to write the handled output."
|
||||
)
|
||||
|
||||
typer.echo(f"Log: {log_path}")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Output formatting
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _print_results(result, input_path: Path, options) -> None:
|
||||
typer.echo(f"\n{'─'*60}")
|
||||
typer.echo(f" File: {input_path.name}")
|
||||
typer.echo(f" Rows: {result.profile_before.rows_total}")
|
||||
typer.echo(f" Columns processed: {len(result.columns_processed)}")
|
||||
typer.echo(
|
||||
f" Cells missing: "
|
||||
f"{result.profile_before.cells_missing} / {result.profile_before.cells_total}"
|
||||
f" ({result.profile_before.cells_missing_pct:.1f}%)"
|
||||
)
|
||||
typer.echo(
|
||||
f" Rows w/ any missing: "
|
||||
f"{result.profile_before.rows_with_any_missing} "
|
||||
f"(complete: {result.profile_before.rows_complete})"
|
||||
)
|
||||
typer.echo(f"{'─'*60}")
|
||||
|
||||
typer.echo("\nPer-column profile:")
|
||||
profile_df = result.profile_before.to_dataframe()
|
||||
for _, row in profile_df.iterrows():
|
||||
marker = " " if row["missing"] == 0 else " "
|
||||
typer.echo(
|
||||
f"{marker}{row['column']:<24} {row['dtype']:<10} "
|
||||
f"missing={row['missing']:<6} ({row['missing_pct']:>5.1f}%)"
|
||||
+ (
|
||||
f" top sentinel: {row['top_sentinel']!r} ×{row['top_sentinel_count']}"
|
||||
if row["top_sentinel_count"] else ""
|
||||
)
|
||||
)
|
||||
|
||||
typer.echo("\nActions:")
|
||||
typer.echo(f" Sentinels standardized to NaN: {result.sentinels_standardized}")
|
||||
typer.echo(f" Cells filled: {result.cells_filled}")
|
||||
typer.echo(f" Rows dropped: {result.rows_dropped}")
|
||||
typer.echo(
|
||||
f" Columns dropped: {len(result.columns_dropped)}"
|
||||
+ (f" ({', '.join(result.columns_dropped)})" if result.columns_dropped else "")
|
||||
)
|
||||
|
||||
if result.strategy_per_column:
|
||||
typer.echo("\nStrategy per column:")
|
||||
for col, strat in result.strategy_per_column.items():
|
||||
typer.echo(f" {col}: {strat}")
|
||||
|
||||
if not result.changes.empty:
|
||||
typer.echo("\nFirst examples:")
|
||||
for _, row in result.changes.head(5).iterrows():
|
||||
old = repr(row["old"])[:40]
|
||||
new = repr(row["new"])[:40]
|
||||
row_label = "—" if row["row"] == -1 else f"Row {row['row'] + 1}"
|
||||
typer.echo(
|
||||
f" {row_label}, {row['column']}: {old} → {new} "
|
||||
f"[{row['action']}]"
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# __main__
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def main():
|
||||
app()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
307
src/cli_pipeline.py
Normal file
307
src/cli_pipeline.py
Normal file
@@ -0,0 +1,307 @@
|
||||
"""CLI for the DataTools Pipeline Runner (script 09).
|
||||
|
||||
Usage:
|
||||
# Run the recommended default pipeline (text → format → missing → dedup):
|
||||
python -m src.cli_pipeline input.csv --apply
|
||||
|
||||
# Quick custom order via --steps:
|
||||
python -m src.cli_pipeline input.csv \\
|
||||
--steps text_clean,format_standardize,missing --apply
|
||||
|
||||
# Save the recommended pipeline to a JSON for editing:
|
||||
python -m src.cli_pipeline --recommend --output pipeline.json
|
||||
|
||||
# Run a saved pipeline:
|
||||
python -m src.cli_pipeline weekly_export.csv --pipeline pipeline.json --apply
|
||||
|
||||
# Strict mode: fail if the pipeline contains soft-dependency violations
|
||||
python -m src.cli_pipeline data.csv --steps dedup,text_clean \\
|
||||
--strict --apply
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import typer
|
||||
from loguru import logger
|
||||
|
||||
app = typer.Typer(
|
||||
name="pipeline",
|
||||
help=(
|
||||
"Chain DataTools cleaning steps into one orchestrated workflow.\n\n"
|
||||
"Default behaviour: preview the plan + run the pipeline (no file "
|
||||
"written). Add --apply to write the cleaned output and audit log.\n\n"
|
||||
"The pipeline RECOMMENDS an order based on tool dependencies "
|
||||
"(text-clean before format-standardize, format before dedup, etc.) "
|
||||
"and WARNS on out-of-order configs but does not block them. Use "
|
||||
"--strict to escalate warnings to errors.\n\n"
|
||||
"Tools available: text_clean, format_standardize, missing, "
|
||||
"column_map, dedup."
|
||||
),
|
||||
add_completion=False,
|
||||
no_args_is_help=False,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _setup_logging(log_dir: Path) -> Path:
|
||||
log_dir.mkdir(parents=True, exist_ok=True)
|
||||
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
log_path = log_dir / f"pipeline_{ts}.log"
|
||||
logger.remove()
|
||||
logger.add(sys.stderr, level="WARNING", format="{message}")
|
||||
logger.add(
|
||||
str(log_path), level="DEBUG",
|
||||
format="{time:YYYY-MM-DD HH:mm:ss} | {level:<8} | {message}",
|
||||
)
|
||||
return log_path
|
||||
|
||||
|
||||
def _split_csv_arg(raw: Optional[str]) -> Optional[list[str]]:
|
||||
if raw is None:
|
||||
return None
|
||||
return [c.strip() for c in raw.split(",") if c.strip()]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main command
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@app.command()
|
||||
def run(
|
||||
input_file: Optional[str] = typer.Argument(
|
||||
None,
|
||||
help="CSV / TSV / Excel file. Optional with --recommend.",
|
||||
),
|
||||
pipeline_path: Optional[str] = typer.Option(
|
||||
None, "--pipeline", "-p",
|
||||
help="Path to a pipeline JSON file (Pipeline.from_file format).",
|
||||
),
|
||||
steps: Optional[str] = typer.Option(
|
||||
None, "--steps",
|
||||
help=(
|
||||
"Quick pipeline: comma-separated tool names in execution order. "
|
||||
"Each step uses defaults. Example: 'text_clean,format_standardize,dedup'."
|
||||
),
|
||||
),
|
||||
recommend: bool = typer.Option(
|
||||
False, "--recommend",
|
||||
help="Print (or save) the recommended default pipeline and exit.",
|
||||
),
|
||||
output: Optional[str] = typer.Option(
|
||||
None, "--output", "-o",
|
||||
help=(
|
||||
"When --recommend is set, save the pipeline JSON here. "
|
||||
"Otherwise, write the pipeline output to this CSV path "
|
||||
"(default: {input}_pipeline.csv)."
|
||||
),
|
||||
),
|
||||
apply: bool = typer.Option(
|
||||
False, "--apply",
|
||||
help="Write the output. Without this flag, only the plan is shown.",
|
||||
),
|
||||
strict: bool = typer.Option(
|
||||
False, "--strict",
|
||||
help="Treat soft-dependency warnings as errors (refuse to run).",
|
||||
),
|
||||
continue_on_error: bool = typer.Option(
|
||||
False, "--continue-on-error",
|
||||
help="Don't abort if a step fails; carry the previous step's df forward.",
|
||||
),
|
||||
encoding_override: Optional[str] = typer.Option(
|
||||
None, "--encoding",
|
||||
help="Override auto-detected file encoding.",
|
||||
),
|
||||
delimiter: Optional[str] = typer.Option(
|
||||
None, "--delimiter",
|
||||
help="Override auto-detected delimiter.",
|
||||
),
|
||||
):
|
||||
"""Run a DataTools cleaning pipeline."""
|
||||
from src.core.pipeline import (
|
||||
Pipeline,
|
||||
recommended_pipeline,
|
||||
run_pipeline,
|
||||
validate_pipeline,
|
||||
)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# --recommend: print or save the default pipeline and exit
|
||||
# ------------------------------------------------------------------
|
||||
if recommend:
|
||||
pipe = recommended_pipeline()
|
||||
body = json.dumps(pipe.to_dict(), indent=2)
|
||||
if output:
|
||||
Path(output).write_text(body)
|
||||
typer.echo(f"Recommended pipeline saved to {output}")
|
||||
else:
|
||||
typer.echo(body)
|
||||
return
|
||||
|
||||
if not input_file:
|
||||
typer.echo(
|
||||
"Error: input file is required (or use --recommend to "
|
||||
"emit the default pipeline).",
|
||||
err=True,
|
||||
)
|
||||
raise typer.Exit(2)
|
||||
|
||||
inp = Path(input_file)
|
||||
if not inp.exists():
|
||||
typer.echo(f"Error: File not found: {inp}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
log_path = _setup_logging(Path("logs"))
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Resolve pipeline source: --pipeline file, --steps list, or default
|
||||
# ------------------------------------------------------------------
|
||||
if pipeline_path and steps:
|
||||
typer.echo(
|
||||
"Error: pass either --pipeline or --steps, not both.",
|
||||
err=True,
|
||||
)
|
||||
raise typer.Exit(1)
|
||||
|
||||
if pipeline_path:
|
||||
pp = Path(pipeline_path)
|
||||
if not pp.exists():
|
||||
typer.echo(f"Error: pipeline file not found: {pp}", err=True)
|
||||
raise typer.Exit(1)
|
||||
try:
|
||||
pipe = Pipeline.from_file(pp)
|
||||
except Exception as e:
|
||||
from src.core.errors import format_for_user
|
||||
typer.echo(f"Error reading pipeline: {format_for_user(e)}", err=True)
|
||||
raise typer.Exit(1)
|
||||
elif steps:
|
||||
names = _split_csv_arg(steps) or []
|
||||
try:
|
||||
pipe = recommended_pipeline(include=names)
|
||||
except Exception as e:
|
||||
from src.core.errors import format_for_user
|
||||
typer.echo(f"Error: {format_for_user(e)}", err=True)
|
||||
raise typer.Exit(1)
|
||||
else:
|
||||
pipe = recommended_pipeline()
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Plan + warnings
|
||||
# ------------------------------------------------------------------
|
||||
warnings = validate_pipeline(pipe)
|
||||
typer.echo(f"\n{'─'*60}")
|
||||
typer.echo(" Pipeline plan:")
|
||||
for i, step in enumerate(pipe.steps, 1):
|
||||
flag = " " if step.enabled else "✗ "
|
||||
typer.echo(f" {i}. {flag}{step.display_name():<22} options={step.options or {}}")
|
||||
typer.echo(f"{'─'*60}")
|
||||
if warnings:
|
||||
typer.echo("\nSoft-dependency warnings (recommended order violated):")
|
||||
for w in warnings:
|
||||
typer.echo(f" ! {w}")
|
||||
if strict:
|
||||
typer.echo(
|
||||
"\nAborting: --strict was set. Reorder the steps or drop --strict.",
|
||||
err=True,
|
||||
)
|
||||
raise typer.Exit(2)
|
||||
|
||||
if not apply:
|
||||
typer.echo(
|
||||
"\nThis was a plan-only run. Add --apply to execute the pipeline."
|
||||
)
|
||||
typer.echo(f"Log: {log_path}")
|
||||
return
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Read input + execute
|
||||
# ------------------------------------------------------------------
|
||||
from src.core.io import read_file, write_file
|
||||
import pandas as pd
|
||||
|
||||
typer.echo(f"\nReading {inp.name}...")
|
||||
try:
|
||||
df = read_file(
|
||||
inp, encoding=encoding_override, delimiter=delimiter, repair=False,
|
||||
)
|
||||
if not isinstance(df, pd.DataFrame):
|
||||
df = pd.concat(list(df), ignore_index=True)
|
||||
except Exception as e:
|
||||
typer.echo(f"Error reading file: {e}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
typer.echo(f" {len(df):,} rows, {len(df.columns)} columns")
|
||||
|
||||
typer.echo("\nExecuting pipeline:")
|
||||
|
||||
def _on_step(sr) -> None:
|
||||
if sr.skipped:
|
||||
typer.echo(f" - {sr.step.display_name()} (skipped)")
|
||||
elif sr.error:
|
||||
typer.echo(f" ✗ {sr.step.display_name()} ({sr.elapsed_seconds*1000:.0f} ms) — ERROR: {sr.error.splitlines()[0]}")
|
||||
else:
|
||||
typer.echo(f" ✓ {sr.step.display_name()} ({sr.elapsed_seconds*1000:.0f} ms) {sr.summary}")
|
||||
|
||||
try:
|
||||
result = run_pipeline(
|
||||
df, pipe,
|
||||
on_step_complete=_on_step,
|
||||
stop_on_error=not continue_on_error,
|
||||
)
|
||||
except Exception as e:
|
||||
from src.core.errors import format_for_user
|
||||
typer.echo(f"\nPipeline halted: {format_for_user(e)}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
typer.echo(f"\n{'─'*60}")
|
||||
typer.echo(f" Initial rows: {result.initial_rows:,}")
|
||||
typer.echo(f" Final rows: {result.final_rows:,}")
|
||||
typer.echo(f" Steps run: {sum(1 for s in result.step_results if not s.skipped)}")
|
||||
typer.echo(f" Total elapsed: {result.total_elapsed:.2f} s")
|
||||
typer.echo(f"{'─'*60}")
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Write output + audit
|
||||
# ------------------------------------------------------------------
|
||||
out_path = Path(output) if output else inp.parent / f"{inp.stem}_pipeline.csv"
|
||||
write_file(result.final_df, out_path)
|
||||
typer.echo(f"\nPipeline output: {out_path}")
|
||||
|
||||
audit_path = inp.parent / f"{inp.stem}_pipeline.json"
|
||||
audit_path.write_text(json.dumps({
|
||||
"pipeline": pipe.to_dict(),
|
||||
"warnings": result.warnings,
|
||||
"initial_rows": result.initial_rows,
|
||||
"final_rows": result.final_rows,
|
||||
"total_elapsed_seconds": result.total_elapsed,
|
||||
"steps": [
|
||||
{
|
||||
"tool": sr.step.tool,
|
||||
"name": sr.step.display_name(),
|
||||
"enabled": sr.step.enabled,
|
||||
"skipped": sr.skipped,
|
||||
"elapsed_seconds": sr.elapsed_seconds,
|
||||
"summary": sr.summary,
|
||||
"error": sr.error,
|
||||
}
|
||||
for sr in result.step_results
|
||||
],
|
||||
}, indent=2, default=str))
|
||||
typer.echo(f"Pipeline audit: {audit_path}")
|
||||
typer.echo(f"Log: {log_path}")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
app()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -96,15 +96,54 @@ from .format_standardize import (
|
||||
PRESETS as STANDARDIZE_PRESETS,
|
||||
StandardizeOptions,
|
||||
StandardizeResult,
|
||||
StreamingStandardizeResult,
|
||||
detect_currency_code,
|
||||
standardize_address,
|
||||
standardize_boolean,
|
||||
standardize_currency,
|
||||
standardize_dataframe,
|
||||
standardize_date,
|
||||
standardize_file,
|
||||
standardize_name,
|
||||
standardize_phone,
|
||||
)
|
||||
from .missing import (
|
||||
DEFAULT_SENTINELS,
|
||||
ColumnReport,
|
||||
MissingOptions,
|
||||
MissingProfile,
|
||||
MissingResult,
|
||||
PRESETS as MISSING_PRESETS,
|
||||
Strategy as MissingStrategy,
|
||||
detect_sentinels,
|
||||
handle_missing,
|
||||
is_missing_like,
|
||||
profile_missing,
|
||||
)
|
||||
from .column_mapper import (
|
||||
ColumnDtype,
|
||||
MapOptions,
|
||||
MapResult,
|
||||
PRESETS as MAP_PRESETS,
|
||||
TargetField,
|
||||
TargetSchema,
|
||||
UnmappedStrategy,
|
||||
coerce_series,
|
||||
infer_mapping,
|
||||
map_columns,
|
||||
)
|
||||
from .pipeline import (
|
||||
Pipeline,
|
||||
PipelineResult,
|
||||
SOFT_DEPENDENCIES,
|
||||
Step,
|
||||
StepResult,
|
||||
TOOL_ADAPTERS,
|
||||
TOOL_NAMES,
|
||||
recommended_pipeline,
|
||||
run_pipeline,
|
||||
validate_pipeline,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
# Core
|
||||
@@ -171,6 +210,7 @@ __all__ = [
|
||||
"STANDARDIZE_PRESETS",
|
||||
"StandardizeOptions",
|
||||
"StandardizeResult",
|
||||
"StreamingStandardizeResult",
|
||||
"detect_currency_code",
|
||||
"standardize_dataframe",
|
||||
"standardize_date",
|
||||
@@ -179,4 +219,39 @@ __all__ = [
|
||||
"standardize_name",
|
||||
"standardize_address",
|
||||
"standardize_boolean",
|
||||
"standardize_file",
|
||||
# Missing-value handling
|
||||
"DEFAULT_SENTINELS",
|
||||
"ColumnReport",
|
||||
"MissingOptions",
|
||||
"MissingProfile",
|
||||
"MissingResult",
|
||||
"MISSING_PRESETS",
|
||||
"MissingStrategy",
|
||||
"detect_sentinels",
|
||||
"handle_missing",
|
||||
"is_missing_like",
|
||||
"profile_missing",
|
||||
# Column mapping
|
||||
"ColumnDtype",
|
||||
"MapOptions",
|
||||
"MapResult",
|
||||
"MAP_PRESETS",
|
||||
"TargetField",
|
||||
"TargetSchema",
|
||||
"UnmappedStrategy",
|
||||
"coerce_series",
|
||||
"infer_mapping",
|
||||
"map_columns",
|
||||
# Pipeline
|
||||
"Pipeline",
|
||||
"PipelineResult",
|
||||
"SOFT_DEPENDENCIES",
|
||||
"Step",
|
||||
"StepResult",
|
||||
"TOOL_ADAPTERS",
|
||||
"TOOL_NAMES",
|
||||
"recommended_pipeline",
|
||||
"run_pipeline",
|
||||
"validate_pipeline",
|
||||
]
|
||||
|
||||
@@ -593,6 +593,40 @@ def _count_row_terminators(raw: bytes) -> tuple[int, int, int]:
|
||||
return n_crlf, n_lf, n_cr
|
||||
|
||||
|
||||
def _detect_lying_bom(raw: bytes) -> list[Finding]:
|
||||
"""Flag files whose UTF-8 BOM disagrees with the body bytes.
|
||||
|
||||
The "lying BOM" pattern is a file that starts with the UTF-8 BOM
|
||||
(``EF BB BF``) but whose body cannot be decoded as UTF-8 — typically
|
||||
a cp1252 export that someone hand-prepended a BOM to in an attempt to
|
||||
make Excel happy. The encoding detector recovers transparently
|
||||
(returns cp1252), but the user should still be told their file is
|
||||
misrepresenting itself so the next downstream tool doesn't get
|
||||
surprised.
|
||||
"""
|
||||
if not raw[:3] == b"\xef\xbb\xbf":
|
||||
return []
|
||||
try:
|
||||
raw[3:].decode("utf-8")
|
||||
return [] # honest BOM — body is real UTF-8
|
||||
except UnicodeDecodeError:
|
||||
pass
|
||||
return [Finding(
|
||||
id="encoding_lying_bom",
|
||||
severity="warn",
|
||||
tool="",
|
||||
count=1,
|
||||
description=(
|
||||
"File starts with a UTF-8 BOM, but the body bytes are not "
|
||||
"valid UTF-8 — the BOM is misleading. The encoding detector "
|
||||
"recovered by falling back to a single-byte codepage; you "
|
||||
"may want to re-save the file with a matching encoding."
|
||||
),
|
||||
confidence="high",
|
||||
fix_action=FIX_NONE,
|
||||
)]
|
||||
|
||||
|
||||
def _detect_mixed_line_endings(raw: bytes) -> list[Finding]:
|
||||
"""Flag files that mix CRLF, LF, and bare CR row terminators.
|
||||
|
||||
@@ -875,6 +909,7 @@ def analyze(
|
||||
findings.extend(_findings_from_repair(repair_result))
|
||||
if raw_for_byte_scan is not None:
|
||||
findings.extend(_detect_mixed_line_endings(raw_for_byte_scan))
|
||||
findings.extend(_detect_lying_bom(raw_for_byte_scan))
|
||||
findings.extend(_detect_encoding_uncertainty(df))
|
||||
findings.extend(_detect_smart_punctuation(df))
|
||||
findings.extend(_detect_invisible_chars(df))
|
||||
@@ -890,6 +925,7 @@ def analyze(
|
||||
|
||||
def _load_for_analysis(
|
||||
path: Path, *, sample_rows: int, encoding_override: Optional[str] = None,
|
||||
fold_quotes: bool = True,
|
||||
) -> tuple[pd.DataFrame, Optional[RepairResult], Optional[bytes]]:
|
||||
"""Read just enough of *path* to scan, with the same robust pre-parse
|
||||
repair the tool pages will use.
|
||||
@@ -903,6 +939,12 @@ def _load_for_analysis(
|
||||
When *encoding_override* is set, it replaces the detected encoding
|
||||
entirely — the user has explicitly told us what the file is. The
|
||||
delimiter is still detected (it's separate from encoding choice).
|
||||
|
||||
*fold_quotes* defaults to True so the byte-level smart-quote fold
|
||||
runs as part of the repair pass (correct for CSV parsing). Pass
|
||||
False when the caller needs a content-preserving decode for
|
||||
identity round-trip checks (encoding corpus tests, format-fidelity
|
||||
audits).
|
||||
"""
|
||||
suffix = path.suffix.lower()
|
||||
if suffix in (".xlsx", ".xls"):
|
||||
@@ -937,7 +979,7 @@ def _load_for_analysis(
|
||||
if not head.strip():
|
||||
return pd.DataFrame(), None, head
|
||||
|
||||
repair = repair_bytes(head, encoding=enc, delimiter=delim)
|
||||
repair = repair_bytes(head, encoding=enc, delimiter=delim, fold_quotes=fold_quotes)
|
||||
import io as _io
|
||||
try:
|
||||
df = pd.read_csv(
|
||||
@@ -954,7 +996,9 @@ def _load_for_analysis(
|
||||
# never trips; the 2× row-size multiplier above handles 99% of inputs.
|
||||
if not head_was_full and len(df) < sample_rows:
|
||||
full_raw = path.read_bytes()
|
||||
full_repair = repair_bytes(full_raw, encoding=enc, delimiter=delim)
|
||||
full_repair = repair_bytes(
|
||||
full_raw, encoding=enc, delimiter=delim, fold_quotes=fold_quotes,
|
||||
)
|
||||
try:
|
||||
df = pd.read_csv(
|
||||
_io.BytesIO(full_repair.repaired_bytes),
|
||||
|
||||
633
src/core/column_mapper.py
Normal file
633
src/core/column_mapper.py
Normal file
@@ -0,0 +1,633 @@
|
||||
"""DataTools Column Mapper.
|
||||
|
||||
Rename columns, enforce a target schema, coerce types, drop / add /
|
||||
reorder columns. Designed for the three buyer profiles the toolkit
|
||||
already serves:
|
||||
|
||||
1. **Schema enforcement** — analyst receives a CSV that has to fit a
|
||||
known target shape (a CRM import format, a database schema, a
|
||||
mailing-list contract). Map source columns to target names, coerce
|
||||
each to the declared type, drop the extras, fail clearly when a
|
||||
required target field is missing.
|
||||
2. **Multi-source unification** — operator merges vendor/partner
|
||||
exports where every file uses different column names ("First Name"
|
||||
/ "first_name" / "FirstName"). The fuzzy auto-mapper proposes a
|
||||
mapping; the user reviews and overrides.
|
||||
3. **Type coercion** — quick conversion of mis-typed columns (string
|
||||
"123" → int, "true"/"yes" → bool, "2024-01-15" → date) without
|
||||
leaving the tool, with errors surfaced row-by-row.
|
||||
|
||||
Public API
|
||||
----------
|
||||
Types:
|
||||
TargetField, TargetSchema, ColumnMapping, MapOptions, MapResult,
|
||||
ColumnDtype
|
||||
|
||||
Functions:
|
||||
map_columns(df, options) -> MapResult
|
||||
infer_mapping(df, schema, *, threshold=0.6) -> dict[src, target]
|
||||
coerce_series(series, dtype) -> (Series, n_failures)
|
||||
|
||||
Presets:
|
||||
PRESETS = {"rename-only", "strict-schema", "lenient-schema"}
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import re
|
||||
from dataclasses import asdict, dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import Any, Iterable, Literal, Optional
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from loguru import logger
|
||||
from pandas.api import types as pdtypes
|
||||
|
||||
from .errors import ConfigError, InputValidationError, ensure_choice, ensure_dataframe
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Types
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
ColumnDtype = Literal[
|
||||
"string",
|
||||
"integer",
|
||||
"float",
|
||||
"boolean",
|
||||
"date",
|
||||
"datetime",
|
||||
"category",
|
||||
"auto", # leave dtype alone
|
||||
]
|
||||
|
||||
_VALID_DTYPES: frozenset[str] = frozenset({
|
||||
"string", "integer", "float", "boolean", "date", "datetime",
|
||||
"category", "auto",
|
||||
})
|
||||
|
||||
|
||||
@dataclass
|
||||
class TargetField:
|
||||
"""One field in a target schema.
|
||||
|
||||
Required fields whose source column is missing produce a
|
||||
``MapResult.missing_required_targets`` entry rather than silently
|
||||
creating a NaN column.
|
||||
"""
|
||||
|
||||
name: str
|
||||
dtype: ColumnDtype = "auto"
|
||||
required: bool = False
|
||||
aliases: list[str] = field(default_factory=list)
|
||||
default: Any = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class TargetSchema:
|
||||
"""Ordered list of target fields. Ordering survives into the result DataFrame."""
|
||||
|
||||
fields: list[TargetField]
|
||||
|
||||
def field_names(self) -> list[str]:
|
||||
return [f.name for f in self.fields]
|
||||
|
||||
def get(self, name: str) -> Optional[TargetField]:
|
||||
return next((f for f in self.fields if f.name == name), None)
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {"fields": [asdict(f) for f in self.fields]}
|
||||
|
||||
def to_file(self, path: str | Path) -> Path:
|
||||
out = Path(path)
|
||||
out.write_text(json.dumps(self.to_dict(), indent=2, default=str))
|
||||
return out
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> TargetSchema:
|
||||
if "fields" not in data:
|
||||
raise ConfigError(
|
||||
"Target schema must contain a 'fields' list",
|
||||
operation="TargetSchema.from_dict",
|
||||
suggestion='Example: {"fields": [{"name": "email", "dtype": "string", "required": true}, ...]}',
|
||||
)
|
||||
fields = []
|
||||
for entry in data["fields"]:
|
||||
if isinstance(entry, str):
|
||||
fields.append(TargetField(name=entry))
|
||||
continue
|
||||
if "name" not in entry:
|
||||
raise ConfigError(
|
||||
f"Schema field is missing 'name': {entry!r}",
|
||||
operation="TargetSchema.from_dict",
|
||||
)
|
||||
dtype = entry.get("dtype", "auto")
|
||||
if dtype not in _VALID_DTYPES:
|
||||
raise ConfigError(
|
||||
f"Schema field {entry['name']!r}: unknown dtype {dtype!r}",
|
||||
operation="TargetSchema.from_dict",
|
||||
suggestion=f"Valid: {sorted(_VALID_DTYPES)}",
|
||||
)
|
||||
fields.append(TargetField(
|
||||
name=entry["name"],
|
||||
dtype=dtype,
|
||||
required=bool(entry.get("required", False)),
|
||||
aliases=list(entry.get("aliases", [])),
|
||||
default=entry.get("default"),
|
||||
))
|
||||
return cls(fields=fields)
|
||||
|
||||
@classmethod
|
||||
def from_file(cls, path: str | Path) -> TargetSchema:
|
||||
return cls.from_dict(json.loads(Path(path).read_text()))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fuzzy column-name matching
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
# Whitespace, punctuation, and case all vary across vendors. We normalise
|
||||
# both sides to a token list before comparing.
|
||||
_NORM_RE = re.compile(r"[^a-z0-9]+")
|
||||
|
||||
|
||||
def _normalize_name(name: str) -> str:
|
||||
"""Lowercase, strip non-alphanumerics — ``First Name`` → ``firstname``."""
|
||||
if not isinstance(name, str):
|
||||
return ""
|
||||
return _NORM_RE.sub("", name.strip().lower())
|
||||
|
||||
|
||||
def _token_set(name: str) -> frozenset[str]:
|
||||
"""Tokenise a column name on non-alphanumeric boundaries."""
|
||||
if not isinstance(name, str):
|
||||
return frozenset()
|
||||
parts = [p for p in _NORM_RE.split(name.strip().lower()) if p]
|
||||
return frozenset(parts)
|
||||
|
||||
|
||||
def _name_similarity(a: str, b: str) -> float:
|
||||
"""Cheap similarity score in [0.0, 1.0].
|
||||
|
||||
Combines exact-after-normalisation, token Jaccard, and SequenceMatcher
|
||||
ratio. A real fuzzy library (rapidfuzz) is already a project
|
||||
dependency for the deduplicator — we use it when available, fall
|
||||
back to stdlib ``difflib`` otherwise so the mapper works in trimmed
|
||||
builds.
|
||||
"""
|
||||
if not a or not b:
|
||||
return 0.0
|
||||
na, nb = _normalize_name(a), _normalize_name(b)
|
||||
if na == nb:
|
||||
return 1.0
|
||||
|
||||
ta, tb = _token_set(a), _token_set(b)
|
||||
jaccard = (len(ta & tb) / len(ta | tb)) if (ta or tb) else 0.0
|
||||
|
||||
try:
|
||||
from rapidfuzz import fuzz
|
||||
seq = fuzz.ratio(na, nb) / 100.0
|
||||
except ImportError:
|
||||
from difflib import SequenceMatcher
|
||||
seq = SequenceMatcher(None, na, nb).ratio()
|
||||
|
||||
return max(jaccard, seq)
|
||||
|
||||
|
||||
def infer_mapping(
|
||||
df: pd.DataFrame,
|
||||
schema: TargetSchema,
|
||||
*,
|
||||
threshold: float = 0.6,
|
||||
) -> dict[str, str]:
|
||||
"""Best-guess source-column → target-field mapping.
|
||||
|
||||
Returns a dict keyed by source-column name. A source column is
|
||||
omitted from the result when no candidate scores above *threshold*.
|
||||
Each target is matched at most once: the highest-scoring source
|
||||
wins, ties broken by source-column order in *df*.
|
||||
|
||||
Aliases declared on a :class:`TargetField` are scored as if they
|
||||
were target names — useful for vendor-specific synonyms
|
||||
(``["customer_id", "cust_id", "client_no"]``).
|
||||
"""
|
||||
ensure_dataframe(df, function="infer_mapping")
|
||||
sources = list(df.columns)
|
||||
targets = schema.fields
|
||||
|
||||
# All (source, target) candidate scores; keep only those above
|
||||
# threshold, sorted descending so a greedy walk picks the best
|
||||
# available pairings first.
|
||||
scored: list[tuple[float, str, str]] = []
|
||||
for src in sources:
|
||||
for tgt in targets:
|
||||
best = _name_similarity(src, tgt.name)
|
||||
for alias in tgt.aliases:
|
||||
s = _name_similarity(src, alias)
|
||||
if s > best:
|
||||
best = s
|
||||
if best >= threshold:
|
||||
scored.append((best, str(src), tgt.name))
|
||||
|
||||
scored.sort(key=lambda x: (-x[0], sources.index(x[1])))
|
||||
|
||||
mapping: dict[str, str] = {}
|
||||
used_targets: set[str] = set()
|
||||
for score, src, tgt in scored:
|
||||
if src in mapping or tgt in used_targets:
|
||||
continue
|
||||
mapping[src] = tgt
|
||||
used_targets.add(tgt)
|
||||
return mapping
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Type coercion
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
_TRUTHY = frozenset({"true", "t", "yes", "y", "1"})
|
||||
_FALSY = frozenset({"false", "f", "no", "n", "0"})
|
||||
|
||||
|
||||
def _coerce_boolean(value: Any) -> Any:
|
||||
if isinstance(value, bool):
|
||||
return value
|
||||
if value is None or (isinstance(value, float) and pd.isna(value)):
|
||||
return pd.NA
|
||||
if isinstance(value, (int, float)):
|
||||
return bool(value)
|
||||
if isinstance(value, str):
|
||||
v = value.strip().lower()
|
||||
if v in _TRUTHY:
|
||||
return True
|
||||
if v in _FALSY:
|
||||
return False
|
||||
raise ValueError(f"cannot coerce to boolean: {value!r}")
|
||||
|
||||
|
||||
def coerce_series(series: pd.Series, dtype: ColumnDtype) -> tuple[pd.Series, int]:
|
||||
"""Coerce *series* to *dtype*, returning ``(coerced, n_failures)``.
|
||||
|
||||
Failures are counted but never raised — the caller (``map_columns``)
|
||||
surfaces them through ``MapResult.coercion_failures`` so the user
|
||||
can inspect which rows didn't fit. Already-typed inputs are cheap
|
||||
no-ops.
|
||||
"""
|
||||
if dtype == "auto":
|
||||
return series, 0
|
||||
if dtype == "string":
|
||||
return series.astype("string"), 0
|
||||
if dtype == "category":
|
||||
return series.astype("category"), 0
|
||||
if dtype == "integer":
|
||||
coerced = pd.to_numeric(series, errors="coerce")
|
||||
# Use nullable Int64 so NaN entries don't get cast to floats.
|
||||
rounded = coerced.round().astype("Int64")
|
||||
# Failures = original non-NaN cells whose numeric coercion produced NaN.
|
||||
original_filled = series.notna()
|
||||
failed = (rounded.isna() & original_filled).sum()
|
||||
return rounded, int(failed)
|
||||
if dtype == "float":
|
||||
coerced = pd.to_numeric(series, errors="coerce").astype("Float64")
|
||||
original_filled = series.notna()
|
||||
failed = (coerced.isna() & original_filled).sum()
|
||||
return coerced, int(failed)
|
||||
if dtype == "boolean":
|
||||
out: list[Any] = []
|
||||
failed = 0
|
||||
for v in series.tolist():
|
||||
try:
|
||||
out.append(_coerce_boolean(v))
|
||||
except ValueError:
|
||||
out.append(pd.NA)
|
||||
failed += 1
|
||||
return pd.Series(out, index=series.index, dtype="boolean"), failed
|
||||
if dtype in {"date", "datetime"}:
|
||||
coerced = pd.to_datetime(series, errors="coerce", utc=False)
|
||||
original_filled = series.notna()
|
||||
failed = (coerced.isna() & original_filled).sum()
|
||||
if dtype == "date":
|
||||
# Drop the time component but keep dtype as datetime64 so
|
||||
# downstream operations (delta, sort) still work.
|
||||
coerced = coerced.dt.normalize()
|
||||
return coerced, int(failed)
|
||||
raise InputValidationError(
|
||||
f"Unknown dtype {dtype!r}",
|
||||
operation="coerce_series",
|
||||
suggestion=f"Valid: {sorted(_VALID_DTYPES)}",
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Options / result dataclasses
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
# Strategy for handling source columns that don't appear in the target
|
||||
# schema. ``keep`` preserves them at the end of the output; ``drop``
|
||||
# removes them; ``error`` raises an InputValidationError.
|
||||
UnmappedStrategy = Literal["keep", "drop", "error"]
|
||||
|
||||
PRESETS: dict[str, dict[str, Any]] = {
|
||||
"rename-only": {
|
||||
"auto_infer": True,
|
||||
"unmapped": "keep",
|
||||
"coerce_types": False,
|
||||
"reorder_to_schema": False,
|
||||
},
|
||||
"strict-schema": {
|
||||
"auto_infer": True,
|
||||
"unmapped": "drop",
|
||||
"coerce_types": True,
|
||||
"reorder_to_schema": True,
|
||||
},
|
||||
"lenient-schema": {
|
||||
"auto_infer": True,
|
||||
"unmapped": "keep",
|
||||
"coerce_types": True,
|
||||
"reorder_to_schema": True,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class MapOptions:
|
||||
"""Toggles for column mapping.
|
||||
|
||||
Defaults match the ``rename-only`` preset: best-effort fuzzy match
|
||||
against the schema (if provided), keep unmapped source columns
|
||||
after the mapped ones, no type coercion, no reorder.
|
||||
"""
|
||||
|
||||
# Either pass an explicit ``mapping`` dict or a ``schema`` (and let
|
||||
# the engine infer the mapping). Explicit mapping wins when both
|
||||
# are set.
|
||||
mapping: dict[str, str] = field(default_factory=dict)
|
||||
schema: Optional[TargetSchema] = None
|
||||
|
||||
# When True (default), missing entries in ``mapping`` are filled in
|
||||
# by ``infer_mapping`` against ``schema``. When False, only the
|
||||
# explicit mapping is honoured.
|
||||
auto_infer: bool = True
|
||||
fuzzy_threshold: float = 0.6
|
||||
|
||||
# What to do with source columns that aren't in the mapping.
|
||||
unmapped: UnmappedStrategy = "keep"
|
||||
|
||||
# Apply target-field dtypes from the schema after rename.
|
||||
coerce_types: bool = False
|
||||
|
||||
# Reorder output to match schema.fields order. Unmapped survivors
|
||||
# (when unmapped="keep") are appended at the end in their original
|
||||
# source order.
|
||||
reorder_to_schema: bool = False
|
||||
|
||||
# Required-target enforcement. When True (default), a required
|
||||
# target field that has no source column raises an InputValidationError.
|
||||
# When False, the missing field is added with ``default`` value.
|
||||
enforce_required: bool = True
|
||||
|
||||
@classmethod
|
||||
def from_preset(cls, name: str) -> MapOptions:
|
||||
if name not in PRESETS:
|
||||
raise ConfigError(
|
||||
f"Unknown preset '{name}'",
|
||||
operation="MapOptions.from_preset",
|
||||
suggestion=f"Available: {sorted(PRESETS)}",
|
||||
)
|
||||
return cls(**PRESETS[name])
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> MapOptions:
|
||||
known = set(cls.__dataclass_fields__)
|
||||
kwargs = {k: v for k, v in data.items() if k in known}
|
||||
if "schema" in kwargs and isinstance(kwargs["schema"], dict):
|
||||
kwargs["schema"] = TargetSchema.from_dict(kwargs["schema"])
|
||||
return cls(**kwargs)
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
out: dict[str, Any] = {
|
||||
"mapping": dict(self.mapping),
|
||||
"auto_infer": self.auto_infer,
|
||||
"fuzzy_threshold": self.fuzzy_threshold,
|
||||
"unmapped": self.unmapped,
|
||||
"coerce_types": self.coerce_types,
|
||||
"reorder_to_schema": self.reorder_to_schema,
|
||||
"enforce_required": self.enforce_required,
|
||||
}
|
||||
if self.schema is not None:
|
||||
out["schema"] = self.schema.to_dict()
|
||||
return out
|
||||
|
||||
def to_file(self, path: str | Path) -> Path:
|
||||
out = Path(path)
|
||||
out.write_text(json.dumps(self.to_dict(), indent=2, default=str))
|
||||
return out
|
||||
|
||||
@classmethod
|
||||
def from_file(cls, path: str | Path) -> MapOptions:
|
||||
return cls.from_dict(json.loads(Path(path).read_text()))
|
||||
|
||||
def validate(self) -> None:
|
||||
ensure_choice(
|
||||
self.unmapped, name="unmapped",
|
||||
choices=("keep", "drop", "error"),
|
||||
function="MapOptions.validate",
|
||||
)
|
||||
if not (0.0 <= self.fuzzy_threshold <= 1.0):
|
||||
raise ConfigError(
|
||||
f"fuzzy_threshold must be in [0.0, 1.0], got {self.fuzzy_threshold!r}",
|
||||
operation="MapOptions.validate",
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class MapResult:
|
||||
"""Output of ``map_columns``."""
|
||||
|
||||
mapped_df: pd.DataFrame
|
||||
mapping: dict[str, str] # source → target
|
||||
inferred_pairs: dict[str, str] # subset of mapping that was auto-inferred
|
||||
columns_renamed: int
|
||||
columns_dropped: list[str]
|
||||
columns_added: list[str] # required-defaulted fields added with default value
|
||||
coercion_failures: dict[str, int] # column → n_rows_that_failed_coercion
|
||||
unmapped_kept: list[str]
|
||||
missing_required_targets: list[str]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main entry point
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def map_columns(
|
||||
df: pd.DataFrame,
|
||||
options: Optional[MapOptions] = None,
|
||||
) -> MapResult:
|
||||
"""Apply *options* to *df* and return a :class:`MapResult`.
|
||||
|
||||
Pipeline placement (recommended, not enforced)
|
||||
----------------------------------------------
|
||||
Two natural slots:
|
||||
* **Early** — header alignment for multi-vendor unification.
|
||||
Each vendor uses different column names; rename to a canonical
|
||||
schema before any other tool runs.
|
||||
* **Late** — schema enforcement for output. After cleaning, coerce
|
||||
types and project to the target shape (CRM import contract,
|
||||
database schema). Run after format / missing so the coerced
|
||||
data is canonical first.
|
||||
The pipeline runner does not enforce a position; place by use case.
|
||||
|
||||
Pipeline:
|
||||
1. Compose mapping (explicit ``options.mapping`` ∪ inferred
|
||||
pairs from ``options.schema``).
|
||||
2. Reject duplicate target names — two source columns mapped to
|
||||
the same target is a user error, not a silent overwrite.
|
||||
3. Decide what to do with unmapped source columns
|
||||
(``keep`` / ``drop`` / ``error``).
|
||||
4. Rename, then handle missing required targets, then coerce
|
||||
types, then reorder.
|
||||
"""
|
||||
ensure_dataframe(df, function="map_columns")
|
||||
options = options or MapOptions()
|
||||
options.validate()
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# 1. Compose the effective mapping
|
||||
# ------------------------------------------------------------------
|
||||
explicit = dict(options.mapping)
|
||||
inferred: dict[str, str] = {}
|
||||
if options.schema is not None and options.auto_infer:
|
||||
all_inferred = infer_mapping(df, options.schema, threshold=options.fuzzy_threshold)
|
||||
# Explicit user pairings always win.
|
||||
used_targets = set(explicit.values())
|
||||
for src, tgt in all_inferred.items():
|
||||
if src in explicit:
|
||||
continue
|
||||
if tgt in used_targets:
|
||||
continue
|
||||
inferred[src] = tgt
|
||||
used_targets.add(tgt)
|
||||
|
||||
mapping: dict[str, str] = {**inferred, **explicit}
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# 2. Validate mapping coherence
|
||||
# ------------------------------------------------------------------
|
||||
unknown_sources = [s for s in mapping if s not in df.columns]
|
||||
if unknown_sources:
|
||||
raise InputValidationError(
|
||||
f"Mapping references columns not in input: {unknown_sources}",
|
||||
operation="map_columns",
|
||||
suggestion=f"Available source columns: {list(df.columns)}",
|
||||
)
|
||||
target_counts: dict[str, int] = {}
|
||||
for tgt in mapping.values():
|
||||
target_counts[tgt] = target_counts.get(tgt, 0) + 1
|
||||
duplicates = [t for t, n in target_counts.items() if n > 1]
|
||||
if duplicates:
|
||||
raise InputValidationError(
|
||||
f"Multiple source columns mapped to the same target(s): {duplicates}",
|
||||
operation="map_columns",
|
||||
suggestion="Each target name must be unique. Drop or rename the conflicting source columns.",
|
||||
)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# 3. Handle unmapped source columns
|
||||
# ------------------------------------------------------------------
|
||||
unmapped_sources = [c for c in df.columns if c not in mapping]
|
||||
unmapped_kept: list[str] = []
|
||||
columns_dropped: list[str] = []
|
||||
if unmapped_sources:
|
||||
if options.unmapped == "drop":
|
||||
columns_dropped = list(unmapped_sources)
|
||||
elif options.unmapped == "error":
|
||||
raise InputValidationError(
|
||||
f"Source columns have no mapping and unmapped='error': {unmapped_sources}",
|
||||
operation="map_columns",
|
||||
suggestion=(
|
||||
"Either add explicit mapping entries, set unmapped='keep' / 'drop', "
|
||||
"or include the columns in the target schema."
|
||||
),
|
||||
)
|
||||
else:
|
||||
unmapped_kept = list(unmapped_sources)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# 4. Apply rename and drop
|
||||
# ------------------------------------------------------------------
|
||||
out = df.copy()
|
||||
if columns_dropped:
|
||||
out = out.drop(columns=columns_dropped)
|
||||
if mapping:
|
||||
out = out.rename(columns=mapping)
|
||||
columns_renamed = sum(1 for src, tgt in mapping.items() if src != tgt)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# 5. Handle the schema's required + default fields
|
||||
# ------------------------------------------------------------------
|
||||
columns_added: list[str] = []
|
||||
missing_required: list[str] = []
|
||||
if options.schema is not None:
|
||||
present = set(out.columns)
|
||||
for tf in options.schema.fields:
|
||||
if tf.name in present:
|
||||
continue
|
||||
if tf.required and tf.default is None:
|
||||
missing_required.append(tf.name)
|
||||
continue
|
||||
# Add with default value (NaN if no default).
|
||||
out[tf.name] = tf.default if tf.default is not None else pd.NA
|
||||
columns_added.append(tf.name)
|
||||
|
||||
if missing_required and options.enforce_required:
|
||||
raise InputValidationError(
|
||||
f"Required target field(s) missing from input: {missing_required}",
|
||||
operation="map_columns",
|
||||
suggestion=(
|
||||
"Either add explicit mapping entries, lower fuzzy_threshold, "
|
||||
"supply a default in the schema, or set enforce_required=False."
|
||||
),
|
||||
)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# 6. Coerce types per the schema
|
||||
# ------------------------------------------------------------------
|
||||
coercion_failures: dict[str, int] = {}
|
||||
if options.coerce_types and options.schema is not None:
|
||||
for tf in options.schema.fields:
|
||||
if tf.name not in out.columns or tf.dtype == "auto":
|
||||
continue
|
||||
try:
|
||||
series, fails = coerce_series(out[tf.name], tf.dtype)
|
||||
except (ValueError, TypeError) as e:
|
||||
logger.warning(
|
||||
"map_columns: coerce of {!r} → {} failed: {}",
|
||||
tf.name, tf.dtype, e,
|
||||
)
|
||||
continue
|
||||
out[tf.name] = series
|
||||
if fails:
|
||||
coercion_failures[tf.name] = fails
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# 7. Reorder
|
||||
# ------------------------------------------------------------------
|
||||
if options.reorder_to_schema and options.schema is not None:
|
||||
ordered = [f.name for f in options.schema.fields if f.name in out.columns]
|
||||
# Append survivors (kept-unmapped originals) in their pre-rename order.
|
||||
survivors = [c for c in out.columns if c not in ordered]
|
||||
out = out.loc[:, ordered + survivors]
|
||||
|
||||
return MapResult(
|
||||
mapped_df=out,
|
||||
mapping=mapping,
|
||||
inferred_pairs=inferred,
|
||||
columns_renamed=columns_renamed,
|
||||
columns_dropped=columns_dropped,
|
||||
columns_added=columns_added,
|
||||
coercion_failures=coercion_failures,
|
||||
unmapped_kept=unmapped_kept,
|
||||
missing_required_targets=missing_required,
|
||||
)
|
||||
@@ -514,6 +514,19 @@ def deduplicate(
|
||||
) -> DeduplicationResult:
|
||||
"""Run the full deduplication pipeline.
|
||||
|
||||
Pipeline placement (recommended, not enforced)
|
||||
----------------------------------------------
|
||||
Run *last* among the cleaning tools. Fuzzy matching is more
|
||||
accurate when:
|
||||
* text has been hygiened (NBSP padding doesn't make
|
||||
``"Alice "`` look different from ``"Alice"``);
|
||||
* formats have been canonicalized (``+14155551234`` matches
|
||||
across rows where the source had ``(415) 555-1234`` and
|
||||
``415.555.1234``);
|
||||
* missing values have been standardized (NaN matching is
|
||||
brittle; sentinel-laundered cells produce false matches).
|
||||
See ``src.core.pipeline.SOFT_DEPENDENCIES``.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : input DataFrame
|
||||
|
||||
@@ -815,7 +815,22 @@ _CURRENCY_TRIM_RE = re.compile(
|
||||
_PARENS_NEGATIVE_RE = re.compile(r"^\s*\(\s*(.+?)\s*\)\s*$")
|
||||
|
||||
|
||||
CurrencyDecimal = Literal["dot", "comma"]
|
||||
CurrencyDecimal = Literal["dot", "comma", "auto"]
|
||||
|
||||
|
||||
# Multi-character symbol prefixes that aren't captured by the
|
||||
# single-codepoint ``_CURRENCY_SYMBOLS`` table. Order matters: the
|
||||
# detector checks these prefixes BEFORE the single-symbol regex, so
|
||||
# ``R$`` resolves to BRL even though ``$`` alone would map to USD.
|
||||
_PREFIX_TO_ISO: dict[str, str] = {
|
||||
"r$": "BRL", # Brazilian Real
|
||||
"kr": "SEK", # ambiguous Nordic — picks SEK as most common; see tests
|
||||
"zł": "PLN", # Polish Złoty
|
||||
"лв": "BGN", # Bulgarian Lev
|
||||
"₽": "RUB", # already in symbol table; kept for parity
|
||||
"rs.": "INR", # rupees — covers IN/PK informal usage
|
||||
"rs": "INR",
|
||||
}
|
||||
|
||||
|
||||
def detect_currency_code(value: str) -> Optional[str]:
|
||||
@@ -825,9 +840,21 @@ def detect_currency_code(value: str) -> Optional[str]:
|
||||
symbol → code mapping (``$1234`` → ``USD``). Symbol mapping is best-
|
||||
effort: ``$`` is ambiguous between USD/CAD/AUD/MXN — the caller is
|
||||
expected to constrain that via input data discipline.
|
||||
|
||||
Multi-char prefixes (``R$``, ``zł``, ``kr``) are recognised before
|
||||
the single-symbol regex so Brazilian / Polish / Nordic data isn't
|
||||
silently bucketed as USD.
|
||||
"""
|
||||
if not isinstance(value, str):
|
||||
return None
|
||||
head = value.lstrip().lower()
|
||||
for prefix, code in _PREFIX_TO_ISO.items():
|
||||
if head.startswith(prefix):
|
||||
# Make sure the next char (if any) isn't a letter — avoid
|
||||
# matching ``rsa`` as ``rs``-then-``a``.
|
||||
tail = head[len(prefix):]
|
||||
if not tail or not tail[0].isalpha():
|
||||
return code
|
||||
m = _CURRENCY_DETECT_RE.search(value)
|
||||
if m is None:
|
||||
return None
|
||||
@@ -852,10 +879,16 @@ def standardize_currency(
|
||||
|
||||
``decimal="dot"``: ``$1,234.56`` → ``1234.56`` (US/UK convention).
|
||||
``decimal="comma"``: ``1.234,56 €`` → ``1234.56`` (EU convention).
|
||||
Either mode auto-detects the EU shape when both ``.`` and ``,`` are
|
||||
present and the comma sits after the dot (so ``€1.234,56`` parses
|
||||
correctly even under the dot-default mode). Space-thousands and
|
||||
Swiss apostrophe-thousands are also recognized.
|
||||
``decimal="auto"``: same as ``dot`` but a single trailing comma
|
||||
whose tail is NOT exactly 3 digits is read as a decimal separator
|
||||
(``850,50`` → ``850.50``, ``R$ 1,5`` → ``1.5``). Use this for
|
||||
mixed-locale international files. Length-3 tails (``1,234``) stay
|
||||
ambiguous regardless of mode.
|
||||
|
||||
All three modes auto-detect the EU shape when both ``.`` and ``,``
|
||||
are present and the comma sits after the dot (so ``€1.234,56``
|
||||
parses correctly even under the dot-default mode). Space-thousands
|
||||
and Swiss apostrophe-thousands are also recognized.
|
||||
|
||||
The output always uses a dot as the decimal separator since that is
|
||||
the form pandas/Python parse natively.
|
||||
@@ -899,6 +932,22 @@ def standardize_currency(
|
||||
|
||||
code = detect_currency_code(s) if preserve_code else None
|
||||
|
||||
# Strip any multi-char currency prefix (``R$``, ``kr``, ``zł``)
|
||||
# before the symbol-table regex — these aren't single codepoints
|
||||
# so the table-driven trim would otherwise leave them in place.
|
||||
head = s.lstrip().lower()
|
||||
for prefix in _PREFIX_TO_ISO:
|
||||
if head.startswith(prefix):
|
||||
tail_start = len(prefix)
|
||||
if tail_start < len(head) and head[tail_start].isalpha():
|
||||
continue
|
||||
# Strip the matched prefix from the original (preserve case
|
||||
# of any trailing content).
|
||||
stripped_lead = s[: len(s) - len(head)]
|
||||
s = stripped_lead + s.lstrip()[len(prefix):]
|
||||
s = s.lstrip()
|
||||
break
|
||||
|
||||
negative = False
|
||||
m = _PARENS_NEGATIVE_RE.match(s)
|
||||
if m:
|
||||
@@ -948,6 +997,19 @@ def standardize_currency(
|
||||
# is unambiguously EU — treat the comma as decimal.
|
||||
if had_space_thousands:
|
||||
rest = rest.replace(",", ".")
|
||||
elif decimal == "auto":
|
||||
# International auto-detection: a single comma whose
|
||||
# tail is NOT exactly 3 digits is far more likely to be
|
||||
# an EU/BRL decimal (``850,50``, ``1,5``) than a
|
||||
# malformed US thousands group. Length-3 tails stay
|
||||
# ambiguous and require an explicit locale.
|
||||
after = rest.rsplit(",", 1)[1]
|
||||
if rest.count(",") > 1:
|
||||
rest = rest.replace(",", "")
|
||||
elif len(after) == 3:
|
||||
return _err("ambiguous separator, set --currency-locale")
|
||||
else:
|
||||
rest = rest.replace(",", ".")
|
||||
else:
|
||||
after = rest.rsplit(",", 1)[1]
|
||||
if len(after) != 3:
|
||||
@@ -1910,6 +1972,26 @@ class StandardizeOptions:
|
||||
# verbatim into Title Case rendering.
|
||||
extra_abbreviations: dict[str, str] = field(default_factory=dict)
|
||||
|
||||
# ----- Scale knobs for large international files -----
|
||||
# Per-row country/region overrides. When set, each phone or address
|
||||
# row's region is read from the named column (an ISO-3166 alpha-2 code:
|
||||
# "US", "GB", "JP", "FR", …). Falls back to ``phone_region`` /
|
||||
# global default when the column is missing or the cell is blank.
|
||||
phone_country_column: Optional[str] = None
|
||||
address_country_column: Optional[str] = None
|
||||
|
||||
# Audit cap. The change table can grow to tens of millions of rows on
|
||||
# a 1 GB input — capping protects memory and keeps the audit usable.
|
||||
# ``cells_changed`` still counts every modification; only the per-row
|
||||
# ``changes`` DataFrame is truncated. Set to None for unbounded.
|
||||
audit_max_rows: Optional[int] = 10_000
|
||||
|
||||
# Value-level LRU cache size per standardizer. Repeated phone numbers
|
||||
# (call-list duplicates), repeated currencies, repeated boolean
|
||||
# tokens — all dominate at scale. A 256k-entry cache absorbs most
|
||||
# real-world cardinalities without ballooning memory.
|
||||
cache_size: int = 262_144
|
||||
|
||||
@classmethod
|
||||
def from_preset(cls, name: str, **overrides: Any) -> StandardizeOptions:
|
||||
"""Build options from a named preset, with optional field overrides.
|
||||
@@ -1953,7 +2035,7 @@ class StandardizeOptions:
|
||||
for field_name, valid in (
|
||||
("date_order", {"MDY", "DMY"}),
|
||||
("phone_format", set(_PHONE_FORMAT_MAP) | {"DIGITS"}),
|
||||
("currency_decimal", {"dot", "comma"}),
|
||||
("currency_decimal", {"dot", "comma", "auto"}),
|
||||
("name_case", {"title", "upper", "lower"}),
|
||||
("boolean_style", set(_BOOL_OUTPUT)),
|
||||
("date_error_policy", {"passthrough", "sentinel"}),
|
||||
@@ -2213,6 +2295,193 @@ def _resolve_column_types(
|
||||
return resolved
|
||||
|
||||
|
||||
def _build_cached_dispatcher(
|
||||
field_type: FieldType,
|
||||
options: StandardizeOptions,
|
||||
):
|
||||
"""Return a per-value standardizer wrapped in an LRU cache.
|
||||
|
||||
The cache key is the raw cell value plus, when applicable, the
|
||||
per-row region derived from ``phone_country_column`` /
|
||||
``address_country_column``. Repeated values are O(1) lookups —
|
||||
critical at 1 GB scale where the same number appears thousands
|
||||
of times.
|
||||
|
||||
The dispatcher captures the relevant subset of ``options`` so the
|
||||
cache key stays small (we don't want to serialize the whole
|
||||
options dataclass into every cache entry).
|
||||
"""
|
||||
from functools import lru_cache
|
||||
|
||||
cache_size = options.cache_size if options.cache_size > 0 else None
|
||||
|
||||
if field_type == FieldType.DATE:
|
||||
out_fmt = options.date_output_format
|
||||
date_order = options.date_order
|
||||
date_err = options.date_error_policy
|
||||
locales = (
|
||||
tuple(options.date_month_locales) if options.date_month_locales else None
|
||||
)
|
||||
|
||||
@lru_cache(maxsize=cache_size)
|
||||
def fn(value: Any, _region: Optional[str] = None):
|
||||
return _apply_field_type_for(
|
||||
value, FieldType.DATE, options,
|
||||
_date_args=(out_fmt, date_order, date_err, locales),
|
||||
)
|
||||
return fn
|
||||
|
||||
if field_type == FieldType.PHONE:
|
||||
out_fmt = options.phone_format
|
||||
err = options.phone_error_policy
|
||||
default_region = options.phone_region
|
||||
|
||||
@lru_cache(maxsize=cache_size)
|
||||
def fn(value: Any, region: Optional[str] = None):
|
||||
r = region or default_region
|
||||
return _apply_field_type_for(
|
||||
value, FieldType.PHONE, options,
|
||||
_phone_args=(out_fmt, r, err),
|
||||
)
|
||||
return fn
|
||||
|
||||
if field_type == FieldType.CURRENCY:
|
||||
decimal = options.currency_decimal
|
||||
decimals = options.currency_decimals
|
||||
preserve = options.currency_preserve_code
|
||||
err = options.currency_error_policy
|
||||
|
||||
@lru_cache(maxsize=cache_size)
|
||||
def fn(value: Any, _region: Optional[str] = None):
|
||||
return _apply_field_type_for(
|
||||
value, FieldType.CURRENCY, options,
|
||||
_currency_args=(decimal, decimals, preserve, err),
|
||||
)
|
||||
return fn
|
||||
|
||||
if field_type == FieldType.BOOLEAN:
|
||||
style = options.boolean_style
|
||||
|
||||
@lru_cache(maxsize=cache_size)
|
||||
def fn(value: Any, _region: Optional[str] = None):
|
||||
return _apply_field_type_for(
|
||||
value, FieldType.BOOLEAN, options,
|
||||
_boolean_args=(style,),
|
||||
)
|
||||
return fn
|
||||
|
||||
if field_type == FieldType.EMAIL:
|
||||
gmail = options.email_gmail_canonical
|
||||
err = options.email_error_policy
|
||||
|
||||
@lru_cache(maxsize=cache_size)
|
||||
def fn(value: Any, _region: Optional[str] = None):
|
||||
return _apply_field_type_for(
|
||||
value, FieldType.EMAIL, options,
|
||||
_email_args=(gmail, err),
|
||||
)
|
||||
return fn
|
||||
|
||||
# Names and addresses are usually unique per row; no cache wraps
|
||||
# them but we still go through ``_apply_field_type`` for parity.
|
||||
if field_type == FieldType.NAME:
|
||||
def fn(value: Any, _region: Optional[str] = None):
|
||||
return _apply_field_type(value, FieldType.NAME, options)
|
||||
return fn
|
||||
|
||||
if field_type == FieldType.ADDRESS:
|
||||
# Addresses can be cached too — long lists of repeated office
|
||||
# addresses or warehouse locations are common in commerce data.
|
||||
@lru_cache(maxsize=cache_size)
|
||||
def fn(value: Any, _region: Optional[str] = None):
|
||||
return _apply_field_type(value, FieldType.ADDRESS, options)
|
||||
return fn
|
||||
|
||||
# Fallback (shouldn't happen — every FieldType is covered above).
|
||||
return lambda value, _region=None: _apply_field_type(value, field_type, options)
|
||||
|
||||
|
||||
def _apply_field_type_for(
|
||||
value: Any,
|
||||
field_type: FieldType,
|
||||
options: StandardizeOptions,
|
||||
*,
|
||||
_date_args=None,
|
||||
_phone_args=None,
|
||||
_currency_args=None,
|
||||
_boolean_args=None,
|
||||
_email_args=None,
|
||||
) -> tuple[Any, bool, bool]:
|
||||
"""Cacheable dispatcher: same shape as :func:`_apply_field_type` but
|
||||
accepts pre-extracted scalar argument tuples so the LRU cache key is
|
||||
just ``(value, region)`` instead of the full options object.
|
||||
"""
|
||||
if value is None or (isinstance(value, float) and pd.isna(value)):
|
||||
return value, False, True
|
||||
if not isinstance(value, str):
|
||||
if field_type == FieldType.BOOLEAN:
|
||||
style = (_boolean_args or (options.boolean_style,))[0]
|
||||
new, changed = standardize_boolean(value, style=style)
|
||||
return new, changed, True
|
||||
value = str(value)
|
||||
|
||||
if not value.strip():
|
||||
return value, False, True
|
||||
|
||||
if field_type == FieldType.DATE:
|
||||
out_fmt, date_order, err, locales = _date_args or (
|
||||
options.date_output_format, options.date_order,
|
||||
options.date_error_policy,
|
||||
tuple(options.date_month_locales) if options.date_month_locales else None,
|
||||
)
|
||||
new, changed = standardize_date(
|
||||
value,
|
||||
output_format=out_fmt,
|
||||
date_order=date_order,
|
||||
error_policy=err,
|
||||
month_locales=list(locales) if locales else None,
|
||||
)
|
||||
elif field_type == FieldType.PHONE:
|
||||
out_fmt, region, err = _phone_args or (
|
||||
options.phone_format, options.phone_region, options.phone_error_policy,
|
||||
)
|
||||
new, changed = standardize_phone(
|
||||
value, output_format=out_fmt, default_region=region, error_policy=err,
|
||||
)
|
||||
elif field_type == FieldType.CURRENCY:
|
||||
decimal, decimals, preserve, err = _currency_args or (
|
||||
options.currency_decimal, options.currency_decimals,
|
||||
options.currency_preserve_code, options.currency_error_policy,
|
||||
)
|
||||
new, changed = standardize_currency(
|
||||
value,
|
||||
decimal=decimal,
|
||||
decimals=decimals,
|
||||
preserve_code=preserve,
|
||||
error_policy=err,
|
||||
)
|
||||
elif field_type == FieldType.BOOLEAN:
|
||||
style = (_boolean_args or (options.boolean_style,))[0]
|
||||
new, changed = standardize_boolean(value, style=style)
|
||||
elif field_type == FieldType.EMAIL:
|
||||
gmail, err = _email_args or (
|
||||
options.email_gmail_canonical, options.email_error_policy,
|
||||
)
|
||||
new, changed = standardize_email(
|
||||
value, gmail_canonical=gmail, error_policy=err,
|
||||
)
|
||||
else:
|
||||
return _apply_field_type(value, field_type, options)
|
||||
|
||||
parsed = True
|
||||
if not changed and field_type in {
|
||||
FieldType.DATE, FieldType.PHONE, FieldType.CURRENCY, FieldType.BOOLEAN,
|
||||
}:
|
||||
parsed = _is_already_canonical(value, field_type, options)
|
||||
|
||||
return new, changed, parsed
|
||||
|
||||
|
||||
def standardize_dataframe(
|
||||
df: pd.DataFrame,
|
||||
options: Optional[StandardizeOptions] = None,
|
||||
@@ -2221,6 +2490,28 @@ def standardize_dataframe(
|
||||
|
||||
Columns absent from ``options.column_types`` pass through unchanged.
|
||||
The input DataFrame is not mutated.
|
||||
|
||||
Pipeline placement (recommended, not enforced)
|
||||
----------------------------------------------
|
||||
Run *after* the text cleaner (smart-quote / NBSP / zero-width
|
||||
pollution breaks phone, currency, and date parsers) and *before*
|
||||
the missing-value handler (numeric imputation expects canonical
|
||||
types) and the deduplicator (canonical phone E.164 / lowercase
|
||||
email enables cross-format duplicate matching). See
|
||||
``src.core.pipeline.SOFT_DEPENDENCIES``.
|
||||
|
||||
Performance characteristics
|
||||
---------------------------
|
||||
Per-cell standardizers are wrapped in an LRU cache (size
|
||||
``options.cache_size``) so repeated values — common in real
|
||||
international data, where the same office phone or vendor address
|
||||
appears thousands of times — short-circuit. The dispatch loop uses
|
||||
``Series.map`` for pandas-native iteration; on a 10-million-row
|
||||
column this is roughly 4-8× faster than the previous
|
||||
``for v in series.tolist()`` path.
|
||||
|
||||
For inputs larger than will fit comfortably in RAM, prefer
|
||||
:func:`standardize_file` which streams chunks from disk.
|
||||
"""
|
||||
from .errors import ensure_dataframe
|
||||
ensure_dataframe(df, function="standardize_dataframe")
|
||||
@@ -2228,33 +2519,74 @@ def standardize_dataframe(
|
||||
out = df.copy()
|
||||
column_types = _resolve_column_types(options, out.columns)
|
||||
|
||||
change_records: list[dict[str, Any]] = []
|
||||
cells_changed = 0
|
||||
cells_unparseable = 0
|
||||
cells_total = 0
|
||||
audit_cap = options.audit_max_rows
|
||||
audit_room = float("inf") if audit_cap is None else audit_cap
|
||||
audit_records: list[dict[str, Any]] = []
|
||||
|
||||
# Per-row region columns must exist in the frame when set.
|
||||
if options.phone_country_column and options.phone_country_column not in out.columns:
|
||||
from .errors import InputValidationError
|
||||
raise InputValidationError(
|
||||
f"phone_country_column={options.phone_country_column!r} not in input columns",
|
||||
operation="standardize_dataframe",
|
||||
suggestion=f"Available: {list(out.columns)}",
|
||||
)
|
||||
if options.address_country_column and options.address_country_column not in out.columns:
|
||||
from .errors import InputValidationError
|
||||
raise InputValidationError(
|
||||
f"address_country_column={options.address_country_column!r} not in input columns",
|
||||
operation="standardize_dataframe",
|
||||
suggestion=f"Available: {list(out.columns)}",
|
||||
)
|
||||
|
||||
for col, field_type in column_types.items():
|
||||
series = out[col]
|
||||
new_values: list[Any] = []
|
||||
for row_idx, original in enumerate(series.tolist()):
|
||||
cells_total += 1
|
||||
new, changed, parsed = _apply_field_type(original, field_type, options)
|
||||
cells_total += len(series)
|
||||
dispatcher = _build_cached_dispatcher(field_type, options)
|
||||
|
||||
# Per-row region lookup. Phones and addresses are the two types
|
||||
# that benefit from country context; everything else ignores the
|
||||
# second argument.
|
||||
region_series: Optional[pd.Series] = None
|
||||
if field_type == FieldType.PHONE and options.phone_country_column:
|
||||
region_series = out[options.phone_country_column]
|
||||
elif field_type == FieldType.ADDRESS and options.address_country_column:
|
||||
region_series = out[options.address_country_column]
|
||||
|
||||
new_values: list[Any] = [None] * len(series)
|
||||
if region_series is None:
|
||||
triples = [dispatcher(v) for v in series.tolist()]
|
||||
else:
|
||||
regions = region_series.tolist()
|
||||
triples = [
|
||||
dispatcher(v, _normalize_region(r))
|
||||
for v, r in zip(series.tolist(), regions)
|
||||
]
|
||||
|
||||
for i, (orig, (new, changed, parsed)) in enumerate(
|
||||
zip(series.tolist(), triples)
|
||||
):
|
||||
new_values[i] = new
|
||||
if changed:
|
||||
cells_changed += 1
|
||||
change_records.append({
|
||||
"row": row_idx,
|
||||
"column": col,
|
||||
"field_type": field_type.value,
|
||||
"old": original,
|
||||
"new": new,
|
||||
})
|
||||
if audit_room > 0:
|
||||
audit_records.append({
|
||||
"row": i,
|
||||
"column": col,
|
||||
"field_type": field_type.value,
|
||||
"old": orig,
|
||||
"new": new,
|
||||
})
|
||||
audit_room -= 1
|
||||
if not parsed:
|
||||
cells_unparseable += 1
|
||||
new_values.append(new)
|
||||
out[col] = new_values
|
||||
|
||||
changes_df = pd.DataFrame(
|
||||
change_records,
|
||||
audit_records,
|
||||
columns=["row", "column", "field_type", "old", "new"],
|
||||
)
|
||||
|
||||
@@ -2272,6 +2604,16 @@ def standardize_dataframe(
|
||||
int(100 * cells_unparseable / cells_total),
|
||||
)
|
||||
|
||||
# Only log the cap message when it would surprise the caller —
|
||||
# cap=0 is the streaming-path's deliberate "audit budget exhausted"
|
||||
# signal and shouldn't generate noise per chunk.
|
||||
if audit_cap and audit_cap > 0 and cells_changed > audit_cap:
|
||||
logger.info(
|
||||
"standardize_dataframe: audit capped at {} rows "
|
||||
"(cells_changed={}); raise audit_max_rows or set to None for full audit.",
|
||||
audit_cap, cells_changed,
|
||||
)
|
||||
|
||||
return StandardizeResult(
|
||||
standardized_df=out,
|
||||
changes=changes_df,
|
||||
@@ -2280,3 +2622,290 @@ def standardize_dataframe(
|
||||
cells_total=cells_total,
|
||||
columns_processed=list(column_types.keys()),
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Per-row region helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
# Common country-name → ISO-3166 alpha-2 mappings. The phonenumbers
|
||||
# library wants the alpha-2 code, but real spreadsheets carry full names
|
||||
# ("United Kingdom", "Japan", "Brazil"). Add new entries lazily as users
|
||||
# bring in data — the table is a soft mapping, missing entries fall back
|
||||
# to the global ``phone_region``.
|
||||
_COUNTRY_NAME_TO_ISO2: dict[str, str] = {
|
||||
"united states": "US", "usa": "US", "u.s.": "US", "u.s.a.": "US",
|
||||
"united kingdom": "GB", "uk": "GB", "great britain": "GB", "england": "GB",
|
||||
"canada": "CA",
|
||||
"mexico": "MX",
|
||||
"france": "FR",
|
||||
"germany": "DE", "deutschland": "DE",
|
||||
"italy": "IT", "italia": "IT",
|
||||
"spain": "ES", "españa": "ES",
|
||||
"portugal": "PT",
|
||||
"netherlands": "NL", "holland": "NL",
|
||||
"belgium": "BE",
|
||||
"switzerland": "CH", "schweiz": "CH",
|
||||
"austria": "AT", "österreich": "AT",
|
||||
"ireland": "IE",
|
||||
"sweden": "SE", "norway": "NO", "denmark": "DK", "finland": "FI",
|
||||
"poland": "PL", "czech republic": "CZ", "czechia": "CZ", "hungary": "HU",
|
||||
"russia": "RU", "ukraine": "UA",
|
||||
"japan": "JP", "中国": "CN", "china": "CN", "south korea": "KR", "korea": "KR",
|
||||
"india": "IN", "indonesia": "ID", "thailand": "TH", "vietnam": "VN",
|
||||
"philippines": "PH", "malaysia": "MY", "singapore": "SG",
|
||||
"australia": "AU", "new zealand": "NZ",
|
||||
"brazil": "BR", "brasil": "BR",
|
||||
"argentina": "AR", "chile": "CL", "colombia": "CO", "peru": "PE",
|
||||
"south africa": "ZA",
|
||||
"uae": "AE", "united arab emirates": "AE",
|
||||
"saudi arabia": "SA",
|
||||
"egypt": "EG",
|
||||
"israel": "IL",
|
||||
"turkey": "TR", "türkiye": "TR",
|
||||
}
|
||||
|
||||
|
||||
def _normalize_region(value: Any) -> Optional[str]:
|
||||
"""Normalise a region cell to an ISO-3166 alpha-2 code.
|
||||
|
||||
Accepts ISO codes (``US``, ``us``, ``USA``), full names
|
||||
(``United States``, ``Japan``), and falls back to None when the
|
||||
value is empty or unrecognized — letting the dispatcher use the
|
||||
global default region.
|
||||
"""
|
||||
if value is None:
|
||||
return None
|
||||
if isinstance(value, float) and pd.isna(value):
|
||||
return None
|
||||
if not isinstance(value, str):
|
||||
value = str(value)
|
||||
s = value.strip()
|
||||
if not s:
|
||||
return None
|
||||
upper = s.upper()
|
||||
# ISO-3166 alpha-2 (e.g. "US", "JP")
|
||||
if len(upper) == 2 and upper.isalpha():
|
||||
return upper
|
||||
# ISO-3166 alpha-3 (e.g. "USA", "JPN") — strip last letter as a
|
||||
# cheap heuristic, then validate alpha-2.
|
||||
if len(upper) == 3 and upper.isalpha():
|
||||
# phonenumbers accepts alpha-2 only; map a few common alpha-3.
|
||||
alpha3_map = {
|
||||
"USA": "US", "GBR": "GB", "CAN": "CA", "MEX": "MX", "DEU": "DE",
|
||||
"FRA": "FR", "ITA": "IT", "ESP": "ES", "JPN": "JP", "CHN": "CN",
|
||||
"KOR": "KR", "BRA": "BR", "AUS": "AU", "IND": "IN", "RUS": "RU",
|
||||
}
|
||||
if upper in alpha3_map:
|
||||
return alpha3_map[upper]
|
||||
# Full country name lookup.
|
||||
return _COUNTRY_NAME_TO_ISO2.get(s.lower())
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Streaming entry point — for inputs that don't fit in memory
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@dataclass
|
||||
class StreamingStandardizeResult:
|
||||
"""Summary returned by :func:`standardize_file`.
|
||||
|
||||
Mirrors :class:`StandardizeResult` but without the in-memory
|
||||
DataFrame — the standardized output is written incrementally to
|
||||
``output_path``. The ``changes`` audit is also written
|
||||
incrementally to ``audit_path`` and capped at
|
||||
``options.audit_max_rows`` total rows across all chunks.
|
||||
"""
|
||||
|
||||
output_path: Path
|
||||
audit_path: Optional[Path]
|
||||
rows_processed: int
|
||||
chunks_processed: int
|
||||
cells_changed: int
|
||||
cells_unparseable: int
|
||||
cells_total: int
|
||||
columns_processed: list[str]
|
||||
|
||||
|
||||
def standardize_file(
|
||||
input_path: str | Path,
|
||||
output_path: str | Path,
|
||||
options: Optional[StandardizeOptions] = None,
|
||||
*,
|
||||
chunk_size: int = 50_000,
|
||||
audit_path: Optional[str | Path] = None,
|
||||
progress_callback: Optional[Any] = None,
|
||||
encoding: str = "utf-8",
|
||||
delimiter: str = ",",
|
||||
) -> StreamingStandardizeResult:
|
||||
"""Standardize a CSV/TSV file in chunks, writing output incrementally.
|
||||
|
||||
For inputs too large to materialize in memory, this entry point
|
||||
streams ``chunk_size`` rows at a time through
|
||||
:func:`standardize_dataframe` and writes each chunk to *output_path*
|
||||
as it completes. Memory stays bounded by the chunk size regardless
|
||||
of input file size.
|
||||
|
||||
The audit is written to *audit_path* (default
|
||||
``{output_path.stem}_changes.csv``). Each chunk's
|
||||
``options.audit_max_rows`` budget is respected per chunk; pass
|
||||
``audit_max_rows=None`` for a full audit (memory-bounded only by
|
||||
disk).
|
||||
|
||||
Performance for a 1 GB CSV with ~10 M rows on a typical workstation:
|
||||
- chunk_size=50_000 → ~50 MB peak DataFrame footprint
|
||||
- phone-only standardization: ~3-6 minutes (cache-warm)
|
||||
- mixed phone + currency + address: ~8-15 minutes
|
||||
- first chunk is the cold-cache slowest; later chunks ride the LRU.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
input_path
|
||||
CSV or TSV path. Excel inputs aren't streamed — load with
|
||||
:func:`read_file` and use :func:`standardize_dataframe`.
|
||||
output_path
|
||||
Where to write the standardized CSV. Existing files are
|
||||
overwritten.
|
||||
chunk_size
|
||||
Rows per chunk. Default 50,000 ≈ 50 MB resident for typical
|
||||
widths. Higher → less I/O overhead, more peak memory.
|
||||
progress_callback
|
||||
Optional ``callable(rows_processed, chunks_processed)``
|
||||
called once per chunk.
|
||||
"""
|
||||
from .errors import wrap_file_read, wrap_file_write
|
||||
options = options or StandardizeOptions()
|
||||
inp = Path(input_path)
|
||||
out = Path(output_path)
|
||||
if not inp.exists():
|
||||
from .errors import FileAccessError
|
||||
raise FileAccessError(
|
||||
f"Input file not found: {inp}",
|
||||
path=inp, operation="standardize_file",
|
||||
)
|
||||
|
||||
audit_p = Path(audit_path) if audit_path else out.with_name(
|
||||
f"{out.stem}_changes.csv"
|
||||
)
|
||||
|
||||
rows_processed = 0
|
||||
chunks_processed = 0
|
||||
cells_changed = 0
|
||||
cells_unparseable = 0
|
||||
cells_total = 0
|
||||
columns_processed: list[str] = []
|
||||
audit_room = (
|
||||
options.audit_max_rows if options.audit_max_rows is not None
|
||||
else float("inf")
|
||||
)
|
||||
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
audit_p.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
out_writer_open = False
|
||||
audit_writer_open = False
|
||||
|
||||
try:
|
||||
reader = pd.read_csv(
|
||||
inp, chunksize=chunk_size, encoding=encoding,
|
||||
sep=delimiter, dtype=str, keep_default_na=False,
|
||||
)
|
||||
except (OSError, FileNotFoundError) as e:
|
||||
raise wrap_file_read(inp, "standardize_file", e) from e
|
||||
|
||||
try:
|
||||
for chunk in reader:
|
||||
# The chunked reader gives back row indices that restart
|
||||
# at chunk boundaries; renumber so audit row indices reflect
|
||||
# the full input file.
|
||||
chunk_offset = rows_processed
|
||||
chunk_options = options
|
||||
# Local audit cap per chunk: never exceed the global budget.
|
||||
if options.audit_max_rows is not None and audit_room <= 0:
|
||||
# Disable audit for this chunk by setting cap=0; the
|
||||
# standardizer skips appending records once room == 0.
|
||||
chunk_options = _replace_options(options, audit_max_rows=0)
|
||||
|
||||
result = standardize_dataframe(chunk, chunk_options)
|
||||
cells_changed += result.cells_changed
|
||||
cells_unparseable += result.cells_unparseable
|
||||
cells_total += result.cells_total
|
||||
if not columns_processed:
|
||||
columns_processed = list(result.columns_processed)
|
||||
|
||||
# Write the standardized chunk
|
||||
try:
|
||||
if not out_writer_open:
|
||||
result.standardized_df.to_csv(
|
||||
out, mode="w", index=False, encoding=encoding,
|
||||
sep=delimiter,
|
||||
)
|
||||
out_writer_open = True
|
||||
else:
|
||||
result.standardized_df.to_csv(
|
||||
out, mode="a", index=False, header=False,
|
||||
encoding=encoding, sep=delimiter,
|
||||
)
|
||||
except OSError as e:
|
||||
raise wrap_file_write(out, "standardize_file", e) from e
|
||||
|
||||
# Write the audit (re-numbering rows to absolute file positions).
|
||||
if not result.changes.empty and audit_room > 0:
|
||||
# ``audit_room`` is float('inf') when the user wants an
|
||||
# unbounded audit; ``iloc[:inf]`` is invalid, so take the
|
||||
# whole frame in that case.
|
||||
if audit_room == float("inf"):
|
||||
cap_changes = result.changes.copy()
|
||||
else:
|
||||
cap_changes = result.changes.iloc[: int(audit_room)].copy()
|
||||
cap_changes["row"] = cap_changes["row"] + chunk_offset
|
||||
try:
|
||||
if not audit_writer_open:
|
||||
cap_changes.to_csv(
|
||||
audit_p, mode="w", index=False, encoding=encoding,
|
||||
)
|
||||
audit_writer_open = True
|
||||
else:
|
||||
cap_changes.to_csv(
|
||||
audit_p, mode="a", index=False, header=False,
|
||||
encoding=encoding,
|
||||
)
|
||||
except OSError as e:
|
||||
raise wrap_file_write(audit_p, "standardize_file", e) from e
|
||||
audit_room -= len(cap_changes)
|
||||
|
||||
rows_processed += len(chunk)
|
||||
chunks_processed += 1
|
||||
if progress_callback:
|
||||
try:
|
||||
progress_callback(rows_processed, chunks_processed)
|
||||
except Exception:
|
||||
# Progress callbacks are advisory — don't kill the run.
|
||||
logger.opt(exception=True).debug(
|
||||
"progress_callback raised; ignoring"
|
||||
)
|
||||
finally:
|
||||
# Ensure the iterator is closed (closes the underlying file).
|
||||
if hasattr(reader, "close"):
|
||||
reader.close()
|
||||
|
||||
return StreamingStandardizeResult(
|
||||
output_path=out,
|
||||
audit_path=audit_p if audit_writer_open else None,
|
||||
rows_processed=rows_processed,
|
||||
chunks_processed=chunks_processed,
|
||||
cells_changed=cells_changed,
|
||||
cells_unparseable=cells_unparseable,
|
||||
cells_total=cells_total,
|
||||
columns_processed=columns_processed,
|
||||
)
|
||||
|
||||
|
||||
def _replace_options(options: StandardizeOptions, **kwargs: Any) -> StandardizeOptions:
|
||||
"""Cheap shallow clone of :class:`StandardizeOptions` with overrides.
|
||||
|
||||
Used by the streaming path to reduce the audit budget chunk-by-chunk
|
||||
without mutating the caller's options object.
|
||||
"""
|
||||
from dataclasses import replace
|
||||
return replace(options, **kwargs)
|
||||
|
||||
236
src/core/io.py
236
src/core/io.py
@@ -18,6 +18,207 @@ from loguru import logger
|
||||
# Encoding detection
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
# charset-normalizer often picks an Eastern-European code page (cp1250,
|
||||
# cp1258) for byte-equivalent Western content, mac_iceland over mac_roman
|
||||
# in the Mac family, and shift_jis_2004 for short Cyrillic samples. The
|
||||
# arbiter below resolves these specific false positives without
|
||||
# overruling the detector when its top pick is genuinely the right
|
||||
# answer.
|
||||
#
|
||||
# Mapping is *over-picked encoding* → *more plausible substitutes (in
|
||||
# priority order)*. We accept either the candidate's primary encoding
|
||||
# name or any of its ``could_be_from_charset`` aliases.
|
||||
_ENCODING_FALLBACKS: dict[str, tuple[str, ...]] = {
|
||||
"cp1250": ("cp1252", "latin_1", "iso8859_15", "iso8859_2"),
|
||||
"cp1258": ("iso8859_2", "cp1250", "cp1252"),
|
||||
"mac_iceland": ("mac_roman",),
|
||||
"shift_jis_2004": ("koi8_r", "cp1251", "cp1252", "iso8859_2"),
|
||||
"shift_jisx0213": ("koi8_r", "cp1251", "cp1252", "iso8859_2"),
|
||||
}
|
||||
|
||||
|
||||
def _arbitrate_charset_match(matches) -> Optional[str]:
|
||||
"""Pick the most plausible encoding from a charset-normalizer match list.
|
||||
|
||||
Two distinguishing signals separate a false positive from a real
|
||||
pick when the top encoding is one we've recorded as over-picked:
|
||||
|
||||
* If the top match's own ``could_be_from_charset`` alias list
|
||||
already names a preferred fallback (e.g. cp1250 with cp1252 as a
|
||||
sibling), we substitute — charset-normalizer has flagged the
|
||||
byte content as ambiguous.
|
||||
* If the second-ranked match shares identical *chaos* and
|
||||
*coherence* scores with the top — meaning the bytes decode
|
||||
byte-equivalently under both — we substitute when the second
|
||||
match is the preferred Western default.
|
||||
|
||||
When neither signal fires (real cp1250 / cp1258 content where
|
||||
charset-normalizer is genuinely confident), the top pick is
|
||||
returned unchanged.
|
||||
"""
|
||||
ranked = list(matches)
|
||||
if not ranked:
|
||||
return None
|
||||
top = ranked[0]
|
||||
top_enc = top.encoding.lower()
|
||||
fallbacks = _ENCODING_FALLBACKS.get(top_enc)
|
||||
if not fallbacks:
|
||||
return top_enc
|
||||
|
||||
# The decisive signal: a lower-ranked candidate that ties the top
|
||||
# pick on both chaos and coherence has decoded the bytes
|
||||
# *identically*, so the choice between them is byte-equivalent. When
|
||||
# one of those tied candidates is a preferred Western default,
|
||||
# substitute. We walk the fallbacks in priority order so the most
|
||||
# canonical alternative wins (cp1252 over iso8859_2 over iso8859_15).
|
||||
#
|
||||
# When no tied candidate matches, we leave the top pick alone — that
|
||||
# is the "real cp1250 / cp1258 content" path where charset-normalizer
|
||||
# is genuinely confident.
|
||||
top_chaos = getattr(top, "chaos", None)
|
||||
top_coherence = getattr(top, "coherence", None)
|
||||
tied: list = []
|
||||
for m in ranked[1:]:
|
||||
if m.chaos != top_chaos or m.coherence != top_coherence:
|
||||
break # ranked list is monotonically less confident
|
||||
tied.append(m)
|
||||
|
||||
if tied:
|
||||
for preferred in fallbacks:
|
||||
for m in tied:
|
||||
candidates = {
|
||||
m.encoding.lower(),
|
||||
*(a.lower() for a in m.could_be_from_charset),
|
||||
}
|
||||
if preferred in candidates:
|
||||
return preferred
|
||||
|
||||
# No tied alternative — but charset-normalizer occasionally folds
|
||||
# the more popular Western alias into the *top pick's own* alias
|
||||
# list (cp1250 with cp1252 listed alongside). When that happens,
|
||||
# prefer the canonical Western form.
|
||||
top_aliases = {a.lower() for a in top.could_be_from_charset}
|
||||
for preferred in fallbacks:
|
||||
# Only honour an in-alias swap if the preferred encoding is a
|
||||
# different family from the top pick (cp1252 swap from cp1250 is
|
||||
# legitimate; iso8859_2 swap from cp1250 is not — they differ
|
||||
# bytewise on accented Eastern letters).
|
||||
if preferred in top_aliases and not _same_byte_family(top_enc, preferred):
|
||||
return preferred
|
||||
|
||||
return top_enc
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Language-aware probe: distinguish KOI8-R from Shift_JIS, ISO-8859-2 from
|
||||
# cp1258 when charset-normalizer cannot.
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
# Unicode ranges that uniquely identify each language family. A candidate
|
||||
# encoding "wins" the probe when its decoding of the raw bytes produces
|
||||
# the highest *coverage ratio* (non-ASCII letters in the target range
|
||||
# divided by total non-ASCII letters).
|
||||
_CYRILLIC_RANGE = (0x0400, 0x04FF)
|
||||
_EE_LATIN_LETTERS = frozenset(
|
||||
"ąćęłńóśźżĄĆĘŁŃÓŚŹŻ" # Polish
|
||||
"áčďéěíňóřšťúůýžÁČĎÉĚÍŇÓŘŠŤÚŮÝŽ" # Czech
|
||||
"áéíóöőúüűÁÉÍÓÖŐÚÜŰ" # Hungarian
|
||||
"äčďéíĺľňóôŕšťúýžÄČĎÉÍĹĽŇÓÔŔŠŤÚÝŽ" # Slovak
|
||||
)
|
||||
|
||||
# Encodings to probe when charset-normalizer fingerprints the file as
|
||||
# Japanese (a frequent misfire on short Cyrillic samples whose byte
|
||||
# patterns happen to coincide with shift_jis lead bytes).
|
||||
_CYRILLIC_PROBES: tuple[str, ...] = ("koi8_r", "cp1251", "iso8859_5")
|
||||
_EE_LATIN_PROBES: tuple[str, ...] = ("iso8859_2", "cp1250")
|
||||
|
||||
|
||||
def _cyrillic_coverage(text: str) -> float:
|
||||
"""Fraction of *all non-ASCII characters* in *text* that are Cyrillic letters.
|
||||
|
||||
Dividing by all non-ASCII (rather than only letters) penalises
|
||||
decodings that produce mostly symbols/box-drawing with a sprinkle
|
||||
of incidental Cyrillic glyphs — a real KOI8-R Russian text scores
|
||||
>0.7 because nearly every non-ASCII codepoint IS a Cyrillic letter,
|
||||
whereas a Japanese-shift_jis-decoded-as-koi8r text scores low.
|
||||
"""
|
||||
non_ascii = [c for c in text if ord(c) >= 0x80]
|
||||
if not non_ascii:
|
||||
return 0.0
|
||||
cyr = sum(
|
||||
1 for c in non_ascii
|
||||
if c.isalpha() and _CYRILLIC_RANGE[0] <= ord(c) <= _CYRILLIC_RANGE[1]
|
||||
)
|
||||
return cyr / len(non_ascii)
|
||||
|
||||
|
||||
def _ee_latin_coverage(text: str) -> float:
|
||||
"""Fraction of *all non-ASCII characters* in *text* that look like EE Latin."""
|
||||
non_ascii = [c for c in text if ord(c) >= 0x80]
|
||||
if not non_ascii:
|
||||
return 0.0
|
||||
ee = sum(1 for c in non_ascii if c in _EE_LATIN_LETTERS)
|
||||
return ee / len(non_ascii)
|
||||
|
||||
|
||||
def _probe_language(raw: bytes, top_enc: str) -> Optional[str]:
|
||||
"""Try language-specific decodings when charset-normalizer guessed wrong.
|
||||
|
||||
Returns a better encoding name when one of the probe candidates
|
||||
decodes the bytes into a language-coherent text (Cyrillic ≥ 70 % for
|
||||
Cyrillic probes, EE-Latin ≥ 50 % for EE Latin probes), else None.
|
||||
"""
|
||||
if top_enc in {"shift_jis_2004", "shift_jisx0213", "shift_jis", "cp932"}:
|
||||
probes, scorer, threshold = _CYRILLIC_PROBES, _cyrillic_coverage, 0.70
|
||||
elif top_enc in {"cp1258", "iso8859_16"}:
|
||||
probes, scorer, threshold = _EE_LATIN_PROBES, _ee_latin_coverage, 0.50
|
||||
else:
|
||||
return None
|
||||
|
||||
# Score the top pick first. If the top encoding *itself* decodes the
|
||||
# bytes into reasonable Cyrillic / EE Latin text, the bytes are
|
||||
# genuinely in that script — don't override.
|
||||
try:
|
||||
top_decoded = raw.decode(top_enc, errors="replace")
|
||||
top_score = scorer(top_decoded)
|
||||
except LookupError:
|
||||
top_score = 0.0
|
||||
|
||||
best_enc: Optional[str] = None
|
||||
best_score = 0.0
|
||||
for enc in probes:
|
||||
try:
|
||||
decoded = raw.decode(enc)
|
||||
except (UnicodeDecodeError, LookupError):
|
||||
continue
|
||||
score = scorer(decoded)
|
||||
if score > best_score:
|
||||
best_score = score
|
||||
best_enc = enc
|
||||
|
||||
# Require both an absolute coverage threshold AND a clear margin over
|
||||
# the top pick — otherwise we risk hijacking real Japanese / Vietnamese
|
||||
# content whose decode happens to produce a few Cyrillic / EE-Latin
|
||||
# glyphs by coincidence.
|
||||
if best_enc and best_score >= threshold and best_score >= top_score + 0.30:
|
||||
return best_enc
|
||||
return None
|
||||
|
||||
|
||||
# Pairs of encoding names whose byte ranges DIFFER for accented letters.
|
||||
# Used to refuse spurious in-alias swaps (e.g. cp1250 vs iso8859_2 are
|
||||
# byte-distinct even though charset-normalizer lists them as siblings).
|
||||
_SAME_FAMILY: set[frozenset[str]] = {
|
||||
frozenset({"cp1250", "iso8859_2"}),
|
||||
frozenset({"mac_iceland", "mac_turkish"}),
|
||||
frozenset({"shift_jis_2004", "shift_jisx0213"}),
|
||||
}
|
||||
|
||||
|
||||
def _same_byte_family(a: str, b: str) -> bool:
|
||||
return frozenset({a, b}) in _SAME_FAMILY
|
||||
|
||||
|
||||
def detect_encoding(path: Path, sample_bytes: int = 65_536) -> str:
|
||||
"""Detect file encoding by reading the first *sample_bytes*.
|
||||
|
||||
@@ -34,8 +235,21 @@ def detect_encoding(path: Path, sample_bytes: int = 65_536) -> str:
|
||||
|
||||
# Check BOM first
|
||||
if raw[:3] == b"\xef\xbb\xbf":
|
||||
return "utf-8-sig"
|
||||
if raw[:2] in (b"\xff\xfe", b"\xfe\xff"):
|
||||
# A "lying" BOM: file claims utf-8 but the body bytes don't decode
|
||||
# as utf-8. Fall through to charset detection on the BOM-stripped
|
||||
# body so we don't hand back utf-8-sig that will then fail to read.
|
||||
body = raw[3:]
|
||||
try:
|
||||
body.decode("utf-8")
|
||||
return "utf-8-sig"
|
||||
except UnicodeDecodeError:
|
||||
logger.debug(
|
||||
"detect_encoding({}): file has UTF-8 BOM but body is not "
|
||||
"valid UTF-8 — falling through to charset detection",
|
||||
Path(path).name,
|
||||
)
|
||||
raw = body
|
||||
elif raw[:2] in (b"\xff\xfe", b"\xfe\xff"):
|
||||
return "utf-16"
|
||||
|
||||
# Strict UTF-8 wins. charset_normalizer fingerprints small files
|
||||
@@ -48,11 +262,21 @@ def detect_encoding(path: Path, sample_bytes: int = 65_536) -> str:
|
||||
except UnicodeDecodeError:
|
||||
pass
|
||||
|
||||
result = from_bytes(raw).best()
|
||||
if result is None:
|
||||
matches = from_bytes(raw)
|
||||
enc = _arbitrate_charset_match(matches)
|
||||
if enc is None:
|
||||
return "utf-8"
|
||||
enc = result.encoding.lower()
|
||||
# Normalise common aliases
|
||||
# Language-aware probe runs after the arbiter so we only spend cycles
|
||||
# on the cases where charset-normalizer fingerprinted the bytes as a
|
||||
# codepage that doesn't match the apparent script. Returns a better
|
||||
# encoding only when the probe finds a high-coverage match.
|
||||
probed = _probe_language(raw, enc)
|
||||
if probed:
|
||||
logger.debug(
|
||||
"detect_encoding({}): language probe overrode {} → {}",
|
||||
Path(path).name, enc, probed,
|
||||
)
|
||||
enc = probed
|
||||
if enc in ("ascii", "us-ascii"):
|
||||
enc = "utf-8"
|
||||
return enc
|
||||
|
||||
780
src/core/missing.py
Normal file
780
src/core/missing.py
Normal file
@@ -0,0 +1,780 @@
|
||||
"""DataTools Missing Value Handler.
|
||||
|
||||
Detects disguised nulls, profiles missingness per column, and applies
|
||||
imputation or drop strategies with a full audit trail.
|
||||
|
||||
Public API
|
||||
----------
|
||||
Per-column helpers:
|
||||
is_missing_like(value, sentinels) -> bool
|
||||
detect_sentinels(series, sentinels) -> dict[str, int]
|
||||
|
||||
DataFrame entry points:
|
||||
profile_missing(df, options) -> MissingProfile
|
||||
handle_missing(df, options) -> MissingResult
|
||||
|
||||
Types:
|
||||
MissingOptions, MissingProfile, MissingResult, ColumnReport, Strategy
|
||||
|
||||
Presets (PRESETS):
|
||||
"detect-only" — only standardize sentinels to NaN, no fill / drop.
|
||||
"safe-fill" — sentinels → NaN, then numeric=median, categorical=mode.
|
||||
"drop-incomplete" — sentinels → NaN, then drop rows with any missing.
|
||||
|
||||
Use cases covered
|
||||
-----------------
|
||||
1. Disguised nulls in survey / CRM exports ("N/A", "n/a", "-", "(blank)",
|
||||
"TBD", whitespace-only, "?", "null", "NaN").
|
||||
2. Per-column profile for QA reports (counts, %, top sentinel hit).
|
||||
3. Row-drop with threshold (e.g., drop rows missing >50% of columns).
|
||||
4. Column-drop with threshold (e.g., drop columns missing >80%).
|
||||
5. Numeric imputation (mean / median / constant), categorical (mode /
|
||||
constant), time-series (ffill / bfill).
|
||||
6. Per-column overrides — different strategy per column in the same run.
|
||||
|
||||
Non-goals
|
||||
---------
|
||||
- ML-based imputation (KNN / iterative) — out of scope for v1.
|
||||
- Group-wise imputation by another column — deferred until a real use case.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import re
|
||||
from dataclasses import asdict, dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import Any, Iterable, Literal, Optional
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from loguru import logger
|
||||
from pandas.api import types as pdtypes
|
||||
|
||||
from .errors import ConfigError, InputValidationError, ensure_choice, ensure_dataframe
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Sentinel detection
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
# Default disguised-null sentinels. Matched case-insensitively after a
|
||||
# strip(). Whitespace-only strings ("", " ") are always treated as
|
||||
# missing regardless of this list.
|
||||
DEFAULT_SENTINELS: tuple[str, ...] = (
|
||||
"n/a", "na", "n.a.", "n.a",
|
||||
"null", "none", "nil",
|
||||
"nan",
|
||||
"-", "--", "---",
|
||||
"?", "??",
|
||||
".",
|
||||
"tbd", "tba",
|
||||
"unknown", "unk",
|
||||
"(blank)", "(none)", "(empty)", "(null)",
|
||||
"#n/a", "#na", "#null!", "#value!",
|
||||
"missing",
|
||||
)
|
||||
|
||||
_WHITESPACE_ONLY_RE = re.compile(r"^\s*$")
|
||||
|
||||
|
||||
def is_missing_like(value: Any, sentinels: Iterable[str] = DEFAULT_SENTINELS) -> bool:
|
||||
"""True when *value* should be treated as missing.
|
||||
|
||||
Catches: real NaN/None, whitespace-only strings, and any string that
|
||||
matches a sentinel after case-fold and strip.
|
||||
"""
|
||||
if value is None:
|
||||
return True
|
||||
# pandas / numpy NaN
|
||||
try:
|
||||
if isinstance(value, float) and np.isnan(value):
|
||||
return True
|
||||
except (TypeError, ValueError):
|
||||
pass
|
||||
if isinstance(value, pd._libs.tslibs.nattype.NaTType): # type: ignore[attr-defined]
|
||||
return True
|
||||
if not isinstance(value, str):
|
||||
return False
|
||||
if _WHITESPACE_ONLY_RE.match(value):
|
||||
return True
|
||||
needle = value.strip().casefold()
|
||||
return needle in {s.casefold() for s in sentinels}
|
||||
|
||||
|
||||
def detect_sentinels(
|
||||
series: pd.Series,
|
||||
sentinels: Iterable[str] = DEFAULT_SENTINELS,
|
||||
) -> dict[str, int]:
|
||||
"""Return ``{sentinel_value: count}`` for sentinels found in *series*.
|
||||
|
||||
Real NaN cells are not counted (they're already missing). Whitespace-
|
||||
only strings are bucketed under the literal key ``"(whitespace)"`` so
|
||||
callers can surface them distinctly from non-whitespace sentinels.
|
||||
"""
|
||||
counts: dict[str, int] = {}
|
||||
needles = {s.casefold(): s for s in sentinels}
|
||||
for value in series:
|
||||
if value is None or (isinstance(value, float) and pd.isna(value)):
|
||||
continue
|
||||
if not isinstance(value, str):
|
||||
continue
|
||||
if _WHITESPACE_ONLY_RE.match(value):
|
||||
counts["(whitespace)"] = counts.get("(whitespace)", 0) + 1
|
||||
continue
|
||||
key = value.strip().casefold()
|
||||
if key in needles:
|
||||
label = needles[key]
|
||||
counts[label] = counts.get(label, 0) + 1
|
||||
return counts
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Strategies / options / results
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
Strategy = Literal[
|
||||
"none", # detect-only; do not fill or drop.
|
||||
"drop_row", # drop rows that are missing in any selected column.
|
||||
"drop_col", # drop columns whose missing fraction exceeds threshold.
|
||||
"drop_both", # apply drop_col first, then drop_row on what remains.
|
||||
"mean", # numeric only.
|
||||
"median", # numeric only.
|
||||
"mode", # any dtype.
|
||||
"constant", # fill with options.fill_value.
|
||||
"ffill",
|
||||
"bfill",
|
||||
"interpolate", # linear interpolation, numeric only.
|
||||
]
|
||||
|
||||
_NUMERIC_STRATEGIES: frozenset[str] = frozenset(
|
||||
{"mean", "median", "interpolate"},
|
||||
)
|
||||
_FILL_STRATEGIES: frozenset[str] = frozenset(
|
||||
{"mean", "median", "mode", "constant", "ffill", "bfill", "interpolate"},
|
||||
)
|
||||
_DROP_STRATEGIES: frozenset[str] = frozenset(
|
||||
{"drop_row", "drop_col", "drop_both"},
|
||||
)
|
||||
|
||||
|
||||
PRESETS: dict[str, dict[str, Any]] = {
|
||||
"detect-only": {
|
||||
"standardize_sentinels": True,
|
||||
"strategy": "none",
|
||||
},
|
||||
"safe-fill": {
|
||||
"standardize_sentinels": True,
|
||||
"strategy": "median",
|
||||
"categorical_strategy": "mode",
|
||||
},
|
||||
"drop-incomplete": {
|
||||
"standardize_sentinels": True,
|
||||
"strategy": "drop_row",
|
||||
# Strict-greater semantics: 0.0 → drop a row as soon as any
|
||||
# selected column is missing.
|
||||
"row_drop_threshold": 0.0,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class MissingOptions:
|
||||
"""Toggles for missing-value detection and handling.
|
||||
|
||||
Defaults match the ``detect-only`` preset: sentinels are standardized
|
||||
to NaN, but no rows are dropped and no values are filled.
|
||||
"""
|
||||
|
||||
# Detection
|
||||
sentinels: list[str] = field(default_factory=lambda: list(DEFAULT_SENTINELS))
|
||||
standardize_sentinels: bool = True
|
||||
|
||||
# Strategy applied to all selected columns. ``categorical_strategy``
|
||||
# is a fallback used by numeric-only strategies (mean/median/interpolate)
|
||||
# when a selected column is non-numeric — rather than crash, fall back
|
||||
# to a reasonable categorical strategy.
|
||||
strategy: Strategy = "none"
|
||||
categorical_strategy: Strategy = "mode"
|
||||
|
||||
# Per-column overrides take precedence over ``strategy`` / preset.
|
||||
column_strategies: dict[str, Strategy] = field(default_factory=dict)
|
||||
|
||||
# Constant-fill payload. Either a scalar (applied to every selected
|
||||
# column) or a per-column dict for differentiated fills.
|
||||
fill_value: Any = None
|
||||
column_fill_values: dict[str, Any] = field(default_factory=dict)
|
||||
|
||||
# Drop thresholds (0.0 .. 1.0). A row/column is dropped when its
|
||||
# missing fraction is *strictly greater than* the threshold. So:
|
||||
# 1.0 (default) — never drop (no fraction exceeds 100%)
|
||||
# 0.5 — drop when more than half is missing
|
||||
# 0.0 — drop on any missing at all
|
||||
row_drop_threshold: float = 1.0
|
||||
col_drop_threshold: float = 1.0
|
||||
|
||||
# Scope control
|
||||
columns: Optional[list[str]] = None
|
||||
skip_columns: list[str] = field(default_factory=list)
|
||||
|
||||
@classmethod
|
||||
def from_preset(cls, name: str) -> MissingOptions:
|
||||
if name not in PRESETS:
|
||||
raise ConfigError(
|
||||
f"Unknown preset '{name}'",
|
||||
operation="MissingOptions.from_preset",
|
||||
suggestion=f"Available: {sorted(PRESETS)}",
|
||||
)
|
||||
return cls(**PRESETS[name])
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> MissingOptions:
|
||||
known = set(cls.__dataclass_fields__)
|
||||
kwargs = {k: v for k, v in data.items() if k in known}
|
||||
return cls(**kwargs)
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return asdict(self)
|
||||
|
||||
def to_file(self, path: str | Path) -> Path:
|
||||
out = Path(path)
|
||||
out.write_text(json.dumps(self.to_dict(), indent=2, default=str))
|
||||
return out
|
||||
|
||||
@classmethod
|
||||
def from_file(cls, path: str | Path) -> MissingOptions:
|
||||
return cls.from_dict(json.loads(Path(path).read_text()))
|
||||
|
||||
def validate(self) -> None:
|
||||
"""Fail fast on incoherent option combinations."""
|
||||
choices = (
|
||||
"none", "drop_row", "drop_col", "drop_both",
|
||||
"mean", "median", "mode", "constant",
|
||||
"ffill", "bfill", "interpolate",
|
||||
)
|
||||
ensure_choice(self.strategy, name="strategy", choices=choices,
|
||||
function="MissingOptions.validate")
|
||||
ensure_choice(self.categorical_strategy, name="categorical_strategy",
|
||||
choices=choices, function="MissingOptions.validate")
|
||||
for col, strat in self.column_strategies.items():
|
||||
ensure_choice(strat, name=f"column_strategies[{col!r}]",
|
||||
choices=choices, function="MissingOptions.validate")
|
||||
if not (0.0 <= self.row_drop_threshold <= 1.0):
|
||||
raise ConfigError(
|
||||
f"row_drop_threshold must be in [0.0, 1.0], got "
|
||||
f"{self.row_drop_threshold!r}",
|
||||
operation="MissingOptions.validate",
|
||||
)
|
||||
if not (0.0 <= self.col_drop_threshold <= 1.0):
|
||||
raise ConfigError(
|
||||
f"col_drop_threshold must be in [0.0, 1.0], got "
|
||||
f"{self.col_drop_threshold!r}",
|
||||
operation="MissingOptions.validate",
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class ColumnReport:
|
||||
"""Per-column missingness snapshot."""
|
||||
|
||||
column: str
|
||||
dtype: str
|
||||
total: int
|
||||
missing: int # NaN cells (after sentinel standardization if enabled)
|
||||
missing_pct: float # 0.0 .. 100.0
|
||||
sentinels_found: dict[str, int] # disguised nulls hit, pre-standardization
|
||||
|
||||
@property
|
||||
def has_missing(self) -> bool:
|
||||
return self.missing > 0
|
||||
|
||||
|
||||
@dataclass
|
||||
class MissingProfile:
|
||||
"""Whole-DataFrame missingness profile."""
|
||||
|
||||
columns: list[ColumnReport]
|
||||
rows_total: int
|
||||
cells_total: int
|
||||
cells_missing: int
|
||||
rows_with_any_missing: int
|
||||
rows_complete: int
|
||||
|
||||
@property
|
||||
def cells_missing_pct(self) -> float:
|
||||
return (self.cells_missing / self.cells_total * 100.0) if self.cells_total else 0.0
|
||||
|
||||
def to_dataframe(self) -> pd.DataFrame:
|
||||
"""Long-form table suitable for the GUI / CLI."""
|
||||
rows = []
|
||||
for r in self.columns:
|
||||
top = max(r.sentinels_found.items(), key=lambda kv: kv[1], default=("", 0))
|
||||
rows.append({
|
||||
"column": r.column,
|
||||
"dtype": r.dtype,
|
||||
"missing": r.missing,
|
||||
"missing_pct": round(r.missing_pct, 2),
|
||||
"top_sentinel": top[0],
|
||||
"top_sentinel_count": top[1],
|
||||
"sentinel_total": sum(r.sentinels_found.values()),
|
||||
})
|
||||
return pd.DataFrame(rows)
|
||||
|
||||
|
||||
@dataclass
|
||||
class MissingResult:
|
||||
"""Output of ``handle_missing``."""
|
||||
|
||||
handled_df: pd.DataFrame
|
||||
profile_before: MissingProfile
|
||||
profile_after: MissingProfile
|
||||
changes: pd.DataFrame # cols: row, column, old, new, action
|
||||
rows_dropped: int
|
||||
columns_dropped: list[str]
|
||||
cells_filled: int
|
||||
sentinels_standardized: int
|
||||
columns_processed: list[str]
|
||||
strategy_per_column: dict[str, Strategy]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Profiling
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _select_columns(df: pd.DataFrame, options: MissingOptions) -> list[str]:
|
||||
"""Pick the columns to operate on (mirrors text_clean._select_columns).
|
||||
|
||||
Default: every column. Missing-value handling is meaningful for any
|
||||
dtype, unlike text cleaning which only touches strings.
|
||||
"""
|
||||
if options.columns is not None:
|
||||
unknown = [c for c in options.columns if c not in df.columns]
|
||||
if unknown:
|
||||
raise InputValidationError(
|
||||
f"Columns not found in input: {unknown}",
|
||||
operation="handle_missing",
|
||||
suggestion=f"Available: {list(df.columns)}",
|
||||
)
|
||||
chosen: Iterable[str] = options.columns
|
||||
else:
|
||||
chosen = list(df.columns)
|
||||
skip = set(options.skip_columns)
|
||||
return [c for c in chosen if c not in skip]
|
||||
|
||||
|
||||
def _standardize_sentinels(
|
||||
df: pd.DataFrame,
|
||||
columns: list[str],
|
||||
sentinels: Iterable[str],
|
||||
) -> tuple[pd.DataFrame, list[dict[str, Any]], int]:
|
||||
"""Replace sentinel strings with NaN in the selected columns.
|
||||
|
||||
Returns ``(new_df, change_records, total_replacements)``. ``change_records``
|
||||
is appended to the audit table so the user can see exactly which cells
|
||||
were converted from "N/A" / "-" / etc. to a real null.
|
||||
"""
|
||||
out = df.copy()
|
||||
needles = {s.casefold(): s for s in sentinels}
|
||||
records: list[dict[str, Any]] = []
|
||||
total = 0
|
||||
|
||||
for col in columns:
|
||||
series = out[col]
|
||||
# Only iterate object/string columns — numeric/datetime cells can't
|
||||
# contain string sentinels by construction.
|
||||
if not (pdtypes.is_object_dtype(series) or pdtypes.is_string_dtype(series)):
|
||||
continue
|
||||
new_values: list[Any] = []
|
||||
changed = False
|
||||
for row_idx, value in enumerate(series.tolist()):
|
||||
if value is None or (isinstance(value, float) and pd.isna(value)):
|
||||
new_values.append(value)
|
||||
continue
|
||||
if not isinstance(value, str):
|
||||
new_values.append(value)
|
||||
continue
|
||||
if _WHITESPACE_ONLY_RE.match(value):
|
||||
records.append({
|
||||
"row": row_idx,
|
||||
"column": col,
|
||||
"old": value,
|
||||
"new": np.nan,
|
||||
"action": "standardize:whitespace",
|
||||
})
|
||||
new_values.append(np.nan)
|
||||
changed = True
|
||||
total += 1
|
||||
continue
|
||||
key = value.strip().casefold()
|
||||
if key in needles:
|
||||
records.append({
|
||||
"row": row_idx,
|
||||
"column": col,
|
||||
"old": value,
|
||||
"new": np.nan,
|
||||
"action": f"standardize:{needles[key]}",
|
||||
})
|
||||
new_values.append(np.nan)
|
||||
changed = True
|
||||
total += 1
|
||||
else:
|
||||
new_values.append(value)
|
||||
if changed:
|
||||
out[col] = new_values
|
||||
return out, records, total
|
||||
|
||||
|
||||
def profile_missing(
|
||||
df: pd.DataFrame,
|
||||
options: Optional[MissingOptions] = None,
|
||||
) -> MissingProfile:
|
||||
"""Compute a per-column missingness profile.
|
||||
|
||||
Sentinels are *not* mutated in *df*; this is a read-only inspection.
|
||||
The profile reports both raw NaN counts and which sentinel strings
|
||||
were hit so the GUI / CLI can show "12 disguised nulls (8 'N/A',
|
||||
4 '-')" alongside "47 real NaN".
|
||||
"""
|
||||
ensure_dataframe(df, function="profile_missing")
|
||||
options = options or MissingOptions()
|
||||
columns = _select_columns(df, options)
|
||||
sentinels = options.sentinels if options.standardize_sentinels else []
|
||||
|
||||
reports: list[ColumnReport] = []
|
||||
for col in columns:
|
||||
series = df[col]
|
||||
sentinels_hit = detect_sentinels(series, sentinels) if sentinels else {}
|
||||
# Effective missing = real-NaN count + sentinel hits (since they'd
|
||||
# become NaN once standardize_sentinels runs). This makes the
|
||||
# "before" profile match what the user sees post-standardization.
|
||||
nan_count = int(series.isna().sum())
|
||||
sentinel_count = sum(sentinels_hit.values())
|
||||
total = len(series)
|
||||
missing = nan_count + sentinel_count
|
||||
reports.append(ColumnReport(
|
||||
column=str(col),
|
||||
dtype=str(series.dtype),
|
||||
total=total,
|
||||
missing=missing,
|
||||
missing_pct=(missing / total * 100.0) if total else 0.0,
|
||||
sentinels_found=sentinels_hit,
|
||||
))
|
||||
|
||||
# For row-level stats use NaN ∪ sentinels in the selected columns.
|
||||
if columns and len(df):
|
||||
if sentinels:
|
||||
mask = pd.DataFrame(index=df.index)
|
||||
needles = {s.casefold() for s in sentinels}
|
||||
for col in columns:
|
||||
series = df[col]
|
||||
if pdtypes.is_object_dtype(series) or pdtypes.is_string_dtype(series):
|
||||
sentinel_mask = series.apply(
|
||||
lambda v: isinstance(v, str)
|
||||
and (
|
||||
bool(_WHITESPACE_ONLY_RE.match(v))
|
||||
or v.strip().casefold() in needles
|
||||
)
|
||||
)
|
||||
mask[col] = series.isna() | sentinel_mask
|
||||
else:
|
||||
mask[col] = series.isna()
|
||||
else:
|
||||
mask = df[columns].isna()
|
||||
rows_with_any = int(mask.any(axis=1).sum())
|
||||
rows_complete = int((~mask.any(axis=1)).sum())
|
||||
cells_missing = int(mask.values.sum())
|
||||
cells_total = int(mask.size)
|
||||
else:
|
||||
rows_with_any = 0
|
||||
rows_complete = len(df)
|
||||
cells_missing = 0
|
||||
cells_total = len(df) * len(columns)
|
||||
|
||||
return MissingProfile(
|
||||
columns=reports,
|
||||
rows_total=len(df),
|
||||
cells_total=cells_total,
|
||||
cells_missing=cells_missing,
|
||||
rows_with_any_missing=rows_with_any,
|
||||
rows_complete=rows_complete,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Imputation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _resolve_strategy(
|
||||
col: str,
|
||||
series: pd.Series,
|
||||
options: MissingOptions,
|
||||
) -> Strategy:
|
||||
"""Effective strategy for *col*: per-column override → global → fallback.
|
||||
|
||||
If the column is non-numeric and the selected strategy is numeric-only,
|
||||
fall back to ``options.categorical_strategy`` so the run doesn't crash
|
||||
halfway through. The fallback is logged so the audit trail records
|
||||
why a different strategy fired.
|
||||
"""
|
||||
strat: Strategy = options.column_strategies.get(col, options.strategy)
|
||||
if strat in _NUMERIC_STRATEGIES and not pdtypes.is_numeric_dtype(series):
|
||||
logger.debug(
|
||||
"Column {!r}: strategy {!r} requires numeric dtype "
|
||||
"(got {}); falling back to {!r}",
|
||||
col, strat, series.dtype, options.categorical_strategy,
|
||||
)
|
||||
return options.categorical_strategy
|
||||
return strat
|
||||
|
||||
|
||||
def _fill_value_for(
|
||||
col: str,
|
||||
series: pd.Series,
|
||||
strategy: Strategy,
|
||||
options: MissingOptions,
|
||||
) -> Any:
|
||||
"""Compute the scalar fill for *series* under *strategy*.
|
||||
|
||||
Returns a sentinel ``object()`` when the strategy doesn't yield a
|
||||
single scalar (ffill/bfill/interpolate handle the fill themselves).
|
||||
"""
|
||||
if strategy == "mean":
|
||||
return series.mean()
|
||||
if strategy == "median":
|
||||
return series.median()
|
||||
if strategy == "mode":
|
||||
modes = series.mode(dropna=True)
|
||||
return modes.iloc[0] if len(modes) else None
|
||||
if strategy == "constant":
|
||||
if col in options.column_fill_values:
|
||||
return options.column_fill_values[col]
|
||||
return options.fill_value
|
||||
return _NO_SCALAR
|
||||
|
||||
|
||||
_NO_SCALAR = object()
|
||||
|
||||
|
||||
def _apply_fill(
|
||||
df: pd.DataFrame,
|
||||
col: str,
|
||||
strategy: Strategy,
|
||||
options: MissingOptions,
|
||||
records: list[dict[str, Any]],
|
||||
) -> int:
|
||||
"""Apply *strategy* to a single column. Returns cells filled."""
|
||||
series = df[col]
|
||||
missing_mask = series.isna()
|
||||
if not missing_mask.any():
|
||||
return 0
|
||||
|
||||
if strategy == "ffill":
|
||||
filled = series.ffill()
|
||||
elif strategy == "bfill":
|
||||
filled = series.bfill()
|
||||
elif strategy == "interpolate":
|
||||
# Interpolation is only defined for numeric series — guard so an
|
||||
# accidentally-routed object column produces no output rather
|
||||
# than a confusing TypeError.
|
||||
if not pdtypes.is_numeric_dtype(series):
|
||||
return 0
|
||||
filled = series.interpolate(method="linear", limit_direction="both")
|
||||
else:
|
||||
# Skip mean/median computation entirely on all-NaN numeric columns
|
||||
# so we don't trip numpy's "Mean of empty slice" RuntimeWarning.
|
||||
if (
|
||||
strategy in {"mean", "median"}
|
||||
and pdtypes.is_numeric_dtype(series)
|
||||
and series.dropna().empty
|
||||
):
|
||||
return 0
|
||||
scalar = _fill_value_for(col, series, strategy, options)
|
||||
if scalar is _NO_SCALAR:
|
||||
return 0
|
||||
if scalar is None or (isinstance(scalar, float) and pd.isna(scalar)):
|
||||
# Nothing to fill with — e.g., all-NaN column under "mean".
|
||||
logger.debug(
|
||||
"Column {!r}: strategy {!r} produced no fill value (all-NaN?)",
|
||||
col, strategy,
|
||||
)
|
||||
return 0
|
||||
# Opt into pandas 2.x's future no-silent-downcast behaviour to
|
||||
# avoid the FutureWarning fired when fillna would auto-downcast
|
||||
# an object column. We then call infer_objects ourselves to
|
||||
# preserve the dtype the user would have ended up with.
|
||||
with pd.option_context("future.no_silent_downcasting", True):
|
||||
filled = series.fillna(scalar)
|
||||
if pdtypes.is_object_dtype(series):
|
||||
filled = filled.infer_objects(copy=False)
|
||||
|
||||
cells = 0
|
||||
for row_idx in np.flatnonzero(missing_mask.values):
|
||||
old = series.iloc[row_idx]
|
||||
new = filled.iloc[row_idx]
|
||||
if pd.isna(new):
|
||||
# ffill/bfill at a leading/trailing NaN run can leave NaN in
|
||||
# place. Don't audit a no-op fill.
|
||||
continue
|
||||
records.append({
|
||||
"row": int(row_idx),
|
||||
"column": col,
|
||||
"old": old,
|
||||
"new": new,
|
||||
"action": f"fill:{strategy}",
|
||||
})
|
||||
cells += 1
|
||||
df[col] = filled
|
||||
return cells
|
||||
|
||||
|
||||
def _apply_drops(
|
||||
df: pd.DataFrame,
|
||||
columns: list[str],
|
||||
strategy: Strategy,
|
||||
options: MissingOptions,
|
||||
records: list[dict[str, Any]],
|
||||
) -> tuple[pd.DataFrame, int, list[str]]:
|
||||
"""Drop rows / columns according to *strategy*.
|
||||
|
||||
Returns ``(new_df, rows_dropped, columns_dropped)``.
|
||||
"""
|
||||
out = df
|
||||
rows_dropped = 0
|
||||
cols_dropped: list[str] = []
|
||||
|
||||
# Drop semantics (consistent across rows and columns): a row/column
|
||||
# is dropped when its missing fraction is *strictly greater* than the
|
||||
# threshold. The default threshold of 1.0 therefore means "never
|
||||
# drop" (no fraction can exceed 100%); 0.0 means "drop on any
|
||||
# missing"; intermediate values trigger when the missing share rises
|
||||
# above the chosen ceiling.
|
||||
if strategy in {"drop_col", "drop_both"} and columns:
|
||||
pct = out[columns].isna().mean()
|
||||
to_drop = [c for c, frac in pct.items() if frac > options.col_drop_threshold]
|
||||
if to_drop:
|
||||
for c in to_drop:
|
||||
records.append({
|
||||
"row": -1,
|
||||
"column": c,
|
||||
"old": f"{int(out[c].isna().sum())} missing / {len(out)}",
|
||||
"new": "",
|
||||
"action": "drop_column",
|
||||
})
|
||||
out = out.drop(columns=to_drop)
|
||||
cols_dropped = to_drop
|
||||
columns = [c for c in columns if c not in to_drop]
|
||||
|
||||
if strategy in {"drop_row", "drop_both"} and columns:
|
||||
sel = out[columns]
|
||||
frac = sel.isna().mean(axis=1)
|
||||
drop_mask = frac > options.row_drop_threshold
|
||||
rows_dropped = int(drop_mask.sum())
|
||||
if rows_dropped:
|
||||
for row_idx in np.flatnonzero(drop_mask.values):
|
||||
miss_cols = [c for c in columns if pd.isna(sel.iloc[row_idx][c])]
|
||||
records.append({
|
||||
"row": int(row_idx),
|
||||
"column": ",".join(miss_cols),
|
||||
"old": "",
|
||||
"new": "",
|
||||
"action": "drop_row",
|
||||
})
|
||||
out = out.loc[~drop_mask].reset_index(drop=True)
|
||||
|
||||
return out, rows_dropped, cols_dropped
|
||||
|
||||
|
||||
def handle_missing(
|
||||
df: pd.DataFrame,
|
||||
options: Optional[MissingOptions] = None,
|
||||
) -> MissingResult:
|
||||
"""Detect and handle missing values in *df*.
|
||||
|
||||
Pipeline placement (recommended, not enforced)
|
||||
----------------------------------------------
|
||||
Run *after* the text cleaner (so NBSP-padded / zero-width-only
|
||||
cells are correctly detected as missing) and the format
|
||||
standardizer (so numeric imputation has numeric dtypes). Run
|
||||
*before* the deduplicator (so dedup doesn't merge a row with a
|
||||
missing email into a row that has one). See
|
||||
``src.core.pipeline.SOFT_DEPENDENCIES``.
|
||||
|
||||
Pipeline:
|
||||
1. Standardize disguised-null sentinels to ``NaN`` (audit-logged).
|
||||
2. Apply column drops (if strategy includes ``drop_col``).
|
||||
3. Apply row drops (if strategy includes ``drop_row``).
|
||||
4. Apply per-column fills (mean/median/mode/constant/ffill/bfill/
|
||||
interpolate). Per-column overrides win over the global strategy.
|
||||
|
||||
The input DataFrame is not mutated.
|
||||
"""
|
||||
ensure_dataframe(df, function="handle_missing")
|
||||
options = options or MissingOptions()
|
||||
options.validate()
|
||||
|
||||
profile_before = profile_missing(df, options)
|
||||
columns = _select_columns(df, options)
|
||||
|
||||
logger.debug(
|
||||
"handle_missing: rows={}, cols={}, strategy={}, scope_cols={}",
|
||||
len(df), len(df.columns), options.strategy, len(columns),
|
||||
)
|
||||
|
||||
records: list[dict[str, Any]] = []
|
||||
sentinels_replaced = 0
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# 1. Sentinel standardization
|
||||
# ------------------------------------------------------------------
|
||||
if options.standardize_sentinels and options.sentinels and columns:
|
||||
out, sentinel_records, sentinels_replaced = _standardize_sentinels(
|
||||
df, columns, options.sentinels,
|
||||
)
|
||||
records.extend(sentinel_records)
|
||||
else:
|
||||
out = df.copy()
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# 2 + 3. Drops (column-first, then row)
|
||||
# ------------------------------------------------------------------
|
||||
rows_dropped = 0
|
||||
columns_dropped: list[str] = []
|
||||
global_strategy = options.strategy
|
||||
if global_strategy in _DROP_STRATEGIES:
|
||||
out, rows_dropped, columns_dropped = _apply_drops(
|
||||
out, columns, global_strategy, options, records,
|
||||
)
|
||||
# Update column scope after potential drops.
|
||||
columns = [c for c in columns if c not in columns_dropped]
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# 4. Fills (per-column)
|
||||
# ------------------------------------------------------------------
|
||||
cells_filled = 0
|
||||
strategy_per_column: dict[str, Strategy] = {}
|
||||
for col in columns:
|
||||
strat = _resolve_strategy(col, out[col], options)
|
||||
strategy_per_column[col] = strat
|
||||
if strat in _FILL_STRATEGIES:
|
||||
cells_filled += _apply_fill(out, col, strat, options, records)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Build audit + after-profile
|
||||
# ------------------------------------------------------------------
|
||||
changes_df = pd.DataFrame(
|
||||
records, columns=["row", "column", "old", "new", "action"],
|
||||
)
|
||||
profile_after = profile_missing(out, options)
|
||||
|
||||
return MissingResult(
|
||||
handled_df=out,
|
||||
profile_before=profile_before,
|
||||
profile_after=profile_after,
|
||||
changes=changes_df,
|
||||
rows_dropped=rows_dropped,
|
||||
columns_dropped=columns_dropped,
|
||||
cells_filled=cells_filled,
|
||||
sentinels_standardized=sentinels_replaced,
|
||||
columns_processed=columns,
|
||||
strategy_per_column=strategy_per_column,
|
||||
)
|
||||
501
src/core/pipeline.py
Normal file
501
src/core/pipeline.py
Normal file
@@ -0,0 +1,501 @@
|
||||
"""DataTools Pipeline Runner.
|
||||
|
||||
Chain the cleaning tools (text-clean, format-standardize, missing,
|
||||
column-map, dedup) into a single orchestrated workflow. The pipeline
|
||||
threads the DataFrame from one step to the next; each step's options
|
||||
are JSON-serializable so the entire pipeline can be saved, shared, and
|
||||
re-run on next week's export.
|
||||
|
||||
Design tenets
|
||||
-------------
|
||||
* **Recommended, not forced.** The recommended order
|
||||
(text → format → missing → dedup, with column-map fitting either
|
||||
end depending on use case) is encoded in
|
||||
:data:`SOFT_DEPENDENCIES`. The runner WARNS on out-of-order
|
||||
pipelines but never refuses to execute them — the user owns their
|
||||
workflow.
|
||||
* **Each step is opt-in / opt-out.** ``Step.enabled = False`` skips
|
||||
the step without removing it from the saved configuration.
|
||||
* **Adapters are tiny.** Each tool is wrapped by a small adapter that
|
||||
bridges its native ``Options`` / ``Result`` shape to the pipeline's
|
||||
uniform ``(df, options_dict) → (new_df, summary)`` contract.
|
||||
|
||||
Public API
|
||||
----------
|
||||
Types:
|
||||
Step, Pipeline, StepResult, PipelineResult
|
||||
|
||||
Functions:
|
||||
run_pipeline(df, pipeline) -> PipelineResult
|
||||
validate_pipeline(pipeline) -> list[str]
|
||||
recommended_pipeline(*, include=None, **opts) -> Pipeline
|
||||
|
||||
Constants:
|
||||
TOOL_ADAPTERS — name → adapter callable
|
||||
TOOL_NAMES — sorted list of recognised tool names
|
||||
SOFT_DEPENDENCIES — list of (earlier, later, reason) tuples
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import time
|
||||
from dataclasses import asdict, dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import Any, Callable, Iterable, Optional
|
||||
|
||||
import pandas as pd
|
||||
from loguru import logger
|
||||
|
||||
from .errors import (
|
||||
ConfigError,
|
||||
DataToolsError,
|
||||
InputValidationError,
|
||||
ensure_choice,
|
||||
ensure_dataframe,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tool adapters — bridge each tool's native API to the pipeline contract
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _adapter_text_clean(
|
||||
df: pd.DataFrame, options: dict[str, Any],
|
||||
) -> tuple[pd.DataFrame, dict[str, Any]]:
|
||||
from .text_clean import CleanOptions, clean_dataframe
|
||||
opts = CleanOptions.from_dict(options) if options else CleanOptions()
|
||||
res = clean_dataframe(df, opts)
|
||||
return res.cleaned_df, {
|
||||
"cells_total": res.cells_total,
|
||||
"cells_changed": res.cells_changed,
|
||||
"columns_processed": list(res.columns_processed),
|
||||
}
|
||||
|
||||
|
||||
def _adapter_format_standardize(
|
||||
df: pd.DataFrame, options: dict[str, Any],
|
||||
) -> tuple[pd.DataFrame, dict[str, Any]]:
|
||||
from .format_standardize import StandardizeOptions, standardize_dataframe
|
||||
opts = StandardizeOptions.from_dict(options) if options else StandardizeOptions()
|
||||
res = standardize_dataframe(df, opts)
|
||||
return res.standardized_df, {
|
||||
"cells_total": res.cells_total,
|
||||
"cells_changed": res.cells_changed,
|
||||
"cells_unparseable": res.cells_unparseable,
|
||||
"columns_processed": list(res.columns_processed),
|
||||
}
|
||||
|
||||
|
||||
def _adapter_missing(
|
||||
df: pd.DataFrame, options: dict[str, Any],
|
||||
) -> tuple[pd.DataFrame, dict[str, Any]]:
|
||||
from .missing import MissingOptions, handle_missing
|
||||
opts = MissingOptions.from_dict(options) if options else MissingOptions()
|
||||
res = handle_missing(df, opts)
|
||||
return res.handled_df, {
|
||||
"sentinels_standardized": res.sentinels_standardized,
|
||||
"cells_filled": res.cells_filled,
|
||||
"rows_dropped": res.rows_dropped,
|
||||
"columns_dropped": list(res.columns_dropped),
|
||||
"columns_processed": list(res.columns_processed),
|
||||
}
|
||||
|
||||
|
||||
def _adapter_column_map(
|
||||
df: pd.DataFrame, options: dict[str, Any],
|
||||
) -> tuple[pd.DataFrame, dict[str, Any]]:
|
||||
from .column_mapper import MapOptions, map_columns
|
||||
opts = MapOptions.from_dict(options) if options else MapOptions()
|
||||
res = map_columns(df, opts)
|
||||
return res.mapped_df, {
|
||||
"columns_renamed": res.columns_renamed,
|
||||
"columns_dropped": list(res.columns_dropped),
|
||||
"columns_added": list(res.columns_added),
|
||||
"coercion_failures": dict(res.coercion_failures),
|
||||
"missing_required_targets": list(res.missing_required_targets),
|
||||
}
|
||||
|
||||
|
||||
def _adapter_dedup(
|
||||
df: pd.DataFrame, options: dict[str, Any],
|
||||
) -> tuple[pd.DataFrame, dict[str, Any]]:
|
||||
from .dedup import deduplicate, SurvivorRule
|
||||
from .config import DeduplicationConfig
|
||||
options = options or {}
|
||||
survivor = options.get("survivor_rule", "first")
|
||||
if isinstance(survivor, str):
|
||||
try:
|
||||
survivor = SurvivorRule(survivor)
|
||||
except ValueError as e:
|
||||
raise ConfigError(
|
||||
f"Unknown survivor_rule {survivor!r}",
|
||||
operation="pipeline.dedup",
|
||||
cause=e,
|
||||
suggestion=f"Valid: {[r.value for r in SurvivorRule]}",
|
||||
) from e
|
||||
|
||||
# Optional explicit strategies via the same JSON shape as
|
||||
# DeduplicationConfig: ``[{"columns": [{"column": "phone",
|
||||
# "algorithm": "exact", "threshold": 100}, ...]}, ...]``.
|
||||
raw_strategies = options.get("strategies")
|
||||
explicit_strategies = None
|
||||
if raw_strategies:
|
||||
cfg = DeduplicationConfig.from_dict({"strategies": raw_strategies})
|
||||
explicit_strategies = cfg.to_strategies()
|
||||
|
||||
res = deduplicate(
|
||||
df,
|
||||
strategies=explicit_strategies,
|
||||
survivor_rule=survivor,
|
||||
merge=options.get("merge", False),
|
||||
preview=False, # pipeline always commits the dedup output
|
||||
date_column=options.get("date_column"),
|
||||
)
|
||||
final = res.deduplicated_df if res.deduplicated_df is not None else df
|
||||
return final, {
|
||||
"input_rows": len(df),
|
||||
"output_rows": len(final),
|
||||
"duplicates_removed": len(df) - len(final),
|
||||
"groups": len(res.match_groups) if res.match_groups else 0,
|
||||
}
|
||||
|
||||
|
||||
TOOL_ADAPTERS: dict[str, Callable[..., tuple[pd.DataFrame, dict[str, Any]]]] = {
|
||||
"text_clean": _adapter_text_clean,
|
||||
"format_standardize": _adapter_format_standardize,
|
||||
"missing": _adapter_missing,
|
||||
"column_map": _adapter_column_map,
|
||||
"dedup": _adapter_dedup,
|
||||
}
|
||||
|
||||
TOOL_NAMES: list[str] = sorted(TOOL_ADAPTERS)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Soft dependencies
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
# Pairs of (earlier, later, reason) where running *earlier* before
|
||||
# *later* is recommended. A reversal triggers a WARNING — never a
|
||||
# block. The user owns their workflow.
|
||||
SOFT_DEPENDENCIES: list[tuple[str, str, str]] = [
|
||||
(
|
||||
"text_clean", "format_standardize",
|
||||
"format parsers (phone / currency / date) fail on smart-quote-"
|
||||
"contaminated or NBSP-padded input — clean text first",
|
||||
),
|
||||
(
|
||||
"text_clean", "missing",
|
||||
"sentinel detection misses cells padded with NBSP / zero-width "
|
||||
"characters — clean text first",
|
||||
),
|
||||
(
|
||||
"text_clean", "dedup",
|
||||
"fuzzy matching treats NBSP-padded values as different — "
|
||||
"clean text first",
|
||||
),
|
||||
(
|
||||
"format_standardize", "missing",
|
||||
"numeric imputation needs numeric dtypes; canonical phones / "
|
||||
"currencies improve sentinel detection",
|
||||
),
|
||||
(
|
||||
"format_standardize", "dedup",
|
||||
"canonical phones / lowercase emails enable cross-format "
|
||||
"duplicate matching",
|
||||
),
|
||||
(
|
||||
"missing", "dedup",
|
||||
"deduping rows with mixed NaN sentinels produces brittle merges "
|
||||
"— resolve missing values first",
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step / Pipeline / Result dataclasses
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@dataclass
|
||||
class Step:
|
||||
"""One step in a pipeline.
|
||||
|
||||
Attributes
|
||||
----------
|
||||
tool : Name of the tool to run. Must be a key of :data:`TOOL_ADAPTERS`.
|
||||
options : JSON-serializable dict of tool-specific options. Each
|
||||
adapter parses this through the tool's ``Options.from_dict``.
|
||||
enabled : Skip the step (without removing it) when False.
|
||||
name : Optional friendly label for logs / GUI rendering. Defaults
|
||||
to the tool name.
|
||||
"""
|
||||
|
||||
tool: str
|
||||
options: dict[str, Any] = field(default_factory=dict)
|
||||
enabled: bool = True
|
||||
name: Optional[str] = None
|
||||
|
||||
def display_name(self) -> str:
|
||||
return self.name or self.tool
|
||||
|
||||
def __post_init__(self) -> None:
|
||||
if self.tool not in TOOL_ADAPTERS:
|
||||
raise ConfigError(
|
||||
f"Unknown tool {self.tool!r}",
|
||||
operation="Step.__post_init__",
|
||||
suggestion=f"Valid tools: {TOOL_NAMES}",
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Pipeline:
|
||||
"""An ordered sequence of :class:`Step` records."""
|
||||
|
||||
steps: list[Step] = field(default_factory=list)
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return {"steps": [asdict(s) for s in self.steps]}
|
||||
|
||||
def to_file(self, path: str | Path) -> Path:
|
||||
out = Path(path)
|
||||
out.write_text(json.dumps(self.to_dict(), indent=2, default=str))
|
||||
return out
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> Pipeline:
|
||||
if "steps" not in data:
|
||||
raise ConfigError(
|
||||
"Pipeline file must contain a 'steps' list",
|
||||
operation="Pipeline.from_dict",
|
||||
suggestion='Example: {"steps": [{"tool": "text_clean"}, ...]}',
|
||||
)
|
||||
steps: list[Step] = []
|
||||
for raw in data["steps"]:
|
||||
if "tool" not in raw:
|
||||
raise ConfigError(
|
||||
f"Step is missing 'tool': {raw!r}",
|
||||
operation="Pipeline.from_dict",
|
||||
)
|
||||
steps.append(Step(
|
||||
tool=raw["tool"],
|
||||
options=dict(raw.get("options") or {}),
|
||||
enabled=bool(raw.get("enabled", True)),
|
||||
name=raw.get("name"),
|
||||
))
|
||||
return cls(steps=steps)
|
||||
|
||||
@classmethod
|
||||
def from_file(cls, path: str | Path) -> Pipeline:
|
||||
return cls.from_dict(json.loads(Path(path).read_text()))
|
||||
|
||||
|
||||
@dataclass
|
||||
class StepResult:
|
||||
"""One step's outcome."""
|
||||
|
||||
step: Step
|
||||
summary: dict[str, Any]
|
||||
elapsed_seconds: float
|
||||
skipped: bool = False
|
||||
error: Optional[str] = None # rendered exception, not the live one
|
||||
|
||||
|
||||
@dataclass
|
||||
class PipelineResult:
|
||||
"""Whole-run outcome."""
|
||||
|
||||
final_df: pd.DataFrame
|
||||
step_results: list[StepResult]
|
||||
total_elapsed: float
|
||||
initial_rows: int
|
||||
final_rows: int
|
||||
warnings: list[str]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Recommended pipeline + validation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
# The single canonical default. Column-map is omitted: include it only
|
||||
# when the caller needs header alignment (early) or schema enforcement
|
||||
# (late). Adding it as an "auto" middle step would override the user's
|
||||
# downstream column lookups without their having asked.
|
||||
_DEFAULT_ORDER: list[str] = [
|
||||
"text_clean",
|
||||
"format_standardize",
|
||||
"missing",
|
||||
"dedup",
|
||||
]
|
||||
|
||||
|
||||
def recommended_pipeline(
|
||||
*,
|
||||
include: Optional[Iterable[str]] = None,
|
||||
options: Optional[dict[str, dict[str, Any]]] = None,
|
||||
) -> Pipeline:
|
||||
"""Build the recommended pipeline.
|
||||
|
||||
Defaults to ``[text_clean, format_standardize, missing, dedup]`` —
|
||||
the canonical workflow surfaced in DECISIONS.md and
|
||||
``src.core.pipeline.SOFT_DEPENDENCIES``.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
include
|
||||
Names of tools to include, in the desired order. When None,
|
||||
uses :data:`_DEFAULT_ORDER`. Pass ``["column_map", "text_clean",
|
||||
...]`` to put column-map first (header-alignment use case) or
|
||||
``[..., "column_map"]`` to put it last (schema-enforcement use
|
||||
case).
|
||||
options
|
||||
Optional ``{tool_name: {option_dict}}`` to seed each step. A
|
||||
missing entry uses the tool's default options.
|
||||
"""
|
||||
chosen = list(include) if include is not None else list(_DEFAULT_ORDER)
|
||||
seed = options or {}
|
||||
for t in chosen:
|
||||
ensure_choice(
|
||||
t, name="tool", choices=TOOL_NAMES,
|
||||
function="recommended_pipeline",
|
||||
)
|
||||
return Pipeline(steps=[
|
||||
Step(tool=t, options=dict(seed.get(t) or {}))
|
||||
for t in chosen
|
||||
])
|
||||
|
||||
|
||||
def validate_pipeline(pipeline: Pipeline) -> list[str]:
|
||||
"""Return a list of WARNING strings for soft-dependency violations.
|
||||
|
||||
Empty list = pipeline is in recommended order. Each warning is a
|
||||
single human-readable sentence the CLI / GUI can surface verbatim.
|
||||
Disabled steps are ignored.
|
||||
"""
|
||||
enabled = [s for s in pipeline.steps if s.enabled]
|
||||
positions: dict[str, int] = {}
|
||||
duplicates: list[str] = []
|
||||
for i, s in enumerate(enabled):
|
||||
if s.tool in positions:
|
||||
# Multiple steps for the same tool is allowed (a user might
|
||||
# text-clean twice with different scopes). Skip the dep
|
||||
# check for the duplicate so we don't spam warnings.
|
||||
duplicates.append(s.tool)
|
||||
else:
|
||||
positions[s.tool] = i
|
||||
|
||||
warnings: list[str] = []
|
||||
for earlier, later, why in SOFT_DEPENDENCIES:
|
||||
if earlier in positions and later in positions:
|
||||
if positions[earlier] > positions[later]:
|
||||
warnings.append(
|
||||
f"step {later!r} runs BEFORE {earlier!r} — {why}"
|
||||
)
|
||||
return warnings
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Execution
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def run_pipeline(
|
||||
df: pd.DataFrame,
|
||||
pipeline: Pipeline,
|
||||
*,
|
||||
on_step_complete: Optional[Callable[[StepResult], None]] = None,
|
||||
stop_on_error: bool = True,
|
||||
) -> PipelineResult:
|
||||
"""Execute *pipeline* against *df*.
|
||||
|
||||
The DataFrame from each step's adapter is passed to the next step;
|
||||
the original input is never mutated. Soft-dependency warnings are
|
||||
captured up-front and returned via ``PipelineResult.warnings`` so
|
||||
the caller can surface them — the run proceeds regardless.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
on_step_complete
|
||||
Optional ``callable(StepResult)`` fired after each step. Useful
|
||||
for live progress in the GUI.
|
||||
stop_on_error
|
||||
When True (default), the first failing step's exception
|
||||
propagates and execution halts. Set False to continue past a
|
||||
failing step using the previous step's output (the failed
|
||||
step's ``StepResult.error`` holds the rendered exception).
|
||||
"""
|
||||
ensure_dataframe(df, function="run_pipeline")
|
||||
if not isinstance(pipeline, Pipeline):
|
||||
raise InputValidationError(
|
||||
f"Expected Pipeline, got {type(pipeline).__name__}",
|
||||
operation="run_pipeline",
|
||||
)
|
||||
|
||||
warnings = validate_pipeline(pipeline)
|
||||
if warnings:
|
||||
for w in warnings:
|
||||
logger.warning("pipeline order: {}", w)
|
||||
|
||||
initial_rows = len(df)
|
||||
step_results: list[StepResult] = []
|
||||
current = df
|
||||
t_start = time.perf_counter()
|
||||
|
||||
for step in pipeline.steps:
|
||||
if not step.enabled:
|
||||
sr = StepResult(
|
||||
step=step, summary={}, elapsed_seconds=0.0, skipped=True,
|
||||
)
|
||||
step_results.append(sr)
|
||||
if on_step_complete:
|
||||
_safe_call(on_step_complete, sr)
|
||||
continue
|
||||
|
||||
adapter = TOOL_ADAPTERS[step.tool]
|
||||
s_start = time.perf_counter()
|
||||
try:
|
||||
new_df, summary = adapter(current, step.options)
|
||||
except Exception as e: # noqa: BLE001 — pipeline owns the error contract
|
||||
elapsed = time.perf_counter() - s_start
|
||||
err_msg = (
|
||||
e.format() if isinstance(e, DataToolsError) else f"{type(e).__name__}: {e}"
|
||||
)
|
||||
sr = StepResult(
|
||||
step=step, summary={}, elapsed_seconds=elapsed,
|
||||
error=err_msg,
|
||||
)
|
||||
step_results.append(sr)
|
||||
if on_step_complete:
|
||||
_safe_call(on_step_complete, sr)
|
||||
if stop_on_error:
|
||||
raise
|
||||
logger.warning(
|
||||
"pipeline step {!r} failed; continuing with previous output",
|
||||
step.display_name(),
|
||||
)
|
||||
continue
|
||||
|
||||
current = new_df
|
||||
sr = StepResult(
|
||||
step=step, summary=summary,
|
||||
elapsed_seconds=time.perf_counter() - s_start,
|
||||
)
|
||||
step_results.append(sr)
|
||||
if on_step_complete:
|
||||
_safe_call(on_step_complete, sr)
|
||||
|
||||
return PipelineResult(
|
||||
final_df=current,
|
||||
step_results=step_results,
|
||||
total_elapsed=time.perf_counter() - t_start,
|
||||
initial_rows=initial_rows,
|
||||
final_rows=len(current),
|
||||
warnings=warnings,
|
||||
)
|
||||
|
||||
|
||||
def _safe_call(callback: Callable, *args: Any) -> None:
|
||||
"""Run a user-supplied callback, logging but never propagating errors."""
|
||||
try:
|
||||
callback(*args)
|
||||
except Exception: # noqa: BLE001 — progress callbacks are advisory
|
||||
logger.opt(exception=True).debug("pipeline callback raised; ignoring")
|
||||
@@ -535,6 +535,15 @@ def clean_dataframe(df: pd.DataFrame, options: Optional[CleanOptions] = None) ->
|
||||
|
||||
Numeric, datetime, and boolean columns are skipped by default. The input
|
||||
DataFrame is not mutated; a copy is returned in ``CleanResult.cleaned_df``.
|
||||
|
||||
Pipeline placement (recommended, not enforced)
|
||||
----------------------------------------------
|
||||
*Best run early.* Smart-quote, NBSP, and zero-width pollution
|
||||
silently breaks downstream parsers — phone numbers fail on
|
||||
smart-quote contamination, sentinel detection misses NBSP-padded
|
||||
cells, and fuzzy dedup treats whitespace-padded values as
|
||||
different. Running this tool before format / missing / dedup is
|
||||
the standard order. See ``src.core.pipeline.SOFT_DEPENDENCIES``.
|
||||
"""
|
||||
from .errors import ensure_dataframe
|
||||
ensure_dataframe(df, function="clean_dataframe")
|
||||
|
||||
468
src/gui/app_demo.py
Normal file
468
src/gui/app_demo.py
Normal file
@@ -0,0 +1,468 @@
|
||||
"""DataTools — public demo app (deploys to Streamlit Community Cloud).
|
||||
|
||||
This is a SEPARATE entry point from the main GUI (``src/gui/app.py``).
|
||||
The full GUI is the paid product surface; this demo is the marketing
|
||||
surface — a single page that runs one of three persona-specific
|
||||
pipelines on a preloaded sample file, shows the BEFORE / AFTER
|
||||
side-by-side, and converts the visitor to a Gumroad purchase.
|
||||
|
||||
Launch:
|
||||
streamlit run src/gui/app_demo.py
|
||||
|
||||
URL routing:
|
||||
https://demo.datatools.app/?p=shopify-pet (Shopify operator)
|
||||
https://demo.datatools.app/?p=bookkeeper (Bookkeeper)
|
||||
https://demo.datatools.app/?p=revops (RevOps agency)
|
||||
|
||||
Free / paid boundary (per docs/DEMO-PLAN.md §6):
|
||||
- input rows capped at ``DEMO_ROW_CAP``
|
||||
- input file size capped at ``DEMO_FILE_CAP_MB``
|
||||
- download CSV gets a single trailing watermark row
|
||||
- the pipeline editor is read-only — visitor sees it but can't change it
|
||||
- no audit-log download (paid feature)
|
||||
- no save-pipeline-JSON (paid feature)
|
||||
|
||||
The demo runs the *same engine* as the paid product. Caps are applied
|
||||
at the surface layer only — when the buyer downloads and runs the paid
|
||||
build, every cap disappears.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import pandas as pd
|
||||
import streamlit as st
|
||||
|
||||
|
||||
# Ensure project root is on sys.path so `src.core` imports work
|
||||
_project_root = Path(__file__).resolve().parent.parent.parent
|
||||
if str(_project_root) not in sys.path:
|
||||
sys.path.insert(0, str(_project_root))
|
||||
|
||||
from src.core.pipeline import Pipeline, run_pipeline
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Free / paid boundary constants
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
DEMO_ROW_CAP: int = 100
|
||||
DEMO_FILE_CAP_MB: int = 5
|
||||
GUMROAD_BASE: str = "https://gumroad.com/l/datatools"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Persona registry — single source of truth
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
DEMO_DIR = _project_root / "samples" / "demo"
|
||||
|
||||
|
||||
PERSONAS: dict[str, dict[str, Any]] = {
|
||||
"shopify-pet": {
|
||||
"label": "Shopify pet operator",
|
||||
"icon": "🛍️",
|
||||
"h1": "Klaviyo-import-ready customer lists. **In 30 seconds. Locally.**",
|
||||
"sub": (
|
||||
"Your Shopify customer export has duplicates Excel can't catch, "
|
||||
"international phones Excel can't parse, and disguised nulls "
|
||||
"(`N/A`, `(blank)`, `?`) that break Klaviyo's import. "
|
||||
"DataTools fixes all of it in one pass — and your data never "
|
||||
"leaves your computer."
|
||||
),
|
||||
"data_file": "shopify_pet_customers.csv",
|
||||
"pipeline_file": "shopify_pet_pipeline.json",
|
||||
"cta": "Get DataTools for Shopify — $49 →",
|
||||
"landing": "https://datatools.app/shopify/",
|
||||
},
|
||||
"bookkeeper": {
|
||||
"label": "Bookkeeper / freelance accountant",
|
||||
"icon": "📒",
|
||||
"h1": "Reconcile messy bank exports. **Hand your client an audit trail.**",
|
||||
"sub": (
|
||||
"The Jan and Feb exports overlap; the same transaction posts twice. "
|
||||
"Vendor names are *Amazon* / *amazon.com* / *AMAZON.COM*4F2X9* in "
|
||||
"three rows. DataTools dedups on Date + Amount + fuzzy Vendor, "
|
||||
"produces ISO dates and numeric amounts, and gives you a row-level "
|
||||
"audit log to hand the client."
|
||||
),
|
||||
"data_file": "bookkeeper_bank_reconcile.csv",
|
||||
"pipeline_file": "bookkeeper_bank_pipeline.json",
|
||||
"cta": "Get DataTools for Bookkeepers — $49 →",
|
||||
"landing": "https://datatools.app/bookkeeper/",
|
||||
},
|
||||
"revops": {
|
||||
"label": "Marketing / RevOps agency",
|
||||
"icon": "🪢",
|
||||
"h1": "Dedupe lead lists across HubSpot, LinkedIn, and manual scrapes — **locally.**",
|
||||
"sub": (
|
||||
"The same prospect shows up in HubSpot as `alice@acme.com`, in "
|
||||
"LinkedIn as `Alice.Johnson@acme.com`, and in your VA's manual "
|
||||
"scrape as `alice@acme.com` again. Country is `USA` / `US` / "
|
||||
"`United States`. DataTools fuzzy-matches across sources, "
|
||||
"normalizes phones for 50+ countries, and merges survivors "
|
||||
"with their most-complete fields — without uploading anything."
|
||||
),
|
||||
"data_file": "agency_combined_leads.csv",
|
||||
"pipeline_file": "agency_leads_pipeline.json",
|
||||
"cta": "Get DataTools for RevOps — $49 →",
|
||||
"landing": "https://datatools.app/revops/",
|
||||
},
|
||||
}
|
||||
|
||||
DEFAULT_PERSONA = "shopify-pet"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Page config + routing
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.set_page_config(
|
||||
page_title="DataTools — try it live",
|
||||
page_icon="🧹",
|
||||
layout="wide",
|
||||
initial_sidebar_state="collapsed",
|
||||
)
|
||||
|
||||
# Strip Streamlit chrome that breaks the iframe-embed look on the
|
||||
# landing pages.
|
||||
st.markdown("""
|
||||
<style>
|
||||
#MainMenu, footer, header { visibility: hidden; }
|
||||
.block-container { padding-top: 1.2rem; padding-bottom: 1rem; max-width: 1200px; }
|
||||
[data-testid="stSidebarNav"] { display: none; }
|
||||
section[data-testid="stSidebar"] { display: none; }
|
||||
.stApp { background: #0f1115; color: #e8eaed; }
|
||||
h1, h2, h3 { color: #e8eaed; letter-spacing: -0.01em; }
|
||||
hr { border-color: #252a36; }
|
||||
.demo-card {
|
||||
background: #161922;
|
||||
border: 1px solid #252a36;
|
||||
border-radius: 12px;
|
||||
padding: 18px;
|
||||
}
|
||||
.cta-block {
|
||||
background: linear-gradient(135deg, #161922 0%, #1d212b 100%);
|
||||
border: 1px solid #6ee7b7;
|
||||
border-radius: 12px;
|
||||
padding: 24px;
|
||||
text-align: center;
|
||||
}
|
||||
.cta-block a {
|
||||
display: inline-block;
|
||||
background: #6ee7b7; color: #052e1a;
|
||||
font-weight: 600; padding: 12px 22px;
|
||||
border-radius: 8px; text-decoration: none;
|
||||
font-size: 17px; margin-top: 12px;
|
||||
}
|
||||
.metric-pill {
|
||||
display: inline-block;
|
||||
background: #1d212b; border: 1px solid #252a36;
|
||||
padding: 4px 10px; border-radius: 999px;
|
||||
font-family: ui-monospace, monospace; font-size: 13px;
|
||||
color: #6ee7b7; margin-right: 6px; margin-bottom: 4px;
|
||||
}
|
||||
</style>
|
||||
""", unsafe_allow_html=True)
|
||||
|
||||
|
||||
def _resolve_persona() -> str:
|
||||
"""Read ``?p=<persona>`` from query string; fall back to default."""
|
||||
try:
|
||||
params = st.query_params
|
||||
raw = params.get("p", DEFAULT_PERSONA)
|
||||
except AttributeError:
|
||||
# Older Streamlit versions
|
||||
params = st.experimental_get_query_params()
|
||||
raw = params.get("p", [DEFAULT_PERSONA])
|
||||
raw = raw[0] if isinstance(raw, list) else raw
|
||||
if raw not in PERSONAS:
|
||||
return DEFAULT_PERSONA
|
||||
return raw
|
||||
|
||||
|
||||
persona_key = _resolve_persona()
|
||||
persona = PERSONAS[persona_key]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Header + persona switch
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
col_brand, col_switch = st.columns([3, 2])
|
||||
with col_brand:
|
||||
st.markdown(f"### 🧹 DataTools / for {persona['label']}")
|
||||
with col_switch:
|
||||
# Quick-switch dropdown for visitors landing on the wrong persona
|
||||
new_choice = st.selectbox(
|
||||
"Try a different demo",
|
||||
options=list(PERSONAS),
|
||||
format_func=lambda k: f"{PERSONAS[k]['icon']} {PERSONAS[k]['label']}",
|
||||
index=list(PERSONAS).index(persona_key),
|
||||
key="persona_switch",
|
||||
label_visibility="collapsed",
|
||||
)
|
||||
if new_choice != persona_key:
|
||||
st.query_params["p"] = new_choice
|
||||
st.rerun()
|
||||
|
||||
st.markdown(f"## {persona['h1']}")
|
||||
st.markdown(persona["sub"])
|
||||
|
||||
st.markdown("---")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Load preloaded sample data + pipeline
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@st.cache_data(show_spinner=False)
|
||||
def _load_demo(data_file: str, pipeline_file: str) -> tuple[pd.DataFrame, Pipeline]:
|
||||
df = pd.read_csv(DEMO_DIR / data_file, dtype=str, keep_default_na=False)
|
||||
pipe = Pipeline.from_file(DEMO_DIR / pipeline_file)
|
||||
return df, pipe
|
||||
|
||||
|
||||
sample_df, sample_pipeline = _load_demo(persona["data_file"], persona["pipeline_file"])
|
||||
|
||||
|
||||
def _read_uploaded(uploaded_file) -> tuple[pd.DataFrame, list[str]]:
|
||||
"""Decode an uploaded file. Returns (df, warnings)."""
|
||||
warnings: list[str] = []
|
||||
raw = uploaded_file.getvalue()
|
||||
size_mb = len(raw) / 1024 / 1024
|
||||
if size_mb > DEMO_FILE_CAP_MB:
|
||||
warnings.append(
|
||||
f"Uploaded file is {size_mb:.1f} MB — demo capped at "
|
||||
f"{DEMO_FILE_CAP_MB} MB. The paid product has no size limit."
|
||||
)
|
||||
return sample_df.copy(), warnings
|
||||
suffix = Path(uploaded_file.name).suffix.lower()
|
||||
bio = io.BytesIO(raw)
|
||||
try:
|
||||
if suffix in (".xlsx", ".xls"):
|
||||
df = pd.read_excel(bio, dtype=str, keep_default_na=False)
|
||||
else:
|
||||
for enc in ("utf-8", "utf-8-sig", "latin-1"):
|
||||
try:
|
||||
bio.seek(0)
|
||||
sep = "\t" if suffix == ".tsv" else ","
|
||||
df = pd.read_csv(
|
||||
bio, dtype=str, keep_default_na=False,
|
||||
encoding=enc, sep=sep, on_bad_lines="warn",
|
||||
)
|
||||
break
|
||||
except UnicodeDecodeError:
|
||||
continue
|
||||
else:
|
||||
bio.seek(0)
|
||||
df = pd.read_csv(bio, dtype=str, keep_default_na=False, encoding="latin-1")
|
||||
except Exception as e:
|
||||
warnings.append(f"Could not read your file ({type(e).__name__}). "
|
||||
"Demo will run on the sample dataset.")
|
||||
return sample_df.copy(), warnings
|
||||
if len(df) > DEMO_ROW_CAP:
|
||||
warnings.append(
|
||||
f"Demo capped at {DEMO_ROW_CAP} rows — your file has {len(df):,}. "
|
||||
f"Running on the first {DEMO_ROW_CAP} rows. The paid product has no row limit."
|
||||
)
|
||||
df = df.head(DEMO_ROW_CAP)
|
||||
return df, warnings
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# File source: preloaded sample (default) or user upload
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.markdown(f"#### Sample dataset preloaded · `{persona['data_file']}`")
|
||||
|
||||
with st.expander(
|
||||
"Or replace with your own file (capped at "
|
||||
f"{DEMO_ROW_CAP} rows / {DEMO_FILE_CAP_MB} MB for the demo)",
|
||||
expanded=False,
|
||||
):
|
||||
uploaded = st.file_uploader(
|
||||
"Your file",
|
||||
type=["csv", "tsv", "xlsx", "xls"],
|
||||
key="demo_user_file",
|
||||
label_visibility="collapsed",
|
||||
help=(
|
||||
"Files larger than the cap are accepted but only the first "
|
||||
f"{DEMO_ROW_CAP} rows are processed. The paid build runs on "
|
||||
"1 GB+ files via streaming."
|
||||
),
|
||||
)
|
||||
|
||||
if uploaded is not None:
|
||||
df_in, upload_warnings = _read_uploaded(uploaded)
|
||||
for w in upload_warnings:
|
||||
st.info(w)
|
||||
using_sample = False
|
||||
else:
|
||||
df_in = sample_df.copy()
|
||||
using_sample = True
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# BEFORE preview
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.markdown(f"#### BEFORE — {len(df_in)} rows, {len(df_in.columns)} columns")
|
||||
st.dataframe(df_in.head(10), use_container_width=True, hide_index=True)
|
||||
|
||||
st.markdown("---")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Pipeline (read-only)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.markdown("#### Pipeline (saved — paid version is editable)")
|
||||
pipe_summary = " → ".join(
|
||||
f"**{i + 1}.** {step.tool}"
|
||||
for i, step in enumerate(sample_pipeline.steps)
|
||||
)
|
||||
st.markdown(pipe_summary)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Run
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
run_clicked = st.button(
|
||||
"▶ Run pipeline",
|
||||
type="primary",
|
||||
use_container_width=True,
|
||||
key="demo_run_button",
|
||||
)
|
||||
|
||||
if run_clicked:
|
||||
with st.spinner("Running…"):
|
||||
t0 = time.perf_counter()
|
||||
try:
|
||||
result = run_pipeline(df_in, sample_pipeline, stop_on_error=False)
|
||||
except Exception as e:
|
||||
from src.core.errors import format_for_user
|
||||
st.error(f"Demo halted: {format_for_user(e)}")
|
||||
st.stop()
|
||||
elapsed = time.perf_counter() - t0
|
||||
st.session_state["demo_result"] = result
|
||||
st.session_state["demo_elapsed"] = elapsed
|
||||
st.session_state["demo_persona"] = persona_key
|
||||
|
||||
result = st.session_state.get("demo_result")
|
||||
elapsed = st.session_state.get("demo_elapsed", 0.0)
|
||||
result_persona = st.session_state.get("demo_persona")
|
||||
|
||||
# Reset cached result when persona switches
|
||||
if result is not None and result_persona != persona_key:
|
||||
result = None
|
||||
st.session_state.pop("demo_result", None)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# AFTER + metrics + CTA
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
if result is not None:
|
||||
st.markdown("---")
|
||||
st.markdown(
|
||||
f"#### AFTER — {len(df_in)} → {len(result.final_df)} rows · "
|
||||
f"finished in {elapsed*1000:.0f} ms"
|
||||
)
|
||||
|
||||
# Per-step metric pills
|
||||
pills_html: list[str] = []
|
||||
for sr in result.step_results:
|
||||
if sr.skipped:
|
||||
continue
|
||||
if sr.error:
|
||||
pills_html.append(
|
||||
f'<span class="metric-pill" style="color:#fbbf24">'
|
||||
f'{sr.step.tool}: error</span>'
|
||||
)
|
||||
continue
|
||||
s = sr.summary
|
||||
bits: list[str] = []
|
||||
if "cells_changed" in s and s["cells_changed"]:
|
||||
bits.append(f"{s['cells_changed']} cells")
|
||||
if "sentinels_standardized" in s and s["sentinels_standardized"]:
|
||||
bits.append(f"{s['sentinels_standardized']} sentinels")
|
||||
if "duplicates_removed" in s and s["duplicates_removed"]:
|
||||
bits.append(f"{s['duplicates_removed']} dupes merged")
|
||||
if "columns_renamed" in s and s["columns_renamed"]:
|
||||
bits.append(f"{s['columns_renamed']} renamed")
|
||||
label = ", ".join(bits) if bits else "no-op"
|
||||
pills_html.append(
|
||||
f'<span class="metric-pill">{sr.step.tool}: {label}</span>'
|
||||
)
|
||||
st.markdown("".join(pills_html), unsafe_allow_html=True)
|
||||
|
||||
st.dataframe(result.final_df.head(10), use_container_width=True, hide_index=True)
|
||||
|
||||
# ----- Download with watermark row -----
|
||||
watermark_row = pd.DataFrame([{
|
||||
col: f"DataTools demo — buy at {persona['landing']}"
|
||||
if i == 0 else ""
|
||||
for i, col in enumerate(result.final_df.columns)
|
||||
}])
|
||||
out_df = pd.concat([result.final_df, watermark_row], ignore_index=True)
|
||||
csv_bytes = out_df.to_csv(index=False).encode("utf-8-sig")
|
||||
|
||||
col_dl, col_cta = st.columns([1, 2])
|
||||
with col_dl:
|
||||
st.download_button(
|
||||
"Download cleaned CSV (sample · watermarked)",
|
||||
data=csv_bytes,
|
||||
file_name=Path(persona["data_file"]).stem + "_cleaned_demo.csv",
|
||||
mime="text/csv",
|
||||
use_container_width=True,
|
||||
)
|
||||
with col_cta:
|
||||
st.markdown(
|
||||
f"""
|
||||
<div class="cta-block">
|
||||
<strong style="font-size: 18px;">Like what you see?</strong><br/>
|
||||
Run this on YOUR full file — locally. No upload. No row limit. No watermark.<br/>
|
||||
<a href="{GUMROAD_BASE}?from={persona_key}" rel="noopener">{persona['cta']}</a>
|
||||
</div>
|
||||
""",
|
||||
unsafe_allow_html=True,
|
||||
)
|
||||
else:
|
||||
# Pre-run state — show the buy block at the bottom anyway so the
|
||||
# CTA is always visible above the fold once the visitor scrolls.
|
||||
st.markdown(
|
||||
f"""
|
||||
<div class="cta-block" style="margin-top: 24px;">
|
||||
<strong style="font-size: 18px;">Already convinced?</strong><br/>
|
||||
Skip the demo and grab the full version. One-time payment, no subscription.<br/>
|
||||
<a href="{GUMROAD_BASE}?from={persona_key}" rel="noopener">{persona['cta']}</a>
|
||||
</div>
|
||||
""",
|
||||
unsafe_allow_html=True,
|
||||
)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Footer trust block
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.markdown("---")
|
||||
col_t1, col_t2, col_t3 = st.columns(3)
|
||||
with col_t1:
|
||||
st.markdown("**🔒 Runs locally**\n\nThe paid product is desktop-only. Your data never leaves your computer.")
|
||||
with col_t2:
|
||||
st.markdown("**📋 Audit trail**\n\nEvery cell change row-logged with old / new / which rule fired.")
|
||||
with col_t3:
|
||||
st.markdown("**💰 One-time $49**\n\nNo subscription. Mac · Windows · Linux. Free updates for v1.x.")
|
||||
|
||||
st.caption(
|
||||
f"Demo capped at {DEMO_ROW_CAP} rows · output watermarked with one trailing row · "
|
||||
"running on free hosting. The paid product is uncapped and runs offline."
|
||||
)
|
||||
@@ -1,111 +1,368 @@
|
||||
"""DataTools Missing Value Handler — stub page."""
|
||||
"""DataTools Missing Value Handler — Streamlit page."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
import streamlit as st
|
||||
|
||||
_project_root = Path(__file__).resolve().parent.parent.parent.parent
|
||||
if str(_project_root) not in sys.path:
|
||||
sys.path.insert(0, str(_project_root))
|
||||
|
||||
from src.gui.components import hide_streamlit_chrome, require_normalization_gate
|
||||
from src.gui.components import (
|
||||
hide_streamlit_chrome,
|
||||
pickup_or_upload,
|
||||
require_normalization_gate,
|
||||
)
|
||||
from src.core.missing import (
|
||||
DEFAULT_SENTINELS,
|
||||
MissingOptions,
|
||||
PRESETS,
|
||||
handle_missing,
|
||||
profile_missing,
|
||||
)
|
||||
|
||||
hide_streamlit_chrome()
|
||||
require_normalization_gate()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Header
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.title("🕳️ Missing Value Handler")
|
||||
st.caption("Detect, analyze, and handle missing values in your data.")
|
||||
st.caption(
|
||||
"Detect disguised nulls, profile missingness, and apply imputation or "
|
||||
"drop strategies. Runs locally — your data never leaves this computer."
|
||||
)
|
||||
|
||||
st.info("This tool is under development.")
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# What this tool will do
|
||||
# File upload
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.markdown("""
|
||||
**Features:**
|
||||
- Detect disguised nulls (empty strings, "N/A", "n/a", "-", "NULL", "None", etc.)
|
||||
- Missingness analysis: per-column counts, percentages, and patterns
|
||||
- Visualize missing data heatmap
|
||||
- Imputation strategies: drop rows/columns, fill with mean/median/mode, forward-fill, backward-fill
|
||||
- Custom sentinel value replacement
|
||||
- Before/after comparison
|
||||
""")
|
||||
uploaded = pickup_or_upload(
|
||||
label="Upload CSV or Excel file",
|
||||
key="missing_file_upload",
|
||||
types=["csv", "tsv", "xlsx", "xls"],
|
||||
)
|
||||
|
||||
if uploaded is None:
|
||||
st.info("Upload a CSV, TSV, or Excel file to begin.")
|
||||
st.stop()
|
||||
|
||||
|
||||
@st.cache_data(show_spinner=False)
|
||||
def _read_uploaded(name: str, data: bytes) -> pd.DataFrame:
|
||||
"""Read the uploaded bytes into a DataFrame.
|
||||
|
||||
Unlike the text cleaner, we do *not* force ``dtype=str`` here: missing-
|
||||
value handling is more useful when numeric columns are typed correctly
|
||||
(so mean / median / interpolate work without manual coercion).
|
||||
Sentinel strings are still detected because they survive in object
|
||||
columns where any cell is non-numeric.
|
||||
"""
|
||||
suffix = Path(name).suffix.lower()
|
||||
bio = io.BytesIO(data)
|
||||
if suffix in (".xlsx", ".xls"):
|
||||
return pd.read_excel(bio)
|
||||
for enc in ("utf-8", "utf-8-sig", "latin-1"):
|
||||
try:
|
||||
bio.seek(0)
|
||||
sep = "\t" if suffix == ".tsv" else ","
|
||||
return pd.read_csv(bio, encoding=enc, sep=sep, on_bad_lines="warn")
|
||||
except UnicodeDecodeError:
|
||||
continue
|
||||
bio.seek(0)
|
||||
return pd.read_csv(bio, encoding="latin-1")
|
||||
|
||||
|
||||
try:
|
||||
df = _read_uploaded(uploaded.name, uploaded.getvalue())
|
||||
except Exception as e:
|
||||
from src.core.errors import format_for_user
|
||||
st.error(
|
||||
f"**Could not read `{uploaded.name}`**\n\n"
|
||||
f"```\n{format_for_user(e)}\n```"
|
||||
)
|
||||
st.stop()
|
||||
|
||||
st.subheader(f"Preview: {uploaded.name}")
|
||||
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
|
||||
st.dataframe(df.head(10), use_container_width=True)
|
||||
|
||||
st.divider()
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# File upload (functional)
|
||||
# Initial profile (read-only)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
uploaded = st.file_uploader(
|
||||
"Upload CSV or Excel file",
|
||||
type=["csv", "tsv", "xlsx", "xls"],
|
||||
help="Upload a file to preview. Processing is not yet available.",
|
||||
key="missing_file_upload",
|
||||
)
|
||||
st.subheader("Missingness profile")
|
||||
|
||||
if uploaded is not None:
|
||||
import pandas as pd
|
||||
try:
|
||||
if uploaded.name.endswith((".xlsx", ".xls")):
|
||||
df = pd.read_excel(uploaded)
|
||||
else:
|
||||
df = pd.read_csv(uploaded)
|
||||
st.subheader(f"Preview: {uploaded.name}")
|
||||
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
|
||||
st.dataframe(df.head(10), use_container_width=True)
|
||||
except Exception as e:
|
||||
from src.core.errors import format_for_user
|
||||
st.error(
|
||||
f"**Could not read `{uploaded.name}`**\n\n"
|
||||
f"```\n{format_for_user(e)}\n```"
|
||||
initial_profile = profile_missing(df, MissingOptions())
|
||||
prof_df = initial_profile.to_dataframe()
|
||||
|
||||
m1, m2, m3, m4 = st.columns(4)
|
||||
m1.metric("Rows", initial_profile.rows_total)
|
||||
m2.metric("Cells missing", initial_profile.cells_missing)
|
||||
m3.metric("% cells missing", f"{initial_profile.cells_missing_pct:.1f}%")
|
||||
m4.metric("Complete rows", initial_profile.rows_complete)
|
||||
|
||||
st.dataframe(prof_df, use_container_width=True, hide_index=True)
|
||||
|
||||
if initial_profile.cells_missing == 0:
|
||||
st.success("No missing values or disguised nulls detected. Nothing to handle.")
|
||||
|
||||
st.divider()
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Options
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.subheader("Strategy")
|
||||
|
||||
preset_label = st.radio(
|
||||
"Preset",
|
||||
[
|
||||
"detect-only (standardize sentinels to NaN, no fill or drop)",
|
||||
"safe-fill (numeric → median, categorical → mode)",
|
||||
"drop-incomplete (drop any row with missing)",
|
||||
],
|
||||
index=0,
|
||||
help=(
|
||||
"detect-only: replace 'N/A', '-', 'NULL', etc. with real NaN, then stop. "
|
||||
"safe-fill: also fill — numeric columns with median, others with mode. "
|
||||
"drop-incomplete: also drop every row that has any missing cell."
|
||||
),
|
||||
)
|
||||
preset_key = preset_label.split(" ", 1)[0]
|
||||
options = MissingOptions.from_preset(preset_key)
|
||||
|
||||
with st.expander("Advanced options"):
|
||||
col_a, col_b = st.columns(2)
|
||||
|
||||
with col_a:
|
||||
st.markdown("**Detection**")
|
||||
options.standardize_sentinels = st.checkbox(
|
||||
"Standardize disguised nulls to NaN",
|
||||
value=options.standardize_sentinels,
|
||||
help="Replace 'N/A', '-', 'NULL', whitespace-only cells, etc. with real NaN.",
|
||||
)
|
||||
sentinels_text = st.text_input(
|
||||
"Sentinel values (comma-separated)",
|
||||
value=", ".join(options.sentinels),
|
||||
disabled=not options.standardize_sentinels,
|
||||
help="Matched case-insensitively after stripping whitespace.",
|
||||
)
|
||||
options.sentinels = [
|
||||
s.strip() for s in sentinels_text.split(",") if s.strip()
|
||||
]
|
||||
|
||||
with col_b:
|
||||
st.markdown("**Strategy override**")
|
||||
strat_options = [
|
||||
"(use preset)",
|
||||
"none", "drop_row", "drop_col", "drop_both",
|
||||
"mean", "median", "mode", "constant",
|
||||
"ffill", "bfill", "interpolate",
|
||||
]
|
||||
strat_choice = st.selectbox(
|
||||
"Global strategy",
|
||||
strat_options,
|
||||
index=0,
|
||||
help=(
|
||||
"drop_row / drop_col use the thresholds below. "
|
||||
"mean / median / interpolate are numeric only — non-numeric "
|
||||
"columns fall back to the categorical strategy."
|
||||
),
|
||||
)
|
||||
if strat_choice != "(use preset)":
|
||||
options.strategy = strat_choice # type: ignore[assignment]
|
||||
|
||||
cat_strat = st.selectbox(
|
||||
"Categorical fallback (for non-numeric columns)",
|
||||
["mode", "constant", "ffill", "bfill", "none"],
|
||||
index=0,
|
||||
)
|
||||
options.categorical_strategy = cat_strat # type: ignore[assignment]
|
||||
|
||||
if options.strategy == "constant" or cat_strat == "constant":
|
||||
fill_val = st.text_input(
|
||||
"Constant fill value",
|
||||
value="",
|
||||
help="Used when strategy = constant. Leave blank to fill with empty string.",
|
||||
)
|
||||
options.fill_value = fill_val
|
||||
|
||||
st.markdown("**Drop thresholds**")
|
||||
col_c, col_d = st.columns(2)
|
||||
with col_c:
|
||||
options.row_drop_threshold = st.slider(
|
||||
"Row drop threshold (drop rows with ≥ this fraction missing across selected cols)",
|
||||
0.0, 1.0, options.row_drop_threshold, 0.05,
|
||||
)
|
||||
with col_d:
|
||||
options.col_drop_threshold = st.slider(
|
||||
"Column drop threshold (drop columns with ≥ this fraction missing)",
|
||||
0.0, 1.0, options.col_drop_threshold, 0.05,
|
||||
)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Placeholder options
|
||||
# ---------------------------------------------------------------------------
|
||||
st.markdown("**Scope**")
|
||||
selected_cols = st.multiselect(
|
||||
"Columns to handle (default: all)",
|
||||
options=list(df.columns),
|
||||
default=list(df.columns),
|
||||
)
|
||||
skip_cols = st.multiselect(
|
||||
"Columns to skip",
|
||||
options=list(df.columns),
|
||||
default=[],
|
||||
)
|
||||
options.columns = selected_cols if selected_cols else None
|
||||
options.skip_columns = list(skip_cols)
|
||||
|
||||
st.subheader("Detection Settings")
|
||||
|
||||
st.text_input(
|
||||
"Null patterns (comma-separated)",
|
||||
value="N/A, n/a, NA, -, NULL, None, empty, .",
|
||||
disabled=True,
|
||||
help="Values to treat as missing.",
|
||||
)
|
||||
|
||||
st.subheader("Handling Strategy")
|
||||
|
||||
st.selectbox("Strategy", [
|
||||
"Drop rows with any missing",
|
||||
"Drop rows above threshold",
|
||||
"Fill with mean (numeric)",
|
||||
"Fill with median (numeric)",
|
||||
"Fill with mode (categorical)",
|
||||
"Forward-fill",
|
||||
"Backward-fill",
|
||||
"Custom value",
|
||||
], disabled=True)
|
||||
|
||||
st.slider("Drop threshold (%)", 0, 100, 50, disabled=True, help="Drop rows missing more than this % of columns.")
|
||||
|
||||
st.divider()
|
||||
st.button("Handle Missing Values", type="primary", use_container_width=True, disabled=True)
|
||||
st.markdown("**Per-column strategy overrides** (optional)")
|
||||
st.caption(
|
||||
"Set a different strategy for specific columns. Leave any row blank to "
|
||||
"use the global strategy."
|
||||
)
|
||||
per_col_overrides: dict[str, str] = {}
|
||||
only_missing_cols = [
|
||||
r.column for r in initial_profile.columns if r.has_missing
|
||||
]
|
||||
if only_missing_cols:
|
||||
edit_df = pd.DataFrame({
|
||||
"column": only_missing_cols,
|
||||
"strategy": ["" for _ in only_missing_cols],
|
||||
})
|
||||
edited = st.data_editor(
|
||||
edit_df,
|
||||
use_container_width=True,
|
||||
hide_index=True,
|
||||
column_config={
|
||||
"column": st.column_config.TextColumn("Column", disabled=True),
|
||||
"strategy": st.column_config.SelectboxColumn(
|
||||
"Override",
|
||||
options=[
|
||||
"", "drop_row", "drop_col",
|
||||
"mean", "median", "mode", "constant",
|
||||
"ffill", "bfill", "interpolate",
|
||||
],
|
||||
),
|
||||
},
|
||||
key="missing_per_col_editor",
|
||||
)
|
||||
for _, row in edited.iterrows():
|
||||
if row["strategy"]:
|
||||
per_col_overrides[row["column"]] = row["strategy"]
|
||||
options.column_strategies = per_col_overrides # type: ignore[assignment]
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Footer
|
||||
# Run
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.divider()
|
||||
st.caption(
|
||||
"Runs locally. Your data never leaves this computer. "
|
||||
"| DataTools v3.0"
|
||||
)
|
||||
|
||||
if st.button("Handle Missing Values", type="primary", use_container_width=True):
|
||||
with st.spinner("Handling..."):
|
||||
try:
|
||||
result = handle_missing(df, options)
|
||||
except (ValueError, OSError) as e:
|
||||
from src.core.errors import format_for_user
|
||||
st.error(format_for_user(e))
|
||||
st.stop()
|
||||
st.session_state["missing_result"] = result
|
||||
st.session_state["missing_input_name"] = uploaded.name
|
||||
st.session_state["missing_options"] = options.to_dict()
|
||||
|
||||
result = st.session_state.get("missing_result")
|
||||
if result is None:
|
||||
st.info("Choose a strategy and click **Handle Missing Values** to run.")
|
||||
st.stop()
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Results
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.subheader("Results")
|
||||
|
||||
m1, m2, m3, m4 = st.columns(4)
|
||||
m1.metric("Sentinels → NaN", result.sentinels_standardized)
|
||||
m2.metric("Cells filled", result.cells_filled)
|
||||
m3.metric("Rows dropped", result.rows_dropped)
|
||||
m4.metric("Columns dropped", len(result.columns_dropped))
|
||||
|
||||
if result.columns_dropped:
|
||||
st.warning(f"Dropped columns: {', '.join(result.columns_dropped)}")
|
||||
|
||||
st.markdown("**Missingness — before vs. after**")
|
||||
before = result.profile_before.to_dataframe().set_index("column")[
|
||||
["missing", "missing_pct"]
|
||||
].rename(columns={"missing": "before_missing", "missing_pct": "before_pct"})
|
||||
after = result.profile_after.to_dataframe().set_index("column")[
|
||||
["missing", "missing_pct"]
|
||||
].rename(columns={"missing": "after_missing", "missing_pct": "after_pct"})
|
||||
combined = before.join(after, how="outer").fillna(0)
|
||||
st.dataframe(combined, use_container_width=True)
|
||||
|
||||
if result.strategy_per_column:
|
||||
st.markdown("**Strategy applied per column**")
|
||||
strat_df = pd.DataFrame(
|
||||
[{"column": c, "strategy": s} for c, s in result.strategy_per_column.items()]
|
||||
)
|
||||
st.dataframe(strat_df, use_container_width=True, hide_index=True)
|
||||
|
||||
if not result.changes.empty:
|
||||
st.markdown("**Audit (first 50 changes)**")
|
||||
audit_view = result.changes.head(50).copy()
|
||||
audit_view["row"] = audit_view["row"].apply(lambda x: "—" if x == -1 else x + 1)
|
||||
st.dataframe(audit_view, use_container_width=True, hide_index=True)
|
||||
if len(result.changes) > 50:
|
||||
st.caption(f"… and {len(result.changes) - 50} more (download the full audit below).")
|
||||
|
||||
st.markdown("**Handled preview (first 10 rows)**")
|
||||
st.dataframe(result.handled_df.head(10), use_container_width=True)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Downloads
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.divider()
|
||||
stem = Path(st.session_state.get("missing_input_name", "input")).stem
|
||||
|
||||
dl_a, dl_b, dl_c = st.columns(3)
|
||||
with dl_a:
|
||||
handled_bytes = result.handled_df.to_csv(index=False).encode("utf-8-sig")
|
||||
st.download_button(
|
||||
"Download handled CSV",
|
||||
data=handled_bytes,
|
||||
file_name=f"{stem}_missing.csv",
|
||||
mime="text/csv",
|
||||
)
|
||||
with dl_b:
|
||||
if not result.changes.empty:
|
||||
changes_bytes = result.changes.to_csv(index=False).encode("utf-8-sig")
|
||||
st.download_button(
|
||||
"Download changes audit",
|
||||
data=changes_bytes,
|
||||
file_name=f"{stem}_missing_changes.csv",
|
||||
mime="text/csv",
|
||||
)
|
||||
with dl_c:
|
||||
config_bytes = json.dumps(
|
||||
st.session_state.get("missing_options", {}), indent=2, default=str,
|
||||
).encode("utf-8")
|
||||
st.download_button(
|
||||
"Download config JSON",
|
||||
data=config_bytes,
|
||||
file_name="missing_config.json",
|
||||
mime="application/json",
|
||||
)
|
||||
|
||||
st.divider()
|
||||
st.caption("Runs locally. Your data never leaves this computer. | DataTools v3.0")
|
||||
|
||||
@@ -1,102 +1,413 @@
|
||||
"""DataTools Column Mapper — stub page."""
|
||||
"""DataTools Column Mapper — Streamlit page."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
import streamlit as st
|
||||
|
||||
_project_root = Path(__file__).resolve().parent.parent.parent.parent
|
||||
if str(_project_root) not in sys.path:
|
||||
sys.path.insert(0, str(_project_root))
|
||||
|
||||
from src.gui.components import hide_streamlit_chrome, require_normalization_gate
|
||||
from src.gui.components import (
|
||||
hide_streamlit_chrome,
|
||||
pickup_or_upload,
|
||||
require_normalization_gate,
|
||||
)
|
||||
from src.core.column_mapper import (
|
||||
MapOptions,
|
||||
PRESETS,
|
||||
TargetField,
|
||||
TargetSchema,
|
||||
infer_mapping,
|
||||
map_columns,
|
||||
)
|
||||
|
||||
hide_streamlit_chrome()
|
||||
require_normalization_gate()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Header
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.title("🗂️ Column Mapper")
|
||||
st.caption("Rename columns, enforce a target schema, and coerce types.")
|
||||
st.caption(
|
||||
"Rename columns, enforce a target schema, and coerce types. Runs locally — "
|
||||
"your data never leaves this computer."
|
||||
)
|
||||
|
||||
st.info("This tool is under development.")
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# What this tool will do
|
||||
# File upload
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.markdown("""
|
||||
**Features:**
|
||||
- Rename columns via interactive mapping table
|
||||
- Load a target schema (JSON/CSV) to auto-map columns
|
||||
- Fuzzy column name matching for automatic suggestions
|
||||
- Type coercion (string → int, string → date, etc.)
|
||||
- Drop unmapped columns or keep as-is
|
||||
- Reorder columns to match target schema
|
||||
""")
|
||||
uploaded = pickup_or_upload(
|
||||
label="Upload CSV or Excel file",
|
||||
key="colmap_file_upload",
|
||||
types=["csv", "tsv", "xlsx", "xls"],
|
||||
)
|
||||
|
||||
if uploaded is None:
|
||||
st.info("Upload a CSV, TSV, or Excel file to begin.")
|
||||
st.stop()
|
||||
|
||||
|
||||
@st.cache_data(show_spinner=False)
|
||||
def _read_uploaded(name: str, data: bytes) -> pd.DataFrame:
|
||||
suffix = Path(name).suffix.lower()
|
||||
bio = io.BytesIO(data)
|
||||
if suffix in (".xlsx", ".xls"):
|
||||
return pd.read_excel(bio)
|
||||
for enc in ("utf-8", "utf-8-sig", "latin-1"):
|
||||
try:
|
||||
bio.seek(0)
|
||||
sep = "\t" if suffix == ".tsv" else ","
|
||||
return pd.read_csv(bio, encoding=enc, sep=sep, on_bad_lines="warn")
|
||||
except UnicodeDecodeError:
|
||||
continue
|
||||
bio.seek(0)
|
||||
return pd.read_csv(bio, encoding="latin-1")
|
||||
|
||||
|
||||
try:
|
||||
df = _read_uploaded(uploaded.name, uploaded.getvalue())
|
||||
except Exception as e:
|
||||
from src.core.errors import format_for_user
|
||||
st.error(
|
||||
f"**Could not read `{uploaded.name}`**\n\n"
|
||||
f"```\n{format_for_user(e)}\n```"
|
||||
)
|
||||
st.stop()
|
||||
|
||||
st.subheader(f"Preview: {uploaded.name}")
|
||||
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
|
||||
st.dataframe(df.head(10), use_container_width=True)
|
||||
st.divider()
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Schema input
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.subheader("Target schema")
|
||||
|
||||
schema_mode = st.radio(
|
||||
"How would you like to define the target schema?",
|
||||
[
|
||||
"Build interactively (start from current columns)",
|
||||
"Upload schema JSON",
|
||||
"Skip (rename / coerce only — no schema)",
|
||||
],
|
||||
index=0,
|
||||
help=(
|
||||
"An interactive build is fastest for one-off cleanup. Upload a JSON "
|
||||
"when you have a fixed contract (a CRM import format, db schema). "
|
||||
"Skip when you only want to rename or coerce specific columns."
|
||||
),
|
||||
)
|
||||
|
||||
schema: TargetSchema | None = None
|
||||
|
||||
if schema_mode.startswith("Upload"):
|
||||
schema_file = st.file_uploader(
|
||||
"Schema JSON",
|
||||
type=["json"],
|
||||
key="colmap_schema_upload",
|
||||
help='Format: {"fields": [{"name": "email", "dtype": "string", "required": true, "aliases": ["EmailAddr"]}, ...]}',
|
||||
)
|
||||
if schema_file is not None:
|
||||
try:
|
||||
schema = TargetSchema.from_dict(json.loads(schema_file.getvalue()))
|
||||
st.success(f"Loaded {len(schema.fields)} target field(s).")
|
||||
except Exception as e:
|
||||
from src.core.errors import format_for_user
|
||||
st.error(f"**Could not parse schema**\n\n```\n{format_for_user(e)}\n```")
|
||||
|
||||
elif schema_mode.startswith("Build"):
|
||||
st.caption(
|
||||
"Edit the table to define your target schema. Add rows for fields the "
|
||||
"input doesn't have yet (with a default), or remove rows for columns "
|
||||
"you want to drop."
|
||||
)
|
||||
initial = pd.DataFrame({
|
||||
"name": list(df.columns),
|
||||
"dtype": ["auto"] * len(df.columns),
|
||||
"required": [False] * len(df.columns),
|
||||
"default": [""] * len(df.columns),
|
||||
"aliases": [""] * len(df.columns),
|
||||
})
|
||||
edited = st.data_editor(
|
||||
initial,
|
||||
use_container_width=True,
|
||||
num_rows="dynamic",
|
||||
column_config={
|
||||
"name": st.column_config.TextColumn("Target name"),
|
||||
"dtype": st.column_config.SelectboxColumn(
|
||||
"Type",
|
||||
options=[
|
||||
"auto", "string", "integer", "float",
|
||||
"boolean", "date", "datetime", "category",
|
||||
],
|
||||
),
|
||||
"required": st.column_config.CheckboxColumn("Required"),
|
||||
"default": st.column_config.TextColumn("Default (for added cols)"),
|
||||
"aliases": st.column_config.TextColumn(
|
||||
"Aliases (comma-sep, helps fuzzy-match)",
|
||||
),
|
||||
},
|
||||
key="colmap_schema_editor",
|
||||
)
|
||||
fields: list[TargetField] = []
|
||||
for _, row in edited.iterrows():
|
||||
name = str(row.get("name", "")).strip()
|
||||
if not name:
|
||||
continue
|
||||
aliases = [
|
||||
a.strip() for a in str(row.get("aliases", "") or "").split(",")
|
||||
if a.strip()
|
||||
]
|
||||
default_raw = row.get("default")
|
||||
default_val = (
|
||||
default_raw if (default_raw not in (None, "", float("nan")))
|
||||
else None
|
||||
)
|
||||
try:
|
||||
if isinstance(default_val, float) and pd.isna(default_val):
|
||||
default_val = None
|
||||
except TypeError:
|
||||
pass
|
||||
fields.append(TargetField(
|
||||
name=name,
|
||||
dtype=str(row.get("dtype", "auto")), # type: ignore[arg-type]
|
||||
required=bool(row.get("required", False)),
|
||||
aliases=aliases,
|
||||
default=default_val,
|
||||
))
|
||||
if fields:
|
||||
schema = TargetSchema(fields=fields)
|
||||
|
||||
st.divider()
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# File upload (functional)
|
||||
# Strategy
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
uploaded = st.file_uploader(
|
||||
"Upload CSV or Excel file",
|
||||
type=["csv", "tsv", "xlsx", "xls"],
|
||||
help="Upload a file to preview. Processing is not yet available.",
|
||||
key="colmap_file_upload",
|
||||
st.subheader("Strategy")
|
||||
|
||||
preset_label = st.radio(
|
||||
"Preset",
|
||||
[
|
||||
"rename-only (just rename, leave types alone, keep extras)",
|
||||
"lenient-schema (rename + coerce + reorder, keep extras)",
|
||||
"strict-schema (rename + coerce + reorder, drop extras)",
|
||||
],
|
||||
index=0,
|
||||
)
|
||||
preset_key = preset_label.split(" ", 1)[0]
|
||||
options = MapOptions.from_preset(preset_key)
|
||||
options.schema = schema
|
||||
|
||||
if uploaded is not None:
|
||||
import pandas as pd
|
||||
try:
|
||||
if uploaded.name.endswith((".xlsx", ".xls")):
|
||||
df = pd.read_excel(uploaded)
|
||||
else:
|
||||
df = pd.read_csv(uploaded)
|
||||
st.subheader(f"Preview: {uploaded.name}")
|
||||
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
|
||||
st.dataframe(df.head(10), use_container_width=True)
|
||||
|
||||
st.subheader("Column Mapping")
|
||||
st.caption("Map source columns to target names. (Interactive mapping coming soon.)")
|
||||
mapping_data = pd.DataFrame({
|
||||
"Source Column": df.columns.tolist(),
|
||||
"Target Column": df.columns.tolist(),
|
||||
"Type": ["auto"] * len(df.columns),
|
||||
})
|
||||
st.dataframe(mapping_data, use_container_width=True, hide_index=True)
|
||||
except Exception as e:
|
||||
from src.core.errors import format_for_user
|
||||
st.error(
|
||||
f"**Could not read `{uploaded.name}`**\n\n"
|
||||
f"```\n{format_for_user(e)}\n```"
|
||||
with st.expander("Advanced options"):
|
||||
col_a, col_b = st.columns(2)
|
||||
with col_a:
|
||||
options.unmapped = st.selectbox( # type: ignore[assignment]
|
||||
"Unmapped source columns",
|
||||
["keep", "drop", "error"],
|
||||
index=["keep", "drop", "error"].index(options.unmapped),
|
||||
)
|
||||
options.coerce_types = st.checkbox(
|
||||
"Coerce types per schema", value=options.coerce_types,
|
||||
)
|
||||
options.reorder_to_schema = st.checkbox(
|
||||
"Reorder to schema order", value=options.reorder_to_schema,
|
||||
)
|
||||
with col_b:
|
||||
options.auto_infer = st.checkbox(
|
||||
"Auto-infer mapping (fuzzy match)", value=options.auto_infer,
|
||||
)
|
||||
options.fuzzy_threshold = st.slider(
|
||||
"Fuzzy match threshold", 0.0, 1.0, options.fuzzy_threshold, 0.05,
|
||||
)
|
||||
options.enforce_required = st.checkbox(
|
||||
"Enforce required fields", value=options.enforce_required,
|
||||
)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Placeholder options
|
||||
# Mapping editor — show inferred and let user override
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.subheader("Schema Options")
|
||||
st.subheader("Mapping")
|
||||
|
||||
st.file_uploader("Load target schema (JSON)", type=["json"], disabled=True, key="colmap_schema")
|
||||
st.checkbox("Drop unmapped columns", value=False, disabled=True)
|
||||
st.checkbox("Reorder to match schema", value=True, disabled=True)
|
||||
|
||||
st.divider()
|
||||
st.button("Apply Column Mapping", type="primary", use_container_width=True, disabled=True)
|
||||
if schema is None:
|
||||
st.caption(
|
||||
"No schema — define explicit renames below (left blank means keep "
|
||||
"the source name)."
|
||||
)
|
||||
rename_initial = pd.DataFrame({
|
||||
"source": list(df.columns),
|
||||
"target": list(df.columns),
|
||||
})
|
||||
rename_edited = st.data_editor(
|
||||
rename_initial,
|
||||
use_container_width=True,
|
||||
column_config={
|
||||
"source": st.column_config.TextColumn("Source", disabled=True),
|
||||
"target": st.column_config.TextColumn("Target"),
|
||||
},
|
||||
hide_index=True,
|
||||
key="colmap_rename_only_editor",
|
||||
)
|
||||
explicit_mapping: dict[str, str] = {}
|
||||
for _, row in rename_edited.iterrows():
|
||||
src = str(row["source"])
|
||||
tgt = str(row["target"]).strip()
|
||||
if tgt and tgt != src:
|
||||
explicit_mapping[src] = tgt
|
||||
options.mapping = explicit_mapping
|
||||
else:
|
||||
inferred = (
|
||||
infer_mapping(df, schema, threshold=options.fuzzy_threshold)
|
||||
if options.auto_infer else {}
|
||||
)
|
||||
target_options = ["(unmapped)"] + schema.field_names()
|
||||
map_initial = pd.DataFrame({
|
||||
"source": list(df.columns),
|
||||
"target": [inferred.get(c, "(unmapped)") for c in df.columns],
|
||||
"auto": [c in inferred for c in df.columns],
|
||||
})
|
||||
map_edited = st.data_editor(
|
||||
map_initial,
|
||||
use_container_width=True,
|
||||
column_config={
|
||||
"source": st.column_config.TextColumn("Source", disabled=True),
|
||||
"target": st.column_config.SelectboxColumn(
|
||||
"Target", options=target_options,
|
||||
),
|
||||
"auto": st.column_config.CheckboxColumn("Auto-suggested", disabled=True),
|
||||
},
|
||||
hide_index=True,
|
||||
key="colmap_schema_mapping_editor",
|
||||
)
|
||||
explicit_mapping = {}
|
||||
for _, row in map_edited.iterrows():
|
||||
src = str(row["source"])
|
||||
tgt = str(row["target"])
|
||||
if tgt and tgt != "(unmapped)":
|
||||
explicit_mapping[src] = tgt
|
||||
options.mapping = explicit_mapping
|
||||
# Disable auto-infer for the actual run since the editor already shows
|
||||
# the user's resolved choices (they can manually re-select to add).
|
||||
options.auto_infer = False
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Footer
|
||||
# Run
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.divider()
|
||||
st.caption(
|
||||
"Runs locally. Your data never leaves this computer. "
|
||||
"| DataTools v3.0"
|
||||
|
||||
if st.button("Apply Column Mapping", type="primary", use_container_width=True):
|
||||
with st.spinner("Mapping..."):
|
||||
try:
|
||||
result = map_columns(df, options)
|
||||
except (ValueError, OSError) as e:
|
||||
from src.core.errors import format_for_user
|
||||
st.error(format_for_user(e))
|
||||
st.stop()
|
||||
st.session_state["colmap_result"] = result
|
||||
st.session_state["colmap_input_name"] = uploaded.name
|
||||
st.session_state["colmap_options"] = options.to_dict()
|
||||
|
||||
result = st.session_state.get("colmap_result")
|
||||
if result is None:
|
||||
st.info("Configure a mapping and click **Apply Column Mapping** to run.")
|
||||
st.stop()
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Results
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.subheader("Results")
|
||||
|
||||
m1, m2, m3, m4 = st.columns(4)
|
||||
m1.metric("Renamed", result.columns_renamed)
|
||||
m2.metric("Dropped", len(result.columns_dropped))
|
||||
m3.metric("Added", len(result.columns_added))
|
||||
m4.metric(
|
||||
"Coerce fails",
|
||||
sum(result.coercion_failures.values()) if result.coercion_failures else 0,
|
||||
)
|
||||
|
||||
if result.columns_dropped:
|
||||
st.warning(f"Dropped columns: {', '.join(result.columns_dropped)}")
|
||||
if result.columns_added:
|
||||
st.info(f"Added (with defaults): {', '.join(result.columns_added)}")
|
||||
if result.coercion_failures:
|
||||
st.warning(
|
||||
"Some cells could not be coerced and were left as NaN: "
|
||||
+ ", ".join(f"{c} ({n})" for c, n in result.coercion_failures.items())
|
||||
)
|
||||
|
||||
if result.mapping:
|
||||
st.markdown("**Resolved mapping**")
|
||||
map_df = pd.DataFrame(
|
||||
[
|
||||
{"source": s, "target": t, "auto": s in result.inferred_pairs}
|
||||
for s, t in result.mapping.items()
|
||||
],
|
||||
)
|
||||
st.dataframe(map_df, use_container_width=True, hide_index=True)
|
||||
|
||||
st.markdown("**Mapped preview (first 10 rows)**")
|
||||
st.dataframe(result.mapped_df.head(10), use_container_width=True)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Downloads
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.divider()
|
||||
stem = Path(st.session_state.get("colmap_input_name", "input")).stem
|
||||
|
||||
dl_a, dl_b, dl_c = st.columns(3)
|
||||
with dl_a:
|
||||
mapped_bytes = result.mapped_df.to_csv(index=False).encode("utf-8-sig")
|
||||
st.download_button(
|
||||
"Download mapped CSV",
|
||||
data=mapped_bytes,
|
||||
file_name=f"{stem}_mapped.csv",
|
||||
mime="text/csv",
|
||||
)
|
||||
with dl_b:
|
||||
audit_bytes = json.dumps({
|
||||
"mapping": result.mapping,
|
||||
"inferred_pairs": result.inferred_pairs,
|
||||
"columns_renamed": result.columns_renamed,
|
||||
"columns_dropped": result.columns_dropped,
|
||||
"columns_added": result.columns_added,
|
||||
"coercion_failures": result.coercion_failures,
|
||||
"unmapped_kept": result.unmapped_kept,
|
||||
"missing_required_targets": result.missing_required_targets,
|
||||
}, indent=2, default=str).encode("utf-8")
|
||||
st.download_button(
|
||||
"Download mapping audit",
|
||||
data=audit_bytes,
|
||||
file_name=f"{stem}_mapping.json",
|
||||
mime="application/json",
|
||||
)
|
||||
with dl_c:
|
||||
config_bytes = json.dumps(
|
||||
st.session_state.get("colmap_options", {}), indent=2, default=str,
|
||||
).encode("utf-8")
|
||||
st.download_button(
|
||||
"Download config JSON",
|
||||
data=config_bytes,
|
||||
file_name="column_map_config.json",
|
||||
mime="application/json",
|
||||
)
|
||||
|
||||
st.divider()
|
||||
st.caption("Runs locally. Your data never leaves this computer. | DataTools v3.0")
|
||||
|
||||
@@ -1,104 +1,370 @@
|
||||
"""DataTools Pipeline Runner — stub page."""
|
||||
"""DataTools Pipeline Runner — Streamlit page."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
import streamlit as st
|
||||
|
||||
_project_root = Path(__file__).resolve().parent.parent.parent.parent
|
||||
if str(_project_root) not in sys.path:
|
||||
sys.path.insert(0, str(_project_root))
|
||||
|
||||
from src.gui.components import hide_streamlit_chrome, require_normalization_gate
|
||||
from src.gui.components import (
|
||||
hide_streamlit_chrome,
|
||||
pickup_or_upload,
|
||||
require_normalization_gate,
|
||||
)
|
||||
from src.core.pipeline import (
|
||||
Pipeline,
|
||||
SOFT_DEPENDENCIES,
|
||||
Step,
|
||||
TOOL_NAMES,
|
||||
recommended_pipeline,
|
||||
run_pipeline,
|
||||
validate_pipeline,
|
||||
)
|
||||
|
||||
hide_streamlit_chrome()
|
||||
require_normalization_gate()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Header
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.title("⚙️ Pipeline Runner")
|
||||
st.caption("Chain tools in sequence and pass output between steps automatically.")
|
||||
|
||||
st.info("This tool is under development.")
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# What this tool will do
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.markdown("""
|
||||
**Features:**
|
||||
- Select tools to run in sequence
|
||||
- Recommended order: Text Cleaner → Format Standardizer → Missing Values → Deduplicator → Validator
|
||||
- Each step's output feeds into the next step's input
|
||||
- Per-step configuration overrides
|
||||
- Progress tracking across all steps
|
||||
- Final combined report
|
||||
""")
|
||||
|
||||
st.divider()
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# File upload (functional)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
uploaded = st.file_uploader(
|
||||
"Upload CSV or Excel file",
|
||||
type=["csv", "tsv", "xlsx", "xls"],
|
||||
help="Upload a file to preview. Processing is not yet available.",
|
||||
key="pipeline_file_upload",
|
||||
st.caption(
|
||||
"Chain DataTools cleaning steps into one repeatable workflow. The "
|
||||
"pipeline recommends an order; you stay in control."
|
||||
)
|
||||
|
||||
if uploaded is not None:
|
||||
import pandas as pd
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# File upload
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
uploaded = pickup_or_upload(
|
||||
label="Upload CSV or Excel file",
|
||||
key="pipeline_file_upload",
|
||||
types=["csv", "tsv", "xlsx", "xls"],
|
||||
)
|
||||
|
||||
if uploaded is None:
|
||||
st.info("Upload a CSV, TSV, or Excel file to begin.")
|
||||
st.stop()
|
||||
|
||||
|
||||
@st.cache_data(show_spinner=False)
|
||||
def _read_uploaded(name: str, data: bytes) -> pd.DataFrame:
|
||||
suffix = Path(name).suffix.lower()
|
||||
bio = io.BytesIO(data)
|
||||
if suffix in (".xlsx", ".xls"):
|
||||
return pd.read_excel(bio)
|
||||
for enc in ("utf-8", "utf-8-sig", "latin-1"):
|
||||
try:
|
||||
bio.seek(0)
|
||||
sep = "\t" if suffix == ".tsv" else ","
|
||||
return pd.read_csv(bio, encoding=enc, sep=sep, on_bad_lines="warn")
|
||||
except UnicodeDecodeError:
|
||||
continue
|
||||
bio.seek(0)
|
||||
return pd.read_csv(bio, encoding="latin-1")
|
||||
|
||||
|
||||
try:
|
||||
df = _read_uploaded(uploaded.name, uploaded.getvalue())
|
||||
except Exception as e:
|
||||
from src.core.errors import format_for_user
|
||||
st.error(
|
||||
f"**Could not read `{uploaded.name}`**\n\n"
|
||||
f"```\n{format_for_user(e)}\n```"
|
||||
)
|
||||
st.stop()
|
||||
|
||||
st.subheader(f"Preview: {uploaded.name}")
|
||||
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
|
||||
st.dataframe(df.head(10), use_container_width=True)
|
||||
st.divider()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Pipeline builder
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.subheader("Pipeline")
|
||||
|
||||
mode = st.radio(
|
||||
"How would you like to define the pipeline?",
|
||||
[
|
||||
"Use the recommended default (text-clean → format → missing → dedup)",
|
||||
"Build interactively",
|
||||
"Upload a saved pipeline JSON",
|
||||
],
|
||||
index=0,
|
||||
)
|
||||
|
||||
if "pipeline_rows" not in st.session_state:
|
||||
default = recommended_pipeline()
|
||||
st.session_state["pipeline_rows"] = pd.DataFrame([
|
||||
{
|
||||
"tool": s.tool, "enabled": s.enabled,
|
||||
"options_json": json.dumps(s.options),
|
||||
}
|
||||
for s in default.steps
|
||||
])
|
||||
|
||||
if mode.startswith("Use the recommended"):
|
||||
default = recommended_pipeline()
|
||||
st.session_state["pipeline_rows"] = pd.DataFrame([
|
||||
{
|
||||
"tool": s.tool, "enabled": s.enabled,
|
||||
"options_json": json.dumps(s.options),
|
||||
}
|
||||
for s in default.steps
|
||||
])
|
||||
elif mode.startswith("Upload"):
|
||||
pipeline_file = st.file_uploader(
|
||||
"Pipeline JSON", type=["json"], key="pipeline_upload",
|
||||
)
|
||||
if pipeline_file is not None:
|
||||
try:
|
||||
data = json.loads(pipeline_file.getvalue())
|
||||
uploaded_pipe = Pipeline.from_dict(data)
|
||||
st.session_state["pipeline_rows"] = pd.DataFrame([
|
||||
{
|
||||
"tool": s.tool, "enabled": s.enabled,
|
||||
"options_json": json.dumps(s.options),
|
||||
}
|
||||
for s in uploaded_pipe.steps
|
||||
])
|
||||
st.success(f"Loaded {len(uploaded_pipe.steps)} step(s).")
|
||||
except Exception as e:
|
||||
from src.core.errors import format_for_user
|
||||
st.error(f"**Could not parse pipeline**\n\n```\n{format_for_user(e)}\n```")
|
||||
|
||||
st.caption(
|
||||
"Edit the table to add, remove, reorder (drag the row index), enable, "
|
||||
"or configure each step. Tool order is recommended, not enforced — "
|
||||
"violations surface as warnings below the table."
|
||||
)
|
||||
edited = st.data_editor(
|
||||
st.session_state["pipeline_rows"],
|
||||
use_container_width=True,
|
||||
num_rows="dynamic",
|
||||
column_config={
|
||||
"tool": st.column_config.SelectboxColumn(
|
||||
"Tool", options=TOOL_NAMES, required=True,
|
||||
),
|
||||
"enabled": st.column_config.CheckboxColumn("Enabled"),
|
||||
"options_json": st.column_config.TextColumn(
|
||||
"Options (JSON)",
|
||||
help='e.g. {"column_types": {"phone": "phone"}}',
|
||||
),
|
||||
},
|
||||
key="pipeline_editor",
|
||||
)
|
||||
st.session_state["pipeline_rows"] = edited
|
||||
|
||||
# Build a Pipeline object from the editor state.
|
||||
steps_list: list[Step] = []
|
||||
parse_errors: list[str] = []
|
||||
for i, row in edited.iterrows():
|
||||
tool = row.get("tool")
|
||||
if not tool or pd.isna(tool):
|
||||
continue
|
||||
raw_opts = row.get("options_json") or "{}"
|
||||
if pd.isna(raw_opts):
|
||||
raw_opts = "{}"
|
||||
try:
|
||||
if uploaded.name.endswith((".xlsx", ".xls")):
|
||||
df = pd.read_excel(uploaded)
|
||||
else:
|
||||
df = pd.read_csv(uploaded)
|
||||
st.subheader(f"Preview: {uploaded.name}")
|
||||
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
|
||||
st.dataframe(df.head(10), use_container_width=True)
|
||||
opts = json.loads(raw_opts) if isinstance(raw_opts, str) else dict(raw_opts)
|
||||
if not isinstance(opts, dict):
|
||||
raise ValueError("options must be a JSON object")
|
||||
except Exception as e:
|
||||
from src.core.errors import format_for_user
|
||||
st.error(
|
||||
f"**Could not read `{uploaded.name}`**\n\n"
|
||||
f"```\n{format_for_user(e)}\n```"
|
||||
parse_errors.append(f"Step {i + 1}: {e}")
|
||||
continue
|
||||
try:
|
||||
steps_list.append(Step(
|
||||
tool=str(tool),
|
||||
options=opts,
|
||||
enabled=bool(row.get("enabled", True)),
|
||||
))
|
||||
except Exception as e:
|
||||
parse_errors.append(f"Step {i + 1}: {e}")
|
||||
|
||||
if parse_errors:
|
||||
for err in parse_errors:
|
||||
st.error(err)
|
||||
|
||||
current_pipeline = Pipeline(steps=steps_list) if steps_list else None
|
||||
|
||||
if current_pipeline is not None:
|
||||
warnings = validate_pipeline(current_pipeline)
|
||||
if warnings:
|
||||
st.warning(
|
||||
"Pipeline is out of recommended order:\n\n"
|
||||
+ "\n".join(f"- {w}" for w in warnings)
|
||||
+ "\n\nThe pipeline will still run — these are recommendations only."
|
||||
)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Pipeline steps (checklist)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.subheader("Pipeline Steps")
|
||||
st.caption("Select tools to include in the pipeline (recommended order):")
|
||||
|
||||
st.checkbox("1. Text Cleaner", value=True, disabled=True)
|
||||
st.checkbox("2. Format Standardizer", value=True, disabled=True)
|
||||
st.checkbox("3. Missing Value Handler", value=True, disabled=True)
|
||||
st.checkbox("4. Column Mapper", value=False, disabled=True)
|
||||
st.checkbox("5. Outlier Detector", value=False, disabled=True)
|
||||
st.checkbox("6. Deduplicator", value=True, disabled=True)
|
||||
st.checkbox("7. Multi-File Merger", value=False, disabled=True)
|
||||
st.checkbox("8. Validator & Reporter", value=True, disabled=True)
|
||||
|
||||
st.subheader("Pipeline Configuration")
|
||||
|
||||
st.selectbox("On error", ["Stop pipeline", "Skip step and continue", "Prompt for decision"], disabled=True)
|
||||
st.checkbox("Generate combined report at end", value=True, disabled=True)
|
||||
with st.expander("Recommended tool order — why each step belongs where it does"):
|
||||
st.markdown(
|
||||
"\n".join(
|
||||
f"- **{e}** before **{l}** — {why}"
|
||||
for e, l, why in SOFT_DEPENDENCIES
|
||||
)
|
||||
)
|
||||
|
||||
st.divider()
|
||||
st.button("Run Pipeline", type="primary", use_container_width=True, disabled=True)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Footer
|
||||
# Run
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
run_disabled = current_pipeline is None or not current_pipeline.steps
|
||||
|
||||
if st.button(
|
||||
"Run Pipeline",
|
||||
type="primary",
|
||||
use_container_width=True,
|
||||
disabled=run_disabled,
|
||||
):
|
||||
progress = st.progress(0.0, text="Starting...")
|
||||
log_box = st.empty()
|
||||
log_lines: list[str] = []
|
||||
total_enabled = sum(1 for s in current_pipeline.steps if s.enabled)
|
||||
completed = [0]
|
||||
|
||||
def _on_step(sr) -> None:
|
||||
completed[0] += 1
|
||||
if sr.skipped:
|
||||
log_lines.append(f"○ {sr.step.display_name()} (skipped)")
|
||||
elif sr.error:
|
||||
log_lines.append(
|
||||
f"✗ {sr.step.display_name()} — {sr.error.splitlines()[0]}"
|
||||
)
|
||||
else:
|
||||
log_lines.append(
|
||||
f"✓ {sr.step.display_name()} — {sr.elapsed_seconds*1000:.0f} ms"
|
||||
)
|
||||
log_box.markdown("\n".join(log_lines))
|
||||
progress.progress(
|
||||
completed[0] / max(total_enabled, 1),
|
||||
text=f"Step {completed[0]}/{total_enabled}",
|
||||
)
|
||||
|
||||
try:
|
||||
result = run_pipeline(
|
||||
df, current_pipeline,
|
||||
on_step_complete=_on_step,
|
||||
stop_on_error=False,
|
||||
)
|
||||
except Exception as e:
|
||||
from src.core.errors import format_for_user
|
||||
st.error(f"**Pipeline halted**\n\n```\n{format_for_user(e)}\n```")
|
||||
st.stop()
|
||||
|
||||
progress.progress(1.0, text="Done")
|
||||
st.session_state["pipeline_result"] = result
|
||||
st.session_state["pipeline_input_name"] = uploaded.name
|
||||
|
||||
result = st.session_state.get("pipeline_result")
|
||||
if result is None:
|
||||
st.info(
|
||||
"Configure the pipeline above and click **Run Pipeline** to "
|
||||
"execute it on your file."
|
||||
)
|
||||
st.stop()
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Results
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.subheader("Results")
|
||||
|
||||
m1, m2, m3, m4 = st.columns(4)
|
||||
m1.metric("Initial rows", result.initial_rows)
|
||||
m2.metric("Final rows", result.final_rows)
|
||||
m3.metric("Steps run", sum(1 for s in result.step_results if not s.skipped))
|
||||
m4.metric("Elapsed", f"{result.total_elapsed:.2f} s")
|
||||
|
||||
st.markdown("**Per-step summary**")
|
||||
step_df = pd.DataFrame([
|
||||
{
|
||||
"step": sr.step.display_name(),
|
||||
"status": (
|
||||
"skipped" if sr.skipped
|
||||
else "error" if sr.error
|
||||
else "ok"
|
||||
),
|
||||
"elapsed_ms": int(sr.elapsed_seconds * 1000),
|
||||
"summary": json.dumps(sr.summary, default=str)[:200],
|
||||
"error": sr.error or "",
|
||||
}
|
||||
for sr in result.step_results
|
||||
])
|
||||
st.dataframe(step_df, use_container_width=True, hide_index=True)
|
||||
|
||||
st.markdown("**Output preview (first 10 rows)**")
|
||||
st.dataframe(result.final_df.head(10), use_container_width=True)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Downloads
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.divider()
|
||||
st.caption(
|
||||
"Runs locally. Your data never leaves this computer. "
|
||||
"| DataTools v3.0"
|
||||
)
|
||||
stem = Path(st.session_state.get("pipeline_input_name", "input")).stem
|
||||
|
||||
dl_a, dl_b, dl_c = st.columns(3)
|
||||
with dl_a:
|
||||
bytes_csv = result.final_df.to_csv(index=False).encode("utf-8-sig")
|
||||
st.download_button(
|
||||
"Download cleaned CSV",
|
||||
data=bytes_csv,
|
||||
file_name=f"{stem}_pipeline.csv",
|
||||
mime="text/csv",
|
||||
)
|
||||
with dl_b:
|
||||
pipeline_bytes = json.dumps(
|
||||
current_pipeline.to_dict() if current_pipeline else {"steps": []},
|
||||
indent=2, default=str,
|
||||
).encode("utf-8")
|
||||
st.download_button(
|
||||
"Download pipeline JSON",
|
||||
data=pipeline_bytes,
|
||||
file_name="pipeline.json",
|
||||
mime="application/json",
|
||||
help="Save this and pass --pipeline pipeline.json to the CLI to re-run on next week's file.",
|
||||
)
|
||||
with dl_c:
|
||||
audit_bytes = json.dumps({
|
||||
"warnings": result.warnings,
|
||||
"initial_rows": result.initial_rows,
|
||||
"final_rows": result.final_rows,
|
||||
"total_elapsed_seconds": result.total_elapsed,
|
||||
"steps": [
|
||||
{
|
||||
"tool": sr.step.tool,
|
||||
"name": sr.step.display_name(),
|
||||
"enabled": sr.step.enabled,
|
||||
"skipped": sr.skipped,
|
||||
"elapsed_seconds": sr.elapsed_seconds,
|
||||
"summary": sr.summary,
|
||||
"error": sr.error,
|
||||
}
|
||||
for sr in result.step_results
|
||||
],
|
||||
}, indent=2, default=str).encode("utf-8")
|
||||
st.download_button(
|
||||
"Download run audit",
|
||||
data=audit_bytes,
|
||||
file_name=f"{stem}_pipeline_audit.json",
|
||||
mime="application/json",
|
||||
)
|
||||
|
||||
st.divider()
|
||||
st.caption("Runs locally. Your data never leaves this computer. | DataTools v3.0")
|
||||
|
||||
@@ -78,7 +78,7 @@ TOOLS: list[Tool] = [
|
||||
"Detect disguised nulls, missingness analysis, and imputation strategies."
|
||||
),
|
||||
page_slug="4_Missing_Values",
|
||||
status="Coming Soon",
|
||||
status="Ready",
|
||||
),
|
||||
Tool(
|
||||
tool_id="05_column_mapper",
|
||||
@@ -86,7 +86,7 @@ TOOLS: list[Tool] = [
|
||||
name="Column Mapper",
|
||||
description="Rename columns, enforce a target schema, and coerce types.",
|
||||
page_slug="5_Column_Mapper",
|
||||
status="Coming Soon",
|
||||
status="Ready",
|
||||
),
|
||||
Tool(
|
||||
tool_id="06_outlier_detector",
|
||||
@@ -125,7 +125,7 @@ TOOLS: list[Tool] = [
|
||||
"Chain tools in recommended order and pass output between steps."
|
||||
),
|
||||
page_slug="9_Pipeline_Runner",
|
||||
status="Coming Soon",
|
||||
status="Ready",
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
51
streamlit_app.py
Normal file
51
streamlit_app.py
Normal file
@@ -0,0 +1,51 @@
|
||||
"""Streamlit Community Cloud entry point — public demo app.
|
||||
|
||||
This is the file Streamlit Community Cloud auto-discovers when you
|
||||
deploy from this repository: leave the "Main file path" field at its
|
||||
default (``streamlit_app.py``) and it just works.
|
||||
|
||||
Why this lives at the repo root, not in ``src/gui/``:
|
||||
Streamlit auto-detects sibling files inside a ``pages/`` directory
|
||||
next to the entry script and renders them as additional pages in
|
||||
the sidebar. The full product GUI's pages live in
|
||||
``src/gui/pages/`` — pointing the Cloud at ``src/gui/app_demo.py``
|
||||
would inadvertently expose every paid-product page in the demo's
|
||||
sidebar (or require URL-routing tricks to suppress them).
|
||||
Anchoring the entry script at the repo root means there is no
|
||||
``pages/`` neighbour and the demo stays single-page by
|
||||
construction.
|
||||
|
||||
The actual demo UI is defined once in ``src/gui/app_demo.py`` so
|
||||
local development still works the way it always did:
|
||||
|
||||
streamlit run src/gui/app_demo.py # local dev, identical UX
|
||||
|
||||
Cloud deploy uses this shim:
|
||||
|
||||
streamlit run streamlit_app.py # what Cloud invokes
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Put the repo root on sys.path so ``src.core`` and ``src.gui`` imports
|
||||
# resolve cleanly. The demo module does this itself for the local-dev
|
||||
# case, but the import order matters when this shim runs first on Cloud.
|
||||
_HERE = Path(__file__).resolve().parent
|
||||
if str(_HERE) not in sys.path:
|
||||
sys.path.insert(0, str(_HERE))
|
||||
|
||||
# Executing the demo module top-to-bottom is the simplest way to share
|
||||
# the UI between the two entry points without duplicating code or
|
||||
# refactoring the demo into a function (Streamlit's idiom is
|
||||
# script-as-page; converting it to a callable would fight the
|
||||
# framework). ``runpy`` runs the file in this script's namespace so
|
||||
# Streamlit's ``st.set_page_config`` / element registration sees the
|
||||
# correct module.
|
||||
import runpy
|
||||
runpy.run_path(
|
||||
str(_HERE / "src" / "gui" / "app_demo.py"),
|
||||
run_name="__main__",
|
||||
)
|
||||
23
test-cases/column-mapper-corpus/README.md
Normal file
23
test-cases/column-mapper-corpus/README.md
Normal file
@@ -0,0 +1,23 @@
|
||||
# Column Mapper — corpus
|
||||
|
||||
Acceptance fixtures for `src/core/column_mapper.py`. Each `.csv` under
|
||||
`test_data/` is paired with assertions in
|
||||
`tests/test_column_mapper_corpus.py`.
|
||||
|
||||
## Use cases (target client profiles)
|
||||
|
||||
| File | Buyer profile | Tested behaviour |
|
||||
|------|---------------|------------------|
|
||||
| `uc01_crm_import.csv` + `schemas/uc01_crm_target.json` | Sales ops admin importing leads into Salesforce / HubSpot | Schema enforcement: rename via aliases, coerce types, drop extras, add `owner` default. |
|
||||
| `uc02_vendor_{a,b,c}.csv` + `schemas/uc02_canonical.json` | Operator unifying vendor exports | Multi-source unification: each vendor uses different headers; auto-inference resolves them all. |
|
||||
| `uc03_type_coercion.csv` + `schemas/uc03_types.json` | Analyst quick-fixing a mistyped CSV | Mixed-type coercion with documented per-column failure counts (bad rows survive as NaN). |
|
||||
|
||||
## Edge cases
|
||||
|
||||
| File | Stresses |
|
||||
|------|----------|
|
||||
| `ec01_duplicate_target.csv` | Mapping two source columns to the same target → InputValidationError. |
|
||||
| `ec02_unicode_columns.csv` | Non-ASCII column names (Japanese) survive rename and coerce. |
|
||||
| `ec03_whitespace_headers.csv` | Leading/trailing whitespace in headers still fuzzy-matches the schema. |
|
||||
| `ec04_no_match.csv` | No source column scores above threshold → empty mapping, fallback unmapped strategy fires. |
|
||||
| `ec05_required_missing.csv` | Required target field has no source column → InputValidationError unless `enforce_required=False`. |
|
||||
13
test-cases/column-mapper-corpus/schemas/uc01_crm_target.json
Normal file
13
test-cases/column-mapper-corpus/schemas/uc01_crm_target.json
Normal file
@@ -0,0 +1,13 @@
|
||||
{
|
||||
"fields": [
|
||||
{"name": "first_name", "dtype": "string", "required": true, "aliases": ["First Name", "fname"]},
|
||||
{"name": "last_name", "dtype": "string", "required": true, "aliases": ["Last Name", "lname"]},
|
||||
{"name": "email", "dtype": "string", "required": true, "aliases": ["EmailAddr", "Email", "email_address"]},
|
||||
{"name": "phone", "dtype": "string", "aliases": ["Phone", "phone_number"]},
|
||||
{"name": "account_name", "dtype": "string", "aliases": ["Company", "Account"]},
|
||||
{"name": "annual_rev", "dtype": "integer", "aliases": ["Annual Revenue", "revenue"]},
|
||||
{"name": "lead_source", "dtype": "category","aliases": ["Lead Source", "source"]},
|
||||
{"name": "created_date", "dtype": "date", "aliases": ["Created", "create_date"]},
|
||||
{"name": "owner", "dtype": "string", "default": "unassigned"}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,9 @@
|
||||
{
|
||||
"fields": [
|
||||
{"name": "first_name", "dtype": "string", "required": true, "aliases": ["FirstName", "FName", "First Name"]},
|
||||
{"name": "last_name", "dtype": "string", "required": true, "aliases": ["LastName", "Surname", "Last Name"]},
|
||||
{"name": "email", "dtype": "string", "required": true, "aliases": ["Email", "E-mail", "email_addr", "EmailAddr"]},
|
||||
{"name": "phone", "dtype": "string", "aliases": ["Phone Number", "Tel", "phone_number"]},
|
||||
{"name": "country", "dtype": "string", "aliases": ["Country", "country_code", "Region"]}
|
||||
]
|
||||
}
|
||||
10
test-cases/column-mapper-corpus/schemas/uc03_types.json
Normal file
10
test-cases/column-mapper-corpus/schemas/uc03_types.json
Normal file
@@ -0,0 +1,10 @@
|
||||
{
|
||||
"fields": [
|
||||
{"name": "id", "dtype": "integer", "required": true},
|
||||
{"name": "age", "dtype": "integer"},
|
||||
{"name": "active", "dtype": "boolean"},
|
||||
{"name": "joined", "dtype": "date"},
|
||||
{"name": "score", "dtype": "float"},
|
||||
{"name": "notes", "dtype": "string"}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,3 @@
|
||||
a,b,c
|
||||
1,2,3
|
||||
4,5,6
|
||||
|
@@ -0,0 +1,3 @@
|
||||
名前,Email,価格
|
||||
Alice,a@x.com,100
|
||||
Bob,b@x.com,200
|
||||
|
@@ -0,0 +1,3 @@
|
||||
First Name , Last Name ,EmailAddr
|
||||
Alice,Johnson,alice@x.com
|
||||
Bob,Smith,bob@x.com
|
||||
|
@@ -0,0 +1,3 @@
|
||||
xyz,abc,foobar
|
||||
1,2,3
|
||||
4,5,6
|
||||
|
@@ -0,0 +1,3 @@
|
||||
first_name,age
|
||||
Alice,30
|
||||
Bob,25
|
||||
|
@@ -0,0 +1,4 @@
|
||||
First Name,Last Name,EmailAddr,Phone,Company,Annual Revenue,Lead Source,Created
|
||||
Alice,Johnson,alice@acme.com,555-1234,Acme Corp,1500000,LinkedIn,2025-12-04
|
||||
Bob,Smith,bob@beta.com,555-5678,Beta LLC,250000,Webinar,2025-11-22
|
||||
Carlos,Garcia,carlos@gamma.io,555-9012,Gamma Inc,4200000,Referral,2025-10-30
|
||||
|
@@ -0,0 +1,3 @@
|
||||
FirstName,LastName,Email,Phone Number,Country
|
||||
Alice,Johnson,alice@vendor-a.com,555-1234,USA
|
||||
Bob,Smith,bob@vendor-a.com,555-5678,USA
|
||||
|
@@ -0,0 +1,3 @@
|
||||
first_name,surname,email_addr,phone,country_code
|
||||
Carlos,Garcia,carlos@vendor-b.com,555-9012,USA
|
||||
Diana,Lee,diana@vendor-b.com,555-7777,UK
|
||||
|
@@ -0,0 +1,3 @@
|
||||
FName,Surname,E-mail,Tel,Region
|
||||
Eve,Martinez,eve@vendor-c.com,555-9988,Bronx
|
||||
Frank,Brown,frank@vendor-c.com,555-1111,Queens
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,age,active,joined,score,notes
|
||||
1,30,true,2025-01-15,87.5,first
|
||||
2,25,false,2025-02-22,not_a_number,second
|
||||
3,not_a_number,yes,2025-03-08,76.0,third
|
||||
4,40,no,bad_date,91.2,fourth
|
||||
5,55,1,2025-05-01,82.0,fifth
|
||||
|
@@ -0,0 +1,21 @@
|
||||
customer_id,name,phone,country,address,price
|
||||
INT-001,Alice Johnson,(415) 555-1234,US,"1 Apple Park Way, Cupertino CA 95014",$1499.99
|
||||
INT-002,Boris Petrov,+7 495 123 4567,RU,"Ulitsa Tverskaya 13, Moscow 125009",₽89500
|
||||
INT-003,carlos garcia,+34 91 411 1111,ES,"Calle Gran Via 28, Madrid 28013","€1.299,00"
|
||||
INT-004,JOHN BROWN,020 7946 0958,GB,"10 Downing Street, London SW1A 2AA","£950.00"
|
||||
INT-005,marie dubois,01 42 86 82 00,FR,"Avenue des Champs-Elysees 100, Paris 75008","€2.499,50"
|
||||
INT-006,Yuki Tanaka,03-3210-7000,JP,"Marunouchi 2-7-3, Chiyoda-ku Tokyo 100-0005",¥150000
|
||||
INT-007,Anna Schmidt,030 12345678,DE,"Unter den Linden 5, Berlin 10117","€899,99"
|
||||
INT-008,giovanni rossi,+39 06 6982,IT,"Via del Corso 320, Roma 00186","€1.450,00"
|
||||
INT-009,Mei Wang,+86 10 1234 5678,CN,"东长安街 1号, 北京 100006",¥10000
|
||||
INT-010,Priya Sharma,+91 11 2345 6789,IN,"Connaught Place, New Delhi 110001",₹85000
|
||||
INT-011,Ahmed Hassan,+20 2 2735 0000,EG,"Tahrir Square, Cairo 11511",E£3500
|
||||
INT-012,emily smith,+61 2 9374 4000,AU,"Sydney Opera House, Sydney NSW 2000","$2,199.00"
|
||||
INT-013,Joao Silva,11 3071 0000,BR,"Avenida Paulista 1000, Sao Paulo 01310","R$ 1.299,90"
|
||||
INT-014,Sofia Lopez,+52 55 5555 0000,MX,"Paseo de la Reforma 222, Ciudad de Mexico 06600","$1,500 MXN"
|
||||
INT-015,Min-jun Kim,+82 2 2287 0114,KR,"Seoul Plaza, Seoul 04518",₩1500000
|
||||
INT-016,Mehmet Yilmaz,+90 212 252 0000,TR,"Sultanahmet, Istanbul 34122","₺1.250"
|
||||
INT-017,david cohen,+972 3 6957 0000,IL,"Dizengoff 50, Tel Aviv 6433222",₪450
|
||||
INT-018,Hanna Kowalska,+48 22 658 4500,PL,"Marszalkowska 1, Warszawa 00-624","zł 350,00"
|
||||
INT-019,Lars Nielsen,+45 33 12 88 88,DK,"Vesterbrogade 1, Copenhagen 1620","kr 950"
|
||||
INT-020,Sven Eriksson,+46 8 506 600 00,SE,"Drottninggatan 1, Stockholm 11151","kr 1.250,50"
|
||||
|
35
test-cases/missing-corpus/README.md
Normal file
35
test-cases/missing-corpus/README.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Missing Value Handler — corpus
|
||||
|
||||
Acceptance fixtures for `src/core/missing.py`. Each `.csv` under
|
||||
`test_data/` is paired with assertions in `tests/test_missing_corpus.py`.
|
||||
Add a new case by dropping a CSV here and adding a parametrize entry to
|
||||
the runner.
|
||||
|
||||
## Use cases (target client profiles)
|
||||
|
||||
| File | Buyer profile | Strategy under test |
|
||||
|------|---------------|---------------------|
|
||||
| `uc01_shopify_export.csv` | SMB / Shopify operator | `detect-only` |
|
||||
| `uc02_marketing_audience.csv` | Marketing / RevOps analyst| `safe-fill` |
|
||||
| `uc03_consultant_intake.csv` | Analyst / consultant | `drop-incomplete` + threshold |
|
||||
|
||||
## Edge cases
|
||||
|
||||
| File | What it stresses |
|
||||
|------|------------------|
|
||||
| `ec01_all_nan_column.csv` | column 100 % missing — fill must skip, drop_col must catch at threshold |
|
||||
| `ec02_no_missing.csv` | clean file — must be a no-op |
|
||||
| `ec03_zero_is_not_missing.csv` | numeric `0`, boolean `false`, `"0"` must NOT be treated as missing |
|
||||
| `ec04_excel_errors.csv` | `#N/A`, `#NULL!`, `#VALUE!` Excel error sentinels |
|
||||
| `ec05_unicode_whitespace.csv` | NBSP, tab-only, ideographic-space cells treated as whitespace |
|
||||
| `ec06_mixed_dtypes.csv` | mixed numeric/string in same column — graceful degrade to mode |
|
||||
| `ec07_real_data_with_padding.csv` | leading/trailing whitespace around real data must NOT be dropped |
|
||||
| `ec08_single_row.csv` | one-row file — every operation must still work |
|
||||
| `ec09_single_column.csv` | one-column file with header-only line + sentinels |
|
||||
| `ec10_all_sentinel_variants.csv` | every `DEFAULT_SENTINELS` entry exercised in one file |
|
||||
| `ec11_constant_per_column.csv` | `column_fill_values` differs per column |
|
||||
| `ec12_drop_threshold_boundary.csv`| boundary values for `row_drop_threshold` (0.5, 0.99, 1.0) |
|
||||
| `ec13_ffill_leading_nan.csv` | leading-NaN run survives ffill (no fabrication) |
|
||||
| `ec14_interpolate_fallback.csv` | numeric-only strategy on string column triggers fallback |
|
||||
| `ec15_headers_only.csv` | empty body — must not crash |
|
||||
| `ec16_idempotent_apply.csv` | running `handle_missing` twice yields the same DataFrame |
|
||||
@@ -0,0 +1,5 @@
|
||||
id,name,deprecated_field
|
||||
1,Alice,
|
||||
2,Bob,
|
||||
3,Charlie,
|
||||
4,Diana,
|
||||
|
4
test-cases/missing-corpus/test_data/ec02_no_missing.csv
Normal file
4
test-cases/missing-corpus/test_data/ec02_no_missing.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
id,name,age,city
|
||||
1,Alice,30,NYC
|
||||
2,Bob,25,LA
|
||||
3,Charlie,35,SF
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,active,balance,count,flag
|
||||
1,true,0.00,0,0
|
||||
2,false,150.50,3,1
|
||||
3,true,0,5,0
|
||||
4,true,75.25,0,1
|
||||
|
@@ -0,0 +1,7 @@
|
||||
sku,price,units,supplier
|
||||
A-100,19.99,5,Acme
|
||||
A-101,#N/A,3,Beta
|
||||
A-102,29.99,#NULL!,Gamma
|
||||
A-103,#VALUE!,2,Delta
|
||||
A-104,9.99,0,Acme
|
||||
A-105,#N/A,#N/A,#NULL!
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,note,value
|
||||
1,hello,10
|
||||
2, ,20
|
||||
3, ,30
|
||||
4,real,40
|
||||
5, ,50
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,mixed_col,real_num
|
||||
1,42,1.0
|
||||
2,N/A,2.0
|
||||
3,hello,
|
||||
4,,4.0
|
||||
5,99,5.0
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,name,city
|
||||
1, Alice ,NYC
|
||||
2, ,LA
|
||||
3, Bob ,
|
||||
4,Charlie, SF
|
||||
|
2
test-cases/missing-corpus/test_data/ec08_single_row.csv
Normal file
2
test-cases/missing-corpus/test_data/ec08_single_row.csv
Normal file
@@ -0,0 +1,2 @@
|
||||
id,name,age,city
|
||||
1,Alice,N/A,
|
||||
|
@@ -0,0 +1,7 @@
|
||||
value
|
||||
10
|
||||
N/A
|
||||
20
|
||||
" "
|
||||
-
|
||||
30
|
||||
|
@@ -0,0 +1,22 @@
|
||||
case_id,sentinel_value
|
||||
01,N/A
|
||||
02,n/a
|
||||
03,NA
|
||||
04,na
|
||||
05,NULL
|
||||
06,null
|
||||
07,None
|
||||
08,nil
|
||||
09,NaN
|
||||
10,-
|
||||
11,--
|
||||
12,?
|
||||
13,.
|
||||
14,TBD
|
||||
15,unknown
|
||||
16,(blank)
|
||||
17,(none)
|
||||
18,#N/A
|
||||
19,#NULL!
|
||||
20,missing
|
||||
21,real_value
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,country,salary,department
|
||||
1,USA,50000,Eng
|
||||
2,,60000,Sales
|
||||
3,UK,,Eng
|
||||
4,USA,55000,
|
||||
5,,,
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,a,b,c,d
|
||||
1,1,2,3,4
|
||||
2,,,3,4
|
||||
3,,,,4
|
||||
4,,,,
|
||||
5,1,2,,
|
||||
|
@@ -0,0 +1,8 @@
|
||||
date,price
|
||||
2025-01-01,
|
||||
2025-01-02,
|
||||
2025-01-03,100.0
|
||||
2025-01-04,
|
||||
2025-01-05,
|
||||
2025-01-06,150.0
|
||||
2025-01-07,
|
||||
|
@@ -0,0 +1,6 @@
|
||||
id,category,value
|
||||
1,A,10.0
|
||||
2,B,
|
||||
3,C,30.0
|
||||
4,,40.0
|
||||
5,A,
|
||||
|
@@ -0,0 +1 @@
|
||||
id,name,age,city
|
||||
|
@@ -0,0 +1,5 @@
|
||||
id,name,age
|
||||
1,Alice,30
|
||||
2,N/A,
|
||||
3,Bob,25
|
||||
4,,40
|
||||
|
11
test-cases/missing-corpus/test_data/uc01_shopify_export.csv
Normal file
11
test-cases/missing-corpus/test_data/uc01_shopify_export.csv
Normal file
@@ -0,0 +1,11 @@
|
||||
customer_id,first_name,last_name,email,phone,city,total_orders,lifetime_value,last_order_date,tags
|
||||
SHOP-001,Alice,Johnson,alice@shop.com,555-1234,Brooklyn,12,1240.50,2025-12-04,VIP
|
||||
SHOP-002,Bob,Smith,bob@shop.com,N/A,Queens,5,420.00,2025-11-22,
|
||||
SHOP-003,Carlos,Garcia,carlos@shop.com,555-5678,-,8,890.25,2025-12-15,Wholesale
|
||||
SHOP-004,Diana,Lee,diana@shop.com,(555) 222-3344,Manhattan,NULL,1875.00,2025-10-30,VIP|Wholesale
|
||||
SHOP-005,Eve,Martinez,,555-9988,Bronx,3,180.00,2025-09-15,
|
||||
SHOP-006,Frank,Brown,frank@shop.com, ,Staten Island,15,2410.75,(blank),
|
||||
SHOP-007,Grace,Davis,grace@shop.com,555-1111,Brooklyn,1,49.99,#N/A,New
|
||||
SHOP-008,Henry,Wilson,henry@shop.com,n/a,Queens,7,675.00,2025-11-08,VIP
|
||||
SHOP-009,Ivy,Chen,ivy@shop.com,555-7777,?,4,320.50,2025-10-12,
|
||||
SHOP-010,Jack,Taylor,jack@shop.com,555-4444,Manhattan,(none),520.00,2025-12-01,Wholesale
|
||||
|
@@ -0,0 +1,16 @@
|
||||
contact_id,email,segment,region,age,ltv,score,last_engagement_days,source,consent
|
||||
LEAD-001,a@mkt.com,Enterprise,NA-East,42,12400,87,3,LinkedIn,true
|
||||
LEAD-002,b@mkt.com,SMB,NA-West,,3200,62,12,Google,true
|
||||
LEAD-003,c@mkt.com,SMB,EU,29,1800,N/A,7,unknown,true
|
||||
LEAD-004,d@mkt.com,Enterprise,NA-East,55,,91,1,Webinar,true
|
||||
LEAD-005,e@mkt.com,Mid-Market,NA-West,38,5600,74,,Referral,true
|
||||
LEAD-006,f@mkt.com,SMB,EU,,2100,55,21,-,
|
||||
LEAD-007,g@mkt.com,Enterprise,APAC,47,9800,82,5,LinkedIn,true
|
||||
LEAD-008,h@mkt.com,SMB,NA-East,33,2900,,9,Google,
|
||||
LEAD-009,i@mkt.com,Mid-Market,EU,41,4750,68,15,NULL,true
|
||||
LEAD-010,j@mkt.com,Enterprise,NA-West,,11200,89,2,Webinar,true
|
||||
LEAD-011,k@mkt.com,SMB,APAC,28,1650,58,18,(blank),true
|
||||
LEAD-012,l@mkt.com,Mid-Market,NA-East,36,5100,,11,Referral,true
|
||||
LEAD-013,m@mkt.com,SMB,EU,31,2300,61,N/A,Google,true
|
||||
LEAD-014,n@mkt.com,Enterprise,APAC,52,10500,93,4,LinkedIn,true
|
||||
LEAD-015,o@mkt.com,SMB,NA-West,26,1400,49,25,?,
|
||||
|
@@ -0,0 +1,13 @@
|
||||
respondent_id,age,gender,zip,survey_q1,survey_q2,survey_q3,survey_q4,nps,comments,internal_id_legacy,beta_field
|
||||
R-001,34,F,11201,4,5,3,4,9,"loved it",,
|
||||
R-002,N/A,M,10001,,,,, ,,,
|
||||
R-003,41,F,90210,5,4,5,5,10,"perfect",,
|
||||
R-004,28,M,-,3,,,,7,,,
|
||||
R-005,,,NULL,,,,,,,,
|
||||
R-006,52,F,02101,4,4,4,4,8,"good experience",,
|
||||
R-007,?,?,?,?,?,?,?,?,?,,
|
||||
R-008,29,M,94102,5,5,5,5,10,"amazing",,
|
||||
R-009,38,F,60601,2,3,2,2,5,"meh",,
|
||||
R-010,(blank),(blank),(blank),(blank),(blank),(blank),(blank),(blank),(blank),,
|
||||
R-011,45,M,30301,4,4,3,4,8,,,
|
||||
R-012,33,F,11201,5,5,5,4,9,"will recommend",,
|
||||
|
@@ -253,16 +253,20 @@ class TestEncodingOverride:
|
||||
|
||||
|
||||
class TestEncodingDecodeFailedFromRepair:
|
||||
def test_decode_replaced_action_surfaces_error_finding(self, tmp_path):
|
||||
# Create a file with a UTF-8 BOM but cp1252 body bytes — utf-8-sig
|
||||
# fails on byte 0x80 (€ in cp1252).
|
||||
def test_lying_bom_recovered_and_flagged(self, tmp_path):
|
||||
# File has a UTF-8 BOM but the body bytes are cp1252 (0x80 = € in
|
||||
# cp1252; not a valid UTF-8 continuation byte). Detector should
|
||||
# recover transparently to cp1252 and surface an
|
||||
# ``encoding_lying_bom`` warn so the user knows.
|
||||
f = tmp_path / "lying_bom.csv"
|
||||
f.write_bytes(b"\xef\xbb\xbfid,name\n1,\x80100\n")
|
||||
findings = analyze(f)
|
||||
ids = {x.id for x in findings}
|
||||
assert "encoding_decode_failed" in ids
|
||||
bad = next(x for x in findings if x.id == "encoding_decode_failed")
|
||||
assert bad.severity == "error"
|
||||
assert "encoding_lying_bom" in ids
|
||||
bad = next(x for x in findings if x.id == "encoding_lying_bom")
|
||||
assert bad.severity == "warn"
|
||||
# Decode should have succeeded — no replacement-character finding.
|
||||
assert "encoding_decode_failed" not in ids
|
||||
|
||||
|
||||
class TestMixedLineEndings:
|
||||
|
||||
374
tests/test_column_mapper.py
Normal file
374
tests/test_column_mapper.py
Normal file
@@ -0,0 +1,374 @@
|
||||
"""Tests for src/core/column_mapper.py."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import pytest
|
||||
|
||||
from src.core.errors import ConfigError, InputValidationError
|
||||
from src.core.column_mapper import (
|
||||
MapOptions,
|
||||
PRESETS,
|
||||
TargetField,
|
||||
TargetSchema,
|
||||
coerce_series,
|
||||
infer_mapping,
|
||||
map_columns,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# infer_mapping — fuzzy matcher
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestInferMapping:
|
||||
def test_exact_normalized_match(self):
|
||||
df = pd.DataFrame({"First Name": [], "Last Name": []})
|
||||
schema = TargetSchema(fields=[
|
||||
TargetField(name="first_name"), TargetField(name="last_name"),
|
||||
])
|
||||
m = infer_mapping(df, schema)
|
||||
assert m == {"First Name": "first_name", "Last Name": "last_name"}
|
||||
|
||||
def test_alias_match(self):
|
||||
df = pd.DataFrame({"EmailAddr": []})
|
||||
schema = TargetSchema(fields=[
|
||||
TargetField(name="email", aliases=["EmailAddr", "email_address"]),
|
||||
])
|
||||
m = infer_mapping(df, schema)
|
||||
assert m == {"EmailAddr": "email"}
|
||||
|
||||
def test_below_threshold_excluded(self):
|
||||
df = pd.DataFrame({"xyz": []})
|
||||
schema = TargetSchema(fields=[TargetField(name="email")])
|
||||
m = infer_mapping(df, schema, threshold=0.6)
|
||||
assert m == {}
|
||||
|
||||
def test_target_matched_at_most_once(self):
|
||||
df = pd.DataFrame({"first_name": [], "fname": []})
|
||||
schema = TargetSchema(fields=[TargetField(name="first_name")])
|
||||
m = infer_mapping(df, schema)
|
||||
# Exact match wins; "fname" stays unmapped.
|
||||
assert m == {"first_name": "first_name"}
|
||||
|
||||
def test_threshold_zero_matches_anything(self):
|
||||
df = pd.DataFrame({"a": [], "b": []})
|
||||
schema = TargetSchema(fields=[TargetField(name="z")])
|
||||
m = infer_mapping(df, schema, threshold=0.0)
|
||||
assert len(m) == 1
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# coerce_series
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestCoerceSeries:
|
||||
def test_integer_clean(self):
|
||||
s = pd.Series(["1", "2", "3"])
|
||||
out, fails = coerce_series(s, "integer")
|
||||
assert list(out) == [1, 2, 3]
|
||||
assert fails == 0
|
||||
|
||||
def test_integer_with_failure(self):
|
||||
s = pd.Series(["1", "bad", "3"])
|
||||
out, fails = coerce_series(s, "integer")
|
||||
assert fails == 1
|
||||
assert pd.isna(out.iloc[1])
|
||||
|
||||
def test_float_with_thousands_sep(self):
|
||||
# Plain floats; thousands-sep handling is for format standardizer.
|
||||
s = pd.Series(["1.5", "2.0", "3.25"])
|
||||
out, fails = coerce_series(s, "float")
|
||||
assert fails == 0
|
||||
assert out.iloc[2] == 3.25
|
||||
|
||||
def test_boolean_truthy_falsy(self):
|
||||
s = pd.Series(["true", "false", "Yes", "no", "1", "0"])
|
||||
out, fails = coerce_series(s, "boolean")
|
||||
assert fails == 0
|
||||
assert list(out) == [True, False, True, False, True, False]
|
||||
|
||||
def test_boolean_unknown_value_fails(self):
|
||||
s = pd.Series(["true", "maybe"])
|
||||
out, fails = coerce_series(s, "boolean")
|
||||
assert fails == 1
|
||||
assert pd.isna(out.iloc[1])
|
||||
|
||||
def test_date_iso_format(self):
|
||||
s = pd.Series(["2025-01-15", "2025-02-20"])
|
||||
out, fails = coerce_series(s, "date")
|
||||
assert fails == 0
|
||||
assert out.iloc[0].year == 2025
|
||||
|
||||
def test_date_failure(self):
|
||||
s = pd.Series(["2025-01-15", "garbage"])
|
||||
out, fails = coerce_series(s, "date")
|
||||
assert fails == 1
|
||||
assert pd.isna(out.iloc[1])
|
||||
|
||||
def test_string_passthrough(self):
|
||||
s = pd.Series([1, 2, 3])
|
||||
out, fails = coerce_series(s, "string")
|
||||
assert fails == 0
|
||||
assert out.dtype.name == "string"
|
||||
|
||||
def test_auto_returns_unchanged(self):
|
||||
s = pd.Series([1, 2])
|
||||
out, fails = coerce_series(s, "auto")
|
||||
assert fails == 0
|
||||
assert out is s
|
||||
|
||||
def test_unknown_dtype_raises(self):
|
||||
with pytest.raises(InputValidationError):
|
||||
coerce_series(pd.Series([1]), "bogus") # type: ignore[arg-type]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# map_columns — explicit mapping
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestMapColumnsExplicit:
|
||||
def test_simple_rename(self):
|
||||
df = pd.DataFrame({"a": [1], "b": [2]})
|
||||
opts = MapOptions(mapping={"a": "alpha", "b": "beta"})
|
||||
res = map_columns(df, opts)
|
||||
assert list(res.mapped_df.columns) == ["alpha", "beta"]
|
||||
assert res.columns_renamed == 2
|
||||
|
||||
def test_unknown_source_raises(self):
|
||||
df = pd.DataFrame({"a": [1]})
|
||||
opts = MapOptions(mapping={"missing": "x"})
|
||||
with pytest.raises(InputValidationError):
|
||||
map_columns(df, opts)
|
||||
|
||||
def test_duplicate_target_raises(self):
|
||||
df = pd.DataFrame({"a": [1], "b": [2]})
|
||||
opts = MapOptions(mapping={"a": "x", "b": "x"})
|
||||
with pytest.raises(InputValidationError):
|
||||
map_columns(df, opts)
|
||||
|
||||
def test_unmapped_keep(self):
|
||||
df = pd.DataFrame({"a": [1], "b": [2]})
|
||||
opts = MapOptions(mapping={"a": "alpha"}, unmapped="keep")
|
||||
res = map_columns(df, opts)
|
||||
assert "b" in res.mapped_df.columns
|
||||
assert res.unmapped_kept == ["b"]
|
||||
|
||||
def test_unmapped_drop(self):
|
||||
df = pd.DataFrame({"a": [1], "b": [2]})
|
||||
opts = MapOptions(mapping={"a": "alpha"}, unmapped="drop")
|
||||
res = map_columns(df, opts)
|
||||
assert list(res.mapped_df.columns) == ["alpha"]
|
||||
assert res.columns_dropped == ["b"]
|
||||
|
||||
def test_unmapped_error(self):
|
||||
df = pd.DataFrame({"a": [1], "b": [2]})
|
||||
opts = MapOptions(mapping={"a": "alpha"}, unmapped="error")
|
||||
with pytest.raises(InputValidationError):
|
||||
map_columns(df, opts)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# map_columns — schema + auto-inference
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestMapColumnsWithSchema:
|
||||
def test_auto_infer_renames(self):
|
||||
df = pd.DataFrame({"First Name": ["A"], "Last Name": ["B"]})
|
||||
schema = TargetSchema(fields=[
|
||||
TargetField(name="first_name"), TargetField(name="last_name"),
|
||||
])
|
||||
opts = MapOptions(schema=schema, auto_infer=True)
|
||||
res = map_columns(df, opts)
|
||||
assert "first_name" in res.mapped_df.columns
|
||||
assert "last_name" in res.mapped_df.columns
|
||||
assert res.inferred_pairs == {"First Name": "first_name", "Last Name": "last_name"}
|
||||
|
||||
def test_explicit_overrides_inferred(self):
|
||||
df = pd.DataFrame({"name": ["A"], "fname": ["B"]})
|
||||
schema = TargetSchema(fields=[TargetField(name="first_name")])
|
||||
opts = MapOptions(
|
||||
schema=schema,
|
||||
mapping={"fname": "first_name"},
|
||||
auto_infer=True,
|
||||
)
|
||||
res = map_columns(df, opts)
|
||||
assert res.mapping["fname"] == "first_name"
|
||||
assert "name" not in res.mapping
|
||||
|
||||
def test_required_missing_raises(self):
|
||||
df = pd.DataFrame({"first_name": ["A"]})
|
||||
schema = TargetSchema(fields=[
|
||||
TargetField(name="first_name", required=True),
|
||||
TargetField(name="email", required=True),
|
||||
])
|
||||
opts = MapOptions(schema=schema, auto_infer=False, enforce_required=True)
|
||||
with pytest.raises(InputValidationError):
|
||||
map_columns(df, opts)
|
||||
|
||||
def test_required_missing_with_default_added(self):
|
||||
df = pd.DataFrame({"first_name": ["A"]})
|
||||
schema = TargetSchema(fields=[
|
||||
TargetField(name="first_name", required=True),
|
||||
TargetField(name="source", required=False, default="import"),
|
||||
])
|
||||
opts = MapOptions(schema=schema, auto_infer=False)
|
||||
res = map_columns(df, opts)
|
||||
assert "source" in res.mapped_df.columns
|
||||
assert res.mapped_df.iloc[0]["source"] == "import"
|
||||
assert res.columns_added == ["source"]
|
||||
|
||||
def test_required_missing_disabled(self):
|
||||
df = pd.DataFrame({"first_name": ["A"]})
|
||||
schema = TargetSchema(fields=[
|
||||
TargetField(name="first_name", required=True),
|
||||
TargetField(name="email", required=True),
|
||||
])
|
||||
opts = MapOptions(schema=schema, auto_infer=False, enforce_required=False)
|
||||
res = map_columns(df, opts)
|
||||
assert "email" in res.missing_required_targets
|
||||
|
||||
def test_reorder_to_schema(self):
|
||||
df = pd.DataFrame({"z": [1], "a": [2], "m": [3]})
|
||||
schema = TargetSchema(fields=[
|
||||
TargetField(name="a"), TargetField(name="m"), TargetField(name="z"),
|
||||
])
|
||||
opts = MapOptions(schema=schema, auto_infer=True, reorder_to_schema=True)
|
||||
res = map_columns(df, opts)
|
||||
assert list(res.mapped_df.columns) == ["a", "m", "z"]
|
||||
|
||||
def test_coerce_types(self):
|
||||
df = pd.DataFrame({"age": ["30", "bad", "40"], "active": ["true", "no", "yes"]})
|
||||
schema = TargetSchema(fields=[
|
||||
TargetField(name="age", dtype="integer"),
|
||||
TargetField(name="active", dtype="boolean"),
|
||||
])
|
||||
opts = MapOptions(schema=schema, auto_infer=True, coerce_types=True)
|
||||
res = map_columns(df, opts)
|
||||
assert res.mapped_df["age"].iloc[0] == 30
|
||||
assert res.mapped_df["active"].iloc[0] is True or res.mapped_df["active"].iloc[0]
|
||||
assert res.coercion_failures == {"age": 1}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Presets
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestPresets:
|
||||
def test_strict_schema_drops_and_coerces_and_reorders(self):
|
||||
df = pd.DataFrame({"First Name": ["A"], "Email": ["a@x"], "extra": [1]})
|
||||
schema = TargetSchema(fields=[
|
||||
TargetField(name="first_name", required=True),
|
||||
TargetField(name="email", required=True),
|
||||
])
|
||||
opts = MapOptions.from_preset("strict-schema")
|
||||
opts.schema = schema
|
||||
res = map_columns(df, opts)
|
||||
assert list(res.mapped_df.columns) == ["first_name", "email"]
|
||||
assert res.columns_dropped == ["extra"]
|
||||
|
||||
def test_lenient_keeps_extras(self):
|
||||
df = pd.DataFrame({"First Name": ["A"], "extra": [1]})
|
||||
schema = TargetSchema(fields=[TargetField(name="first_name")])
|
||||
opts = MapOptions.from_preset("lenient-schema")
|
||||
opts.schema = schema
|
||||
res = map_columns(df, opts)
|
||||
assert "extra" in res.mapped_df.columns
|
||||
|
||||
def test_unknown_preset(self):
|
||||
with pytest.raises(ConfigError):
|
||||
MapOptions.from_preset("does-not-exist")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Schema serialization
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestSchemaIO:
|
||||
def test_roundtrip_dict(self):
|
||||
schema = TargetSchema(fields=[
|
||||
TargetField(name="x", dtype="integer", required=True, aliases=["X", "X "]),
|
||||
TargetField(name="y", default="z"),
|
||||
])
|
||||
d = schema.to_dict()
|
||||
loaded = TargetSchema.from_dict(d)
|
||||
assert loaded.field_names() == ["x", "y"]
|
||||
assert loaded.fields[0].required is True
|
||||
assert loaded.fields[1].default == "z"
|
||||
|
||||
def test_from_dict_string_field(self):
|
||||
# Allow shorthand: bare string defaults to dtype=auto.
|
||||
loaded = TargetSchema.from_dict({"fields": ["a", "b"]})
|
||||
assert loaded.field_names() == ["a", "b"]
|
||||
|
||||
def test_from_dict_unknown_dtype_raises(self):
|
||||
with pytest.raises(ConfigError):
|
||||
TargetSchema.from_dict({"fields": [{"name": "x", "dtype": "bogus"}]})
|
||||
|
||||
def test_from_dict_missing_name_raises(self):
|
||||
with pytest.raises(ConfigError):
|
||||
TargetSchema.from_dict({"fields": [{"dtype": "string"}]})
|
||||
|
||||
def test_options_roundtrip_to_file(self, tmp_path):
|
||||
schema = TargetSchema(fields=[TargetField(name="x", dtype="string")])
|
||||
opts = MapOptions(
|
||||
schema=schema,
|
||||
mapping={"a": "x"},
|
||||
unmapped="drop",
|
||||
coerce_types=True,
|
||||
reorder_to_schema=True,
|
||||
)
|
||||
path = tmp_path / "cfg.json"
|
||||
opts.to_file(path)
|
||||
loaded = MapOptions.from_file(path)
|
||||
assert loaded.mapping == {"a": "x"}
|
||||
assert loaded.unmapped == "drop"
|
||||
assert loaded.coerce_types is True
|
||||
assert loaded.schema is not None
|
||||
assert loaded.schema.field_names() == ["x"]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Validation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestValidation:
|
||||
def test_invalid_unmapped_strategy(self):
|
||||
opts = MapOptions(unmapped="bogus") # type: ignore[arg-type]
|
||||
with pytest.raises(InputValidationError):
|
||||
opts.validate()
|
||||
|
||||
def test_threshold_out_of_range(self):
|
||||
opts = MapOptions(fuzzy_threshold=1.5)
|
||||
with pytest.raises(ConfigError):
|
||||
opts.validate()
|
||||
|
||||
def test_non_dataframe_input(self):
|
||||
with pytest.raises(InputValidationError):
|
||||
map_columns([1, 2, 3]) # type: ignore[arg-type]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Idempotency
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestIdempotency:
|
||||
def test_double_apply_is_stable(self):
|
||||
df = pd.DataFrame({"First Name": ["A"], "Email": ["a@x"]})
|
||||
schema = TargetSchema(fields=[
|
||||
TargetField(name="first_name"),
|
||||
TargetField(name="email"),
|
||||
])
|
||||
opts = MapOptions(schema=schema, auto_infer=True, reorder_to_schema=True)
|
||||
first = map_columns(df, opts)
|
||||
second = map_columns(first.mapped_df, opts)
|
||||
pd.testing.assert_frame_equal(second.mapped_df, first.mapped_df)
|
||||
|
||||
def test_input_not_mutated(self):
|
||||
df = pd.DataFrame({"a": [1], "b": [2]})
|
||||
snapshot = df.copy(deep=True)
|
||||
map_columns(df, MapOptions(mapping={"a": "x"}))
|
||||
pd.testing.assert_frame_equal(df, snapshot)
|
||||
240
tests/test_column_mapper_corpus.py
Normal file
240
tests/test_column_mapper_corpus.py
Normal file
@@ -0,0 +1,240 @@
|
||||
"""Acceptance corpus for the Column Mapper.
|
||||
|
||||
Loads every fixture in ``test-cases/column-mapper-corpus/test_data/``
|
||||
and asserts the documented behaviour against the documented schema.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
import pytest
|
||||
|
||||
from src.core.errors import InputValidationError
|
||||
from src.core.column_mapper import (
|
||||
MapOptions,
|
||||
TargetField,
|
||||
TargetSchema,
|
||||
map_columns,
|
||||
)
|
||||
|
||||
CORPUS = Path(__file__).resolve().parents[1] / "test-cases" / "column-mapper-corpus"
|
||||
TEST_DATA = CORPUS / "test_data"
|
||||
SCHEMAS = CORPUS / "schemas"
|
||||
|
||||
|
||||
def _read(name: str) -> pd.DataFrame:
|
||||
return pd.read_csv(TEST_DATA / name)
|
||||
|
||||
|
||||
def _schema(name: str) -> TargetSchema:
|
||||
return TargetSchema.from_file(SCHEMAS / name)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# UC01 — CRM import
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestUC01CrmImport:
|
||||
def test_strict_schema_round_trip(self):
|
||||
df = _read("uc01_crm_import.csv")
|
||||
schema = _schema("uc01_crm_target.json")
|
||||
opts = MapOptions.from_preset("strict-schema")
|
||||
opts.schema = schema
|
||||
res = map_columns(df, opts)
|
||||
|
||||
# Every required target is present after the run.
|
||||
for f in schema.fields:
|
||||
if f.required:
|
||||
assert f.name in res.mapped_df.columns
|
||||
|
||||
# 'owner' default added.
|
||||
assert "owner" in res.columns_added
|
||||
assert (res.mapped_df["owner"] == "unassigned").all()
|
||||
|
||||
# No unmapped survivors (strict preset drops extras).
|
||||
assert res.unmapped_kept == []
|
||||
|
||||
# Reordered to schema order.
|
||||
expected_prefix = [f.name for f in schema.fields]
|
||||
assert list(res.mapped_df.columns)[: len(expected_prefix)] == expected_prefix
|
||||
|
||||
def test_types_coerced_from_strings(self):
|
||||
df = _read("uc01_crm_import.csv")
|
||||
schema = _schema("uc01_crm_target.json")
|
||||
opts = MapOptions.from_preset("strict-schema")
|
||||
opts.schema = schema
|
||||
res = map_columns(df, opts)
|
||||
# annual_rev → integer (was numeric strings in the source).
|
||||
assert pd.api.types.is_integer_dtype(res.mapped_df["annual_rev"])
|
||||
# created_date → datetime64.
|
||||
assert pd.api.types.is_datetime64_any_dtype(res.mapped_df["created_date"])
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# UC02 — Multi-vendor unification
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestUC02MultiVendor:
|
||||
@pytest.mark.parametrize("vendor", ["a", "b", "c"])
|
||||
def test_each_vendor_normalises_to_canonical(self, vendor):
|
||||
df = _read(f"uc02_vendor_{vendor}.csv")
|
||||
schema = _schema("uc02_canonical.json")
|
||||
opts = MapOptions.from_preset("lenient-schema")
|
||||
opts.schema = schema
|
||||
opts.fuzzy_threshold = 0.5 # vendor C uses obscure aliases ("FName", "Tel")
|
||||
res = map_columns(df, opts)
|
||||
# Every required canonical field landed in the output.
|
||||
for f in schema.fields:
|
||||
if f.required:
|
||||
assert f.name in res.mapped_df.columns, (
|
||||
f"vendor {vendor}: missing {f.name}; mapping={res.mapping}"
|
||||
)
|
||||
|
||||
def test_concatenated_vendors_share_schema(self):
|
||||
# The point of unification: after each vendor goes through the
|
||||
# mapper, the resulting frames stack cleanly.
|
||||
schema = _schema("uc02_canonical.json")
|
||||
opts = MapOptions.from_preset("strict-schema")
|
||||
opts.schema = schema
|
||||
opts.fuzzy_threshold = 0.5
|
||||
frames = [
|
||||
map_columns(_read(f"uc02_vendor_{v}.csv"), opts).mapped_df
|
||||
for v in ("a", "b", "c")
|
||||
]
|
||||
unified = pd.concat(frames, ignore_index=True)
|
||||
assert list(unified.columns) == [f.name for f in schema.fields]
|
||||
# Total rows = sum of inputs.
|
||||
assert len(unified) == sum(len(f) for f in frames)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# UC03 — Type coercion
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestUC03TypeCoercion:
|
||||
def test_documented_failures_are_reported(self):
|
||||
df = _read("uc03_type_coercion.csv")
|
||||
schema = _schema("uc03_types.json")
|
||||
opts = MapOptions.from_preset("lenient-schema")
|
||||
opts.schema = schema
|
||||
res = map_columns(df, opts)
|
||||
# Bad rows survive as NaN, with counts recorded.
|
||||
assert res.coercion_failures.get("age") == 1
|
||||
assert res.coercion_failures.get("score") == 1
|
||||
assert res.coercion_failures.get("joined") == 1
|
||||
|
||||
def test_coerced_dtypes(self):
|
||||
df = _read("uc03_type_coercion.csv")
|
||||
schema = _schema("uc03_types.json")
|
||||
opts = MapOptions.from_preset("lenient-schema")
|
||||
opts.schema = schema
|
||||
res = map_columns(df, opts)
|
||||
out = res.mapped_df
|
||||
assert pd.api.types.is_integer_dtype(out["id"])
|
||||
assert out["active"].dtype.name == "boolean"
|
||||
assert pd.api.types.is_datetime64_any_dtype(out["joined"])
|
||||
# Float failures NaN-ify.
|
||||
assert pd.isna(out["score"].iloc[1])
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Edge cases
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestEC01DuplicateTarget:
|
||||
def test_two_sources_to_same_target_raises(self):
|
||||
df = _read("ec01_duplicate_target.csv")
|
||||
opts = MapOptions(mapping={"a": "x", "b": "x"})
|
||||
with pytest.raises(InputValidationError):
|
||||
map_columns(df, opts)
|
||||
|
||||
|
||||
class TestEC02UnicodeColumns:
|
||||
def test_japanese_column_renamed(self):
|
||||
df = _read("ec02_unicode_columns.csv")
|
||||
opts = MapOptions(mapping={"名前": "name", "価格": "price"})
|
||||
res = map_columns(df, opts)
|
||||
assert "name" in res.mapped_df.columns
|
||||
assert "price" in res.mapped_df.columns
|
||||
# Email passes through (unmapped, kept by default).
|
||||
assert "Email" in res.mapped_df.columns
|
||||
|
||||
|
||||
class TestEC03WhitespaceHeaders:
|
||||
def test_header_whitespace_does_not_block_match(self):
|
||||
df = _read("ec03_whitespace_headers.csv")
|
||||
schema = TargetSchema(fields=[
|
||||
TargetField(name="first_name", aliases=["First Name"]),
|
||||
TargetField(name="last_name", aliases=["Last Name"]),
|
||||
TargetField(name="email", aliases=["EmailAddr"]),
|
||||
])
|
||||
opts = MapOptions(schema=schema, auto_infer=True)
|
||||
res = map_columns(df, opts)
|
||||
# All three columns should map despite the leading/trailing spaces.
|
||||
assert len(res.mapping) == 3
|
||||
|
||||
|
||||
class TestEC04NoMatch:
|
||||
def test_zero_inferred_with_no_match(self):
|
||||
df = _read("ec04_no_match.csv")
|
||||
schema = TargetSchema(fields=[
|
||||
TargetField(name="email"), TargetField(name="phone"),
|
||||
])
|
||||
opts = MapOptions(schema=schema, auto_infer=True, unmapped="keep")
|
||||
res = map_columns(df, opts)
|
||||
assert res.inferred_pairs == {}
|
||||
# Source columns survive as-is under keep.
|
||||
assert set(df.columns) <= set(res.mapped_df.columns)
|
||||
|
||||
def test_no_match_with_unmapped_error(self):
|
||||
df = _read("ec04_no_match.csv")
|
||||
schema = TargetSchema(fields=[TargetField(name="email")])
|
||||
opts = MapOptions(
|
||||
schema=schema, auto_infer=True, unmapped="error",
|
||||
enforce_required=False,
|
||||
)
|
||||
with pytest.raises(InputValidationError):
|
||||
map_columns(df, opts)
|
||||
|
||||
|
||||
class TestEC05RequiredMissing:
|
||||
def test_required_missing_raises(self):
|
||||
df = _read("ec05_required_missing.csv")
|
||||
schema = TargetSchema(fields=[
|
||||
TargetField(name="first_name", required=True),
|
||||
TargetField(name="email", required=True),
|
||||
])
|
||||
opts = MapOptions(schema=schema, auto_infer=True, enforce_required=True)
|
||||
with pytest.raises(InputValidationError):
|
||||
map_columns(df, opts)
|
||||
|
||||
def test_disable_enforce_surfaces_in_result(self):
|
||||
df = _read("ec05_required_missing.csv")
|
||||
schema = TargetSchema(fields=[
|
||||
TargetField(name="first_name", required=True),
|
||||
TargetField(name="email", required=True),
|
||||
])
|
||||
opts = MapOptions(schema=schema, auto_infer=True, enforce_required=False)
|
||||
res = map_columns(df, opts)
|
||||
assert "email" in res.missing_required_targets
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Whole-corpus property tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
ALL_FIXTURES = sorted(p.name for p in TEST_DATA.glob("*.csv"))
|
||||
|
||||
|
||||
@pytest.mark.parametrize("fixture", ALL_FIXTURES)
|
||||
def test_map_columns_does_not_mutate_input(fixture):
|
||||
df = pd.read_csv(TEST_DATA / fixture)
|
||||
snapshot = df.copy(deep=True)
|
||||
try:
|
||||
map_columns(df, MapOptions()) # identity run; default options.
|
||||
except InputValidationError:
|
||||
pass # ec01 / ec05 raise here — fine, mutation is what we care about.
|
||||
pd.testing.assert_frame_equal(df, snapshot)
|
||||
@@ -169,8 +169,23 @@ class TestMojibake:
|
||||
assert actual.equals(expected), "14 mojibake default (no repair) differs"
|
||||
|
||||
def test_fixed_variant(self):
|
||||
# --fix-mojibake is Tier 2; the cleaner does not implement it. Mark xfail.
|
||||
pytest.xfail("Mojibake auto-repair is Tier 2; not yet implemented (uses ftfy).")
|
||||
"""Mojibake auto-repair (ftfy-backed) restores the original text.
|
||||
|
||||
Skipped automatically when ftfy is not installed — the engine
|
||||
falls back to a no-op in that case and the diff would never close.
|
||||
"""
|
||||
try:
|
||||
import ftfy # noqa: F401
|
||||
except ImportError:
|
||||
pytest.skip("ftfy not installed — install ftfy to enable mojibake repair")
|
||||
|
||||
from src.core.fixes import repair_mojibake
|
||||
|
||||
df = _read_csv_strict(TEST_DATA / "14_mojibake.csv")
|
||||
expected = _read_csv_strict(EXPECTED / "14_mojibake__fixed.csv")
|
||||
repaired, _ = repair_mojibake(df)
|
||||
actual = repaired.reset_index(drop=True)
|
||||
assert actual.equals(expected), "14 mojibake fixed variant differs"
|
||||
|
||||
|
||||
class TestEmptyFile:
|
||||
|
||||
@@ -14,12 +14,11 @@ What's tested
|
||||
REJECT / LOW_CONFIDENCE.
|
||||
3. The decoded DataFrame matches the canonical reference content.
|
||||
|
||||
Cases where the current implementation is known to fail (charset-
|
||||
normalizer label drift on byte-equivalent encodings, ``repair_bytes``
|
||||
NUL-strip destroying UTF-16, the "lying BOM" pathological case) are
|
||||
marked ``xfail`` so they surface in the report as documented gaps.
|
||||
A future fix that makes the case pass will flip xfail to xpass and the
|
||||
test owner can drop the marker.
|
||||
Detection arbiter (cp1250→cp1252, mac_iceland→mac_roman, lying-BOM
|
||||
recovery) and a language-aware probe (Cyrillic / EE-Latin coverage)
|
||||
together close every documented gap; the ``KNOWN_*_FAILURES`` dicts
|
||||
below are kept empty as a tripwire — re-add an entry only when a real
|
||||
limitation surfaces.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
@@ -41,27 +40,9 @@ REFERENCE_DIR = CORPUS / "reference"
|
||||
|
||||
# Known failures the analyzer does not yet handle correctly. Each entry
|
||||
# has a one-line reason — drop the entry once a fix lands.
|
||||
KNOWN_DETECTION_FAILURES = {
|
||||
"E03_western_basic_cp1252.csv": "charset-normalizer returns cp1250 for byte-equivalent content",
|
||||
"E04_western_basic_latin1.csv": "charset-normalizer returns cp1250 for byte-equivalent content",
|
||||
"E05_western_basic_latin9.csv": "charset-normalizer returns cp1250 for byte-equivalent content",
|
||||
"E06_western_basic_macroman.csv": "returns mac_iceland (same family) instead of mac_roman",
|
||||
"E11_western_extended_cp1252.csv": "charset-normalizer returns cp1250 for cp1252 content",
|
||||
"E15_eastern_european_iso88592.csv": "charset-normalizer returns cp1258 for ISO-8859-2 content",
|
||||
"E18_cyrillic_koi8r.csv": "charset-normalizer returns shift_jis_2004 for KOI8-R content",
|
||||
}
|
||||
KNOWN_DETECTION_FAILURES: dict[str, str] = {}
|
||||
|
||||
KNOWN_DECODE_FAILURES = {
|
||||
"E03_western_basic_cp1252.csv": "decoded as cp1250 — different mapping at 0xF1 (ñ vs ń)",
|
||||
"E04_western_basic_latin1.csv": "decoded as cp1250 — different mapping at 0xF1",
|
||||
"E05_western_basic_latin9.csv": "decoded as cp1250 — different mapping at 0xF1",
|
||||
"E10_western_extended_utf8.csv": "byte-level smart-quote fold rewrites U+201C/U+201D to ASCII before parse",
|
||||
"E11_western_extended_cp1252.csv": "wrong encoding + smart-quote fold",
|
||||
"E12_western_extended_utf16le.csv": "byte-level smart-quote fold rewrites U+201C/U+201D before parse",
|
||||
"E15_eastern_european_iso88592.csv": "wrong encoding (cp1258 != ISO-8859-2)",
|
||||
"E18_cyrillic_koi8r.csv": "wrong encoding (shift_jis_2004 != KOI8-R)",
|
||||
"E30_pathological_lying_bom.csv": "utf-8-sig fails on cp1252 body bytes; needs lying-BOM recovery",
|
||||
}
|
||||
KNOWN_DECODE_FAILURES: dict[str, str] = {}
|
||||
|
||||
|
||||
def _normalize_encoding(name: str) -> str:
|
||||
@@ -164,7 +145,12 @@ def _decodable_entries():
|
||||
],
|
||||
)
|
||||
def test_decoded_matches_reference(entry):
|
||||
df, _, _ = _load_for_analysis(CORPUS / entry["filename"], sample_rows=1000)
|
||||
# The reference files preserve smart quotes — disable byte-level
|
||||
# smart-quote folding so this round-trip identity test isn't
|
||||
# confounded by the analyzer's deliberate parser-safety fold.
|
||||
df, _, _ = _load_for_analysis(
|
||||
CORPUS / entry["filename"], sample_rows=1000, fold_quotes=False,
|
||||
)
|
||||
ref_text = REFERENCES[entry["canonical_content_id"]]
|
||||
ref_rows = list(csv.reader(io.StringIO(ref_text)))
|
||||
if not ref_rows:
|
||||
|
||||
@@ -230,8 +230,27 @@ class TestRepairMojibake:
|
||||
|
||||
|
||||
class TestRepairMojibakeNoFtfy:
|
||||
@pytest.mark.skipif(_HAS_FTFY, reason="ftfy installed — exercises the no-op path")
|
||||
def test_returns_input_unchanged_without_ftfy(self):
|
||||
def test_returns_input_unchanged_without_ftfy(self, monkeypatch):
|
||||
"""Exercise the no-op path regardless of whether ftfy is installed.
|
||||
|
||||
``repair_mojibake`` lazy-imports ftfy inside the function body, so
|
||||
we hide ``ftfy`` from ``sys.modules`` and from import resolution
|
||||
before calling. The function must then degrade to ``(df, 0)``
|
||||
without raising.
|
||||
"""
|
||||
import sys
|
||||
import builtins
|
||||
|
||||
monkeypatch.delitem(sys.modules, "ftfy", raising=False)
|
||||
real_import = builtins.__import__
|
||||
|
||||
def fake_import(name, *args, **kwargs):
|
||||
if name == "ftfy" or name.startswith("ftfy."):
|
||||
raise ImportError("ftfy hidden by test")
|
||||
return real_import(name, *args, **kwargs)
|
||||
|
||||
monkeypatch.setattr(builtins, "__import__", fake_import)
|
||||
|
||||
df = pd.DataFrame({"x": ["café"]})
|
||||
out, changed = repair_mojibake(df)
|
||||
assert changed == 0
|
||||
|
||||
105
tests/test_format_intl_corpus.py
Normal file
105
tests/test_format_intl_corpus.py
Normal file
@@ -0,0 +1,105 @@
|
||||
"""Acceptance corpus for international format standardization.
|
||||
|
||||
Stresses the rework's three pillars on a single mixed-locale fixture:
|
||||
* Per-row country column drives phone parsing.
|
||||
* ``currency_decimal="auto"`` resolves comma-decimal locales.
|
||||
* Streaming entry point handles the same content unchanged.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
import pytest
|
||||
|
||||
from src.core.format_standardize import (
|
||||
FieldType,
|
||||
StandardizeOptions,
|
||||
standardize_dataframe,
|
||||
standardize_file,
|
||||
)
|
||||
|
||||
CORPUS = Path(__file__).resolve().parents[1] / "test-cases" / "format-cleaner-corpus" / "international"
|
||||
FIXTURE = CORPUS / "intl_phones_addresses.csv"
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def df():
|
||||
return pd.read_csv(FIXTURE, dtype=str, keep_default_na=False)
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def options():
|
||||
return StandardizeOptions(
|
||||
column_types={
|
||||
"name": FieldType.NAME,
|
||||
"phone": FieldType.PHONE,
|
||||
"price": FieldType.CURRENCY,
|
||||
},
|
||||
phone_country_column="country",
|
||||
currency_preserve_code=True,
|
||||
currency_decimal="auto",
|
||||
)
|
||||
|
||||
|
||||
class TestPhonesByRegion:
|
||||
def test_every_row_lands_on_correct_e164_prefix(self, df, options):
|
||||
# Each row's country column drives the per-row region used by
|
||||
# phonenumbers.parse — the correct + prefix is the acceptance bar.
|
||||
res = standardize_dataframe(df, options)
|
||||
out = res.standardized_df
|
||||
# ISO-2 → expected E.164 country code prefix
|
||||
prefix_for_country = {
|
||||
"US": "+1", "GB": "+44", "RU": "+7", "ES": "+34",
|
||||
"FR": "+33", "JP": "+81", "DE": "+49", "IT": "+39",
|
||||
"CN": "+86", "IN": "+91", "EG": "+20", "AU": "+61",
|
||||
"BR": "+55", "MX": "+52", "KR": "+82", "TR": "+90",
|
||||
"IL": "+972", "PL": "+48", "DK": "+45", "SE": "+46",
|
||||
}
|
||||
bad: list[tuple[str, str, str]] = []
|
||||
for _, row in out.iterrows():
|
||||
want = prefix_for_country[row["country"]]
|
||||
got = row["phone"]
|
||||
if not got.startswith(want):
|
||||
bad.append((row["country"], want, got))
|
||||
assert not bad, f"phone prefix mismatches: {bad}"
|
||||
|
||||
|
||||
class TestCurrencyByLocale:
|
||||
def test_eu_decimal_comma_resolves_under_auto(self, df, options):
|
||||
res = standardize_dataframe(df, options)
|
||||
# Spain, France, Germany, Italy, Brazil, Sweden all use decimal
|
||||
# comma. Verify a clean numeric result post-standardization.
|
||||
eu_idx = df.index[df["country"].isin(
|
||||
["ES", "FR", "DE", "IT", "BR", "SE"]
|
||||
)]
|
||||
for i in eu_idx:
|
||||
val = res.standardized_df.loc[i, "price"]
|
||||
# Either ``CODE NNN.NN`` or bare ``NNN.NN`` — but the comma
|
||||
# in the source must have become a dot in the output.
|
||||
assert "," not in val, (
|
||||
f"row {i} ({df.loc[i, 'country']}): comma persisted in {val!r}"
|
||||
)
|
||||
|
||||
def test_brl_real_prefix_recognised(self, df, options):
|
||||
res = standardize_dataframe(df, options)
|
||||
br_row = res.standardized_df[res.standardized_df["country"] == "BR"].iloc[0]
|
||||
assert "BRL" in br_row["price"]
|
||||
|
||||
|
||||
class TestStreamingMatchesInMemory:
|
||||
def test_same_output_via_streaming(self, tmp_path, df, options):
|
||||
# Streaming the same fixture through standardize_file should
|
||||
# produce a CSV byte-equivalent to the in-memory path.
|
||||
in_mem = standardize_dataframe(df, options).standardized_df
|
||||
out = tmp_path / "out.csv"
|
||||
# Use a chunk size that splits the 20-row fixture mid-way.
|
||||
res = standardize_file(FIXTURE, out, options, chunk_size=7)
|
||||
assert res.rows_processed == len(df)
|
||||
streamed = pd.read_csv(out, dtype=str, keep_default_na=False)
|
||||
# Compare typed columns only — others pass through.
|
||||
for col in options.column_types:
|
||||
assert streamed[col].tolist() == in_mem[col].astype(str).tolist(), (
|
||||
f"column {col} differs between in-memory and streaming"
|
||||
)
|
||||
@@ -110,16 +110,16 @@ _DATE_EXPECTED_MDY: dict[str, object] = {
|
||||
"FD13": PASSTHROUGH,
|
||||
"FD14": PASSTHROUGH,
|
||||
"FD15": PASSTHROUGH,
|
||||
# excel serial → 2024-01-15 (xfail — not implemented)
|
||||
# excel serial dates (numeric days since 1899-12-30)
|
||||
"FD22": "2024-01-15",
|
||||
"FD23": "2024-01-15",
|
||||
# unix timestamp seconds / millis → 2024-01-15 (xfail)
|
||||
# unix timestamps (seconds, milliseconds)
|
||||
"FD24": "2024-01-15",
|
||||
"FD25": "2024-01-15",
|
||||
# partial precision — corpus preserves it
|
||||
"FD26": "2024-01",
|
||||
"FD27": "2024-01", # xfail — text precision
|
||||
"FD28": "2024-Q1", # xfail — quarter
|
||||
"FD27": "2024-01", # text precision month
|
||||
"FD28": "2024-Q1", # quarter
|
||||
"FD29": "2024",
|
||||
# 2-digit year cutoff (per docs: 1969 wins over 2069)
|
||||
"FD30": "1969-01-15",
|
||||
@@ -135,7 +135,7 @@ _DATE_EXPECTED_MDY: dict[str, object] = {
|
||||
"FD37": "2024-01-15",
|
||||
# garbage → pass through (corpus 0.3 boundary table)
|
||||
# FD38/39/40 → PASSTHROUGH default
|
||||
# locale-specific month names (xfail — not shipped)
|
||||
# locale-specific month names (en/fr/de via month_locales)
|
||||
"FD41": "2024-01-15",
|
||||
"FD42": "2024-01-15",
|
||||
# timezone — corpus 3.3 says fixed-offset only
|
||||
|
||||
301
tests/test_format_streaming.py
Normal file
301
tests/test_format_streaming.py
Normal file
@@ -0,0 +1,301 @@
|
||||
"""Tests for the format-standardizer rework: cache, vectorized dispatch,
|
||||
per-row country, audit cap, and streaming entry point."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import csv
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
import pytest
|
||||
|
||||
from src.core.format_standardize import (
|
||||
FieldType,
|
||||
StandardizeOptions,
|
||||
StreamingStandardizeResult,
|
||||
_normalize_region,
|
||||
standardize_dataframe,
|
||||
standardize_file,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Per-row country / region
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestPerRowCountry:
|
||||
def test_phone_uses_per_row_country(self):
|
||||
df = pd.DataFrame({
|
||||
"phone": ["020 7946 0958", "03-3210-7000", "(415) 555-1234"],
|
||||
"country": ["GB", "JP", "US"],
|
||||
})
|
||||
opts = StandardizeOptions(
|
||||
column_types={"phone": FieldType.PHONE},
|
||||
phone_country_column="country",
|
||||
)
|
||||
res = standardize_dataframe(df, opts)
|
||||
out = res.standardized_df["phone"].tolist()
|
||||
assert out[0].startswith("+44")
|
||||
assert out[1].startswith("+81")
|
||||
assert out[2].startswith("+1")
|
||||
|
||||
def test_phone_country_full_name_resolved(self):
|
||||
df = pd.DataFrame({
|
||||
"phone": ["020 7946 0958"],
|
||||
"country": ["United Kingdom"],
|
||||
})
|
||||
opts = StandardizeOptions(
|
||||
column_types={"phone": FieldType.PHONE},
|
||||
phone_country_column="country",
|
||||
)
|
||||
res = standardize_dataframe(df, opts)
|
||||
assert res.standardized_df["phone"].iloc[0].startswith("+44")
|
||||
|
||||
def test_blank_country_falls_back_to_default(self):
|
||||
df = pd.DataFrame({
|
||||
"phone": ["(415) 555-1234"],
|
||||
"country": [""], # blank → use default region
|
||||
})
|
||||
opts = StandardizeOptions(
|
||||
column_types={"phone": FieldType.PHONE},
|
||||
phone_country_column="country",
|
||||
phone_region="US",
|
||||
)
|
||||
res = standardize_dataframe(df, opts)
|
||||
assert res.standardized_df["phone"].iloc[0] == "+14155551234"
|
||||
|
||||
def test_unknown_country_column_raises(self):
|
||||
df = pd.DataFrame({"phone": ["x"]})
|
||||
opts = StandardizeOptions(
|
||||
column_types={"phone": FieldType.PHONE},
|
||||
phone_country_column="missing_col",
|
||||
)
|
||||
from src.core.errors import InputValidationError
|
||||
with pytest.raises(InputValidationError):
|
||||
standardize_dataframe(df, opts)
|
||||
|
||||
|
||||
class TestNormalizeRegion:
|
||||
def test_iso2_passthrough(self):
|
||||
assert _normalize_region("US") == "US"
|
||||
assert _normalize_region("us") == "US"
|
||||
assert _normalize_region(" jp ") == "JP"
|
||||
|
||||
def test_iso3_mapped(self):
|
||||
assert _normalize_region("USA") == "US"
|
||||
assert _normalize_region("GBR") == "GB"
|
||||
assert _normalize_region("JPN") == "JP"
|
||||
|
||||
def test_full_name(self):
|
||||
assert _normalize_region("United States") == "US"
|
||||
assert _normalize_region("Japan") == "JP"
|
||||
assert _normalize_region("Brazil") == "BR"
|
||||
assert _normalize_region("brasil") == "BR"
|
||||
assert _normalize_region("España") == "ES"
|
||||
|
||||
def test_blank_or_unknown(self):
|
||||
assert _normalize_region("") is None
|
||||
assert _normalize_region(" ") is None
|
||||
assert _normalize_region(None) is None
|
||||
assert _normalize_region("xyz-no-such-country") is None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Audit cap
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestAuditCap:
|
||||
def test_cap_truncates_change_rows(self):
|
||||
df = pd.DataFrame({
|
||||
"phone": ["(415) 555-12{:02d}".format(i) for i in range(50)],
|
||||
})
|
||||
opts = StandardizeOptions(
|
||||
column_types={"phone": FieldType.PHONE},
|
||||
audit_max_rows=5,
|
||||
)
|
||||
res = standardize_dataframe(df, opts)
|
||||
# cells_changed counts everything; the audit table is capped.
|
||||
assert res.cells_changed == 50
|
||||
assert len(res.changes) == 5
|
||||
|
||||
def test_unbounded_audit(self):
|
||||
df = pd.DataFrame({
|
||||
"phone": ["(415) 555-12{:02d}".format(i) for i in range(20)],
|
||||
})
|
||||
opts = StandardizeOptions(
|
||||
column_types={"phone": FieldType.PHONE},
|
||||
audit_max_rows=None,
|
||||
)
|
||||
res = standardize_dataframe(df, opts)
|
||||
assert len(res.changes) == 20
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Cache + vectorized dispatch (correctness)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestCacheCorrectness:
|
||||
def test_repeated_phone_consistent(self):
|
||||
# 1000 copies of the same phone should produce identical output.
|
||||
df = pd.DataFrame({"phone": ["(415) 555-1234"] * 1000})
|
||||
opts = StandardizeOptions(
|
||||
column_types={"phone": FieldType.PHONE},
|
||||
audit_max_rows=None,
|
||||
)
|
||||
res = standardize_dataframe(df, opts)
|
||||
assert (res.standardized_df["phone"] == "+14155551234").all()
|
||||
assert res.cells_changed == 1000
|
||||
|
||||
def test_cache_disabled_still_works(self):
|
||||
df = pd.DataFrame({"phone": ["(415) 555-1234", "020 7946 0958"]})
|
||||
opts = StandardizeOptions(
|
||||
column_types={"phone": FieldType.PHONE},
|
||||
cache_size=0, # disabled
|
||||
)
|
||||
res = standardize_dataframe(df, opts)
|
||||
assert res.standardized_df["phone"].iloc[0] == "+14155551234"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Streaming standardize_file
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestStandardizeFile:
|
||||
def test_basic_streaming(self, tmp_path):
|
||||
inp = tmp_path / "in.csv"
|
||||
inp.write_text(
|
||||
"phone,country,price\n"
|
||||
"(415) 555-1234,US,$1500.00\n"
|
||||
"020 7946 0958,GB,£99.99\n"
|
||||
"03-3210-7000,JP,¥12000\n"
|
||||
"+33 1 42 86 82 00,FR,€850.50\n"
|
||||
)
|
||||
out = tmp_path / "out.csv"
|
||||
opts = StandardizeOptions(
|
||||
column_types={"phone": FieldType.PHONE, "price": FieldType.CURRENCY},
|
||||
phone_country_column="country",
|
||||
currency_preserve_code=True,
|
||||
)
|
||||
res = standardize_file(inp, out, opts, chunk_size=2)
|
||||
assert isinstance(res, StreamingStandardizeResult)
|
||||
assert res.rows_processed == 4
|
||||
assert res.chunks_processed == 2
|
||||
assert out.exists()
|
||||
out_df = pd.read_csv(out, dtype=str, keep_default_na=False)
|
||||
assert out_df["phone"].iloc[0].startswith("+1")
|
||||
assert out_df["phone"].iloc[1].startswith("+44")
|
||||
assert out_df["phone"].iloc[2].startswith("+81")
|
||||
assert out_df["phone"].iloc[3].startswith("+33")
|
||||
|
||||
def test_audit_capped_across_chunks(self, tmp_path):
|
||||
# 60 rows, audit cap 10, chunks of 20 → audit must stop at 10.
|
||||
inp = tmp_path / "in.csv"
|
||||
rows = ["phone\n"] + [f"(415) 555-12{i:02d}\n" for i in range(60)]
|
||||
inp.write_text("".join(rows))
|
||||
out = tmp_path / "out.csv"
|
||||
opts = StandardizeOptions(
|
||||
column_types={"phone": FieldType.PHONE},
|
||||
audit_max_rows=10,
|
||||
)
|
||||
res = standardize_file(inp, out, opts, chunk_size=20)
|
||||
# Audit file exists and has exactly 10 data rows + 1 header.
|
||||
audit_lines = res.audit_path.read_text().splitlines()
|
||||
assert len(audit_lines) - 1 == 10
|
||||
|
||||
def test_audit_row_indices_are_global(self, tmp_path):
|
||||
# Audit row numbers must reflect absolute file position, not chunk-local.
|
||||
inp = tmp_path / "in.csv"
|
||||
rows = ["phone\n"] + [f"(415) 555-12{i:02d}\n" for i in range(30)]
|
||||
inp.write_text("".join(rows))
|
||||
out = tmp_path / "out.csv"
|
||||
opts = StandardizeOptions(
|
||||
column_types={"phone": FieldType.PHONE},
|
||||
audit_max_rows=None,
|
||||
)
|
||||
res = standardize_file(inp, out, opts, chunk_size=10)
|
||||
audit = pd.read_csv(res.audit_path)
|
||||
# Rows should be 0..29, monotonically increasing.
|
||||
assert audit["row"].tolist() == list(range(30))
|
||||
|
||||
def test_progress_callback_fires(self, tmp_path):
|
||||
inp = tmp_path / "in.csv"
|
||||
inp.write_text("phone\n" + "\n".join("(415) 555-1234" for _ in range(20)) + "\n")
|
||||
out = tmp_path / "out.csv"
|
||||
opts = StandardizeOptions(column_types={"phone": FieldType.PHONE})
|
||||
seen: list[tuple[int, int]] = []
|
||||
def cb(rows, chunks):
|
||||
seen.append((rows, chunks))
|
||||
standardize_file(inp, out, opts, chunk_size=5, progress_callback=cb)
|
||||
assert len(seen) == 4
|
||||
assert seen[-1] == (20, 4)
|
||||
|
||||
def test_progress_callback_exception_does_not_abort(self, tmp_path):
|
||||
inp = tmp_path / "in.csv"
|
||||
inp.write_text("phone\n(415) 555-1234\n")
|
||||
out = tmp_path / "out.csv"
|
||||
opts = StandardizeOptions(column_types={"phone": FieldType.PHONE})
|
||||
def bad_cb(*a, **k):
|
||||
raise RuntimeError("boom")
|
||||
# Must not raise.
|
||||
res = standardize_file(inp, out, opts, chunk_size=1, progress_callback=bad_cb)
|
||||
assert res.rows_processed == 1
|
||||
|
||||
def test_missing_input_raises_clean_error(self, tmp_path):
|
||||
from src.core.errors import FileAccessError
|
||||
opts = StandardizeOptions(column_types={"phone": FieldType.PHONE})
|
||||
with pytest.raises(FileAccessError):
|
||||
standardize_file(
|
||||
tmp_path / "missing.csv",
|
||||
tmp_path / "out.csv",
|
||||
opts,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# International coverage smoke
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestInternationalCoverage:
|
||||
@pytest.mark.parametrize("number,country,prefix", [
|
||||
("020 7946 0958", "GB", "+44"),
|
||||
("03-3210-7000", "JP", "+81"),
|
||||
("+49 30 12345678", "DE", "+49"),
|
||||
("01 42 86 82 00", "FR", "+33"),
|
||||
("+39 06 6982", "IT", "+39"),
|
||||
("+34 91 411 1111", "ES", "+34"),
|
||||
("+86 10 1234 5678", "CN", "+86"),
|
||||
("+91 11 2345 6789", "IN", "+91"),
|
||||
("+61 2 9374 4000", "AU", "+61"),
|
||||
("11 3071 0000", "BR", "+55"),
|
||||
("+52 55 5555 0000", "MX", "+52"),
|
||||
("+82 2 2287 0114", "KR", "+82"),
|
||||
])
|
||||
def test_phone_via_per_row_region(self, number, country, prefix):
|
||||
df = pd.DataFrame({"phone": [number], "country": [country]})
|
||||
opts = StandardizeOptions(
|
||||
column_types={"phone": FieldType.PHONE},
|
||||
phone_country_column="country",
|
||||
)
|
||||
res = standardize_dataframe(df, opts)
|
||||
out = res.standardized_df["phone"].iloc[0]
|
||||
assert out.startswith(prefix), (
|
||||
f"{number!r} ({country}): expected to start with {prefix}, got {out!r}"
|
||||
)
|
||||
|
||||
@pytest.mark.parametrize("price,want_code", [
|
||||
("$1,500.00", "USD"),
|
||||
("€850,50", "EUR"),
|
||||
("£99.99", "GBP"),
|
||||
("¥12000", "JPY"),
|
||||
("R$ 250,00", "BRL"),
|
||||
("CHF 1200.00", "CHF"),
|
||||
])
|
||||
def test_currency_codes_detected(self, price, want_code):
|
||||
df = pd.DataFrame({"price": [price]})
|
||||
opts = StandardizeOptions(
|
||||
column_types={"price": FieldType.CURRENCY},
|
||||
currency_preserve_code=True,
|
||||
currency_decimal="auto", # international mode
|
||||
)
|
||||
res = standardize_dataframe(df, opts)
|
||||
assert want_code in res.standardized_df["price"].iloc[0]
|
||||
@@ -8,10 +8,8 @@ These cover edges that existing suites missed:
|
||||
- ``analyze()`` with ``sample_rows >= len(df)`` (uses copy(), not head()).
|
||||
- ``findings_by_tool`` on an empty list.
|
||||
- BOM that appears mid-cell rather than at file start.
|
||||
|
||||
The collapse-whitespace heuristic for numeric/date/phone-shaped cells (spec
|
||||
§4.17) is *not yet implemented* and is captured here as a known-gap xfail
|
||||
so it's surfaced rather than silently missing.
|
||||
- The collapse-whitespace heuristic for numeric/date/phone-shaped cells
|
||||
(spec §4.17), now wired in via ``_smart_collapse_whitespace``.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
462
tests/test_missing.py
Normal file
462
tests/test_missing.py
Normal file
@@ -0,0 +1,462 @@
|
||||
"""Tests for src/core/missing.py."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import pytest
|
||||
|
||||
from src.core.errors import ConfigError, InputValidationError
|
||||
from src.core.missing import (
|
||||
DEFAULT_SENTINELS,
|
||||
MissingOptions,
|
||||
PRESETS,
|
||||
detect_sentinels,
|
||||
handle_missing,
|
||||
is_missing_like,
|
||||
profile_missing,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# is_missing_like
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestIsMissingLike:
|
||||
def test_none(self):
|
||||
assert is_missing_like(None)
|
||||
|
||||
def test_nan(self):
|
||||
assert is_missing_like(np.nan)
|
||||
|
||||
def test_pd_nat(self):
|
||||
assert is_missing_like(pd.NaT)
|
||||
|
||||
def test_empty_string(self):
|
||||
assert is_missing_like("")
|
||||
|
||||
def test_whitespace_only(self):
|
||||
assert is_missing_like(" ")
|
||||
assert is_missing_like("\t\n ")
|
||||
|
||||
def test_default_sentinels(self):
|
||||
for s in ("N/A", "n/a", "NULL", "null", "-", "--", "?", "TBD", "(blank)"):
|
||||
assert is_missing_like(s), f"expected {s!r} to be missing-like"
|
||||
|
||||
def test_case_insensitive(self):
|
||||
assert is_missing_like("N/A")
|
||||
assert is_missing_like("n/A")
|
||||
assert is_missing_like("NA")
|
||||
assert is_missing_like("na")
|
||||
|
||||
def test_real_value_not_missing(self):
|
||||
assert not is_missing_like("hello")
|
||||
assert not is_missing_like("0")
|
||||
assert not is_missing_like(0)
|
||||
assert not is_missing_like(0.0)
|
||||
|
||||
def test_zero_is_not_missing(self):
|
||||
# Common bug: treating 0 / "0" / False as missing.
|
||||
assert not is_missing_like(0)
|
||||
assert not is_missing_like(False)
|
||||
|
||||
def test_custom_sentinels_override(self):
|
||||
assert is_missing_like("xx", sentinels=["xx"])
|
||||
assert not is_missing_like("xx", sentinels=["zz"])
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# detect_sentinels
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestDetectSentinels:
|
||||
def test_counts_by_label(self):
|
||||
s = pd.Series(["alice", "N/A", "n/a", "NULL", " ", "", "bob"])
|
||||
counts = detect_sentinels(s)
|
||||
# "n/a" matches both 'N/A' and 'n/a' under casefold; the canonical
|
||||
# label that wins is whichever is in the DEFAULT_SENTINELS list.
|
||||
assert sum(v for k, v in counts.items() if k != "(whitespace)") == 3
|
||||
assert counts["(whitespace)"] == 2
|
||||
|
||||
def test_skips_real_nan(self):
|
||||
s = pd.Series(["a", np.nan, "N/A"])
|
||||
counts = detect_sentinels(s)
|
||||
assert sum(counts.values()) == 1
|
||||
|
||||
def test_no_sentinels_returns_empty(self):
|
||||
s = pd.Series(["alice", "bob", "charlie"])
|
||||
assert detect_sentinels(s) == {}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# profile_missing
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestProfileMissing:
|
||||
def test_basic(self):
|
||||
df = pd.DataFrame({
|
||||
"name": ["Alice", "Bob", "N/A", "", "Charlie"],
|
||||
"age": [30, None, 25, 40, np.nan],
|
||||
})
|
||||
prof = profile_missing(df, MissingOptions())
|
||||
assert prof.rows_total == 5
|
||||
# name: '' + 'N/A' = 2 sentinels; age: 2 NaN
|
||||
report_by_col = {r.column: r for r in prof.columns}
|
||||
assert report_by_col["name"].missing == 2
|
||||
assert report_by_col["age"].missing == 2
|
||||
assert prof.cells_missing == 4
|
||||
|
||||
def test_complete_dataframe(self):
|
||||
df = pd.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "c"]})
|
||||
prof = profile_missing(df, MissingOptions())
|
||||
assert prof.cells_missing == 0
|
||||
assert prof.rows_complete == 3
|
||||
assert prof.rows_with_any_missing == 0
|
||||
|
||||
def test_to_dataframe_columns(self):
|
||||
df = pd.DataFrame({"x": [1, None]})
|
||||
prof = profile_missing(df, MissingOptions())
|
||||
out = prof.to_dataframe()
|
||||
assert set(out.columns) >= {"column", "missing", "missing_pct", "top_sentinel"}
|
||||
|
||||
def test_disabled_sentinels_only_counts_real_nan(self):
|
||||
df = pd.DataFrame({"x": ["N/A", "alice", np.nan]})
|
||||
opts = MissingOptions(standardize_sentinels=False)
|
||||
prof = profile_missing(df, opts)
|
||||
report_by_col = {r.column: r for r in prof.columns}
|
||||
# Only the real NaN counts; 'N/A' is left alone.
|
||||
assert report_by_col["x"].missing == 1
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# handle_missing — sentinel standardization
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestSentinelStandardization:
|
||||
def test_replaces_sentinels_with_nan(self):
|
||||
df = pd.DataFrame({"x": ["alice", "N/A", "-", " ", "bob"]})
|
||||
res = handle_missing(df, MissingOptions(strategy="none"))
|
||||
# 'N/A' + '-' + whitespace-only = 3
|
||||
assert res.sentinels_standardized == 3
|
||||
assert res.handled_df["x"].isna().sum() == 3
|
||||
assert res.handled_df.iloc[0]["x"] == "alice"
|
||||
assert res.handled_df.iloc[4]["x"] == "bob"
|
||||
|
||||
def test_audit_records_each_replacement(self):
|
||||
df = pd.DataFrame({"x": ["alice", "N/A", "bob"]})
|
||||
res = handle_missing(df, MissingOptions(strategy="none"))
|
||||
assert len(res.changes) == 1
|
||||
assert res.changes.iloc[0]["action"].startswith("standardize:")
|
||||
|
||||
def test_disabled_keeps_sentinels(self):
|
||||
df = pd.DataFrame({"x": ["alice", "N/A", "bob"]})
|
||||
opts = MissingOptions(standardize_sentinels=False, strategy="none")
|
||||
res = handle_missing(df, opts)
|
||||
assert res.sentinels_standardized == 0
|
||||
assert res.handled_df.iloc[1]["x"] == "N/A"
|
||||
|
||||
def test_custom_sentinels_extend_default(self):
|
||||
df = pd.DataFrame({"x": ["alice", "MISSING_DATA", "bob"]})
|
||||
opts = MissingOptions(
|
||||
sentinels=[*DEFAULT_SENTINELS, "MISSING_DATA"],
|
||||
strategy="none",
|
||||
)
|
||||
res = handle_missing(df, opts)
|
||||
assert res.sentinels_standardized == 1
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# handle_missing — fill strategies
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestFillStrategies:
|
||||
@pytest.fixture
|
||||
def numeric_df(self):
|
||||
return pd.DataFrame({"x": [1.0, 2.0, np.nan, 4.0, np.nan]})
|
||||
|
||||
def test_mean(self, numeric_df):
|
||||
res = handle_missing(numeric_df, MissingOptions(strategy="mean"))
|
||||
# mean of [1, 2, 4] = 7/3
|
||||
filled = res.handled_df["x"].iloc[2]
|
||||
assert abs(filled - 7.0 / 3.0) < 1e-9
|
||||
assert res.cells_filled == 2
|
||||
|
||||
def test_median(self, numeric_df):
|
||||
res = handle_missing(numeric_df, MissingOptions(strategy="median"))
|
||||
# median of [1, 2, 4] = 2.0
|
||||
assert res.handled_df["x"].iloc[2] == 2.0
|
||||
|
||||
def test_mode(self):
|
||||
df = pd.DataFrame({"x": ["a", "a", "b", None, None]})
|
||||
res = handle_missing(df, MissingOptions(strategy="mode"))
|
||||
assert res.handled_df["x"].iloc[3] == "a"
|
||||
assert res.handled_df["x"].iloc[4] == "a"
|
||||
assert res.cells_filled == 2
|
||||
|
||||
def test_constant_scalar(self, numeric_df):
|
||||
res = handle_missing(
|
||||
numeric_df,
|
||||
MissingOptions(strategy="constant", fill_value=99.0),
|
||||
)
|
||||
assert res.handled_df["x"].iloc[2] == 99.0
|
||||
assert res.handled_df["x"].iloc[4] == 99.0
|
||||
|
||||
def test_constant_per_column(self):
|
||||
df = pd.DataFrame({"a": [1, np.nan], "b": ["x", None]})
|
||||
opts = MissingOptions(
|
||||
strategy="constant",
|
||||
column_fill_values={"a": 0, "b": "?"},
|
||||
)
|
||||
res = handle_missing(df, opts)
|
||||
assert res.handled_df["a"].iloc[1] == 0
|
||||
assert res.handled_df["b"].iloc[1] == "?"
|
||||
|
||||
def test_ffill(self):
|
||||
df = pd.DataFrame({"x": [1.0, np.nan, np.nan, 4.0]})
|
||||
res = handle_missing(df, MissingOptions(strategy="ffill"))
|
||||
assert list(res.handled_df["x"]) == [1.0, 1.0, 1.0, 4.0]
|
||||
|
||||
def test_bfill(self):
|
||||
df = pd.DataFrame({"x": [1.0, np.nan, np.nan, 4.0]})
|
||||
res = handle_missing(df, MissingOptions(strategy="bfill"))
|
||||
assert list(res.handled_df["x"]) == [1.0, 4.0, 4.0, 4.0]
|
||||
|
||||
def test_interpolate(self):
|
||||
df = pd.DataFrame({"x": [1.0, np.nan, np.nan, 4.0]})
|
||||
res = handle_missing(df, MissingOptions(strategy="interpolate"))
|
||||
assert list(res.handled_df["x"]) == [1.0, 2.0, 3.0, 4.0]
|
||||
|
||||
def test_numeric_strategy_falls_back_for_categorical(self):
|
||||
df = pd.DataFrame({"x": ["a", "a", None, "b"]})
|
||||
opts = MissingOptions(strategy="median", categorical_strategy="mode")
|
||||
res = handle_missing(df, opts)
|
||||
assert res.strategy_per_column["x"] == "mode"
|
||||
assert res.handled_df["x"].iloc[2] == "a"
|
||||
|
||||
def test_per_column_strategy_overrides_global(self):
|
||||
df = pd.DataFrame({"a": [1.0, np.nan], "b": ["x", None]})
|
||||
opts = MissingOptions(
|
||||
strategy="median",
|
||||
column_strategies={"b": "constant"},
|
||||
fill_value="??",
|
||||
)
|
||||
res = handle_missing(df, opts)
|
||||
assert res.handled_df["a"].iloc[1] == 1.0 # median of [1.0]
|
||||
assert res.handled_df["b"].iloc[1] == "??"
|
||||
|
||||
def test_all_nan_column_safely_skipped(self):
|
||||
df = pd.DataFrame({"x": [np.nan, np.nan, np.nan]})
|
||||
res = handle_missing(df, MissingOptions(strategy="mean"))
|
||||
assert res.cells_filled == 0
|
||||
assert res.handled_df["x"].isna().all()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# handle_missing — drops
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestDropStrategies:
|
||||
def test_drop_row_any_missing(self):
|
||||
# Strict-greater: threshold 0.0 → drop any row with any missing.
|
||||
df = pd.DataFrame({
|
||||
"a": [1, 2, np.nan, 4],
|
||||
"b": ["x", None, "z", "w"],
|
||||
})
|
||||
opts = MissingOptions(strategy="drop_row", row_drop_threshold=0.0)
|
||||
res = handle_missing(df, opts)
|
||||
# Rows 1 and 2 each have one missing cell; rows 0 and 3 are clean.
|
||||
assert res.rows_dropped == 2
|
||||
assert len(res.handled_df) == 2
|
||||
|
||||
def test_drop_row_default_threshold_never_drops(self):
|
||||
# Default 1.0 = never drop — no fraction exceeds 100%.
|
||||
df = pd.DataFrame({
|
||||
"a": [1, 2, np.nan],
|
||||
"b": ["x", "y", None],
|
||||
})
|
||||
opts = MissingOptions(strategy="drop_row") # threshold defaults to 1.0
|
||||
res = handle_missing(df, opts)
|
||||
assert res.rows_dropped == 0
|
||||
|
||||
def test_drop_row_partial_threshold(self):
|
||||
df = pd.DataFrame({
|
||||
"a": [1, np.nan, np.nan, np.nan],
|
||||
"b": [10, 20, np.nan, np.nan],
|
||||
"c": [100, 200, np.nan, 400],
|
||||
})
|
||||
# Strict-greater: threshold 0.5 → drop rows with > 50% missing.
|
||||
opts = MissingOptions(strategy="drop_row", row_drop_threshold=0.5)
|
||||
res = handle_missing(df, opts)
|
||||
# row 0: 0/3, row 1: 1/3 (0.33) -> keep
|
||||
# row 2: 3/3 (1.0) -> drop, row 3: 2/3 (0.67) -> drop
|
||||
assert res.rows_dropped == 2
|
||||
|
||||
def test_drop_col_threshold(self):
|
||||
df = pd.DataFrame({
|
||||
"keep": [1, 2, 3, 4],
|
||||
"drop_me": [np.nan, np.nan, np.nan, 1], # 75% missing
|
||||
})
|
||||
# Strict-greater: 0.5 → drop columns with > 50% missing.
|
||||
opts = MissingOptions(strategy="drop_col", col_drop_threshold=0.5)
|
||||
res = handle_missing(df, opts)
|
||||
assert "drop_me" in res.columns_dropped
|
||||
assert "keep" not in res.columns_dropped
|
||||
|
||||
def test_drop_both(self):
|
||||
df = pd.DataFrame({
|
||||
"keep": [1, 2, 3, 4, 5],
|
||||
"drop_col": [np.nan] * 5,
|
||||
"x": [1, np.nan, 3, np.nan, 5],
|
||||
})
|
||||
opts = MissingOptions(
|
||||
strategy="drop_both",
|
||||
col_drop_threshold=0.99, # >99% missing → drop column
|
||||
row_drop_threshold=0.0, # any missing in remaining cols → drop row
|
||||
)
|
||||
res = handle_missing(df, opts)
|
||||
# drop_col is 100% missing → dropped
|
||||
assert "drop_col" in res.columns_dropped
|
||||
# Remaining scope (keep + x): rows 1 and 3 have a missing x → drop.
|
||||
assert res.rows_dropped == 2
|
||||
|
||||
def test_drop_audit_records_dropped_rows(self):
|
||||
df = pd.DataFrame({"a": [1, np.nan], "b": [2, np.nan]})
|
||||
# Drop the fully-missing row (frac > 0.99).
|
||||
opts = MissingOptions(strategy="drop_row", row_drop_threshold=0.99)
|
||||
res = handle_missing(df, opts)
|
||||
drop_records = res.changes[res.changes["action"] == "drop_row"]
|
||||
assert len(drop_records) == 1
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Scope: columns / skip_columns
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestScope:
|
||||
def test_columns_filter(self):
|
||||
df = pd.DataFrame({"a": [np.nan, 2], "b": [np.nan, 4]})
|
||||
opts = MissingOptions(columns=["a"], strategy="constant", fill_value=99)
|
||||
res = handle_missing(df, opts)
|
||||
assert res.handled_df["a"].iloc[0] == 99
|
||||
# b should be untouched
|
||||
assert pd.isna(res.handled_df["b"].iloc[0])
|
||||
|
||||
def test_skip_columns(self):
|
||||
df = pd.DataFrame({"a": [np.nan, 2], "b": [np.nan, 4]})
|
||||
opts = MissingOptions(skip_columns=["b"], strategy="constant", fill_value=99)
|
||||
res = handle_missing(df, opts)
|
||||
assert res.handled_df["a"].iloc[0] == 99
|
||||
assert pd.isna(res.handled_df["b"].iloc[0])
|
||||
|
||||
def test_unknown_column_raises(self):
|
||||
df = pd.DataFrame({"a": [1]})
|
||||
opts = MissingOptions(columns=["does_not_exist"])
|
||||
with pytest.raises(InputValidationError):
|
||||
handle_missing(df, opts)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Presets / config
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestPresets:
|
||||
def test_detect_only_does_not_fill(self):
|
||||
df = pd.DataFrame({"x": ["alice", "N/A", "bob"]})
|
||||
opts = MissingOptions.from_preset("detect-only")
|
||||
res = handle_missing(df, opts)
|
||||
assert res.sentinels_standardized == 1
|
||||
assert res.cells_filled == 0
|
||||
assert res.rows_dropped == 0
|
||||
|
||||
def test_safe_fill_fills(self):
|
||||
df = pd.DataFrame({"age": [30, np.nan, 25, 40], "name": ["a", "a", None, "b"]})
|
||||
opts = MissingOptions.from_preset("safe-fill")
|
||||
res = handle_missing(df, opts)
|
||||
assert res.cells_filled == 2
|
||||
|
||||
def test_drop_incomplete(self):
|
||||
df = pd.DataFrame({"a": [1, np.nan, 3], "b": [10, 20, 30]})
|
||||
opts = MissingOptions.from_preset("drop-incomplete")
|
||||
res = handle_missing(df, opts)
|
||||
assert res.rows_dropped == 1
|
||||
|
||||
def test_unknown_preset_raises(self):
|
||||
with pytest.raises(ConfigError):
|
||||
MissingOptions.from_preset("does-not-exist")
|
||||
|
||||
def test_roundtrip_to_file(self, tmp_path):
|
||||
opts = MissingOptions.from_preset("safe-fill")
|
||||
opts.column_strategies = {"age": "median"}
|
||||
path = tmp_path / "cfg.json"
|
||||
opts.to_file(path)
|
||||
loaded = MissingOptions.from_file(path)
|
||||
assert loaded.strategy == opts.strategy
|
||||
assert loaded.column_strategies == opts.column_strategies
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Validation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestValidate:
|
||||
def test_invalid_strategy(self):
|
||||
opts = MissingOptions(strategy="bogus") # type: ignore[arg-type]
|
||||
with pytest.raises(InputValidationError):
|
||||
opts.validate()
|
||||
|
||||
def test_threshold_out_of_range(self):
|
||||
opts = MissingOptions(row_drop_threshold=1.5)
|
||||
with pytest.raises(ConfigError):
|
||||
opts.validate()
|
||||
|
||||
def test_handle_missing_validates(self):
|
||||
df = pd.DataFrame({"x": [1]})
|
||||
opts = MissingOptions(strategy="bogus") # type: ignore[arg-type]
|
||||
with pytest.raises(InputValidationError):
|
||||
handle_missing(df, opts)
|
||||
|
||||
def test_non_dataframe_input(self):
|
||||
with pytest.raises(InputValidationError):
|
||||
handle_missing([1, 2, 3]) # type: ignore[arg-type]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# End-to-end realistic case
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestEndToEnd:
|
||||
def test_messy_customer_export(self):
|
||||
df = pd.DataFrame({
|
||||
"customer_id": [1, 2, 3, 4, 5, 6],
|
||||
"name": ["Alice", "Bob", "N/A", " ", "Charlie", None],
|
||||
"email": ["a@x.com", "-", "c@x.com", "d@x.com", "NULL", "f@x.com"],
|
||||
"age": [30, np.nan, 25, 40, np.nan, 50],
|
||||
})
|
||||
opts = MissingOptions(
|
||||
standardize_sentinels=True,
|
||||
strategy="median",
|
||||
categorical_strategy="constant",
|
||||
fill_value="UNKNOWN",
|
||||
)
|
||||
res = handle_missing(df, opts)
|
||||
|
||||
# Sentinels: name "N/A"," ",None; email "-","NULL". (None is real-NaN, not sentinel.)
|
||||
# Whitespace + 'N/A' on name = 2; '-' + 'NULL' on email = 2. Total = 4.
|
||||
assert res.sentinels_standardized == 4
|
||||
# name has 3 missing after standardize (N/A, " ", None) → constant fill
|
||||
# email has 2 missing → constant fill
|
||||
# age has 2 missing → median (32.5 of [30, 25, 40, 50])
|
||||
assert res.cells_filled == 7
|
||||
assert res.handled_df["name"].isna().sum() == 0
|
||||
assert res.handled_df["email"].isna().sum() == 0
|
||||
assert res.handled_df["age"].isna().sum() == 0
|
||||
assert (res.handled_df["name"] == "UNKNOWN").sum() == 3
|
||||
assert (res.handled_df["age"] == 35.0).sum() == 2 # median of [30, 25, 40, 50]
|
||||
|
||||
def test_input_not_mutated(self):
|
||||
df = pd.DataFrame({"x": ["N/A", "alice", np.nan]})
|
||||
df_copy = df.copy()
|
||||
handle_missing(df, MissingOptions.from_preset("safe-fill"))
|
||||
pd.testing.assert_frame_equal(df, df_copy)
|
||||
463
tests/test_missing_corpus.py
Normal file
463
tests/test_missing_corpus.py
Normal file
@@ -0,0 +1,463 @@
|
||||
"""Acceptance corpus for the Missing Value Handler.
|
||||
|
||||
Loads every fixture in ``test-cases/missing-corpus/test_data/`` and
|
||||
asserts the documented behaviour. The fixtures are split into:
|
||||
|
||||
* ``uc##`` — three target-client use cases (Shopify operator,
|
||||
marketing analyst, consultant intake).
|
||||
* ``ec##`` — edge cases the engine must handle without surprise:
|
||||
all-NaN columns, zeros that aren't missing, Excel errors, unicode
|
||||
whitespace, mixed dtypes, padding, single row/column, every default
|
||||
sentinel, per-column constants, drop thresholds, leading-NaN ffill,
|
||||
numeric-strategy fallback for non-numeric columns, headers-only,
|
||||
idempotency.
|
||||
|
||||
Each test runs through the public API (``handle_missing``) so any
|
||||
regression in the engine surfaces here. Fixture files double as living
|
||||
documentation for what the tool is supposed to do.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import pytest
|
||||
|
||||
from src.core.missing import (
|
||||
MissingOptions,
|
||||
handle_missing,
|
||||
is_missing_like,
|
||||
profile_missing,
|
||||
)
|
||||
|
||||
CORPUS = Path(__file__).resolve().parents[1] / "test-cases" / "missing-corpus"
|
||||
TEST_DATA = CORPUS / "test_data"
|
||||
|
||||
|
||||
def _read(name: str, *, dtype_str: bool = False) -> pd.DataFrame:
|
||||
"""Load a corpus CSV.
|
||||
|
||||
By default we let pandas infer dtypes — that's the most realistic
|
||||
intake path (Excel exports keep numeric columns numeric). A handful
|
||||
of cases pass ``dtype_str=True`` to keep sentinels visible in
|
||||
columns that would otherwise be coerced to float.
|
||||
"""
|
||||
path = TEST_DATA / name
|
||||
if dtype_str:
|
||||
return pd.read_csv(path, dtype=str, keep_default_na=False)
|
||||
return pd.read_csv(path)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Use case 1 — Shopify operator: detect-only
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestUC01ShopifyExport:
|
||||
"""SMB operator standardizes disguised nulls before reimporting."""
|
||||
|
||||
def test_detect_only_replaces_sentinels(self):
|
||||
df = _read("uc01_shopify_export.csv", dtype_str=True)
|
||||
opts = MissingOptions.from_preset("detect-only")
|
||||
res = handle_missing(df, opts)
|
||||
# Spot-check known sentinels from the fixture
|
||||
assert res.sentinels_standardized > 0
|
||||
assert res.cells_filled == 0
|
||||
assert res.rows_dropped == 0
|
||||
|
||||
# Fields that contained 'N/A', '-', 'NULL', '(blank)', '#N/A',
|
||||
# 'n/a', '?', '(none)' should now be NaN.
|
||||
for row, col in [
|
||||
(1, "phone"), # 'N/A'
|
||||
(2, "city"), # '-'
|
||||
(3, "total_orders"), # 'NULL'
|
||||
(5, "phone"), # ' '
|
||||
(5, "last_order_date"), # '(blank)'
|
||||
(6, "last_order_date"), # '#N/A'
|
||||
(7, "phone"), # 'n/a'
|
||||
(8, "city"), # '?'
|
||||
(9, "total_orders"), # '(none)'
|
||||
]:
|
||||
assert pd.isna(res.handled_df.iloc[row][col]), (
|
||||
f"Expected NaN at row {row} col {col}, got "
|
||||
f"{res.handled_df.iloc[row][col]!r}"
|
||||
)
|
||||
|
||||
def test_real_values_preserved(self):
|
||||
df = _read("uc01_shopify_export.csv", dtype_str=True)
|
||||
res = handle_missing(df, MissingOptions.from_preset("detect-only"))
|
||||
# First row should be untouched.
|
||||
assert res.handled_df.iloc[0]["first_name"] == "Alice"
|
||||
assert res.handled_df.iloc[0]["email"] == "alice@shop.com"
|
||||
assert res.handled_df.iloc[0]["lifetime_value"] == "1240.50"
|
||||
|
||||
def test_audit_log_complete(self):
|
||||
df = _read("uc01_shopify_export.csv", dtype_str=True)
|
||||
res = handle_missing(df, MissingOptions.from_preset("detect-only"))
|
||||
# One audit row per sentinel replacement.
|
||||
assert len(res.changes) == res.sentinels_standardized
|
||||
assert set(res.changes["action"].apply(lambda s: s.startswith("standardize:"))) == {True}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Use case 2 — Marketing analyst: safe-fill
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestUC02MarketingAudience:
|
||||
"""Marketer fills numeric columns with median, categorical with mode."""
|
||||
|
||||
def test_safe_fill_clears_all_missing(self):
|
||||
df = _read("uc02_marketing_audience.csv")
|
||||
opts = MissingOptions.from_preset("safe-fill")
|
||||
res = handle_missing(df, opts)
|
||||
# Every cell in scope should be filled.
|
||||
assert res.profile_after.cells_missing == 0
|
||||
assert res.cells_filled > 0
|
||||
|
||||
def test_numeric_uses_median_categorical_uses_mode(self):
|
||||
df = _read("uc02_marketing_audience.csv")
|
||||
opts = MissingOptions.from_preset("safe-fill")
|
||||
res = handle_missing(df, opts)
|
||||
# 'age' is numeric → median strategy
|
||||
assert res.strategy_per_column["age"] == "median"
|
||||
# 'segment' / 'region' / 'source' are object → mode fallback
|
||||
assert res.strategy_per_column["segment"] == "mode"
|
||||
assert res.strategy_per_column["region"] == "mode"
|
||||
|
||||
def test_per_column_override(self):
|
||||
df = _read("uc02_marketing_audience.csv")
|
||||
opts = MissingOptions.from_preset("safe-fill")
|
||||
opts.column_strategies = {"source": "constant"}
|
||||
opts.column_fill_values = {"source": "unknown"}
|
||||
res = handle_missing(df, opts)
|
||||
# Cells previously holding sentinels in 'source' should now equal "unknown".
|
||||
assert (res.handled_df["source"] == "unknown").sum() >= 3
|
||||
|
||||
def test_consent_real_false_not_dropped(self):
|
||||
# 'consent' column has empty cells but also explicit "true"; mode fill
|
||||
# must not silently change a real "true" to anything else.
|
||||
df = _read("uc02_marketing_audience.csv")
|
||||
res = handle_missing(df, MissingOptions.from_preset("safe-fill"))
|
||||
original_trues = (df["consent"] == "true").sum()
|
||||
result_trues = (res.handled_df["consent"] == "true").sum()
|
||||
# Filled rows can become "true" (mode) but should not lose existing trues.
|
||||
assert result_trues >= original_trues
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Use case 3 — Consultant intake: threshold drops + fill
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestUC03ConsultantIntake:
|
||||
"""Drop sparse columns and rows, then fill the survivors."""
|
||||
|
||||
def test_drop_col_removes_legacy_fields(self):
|
||||
df = _read("uc03_consultant_intake.csv", dtype_str=True)
|
||||
# internal_id_legacy and beta_field are 100% missing — drop them.
|
||||
opts = MissingOptions(
|
||||
standardize_sentinels=True,
|
||||
strategy="drop_col",
|
||||
col_drop_threshold=0.99,
|
||||
)
|
||||
res = handle_missing(df, opts)
|
||||
assert "internal_id_legacy" in res.columns_dropped
|
||||
assert "beta_field" in res.columns_dropped
|
||||
|
||||
def test_drop_row_removes_mostly_empty_respondents(self):
|
||||
df = _read("uc03_consultant_intake.csv", dtype_str=True)
|
||||
opts = MissingOptions(
|
||||
standardize_sentinels=True,
|
||||
strategy="drop_both",
|
||||
col_drop_threshold=0.99, # drop the legacy / beta cols first
|
||||
row_drop_threshold=0.5, # then drop rows with >50% missing
|
||||
)
|
||||
res = handle_missing(df, opts)
|
||||
# R-002, R-005, R-007, R-010 are mostly-empty respondents.
|
||||
assert res.rows_dropped >= 4
|
||||
# Non-empty respondents survive.
|
||||
kept_ids = set(res.handled_df["respondent_id"].tolist())
|
||||
for survivor in ("R-001", "R-003", "R-006", "R-008", "R-009", "R-012"):
|
||||
assert survivor in kept_ids
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Edge cases
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestEC01AllNanColumn:
|
||||
def test_fill_skips_all_nan_column(self):
|
||||
df = _read("ec01_all_nan_column.csv")
|
||||
res = handle_missing(df, MissingOptions(strategy="mean"))
|
||||
# Mean of all-NaN is NaN — engine must NOT fabricate a value.
|
||||
assert res.handled_df["deprecated_field"].isna().all()
|
||||
assert res.cells_filled == 0
|
||||
|
||||
def test_drop_col_catches_all_nan(self):
|
||||
df = _read("ec01_all_nan_column.csv")
|
||||
res = handle_missing(
|
||||
df, MissingOptions(strategy="drop_col", col_drop_threshold=0.99),
|
||||
)
|
||||
assert "deprecated_field" in res.columns_dropped
|
||||
assert "name" not in res.columns_dropped
|
||||
|
||||
|
||||
class TestEC02NoMissing:
|
||||
def test_clean_file_is_noop(self):
|
||||
df = _read("ec02_no_missing.csv")
|
||||
res = handle_missing(df, MissingOptions.from_preset("safe-fill"))
|
||||
assert res.sentinels_standardized == 0
|
||||
assert res.cells_filled == 0
|
||||
assert res.rows_dropped == 0
|
||||
pd.testing.assert_frame_equal(res.handled_df, df)
|
||||
|
||||
|
||||
class TestEC03ZeroIsNotMissing:
|
||||
def test_zero_preserved(self):
|
||||
df = _read("ec03_zero_is_not_missing.csv")
|
||||
res = handle_missing(df, MissingOptions.from_preset("safe-fill"))
|
||||
# Original zeros remain zero.
|
||||
assert (res.handled_df["balance"] == 0).sum() == (df["balance"] == 0).sum()
|
||||
assert (res.handled_df["count"] == 0).sum() == (df["count"] == 0).sum()
|
||||
# No spurious changes recorded.
|
||||
assert res.cells_filled == 0
|
||||
assert res.sentinels_standardized == 0
|
||||
|
||||
def test_is_missing_like_zero_predicate(self):
|
||||
# Direct predicate check — zeros, false, "0" must all be non-missing.
|
||||
assert not is_missing_like(0)
|
||||
assert not is_missing_like(0.0)
|
||||
assert not is_missing_like(False)
|
||||
assert not is_missing_like("0")
|
||||
assert not is_missing_like("0.00")
|
||||
|
||||
|
||||
class TestEC04ExcelErrors:
|
||||
def test_excel_error_sentinels_recognized(self):
|
||||
df = _read("ec04_excel_errors.csv", dtype_str=True)
|
||||
res = handle_missing(df, MissingOptions(strategy="none"))
|
||||
# 6 error sentinels in the fixture: #N/A, #NULL!, #VALUE!, #N/A, #N/A, #NULL!
|
||||
assert res.sentinels_standardized == 6
|
||||
|
||||
|
||||
class TestEC05UnicodeWhitespace:
|
||||
def test_nbsp_and_ideographic_space_count_as_missing(self):
|
||||
df = _read("ec05_unicode_whitespace.csv", dtype_str=True)
|
||||
res = handle_missing(df, MissingOptions(strategy="none"))
|
||||
# rows 1, 2, 4 contain NBSP / tab / ideographic space respectively
|
||||
assert res.handled_df["note"].isna().sum() == 3
|
||||
assert res.handled_df.iloc[0]["note"] == "hello"
|
||||
assert res.handled_df.iloc[3]["note"] == "real"
|
||||
|
||||
|
||||
class TestEC06MixedDtypes:
|
||||
def test_mixed_column_falls_back_to_mode(self):
|
||||
# Read with native dtypes so 'real_num' stays numeric.
|
||||
df = _read("ec06_mixed_dtypes.csv")
|
||||
opts = MissingOptions(
|
||||
standardize_sentinels=True,
|
||||
strategy="median",
|
||||
categorical_strategy="mode",
|
||||
)
|
||||
res = handle_missing(df, opts)
|
||||
# mixed_col holds 'N/A' / 'hello' alongside numbers → object dtype,
|
||||
# median falls back to mode.
|
||||
assert res.strategy_per_column["mixed_col"] == "mode"
|
||||
# real_num is float dtype → median runs.
|
||||
assert res.strategy_per_column["real_num"] == "median"
|
||||
|
||||
|
||||
class TestEC07RealDataWithPadding:
|
||||
def test_padded_real_data_not_treated_as_missing(self):
|
||||
df = _read("ec07_real_data_with_padding.csv", dtype_str=True)
|
||||
res = handle_missing(df, MissingOptions(strategy="none"))
|
||||
# Only row 1 (name=" ") and row 2 (city=blank) should become NaN.
|
||||
# " Alice ", " Bob ", " SF" must remain.
|
||||
assert res.handled_df.iloc[0]["name"] == " Alice "
|
||||
assert res.handled_df.iloc[2]["name"] == " Bob "
|
||||
assert res.handled_df.iloc[3]["city"] == " SF"
|
||||
|
||||
|
||||
class TestEC08SingleRow:
|
||||
def test_single_row_handles_cleanly(self):
|
||||
df = _read("ec08_single_row.csv", dtype_str=True)
|
||||
# detect-only
|
||||
res = handle_missing(df, MissingOptions(strategy="none"))
|
||||
assert res.sentinels_standardized == 2 # 'N/A' + ''
|
||||
# safe-fill on a one-row file: median/mode of a single value is itself.
|
||||
res2 = handle_missing(df, MissingOptions.from_preset("safe-fill"))
|
||||
assert res2.handled_df.iloc[0]["name"] == "Alice"
|
||||
|
||||
|
||||
class TestEC09SingleColumn:
|
||||
def test_single_column_works(self):
|
||||
df = _read("ec09_single_column.csv", dtype_str=True)
|
||||
res = handle_missing(df, MissingOptions(strategy="none"))
|
||||
# 'N/A', whitespace-only ' ', '-' = 3 sentinels
|
||||
assert res.sentinels_standardized == 3
|
||||
assert res.handled_df["value"].isna().sum() == 3
|
||||
|
||||
|
||||
class TestEC10AllSentinelVariants:
|
||||
def test_every_default_sentinel_recognized(self):
|
||||
df = _read("ec10_all_sentinel_variants.csv", dtype_str=True)
|
||||
res = handle_missing(df, MissingOptions(strategy="none"))
|
||||
# 20 sentinels + 1 real value
|
||||
assert res.sentinels_standardized == 20
|
||||
# The 'real_value' row stays.
|
||||
assert (res.handled_df["sentinel_value"] == "real_value").sum() == 1
|
||||
|
||||
|
||||
class TestEC11ConstantPerColumn:
|
||||
def test_per_column_fill_values(self):
|
||||
df = _read("ec11_constant_per_column.csv", dtype_str=True)
|
||||
opts = MissingOptions(
|
||||
strategy="constant",
|
||||
column_fill_values={
|
||||
"country": "USA",
|
||||
"salary": "0",
|
||||
"department": "Unassigned",
|
||||
},
|
||||
)
|
||||
res = handle_missing(df, opts)
|
||||
# Fixture has 1 UK row + 2 USA rows + 2 blanks. Filling blanks with
|
||||
# "USA" yields 4 USA total; UK is preserved.
|
||||
assert (res.handled_df["country"] == "USA").sum() == 4
|
||||
assert (res.handled_df["country"] == "UK").sum() == 1
|
||||
assert (res.handled_df["department"] == "Unassigned").sum() >= 2
|
||||
|
||||
|
||||
class TestEC12DropThresholdBoundary:
|
||||
def test_threshold_one_never_drops(self):
|
||||
# threshold 1.0 + strict-greater = never drop.
|
||||
df = _read("ec12_drop_threshold_boundary.csv")
|
||||
opts = MissingOptions(strategy="drop_row", row_drop_threshold=1.0)
|
||||
res = handle_missing(df, opts)
|
||||
assert res.rows_dropped == 0
|
||||
|
||||
def test_threshold_just_under_one_drops_fully_missing(self):
|
||||
# threshold 0.99: drop only fully-missing rows (frac > 0.99 → frac == 1.0).
|
||||
df = _read("ec12_drop_threshold_boundary.csv")
|
||||
opts = MissingOptions(
|
||||
strategy="drop_row",
|
||||
row_drop_threshold=0.99,
|
||||
columns=["a", "b", "c", "d"], # exclude id from the scope
|
||||
)
|
||||
res = handle_missing(df, opts)
|
||||
# Only row 3 (id=4, all four are NaN) qualifies.
|
||||
assert res.rows_dropped == 1
|
||||
|
||||
def test_threshold_half_drops_majority_missing(self):
|
||||
df = _read("ec12_drop_threshold_boundary.csv")
|
||||
opts = MissingOptions(
|
||||
strategy="drop_row",
|
||||
row_drop_threshold=0.5,
|
||||
columns=["a", "b", "c", "d"],
|
||||
)
|
||||
res = handle_missing(df, opts)
|
||||
# Missing fractions across [a,b,c,d]:
|
||||
# row 0: 0/4=0.0 keep
|
||||
# row 1: 2/4=0.5 keep (strict >, not equal)
|
||||
# row 2: 3/4=0.75 drop
|
||||
# row 3: 4/4=1.0 drop
|
||||
# row 4: 2/4=0.5 keep
|
||||
assert res.rows_dropped == 2
|
||||
|
||||
def test_threshold_zero_drops_any_missing(self):
|
||||
df = _read("ec12_drop_threshold_boundary.csv")
|
||||
opts = MissingOptions(
|
||||
strategy="drop_row",
|
||||
row_drop_threshold=0.0,
|
||||
columns=["a", "b", "c", "d"],
|
||||
)
|
||||
res = handle_missing(df, opts)
|
||||
# Every body row except row 0 has at least one missing.
|
||||
assert res.rows_dropped == 4
|
||||
|
||||
|
||||
class TestEC13FfillLeadingNan:
|
||||
def test_leading_nan_run_survives_ffill(self):
|
||||
df = _read("ec13_ffill_leading_nan.csv")
|
||||
res = handle_missing(df, MissingOptions(strategy="ffill"))
|
||||
# First two rows (leading NaN) remain NaN — there's nothing to fill from.
|
||||
assert pd.isna(res.handled_df["price"].iloc[0])
|
||||
assert pd.isna(res.handled_df["price"].iloc[1])
|
||||
# Mid-series gets filled forward.
|
||||
assert res.handled_df["price"].iloc[3] == 100.0
|
||||
assert res.handled_df["price"].iloc[4] == 100.0
|
||||
# Trailing NaN gets filled by the last seen value.
|
||||
assert res.handled_df["price"].iloc[6] == 150.0
|
||||
|
||||
|
||||
class TestEC14InterpolateFallback:
|
||||
def test_interpolate_on_non_numeric_falls_back(self):
|
||||
df = _read("ec14_interpolate_fallback.csv", dtype_str=True)
|
||||
opts = MissingOptions(
|
||||
strategy="interpolate",
|
||||
categorical_strategy="mode",
|
||||
)
|
||||
res = handle_missing(df, opts)
|
||||
# All columns are object dtype here → fallback to mode.
|
||||
assert res.strategy_per_column["category"] == "mode"
|
||||
assert res.strategy_per_column["value"] == "mode"
|
||||
|
||||
|
||||
class TestEC15HeadersOnly:
|
||||
def test_empty_body_does_not_crash(self):
|
||||
df = _read("ec15_headers_only.csv")
|
||||
# All operations must be no-ops on an empty body.
|
||||
for preset in ("detect-only", "safe-fill", "drop-incomplete"):
|
||||
res = handle_missing(df, MissingOptions.from_preset(preset))
|
||||
assert len(res.handled_df) == 0
|
||||
assert res.cells_filled == 0
|
||||
assert res.rows_dropped == 0
|
||||
|
||||
|
||||
class TestEC16Idempotency:
|
||||
def test_safe_fill_is_idempotent(self):
|
||||
df = _read("ec16_idempotent_apply.csv", dtype_str=True)
|
||||
opts = MissingOptions.from_preset("safe-fill")
|
||||
first = handle_missing(df, opts)
|
||||
second = handle_missing(first.handled_df, opts)
|
||||
# Second pass should make no further changes.
|
||||
pd.testing.assert_frame_equal(
|
||||
second.handled_df.reset_index(drop=True),
|
||||
first.handled_df.reset_index(drop=True),
|
||||
)
|
||||
assert second.cells_filled == 0
|
||||
assert second.sentinels_standardized == 0
|
||||
|
||||
def test_detect_only_is_idempotent(self):
|
||||
df = _read("ec16_idempotent_apply.csv", dtype_str=True)
|
||||
opts = MissingOptions.from_preset("detect-only")
|
||||
first = handle_missing(df, opts)
|
||||
second = handle_missing(first.handled_df, opts)
|
||||
assert second.sentinels_standardized == 0
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Whole-corpus property tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
ALL_FIXTURES = sorted(p.name for p in TEST_DATA.glob("*.csv"))
|
||||
|
||||
|
||||
@pytest.mark.parametrize("fixture", ALL_FIXTURES)
|
||||
def test_handle_missing_does_not_mutate_input(fixture):
|
||||
"""Every fixture must leave the input DataFrame untouched."""
|
||||
df = pd.read_csv(TEST_DATA / fixture, dtype=str, keep_default_na=False)
|
||||
if df.empty and len(df.columns) == 0:
|
||||
pytest.skip(f"{fixture}: completely empty file")
|
||||
snapshot = df.copy(deep=True)
|
||||
handle_missing(df, MissingOptions.from_preset("safe-fill"))
|
||||
pd.testing.assert_frame_equal(df, snapshot)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("fixture", ALL_FIXTURES)
|
||||
def test_profile_runs_on_every_fixture(fixture):
|
||||
"""``profile_missing`` must succeed on every corpus file."""
|
||||
df = pd.read_csv(TEST_DATA / fixture, dtype=str, keep_default_na=False)
|
||||
prof = profile_missing(df, MissingOptions())
|
||||
assert prof.rows_total == len(df)
|
||||
assert prof.cells_total == len(df) * len(df.columns)
|
||||
324
tests/test_pipeline.py
Normal file
324
tests/test_pipeline.py
Normal file
@@ -0,0 +1,324 @@
|
||||
"""Tests for src/core/pipeline.py."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import pytest
|
||||
|
||||
from src.core.errors import ConfigError, InputValidationError
|
||||
from src.core.pipeline import (
|
||||
Pipeline,
|
||||
PipelineResult,
|
||||
SOFT_DEPENDENCIES,
|
||||
Step,
|
||||
StepResult,
|
||||
TOOL_ADAPTERS,
|
||||
TOOL_NAMES,
|
||||
recommended_pipeline,
|
||||
run_pipeline,
|
||||
validate_pipeline,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step / Pipeline construction
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestStep:
|
||||
def test_unknown_tool_raises(self):
|
||||
with pytest.raises(ConfigError):
|
||||
Step(tool="bogus_tool")
|
||||
|
||||
def test_default_options_empty_dict(self):
|
||||
s = Step(tool="text_clean")
|
||||
assert s.options == {}
|
||||
assert s.enabled is True
|
||||
|
||||
def test_display_name_falls_back_to_tool(self):
|
||||
assert Step(tool="dedup").display_name() == "dedup"
|
||||
assert Step(tool="dedup", name="Final dedup").display_name() == "Final dedup"
|
||||
|
||||
|
||||
class TestPipelineSerialization:
|
||||
def test_roundtrip_dict(self):
|
||||
p = Pipeline(steps=[
|
||||
Step("text_clean", {"trim": True}),
|
||||
Step("dedup", {"survivor_rule": "first"}),
|
||||
])
|
||||
out = p.to_dict()
|
||||
loaded = Pipeline.from_dict(out)
|
||||
assert len(loaded.steps) == 2
|
||||
assert loaded.steps[0].tool == "text_clean"
|
||||
assert loaded.steps[1].options["survivor_rule"] == "first"
|
||||
|
||||
def test_roundtrip_file(self, tmp_path):
|
||||
p = Pipeline(steps=[Step("text_clean")])
|
||||
path = tmp_path / "p.json"
|
||||
p.to_file(path)
|
||||
loaded = Pipeline.from_file(path)
|
||||
assert loaded.steps[0].tool == "text_clean"
|
||||
|
||||
def test_from_dict_missing_steps_key(self):
|
||||
with pytest.raises(ConfigError):
|
||||
Pipeline.from_dict({})
|
||||
|
||||
def test_from_dict_missing_tool(self):
|
||||
with pytest.raises(ConfigError):
|
||||
Pipeline.from_dict({"steps": [{"options": {}}]})
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# recommended_pipeline
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestRecommendedPipeline:
|
||||
def test_default_order(self):
|
||||
p = recommended_pipeline()
|
||||
assert [s.tool for s in p.steps] == [
|
||||
"text_clean", "format_standardize", "missing", "dedup",
|
||||
]
|
||||
|
||||
def test_default_passes_validation(self):
|
||||
p = recommended_pipeline()
|
||||
assert validate_pipeline(p) == []
|
||||
|
||||
def test_include_overrides_default(self):
|
||||
p = recommended_pipeline(include=["text_clean", "missing"])
|
||||
assert [s.tool for s in p.steps] == ["text_clean", "missing"]
|
||||
|
||||
def test_options_seed_reaches_step(self):
|
||||
p = recommended_pipeline(options={"text_clean": {"trim": False}})
|
||||
assert p.steps[0].options == {"trim": False}
|
||||
|
||||
def test_unknown_tool_raises(self):
|
||||
with pytest.raises(InputValidationError):
|
||||
recommended_pipeline(include=["bogus"])
|
||||
|
||||
def test_can_place_column_map_first_or_last(self):
|
||||
# Both placements must be acceptable per the docstring.
|
||||
first = recommended_pipeline(include=[
|
||||
"column_map", "text_clean", "format_standardize", "missing", "dedup",
|
||||
])
|
||||
last = recommended_pipeline(include=[
|
||||
"text_clean", "format_standardize", "missing", "column_map", "dedup",
|
||||
])
|
||||
# No soft-dependency rule names column_map, so neither warns.
|
||||
assert validate_pipeline(first) == []
|
||||
assert validate_pipeline(last) == []
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# validate_pipeline — soft dependencies
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestValidatePipeline:
|
||||
def test_in_order_no_warnings(self):
|
||||
p = recommended_pipeline()
|
||||
assert validate_pipeline(p) == []
|
||||
|
||||
def test_dedup_before_text_clean_warns(self):
|
||||
p = Pipeline(steps=[Step("dedup"), Step("text_clean")])
|
||||
ws = validate_pipeline(p)
|
||||
assert len(ws) == 1
|
||||
assert "dedup" in ws[0] and "text_clean" in ws[0]
|
||||
|
||||
def test_format_before_text_clean_warns(self):
|
||||
p = Pipeline(steps=[Step("format_standardize"), Step("text_clean")])
|
||||
ws = validate_pipeline(p)
|
||||
assert any("format_standardize" in w for w in ws)
|
||||
|
||||
def test_disabled_steps_ignored(self):
|
||||
# Disabled dedup-first should not trigger a warning.
|
||||
p = Pipeline(steps=[
|
||||
Step("dedup", enabled=False),
|
||||
Step("text_clean"),
|
||||
])
|
||||
assert validate_pipeline(p) == []
|
||||
|
||||
def test_duplicate_tool_does_not_double_warn(self):
|
||||
# text_clean twice (legitimate: two-pass cleaning) shouldn't
|
||||
# generate redundant warnings.
|
||||
p = Pipeline(steps=[
|
||||
Step("text_clean"),
|
||||
Step("text_clean"),
|
||||
])
|
||||
assert validate_pipeline(p) == []
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# run_pipeline — execution
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@pytest.fixture
|
||||
def messy_df():
|
||||
return pd.DataFrame({
|
||||
"name": [" Alice ", "BOB", "N/A", "", "charlie "],
|
||||
"phone": ["(415) 555-1234", "+44 20 7946 0958", "03-3210-7000", "", "(415) 555-1234"],
|
||||
"country": ["US", "GB", "JP", "", "US"],
|
||||
})
|
||||
|
||||
|
||||
class TestRunPipeline:
|
||||
def test_recommended_pipeline_runs_end_to_end(self, messy_df):
|
||||
p = recommended_pipeline(options={
|
||||
"format_standardize": {
|
||||
"column_types": {"phone": "phone"},
|
||||
"phone_country_column": "country",
|
||||
},
|
||||
"missing": {"strategy": "none"},
|
||||
})
|
||||
res = run_pipeline(messy_df, p)
|
||||
assert isinstance(res, PipelineResult)
|
||||
assert res.initial_rows == 5
|
||||
# Dedup at the end removes the Alice/charlie duplicate (same phone).
|
||||
assert res.final_rows < res.initial_rows
|
||||
assert res.warnings == []
|
||||
|
||||
def test_initial_df_not_mutated(self, messy_df):
|
||||
snapshot = messy_df.copy(deep=True)
|
||||
run_pipeline(messy_df, recommended_pipeline())
|
||||
pd.testing.assert_frame_equal(messy_df, snapshot)
|
||||
|
||||
def test_disabled_step_skipped(self, messy_df):
|
||||
p = Pipeline(steps=[
|
||||
Step("text_clean", enabled=False),
|
||||
Step("missing", options={"strategy": "none"}),
|
||||
])
|
||||
res = run_pipeline(messy_df, p)
|
||||
assert res.step_results[0].skipped is True
|
||||
assert res.step_results[1].skipped is False
|
||||
|
||||
def test_step_results_ordered_and_timed(self, messy_df):
|
||||
p = recommended_pipeline(options={
|
||||
"missing": {"strategy": "none"},
|
||||
})
|
||||
res = run_pipeline(messy_df, p)
|
||||
assert len(res.step_results) == 4
|
||||
for sr in res.step_results:
|
||||
assert sr.elapsed_seconds >= 0
|
||||
assert [sr.step.tool for sr in res.step_results] == [
|
||||
"text_clean", "format_standardize", "missing", "dedup",
|
||||
]
|
||||
|
||||
def test_warnings_returned_but_run_proceeds(self, messy_df):
|
||||
p = Pipeline(steps=[
|
||||
Step("dedup"),
|
||||
Step("text_clean"),
|
||||
])
|
||||
res = run_pipeline(messy_df, p)
|
||||
assert res.warnings # warnings present
|
||||
# Both steps still ran.
|
||||
assert all(not sr.skipped for sr in res.step_results)
|
||||
|
||||
def test_progress_callback_fires_per_step(self, messy_df):
|
||||
seen: list[StepResult] = []
|
||||
p = Pipeline(steps=[
|
||||
Step("text_clean"),
|
||||
Step("missing", options={"strategy": "none"}),
|
||||
])
|
||||
run_pipeline(messy_df, p, on_step_complete=seen.append)
|
||||
assert len(seen) == 2
|
||||
assert all(isinstance(s, StepResult) for s in seen)
|
||||
|
||||
def test_progress_callback_exception_does_not_abort(self, messy_df):
|
||||
def bad(_sr):
|
||||
raise RuntimeError("boom")
|
||||
p = Pipeline(steps=[Step("text_clean")])
|
||||
# Must not raise.
|
||||
res = run_pipeline(messy_df, p, on_step_complete=bad)
|
||||
assert res.final_rows == 5
|
||||
|
||||
def test_stop_on_error_default(self, messy_df):
|
||||
# Force an error by giving format_standardize a non-existent column.
|
||||
p = Pipeline(steps=[
|
||||
Step("format_standardize", options={
|
||||
"column_types": {"does_not_exist": "phone"},
|
||||
}),
|
||||
])
|
||||
with pytest.raises(InputValidationError):
|
||||
run_pipeline(messy_df, p)
|
||||
|
||||
def test_continue_on_error_carries_previous_df(self, messy_df):
|
||||
p = Pipeline(steps=[
|
||||
Step("text_clean"),
|
||||
Step("format_standardize", options={
|
||||
"column_types": {"does_not_exist": "phone"},
|
||||
}),
|
||||
Step("missing", options={"strategy": "none"}),
|
||||
])
|
||||
res = run_pipeline(messy_df, p, stop_on_error=False)
|
||||
# Step 2 errored, step 3 still ran.
|
||||
assert res.step_results[1].error is not None
|
||||
assert res.step_results[2].error is None
|
||||
assert res.final_rows == 5
|
||||
|
||||
def test_non_dataframe_input(self):
|
||||
with pytest.raises(InputValidationError):
|
||||
run_pipeline([1, 2, 3], recommended_pipeline()) # type: ignore[arg-type]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Per-tool adapter sanity
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestAdapters:
|
||||
@pytest.mark.parametrize("tool", TOOL_NAMES)
|
||||
def test_adapter_with_default_options_runs(self, tool, messy_df):
|
||||
# Each adapter must accept an empty options dict and return a
|
||||
# (df, summary) pair.
|
||||
out_df, summary = TOOL_ADAPTERS[tool](messy_df, {})
|
||||
assert isinstance(out_df, pd.DataFrame)
|
||||
assert isinstance(summary, dict)
|
||||
|
||||
def test_format_standardize_adapter_passes_column_types(self, messy_df):
|
||||
out, summary = TOOL_ADAPTERS["format_standardize"](
|
||||
messy_df, {"column_types": {"phone": "phone"}},
|
||||
)
|
||||
assert summary["columns_processed"] == ["phone"]
|
||||
|
||||
def test_dedup_adapter_with_unknown_survivor_rule_raises(self, messy_df):
|
||||
with pytest.raises(ConfigError):
|
||||
TOOL_ADAPTERS["dedup"](messy_df, {"survivor_rule": "bogus"})
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# SOFT_DEPENDENCIES integrity
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestSoftDependencies:
|
||||
def test_every_pair_uses_known_tools(self):
|
||||
for earlier, later, _ in SOFT_DEPENDENCIES:
|
||||
assert earlier in TOOL_NAMES
|
||||
assert later in TOOL_NAMES
|
||||
|
||||
def test_all_reasons_non_empty(self):
|
||||
for _, _, why in SOFT_DEPENDENCIES:
|
||||
assert why and isinstance(why, str)
|
||||
# Reason should be a sentence — at least 20 chars.
|
||||
assert len(why) > 20
|
||||
|
||||
def test_dependencies_form_a_dag(self):
|
||||
# No cycles — there must exist a topological ordering of the
|
||||
# tools such that every soft dependency (earlier, later)
|
||||
# is satisfied. With 5 tools and 6 deps this is easy to verify.
|
||||
from collections import defaultdict, deque
|
||||
edges: dict[str, list[str]] = defaultdict(list)
|
||||
in_degree: dict[str, int] = {t: 0 for t in TOOL_NAMES}
|
||||
for e, l, _ in SOFT_DEPENDENCIES:
|
||||
edges[e].append(l)
|
||||
in_degree[l] += 1
|
||||
queue = deque(t for t, d in in_degree.items() if d == 0)
|
||||
order = []
|
||||
while queue:
|
||||
t = queue.popleft()
|
||||
order.append(t)
|
||||
for nxt in edges[t]:
|
||||
in_degree[nxt] -= 1
|
||||
if in_degree[nxt] == 0:
|
||||
queue.append(nxt)
|
||||
assert len(order) == len(TOOL_NAMES), (
|
||||
f"SOFT_DEPENDENCIES contain a cycle; topo order={order}"
|
||||
)
|
||||
Reference in New Issue
Block a user