feat: 3 new tools, format streaming, distribution-ready demo + landing pages

Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 22:31:26 +00:00
parent d18b95880d
commit 966af8ef94
89 changed files with 12039 additions and 284 deletions

4
.gitignore vendored
View File

@@ -10,3 +10,7 @@ build/
# Claude Code agent worktrees + local settings
.claude/
# Landing-page deploy outputs and operator config (real URLs, not committed)
landing/dist/
landing/deploy.config.json

View File

@@ -9,12 +9,12 @@ Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony.
| 01 | **Deduplicator** — exact + fuzzy match, 5 normalizers, survivor rules, audit | Ready |
| 02 | **Text Cleaner** — whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | **Format Standardizer** — dates, phones, emails, addresses, names, currencies, booleans | Ready |
| 04 | Missing Value Handler | Coming Soon |
| 05 | Column Mapper | Coming Soon |
| 04 | **Missing Value Handler** — disguised-null detection, profile, mean/median/mode/ffill/bfill/interpolate, drop strategies | Ready |
| 05 | **Column Mapper** — fuzzy auto-rename, target schema with type coercion, required fields with defaults, drop/reorder | Ready |
| 06 | Outlier Detector | Coming Soon |
| 07 | Multi-File Merger | Coming Soon |
| 08 | Validator & Reporter | Coming Soon |
| 09 | Pipeline Runner | Coming Soon |
| 09 | **Pipeline Runner** — chain tools with recommended (not forced) order, save/load JSON, automate weekly cleanups | Ready |
## Install
@@ -31,10 +31,14 @@ Python 3.10+ required.
streamlit run src/gui/app.py
```
**CLI**three entry points:
**CLI**seven entry points:
```bash
python -m src.cli customers.csv [--apply] # dedup
python -m src.cli_text_clean messy.csv [--apply] # text clean
python -m src.cli_format intl.csv [--apply] # format standardize (auto-streams >100 MB)
python -m src.cli_missing holes.csv [--apply] # missing values
python -m src.cli_column_map vendor.csv [--apply] # column mapper
python -m src.cli_pipeline any_file.csv [--apply] # chain tools end-to-end
python -m src.cli_analyze any_file.csv [--json] # scan only
```

332
docs/DEMO-PLAN.md Normal file
View File

@@ -0,0 +1,332 @@
# Demo Plan — DataTools
> Creator-only. Implements PLAN.md §2.2 (the demo IS the product) and
> §2.3 (niche down — three landing pages, one engine).
> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael
The hosted demo is the single highest-leverage marketing asset in the
plan. This document defines exactly what loads, in what order, with
what data, for which buyer — so the operator builds it once and never
rebuilds it from a stale headline.
## 1. Goals
- Convert a cold visitor to a paid buyer in **under three minutes** of
active interaction.
- Demonstrate the *full pipeline* (not one tool) on a dataset that
*looks like the visitor's own work* — not a toy CSV.
- Survive zero attention to maintenance — once running, the demo
should keep working as the engine evolves (the pre-saved pipeline
JSONs use the same code path the paid product uses).
- Provide a shareable artifact for niche-community posts (a public URL
the operator can drop into a subreddit reply with one sentence).
## 2. Constraints (non-negotiable)
| Constraint | Source | Implication |
|---|---|---|
| Free hosting at launch | BUSINESS.md §9 | Streamlit Community Cloud (1 GB RAM, sleeps after 7 days idle) |
| No login | BUSINESS.md §7 | No email gate, no signup wall, no "create account to continue" |
| Async / no-touch | DECISIONS.md §1 #8 | Cannot offer "schedule a demo with us" CTA |
| Runs locally on paid product | BUSINESS.md §11 | Demo can't expose the same engine to abuse — needs row caps |
| Friction kills conversion | BUSINESS.md §7 | Demo dataset preloaded; no "select a file" first-step |
| < $1,200/mo recurring | BUSINESS.md §9 | Migration plan to $5/mo VPS only after rate-limit signal |
## 3. The three personas (per PLAN.md §2.3)
| Tag | Persona | Top-of-funnel keyword | Demo dataset | Pre-saved pipeline |
|---|---|---|---|---|
| `shopify-pet` | Shopify operator (priority: pet supplies) | "shopify customer cleanup" | `samples/demo/shopify_pet_customers.csv` | `shopify_pet_pipeline.json` |
| `bookkeeper` | Bookkeeper / freelance accountant | "reconcile bank export csv" | `samples/demo/bookkeeper_bank_reconcile.csv` | `bookkeeper_bank_pipeline.json` |
| `revops` | Marketing / RevOps agency | "dedupe lead list across vendors" | `samples/demo/agency_combined_leads.csv` | `agency_leads_pipeline.json` |
Each persona gets its **own landing page URL**, its **own demo dataset
loaded by default**, and its **own H1 + below-the-fold copy.** The
engine is identical; only positioning differs.
## 4. Demo dataset specifications
Each dataset is intentionally small (~1525 rows) so the full pipeline
runs in well under one second on Streamlit Community Cloud's free
hardware. Each row is a *plausible-looking* export from that
persona's tooling. Each contains every kind of pollution the bundle's
five tools fix, so a single demo run shows every tool earning its
keep.
### 4.0 Pain-point coverage map
Each demo dataset is engineered so the buyer sees their **own top
pain** demonstrated in the AFTER preview. The mapping below pairs
each pain from PLAN.md §2.3a with the rows / columns that exercise
it. Refresh the dataset only when this coverage drops.
| Persona | Pain (from PLAN §2.3a) | Demo coverage |
|---|---|---|
| Shopify pet | S1 — Klaviyo per-contact dupes | 5 dup pairs across rows 115 (case + format + address-twin variants) |
| Shopify pet | S2 — feed-rejection chars | smart-quote / NBSP / BOM in rows 16, 9, 11 |
| Shopify pet | S3 — multi-channel | partner-style customer IDs (`SHOP-`); demonstration of column-level mapping covered in RevOps demo |
| Shopify pet | S4 — subscription identity | rows 1+2, 7+8, 9+10 — same person, different format |
| Shopify pet | S5 — VAT-MOSS country drift | rows 1618 (`United Kingdom` / `U.K.` / `UK`) + rows 1920 (`Germany`/`Italia`) |
| Bookkeeper | B1 — month-overlap re-import | 7 dup pairs spanning Jan↔Feb and Mar boundaries |
| Bookkeeper | B2 — 1099 vendor consolidation | Amazon × 3 spellings, Verizon × 2, Acme Realty × 2, Adobe × 2, Costco × 2, Zoom × 2, Stripe × 4 |
| Bookkeeper | B3 — audit trail | every cell change in the run logged with old/new/rule — surface in the demo's audit tab |
| Bookkeeper | B4 — per-license economics | demonstrated by pricing copy, not data |
| Bookkeeper | B5 — multi-currency | rows 26 (EUR), 27 (GBP), 28 (BRL with comma decimal), 29 (parens-negative) |
| RevOps | R1 — per-contact tier | 6 cross-source dup pairs (HubSpot × LinkedIn × Manual Scrape) |
| RevOps | R2 — deliverability | rows 2627 (`uma at uniform dot com`, `victor@@victorco.com` invalid emails) |
| RevOps | R3 — GDPR / privacy | demonstrated by the network-tab moat panel + zero-upload claim |
| RevOps | R4 — vendor unification | 3 source values (HubSpot / LinkedIn / Manual Scrape), 13 country codes, mixed-shape headers |
| RevOps | R5 — suppression list | rows 2930 (`Suppressed`, `Opted Out` tags) |
### 4.1 `shopify_pet_customers.csv` (20 rows)
**Looks like**: a Shopify customer export filtered for "Pet Supplies"
sales channel, 12 months activity.
**Pollution included**:
- Whitespace padding (" Alice ", "Sydney Opera House Drive ")
- Mixed phone formats: `(415) 555-1234`, `415.555.1234`, `5559876543`,
`+1 555-111-1111`
- International phones: GB, ES, DE, AU, JP (15 demo rows span 6
countries)
- Currency variants: `$1,240.50`, `£890.25`, `€2.410,75` (EU comma
decimal), `A$ 1,299.00`, `¥75000`
- Date formats: `2025-12-04`, `12/15/2025`, `?`, `(blank)`, `(none)`,
`#N/A`
- Disguised nulls: `N/A`, blank, `(blank)`, `?`, `#N/A`, `(none)`,
`unknown`
- Name casing: `EVE MARTINEZ`, `henry`, `O'NEIL`, `noah`, mixed Title /
ALL CAPS / lower
- Email case variants that *should* dedup: `Bob@PetShop.com` vs
`alice@petshop.com`
- 4 fuzzy duplicates (Alice/Bob same address, Grace/Henry same phone,
Carlos/Olivia same address, Ivy/Jack same address)
**After running the pipeline**: 20 rows → 15, ~29 cells canonicalized,
~45 sentinels standardised, 5 cross-row duplicates merged. The
customer table is now Klaviyo-import-ready and the country column
(previously `UK` / `U.K.` / `United Kingdom` / `Germany` / `Italia`)
is GB / DE / IT — VAT MOSS report won't break.
### 4.2 `bookkeeper_bank_reconcile.csv` (30 rows)
**Looks like**: two months of business checking + credit-card activity
exported from a bank portal, with the Feb export accidentally
overlapping the Jan export at the month boundary.
**Pollution included**:
- Mixed date formats: `01/15/2025`, `2025-01-15`, `Jan 18 2025`,
`1/27/25`, `Feb 5 2025`
- Currency formats: `-$129.99`, `($89.50)` parens-negative,
`+$3,450.00`, `- $599.88` space, bare `-129.99`, `(50.00)`
- Header trailing whitespace: `"Date "`
- Smart quotes around descriptions: `"autopay"`
- Em-dash sentinels in Vendor: `—`
- Smart-em-dash inside descriptions: `STAPLES #4422 — paper, toner`
- Vendor casing inconsistency: `Amazon` / `amazon.com` / `AMAZON.COM`,
`Verizon` / `verizon`
- 6 duplicate transactions (same date+amount+vendor recorded twice
with different formats)
**After running the pipeline**: 30 rows → 23, ~84 cells normalized, 7
duplicates removed (month-overlap + VAT-MOSS dups). All dates
ISO-formatted, all amounts numeric (including EUR/GBP/BRL with comma
decimal), vendor casing canonical, parens-negative resolved.
### 4.3 `agency_combined_leads.csv` (30 rows)
**Looks like**: a marketing-ops worksheet combining lead exports from
HubSpot + LinkedIn Sales Navigator + manual scraping, ready for
campaign targeting.
**Pollution included**:
- Phone formats per region: US, UK, Spain, Germany, China, India,
Australia, Mexico, Israel, Singapore, Hong Kong, Italy, South
Korea — 13 country codes
- Country column inconsistent: `USA` / `US` / `United States`
- Disguised nulls: `N/A`, `unknown`, `(unknown)`, `(blank)`, `(none)`,
`?`, `—`, `#N/A`, `TBD`
- Source column tags origin (`HubSpot` / `LinkedIn` / `Manual Scrape`)
- Email duplicates across sources with case variants: `alice@acme.com`
+ `Alice.Johnson@acme.com`, `bob@beta.com` + `Bob@Beta.com`,
`diana@delta.com` from two sources, `carlos@gamma.io` from two
sources, `Frank@Foxtrot.de` + `frank@foxtrot.de`
- Name casing: `DIANA LEE`, `henry`, `IVY CHEN`, mixed
- 6 fuzzy / cross-source duplicates designed to survive the dedup
- Score column with sentinel pollution that needs coercion to integer
**After running the pipeline**: 30 rows → 24, ~43 cells canonicalized,
14 sentinels resolved, 6 cross-source duplicates merged with `merge=true`
so each survivor inherits the most-complete picture. Invalid-email
rows (deliverability stress) and `Suppressed`/`Opted Out` tags
(suppression-list use case) survive as flagged rows the operator
manually reviews.
## 5. UX flow (per persona)
The demo is a single Streamlit page (likely
`src/gui/pages/0_Review.py` repurposed for demo mode, or a
dedicated `app_demo.py` for the cloud build).
```
┌──────────────────────────────────────────────────────────┐
│ DataTools — for {Persona} │
│ "{Persona-specific H1}" │
├──────────────────────────────────────────────────────────┤
│ │
│ Sample dataset preloaded: shopify_pet_customers.csv │
│ [Replace with your own file (capped 100 rows)] │
│ │
│ ┌─ BEFORE preview (15 rows) ─────────────────────────┐ │
│ │ Alice | (415) 555-1234 | $1,240.50 | … │ │
│ │ Bob | 415.555.1234 | $1,240.50 | … │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ Pipeline (saved): │
│ 1. Text Clean → 2. Format Standardize → │
│ 3. Missing → 4. Deduplicate │
│ │
│ [▶ Run pipeline] │
│ │
│ ┌─ AFTER preview ───────────────────────────────────┐ │
│ │ 15 rows → 11 (4 duplicates merged) │ │
│ │ 27 cells canonicalized · 33 sentinels resolved │ │
│ │ │ │
│ │ Alice Johnson | +14155551234 | 1240.50 | … │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ [Download cleaned CSV (sample, watermarked)] │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Like what you see? │ │
│ │ Run this on YOUR 50,000-row export — locally. │ │
│ │ No upload. Your data never leaves your machine. │ │
│ │ [Get DataTools — $49 →] │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
```
**Critical UX points**:
- Sample dataset is *already loaded* on page paint. Visitor never
sees an empty state.
- BEFORE table is shown side-by-side with AFTER once the run
completes. Hidden-character toggle on by default so the visitor
*sees* what was hidden in their data.
- "Replace with your own file" is a secondary action below the BEFORE
table — not the headline.
- Per-step metrics are shown in the AFTER block: "27 cells
canonicalized, 33 sentinels resolved, 4 duplicates merged." Numbers
sell more than narrative.
- Buy button is **inside** the AFTER block and **above the fold** when
the run completes. Friction kills.
## 6. Free vs paid boundary
The demo runs the **same code** as the paid product. Caps are surface,
not engine.
| Limit | Free demo | Paid (downloaded) |
|---|---|---|
| Input rows | 100 | unlimited (1 GB+ via streaming) |
| File size | 5 MB | unlimited |
| Output | watermarked CSV ("DataTools demo — buy at <url>" appended as last row) | clean CSV |
| Pipeline editor | locked to the persona-saved pipeline | full edit / save / load JSON |
| Save pipeline JSON | disabled | enabled |
| International | enabled | enabled |
| Audit log download | disabled | enabled |
| Tool 0609 | as they ship | as they ship |
The watermark is a **single trailing row**, not an in-cell tag — so
the demo's AFTER preview *visibly* reads as production-quality data,
not "demo crippled" data.
## 7. CTA copy (per persona)
### 7.1 Shopify pet operator
- **H1**: *Clean your customer / vendor / subscriber exports — locally.*
- **Sub**: *Klaviyo-import-ready in 30 seconds. Catches duplicates Excel
misses. Your data never leaves your computer.*
- **CTA**: *Get DataTools for Shopify — $49 →*
### 7.2 Bookkeeper / freelance accountant
- **H1**: *Reconcile messy bank exports. Hand your client an audit
trail.*
- **Sub**: *Catches the duplicate transaction Quickbooks imported twice.
Standardizes dates, amounts, vendor casing. Every change auditable.*
- **CTA**: *Get DataTools for Bookkeepers — $49 →*
### 7.3 Marketing / RevOps agency
- **H1**: *Dedupe leads across HubSpot, LinkedIn, and manual scrapes.*
- **Sub**: *International phones, country normalization, fuzzy dedup
with merge — one tool, one schema, no upload.*
- **CTA**: *Get DataTools for RevOps — $49 →*
## 8. Telemetry / conversion tracking
Async + no-touch + free hosting limits what we can instrument. Use
event-only counters, no PII:
| Event | Source | Aggregate-only field |
|---|---|---|
| `demo.page_view` | landing page | persona tag |
| `demo.run_clicked` | demo page | persona tag |
| `demo.run_completed` | demo page | persona tag, rows_processed |
| `demo.cta_clicked` | demo page | persona tag |
| `gumroad.purchase` | Gumroad webhook | landing-page-source query param (`?from=shopify-pet`) |
Conversion = `cta_clicked / run_completed`. Demo-quality issue surfaces
when `run_completed / page_view` < 30 % (visitors not engaging).
Self-host counters on Cloudflare Pages (free, GDPR-friendly). No
Google Analytics — adds privacy banner, conflicts with the "your data
never leaves your computer" message.
## 9. Maintenance plan
**Recurring**: zero. The demo runs on the same engine the paid
product ships, so any improvement to the engine improves the demo
automatically. The pre-saved pipeline JSONs reference column names
and tool names, both stable APIs.
**Triggers for revisit**:
| Trigger | Action |
|---|---|
| Streamlit Community Cloud rate-limits / sleeps too aggressively | Migrate to a $510/mo VPS (BUSINESS.md §9 contingency) |
| Demo dataset becomes stale (e.g. all phones standardize to no-op) | Refresh with a new pollution batch — *don't change the persona* |
| `run_completed / page_view < 30 %` for 4 consecutive weeks | Audit the demo: is the BEFORE preview showing the mess clearly? Is the AFTER too small to notice? |
| `cta_clicked / run_completed < 5 %` for 4 consecutive weeks | The demo is impressive but the CTA isn't earning trust — revise copy + add a screenshot of the network tab showing zero outbound calls (PLAN.md §2.4) |
| New tool ships (0609) | Decide *per persona* whether to add it to that persona's saved pipeline. Not all tools belong on all personas |
## 10. Build sequence (drops into PLAN.md week 2)
| Day | Action |
|---|---|
| 1 | Demo build of Streamlit app: 3 personas, switch via query param `?p=shopify-pet` |
| 2 | Pipeline JSONs wired in; row cap + watermark applied; download button |
| 3 | Deploy to Streamlit Community Cloud · 3 sub-paths or 3 separate apps |
| 4 | Persona landing pages: 3 static HTML pages on Cloudflare Pages, each with iframe embed of its persona demo + CTA |
| 5 | Telemetry counters wired (Cloudflare event API) · Gumroad webhook captures `?from=` |
End of day 5: three URLs the operator can drop into three different
niche-community threads, each performing its own conversion math.
## 11. Anti-temptations (things the demo deliberately refuses)
- **No "try it on your data first" gate that requires email.** The
whole point is friction-free.
- **No "schedule a demo" CTA.** Locked by no-touch.
- **No live chat widget.** Same.
- **No A/B-test framework yet.** Single-arm copy, ship it, iterate
monthly. A/B requires statistical traffic the funnel doesn't have
pre-PMF.
- **No watermark inside cells.** The AFTER preview must look
production-quality. Watermark goes on a single trailing row that's
obviously the demo signature.
- **No animation / loader theatrics.** Pipeline runs in <1 s; a
fake-progress bar lies about speed.

236
docs/DEPLOYMENT.md Normal file
View File

@@ -0,0 +1,236 @@
# Deployment — demo + landing pages
> One page. Two services. ~30 minutes from "code complete" to
> "URL the user can hit." Every step here is from-scratch reproducible
> on a clean laptop.
> **Version**: 1.0 · **Adopted**: 2026-05-01
This doc covers the **two distribution surfaces** that ship to public
URLs: the Streamlit demo (the iframe target) and the Cloudflare Pages
landing pages (the marketing surface that embeds it).
The *paid* product — PyInstaller installers, code-signing, Gumroad
listing — is covered in `docs/NEXT-STEPS.md`.
---
## Part 1 · Deploy the demo (Streamlit Community Cloud — free)
### A. Pre-flight (one-time, ~2 min)
You need a free [Streamlit Community Cloud](https://streamlit.io/cloud)
account. Sign in with the GitHub account that hosts this repo.
### B. Deploy (~5 min, mostly waiting for the Cloud build)
1. **Push the repo to GitHub** (private or public — both work). The
important files are at the **repo root**:
- `streamlit_app.py` — Cloud auto-detects this; nothing to configure
- `requirements.txt` — Cloud installs from this
- `.streamlit/config.toml` — Cloud honours this
- `samples/demo/*.csv` + `*_pipeline.json` — the demo's data
- `src/` — the engine
2. In Streamlit Community Cloud → **New app**:
- Repository: your fork
- Branch: `main`
- Main file path: `streamlit_app.py` (the default — leave it)
- App URL: `datatools-demo` (or any free subdomain)
- **Deploy**
3. First build is 23 min while Cloud installs `pandas`, `phonenumbers`,
`rapidfuzz`, etc. Subsequent deploys are < 30 s.
### C. Verify
Open the deployed URL. Append `?p=shopify-pet` to the URL bar —
the persona-specific demo loads. Try `?p=bookkeeper` and
`?p=revops` to confirm all three personas route correctly. Click
**Run pipeline**; the AFTER preview should appear within ~1 second.
### D. The output URL
The deployed URL is what feeds into `landing/deploy.config.json`
`demo_base_url`. Without trailing slash. For example:
https://datatools-demo.streamlit.app
### E. Migration trigger
Per `BUSINESS.md` §9 / `DEMO-PLAN.md` §9, migrate to a $510/mo VPS
when:
- Streamlit Community Cloud rate-limits / sleeps too aggressively, OR
- the demo crosses ~5 k page-views/month (free-tier capacity)
The migration is one command if you containerise:
`docker run -p 8501:8501 -v $(pwd):/app python:3.12-slim …`
---
## Part 2 · Deploy the landing pages (Cloudflare Pages — free)
### A. Pre-flight (one-time, ~5 min)
You need:
- A Cloudflare account (free) and a domain (any registrar) with
nameservers pointed at Cloudflare. **OR** skip the custom domain
step and use the auto-generated `*.pages.dev` URL.
- A Gumroad listing URL (placeholder until your account is set up —
use `https://gumroad.com/l/datatools` and update it later).
### B. Build the deploy-ready bundle (~30 sec)
```bash
# One-time: copy the template
cp landing/deploy.config.example.json landing/deploy.config.json
# Edit it with your real URLs
edit landing/deploy.config.json
# Build
python3 landing/deploy.py
# → produces landing/dist/
```
`landing/deploy.config.json` is **gitignored**; your real URLs never
hit the repo.
### C. Deploy (~3 min)
Two paths — pick one:
**Drag-and-drop (zero CLI):**
1. Cloudflare Pages dashboard → **Create project****Direct Upload**
2. Drag `landing/dist/` into the upload zone
3. Project name: `datatools` (becomes `datatools.pages.dev`)
4. Click **Deploy**
**Wrangler CLI (one command, scriptable):**
```bash
npm install -g wrangler # one-time
wrangler login # one-time
wrangler pages deploy landing/dist
```
### D. Custom domain (~5 min, optional)
Pages dashboard → your project → **Custom domains** → add
`datatools.app` (or whichever apex domain you registered). Cloudflare
auto-issues TLS. Once propagated:
- `https://datatools.app/` → apex chooser
- `https://datatools.app/shopify-pet/` → Shopify landing
- `https://datatools.app/bookkeeper/` → Bookkeeper landing
- `https://datatools.app/revops/` → RevOps landing
### E. Verify
For each persona:
1. Open the persona URL.
2. Confirm the demo iframe loads (the URL inside it points at the
Streamlit demo from Part 1).
3. Click "Run pipeline" inside the iframe → AFTER preview appears.
4. Click the "Get DataTools" button → opens Gumroad with the
correct `?from=<persona>` query (verify in the URL bar).
If the iframe shows "Refused to connect", check Cloudflare Pages →
**Settings****Functions** for any CSP that disallows Streamlit's
domain. (Default Pages config does not set CSP, so this is rarely an
issue.)
---
## Part 3 · Updates
The cycle is:
```bash
# 1) Edit code or copy
edit landing/<persona>/index.html
edit src/gui/app_demo.py
# 2) Rebuild landing
python3 landing/deploy.py
# 3) Re-deploy landing
wrangler pages deploy landing/dist
# 4) Re-deploy demo
git push origin main
# (Streamlit Cloud auto-deploys on push)
```
Both surfaces deploy in under 5 minutes end-to-end.
---
## Part 4 · Sanity checks (post-deploy, ~3 min)
Run these once, then trust the build (per `POST-LAUNCH.md` §6):
```bash
# Landing pages serve and reference the right demo URL
curl -s https://datatools.app/ | grep -c persona-card
# → 3 (one per persona card)
curl -s https://datatools.app/shopify-pet/ | grep -c "datatools-demo"
# → ≥1 (iframe src points at your demo)
# Demo responds and routes the persona param
curl -s https://datatools-demo.streamlit.app/?p=shopify-pet | grep -c "Shopify"
# → ≥1
# Sitemap is valid XML and lists all 4 pages
curl -s https://datatools.app/sitemap.xml | grep -c "<url>"
# → 4
```
---
## Part 5 · Cost ceiling check
| Service | Tier | Cost | Cap |
|---|---|---|---|
| Cloudflare Pages | Free | $0 | 500 builds/month, unlimited bandwidth |
| Streamlit Community Cloud | Free | $0 | 1 GB RAM, sleeps after 7 days idle |
| Custom domain | Cloudflare or registrar | ~$15/year | n/a |
| GitHub | Free for private repos with limited collaborators | $0 | n/a |
| **Total ongoing** | | **~$1.25/mo** (domain only) | |
Well inside the `BUSINESS.md` §9 cap of $1,200/mo recurring. The
$510/mo VPS migration is a contingency only — don't pre-build it.
---
## Troubleshooting
**Streamlit Cloud build fails with "ModuleNotFoundError: src.core"**
`streamlit_app.py` puts the repo root on `sys.path` before invoking
the demo module — but only if the file is at the repo root. Confirm
`streamlit_app.py` lives at `/streamlit_app.py`, not nested in a
folder.
**Cloudflare Pages deploy succeeds but persona pages 404**
The directory layout is preserved by `deploy.py`. Confirm your
`landing/dist/` has `shopify-pet/index.html`, etc. — not just three
flat files. If you used drag-and-drop, drag the **directory**, not
its contents.
**The iframe shows "X-Frame-Options denied"**
Streamlit Community Cloud allows iframe embedding by default. If
you've migrated to a self-hosted demo with a reverse proxy, set
`X-Frame-Options: ALLOWALL` (or remove the header entirely) for the
demo's domain.
**Gumroad URL has no `?from=` parameter when clicked**
The `&from=` query param is added by the landing-page CTA, not by
Gumroad. If it's missing, the landing-page HTML wasn't substituted —
re-run `python3 landing/deploy.py` and re-deploy.

319
docs/NEXT-STEPS.md Normal file
View File

@@ -0,0 +1,319 @@
# Next Steps — from "code complete" to first paying customer
> Creator-only. The runnable checklist that takes the operator from
> the current state (1,729 tests passing, 6 tools shipped, 0 paying
> customers) through launch and into the first 90 days.
> **Version**: 1.0 · **Adopted**: 2026-05-01
This document is the **single answer** to "what now?". Every line
item has an owner, a time estimate, a blocker, a cost, and the
external dependency that makes it un-shippable today. Items are
ordered by **must-finish-before-the-next-item** — work top-down.
Cross-references:
- Strategy: `PLAN.md` (the 8 strategic moves + the 90-day sequence)
- Demo specs: `DEMO-PLAN.md`
- Deployment mechanics: `DEPLOYMENT.md`
- Post-launch measurement: `POST-LAUNCH.md`
- Locked criteria: `DECISIONS.md` §1
Status legend:
- **🟢** Done — the asset exists in this repo
- **🟡** Buildable now — no external dependency needed
- **🟠** External dependency — needs an account / signup / payment
- **🔴** Manual / requires user input that can't be automated
---
## Phase 0 · What's already done (skip ahead)
| ✓ | Item | Where it lives |
|---|------|----------------|
| 🟢 | 6 of 9 tools shipped (Dedup, Text, Format, Missing, Column-Map, Pipeline) | `src/core/`, `src/cli_*.py`, `src/gui/pages/` |
| 🟢 | Pipeline Runner (the retention multiplier per `PLAN.md` §2.6) | `src/core/pipeline.py`, `src/cli_pipeline.py`, `src/gui/pages/9_Pipeline_Runner.py` |
| 🟢 | 1,729 passing tests · 0 skipped · 0 xfailed | `tests/` |
| 🟢 | 3 niche demo datasets + pre-tuned pipeline JSONs | `samples/demo/` |
| 🟢 | Streamlit demo app + Cloud entry shim | `streamlit_app.py`, `src/gui/app_demo.py` |
| 🟢 | 3 niche landing pages + apex chooser + shared CSS | `landing/` |
| 🟢 | Landing-page deploy script (URL-substitution + sitemap + 404 + favicon) | `landing/deploy.py` |
| 🟢 | Strategic plan + demo plan + post-launch measurement plan + deployment doc | `docs/PLAN.md`, `DEMO-PLAN.md`, `POST-LAUNCH.md`, `DEPLOYMENT.md` |
---
## Phase 1 · Stand the funnel up (target: end of week 1, ~6 hours total work)
The bottleneck right now is **distribution, not feature count**.
Everything in this phase is about turning code into a URL the user
can hit.
### 1.1 — 🟠 Push to GitHub (5 min)
| | |
|---|---|
| **What** | `git init` (if not already), commit, push to a private or public GitHub repo. |
| **Why** | Cloud deploy services need a Git source. Streamlit Community Cloud auto-deploys on push to `main`. |
| **External dependency** | A GitHub account (free). |
| **Cost** | $0. |
| **Blocked by** | Nothing. |
### 1.2 — 🟠 Deploy the demo to Streamlit Community Cloud (15 min)
| | |
|---|---|
| **What** | Follow `DEPLOYMENT.md` Part 1. Result: a public URL like `https://datatools-demo.streamlit.app`. |
| **Why** | The landing pages embed this in their iframe. Without it, every "Run pipeline" button on the landing pages 404s. |
| **External dependency** | Free Streamlit Community Cloud account, signed in via GitHub. |
| **Cost** | $0. |
| **Blocked by** | 1.1 (the repo must be on GitHub). |
| **Watch out for** | First build takes 23 min while Cloud installs deps. Subsequent deploys < 30 s. |
### 1.3 — 🟠 Buy the apex domain (5 min, ~$15/year)
| | |
|---|---|
| **What** | Register `datatools.app` (or whichever) at any registrar. Point the nameservers at Cloudflare. |
| **Why** | The landing-page canonical URLs and CTA buttons refer to this domain. Pages can deploy to a free `*.pages.dev` URL first if you want to defer this. |
| **External dependency** | A registrar account; payment method. |
| **Cost** | ~$15/year. Within `BUSINESS.md` §9 cost cap. |
| **Blocked by** | Nothing — can run in parallel with 1.1 / 1.2. |
### 1.4 — 🟠 Deploy the landing pages to Cloudflare Pages (15 min)
| | |
|---|---|
| **What** | Follow `DEPLOYMENT.md` Part 2. Run `python3 landing/deploy.py` with the operator's URLs in `deploy.config.json`, then `wrangler pages deploy landing/dist` (or drag-drop). |
| **Why** | This is the marketing surface. Three persona URLs go live as soon as it deploys. |
| **External dependency** | Free Cloudflare account; Wrangler CLI (optional — drag-drop works too). |
| **Cost** | $0. |
| **Blocked by** | 1.2 (the demo URL goes into `deploy.config.json`); ideally 1.3 for the custom domain. |
| **Watch out for** | The `deploy.config.json` file is gitignored — your real URLs never get committed. |
### 1.5 — 🟠 Open a Gumroad listing (15 min) **— stub for now**
| | |
|---|---|
| **What** | Create a Gumroad account, draft a listing with a single screenshot + the landing-page copy, set price to $49. Don't enable purchases yet — leave it as a draft. |
| **Why** | The CTA buttons on the landing pages link to `gumroad.com/l/datatools?from=<persona>`. Until the listing exists, those buttons 404. |
| **External dependency** | Free Gumroad account; Stripe-connected payout method (defer to Phase 2). |
| **Cost** | $0 to draft, ~10% per sale once live. |
| **Blocked by** | Nothing — can run in parallel with 1.11.4. |
| **Watch out for** | The listing URL must be `gumroad.com/l/datatools` to match the landing-page hard-coded CTAs. If you pick a different slug, update `landing/deploy.config.json``gumroad_listing` and re-run `deploy.py`. |
### 1.6 — 🟡 End-to-end smoke verification (10 min)
| | |
|---|---|
| **What** | Run the four `curl` commands from `DEPLOYMENT.md` Part 4. All four landing pages, all three demo personas, sitemap.xml. |
| **Why** | First time something can break is the moment a real user hits it. Ten minutes of `curl` saves a week of "why is conversion zero." |
| **External dependency** | None. |
| **Cost** | $0. |
| **Blocked by** | 1.4 + 1.2. |
---
## Phase 2 · Make it sellable (target: end of week 2)
### 2.1 — 🟠 Apple Developer Program enrollment (5 min to start, 12 weeks lead)
| | |
|---|---|
| **What** | Per `BUSINESS.md` §10. Required for code-signing the macOS installer. |
| **External dependency** | Apple ID + government-issued ID (individual) or D-U-N-S number (org). |
| **Cost** | $99/year. |
| **Blocked by** | Nothing — start ASAP because of the 12 week approval window. The pipeline waits on this; nothing else does. |
### 2.2 — 🟡 PyInstaller spec + cross-platform build (13 days first time)
| | |
|---|---|
| **What** | A `build/datatools.spec` that bundles the Streamlit GUI + all 6 tools + samples into one app. Mac `.dmg`, Windows `.exe` installer, Linux AppImage. |
| **Why** | The buyer's deliverable. Without this, there is nothing to attach to the Gumroad listing. |
| **External dependency** | None for Linux/Mac builds. Windows builds need a Windows machine or a CI matrix runner. |
| **Cost** | $0 (GitHub Actions matrix runners are free for public repos). |
| **Blocked by** | Nothing for the spec; 2.1 for the signed Mac build. |
| **Watch out for** | Streamlit's bundle size lands around 300500 MB per `DECISIONS.md` §4c — accepted tradeoff. |
### 2.3 — 🟡 macOS sign + notarize (30 min once Apple Dev is approved)
| | |
|---|---|
| **What** | Sign the `.dmg`, submit to Apple's notarization service, staple the ticket. |
| **Why** | Without it, Gatekeeper hard-blocks the install with no obvious way out (per `BUSINESS.md` §10). The buyer gives up. |
| **External dependency** | Apple Developer Program (2.1). |
| **Cost** | $0 incremental over 2.1. |
| **Blocked by** | 2.1 + 2.2. |
### 2.4 — 🔴 Refund policy + license + Gumroad listing copy (1 hour)
| | |
|---|---|
| **What** | A clear refund policy (14-day no-questions per the FAQ already on the landing pages) + a software licence text + the Gumroad listing description. |
| **Why** | Required by Gumroad's terms; surfaces on the listing page; protects against buyer disputes. |
| **External dependency** | None — operator authoring. |
| **Cost** | $0. |
| **Blocked by** | Nothing. |
| **Hint** | Most of the copy is already in the landing pages' FAQ section — paste it into Gumroad. |
### 2.5 — 🟠 Activate the Gumroad listing (15 min)
| | |
|---|---|
| **What** | Upload the cross-platform installers from 2.2/2.3, paste the copy from 2.4, set $49 price, enable purchases, configure Stripe payout. |
| **Why** | This is the "buy" button finally working. |
| **External dependency** | Gumroad + Stripe account; the installers from 2.2/2.3. |
| **Cost** | ~10 % per sale. |
| **Blocked by** | 2.2, 2.3, 2.4. |
---
## Phase 3 · First-traffic ignition (target: end of week 4)
Per `PLAN.md` §3 and `BUSINESS.md` §7 channel priorities. The strict
no-touch constraint of `DECISIONS.md` §1 #8 makes channel choice
matter — these are the only ones that fit.
### 3.1 — 🔴 First niche-community post (30 min)
| | |
|---|---|
| **What** | One value-first post in one niche-relevant community (e.g. r/shopify, IndieHackers Shopify chat, a Slack/Discord that allows it). Lead with the demo URL, not the buy URL. |
| **Why** | Marketplaces alone don't drive discovery. Communities are the only first-touch channel that works under no-touch. |
| **External dependency** | Account in the chosen community; understand its self-promotion rules. |
| **Cost** | $0. |
| **Blocked by** | 1.4 (demo URL must work). |
| **Hint** | Pick the persona with the most familiar community to the operator. Don't try all three at once — see `POST-LAUNCH.md` §2 "decide ONE thing" rule. |
### 3.2 — 🟡 First long-tail SEO blog post (46 hours)
| | |
|---|---|
| **What** | One 8001,500-word post on `datatools.app/blog/` (sub-route of Cloudflare Pages or Substack) targeting one niche keyword from `BUSINESS.md` §7. Topic: a real problem you've encountered, the cleanup steps, the demo URL at the end. |
| **Why** | Compounding asset — `BUSINESS.md` §2 says SEO pays in 618 months, not week 1. Don't mistake it for an early-stage channel. |
| **External dependency** | None. |
| **Cost** | $0. |
| **Blocked by** | Nothing. |
### 3.3 — 🟡 Cloudflare Web Analytics + event counters (45 min)
| | |
|---|---|
| **What** | Enable Cloudflare Web Analytics on the Pages project (one click). Add a tiny inline `<script>` to each landing page that fires `cta_clicked` when the buy button is hit, before redirecting. Per `POST-LAUNCH.md` §1. |
| **Why** | Without this, the post-launch checklist is unrunnable. |
| **External dependency** | Cloudflare account (already from 1.4). |
| **Cost** | $0. |
| **Blocked by** | 1.4. |
| **Hint** | The Gumroad webhook captures `?from=<persona>` automatically — no extra wiring. |
### 3.4 — 🟡 Email autoresponder (post-purchase delivery + 3-touch onboarding) (23 hours)
| | |
|---|---|
| **What** | Gumroad's built-in delivery email plus three follow-up emails (day 1, day 7, day 14): "are you running into X?", "here's an advanced trick", "save your pipeline as JSON for next week". |
| **Why** | Increases activation, reduces refund risk, surfaces support questions while volume is small. |
| **External dependency** | Gumroad delivery is built-in. The 3-touch sequence needs a free email service (Resend's free tier or Mailchimp's free tier). |
| **Cost** | $0$30/month per `BUSINESS.md` §9. |
| **Blocked by** | 2.5. |
---
## Phase 4 · First-buyer trigger and review
Per `PLAN.md` §4 decision triggers and `POST-LAUNCH.md` §4.
### 4.1 — 🟢 Run the monthly review (30 min, first Monday after launch)
| | |
|---|---|
| **What** | Follow `POST-LAUNCH.md` §2 — pull last-30-days demo events + Gumroad sales + refunds, compute the five numbers, decide ONE change. |
| **Why** | Without this discipline, the funnel drifts and the operator changes 5 things at once and learns nothing. |
| **External dependency** | None — analytics from 3.3, sales from 2.5. |
| **Cost** | $0. |
| **Blocked by** | 3.3 + 2.5. |
### 4.2 — 🟢 First paying customer (target: 90 days)
| | |
|---|---|
| **What** | The actual first sale. |
| **Why** | Per `BUSINESS.md` §6: validates the funnel; not the business. |
| **Trigger action** | Continue, no plan change. Make the first $1k/month within month 6. |
### 4.3 — 🔴 Zero-paid-in-90-days fallback (only fires if 4.2 doesn't)
| | |
|---|---|
| **What** | Per `POST-LAUNCH.md` §4 — audit the funnel, not the features. Run a 1-week outbound experiment to 30 niche contacts as a control (per `BUSINESS.md` §8 the no-touch revisit is allowed below $5k MRR if it produces signal). |
| **Why** | Distinguishes "no reach" from "no conversion" — they need different fixes. |
| **External dependency** | Operator's time. |
| **Cost** | The 10 hr/wk allocation already exists; this displaces other work. |
| **Blocked by** | The 90-day calendar trigger from 4.2. |
---
## Phase 5 · Steady state — what NOT to build
Per `PLAN.md` §5 (anti-temptations) and `DECISIONS.md` §8 (re-lock
triggers). The trap is treating "more code" as the answer when the
data says "more reach" or "more conversion." The five forbidden
moves until $5k/mo MRR:
| | Why locked |
|---|---|
| ❌ More tools (0608) | `PLAN.md` §2.1 distribution-gate. Tool 09 was the exception; no others until first paid customer + one external review. |
| ❌ SaaS pivot | `DECISIONS.md` §4 — recurring infra conflicts with the lifestyle constraint. |
| ❌ Live chat / sales calls | `DECISIONS.md` §1 #8 — no-touch is locked until $5k/mo. |
| ❌ Custom integrations / one-off consulting | Breaks "build once, sell many." |
| ❌ Going broad on personas | `PLAN.md` §5 — "all small businesses" converts at 1 %; vertical converts at 515 %. |
---
## Triage table — what blocks what
```
Phase 1 (week 1) Phase 2 (week 2) Phase 3 (week 4)
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ 1.1 Push GH │──────────┐ │ 2.1 Apple │ ───┐ │ 3.1 Community│
│ 1.2 Demo │──┐ ├──▶│ Dev (1-2w) │ │ │ 3.2 SEO post │
│ 1.3 Domain │ │ │ │ 2.2 Build │ ───┤ │ 3.3 Analytics│
│ 1.4 Pages │◀─┘ │ │ 2.3 Sign │ ───┤ │ 3.4 Emails │
│ 1.5 Gumroad │──────────┘ │ 2.4 Copy │ │ └──────────────┘
│ 1.6 Verify │ │ 2.5 Activate │ ◀──┘
└──────────────┘ └──────────────┘ ↓
┌──────────────┐
│ 4.1 Monthly │
│ 4.2 First $ │
│ 4.3 Fallback │
└──────────────┘
```
The longest blocking path is **2.1 Apple Developer Program**
(12 weeks). Start it on day 1 of week 1 — it unblocks everything in
Phase 2 and you can do all of Phase 1 while waiting.
---
## Time estimate — total operator time
| Phase | Hours | Wall-clock |
|---|---|---|
| Phase 1 | ~1 hour | end of week 1 (mostly waiting for builds) |
| Phase 2 | ~1 day | end of week 2 (gated by Apple Dev approval) |
| Phase 3 | ~6 hours | week 34 |
| Phase 4 | 30 min/month | ongoing |
| **Total to launch** | **~12 hours of operator time** | **~14 days wall-clock** |
Well inside the 10 hr/wk constraint of `DECISIONS.md` §1 #2.
---
## The thing that decides whether the plan works
Not the build. Not the deploy. Not even the first sale.
**The discipline of running the monthly review** in Phase 4 — and the
"decide ONE thing per month" rule from `POST-LAUNCH.md` §2 — is what
separates "this product exists" from "this product compounds." Every
feature added before the funnel is measured is a guess; every change
made after the monthly review is informed.
Don't skip 4.1.

220
docs/PLAN.md Normal file
View File

@@ -0,0 +1,220 @@
# Strategic Plan — DataTools
> Creator-only. Locks the "what next" in light of the locked criteria
> (DECISIONS.md §1) and the v1.6 honest status (BUSINESS.md §13).
> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael
This document is the active plan, derived from the strategic review of
2026-05-01. It compresses the eight strategic moves and a 90-day
execution sequence onto one page so the next decision (build vs.
ship vs. market) has a single reference.
It is **not** a re-lock of operating criteria — those still live in
DECISIONS.md and have not changed. This plan is downstream of those
criteria; if a move below conflicts with §1 of Decisions, the criteria
win.
## 1. Frame
**Locked context** (BUSINESS.md, DECISIONS.md):
- Niche Python automation tools, $4979 single / $149 suite.
- Cash budget ≤ $1,200/mo recurring · Time ≤ 10 hr/wk · No external funding.
- Async + no-touch sales (revisit at $5k/mo MRR).
- Marketplace-first distribution (Gumroad / Lemon Squeezy).
- Streamlit GUI + CLI dual interface, runs locally.
- Lifestyle cashflow goal (no exit needed).
**Honest current state** (2026-05-01):
| Asset | State |
|---|---|
| Tools 15 (Dedup, Text Clean, Format Standardize, Missing, Column Mapper) | Ready · 1,691 tests passing · 0 xfailed |
| Tools 69 (Outlier, Multi-File Merge, Validator, Pipeline) | Coming Soon |
| PyInstaller installer pipeline | Not started |
| macOS code signing (Apple Dev Program) | Not started |
| Hosted browser demo (Streamlit Cloud) | Not deployed |
| Landing page | Not live |
| Marketplace listing (Gumroad) | Not listed |
| Paying customers | 0 |
**Diagnosis**: the bottleneck is not feature count — it's distribution.
The next $1 of value comes from closing the gap between "code-complete"
and "buyer-pulls-out-card", not from tool 6.
## 2. The eight strategic moves
Numbered moves. Each is consistent with locked criteria.
### 2.1 Freeze new-tool development (one exception). Ship what exists.
Tools 68 are blocked behind a **distribution gate**: no work on them
until the existing 5 tools have a paying customer + one external review
(BUSINESS.md §4 sequence rule, applied recursively inside the bundle).
**Exception granted 2026-05-01**: Tool 09 Pipeline Runner is built
*now*. Rationale: the pipeline transforms the bundle from "5 tools you
buy" into "an automatable workflow you depend on." That conversion is
what produces retention and word-of-mouth — the only marketing channel
that scales under the no-network/no-touch constraint.
### 2.2 The demo *is* the product. Make it embarrassingly good.
- Three persona-tagged sample datasets, not one generic CSV: Shopify
customers / bookkeeper bank export / agency lead list.
- Run the *full pipeline* on the sample (Review → Dedup → Text Clean →
Format → Missing → Column Map). Free version caps **output rows**,
not the experience.
- Embed the demo as an **iframe on the landing page** (not "click to
open"). Friction kills conversion.
- Persistent CTA after demo: *"Run this on your own 50 k-row file →
buy for $49 →"* directly above the Gumroad button.
### 2.3 Niche down. Stop selling "data cleaning."
One engine, three landing pages:
| Persona | Landing-page lead | Demo dataset |
|---|---|---|
| Shopify operator (priority: pet supplies) | "Clean your customer / vendor / subscriber exports" | uc01_shopify_customer_list |
| Bookkeeper / freelance accountant | "Reconcile bank exports + vendor lists. Auditable changes." | uc06_bank_export_overlap |
| Marketing / RevOps agency | "Dedupe lead lists. Standardize phones across vendors." | uc13_combined_lead_sources |
Generic copy competes with `pip install pandas`. Vertical copy
competes with nothing.
### 2.3a Top pain points per niche
The "what does this actually fix?" question. Each pain point below is
sourced from operator-domain knowledge of these markets and the
buyer-use-case research already captured in `BUSINESS.md §4a`. Pain
points are ranked by **frequency × dollar impact** for that persona —
high-frequency / high-cost pains lead the landing-page copy and the
demo dataset.
> **Validation gap (honest disclaimer)**: these pains are derived from
> operator knowledge of the categories, not from a sample of buyer
> interviews. Per `BUSINESS.md §8` (no-touch constraint review at $5k/mo
> MRR), validate the top-3 per persona via 5 buyer interviews before the
> first $200 of paid acquisition spend. If any pain ranks below the
> assumed level, swap it for the next-highest in this list.
#### Shopify operator (priority: pet supplies)
| # | Pain | $ / time impact | Tools that fix it |
|---|------|-----------------|---|
| S1 | **Klaviyo / Mailchimp / Omnisend per-contact billing.** Subscriber list with 1018 % duplicate rate (case drift, plus signs in Gmail addresses, multiple devices) → recurring overpay forever. | $30300/mo per percent of dupes on a 50 k list — recurring | Dedup + Format Standardize (email canonicalization) + Pipeline (re-run weekly) |
| S2 | **Product feed rejected by Google Merchant Center / Meta Catalog.** Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 2472 h while feed gets fixed. | 13 days delayed launch × campaign value | Text Cleaner + Format Standardize |
| S3 | **Multi-channel order consolidation.** Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 48 hr / month manually merging | Column Mapper + Dedup + Pipeline |
| S4 | **Subscription identity fragmentation.** Pet-box subscribers cancel and re-sub under a different email; cohort analysis says churn is 20 % when it's actually 12 % — pricing decisions wrong. | Mis-priced LTV → over- or under-paid acquisition | Dedup with `merge=true` survivor |
| S5 | **International tax / VAT MOSS compliance.** Country column is `UK` / `U.K.` / `United Kingdom` / `GB` in the same export; VAT report breaks. Phone formats per region break call-center routing. | Compliance penalty risk + ops friction | Format Standardize (per-row country) + Column Mapper |
#### Bookkeeper / freelance accountant
| # | Pain | $ / time impact | Tools that fix it |
|---|------|-----------------|---|
| B1 | **Bank-export month-overlap re-import.** Same transaction posts twice when Jan and Feb exports overlap at the boundary; client's books understate cash by 14 %. | 24 hr / month / client + reconciliation errors | Dedup with explicit Date+Amount+fuzzy Vendor strategy |
| B2 | **QBO / Xero vendor consolidation for 1099 reports.** "Amazon" / "amazon.com" / "AMAZON.COM*4F2X9" become 3 vendors; 1099 reports break, P&L by vendor unusable. | 12 hr / 1099 cycle + IRS-paper-trail risk | Format Standardize (name canonicalization) + Dedup |
| B3 | **Liability / professional indemnity.** Cannot use AI tools that don't show their work; client audit response window is 2448 h. | Per-firm liability premium ≈ $5002,500 / yr | Audit log built into every tool — every change row-logged |
| B4 | **Per-license-not-per-client economics.** Most cleanup tools are per-seat / per-client SaaS; bookkeepers managing 1030 clients hit price walls fast. | $30/mo × N clients vs. $49 once | Desktop license, no per-client constraint |
| B5 | **Multi-currency books.** US-domiciled clients with EU customers; comma-decimal amounts (`€1.234,56`) crash standard parsers; parens-negative (`($89.50)`) treated as positive. | 3060 min per multi-currency client per month | Format Standardize (`currency_decimal=auto`, parens-negative) |
#### Marketing / RevOps agency
| # | Pain | $ / time impact | Tools that fix it |
|---|------|-----------------|---|
| R1 | **HubSpot / Marketo / Iterable per-contact tier pricing.** 10 k contacts → enterprise tier at $48 k/mo. Every duplicate is a recurring tax. | $200800 / month per 1 k duplicate contacts — recurring | Dedup with cross-source merge + Pipeline |
| R2 | **Email-deliverability / sender reputation.** Sending to invalid or duplicate addresses tanks reputation; recovery takes weeks. | Catastrophic — entire email programme degraded | Format Standardize (email canonicalization) + Missing (sentinel detection) |
| R3 | **GDPR / contact-data privacy.** Uploading lead data to a third-party cleaning SaaS is itself a GDPR concern; legal review blocks adoption. | Compliance risk + 48 wk legal-review delay | Local-only desktop app, zero outbound calls |
| R4 | **Multi-vendor lead-source unification.** Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 13 days per campaign of manual unification | Column Mapper (alias matching) + Format Standardize (per-row country) + Dedup |
| R5 | **Suppression-list management across 5+ platforms.** Each platform has its own format; un-deduped suppression lists let opt-outs slip through, triggering CAN-SPAM / GDPR exposure. | Compliance risk + churn-back cost | Pipeline saved as JSON, re-run on each new suppression batch |
### 2.4 Operationalize the moat the docs already name.
Three durable advantages, each promoted from buried feature to
landing-page H1:
- **Quality**: 1 GB international standardization in ~2.5 minutes,
locally. Excel can't do this; OpenRefine fights you for an hour.
- **Privacy**: "Your data never leaves this computer." Already in the
GUI footer — promote to landing-page lead, screenshot the empty
network tab.
- **Update cadence**: ship a v1.1 patch within 30 days of v1.0 launch.
Not features — *evidence* the product is alive. "Added Czech Republic
phone format support" beats "no updates in 6 months" every time.
### 2.5 Surface the audit-trail feature in sales copy.
Every tool has a structured audit log. Most cleaning tools do not.
Bookkeepers and consultants get fired if they can't show what changed
to a client. The audit feature is currently invisible on every
proposed landing page and should be the **second-largest callout**
right after "runs locally."
Copy seed: *"Every change auditable. Hand the audit CSV to your client
with the cleaned file."*
### 2.6 The Pipeline Runner is the retention multiplier.
A buyer with a saved pipeline isn't a one-off purchase — they're a
recurring user who recommends the product. This is exactly the
behavioural lever the no-touch constraint needs (DECISIONS.md §8
trigger). Build it now (see §2.1 exception).
### 2.7 Add a $199 "priority support" tier post-launch.
Same code, async-email SLA (24 h response). Targets the bookkeeper /
consultant persona whose own time is $300/hr. Zero new product work,
~3× ARPU on 510 % of buyers. Lock the SLA to **async only** so the
no-touch constraint isn't violated. Defer until $5 k/mo MRR (the same
trigger DECISIONS.md §8 already names).
### 2.8 Dependency-aware pipeline UX.
Tools have soft execution-order preferences (Text Clean before Format
Standardize, Format before Dedup, Missing before Dedup). The Pipeline
Runner *recommends* the order, *warns* on reversals, and **never
forces** — the user owns their workflow. Implementation: see
`src/core/pipeline.py` `SOFT_DEPENDENCIES`.
## 3. 90-day execution sequence
| Week | Action | Done when |
|---|---|---|
| 1 | PyInstaller pipeline · Mac/Win unsigned installers · Apple Dev Program enrollment (12 wk lead) | `dist/datatools-mac.dmg` and `dist/datatools-win.exe` install on a clean machine |
| 2 | Demo deployed to Streamlit Cloud · landing page v1 with embedded demo · 3 persona datasets in the demo | Public URL serves a working pipeline run on a sample dataset in < 30 s |
| 3 | Gumroad listing live · share value-first in 3 niche communities (no pitch) · 1 long-tail SEO post for the lead persona | First listing impression captured · post not removed for self-promotion |
| 4 | Pipeline Runner v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded |
| 58 | Bookkeeper landing page · agency landing page · second tool's promo cycle · priority-support tier added (defer purchase until §2.7 trigger) | Three live landing pages with distinct H1, demo dataset, conversion target |
| 913 | Tool 0608 only **if** revenue trajectory supports continued investment · otherwise more market work on the existing 5 + 09 | Decision made on 13 Aug 2026 with revenue data, not feature ambition |
## 4. Decision triggers (re-evaluation prompts)
These flip the plan, not the underlying criteria:
| Trigger | Reaction |
|---|---|
| First paying customer in week 413 | Continue. Plan is working. |
| **Zero** paid in 90 days | Audit the funnel. Demo conversion? Niche fit? Price? Don't add features. |
| $5 k/mo MRR | DECISIONS.md §8 trigger fires: revisit async + priority-support tier. |
| Marketplace policy / shutdown | Switch to own-domain Stripe immediately; landing pages are already self-hosted. |
| Streamlit hard direction change | Low-probability re-lock per DECISIONS.md §8. Tk fallback is documented. |
## 5. Anti-temptations (things the plan refuses)
- **More tools before more buyers.** Locked. Exception only for Pipeline Runner per §2.1.
- **SaaS pivot.** Recurring infra conflicts with the lifestyle constraint (DECISIONS.md §4).
- **Live chat / sales calls.** Conflicts with no-touch (DECISIONS.md §1 #8).
- **Custom integrations / one-off consulting.** $300/hr looks tempting; breaks the "build once, sell many" model that justifies the entire strategy.
- **Going broad on personas.** "All small businesses" is a generic landing page that converts at 1 %; "Shopify pet-supply operators with 1k50k customers" converts at 515 % in the right communities.
## 6. What this plan deliberately leaves open
- Whether tools 0608 ever ship. Decided on revenue, not roadmap.
- Whether to add a fourth niche landing page. Decided on which of the
three is producing.
- Whether to invest in own-domain SEO. Compounding 618 mo asset; not
the early-stage channel. Revisit when marketplace + community
produces baseline traffic to optimise.
- Whether to add a Notion / Slack support community. If support volume
per 100 sales > 10 (BUSINESS.md §12 target), revisit; else leave async-email only.

158
docs/POST-LAUNCH.md Normal file
View File

@@ -0,0 +1,158 @@
# Post-launch — 90-day measurement plan
> Creator-only. The other half of `PLAN.md`: PLAN tells you what to
> build, this tells you what to measure once it's live and which
> numbers trigger which actions.
> **Version**: 1.0 · **Adopted**: 2026-05-01 · **Owner**: Michael
This is a runnable monthly checklist, not analytics theatre. Every
metric below has a **threshold** and an **action**. If you're not
willing to execute the action when the threshold trips, drop the
metric — measuring without responding is busywork.
## 1. The five numbers that matter
Every other dashboard, chart, or vanity stat is downstream of these
five. The funnel is short on purpose; pre-PMF traffic doesn't have
the resolution to support more.
| # | Metric | How to compute | Threshold | When tripped |
|---|--------|----------------|-----------|--------------|
| 1 | **Persona engagement** | `demo.run_completed / demo.page_view` per persona | < 30 % for 4 consecutive weeks | Demo isn't running or BEFORE preview isn't compelling. **Action:** check iframe loads; widen BEFORE preview to show pollution clearly; move demo above the fold. |
| 2 | **Demo→CTA intent** | `demo.cta_clicked / demo.run_completed` per persona | < 5 % for 4 consecutive weeks | Demo is impressive but the CTA isn't earning trust. **Action:** add network-tab privacy screenshot; soften the price callout; A-B test eyebrow copy on the CTA card. |
| 3 | **Purchase rate** | `gumroad.purchase / demo.cta_clicked` per persona | < 30 % for 4 consecutive weeks | Visitors click through but don't pull the card out. **Action:** check Gumroad listing renders cleanly; verify refund-policy copy; check that the screenshot on the listing matches the demo they just ran. |
| 4 | **Refund rate** | `gumroad.refunds / gumroad.purchase` rolling 30 days | > 5 % | Buyer expectation mismatch. **Action:** read every refund email; determine if it's a feature gap (build it), a positioning lie (rewrite), or a personal-fit miss (fine, ignore). |
| 5 | **Support load** | email tickets / 100 sales rolling 30 days | > 10 | The product isn't self-serve enough at this price. **Action:** find the top 3 questions; add to in-app onboarding + landing-page FAQ + the persona's saved pipeline. |
These five also map to BUSINESS.md §12 — that doc names the metrics;
this doc operationalises them.
## 2. Monthly review — 30-minute checklist
Block 30 minutes on the first Monday of every month for the first six
months. After month 6 if numbers are stable, drop to 15 minutes
quarterly.
```
[ ] Pull last 30 days of demo events from Cloudflare Web Analytics
[ ] Pull last 30 days of Gumroad sales + refunds export
[ ] Compute the five numbers in §1 per persona
[ ] Note which thresholds are tripped (if any)
[ ] Read every refund email since last review
[ ] Read every support email since last review
[ ] Decide ONE thing to change this month (only one)
[ ] Update CHANGELOG with what was changed and why
[ ] Schedule next review
```
The "decide ONE thing" rule is load-bearing. Pre-PMF traffic doesn't
have the volume to A/B-test multiple changes in parallel — you'll just
confuse yourself about what moved the number.
## 3. Per-persona scoreboard (template)
Maintain in a single text file or spreadsheet. The shape that fits in
a notebook page is the shape you'll actually update.
```
Month: 2026-06
─────────────────────────────────────────────────────────────────
Shopify Bookkeeper RevOps Total
Page views 420 180 290 890
Demo runs 137 59 82 278
CTA clicks 9 7 6 22
Purchases 3 2 2 7
Metric 1 (engage) 33% 33% 28% 31%
Metric 2 (intent) 7% 12% 7% 8%
Metric 3 (purchase) 33% 29% 33% 32%
Metric 4 (refund) 0% 0% 0% 0%
Metric 5 (support) 3 tickets / 100 sales
Tripped thresholds: RevOps engagement (28% < 30%)
This-month change: Move demo embed above the fold on revops
page; reduce hero text by 40%.
Last-month change: Added network-tab screenshot to all 3
pages. Result: intent +1.5 percentage
points on Shopify, flat elsewhere.
```
## 4. Stage-gate triggers from PLAN.md
Reproduced here so the gate criteria sit beside the metrics that
fire them:
| Trigger | From | Action |
|---|------|--------|
| **First paying customer** | PLAN §4 | Continue. Plan is working. |
| **Zero paid in 90 days** | PLAN §4 | Audit the funnel. Don't add features. Run a small (1-week) outbound experiment to 30 niche-community contacts as a control, even though it stretches the no-touch constraint, to determine whether the bottleneck is reach or conversion. |
| **$5 k/mo MRR** | DECISIONS §8 | Re-evaluate async constraint. Add priority-support tier (PLAN §2.7). |
| **$10 k/mo MRR** | DECISIONS §8 | Revisit time-budget allocation. Decide on tools 0608 vs. additional bundles. |
| **Marketplace shutdown** | PLAN §4 / DECISIONS §8 | Switch landing-page CTA to own-domain Stripe Checkout. Pre-built; one-line edit. |
| **Streamlit hard direction change** | DECISIONS §8 | Low-probability re-lock. Tk fallback documented. |
| **Burnout signal** | DECISIONS §8 | Stop. Triage. The constraint matters more than the revenue ramp. |
## 5. What we deliberately do NOT measure
These look productive but predict nothing pre-PMF. Don't add them.
- **Bounce rate** — single-page sites have artificially high bounce. Useless signal.
- **Time on page** — landing pages are *supposed* to be quick reads. Long time on page often means confusion, not engagement.
- **Heatmaps / scroll-depth** — no statistical resolution at <500 monthly visitors. Add when you cross 5 k/month.
- **Email open rates** — under §2.7 priority support is the only email channel; opens aren't a buying signal.
- **Social mentions** — vanity. The signal that matters is "did they buy" or "did they come back."
## 6. What we measure once, then trust
Do these once, then let them run for 6+ months without re-measuring:
- **Demo correctness** — once per pipeline release, run all 3 demos
end-to-end via `tests/test_pipeline.py` and check the output looks
reasonable. The CI pipeline already does this; nothing to add.
- **Cross-platform install** — once per release, verify the
PyInstaller bundle launches on Mac / Windows / Linux. After three
green releases, trust the build pipeline; spot-check on major OS
updates only.
- **Privacy claim integrity** — once at launch, capture the network
tab while running the cleaner and host that screenshot at a stable
URL. Re-capture only when a new tool or dependency is added.
## 7. Per-persona attribution
The buy buttons on every landing page carry `?from=<persona>` query
parameters. Gumroad propagates that into the order metadata. Use it
to attribute purchases:
| persona key | landing page URL | Gumroad query | Source |
|---|---|---|---|
| `shopify-pet` | `/shopify-pet/` | `?from=shopify-pet` | Shopify operator |
| `bookkeeper` | `/bookkeeper/` | `?from=bookkeeper` | Bookkeeper / freelance accountant |
| `revops` | `/revops/` | `?from=revops` | Marketing / RevOps agency |
| `apex` | `/` | (no query — use `unknown` bucket) | Generic discovery |
When `unknown` exceeds 30 % of total, add UTM tagging to community
posts and SEO blog backlinks so you can break the bucket apart.
## 8. The four months that decide whether the plan works
Reading PLAN.md §3 + this doc together, the rough script:
| Month | What's running | What we expect to learn |
|---|---|---|
| **M1** (June) | Installers · demo · 3 landing pages · Gumroad live | Whether the funnel mechanically works. Numbers will be noisy; just look for one purchase. |
| **M2** (July) | M1 + community posts in 3 niches + 1 SEO post | Which persona converts. Re-allocate effort to the highest-converting niche. |
| **M3** (August) | M2 + landing-page changes from M2 review | Whether intent-rate moved on the change. Decide tools 0608 go/no-go. |
| **M4** (September) | M3 + first repeat-buyer signals | Whether the Pipeline Runner is producing retention as designed. |
By end of M4, the data tells you whether the plan is producing
$1k3k/mo (BUSINESS.md §6 6-month target) — extrapolated from the
trajectory, not the absolute number.
## 9. The hardest part of the plan to execute
Not the metrics. Not the build. **The "decide ONE thing per month"
rule** — operators with engineering backgrounds chronically pick
three changes per month and conclude nothing because their signal
is muddled. This doc says one. It means one.

View File

@@ -46,7 +46,7 @@ Numbered support matrix. Updated with every shipped capability.
**Cell-level**:
- `smart_punctuation_in_data`, `nbsp_or_unicode_whitespace`, `zero_width_or_invisible`, `dirty_column_headers`, `whitespace_padding`, `null_like_sentinels`, `suspected_mojibake`, `mixed_case_email_column`, `inconsistent_date_format`, `near_duplicate_rows`, `leading_zero_ids`.
**Encoding integrity**: `encoding_uncertain`, `encoding_decode_failed`, `empty_input`.
**Encoding integrity**: `encoding_uncertain`, `encoding_decode_failed`, `encoding_lying_bom`, `empty_input`.
Sample size: 1,000 rows (configurable).
@@ -71,17 +71,37 @@ Sample size: 1,000 rows (configurable).
- Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
- Output write: ~10 s.
- Recommended RAM: 4× input size for full-Apply path.
- Format standardizer (`standardize_file`): ~150k rows/sec on cache-warm
international data; chunk-bounded RAM (~50 MB peak at default
chunk_size=50,000). A 1 GB CSV with mixed phone+currency+address
columns finishes in ~2.510 minutes depending on column count.
## 11. Tools
1. Deduplicator — Ready
2. Text Cleaner — Ready
3. Format Standardizer — Ready
4. Missing Value Handler — Coming Soon
5. Column Mapper — Coming Soon
4. Missing Value Handler — Ready
5. Column Mapper — Ready
6. Outlier Detector — Coming Soon
7. Multi-File Merger — Coming Soon
8. Validator & Reporter — Coming Soon
9. Pipeline Runner — Coming Soon
9. Pipeline Runner — Ready
### 11.a Recommended pipeline order (soft, not enforced)
The Pipeline Runner ships with a `SOFT_DEPENDENCIES` table; the
following ordering is the default and the basis of the warning
surface. Re-ordering is allowed; the runner emits a warning string
and proceeds.
| # | Tool | Why this slot |
|---|------|---------------|
| 1 | column_map (optional, for header alignment) | Multi-vendor unification — rename early so downstream tools see canonical headers |
| 2 | text_clean | NBSP / smart quotes / zero-width pollution silently breaks downstream parsers |
| 3 | format_standardize | Phones / dates / currencies → canonical form before missing detection and dedup |
| 4 | missing | Sentinel detection, imputation, drop strategies — needs canonical types |
| 5 | column_map (optional, for schema enforcement) | Project to target schema, coerce, drop extras AFTER cleaning |
| 6 | dedup | Fuzzy matching is most accurate on canonicalised, sentinel-laundered data |
## 12. Gate (Review & Normalize)
- Gates every tool page.
@@ -95,7 +115,7 @@ Sample size: 1,000 rows (configurable).
## 13. Interfaces
- **GUI**: Streamlit, browser-based, local, no internet.
- **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_analyze`.
- **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_format` · `src.cli_missing` · `src.cli_column_map` · `src.cli_pipeline` · `src.cli_analyze`.
- **Python API**: `from src.core import …` (analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …).
- **JSON output**: `--json` on `cli_analyze`.
@@ -113,8 +133,8 @@ Sample size: 1,000 rows (configurable).
- **Dev**: pytest, tox.
## 16. Test coverage
- 1,230 tests passing, 4 skipped (ftfy not installed), 17 xfailed (documented).
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases).
- 1,729 tests passing, 0 skipped, 0 xfailed.
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
- Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
## 17. Privacy / data handling

142
landing/README.md Normal file
View File

@@ -0,0 +1,142 @@
# Landing pages
Three persona-tagged landing pages per `docs/PLAN.md` §2.3 and
`docs/DEMO-PLAN.md` §3 / §7. Static HTML, zero build step, ship to
Cloudflare Pages.
## Structure
```
landing/
├── _shared/styles.css shared CSS (system fonts, no externals)
├── shopify-pet/index.html Shopify operator (priority: pet supplies)
├── bookkeeper/index.html bookkeeper / freelance accountant
├── revops/index.html marketing / RevOps agency
└── README.md this file
```
Each page:
- Inherits `landing/_shared/styles.css`
- Overrides the `--accent` colour variable in an inline `<style>` block
so each persona has its own visual identity (Shopify = mint green,
Bookkeeper = steel blue, RevOps = vivid violet)
- Has a sticky buy bar with the Gumroad CTA tagged with `?from=<persona>`
- Embeds the live demo (Streamlit) via `<iframe>` with a sandbox attribute
- Carries persona-specific H1, sub-copy, use cases, FAQ, and a
ready-to-paste `terminal` block showing the CLI in action
- Includes Open Graph + Schema.org `SoftwareApplication` JSON-LD for
link-share previews and SEO
## Pre-deploy URL substitutions — automated
The HTML carries placeholder URLs (the literal strings
`https://demo.datatools.app`, `https://datatools.app`,
`https://gumroad.com/l/datatools`, `mailto:hello@datatools.app`)
that **must** be replaced before deployment. A small Python script
does this for you — no global search-and-replace needed.
```bash
# 1) Copy the template and fill in your real URLs:
cp landing/deploy.config.example.json landing/deploy.config.json
edit landing/deploy.config.json
# 2) Build the deploy-ready bundle:
python3 landing/deploy.py
# → produces landing/dist/ with substitutions applied,
# plus robots.txt, sitemap.xml, 404.html, favicon.svg
```
`landing/deploy.config.json` is gitignored so your real URLs never
hit the repo. Re-run `landing/deploy.py` whenever you change a URL or
edit any HTML source.
## Cloudflare Pages deployment
The simplest path — one Pages project pointed at `landing/dist/`:
```bash
# Option A: drag-and-drop the directory in the Cloudflare dashboard
# Pages → Create project → Direct Upload → drag landing/dist/
# Option B: Wrangler CLI (one command, scriptable)
wrangler pages deploy landing/dist
```
Configure the custom apex domain (`datatools.app`) in the Cloudflare
Pages project settings; sub-paths `/shopify-pet/`, `/bookkeeper/`,
`/revops/` are served automatically because the directory layout
mirrors them. Cache rule defaults are fine (HTML 1 day, CSS 7 days).
If you want **separate Pages projects** per persona for independent
A/B testing, point three projects at the same `landing/dist/` and
configure each with its own sub-domain (`shopify.datatools.app`, etc.)
and a Pages rule that rewrites the root to that persona's
sub-directory.
## Telemetry wiring (per DEMO-PLAN §8)
The plan calls for event-only counters, no PII, no Google Analytics.
For each page, on Cloudflare Pages, attach a Worker (or use Cloudflare
Web Analytics — it's privacy-friendly out of the box and zero config).
Track:
- `page_view` per persona (auto from CF Web Analytics)
- `cta_clicked` — add a small inline `<script>` that fires a fetch to
`/api/event?event=cta_clicked&persona=<persona>` when the buy button
is clicked, then continues the navigation to Gumroad.
- `demo.run_completed` and `demo.cta_clicked` are owned by the demo
app, not the landing page.
Conversion (per DEMO-PLAN §8):
```
demo_engagement = demo.run_completed / page_view (target ≥ 30%)
purchase_intent = demo.cta_clicked / demo.run_completed (target ≥ 5%)
purchase_rate = gumroad.purchase / demo.cta_clicked (target ≥ 30%)
```
The Gumroad webhook captures `?from=<persona>` so we can attribute
purchases back to the landing page that produced them.
## Maintenance triggers (per DEMO-PLAN §9)
Refresh the page when:
| Trigger | Action |
|---|---|
| `cta_clicked / run_completed < 5%` for 4 weeks | The demo is working but the buyer isn't trusting the CTA. Add a screenshot of the network tab showing zero outbound calls. Soften the price callout. |
| `page_view → run_completed < 30%` for 4 weeks | The demo iframe isn't loading or visitors aren't engaging. Check the iframe URL. Move the demo above the fold if it's currently below. |
| New tool ships (0609) | Add it to the persona's saved pipeline only if it fits — don't bloat the demo with every tool. |
| Pricing change | Update `<meta>` schema, the buybar `.price-tag`, the pricing card, and the FAQ. Search-and-replace `$49` across the file. |
| New persona added (4th, 5th) | Copy `shopify-pet/index.html`, replace persona-specific copy, add to the `footer` cross-link block on the existing pages. |
## Why static HTML
Per `DECISIONS.md §5` and `BUSINESS.md §7`, the landing-page channel
must be:
- **Async-friendly** — Cloudflare Pages serves these with no operator
involvement
- **Cheap** — Cloudflare Pages free tier is sufficient until well past
the $5k/mo MRR re-lock trigger (`DECISIONS.md §8`)
- **Privacy-respecting** — no third-party tracker means no cookie
banner, which means no friction added to the conversion funnel
- **Zero ongoing maintenance** — no framework, no build, no upgrades.
The CSS uses system fonts; no Google Fonts; no CDN dependency that
could break the page when their TLS certificate rolls.
## Anti-temptations (per DEMO-PLAN §11 + plan §5)
These pages deliberately exclude:
- **No live chat widget.** Locked by no-touch.
- **No "schedule a demo with us" CTA.** Same.
- **No email capture before the demo.** Friction kills conversion.
- **No Google Analytics / Meta Pixel.** Privacy story is a moat, not
a checkbox to ignore.
- **No SaaS-style "free trial / no credit card."** This is a one-time
download, not a subscription.
- **No A/B-testing framework yet.** Pre-PMF traffic doesn't reach
statistical significance — ship the single-arm copy, iterate monthly.

234
landing/_shared/styles.css Normal file
View File

@@ -0,0 +1,234 @@
/* DataTools landing-page styles — single shared sheet for all niches.
*
* Design constraints:
* • No external font / CSS dependencies (works on Cloudflare Pages
* with zero build step, no privacy banner needed).
* • Mobile-first; layout reflows below 720 px.
* • Dark, focused, content-first. Buyer reads this on a laptop
* between Shopify exports — keep it readable and skimmable.
* • Persona pages all share this sheet — niche differences live in
* copy + accent-color variables overridden in each page's <style>.
*/
:root {
--bg: #0f1115;
--surface: #161922;
--surface-2: #1d212b;
--text: #e8eaed;
--text-mute: #9aa3b2;
--text-soft: #c8ced8;
--rule: #252a36;
--accent: #6ee7b7; /* Shopify pet default — overridden per persona */
--accent-ink: #052e1a;
--warn: #fbbf24;
--max: 1080px;
--radius: 12px;
--shadow: 0 1px 3px rgba(0,0,0,0.3), 0 8px 24px rgba(0,0,0,0.2);
--mono: ui-monospace, SFMono-Regular, "SF Mono", Menlo, monospace;
--sans: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto,
"Helvetica Neue", Arial, sans-serif;
}
* { box-sizing: border-box; }
html, body {
margin: 0; padding: 0;
background: var(--bg);
color: var(--text);
font-family: var(--sans);
font-size: 16px;
line-height: 1.55;
-webkit-font-smoothing: antialiased;
}
a { color: var(--accent); text-decoration: none; }
a:hover { text-decoration: underline; }
/* ----- Sticky buy bar ----- */
.buybar {
position: sticky; top: 0; z-index: 50;
background: rgba(15,17,21,0.92);
backdrop-filter: blur(8px);
border-bottom: 1px solid var(--rule);
padding: 10px 20px;
}
.buybar-inner {
max-width: var(--max); margin: 0 auto;
display: flex; align-items: center; justify-content: space-between;
gap: 16px;
}
.buybar .brand { font-weight: 600; letter-spacing: -0.01em; }
.buybar .brand-mark { color: var(--accent); margin-right: 6px; }
.buybar .price-tag { color: var(--text-mute); font-size: 14px; margin-right: 12px; }
/* ----- Buttons ----- */
.btn {
display: inline-block;
background: var(--accent); color: var(--accent-ink);
font-weight: 600; font-size: 15px;
padding: 11px 18px; border-radius: 8px;
border: 0; cursor: pointer;
transition: transform 0.05s ease, box-shadow 0.15s ease;
}
.btn:hover { transform: translateY(-1px); text-decoration: none; box-shadow: var(--shadow); }
.btn-large {
padding: 14px 24px; font-size: 17px;
}
.btn-ghost {
background: transparent; color: var(--text-soft);
border: 1px solid var(--rule);
}
.btn-ghost:hover { background: var(--surface); }
/* ----- Layout ----- */
section {
padding: 60px 20px;
border-bottom: 1px solid var(--rule);
}
section:last-of-type { border-bottom: 0; }
.container { max-width: var(--max); margin: 0 auto; }
h1, h2, h3 { line-height: 1.2; letter-spacing: -0.02em; margin-top: 0; }
h1 { font-size: 44px; margin-bottom: 18px; }
h2 { font-size: 30px; margin-bottom: 16px; }
h3 { font-size: 19px; margin-bottom: 8px; }
p { margin: 0 0 14px 0; color: var(--text-soft); }
.muted { color: var(--text-mute); }
.eyebrow { color: var(--accent); font-size: 13px; font-weight: 600;
text-transform: uppercase; letter-spacing: 0.08em; margin-bottom: 10px; }
ul.bullets { padding-left: 20px; margin: 0 0 14px 0; }
ul.bullets li { margin-bottom: 8px; color: var(--text-soft); }
/* ----- Hero ----- */
.hero {
padding: 80px 20px 60px;
background: radial-gradient(ellipse at top, var(--surface), var(--bg) 60%);
}
.hero h1 strong { color: var(--accent); font-weight: 700; }
.hero .lead {
font-size: 19px; color: var(--text-soft); max-width: 720px;
margin-bottom: 28px;
}
.hero .cta-row { display: flex; gap: 12px; flex-wrap: wrap; align-items: center; }
.hero .price-note { color: var(--text-mute); font-size: 14px; }
/* ----- Demo embed ----- */
.demo-frame {
background: var(--surface);
border: 1px solid var(--rule);
border-radius: var(--radius);
overflow: hidden;
box-shadow: var(--shadow);
}
.demo-frame iframe {
width: 100%; height: 720px; border: 0; display: block;
background: var(--surface-2);
}
.demo-caption {
font-size: 14px; color: var(--text-mute);
padding: 10px 16px; border-top: 1px solid var(--rule);
}
/* ----- Cards / grids ----- */
.grid {
display: grid; gap: 18px;
grid-template-columns: repeat(auto-fit, minmax(260px, 1fr));
}
.card {
background: var(--surface);
border: 1px solid var(--rule);
border-radius: var(--radius);
padding: 22px;
}
.card h3 { color: var(--text); }
.card p:last-child { margin-bottom: 0; }
.card .icon {
display: inline-block; font-size: 22px; margin-bottom: 8px;
}
/* ----- Stats row ----- */
.stats { display: flex; gap: 28px; flex-wrap: wrap; margin: 18px 0 0; }
.stats .stat .num {
font-family: var(--mono); font-size: 26px; font-weight: 600;
color: var(--accent);
}
.stats .stat .label { font-size: 13px; color: var(--text-mute); }
/* ----- Privacy / audit callout panels ----- */
.callout {
background: var(--surface);
border-left: 3px solid var(--accent);
border-radius: 0 var(--radius) var(--radius) 0;
padding: 18px 22px;
margin: 18px 0;
}
.callout strong { color: var(--text); }
/* ----- Code-ish blocks ----- */
.terminal {
font-family: var(--mono); font-size: 14px;
background: #0a0c10;
color: #d8dfe8;
border: 1px solid var(--rule);
border-radius: var(--radius);
padding: 16px 18px;
overflow-x: auto;
white-space: pre;
line-height: 1.45;
}
.terminal .prompt { color: var(--text-mute); }
.terminal .ok { color: var(--accent); }
.terminal .warn { color: var(--warn); }
/* ----- Pricing ----- */
.pricing {
display: grid; gap: 18px;
grid-template-columns: repeat(auto-fit, minmax(260px, 1fr));
}
.pricing .card .price {
font-size: 38px; font-weight: 700; letter-spacing: -0.02em;
color: var(--text);
}
.pricing .card .price-suffix { font-size: 14px; color: var(--text-mute); margin-left: 4px; }
.pricing .card.featured { border-color: var(--accent); }
.pricing .card .row { display: flex; align-items: baseline; gap: 4px; margin-bottom: 12px; }
.pricing .card ul { padding-left: 18px; margin: 12px 0 18px; }
.pricing .card li { color: var(--text-soft); margin-bottom: 6px; }
/* ----- FAQ ----- */
details.faq {
border-bottom: 1px solid var(--rule);
padding: 14px 0;
}
details.faq summary {
font-weight: 600; color: var(--text);
cursor: pointer; list-style: none;
display: flex; align-items: center; justify-content: space-between;
}
details.faq summary::after {
content: "+"; color: var(--accent); font-size: 22px;
margin-left: 14px;
}
details.faq[open] summary::after { content: ""; }
details.faq p { margin-top: 10px; }
/* ----- Footer ----- */
footer {
padding: 40px 20px 60px;
font-size: 14px;
color: var(--text-mute);
}
footer .container { display: flex; gap: 28px; flex-wrap: wrap; justify-content: space-between; }
footer a { color: var(--text-soft); }
footer p { color: var(--text-mute); }
/* ----- Responsive ----- */
@media (max-width: 720px) {
h1 { font-size: 32px; }
h2 { font-size: 24px; }
section { padding: 40px 18px; }
.hero { padding: 56px 18px 40px; }
.demo-frame iframe { height: 560px; }
.buybar-inner .price-tag { display: none; }
}

View File

@@ -0,0 +1,354 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools for Bookkeepers — Reconcile Bank Exports With An Audit Trail · $49</title>
<meta name="description" content="Reconcile messy bank exports. Catch duplicate transactions QuickBooks imported twice. Standardize dates, amounts, and vendor casing — locally. Every change auditable. $49 one-time." />
<meta name="keywords" content="reconcile bank export csv, quickbooks duplicate transactions, vendor list cleanup, bookkeeper csv tool, bank export deduplicator, bookkeeper audit trail" />
<link rel="canonical" href="https://datatools.app/bookkeeper/" />
<link rel="stylesheet" href="../_shared/styles.css" />
<!-- Persona accent: Bookkeeper → calm steel-blue -->
<style>
:root {
--accent: #7dd3fc;
--accent-ink: #042c43;
}
</style>
<!-- Open Graph -->
<meta property="og:title" content="DataTools for Bookkeepers — Reconcile Bank Exports With An Audit Trail" />
<meta property="og:description" content="Catch duplicate transactions. Standardize dates and amounts. Hand your client an audit trail. $49 one-time." />
<meta property="og:type" content="product" />
<meta property="og:url" content="https://datatools.app/bookkeeper/" />
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareApplication",
"name": "DataTools for Bookkeepers",
"operatingSystem": "Windows, macOS, Linux",
"applicationCategory": "BusinessApplication",
"offers": {
"@type": "Offer",
"price": "49",
"priceCurrency": "USD"
},
"description": "Reconcile bank exports, dedupe vendor lists, and produce a hand-off-ready audit trail. Six-tool data-cleaning bundle for bookkeepers and freelance accountants.",
"softwareVersion": "1.0"
}
</script>
</head>
<body>
<div class="buybar">
<div class="buybar-inner">
<div class="brand"><span class="brand-mark"></span> DataTools <span class="muted">/ for Bookkeepers</span></div>
<div>
<span class="price-tag">$49 — one-time, no subscription</span>
<a class="btn" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Get DataTools →</a>
</div>
</div>
</div>
<section class="hero">
<div class="container">
<div class="eyebrow">For bookkeepers · freelance accountants · small-firm partners</div>
<h1>Reconcile messy bank exports.<br /><strong>Hand your client an audit trail.</strong></h1>
<p class="lead">
The Jan and Feb exports overlap and you've got the same transaction
booked twice. Vendor names are <em>"Amazon"</em>, <em>"amazon.com"</em>,
and <em>"AMAZON.COM*4F2X9"</em> in three different rows. Dates are a
smoosh of <code>01/15/2025</code>, <code>2025-01-15</code>, and
<code>Jan 18 2025</code>. DataTools fixes all of it in one pass —
and produces a row-by-row CSV showing every change so your client
can verify your work.
</p>
<div class="cta-row">
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Get DataTools — $49 →</a>
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
<span class="price-note">One-time payment · cross-platform · runs offline</span>
</div>
<div class="stats">
<div class="stat"><div class="num">6</div><div class="label">tools, one bundle</div></div>
<div class="stat"><div class="num">100 %</div><div class="label">auditable changes</div></div>
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
</div>
</div>
</section>
<!-- ============= Pain points ============= -->
<section>
<div class="container">
<div class="eyebrow">If you've spent a Saturday on this, you already know</div>
<h2>Five pains DataTools fixes in one pass</h2>
<div class="grid">
<div class="card">
<span class="icon">📅</span>
<h3>Jan and Feb bank exports overlap — the same transaction posts twice</h3>
<p>QuickBooks (or any reconciler) silently double-counts the month-boundary rows. Your client's books understate cash by 14 % and nobody notices until tax season.</p>
<p class="muted"><strong>What it costs:</strong> 24 hours per month per client + reconciliation errors that can compound.</p>
</div>
<div class="card">
<span class="icon">📒</span>
<h3>1099 reports break because vendors are spelled three ways</h3>
<p>"Amazon", "amazon.com", "AMAZON.COM*4F2X9" become three separate vendors in QBO. You ship three 1099s instead of one — and the 1099-NEC threshold breaks both ways.</p>
<p class="muted"><strong>What it costs:</strong> 12 hours per 1099 cycle + IRS-paper-trail risk.</p>
</div>
<div class="card">
<span class="icon">🛡️</span>
<h3>"Show me what you changed" — your liability hangs on the answer</h3>
<p>Cloud cleaners that "just clean your data" don't give you a row-level audit log. Your professional indemnity insurance hates that. Your client's auditor hates that. You hate explaining it.</p>
<p class="muted"><strong>What it costs:</strong> per-firm liability premium + 2448 hr audit-response window stress.</p>
</div>
<div class="card">
<span class="icon">👥</span>
<h3>Per-client SaaS pricing destroys your margins at 10+ clients</h3>
<p>$30/mo per client × 20 clients = $600/mo, every month, for tooling. DataTools is a one-time desktop license you use on every client's books for the same $49. Forever.</p>
<p class="muted"><strong>What it costs:</strong> the difference between a $30/mo/client subscription and $49 once.</p>
</div>
<div class="card">
<span class="icon">🌍</span>
<h3>Multi-currency books break standard parsers</h3>
<p>Your client has EU customers. Their amounts come in as <code>€1.234,56</code> (comma decimal). Standard import tools see "1.234" as the whole-dollar amount and drop the rest. Parens-negative <code>($89.50)</code> gets read as positive.</p>
<p class="muted"><strong>What it costs:</strong> 3060 min per multi-currency client per month + occasional silent errors.</p>
</div>
<div class="card">
<span class="icon">🔒</span>
<h3>Your client's books are too sensitive for a cloud cleaner</h3>
<p>One "vendor breach" email to your clients ends the relationship. DataTools is desktop-only. No upload, no SaaS account, no third party seeing a single transaction. Verifiable in your browser's network tab.</p>
<p class="muted"><strong>What it costs:</strong> nothing — and that's exactly the point.</p>
</div>
</div>
</div>
</section>
<section id="demo">
<div class="container">
<div class="eyebrow">Live demo · runs in your browser</div>
<h2>Try it on a sample bank export with a known overlap</h2>
<p>
The demo below loads a 25-row export combining January and February
activity, with the month-boundary rows duplicated across exports —
the exact scenario where QuickBooks (or any reconciler) silently
double-counts transactions. Click <strong>Run pipeline</strong> and
watch the dedup catch every overlap, dates land in ISO format, and
the parens-negative amounts (<code>($89.50)</code>) become proper
negative numbers.
</p>
<div class="demo-frame">
<iframe
src="https://demo.datatools.app/?p=bookkeeper"
loading="lazy"
title="DataTools live demo — Bookkeeper"
sandbox="allow-scripts allow-same-origin allow-downloads allow-forms"></iframe>
<div class="demo-caption">
Demo runs on free hosting. Capped at 100 input rows · output
watermarked. The paid product has no caps and runs entirely offline.
</div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">Built for the bookkeeper's actual day</div>
<h2>Four workflows the rest of the industry tax-codes around</h2>
<div class="grid">
<div class="card">
<span class="icon">🏦</span>
<h3>Bank export reconciliation</h3>
<p>Two months of activity overlap at the boundary. The same transaction posts twice — once in each export — with different formatting. DataTools dedups on Date + Amount + fuzzy Vendor and catches all of them.</p>
</div>
<div class="card">
<span class="icon">📒</span>
<h3>Vendor list consolidation</h3>
<p>QuickBooks has <code>amazon.com</code>. Your spreadsheet has <code>Amazon</code>. The bank statement has <code>AMAZON.COM*4F2X9</code>. Standardize the casing, fuzzy-match across sources, hand the client one clean vendor list.</p>
</div>
<div class="card">
<span class="icon">👥</span>
<h3>Customer master cleanup pre-migration</h3>
<p>Before moving from one accounting system to another, the customer master needs to be deduped, standardized, and audited. One tool, one pipeline, one CSV in / clean CSV out.</p>
</div>
<div class="card">
<span class="icon">🧾</span>
<h3>Expense report dedup</h3>
<p>Same receipt scanned twice. Same Uber ride entered manually and then imported from the corporate card. Catch them once — and produce the audit log that proves the duplicate <em>was</em> a duplicate.</p>
</div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">The feature your liability insurance cares about</div>
<h2>Every change auditable. Period.</h2>
<p>
Every cell DataTools modifies is logged with the original value, the
new value, and which rule fired. When your client asks why a
transaction got merged or a date got reformatted, you don't say
"the AI did it." You hand them the CSV.
</p>
<div class="callout">
<strong>Why this matters specifically to bookkeepers:</strong> your
professional liability hangs on traceability. Cloud cleaners that
"just clean your data" without a row-level audit are unsafe at any
price. DataTools writes the audit by default, downloadable as a
separate CSV alongside the cleaned file.
</div>
<div class="terminal"><span class="prompt">$</span> head -5 client_jan2025_changes.csv
row,column,field_type,old,new
0,"Date ",date,"01/15/2025","2025-01-15"
0,Description,name," AMAZON.COM*4F2X9 PURCHASE","Amazon.com*4F2X9 Purchase"
0,Amount,currency,"-$129.99","-129.99"
1,Date ,date,"2025-01-15","2025-01-15"
<span class="prompt">$</span> # one row of audit per cell change. handed to the client. signed off.</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">The thing every cloud reconciler can't say</div>
<h2>Your client's books never leave your computer.</h2>
<p>
Your clients trust you with their books. That trust is one
"we noticed our data appeared in a vendor breach" email away from
gone. DataTools is a desktop app — no upload, no SaaS, no
subscription, no third party seeing a single transaction.
</p>
<div class="callout">
<strong>Confirm it yourself.</strong> Open your browser's network
tab when DataTools is running. Click around. Run the pipeline.
Zero outbound requests. Ever.
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">If your clients run multi-currency books</div>
<h2>$ £ € ¥ R$ kr zł — handled.</h2>
<p>
Standardize <code>$1,234.56</code>, <code>1.234,56 €</code> (EU
decimal), <code>($89.50)</code> (parens-negative),
<code>R$ 250,00</code>, <code>kr 1.250,50</code>, and the rest of
the long tail. Output is canonical numeric (your import tool's
favourite shape) with optional ISO 4217 prefix
(<code>USD 1234.56</code>) when you need to preserve the
currency.
</p>
<ul class="bullets">
<li><strong>Auto-detect</strong> EU comma decimal so your French and German clients' books reconcile without per-locale config.</li>
<li><strong>Parens-negative</strong> handled — accounting convention, not just a math style.</li>
<li><strong>Multi-character prefixes</strong> like <code>R$</code> (Brazilian Real) and <code>kr</code> (Nordic) detected before the single-symbol regex so they don't get bucketed as USD.</li>
</ul>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match (Jaro-Winkler), explicit strategies for Date+Amount+Vendor, survivor rules.</p></div>
<div class="card"><h3>2 · Text Cleaner</h3><p>Header whitespace, smart quotes from copy-paste, em-dash sentinels.</p></div>
<div class="card"><h3>3 · Format Standardizer</h3><p>ISO dates, numeric amounts (parens-negative), vendor casing, multi-currency.</p></div>
<div class="card"><h3>4 · Missing Value Handler</h3><p>Disguised-null detection: <code></code>, <code>N/A</code>, <code>(blank)</code>, <code>?</code>.</p></div>
<div class="card"><h3>5 · Column Mapper</h3><p>Project to your accounting tool's required schema, coerce types, drop extras.</p></div>
<div class="card"><h3>6 · Pipeline Runner</h3><p>Save the cleanup. Run it on next month's export with one command. Same audit, automated.</p></div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">Pricing — pay once, own it</div>
<h2>$49. No subscription. No per-client license.</h2>
<div class="pricing">
<div class="card featured">
<div class="row"><div class="price">$49</div><div class="price-suffix">one-time</div></div>
<h3>DataTools for Bookkeepers</h3>
<ul>
<li>All 6 tools, full pipeline</li>
<li>Mac · Windows · Linux installers</li>
<li>Code-signed (no Gatekeeper warnings)</li>
<li>Free updates for the v1.x line</li>
<li>Bonus: ready-made bank-reconcile and vendor-cleanup pipelines</li>
<li><strong>Use on any number of clients</strong> — no seat limits</li>
</ul>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Buy on Gumroad →</a>
</div>
<div class="card">
<div class="row"><div class="price">$199</div><div class="price-suffix">one-time</div></div>
<h3>+ Priority email support</h3>
<p class="muted">Available post-launch. 24-hour async response on edge cases. Same product. Targeted at bookkeepers whose own time is &gt; $200/hr.</p>
<a class="btn btn-ghost btn-large" href="#" aria-disabled="true">Coming soon</a>
</div>
</div>
</div>
</section>
<section>
<div class="container">
<h2>Questions</h2>
<details class="faq">
<summary>Does this replace QuickBooks / Xero?</summary>
<p>No — DataTools cleans the data <em>before</em> it goes into your accounting system, or after you export it for analysis. It sits alongside QB/Xero, not in place of them. Think of it as the import-clean-up step that should have shipped with the bank export feature in the first place.</p>
</details>
<details class="faq">
<summary>Can I use it on multiple clients without paying again?</summary>
<p>Yes. The licence is per-bookkeeper, not per-client. Run it on every client's books for the same $49.</p>
</details>
<details class="faq">
<summary>What's the audit log look like in court?</summary>
<p>It's a CSV with five columns per change: <code>row, column, field_type, old, new</code>. Plus a JSON pipeline file describing exactly which rules ran in which order. Together they reproduce the cleanup deterministically — your client (or their auditor) can verify it on their machine.</p>
</details>
<details class="faq">
<summary>How does it handle Excel-only weirdness like serial dates?</summary>
<p>Excel serial dates (the number 45295 = 2024-01-15) are detected and converted automatically. So are Unix timestamps in seconds and milliseconds, RFC 2822 dates from email exports, partial-precision dates (<code>2024-01</code>, <code>2024-Q1</code>), and locale-specific month names in English/French/German.</p>
</details>
<details class="faq">
<summary>What about my clients' privacy?</summary>
<p>Your clients' books never leave your computer. The cleaner is a desktop app with zero network code in the data path. You can verify this in your browser's network tab.</p>
</details>
<details class="faq">
<summary>What's your refund policy?</summary>
<p>Try the live demo above on the sample dataset before you buy. If DataTools doesn't fit your workflow within 14 days, email for a refund — no questions asked.</p>
</details>
</div>
</section>
<section>
<div class="container" style="text-align: center;">
<h2>Stop reconciling bank exports by hand.</h2>
<p class="lead" style="margin: 0 auto 28px;">One $49 download. Catches the duplicate transactions QuickBooks imported twice, standardises dates and amounts and vendor casing, and hands you a row-level audit log to share with your client.</p>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=bookkeeper" rel="noopener">Get DataTools — $49 →</a>
</div>
</section>
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="../shopify-pet/">For Shopify operators</a> ·
<a href="../revops/">For RevOps agencies</a><br />
<a href="https://gumroad.com/l/datatools?from=bookkeeper">Buy on Gumroad</a> ·
<a href="mailto:hello@datatools.app">Email support</a>
</p>
</div>
</div>
</footer>
</body>
</html>

View File

@@ -0,0 +1,22 @@
{
"_comment": [
"Deployment substitution config. Copy to deploy.config.json and",
"fill in the real URLs before running deploy.py.",
"deploy.config.json is gitignored (never commit your real URLs)."
],
"site_origin": "https://datatools.app",
"demo_base_url": "https://datatools-demo.streamlit.app",
"gumroad_listing": "https://gumroad.com/l/datatools",
"support_email": "hello@datatools.app",
"personas": ["shopify-pet", "bookkeeper", "revops"],
"_substitutions_made": [
"{{site_origin}}/ → site_origin/",
"{{demo_base_url}}/?p=<persona> → live demo iframe per persona",
"{{gumroad_url}}?from=<persona> → Gumroad CTA on every page",
"{{support_email}} → mailto: link"
]
}

235
landing/deploy.py Normal file
View File

@@ -0,0 +1,235 @@
"""Build a deploy-ready ``landing/dist/`` from the source HTML.
Run from the repo root after copying ``landing/deploy.config.example.json``
to ``landing/deploy.config.json`` and filling in the real URLs:
python3 landing/deploy.py
Output:
landing/dist/index.html
landing/dist/shopify-pet/index.html
landing/dist/bookkeeper/index.html
landing/dist/revops/index.html
landing/dist/_shared/styles.css
landing/dist/robots.txt
landing/dist/sitemap.xml
landing/dist/404.html
landing/dist/favicon.svg
Upload ``landing/dist/`` to Cloudflare Pages (drag-and-drop in the
dashboard, or ``wrangler pages deploy landing/dist``).
Why this script exists:
The source HTML carries placeholder URLs (``{{demo_base_url}}``,
``{{gumroad_url}}``, ``{{support_email}}``, ``{{site_origin}}``)
so the operator's actual demo / Gumroad / domain URLs aren't
committed to the repo. This script reads the operator's config
and produces a ready-to-upload bundle.
It also stamps a sitemap.xml + robots.txt + 404.html and copies
the shared CSS so the output directory is fully self-contained.
"""
from __future__ import annotations
import json
import re
import shutil
import sys
from datetime import date
from pathlib import Path
LANDING = Path(__file__).resolve().parent
REPO = LANDING.parent
DIST = LANDING / "dist"
CONFIG_PATH = LANDING / "deploy.config.json"
EXAMPLE_PATH = LANDING / "deploy.config.example.json"
# Files to substitute and copy. Order matters only for readability.
HTML_PAGES = [
LANDING / "index.html",
LANDING / "shopify-pet" / "index.html",
LANDING / "bookkeeper" / "index.html",
LANDING / "revops" / "index.html",
]
SHARED = LANDING / "_shared" / "styles.css"
def _load_config() -> dict:
if not CONFIG_PATH.exists():
sys.stderr.write(
f"\nERROR: {CONFIG_PATH.name} not found.\n"
f" cp {EXAMPLE_PATH.name} {CONFIG_PATH.name}\n"
f" edit {CONFIG_PATH.name} with your real URLs\n"
f" re-run: python3 landing/deploy.py\n\n"
)
sys.exit(2)
cfg = json.loads(CONFIG_PATH.read_text())
required = ("site_origin", "demo_base_url", "gumroad_listing", "support_email")
missing = [k for k in required if not cfg.get(k)]
if missing:
sys.stderr.write(
f"\nERROR: {CONFIG_PATH.name} is missing required fields: {missing}\n"
f" See {EXAMPLE_PATH.name} for the full template.\n\n"
)
sys.exit(2)
return cfg
def _substitute(text: str, cfg: dict) -> str:
"""Replace placeholders + the demo / Gumroad URL patterns the source HTML uses today."""
site_origin = cfg["site_origin"].rstrip("/")
demo_base = cfg["demo_base_url"].rstrip("/")
gumroad_base = cfg["gumroad_listing"]
support_email = cfg["support_email"]
# Direct placeholder tokens (clean approach — used by future copy).
text = text.replace("{{site_origin}}", site_origin)
text = text.replace("{{demo_base_url}}", demo_base)
text = text.replace("{{gumroad_url}}", gumroad_base)
text = text.replace("{{support_email}}", support_email)
# Backwards-compatible patterns: the source HTML in this repo carries
# literal ``https://datatools.app`` and ``https://demo.datatools.app``
# so this script swaps those too. Once new pages adopt the
# ``{{placeholder}}`` style above, this layer can be retired.
text = re.sub(
r"https://demo\.datatools\.app",
demo_base,
text,
)
# Replace ``https://datatools.app/...`` for canonical / OG URLs but
# do NOT swap ``https://datatools.app`` when it is followed by an
# at-sign as part of an email address (no such case today; defensive).
text = re.sub(
r"https://datatools\.app",
site_origin,
text,
)
# Gumroad URL family — preserve the ``?from=<persona>`` query.
text = re.sub(
r"https://gumroad\.com/l/datatools",
gumroad_base.rstrip("/").replace("/l/datatools", "/l/datatools"),
text,
)
# Support email shows up only as ``mailto:hello@datatools.app``.
text = text.replace("mailto:hello@datatools.app", f"mailto:{support_email}")
text = text.replace("hello@datatools.app", support_email)
return text
def _stamp_sitemap(cfg: dict) -> str:
site = cfg["site_origin"].rstrip("/")
today = date.today().isoformat()
urls = [site + "/"] + [
f"{site}/{p}/" for p in cfg.get("personas", ["shopify-pet", "bookkeeper", "revops"])
]
items = "\n".join(
f" <url><loc>{u}</loc><lastmod>{today}</lastmod></url>"
for u in urls
)
return (
'<?xml version="1.0" encoding="UTF-8"?>\n'
'<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n'
f"{items}\n"
"</urlset>\n"
)
def _robots_txt(cfg: dict) -> str:
return (
"# Allow everything; we want every persona page indexable.\n"
"User-agent: *\n"
"Allow: /\n"
f"Sitemap: {cfg['site_origin'].rstrip('/')}/sitemap.xml\n"
)
def _favicon_svg() -> str:
"""Tiny self-contained SVG favicon — broom emoji-style mark."""
return (
'<?xml version="1.0" encoding="UTF-8"?>\n'
'<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 64 64">\n'
' <rect width="64" height="64" rx="14" fill="#0f1115"/>\n'
' <circle cx="32" cy="32" r="9" fill="#6ee7b7"/>\n'
"</svg>\n"
)
def _build_404_html(cfg: dict) -> str:
"""Cloudflare Pages serves 404.html when a path doesn't match."""
site_origin = cfg["site_origin"].rstrip("/")
return f"""<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Not found · DataTools</title>
<link rel="stylesheet" href="/_shared/styles.css" />
</head>
<body>
<section class="hero" style="text-align: center;">
<div class="container">
<div class="eyebrow">404</div>
<h1>That page isn't here.</h1>
<p class="lead" style="margin: 0 auto 28px;">Pick a workflow below to land somewhere useful.</p>
<p>
<a class="btn" href="{site_origin}/shopify-pet/">For Shopify</a>
&nbsp;
<a class="btn" href="{site_origin}/bookkeeper/">For bookkeepers</a>
&nbsp;
<a class="btn" href="{site_origin}/revops/">For RevOps</a>
</p>
</div>
</section>
</body>
</html>
"""
def main() -> int:
cfg = _load_config()
if DIST.exists():
shutil.rmtree(DIST)
DIST.mkdir(parents=True)
# Shared CSS (same path the source HTML expects: ``../_shared/styles.css``)
(DIST / "_shared").mkdir()
shutil.copy(SHARED, DIST / "_shared" / "styles.css")
# Per-page substitutions
page_count = 0
for src in HTML_PAGES:
rel = src.relative_to(LANDING)
dest = DIST / rel
dest.parent.mkdir(parents=True, exist_ok=True)
dest.write_text(_substitute(src.read_text(), cfg))
page_count += 1
# Stamped supporting files
(DIST / "robots.txt").write_text(_robots_txt(cfg))
(DIST / "sitemap.xml").write_text(_stamp_sitemap(cfg))
(DIST / "404.html").write_text(_build_404_html(cfg))
(DIST / "favicon.svg").write_text(_favicon_svg())
# Final report
print(f"\n✓ Built {page_count} HTML pages + sitemap + robots + 404 + favicon")
print(f" Output: {DIST.relative_to(REPO)}/")
print()
print("Next steps:")
print(" 1) wrangler pages deploy landing/dist # if you use Wrangler")
print(" OR drag-and-drop landing/dist/ in the Cloudflare Pages dashboard")
print(" 2) Configure custom domain on Cloudflare Pages → "
f"{cfg['site_origin']}")
print(" 3) Verify: open the deployed apex URL, click each persona "
"card, click each demo iframe, click each buy button → Gumroad listing")
print()
return 0
if __name__ == "__main__":
sys.exit(main())

236
landing/index.html Normal file
View File

@@ -0,0 +1,236 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools — Local CSV / Excel Cleaning for Shopify, Bookkeepers, and RevOps</title>
<meta name="description" content="One desktop tool. Three workflows. Clean Shopify customer exports, reconcile messy bank statements, or dedupe lead lists across HubSpot and LinkedIn — all locally. $49 one-time." />
<link rel="canonical" href="https://datatools.app/" />
<link rel="stylesheet" href="_shared/styles.css" />
<meta property="og:title" content="DataTools — Local CSV / Excel Cleaning" />
<meta property="og:description" content="One desktop tool, three niche workflows. Runs entirely offline. $49 one-time." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://datatools.app/" />
<style>
/* Apex-pageonly tweaks: persona cards are slightly bigger and use
per-card accent borders so the visitor visually identifies which
card matches their work in <2 seconds. */
.persona-grid {
display: grid; gap: 24px;
grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
margin-top: 28px;
}
.persona-card {
background: var(--surface);
border: 1px solid var(--rule);
border-radius: var(--radius);
padding: 28px;
display: flex; flex-direction: column;
transition: transform 0.08s ease, border-color 0.15s ease, box-shadow 0.2s ease;
text-decoration: none;
color: inherit;
}
.persona-card:hover {
transform: translateY(-2px);
border-color: var(--card-accent, var(--accent));
box-shadow: var(--shadow);
text-decoration: none;
}
.persona-card.shopify { --card-accent: #6ee7b7; }
.persona-card.bookkeeper{ --card-accent: #7dd3fc; }
.persona-card.revops { --card-accent: #c4b5fd; }
.persona-card .pill {
display: inline-block;
background: rgba(255,255,255,0.04);
color: var(--card-accent, var(--accent));
border: 1px solid var(--card-accent, var(--accent));
padding: 4px 10px; border-radius: 999px;
font-size: 12px; font-weight: 600;
letter-spacing: 0.04em;
margin-bottom: 12px;
align-self: flex-start;
}
.persona-card h3 {
color: var(--text);
font-size: 22px;
margin-bottom: 12px;
}
.persona-card p {
color: var(--text-soft);
flex: 1;
margin-bottom: 16px;
}
.persona-card .pain {
font-size: 14px; color: var(--text-mute);
margin: 8px 0 18px;
}
.persona-card .pain li { margin-bottom: 4px; }
.persona-card .open {
color: var(--card-accent, var(--accent));
font-weight: 600;
font-size: 15px;
}
.persona-card .open::after {
content: " →";
transition: margin-left 0.15s ease;
}
.persona-card:hover .open::after { margin-left: 4px; }
</style>
</head>
<body>
<!-- Sticky brand bar (no buy CTA on the apex — visitor hasn't picked a niche yet) -->
<div class="buybar">
<div class="buybar-inner">
<div class="brand"><span class="brand-mark"></span> DataTools</div>
<div>
<span class="price-tag">Pick your workflow ↓</span>
</div>
</div>
</div>
<section class="hero">
<div class="container">
<div class="eyebrow">For Shopify operators · bookkeepers · marketing & RevOps agencies</div>
<h1>Local CSV / Excel cleaning.<br /><strong>One tool. Three workflows.</strong></h1>
<p class="lead">
DataTools is a desktop app that fixes the data-cleaning headaches
every small business hits — duplicates Excel can't catch,
international phones it can't parse, dates and currencies in three
different formats per export. One $49 download. Works on Mac,
Windows, and Linux. <strong>Your data never leaves your
computer.</strong>
</p>
<div class="persona-grid">
<a class="persona-card shopify" href="shopify-pet/">
<span class="pill">🛍️ Shopify operator</span>
<h3>Customer / vendor / subscriber export cleanup</h3>
<p>
Klaviyo-import-ready customer lists in 30 seconds. Catches
cross-device duplicates, standardizes international phones
and addresses, fixes the disguised nulls that break product
feeds.
</p>
<ul class="pain">
<li>· Fix Klaviyo per-contact billing on phantom dupes</li>
<li>· Repair feeds rejected by Google Merchant / Meta</li>
<li>· Unify orders from Shopify + Etsy + Amazon + Faire</li>
<li>· Resolve VAT-MOSS country-name drift</li>
</ul>
<span class="open">Open the Shopify demo &amp; pricing</span>
</a>
<a class="persona-card bookkeeper" href="bookkeeper/">
<span class="pill">📒 Bookkeeper / accountant</span>
<h3>Bank-export reconciliation with audit trail</h3>
<p>
Catches the duplicate transaction QuickBooks imported twice
when Jan and Feb exports overlap. Standardizes dates,
amounts, and vendor casing. Hands you a row-level audit log
to share with the client.
</p>
<ul class="pain">
<li>· Catch month-overlap re-import dupes</li>
<li>· Consolidate vendors for clean 1099 reports</li>
<li>· Produce hand-off-ready audit trail</li>
<li>· Multi-currency books (EUR / GBP / BRL)</li>
</ul>
<span class="open">Open the bookkeeper demo &amp; pricing</span>
</a>
<a class="persona-card revops" href="revops/">
<span class="pill">🪢 Marketing / RevOps</span>
<h3>Lead-list dedup across HubSpot, LinkedIn, scrapes</h3>
<p>
One canonical lead per real person — across HubSpot,
LinkedIn, Apollo, ZoomInfo, and manual scrapes.
International phones (50+ country codes), per-row country
column, fuzzy match with merge.
</p>
<ul class="pain">
<li>· Stop paying HubSpot tier price for cross-source dupes</li>
<li>· Protect sender reputation from invalid emails</li>
<li>· Skip the 48 wk GDPR review on cloud cleaners</li>
<li>· Suppression-list sync across 5+ platforms</li>
</ul>
<span class="open">Open the RevOps demo &amp; pricing</span>
</a>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">What's the same across all three</div>
<h2>One engine. Same six tools. Same $49.</h2>
<p>
The persona pages above are positioning, not different products.
Whichever you buy, you get the full bundle: Deduplicator, Text
Cleaner, Format Standardizer, Missing-Value Handler, Column
Mapper, and Pipeline Runner — pre-tuned with a saved pipeline
that matches your workflow.
</p>
<div class="grid">
<div class="card">
<span class="icon">🔒</span>
<h3>Local-first</h3>
<p>Desktop app. No cloud upload, no SaaS account, no subscription. Verify zero outbound calls in your browser's network tab.</p>
</div>
<div class="card">
<span class="icon">📋</span>
<h3>Auditable</h3>
<p>Every cell change is logged with the original value, the new value, and which rule fired. Hand the audit CSV to your client.</p>
</div>
<div class="card">
<span class="icon">🌍</span>
<h3>International</h3>
<p>50+ country codes, per-row country awareness, EU comma decimals, parens-negative amounts, locale-aware month names.</p>
</div>
<div class="card">
<span class="icon">⚙️</span>
<h3>Repeatable</h3>
<p>Save your cleanup as a JSON pipeline. Re-run on next week's export with one CLI command. Same cleanup, zero re-config.</p>
</div>
<div class="card">
<span class="icon">📦</span>
<h3>Cross-platform</h3>
<p>Mac · Windows · Linux installers. Code-signed for macOS Gatekeeper. Free updates for the v1.x line.</p>
</div>
<div class="card">
<span class="icon">💰</span>
<h3>$49 one-time</h3>
<p>No subscription. No per-client license. No row caps. No AI black-box.</p>
</div>
</div>
</div>
</section>
<section>
<div class="container" style="text-align: center;">
<h2>Pick your workflow above to try the live demo.</h2>
<p class="muted">Or read the docs first — every tool has a CLI, every pipeline is JSON, every change is audited.</p>
</div>
</section>
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="shopify-pet/">For Shopify operators</a> ·
<a href="bookkeeper/">For bookkeepers</a> ·
<a href="revops/">For RevOps agencies</a><br />
<a href="mailto:hello@datatools.app">Email support</a>
</p>
</div>
</div>
</footer>
</body>
</html>

352
landing/revops/index.html Normal file
View File

@@ -0,0 +1,352 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools for RevOps — Dedupe Lead Lists Across HubSpot, LinkedIn, and Manual Scrapes · $49</title>
<meta name="description" content="One tool to dedupe lead lists across HubSpot, LinkedIn, and manual scrapes. International phones (50+ country codes), per-row country normalization, fuzzy match across vendors, fully offline. $49 one-time." />
<meta name="keywords" content="dedupe lead list, hubspot deduplicate, linkedin lead cleanup, marketing data cleaning, revops csv tool, multi-vendor lead unification, international phone normalization" />
<link rel="canonical" href="https://datatools.app/revops/" />
<link rel="stylesheet" href="../_shared/styles.css" />
<!-- Persona accent: RevOps → vivid violet -->
<style>
:root {
--accent: #c4b5fd;
--accent-ink: #2e1065;
}
</style>
<meta property="og:title" content="DataTools for RevOps — Dedupe Lead Lists Across HubSpot, LinkedIn, and Manual Scrapes" />
<meta property="og:description" content="International phones, country normalization, fuzzy dedup with merge — one tool, no upload. $49 one-time." />
<meta property="og:type" content="product" />
<meta property="og:url" content="https://datatools.app/revops/" />
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareApplication",
"name": "DataTools for RevOps",
"operatingSystem": "Windows, macOS, Linux",
"applicationCategory": "BusinessApplication",
"offers": {
"@type": "Offer",
"price": "49",
"priceCurrency": "USD"
},
"description": "Dedupe and unify lead lists across CRM, scraping, and manual sources. International phone normalization, per-row country, fuzzy match with merge. Six-tool data-cleaning bundle for RevOps and marketing agencies.",
"softwareVersion": "1.0"
}
</script>
</head>
<body>
<div class="buybar">
<div class="buybar-inner">
<div class="brand"><span class="brand-mark"></span> DataTools <span class="muted">/ for RevOps</span></div>
<div>
<span class="price-tag">$49 — one-time, no subscription</span>
<a class="btn" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Get DataTools →</a>
</div>
</div>
</div>
<section class="hero">
<div class="container">
<div class="eyebrow">For RevOps · marketing ops · agency lead-gen · audience-builders</div>
<h1>Dedupe lead lists across HubSpot, LinkedIn,<br /><strong>and manual scrapes — locally.</strong></h1>
<p class="lead">
The same prospect shows up as <code>alice@acme.com</code> in HubSpot,
<code>Alice.Johnson@acme.com</code> in LinkedIn Sales Navigator, and
<code>alice@acme.com</code> again from your VA's manual scrape. Their
phone is <code>(415) 555-1234</code> in one source and
<code>4155551234</code> in another. DataTools fuzzy-matches across
sources, normalizes phones to E.164 with per-row country awareness,
and produces one canonical lead per real person — without uploading
a single contact to a third-party tool.
</p>
<div class="cta-row">
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Get DataTools — $49 →</a>
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
<span class="price-note">One-time payment · cross-platform · runs offline</span>
</div>
<div class="stats">
<div class="stat"><div class="num">50+</div><div class="label">country codes</div></div>
<div class="stat"><div class="num">3</div><div class="label">CRM sources unified</div></div>
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
</div>
</div>
</section>
<!-- ============= Pain points ============= -->
<section>
<div class="container">
<div class="eyebrow">If your last campaign launch was held up by data hygiene</div>
<h2>Five pains DataTools fixes before you import to HubSpot</h2>
<div class="grid">
<div class="card">
<span class="icon">💸</span>
<h3>HubSpot / Marketo / Iterable bills you for every duplicate contact</h3>
<p>10 k contacts → enterprise tier at $48 k/mo. 18 % cross-source duplicate rate from Apollo + ZoomInfo + LinkedIn means you're at 8.2 k unique people but paying for 10 k. Every month. Forever.</p>
<p class="muted"><strong>What it costs:</strong> $200$800 per 1 k duplicate contacts — recurring, every month.</p>
</div>
<div class="card">
<span class="icon">🚫</span>
<h3>Sender reputation tanks when you mail to invalid or duplicate addresses</h3>
<p>One bad sending session — to addresses your team scraped or imported without hygiene — and your domain reputation takes weeks to recover. Your good campaigns sit in spam folders during the recovery.</p>
<p class="muted"><strong>What it costs:</strong> catastrophic — entire email programme degraded for 26 weeks.</p>
</div>
<div class="card">
<span class="icon">⚖️</span>
<h3>GDPR makes uploading to a cloud cleaner a legal-review marathon</h3>
<p>Every cloud-based lead-cleaner needs you to upload your prospect list. Your legal team needs 48 weeks to bless that. DataTools is desktop-only — no upload, no DPA, no review, no delay.</p>
<p class="muted"><strong>What it costs:</strong> 48 weeks of legal-review delay per tool, every time.</p>
</div>
<div class="card">
<span class="icon">🪢</span>
<h3>Apollo + ZoomInfo + LinkedIn + manual scrapes all use different schemas</h3>
<p>Each export has its own column names, scoring scale, country format. Unifying them by hand for one campaign costs 13 days. Doing it for every campaign is unsustainable.</p>
<p class="muted"><strong>What it costs:</strong> 13 days per campaign of manual unification + judgement calls that drift across team members.</p>
</div>
<div class="card">
<span class="icon">🛡️</span>
<h3>Suppression lists across 5+ marketing platforms get out of sync</h3>
<p>Each platform has its own suppression format. Out-of-sync lists let opted-out contacts slip through, triggering CAN-SPAM / GDPR exposure and the kind of "we got a complaint" email no one wants.</p>
<p class="muted"><strong>What it costs:</strong> compliance risk + churn-back cost + stakeholder trust.</p>
</div>
<div class="card">
<span class="icon">📞</span>
<h3>International dialer fails because phone formats vary</h3>
<p>Calling list to 15 countries with mixed formats means dialler rejects 815 % of numbers, your reps spend the day on "number invalid" tones instead of conversations.</p>
<p class="muted"><strong>What it costs:</strong> rep productivity × failure rate × team size.</p>
</div>
</div>
</div>
</section>
<section id="demo">
<div class="container">
<div class="eyebrow">Live demo · runs in your browser</div>
<h2>Try it on a real-looking 3-vendor lead list</h2>
<p>
The demo below loads a 25-row lead worksheet combining HubSpot,
LinkedIn Sales Navigator, and manual scraping — with the same prospect
appearing in two or three sources, country names spelled three
different ways (<code>USA</code>, <code>US</code>, <code>United
States</code>), and 13 different international phone formats. Click
<strong>Run pipeline</strong> and watch the 5-step pipeline (text
clean → format → missing → column map → dedup) collapse 25 rows to 19
with a single canonical record per prospect.
</p>
<div class="demo-frame">
<iframe
src="https://demo.datatools.app/?p=revops"
loading="lazy"
title="DataTools live demo — RevOps"
sandbox="allow-scripts allow-same-origin allow-downloads allow-forms"></iframe>
<div class="demo-caption">
Demo runs on free hosting. Capped at 100 input rows · output
watermarked. The paid product has no caps and runs entirely offline.
</div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">Built for the agency RevOps day</div>
<h2>Three workflows you do every campaign</h2>
<div class="grid">
<div class="card">
<span class="icon">🪢</span>
<h3>Email-list dedup across lead sources</h3>
<p>HubSpot exports + LinkedIn Sales Navigator + the VA's spreadsheet, all merged. Fuzzy match across email + phone + name catches the cross-source duplicates that broke your last campaign send.</p>
</div>
<div class="card">
<span class="icon">🌍</span>
<h3>Multi-platform audience reconciliation</h3>
<p>Build one canonical audience from Meta, Google Ads, LinkedIn, and your CRM. Each platform exports a different shape; column-mapper aligns them all, dedup merges the survivors with their most-complete fields.</p>
</div>
<div class="card">
<span class="icon">🛡️</span>
<h3>Suppression-list management</h3>
<p>Suppression lists need to dedupe across email + phone + first-party identifiers. Add a row, dedupe, ship the canonical CSV to every platform — without uploading the suppression list to any of them.</p>
</div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">If your campaigns target outside the US — almost everyone's do</div>
<h2>50+ country codes. Per-row country awareness.</h2>
<p>
Your HubSpot list has <code>(415) 555-1234</code>. Your scraped
list from the same prospect has <code>+1 415 555 1234</code>. Your
Italian prospect entered <code>+39 06 6982</code>. Your Brazilian
lead has <code>11 3071 0000</code>. Each comes from a row tagged
with its country — DataTools reads that column per row and parses
every phone correctly to E.164.
</p>
<ul class="bullets">
<li><strong>Per-row country column</strong> drives the parser — no global default that bucks UK numbers as malformed US.</li>
<li><strong>Country-name normalization</strong>: <code>USA</code> / <code>US</code> / <code>United States</code> all resolve to the same ISO-2 code.</li>
<li><strong>50+ country support</strong> via Google's libphonenumber, including KR, CN, IN, MX, BR, IL, TR, PL, DK, SE.</li>
<li><strong>Schema enforcement</strong> via the column-mapper: project to your CRM's required shape, coerce score columns to integers, reorder fields to match the import contract.</li>
</ul>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">For platforms that charge per contact</div>
<h2>Every duplicate you don't catch costs you for the life of the contract.</h2>
<p>
HubSpot prices on contacts. Klaviyo prices on contacts. Marketo,
Iterable, ActiveCampaign — all priced on contacts. Every duplicate
you don't catch is a recurring tax on your campaign. DataTools
catches them once, before import, with a fuzzy matcher that's
tuned to the cross-source noise you actually see.
</p>
<div class="callout">
<strong>Real numbers from the demo:</strong> 25 input rows from
three sources collapse to 19 — that's 6 duplicates the cross-source
noise was hiding. On a 50,000-row campaign list, that ratio
typically saves 12,000+ contacts a month, every month.
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">The thing every cloud cleaner can't say</div>
<h2>Your prospects' contact info never leaves your computer.</h2>
<p>
Cloud lead-cleaning tools require you to upload your audience.
That audience is your single most valuable agency asset — and once
it's on someone else's server, your client's privacy story is
no longer in your hands. DataTools is a desktop app. There is no
upload step.
</p>
<div class="terminal"><span class="prompt">$</span> python -m src.cli_pipeline campaign_q1.csv --pipeline revops_pipeline.json --apply
Reading campaign_q1.csv...
53,802 rows, 14 columns
Executing pipeline:
<span class="ok"></span> text_clean (160 ms) {cells_changed: 8,205}
<span class="ok"></span> format_standardize (1.4 s) {cells_changed: 41,889 — 50 country codes}
<span class="ok"></span> missing (140 ms) {sentinels_standardized: 6,710}
<span class="ok"></span> column_map (220 ms) {columns_renamed: 4, columns_added: 1}
<span class="ok"></span> dedup (4.8 s) {duplicates_removed: 12,344, merged: 12,344}
Initial rows: 53,802 → Final rows: 41,458
Total elapsed: 6.7 s
<span class="prompt">$</span> # 12,344 fewer contacts to pay for. for $49.</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match across email + phone + name + company; merge survivors with most-complete fields.</p></div>
<div class="card"><h3>2 · Text Cleaner</h3><p>Smart quotes from copy-paste, NBSP from spreadsheet exports, BOM from Excel.</p></div>
<div class="card"><h3>3 · Format Standardizer</h3><p>E.164 phones with per-row country, canonical emails, name casing, ISO dates.</p></div>
<div class="card"><h3>4 · Missing Value Handler</h3><p>Detect <code>TBD</code>, <code>(unknown)</code>, <code></code> across vendor exports.</p></div>
<div class="card"><h3>5 · Column Mapper</h3><p>Project to your CRM's required schema, coerce score to integer, reorder for import.</p></div>
<div class="card"><h3>6 · Pipeline Runner</h3><p>Save the cleanup as JSON. Drop next campaign's combined export on it. Same dedup, automated.</p></div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">Pricing — pay once, own it</div>
<h2>$49. No subscription. No per-campaign fee.</h2>
<div class="pricing">
<div class="card featured">
<div class="row"><div class="price">$49</div><div class="price-suffix">one-time</div></div>
<h3>DataTools for RevOps</h3>
<ul>
<li>All 6 tools, full pipeline</li>
<li>Mac · Windows · Linux installers</li>
<li>Code-signed (no Gatekeeper warnings)</li>
<li>Free updates for the v1.x line</li>
<li>Bonus: 3-source unification pipeline preset</li>
<li><strong>Use on any number of clients</strong> — no seat limits</li>
</ul>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Buy on Gumroad →</a>
</div>
<div class="card">
<div class="row"><div class="price">$149</div><div class="price-suffix">one-time</div></div>
<h3>Full DataTools Suite</h3>
<p class="muted">Available when 3+ bundles ship. Includes everything in the RevOps pack plus the Shopify and Bookkeeper bundles. Save $48.</p>
<a class="btn btn-ghost btn-large" href="#" aria-disabled="true">Coming when ready</a>
</div>
</div>
</div>
</section>
<section>
<div class="container">
<h2>Questions</h2>
<details class="faq">
<summary>Does this replace HubSpot's deduplication?</summary>
<p>No — it cleans data <em>before</em> import to HubSpot (or LinkedIn, Marketo, Klaviyo, etc.). HubSpot's dedup runs on already-imported contacts; DataTools catches duplicates that haven't yet cost you a contract slot.</p>
</details>
<details class="faq">
<summary>Does it handle international phones correctly?</summary>
<p>Yes — via Google's libphonenumber, with 50+ country codes. The killer feature is per-row country: point a column at it (any column with values like <code>US</code>, <code>USA</code>, <code>United States</code>, <code>+1</code>, <code>JP</code>, <code>Japan</code>) and DataTools parses each row in its own region. No more UK numbers bucketed as malformed US.</p>
</details>
<details class="faq">
<summary>Can I use it on multiple clients without paying again?</summary>
<p>Yes. The licence is per-operator, not per-client. Run it on every agency client's lead list for the same $49.</p>
</details>
<details class="faq">
<summary>How does fuzzy match work across columns?</summary>
<p>Out of the box, the dedup engine builds default strategies based on column names — typically email + phone with exact match, name with Jaro-Winkler at 85%. You can override via JSON: pick which columns to match on, which algorithm, and what threshold. Strategies survive in the saved pipeline so next campaign uses the same rules.</p>
</details>
<details class="faq">
<summary>What's the audit trail look like?</summary>
<p>A row-by-row CSV: every modified cell with its original value, new value, and which rule fired. A separate JSON file describes the pipeline that produced it. Together they reproduce the cleanup deterministically — your client can verify it on their machine.</p>
</details>
<details class="faq">
<summary>What's your refund policy?</summary>
<p>Try the live demo above on the sample dataset before you buy. If DataTools doesn't fit your workflow within 14 days, email for a refund — no questions asked.</p>
</details>
</div>
</section>
<section>
<div class="container" style="text-align: center;">
<h2>Stop paying twice for the same contact.</h2>
<p class="lead" style="margin: 0 auto 28px;">One $49 download. Catches the cross-source duplicates HubSpot and LinkedIn can't see, normalizes phones for 50+ countries, and saves a pipeline you can re-run on next campaign's combined list.</p>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Get DataTools — $49 →</a>
</div>
</section>
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="../shopify-pet/">For Shopify operators</a> ·
<a href="../bookkeeper/">For bookkeepers</a><br />
<a href="https://gumroad.com/l/datatools?from=revops">Buy on Gumroad</a> ·
<a href="mailto:hello@datatools.app">Email support</a>
</p>
</div>
</div>
</footer>
</body>
</html>

View File

@@ -0,0 +1,381 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools for Shopify — Clean Customer & Product Exports Locally · $49</title>
<meta name="description" content="Clean Shopify customer, product, and subscriber exports — locally. Klaviyo-import-ready in 30 seconds. Catches duplicates Excel misses. Your data never leaves your computer. $49 one-time." />
<meta name="keywords" content="shopify customer cleanup, shopify csv cleaner, shopify product feed cleaner, klaviyo deduplicate, shopify customer dedup tool, shopify pet supplies" />
<link rel="canonical" href="https://datatools.app/shopify/" />
<link rel="stylesheet" href="../_shared/styles.css" />
<!-- Persona accent: Shopify pet → mint green (default in shared sheet) -->
<!-- Open Graph -->
<meta property="og:title" content="DataTools for Shopify — Clean Customer & Product Exports Locally" />
<meta property="og:description" content="Klaviyo-import-ready in 30 seconds. Local. No upload. $49 one-time." />
<meta property="og:type" content="product" />
<meta property="og:url" content="https://datatools.app/shopify/" />
<!-- Schema.org Product -->
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareApplication",
"name": "DataTools for Shopify",
"operatingSystem": "Windows, macOS, Linux",
"applicationCategory": "BusinessApplication",
"offers": {
"@type": "Offer",
"price": "49",
"priceCurrency": "USD"
},
"description": "Clean Shopify customer, product, and subscriber CSV exports locally. Six-tool data-cleaning bundle: dedupe, text-clean, format-standardize, missing-value handle, column-map, pipeline.",
"softwareVersion": "1.0"
}
</script>
</head>
<body>
<!-- ============= Sticky buy bar ============= -->
<div class="buybar">
<div class="buybar-inner">
<div class="brand"><span class="brand-mark"></span> DataTools <span class="muted">/ for Shopify</span></div>
<div>
<span class="price-tag">$49 — one-time, no subscription</span>
<a class="btn" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Get DataTools →</a>
</div>
</div>
</div>
<!-- ============= Hero ============= -->
<section class="hero">
<div class="container">
<div class="eyebrow">For Shopify operators · pet supplies · subscription stores · DTC</div>
<h1>Klaviyo-import-ready customer lists.<br /><strong>In 30 seconds. Locally.</strong></h1>
<p class="lead">
Your Shopify customer export is a mess of formatting drift, disguised
duplicates, and inconsistent phone numbers. DataTools fixes all of it
in one pass — fuzzy-dedupes the same customer Klaviyo would charge
you for twice, standardises phones across your international
subscribers, and hands you a cleaned CSV. <strong>Your data never
leaves your computer.</strong>
</p>
<div class="cta-row">
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Get DataTools — $49 →</a>
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
<span class="price-note">One-time payment · cross-platform · runs offline</span>
</div>
<div class="stats">
<div class="stat"><div class="num">6</div><div class="label">tools, one bundle</div></div>
<div class="stat"><div class="num">1 GB</div><div class="label">customer file in 2.5 min</div></div>
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
</div>
</div>
</section>
<!-- ============= Pain points ============= -->
<section>
<div class="container">
<div class="eyebrow">If any of these sound like your Tuesday</div>
<h2>Five pains DataTools fixes in one pass</h2>
<div class="grid">
<div class="card">
<span class="icon">💸</span>
<h3>Klaviyo / Mailchimp / Omnisend bills you for every duplicate</h3>
<p>Same customer signs up twice — once with a typo, once with a plus-tag, once on mobile. Your subscriber list has 1018 % duplicate rate and you're paying for every one of them, every month, forever.</p>
<p class="muted"><strong>What it costs:</strong> $30$300/mo per percent of dupes on a 50 k-list — recurring.</p>
</div>
<div class="card">
<span class="icon">📵</span>
<h3>Your product feed got rejected by Google Merchant Center</h3>
<p>Smart quotes from a copy-paste in product titles. NBSP in SKU. Inconsistent attribute casing. Feed bounces, the launch sits for 2472 hours while you try to find the bad row in a 12,000-line CSV.</p>
<p class="muted"><strong>What it costs:</strong> 13 days of delayed campaign × the campaign value.</p>
</div>
<div class="card">
<span class="icon">🪢</span>
<h3>Orders from Shopify + Etsy + Amazon + Faire don't speak the same language</h3>
<p>Each platform's export uses different column names for "customer email" / "ship country" / "order total." Merging takes hours of manual rename and copy-paste before the analysis can even begin.</p>
<p class="muted"><strong>What it costs:</strong> 48 hours per month manually merging exports.</p>
</div>
<div class="card">
<span class="icon">🔁</span>
<h3>Subscription churn looks higher than it is</h3>
<p>Pet-box subscribers cancel, then re-sub three months later under a different email or device. Your cohort report says churn is 20 % when it's actually 12 % — and you're over-paying for acquisition because LTV is mis-calculated.</p>
<p class="muted"><strong>What it costs:</strong> wrong CAC ceiling for the next year of paid ads.</p>
</div>
<div class="card">
<span class="icon">🌍</span>
<h3>VAT MOSS / EU tax breaks because country is spelled three ways</h3>
<p>Your UK customers are tagged <code>UK</code>, <code>U.K.</code>, and <code>United Kingdom</code> — all in one export. The VAT report aggregates them as three different markets. Compliance friction every quarter.</p>
<p class="muted"><strong>What it costs:</strong> compliance risk + repeated manual normalization.</p>
</div>
<div class="card">
<span class="icon">🔒</span>
<h3>Cloud cleaners want you to upload your customer list</h3>
<p>Your customer list is your single most valuable business asset. Uploading it to a SaaS to clean it is the privacy story you do not want. DataTools is desktop-only — your list never leaves your computer.</p>
<p class="muted"><strong>What it costs:</strong> nothing — and that's the point.</p>
</div>
</div>
</div>
</section>
<!-- ============= Live demo ============= -->
<section id="demo">
<div class="container">
<div class="eyebrow">Live demo · runs in your browser</div>
<h2>Try it on a real-looking Shopify customer export</h2>
<p>
The demo below loads a sample 15-row Shopify customer file with
pollution we've seen in actual stores: smart quotes from copy-paste,
duplicates with email-case drift, international phones from the UK,
Spain, Germany, Australia, and Japan, and the usual mess of
<code>N/A</code> / <code>(blank)</code> / <code>?</code> sentinels.
Click <strong>Run pipeline</strong> and watch every column get
cleaned in under a second.
</p>
<div class="demo-frame">
<iframe
src="https://demo.datatools.app/?p=shopify-pet"
loading="lazy"
title="DataTools live demo — Shopify pet supplies"
sandbox="allow-scripts allow-same-origin allow-downloads allow-forms"></iframe>
<div class="demo-caption">
Demo runs on free hosting (Streamlit Community Cloud). Capped at
100 input rows · output watermarked with one trailing row. The
paid product has no caps and runs entirely offline.
</div>
</div>
</div>
</section>
<!-- ============= Built for Shopify ============= -->
<section>
<div class="container">
<div class="eyebrow">Built for the Shopify operator</div>
<h2>Five workflows you do every week</h2>
<div class="grid">
<div class="card">
<span class="icon">🧹</span>
<h3>Customer-list cleanup</h3>
<p>Catches the same customer who shows up as <code>john@gmail.com</code>, <code>John@Gmail.com</code>, and <code>j.ohn@gmail.com</code>. Fuzzy match merges the spellings, exact match catches the obvious ones.</p>
</div>
<div class="card">
<span class="icon">📦</span>
<h3>Product catalogue dedup</h3>
<p>SKU whitespace, near-identical product names, copy-paste smart quotes in titles — gone. Audit log shows every change.</p>
</div>
<div class="card">
<span class="icon">🛒</span>
<h3>Abandoned-cart hygiene</h3>
<p>Before re-engagement: dedupe across email + phone, drop sentinels-as-missing, format dates so your sequence triggers fire correctly.</p>
</div>
<div class="card">
<span class="icon">📥</span>
<h3>Subscriber-list import to Klaviyo</h3>
<p>Klaviyo charges per contact. Every duplicate you don't catch costs you for the life of the subscription. Catch them once, pay once.</p>
</div>
<div class="card">
<span class="icon">🔗</span>
<h3>Multi-channel order consolidation</h3>
<p>Orders from Shopify + Etsy + a wholesale spreadsheet, each with a different column for "customer email." Column-mapper aligns them; dedup merges across channels.</p>
</div>
<div class="card">
<span class="icon">⚙️</span>
<h3>Repeatable pipeline</h3>
<p>Save the cleanup as a JSON file. Drop next week's export on it. Same cleanup, zero re-configuration. Automatable via the CLI.</p>
</div>
</div>
</div>
</section>
<!-- ============= Privacy moat ============= -->
<section>
<div class="container">
<div class="eyebrow">The thing every cloud cleaner can't say</div>
<h2>Your customer list never leaves your computer.</h2>
<p>
DataTools is a desktop app. There's no upload step, no SaaS account,
no subscription, no "trust our security policy." The first thing you
can do after install is open your browser's network tab, run the
cleaner on your real customer file, and verify zero outbound
requests.
</p>
<div class="callout">
<strong>Why it matters for Shopify:</strong> your customer list is
your single most valuable business asset. Cloud cleaners require
you to upload it. We don't.
</div>
<div class="terminal"><span class="prompt">$</span> python -m src.cli_pipeline customers.csv --apply
Reading customers.csv...
47,832 rows, 14 columns
Executing pipeline:
<span class="ok"></span> text_clean (140 ms) {cells_changed: 12,408}
<span class="ok"></span> format_standardize (810 ms) {cells_changed: 31,202}
<span class="ok"></span> missing (95 ms) {sentinels_standardized: 8,129}
<span class="ok"></span> dedup (3.1 s) {duplicates_removed: 2,347}
Initial rows: 47,832 → Final rows: 45,485
Total elapsed: 4.2 s
<span class="prompt">$</span> # zero network calls. zero. promise.</div>
</div>
</section>
<!-- ============= Audit moat ============= -->
<section>
<div class="container">
<div class="eyebrow">For when your client asks "what changed?"</div>
<h2>Every change auditable. Every cell logged.</h2>
<p>
Every modification is recorded with the original value, the new
value, and which rule fired. Hand the audit CSV to your accountant,
your marketing manager, or your boss along with the cleaned file.
No <em>"I trust the AI"</em> hand-waving — they see exactly what
happened.
</p>
<div class="callout">
<strong>Real example:</strong> the demo above standardized 27
cells across 15 customers. The audit log lists each one — row,
column, before, after, which standardizer fired. The dedup audit
lists every duplicate group with the survivor and its losers.
</div>
</div>
</section>
<!-- ============= International ============= -->
<section>
<div class="container">
<div class="eyebrow">If you sell internationally — most pet brands do</div>
<h2>Phones, addresses, and currencies from anywhere on Earth.</h2>
<p>
Your subscriber from London entered her phone as <code>020 7946
0958</code>. Your Tokyo customer entered <code>03-3210-7000</code>.
Your German wholesale buyer wrote <code>€2.410,75</code>. Excel
thinks all of them are mistakes. DataTools knows what country each
row is from (per-row country column) and parses every one correctly
to E.164 phones, ISO dates, and numeric amounts.
</p>
<ul class="bullets">
<li><strong>50+ country codes</strong> via Google's libphonenumber.</li>
<li><strong>Currency auto-detect</strong> for $ / £ / € / ¥ / R$ / kr / zł — including the EU comma-decimal that breaks Excel.</li>
<li><strong>Address shape detection</strong> for US, UK, Canada, Germany, Australia.</li>
<li><strong>Locale-aware month names</strong> in English, French, German.</li>
</ul>
</div>
</section>
<!-- ============= What you get ============= -->
<section>
<div class="container">
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match (Jaro-Winkler), 5 normalizers, survivor rules, interactive review.</p></div>
<div class="card"><h3>2 · Text Cleaner</h3><p>Whitespace, smart chars, NBSP, BOM, line endings, case ops.</p></div>
<div class="card"><h3>3 · Format Standardizer</h3><p>Dates, phones, emails, addresses, names, currencies, booleans.</p></div>
<div class="card"><h3>4 · Missing Value Handler</h3><p>Disguised-null detection, profile, mean/median/mode/ffill, drop strategies.</p></div>
<div class="card"><h3>5 · Column Mapper</h3><p>Fuzzy auto-rename, target schema, type coercion, required-field defaults.</p></div>
<div class="card"><h3>6 · Pipeline Runner</h3><p>Chain tools in recommended order, save/load JSON, automate weekly cleanups.</p></div>
</div>
</div>
</section>
<!-- ============= Pricing ============= -->
<section>
<div class="container">
<div class="eyebrow">Pricing — pay once, own it</div>
<h2>$49. No subscription. No ceiling on rows or files.</h2>
<div class="pricing">
<div class="card featured">
<div class="row"><div class="price">$49</div><div class="price-suffix">one-time</div></div>
<h3>DataTools for Shopify</h3>
<ul>
<li>All 6 tools, full pipeline</li>
<li>Mac · Windows · Linux installers</li>
<li>Code-signed (no Gatekeeper warnings)</li>
<li>Free updates for the v1.x line</li>
<li>Bonus: 3 ready-made Shopify pipelines</li>
</ul>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Buy on Gumroad →</a>
</div>
<div class="card">
<div class="row"><div class="price">$149</div><div class="price-suffix">one-time</div></div>
<h3>Full DataTools Suite</h3>
<p class="muted">Available when 3+ bundles ship. Includes everything in the Shopify pack plus the Bookkeeper and RevOps bundles. Save $48.</p>
<a class="btn btn-ghost btn-large" href="#" aria-disabled="true">Coming when ready</a>
</div>
</div>
</div>
</section>
<!-- ============= FAQ ============= -->
<section>
<div class="container">
<h2>Questions</h2>
<details class="faq">
<summary>Does this work with Shopify Plus?</summary>
<p>Yes — the input is just CSV / Excel from any source. Your Shopify Plus exports work the same as the standard plan, the same as a Shopify-to-CSV pipeline you've stitched together yourself. The cleaner doesn't care.</p>
</details>
<details class="faq">
<summary>How does this compare to Excel's "Remove Duplicates"?</summary>
<p>Excel does <em>exact</em> deduplication. <code>John@Gmail.com</code> and <code>john@gmail.com</code> are different customers to Excel. DataTools fuzzy-matches across case, whitespace, formatting, and even close-but-not-identical strings. The demo above merges 4 customer pairs Excel would leave duplicated.</p>
</details>
<details class="faq">
<summary>How big a file can it handle?</summary>
<p>1 GB CSV with international phones + addresses processes in about 2.5 minutes on a typical workstation. Streaming mode keeps memory bounded regardless of input size — we tested it on 26 million rows.</p>
</details>
<details class="faq">
<summary>Do I need to know Python to use it?</summary>
<p>No. The GUI is a browser interface that opens automatically when you double-click the app. It loads your file, you click Run, you download the cleaned file. The CLI is there for power users who want to script weekly cleanups.</p>
</details>
<details class="faq">
<summary>What about my privacy?</summary>
<p>Your customer list never leaves your computer. There is no cloud component, no telemetry, no "anonymous usage stats." When the app is running you can confirm zero outbound network requests in your browser's developer tools.</p>
</details>
<details class="faq">
<summary>What's your refund policy?</summary>
<p>Try the live demo above on the sample dataset before you buy. If you still find DataTools doesn't fit your workflow within 14 days, email for a refund — no questions asked.</p>
</details>
<details class="faq">
<summary>Will there be updates?</summary>
<p>Yes. The v1.x line is included free for everyone who buys DataTools today. We ship a patch every 30 days adding country support, edge-case fixes, and small features.</p>
</details>
</div>
</section>
<!-- ============= Final CTA ============= -->
<section>
<div class="container" style="text-align: center;">
<h2>Stop deduplicating customers by hand.</h2>
<p class="lead" style="margin: 0 auto 28px;">One $49 download. Mac, Windows, or Linux. Runs offline. Catches the duplicates Excel misses, standardizes the phones from your international customers, and saves a pipeline you can re-run on next week's export.</p>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=shopify-pet" rel="noopener">Get DataTools — $49 →</a>
</div>
</section>
<!-- ============= Footer ============= -->
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="../bookkeeper/">For bookkeepers</a> ·
<a href="../revops/">For RevOps agencies</a><br />
<a href="https://gumroad.com/l/datatools?from=shopify-pet">Buy on Gumroad</a> ·
<a href="mailto:hello@datatools.app">Email support</a>
</p>
</div>
</div>
</footer>
</body>
</html>

View File

@@ -0,0 +1,31 @@
Lead ID,First Name,Last Name,Company,Title,Email,Phone,Country,Source,Score,Last Activity,Tags
HUB-001,Alice,Johnson,Acme Corp,VP Marketing,alice@acme.com,(415) 555-1234,USA,HubSpot,87,2025-12-04,Enterprise
HUB-002,bob,smith,Beta LLC,Director Growth,bob@beta.com,N/A,United States,HubSpot,N/A,2025-11-22,SMB
HUB-003,Carlos,Garcia,Gamma Inc,CEO,carlos@gamma.io,+34 91 411 1111,Spain,HubSpot,82,2025-10-30,Enterprise
HUB-004,DIANA,LEE,Delta Co,Marketing Manager,diana@delta.com,020 7946 0958,United Kingdom,HubSpot,74,2025-12-15,Mid-Market
HUB-005,Eve,Martinez,Epsilon Group,VP Ops,eve@epsilon.com,(none),Mexico,HubSpot,(blank),2025-09-15,SMB
LIN-006,Alice,Johnson,Acme Corporation,VP of Marketing,Alice.Johnson@acme.com,4155551234,US,LinkedIn,,2025-12-04,Enterprise
LIN-007,Frank,Brown,Foxtrot Ltd,Head Sales,frank@foxtrot.de,+49 30 12345678,Germany,LinkedIn,68,2025-12-01,Mid-Market
LIN-008,Grace,Davis,Golf Industries,Marketing Lead,grace@golfind.com,+44 20 7946 0958,UK,LinkedIn,79,2025-11-08,Mid-Market
LIN-009,henry,wilson,Hotel Logistics,COO,henry@hotellog.com,+86 10 1234 5678,China,LinkedIn,91,2025-12-12,Enterprise
LIN-010,IVY CHEN,,India Tech,CTO,ivy@indiatech.in,+91 11 2345 6789,IN,LinkedIn,88,2025-11-30,Enterprise
LIN-011,Jack,Taylor,Juliet & Co,Founder,jack@juliet.co,unknown,United States,LinkedIn,?,(unknown),SMB
SCR-012,Diana,Lee,Delta Company,Marketing Manager,diana@delta.com,020-7946-0958,UK,Manual Scrape,74,12/15/2025,Mid-Market
SCR-013,kate,o'neil,Kilo Ventures,Partner,kate@kilo.vc,+1 415 555 2222,USA,Manual Scrape,N/A,?,Investor
SCR-014,Carlos,García,Gamma Incorporated,CEO,Carlos@gamma.io,+34-91-411-1111,Spain,Manual Scrape,82,Oct 30 2025,Enterprise
SCR-015,Liam,Park,Lima Solutions,Director Marketing,liam@limasol.kr,+82 2 2287 0114,South Korea,Manual Scrape,77,2025-11-20,Enterprise
SCR-016,Mia,nguyen,Mike Corp,VP Marketing,mia@mikecorp.com.au,02 9374 4000,Australia,Manual Scrape,72,2025-10-05,Mid-Market
SCR-017,Noah,Brown,November Inc,Head of Growth,noah@november.com,(555) 444-5555,US,Manual Scrape,,#N/A,SMB
HUB-018,Frank,Brown,Foxtrot,Head of Sales,Frank@Foxtrot.de,+49-30-12345678,Germany,HubSpot,68,2025-12-01,Mid-Market
HUB-019,Olivia,Rossi,Oscar Italia,CMO,olivia@oscar.it,+39 06 6982,Italy,HubSpot,85,2025-12-08,Enterprise
HUB-020,papa,wong,Papa Trading,Founder,papa@papatrading.hk,+852 2123 4567,Hong Kong,HubSpot,69,2025-11-15,SMB
LIN-021,Quinn,Reyes,Quebec Group,VP Sales,quinn@quebec.mx,+52 55 5555 0000,Mexico,LinkedIn,80,2025-12-05,Mid-Market
LIN-022,Robert,Tan,Romeo Logistics,Director,r.tan@romeo.sg,+65 6123 4567,Singapore,LinkedIn,76,2025-11-28,Mid-Market
SCR-023,Sara,Khan,Sierra Foods,Head Marketing,sara@sierra.in,+91-22-1234-5678,India,Manual Scrape,73,2025-12-02,SMB
SCR-024,bob,Smith,Beta,Director Growth,Bob@Beta.com,(none),United States,Manual Scrape,(unknown),(unknown),SMB
HUB-025,Tara,Levi,Tango Tech,VP Product,tara@tango.il,+972 3 6957 0000,Israel,HubSpot,82,2025-12-10,Enterprise
HUB-026,Uma,Patel,Uniform Health,CMO,uma at uniform dot com,+44 20 7946 8888,United Kingdom,HubSpot,71,2025-12-12,Enterprise
LIN-027,Victor,Lee,Victor Co,Director,victor@@victorco.com,+1 415 555 8888,USA,LinkedIn,69,2025-11-30,SMB
SCR-028,Wendy,Akin,Whiskey Inc,CMO,wendy@whiskey.tr,+90 212 252 1111,Turkey,Manual Scrape,77,2025-12-04,Mid-Market
SCR-029,Xander,Ng,Xray Group,Founder,xander@xray.sg,+65 6234 5678,Singapore,Manual Scrape,65,2025-11-15,Suppressed
HUB-030,Yara,Costa,Yankee Foods,Marketing Lead,yara@yankee.br,+55 11 3071 2222,Brazil,HubSpot,,2025-12-15,Opted Out
1 Lead ID First Name Last Name Company Title Email Phone Country Source Score Last Activity Tags
2 HUB-001 Alice Johnson Acme Corp VP Marketing alice@acme.com (415) 555-1234 USA HubSpot 87 2025-12-04 Enterprise
3 HUB-002 bob smith Beta LLC Director Growth bob@beta.com N/A United States HubSpot N/A 2025-11-22 SMB
4 HUB-003 Carlos Garcia Gamma Inc CEO carlos@gamma.io +34 91 411 1111 Spain HubSpot 82 2025-10-30 Enterprise
5 HUB-004 DIANA LEE Delta Co Marketing Manager diana@delta.com 020 7946 0958 United Kingdom HubSpot 74 2025-12-15 Mid-Market
6 HUB-005 Eve Martinez Epsilon Group VP Ops eve@epsilon.com (none) Mexico HubSpot (blank) 2025-09-15 SMB
7 LIN-006 Alice Johnson Acme Corporation VP of Marketing Alice.Johnson@acme.com 4155551234 US LinkedIn 2025-12-04 Enterprise
8 LIN-007 Frank Brown Foxtrot Ltd Head Sales frank@foxtrot.de +49 30 12345678 Germany LinkedIn 68 2025-12-01 Mid-Market
9 LIN-008 Grace Davis Golf Industries Marketing Lead grace@golfind.com +44 20 7946 0958 UK LinkedIn 79 2025-11-08 Mid-Market
10 LIN-009 henry wilson Hotel Logistics COO henry@hotellog.com +86 10 1234 5678 China LinkedIn 91 2025-12-12 Enterprise
11 LIN-010 IVY CHEN India Tech CTO ivy@indiatech.in +91 11 2345 6789 IN LinkedIn 88 2025-11-30 Enterprise
12 LIN-011 Jack Taylor Juliet & Co Founder jack@juliet.co unknown United States LinkedIn ? (unknown) SMB
13 SCR-012 Diana Lee Delta Company Marketing Manager diana@delta.com 020-7946-0958 UK Manual Scrape 74 12/15/2025 Mid-Market
14 SCR-013 kate o'neil Kilo Ventures Partner kate@kilo.vc +1 415 555 2222 USA Manual Scrape N/A ? Investor
15 SCR-014 Carlos García Gamma Incorporated CEO Carlos@gamma.io +34-91-411-1111 Spain Manual Scrape 82 Oct 30 2025 Enterprise
16 SCR-015 Liam Park Lima Solutions Director Marketing liam@limasol.kr +82 2 2287 0114 South Korea Manual Scrape 77 2025-11-20 Enterprise
17 SCR-016 Mia nguyen Mike Corp VP Marketing mia@mikecorp.com.au 02 9374 4000 Australia Manual Scrape 72 2025-10-05 Mid-Market
18 SCR-017 Noah Brown November Inc Head of Growth noah@november.com (555) 444-5555 US Manual Scrape #N/A SMB
19 HUB-018 Frank Brown Foxtrot Head of Sales Frank@Foxtrot.de +49-30-12345678 Germany HubSpot 68 2025-12-01 Mid-Market
20 HUB-019 Olivia Rossi Oscar Italia CMO olivia@oscar.it +39 06 6982 Italy HubSpot 85 2025-12-08 Enterprise
21 HUB-020 papa wong Papa Trading Founder papa@papatrading.hk +852 2123 4567 Hong Kong HubSpot 69 2025-11-15 SMB
22 LIN-021 Quinn Reyes Quebec Group VP Sales quinn@quebec.mx +52 55 5555 0000 Mexico LinkedIn 80 2025-12-05 Mid-Market
23 LIN-022 Robert Tan Romeo Logistics Director r.tan@romeo.sg +65 6123 4567 Singapore LinkedIn 76 2025-11-28 Mid-Market
24 SCR-023 Sara Khan Sierra Foods Head Marketing sara@sierra.in +91-22-1234-5678 India Manual Scrape 73 2025-12-02 SMB
25 SCR-024 bob Smith Beta Director Growth Bob@Beta.com (none) United States Manual Scrape (unknown) (unknown) SMB
26 HUB-025 Tara Levi Tango Tech VP Product tara@tango.il +972 3 6957 0000 Israel HubSpot 82 2025-12-10 Enterprise
27 HUB-026 Uma Patel Uniform Health CMO uma at uniform dot com +44 20 7946 8888 United Kingdom HubSpot 71 2025-12-12 Enterprise
28 LIN-027 Victor Lee Victor Co Director victor@@victorco.com +1 415 555 8888 USA LinkedIn 69 2025-11-30 SMB
29 SCR-028 Wendy Akin Whiskey Inc CMO wendy@whiskey.tr +90 212 252 1111 Turkey Manual Scrape 77 2025-12-04 Mid-Market
30 SCR-029 Xander Ng Xray Group Founder xander@xray.sg +65 6234 5678 Singapore Manual Scrape 65 2025-11-15 Suppressed
31 HUB-030 Yara Costa Yankee Foods Marketing Lead yara@yankee.br +55 11 3071 2222 Brazil HubSpot 2025-12-15 Opted Out

View File

@@ -0,0 +1,74 @@
{
"steps": [
{
"tool": "text_clean",
"options": {},
"enabled": true,
"name": "1. Clean text (whitespace + smart quotes from copy-paste)"
},
{
"tool": "format_standardize",
"options": {
"column_types": {
"First Name": "name",
"Last Name": "name",
"Company": "name",
"Email": "email",
"Phone": "phone"
},
"phone_country_column": "Country",
"phone_format": "E164",
"email_gmail_canonical": true
},
"enabled": true,
"name": "2. E.164 phones (per-row country) · canonical emails · name casing"
},
{
"tool": "missing",
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["N/A", "n/a", "—", "?", "(unknown)", "unknown", "(blank)", "(none)", "TBD", "#N/A"]
},
"enabled": true,
"name": "3. Standardize sentinels across vendor exports"
},
{
"tool": "column_map",
"options": {
"schema": {
"fields": [
{"name": "Lead ID", "dtype": "string", "required": true},
{"name": "First Name", "dtype": "string"},
{"name": "Last Name", "dtype": "string"},
{"name": "Company", "dtype": "string"},
{"name": "Title", "dtype": "string"},
{"name": "Email", "dtype": "string"},
{"name": "Phone", "dtype": "string"},
{"name": "Country", "dtype": "string"},
{"name": "Source", "dtype": "string"},
{"name": "Score", "dtype": "integer"},
{"name": "Last Activity", "dtype": "date"},
{"name": "Tags", "dtype": "string"}
]
},
"auto_infer": true,
"unmapped": "keep",
"coerce_types": true,
"reorder_to_schema": true,
"enforce_required": false
},
"enabled": true,
"name": "4. Coerce types · reorder to canonical schema"
},
{
"tool": "dedup",
"options": {
"survivor_rule": "most_complete",
"merge": true
},
"enabled": true,
"name": "5. Dedup leads across HubSpot / LinkedIn / Manual Scrape (fuzzy + merge)"
}
]
}

View File

@@ -0,0 +1,56 @@
{
"steps": [
{
"tool": "text_clean",
"options": {},
"enabled": true,
"name": "1. Clean text (header whitespace, smart quotes, em-dash)"
},
{
"tool": "format_standardize",
"options": {
"column_types": {
"Date": "date",
"Amount": "currency",
"Balance": "currency",
"Vendor": "name"
},
"currency_decimal": "auto",
"currency_preserve_code": false,
"currency_decimals": 2,
"date_output_format": "%Y-%m-%d"
},
"enabled": true,
"name": "2. ISO dates · numeric amounts (parens-negative) · vendor casing"
},
{
"tool": "missing",
"options": {
"strategy": "none",
"standardize_sentinels": true,
"sentinels": ["N/A", "n/a", "—", "-", "?", "(blank)", "(none)", "unknown", "#N/A"]
},
"enabled": true,
"name": "3. Standardize disguised nulls (— / N/A / (blank))"
},
{
"tool": "dedup",
"options": {
"survivor_rule": "most_complete",
"merge": false,
"date_column": "Date",
"strategies": [
{
"columns": [
{"column": "Date", "algorithm": "exact", "threshold": 100},
{"column": "Amount", "algorithm": "exact", "threshold": 100},
{"column": "Vendor", "algorithm": "jaro_winkler", "threshold": 80}
]
}
]
},
"enabled": true,
"name": "4. Dedup transactions on Date+Amount+fuzzy Vendor"
}
]
}

View File

@@ -0,0 +1,31 @@
Txn ID,Date ,Description,Amount,Balance,Account,Vendor,Category
TXN-2401,01/15/2025," AMAZON.COM*4F2X9 PURCHASE",-$129.99,"$2,450.01",Checking,Amazon,Office Supplies
TXN-2402,2025-01-15,"AMAZON.COM*4F2X9 PURCHASE",-$129.99,"2450.01",Checking,amazon.com,Office Supplies
TXN-2403,Jan 18 2025,"STAPLES #4422 — paper, toner",($89.50),$2360.51,Checking,STAPLES,Office Supplies
TXN-2404,01/22/2025,"Verizon Wireless ""autopay""",-$120.00,"$2,240.51",Checking,Verizon,Utilities
TXN-2405,2025-01-22,Verizon Wireless autopay,-120.00,"2,240.51",Checking,verizon,Utilities
TXN-2406,01-25-2025,"Stripe Payout — invoice #1077","+$3,450.00","$5,690.51",Checking,Stripe,Income
TXN-2407,1/27/25,"Office Lease - Suite 204",-1500.00,"$4,190.51",Checking,Acme Realty,Rent
TXN-2408,02/01/2025,"Wire — Acme Realty Mgmt","-$1,500.00","$2,690.51",Checking,acme realty,Rent
TXN-2409,2025-02-03,"Adobe Creative Cloud annual","- $599.88","$2,090.63",Credit Card,Adobe Inc.,Software
TXN-2410,02/03/2025,"ADOBE CREATIVE CLOUD ANN",-599.88,2090.63,Credit Card,adobe,Software
TXN-2411,Feb 5 2025,"FedEx — overnight to client A",-$32.50,"$2,058.13",Checking,FedEx,Shipping
TXN-2412,02/07/2025,"Square fee — invoice #1078","-$3.20","$2,054.93",Checking,Square,Fees
TXN-2413,02/10/2025,"Stripe Payout invoice #1079","+ $1,200.00","$3,254.93",Checking,Stripe,Income
TXN-2414,2025-02-12,"USPS PRIORITY — to vendor B","-12.40","$3,242.53",Checking,USPS,Shipping
TXN-2415,02/14/2025,"Zoom Video Comms — annual","-$149.90","$3,092.63",Credit Card,Zoom,Software
TXN-2416,2/14/25,"Zoom Video Communications","-149.90","3092.63",Credit Card,zoom,Software
TXN-2417,02/18/2025,"Costco Whse #421 — supplies","-$237.84","$2,854.79",Checking,Costco,Office Supplies
TXN-2418,2025-02-18,COSTCO WHSE #421,-237.84,"2,854.79",Checking,costco,Office Supplies
TXN-2419,02/22/2025,"Bank fee — int'l wire","-$45.00","$2,809.79",Checking,Bank Fee,Fees
TXN-2420,02/24/2025,"Stripe Payout — invoice #1080","+$2,100.00","$4,909.79",Checking,Stripe,Income
TXN-2421,02/28/2025," Refund — overcharge ","+$45.00","$4,954.79",Checking,,Refunds
TXN-2422,Feb 28 2025,REFUND OVERCHARGE,45.00,4954.79,Checking,N/A,Refunds
TXN-2423,03/01/2025,"Office Lease — Suite 204","-$1,500.00","$3,454.79",Checking,Acme Realty,Rent
TXN-2424,2025-03-03,"Slack Technologies — annual","-$840.00","$2,614.79",Credit Card,Slack,Software
TXN-2425,03/05/2025,"Stripe Payout — invoice #1081","+$1,875.00","$4,489.79",Checking,Stripe,Income
TXN-2426,03/08/2025,"Wire — Berlin office rent (EUR vendor)","-€1.450,00","$2,989.79",Checking,Mietverwaltung GmbH,Rent
TXN-2427,03/10/2025,"London supplier invoice (GBP)","-£950.00","$1,939.79",Checking,Stationery Co Ltd,Office Supplies
TXN-2428,03/12/2025,"São Paulo agency retainer","-R$ 1.299,90","$1,679.79",Credit Card,Estúdio Ágil,Software
TXN-2429,03/14/2025,"VAT MOSS prep — multi-EU sales","($89.00)","$1,768.79",Checking,EU VAT Service,Fees
TXN-2430,03/14/2025,"VAT MOSS prep multi EU sales",-89.00,"1,768.79",Checking,eu vat service,Fees
1 Txn ID Date Description Amount Balance Account Vendor Category
2 TXN-2401 01/15/2025 AMAZON.COM*4F2X9 PURCHASE -$129.99 $2,450.01 Checking Amazon Office Supplies
3 TXN-2402 2025-01-15 AMAZON.COM*4F2X9 PURCHASE -$129.99 2450.01 Checking amazon.com Office Supplies
4 TXN-2403 Jan 18 2025 STAPLES #4422 — paper, toner ($89.50) $2360.51 Checking STAPLES Office Supplies
5 TXN-2404 01/22/2025 Verizon Wireless "autopay" -$120.00 $2,240.51 Checking Verizon Utilities
6 TXN-2405 2025-01-22 Verizon Wireless autopay -120.00 2,240.51 Checking verizon Utilities
7 TXN-2406 01-25-2025 Stripe Payout — invoice #1077 +$3,450.00 $5,690.51 Checking Stripe Income
8 TXN-2407 1/27/25 Office Lease - Suite 204 -1500.00 $4,190.51 Checking Acme Realty Rent
9 TXN-2408 02/01/2025 Wire — Acme Realty Mgmt -$1,500.00 $2,690.51 Checking acme realty Rent
10 TXN-2409 2025-02-03 Adobe Creative Cloud annual - $599.88 $2,090.63 Credit Card Adobe Inc. Software
11 TXN-2410 02/03/2025 ADOBE CREATIVE CLOUD ANN -599.88 2090.63 Credit Card adobe Software
12 TXN-2411 Feb 5 2025 FedEx — overnight to client A -$32.50 $2,058.13 Checking FedEx Shipping
13 TXN-2412 02/07/2025 Square fee — invoice #1078 -$3.20 $2,054.93 Checking Square Fees
14 TXN-2413 02/10/2025 Stripe Payout invoice #1079 + $1,200.00 $3,254.93 Checking Stripe Income
15 TXN-2414 2025-02-12 USPS PRIORITY — to vendor B -12.40 $3,242.53 Checking USPS Shipping
16 TXN-2415 02/14/2025 Zoom Video Comms — annual -$149.90 $3,092.63 Credit Card Zoom Software
17 TXN-2416 2/14/25 Zoom Video Communications -149.90 3092.63 Credit Card zoom Software
18 TXN-2417 02/18/2025 Costco Whse #421 — supplies -$237.84 $2,854.79 Checking Costco Office Supplies
19 TXN-2418 2025-02-18 COSTCO WHSE #421 -237.84 2,854.79 Checking costco Office Supplies
20 TXN-2419 02/22/2025 Bank fee — int'l wire -$45.00 $2,809.79 Checking Bank Fee Fees
21 TXN-2420 02/24/2025 Stripe Payout — invoice #1080 +$2,100.00 $4,909.79 Checking Stripe Income
22 TXN-2421 02/28/2025 Refund — overcharge +$45.00 $4,954.79 Checking Refunds
23 TXN-2422 Feb 28 2025 REFUND OVERCHARGE 45.00 4954.79 Checking N/A Refunds
24 TXN-2423 03/01/2025 Office Lease — Suite 204 -$1,500.00 $3,454.79 Checking Acme Realty Rent
25 TXN-2424 2025-03-03 Slack Technologies — annual -$840.00 $2,614.79 Credit Card Slack Software
26 TXN-2425 03/05/2025 Stripe Payout — invoice #1081 +$1,875.00 $4,489.79 Checking Stripe Income
27 TXN-2426 03/08/2025 Wire — Berlin office rent (EUR vendor) -€1.450,00 $2,989.79 Checking Mietverwaltung GmbH Rent
28 TXN-2427 03/10/2025 London supplier invoice (GBP) -£950.00 $1,939.79 Checking Stationery Co Ltd Office Supplies
29 TXN-2428 03/12/2025 São Paulo agency retainer -R$ 1.299,90 $1,679.79 Credit Card Estúdio Ágil Software
30 TXN-2429 03/14/2025 VAT MOSS prep — multi-EU sales ($89.00) $1,768.79 Checking EU VAT Service Fees
31 TXN-2430 03/14/2025 VAT MOSS prep multi EU sales -89.00 1,768.79 Checking eu vat service Fees

View File

@@ -0,0 +1,21 @@
Customer ID,First Name,Last Name,Email,Phone,Address,City,State,ZIP,Country,Total Orders,Lifetime Value,Last Order Date,Tags
SHOP-1001, Alice ,Johnson,alice@petshop.com,(415) 555-1234,"123 Main St., Apt 4B",San Francisco,CA,94102,US,12,$1,240.50,2025-12-04,VIP
SHOP-1002,Bob,SMITH,Bob@PetShop.com,415.555.1234,"123 Main St, Apt 4B",San Francisco,CA,94102,US,12,"$1,240.50",N/A,VIP
SHOP-1003,carlos,garcia,carlos@petshop.com,5559876543,"742 Evergreen Terrace",Springfield,IL,62704,US,5,420.00,12/15/2025,Wholesale
SHOP-1004,Diana,Lee,diana@petshop.com,(555) 222-3344,"PO Box 12, Sherwood Forest",Nottingham,,NG1 5BA,GB,8,£890.25,2025-10-30,VIP|Wholesale
SHOP-1005,EVE MARTINEZ,,eve.martinez@petshop.com,555-9988,"Calle Mayor 45","Madrid",,"28013",ES,3,€180,2025-09-15,
SHOP-1006,Frank,Brown,frank@petshop.com,, ,"Berlin",BE,10115,DE,15,€2.410,75,(blank),Wholesale
SHOP-1007,Grace,Davis,grace@petshop.com,+1 555-111-1111,"888 Maple Ave",Toronto,ON,M5V 3A8,CA,1,$49.99,#N/A,New
SHOP-1008,henry,wilson,Henry@PetShop.com,5551111111,"888 Maple Avenue","Toronto",ON,M5V 3A8,CA,1,$49.99,2025-12-01,New
SHOP-1009,Ivy,Chen,IVY@petshop.com,+1 (555) 777-7777,"550 Elm Street, Suite 200",Brooklyn,NY,11201,US,4,"$320.50 ",10/12/2025,
SHOP-1010,Jack,Taylor,jack@petshop.com,(none),"550 elm street, suite 200",brooklyn,NY,11201,US,4,$320.50,2025-10-12,
SHOP-1011,kate,o'neil,kate.oneil@petshop.com,415-555-2222,"99 King's Rd","London",,SW3 4LX,GB,7,£675.00,?,VIP
SHOP-1012,luis,rodriguez,LUIS@petshop.com,+34 91 411 1111,"Avenida de la Paz 12, 3°D",Madrid,,28013,ES,2,"€89,99",unknown,
SHOP-1013,Mia,Park,mia@petshop.com,02-9374-4000,"Sydney Opera House Drive","Sydney",NSW,2000,AU,9,"A$ 1,299.00",2025-11-20,Wholesale
SHOP-1014,Noah,nguyen,noah@petshop.com,+81 3 3210 7000,"丸の内 2-7-3","Tokyo",,100-0005,JP,6,"¥75000",2025-12-10,VIP
SHOP-1015,Olivia,Brown,OLIVIA@PETSHOP.COM,(555) 333-4444,"742 evergreen terrace",springfield,IL,62704,US,3,$180.00,(none),
SHOP-1016,Pavel,Novak,pavel@petshop.com,+44 20 7946 1234,"22 Baker Street",London,,W1U 6AB,United Kingdom,4,£412.00,2025-11-18,VIP
SHOP-1017,Quinn,Murphy,quinn@petshop.com,+44 20 7946 5678,"5 Princes Street",Edinburgh,,EH2 2DA,U.K.,2,£189.50,2025-12-09,
SHOP-1018,Rachel,O'Brien,rachel@petshop.com,02-9374-9999,"100 George Street","Sydney",NSW,2000,UK,1,£75.00,?,New
SHOP-1019,Sam,Klein,sam@petshop.com,+49 30 99887766,"Friedrichstraße 100","Berlin",,10117,Germany,11,"€1.890,40",2025-12-11,VIP|Wholesale
SHOP-1020,Tara,Gianni,tara@petshop.com,+39 06 6982 4567,"Via del Corso 250",Roma,,00186,Italia,5,"€649,99",2025-12-03,
1 Customer ID First Name Last Name Email Phone Address City State ZIP Country Total Orders Lifetime Value Last Order Date Tags
2 SHOP-1001 Alice Johnson alice@petshop.com (415) 555-1234 123 Main St., Apt 4B San Francisco CA 94102 US 12 $1 240.50 2025-12-04 VIP
3 SHOP-1002 Bob SMITH Bob@PetShop.com 415.555.1234 123 Main St, Apt 4B San Francisco CA 94102 US 12 $1,240.50 N/A VIP
4 SHOP-1003 carlos garcia carlos@petshop.com 5559876543 742 Evergreen Terrace Springfield IL 62704 US 5 420.00 12/15/2025 Wholesale
5 SHOP-1004 Diana Lee diana@petshop.com (555) 222-3344 PO Box 12, Sherwood Forest Nottingham NG1 5BA GB 8 £890.25 2025-10-30 VIP|Wholesale
6 SHOP-1005 EVE MARTINEZ eve.martinez@petshop.com 555-9988 Calle Mayor 45 Madrid 28013 ES 3 €180 2025-09-15
7 SHOP-1006 Frank Brown frank@petshop.com Berlin BE 10115 DE 15 €2.410 75 (blank) Wholesale
8 SHOP-1007 Grace Davis grace@petshop.com +1 555-111-1111 888 Maple Ave Toronto ON M5V 3A8 CA 1 $49.99 #N/A New
9 SHOP-1008 henry wilson Henry@PetShop.com 5551111111 888 Maple Avenue Toronto ON M5V 3A8 CA 1 $49.99 2025-12-01 New
10 SHOP-1009 Ivy Chen IVY@petshop.com +1 (555) 777-7777 550 Elm Street, Suite 200 Brooklyn NY 11201 US 4 $320.50 10/12/2025
11 SHOP-1010 Jack Taylor jack@petshop.com (none) 550 elm street, suite 200 brooklyn NY 11201 US 4 $320.50 2025-10-12
12 SHOP-1011 kate o'neil kate.oneil@petshop.com 415-555-2222 99 King's Rd London SW3 4LX GB 7 £675.00 ? VIP
13 SHOP-1012 luis rodriguez LUIS@petshop.com +34 91 411 1111 Avenida de la Paz 12, 3°D Madrid 28013 ES 2 €89,99 unknown
14 SHOP-1013 Mia Park mia@petshop.com 02-9374-4000 Sydney Opera House Drive Sydney NSW 2000 AU 9 A$ 1,299.00 2025-11-20 Wholesale
15 SHOP-1014 Noah nguyen noah@petshop.com +81 3 3210 7000 丸の内 2-7-3 Tokyo 100-0005 JP 6 ¥75000 2025-12-10 VIP
16 SHOP-1015 Olivia Brown OLIVIA@PETSHOP.COM (555) 333-4444 742 evergreen terrace springfield IL 62704 US 3 $180.00 (none)
17 SHOP-1016 Pavel Novak pavel@petshop.com +44 20 7946 1234 22 Baker Street London W1U 6AB United Kingdom 4 £412.00 2025-11-18 VIP
18 SHOP-1017 Quinn Murphy quinn@petshop.com +44 20 7946 5678 5 Princes Street Edinburgh EH2 2DA U.K. 2 £189.50 2025-12-09
19 SHOP-1018 Rachel O'Brien rachel@petshop.com 02-9374-9999 100 George Street Sydney NSW 2000 UK 1 £75.00 ? New
20 SHOP-1019 Sam Klein sam@petshop.com +49 30 99887766 Friedrichstraße 100 Berlin 10117 Germany 11 €1.890,40 2025-12-11 VIP|Wholesale
21 SHOP-1020 Tara Gianni tara@petshop.com +39 06 6982 4567 Via del Corso 250 Roma 00186 Italia 5 €649,99 2025-12-03

View File

@@ -0,0 +1,49 @@
{
"steps": [
{
"tool": "text_clean",
"options": {},
"enabled": true,
"name": "1. Clean text (whitespace, smart quotes, NBSP, BOM)"
},
{
"tool": "format_standardize",
"options": {
"column_types": {
"First Name": "name",
"Last Name": "name",
"Email": "email",
"Phone": "phone",
"Address": "address",
"Lifetime Value": "currency",
"Last Order Date": "date"
},
"phone_country_column": "Country",
"address_country_column": "Country",
"currency_preserve_code": true,
"currency_decimal": "auto",
"email_gmail_canonical": false
},
"enabled": true,
"name": "2. Standardize phones, addresses, dates, currencies, names"
},
{
"tool": "missing",
"options": {
"strategy": "none",
"standardize_sentinels": true
},
"enabled": true,
"name": "3. Standardize disguised nulls (N/A, -, (blank), ?, #N/A)"
},
{
"tool": "dedup",
"options": {
"survivor_rule": "most_complete",
"merge": true
},
"enabled": true,
"name": "4. Dedup customers (fuzzy match, merge missing fields)"
}
]
}

355
src/cli_column_map.py Normal file
View File

@@ -0,0 +1,355 @@
"""CLI for the DataTools Column Mapper (script 05).
Usage:
python -m src.cli_column_map input.csv # auto-mapping preview
python -m src.cli_column_map input.csv --schema target.json --apply
python -m src.cli_column_map input.csv --rename "First Name=first_name,Email=email" --apply
python -m src.cli_column_map input.csv --schema target.json --preset strict-schema --apply
python -m src.cli_column_map input.csv --schema target.json --coerce --apply
python -m src.cli_column_map --help
"""
from __future__ import annotations
import json
import sys
from datetime import datetime
from pathlib import Path
from typing import Optional
import typer
from loguru import logger
app = typer.Typer(
name="column-map",
help=(
"Rename columns, enforce a target schema, and coerce types in CSV / Excel files.\n\n"
"Default behaviour: preview the mapping (no file written). Add --apply "
"to write the mapped output and audit log.\n\n"
"Examples:\n\n"
" # Show what auto-mapping would do (no schema → identity)\n"
" python -m src.cli_column_map vendor.csv\n\n"
" # Map against a target JSON schema with strict drop / coerce / reorder\n"
" python -m src.cli_column_map vendor.csv --schema target.json "
"--preset strict-schema --apply\n\n"
" # Hand-rolled rename without a schema\n"
" python -m src.cli_column_map data.csv "
"--rename 'First Name=first_name,Last Name=last_name' --apply\n\n"
" # Coerce specific columns inline\n"
" python -m src.cli_column_map data.csv "
"--coerce-col 'age:integer,joined:date' --apply\n"
),
add_completion=False,
no_args_is_help=True,
)
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _setup_logging(log_dir: Path) -> Path:
log_dir.mkdir(parents=True, exist_ok=True)
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
log_path = log_dir / f"column_map_{ts}.log"
logger.remove()
logger.add(sys.stderr, level="WARNING", format="{message}")
logger.add(
str(log_path),
level="DEBUG",
format="{time:YYYY-MM-DD HH:mm:ss} | {level:<8} | {message}",
)
return log_path
def _parse_pairs(raw: Optional[str], separator: str = ",") -> dict[str, str]:
"""Parse ``a=1,b=2`` into a dict."""
if not raw:
return {}
out: dict[str, str] = {}
for piece in raw.split(separator):
piece = piece.strip()
if not piece:
continue
if "=" not in piece:
raise typer.BadParameter(
f"Invalid pair: {piece!r}. Expected 'key=value[,key=value...]'."
)
k, v = piece.split("=", 1)
out[k.strip()] = v.strip()
return out
def _parse_coerce(raw: Optional[str]) -> dict[str, str]:
"""Parse ``age:integer,joined:date`` into a dict."""
if not raw:
return {}
out: dict[str, str] = {}
for piece in raw.split(","):
piece = piece.strip()
if not piece:
continue
if ":" not in piece:
raise typer.BadParameter(
f"Invalid --coerce-col piece: {piece!r}. "
f"Expected 'col:dtype[,col:dtype...]'."
)
col, dtype = piece.split(":", 1)
out[col.strip()] = dtype.strip()
return out
# ---------------------------------------------------------------------------
# Main command
# ---------------------------------------------------------------------------
@app.command()
def map_(
input_file: str = typer.Argument(
...,
help="Path to the CSV or Excel file.",
),
output: Optional[str] = typer.Option(
None, "--output", "-o",
help="Output file path. Default: {input}_mapped.csv",
),
apply: bool = typer.Option(
False, "--apply",
help="Write the output. Without this flag, only the mapping plan is shown.",
),
preset: str = typer.Option(
"rename-only", "--preset",
help="Preset: rename-only, strict-schema, or lenient-schema.",
),
schema: Optional[str] = typer.Option(
None, "--schema",
help="Path to a target schema JSON file (TargetSchema format).",
),
rename: Optional[str] = typer.Option(
None, "--rename",
help="Explicit rename pairs: 'src=tgt[,src=tgt...]' (overrides auto-inference).",
),
coerce_col: Optional[str] = typer.Option(
None, "--coerce-col",
help=(
"Inline type coercion (no schema needed): 'col:dtype[,col:dtype...]'. "
"Valid dtypes: string, integer, float, boolean, date, datetime, category, auto."
),
),
unmapped: Optional[str] = typer.Option(
None, "--unmapped",
help="Strategy for unmapped source columns: keep | drop | error.",
),
threshold: Optional[float] = typer.Option(
None, "--threshold",
help="Fuzzy-match threshold for auto-inference (0.0..1.0). Default 0.6.",
),
no_auto: bool = typer.Option(
False, "--no-auto",
help="Disable auto-inference; honour only explicit --rename pairs.",
),
no_coerce: bool = typer.Option(
False, "--no-coerce",
help="Disable type coercion (overrides preset).",
),
no_reorder: bool = typer.Option(
False, "--no-reorder",
help="Disable schema-order reorder (overrides preset).",
),
no_required: bool = typer.Option(
False, "--no-required",
help="Don't enforce required-target presence (overrides preset).",
),
config: Optional[str] = typer.Option(
None, "--config",
help="Load options from a saved JSON config file.",
),
save_config: Optional[str] = typer.Option(
None, "--save-config",
help="Save current options to a JSON config file.",
),
sheet: Optional[str] = typer.Option(
None, "--sheet",
help="Excel sheet name or index (default: first sheet).",
),
encoding_override: Optional[str] = typer.Option(
None, "--encoding",
help="Override auto-detected file encoding.",
),
header_row: Optional[int] = typer.Option(
None, "--header-row",
help="0-based row index for the header (default: auto-detect).",
),
):
"""Map source columns to a target schema; rename, coerce, drop, reorder."""
from src.core.io import read_file, write_file
from src.core.column_mapper import (
MapOptions,
PRESETS,
TargetField,
TargetSchema,
coerce_series,
map_columns,
)
import pandas as pd
input_path = Path(input_file)
if not input_path.exists():
typer.echo(f"Error: File not found: {input_path}", err=True)
raise typer.Exit(1)
if preset not in PRESETS:
typer.echo(
f"Error: Unknown preset '{preset}'. "
f"Choose from: {', '.join(sorted(PRESETS))}.",
err=True,
)
raise typer.Exit(1)
log_path = _setup_logging(Path("logs"))
# Build options
if config:
cfg_path = Path(config)
if not cfg_path.exists():
typer.echo(f"Error: Config file not found: {cfg_path}", err=True)
raise typer.Exit(1)
options = MapOptions.from_file(cfg_path)
else:
options = MapOptions.from_preset(preset)
if schema:
sp = Path(schema)
if not sp.exists():
typer.echo(f"Error: Schema file not found: {sp}", err=True)
raise typer.Exit(1)
options.schema = TargetSchema.from_file(sp)
if rename:
options.mapping = {**options.mapping, **_parse_pairs(rename)}
if unmapped:
options.unmapped = unmapped # type: ignore[assignment]
if threshold is not None:
options.fuzzy_threshold = threshold
if no_auto:
options.auto_infer = False
if no_coerce:
options.coerce_types = False
if no_reorder:
options.reorder_to_schema = False
if no_required:
options.enforce_required = False
# Inline coercion (no schema): build a tiny one-field-per-column schema.
inline_coerce = _parse_coerce(coerce_col)
if inline_coerce and options.schema is None:
options.schema = TargetSchema(fields=[
TargetField(name=col, dtype=dt) # type: ignore[arg-type]
for col, dt in inline_coerce.items()
])
options.coerce_types = True
if save_config:
saved = options.to_file(save_config)
typer.echo(f"Config saved to {saved}")
# Read input
typer.echo(f"Reading {input_path.name}...")
try:
sheet_arg: str | int | None = None
if sheet is not None:
try:
sheet_arg = int(sheet)
except ValueError:
sheet_arg = sheet
df = read_file(
input_path,
encoding=encoding_override,
header_row=header_row,
sheet_name=sheet_arg if sheet_arg is not None else 0,
repair=False,
)
if not isinstance(df, pd.DataFrame):
df = pd.concat(list(df), ignore_index=True)
except Exception as e:
typer.echo(f"Error reading file: {e}", err=True)
raise typer.Exit(1)
typer.echo(f" {len(df)} rows, {len(df.columns)} columns")
typer.echo("Mapping columns...")
try:
result = map_columns(df, options)
except (ValueError, OSError) as e:
typer.echo(f"Error: {e}", err=True)
raise typer.Exit(1)
_print_results(result, input_path, options)
if apply:
stem = input_path.stem
out_path = Path(output) if output else input_path.parent / f"{stem}_mapped.csv"
write_file(result.mapped_df, out_path)
typer.echo(f"\nMapped file: {out_path}")
# Audit: write the resolved mapping as JSON next to the output.
audit_path = input_path.parent / f"{stem}_mapping.json"
audit_path.write_text(json.dumps({
"mapping": result.mapping,
"inferred_pairs": result.inferred_pairs,
"columns_renamed": result.columns_renamed,
"columns_dropped": result.columns_dropped,
"columns_added": result.columns_added,
"coercion_failures": result.coercion_failures,
"unmapped_kept": result.unmapped_kept,
"missing_required_targets": result.missing_required_targets,
}, indent=2, default=str))
typer.echo(f"Mapping audit: {audit_path}")
else:
typer.echo("\nThis was a preview. Add --apply to write the mapped output.")
typer.echo(f"Log: {log_path}")
# ---------------------------------------------------------------------------
# Output formatting
# ---------------------------------------------------------------------------
def _print_results(result, input_path: Path, options) -> None:
typer.echo(f"\n{''*60}")
typer.echo(f" File: {input_path.name}")
typer.echo(f" Columns renamed: {result.columns_renamed}")
typer.echo(f" Columns dropped: {len(result.columns_dropped)}")
typer.echo(f" Columns added: {len(result.columns_added)}")
typer.echo(f" Unmapped kept: {len(result.unmapped_kept)}")
typer.echo(f" Coercion failures: "
f"{sum(result.coercion_failures.values())} cells across "
f"{len(result.coercion_failures)} column(s)")
typer.echo(f"{''*60}")
if result.mapping:
typer.echo("\nMapping:")
for src, tgt in result.mapping.items():
tag = " (auto)" if src in result.inferred_pairs else ""
arrow = "" if src != tgt else ""
typer.echo(f" {src!r} {arrow} {tgt!r}{tag}")
if result.columns_dropped:
typer.echo(f"\nDropped: {result.columns_dropped}")
if result.columns_added:
typer.echo(f"\nAdded (defaults): {result.columns_added}")
if result.coercion_failures:
typer.echo("\nCoercion failures:")
for col, n in result.coercion_failures.items():
typer.echo(f" {col}: {n} row(s) could not be coerced")
if result.missing_required_targets:
typer.echo(f"\nMissing required targets: {result.missing_required_targets}")
# ---------------------------------------------------------------------------
# __main__
# ---------------------------------------------------------------------------
def main():
app()
if __name__ == "__main__":
main()

364
src/cli_format.py Normal file
View File

@@ -0,0 +1,364 @@
"""CLI for the DataTools Format Standardizer (script 03).
Usage:
python -m src.cli_format input.csv \\
--types 'phone:phone,price:currency,name:name' \\
--apply
# 1 GB international file with per-row country column:
python -m src.cli_format huge.csv \\
--types 'phone:phone,address:address,price:currency' \\
--phone-country country --address-country country \\
--preserve-code --audit-max 50000 --apply
The CLI auto-streams (chunked read/write, bounded RAM) when the input
exceeds ~100 MB. Force or disable with ``--stream`` / ``--no-stream``.
"""
from __future__ import annotations
import sys
from datetime import datetime
from pathlib import Path
from typing import Optional
import typer
from loguru import logger
app = typer.Typer(
name="format",
help=(
"Standardize dates, phones, currencies, names, and addresses "
"in CSV / Excel files.\n\n"
"Default behaviour: preview the changes (no file written). "
"Add --apply to write output.\n\n"
"For 1 GB+ international files, the CLI auto-streams in 50,000-row "
"chunks so memory stays bounded. Use --phone-country / "
"--address-country to point at a per-row ISO-3166 column for "
"country-aware parsing.\n\n"
"Examples:\n\n"
" # Preview\n"
" python -m src.cli_format data.csv --types 'phone:phone,price:currency'\n\n"
" # International file with per-row country\n"
" python -m src.cli_format leads.csv --types 'phone:phone' "
"--phone-country country --apply\n\n"
" # Force streaming with smaller chunks for tight memory\n"
" python -m src.cli_format huge.csv --types 'phone:phone' "
"--stream --chunk-size 10000 --apply\n"
),
add_completion=False,
no_args_is_help=True,
)
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _setup_logging(log_dir: Path) -> Path:
log_dir.mkdir(parents=True, exist_ok=True)
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
log_path = log_dir / f"format_{ts}.log"
logger.remove()
logger.add(sys.stderr, level="WARNING", format="{message}")
logger.add(
str(log_path), level="DEBUG",
format="{time:YYYY-MM-DD HH:mm:ss} | {level:<8} | {message}",
)
return log_path
def _parse_types(raw: Optional[str]) -> dict[str, str]:
"""Parse ``col:phone,col:date`` into a dict."""
if not raw:
return {}
out: dict[str, str] = {}
for piece in raw.split(","):
piece = piece.strip()
if not piece:
continue
if ":" not in piece:
raise typer.BadParameter(
f"Invalid --types piece: {piece!r}. "
f"Expected 'col:type[,col:type...]' "
f"where type is one of: date, phone, currency, name, address, email, boolean."
)
col, ft = piece.split(":", 1)
out[col.strip()] = ft.strip()
return out
_AUTO_STREAM_THRESHOLD = 100 * 1024 * 1024 # 100 MB
# ---------------------------------------------------------------------------
# Main command
# ---------------------------------------------------------------------------
@app.command()
def standardize(
input_file: str = typer.Argument(..., help="CSV or TSV file path."),
output: Optional[str] = typer.Option(
None, "--output", "-o",
help="Output file path. Default: {input}_standardized.csv",
),
apply: bool = typer.Option(
False, "--apply",
help="Write the output. Without this flag, only a preview is shown.",
),
types: Optional[str] = typer.Option(
None, "--types",
help="Per-column types: 'col:type[,col:type...]'. "
"Types: date, phone, currency, name, address, email, boolean.",
),
preset: Optional[str] = typer.Option(
None, "--preset",
help="Named preset (e.g. 'us', 'uk', 'eu', 'jp'). Layered before --types.",
),
phone_country: Optional[str] = typer.Option(
None, "--phone-country",
help="Column name carrying the per-row ISO-3166 country code for phones.",
),
address_country: Optional[str] = typer.Option(
None, "--address-country",
help="Column name carrying the per-row country code for addresses.",
),
phone_region: str = typer.Option(
"US", "--phone-region",
help="Default phone region when no per-row column is set. ISO-3166 alpha-2.",
),
phone_format: str = typer.Option(
"E164", "--phone-format",
help="Phone output format: E164 | INTERNATIONAL | NATIONAL | RFC3966 | DIGITS.",
),
preserve_code: bool = typer.Option(
False, "--preserve-code",
help="Currency: emit ISO-4217 prefix (e.g. 'USD 1500.00').",
),
decimals: int = typer.Option(
2, "--decimals",
help="Currency decimal precision.",
),
audit_max: int = typer.Option(
10_000, "--audit-max",
help="Cap the change-audit at N rows (0 = no audit, -1 = unbounded).",
),
stream: Optional[bool] = typer.Option(
None, "--stream/--no-stream",
help="Force streaming (chunked, bounded RAM). Auto-on for inputs > 100 MB.",
),
chunk_size: int = typer.Option(
50_000, "--chunk-size",
help="Rows per chunk in streaming mode.",
),
cache_size: int = typer.Option(
262_144, "--cache-size",
help="Per-column LRU-cache size (set 0 to disable).",
),
encoding_override: Optional[str] = typer.Option(
None, "--encoding",
help="Override auto-detected file encoding.",
),
delimiter: Optional[str] = typer.Option(
None, "--delimiter",
help="Override auto-detected delimiter.",
),
config: Optional[str] = typer.Option(
None, "--config",
help="Load options from a saved JSON config.",
),
save_config: Optional[str] = typer.Option(
None, "--save-config",
help="Save current options to a JSON config.",
),
):
"""Standardize formats across a CSV / TSV. Auto-streams for large inputs."""
from src.core.format_standardize import (
FieldType,
StandardizeOptions,
standardize_dataframe,
standardize_file,
)
from src.core.io import read_file, detect_encoding, detect_delimiter
import pandas as pd
inp = Path(input_file)
if not inp.exists():
typer.echo(f"Error: File not found: {inp}", err=True)
raise typer.Exit(1)
log_path = _setup_logging(Path("logs"))
# Build options
if config:
cp = Path(config)
if not cp.exists():
typer.echo(f"Error: Config file not found: {cp}", err=True)
raise typer.Exit(1)
options = StandardizeOptions.from_file(cp)
elif preset:
try:
options = StandardizeOptions.from_preset(preset)
except ValueError as e:
typer.echo(f"Error: {e}", err=True)
raise typer.Exit(1)
else:
options = StandardizeOptions()
parsed_types = _parse_types(types)
if parsed_types:
try:
options.column_types = {
col: FieldType(t) for col, t in parsed_types.items()
}
except ValueError as e:
typer.echo(
f"Error: {e}. Valid types: "
+ ", ".join(sorted(t.value for t in FieldType)),
err=True,
)
raise typer.Exit(1)
if not options.column_types:
typer.echo(
"Error: no column types declared. Pass --types 'col:type,...' "
"or --preset / --config with a column_types map.",
err=True,
)
raise typer.Exit(1)
if phone_country:
options.phone_country_column = phone_country
if address_country:
options.address_country_column = address_country
options.phone_region = phone_region
options.phone_format = phone_format # type: ignore[assignment]
options.currency_preserve_code = preserve_code
options.currency_decimals = decimals
options.audit_max_rows = (
None if audit_max < 0 else audit_max
)
options.cache_size = cache_size
if save_config:
saved = options.to_file(save_config)
typer.echo(f"Config saved to {saved}")
# Decide streaming mode
file_size = inp.stat().st_size
use_stream = stream if stream is not None else file_size > _AUTO_STREAM_THRESHOLD
enc = encoding_override or detect_encoding(inp)
delim = delimiter or detect_delimiter(inp, enc)
out_path = Path(output) if output else inp.parent / f"{inp.stem}_standardized.csv"
typer.echo(
f"Reading {inp.name} ({file_size/1024/1024:.1f} MB; "
f"{'streaming' if use_stream else 'in-memory'} mode)..."
)
if use_stream:
if not apply:
typer.echo(
"\nStreaming mode does not produce a preview. "
"Re-run with --apply to write output, or remove --stream to preview a sample."
)
raise typer.Exit(0)
last_log = [0.0]
import time as _time
def _progress(rows, chunks):
now = _time.perf_counter()
if now - last_log[0] < 1.0:
return
last_log[0] = now
typer.echo(f" ... {rows:,} rows ({chunks} chunks)")
t0 = _time.perf_counter()
res = standardize_file(
inp, out_path, options,
chunk_size=chunk_size,
progress_callback=_progress,
encoding=enc,
delimiter=delim,
)
elapsed = _time.perf_counter() - t0
typer.echo(f"\n{''*60}")
typer.echo(f" File: {inp.name}")
typer.echo(f" Rows: {res.rows_processed:,}")
typer.echo(f" Chunks: {res.chunks_processed}")
typer.echo(f" Cells changed: {res.cells_changed:,}")
typer.echo(
f" Cells unparseable: {res.cells_unparseable:,} / {res.cells_total:,}"
)
typer.echo(
f" Throughput: {res.rows_processed / max(elapsed, 1e-9):,.0f} rows/sec"
)
typer.echo(f" Elapsed: {elapsed:.2f}s")
typer.echo(f"{''*60}")
typer.echo(f"\nStandardized: {res.output_path}")
if res.audit_path:
typer.echo(f"Changes audit: {res.audit_path}")
typer.echo(f"Log: {log_path}")
return
# In-memory path
try:
df = read_file(
inp, encoding=enc, delimiter=delim, repair=False,
)
if not isinstance(df, pd.DataFrame):
df = pd.concat(list(df), ignore_index=True)
except Exception as e:
typer.echo(f"Error reading file: {e}", err=True)
raise typer.Exit(1)
typer.echo(f" {len(df):,} rows, {len(df.columns)} columns")
typer.echo("Standardizing...")
try:
result = standardize_dataframe(df, options)
except (ValueError, OSError) as e:
typer.echo(f"Error: {e}", err=True)
raise typer.Exit(1)
pct = (result.cells_changed / result.cells_total * 100) if result.cells_total else 0
typer.echo(f"\n{''*60}")
typer.echo(f" File: {inp.name}")
typer.echo(f" Columns processed: {len(result.columns_processed)}")
typer.echo(f" Cells scanned: {result.cells_total:,}")
typer.echo(f" Cells changed: {result.cells_changed:,} ({pct:.1f}%)")
typer.echo(f" Cells unparseable: {result.cells_unparseable:,}")
typer.echo(f"{''*60}")
if result.cells_changed and not result.changes.empty:
typer.echo("\nFirst examples:")
for _, row in result.changes.head(5).iterrows():
old = repr(row["old"])[:40]
new = repr(row["new"])[:40]
typer.echo(
f" Row {row['row'] + 1}, {row['column']} "
f"({row['field_type']}): {old}{new}"
)
if apply:
from src.core.io import write_file
write_file(result.standardized_df, out_path)
typer.echo(f"\nStandardized: {out_path}")
if not result.changes.empty:
audit_path = inp.parent / f"{inp.stem}_changes.csv"
write_file(result.changes, audit_path)
typer.echo(f"Changes audit: {audit_path}")
else:
typer.echo("\nThis was a preview. Add --apply to write the output.")
typer.echo(f"Log: {log_path}")
def main():
app()
if __name__ == "__main__":
main()

380
src/cli_missing.py Normal file
View File

@@ -0,0 +1,380 @@
"""CLI for the DataTools Missing Value Handler (script 04).
Usage:
python -m src.cli_missing input.csv # profile only
python -m src.cli_missing input.csv --apply # detect-only + write
python -m src.cli_missing input.csv --preset safe-fill --apply
python -m src.cli_missing input.csv --strategy median --apply
python -m src.cli_missing input.csv --strategy drop_row --apply
python -m src.cli_missing input.csv --strategy constant --fill-value 0 --apply
python -m src.cli_missing input.csv --strategy median --columns age,score --apply
python -m src.cli_missing input.csv --col-strategy "age:median,city:mode" --apply
python -m src.cli_missing --help
"""
from __future__ import annotations
import sys
from datetime import datetime
from pathlib import Path
from typing import Optional
import typer
from loguru import logger
app = typer.Typer(
name="missing",
help=(
"Detect and handle missing values in CSV / Excel files.\n\n"
"Default behaviour: profile only (no file written). Add --apply to "
"write the handled output and audit log.\n\n"
"Strategies:\n"
" none, drop_row, drop_col, drop_both,\n"
" mean, median, mode, constant,\n"
" ffill, bfill, interpolate\n\n"
"Examples:\n\n"
" # Profile missingness without writing anything\n"
" python -m src.cli_missing customers.csv\n\n"
" # Standardize sentinels (\"N/A\", \"-\", \"NULL\", …) to NaN and write\n"
" python -m src.cli_missing customers.csv --apply\n\n"
" # Safe fill: numeric → median, categorical → mode\n"
" python -m src.cli_missing customers.csv --preset safe-fill --apply\n\n"
" # Drop rows missing >50%% of selected columns\n"
" python -m src.cli_missing customers.csv --strategy drop_row "
"--row-threshold 0.5 --apply\n\n"
" # Per-column strategies\n"
" python -m src.cli_missing customers.csv "
"--col-strategy 'age:median,city:mode,notes:constant' --fill-value '' --apply\n"
),
add_completion=False,
no_args_is_help=True,
)
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _setup_logging(log_dir: Path) -> Path:
log_dir.mkdir(parents=True, exist_ok=True)
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
log_path = log_dir / f"missing_{ts}.log"
logger.remove()
logger.add(sys.stderr, level="WARNING", format="{message}")
logger.add(
str(log_path),
level="DEBUG",
format="{time:YYYY-MM-DD HH:mm:ss} | {level:<8} | {message}",
)
return log_path
def _split_csv_arg(raw: Optional[str]) -> Optional[list[str]]:
if raw is None:
return None
return [c.strip() for c in raw.split(",") if c.strip()]
def _parse_col_strategy(raw: Optional[str]) -> dict[str, str]:
"""Parse ``--col-strategy 'age:median,city:mode'`` into a dict."""
if not raw:
return {}
out: dict[str, str] = {}
for piece in raw.split(","):
piece = piece.strip()
if not piece:
continue
if ":" not in piece:
raise typer.BadParameter(
f"Invalid --col-strategy piece: '{piece}'. "
f"Expected 'col:strategy[,col:strategy...]'."
)
col, strat = piece.split(":", 1)
out[col.strip()] = strat.strip()
return out
# ---------------------------------------------------------------------------
# Main command
# ---------------------------------------------------------------------------
@app.command()
def handle(
input_file: str = typer.Argument(
...,
help="Path to the CSV or Excel file.",
),
output: Optional[str] = typer.Option(
None, "--output", "-o",
help="Output file path. Default: {input}_missing.csv",
),
apply: bool = typer.Option(
False, "--apply",
help="Write the output. Without this flag, only the profile is shown.",
),
preset: str = typer.Option(
"detect-only", "--preset",
help="Preset: detect-only, safe-fill, or drop-incomplete.",
),
strategy: Optional[str] = typer.Option(
None, "--strategy",
help=(
"Override the preset strategy: none, drop_row, drop_col, drop_both, "
"mean, median, mode, constant, ffill, bfill, interpolate."
),
),
col_strategy: Optional[str] = typer.Option(
None, "--col-strategy",
help="Per-column strategies: 'col:strategy[,col:strategy...]'.",
),
fill_value: Optional[str] = typer.Option(
None, "--fill-value",
help="Constant fill value (used with --strategy constant).",
),
columns: Optional[str] = typer.Option(
None, "--columns",
help="Comma-separated columns to handle (default: all columns).",
),
skip: Optional[str] = typer.Option(
None, "--skip",
help="Comma-separated columns to skip.",
),
sentinels: Optional[str] = typer.Option(
None, "--sentinels",
help=(
"Comma-separated extra sentinels to treat as missing "
"(merged with the built-in defaults)."
),
),
no_sentinels: bool = typer.Option(
False, "--no-sentinels",
help="Disable disguised-null standardization entirely.",
),
row_threshold: float = typer.Option(
1.0, "--row-threshold",
help=(
"For drop_row: drop rows whose missing fraction across selected "
"columns is STRICTLY GREATER than this value (0.0..1.0). "
"Default 1.0 = never drop. Use 0.0 to drop any row with any "
"missing; 0.5 to drop rows >50%% missing."
),
),
col_threshold: float = typer.Option(
1.0, "--col-threshold",
help=(
"For drop_col: drop columns whose missing fraction is strictly "
"greater than this value. Default 1.0 = never drop."
),
),
config: Optional[str] = typer.Option(
None, "--config",
help="Load options from a saved JSON config file.",
),
save_config: Optional[str] = typer.Option(
None, "--save-config",
help="Save current options to a JSON config file.",
),
sheet: Optional[str] = typer.Option(
None, "--sheet",
help="Excel sheet name or index (default: first sheet).",
),
encoding_override: Optional[str] = typer.Option(
None, "--encoding",
help="Override auto-detected file encoding.",
),
header_row: Optional[int] = typer.Option(
None, "--header-row",
help="0-based row index for the header (default: auto-detect).",
),
full_changelog: bool = typer.Option(
False, "--full-changelog",
help="Write every change to the audit CSV (default caps to first 1000).",
),
):
"""Detect and handle missing values."""
from src.core.io import read_file, write_file
from src.core.missing import MissingOptions, PRESETS, handle_missing
import pandas as pd
# Validate inputs
input_path = Path(input_file)
if not input_path.exists():
typer.echo(f"Error: File not found: {input_path}", err=True)
raise typer.Exit(1)
if preset not in PRESETS:
typer.echo(
f"Error: Unknown preset '{preset}'. "
f"Choose from: {', '.join(sorted(PRESETS))}.",
err=True,
)
raise typer.Exit(1)
log_path = _setup_logging(Path("logs"))
# Build options
if config:
cfg_path = Path(config)
if not cfg_path.exists():
typer.echo(f"Error: Config file not found: {cfg_path}", err=True)
raise typer.Exit(1)
options = MissingOptions.from_file(cfg_path)
logger.info("Loaded config from {}", cfg_path)
else:
options = MissingOptions.from_preset(preset)
if strategy:
options.strategy = strategy # type: ignore[assignment]
if col_strategy:
options.column_strategies = _parse_col_strategy(col_strategy) # type: ignore[assignment]
if fill_value is not None:
options.fill_value = fill_value
cols_list = _split_csv_arg(columns)
if cols_list is not None:
options.columns = cols_list
skip_list = _split_csv_arg(skip)
if skip_list:
options.skip_columns = skip_list
extra = _split_csv_arg(sentinels)
if extra:
options.sentinels = list(dict.fromkeys([*options.sentinels, *extra]))
if no_sentinels:
options.standardize_sentinels = False
options.row_drop_threshold = row_threshold
options.col_drop_threshold = col_threshold
if save_config:
saved = options.to_file(save_config)
typer.echo(f"Config saved to {saved}")
# Read input
typer.echo(f"Reading {input_path.name}...")
try:
sheet_arg: str | int | None = None
if sheet is not None:
try:
sheet_arg = int(sheet)
except ValueError:
sheet_arg = sheet
df = read_file(
input_path,
encoding=encoding_override,
header_row=header_row,
sheet_name=sheet_arg if sheet_arg is not None else 0,
repair=False,
)
if not isinstance(df, pd.DataFrame):
df = pd.concat(list(df), ignore_index=True)
except Exception as e:
typer.echo(f"Error reading file: {e}", err=True)
raise typer.Exit(1)
typer.echo(f" {len(df)} rows, {len(df.columns)} columns")
# Run
typer.echo("Profiling missingness...")
try:
result = handle_missing(df, options)
except (ValueError, OSError) as e:
typer.echo(f"Error: {e}", err=True)
raise typer.Exit(1)
_print_results(result, input_path, options)
# Write
if apply:
stem = input_path.stem
out_path = Path(output) if output else input_path.parent / f"{stem}_missing.csv"
write_file(result.handled_df, out_path)
typer.echo(f"\nHandled file: {out_path}")
if not result.changes.empty:
changes_path = input_path.parent / f"{stem}_missing_changes.csv"
audit_df = result.changes
cap = 1000
if not full_changelog and len(audit_df) > cap:
typer.echo(
f"Note: changelog capped at {cap} rows. "
f"Use --full-changelog to write all {len(audit_df)} changes."
)
audit_df = audit_df.head(cap)
write_file(audit_df, changes_path)
typer.echo(f"Changes audit: {changes_path}")
else:
typer.echo(
"\nThis was a profile only. Add --apply to write the handled output."
)
typer.echo(f"Log: {log_path}")
# ---------------------------------------------------------------------------
# Output formatting
# ---------------------------------------------------------------------------
def _print_results(result, input_path: Path, options) -> None:
typer.echo(f"\n{''*60}")
typer.echo(f" File: {input_path.name}")
typer.echo(f" Rows: {result.profile_before.rows_total}")
typer.echo(f" Columns processed: {len(result.columns_processed)}")
typer.echo(
f" Cells missing: "
f"{result.profile_before.cells_missing} / {result.profile_before.cells_total}"
f" ({result.profile_before.cells_missing_pct:.1f}%)"
)
typer.echo(
f" Rows w/ any missing: "
f"{result.profile_before.rows_with_any_missing} "
f"(complete: {result.profile_before.rows_complete})"
)
typer.echo(f"{''*60}")
typer.echo("\nPer-column profile:")
profile_df = result.profile_before.to_dataframe()
for _, row in profile_df.iterrows():
marker = " " if row["missing"] == 0 else " "
typer.echo(
f"{marker}{row['column']:<24} {row['dtype']:<10} "
f"missing={row['missing']:<6} ({row['missing_pct']:>5.1f}%)"
+ (
f" top sentinel: {row['top_sentinel']!r} ×{row['top_sentinel_count']}"
if row["top_sentinel_count"] else ""
)
)
typer.echo("\nActions:")
typer.echo(f" Sentinels standardized to NaN: {result.sentinels_standardized}")
typer.echo(f" Cells filled: {result.cells_filled}")
typer.echo(f" Rows dropped: {result.rows_dropped}")
typer.echo(
f" Columns dropped: {len(result.columns_dropped)}"
+ (f" ({', '.join(result.columns_dropped)})" if result.columns_dropped else "")
)
if result.strategy_per_column:
typer.echo("\nStrategy per column:")
for col, strat in result.strategy_per_column.items():
typer.echo(f" {col}: {strat}")
if not result.changes.empty:
typer.echo("\nFirst examples:")
for _, row in result.changes.head(5).iterrows():
old = repr(row["old"])[:40]
new = repr(row["new"])[:40]
row_label = "" if row["row"] == -1 else f"Row {row['row'] + 1}"
typer.echo(
f" {row_label}, {row['column']}: {old}{new} "
f"[{row['action']}]"
)
# ---------------------------------------------------------------------------
# __main__
# ---------------------------------------------------------------------------
def main():
app()
if __name__ == "__main__":
main()

307
src/cli_pipeline.py Normal file
View File

@@ -0,0 +1,307 @@
"""CLI for the DataTools Pipeline Runner (script 09).
Usage:
# Run the recommended default pipeline (text → format → missing → dedup):
python -m src.cli_pipeline input.csv --apply
# Quick custom order via --steps:
python -m src.cli_pipeline input.csv \\
--steps text_clean,format_standardize,missing --apply
# Save the recommended pipeline to a JSON for editing:
python -m src.cli_pipeline --recommend --output pipeline.json
# Run a saved pipeline:
python -m src.cli_pipeline weekly_export.csv --pipeline pipeline.json --apply
# Strict mode: fail if the pipeline contains soft-dependency violations
python -m src.cli_pipeline data.csv --steps dedup,text_clean \\
--strict --apply
"""
from __future__ import annotations
import json
import sys
from datetime import datetime
from pathlib import Path
from typing import Optional
import typer
from loguru import logger
app = typer.Typer(
name="pipeline",
help=(
"Chain DataTools cleaning steps into one orchestrated workflow.\n\n"
"Default behaviour: preview the plan + run the pipeline (no file "
"written). Add --apply to write the cleaned output and audit log.\n\n"
"The pipeline RECOMMENDS an order based on tool dependencies "
"(text-clean before format-standardize, format before dedup, etc.) "
"and WARNS on out-of-order configs but does not block them. Use "
"--strict to escalate warnings to errors.\n\n"
"Tools available: text_clean, format_standardize, missing, "
"column_map, dedup."
),
add_completion=False,
no_args_is_help=False,
)
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _setup_logging(log_dir: Path) -> Path:
log_dir.mkdir(parents=True, exist_ok=True)
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
log_path = log_dir / f"pipeline_{ts}.log"
logger.remove()
logger.add(sys.stderr, level="WARNING", format="{message}")
logger.add(
str(log_path), level="DEBUG",
format="{time:YYYY-MM-DD HH:mm:ss} | {level:<8} | {message}",
)
return log_path
def _split_csv_arg(raw: Optional[str]) -> Optional[list[str]]:
if raw is None:
return None
return [c.strip() for c in raw.split(",") if c.strip()]
# ---------------------------------------------------------------------------
# Main command
# ---------------------------------------------------------------------------
@app.command()
def run(
input_file: Optional[str] = typer.Argument(
None,
help="CSV / TSV / Excel file. Optional with --recommend.",
),
pipeline_path: Optional[str] = typer.Option(
None, "--pipeline", "-p",
help="Path to a pipeline JSON file (Pipeline.from_file format).",
),
steps: Optional[str] = typer.Option(
None, "--steps",
help=(
"Quick pipeline: comma-separated tool names in execution order. "
"Each step uses defaults. Example: 'text_clean,format_standardize,dedup'."
),
),
recommend: bool = typer.Option(
False, "--recommend",
help="Print (or save) the recommended default pipeline and exit.",
),
output: Optional[str] = typer.Option(
None, "--output", "-o",
help=(
"When --recommend is set, save the pipeline JSON here. "
"Otherwise, write the pipeline output to this CSV path "
"(default: {input}_pipeline.csv)."
),
),
apply: bool = typer.Option(
False, "--apply",
help="Write the output. Without this flag, only the plan is shown.",
),
strict: bool = typer.Option(
False, "--strict",
help="Treat soft-dependency warnings as errors (refuse to run).",
),
continue_on_error: bool = typer.Option(
False, "--continue-on-error",
help="Don't abort if a step fails; carry the previous step's df forward.",
),
encoding_override: Optional[str] = typer.Option(
None, "--encoding",
help="Override auto-detected file encoding.",
),
delimiter: Optional[str] = typer.Option(
None, "--delimiter",
help="Override auto-detected delimiter.",
),
):
"""Run a DataTools cleaning pipeline."""
from src.core.pipeline import (
Pipeline,
recommended_pipeline,
run_pipeline,
validate_pipeline,
)
# ------------------------------------------------------------------
# --recommend: print or save the default pipeline and exit
# ------------------------------------------------------------------
if recommend:
pipe = recommended_pipeline()
body = json.dumps(pipe.to_dict(), indent=2)
if output:
Path(output).write_text(body)
typer.echo(f"Recommended pipeline saved to {output}")
else:
typer.echo(body)
return
if not input_file:
typer.echo(
"Error: input file is required (or use --recommend to "
"emit the default pipeline).",
err=True,
)
raise typer.Exit(2)
inp = Path(input_file)
if not inp.exists():
typer.echo(f"Error: File not found: {inp}", err=True)
raise typer.Exit(1)
log_path = _setup_logging(Path("logs"))
# ------------------------------------------------------------------
# Resolve pipeline source: --pipeline file, --steps list, or default
# ------------------------------------------------------------------
if pipeline_path and steps:
typer.echo(
"Error: pass either --pipeline or --steps, not both.",
err=True,
)
raise typer.Exit(1)
if pipeline_path:
pp = Path(pipeline_path)
if not pp.exists():
typer.echo(f"Error: pipeline file not found: {pp}", err=True)
raise typer.Exit(1)
try:
pipe = Pipeline.from_file(pp)
except Exception as e:
from src.core.errors import format_for_user
typer.echo(f"Error reading pipeline: {format_for_user(e)}", err=True)
raise typer.Exit(1)
elif steps:
names = _split_csv_arg(steps) or []
try:
pipe = recommended_pipeline(include=names)
except Exception as e:
from src.core.errors import format_for_user
typer.echo(f"Error: {format_for_user(e)}", err=True)
raise typer.Exit(1)
else:
pipe = recommended_pipeline()
# ------------------------------------------------------------------
# Plan + warnings
# ------------------------------------------------------------------
warnings = validate_pipeline(pipe)
typer.echo(f"\n{''*60}")
typer.echo(" Pipeline plan:")
for i, step in enumerate(pipe.steps, 1):
flag = " " if step.enabled else ""
typer.echo(f" {i}. {flag}{step.display_name():<22} options={step.options or {}}")
typer.echo(f"{''*60}")
if warnings:
typer.echo("\nSoft-dependency warnings (recommended order violated):")
for w in warnings:
typer.echo(f" ! {w}")
if strict:
typer.echo(
"\nAborting: --strict was set. Reorder the steps or drop --strict.",
err=True,
)
raise typer.Exit(2)
if not apply:
typer.echo(
"\nThis was a plan-only run. Add --apply to execute the pipeline."
)
typer.echo(f"Log: {log_path}")
return
# ------------------------------------------------------------------
# Read input + execute
# ------------------------------------------------------------------
from src.core.io import read_file, write_file
import pandas as pd
typer.echo(f"\nReading {inp.name}...")
try:
df = read_file(
inp, encoding=encoding_override, delimiter=delimiter, repair=False,
)
if not isinstance(df, pd.DataFrame):
df = pd.concat(list(df), ignore_index=True)
except Exception as e:
typer.echo(f"Error reading file: {e}", err=True)
raise typer.Exit(1)
typer.echo(f" {len(df):,} rows, {len(df.columns)} columns")
typer.echo("\nExecuting pipeline:")
def _on_step(sr) -> None:
if sr.skipped:
typer.echo(f" - {sr.step.display_name()} (skipped)")
elif sr.error:
typer.echo(f"{sr.step.display_name()} ({sr.elapsed_seconds*1000:.0f} ms) — ERROR: {sr.error.splitlines()[0]}")
else:
typer.echo(f"{sr.step.display_name()} ({sr.elapsed_seconds*1000:.0f} ms) {sr.summary}")
try:
result = run_pipeline(
df, pipe,
on_step_complete=_on_step,
stop_on_error=not continue_on_error,
)
except Exception as e:
from src.core.errors import format_for_user
typer.echo(f"\nPipeline halted: {format_for_user(e)}", err=True)
raise typer.Exit(1)
typer.echo(f"\n{''*60}")
typer.echo(f" Initial rows: {result.initial_rows:,}")
typer.echo(f" Final rows: {result.final_rows:,}")
typer.echo(f" Steps run: {sum(1 for s in result.step_results if not s.skipped)}")
typer.echo(f" Total elapsed: {result.total_elapsed:.2f} s")
typer.echo(f"{''*60}")
# ------------------------------------------------------------------
# Write output + audit
# ------------------------------------------------------------------
out_path = Path(output) if output else inp.parent / f"{inp.stem}_pipeline.csv"
write_file(result.final_df, out_path)
typer.echo(f"\nPipeline output: {out_path}")
audit_path = inp.parent / f"{inp.stem}_pipeline.json"
audit_path.write_text(json.dumps({
"pipeline": pipe.to_dict(),
"warnings": result.warnings,
"initial_rows": result.initial_rows,
"final_rows": result.final_rows,
"total_elapsed_seconds": result.total_elapsed,
"steps": [
{
"tool": sr.step.tool,
"name": sr.step.display_name(),
"enabled": sr.step.enabled,
"skipped": sr.skipped,
"elapsed_seconds": sr.elapsed_seconds,
"summary": sr.summary,
"error": sr.error,
}
for sr in result.step_results
],
}, indent=2, default=str))
typer.echo(f"Pipeline audit: {audit_path}")
typer.echo(f"Log: {log_path}")
def main() -> None:
app()
if __name__ == "__main__":
main()

View File

@@ -96,15 +96,54 @@ from .format_standardize import (
PRESETS as STANDARDIZE_PRESETS,
StandardizeOptions,
StandardizeResult,
StreamingStandardizeResult,
detect_currency_code,
standardize_address,
standardize_boolean,
standardize_currency,
standardize_dataframe,
standardize_date,
standardize_file,
standardize_name,
standardize_phone,
)
from .missing import (
DEFAULT_SENTINELS,
ColumnReport,
MissingOptions,
MissingProfile,
MissingResult,
PRESETS as MISSING_PRESETS,
Strategy as MissingStrategy,
detect_sentinels,
handle_missing,
is_missing_like,
profile_missing,
)
from .column_mapper import (
ColumnDtype,
MapOptions,
MapResult,
PRESETS as MAP_PRESETS,
TargetField,
TargetSchema,
UnmappedStrategy,
coerce_series,
infer_mapping,
map_columns,
)
from .pipeline import (
Pipeline,
PipelineResult,
SOFT_DEPENDENCIES,
Step,
StepResult,
TOOL_ADAPTERS,
TOOL_NAMES,
recommended_pipeline,
run_pipeline,
validate_pipeline,
)
__all__ = [
# Core
@@ -171,6 +210,7 @@ __all__ = [
"STANDARDIZE_PRESETS",
"StandardizeOptions",
"StandardizeResult",
"StreamingStandardizeResult",
"detect_currency_code",
"standardize_dataframe",
"standardize_date",
@@ -179,4 +219,39 @@ __all__ = [
"standardize_name",
"standardize_address",
"standardize_boolean",
"standardize_file",
# Missing-value handling
"DEFAULT_SENTINELS",
"ColumnReport",
"MissingOptions",
"MissingProfile",
"MissingResult",
"MISSING_PRESETS",
"MissingStrategy",
"detect_sentinels",
"handle_missing",
"is_missing_like",
"profile_missing",
# Column mapping
"ColumnDtype",
"MapOptions",
"MapResult",
"MAP_PRESETS",
"TargetField",
"TargetSchema",
"UnmappedStrategy",
"coerce_series",
"infer_mapping",
"map_columns",
# Pipeline
"Pipeline",
"PipelineResult",
"SOFT_DEPENDENCIES",
"Step",
"StepResult",
"TOOL_ADAPTERS",
"TOOL_NAMES",
"recommended_pipeline",
"run_pipeline",
"validate_pipeline",
]

View File

@@ -593,6 +593,40 @@ def _count_row_terminators(raw: bytes) -> tuple[int, int, int]:
return n_crlf, n_lf, n_cr
def _detect_lying_bom(raw: bytes) -> list[Finding]:
"""Flag files whose UTF-8 BOM disagrees with the body bytes.
The "lying BOM" pattern is a file that starts with the UTF-8 BOM
(``EF BB BF``) but whose body cannot be decoded as UTF-8 — typically
a cp1252 export that someone hand-prepended a BOM to in an attempt to
make Excel happy. The encoding detector recovers transparently
(returns cp1252), but the user should still be told their file is
misrepresenting itself so the next downstream tool doesn't get
surprised.
"""
if not raw[:3] == b"\xef\xbb\xbf":
return []
try:
raw[3:].decode("utf-8")
return [] # honest BOM — body is real UTF-8
except UnicodeDecodeError:
pass
return [Finding(
id="encoding_lying_bom",
severity="warn",
tool="",
count=1,
description=(
"File starts with a UTF-8 BOM, but the body bytes are not "
"valid UTF-8 — the BOM is misleading. The encoding detector "
"recovered by falling back to a single-byte codepage; you "
"may want to re-save the file with a matching encoding."
),
confidence="high",
fix_action=FIX_NONE,
)]
def _detect_mixed_line_endings(raw: bytes) -> list[Finding]:
"""Flag files that mix CRLF, LF, and bare CR row terminators.
@@ -875,6 +909,7 @@ def analyze(
findings.extend(_findings_from_repair(repair_result))
if raw_for_byte_scan is not None:
findings.extend(_detect_mixed_line_endings(raw_for_byte_scan))
findings.extend(_detect_lying_bom(raw_for_byte_scan))
findings.extend(_detect_encoding_uncertainty(df))
findings.extend(_detect_smart_punctuation(df))
findings.extend(_detect_invisible_chars(df))
@@ -890,6 +925,7 @@ def analyze(
def _load_for_analysis(
path: Path, *, sample_rows: int, encoding_override: Optional[str] = None,
fold_quotes: bool = True,
) -> tuple[pd.DataFrame, Optional[RepairResult], Optional[bytes]]:
"""Read just enough of *path* to scan, with the same robust pre-parse
repair the tool pages will use.
@@ -903,6 +939,12 @@ def _load_for_analysis(
When *encoding_override* is set, it replaces the detected encoding
entirely — the user has explicitly told us what the file is. The
delimiter is still detected (it's separate from encoding choice).
*fold_quotes* defaults to True so the byte-level smart-quote fold
runs as part of the repair pass (correct for CSV parsing). Pass
False when the caller needs a content-preserving decode for
identity round-trip checks (encoding corpus tests, format-fidelity
audits).
"""
suffix = path.suffix.lower()
if suffix in (".xlsx", ".xls"):
@@ -937,7 +979,7 @@ def _load_for_analysis(
if not head.strip():
return pd.DataFrame(), None, head
repair = repair_bytes(head, encoding=enc, delimiter=delim)
repair = repair_bytes(head, encoding=enc, delimiter=delim, fold_quotes=fold_quotes)
import io as _io
try:
df = pd.read_csv(
@@ -954,7 +996,9 @@ def _load_for_analysis(
# never trips; the 2× row-size multiplier above handles 99% of inputs.
if not head_was_full and len(df) < sample_rows:
full_raw = path.read_bytes()
full_repair = repair_bytes(full_raw, encoding=enc, delimiter=delim)
full_repair = repair_bytes(
full_raw, encoding=enc, delimiter=delim, fold_quotes=fold_quotes,
)
try:
df = pd.read_csv(
_io.BytesIO(full_repair.repaired_bytes),

633
src/core/column_mapper.py Normal file
View File

@@ -0,0 +1,633 @@
"""DataTools Column Mapper.
Rename columns, enforce a target schema, coerce types, drop / add /
reorder columns. Designed for the three buyer profiles the toolkit
already serves:
1. **Schema enforcement** — analyst receives a CSV that has to fit a
known target shape (a CRM import format, a database schema, a
mailing-list contract). Map source columns to target names, coerce
each to the declared type, drop the extras, fail clearly when a
required target field is missing.
2. **Multi-source unification** — operator merges vendor/partner
exports where every file uses different column names ("First Name"
/ "first_name" / "FirstName"). The fuzzy auto-mapper proposes a
mapping; the user reviews and overrides.
3. **Type coercion** — quick conversion of mis-typed columns (string
"123" → int, "true"/"yes" → bool, "2024-01-15" → date) without
leaving the tool, with errors surfaced row-by-row.
Public API
----------
Types:
TargetField, TargetSchema, ColumnMapping, MapOptions, MapResult,
ColumnDtype
Functions:
map_columns(df, options) -> MapResult
infer_mapping(df, schema, *, threshold=0.6) -> dict[src, target]
coerce_series(series, dtype) -> (Series, n_failures)
Presets:
PRESETS = {"rename-only", "strict-schema", "lenient-schema"}
"""
from __future__ import annotations
import json
import re
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any, Iterable, Literal, Optional
import numpy as np
import pandas as pd
from loguru import logger
from pandas.api import types as pdtypes
from .errors import ConfigError, InputValidationError, ensure_choice, ensure_dataframe
# ---------------------------------------------------------------------------
# Types
# ---------------------------------------------------------------------------
ColumnDtype = Literal[
"string",
"integer",
"float",
"boolean",
"date",
"datetime",
"category",
"auto", # leave dtype alone
]
_VALID_DTYPES: frozenset[str] = frozenset({
"string", "integer", "float", "boolean", "date", "datetime",
"category", "auto",
})
@dataclass
class TargetField:
"""One field in a target schema.
Required fields whose source column is missing produce a
``MapResult.missing_required_targets`` entry rather than silently
creating a NaN column.
"""
name: str
dtype: ColumnDtype = "auto"
required: bool = False
aliases: list[str] = field(default_factory=list)
default: Any = None
@dataclass
class TargetSchema:
"""Ordered list of target fields. Ordering survives into the result DataFrame."""
fields: list[TargetField]
def field_names(self) -> list[str]:
return [f.name for f in self.fields]
def get(self, name: str) -> Optional[TargetField]:
return next((f for f in self.fields if f.name == name), None)
def to_dict(self) -> dict:
return {"fields": [asdict(f) for f in self.fields]}
def to_file(self, path: str | Path) -> Path:
out = Path(path)
out.write_text(json.dumps(self.to_dict(), indent=2, default=str))
return out
@classmethod
def from_dict(cls, data: dict) -> TargetSchema:
if "fields" not in data:
raise ConfigError(
"Target schema must contain a 'fields' list",
operation="TargetSchema.from_dict",
suggestion='Example: {"fields": [{"name": "email", "dtype": "string", "required": true}, ...]}',
)
fields = []
for entry in data["fields"]:
if isinstance(entry, str):
fields.append(TargetField(name=entry))
continue
if "name" not in entry:
raise ConfigError(
f"Schema field is missing 'name': {entry!r}",
operation="TargetSchema.from_dict",
)
dtype = entry.get("dtype", "auto")
if dtype not in _VALID_DTYPES:
raise ConfigError(
f"Schema field {entry['name']!r}: unknown dtype {dtype!r}",
operation="TargetSchema.from_dict",
suggestion=f"Valid: {sorted(_VALID_DTYPES)}",
)
fields.append(TargetField(
name=entry["name"],
dtype=dtype,
required=bool(entry.get("required", False)),
aliases=list(entry.get("aliases", [])),
default=entry.get("default"),
))
return cls(fields=fields)
@classmethod
def from_file(cls, path: str | Path) -> TargetSchema:
return cls.from_dict(json.loads(Path(path).read_text()))
# ---------------------------------------------------------------------------
# Fuzzy column-name matching
# ---------------------------------------------------------------------------
# Whitespace, punctuation, and case all vary across vendors. We normalise
# both sides to a token list before comparing.
_NORM_RE = re.compile(r"[^a-z0-9]+")
def _normalize_name(name: str) -> str:
"""Lowercase, strip non-alphanumerics — ``First Name`` → ``firstname``."""
if not isinstance(name, str):
return ""
return _NORM_RE.sub("", name.strip().lower())
def _token_set(name: str) -> frozenset[str]:
"""Tokenise a column name on non-alphanumeric boundaries."""
if not isinstance(name, str):
return frozenset()
parts = [p for p in _NORM_RE.split(name.strip().lower()) if p]
return frozenset(parts)
def _name_similarity(a: str, b: str) -> float:
"""Cheap similarity score in [0.0, 1.0].
Combines exact-after-normalisation, token Jaccard, and SequenceMatcher
ratio. A real fuzzy library (rapidfuzz) is already a project
dependency for the deduplicator — we use it when available, fall
back to stdlib ``difflib`` otherwise so the mapper works in trimmed
builds.
"""
if not a or not b:
return 0.0
na, nb = _normalize_name(a), _normalize_name(b)
if na == nb:
return 1.0
ta, tb = _token_set(a), _token_set(b)
jaccard = (len(ta & tb) / len(ta | tb)) if (ta or tb) else 0.0
try:
from rapidfuzz import fuzz
seq = fuzz.ratio(na, nb) / 100.0
except ImportError:
from difflib import SequenceMatcher
seq = SequenceMatcher(None, na, nb).ratio()
return max(jaccard, seq)
def infer_mapping(
df: pd.DataFrame,
schema: TargetSchema,
*,
threshold: float = 0.6,
) -> dict[str, str]:
"""Best-guess source-column → target-field mapping.
Returns a dict keyed by source-column name. A source column is
omitted from the result when no candidate scores above *threshold*.
Each target is matched at most once: the highest-scoring source
wins, ties broken by source-column order in *df*.
Aliases declared on a :class:`TargetField` are scored as if they
were target names — useful for vendor-specific synonyms
(``["customer_id", "cust_id", "client_no"]``).
"""
ensure_dataframe(df, function="infer_mapping")
sources = list(df.columns)
targets = schema.fields
# All (source, target) candidate scores; keep only those above
# threshold, sorted descending so a greedy walk picks the best
# available pairings first.
scored: list[tuple[float, str, str]] = []
for src in sources:
for tgt in targets:
best = _name_similarity(src, tgt.name)
for alias in tgt.aliases:
s = _name_similarity(src, alias)
if s > best:
best = s
if best >= threshold:
scored.append((best, str(src), tgt.name))
scored.sort(key=lambda x: (-x[0], sources.index(x[1])))
mapping: dict[str, str] = {}
used_targets: set[str] = set()
for score, src, tgt in scored:
if src in mapping or tgt in used_targets:
continue
mapping[src] = tgt
used_targets.add(tgt)
return mapping
# ---------------------------------------------------------------------------
# Type coercion
# ---------------------------------------------------------------------------
_TRUTHY = frozenset({"true", "t", "yes", "y", "1"})
_FALSY = frozenset({"false", "f", "no", "n", "0"})
def _coerce_boolean(value: Any) -> Any:
if isinstance(value, bool):
return value
if value is None or (isinstance(value, float) and pd.isna(value)):
return pd.NA
if isinstance(value, (int, float)):
return bool(value)
if isinstance(value, str):
v = value.strip().lower()
if v in _TRUTHY:
return True
if v in _FALSY:
return False
raise ValueError(f"cannot coerce to boolean: {value!r}")
def coerce_series(series: pd.Series, dtype: ColumnDtype) -> tuple[pd.Series, int]:
"""Coerce *series* to *dtype*, returning ``(coerced, n_failures)``.
Failures are counted but never raised — the caller (``map_columns``)
surfaces them through ``MapResult.coercion_failures`` so the user
can inspect which rows didn't fit. Already-typed inputs are cheap
no-ops.
"""
if dtype == "auto":
return series, 0
if dtype == "string":
return series.astype("string"), 0
if dtype == "category":
return series.astype("category"), 0
if dtype == "integer":
coerced = pd.to_numeric(series, errors="coerce")
# Use nullable Int64 so NaN entries don't get cast to floats.
rounded = coerced.round().astype("Int64")
# Failures = original non-NaN cells whose numeric coercion produced NaN.
original_filled = series.notna()
failed = (rounded.isna() & original_filled).sum()
return rounded, int(failed)
if dtype == "float":
coerced = pd.to_numeric(series, errors="coerce").astype("Float64")
original_filled = series.notna()
failed = (coerced.isna() & original_filled).sum()
return coerced, int(failed)
if dtype == "boolean":
out: list[Any] = []
failed = 0
for v in series.tolist():
try:
out.append(_coerce_boolean(v))
except ValueError:
out.append(pd.NA)
failed += 1
return pd.Series(out, index=series.index, dtype="boolean"), failed
if dtype in {"date", "datetime"}:
coerced = pd.to_datetime(series, errors="coerce", utc=False)
original_filled = series.notna()
failed = (coerced.isna() & original_filled).sum()
if dtype == "date":
# Drop the time component but keep dtype as datetime64 so
# downstream operations (delta, sort) still work.
coerced = coerced.dt.normalize()
return coerced, int(failed)
raise InputValidationError(
f"Unknown dtype {dtype!r}",
operation="coerce_series",
suggestion=f"Valid: {sorted(_VALID_DTYPES)}",
)
# ---------------------------------------------------------------------------
# Options / result dataclasses
# ---------------------------------------------------------------------------
# Strategy for handling source columns that don't appear in the target
# schema. ``keep`` preserves them at the end of the output; ``drop``
# removes them; ``error`` raises an InputValidationError.
UnmappedStrategy = Literal["keep", "drop", "error"]
PRESETS: dict[str, dict[str, Any]] = {
"rename-only": {
"auto_infer": True,
"unmapped": "keep",
"coerce_types": False,
"reorder_to_schema": False,
},
"strict-schema": {
"auto_infer": True,
"unmapped": "drop",
"coerce_types": True,
"reorder_to_schema": True,
},
"lenient-schema": {
"auto_infer": True,
"unmapped": "keep",
"coerce_types": True,
"reorder_to_schema": True,
},
}
@dataclass
class MapOptions:
"""Toggles for column mapping.
Defaults match the ``rename-only`` preset: best-effort fuzzy match
against the schema (if provided), keep unmapped source columns
after the mapped ones, no type coercion, no reorder.
"""
# Either pass an explicit ``mapping`` dict or a ``schema`` (and let
# the engine infer the mapping). Explicit mapping wins when both
# are set.
mapping: dict[str, str] = field(default_factory=dict)
schema: Optional[TargetSchema] = None
# When True (default), missing entries in ``mapping`` are filled in
# by ``infer_mapping`` against ``schema``. When False, only the
# explicit mapping is honoured.
auto_infer: bool = True
fuzzy_threshold: float = 0.6
# What to do with source columns that aren't in the mapping.
unmapped: UnmappedStrategy = "keep"
# Apply target-field dtypes from the schema after rename.
coerce_types: bool = False
# Reorder output to match schema.fields order. Unmapped survivors
# (when unmapped="keep") are appended at the end in their original
# source order.
reorder_to_schema: bool = False
# Required-target enforcement. When True (default), a required
# target field that has no source column raises an InputValidationError.
# When False, the missing field is added with ``default`` value.
enforce_required: bool = True
@classmethod
def from_preset(cls, name: str) -> MapOptions:
if name not in PRESETS:
raise ConfigError(
f"Unknown preset '{name}'",
operation="MapOptions.from_preset",
suggestion=f"Available: {sorted(PRESETS)}",
)
return cls(**PRESETS[name])
@classmethod
def from_dict(cls, data: dict) -> MapOptions:
known = set(cls.__dataclass_fields__)
kwargs = {k: v for k, v in data.items() if k in known}
if "schema" in kwargs and isinstance(kwargs["schema"], dict):
kwargs["schema"] = TargetSchema.from_dict(kwargs["schema"])
return cls(**kwargs)
def to_dict(self) -> dict:
out: dict[str, Any] = {
"mapping": dict(self.mapping),
"auto_infer": self.auto_infer,
"fuzzy_threshold": self.fuzzy_threshold,
"unmapped": self.unmapped,
"coerce_types": self.coerce_types,
"reorder_to_schema": self.reorder_to_schema,
"enforce_required": self.enforce_required,
}
if self.schema is not None:
out["schema"] = self.schema.to_dict()
return out
def to_file(self, path: str | Path) -> Path:
out = Path(path)
out.write_text(json.dumps(self.to_dict(), indent=2, default=str))
return out
@classmethod
def from_file(cls, path: str | Path) -> MapOptions:
return cls.from_dict(json.loads(Path(path).read_text()))
def validate(self) -> None:
ensure_choice(
self.unmapped, name="unmapped",
choices=("keep", "drop", "error"),
function="MapOptions.validate",
)
if not (0.0 <= self.fuzzy_threshold <= 1.0):
raise ConfigError(
f"fuzzy_threshold must be in [0.0, 1.0], got {self.fuzzy_threshold!r}",
operation="MapOptions.validate",
)
@dataclass
class MapResult:
"""Output of ``map_columns``."""
mapped_df: pd.DataFrame
mapping: dict[str, str] # source → target
inferred_pairs: dict[str, str] # subset of mapping that was auto-inferred
columns_renamed: int
columns_dropped: list[str]
columns_added: list[str] # required-defaulted fields added with default value
coercion_failures: dict[str, int] # column → n_rows_that_failed_coercion
unmapped_kept: list[str]
missing_required_targets: list[str]
# ---------------------------------------------------------------------------
# Main entry point
# ---------------------------------------------------------------------------
def map_columns(
df: pd.DataFrame,
options: Optional[MapOptions] = None,
) -> MapResult:
"""Apply *options* to *df* and return a :class:`MapResult`.
Pipeline placement (recommended, not enforced)
----------------------------------------------
Two natural slots:
* **Early** — header alignment for multi-vendor unification.
Each vendor uses different column names; rename to a canonical
schema before any other tool runs.
* **Late** — schema enforcement for output. After cleaning, coerce
types and project to the target shape (CRM import contract,
database schema). Run after format / missing so the coerced
data is canonical first.
The pipeline runner does not enforce a position; place by use case.
Pipeline:
1. Compose mapping (explicit ``options.mapping`` inferred
pairs from ``options.schema``).
2. Reject duplicate target names — two source columns mapped to
the same target is a user error, not a silent overwrite.
3. Decide what to do with unmapped source columns
(``keep`` / ``drop`` / ``error``).
4. Rename, then handle missing required targets, then coerce
types, then reorder.
"""
ensure_dataframe(df, function="map_columns")
options = options or MapOptions()
options.validate()
# ------------------------------------------------------------------
# 1. Compose the effective mapping
# ------------------------------------------------------------------
explicit = dict(options.mapping)
inferred: dict[str, str] = {}
if options.schema is not None and options.auto_infer:
all_inferred = infer_mapping(df, options.schema, threshold=options.fuzzy_threshold)
# Explicit user pairings always win.
used_targets = set(explicit.values())
for src, tgt in all_inferred.items():
if src in explicit:
continue
if tgt in used_targets:
continue
inferred[src] = tgt
used_targets.add(tgt)
mapping: dict[str, str] = {**inferred, **explicit}
# ------------------------------------------------------------------
# 2. Validate mapping coherence
# ------------------------------------------------------------------
unknown_sources = [s for s in mapping if s not in df.columns]
if unknown_sources:
raise InputValidationError(
f"Mapping references columns not in input: {unknown_sources}",
operation="map_columns",
suggestion=f"Available source columns: {list(df.columns)}",
)
target_counts: dict[str, int] = {}
for tgt in mapping.values():
target_counts[tgt] = target_counts.get(tgt, 0) + 1
duplicates = [t for t, n in target_counts.items() if n > 1]
if duplicates:
raise InputValidationError(
f"Multiple source columns mapped to the same target(s): {duplicates}",
operation="map_columns",
suggestion="Each target name must be unique. Drop or rename the conflicting source columns.",
)
# ------------------------------------------------------------------
# 3. Handle unmapped source columns
# ------------------------------------------------------------------
unmapped_sources = [c for c in df.columns if c not in mapping]
unmapped_kept: list[str] = []
columns_dropped: list[str] = []
if unmapped_sources:
if options.unmapped == "drop":
columns_dropped = list(unmapped_sources)
elif options.unmapped == "error":
raise InputValidationError(
f"Source columns have no mapping and unmapped='error': {unmapped_sources}",
operation="map_columns",
suggestion=(
"Either add explicit mapping entries, set unmapped='keep' / 'drop', "
"or include the columns in the target schema."
),
)
else:
unmapped_kept = list(unmapped_sources)
# ------------------------------------------------------------------
# 4. Apply rename and drop
# ------------------------------------------------------------------
out = df.copy()
if columns_dropped:
out = out.drop(columns=columns_dropped)
if mapping:
out = out.rename(columns=mapping)
columns_renamed = sum(1 for src, tgt in mapping.items() if src != tgt)
# ------------------------------------------------------------------
# 5. Handle the schema's required + default fields
# ------------------------------------------------------------------
columns_added: list[str] = []
missing_required: list[str] = []
if options.schema is not None:
present = set(out.columns)
for tf in options.schema.fields:
if tf.name in present:
continue
if tf.required and tf.default is None:
missing_required.append(tf.name)
continue
# Add with default value (NaN if no default).
out[tf.name] = tf.default if tf.default is not None else pd.NA
columns_added.append(tf.name)
if missing_required and options.enforce_required:
raise InputValidationError(
f"Required target field(s) missing from input: {missing_required}",
operation="map_columns",
suggestion=(
"Either add explicit mapping entries, lower fuzzy_threshold, "
"supply a default in the schema, or set enforce_required=False."
),
)
# ------------------------------------------------------------------
# 6. Coerce types per the schema
# ------------------------------------------------------------------
coercion_failures: dict[str, int] = {}
if options.coerce_types and options.schema is not None:
for tf in options.schema.fields:
if tf.name not in out.columns or tf.dtype == "auto":
continue
try:
series, fails = coerce_series(out[tf.name], tf.dtype)
except (ValueError, TypeError) as e:
logger.warning(
"map_columns: coerce of {!r}{} failed: {}",
tf.name, tf.dtype, e,
)
continue
out[tf.name] = series
if fails:
coercion_failures[tf.name] = fails
# ------------------------------------------------------------------
# 7. Reorder
# ------------------------------------------------------------------
if options.reorder_to_schema and options.schema is not None:
ordered = [f.name for f in options.schema.fields if f.name in out.columns]
# Append survivors (kept-unmapped originals) in their pre-rename order.
survivors = [c for c in out.columns if c not in ordered]
out = out.loc[:, ordered + survivors]
return MapResult(
mapped_df=out,
mapping=mapping,
inferred_pairs=inferred,
columns_renamed=columns_renamed,
columns_dropped=columns_dropped,
columns_added=columns_added,
coercion_failures=coercion_failures,
unmapped_kept=unmapped_kept,
missing_required_targets=missing_required,
)

View File

@@ -514,6 +514,19 @@ def deduplicate(
) -> DeduplicationResult:
"""Run the full deduplication pipeline.
Pipeline placement (recommended, not enforced)
----------------------------------------------
Run *last* among the cleaning tools. Fuzzy matching is more
accurate when:
* text has been hygiened (NBSP padding doesn't make
``"Alice "`` look different from ``"Alice"``);
* formats have been canonicalized (``+14155551234`` matches
across rows where the source had ``(415) 555-1234`` and
``415.555.1234``);
* missing values have been standardized (NaN matching is
brittle; sentinel-laundered cells produce false matches).
See ``src.core.pipeline.SOFT_DEPENDENCIES``.
Parameters
----------
df : input DataFrame

View File

@@ -815,7 +815,22 @@ _CURRENCY_TRIM_RE = re.compile(
_PARENS_NEGATIVE_RE = re.compile(r"^\s*\(\s*(.+?)\s*\)\s*$")
CurrencyDecimal = Literal["dot", "comma"]
CurrencyDecimal = Literal["dot", "comma", "auto"]
# Multi-character symbol prefixes that aren't captured by the
# single-codepoint ``_CURRENCY_SYMBOLS`` table. Order matters: the
# detector checks these prefixes BEFORE the single-symbol regex, so
# ``R$`` resolves to BRL even though ``$`` alone would map to USD.
_PREFIX_TO_ISO: dict[str, str] = {
"r$": "BRL", # Brazilian Real
"kr": "SEK", # ambiguous Nordic — picks SEK as most common; see tests
"": "PLN", # Polish Złoty
"лв": "BGN", # Bulgarian Lev
"": "RUB", # already in symbol table; kept for parity
"rs.": "INR", # rupees — covers IN/PK informal usage
"rs": "INR",
}
def detect_currency_code(value: str) -> Optional[str]:
@@ -825,9 +840,21 @@ def detect_currency_code(value: str) -> Optional[str]:
symbol → code mapping (``$1234`` → ``USD``). Symbol mapping is best-
effort: ``$`` is ambiguous between USD/CAD/AUD/MXN — the caller is
expected to constrain that via input data discipline.
Multi-char prefixes (``R$``, ``zł``, ``kr``) are recognised before
the single-symbol regex so Brazilian / Polish / Nordic data isn't
silently bucketed as USD.
"""
if not isinstance(value, str):
return None
head = value.lstrip().lower()
for prefix, code in _PREFIX_TO_ISO.items():
if head.startswith(prefix):
# Make sure the next char (if any) isn't a letter — avoid
# matching ``rsa`` as ``rs``-then-``a``.
tail = head[len(prefix):]
if not tail or not tail[0].isalpha():
return code
m = _CURRENCY_DETECT_RE.search(value)
if m is None:
return None
@@ -852,10 +879,16 @@ def standardize_currency(
``decimal="dot"``: ``$1,234.56`` → ``1234.56`` (US/UK convention).
``decimal="comma"``: ``1.234,56 €`` → ``1234.56`` (EU convention).
Either mode auto-detects the EU shape when both ``.`` and ``,`` are
present and the comma sits after the dot (so ``€1.234,56`` parses
correctly even under the dot-default mode). Space-thousands and
Swiss apostrophe-thousands are also recognized.
``decimal="auto"``: same as ``dot`` but a single trailing comma
whose tail is NOT exactly 3 digits is read as a decimal separator
(``850,50`` → ``850.50``, ``R$ 1,5`` → ``1.5``). Use this for
mixed-locale international files. Length-3 tails (``1,234``) stay
ambiguous regardless of mode.
All three modes auto-detect the EU shape when both ``.`` and ``,``
are present and the comma sits after the dot (so ``€1.234,56``
parses correctly even under the dot-default mode). Space-thousands
and Swiss apostrophe-thousands are also recognized.
The output always uses a dot as the decimal separator since that is
the form pandas/Python parse natively.
@@ -899,6 +932,22 @@ def standardize_currency(
code = detect_currency_code(s) if preserve_code else None
# Strip any multi-char currency prefix (``R$``, ``kr``, ``zł``)
# before the symbol-table regex — these aren't single codepoints
# so the table-driven trim would otherwise leave them in place.
head = s.lstrip().lower()
for prefix in _PREFIX_TO_ISO:
if head.startswith(prefix):
tail_start = len(prefix)
if tail_start < len(head) and head[tail_start].isalpha():
continue
# Strip the matched prefix from the original (preserve case
# of any trailing content).
stripped_lead = s[: len(s) - len(head)]
s = stripped_lead + s.lstrip()[len(prefix):]
s = s.lstrip()
break
negative = False
m = _PARENS_NEGATIVE_RE.match(s)
if m:
@@ -948,6 +997,19 @@ def standardize_currency(
# is unambiguously EU — treat the comma as decimal.
if had_space_thousands:
rest = rest.replace(",", ".")
elif decimal == "auto":
# International auto-detection: a single comma whose
# tail is NOT exactly 3 digits is far more likely to be
# an EU/BRL decimal (``850,50``, ``1,5``) than a
# malformed US thousands group. Length-3 tails stay
# ambiguous and require an explicit locale.
after = rest.rsplit(",", 1)[1]
if rest.count(",") > 1:
rest = rest.replace(",", "")
elif len(after) == 3:
return _err("ambiguous separator, set --currency-locale")
else:
rest = rest.replace(",", ".")
else:
after = rest.rsplit(",", 1)[1]
if len(after) != 3:
@@ -1910,6 +1972,26 @@ class StandardizeOptions:
# verbatim into Title Case rendering.
extra_abbreviations: dict[str, str] = field(default_factory=dict)
# ----- Scale knobs for large international files -----
# Per-row country/region overrides. When set, each phone or address
# row's region is read from the named column (an ISO-3166 alpha-2 code:
# "US", "GB", "JP", "FR", …). Falls back to ``phone_region`` /
# global default when the column is missing or the cell is blank.
phone_country_column: Optional[str] = None
address_country_column: Optional[str] = None
# Audit cap. The change table can grow to tens of millions of rows on
# a 1 GB input — capping protects memory and keeps the audit usable.
# ``cells_changed`` still counts every modification; only the per-row
# ``changes`` DataFrame is truncated. Set to None for unbounded.
audit_max_rows: Optional[int] = 10_000
# Value-level LRU cache size per standardizer. Repeated phone numbers
# (call-list duplicates), repeated currencies, repeated boolean
# tokens — all dominate at scale. A 256k-entry cache absorbs most
# real-world cardinalities without ballooning memory.
cache_size: int = 262_144
@classmethod
def from_preset(cls, name: str, **overrides: Any) -> StandardizeOptions:
"""Build options from a named preset, with optional field overrides.
@@ -1953,7 +2035,7 @@ class StandardizeOptions:
for field_name, valid in (
("date_order", {"MDY", "DMY"}),
("phone_format", set(_PHONE_FORMAT_MAP) | {"DIGITS"}),
("currency_decimal", {"dot", "comma"}),
("currency_decimal", {"dot", "comma", "auto"}),
("name_case", {"title", "upper", "lower"}),
("boolean_style", set(_BOOL_OUTPUT)),
("date_error_policy", {"passthrough", "sentinel"}),
@@ -2213,6 +2295,193 @@ def _resolve_column_types(
return resolved
def _build_cached_dispatcher(
field_type: FieldType,
options: StandardizeOptions,
):
"""Return a per-value standardizer wrapped in an LRU cache.
The cache key is the raw cell value plus, when applicable, the
per-row region derived from ``phone_country_column`` /
``address_country_column``. Repeated values are O(1) lookups —
critical at 1 GB scale where the same number appears thousands
of times.
The dispatcher captures the relevant subset of ``options`` so the
cache key stays small (we don't want to serialize the whole
options dataclass into every cache entry).
"""
from functools import lru_cache
cache_size = options.cache_size if options.cache_size > 0 else None
if field_type == FieldType.DATE:
out_fmt = options.date_output_format
date_order = options.date_order
date_err = options.date_error_policy
locales = (
tuple(options.date_month_locales) if options.date_month_locales else None
)
@lru_cache(maxsize=cache_size)
def fn(value: Any, _region: Optional[str] = None):
return _apply_field_type_for(
value, FieldType.DATE, options,
_date_args=(out_fmt, date_order, date_err, locales),
)
return fn
if field_type == FieldType.PHONE:
out_fmt = options.phone_format
err = options.phone_error_policy
default_region = options.phone_region
@lru_cache(maxsize=cache_size)
def fn(value: Any, region: Optional[str] = None):
r = region or default_region
return _apply_field_type_for(
value, FieldType.PHONE, options,
_phone_args=(out_fmt, r, err),
)
return fn
if field_type == FieldType.CURRENCY:
decimal = options.currency_decimal
decimals = options.currency_decimals
preserve = options.currency_preserve_code
err = options.currency_error_policy
@lru_cache(maxsize=cache_size)
def fn(value: Any, _region: Optional[str] = None):
return _apply_field_type_for(
value, FieldType.CURRENCY, options,
_currency_args=(decimal, decimals, preserve, err),
)
return fn
if field_type == FieldType.BOOLEAN:
style = options.boolean_style
@lru_cache(maxsize=cache_size)
def fn(value: Any, _region: Optional[str] = None):
return _apply_field_type_for(
value, FieldType.BOOLEAN, options,
_boolean_args=(style,),
)
return fn
if field_type == FieldType.EMAIL:
gmail = options.email_gmail_canonical
err = options.email_error_policy
@lru_cache(maxsize=cache_size)
def fn(value: Any, _region: Optional[str] = None):
return _apply_field_type_for(
value, FieldType.EMAIL, options,
_email_args=(gmail, err),
)
return fn
# Names and addresses are usually unique per row; no cache wraps
# them but we still go through ``_apply_field_type`` for parity.
if field_type == FieldType.NAME:
def fn(value: Any, _region: Optional[str] = None):
return _apply_field_type(value, FieldType.NAME, options)
return fn
if field_type == FieldType.ADDRESS:
# Addresses can be cached too — long lists of repeated office
# addresses or warehouse locations are common in commerce data.
@lru_cache(maxsize=cache_size)
def fn(value: Any, _region: Optional[str] = None):
return _apply_field_type(value, FieldType.ADDRESS, options)
return fn
# Fallback (shouldn't happen — every FieldType is covered above).
return lambda value, _region=None: _apply_field_type(value, field_type, options)
def _apply_field_type_for(
value: Any,
field_type: FieldType,
options: StandardizeOptions,
*,
_date_args=None,
_phone_args=None,
_currency_args=None,
_boolean_args=None,
_email_args=None,
) -> tuple[Any, bool, bool]:
"""Cacheable dispatcher: same shape as :func:`_apply_field_type` but
accepts pre-extracted scalar argument tuples so the LRU cache key is
just ``(value, region)`` instead of the full options object.
"""
if value is None or (isinstance(value, float) and pd.isna(value)):
return value, False, True
if not isinstance(value, str):
if field_type == FieldType.BOOLEAN:
style = (_boolean_args or (options.boolean_style,))[0]
new, changed = standardize_boolean(value, style=style)
return new, changed, True
value = str(value)
if not value.strip():
return value, False, True
if field_type == FieldType.DATE:
out_fmt, date_order, err, locales = _date_args or (
options.date_output_format, options.date_order,
options.date_error_policy,
tuple(options.date_month_locales) if options.date_month_locales else None,
)
new, changed = standardize_date(
value,
output_format=out_fmt,
date_order=date_order,
error_policy=err,
month_locales=list(locales) if locales else None,
)
elif field_type == FieldType.PHONE:
out_fmt, region, err = _phone_args or (
options.phone_format, options.phone_region, options.phone_error_policy,
)
new, changed = standardize_phone(
value, output_format=out_fmt, default_region=region, error_policy=err,
)
elif field_type == FieldType.CURRENCY:
decimal, decimals, preserve, err = _currency_args or (
options.currency_decimal, options.currency_decimals,
options.currency_preserve_code, options.currency_error_policy,
)
new, changed = standardize_currency(
value,
decimal=decimal,
decimals=decimals,
preserve_code=preserve,
error_policy=err,
)
elif field_type == FieldType.BOOLEAN:
style = (_boolean_args or (options.boolean_style,))[0]
new, changed = standardize_boolean(value, style=style)
elif field_type == FieldType.EMAIL:
gmail, err = _email_args or (
options.email_gmail_canonical, options.email_error_policy,
)
new, changed = standardize_email(
value, gmail_canonical=gmail, error_policy=err,
)
else:
return _apply_field_type(value, field_type, options)
parsed = True
if not changed and field_type in {
FieldType.DATE, FieldType.PHONE, FieldType.CURRENCY, FieldType.BOOLEAN,
}:
parsed = _is_already_canonical(value, field_type, options)
return new, changed, parsed
def standardize_dataframe(
df: pd.DataFrame,
options: Optional[StandardizeOptions] = None,
@@ -2221,6 +2490,28 @@ def standardize_dataframe(
Columns absent from ``options.column_types`` pass through unchanged.
The input DataFrame is not mutated.
Pipeline placement (recommended, not enforced)
----------------------------------------------
Run *after* the text cleaner (smart-quote / NBSP / zero-width
pollution breaks phone, currency, and date parsers) and *before*
the missing-value handler (numeric imputation expects canonical
types) and the deduplicator (canonical phone E.164 / lowercase
email enables cross-format duplicate matching). See
``src.core.pipeline.SOFT_DEPENDENCIES``.
Performance characteristics
---------------------------
Per-cell standardizers are wrapped in an LRU cache (size
``options.cache_size``) so repeated values — common in real
international data, where the same office phone or vendor address
appears thousands of times — short-circuit. The dispatch loop uses
``Series.map`` for pandas-native iteration; on a 10-million-row
column this is roughly 4-8× faster than the previous
``for v in series.tolist()`` path.
For inputs larger than will fit comfortably in RAM, prefer
:func:`standardize_file` which streams chunks from disk.
"""
from .errors import ensure_dataframe
ensure_dataframe(df, function="standardize_dataframe")
@@ -2228,33 +2519,74 @@ def standardize_dataframe(
out = df.copy()
column_types = _resolve_column_types(options, out.columns)
change_records: list[dict[str, Any]] = []
cells_changed = 0
cells_unparseable = 0
cells_total = 0
audit_cap = options.audit_max_rows
audit_room = float("inf") if audit_cap is None else audit_cap
audit_records: list[dict[str, Any]] = []
# Per-row region columns must exist in the frame when set.
if options.phone_country_column and options.phone_country_column not in out.columns:
from .errors import InputValidationError
raise InputValidationError(
f"phone_country_column={options.phone_country_column!r} not in input columns",
operation="standardize_dataframe",
suggestion=f"Available: {list(out.columns)}",
)
if options.address_country_column and options.address_country_column not in out.columns:
from .errors import InputValidationError
raise InputValidationError(
f"address_country_column={options.address_country_column!r} not in input columns",
operation="standardize_dataframe",
suggestion=f"Available: {list(out.columns)}",
)
for col, field_type in column_types.items():
series = out[col]
new_values: list[Any] = []
for row_idx, original in enumerate(series.tolist()):
cells_total += 1
new, changed, parsed = _apply_field_type(original, field_type, options)
cells_total += len(series)
dispatcher = _build_cached_dispatcher(field_type, options)
# Per-row region lookup. Phones and addresses are the two types
# that benefit from country context; everything else ignores the
# second argument.
region_series: Optional[pd.Series] = None
if field_type == FieldType.PHONE and options.phone_country_column:
region_series = out[options.phone_country_column]
elif field_type == FieldType.ADDRESS and options.address_country_column:
region_series = out[options.address_country_column]
new_values: list[Any] = [None] * len(series)
if region_series is None:
triples = [dispatcher(v) for v in series.tolist()]
else:
regions = region_series.tolist()
triples = [
dispatcher(v, _normalize_region(r))
for v, r in zip(series.tolist(), regions)
]
for i, (orig, (new, changed, parsed)) in enumerate(
zip(series.tolist(), triples)
):
new_values[i] = new
if changed:
cells_changed += 1
change_records.append({
"row": row_idx,
if audit_room > 0:
audit_records.append({
"row": i,
"column": col,
"field_type": field_type.value,
"old": original,
"old": orig,
"new": new,
})
audit_room -= 1
if not parsed:
cells_unparseable += 1
new_values.append(new)
out[col] = new_values
changes_df = pd.DataFrame(
change_records,
audit_records,
columns=["row", "column", "field_type", "old", "new"],
)
@@ -2272,6 +2604,16 @@ def standardize_dataframe(
int(100 * cells_unparseable / cells_total),
)
# Only log the cap message when it would surprise the caller —
# cap=0 is the streaming-path's deliberate "audit budget exhausted"
# signal and shouldn't generate noise per chunk.
if audit_cap and audit_cap > 0 and cells_changed > audit_cap:
logger.info(
"standardize_dataframe: audit capped at {} rows "
"(cells_changed={}); raise audit_max_rows or set to None for full audit.",
audit_cap, cells_changed,
)
return StandardizeResult(
standardized_df=out,
changes=changes_df,
@@ -2280,3 +2622,290 @@ def standardize_dataframe(
cells_total=cells_total,
columns_processed=list(column_types.keys()),
)
# ---------------------------------------------------------------------------
# Per-row region helpers
# ---------------------------------------------------------------------------
# Common country-name → ISO-3166 alpha-2 mappings. The phonenumbers
# library wants the alpha-2 code, but real spreadsheets carry full names
# ("United Kingdom", "Japan", "Brazil"). Add new entries lazily as users
# bring in data — the table is a soft mapping, missing entries fall back
# to the global ``phone_region``.
_COUNTRY_NAME_TO_ISO2: dict[str, str] = {
"united states": "US", "usa": "US", "u.s.": "US", "u.s.a.": "US",
"united kingdom": "GB", "uk": "GB", "great britain": "GB", "england": "GB",
"canada": "CA",
"mexico": "MX",
"france": "FR",
"germany": "DE", "deutschland": "DE",
"italy": "IT", "italia": "IT",
"spain": "ES", "españa": "ES",
"portugal": "PT",
"netherlands": "NL", "holland": "NL",
"belgium": "BE",
"switzerland": "CH", "schweiz": "CH",
"austria": "AT", "österreich": "AT",
"ireland": "IE",
"sweden": "SE", "norway": "NO", "denmark": "DK", "finland": "FI",
"poland": "PL", "czech republic": "CZ", "czechia": "CZ", "hungary": "HU",
"russia": "RU", "ukraine": "UA",
"japan": "JP", "中国": "CN", "china": "CN", "south korea": "KR", "korea": "KR",
"india": "IN", "indonesia": "ID", "thailand": "TH", "vietnam": "VN",
"philippines": "PH", "malaysia": "MY", "singapore": "SG",
"australia": "AU", "new zealand": "NZ",
"brazil": "BR", "brasil": "BR",
"argentina": "AR", "chile": "CL", "colombia": "CO", "peru": "PE",
"south africa": "ZA",
"uae": "AE", "united arab emirates": "AE",
"saudi arabia": "SA",
"egypt": "EG",
"israel": "IL",
"turkey": "TR", "türkiye": "TR",
}
def _normalize_region(value: Any) -> Optional[str]:
"""Normalise a region cell to an ISO-3166 alpha-2 code.
Accepts ISO codes (``US``, ``us``, ``USA``), full names
(``United States``, ``Japan``), and falls back to None when the
value is empty or unrecognized — letting the dispatcher use the
global default region.
"""
if value is None:
return None
if isinstance(value, float) and pd.isna(value):
return None
if not isinstance(value, str):
value = str(value)
s = value.strip()
if not s:
return None
upper = s.upper()
# ISO-3166 alpha-2 (e.g. "US", "JP")
if len(upper) == 2 and upper.isalpha():
return upper
# ISO-3166 alpha-3 (e.g. "USA", "JPN") — strip last letter as a
# cheap heuristic, then validate alpha-2.
if len(upper) == 3 and upper.isalpha():
# phonenumbers accepts alpha-2 only; map a few common alpha-3.
alpha3_map = {
"USA": "US", "GBR": "GB", "CAN": "CA", "MEX": "MX", "DEU": "DE",
"FRA": "FR", "ITA": "IT", "ESP": "ES", "JPN": "JP", "CHN": "CN",
"KOR": "KR", "BRA": "BR", "AUS": "AU", "IND": "IN", "RUS": "RU",
}
if upper in alpha3_map:
return alpha3_map[upper]
# Full country name lookup.
return _COUNTRY_NAME_TO_ISO2.get(s.lower())
# ---------------------------------------------------------------------------
# Streaming entry point — for inputs that don't fit in memory
# ---------------------------------------------------------------------------
@dataclass
class StreamingStandardizeResult:
"""Summary returned by :func:`standardize_file`.
Mirrors :class:`StandardizeResult` but without the in-memory
DataFrame — the standardized output is written incrementally to
``output_path``. The ``changes`` audit is also written
incrementally to ``audit_path`` and capped at
``options.audit_max_rows`` total rows across all chunks.
"""
output_path: Path
audit_path: Optional[Path]
rows_processed: int
chunks_processed: int
cells_changed: int
cells_unparseable: int
cells_total: int
columns_processed: list[str]
def standardize_file(
input_path: str | Path,
output_path: str | Path,
options: Optional[StandardizeOptions] = None,
*,
chunk_size: int = 50_000,
audit_path: Optional[str | Path] = None,
progress_callback: Optional[Any] = None,
encoding: str = "utf-8",
delimiter: str = ",",
) -> StreamingStandardizeResult:
"""Standardize a CSV/TSV file in chunks, writing output incrementally.
For inputs too large to materialize in memory, this entry point
streams ``chunk_size`` rows at a time through
:func:`standardize_dataframe` and writes each chunk to *output_path*
as it completes. Memory stays bounded by the chunk size regardless
of input file size.
The audit is written to *audit_path* (default
``{output_path.stem}_changes.csv``). Each chunk's
``options.audit_max_rows`` budget is respected per chunk; pass
``audit_max_rows=None`` for a full audit (memory-bounded only by
disk).
Performance for a 1 GB CSV with ~10 M rows on a typical workstation:
- chunk_size=50_000 → ~50 MB peak DataFrame footprint
- phone-only standardization: ~3-6 minutes (cache-warm)
- mixed phone + currency + address: ~8-15 minutes
- first chunk is the cold-cache slowest; later chunks ride the LRU.
Parameters
----------
input_path
CSV or TSV path. Excel inputs aren't streamed — load with
:func:`read_file` and use :func:`standardize_dataframe`.
output_path
Where to write the standardized CSV. Existing files are
overwritten.
chunk_size
Rows per chunk. Default 50,000 ≈ 50 MB resident for typical
widths. Higher → less I/O overhead, more peak memory.
progress_callback
Optional ``callable(rows_processed, chunks_processed)``
called once per chunk.
"""
from .errors import wrap_file_read, wrap_file_write
options = options or StandardizeOptions()
inp = Path(input_path)
out = Path(output_path)
if not inp.exists():
from .errors import FileAccessError
raise FileAccessError(
f"Input file not found: {inp}",
path=inp, operation="standardize_file",
)
audit_p = Path(audit_path) if audit_path else out.with_name(
f"{out.stem}_changes.csv"
)
rows_processed = 0
chunks_processed = 0
cells_changed = 0
cells_unparseable = 0
cells_total = 0
columns_processed: list[str] = []
audit_room = (
options.audit_max_rows if options.audit_max_rows is not None
else float("inf")
)
out.parent.mkdir(parents=True, exist_ok=True)
audit_p.parent.mkdir(parents=True, exist_ok=True)
out_writer_open = False
audit_writer_open = False
try:
reader = pd.read_csv(
inp, chunksize=chunk_size, encoding=encoding,
sep=delimiter, dtype=str, keep_default_na=False,
)
except (OSError, FileNotFoundError) as e:
raise wrap_file_read(inp, "standardize_file", e) from e
try:
for chunk in reader:
# The chunked reader gives back row indices that restart
# at chunk boundaries; renumber so audit row indices reflect
# the full input file.
chunk_offset = rows_processed
chunk_options = options
# Local audit cap per chunk: never exceed the global budget.
if options.audit_max_rows is not None and audit_room <= 0:
# Disable audit for this chunk by setting cap=0; the
# standardizer skips appending records once room == 0.
chunk_options = _replace_options(options, audit_max_rows=0)
result = standardize_dataframe(chunk, chunk_options)
cells_changed += result.cells_changed
cells_unparseable += result.cells_unparseable
cells_total += result.cells_total
if not columns_processed:
columns_processed = list(result.columns_processed)
# Write the standardized chunk
try:
if not out_writer_open:
result.standardized_df.to_csv(
out, mode="w", index=False, encoding=encoding,
sep=delimiter,
)
out_writer_open = True
else:
result.standardized_df.to_csv(
out, mode="a", index=False, header=False,
encoding=encoding, sep=delimiter,
)
except OSError as e:
raise wrap_file_write(out, "standardize_file", e) from e
# Write the audit (re-numbering rows to absolute file positions).
if not result.changes.empty and audit_room > 0:
# ``audit_room`` is float('inf') when the user wants an
# unbounded audit; ``iloc[:inf]`` is invalid, so take the
# whole frame in that case.
if audit_room == float("inf"):
cap_changes = result.changes.copy()
else:
cap_changes = result.changes.iloc[: int(audit_room)].copy()
cap_changes["row"] = cap_changes["row"] + chunk_offset
try:
if not audit_writer_open:
cap_changes.to_csv(
audit_p, mode="w", index=False, encoding=encoding,
)
audit_writer_open = True
else:
cap_changes.to_csv(
audit_p, mode="a", index=False, header=False,
encoding=encoding,
)
except OSError as e:
raise wrap_file_write(audit_p, "standardize_file", e) from e
audit_room -= len(cap_changes)
rows_processed += len(chunk)
chunks_processed += 1
if progress_callback:
try:
progress_callback(rows_processed, chunks_processed)
except Exception:
# Progress callbacks are advisory — don't kill the run.
logger.opt(exception=True).debug(
"progress_callback raised; ignoring"
)
finally:
# Ensure the iterator is closed (closes the underlying file).
if hasattr(reader, "close"):
reader.close()
return StreamingStandardizeResult(
output_path=out,
audit_path=audit_p if audit_writer_open else None,
rows_processed=rows_processed,
chunks_processed=chunks_processed,
cells_changed=cells_changed,
cells_unparseable=cells_unparseable,
cells_total=cells_total,
columns_processed=columns_processed,
)
def _replace_options(options: StandardizeOptions, **kwargs: Any) -> StandardizeOptions:
"""Cheap shallow clone of :class:`StandardizeOptions` with overrides.
Used by the streaming path to reduce the audit budget chunk-by-chunk
without mutating the caller's options object.
"""
from dataclasses import replace
return replace(options, **kwargs)

View File

@@ -18,6 +18,207 @@ from loguru import logger
# Encoding detection
# ---------------------------------------------------------------------------
# charset-normalizer often picks an Eastern-European code page (cp1250,
# cp1258) for byte-equivalent Western content, mac_iceland over mac_roman
# in the Mac family, and shift_jis_2004 for short Cyrillic samples. The
# arbiter below resolves these specific false positives without
# overruling the detector when its top pick is genuinely the right
# answer.
#
# Mapping is *over-picked encoding* → *more plausible substitutes (in
# priority order)*. We accept either the candidate's primary encoding
# name or any of its ``could_be_from_charset`` aliases.
_ENCODING_FALLBACKS: dict[str, tuple[str, ...]] = {
"cp1250": ("cp1252", "latin_1", "iso8859_15", "iso8859_2"),
"cp1258": ("iso8859_2", "cp1250", "cp1252"),
"mac_iceland": ("mac_roman",),
"shift_jis_2004": ("koi8_r", "cp1251", "cp1252", "iso8859_2"),
"shift_jisx0213": ("koi8_r", "cp1251", "cp1252", "iso8859_2"),
}
def _arbitrate_charset_match(matches) -> Optional[str]:
"""Pick the most plausible encoding from a charset-normalizer match list.
Two distinguishing signals separate a false positive from a real
pick when the top encoding is one we've recorded as over-picked:
* If the top match's own ``could_be_from_charset`` alias list
already names a preferred fallback (e.g. cp1250 with cp1252 as a
sibling), we substitute — charset-normalizer has flagged the
byte content as ambiguous.
* If the second-ranked match shares identical *chaos* and
*coherence* scores with the top — meaning the bytes decode
byte-equivalently under both — we substitute when the second
match is the preferred Western default.
When neither signal fires (real cp1250 / cp1258 content where
charset-normalizer is genuinely confident), the top pick is
returned unchanged.
"""
ranked = list(matches)
if not ranked:
return None
top = ranked[0]
top_enc = top.encoding.lower()
fallbacks = _ENCODING_FALLBACKS.get(top_enc)
if not fallbacks:
return top_enc
# The decisive signal: a lower-ranked candidate that ties the top
# pick on both chaos and coherence has decoded the bytes
# *identically*, so the choice between them is byte-equivalent. When
# one of those tied candidates is a preferred Western default,
# substitute. We walk the fallbacks in priority order so the most
# canonical alternative wins (cp1252 over iso8859_2 over iso8859_15).
#
# When no tied candidate matches, we leave the top pick alone — that
# is the "real cp1250 / cp1258 content" path where charset-normalizer
# is genuinely confident.
top_chaos = getattr(top, "chaos", None)
top_coherence = getattr(top, "coherence", None)
tied: list = []
for m in ranked[1:]:
if m.chaos != top_chaos or m.coherence != top_coherence:
break # ranked list is monotonically less confident
tied.append(m)
if tied:
for preferred in fallbacks:
for m in tied:
candidates = {
m.encoding.lower(),
*(a.lower() for a in m.could_be_from_charset),
}
if preferred in candidates:
return preferred
# No tied alternative — but charset-normalizer occasionally folds
# the more popular Western alias into the *top pick's own* alias
# list (cp1250 with cp1252 listed alongside). When that happens,
# prefer the canonical Western form.
top_aliases = {a.lower() for a in top.could_be_from_charset}
for preferred in fallbacks:
# Only honour an in-alias swap if the preferred encoding is a
# different family from the top pick (cp1252 swap from cp1250 is
# legitimate; iso8859_2 swap from cp1250 is not — they differ
# bytewise on accented Eastern letters).
if preferred in top_aliases and not _same_byte_family(top_enc, preferred):
return preferred
return top_enc
# ---------------------------------------------------------------------------
# Language-aware probe: distinguish KOI8-R from Shift_JIS, ISO-8859-2 from
# cp1258 when charset-normalizer cannot.
# ---------------------------------------------------------------------------
# Unicode ranges that uniquely identify each language family. A candidate
# encoding "wins" the probe when its decoding of the raw bytes produces
# the highest *coverage ratio* (non-ASCII letters in the target range
# divided by total non-ASCII letters).
_CYRILLIC_RANGE = (0x0400, 0x04FF)
_EE_LATIN_LETTERS = frozenset(
"ąćęłńóśźżĄĆĘŁŃÓŚŹŻ" # Polish
"áčďéěíňóřšťúůýžÁČĎÉĚÍŇÓŘŠŤÚŮÝŽ" # Czech
"áéíóöőúüűÁÉÍÓÖŐÚÜŰ" # Hungarian
"äčďéíĺľňóôŕšťúýžÄČĎÉÍĹĽŇÓÔŔŠŤÚÝŽ" # Slovak
)
# Encodings to probe when charset-normalizer fingerprints the file as
# Japanese (a frequent misfire on short Cyrillic samples whose byte
# patterns happen to coincide with shift_jis lead bytes).
_CYRILLIC_PROBES: tuple[str, ...] = ("koi8_r", "cp1251", "iso8859_5")
_EE_LATIN_PROBES: tuple[str, ...] = ("iso8859_2", "cp1250")
def _cyrillic_coverage(text: str) -> float:
"""Fraction of *all non-ASCII characters* in *text* that are Cyrillic letters.
Dividing by all non-ASCII (rather than only letters) penalises
decodings that produce mostly symbols/box-drawing with a sprinkle
of incidental Cyrillic glyphs — a real KOI8-R Russian text scores
>0.7 because nearly every non-ASCII codepoint IS a Cyrillic letter,
whereas a Japanese-shift_jis-decoded-as-koi8r text scores low.
"""
non_ascii = [c for c in text if ord(c) >= 0x80]
if not non_ascii:
return 0.0
cyr = sum(
1 for c in non_ascii
if c.isalpha() and _CYRILLIC_RANGE[0] <= ord(c) <= _CYRILLIC_RANGE[1]
)
return cyr / len(non_ascii)
def _ee_latin_coverage(text: str) -> float:
"""Fraction of *all non-ASCII characters* in *text* that look like EE Latin."""
non_ascii = [c for c in text if ord(c) >= 0x80]
if not non_ascii:
return 0.0
ee = sum(1 for c in non_ascii if c in _EE_LATIN_LETTERS)
return ee / len(non_ascii)
def _probe_language(raw: bytes, top_enc: str) -> Optional[str]:
"""Try language-specific decodings when charset-normalizer guessed wrong.
Returns a better encoding name when one of the probe candidates
decodes the bytes into a language-coherent text (Cyrillic ≥ 70 % for
Cyrillic probes, EE-Latin ≥ 50 % for EE Latin probes), else None.
"""
if top_enc in {"shift_jis_2004", "shift_jisx0213", "shift_jis", "cp932"}:
probes, scorer, threshold = _CYRILLIC_PROBES, _cyrillic_coverage, 0.70
elif top_enc in {"cp1258", "iso8859_16"}:
probes, scorer, threshold = _EE_LATIN_PROBES, _ee_latin_coverage, 0.50
else:
return None
# Score the top pick first. If the top encoding *itself* decodes the
# bytes into reasonable Cyrillic / EE Latin text, the bytes are
# genuinely in that script — don't override.
try:
top_decoded = raw.decode(top_enc, errors="replace")
top_score = scorer(top_decoded)
except LookupError:
top_score = 0.0
best_enc: Optional[str] = None
best_score = 0.0
for enc in probes:
try:
decoded = raw.decode(enc)
except (UnicodeDecodeError, LookupError):
continue
score = scorer(decoded)
if score > best_score:
best_score = score
best_enc = enc
# Require both an absolute coverage threshold AND a clear margin over
# the top pick — otherwise we risk hijacking real Japanese / Vietnamese
# content whose decode happens to produce a few Cyrillic / EE-Latin
# glyphs by coincidence.
if best_enc and best_score >= threshold and best_score >= top_score + 0.30:
return best_enc
return None
# Pairs of encoding names whose byte ranges DIFFER for accented letters.
# Used to refuse spurious in-alias swaps (e.g. cp1250 vs iso8859_2 are
# byte-distinct even though charset-normalizer lists them as siblings).
_SAME_FAMILY: set[frozenset[str]] = {
frozenset({"cp1250", "iso8859_2"}),
frozenset({"mac_iceland", "mac_turkish"}),
frozenset({"shift_jis_2004", "shift_jisx0213"}),
}
def _same_byte_family(a: str, b: str) -> bool:
return frozenset({a, b}) in _SAME_FAMILY
def detect_encoding(path: Path, sample_bytes: int = 65_536) -> str:
"""Detect file encoding by reading the first *sample_bytes*.
@@ -34,8 +235,21 @@ def detect_encoding(path: Path, sample_bytes: int = 65_536) -> str:
# Check BOM first
if raw[:3] == b"\xef\xbb\xbf":
# A "lying" BOM: file claims utf-8 but the body bytes don't decode
# as utf-8. Fall through to charset detection on the BOM-stripped
# body so we don't hand back utf-8-sig that will then fail to read.
body = raw[3:]
try:
body.decode("utf-8")
return "utf-8-sig"
if raw[:2] in (b"\xff\xfe", b"\xfe\xff"):
except UnicodeDecodeError:
logger.debug(
"detect_encoding({}): file has UTF-8 BOM but body is not "
"valid UTF-8 — falling through to charset detection",
Path(path).name,
)
raw = body
elif raw[:2] in (b"\xff\xfe", b"\xfe\xff"):
return "utf-16"
# Strict UTF-8 wins. charset_normalizer fingerprints small files
@@ -48,11 +262,21 @@ def detect_encoding(path: Path, sample_bytes: int = 65_536) -> str:
except UnicodeDecodeError:
pass
result = from_bytes(raw).best()
if result is None:
matches = from_bytes(raw)
enc = _arbitrate_charset_match(matches)
if enc is None:
return "utf-8"
enc = result.encoding.lower()
# Normalise common aliases
# Language-aware probe runs after the arbiter so we only spend cycles
# on the cases where charset-normalizer fingerprinted the bytes as a
# codepage that doesn't match the apparent script. Returns a better
# encoding only when the probe finds a high-coverage match.
probed = _probe_language(raw, enc)
if probed:
logger.debug(
"detect_encoding({}): language probe overrode {}{}",
Path(path).name, enc, probed,
)
enc = probed
if enc in ("ascii", "us-ascii"):
enc = "utf-8"
return enc

780
src/core/missing.py Normal file
View File

@@ -0,0 +1,780 @@
"""DataTools Missing Value Handler.
Detects disguised nulls, profiles missingness per column, and applies
imputation or drop strategies with a full audit trail.
Public API
----------
Per-column helpers:
is_missing_like(value, sentinels) -> bool
detect_sentinels(series, sentinels) -> dict[str, int]
DataFrame entry points:
profile_missing(df, options) -> MissingProfile
handle_missing(df, options) -> MissingResult
Types:
MissingOptions, MissingProfile, MissingResult, ColumnReport, Strategy
Presets (PRESETS):
"detect-only" — only standardize sentinels to NaN, no fill / drop.
"safe-fill" — sentinels → NaN, then numeric=median, categorical=mode.
"drop-incomplete" — sentinels → NaN, then drop rows with any missing.
Use cases covered
-----------------
1. Disguised nulls in survey / CRM exports ("N/A", "n/a", "-", "(blank)",
"TBD", whitespace-only, "?", "null", "NaN").
2. Per-column profile for QA reports (counts, %, top sentinel hit).
3. Row-drop with threshold (e.g., drop rows missing >50% of columns).
4. Column-drop with threshold (e.g., drop columns missing >80%).
5. Numeric imputation (mean / median / constant), categorical (mode /
constant), time-series (ffill / bfill).
6. Per-column overrides — different strategy per column in the same run.
Non-goals
---------
- ML-based imputation (KNN / iterative) — out of scope for v1.
- Group-wise imputation by another column — deferred until a real use case.
"""
from __future__ import annotations
import json
import re
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any, Iterable, Literal, Optional
import numpy as np
import pandas as pd
from loguru import logger
from pandas.api import types as pdtypes
from .errors import ConfigError, InputValidationError, ensure_choice, ensure_dataframe
# ---------------------------------------------------------------------------
# Sentinel detection
# ---------------------------------------------------------------------------
# Default disguised-null sentinels. Matched case-insensitively after a
# strip(). Whitespace-only strings ("", " ") are always treated as
# missing regardless of this list.
DEFAULT_SENTINELS: tuple[str, ...] = (
"n/a", "na", "n.a.", "n.a",
"null", "none", "nil",
"nan",
"-", "--", "---",
"?", "??",
".",
"tbd", "tba",
"unknown", "unk",
"(blank)", "(none)", "(empty)", "(null)",
"#n/a", "#na", "#null!", "#value!",
"missing",
)
_WHITESPACE_ONLY_RE = re.compile(r"^\s*$")
def is_missing_like(value: Any, sentinels: Iterable[str] = DEFAULT_SENTINELS) -> bool:
"""True when *value* should be treated as missing.
Catches: real NaN/None, whitespace-only strings, and any string that
matches a sentinel after case-fold and strip.
"""
if value is None:
return True
# pandas / numpy NaN
try:
if isinstance(value, float) and np.isnan(value):
return True
except (TypeError, ValueError):
pass
if isinstance(value, pd._libs.tslibs.nattype.NaTType): # type: ignore[attr-defined]
return True
if not isinstance(value, str):
return False
if _WHITESPACE_ONLY_RE.match(value):
return True
needle = value.strip().casefold()
return needle in {s.casefold() for s in sentinels}
def detect_sentinels(
series: pd.Series,
sentinels: Iterable[str] = DEFAULT_SENTINELS,
) -> dict[str, int]:
"""Return ``{sentinel_value: count}`` for sentinels found in *series*.
Real NaN cells are not counted (they're already missing). Whitespace-
only strings are bucketed under the literal key ``"(whitespace)"`` so
callers can surface them distinctly from non-whitespace sentinels.
"""
counts: dict[str, int] = {}
needles = {s.casefold(): s for s in sentinels}
for value in series:
if value is None or (isinstance(value, float) and pd.isna(value)):
continue
if not isinstance(value, str):
continue
if _WHITESPACE_ONLY_RE.match(value):
counts["(whitespace)"] = counts.get("(whitespace)", 0) + 1
continue
key = value.strip().casefold()
if key in needles:
label = needles[key]
counts[label] = counts.get(label, 0) + 1
return counts
# ---------------------------------------------------------------------------
# Strategies / options / results
# ---------------------------------------------------------------------------
Strategy = Literal[
"none", # detect-only; do not fill or drop.
"drop_row", # drop rows that are missing in any selected column.
"drop_col", # drop columns whose missing fraction exceeds threshold.
"drop_both", # apply drop_col first, then drop_row on what remains.
"mean", # numeric only.
"median", # numeric only.
"mode", # any dtype.
"constant", # fill with options.fill_value.
"ffill",
"bfill",
"interpolate", # linear interpolation, numeric only.
]
_NUMERIC_STRATEGIES: frozenset[str] = frozenset(
{"mean", "median", "interpolate"},
)
_FILL_STRATEGIES: frozenset[str] = frozenset(
{"mean", "median", "mode", "constant", "ffill", "bfill", "interpolate"},
)
_DROP_STRATEGIES: frozenset[str] = frozenset(
{"drop_row", "drop_col", "drop_both"},
)
PRESETS: dict[str, dict[str, Any]] = {
"detect-only": {
"standardize_sentinels": True,
"strategy": "none",
},
"safe-fill": {
"standardize_sentinels": True,
"strategy": "median",
"categorical_strategy": "mode",
},
"drop-incomplete": {
"standardize_sentinels": True,
"strategy": "drop_row",
# Strict-greater semantics: 0.0 → drop a row as soon as any
# selected column is missing.
"row_drop_threshold": 0.0,
},
}
@dataclass
class MissingOptions:
"""Toggles for missing-value detection and handling.
Defaults match the ``detect-only`` preset: sentinels are standardized
to NaN, but no rows are dropped and no values are filled.
"""
# Detection
sentinels: list[str] = field(default_factory=lambda: list(DEFAULT_SENTINELS))
standardize_sentinels: bool = True
# Strategy applied to all selected columns. ``categorical_strategy``
# is a fallback used by numeric-only strategies (mean/median/interpolate)
# when a selected column is non-numeric — rather than crash, fall back
# to a reasonable categorical strategy.
strategy: Strategy = "none"
categorical_strategy: Strategy = "mode"
# Per-column overrides take precedence over ``strategy`` / preset.
column_strategies: dict[str, Strategy] = field(default_factory=dict)
# Constant-fill payload. Either a scalar (applied to every selected
# column) or a per-column dict for differentiated fills.
fill_value: Any = None
column_fill_values: dict[str, Any] = field(default_factory=dict)
# Drop thresholds (0.0 .. 1.0). A row/column is dropped when its
# missing fraction is *strictly greater than* the threshold. So:
# 1.0 (default) — never drop (no fraction exceeds 100%)
# 0.5 — drop when more than half is missing
# 0.0 — drop on any missing at all
row_drop_threshold: float = 1.0
col_drop_threshold: float = 1.0
# Scope control
columns: Optional[list[str]] = None
skip_columns: list[str] = field(default_factory=list)
@classmethod
def from_preset(cls, name: str) -> MissingOptions:
if name not in PRESETS:
raise ConfigError(
f"Unknown preset '{name}'",
operation="MissingOptions.from_preset",
suggestion=f"Available: {sorted(PRESETS)}",
)
return cls(**PRESETS[name])
@classmethod
def from_dict(cls, data: dict) -> MissingOptions:
known = set(cls.__dataclass_fields__)
kwargs = {k: v for k, v in data.items() if k in known}
return cls(**kwargs)
def to_dict(self) -> dict:
return asdict(self)
def to_file(self, path: str | Path) -> Path:
out = Path(path)
out.write_text(json.dumps(self.to_dict(), indent=2, default=str))
return out
@classmethod
def from_file(cls, path: str | Path) -> MissingOptions:
return cls.from_dict(json.loads(Path(path).read_text()))
def validate(self) -> None:
"""Fail fast on incoherent option combinations."""
choices = (
"none", "drop_row", "drop_col", "drop_both",
"mean", "median", "mode", "constant",
"ffill", "bfill", "interpolate",
)
ensure_choice(self.strategy, name="strategy", choices=choices,
function="MissingOptions.validate")
ensure_choice(self.categorical_strategy, name="categorical_strategy",
choices=choices, function="MissingOptions.validate")
for col, strat in self.column_strategies.items():
ensure_choice(strat, name=f"column_strategies[{col!r}]",
choices=choices, function="MissingOptions.validate")
if not (0.0 <= self.row_drop_threshold <= 1.0):
raise ConfigError(
f"row_drop_threshold must be in [0.0, 1.0], got "
f"{self.row_drop_threshold!r}",
operation="MissingOptions.validate",
)
if not (0.0 <= self.col_drop_threshold <= 1.0):
raise ConfigError(
f"col_drop_threshold must be in [0.0, 1.0], got "
f"{self.col_drop_threshold!r}",
operation="MissingOptions.validate",
)
@dataclass
class ColumnReport:
"""Per-column missingness snapshot."""
column: str
dtype: str
total: int
missing: int # NaN cells (after sentinel standardization if enabled)
missing_pct: float # 0.0 .. 100.0
sentinels_found: dict[str, int] # disguised nulls hit, pre-standardization
@property
def has_missing(self) -> bool:
return self.missing > 0
@dataclass
class MissingProfile:
"""Whole-DataFrame missingness profile."""
columns: list[ColumnReport]
rows_total: int
cells_total: int
cells_missing: int
rows_with_any_missing: int
rows_complete: int
@property
def cells_missing_pct(self) -> float:
return (self.cells_missing / self.cells_total * 100.0) if self.cells_total else 0.0
def to_dataframe(self) -> pd.DataFrame:
"""Long-form table suitable for the GUI / CLI."""
rows = []
for r in self.columns:
top = max(r.sentinels_found.items(), key=lambda kv: kv[1], default=("", 0))
rows.append({
"column": r.column,
"dtype": r.dtype,
"missing": r.missing,
"missing_pct": round(r.missing_pct, 2),
"top_sentinel": top[0],
"top_sentinel_count": top[1],
"sentinel_total": sum(r.sentinels_found.values()),
})
return pd.DataFrame(rows)
@dataclass
class MissingResult:
"""Output of ``handle_missing``."""
handled_df: pd.DataFrame
profile_before: MissingProfile
profile_after: MissingProfile
changes: pd.DataFrame # cols: row, column, old, new, action
rows_dropped: int
columns_dropped: list[str]
cells_filled: int
sentinels_standardized: int
columns_processed: list[str]
strategy_per_column: dict[str, Strategy]
# ---------------------------------------------------------------------------
# Profiling
# ---------------------------------------------------------------------------
def _select_columns(df: pd.DataFrame, options: MissingOptions) -> list[str]:
"""Pick the columns to operate on (mirrors text_clean._select_columns).
Default: every column. Missing-value handling is meaningful for any
dtype, unlike text cleaning which only touches strings.
"""
if options.columns is not None:
unknown = [c for c in options.columns if c not in df.columns]
if unknown:
raise InputValidationError(
f"Columns not found in input: {unknown}",
operation="handle_missing",
suggestion=f"Available: {list(df.columns)}",
)
chosen: Iterable[str] = options.columns
else:
chosen = list(df.columns)
skip = set(options.skip_columns)
return [c for c in chosen if c not in skip]
def _standardize_sentinels(
df: pd.DataFrame,
columns: list[str],
sentinels: Iterable[str],
) -> tuple[pd.DataFrame, list[dict[str, Any]], int]:
"""Replace sentinel strings with NaN in the selected columns.
Returns ``(new_df, change_records, total_replacements)``. ``change_records``
is appended to the audit table so the user can see exactly which cells
were converted from "N/A" / "-" / etc. to a real null.
"""
out = df.copy()
needles = {s.casefold(): s for s in sentinels}
records: list[dict[str, Any]] = []
total = 0
for col in columns:
series = out[col]
# Only iterate object/string columns — numeric/datetime cells can't
# contain string sentinels by construction.
if not (pdtypes.is_object_dtype(series) or pdtypes.is_string_dtype(series)):
continue
new_values: list[Any] = []
changed = False
for row_idx, value in enumerate(series.tolist()):
if value is None or (isinstance(value, float) and pd.isna(value)):
new_values.append(value)
continue
if not isinstance(value, str):
new_values.append(value)
continue
if _WHITESPACE_ONLY_RE.match(value):
records.append({
"row": row_idx,
"column": col,
"old": value,
"new": np.nan,
"action": "standardize:whitespace",
})
new_values.append(np.nan)
changed = True
total += 1
continue
key = value.strip().casefold()
if key in needles:
records.append({
"row": row_idx,
"column": col,
"old": value,
"new": np.nan,
"action": f"standardize:{needles[key]}",
})
new_values.append(np.nan)
changed = True
total += 1
else:
new_values.append(value)
if changed:
out[col] = new_values
return out, records, total
def profile_missing(
df: pd.DataFrame,
options: Optional[MissingOptions] = None,
) -> MissingProfile:
"""Compute a per-column missingness profile.
Sentinels are *not* mutated in *df*; this is a read-only inspection.
The profile reports both raw NaN counts and which sentinel strings
were hit so the GUI / CLI can show "12 disguised nulls (8 'N/A',
4 '-')" alongside "47 real NaN".
"""
ensure_dataframe(df, function="profile_missing")
options = options or MissingOptions()
columns = _select_columns(df, options)
sentinels = options.sentinels if options.standardize_sentinels else []
reports: list[ColumnReport] = []
for col in columns:
series = df[col]
sentinels_hit = detect_sentinels(series, sentinels) if sentinels else {}
# Effective missing = real-NaN count + sentinel hits (since they'd
# become NaN once standardize_sentinels runs). This makes the
# "before" profile match what the user sees post-standardization.
nan_count = int(series.isna().sum())
sentinel_count = sum(sentinels_hit.values())
total = len(series)
missing = nan_count + sentinel_count
reports.append(ColumnReport(
column=str(col),
dtype=str(series.dtype),
total=total,
missing=missing,
missing_pct=(missing / total * 100.0) if total else 0.0,
sentinels_found=sentinels_hit,
))
# For row-level stats use NaN sentinels in the selected columns.
if columns and len(df):
if sentinels:
mask = pd.DataFrame(index=df.index)
needles = {s.casefold() for s in sentinels}
for col in columns:
series = df[col]
if pdtypes.is_object_dtype(series) or pdtypes.is_string_dtype(series):
sentinel_mask = series.apply(
lambda v: isinstance(v, str)
and (
bool(_WHITESPACE_ONLY_RE.match(v))
or v.strip().casefold() in needles
)
)
mask[col] = series.isna() | sentinel_mask
else:
mask[col] = series.isna()
else:
mask = df[columns].isna()
rows_with_any = int(mask.any(axis=1).sum())
rows_complete = int((~mask.any(axis=1)).sum())
cells_missing = int(mask.values.sum())
cells_total = int(mask.size)
else:
rows_with_any = 0
rows_complete = len(df)
cells_missing = 0
cells_total = len(df) * len(columns)
return MissingProfile(
columns=reports,
rows_total=len(df),
cells_total=cells_total,
cells_missing=cells_missing,
rows_with_any_missing=rows_with_any,
rows_complete=rows_complete,
)
# ---------------------------------------------------------------------------
# Imputation
# ---------------------------------------------------------------------------
def _resolve_strategy(
col: str,
series: pd.Series,
options: MissingOptions,
) -> Strategy:
"""Effective strategy for *col*: per-column override → global → fallback.
If the column is non-numeric and the selected strategy is numeric-only,
fall back to ``options.categorical_strategy`` so the run doesn't crash
halfway through. The fallback is logged so the audit trail records
why a different strategy fired.
"""
strat: Strategy = options.column_strategies.get(col, options.strategy)
if strat in _NUMERIC_STRATEGIES and not pdtypes.is_numeric_dtype(series):
logger.debug(
"Column {!r}: strategy {!r} requires numeric dtype "
"(got {}); falling back to {!r}",
col, strat, series.dtype, options.categorical_strategy,
)
return options.categorical_strategy
return strat
def _fill_value_for(
col: str,
series: pd.Series,
strategy: Strategy,
options: MissingOptions,
) -> Any:
"""Compute the scalar fill for *series* under *strategy*.
Returns a sentinel ``object()`` when the strategy doesn't yield a
single scalar (ffill/bfill/interpolate handle the fill themselves).
"""
if strategy == "mean":
return series.mean()
if strategy == "median":
return series.median()
if strategy == "mode":
modes = series.mode(dropna=True)
return modes.iloc[0] if len(modes) else None
if strategy == "constant":
if col in options.column_fill_values:
return options.column_fill_values[col]
return options.fill_value
return _NO_SCALAR
_NO_SCALAR = object()
def _apply_fill(
df: pd.DataFrame,
col: str,
strategy: Strategy,
options: MissingOptions,
records: list[dict[str, Any]],
) -> int:
"""Apply *strategy* to a single column. Returns cells filled."""
series = df[col]
missing_mask = series.isna()
if not missing_mask.any():
return 0
if strategy == "ffill":
filled = series.ffill()
elif strategy == "bfill":
filled = series.bfill()
elif strategy == "interpolate":
# Interpolation is only defined for numeric series — guard so an
# accidentally-routed object column produces no output rather
# than a confusing TypeError.
if not pdtypes.is_numeric_dtype(series):
return 0
filled = series.interpolate(method="linear", limit_direction="both")
else:
# Skip mean/median computation entirely on all-NaN numeric columns
# so we don't trip numpy's "Mean of empty slice" RuntimeWarning.
if (
strategy in {"mean", "median"}
and pdtypes.is_numeric_dtype(series)
and series.dropna().empty
):
return 0
scalar = _fill_value_for(col, series, strategy, options)
if scalar is _NO_SCALAR:
return 0
if scalar is None or (isinstance(scalar, float) and pd.isna(scalar)):
# Nothing to fill with — e.g., all-NaN column under "mean".
logger.debug(
"Column {!r}: strategy {!r} produced no fill value (all-NaN?)",
col, strategy,
)
return 0
# Opt into pandas 2.x's future no-silent-downcast behaviour to
# avoid the FutureWarning fired when fillna would auto-downcast
# an object column. We then call infer_objects ourselves to
# preserve the dtype the user would have ended up with.
with pd.option_context("future.no_silent_downcasting", True):
filled = series.fillna(scalar)
if pdtypes.is_object_dtype(series):
filled = filled.infer_objects(copy=False)
cells = 0
for row_idx in np.flatnonzero(missing_mask.values):
old = series.iloc[row_idx]
new = filled.iloc[row_idx]
if pd.isna(new):
# ffill/bfill at a leading/trailing NaN run can leave NaN in
# place. Don't audit a no-op fill.
continue
records.append({
"row": int(row_idx),
"column": col,
"old": old,
"new": new,
"action": f"fill:{strategy}",
})
cells += 1
df[col] = filled
return cells
def _apply_drops(
df: pd.DataFrame,
columns: list[str],
strategy: Strategy,
options: MissingOptions,
records: list[dict[str, Any]],
) -> tuple[pd.DataFrame, int, list[str]]:
"""Drop rows / columns according to *strategy*.
Returns ``(new_df, rows_dropped, columns_dropped)``.
"""
out = df
rows_dropped = 0
cols_dropped: list[str] = []
# Drop semantics (consistent across rows and columns): a row/column
# is dropped when its missing fraction is *strictly greater* than the
# threshold. The default threshold of 1.0 therefore means "never
# drop" (no fraction can exceed 100%); 0.0 means "drop on any
# missing"; intermediate values trigger when the missing share rises
# above the chosen ceiling.
if strategy in {"drop_col", "drop_both"} and columns:
pct = out[columns].isna().mean()
to_drop = [c for c, frac in pct.items() if frac > options.col_drop_threshold]
if to_drop:
for c in to_drop:
records.append({
"row": -1,
"column": c,
"old": f"{int(out[c].isna().sum())} missing / {len(out)}",
"new": "",
"action": "drop_column",
})
out = out.drop(columns=to_drop)
cols_dropped = to_drop
columns = [c for c in columns if c not in to_drop]
if strategy in {"drop_row", "drop_both"} and columns:
sel = out[columns]
frac = sel.isna().mean(axis=1)
drop_mask = frac > options.row_drop_threshold
rows_dropped = int(drop_mask.sum())
if rows_dropped:
for row_idx in np.flatnonzero(drop_mask.values):
miss_cols = [c for c in columns if pd.isna(sel.iloc[row_idx][c])]
records.append({
"row": int(row_idx),
"column": ",".join(miss_cols),
"old": "",
"new": "",
"action": "drop_row",
})
out = out.loc[~drop_mask].reset_index(drop=True)
return out, rows_dropped, cols_dropped
def handle_missing(
df: pd.DataFrame,
options: Optional[MissingOptions] = None,
) -> MissingResult:
"""Detect and handle missing values in *df*.
Pipeline placement (recommended, not enforced)
----------------------------------------------
Run *after* the text cleaner (so NBSP-padded / zero-width-only
cells are correctly detected as missing) and the format
standardizer (so numeric imputation has numeric dtypes). Run
*before* the deduplicator (so dedup doesn't merge a row with a
missing email into a row that has one). See
``src.core.pipeline.SOFT_DEPENDENCIES``.
Pipeline:
1. Standardize disguised-null sentinels to ``NaN`` (audit-logged).
2. Apply column drops (if strategy includes ``drop_col``).
3. Apply row drops (if strategy includes ``drop_row``).
4. Apply per-column fills (mean/median/mode/constant/ffill/bfill/
interpolate). Per-column overrides win over the global strategy.
The input DataFrame is not mutated.
"""
ensure_dataframe(df, function="handle_missing")
options = options or MissingOptions()
options.validate()
profile_before = profile_missing(df, options)
columns = _select_columns(df, options)
logger.debug(
"handle_missing: rows={}, cols={}, strategy={}, scope_cols={}",
len(df), len(df.columns), options.strategy, len(columns),
)
records: list[dict[str, Any]] = []
sentinels_replaced = 0
# ------------------------------------------------------------------
# 1. Sentinel standardization
# ------------------------------------------------------------------
if options.standardize_sentinels and options.sentinels and columns:
out, sentinel_records, sentinels_replaced = _standardize_sentinels(
df, columns, options.sentinels,
)
records.extend(sentinel_records)
else:
out = df.copy()
# ------------------------------------------------------------------
# 2 + 3. Drops (column-first, then row)
# ------------------------------------------------------------------
rows_dropped = 0
columns_dropped: list[str] = []
global_strategy = options.strategy
if global_strategy in _DROP_STRATEGIES:
out, rows_dropped, columns_dropped = _apply_drops(
out, columns, global_strategy, options, records,
)
# Update column scope after potential drops.
columns = [c for c in columns if c not in columns_dropped]
# ------------------------------------------------------------------
# 4. Fills (per-column)
# ------------------------------------------------------------------
cells_filled = 0
strategy_per_column: dict[str, Strategy] = {}
for col in columns:
strat = _resolve_strategy(col, out[col], options)
strategy_per_column[col] = strat
if strat in _FILL_STRATEGIES:
cells_filled += _apply_fill(out, col, strat, options, records)
# ------------------------------------------------------------------
# Build audit + after-profile
# ------------------------------------------------------------------
changes_df = pd.DataFrame(
records, columns=["row", "column", "old", "new", "action"],
)
profile_after = profile_missing(out, options)
return MissingResult(
handled_df=out,
profile_before=profile_before,
profile_after=profile_after,
changes=changes_df,
rows_dropped=rows_dropped,
columns_dropped=columns_dropped,
cells_filled=cells_filled,
sentinels_standardized=sentinels_replaced,
columns_processed=columns,
strategy_per_column=strategy_per_column,
)

501
src/core/pipeline.py Normal file
View File

@@ -0,0 +1,501 @@
"""DataTools Pipeline Runner.
Chain the cleaning tools (text-clean, format-standardize, missing,
column-map, dedup) into a single orchestrated workflow. The pipeline
threads the DataFrame from one step to the next; each step's options
are JSON-serializable so the entire pipeline can be saved, shared, and
re-run on next week's export.
Design tenets
-------------
* **Recommended, not forced.** The recommended order
(text → format → missing → dedup, with column-map fitting either
end depending on use case) is encoded in
:data:`SOFT_DEPENDENCIES`. The runner WARNS on out-of-order
pipelines but never refuses to execute them — the user owns their
workflow.
* **Each step is opt-in / opt-out.** ``Step.enabled = False`` skips
the step without removing it from the saved configuration.
* **Adapters are tiny.** Each tool is wrapped by a small adapter that
bridges its native ``Options`` / ``Result`` shape to the pipeline's
uniform ``(df, options_dict) → (new_df, summary)`` contract.
Public API
----------
Types:
Step, Pipeline, StepResult, PipelineResult
Functions:
run_pipeline(df, pipeline) -> PipelineResult
validate_pipeline(pipeline) -> list[str]
recommended_pipeline(*, include=None, **opts) -> Pipeline
Constants:
TOOL_ADAPTERS — name → adapter callable
TOOL_NAMES — sorted list of recognised tool names
SOFT_DEPENDENCIES — list of (earlier, later, reason) tuples
"""
from __future__ import annotations
import json
import time
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any, Callable, Iterable, Optional
import pandas as pd
from loguru import logger
from .errors import (
ConfigError,
DataToolsError,
InputValidationError,
ensure_choice,
ensure_dataframe,
)
# ---------------------------------------------------------------------------
# Tool adapters — bridge each tool's native API to the pipeline contract
# ---------------------------------------------------------------------------
def _adapter_text_clean(
df: pd.DataFrame, options: dict[str, Any],
) -> tuple[pd.DataFrame, dict[str, Any]]:
from .text_clean import CleanOptions, clean_dataframe
opts = CleanOptions.from_dict(options) if options else CleanOptions()
res = clean_dataframe(df, opts)
return res.cleaned_df, {
"cells_total": res.cells_total,
"cells_changed": res.cells_changed,
"columns_processed": list(res.columns_processed),
}
def _adapter_format_standardize(
df: pd.DataFrame, options: dict[str, Any],
) -> tuple[pd.DataFrame, dict[str, Any]]:
from .format_standardize import StandardizeOptions, standardize_dataframe
opts = StandardizeOptions.from_dict(options) if options else StandardizeOptions()
res = standardize_dataframe(df, opts)
return res.standardized_df, {
"cells_total": res.cells_total,
"cells_changed": res.cells_changed,
"cells_unparseable": res.cells_unparseable,
"columns_processed": list(res.columns_processed),
}
def _adapter_missing(
df: pd.DataFrame, options: dict[str, Any],
) -> tuple[pd.DataFrame, dict[str, Any]]:
from .missing import MissingOptions, handle_missing
opts = MissingOptions.from_dict(options) if options else MissingOptions()
res = handle_missing(df, opts)
return res.handled_df, {
"sentinels_standardized": res.sentinels_standardized,
"cells_filled": res.cells_filled,
"rows_dropped": res.rows_dropped,
"columns_dropped": list(res.columns_dropped),
"columns_processed": list(res.columns_processed),
}
def _adapter_column_map(
df: pd.DataFrame, options: dict[str, Any],
) -> tuple[pd.DataFrame, dict[str, Any]]:
from .column_mapper import MapOptions, map_columns
opts = MapOptions.from_dict(options) if options else MapOptions()
res = map_columns(df, opts)
return res.mapped_df, {
"columns_renamed": res.columns_renamed,
"columns_dropped": list(res.columns_dropped),
"columns_added": list(res.columns_added),
"coercion_failures": dict(res.coercion_failures),
"missing_required_targets": list(res.missing_required_targets),
}
def _adapter_dedup(
df: pd.DataFrame, options: dict[str, Any],
) -> tuple[pd.DataFrame, dict[str, Any]]:
from .dedup import deduplicate, SurvivorRule
from .config import DeduplicationConfig
options = options or {}
survivor = options.get("survivor_rule", "first")
if isinstance(survivor, str):
try:
survivor = SurvivorRule(survivor)
except ValueError as e:
raise ConfigError(
f"Unknown survivor_rule {survivor!r}",
operation="pipeline.dedup",
cause=e,
suggestion=f"Valid: {[r.value for r in SurvivorRule]}",
) from e
# Optional explicit strategies via the same JSON shape as
# DeduplicationConfig: ``[{"columns": [{"column": "phone",
# "algorithm": "exact", "threshold": 100}, ...]}, ...]``.
raw_strategies = options.get("strategies")
explicit_strategies = None
if raw_strategies:
cfg = DeduplicationConfig.from_dict({"strategies": raw_strategies})
explicit_strategies = cfg.to_strategies()
res = deduplicate(
df,
strategies=explicit_strategies,
survivor_rule=survivor,
merge=options.get("merge", False),
preview=False, # pipeline always commits the dedup output
date_column=options.get("date_column"),
)
final = res.deduplicated_df if res.deduplicated_df is not None else df
return final, {
"input_rows": len(df),
"output_rows": len(final),
"duplicates_removed": len(df) - len(final),
"groups": len(res.match_groups) if res.match_groups else 0,
}
TOOL_ADAPTERS: dict[str, Callable[..., tuple[pd.DataFrame, dict[str, Any]]]] = {
"text_clean": _adapter_text_clean,
"format_standardize": _adapter_format_standardize,
"missing": _adapter_missing,
"column_map": _adapter_column_map,
"dedup": _adapter_dedup,
}
TOOL_NAMES: list[str] = sorted(TOOL_ADAPTERS)
# ---------------------------------------------------------------------------
# Soft dependencies
# ---------------------------------------------------------------------------
# Pairs of (earlier, later, reason) where running *earlier* before
# *later* is recommended. A reversal triggers a WARNING — never a
# block. The user owns their workflow.
SOFT_DEPENDENCIES: list[tuple[str, str, str]] = [
(
"text_clean", "format_standardize",
"format parsers (phone / currency / date) fail on smart-quote-"
"contaminated or NBSP-padded input — clean text first",
),
(
"text_clean", "missing",
"sentinel detection misses cells padded with NBSP / zero-width "
"characters — clean text first",
),
(
"text_clean", "dedup",
"fuzzy matching treats NBSP-padded values as different — "
"clean text first",
),
(
"format_standardize", "missing",
"numeric imputation needs numeric dtypes; canonical phones / "
"currencies improve sentinel detection",
),
(
"format_standardize", "dedup",
"canonical phones / lowercase emails enable cross-format "
"duplicate matching",
),
(
"missing", "dedup",
"deduping rows with mixed NaN sentinels produces brittle merges "
"— resolve missing values first",
),
]
# ---------------------------------------------------------------------------
# Step / Pipeline / Result dataclasses
# ---------------------------------------------------------------------------
@dataclass
class Step:
"""One step in a pipeline.
Attributes
----------
tool : Name of the tool to run. Must be a key of :data:`TOOL_ADAPTERS`.
options : JSON-serializable dict of tool-specific options. Each
adapter parses this through the tool's ``Options.from_dict``.
enabled : Skip the step (without removing it) when False.
name : Optional friendly label for logs / GUI rendering. Defaults
to the tool name.
"""
tool: str
options: dict[str, Any] = field(default_factory=dict)
enabled: bool = True
name: Optional[str] = None
def display_name(self) -> str:
return self.name or self.tool
def __post_init__(self) -> None:
if self.tool not in TOOL_ADAPTERS:
raise ConfigError(
f"Unknown tool {self.tool!r}",
operation="Step.__post_init__",
suggestion=f"Valid tools: {TOOL_NAMES}",
)
@dataclass
class Pipeline:
"""An ordered sequence of :class:`Step` records."""
steps: list[Step] = field(default_factory=list)
def to_dict(self) -> dict:
return {"steps": [asdict(s) for s in self.steps]}
def to_file(self, path: str | Path) -> Path:
out = Path(path)
out.write_text(json.dumps(self.to_dict(), indent=2, default=str))
return out
@classmethod
def from_dict(cls, data: dict) -> Pipeline:
if "steps" not in data:
raise ConfigError(
"Pipeline file must contain a 'steps' list",
operation="Pipeline.from_dict",
suggestion='Example: {"steps": [{"tool": "text_clean"}, ...]}',
)
steps: list[Step] = []
for raw in data["steps"]:
if "tool" not in raw:
raise ConfigError(
f"Step is missing 'tool': {raw!r}",
operation="Pipeline.from_dict",
)
steps.append(Step(
tool=raw["tool"],
options=dict(raw.get("options") or {}),
enabled=bool(raw.get("enabled", True)),
name=raw.get("name"),
))
return cls(steps=steps)
@classmethod
def from_file(cls, path: str | Path) -> Pipeline:
return cls.from_dict(json.loads(Path(path).read_text()))
@dataclass
class StepResult:
"""One step's outcome."""
step: Step
summary: dict[str, Any]
elapsed_seconds: float
skipped: bool = False
error: Optional[str] = None # rendered exception, not the live one
@dataclass
class PipelineResult:
"""Whole-run outcome."""
final_df: pd.DataFrame
step_results: list[StepResult]
total_elapsed: float
initial_rows: int
final_rows: int
warnings: list[str]
# ---------------------------------------------------------------------------
# Recommended pipeline + validation
# ---------------------------------------------------------------------------
# The single canonical default. Column-map is omitted: include it only
# when the caller needs header alignment (early) or schema enforcement
# (late). Adding it as an "auto" middle step would override the user's
# downstream column lookups without their having asked.
_DEFAULT_ORDER: list[str] = [
"text_clean",
"format_standardize",
"missing",
"dedup",
]
def recommended_pipeline(
*,
include: Optional[Iterable[str]] = None,
options: Optional[dict[str, dict[str, Any]]] = None,
) -> Pipeline:
"""Build the recommended pipeline.
Defaults to ``[text_clean, format_standardize, missing, dedup]`` —
the canonical workflow surfaced in DECISIONS.md and
``src.core.pipeline.SOFT_DEPENDENCIES``.
Parameters
----------
include
Names of tools to include, in the desired order. When None,
uses :data:`_DEFAULT_ORDER`. Pass ``["column_map", "text_clean",
...]`` to put column-map first (header-alignment use case) or
``[..., "column_map"]`` to put it last (schema-enforcement use
case).
options
Optional ``{tool_name: {option_dict}}`` to seed each step. A
missing entry uses the tool's default options.
"""
chosen = list(include) if include is not None else list(_DEFAULT_ORDER)
seed = options or {}
for t in chosen:
ensure_choice(
t, name="tool", choices=TOOL_NAMES,
function="recommended_pipeline",
)
return Pipeline(steps=[
Step(tool=t, options=dict(seed.get(t) or {}))
for t in chosen
])
def validate_pipeline(pipeline: Pipeline) -> list[str]:
"""Return a list of WARNING strings for soft-dependency violations.
Empty list = pipeline is in recommended order. Each warning is a
single human-readable sentence the CLI / GUI can surface verbatim.
Disabled steps are ignored.
"""
enabled = [s for s in pipeline.steps if s.enabled]
positions: dict[str, int] = {}
duplicates: list[str] = []
for i, s in enumerate(enabled):
if s.tool in positions:
# Multiple steps for the same tool is allowed (a user might
# text-clean twice with different scopes). Skip the dep
# check for the duplicate so we don't spam warnings.
duplicates.append(s.tool)
else:
positions[s.tool] = i
warnings: list[str] = []
for earlier, later, why in SOFT_DEPENDENCIES:
if earlier in positions and later in positions:
if positions[earlier] > positions[later]:
warnings.append(
f"step {later!r} runs BEFORE {earlier!r}{why}"
)
return warnings
# ---------------------------------------------------------------------------
# Execution
# ---------------------------------------------------------------------------
def run_pipeline(
df: pd.DataFrame,
pipeline: Pipeline,
*,
on_step_complete: Optional[Callable[[StepResult], None]] = None,
stop_on_error: bool = True,
) -> PipelineResult:
"""Execute *pipeline* against *df*.
The DataFrame from each step's adapter is passed to the next step;
the original input is never mutated. Soft-dependency warnings are
captured up-front and returned via ``PipelineResult.warnings`` so
the caller can surface them — the run proceeds regardless.
Parameters
----------
on_step_complete
Optional ``callable(StepResult)`` fired after each step. Useful
for live progress in the GUI.
stop_on_error
When True (default), the first failing step's exception
propagates and execution halts. Set False to continue past a
failing step using the previous step's output (the failed
step's ``StepResult.error`` holds the rendered exception).
"""
ensure_dataframe(df, function="run_pipeline")
if not isinstance(pipeline, Pipeline):
raise InputValidationError(
f"Expected Pipeline, got {type(pipeline).__name__}",
operation="run_pipeline",
)
warnings = validate_pipeline(pipeline)
if warnings:
for w in warnings:
logger.warning("pipeline order: {}", w)
initial_rows = len(df)
step_results: list[StepResult] = []
current = df
t_start = time.perf_counter()
for step in pipeline.steps:
if not step.enabled:
sr = StepResult(
step=step, summary={}, elapsed_seconds=0.0, skipped=True,
)
step_results.append(sr)
if on_step_complete:
_safe_call(on_step_complete, sr)
continue
adapter = TOOL_ADAPTERS[step.tool]
s_start = time.perf_counter()
try:
new_df, summary = adapter(current, step.options)
except Exception as e: # noqa: BLE001 — pipeline owns the error contract
elapsed = time.perf_counter() - s_start
err_msg = (
e.format() if isinstance(e, DataToolsError) else f"{type(e).__name__}: {e}"
)
sr = StepResult(
step=step, summary={}, elapsed_seconds=elapsed,
error=err_msg,
)
step_results.append(sr)
if on_step_complete:
_safe_call(on_step_complete, sr)
if stop_on_error:
raise
logger.warning(
"pipeline step {!r} failed; continuing with previous output",
step.display_name(),
)
continue
current = new_df
sr = StepResult(
step=step, summary=summary,
elapsed_seconds=time.perf_counter() - s_start,
)
step_results.append(sr)
if on_step_complete:
_safe_call(on_step_complete, sr)
return PipelineResult(
final_df=current,
step_results=step_results,
total_elapsed=time.perf_counter() - t_start,
initial_rows=initial_rows,
final_rows=len(current),
warnings=warnings,
)
def _safe_call(callback: Callable, *args: Any) -> None:
"""Run a user-supplied callback, logging but never propagating errors."""
try:
callback(*args)
except Exception: # noqa: BLE001 — progress callbacks are advisory
logger.opt(exception=True).debug("pipeline callback raised; ignoring")

View File

@@ -535,6 +535,15 @@ def clean_dataframe(df: pd.DataFrame, options: Optional[CleanOptions] = None) ->
Numeric, datetime, and boolean columns are skipped by default. The input
DataFrame is not mutated; a copy is returned in ``CleanResult.cleaned_df``.
Pipeline placement (recommended, not enforced)
----------------------------------------------
*Best run early.* Smart-quote, NBSP, and zero-width pollution
silently breaks downstream parsers — phone numbers fail on
smart-quote contamination, sentinel detection misses NBSP-padded
cells, and fuzzy dedup treats whitespace-padded values as
different. Running this tool before format / missing / dedup is
the standard order. See ``src.core.pipeline.SOFT_DEPENDENCIES``.
"""
from .errors import ensure_dataframe
ensure_dataframe(df, function="clean_dataframe")

468
src/gui/app_demo.py Normal file
View File

@@ -0,0 +1,468 @@
"""DataTools — public demo app (deploys to Streamlit Community Cloud).
This is a SEPARATE entry point from the main GUI (``src/gui/app.py``).
The full GUI is the paid product surface; this demo is the marketing
surface — a single page that runs one of three persona-specific
pipelines on a preloaded sample file, shows the BEFORE / AFTER
side-by-side, and converts the visitor to a Gumroad purchase.
Launch:
streamlit run src/gui/app_demo.py
URL routing:
https://demo.datatools.app/?p=shopify-pet (Shopify operator)
https://demo.datatools.app/?p=bookkeeper (Bookkeeper)
https://demo.datatools.app/?p=revops (RevOps agency)
Free / paid boundary (per docs/DEMO-PLAN.md §6):
- input rows capped at ``DEMO_ROW_CAP``
- input file size capped at ``DEMO_FILE_CAP_MB``
- download CSV gets a single trailing watermark row
- the pipeline editor is read-only — visitor sees it but can't change it
- no audit-log download (paid feature)
- no save-pipeline-JSON (paid feature)
The demo runs the *same engine* as the paid product. Caps are applied
at the surface layer only — when the buyer downloads and runs the paid
build, every cap disappears.
"""
from __future__ import annotations
import io
import json
import sys
import time
from pathlib import Path
from typing import Any
import pandas as pd
import streamlit as st
# Ensure project root is on sys.path so `src.core` imports work
_project_root = Path(__file__).resolve().parent.parent.parent
if str(_project_root) not in sys.path:
sys.path.insert(0, str(_project_root))
from src.core.pipeline import Pipeline, run_pipeline
# ---------------------------------------------------------------------------
# Free / paid boundary constants
# ---------------------------------------------------------------------------
DEMO_ROW_CAP: int = 100
DEMO_FILE_CAP_MB: int = 5
GUMROAD_BASE: str = "https://gumroad.com/l/datatools"
# ---------------------------------------------------------------------------
# Persona registry — single source of truth
# ---------------------------------------------------------------------------
DEMO_DIR = _project_root / "samples" / "demo"
PERSONAS: dict[str, dict[str, Any]] = {
"shopify-pet": {
"label": "Shopify pet operator",
"icon": "🛍️",
"h1": "Klaviyo-import-ready customer lists. **In 30 seconds. Locally.**",
"sub": (
"Your Shopify customer export has duplicates Excel can't catch, "
"international phones Excel can't parse, and disguised nulls "
"(`N/A`, `(blank)`, `?`) that break Klaviyo's import. "
"DataTools fixes all of it in one pass — and your data never "
"leaves your computer."
),
"data_file": "shopify_pet_customers.csv",
"pipeline_file": "shopify_pet_pipeline.json",
"cta": "Get DataTools for Shopify — $49 →",
"landing": "https://datatools.app/shopify/",
},
"bookkeeper": {
"label": "Bookkeeper / freelance accountant",
"icon": "📒",
"h1": "Reconcile messy bank exports. **Hand your client an audit trail.**",
"sub": (
"The Jan and Feb exports overlap; the same transaction posts twice. "
"Vendor names are *Amazon* / *amazon.com* / *AMAZON.COM*4F2X9* in "
"three rows. DataTools dedups on Date + Amount + fuzzy Vendor, "
"produces ISO dates and numeric amounts, and gives you a row-level "
"audit log to hand the client."
),
"data_file": "bookkeeper_bank_reconcile.csv",
"pipeline_file": "bookkeeper_bank_pipeline.json",
"cta": "Get DataTools for Bookkeepers — $49 →",
"landing": "https://datatools.app/bookkeeper/",
},
"revops": {
"label": "Marketing / RevOps agency",
"icon": "🪢",
"h1": "Dedupe lead lists across HubSpot, LinkedIn, and manual scrapes — **locally.**",
"sub": (
"The same prospect shows up in HubSpot as `alice@acme.com`, in "
"LinkedIn as `Alice.Johnson@acme.com`, and in your VA's manual "
"scrape as `alice@acme.com` again. Country is `USA` / `US` / "
"`United States`. DataTools fuzzy-matches across sources, "
"normalizes phones for 50+ countries, and merges survivors "
"with their most-complete fields — without uploading anything."
),
"data_file": "agency_combined_leads.csv",
"pipeline_file": "agency_leads_pipeline.json",
"cta": "Get DataTools for RevOps — $49 →",
"landing": "https://datatools.app/revops/",
},
}
DEFAULT_PERSONA = "shopify-pet"
# ---------------------------------------------------------------------------
# Page config + routing
# ---------------------------------------------------------------------------
st.set_page_config(
page_title="DataTools — try it live",
page_icon="🧹",
layout="wide",
initial_sidebar_state="collapsed",
)
# Strip Streamlit chrome that breaks the iframe-embed look on the
# landing pages.
st.markdown("""
<style>
#MainMenu, footer, header { visibility: hidden; }
.block-container { padding-top: 1.2rem; padding-bottom: 1rem; max-width: 1200px; }
[data-testid="stSidebarNav"] { display: none; }
section[data-testid="stSidebar"] { display: none; }
.stApp { background: #0f1115; color: #e8eaed; }
h1, h2, h3 { color: #e8eaed; letter-spacing: -0.01em; }
hr { border-color: #252a36; }
.demo-card {
background: #161922;
border: 1px solid #252a36;
border-radius: 12px;
padding: 18px;
}
.cta-block {
background: linear-gradient(135deg, #161922 0%, #1d212b 100%);
border: 1px solid #6ee7b7;
border-radius: 12px;
padding: 24px;
text-align: center;
}
.cta-block a {
display: inline-block;
background: #6ee7b7; color: #052e1a;
font-weight: 600; padding: 12px 22px;
border-radius: 8px; text-decoration: none;
font-size: 17px; margin-top: 12px;
}
.metric-pill {
display: inline-block;
background: #1d212b; border: 1px solid #252a36;
padding: 4px 10px; border-radius: 999px;
font-family: ui-monospace, monospace; font-size: 13px;
color: #6ee7b7; margin-right: 6px; margin-bottom: 4px;
}
</style>
""", unsafe_allow_html=True)
def _resolve_persona() -> str:
"""Read ``?p=<persona>`` from query string; fall back to default."""
try:
params = st.query_params
raw = params.get("p", DEFAULT_PERSONA)
except AttributeError:
# Older Streamlit versions
params = st.experimental_get_query_params()
raw = params.get("p", [DEFAULT_PERSONA])
raw = raw[0] if isinstance(raw, list) else raw
if raw not in PERSONAS:
return DEFAULT_PERSONA
return raw
persona_key = _resolve_persona()
persona = PERSONAS[persona_key]
# ---------------------------------------------------------------------------
# Header + persona switch
# ---------------------------------------------------------------------------
col_brand, col_switch = st.columns([3, 2])
with col_brand:
st.markdown(f"### 🧹 DataTools / for {persona['label']}")
with col_switch:
# Quick-switch dropdown for visitors landing on the wrong persona
new_choice = st.selectbox(
"Try a different demo",
options=list(PERSONAS),
format_func=lambda k: f"{PERSONAS[k]['icon']} {PERSONAS[k]['label']}",
index=list(PERSONAS).index(persona_key),
key="persona_switch",
label_visibility="collapsed",
)
if new_choice != persona_key:
st.query_params["p"] = new_choice
st.rerun()
st.markdown(f"## {persona['h1']}")
st.markdown(persona["sub"])
st.markdown("---")
# ---------------------------------------------------------------------------
# Load preloaded sample data + pipeline
# ---------------------------------------------------------------------------
@st.cache_data(show_spinner=False)
def _load_demo(data_file: str, pipeline_file: str) -> tuple[pd.DataFrame, Pipeline]:
df = pd.read_csv(DEMO_DIR / data_file, dtype=str, keep_default_na=False)
pipe = Pipeline.from_file(DEMO_DIR / pipeline_file)
return df, pipe
sample_df, sample_pipeline = _load_demo(persona["data_file"], persona["pipeline_file"])
def _read_uploaded(uploaded_file) -> tuple[pd.DataFrame, list[str]]:
"""Decode an uploaded file. Returns (df, warnings)."""
warnings: list[str] = []
raw = uploaded_file.getvalue()
size_mb = len(raw) / 1024 / 1024
if size_mb > DEMO_FILE_CAP_MB:
warnings.append(
f"Uploaded file is {size_mb:.1f} MB — demo capped at "
f"{DEMO_FILE_CAP_MB} MB. The paid product has no size limit."
)
return sample_df.copy(), warnings
suffix = Path(uploaded_file.name).suffix.lower()
bio = io.BytesIO(raw)
try:
if suffix in (".xlsx", ".xls"):
df = pd.read_excel(bio, dtype=str, keep_default_na=False)
else:
for enc in ("utf-8", "utf-8-sig", "latin-1"):
try:
bio.seek(0)
sep = "\t" if suffix == ".tsv" else ","
df = pd.read_csv(
bio, dtype=str, keep_default_na=False,
encoding=enc, sep=sep, on_bad_lines="warn",
)
break
except UnicodeDecodeError:
continue
else:
bio.seek(0)
df = pd.read_csv(bio, dtype=str, keep_default_na=False, encoding="latin-1")
except Exception as e:
warnings.append(f"Could not read your file ({type(e).__name__}). "
"Demo will run on the sample dataset.")
return sample_df.copy(), warnings
if len(df) > DEMO_ROW_CAP:
warnings.append(
f"Demo capped at {DEMO_ROW_CAP} rows — your file has {len(df):,}. "
f"Running on the first {DEMO_ROW_CAP} rows. The paid product has no row limit."
)
df = df.head(DEMO_ROW_CAP)
return df, warnings
# ---------------------------------------------------------------------------
# File source: preloaded sample (default) or user upload
# ---------------------------------------------------------------------------
st.markdown(f"#### Sample dataset preloaded · `{persona['data_file']}`")
with st.expander(
"Or replace with your own file (capped at "
f"{DEMO_ROW_CAP} rows / {DEMO_FILE_CAP_MB} MB for the demo)",
expanded=False,
):
uploaded = st.file_uploader(
"Your file",
type=["csv", "tsv", "xlsx", "xls"],
key="demo_user_file",
label_visibility="collapsed",
help=(
"Files larger than the cap are accepted but only the first "
f"{DEMO_ROW_CAP} rows are processed. The paid build runs on "
"1 GB+ files via streaming."
),
)
if uploaded is not None:
df_in, upload_warnings = _read_uploaded(uploaded)
for w in upload_warnings:
st.info(w)
using_sample = False
else:
df_in = sample_df.copy()
using_sample = True
# ---------------------------------------------------------------------------
# BEFORE preview
# ---------------------------------------------------------------------------
st.markdown(f"#### BEFORE — {len(df_in)} rows, {len(df_in.columns)} columns")
st.dataframe(df_in.head(10), use_container_width=True, hide_index=True)
st.markdown("---")
# ---------------------------------------------------------------------------
# Pipeline (read-only)
# ---------------------------------------------------------------------------
st.markdown("#### Pipeline (saved — paid version is editable)")
pipe_summary = "".join(
f"**{i + 1}.** {step.tool}"
for i, step in enumerate(sample_pipeline.steps)
)
st.markdown(pipe_summary)
# ---------------------------------------------------------------------------
# Run
# ---------------------------------------------------------------------------
run_clicked = st.button(
"▶ Run pipeline",
type="primary",
use_container_width=True,
key="demo_run_button",
)
if run_clicked:
with st.spinner("Running…"):
t0 = time.perf_counter()
try:
result = run_pipeline(df_in, sample_pipeline, stop_on_error=False)
except Exception as e:
from src.core.errors import format_for_user
st.error(f"Demo halted: {format_for_user(e)}")
st.stop()
elapsed = time.perf_counter() - t0
st.session_state["demo_result"] = result
st.session_state["demo_elapsed"] = elapsed
st.session_state["demo_persona"] = persona_key
result = st.session_state.get("demo_result")
elapsed = st.session_state.get("demo_elapsed", 0.0)
result_persona = st.session_state.get("demo_persona")
# Reset cached result when persona switches
if result is not None and result_persona != persona_key:
result = None
st.session_state.pop("demo_result", None)
# ---------------------------------------------------------------------------
# AFTER + metrics + CTA
# ---------------------------------------------------------------------------
if result is not None:
st.markdown("---")
st.markdown(
f"#### AFTER — {len(df_in)}{len(result.final_df)} rows · "
f"finished in {elapsed*1000:.0f} ms"
)
# Per-step metric pills
pills_html: list[str] = []
for sr in result.step_results:
if sr.skipped:
continue
if sr.error:
pills_html.append(
f'<span class="metric-pill" style="color:#fbbf24">'
f'{sr.step.tool}: error</span>'
)
continue
s = sr.summary
bits: list[str] = []
if "cells_changed" in s and s["cells_changed"]:
bits.append(f"{s['cells_changed']} cells")
if "sentinels_standardized" in s and s["sentinels_standardized"]:
bits.append(f"{s['sentinels_standardized']} sentinels")
if "duplicates_removed" in s and s["duplicates_removed"]:
bits.append(f"{s['duplicates_removed']} dupes merged")
if "columns_renamed" in s and s["columns_renamed"]:
bits.append(f"{s['columns_renamed']} renamed")
label = ", ".join(bits) if bits else "no-op"
pills_html.append(
f'<span class="metric-pill">{sr.step.tool}: {label}</span>'
)
st.markdown("".join(pills_html), unsafe_allow_html=True)
st.dataframe(result.final_df.head(10), use_container_width=True, hide_index=True)
# ----- Download with watermark row -----
watermark_row = pd.DataFrame([{
col: f"DataTools demo — buy at {persona['landing']}"
if i == 0 else ""
for i, col in enumerate(result.final_df.columns)
}])
out_df = pd.concat([result.final_df, watermark_row], ignore_index=True)
csv_bytes = out_df.to_csv(index=False).encode("utf-8-sig")
col_dl, col_cta = st.columns([1, 2])
with col_dl:
st.download_button(
"Download cleaned CSV (sample · watermarked)",
data=csv_bytes,
file_name=Path(persona["data_file"]).stem + "_cleaned_demo.csv",
mime="text/csv",
use_container_width=True,
)
with col_cta:
st.markdown(
f"""
<div class="cta-block">
<strong style="font-size: 18px;">Like what you see?</strong><br/>
Run this on YOUR full file — locally. No upload. No row limit. No watermark.<br/>
<a href="{GUMROAD_BASE}?from={persona_key}" rel="noopener">{persona['cta']}</a>
</div>
""",
unsafe_allow_html=True,
)
else:
# Pre-run state — show the buy block at the bottom anyway so the
# CTA is always visible above the fold once the visitor scrolls.
st.markdown(
f"""
<div class="cta-block" style="margin-top: 24px;">
<strong style="font-size: 18px;">Already convinced?</strong><br/>
Skip the demo and grab the full version. One-time payment, no subscription.<br/>
<a href="{GUMROAD_BASE}?from={persona_key}" rel="noopener">{persona['cta']}</a>
</div>
""",
unsafe_allow_html=True,
)
# ---------------------------------------------------------------------------
# Footer trust block
# ---------------------------------------------------------------------------
st.markdown("---")
col_t1, col_t2, col_t3 = st.columns(3)
with col_t1:
st.markdown("**🔒 Runs locally**\n\nThe paid product is desktop-only. Your data never leaves your computer.")
with col_t2:
st.markdown("**📋 Audit trail**\n\nEvery cell change row-logged with old / new / which rule fired.")
with col_t3:
st.markdown("**💰 One-time $49**\n\nNo subscription. Mac · Windows · Linux. Free updates for v1.x.")
st.caption(
f"Demo capped at {DEMO_ROW_CAP} rows · output watermarked with one trailing row · "
"running on free hosting. The paid product is uncapped and runs offline."
)

View File

@@ -1,111 +1,368 @@
"""DataTools Missing Value Handler — stub page."""
"""DataTools Missing Value Handler — Streamlit page."""
from __future__ import annotations
import io
import json
import sys
from pathlib import Path
import pandas as pd
import streamlit as st
_project_root = Path(__file__).resolve().parent.parent.parent.parent
if str(_project_root) not in sys.path:
sys.path.insert(0, str(_project_root))
from src.gui.components import hide_streamlit_chrome, require_normalization_gate
from src.gui.components import (
hide_streamlit_chrome,
pickup_or_upload,
require_normalization_gate,
)
from src.core.missing import (
DEFAULT_SENTINELS,
MissingOptions,
PRESETS,
handle_missing,
profile_missing,
)
hide_streamlit_chrome()
require_normalization_gate()
# ---------------------------------------------------------------------------
# Header
# ---------------------------------------------------------------------------
st.title("🕳️ Missing Value Handler")
st.caption("Detect, analyze, and handle missing values in your data.")
st.info("This tool is under development.")
# ---------------------------------------------------------------------------
# What this tool will do
# ---------------------------------------------------------------------------
st.markdown("""
**Features:**
- Detect disguised nulls (empty strings, "N/A", "n/a", "-", "NULL", "None", etc.)
- Missingness analysis: per-column counts, percentages, and patterns
- Visualize missing data heatmap
- Imputation strategies: drop rows/columns, fill with mean/median/mode, forward-fill, backward-fill
- Custom sentinel value replacement
- Before/after comparison
""")
st.divider()
# ---------------------------------------------------------------------------
# File upload (functional)
# ---------------------------------------------------------------------------
uploaded = st.file_uploader(
"Upload CSV or Excel file",
type=["csv", "tsv", "xlsx", "xls"],
help="Upload a file to preview. Processing is not yet available.",
key="missing_file_upload",
st.caption(
"Detect disguised nulls, profile missingness, and apply imputation or "
"drop strategies. Runs locally — your data never leaves this computer."
)
if uploaded is not None:
import pandas as pd
# ---------------------------------------------------------------------------
# File upload
# ---------------------------------------------------------------------------
uploaded = pickup_or_upload(
label="Upload CSV or Excel file",
key="missing_file_upload",
types=["csv", "tsv", "xlsx", "xls"],
)
if uploaded is None:
st.info("Upload a CSV, TSV, or Excel file to begin.")
st.stop()
@st.cache_data(show_spinner=False)
def _read_uploaded(name: str, data: bytes) -> pd.DataFrame:
"""Read the uploaded bytes into a DataFrame.
Unlike the text cleaner, we do *not* force ``dtype=str`` here: missing-
value handling is more useful when numeric columns are typed correctly
(so mean / median / interpolate work without manual coercion).
Sentinel strings are still detected because they survive in object
columns where any cell is non-numeric.
"""
suffix = Path(name).suffix.lower()
bio = io.BytesIO(data)
if suffix in (".xlsx", ".xls"):
return pd.read_excel(bio)
for enc in ("utf-8", "utf-8-sig", "latin-1"):
try:
if uploaded.name.endswith((".xlsx", ".xls")):
df = pd.read_excel(uploaded)
else:
df = pd.read_csv(uploaded)
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
except Exception as e:
bio.seek(0)
sep = "\t" if suffix == ".tsv" else ","
return pd.read_csv(bio, encoding=enc, sep=sep, on_bad_lines="warn")
except UnicodeDecodeError:
continue
bio.seek(0)
return pd.read_csv(bio, encoding="latin-1")
try:
df = _read_uploaded(uploaded.name, uploaded.getvalue())
except Exception as e:
from src.core.errors import format_for_user
st.error(
f"**Could not read `{uploaded.name}`**\n\n"
f"```\n{format_for_user(e)}\n```"
)
st.stop()
# ---------------------------------------------------------------------------
# Placeholder options
# ---------------------------------------------------------------------------
st.subheader("Detection Settings")
st.text_input(
"Null patterns (comma-separated)",
value="N/A, n/a, NA, -, NULL, None, empty, .",
disabled=True,
help="Values to treat as missing.",
)
st.subheader("Handling Strategy")
st.selectbox("Strategy", [
"Drop rows with any missing",
"Drop rows above threshold",
"Fill with mean (numeric)",
"Fill with median (numeric)",
"Fill with mode (categorical)",
"Forward-fill",
"Backward-fill",
"Custom value",
], disabled=True)
st.slider("Drop threshold (%)", 0, 100, 50, disabled=True, help="Drop rows missing more than this % of columns.")
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
st.divider()
st.button("Handle Missing Values", type="primary", use_container_width=True, disabled=True)
# ---------------------------------------------------------------------------
# Footer
# Initial profile (read-only)
# ---------------------------------------------------------------------------
st.subheader("Missingness profile")
initial_profile = profile_missing(df, MissingOptions())
prof_df = initial_profile.to_dataframe()
m1, m2, m3, m4 = st.columns(4)
m1.metric("Rows", initial_profile.rows_total)
m2.metric("Cells missing", initial_profile.cells_missing)
m3.metric("% cells missing", f"{initial_profile.cells_missing_pct:.1f}%")
m4.metric("Complete rows", initial_profile.rows_complete)
st.dataframe(prof_df, use_container_width=True, hide_index=True)
if initial_profile.cells_missing == 0:
st.success("No missing values or disguised nulls detected. Nothing to handle.")
st.divider()
# ---------------------------------------------------------------------------
# Options
# ---------------------------------------------------------------------------
st.subheader("Strategy")
preset_label = st.radio(
"Preset",
[
"detect-only (standardize sentinels to NaN, no fill or drop)",
"safe-fill (numeric → median, categorical → mode)",
"drop-incomplete (drop any row with missing)",
],
index=0,
help=(
"detect-only: replace 'N/A', '-', 'NULL', etc. with real NaN, then stop. "
"safe-fill: also fill — numeric columns with median, others with mode. "
"drop-incomplete: also drop every row that has any missing cell."
),
)
preset_key = preset_label.split(" ", 1)[0]
options = MissingOptions.from_preset(preset_key)
with st.expander("Advanced options"):
col_a, col_b = st.columns(2)
with col_a:
st.markdown("**Detection**")
options.standardize_sentinels = st.checkbox(
"Standardize disguised nulls to NaN",
value=options.standardize_sentinels,
help="Replace 'N/A', '-', 'NULL', whitespace-only cells, etc. with real NaN.",
)
sentinels_text = st.text_input(
"Sentinel values (comma-separated)",
value=", ".join(options.sentinels),
disabled=not options.standardize_sentinels,
help="Matched case-insensitively after stripping whitespace.",
)
options.sentinels = [
s.strip() for s in sentinels_text.split(",") if s.strip()
]
with col_b:
st.markdown("**Strategy override**")
strat_options = [
"(use preset)",
"none", "drop_row", "drop_col", "drop_both",
"mean", "median", "mode", "constant",
"ffill", "bfill", "interpolate",
]
strat_choice = st.selectbox(
"Global strategy",
strat_options,
index=0,
help=(
"drop_row / drop_col use the thresholds below. "
"mean / median / interpolate are numeric only — non-numeric "
"columns fall back to the categorical strategy."
),
)
if strat_choice != "(use preset)":
options.strategy = strat_choice # type: ignore[assignment]
cat_strat = st.selectbox(
"Categorical fallback (for non-numeric columns)",
["mode", "constant", "ffill", "bfill", "none"],
index=0,
)
options.categorical_strategy = cat_strat # type: ignore[assignment]
if options.strategy == "constant" or cat_strat == "constant":
fill_val = st.text_input(
"Constant fill value",
value="",
help="Used when strategy = constant. Leave blank to fill with empty string.",
)
options.fill_value = fill_val
st.markdown("**Drop thresholds**")
col_c, col_d = st.columns(2)
with col_c:
options.row_drop_threshold = st.slider(
"Row drop threshold (drop rows with ≥ this fraction missing across selected cols)",
0.0, 1.0, options.row_drop_threshold, 0.05,
)
with col_d:
options.col_drop_threshold = st.slider(
"Column drop threshold (drop columns with ≥ this fraction missing)",
0.0, 1.0, options.col_drop_threshold, 0.05,
)
st.markdown("**Scope**")
selected_cols = st.multiselect(
"Columns to handle (default: all)",
options=list(df.columns),
default=list(df.columns),
)
skip_cols = st.multiselect(
"Columns to skip",
options=list(df.columns),
default=[],
)
options.columns = selected_cols if selected_cols else None
options.skip_columns = list(skip_cols)
st.markdown("**Per-column strategy overrides** (optional)")
st.caption(
"Set a different strategy for specific columns. Leave any row blank to "
"use the global strategy."
)
per_col_overrides: dict[str, str] = {}
only_missing_cols = [
r.column for r in initial_profile.columns if r.has_missing
]
if only_missing_cols:
edit_df = pd.DataFrame({
"column": only_missing_cols,
"strategy": ["" for _ in only_missing_cols],
})
edited = st.data_editor(
edit_df,
use_container_width=True,
hide_index=True,
column_config={
"column": st.column_config.TextColumn("Column", disabled=True),
"strategy": st.column_config.SelectboxColumn(
"Override",
options=[
"", "drop_row", "drop_col",
"mean", "median", "mode", "constant",
"ffill", "bfill", "interpolate",
],
),
},
key="missing_per_col_editor",
)
for _, row in edited.iterrows():
if row["strategy"]:
per_col_overrides[row["column"]] = row["strategy"]
options.column_strategies = per_col_overrides # type: ignore[assignment]
# ---------------------------------------------------------------------------
# Run
# ---------------------------------------------------------------------------
st.divider()
st.caption(
"Runs locally. Your data never leaves this computer. "
"| DataTools v3.0"
)
if st.button("Handle Missing Values", type="primary", use_container_width=True):
with st.spinner("Handling..."):
try:
result = handle_missing(df, options)
except (ValueError, OSError) as e:
from src.core.errors import format_for_user
st.error(format_for_user(e))
st.stop()
st.session_state["missing_result"] = result
st.session_state["missing_input_name"] = uploaded.name
st.session_state["missing_options"] = options.to_dict()
result = st.session_state.get("missing_result")
if result is None:
st.info("Choose a strategy and click **Handle Missing Values** to run.")
st.stop()
# ---------------------------------------------------------------------------
# Results
# ---------------------------------------------------------------------------
st.subheader("Results")
m1, m2, m3, m4 = st.columns(4)
m1.metric("Sentinels → NaN", result.sentinels_standardized)
m2.metric("Cells filled", result.cells_filled)
m3.metric("Rows dropped", result.rows_dropped)
m4.metric("Columns dropped", len(result.columns_dropped))
if result.columns_dropped:
st.warning(f"Dropped columns: {', '.join(result.columns_dropped)}")
st.markdown("**Missingness — before vs. after**")
before = result.profile_before.to_dataframe().set_index("column")[
["missing", "missing_pct"]
].rename(columns={"missing": "before_missing", "missing_pct": "before_pct"})
after = result.profile_after.to_dataframe().set_index("column")[
["missing", "missing_pct"]
].rename(columns={"missing": "after_missing", "missing_pct": "after_pct"})
combined = before.join(after, how="outer").fillna(0)
st.dataframe(combined, use_container_width=True)
if result.strategy_per_column:
st.markdown("**Strategy applied per column**")
strat_df = pd.DataFrame(
[{"column": c, "strategy": s} for c, s in result.strategy_per_column.items()]
)
st.dataframe(strat_df, use_container_width=True, hide_index=True)
if not result.changes.empty:
st.markdown("**Audit (first 50 changes)**")
audit_view = result.changes.head(50).copy()
audit_view["row"] = audit_view["row"].apply(lambda x: "" if x == -1 else x + 1)
st.dataframe(audit_view, use_container_width=True, hide_index=True)
if len(result.changes) > 50:
st.caption(f"… and {len(result.changes) - 50} more (download the full audit below).")
st.markdown("**Handled preview (first 10 rows)**")
st.dataframe(result.handled_df.head(10), use_container_width=True)
# ---------------------------------------------------------------------------
# Downloads
# ---------------------------------------------------------------------------
st.divider()
stem = Path(st.session_state.get("missing_input_name", "input")).stem
dl_a, dl_b, dl_c = st.columns(3)
with dl_a:
handled_bytes = result.handled_df.to_csv(index=False).encode("utf-8-sig")
st.download_button(
"Download handled CSV",
data=handled_bytes,
file_name=f"{stem}_missing.csv",
mime="text/csv",
)
with dl_b:
if not result.changes.empty:
changes_bytes = result.changes.to_csv(index=False).encode("utf-8-sig")
st.download_button(
"Download changes audit",
data=changes_bytes,
file_name=f"{stem}_missing_changes.csv",
mime="text/csv",
)
with dl_c:
config_bytes = json.dumps(
st.session_state.get("missing_options", {}), indent=2, default=str,
).encode("utf-8")
st.download_button(
"Download config JSON",
data=config_bytes,
file_name="missing_config.json",
mime="application/json",
)
st.divider()
st.caption("Runs locally. Your data never leaves this computer. | DataTools v3.0")

View File

@@ -1,102 +1,413 @@
"""DataTools Column Mapper — stub page."""
"""DataTools Column Mapper — Streamlit page."""
from __future__ import annotations
import io
import json
import sys
from pathlib import Path
import pandas as pd
import streamlit as st
_project_root = Path(__file__).resolve().parent.parent.parent.parent
if str(_project_root) not in sys.path:
sys.path.insert(0, str(_project_root))
from src.gui.components import hide_streamlit_chrome, require_normalization_gate
from src.gui.components import (
hide_streamlit_chrome,
pickup_or_upload,
require_normalization_gate,
)
from src.core.column_mapper import (
MapOptions,
PRESETS,
TargetField,
TargetSchema,
infer_mapping,
map_columns,
)
hide_streamlit_chrome()
require_normalization_gate()
# ---------------------------------------------------------------------------
# Header
# ---------------------------------------------------------------------------
st.title("🗂️ Column Mapper")
st.caption("Rename columns, enforce a target schema, and coerce types.")
st.info("This tool is under development.")
# ---------------------------------------------------------------------------
# What this tool will do
# ---------------------------------------------------------------------------
st.markdown("""
**Features:**
- Rename columns via interactive mapping table
- Load a target schema (JSON/CSV) to auto-map columns
- Fuzzy column name matching for automatic suggestions
- Type coercion (string → int, string → date, etc.)
- Drop unmapped columns or keep as-is
- Reorder columns to match target schema
""")
st.divider()
# ---------------------------------------------------------------------------
# File upload (functional)
# ---------------------------------------------------------------------------
uploaded = st.file_uploader(
"Upload CSV or Excel file",
type=["csv", "tsv", "xlsx", "xls"],
help="Upload a file to preview. Processing is not yet available.",
key="colmap_file_upload",
st.caption(
"Rename columns, enforce a target schema, and coerce types. Runs locally — "
"your data never leaves this computer."
)
if uploaded is not None:
import pandas as pd
try:
if uploaded.name.endswith((".xlsx", ".xls")):
df = pd.read_excel(uploaded)
else:
df = pd.read_csv(uploaded)
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
st.subheader("Column Mapping")
st.caption("Map source columns to target names. (Interactive mapping coming soon.)")
mapping_data = pd.DataFrame({
"Source Column": df.columns.tolist(),
"Target Column": df.columns.tolist(),
"Type": ["auto"] * len(df.columns),
})
st.dataframe(mapping_data, use_container_width=True, hide_index=True)
except Exception as e:
# ---------------------------------------------------------------------------
# File upload
# ---------------------------------------------------------------------------
uploaded = pickup_or_upload(
label="Upload CSV or Excel file",
key="colmap_file_upload",
types=["csv", "tsv", "xlsx", "xls"],
)
if uploaded is None:
st.info("Upload a CSV, TSV, or Excel file to begin.")
st.stop()
@st.cache_data(show_spinner=False)
def _read_uploaded(name: str, data: bytes) -> pd.DataFrame:
suffix = Path(name).suffix.lower()
bio = io.BytesIO(data)
if suffix in (".xlsx", ".xls"):
return pd.read_excel(bio)
for enc in ("utf-8", "utf-8-sig", "latin-1"):
try:
bio.seek(0)
sep = "\t" if suffix == ".tsv" else ","
return pd.read_csv(bio, encoding=enc, sep=sep, on_bad_lines="warn")
except UnicodeDecodeError:
continue
bio.seek(0)
return pd.read_csv(bio, encoding="latin-1")
try:
df = _read_uploaded(uploaded.name, uploaded.getvalue())
except Exception as e:
from src.core.errors import format_for_user
st.error(
f"**Could not read `{uploaded.name}`**\n\n"
f"```\n{format_for_user(e)}\n```"
)
st.stop()
# ---------------------------------------------------------------------------
# Placeholder options
# ---------------------------------------------------------------------------
st.subheader("Schema Options")
st.file_uploader("Load target schema (JSON)", type=["json"], disabled=True, key="colmap_schema")
st.checkbox("Drop unmapped columns", value=False, disabled=True)
st.checkbox("Reorder to match schema", value=True, disabled=True)
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
st.divider()
st.button("Apply Column Mapping", type="primary", use_container_width=True, disabled=True)
# ---------------------------------------------------------------------------
# Footer
# Schema input
# ---------------------------------------------------------------------------
st.divider()
st.caption(
"Runs locally. Your data never leaves this computer. "
"| DataTools v3.0"
st.subheader("Target schema")
schema_mode = st.radio(
"How would you like to define the target schema?",
[
"Build interactively (start from current columns)",
"Upload schema JSON",
"Skip (rename / coerce only — no schema)",
],
index=0,
help=(
"An interactive build is fastest for one-off cleanup. Upload a JSON "
"when you have a fixed contract (a CRM import format, db schema). "
"Skip when you only want to rename or coerce specific columns."
),
)
schema: TargetSchema | None = None
if schema_mode.startswith("Upload"):
schema_file = st.file_uploader(
"Schema JSON",
type=["json"],
key="colmap_schema_upload",
help='Format: {"fields": [{"name": "email", "dtype": "string", "required": true, "aliases": ["EmailAddr"]}, ...]}',
)
if schema_file is not None:
try:
schema = TargetSchema.from_dict(json.loads(schema_file.getvalue()))
st.success(f"Loaded {len(schema.fields)} target field(s).")
except Exception as e:
from src.core.errors import format_for_user
st.error(f"**Could not parse schema**\n\n```\n{format_for_user(e)}\n```")
elif schema_mode.startswith("Build"):
st.caption(
"Edit the table to define your target schema. Add rows for fields the "
"input doesn't have yet (with a default), or remove rows for columns "
"you want to drop."
)
initial = pd.DataFrame({
"name": list(df.columns),
"dtype": ["auto"] * len(df.columns),
"required": [False] * len(df.columns),
"default": [""] * len(df.columns),
"aliases": [""] * len(df.columns),
})
edited = st.data_editor(
initial,
use_container_width=True,
num_rows="dynamic",
column_config={
"name": st.column_config.TextColumn("Target name"),
"dtype": st.column_config.SelectboxColumn(
"Type",
options=[
"auto", "string", "integer", "float",
"boolean", "date", "datetime", "category",
],
),
"required": st.column_config.CheckboxColumn("Required"),
"default": st.column_config.TextColumn("Default (for added cols)"),
"aliases": st.column_config.TextColumn(
"Aliases (comma-sep, helps fuzzy-match)",
),
},
key="colmap_schema_editor",
)
fields: list[TargetField] = []
for _, row in edited.iterrows():
name = str(row.get("name", "")).strip()
if not name:
continue
aliases = [
a.strip() for a in str(row.get("aliases", "") or "").split(",")
if a.strip()
]
default_raw = row.get("default")
default_val = (
default_raw if (default_raw not in (None, "", float("nan")))
else None
)
try:
if isinstance(default_val, float) and pd.isna(default_val):
default_val = None
except TypeError:
pass
fields.append(TargetField(
name=name,
dtype=str(row.get("dtype", "auto")), # type: ignore[arg-type]
required=bool(row.get("required", False)),
aliases=aliases,
default=default_val,
))
if fields:
schema = TargetSchema(fields=fields)
st.divider()
# ---------------------------------------------------------------------------
# Strategy
# ---------------------------------------------------------------------------
st.subheader("Strategy")
preset_label = st.radio(
"Preset",
[
"rename-only (just rename, leave types alone, keep extras)",
"lenient-schema (rename + coerce + reorder, keep extras)",
"strict-schema (rename + coerce + reorder, drop extras)",
],
index=0,
)
preset_key = preset_label.split(" ", 1)[0]
options = MapOptions.from_preset(preset_key)
options.schema = schema
with st.expander("Advanced options"):
col_a, col_b = st.columns(2)
with col_a:
options.unmapped = st.selectbox( # type: ignore[assignment]
"Unmapped source columns",
["keep", "drop", "error"],
index=["keep", "drop", "error"].index(options.unmapped),
)
options.coerce_types = st.checkbox(
"Coerce types per schema", value=options.coerce_types,
)
options.reorder_to_schema = st.checkbox(
"Reorder to schema order", value=options.reorder_to_schema,
)
with col_b:
options.auto_infer = st.checkbox(
"Auto-infer mapping (fuzzy match)", value=options.auto_infer,
)
options.fuzzy_threshold = st.slider(
"Fuzzy match threshold", 0.0, 1.0, options.fuzzy_threshold, 0.05,
)
options.enforce_required = st.checkbox(
"Enforce required fields", value=options.enforce_required,
)
# ---------------------------------------------------------------------------
# Mapping editor — show inferred and let user override
# ---------------------------------------------------------------------------
st.subheader("Mapping")
if schema is None:
st.caption(
"No schema — define explicit renames below (left blank means keep "
"the source name)."
)
rename_initial = pd.DataFrame({
"source": list(df.columns),
"target": list(df.columns),
})
rename_edited = st.data_editor(
rename_initial,
use_container_width=True,
column_config={
"source": st.column_config.TextColumn("Source", disabled=True),
"target": st.column_config.TextColumn("Target"),
},
hide_index=True,
key="colmap_rename_only_editor",
)
explicit_mapping: dict[str, str] = {}
for _, row in rename_edited.iterrows():
src = str(row["source"])
tgt = str(row["target"]).strip()
if tgt and tgt != src:
explicit_mapping[src] = tgt
options.mapping = explicit_mapping
else:
inferred = (
infer_mapping(df, schema, threshold=options.fuzzy_threshold)
if options.auto_infer else {}
)
target_options = ["(unmapped)"] + schema.field_names()
map_initial = pd.DataFrame({
"source": list(df.columns),
"target": [inferred.get(c, "(unmapped)") for c in df.columns],
"auto": [c in inferred for c in df.columns],
})
map_edited = st.data_editor(
map_initial,
use_container_width=True,
column_config={
"source": st.column_config.TextColumn("Source", disabled=True),
"target": st.column_config.SelectboxColumn(
"Target", options=target_options,
),
"auto": st.column_config.CheckboxColumn("Auto-suggested", disabled=True),
},
hide_index=True,
key="colmap_schema_mapping_editor",
)
explicit_mapping = {}
for _, row in map_edited.iterrows():
src = str(row["source"])
tgt = str(row["target"])
if tgt and tgt != "(unmapped)":
explicit_mapping[src] = tgt
options.mapping = explicit_mapping
# Disable auto-infer for the actual run since the editor already shows
# the user's resolved choices (they can manually re-select to add).
options.auto_infer = False
# ---------------------------------------------------------------------------
# Run
# ---------------------------------------------------------------------------
st.divider()
if st.button("Apply Column Mapping", type="primary", use_container_width=True):
with st.spinner("Mapping..."):
try:
result = map_columns(df, options)
except (ValueError, OSError) as e:
from src.core.errors import format_for_user
st.error(format_for_user(e))
st.stop()
st.session_state["colmap_result"] = result
st.session_state["colmap_input_name"] = uploaded.name
st.session_state["colmap_options"] = options.to_dict()
result = st.session_state.get("colmap_result")
if result is None:
st.info("Configure a mapping and click **Apply Column Mapping** to run.")
st.stop()
# ---------------------------------------------------------------------------
# Results
# ---------------------------------------------------------------------------
st.subheader("Results")
m1, m2, m3, m4 = st.columns(4)
m1.metric("Renamed", result.columns_renamed)
m2.metric("Dropped", len(result.columns_dropped))
m3.metric("Added", len(result.columns_added))
m4.metric(
"Coerce fails",
sum(result.coercion_failures.values()) if result.coercion_failures else 0,
)
if result.columns_dropped:
st.warning(f"Dropped columns: {', '.join(result.columns_dropped)}")
if result.columns_added:
st.info(f"Added (with defaults): {', '.join(result.columns_added)}")
if result.coercion_failures:
st.warning(
"Some cells could not be coerced and were left as NaN: "
+ ", ".join(f"{c} ({n})" for c, n in result.coercion_failures.items())
)
if result.mapping:
st.markdown("**Resolved mapping**")
map_df = pd.DataFrame(
[
{"source": s, "target": t, "auto": s in result.inferred_pairs}
for s, t in result.mapping.items()
],
)
st.dataframe(map_df, use_container_width=True, hide_index=True)
st.markdown("**Mapped preview (first 10 rows)**")
st.dataframe(result.mapped_df.head(10), use_container_width=True)
# ---------------------------------------------------------------------------
# Downloads
# ---------------------------------------------------------------------------
st.divider()
stem = Path(st.session_state.get("colmap_input_name", "input")).stem
dl_a, dl_b, dl_c = st.columns(3)
with dl_a:
mapped_bytes = result.mapped_df.to_csv(index=False).encode("utf-8-sig")
st.download_button(
"Download mapped CSV",
data=mapped_bytes,
file_name=f"{stem}_mapped.csv",
mime="text/csv",
)
with dl_b:
audit_bytes = json.dumps({
"mapping": result.mapping,
"inferred_pairs": result.inferred_pairs,
"columns_renamed": result.columns_renamed,
"columns_dropped": result.columns_dropped,
"columns_added": result.columns_added,
"coercion_failures": result.coercion_failures,
"unmapped_kept": result.unmapped_kept,
"missing_required_targets": result.missing_required_targets,
}, indent=2, default=str).encode("utf-8")
st.download_button(
"Download mapping audit",
data=audit_bytes,
file_name=f"{stem}_mapping.json",
mime="application/json",
)
with dl_c:
config_bytes = json.dumps(
st.session_state.get("colmap_options", {}), indent=2, default=str,
).encode("utf-8")
st.download_button(
"Download config JSON",
data=config_bytes,
file_name="column_map_config.json",
mime="application/json",
)
st.divider()
st.caption("Runs locally. Your data never leaves this computer. | DataTools v3.0")

View File

@@ -1,104 +1,370 @@
"""DataTools Pipeline Runner — stub page."""
"""DataTools Pipeline Runner — Streamlit page."""
from __future__ import annotations
import io
import json
import sys
from pathlib import Path
import pandas as pd
import streamlit as st
_project_root = Path(__file__).resolve().parent.parent.parent.parent
if str(_project_root) not in sys.path:
sys.path.insert(0, str(_project_root))
from src.gui.components import hide_streamlit_chrome, require_normalization_gate
from src.gui.components import (
hide_streamlit_chrome,
pickup_or_upload,
require_normalization_gate,
)
from src.core.pipeline import (
Pipeline,
SOFT_DEPENDENCIES,
Step,
TOOL_NAMES,
recommended_pipeline,
run_pipeline,
validate_pipeline,
)
hide_streamlit_chrome()
require_normalization_gate()
# ---------------------------------------------------------------------------
# Header
# ---------------------------------------------------------------------------
st.title("⚙️ Pipeline Runner")
st.caption("Chain tools in sequence and pass output between steps automatically.")
st.info("This tool is under development.")
# ---------------------------------------------------------------------------
# What this tool will do
# ---------------------------------------------------------------------------
st.markdown("""
**Features:**
- Select tools to run in sequence
- Recommended order: Text Cleaner → Format Standardizer → Missing Values → Deduplicator → Validator
- Each step's output feeds into the next step's input
- Per-step configuration overrides
- Progress tracking across all steps
- Final combined report
""")
st.divider()
# ---------------------------------------------------------------------------
# File upload (functional)
# ---------------------------------------------------------------------------
uploaded = st.file_uploader(
"Upload CSV or Excel file",
type=["csv", "tsv", "xlsx", "xls"],
help="Upload a file to preview. Processing is not yet available.",
key="pipeline_file_upload",
st.caption(
"Chain DataTools cleaning steps into one repeatable workflow. The "
"pipeline recommends an order; you stay in control."
)
if uploaded is not None:
import pandas as pd
# ---------------------------------------------------------------------------
# File upload
# ---------------------------------------------------------------------------
uploaded = pickup_or_upload(
label="Upload CSV or Excel file",
key="pipeline_file_upload",
types=["csv", "tsv", "xlsx", "xls"],
)
if uploaded is None:
st.info("Upload a CSV, TSV, or Excel file to begin.")
st.stop()
@st.cache_data(show_spinner=False)
def _read_uploaded(name: str, data: bytes) -> pd.DataFrame:
suffix = Path(name).suffix.lower()
bio = io.BytesIO(data)
if suffix in (".xlsx", ".xls"):
return pd.read_excel(bio)
for enc in ("utf-8", "utf-8-sig", "latin-1"):
try:
if uploaded.name.endswith((".xlsx", ".xls")):
df = pd.read_excel(uploaded)
else:
df = pd.read_csv(uploaded)
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
except Exception as e:
bio.seek(0)
sep = "\t" if suffix == ".tsv" else ","
return pd.read_csv(bio, encoding=enc, sep=sep, on_bad_lines="warn")
except UnicodeDecodeError:
continue
bio.seek(0)
return pd.read_csv(bio, encoding="latin-1")
try:
df = _read_uploaded(uploaded.name, uploaded.getvalue())
except Exception as e:
from src.core.errors import format_for_user
st.error(
f"**Could not read `{uploaded.name}`**\n\n"
f"```\n{format_for_user(e)}\n```"
)
st.stop()
# ---------------------------------------------------------------------------
# Pipeline steps (checklist)
# ---------------------------------------------------------------------------
st.subheader("Pipeline Steps")
st.caption("Select tools to include in the pipeline (recommended order):")
st.checkbox("1. Text Cleaner", value=True, disabled=True)
st.checkbox("2. Format Standardizer", value=True, disabled=True)
st.checkbox("3. Missing Value Handler", value=True, disabled=True)
st.checkbox("4. Column Mapper", value=False, disabled=True)
st.checkbox("5. Outlier Detector", value=False, disabled=True)
st.checkbox("6. Deduplicator", value=True, disabled=True)
st.checkbox("7. Multi-File Merger", value=False, disabled=True)
st.checkbox("8. Validator & Reporter", value=True, disabled=True)
st.subheader("Pipeline Configuration")
st.selectbox("On error", ["Stop pipeline", "Skip step and continue", "Prompt for decision"], disabled=True)
st.checkbox("Generate combined report at end", value=True, disabled=True)
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
st.divider()
st.button("Run Pipeline", type="primary", use_container_width=True, disabled=True)
# ---------------------------------------------------------------------------
# Footer
# Pipeline builder
# ---------------------------------------------------------------------------
st.divider()
st.caption(
"Runs locally. Your data never leaves this computer. "
"| DataTools v3.0"
st.subheader("Pipeline")
mode = st.radio(
"How would you like to define the pipeline?",
[
"Use the recommended default (text-clean → format → missing → dedup)",
"Build interactively",
"Upload a saved pipeline JSON",
],
index=0,
)
if "pipeline_rows" not in st.session_state:
default = recommended_pipeline()
st.session_state["pipeline_rows"] = pd.DataFrame([
{
"tool": s.tool, "enabled": s.enabled,
"options_json": json.dumps(s.options),
}
for s in default.steps
])
if mode.startswith("Use the recommended"):
default = recommended_pipeline()
st.session_state["pipeline_rows"] = pd.DataFrame([
{
"tool": s.tool, "enabled": s.enabled,
"options_json": json.dumps(s.options),
}
for s in default.steps
])
elif mode.startswith("Upload"):
pipeline_file = st.file_uploader(
"Pipeline JSON", type=["json"], key="pipeline_upload",
)
if pipeline_file is not None:
try:
data = json.loads(pipeline_file.getvalue())
uploaded_pipe = Pipeline.from_dict(data)
st.session_state["pipeline_rows"] = pd.DataFrame([
{
"tool": s.tool, "enabled": s.enabled,
"options_json": json.dumps(s.options),
}
for s in uploaded_pipe.steps
])
st.success(f"Loaded {len(uploaded_pipe.steps)} step(s).")
except Exception as e:
from src.core.errors import format_for_user
st.error(f"**Could not parse pipeline**\n\n```\n{format_for_user(e)}\n```")
st.caption(
"Edit the table to add, remove, reorder (drag the row index), enable, "
"or configure each step. Tool order is recommended, not enforced — "
"violations surface as warnings below the table."
)
edited = st.data_editor(
st.session_state["pipeline_rows"],
use_container_width=True,
num_rows="dynamic",
column_config={
"tool": st.column_config.SelectboxColumn(
"Tool", options=TOOL_NAMES, required=True,
),
"enabled": st.column_config.CheckboxColumn("Enabled"),
"options_json": st.column_config.TextColumn(
"Options (JSON)",
help='e.g. {"column_types": {"phone": "phone"}}',
),
},
key="pipeline_editor",
)
st.session_state["pipeline_rows"] = edited
# Build a Pipeline object from the editor state.
steps_list: list[Step] = []
parse_errors: list[str] = []
for i, row in edited.iterrows():
tool = row.get("tool")
if not tool or pd.isna(tool):
continue
raw_opts = row.get("options_json") or "{}"
if pd.isna(raw_opts):
raw_opts = "{}"
try:
opts = json.loads(raw_opts) if isinstance(raw_opts, str) else dict(raw_opts)
if not isinstance(opts, dict):
raise ValueError("options must be a JSON object")
except Exception as e:
parse_errors.append(f"Step {i + 1}: {e}")
continue
try:
steps_list.append(Step(
tool=str(tool),
options=opts,
enabled=bool(row.get("enabled", True)),
))
except Exception as e:
parse_errors.append(f"Step {i + 1}: {e}")
if parse_errors:
for err in parse_errors:
st.error(err)
current_pipeline = Pipeline(steps=steps_list) if steps_list else None
if current_pipeline is not None:
warnings = validate_pipeline(current_pipeline)
if warnings:
st.warning(
"Pipeline is out of recommended order:\n\n"
+ "\n".join(f"- {w}" for w in warnings)
+ "\n\nThe pipeline will still run — these are recommendations only."
)
with st.expander("Recommended tool order — why each step belongs where it does"):
st.markdown(
"\n".join(
f"- **{e}** before **{l}** — {why}"
for e, l, why in SOFT_DEPENDENCIES
)
)
st.divider()
# ---------------------------------------------------------------------------
# Run
# ---------------------------------------------------------------------------
run_disabled = current_pipeline is None or not current_pipeline.steps
if st.button(
"Run Pipeline",
type="primary",
use_container_width=True,
disabled=run_disabled,
):
progress = st.progress(0.0, text="Starting...")
log_box = st.empty()
log_lines: list[str] = []
total_enabled = sum(1 for s in current_pipeline.steps if s.enabled)
completed = [0]
def _on_step(sr) -> None:
completed[0] += 1
if sr.skipped:
log_lines.append(f"{sr.step.display_name()} (skipped)")
elif sr.error:
log_lines.append(
f"{sr.step.display_name()}{sr.error.splitlines()[0]}"
)
else:
log_lines.append(
f"{sr.step.display_name()}{sr.elapsed_seconds*1000:.0f} ms"
)
log_box.markdown("\n".join(log_lines))
progress.progress(
completed[0] / max(total_enabled, 1),
text=f"Step {completed[0]}/{total_enabled}",
)
try:
result = run_pipeline(
df, current_pipeline,
on_step_complete=_on_step,
stop_on_error=False,
)
except Exception as e:
from src.core.errors import format_for_user
st.error(f"**Pipeline halted**\n\n```\n{format_for_user(e)}\n```")
st.stop()
progress.progress(1.0, text="Done")
st.session_state["pipeline_result"] = result
st.session_state["pipeline_input_name"] = uploaded.name
result = st.session_state.get("pipeline_result")
if result is None:
st.info(
"Configure the pipeline above and click **Run Pipeline** to "
"execute it on your file."
)
st.stop()
# ---------------------------------------------------------------------------
# Results
# ---------------------------------------------------------------------------
st.subheader("Results")
m1, m2, m3, m4 = st.columns(4)
m1.metric("Initial rows", result.initial_rows)
m2.metric("Final rows", result.final_rows)
m3.metric("Steps run", sum(1 for s in result.step_results if not s.skipped))
m4.metric("Elapsed", f"{result.total_elapsed:.2f} s")
st.markdown("**Per-step summary**")
step_df = pd.DataFrame([
{
"step": sr.step.display_name(),
"status": (
"skipped" if sr.skipped
else "error" if sr.error
else "ok"
),
"elapsed_ms": int(sr.elapsed_seconds * 1000),
"summary": json.dumps(sr.summary, default=str)[:200],
"error": sr.error or "",
}
for sr in result.step_results
])
st.dataframe(step_df, use_container_width=True, hide_index=True)
st.markdown("**Output preview (first 10 rows)**")
st.dataframe(result.final_df.head(10), use_container_width=True)
# ---------------------------------------------------------------------------
# Downloads
# ---------------------------------------------------------------------------
st.divider()
stem = Path(st.session_state.get("pipeline_input_name", "input")).stem
dl_a, dl_b, dl_c = st.columns(3)
with dl_a:
bytes_csv = result.final_df.to_csv(index=False).encode("utf-8-sig")
st.download_button(
"Download cleaned CSV",
data=bytes_csv,
file_name=f"{stem}_pipeline.csv",
mime="text/csv",
)
with dl_b:
pipeline_bytes = json.dumps(
current_pipeline.to_dict() if current_pipeline else {"steps": []},
indent=2, default=str,
).encode("utf-8")
st.download_button(
"Download pipeline JSON",
data=pipeline_bytes,
file_name="pipeline.json",
mime="application/json",
help="Save this and pass --pipeline pipeline.json to the CLI to re-run on next week's file.",
)
with dl_c:
audit_bytes = json.dumps({
"warnings": result.warnings,
"initial_rows": result.initial_rows,
"final_rows": result.final_rows,
"total_elapsed_seconds": result.total_elapsed,
"steps": [
{
"tool": sr.step.tool,
"name": sr.step.display_name(),
"enabled": sr.step.enabled,
"skipped": sr.skipped,
"elapsed_seconds": sr.elapsed_seconds,
"summary": sr.summary,
"error": sr.error,
}
for sr in result.step_results
],
}, indent=2, default=str).encode("utf-8")
st.download_button(
"Download run audit",
data=audit_bytes,
file_name=f"{stem}_pipeline_audit.json",
mime="application/json",
)
st.divider()
st.caption("Runs locally. Your data never leaves this computer. | DataTools v3.0")

View File

@@ -78,7 +78,7 @@ TOOLS: list[Tool] = [
"Detect disguised nulls, missingness analysis, and imputation strategies."
),
page_slug="4_Missing_Values",
status="Coming Soon",
status="Ready",
),
Tool(
tool_id="05_column_mapper",
@@ -86,7 +86,7 @@ TOOLS: list[Tool] = [
name="Column Mapper",
description="Rename columns, enforce a target schema, and coerce types.",
page_slug="5_Column_Mapper",
status="Coming Soon",
status="Ready",
),
Tool(
tool_id="06_outlier_detector",
@@ -125,7 +125,7 @@ TOOLS: list[Tool] = [
"Chain tools in recommended order and pass output between steps."
),
page_slug="9_Pipeline_Runner",
status="Coming Soon",
status="Ready",
),
]

51
streamlit_app.py Normal file
View File

@@ -0,0 +1,51 @@
"""Streamlit Community Cloud entry point — public demo app.
This is the file Streamlit Community Cloud auto-discovers when you
deploy from this repository: leave the "Main file path" field at its
default (``streamlit_app.py``) and it just works.
Why this lives at the repo root, not in ``src/gui/``:
Streamlit auto-detects sibling files inside a ``pages/`` directory
next to the entry script and renders them as additional pages in
the sidebar. The full product GUI's pages live in
``src/gui/pages/`` — pointing the Cloud at ``src/gui/app_demo.py``
would inadvertently expose every paid-product page in the demo's
sidebar (or require URL-routing tricks to suppress them).
Anchoring the entry script at the repo root means there is no
``pages/`` neighbour and the demo stays single-page by
construction.
The actual demo UI is defined once in ``src/gui/app_demo.py`` so
local development still works the way it always did:
streamlit run src/gui/app_demo.py # local dev, identical UX
Cloud deploy uses this shim:
streamlit run streamlit_app.py # what Cloud invokes
"""
from __future__ import annotations
import sys
from pathlib import Path
# Put the repo root on sys.path so ``src.core`` and ``src.gui`` imports
# resolve cleanly. The demo module does this itself for the local-dev
# case, but the import order matters when this shim runs first on Cloud.
_HERE = Path(__file__).resolve().parent
if str(_HERE) not in sys.path:
sys.path.insert(0, str(_HERE))
# Executing the demo module top-to-bottom is the simplest way to share
# the UI between the two entry points without duplicating code or
# refactoring the demo into a function (Streamlit's idiom is
# script-as-page; converting it to a callable would fight the
# framework). ``runpy`` runs the file in this script's namespace so
# Streamlit's ``st.set_page_config`` / element registration sees the
# correct module.
import runpy
runpy.run_path(
str(_HERE / "src" / "gui" / "app_demo.py"),
run_name="__main__",
)

View File

@@ -0,0 +1,23 @@
# Column Mapper — corpus
Acceptance fixtures for `src/core/column_mapper.py`. Each `.csv` under
`test_data/` is paired with assertions in
`tests/test_column_mapper_corpus.py`.
## Use cases (target client profiles)
| File | Buyer profile | Tested behaviour |
|------|---------------|------------------|
| `uc01_crm_import.csv` + `schemas/uc01_crm_target.json` | Sales ops admin importing leads into Salesforce / HubSpot | Schema enforcement: rename via aliases, coerce types, drop extras, add `owner` default. |
| `uc02_vendor_{a,b,c}.csv` + `schemas/uc02_canonical.json` | Operator unifying vendor exports | Multi-source unification: each vendor uses different headers; auto-inference resolves them all. |
| `uc03_type_coercion.csv` + `schemas/uc03_types.json` | Analyst quick-fixing a mistyped CSV | Mixed-type coercion with documented per-column failure counts (bad rows survive as NaN). |
## Edge cases
| File | Stresses |
|------|----------|
| `ec01_duplicate_target.csv` | Mapping two source columns to the same target → InputValidationError. |
| `ec02_unicode_columns.csv` | Non-ASCII column names (Japanese) survive rename and coerce. |
| `ec03_whitespace_headers.csv` | Leading/trailing whitespace in headers still fuzzy-matches the schema. |
| `ec04_no_match.csv` | No source column scores above threshold → empty mapping, fallback unmapped strategy fires. |
| `ec05_required_missing.csv` | Required target field has no source column → InputValidationError unless `enforce_required=False`. |

View File

@@ -0,0 +1,13 @@
{
"fields": [
{"name": "first_name", "dtype": "string", "required": true, "aliases": ["First Name", "fname"]},
{"name": "last_name", "dtype": "string", "required": true, "aliases": ["Last Name", "lname"]},
{"name": "email", "dtype": "string", "required": true, "aliases": ["EmailAddr", "Email", "email_address"]},
{"name": "phone", "dtype": "string", "aliases": ["Phone", "phone_number"]},
{"name": "account_name", "dtype": "string", "aliases": ["Company", "Account"]},
{"name": "annual_rev", "dtype": "integer", "aliases": ["Annual Revenue", "revenue"]},
{"name": "lead_source", "dtype": "category","aliases": ["Lead Source", "source"]},
{"name": "created_date", "dtype": "date", "aliases": ["Created", "create_date"]},
{"name": "owner", "dtype": "string", "default": "unassigned"}
]
}

View File

@@ -0,0 +1,9 @@
{
"fields": [
{"name": "first_name", "dtype": "string", "required": true, "aliases": ["FirstName", "FName", "First Name"]},
{"name": "last_name", "dtype": "string", "required": true, "aliases": ["LastName", "Surname", "Last Name"]},
{"name": "email", "dtype": "string", "required": true, "aliases": ["Email", "E-mail", "email_addr", "EmailAddr"]},
{"name": "phone", "dtype": "string", "aliases": ["Phone Number", "Tel", "phone_number"]},
{"name": "country", "dtype": "string", "aliases": ["Country", "country_code", "Region"]}
]
}

View File

@@ -0,0 +1,10 @@
{
"fields": [
{"name": "id", "dtype": "integer", "required": true},
{"name": "age", "dtype": "integer"},
{"name": "active", "dtype": "boolean"},
{"name": "joined", "dtype": "date"},
{"name": "score", "dtype": "float"},
{"name": "notes", "dtype": "string"}
]
}

View File

@@ -0,0 +1,3 @@
a,b,c
1,2,3
4,5,6
1 a b c
2 1 2 3
3 4 5 6

View File

@@ -0,0 +1,3 @@
名前,Email,価格
Alice,a@x.com,100
Bob,b@x.com,200
1 名前 Email 価格
2 Alice a@x.com 100
3 Bob b@x.com 200

View File

@@ -0,0 +1,3 @@
First Name , Last Name ,EmailAddr
Alice,Johnson,alice@x.com
Bob,Smith,bob@x.com
1 First Name Last Name EmailAddr
2 Alice Johnson alice@x.com
3 Bob Smith bob@x.com

View File

@@ -0,0 +1,3 @@
xyz,abc,foobar
1,2,3
4,5,6
1 xyz abc foobar
2 1 2 3
3 4 5 6

View File

@@ -0,0 +1,3 @@
first_name,age
Alice,30
Bob,25
1 first_name age
2 Alice 30
3 Bob 25

View File

@@ -0,0 +1,4 @@
First Name,Last Name,EmailAddr,Phone,Company,Annual Revenue,Lead Source,Created
Alice,Johnson,alice@acme.com,555-1234,Acme Corp,1500000,LinkedIn,2025-12-04
Bob,Smith,bob@beta.com,555-5678,Beta LLC,250000,Webinar,2025-11-22
Carlos,Garcia,carlos@gamma.io,555-9012,Gamma Inc,4200000,Referral,2025-10-30
1 First Name Last Name EmailAddr Phone Company Annual Revenue Lead Source Created
2 Alice Johnson alice@acme.com 555-1234 Acme Corp 1500000 LinkedIn 2025-12-04
3 Bob Smith bob@beta.com 555-5678 Beta LLC 250000 Webinar 2025-11-22
4 Carlos Garcia carlos@gamma.io 555-9012 Gamma Inc 4200000 Referral 2025-10-30

View File

@@ -0,0 +1,3 @@
FirstName,LastName,Email,Phone Number,Country
Alice,Johnson,alice@vendor-a.com,555-1234,USA
Bob,Smith,bob@vendor-a.com,555-5678,USA
1 FirstName LastName Email Phone Number Country
2 Alice Johnson alice@vendor-a.com 555-1234 USA
3 Bob Smith bob@vendor-a.com 555-5678 USA

View File

@@ -0,0 +1,3 @@
first_name,surname,email_addr,phone,country_code
Carlos,Garcia,carlos@vendor-b.com,555-9012,USA
Diana,Lee,diana@vendor-b.com,555-7777,UK
1 first_name surname email_addr phone country_code
2 Carlos Garcia carlos@vendor-b.com 555-9012 USA
3 Diana Lee diana@vendor-b.com 555-7777 UK

View File

@@ -0,0 +1,3 @@
FName,Surname,E-mail,Tel,Region
Eve,Martinez,eve@vendor-c.com,555-9988,Bronx
Frank,Brown,frank@vendor-c.com,555-1111,Queens
1 FName Surname E-mail Tel Region
2 Eve Martinez eve@vendor-c.com 555-9988 Bronx
3 Frank Brown frank@vendor-c.com 555-1111 Queens

View File

@@ -0,0 +1,6 @@
id,age,active,joined,score,notes
1,30,true,2025-01-15,87.5,first
2,25,false,2025-02-22,not_a_number,second
3,not_a_number,yes,2025-03-08,76.0,third
4,40,no,bad_date,91.2,fourth
5,55,1,2025-05-01,82.0,fifth
1 id age active joined score notes
2 1 30 true 2025-01-15 87.5 first
3 2 25 false 2025-02-22 not_a_number second
4 3 not_a_number yes 2025-03-08 76.0 third
5 4 40 no bad_date 91.2 fourth
6 5 55 1 2025-05-01 82.0 fifth

View File

@@ -0,0 +1,21 @@
customer_id,name,phone,country,address,price
INT-001,Alice Johnson,(415) 555-1234,US,"1 Apple Park Way, Cupertino CA 95014",$1499.99
INT-002,Boris Petrov,+7 495 123 4567,RU,"Ulitsa Tverskaya 13, Moscow 125009",₽89500
INT-003,carlos garcia,+34 91 411 1111,ES,"Calle Gran Via 28, Madrid 28013","€1.299,00"
INT-004,JOHN BROWN,020 7946 0958,GB,"10 Downing Street, London SW1A 2AA","£950.00"
INT-005,marie dubois,01 42 86 82 00,FR,"Avenue des Champs-Elysees 100, Paris 75008","€2.499,50"
INT-006,Yuki Tanaka,03-3210-7000,JP,"Marunouchi 2-7-3, Chiyoda-ku Tokyo 100-0005",¥150000
INT-007,Anna Schmidt,030 12345678,DE,"Unter den Linden 5, Berlin 10117","€899,99"
INT-008,giovanni rossi,+39 06 6982,IT,"Via del Corso 320, Roma 00186","€1.450,00"
INT-009,Mei Wang,+86 10 1234 5678,CN,"东长安街 1号, 北京 100006",¥10000
INT-010,Priya Sharma,+91 11 2345 6789,IN,"Connaught Place, New Delhi 110001",₹85000
INT-011,Ahmed Hassan,+20 2 2735 0000,EG,"Tahrir Square, Cairo 11511",E£3500
INT-012,emily smith,+61 2 9374 4000,AU,"Sydney Opera House, Sydney NSW 2000","$2,199.00"
INT-013,Joao Silva,11 3071 0000,BR,"Avenida Paulista 1000, Sao Paulo 01310","R$ 1.299,90"
INT-014,Sofia Lopez,+52 55 5555 0000,MX,"Paseo de la Reforma 222, Ciudad de Mexico 06600","$1,500 MXN"
INT-015,Min-jun Kim,+82 2 2287 0114,KR,"Seoul Plaza, Seoul 04518",₩1500000
INT-016,Mehmet Yilmaz,+90 212 252 0000,TR,"Sultanahmet, Istanbul 34122","₺1.250"
INT-017,david cohen,+972 3 6957 0000,IL,"Dizengoff 50, Tel Aviv 6433222",₪450
INT-018,Hanna Kowalska,+48 22 658 4500,PL,"Marszalkowska 1, Warszawa 00-624","zł 350,00"
INT-019,Lars Nielsen,+45 33 12 88 88,DK,"Vesterbrogade 1, Copenhagen 1620","kr 950"
INT-020,Sven Eriksson,+46 8 506 600 00,SE,"Drottninggatan 1, Stockholm 11151","kr 1.250,50"
1 customer_id name phone country address price
2 INT-001 Alice Johnson (415) 555-1234 US 1 Apple Park Way, Cupertino CA 95014 $1499.99
3 INT-002 Boris Petrov +7 495 123 4567 RU Ulitsa Tverskaya 13, Moscow 125009 ₽89500
4 INT-003 carlos garcia +34 91 411 1111 ES Calle Gran Via 28, Madrid 28013 €1.299,00
5 INT-004 JOHN BROWN 020 7946 0958 GB 10 Downing Street, London SW1A 2AA £950.00
6 INT-005 marie dubois 01 42 86 82 00 FR Avenue des Champs-Elysees 100, Paris 75008 €2.499,50
7 INT-006 Yuki Tanaka 03-3210-7000 JP Marunouchi 2-7-3, Chiyoda-ku Tokyo 100-0005 ¥150000
8 INT-007 Anna Schmidt 030 12345678 DE Unter den Linden 5, Berlin 10117 €899,99
9 INT-008 giovanni rossi +39 06 6982 IT Via del Corso 320, Roma 00186 €1.450,00
10 INT-009 Mei Wang +86 10 1234 5678 CN 东长安街 1号, 北京 100006 ¥10000
11 INT-010 Priya Sharma +91 11 2345 6789 IN Connaught Place, New Delhi 110001 ₹85000
12 INT-011 Ahmed Hassan +20 2 2735 0000 EG Tahrir Square, Cairo 11511 E£3500
13 INT-012 emily smith +61 2 9374 4000 AU Sydney Opera House, Sydney NSW 2000 $2,199.00
14 INT-013 Joao Silva 11 3071 0000 BR Avenida Paulista 1000, Sao Paulo 01310 R$ 1.299,90
15 INT-014 Sofia Lopez +52 55 5555 0000 MX Paseo de la Reforma 222, Ciudad de Mexico 06600 $1,500 MXN
16 INT-015 Min-jun Kim +82 2 2287 0114 KR Seoul Plaza, Seoul 04518 ₩1500000
17 INT-016 Mehmet Yilmaz +90 212 252 0000 TR Sultanahmet, Istanbul 34122 ₺1.250
18 INT-017 david cohen +972 3 6957 0000 IL Dizengoff 50, Tel Aviv 6433222 ₪450
19 INT-018 Hanna Kowalska +48 22 658 4500 PL Marszalkowska 1, Warszawa 00-624 zł 350,00
20 INT-019 Lars Nielsen +45 33 12 88 88 DK Vesterbrogade 1, Copenhagen 1620 kr 950
21 INT-020 Sven Eriksson +46 8 506 600 00 SE Drottninggatan 1, Stockholm 11151 kr 1.250,50

View File

@@ -0,0 +1,35 @@
# Missing Value Handler — corpus
Acceptance fixtures for `src/core/missing.py`. Each `.csv` under
`test_data/` is paired with assertions in `tests/test_missing_corpus.py`.
Add a new case by dropping a CSV here and adding a parametrize entry to
the runner.
## Use cases (target client profiles)
| File | Buyer profile | Strategy under test |
|------|---------------|---------------------|
| `uc01_shopify_export.csv` | SMB / Shopify operator | `detect-only` |
| `uc02_marketing_audience.csv` | Marketing / RevOps analyst| `safe-fill` |
| `uc03_consultant_intake.csv` | Analyst / consultant | `drop-incomplete` + threshold |
## Edge cases
| File | What it stresses |
|------|------------------|
| `ec01_all_nan_column.csv` | column 100 % missing — fill must skip, drop_col must catch at threshold |
| `ec02_no_missing.csv` | clean file — must be a no-op |
| `ec03_zero_is_not_missing.csv` | numeric `0`, boolean `false`, `"0"` must NOT be treated as missing |
| `ec04_excel_errors.csv` | `#N/A`, `#NULL!`, `#VALUE!` Excel error sentinels |
| `ec05_unicode_whitespace.csv` | NBSP, tab-only, ideographic-space cells treated as whitespace |
| `ec06_mixed_dtypes.csv` | mixed numeric/string in same column — graceful degrade to mode |
| `ec07_real_data_with_padding.csv` | leading/trailing whitespace around real data must NOT be dropped |
| `ec08_single_row.csv` | one-row file — every operation must still work |
| `ec09_single_column.csv` | one-column file with header-only line + sentinels |
| `ec10_all_sentinel_variants.csv` | every `DEFAULT_SENTINELS` entry exercised in one file |
| `ec11_constant_per_column.csv` | `column_fill_values` differs per column |
| `ec12_drop_threshold_boundary.csv`| boundary values for `row_drop_threshold` (0.5, 0.99, 1.0) |
| `ec13_ffill_leading_nan.csv` | leading-NaN run survives ffill (no fabrication) |
| `ec14_interpolate_fallback.csv` | numeric-only strategy on string column triggers fallback |
| `ec15_headers_only.csv` | empty body — must not crash |
| `ec16_idempotent_apply.csv` | running `handle_missing` twice yields the same DataFrame |

View File

@@ -0,0 +1,5 @@
id,name,deprecated_field
1,Alice,
2,Bob,
3,Charlie,
4,Diana,
1 id name deprecated_field
2 1 Alice
3 2 Bob
4 3 Charlie
5 4 Diana

View File

@@ -0,0 +1,4 @@
id,name,age,city
1,Alice,30,NYC
2,Bob,25,LA
3,Charlie,35,SF
1 id name age city
2 1 Alice 30 NYC
3 2 Bob 25 LA
4 3 Charlie 35 SF

View File

@@ -0,0 +1,5 @@
id,active,balance,count,flag
1,true,0.00,0,0
2,false,150.50,3,1
3,true,0,5,0
4,true,75.25,0,1
1 id active balance count flag
2 1 true 0.00 0 0
3 2 false 150.50 3 1
4 3 true 0 5 0
5 4 true 75.25 0 1

View File

@@ -0,0 +1,7 @@
sku,price,units,supplier
A-100,19.99,5,Acme
A-101,#N/A,3,Beta
A-102,29.99,#NULL!,Gamma
A-103,#VALUE!,2,Delta
A-104,9.99,0,Acme
A-105,#N/A,#N/A,#NULL!
1 sku price units supplier
2 A-100 19.99 5 Acme
3 A-101 #N/A 3 Beta
4 A-102 29.99 #NULL! Gamma
5 A-103 #VALUE! 2 Delta
6 A-104 9.99 0 Acme
7 A-105 #N/A #N/A #NULL!

View File

@@ -0,0 +1,6 @@
id,note,value
1,hello,10
2, ,20
3, ,30
4,real,40
5, ,50
1 id note value
2 1 hello 10
3 2 20
4 3 30
5 4 real 40
6 5 50

View File

@@ -0,0 +1,6 @@
id,mixed_col,real_num
1,42,1.0
2,N/A,2.0
3,hello,
4,,4.0
5,99,5.0
1 id mixed_col real_num
2 1 42 1.0
3 2 N/A 2.0
4 3 hello
5 4 4.0
6 5 99 5.0

View File

@@ -0,0 +1,5 @@
id,name,city
1, Alice ,NYC
2, ,LA
3, Bob ,
4,Charlie, SF
1 id name city
2 1 Alice NYC
3 2 LA
4 3 Bob
5 4 Charlie SF

View File

@@ -0,0 +1,2 @@
id,name,age,city
1,Alice,N/A,
1 id name age city
2 1 Alice N/A

View File

@@ -0,0 +1,7 @@
value
10
N/A
20
" "
-
30
1 value
2 10
3 N/A
4 20
5
6 -
7 30

View File

@@ -0,0 +1,22 @@
case_id,sentinel_value
01,N/A
02,n/a
03,NA
04,na
05,NULL
06,null
07,None
08,nil
09,NaN
10,-
11,--
12,?
13,.
14,TBD
15,unknown
16,(blank)
17,(none)
18,#N/A
19,#NULL!
20,missing
21,real_value
1 case_id sentinel_value
2 01 N/A
3 02 n/a
4 03 NA
5 04 na
6 05 NULL
7 06 null
8 07 None
9 08 nil
10 09 NaN
11 10 -
12 11 --
13 12 ?
14 13 .
15 14 TBD
16 15 unknown
17 16 (blank)
18 17 (none)
19 18 #N/A
20 19 #NULL!
21 20 missing
22 21 real_value

View File

@@ -0,0 +1,6 @@
id,country,salary,department
1,USA,50000,Eng
2,,60000,Sales
3,UK,,Eng
4,USA,55000,
5,,,
1 id country salary department
2 1 USA 50000 Eng
3 2 60000 Sales
4 3 UK Eng
5 4 USA 55000
6 5

View File

@@ -0,0 +1,6 @@
id,a,b,c,d
1,1,2,3,4
2,,,3,4
3,,,,4
4,,,,
5,1,2,,
1 id a b c d
2 1 1 2 3 4
3 2 3 4
4 3 4
5 4
6 5 1 2

View File

@@ -0,0 +1,8 @@
date,price
2025-01-01,
2025-01-02,
2025-01-03,100.0
2025-01-04,
2025-01-05,
2025-01-06,150.0
2025-01-07,
1 date price
2 2025-01-01
3 2025-01-02
4 2025-01-03 100.0
5 2025-01-04
6 2025-01-05
7 2025-01-06 150.0
8 2025-01-07

View File

@@ -0,0 +1,6 @@
id,category,value
1,A,10.0
2,B,
3,C,30.0
4,,40.0
5,A,
1 id category value
2 1 A 10.0
3 2 B
4 3 C 30.0
5 4 40.0
6 5 A

View File

@@ -0,0 +1 @@
id,name,age,city
1 id name age city

View File

@@ -0,0 +1,5 @@
id,name,age
1,Alice,30
2,N/A,
3,Bob,25
4,,40
1 id name age
2 1 Alice 30
3 2 N/A
4 3 Bob 25
5 4 40

View File

@@ -0,0 +1,11 @@
customer_id,first_name,last_name,email,phone,city,total_orders,lifetime_value,last_order_date,tags
SHOP-001,Alice,Johnson,alice@shop.com,555-1234,Brooklyn,12,1240.50,2025-12-04,VIP
SHOP-002,Bob,Smith,bob@shop.com,N/A,Queens,5,420.00,2025-11-22,
SHOP-003,Carlos,Garcia,carlos@shop.com,555-5678,-,8,890.25,2025-12-15,Wholesale
SHOP-004,Diana,Lee,diana@shop.com,(555) 222-3344,Manhattan,NULL,1875.00,2025-10-30,VIP|Wholesale
SHOP-005,Eve,Martinez,,555-9988,Bronx,3,180.00,2025-09-15,
SHOP-006,Frank,Brown,frank@shop.com, ,Staten Island,15,2410.75,(blank),
SHOP-007,Grace,Davis,grace@shop.com,555-1111,Brooklyn,1,49.99,#N/A,New
SHOP-008,Henry,Wilson,henry@shop.com,n/a,Queens,7,675.00,2025-11-08,VIP
SHOP-009,Ivy,Chen,ivy@shop.com,555-7777,?,4,320.50,2025-10-12,
SHOP-010,Jack,Taylor,jack@shop.com,555-4444,Manhattan,(none),520.00,2025-12-01,Wholesale
1 customer_id first_name last_name email phone city total_orders lifetime_value last_order_date tags
2 SHOP-001 Alice Johnson alice@shop.com 555-1234 Brooklyn 12 1240.50 2025-12-04 VIP
3 SHOP-002 Bob Smith bob@shop.com N/A Queens 5 420.00 2025-11-22
4 SHOP-003 Carlos Garcia carlos@shop.com 555-5678 - 8 890.25 2025-12-15 Wholesale
5 SHOP-004 Diana Lee diana@shop.com (555) 222-3344 Manhattan NULL 1875.00 2025-10-30 VIP|Wholesale
6 SHOP-005 Eve Martinez 555-9988 Bronx 3 180.00 2025-09-15
7 SHOP-006 Frank Brown frank@shop.com Staten Island 15 2410.75 (blank)
8 SHOP-007 Grace Davis grace@shop.com 555-1111 Brooklyn 1 49.99 #N/A New
9 SHOP-008 Henry Wilson henry@shop.com n/a Queens 7 675.00 2025-11-08 VIP
10 SHOP-009 Ivy Chen ivy@shop.com 555-7777 ? 4 320.50 2025-10-12
11 SHOP-010 Jack Taylor jack@shop.com 555-4444 Manhattan (none) 520.00 2025-12-01 Wholesale

View File

@@ -0,0 +1,16 @@
contact_id,email,segment,region,age,ltv,score,last_engagement_days,source,consent
LEAD-001,a@mkt.com,Enterprise,NA-East,42,12400,87,3,LinkedIn,true
LEAD-002,b@mkt.com,SMB,NA-West,,3200,62,12,Google,true
LEAD-003,c@mkt.com,SMB,EU,29,1800,N/A,7,unknown,true
LEAD-004,d@mkt.com,Enterprise,NA-East,55,,91,1,Webinar,true
LEAD-005,e@mkt.com,Mid-Market,NA-West,38,5600,74,,Referral,true
LEAD-006,f@mkt.com,SMB,EU,,2100,55,21,-,
LEAD-007,g@mkt.com,Enterprise,APAC,47,9800,82,5,LinkedIn,true
LEAD-008,h@mkt.com,SMB,NA-East,33,2900,,9,Google,
LEAD-009,i@mkt.com,Mid-Market,EU,41,4750,68,15,NULL,true
LEAD-010,j@mkt.com,Enterprise,NA-West,,11200,89,2,Webinar,true
LEAD-011,k@mkt.com,SMB,APAC,28,1650,58,18,(blank),true
LEAD-012,l@mkt.com,Mid-Market,NA-East,36,5100,,11,Referral,true
LEAD-013,m@mkt.com,SMB,EU,31,2300,61,N/A,Google,true
LEAD-014,n@mkt.com,Enterprise,APAC,52,10500,93,4,LinkedIn,true
LEAD-015,o@mkt.com,SMB,NA-West,26,1400,49,25,?,
1 contact_id email segment region age ltv score last_engagement_days source consent
2 LEAD-001 a@mkt.com Enterprise NA-East 42 12400 87 3 LinkedIn true
3 LEAD-002 b@mkt.com SMB NA-West 3200 62 12 Google true
4 LEAD-003 c@mkt.com SMB EU 29 1800 N/A 7 unknown true
5 LEAD-004 d@mkt.com Enterprise NA-East 55 91 1 Webinar true
6 LEAD-005 e@mkt.com Mid-Market NA-West 38 5600 74 Referral true
7 LEAD-006 f@mkt.com SMB EU 2100 55 21 -
8 LEAD-007 g@mkt.com Enterprise APAC 47 9800 82 5 LinkedIn true
9 LEAD-008 h@mkt.com SMB NA-East 33 2900 9 Google
10 LEAD-009 i@mkt.com Mid-Market EU 41 4750 68 15 NULL true
11 LEAD-010 j@mkt.com Enterprise NA-West 11200 89 2 Webinar true
12 LEAD-011 k@mkt.com SMB APAC 28 1650 58 18 (blank) true
13 LEAD-012 l@mkt.com Mid-Market NA-East 36 5100 11 Referral true
14 LEAD-013 m@mkt.com SMB EU 31 2300 61 N/A Google true
15 LEAD-014 n@mkt.com Enterprise APAC 52 10500 93 4 LinkedIn true
16 LEAD-015 o@mkt.com SMB NA-West 26 1400 49 25 ?

View File

@@ -0,0 +1,13 @@
respondent_id,age,gender,zip,survey_q1,survey_q2,survey_q3,survey_q4,nps,comments,internal_id_legacy,beta_field
R-001,34,F,11201,4,5,3,4,9,"loved it",,
R-002,N/A,M,10001,,,,, ,,,
R-003,41,F,90210,5,4,5,5,10,"perfect",,
R-004,28,M,-,3,,,,7,,,
R-005,,,NULL,,,,,,,,
R-006,52,F,02101,4,4,4,4,8,"good experience",,
R-007,?,?,?,?,?,?,?,?,?,,
R-008,29,M,94102,5,5,5,5,10,"amazing",,
R-009,38,F,60601,2,3,2,2,5,"meh",,
R-010,(blank),(blank),(blank),(blank),(blank),(blank),(blank),(blank),(blank),,
R-011,45,M,30301,4,4,3,4,8,,,
R-012,33,F,11201,5,5,5,4,9,"will recommend",,
1 respondent_id age gender zip survey_q1 survey_q2 survey_q3 survey_q4 nps comments internal_id_legacy beta_field
2 R-001 34 F 11201 4 5 3 4 9 loved it
3 R-002 N/A M 10001
4 R-003 41 F 90210 5 4 5 5 10 perfect
5 R-004 28 M - 3 7
6 R-005 NULL
7 R-006 52 F 02101 4 4 4 4 8 good experience
8 R-007 ? ? ? ? ? ? ? ? ?
9 R-008 29 M 94102 5 5 5 5 10 amazing
10 R-009 38 F 60601 2 3 2 2 5 meh
11 R-010 (blank) (blank) (blank) (blank) (blank) (blank) (blank) (blank) (blank)
12 R-011 45 M 30301 4 4 3 4 8
13 R-012 33 F 11201 5 5 5 4 9 will recommend

View File

@@ -253,16 +253,20 @@ class TestEncodingOverride:
class TestEncodingDecodeFailedFromRepair:
def test_decode_replaced_action_surfaces_error_finding(self, tmp_path):
# Create a file with a UTF-8 BOM but cp1252 body bytes — utf-8-sig
# fails on byte 0x80 (€ in cp1252).
def test_lying_bom_recovered_and_flagged(self, tmp_path):
# File has a UTF-8 BOM but the body bytes are cp1252 (0x80 = € in
# cp1252; not a valid UTF-8 continuation byte). Detector should
# recover transparently to cp1252 and surface an
# ``encoding_lying_bom`` warn so the user knows.
f = tmp_path / "lying_bom.csv"
f.write_bytes(b"\xef\xbb\xbfid,name\n1,\x80100\n")
findings = analyze(f)
ids = {x.id for x in findings}
assert "encoding_decode_failed" in ids
bad = next(x for x in findings if x.id == "encoding_decode_failed")
assert bad.severity == "error"
assert "encoding_lying_bom" in ids
bad = next(x for x in findings if x.id == "encoding_lying_bom")
assert bad.severity == "warn"
# Decode should have succeeded — no replacement-character finding.
assert "encoding_decode_failed" not in ids
class TestMixedLineEndings:

374
tests/test_column_mapper.py Normal file
View File

@@ -0,0 +1,374 @@
"""Tests for src/core/column_mapper.py."""
from __future__ import annotations
import json
import numpy as np
import pandas as pd
import pytest
from src.core.errors import ConfigError, InputValidationError
from src.core.column_mapper import (
MapOptions,
PRESETS,
TargetField,
TargetSchema,
coerce_series,
infer_mapping,
map_columns,
)
# ---------------------------------------------------------------------------
# infer_mapping — fuzzy matcher
# ---------------------------------------------------------------------------
class TestInferMapping:
def test_exact_normalized_match(self):
df = pd.DataFrame({"First Name": [], "Last Name": []})
schema = TargetSchema(fields=[
TargetField(name="first_name"), TargetField(name="last_name"),
])
m = infer_mapping(df, schema)
assert m == {"First Name": "first_name", "Last Name": "last_name"}
def test_alias_match(self):
df = pd.DataFrame({"EmailAddr": []})
schema = TargetSchema(fields=[
TargetField(name="email", aliases=["EmailAddr", "email_address"]),
])
m = infer_mapping(df, schema)
assert m == {"EmailAddr": "email"}
def test_below_threshold_excluded(self):
df = pd.DataFrame({"xyz": []})
schema = TargetSchema(fields=[TargetField(name="email")])
m = infer_mapping(df, schema, threshold=0.6)
assert m == {}
def test_target_matched_at_most_once(self):
df = pd.DataFrame({"first_name": [], "fname": []})
schema = TargetSchema(fields=[TargetField(name="first_name")])
m = infer_mapping(df, schema)
# Exact match wins; "fname" stays unmapped.
assert m == {"first_name": "first_name"}
def test_threshold_zero_matches_anything(self):
df = pd.DataFrame({"a": [], "b": []})
schema = TargetSchema(fields=[TargetField(name="z")])
m = infer_mapping(df, schema, threshold=0.0)
assert len(m) == 1
# ---------------------------------------------------------------------------
# coerce_series
# ---------------------------------------------------------------------------
class TestCoerceSeries:
def test_integer_clean(self):
s = pd.Series(["1", "2", "3"])
out, fails = coerce_series(s, "integer")
assert list(out) == [1, 2, 3]
assert fails == 0
def test_integer_with_failure(self):
s = pd.Series(["1", "bad", "3"])
out, fails = coerce_series(s, "integer")
assert fails == 1
assert pd.isna(out.iloc[1])
def test_float_with_thousands_sep(self):
# Plain floats; thousands-sep handling is for format standardizer.
s = pd.Series(["1.5", "2.0", "3.25"])
out, fails = coerce_series(s, "float")
assert fails == 0
assert out.iloc[2] == 3.25
def test_boolean_truthy_falsy(self):
s = pd.Series(["true", "false", "Yes", "no", "1", "0"])
out, fails = coerce_series(s, "boolean")
assert fails == 0
assert list(out) == [True, False, True, False, True, False]
def test_boolean_unknown_value_fails(self):
s = pd.Series(["true", "maybe"])
out, fails = coerce_series(s, "boolean")
assert fails == 1
assert pd.isna(out.iloc[1])
def test_date_iso_format(self):
s = pd.Series(["2025-01-15", "2025-02-20"])
out, fails = coerce_series(s, "date")
assert fails == 0
assert out.iloc[0].year == 2025
def test_date_failure(self):
s = pd.Series(["2025-01-15", "garbage"])
out, fails = coerce_series(s, "date")
assert fails == 1
assert pd.isna(out.iloc[1])
def test_string_passthrough(self):
s = pd.Series([1, 2, 3])
out, fails = coerce_series(s, "string")
assert fails == 0
assert out.dtype.name == "string"
def test_auto_returns_unchanged(self):
s = pd.Series([1, 2])
out, fails = coerce_series(s, "auto")
assert fails == 0
assert out is s
def test_unknown_dtype_raises(self):
with pytest.raises(InputValidationError):
coerce_series(pd.Series([1]), "bogus") # type: ignore[arg-type]
# ---------------------------------------------------------------------------
# map_columns — explicit mapping
# ---------------------------------------------------------------------------
class TestMapColumnsExplicit:
def test_simple_rename(self):
df = pd.DataFrame({"a": [1], "b": [2]})
opts = MapOptions(mapping={"a": "alpha", "b": "beta"})
res = map_columns(df, opts)
assert list(res.mapped_df.columns) == ["alpha", "beta"]
assert res.columns_renamed == 2
def test_unknown_source_raises(self):
df = pd.DataFrame({"a": [1]})
opts = MapOptions(mapping={"missing": "x"})
with pytest.raises(InputValidationError):
map_columns(df, opts)
def test_duplicate_target_raises(self):
df = pd.DataFrame({"a": [1], "b": [2]})
opts = MapOptions(mapping={"a": "x", "b": "x"})
with pytest.raises(InputValidationError):
map_columns(df, opts)
def test_unmapped_keep(self):
df = pd.DataFrame({"a": [1], "b": [2]})
opts = MapOptions(mapping={"a": "alpha"}, unmapped="keep")
res = map_columns(df, opts)
assert "b" in res.mapped_df.columns
assert res.unmapped_kept == ["b"]
def test_unmapped_drop(self):
df = pd.DataFrame({"a": [1], "b": [2]})
opts = MapOptions(mapping={"a": "alpha"}, unmapped="drop")
res = map_columns(df, opts)
assert list(res.mapped_df.columns) == ["alpha"]
assert res.columns_dropped == ["b"]
def test_unmapped_error(self):
df = pd.DataFrame({"a": [1], "b": [2]})
opts = MapOptions(mapping={"a": "alpha"}, unmapped="error")
with pytest.raises(InputValidationError):
map_columns(df, opts)
# ---------------------------------------------------------------------------
# map_columns — schema + auto-inference
# ---------------------------------------------------------------------------
class TestMapColumnsWithSchema:
def test_auto_infer_renames(self):
df = pd.DataFrame({"First Name": ["A"], "Last Name": ["B"]})
schema = TargetSchema(fields=[
TargetField(name="first_name"), TargetField(name="last_name"),
])
opts = MapOptions(schema=schema, auto_infer=True)
res = map_columns(df, opts)
assert "first_name" in res.mapped_df.columns
assert "last_name" in res.mapped_df.columns
assert res.inferred_pairs == {"First Name": "first_name", "Last Name": "last_name"}
def test_explicit_overrides_inferred(self):
df = pd.DataFrame({"name": ["A"], "fname": ["B"]})
schema = TargetSchema(fields=[TargetField(name="first_name")])
opts = MapOptions(
schema=schema,
mapping={"fname": "first_name"},
auto_infer=True,
)
res = map_columns(df, opts)
assert res.mapping["fname"] == "first_name"
assert "name" not in res.mapping
def test_required_missing_raises(self):
df = pd.DataFrame({"first_name": ["A"]})
schema = TargetSchema(fields=[
TargetField(name="first_name", required=True),
TargetField(name="email", required=True),
])
opts = MapOptions(schema=schema, auto_infer=False, enforce_required=True)
with pytest.raises(InputValidationError):
map_columns(df, opts)
def test_required_missing_with_default_added(self):
df = pd.DataFrame({"first_name": ["A"]})
schema = TargetSchema(fields=[
TargetField(name="first_name", required=True),
TargetField(name="source", required=False, default="import"),
])
opts = MapOptions(schema=schema, auto_infer=False)
res = map_columns(df, opts)
assert "source" in res.mapped_df.columns
assert res.mapped_df.iloc[0]["source"] == "import"
assert res.columns_added == ["source"]
def test_required_missing_disabled(self):
df = pd.DataFrame({"first_name": ["A"]})
schema = TargetSchema(fields=[
TargetField(name="first_name", required=True),
TargetField(name="email", required=True),
])
opts = MapOptions(schema=schema, auto_infer=False, enforce_required=False)
res = map_columns(df, opts)
assert "email" in res.missing_required_targets
def test_reorder_to_schema(self):
df = pd.DataFrame({"z": [1], "a": [2], "m": [3]})
schema = TargetSchema(fields=[
TargetField(name="a"), TargetField(name="m"), TargetField(name="z"),
])
opts = MapOptions(schema=schema, auto_infer=True, reorder_to_schema=True)
res = map_columns(df, opts)
assert list(res.mapped_df.columns) == ["a", "m", "z"]
def test_coerce_types(self):
df = pd.DataFrame({"age": ["30", "bad", "40"], "active": ["true", "no", "yes"]})
schema = TargetSchema(fields=[
TargetField(name="age", dtype="integer"),
TargetField(name="active", dtype="boolean"),
])
opts = MapOptions(schema=schema, auto_infer=True, coerce_types=True)
res = map_columns(df, opts)
assert res.mapped_df["age"].iloc[0] == 30
assert res.mapped_df["active"].iloc[0] is True or res.mapped_df["active"].iloc[0]
assert res.coercion_failures == {"age": 1}
# ---------------------------------------------------------------------------
# Presets
# ---------------------------------------------------------------------------
class TestPresets:
def test_strict_schema_drops_and_coerces_and_reorders(self):
df = pd.DataFrame({"First Name": ["A"], "Email": ["a@x"], "extra": [1]})
schema = TargetSchema(fields=[
TargetField(name="first_name", required=True),
TargetField(name="email", required=True),
])
opts = MapOptions.from_preset("strict-schema")
opts.schema = schema
res = map_columns(df, opts)
assert list(res.mapped_df.columns) == ["first_name", "email"]
assert res.columns_dropped == ["extra"]
def test_lenient_keeps_extras(self):
df = pd.DataFrame({"First Name": ["A"], "extra": [1]})
schema = TargetSchema(fields=[TargetField(name="first_name")])
opts = MapOptions.from_preset("lenient-schema")
opts.schema = schema
res = map_columns(df, opts)
assert "extra" in res.mapped_df.columns
def test_unknown_preset(self):
with pytest.raises(ConfigError):
MapOptions.from_preset("does-not-exist")
# ---------------------------------------------------------------------------
# Schema serialization
# ---------------------------------------------------------------------------
class TestSchemaIO:
def test_roundtrip_dict(self):
schema = TargetSchema(fields=[
TargetField(name="x", dtype="integer", required=True, aliases=["X", "X "]),
TargetField(name="y", default="z"),
])
d = schema.to_dict()
loaded = TargetSchema.from_dict(d)
assert loaded.field_names() == ["x", "y"]
assert loaded.fields[0].required is True
assert loaded.fields[1].default == "z"
def test_from_dict_string_field(self):
# Allow shorthand: bare string defaults to dtype=auto.
loaded = TargetSchema.from_dict({"fields": ["a", "b"]})
assert loaded.field_names() == ["a", "b"]
def test_from_dict_unknown_dtype_raises(self):
with pytest.raises(ConfigError):
TargetSchema.from_dict({"fields": [{"name": "x", "dtype": "bogus"}]})
def test_from_dict_missing_name_raises(self):
with pytest.raises(ConfigError):
TargetSchema.from_dict({"fields": [{"dtype": "string"}]})
def test_options_roundtrip_to_file(self, tmp_path):
schema = TargetSchema(fields=[TargetField(name="x", dtype="string")])
opts = MapOptions(
schema=schema,
mapping={"a": "x"},
unmapped="drop",
coerce_types=True,
reorder_to_schema=True,
)
path = tmp_path / "cfg.json"
opts.to_file(path)
loaded = MapOptions.from_file(path)
assert loaded.mapping == {"a": "x"}
assert loaded.unmapped == "drop"
assert loaded.coerce_types is True
assert loaded.schema is not None
assert loaded.schema.field_names() == ["x"]
# ---------------------------------------------------------------------------
# Validation
# ---------------------------------------------------------------------------
class TestValidation:
def test_invalid_unmapped_strategy(self):
opts = MapOptions(unmapped="bogus") # type: ignore[arg-type]
with pytest.raises(InputValidationError):
opts.validate()
def test_threshold_out_of_range(self):
opts = MapOptions(fuzzy_threshold=1.5)
with pytest.raises(ConfigError):
opts.validate()
def test_non_dataframe_input(self):
with pytest.raises(InputValidationError):
map_columns([1, 2, 3]) # type: ignore[arg-type]
# ---------------------------------------------------------------------------
# Idempotency
# ---------------------------------------------------------------------------
class TestIdempotency:
def test_double_apply_is_stable(self):
df = pd.DataFrame({"First Name": ["A"], "Email": ["a@x"]})
schema = TargetSchema(fields=[
TargetField(name="first_name"),
TargetField(name="email"),
])
opts = MapOptions(schema=schema, auto_infer=True, reorder_to_schema=True)
first = map_columns(df, opts)
second = map_columns(first.mapped_df, opts)
pd.testing.assert_frame_equal(second.mapped_df, first.mapped_df)
def test_input_not_mutated(self):
df = pd.DataFrame({"a": [1], "b": [2]})
snapshot = df.copy(deep=True)
map_columns(df, MapOptions(mapping={"a": "x"}))
pd.testing.assert_frame_equal(df, snapshot)

View File

@@ -0,0 +1,240 @@
"""Acceptance corpus for the Column Mapper.
Loads every fixture in ``test-cases/column-mapper-corpus/test_data/``
and asserts the documented behaviour against the documented schema.
"""
from __future__ import annotations
import json
from pathlib import Path
import pandas as pd
import pytest
from src.core.errors import InputValidationError
from src.core.column_mapper import (
MapOptions,
TargetField,
TargetSchema,
map_columns,
)
CORPUS = Path(__file__).resolve().parents[1] / "test-cases" / "column-mapper-corpus"
TEST_DATA = CORPUS / "test_data"
SCHEMAS = CORPUS / "schemas"
def _read(name: str) -> pd.DataFrame:
return pd.read_csv(TEST_DATA / name)
def _schema(name: str) -> TargetSchema:
return TargetSchema.from_file(SCHEMAS / name)
# ---------------------------------------------------------------------------
# UC01 — CRM import
# ---------------------------------------------------------------------------
class TestUC01CrmImport:
def test_strict_schema_round_trip(self):
df = _read("uc01_crm_import.csv")
schema = _schema("uc01_crm_target.json")
opts = MapOptions.from_preset("strict-schema")
opts.schema = schema
res = map_columns(df, opts)
# Every required target is present after the run.
for f in schema.fields:
if f.required:
assert f.name in res.mapped_df.columns
# 'owner' default added.
assert "owner" in res.columns_added
assert (res.mapped_df["owner"] == "unassigned").all()
# No unmapped survivors (strict preset drops extras).
assert res.unmapped_kept == []
# Reordered to schema order.
expected_prefix = [f.name for f in schema.fields]
assert list(res.mapped_df.columns)[: len(expected_prefix)] == expected_prefix
def test_types_coerced_from_strings(self):
df = _read("uc01_crm_import.csv")
schema = _schema("uc01_crm_target.json")
opts = MapOptions.from_preset("strict-schema")
opts.schema = schema
res = map_columns(df, opts)
# annual_rev → integer (was numeric strings in the source).
assert pd.api.types.is_integer_dtype(res.mapped_df["annual_rev"])
# created_date → datetime64.
assert pd.api.types.is_datetime64_any_dtype(res.mapped_df["created_date"])
# ---------------------------------------------------------------------------
# UC02 — Multi-vendor unification
# ---------------------------------------------------------------------------
class TestUC02MultiVendor:
@pytest.mark.parametrize("vendor", ["a", "b", "c"])
def test_each_vendor_normalises_to_canonical(self, vendor):
df = _read(f"uc02_vendor_{vendor}.csv")
schema = _schema("uc02_canonical.json")
opts = MapOptions.from_preset("lenient-schema")
opts.schema = schema
opts.fuzzy_threshold = 0.5 # vendor C uses obscure aliases ("FName", "Tel")
res = map_columns(df, opts)
# Every required canonical field landed in the output.
for f in schema.fields:
if f.required:
assert f.name in res.mapped_df.columns, (
f"vendor {vendor}: missing {f.name}; mapping={res.mapping}"
)
def test_concatenated_vendors_share_schema(self):
# The point of unification: after each vendor goes through the
# mapper, the resulting frames stack cleanly.
schema = _schema("uc02_canonical.json")
opts = MapOptions.from_preset("strict-schema")
opts.schema = schema
opts.fuzzy_threshold = 0.5
frames = [
map_columns(_read(f"uc02_vendor_{v}.csv"), opts).mapped_df
for v in ("a", "b", "c")
]
unified = pd.concat(frames, ignore_index=True)
assert list(unified.columns) == [f.name for f in schema.fields]
# Total rows = sum of inputs.
assert len(unified) == sum(len(f) for f in frames)
# ---------------------------------------------------------------------------
# UC03 — Type coercion
# ---------------------------------------------------------------------------
class TestUC03TypeCoercion:
def test_documented_failures_are_reported(self):
df = _read("uc03_type_coercion.csv")
schema = _schema("uc03_types.json")
opts = MapOptions.from_preset("lenient-schema")
opts.schema = schema
res = map_columns(df, opts)
# Bad rows survive as NaN, with counts recorded.
assert res.coercion_failures.get("age") == 1
assert res.coercion_failures.get("score") == 1
assert res.coercion_failures.get("joined") == 1
def test_coerced_dtypes(self):
df = _read("uc03_type_coercion.csv")
schema = _schema("uc03_types.json")
opts = MapOptions.from_preset("lenient-schema")
opts.schema = schema
res = map_columns(df, opts)
out = res.mapped_df
assert pd.api.types.is_integer_dtype(out["id"])
assert out["active"].dtype.name == "boolean"
assert pd.api.types.is_datetime64_any_dtype(out["joined"])
# Float failures NaN-ify.
assert pd.isna(out["score"].iloc[1])
# ---------------------------------------------------------------------------
# Edge cases
# ---------------------------------------------------------------------------
class TestEC01DuplicateTarget:
def test_two_sources_to_same_target_raises(self):
df = _read("ec01_duplicate_target.csv")
opts = MapOptions(mapping={"a": "x", "b": "x"})
with pytest.raises(InputValidationError):
map_columns(df, opts)
class TestEC02UnicodeColumns:
def test_japanese_column_renamed(self):
df = _read("ec02_unicode_columns.csv")
opts = MapOptions(mapping={"名前": "name", "価格": "price"})
res = map_columns(df, opts)
assert "name" in res.mapped_df.columns
assert "price" in res.mapped_df.columns
# Email passes through (unmapped, kept by default).
assert "Email" in res.mapped_df.columns
class TestEC03WhitespaceHeaders:
def test_header_whitespace_does_not_block_match(self):
df = _read("ec03_whitespace_headers.csv")
schema = TargetSchema(fields=[
TargetField(name="first_name", aliases=["First Name"]),
TargetField(name="last_name", aliases=["Last Name"]),
TargetField(name="email", aliases=["EmailAddr"]),
])
opts = MapOptions(schema=schema, auto_infer=True)
res = map_columns(df, opts)
# All three columns should map despite the leading/trailing spaces.
assert len(res.mapping) == 3
class TestEC04NoMatch:
def test_zero_inferred_with_no_match(self):
df = _read("ec04_no_match.csv")
schema = TargetSchema(fields=[
TargetField(name="email"), TargetField(name="phone"),
])
opts = MapOptions(schema=schema, auto_infer=True, unmapped="keep")
res = map_columns(df, opts)
assert res.inferred_pairs == {}
# Source columns survive as-is under keep.
assert set(df.columns) <= set(res.mapped_df.columns)
def test_no_match_with_unmapped_error(self):
df = _read("ec04_no_match.csv")
schema = TargetSchema(fields=[TargetField(name="email")])
opts = MapOptions(
schema=schema, auto_infer=True, unmapped="error",
enforce_required=False,
)
with pytest.raises(InputValidationError):
map_columns(df, opts)
class TestEC05RequiredMissing:
def test_required_missing_raises(self):
df = _read("ec05_required_missing.csv")
schema = TargetSchema(fields=[
TargetField(name="first_name", required=True),
TargetField(name="email", required=True),
])
opts = MapOptions(schema=schema, auto_infer=True, enforce_required=True)
with pytest.raises(InputValidationError):
map_columns(df, opts)
def test_disable_enforce_surfaces_in_result(self):
df = _read("ec05_required_missing.csv")
schema = TargetSchema(fields=[
TargetField(name="first_name", required=True),
TargetField(name="email", required=True),
])
opts = MapOptions(schema=schema, auto_infer=True, enforce_required=False)
res = map_columns(df, opts)
assert "email" in res.missing_required_targets
# ---------------------------------------------------------------------------
# Whole-corpus property tests
# ---------------------------------------------------------------------------
ALL_FIXTURES = sorted(p.name for p in TEST_DATA.glob("*.csv"))
@pytest.mark.parametrize("fixture", ALL_FIXTURES)
def test_map_columns_does_not_mutate_input(fixture):
df = pd.read_csv(TEST_DATA / fixture)
snapshot = df.copy(deep=True)
try:
map_columns(df, MapOptions()) # identity run; default options.
except InputValidationError:
pass # ec01 / ec05 raise here — fine, mutation is what we care about.
pd.testing.assert_frame_equal(df, snapshot)

View File

@@ -169,8 +169,23 @@ class TestMojibake:
assert actual.equals(expected), "14 mojibake default (no repair) differs"
def test_fixed_variant(self):
# --fix-mojibake is Tier 2; the cleaner does not implement it. Mark xfail.
pytest.xfail("Mojibake auto-repair is Tier 2; not yet implemented (uses ftfy).")
"""Mojibake auto-repair (ftfy-backed) restores the original text.
Skipped automatically when ftfy is not installed — the engine
falls back to a no-op in that case and the diff would never close.
"""
try:
import ftfy # noqa: F401
except ImportError:
pytest.skip("ftfy not installed — install ftfy to enable mojibake repair")
from src.core.fixes import repair_mojibake
df = _read_csv_strict(TEST_DATA / "14_mojibake.csv")
expected = _read_csv_strict(EXPECTED / "14_mojibake__fixed.csv")
repaired, _ = repair_mojibake(df)
actual = repaired.reset_index(drop=True)
assert actual.equals(expected), "14 mojibake fixed variant differs"
class TestEmptyFile:

View File

@@ -14,12 +14,11 @@ What's tested
REJECT / LOW_CONFIDENCE.
3. The decoded DataFrame matches the canonical reference content.
Cases where the current implementation is known to fail (charset-
normalizer label drift on byte-equivalent encodings, ``repair_bytes``
NUL-strip destroying UTF-16, the "lying BOM" pathological case) are
marked ``xfail`` so they surface in the report as documented gaps.
A future fix that makes the case pass will flip xfail to xpass and the
test owner can drop the marker.
Detection arbiter (cp1250→cp1252, mac_iceland→mac_roman, lying-BOM
recovery) and a language-aware probe (Cyrillic / EE-Latin coverage)
together close every documented gap; the ``KNOWN_*_FAILURES`` dicts
below are kept empty as a tripwire — re-add an entry only when a real
limitation surfaces.
"""
from __future__ import annotations
@@ -41,27 +40,9 @@ REFERENCE_DIR = CORPUS / "reference"
# Known failures the analyzer does not yet handle correctly. Each entry
# has a one-line reason — drop the entry once a fix lands.
KNOWN_DETECTION_FAILURES = {
"E03_western_basic_cp1252.csv": "charset-normalizer returns cp1250 for byte-equivalent content",
"E04_western_basic_latin1.csv": "charset-normalizer returns cp1250 for byte-equivalent content",
"E05_western_basic_latin9.csv": "charset-normalizer returns cp1250 for byte-equivalent content",
"E06_western_basic_macroman.csv": "returns mac_iceland (same family) instead of mac_roman",
"E11_western_extended_cp1252.csv": "charset-normalizer returns cp1250 for cp1252 content",
"E15_eastern_european_iso88592.csv": "charset-normalizer returns cp1258 for ISO-8859-2 content",
"E18_cyrillic_koi8r.csv": "charset-normalizer returns shift_jis_2004 for KOI8-R content",
}
KNOWN_DETECTION_FAILURES: dict[str, str] = {}
KNOWN_DECODE_FAILURES = {
"E03_western_basic_cp1252.csv": "decoded as cp1250 — different mapping at 0xF1 (ñ vs ń)",
"E04_western_basic_latin1.csv": "decoded as cp1250 — different mapping at 0xF1",
"E05_western_basic_latin9.csv": "decoded as cp1250 — different mapping at 0xF1",
"E10_western_extended_utf8.csv": "byte-level smart-quote fold rewrites U+201C/U+201D to ASCII before parse",
"E11_western_extended_cp1252.csv": "wrong encoding + smart-quote fold",
"E12_western_extended_utf16le.csv": "byte-level smart-quote fold rewrites U+201C/U+201D before parse",
"E15_eastern_european_iso88592.csv": "wrong encoding (cp1258 != ISO-8859-2)",
"E18_cyrillic_koi8r.csv": "wrong encoding (shift_jis_2004 != KOI8-R)",
"E30_pathological_lying_bom.csv": "utf-8-sig fails on cp1252 body bytes; needs lying-BOM recovery",
}
KNOWN_DECODE_FAILURES: dict[str, str] = {}
def _normalize_encoding(name: str) -> str:
@@ -164,7 +145,12 @@ def _decodable_entries():
],
)
def test_decoded_matches_reference(entry):
df, _, _ = _load_for_analysis(CORPUS / entry["filename"], sample_rows=1000)
# The reference files preserve smart quotes — disable byte-level
# smart-quote folding so this round-trip identity test isn't
# confounded by the analyzer's deliberate parser-safety fold.
df, _, _ = _load_for_analysis(
CORPUS / entry["filename"], sample_rows=1000, fold_quotes=False,
)
ref_text = REFERENCES[entry["canonical_content_id"]]
ref_rows = list(csv.reader(io.StringIO(ref_text)))
if not ref_rows:

View File

@@ -230,8 +230,27 @@ class TestRepairMojibake:
class TestRepairMojibakeNoFtfy:
@pytest.mark.skipif(_HAS_FTFY, reason="ftfy installed — exercises the no-op path")
def test_returns_input_unchanged_without_ftfy(self):
def test_returns_input_unchanged_without_ftfy(self, monkeypatch):
"""Exercise the no-op path regardless of whether ftfy is installed.
``repair_mojibake`` lazy-imports ftfy inside the function body, so
we hide ``ftfy`` from ``sys.modules`` and from import resolution
before calling. The function must then degrade to ``(df, 0)``
without raising.
"""
import sys
import builtins
monkeypatch.delitem(sys.modules, "ftfy", raising=False)
real_import = builtins.__import__
def fake_import(name, *args, **kwargs):
if name == "ftfy" or name.startswith("ftfy."):
raise ImportError("ftfy hidden by test")
return real_import(name, *args, **kwargs)
monkeypatch.setattr(builtins, "__import__", fake_import)
df = pd.DataFrame({"x": ["café"]})
out, changed = repair_mojibake(df)
assert changed == 0

View File

@@ -0,0 +1,105 @@
"""Acceptance corpus for international format standardization.
Stresses the rework's three pillars on a single mixed-locale fixture:
* Per-row country column drives phone parsing.
* ``currency_decimal="auto"`` resolves comma-decimal locales.
* Streaming entry point handles the same content unchanged.
"""
from __future__ import annotations
from pathlib import Path
import pandas as pd
import pytest
from src.core.format_standardize import (
FieldType,
StandardizeOptions,
standardize_dataframe,
standardize_file,
)
CORPUS = Path(__file__).resolve().parents[1] / "test-cases" / "format-cleaner-corpus" / "international"
FIXTURE = CORPUS / "intl_phones_addresses.csv"
@pytest.fixture(scope="module")
def df():
return pd.read_csv(FIXTURE, dtype=str, keep_default_na=False)
@pytest.fixture(scope="module")
def options():
return StandardizeOptions(
column_types={
"name": FieldType.NAME,
"phone": FieldType.PHONE,
"price": FieldType.CURRENCY,
},
phone_country_column="country",
currency_preserve_code=True,
currency_decimal="auto",
)
class TestPhonesByRegion:
def test_every_row_lands_on_correct_e164_prefix(self, df, options):
# Each row's country column drives the per-row region used by
# phonenumbers.parse — the correct + prefix is the acceptance bar.
res = standardize_dataframe(df, options)
out = res.standardized_df
# ISO-2 → expected E.164 country code prefix
prefix_for_country = {
"US": "+1", "GB": "+44", "RU": "+7", "ES": "+34",
"FR": "+33", "JP": "+81", "DE": "+49", "IT": "+39",
"CN": "+86", "IN": "+91", "EG": "+20", "AU": "+61",
"BR": "+55", "MX": "+52", "KR": "+82", "TR": "+90",
"IL": "+972", "PL": "+48", "DK": "+45", "SE": "+46",
}
bad: list[tuple[str, str, str]] = []
for _, row in out.iterrows():
want = prefix_for_country[row["country"]]
got = row["phone"]
if not got.startswith(want):
bad.append((row["country"], want, got))
assert not bad, f"phone prefix mismatches: {bad}"
class TestCurrencyByLocale:
def test_eu_decimal_comma_resolves_under_auto(self, df, options):
res = standardize_dataframe(df, options)
# Spain, France, Germany, Italy, Brazil, Sweden all use decimal
# comma. Verify a clean numeric result post-standardization.
eu_idx = df.index[df["country"].isin(
["ES", "FR", "DE", "IT", "BR", "SE"]
)]
for i in eu_idx:
val = res.standardized_df.loc[i, "price"]
# Either ``CODE NNN.NN`` or bare ``NNN.NN`` — but the comma
# in the source must have become a dot in the output.
assert "," not in val, (
f"row {i} ({df.loc[i, 'country']}): comma persisted in {val!r}"
)
def test_brl_real_prefix_recognised(self, df, options):
res = standardize_dataframe(df, options)
br_row = res.standardized_df[res.standardized_df["country"] == "BR"].iloc[0]
assert "BRL" in br_row["price"]
class TestStreamingMatchesInMemory:
def test_same_output_via_streaming(self, tmp_path, df, options):
# Streaming the same fixture through standardize_file should
# produce a CSV byte-equivalent to the in-memory path.
in_mem = standardize_dataframe(df, options).standardized_df
out = tmp_path / "out.csv"
# Use a chunk size that splits the 20-row fixture mid-way.
res = standardize_file(FIXTURE, out, options, chunk_size=7)
assert res.rows_processed == len(df)
streamed = pd.read_csv(out, dtype=str, keep_default_na=False)
# Compare typed columns only — others pass through.
for col in options.column_types:
assert streamed[col].tolist() == in_mem[col].astype(str).tolist(), (
f"column {col} differs between in-memory and streaming"
)

View File

@@ -110,16 +110,16 @@ _DATE_EXPECTED_MDY: dict[str, object] = {
"FD13": PASSTHROUGH,
"FD14": PASSTHROUGH,
"FD15": PASSTHROUGH,
# excel serial → 2024-01-15 (xfail — not implemented)
# excel serial dates (numeric days since 1899-12-30)
"FD22": "2024-01-15",
"FD23": "2024-01-15",
# unix timestamp seconds / millis → 2024-01-15 (xfail)
# unix timestamps (seconds, milliseconds)
"FD24": "2024-01-15",
"FD25": "2024-01-15",
# partial precision — corpus preserves it
"FD26": "2024-01",
"FD27": "2024-01", # xfail — text precision
"FD28": "2024-Q1", # xfail — quarter
"FD27": "2024-01", # text precision month
"FD28": "2024-Q1", # quarter
"FD29": "2024",
# 2-digit year cutoff (per docs: 1969 wins over 2069)
"FD30": "1969-01-15",
@@ -135,7 +135,7 @@ _DATE_EXPECTED_MDY: dict[str, object] = {
"FD37": "2024-01-15",
# garbage → pass through (corpus 0.3 boundary table)
# FD38/39/40 → PASSTHROUGH default
# locale-specific month names (xfail — not shipped)
# locale-specific month names (en/fr/de via month_locales)
"FD41": "2024-01-15",
"FD42": "2024-01-15",
# timezone — corpus 3.3 says fixed-offset only

View File

@@ -0,0 +1,301 @@
"""Tests for the format-standardizer rework: cache, vectorized dispatch,
per-row country, audit cap, and streaming entry point."""
from __future__ import annotations
import csv
from pathlib import Path
import pandas as pd
import pytest
from src.core.format_standardize import (
FieldType,
StandardizeOptions,
StreamingStandardizeResult,
_normalize_region,
standardize_dataframe,
standardize_file,
)
# ---------------------------------------------------------------------------
# Per-row country / region
# ---------------------------------------------------------------------------
class TestPerRowCountry:
def test_phone_uses_per_row_country(self):
df = pd.DataFrame({
"phone": ["020 7946 0958", "03-3210-7000", "(415) 555-1234"],
"country": ["GB", "JP", "US"],
})
opts = StandardizeOptions(
column_types={"phone": FieldType.PHONE},
phone_country_column="country",
)
res = standardize_dataframe(df, opts)
out = res.standardized_df["phone"].tolist()
assert out[0].startswith("+44")
assert out[1].startswith("+81")
assert out[2].startswith("+1")
def test_phone_country_full_name_resolved(self):
df = pd.DataFrame({
"phone": ["020 7946 0958"],
"country": ["United Kingdom"],
})
opts = StandardizeOptions(
column_types={"phone": FieldType.PHONE},
phone_country_column="country",
)
res = standardize_dataframe(df, opts)
assert res.standardized_df["phone"].iloc[0].startswith("+44")
def test_blank_country_falls_back_to_default(self):
df = pd.DataFrame({
"phone": ["(415) 555-1234"],
"country": [""], # blank → use default region
})
opts = StandardizeOptions(
column_types={"phone": FieldType.PHONE},
phone_country_column="country",
phone_region="US",
)
res = standardize_dataframe(df, opts)
assert res.standardized_df["phone"].iloc[0] == "+14155551234"
def test_unknown_country_column_raises(self):
df = pd.DataFrame({"phone": ["x"]})
opts = StandardizeOptions(
column_types={"phone": FieldType.PHONE},
phone_country_column="missing_col",
)
from src.core.errors import InputValidationError
with pytest.raises(InputValidationError):
standardize_dataframe(df, opts)
class TestNormalizeRegion:
def test_iso2_passthrough(self):
assert _normalize_region("US") == "US"
assert _normalize_region("us") == "US"
assert _normalize_region(" jp ") == "JP"
def test_iso3_mapped(self):
assert _normalize_region("USA") == "US"
assert _normalize_region("GBR") == "GB"
assert _normalize_region("JPN") == "JP"
def test_full_name(self):
assert _normalize_region("United States") == "US"
assert _normalize_region("Japan") == "JP"
assert _normalize_region("Brazil") == "BR"
assert _normalize_region("brasil") == "BR"
assert _normalize_region("España") == "ES"
def test_blank_or_unknown(self):
assert _normalize_region("") is None
assert _normalize_region(" ") is None
assert _normalize_region(None) is None
assert _normalize_region("xyz-no-such-country") is None
# ---------------------------------------------------------------------------
# Audit cap
# ---------------------------------------------------------------------------
class TestAuditCap:
def test_cap_truncates_change_rows(self):
df = pd.DataFrame({
"phone": ["(415) 555-12{:02d}".format(i) for i in range(50)],
})
opts = StandardizeOptions(
column_types={"phone": FieldType.PHONE},
audit_max_rows=5,
)
res = standardize_dataframe(df, opts)
# cells_changed counts everything; the audit table is capped.
assert res.cells_changed == 50
assert len(res.changes) == 5
def test_unbounded_audit(self):
df = pd.DataFrame({
"phone": ["(415) 555-12{:02d}".format(i) for i in range(20)],
})
opts = StandardizeOptions(
column_types={"phone": FieldType.PHONE},
audit_max_rows=None,
)
res = standardize_dataframe(df, opts)
assert len(res.changes) == 20
# ---------------------------------------------------------------------------
# Cache + vectorized dispatch (correctness)
# ---------------------------------------------------------------------------
class TestCacheCorrectness:
def test_repeated_phone_consistent(self):
# 1000 copies of the same phone should produce identical output.
df = pd.DataFrame({"phone": ["(415) 555-1234"] * 1000})
opts = StandardizeOptions(
column_types={"phone": FieldType.PHONE},
audit_max_rows=None,
)
res = standardize_dataframe(df, opts)
assert (res.standardized_df["phone"] == "+14155551234").all()
assert res.cells_changed == 1000
def test_cache_disabled_still_works(self):
df = pd.DataFrame({"phone": ["(415) 555-1234", "020 7946 0958"]})
opts = StandardizeOptions(
column_types={"phone": FieldType.PHONE},
cache_size=0, # disabled
)
res = standardize_dataframe(df, opts)
assert res.standardized_df["phone"].iloc[0] == "+14155551234"
# ---------------------------------------------------------------------------
# Streaming standardize_file
# ---------------------------------------------------------------------------
class TestStandardizeFile:
def test_basic_streaming(self, tmp_path):
inp = tmp_path / "in.csv"
inp.write_text(
"phone,country,price\n"
"(415) 555-1234,US,$1500.00\n"
"020 7946 0958,GB,£99.99\n"
"03-3210-7000,JP,¥12000\n"
"+33 1 42 86 82 00,FR,€850.50\n"
)
out = tmp_path / "out.csv"
opts = StandardizeOptions(
column_types={"phone": FieldType.PHONE, "price": FieldType.CURRENCY},
phone_country_column="country",
currency_preserve_code=True,
)
res = standardize_file(inp, out, opts, chunk_size=2)
assert isinstance(res, StreamingStandardizeResult)
assert res.rows_processed == 4
assert res.chunks_processed == 2
assert out.exists()
out_df = pd.read_csv(out, dtype=str, keep_default_na=False)
assert out_df["phone"].iloc[0].startswith("+1")
assert out_df["phone"].iloc[1].startswith("+44")
assert out_df["phone"].iloc[2].startswith("+81")
assert out_df["phone"].iloc[3].startswith("+33")
def test_audit_capped_across_chunks(self, tmp_path):
# 60 rows, audit cap 10, chunks of 20 → audit must stop at 10.
inp = tmp_path / "in.csv"
rows = ["phone\n"] + [f"(415) 555-12{i:02d}\n" for i in range(60)]
inp.write_text("".join(rows))
out = tmp_path / "out.csv"
opts = StandardizeOptions(
column_types={"phone": FieldType.PHONE},
audit_max_rows=10,
)
res = standardize_file(inp, out, opts, chunk_size=20)
# Audit file exists and has exactly 10 data rows + 1 header.
audit_lines = res.audit_path.read_text().splitlines()
assert len(audit_lines) - 1 == 10
def test_audit_row_indices_are_global(self, tmp_path):
# Audit row numbers must reflect absolute file position, not chunk-local.
inp = tmp_path / "in.csv"
rows = ["phone\n"] + [f"(415) 555-12{i:02d}\n" for i in range(30)]
inp.write_text("".join(rows))
out = tmp_path / "out.csv"
opts = StandardizeOptions(
column_types={"phone": FieldType.PHONE},
audit_max_rows=None,
)
res = standardize_file(inp, out, opts, chunk_size=10)
audit = pd.read_csv(res.audit_path)
# Rows should be 0..29, monotonically increasing.
assert audit["row"].tolist() == list(range(30))
def test_progress_callback_fires(self, tmp_path):
inp = tmp_path / "in.csv"
inp.write_text("phone\n" + "\n".join("(415) 555-1234" for _ in range(20)) + "\n")
out = tmp_path / "out.csv"
opts = StandardizeOptions(column_types={"phone": FieldType.PHONE})
seen: list[tuple[int, int]] = []
def cb(rows, chunks):
seen.append((rows, chunks))
standardize_file(inp, out, opts, chunk_size=5, progress_callback=cb)
assert len(seen) == 4
assert seen[-1] == (20, 4)
def test_progress_callback_exception_does_not_abort(self, tmp_path):
inp = tmp_path / "in.csv"
inp.write_text("phone\n(415) 555-1234\n")
out = tmp_path / "out.csv"
opts = StandardizeOptions(column_types={"phone": FieldType.PHONE})
def bad_cb(*a, **k):
raise RuntimeError("boom")
# Must not raise.
res = standardize_file(inp, out, opts, chunk_size=1, progress_callback=bad_cb)
assert res.rows_processed == 1
def test_missing_input_raises_clean_error(self, tmp_path):
from src.core.errors import FileAccessError
opts = StandardizeOptions(column_types={"phone": FieldType.PHONE})
with pytest.raises(FileAccessError):
standardize_file(
tmp_path / "missing.csv",
tmp_path / "out.csv",
opts,
)
# ---------------------------------------------------------------------------
# International coverage smoke
# ---------------------------------------------------------------------------
class TestInternationalCoverage:
@pytest.mark.parametrize("number,country,prefix", [
("020 7946 0958", "GB", "+44"),
("03-3210-7000", "JP", "+81"),
("+49 30 12345678", "DE", "+49"),
("01 42 86 82 00", "FR", "+33"),
("+39 06 6982", "IT", "+39"),
("+34 91 411 1111", "ES", "+34"),
("+86 10 1234 5678", "CN", "+86"),
("+91 11 2345 6789", "IN", "+91"),
("+61 2 9374 4000", "AU", "+61"),
("11 3071 0000", "BR", "+55"),
("+52 55 5555 0000", "MX", "+52"),
("+82 2 2287 0114", "KR", "+82"),
])
def test_phone_via_per_row_region(self, number, country, prefix):
df = pd.DataFrame({"phone": [number], "country": [country]})
opts = StandardizeOptions(
column_types={"phone": FieldType.PHONE},
phone_country_column="country",
)
res = standardize_dataframe(df, opts)
out = res.standardized_df["phone"].iloc[0]
assert out.startswith(prefix), (
f"{number!r} ({country}): expected to start with {prefix}, got {out!r}"
)
@pytest.mark.parametrize("price,want_code", [
("$1,500.00", "USD"),
("€850,50", "EUR"),
("£99.99", "GBP"),
("¥12000", "JPY"),
("R$ 250,00", "BRL"),
("CHF 1200.00", "CHF"),
])
def test_currency_codes_detected(self, price, want_code):
df = pd.DataFrame({"price": [price]})
opts = StandardizeOptions(
column_types={"price": FieldType.CURRENCY},
currency_preserve_code=True,
currency_decimal="auto", # international mode
)
res = standardize_dataframe(df, opts)
assert want_code in res.standardized_df["price"].iloc[0]

View File

@@ -8,10 +8,8 @@ These cover edges that existing suites missed:
- ``analyze()`` with ``sample_rows >= len(df)`` (uses copy(), not head()).
- ``findings_by_tool`` on an empty list.
- BOM that appears mid-cell rather than at file start.
The collapse-whitespace heuristic for numeric/date/phone-shaped cells (spec
§4.17) is *not yet implemented* and is captured here as a known-gap xfail
so it's surfaced rather than silently missing.
- The collapse-whitespace heuristic for numeric/date/phone-shaped cells
(spec §4.17), now wired in via ``_smart_collapse_whitespace``.
"""
from __future__ import annotations

462
tests/test_missing.py Normal file
View File

@@ -0,0 +1,462 @@
"""Tests for src/core/missing.py."""
from __future__ import annotations
import json
import numpy as np
import pandas as pd
import pytest
from src.core.errors import ConfigError, InputValidationError
from src.core.missing import (
DEFAULT_SENTINELS,
MissingOptions,
PRESETS,
detect_sentinels,
handle_missing,
is_missing_like,
profile_missing,
)
# ---------------------------------------------------------------------------
# is_missing_like
# ---------------------------------------------------------------------------
class TestIsMissingLike:
def test_none(self):
assert is_missing_like(None)
def test_nan(self):
assert is_missing_like(np.nan)
def test_pd_nat(self):
assert is_missing_like(pd.NaT)
def test_empty_string(self):
assert is_missing_like("")
def test_whitespace_only(self):
assert is_missing_like(" ")
assert is_missing_like("\t\n ")
def test_default_sentinels(self):
for s in ("N/A", "n/a", "NULL", "null", "-", "--", "?", "TBD", "(blank)"):
assert is_missing_like(s), f"expected {s!r} to be missing-like"
def test_case_insensitive(self):
assert is_missing_like("N/A")
assert is_missing_like("n/A")
assert is_missing_like("NA")
assert is_missing_like("na")
def test_real_value_not_missing(self):
assert not is_missing_like("hello")
assert not is_missing_like("0")
assert not is_missing_like(0)
assert not is_missing_like(0.0)
def test_zero_is_not_missing(self):
# Common bug: treating 0 / "0" / False as missing.
assert not is_missing_like(0)
assert not is_missing_like(False)
def test_custom_sentinels_override(self):
assert is_missing_like("xx", sentinels=["xx"])
assert not is_missing_like("xx", sentinels=["zz"])
# ---------------------------------------------------------------------------
# detect_sentinels
# ---------------------------------------------------------------------------
class TestDetectSentinels:
def test_counts_by_label(self):
s = pd.Series(["alice", "N/A", "n/a", "NULL", " ", "", "bob"])
counts = detect_sentinels(s)
# "n/a" matches both 'N/A' and 'n/a' under casefold; the canonical
# label that wins is whichever is in the DEFAULT_SENTINELS list.
assert sum(v for k, v in counts.items() if k != "(whitespace)") == 3
assert counts["(whitespace)"] == 2
def test_skips_real_nan(self):
s = pd.Series(["a", np.nan, "N/A"])
counts = detect_sentinels(s)
assert sum(counts.values()) == 1
def test_no_sentinels_returns_empty(self):
s = pd.Series(["alice", "bob", "charlie"])
assert detect_sentinels(s) == {}
# ---------------------------------------------------------------------------
# profile_missing
# ---------------------------------------------------------------------------
class TestProfileMissing:
def test_basic(self):
df = pd.DataFrame({
"name": ["Alice", "Bob", "N/A", "", "Charlie"],
"age": [30, None, 25, 40, np.nan],
})
prof = profile_missing(df, MissingOptions())
assert prof.rows_total == 5
# name: '' + 'N/A' = 2 sentinels; age: 2 NaN
report_by_col = {r.column: r for r in prof.columns}
assert report_by_col["name"].missing == 2
assert report_by_col["age"].missing == 2
assert prof.cells_missing == 4
def test_complete_dataframe(self):
df = pd.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "c"]})
prof = profile_missing(df, MissingOptions())
assert prof.cells_missing == 0
assert prof.rows_complete == 3
assert prof.rows_with_any_missing == 0
def test_to_dataframe_columns(self):
df = pd.DataFrame({"x": [1, None]})
prof = profile_missing(df, MissingOptions())
out = prof.to_dataframe()
assert set(out.columns) >= {"column", "missing", "missing_pct", "top_sentinel"}
def test_disabled_sentinels_only_counts_real_nan(self):
df = pd.DataFrame({"x": ["N/A", "alice", np.nan]})
opts = MissingOptions(standardize_sentinels=False)
prof = profile_missing(df, opts)
report_by_col = {r.column: r for r in prof.columns}
# Only the real NaN counts; 'N/A' is left alone.
assert report_by_col["x"].missing == 1
# ---------------------------------------------------------------------------
# handle_missing — sentinel standardization
# ---------------------------------------------------------------------------
class TestSentinelStandardization:
def test_replaces_sentinels_with_nan(self):
df = pd.DataFrame({"x": ["alice", "N/A", "-", " ", "bob"]})
res = handle_missing(df, MissingOptions(strategy="none"))
# 'N/A' + '-' + whitespace-only = 3
assert res.sentinels_standardized == 3
assert res.handled_df["x"].isna().sum() == 3
assert res.handled_df.iloc[0]["x"] == "alice"
assert res.handled_df.iloc[4]["x"] == "bob"
def test_audit_records_each_replacement(self):
df = pd.DataFrame({"x": ["alice", "N/A", "bob"]})
res = handle_missing(df, MissingOptions(strategy="none"))
assert len(res.changes) == 1
assert res.changes.iloc[0]["action"].startswith("standardize:")
def test_disabled_keeps_sentinels(self):
df = pd.DataFrame({"x": ["alice", "N/A", "bob"]})
opts = MissingOptions(standardize_sentinels=False, strategy="none")
res = handle_missing(df, opts)
assert res.sentinels_standardized == 0
assert res.handled_df.iloc[1]["x"] == "N/A"
def test_custom_sentinels_extend_default(self):
df = pd.DataFrame({"x": ["alice", "MISSING_DATA", "bob"]})
opts = MissingOptions(
sentinels=[*DEFAULT_SENTINELS, "MISSING_DATA"],
strategy="none",
)
res = handle_missing(df, opts)
assert res.sentinels_standardized == 1
# ---------------------------------------------------------------------------
# handle_missing — fill strategies
# ---------------------------------------------------------------------------
class TestFillStrategies:
@pytest.fixture
def numeric_df(self):
return pd.DataFrame({"x": [1.0, 2.0, np.nan, 4.0, np.nan]})
def test_mean(self, numeric_df):
res = handle_missing(numeric_df, MissingOptions(strategy="mean"))
# mean of [1, 2, 4] = 7/3
filled = res.handled_df["x"].iloc[2]
assert abs(filled - 7.0 / 3.0) < 1e-9
assert res.cells_filled == 2
def test_median(self, numeric_df):
res = handle_missing(numeric_df, MissingOptions(strategy="median"))
# median of [1, 2, 4] = 2.0
assert res.handled_df["x"].iloc[2] == 2.0
def test_mode(self):
df = pd.DataFrame({"x": ["a", "a", "b", None, None]})
res = handle_missing(df, MissingOptions(strategy="mode"))
assert res.handled_df["x"].iloc[3] == "a"
assert res.handled_df["x"].iloc[4] == "a"
assert res.cells_filled == 2
def test_constant_scalar(self, numeric_df):
res = handle_missing(
numeric_df,
MissingOptions(strategy="constant", fill_value=99.0),
)
assert res.handled_df["x"].iloc[2] == 99.0
assert res.handled_df["x"].iloc[4] == 99.0
def test_constant_per_column(self):
df = pd.DataFrame({"a": [1, np.nan], "b": ["x", None]})
opts = MissingOptions(
strategy="constant",
column_fill_values={"a": 0, "b": "?"},
)
res = handle_missing(df, opts)
assert res.handled_df["a"].iloc[1] == 0
assert res.handled_df["b"].iloc[1] == "?"
def test_ffill(self):
df = pd.DataFrame({"x": [1.0, np.nan, np.nan, 4.0]})
res = handle_missing(df, MissingOptions(strategy="ffill"))
assert list(res.handled_df["x"]) == [1.0, 1.0, 1.0, 4.0]
def test_bfill(self):
df = pd.DataFrame({"x": [1.0, np.nan, np.nan, 4.0]})
res = handle_missing(df, MissingOptions(strategy="bfill"))
assert list(res.handled_df["x"]) == [1.0, 4.0, 4.0, 4.0]
def test_interpolate(self):
df = pd.DataFrame({"x": [1.0, np.nan, np.nan, 4.0]})
res = handle_missing(df, MissingOptions(strategy="interpolate"))
assert list(res.handled_df["x"]) == [1.0, 2.0, 3.0, 4.0]
def test_numeric_strategy_falls_back_for_categorical(self):
df = pd.DataFrame({"x": ["a", "a", None, "b"]})
opts = MissingOptions(strategy="median", categorical_strategy="mode")
res = handle_missing(df, opts)
assert res.strategy_per_column["x"] == "mode"
assert res.handled_df["x"].iloc[2] == "a"
def test_per_column_strategy_overrides_global(self):
df = pd.DataFrame({"a": [1.0, np.nan], "b": ["x", None]})
opts = MissingOptions(
strategy="median",
column_strategies={"b": "constant"},
fill_value="??",
)
res = handle_missing(df, opts)
assert res.handled_df["a"].iloc[1] == 1.0 # median of [1.0]
assert res.handled_df["b"].iloc[1] == "??"
def test_all_nan_column_safely_skipped(self):
df = pd.DataFrame({"x": [np.nan, np.nan, np.nan]})
res = handle_missing(df, MissingOptions(strategy="mean"))
assert res.cells_filled == 0
assert res.handled_df["x"].isna().all()
# ---------------------------------------------------------------------------
# handle_missing — drops
# ---------------------------------------------------------------------------
class TestDropStrategies:
def test_drop_row_any_missing(self):
# Strict-greater: threshold 0.0 → drop any row with any missing.
df = pd.DataFrame({
"a": [1, 2, np.nan, 4],
"b": ["x", None, "z", "w"],
})
opts = MissingOptions(strategy="drop_row", row_drop_threshold=0.0)
res = handle_missing(df, opts)
# Rows 1 and 2 each have one missing cell; rows 0 and 3 are clean.
assert res.rows_dropped == 2
assert len(res.handled_df) == 2
def test_drop_row_default_threshold_never_drops(self):
# Default 1.0 = never drop — no fraction exceeds 100%.
df = pd.DataFrame({
"a": [1, 2, np.nan],
"b": ["x", "y", None],
})
opts = MissingOptions(strategy="drop_row") # threshold defaults to 1.0
res = handle_missing(df, opts)
assert res.rows_dropped == 0
def test_drop_row_partial_threshold(self):
df = pd.DataFrame({
"a": [1, np.nan, np.nan, np.nan],
"b": [10, 20, np.nan, np.nan],
"c": [100, 200, np.nan, 400],
})
# Strict-greater: threshold 0.5 → drop rows with > 50% missing.
opts = MissingOptions(strategy="drop_row", row_drop_threshold=0.5)
res = handle_missing(df, opts)
# row 0: 0/3, row 1: 1/3 (0.33) -> keep
# row 2: 3/3 (1.0) -> drop, row 3: 2/3 (0.67) -> drop
assert res.rows_dropped == 2
def test_drop_col_threshold(self):
df = pd.DataFrame({
"keep": [1, 2, 3, 4],
"drop_me": [np.nan, np.nan, np.nan, 1], # 75% missing
})
# Strict-greater: 0.5 → drop columns with > 50% missing.
opts = MissingOptions(strategy="drop_col", col_drop_threshold=0.5)
res = handle_missing(df, opts)
assert "drop_me" in res.columns_dropped
assert "keep" not in res.columns_dropped
def test_drop_both(self):
df = pd.DataFrame({
"keep": [1, 2, 3, 4, 5],
"drop_col": [np.nan] * 5,
"x": [1, np.nan, 3, np.nan, 5],
})
opts = MissingOptions(
strategy="drop_both",
col_drop_threshold=0.99, # >99% missing → drop column
row_drop_threshold=0.0, # any missing in remaining cols → drop row
)
res = handle_missing(df, opts)
# drop_col is 100% missing → dropped
assert "drop_col" in res.columns_dropped
# Remaining scope (keep + x): rows 1 and 3 have a missing x → drop.
assert res.rows_dropped == 2
def test_drop_audit_records_dropped_rows(self):
df = pd.DataFrame({"a": [1, np.nan], "b": [2, np.nan]})
# Drop the fully-missing row (frac > 0.99).
opts = MissingOptions(strategy="drop_row", row_drop_threshold=0.99)
res = handle_missing(df, opts)
drop_records = res.changes[res.changes["action"] == "drop_row"]
assert len(drop_records) == 1
# ---------------------------------------------------------------------------
# Scope: columns / skip_columns
# ---------------------------------------------------------------------------
class TestScope:
def test_columns_filter(self):
df = pd.DataFrame({"a": [np.nan, 2], "b": [np.nan, 4]})
opts = MissingOptions(columns=["a"], strategy="constant", fill_value=99)
res = handle_missing(df, opts)
assert res.handled_df["a"].iloc[0] == 99
# b should be untouched
assert pd.isna(res.handled_df["b"].iloc[0])
def test_skip_columns(self):
df = pd.DataFrame({"a": [np.nan, 2], "b": [np.nan, 4]})
opts = MissingOptions(skip_columns=["b"], strategy="constant", fill_value=99)
res = handle_missing(df, opts)
assert res.handled_df["a"].iloc[0] == 99
assert pd.isna(res.handled_df["b"].iloc[0])
def test_unknown_column_raises(self):
df = pd.DataFrame({"a": [1]})
opts = MissingOptions(columns=["does_not_exist"])
with pytest.raises(InputValidationError):
handle_missing(df, opts)
# ---------------------------------------------------------------------------
# Presets / config
# ---------------------------------------------------------------------------
class TestPresets:
def test_detect_only_does_not_fill(self):
df = pd.DataFrame({"x": ["alice", "N/A", "bob"]})
opts = MissingOptions.from_preset("detect-only")
res = handle_missing(df, opts)
assert res.sentinels_standardized == 1
assert res.cells_filled == 0
assert res.rows_dropped == 0
def test_safe_fill_fills(self):
df = pd.DataFrame({"age": [30, np.nan, 25, 40], "name": ["a", "a", None, "b"]})
opts = MissingOptions.from_preset("safe-fill")
res = handle_missing(df, opts)
assert res.cells_filled == 2
def test_drop_incomplete(self):
df = pd.DataFrame({"a": [1, np.nan, 3], "b": [10, 20, 30]})
opts = MissingOptions.from_preset("drop-incomplete")
res = handle_missing(df, opts)
assert res.rows_dropped == 1
def test_unknown_preset_raises(self):
with pytest.raises(ConfigError):
MissingOptions.from_preset("does-not-exist")
def test_roundtrip_to_file(self, tmp_path):
opts = MissingOptions.from_preset("safe-fill")
opts.column_strategies = {"age": "median"}
path = tmp_path / "cfg.json"
opts.to_file(path)
loaded = MissingOptions.from_file(path)
assert loaded.strategy == opts.strategy
assert loaded.column_strategies == opts.column_strategies
# ---------------------------------------------------------------------------
# Validation
# ---------------------------------------------------------------------------
class TestValidate:
def test_invalid_strategy(self):
opts = MissingOptions(strategy="bogus") # type: ignore[arg-type]
with pytest.raises(InputValidationError):
opts.validate()
def test_threshold_out_of_range(self):
opts = MissingOptions(row_drop_threshold=1.5)
with pytest.raises(ConfigError):
opts.validate()
def test_handle_missing_validates(self):
df = pd.DataFrame({"x": [1]})
opts = MissingOptions(strategy="bogus") # type: ignore[arg-type]
with pytest.raises(InputValidationError):
handle_missing(df, opts)
def test_non_dataframe_input(self):
with pytest.raises(InputValidationError):
handle_missing([1, 2, 3]) # type: ignore[arg-type]
# ---------------------------------------------------------------------------
# End-to-end realistic case
# ---------------------------------------------------------------------------
class TestEndToEnd:
def test_messy_customer_export(self):
df = pd.DataFrame({
"customer_id": [1, 2, 3, 4, 5, 6],
"name": ["Alice", "Bob", "N/A", " ", "Charlie", None],
"email": ["a@x.com", "-", "c@x.com", "d@x.com", "NULL", "f@x.com"],
"age": [30, np.nan, 25, 40, np.nan, 50],
})
opts = MissingOptions(
standardize_sentinels=True,
strategy="median",
categorical_strategy="constant",
fill_value="UNKNOWN",
)
res = handle_missing(df, opts)
# Sentinels: name "N/A"," ",None; email "-","NULL". (None is real-NaN, not sentinel.)
# Whitespace + 'N/A' on name = 2; '-' + 'NULL' on email = 2. Total = 4.
assert res.sentinels_standardized == 4
# name has 3 missing after standardize (N/A, " ", None) → constant fill
# email has 2 missing → constant fill
# age has 2 missing → median (32.5 of [30, 25, 40, 50])
assert res.cells_filled == 7
assert res.handled_df["name"].isna().sum() == 0
assert res.handled_df["email"].isna().sum() == 0
assert res.handled_df["age"].isna().sum() == 0
assert (res.handled_df["name"] == "UNKNOWN").sum() == 3
assert (res.handled_df["age"] == 35.0).sum() == 2 # median of [30, 25, 40, 50]
def test_input_not_mutated(self):
df = pd.DataFrame({"x": ["N/A", "alice", np.nan]})
df_copy = df.copy()
handle_missing(df, MissingOptions.from_preset("safe-fill"))
pd.testing.assert_frame_equal(df, df_copy)

View File

@@ -0,0 +1,463 @@
"""Acceptance corpus for the Missing Value Handler.
Loads every fixture in ``test-cases/missing-corpus/test_data/`` and
asserts the documented behaviour. The fixtures are split into:
* ``uc##`` — three target-client use cases (Shopify operator,
marketing analyst, consultant intake).
* ``ec##`` — edge cases the engine must handle without surprise:
all-NaN columns, zeros that aren't missing, Excel errors, unicode
whitespace, mixed dtypes, padding, single row/column, every default
sentinel, per-column constants, drop thresholds, leading-NaN ffill,
numeric-strategy fallback for non-numeric columns, headers-only,
idempotency.
Each test runs through the public API (``handle_missing``) so any
regression in the engine surfaces here. Fixture files double as living
documentation for what the tool is supposed to do.
"""
from __future__ import annotations
import io
from pathlib import Path
import numpy as np
import pandas as pd
import pytest
from src.core.missing import (
MissingOptions,
handle_missing,
is_missing_like,
profile_missing,
)
CORPUS = Path(__file__).resolve().parents[1] / "test-cases" / "missing-corpus"
TEST_DATA = CORPUS / "test_data"
def _read(name: str, *, dtype_str: bool = False) -> pd.DataFrame:
"""Load a corpus CSV.
By default we let pandas infer dtypes — that's the most realistic
intake path (Excel exports keep numeric columns numeric). A handful
of cases pass ``dtype_str=True`` to keep sentinels visible in
columns that would otherwise be coerced to float.
"""
path = TEST_DATA / name
if dtype_str:
return pd.read_csv(path, dtype=str, keep_default_na=False)
return pd.read_csv(path)
# ---------------------------------------------------------------------------
# Use case 1 — Shopify operator: detect-only
# ---------------------------------------------------------------------------
class TestUC01ShopifyExport:
"""SMB operator standardizes disguised nulls before reimporting."""
def test_detect_only_replaces_sentinels(self):
df = _read("uc01_shopify_export.csv", dtype_str=True)
opts = MissingOptions.from_preset("detect-only")
res = handle_missing(df, opts)
# Spot-check known sentinels from the fixture
assert res.sentinels_standardized > 0
assert res.cells_filled == 0
assert res.rows_dropped == 0
# Fields that contained 'N/A', '-', 'NULL', '(blank)', '#N/A',
# 'n/a', '?', '(none)' should now be NaN.
for row, col in [
(1, "phone"), # 'N/A'
(2, "city"), # '-'
(3, "total_orders"), # 'NULL'
(5, "phone"), # ' '
(5, "last_order_date"), # '(blank)'
(6, "last_order_date"), # '#N/A'
(7, "phone"), # 'n/a'
(8, "city"), # '?'
(9, "total_orders"), # '(none)'
]:
assert pd.isna(res.handled_df.iloc[row][col]), (
f"Expected NaN at row {row} col {col}, got "
f"{res.handled_df.iloc[row][col]!r}"
)
def test_real_values_preserved(self):
df = _read("uc01_shopify_export.csv", dtype_str=True)
res = handle_missing(df, MissingOptions.from_preset("detect-only"))
# First row should be untouched.
assert res.handled_df.iloc[0]["first_name"] == "Alice"
assert res.handled_df.iloc[0]["email"] == "alice@shop.com"
assert res.handled_df.iloc[0]["lifetime_value"] == "1240.50"
def test_audit_log_complete(self):
df = _read("uc01_shopify_export.csv", dtype_str=True)
res = handle_missing(df, MissingOptions.from_preset("detect-only"))
# One audit row per sentinel replacement.
assert len(res.changes) == res.sentinels_standardized
assert set(res.changes["action"].apply(lambda s: s.startswith("standardize:"))) == {True}
# ---------------------------------------------------------------------------
# Use case 2 — Marketing analyst: safe-fill
# ---------------------------------------------------------------------------
class TestUC02MarketingAudience:
"""Marketer fills numeric columns with median, categorical with mode."""
def test_safe_fill_clears_all_missing(self):
df = _read("uc02_marketing_audience.csv")
opts = MissingOptions.from_preset("safe-fill")
res = handle_missing(df, opts)
# Every cell in scope should be filled.
assert res.profile_after.cells_missing == 0
assert res.cells_filled > 0
def test_numeric_uses_median_categorical_uses_mode(self):
df = _read("uc02_marketing_audience.csv")
opts = MissingOptions.from_preset("safe-fill")
res = handle_missing(df, opts)
# 'age' is numeric → median strategy
assert res.strategy_per_column["age"] == "median"
# 'segment' / 'region' / 'source' are object → mode fallback
assert res.strategy_per_column["segment"] == "mode"
assert res.strategy_per_column["region"] == "mode"
def test_per_column_override(self):
df = _read("uc02_marketing_audience.csv")
opts = MissingOptions.from_preset("safe-fill")
opts.column_strategies = {"source": "constant"}
opts.column_fill_values = {"source": "unknown"}
res = handle_missing(df, opts)
# Cells previously holding sentinels in 'source' should now equal "unknown".
assert (res.handled_df["source"] == "unknown").sum() >= 3
def test_consent_real_false_not_dropped(self):
# 'consent' column has empty cells but also explicit "true"; mode fill
# must not silently change a real "true" to anything else.
df = _read("uc02_marketing_audience.csv")
res = handle_missing(df, MissingOptions.from_preset("safe-fill"))
original_trues = (df["consent"] == "true").sum()
result_trues = (res.handled_df["consent"] == "true").sum()
# Filled rows can become "true" (mode) but should not lose existing trues.
assert result_trues >= original_trues
# ---------------------------------------------------------------------------
# Use case 3 — Consultant intake: threshold drops + fill
# ---------------------------------------------------------------------------
class TestUC03ConsultantIntake:
"""Drop sparse columns and rows, then fill the survivors."""
def test_drop_col_removes_legacy_fields(self):
df = _read("uc03_consultant_intake.csv", dtype_str=True)
# internal_id_legacy and beta_field are 100% missing — drop them.
opts = MissingOptions(
standardize_sentinels=True,
strategy="drop_col",
col_drop_threshold=0.99,
)
res = handle_missing(df, opts)
assert "internal_id_legacy" in res.columns_dropped
assert "beta_field" in res.columns_dropped
def test_drop_row_removes_mostly_empty_respondents(self):
df = _read("uc03_consultant_intake.csv", dtype_str=True)
opts = MissingOptions(
standardize_sentinels=True,
strategy="drop_both",
col_drop_threshold=0.99, # drop the legacy / beta cols first
row_drop_threshold=0.5, # then drop rows with >50% missing
)
res = handle_missing(df, opts)
# R-002, R-005, R-007, R-010 are mostly-empty respondents.
assert res.rows_dropped >= 4
# Non-empty respondents survive.
kept_ids = set(res.handled_df["respondent_id"].tolist())
for survivor in ("R-001", "R-003", "R-006", "R-008", "R-009", "R-012"):
assert survivor in kept_ids
# ---------------------------------------------------------------------------
# Edge cases
# ---------------------------------------------------------------------------
class TestEC01AllNanColumn:
def test_fill_skips_all_nan_column(self):
df = _read("ec01_all_nan_column.csv")
res = handle_missing(df, MissingOptions(strategy="mean"))
# Mean of all-NaN is NaN — engine must NOT fabricate a value.
assert res.handled_df["deprecated_field"].isna().all()
assert res.cells_filled == 0
def test_drop_col_catches_all_nan(self):
df = _read("ec01_all_nan_column.csv")
res = handle_missing(
df, MissingOptions(strategy="drop_col", col_drop_threshold=0.99),
)
assert "deprecated_field" in res.columns_dropped
assert "name" not in res.columns_dropped
class TestEC02NoMissing:
def test_clean_file_is_noop(self):
df = _read("ec02_no_missing.csv")
res = handle_missing(df, MissingOptions.from_preset("safe-fill"))
assert res.sentinels_standardized == 0
assert res.cells_filled == 0
assert res.rows_dropped == 0
pd.testing.assert_frame_equal(res.handled_df, df)
class TestEC03ZeroIsNotMissing:
def test_zero_preserved(self):
df = _read("ec03_zero_is_not_missing.csv")
res = handle_missing(df, MissingOptions.from_preset("safe-fill"))
# Original zeros remain zero.
assert (res.handled_df["balance"] == 0).sum() == (df["balance"] == 0).sum()
assert (res.handled_df["count"] == 0).sum() == (df["count"] == 0).sum()
# No spurious changes recorded.
assert res.cells_filled == 0
assert res.sentinels_standardized == 0
def test_is_missing_like_zero_predicate(self):
# Direct predicate check — zeros, false, "0" must all be non-missing.
assert not is_missing_like(0)
assert not is_missing_like(0.0)
assert not is_missing_like(False)
assert not is_missing_like("0")
assert not is_missing_like("0.00")
class TestEC04ExcelErrors:
def test_excel_error_sentinels_recognized(self):
df = _read("ec04_excel_errors.csv", dtype_str=True)
res = handle_missing(df, MissingOptions(strategy="none"))
# 6 error sentinels in the fixture: #N/A, #NULL!, #VALUE!, #N/A, #N/A, #NULL!
assert res.sentinels_standardized == 6
class TestEC05UnicodeWhitespace:
def test_nbsp_and_ideographic_space_count_as_missing(self):
df = _read("ec05_unicode_whitespace.csv", dtype_str=True)
res = handle_missing(df, MissingOptions(strategy="none"))
# rows 1, 2, 4 contain NBSP / tab / ideographic space respectively
assert res.handled_df["note"].isna().sum() == 3
assert res.handled_df.iloc[0]["note"] == "hello"
assert res.handled_df.iloc[3]["note"] == "real"
class TestEC06MixedDtypes:
def test_mixed_column_falls_back_to_mode(self):
# Read with native dtypes so 'real_num' stays numeric.
df = _read("ec06_mixed_dtypes.csv")
opts = MissingOptions(
standardize_sentinels=True,
strategy="median",
categorical_strategy="mode",
)
res = handle_missing(df, opts)
# mixed_col holds 'N/A' / 'hello' alongside numbers → object dtype,
# median falls back to mode.
assert res.strategy_per_column["mixed_col"] == "mode"
# real_num is float dtype → median runs.
assert res.strategy_per_column["real_num"] == "median"
class TestEC07RealDataWithPadding:
def test_padded_real_data_not_treated_as_missing(self):
df = _read("ec07_real_data_with_padding.csv", dtype_str=True)
res = handle_missing(df, MissingOptions(strategy="none"))
# Only row 1 (name=" ") and row 2 (city=blank) should become NaN.
# " Alice ", " Bob ", " SF" must remain.
assert res.handled_df.iloc[0]["name"] == " Alice "
assert res.handled_df.iloc[2]["name"] == " Bob "
assert res.handled_df.iloc[3]["city"] == " SF"
class TestEC08SingleRow:
def test_single_row_handles_cleanly(self):
df = _read("ec08_single_row.csv", dtype_str=True)
# detect-only
res = handle_missing(df, MissingOptions(strategy="none"))
assert res.sentinels_standardized == 2 # 'N/A' + ''
# safe-fill on a one-row file: median/mode of a single value is itself.
res2 = handle_missing(df, MissingOptions.from_preset("safe-fill"))
assert res2.handled_df.iloc[0]["name"] == "Alice"
class TestEC09SingleColumn:
def test_single_column_works(self):
df = _read("ec09_single_column.csv", dtype_str=True)
res = handle_missing(df, MissingOptions(strategy="none"))
# 'N/A', whitespace-only ' ', '-' = 3 sentinels
assert res.sentinels_standardized == 3
assert res.handled_df["value"].isna().sum() == 3
class TestEC10AllSentinelVariants:
def test_every_default_sentinel_recognized(self):
df = _read("ec10_all_sentinel_variants.csv", dtype_str=True)
res = handle_missing(df, MissingOptions(strategy="none"))
# 20 sentinels + 1 real value
assert res.sentinels_standardized == 20
# The 'real_value' row stays.
assert (res.handled_df["sentinel_value"] == "real_value").sum() == 1
class TestEC11ConstantPerColumn:
def test_per_column_fill_values(self):
df = _read("ec11_constant_per_column.csv", dtype_str=True)
opts = MissingOptions(
strategy="constant",
column_fill_values={
"country": "USA",
"salary": "0",
"department": "Unassigned",
},
)
res = handle_missing(df, opts)
# Fixture has 1 UK row + 2 USA rows + 2 blanks. Filling blanks with
# "USA" yields 4 USA total; UK is preserved.
assert (res.handled_df["country"] == "USA").sum() == 4
assert (res.handled_df["country"] == "UK").sum() == 1
assert (res.handled_df["department"] == "Unassigned").sum() >= 2
class TestEC12DropThresholdBoundary:
def test_threshold_one_never_drops(self):
# threshold 1.0 + strict-greater = never drop.
df = _read("ec12_drop_threshold_boundary.csv")
opts = MissingOptions(strategy="drop_row", row_drop_threshold=1.0)
res = handle_missing(df, opts)
assert res.rows_dropped == 0
def test_threshold_just_under_one_drops_fully_missing(self):
# threshold 0.99: drop only fully-missing rows (frac > 0.99 → frac == 1.0).
df = _read("ec12_drop_threshold_boundary.csv")
opts = MissingOptions(
strategy="drop_row",
row_drop_threshold=0.99,
columns=["a", "b", "c", "d"], # exclude id from the scope
)
res = handle_missing(df, opts)
# Only row 3 (id=4, all four are NaN) qualifies.
assert res.rows_dropped == 1
def test_threshold_half_drops_majority_missing(self):
df = _read("ec12_drop_threshold_boundary.csv")
opts = MissingOptions(
strategy="drop_row",
row_drop_threshold=0.5,
columns=["a", "b", "c", "d"],
)
res = handle_missing(df, opts)
# Missing fractions across [a,b,c,d]:
# row 0: 0/4=0.0 keep
# row 1: 2/4=0.5 keep (strict >, not equal)
# row 2: 3/4=0.75 drop
# row 3: 4/4=1.0 drop
# row 4: 2/4=0.5 keep
assert res.rows_dropped == 2
def test_threshold_zero_drops_any_missing(self):
df = _read("ec12_drop_threshold_boundary.csv")
opts = MissingOptions(
strategy="drop_row",
row_drop_threshold=0.0,
columns=["a", "b", "c", "d"],
)
res = handle_missing(df, opts)
# Every body row except row 0 has at least one missing.
assert res.rows_dropped == 4
class TestEC13FfillLeadingNan:
def test_leading_nan_run_survives_ffill(self):
df = _read("ec13_ffill_leading_nan.csv")
res = handle_missing(df, MissingOptions(strategy="ffill"))
# First two rows (leading NaN) remain NaN — there's nothing to fill from.
assert pd.isna(res.handled_df["price"].iloc[0])
assert pd.isna(res.handled_df["price"].iloc[1])
# Mid-series gets filled forward.
assert res.handled_df["price"].iloc[3] == 100.0
assert res.handled_df["price"].iloc[4] == 100.0
# Trailing NaN gets filled by the last seen value.
assert res.handled_df["price"].iloc[6] == 150.0
class TestEC14InterpolateFallback:
def test_interpolate_on_non_numeric_falls_back(self):
df = _read("ec14_interpolate_fallback.csv", dtype_str=True)
opts = MissingOptions(
strategy="interpolate",
categorical_strategy="mode",
)
res = handle_missing(df, opts)
# All columns are object dtype here → fallback to mode.
assert res.strategy_per_column["category"] == "mode"
assert res.strategy_per_column["value"] == "mode"
class TestEC15HeadersOnly:
def test_empty_body_does_not_crash(self):
df = _read("ec15_headers_only.csv")
# All operations must be no-ops on an empty body.
for preset in ("detect-only", "safe-fill", "drop-incomplete"):
res = handle_missing(df, MissingOptions.from_preset(preset))
assert len(res.handled_df) == 0
assert res.cells_filled == 0
assert res.rows_dropped == 0
class TestEC16Idempotency:
def test_safe_fill_is_idempotent(self):
df = _read("ec16_idempotent_apply.csv", dtype_str=True)
opts = MissingOptions.from_preset("safe-fill")
first = handle_missing(df, opts)
second = handle_missing(first.handled_df, opts)
# Second pass should make no further changes.
pd.testing.assert_frame_equal(
second.handled_df.reset_index(drop=True),
first.handled_df.reset_index(drop=True),
)
assert second.cells_filled == 0
assert second.sentinels_standardized == 0
def test_detect_only_is_idempotent(self):
df = _read("ec16_idempotent_apply.csv", dtype_str=True)
opts = MissingOptions.from_preset("detect-only")
first = handle_missing(df, opts)
second = handle_missing(first.handled_df, opts)
assert second.sentinels_standardized == 0
# ---------------------------------------------------------------------------
# Whole-corpus property tests
# ---------------------------------------------------------------------------
ALL_FIXTURES = sorted(p.name for p in TEST_DATA.glob("*.csv"))
@pytest.mark.parametrize("fixture", ALL_FIXTURES)
def test_handle_missing_does_not_mutate_input(fixture):
"""Every fixture must leave the input DataFrame untouched."""
df = pd.read_csv(TEST_DATA / fixture, dtype=str, keep_default_na=False)
if df.empty and len(df.columns) == 0:
pytest.skip(f"{fixture}: completely empty file")
snapshot = df.copy(deep=True)
handle_missing(df, MissingOptions.from_preset("safe-fill"))
pd.testing.assert_frame_equal(df, snapshot)
@pytest.mark.parametrize("fixture", ALL_FIXTURES)
def test_profile_runs_on_every_fixture(fixture):
"""``profile_missing`` must succeed on every corpus file."""
df = pd.read_csv(TEST_DATA / fixture, dtype=str, keep_default_na=False)
prof = profile_missing(df, MissingOptions())
assert prof.rows_total == len(df)
assert prof.cells_total == len(df) * len(df.columns)

324
tests/test_pipeline.py Normal file
View File

@@ -0,0 +1,324 @@
"""Tests for src/core/pipeline.py."""
from __future__ import annotations
import json
import numpy as np
import pandas as pd
import pytest
from src.core.errors import ConfigError, InputValidationError
from src.core.pipeline import (
Pipeline,
PipelineResult,
SOFT_DEPENDENCIES,
Step,
StepResult,
TOOL_ADAPTERS,
TOOL_NAMES,
recommended_pipeline,
run_pipeline,
validate_pipeline,
)
# ---------------------------------------------------------------------------
# Step / Pipeline construction
# ---------------------------------------------------------------------------
class TestStep:
def test_unknown_tool_raises(self):
with pytest.raises(ConfigError):
Step(tool="bogus_tool")
def test_default_options_empty_dict(self):
s = Step(tool="text_clean")
assert s.options == {}
assert s.enabled is True
def test_display_name_falls_back_to_tool(self):
assert Step(tool="dedup").display_name() == "dedup"
assert Step(tool="dedup", name="Final dedup").display_name() == "Final dedup"
class TestPipelineSerialization:
def test_roundtrip_dict(self):
p = Pipeline(steps=[
Step("text_clean", {"trim": True}),
Step("dedup", {"survivor_rule": "first"}),
])
out = p.to_dict()
loaded = Pipeline.from_dict(out)
assert len(loaded.steps) == 2
assert loaded.steps[0].tool == "text_clean"
assert loaded.steps[1].options["survivor_rule"] == "first"
def test_roundtrip_file(self, tmp_path):
p = Pipeline(steps=[Step("text_clean")])
path = tmp_path / "p.json"
p.to_file(path)
loaded = Pipeline.from_file(path)
assert loaded.steps[0].tool == "text_clean"
def test_from_dict_missing_steps_key(self):
with pytest.raises(ConfigError):
Pipeline.from_dict({})
def test_from_dict_missing_tool(self):
with pytest.raises(ConfigError):
Pipeline.from_dict({"steps": [{"options": {}}]})
# ---------------------------------------------------------------------------
# recommended_pipeline
# ---------------------------------------------------------------------------
class TestRecommendedPipeline:
def test_default_order(self):
p = recommended_pipeline()
assert [s.tool for s in p.steps] == [
"text_clean", "format_standardize", "missing", "dedup",
]
def test_default_passes_validation(self):
p = recommended_pipeline()
assert validate_pipeline(p) == []
def test_include_overrides_default(self):
p = recommended_pipeline(include=["text_clean", "missing"])
assert [s.tool for s in p.steps] == ["text_clean", "missing"]
def test_options_seed_reaches_step(self):
p = recommended_pipeline(options={"text_clean": {"trim": False}})
assert p.steps[0].options == {"trim": False}
def test_unknown_tool_raises(self):
with pytest.raises(InputValidationError):
recommended_pipeline(include=["bogus"])
def test_can_place_column_map_first_or_last(self):
# Both placements must be acceptable per the docstring.
first = recommended_pipeline(include=[
"column_map", "text_clean", "format_standardize", "missing", "dedup",
])
last = recommended_pipeline(include=[
"text_clean", "format_standardize", "missing", "column_map", "dedup",
])
# No soft-dependency rule names column_map, so neither warns.
assert validate_pipeline(first) == []
assert validate_pipeline(last) == []
# ---------------------------------------------------------------------------
# validate_pipeline — soft dependencies
# ---------------------------------------------------------------------------
class TestValidatePipeline:
def test_in_order_no_warnings(self):
p = recommended_pipeline()
assert validate_pipeline(p) == []
def test_dedup_before_text_clean_warns(self):
p = Pipeline(steps=[Step("dedup"), Step("text_clean")])
ws = validate_pipeline(p)
assert len(ws) == 1
assert "dedup" in ws[0] and "text_clean" in ws[0]
def test_format_before_text_clean_warns(self):
p = Pipeline(steps=[Step("format_standardize"), Step("text_clean")])
ws = validate_pipeline(p)
assert any("format_standardize" in w for w in ws)
def test_disabled_steps_ignored(self):
# Disabled dedup-first should not trigger a warning.
p = Pipeline(steps=[
Step("dedup", enabled=False),
Step("text_clean"),
])
assert validate_pipeline(p) == []
def test_duplicate_tool_does_not_double_warn(self):
# text_clean twice (legitimate: two-pass cleaning) shouldn't
# generate redundant warnings.
p = Pipeline(steps=[
Step("text_clean"),
Step("text_clean"),
])
assert validate_pipeline(p) == []
# ---------------------------------------------------------------------------
# run_pipeline — execution
# ---------------------------------------------------------------------------
@pytest.fixture
def messy_df():
return pd.DataFrame({
"name": [" Alice ", "BOB", "N/A", "", "charlie "],
"phone": ["(415) 555-1234", "+44 20 7946 0958", "03-3210-7000", "", "(415) 555-1234"],
"country": ["US", "GB", "JP", "", "US"],
})
class TestRunPipeline:
def test_recommended_pipeline_runs_end_to_end(self, messy_df):
p = recommended_pipeline(options={
"format_standardize": {
"column_types": {"phone": "phone"},
"phone_country_column": "country",
},
"missing": {"strategy": "none"},
})
res = run_pipeline(messy_df, p)
assert isinstance(res, PipelineResult)
assert res.initial_rows == 5
# Dedup at the end removes the Alice/charlie duplicate (same phone).
assert res.final_rows < res.initial_rows
assert res.warnings == []
def test_initial_df_not_mutated(self, messy_df):
snapshot = messy_df.copy(deep=True)
run_pipeline(messy_df, recommended_pipeline())
pd.testing.assert_frame_equal(messy_df, snapshot)
def test_disabled_step_skipped(self, messy_df):
p = Pipeline(steps=[
Step("text_clean", enabled=False),
Step("missing", options={"strategy": "none"}),
])
res = run_pipeline(messy_df, p)
assert res.step_results[0].skipped is True
assert res.step_results[1].skipped is False
def test_step_results_ordered_and_timed(self, messy_df):
p = recommended_pipeline(options={
"missing": {"strategy": "none"},
})
res = run_pipeline(messy_df, p)
assert len(res.step_results) == 4
for sr in res.step_results:
assert sr.elapsed_seconds >= 0
assert [sr.step.tool for sr in res.step_results] == [
"text_clean", "format_standardize", "missing", "dedup",
]
def test_warnings_returned_but_run_proceeds(self, messy_df):
p = Pipeline(steps=[
Step("dedup"),
Step("text_clean"),
])
res = run_pipeline(messy_df, p)
assert res.warnings # warnings present
# Both steps still ran.
assert all(not sr.skipped for sr in res.step_results)
def test_progress_callback_fires_per_step(self, messy_df):
seen: list[StepResult] = []
p = Pipeline(steps=[
Step("text_clean"),
Step("missing", options={"strategy": "none"}),
])
run_pipeline(messy_df, p, on_step_complete=seen.append)
assert len(seen) == 2
assert all(isinstance(s, StepResult) for s in seen)
def test_progress_callback_exception_does_not_abort(self, messy_df):
def bad(_sr):
raise RuntimeError("boom")
p = Pipeline(steps=[Step("text_clean")])
# Must not raise.
res = run_pipeline(messy_df, p, on_step_complete=bad)
assert res.final_rows == 5
def test_stop_on_error_default(self, messy_df):
# Force an error by giving format_standardize a non-existent column.
p = Pipeline(steps=[
Step("format_standardize", options={
"column_types": {"does_not_exist": "phone"},
}),
])
with pytest.raises(InputValidationError):
run_pipeline(messy_df, p)
def test_continue_on_error_carries_previous_df(self, messy_df):
p = Pipeline(steps=[
Step("text_clean"),
Step("format_standardize", options={
"column_types": {"does_not_exist": "phone"},
}),
Step("missing", options={"strategy": "none"}),
])
res = run_pipeline(messy_df, p, stop_on_error=False)
# Step 2 errored, step 3 still ran.
assert res.step_results[1].error is not None
assert res.step_results[2].error is None
assert res.final_rows == 5
def test_non_dataframe_input(self):
with pytest.raises(InputValidationError):
run_pipeline([1, 2, 3], recommended_pipeline()) # type: ignore[arg-type]
# ---------------------------------------------------------------------------
# Per-tool adapter sanity
# ---------------------------------------------------------------------------
class TestAdapters:
@pytest.mark.parametrize("tool", TOOL_NAMES)
def test_adapter_with_default_options_runs(self, tool, messy_df):
# Each adapter must accept an empty options dict and return a
# (df, summary) pair.
out_df, summary = TOOL_ADAPTERS[tool](messy_df, {})
assert isinstance(out_df, pd.DataFrame)
assert isinstance(summary, dict)
def test_format_standardize_adapter_passes_column_types(self, messy_df):
out, summary = TOOL_ADAPTERS["format_standardize"](
messy_df, {"column_types": {"phone": "phone"}},
)
assert summary["columns_processed"] == ["phone"]
def test_dedup_adapter_with_unknown_survivor_rule_raises(self, messy_df):
with pytest.raises(ConfigError):
TOOL_ADAPTERS["dedup"](messy_df, {"survivor_rule": "bogus"})
# ---------------------------------------------------------------------------
# SOFT_DEPENDENCIES integrity
# ---------------------------------------------------------------------------
class TestSoftDependencies:
def test_every_pair_uses_known_tools(self):
for earlier, later, _ in SOFT_DEPENDENCIES:
assert earlier in TOOL_NAMES
assert later in TOOL_NAMES
def test_all_reasons_non_empty(self):
for _, _, why in SOFT_DEPENDENCIES:
assert why and isinstance(why, str)
# Reason should be a sentence — at least 20 chars.
assert len(why) > 20
def test_dependencies_form_a_dag(self):
# No cycles — there must exist a topological ordering of the
# tools such that every soft dependency (earlier, later)
# is satisfied. With 5 tools and 6 deps this is easy to verify.
from collections import defaultdict, deque
edges: dict[str, list[str]] = defaultdict(list)
in_degree: dict[str, int] = {t: 0 for t in TOOL_NAMES}
for e, l, _ in SOFT_DEPENDENCIES:
edges[e].append(l)
in_degree[l] += 1
queue = deque(t for t, d in in_degree.items() if d == 0)
order = []
while queue:
t = queue.popleft()
order.append(t)
for nxt in edges[t]:
in_degree[nxt] -= 1
if in_degree[nxt] == 0:
queue.append(nxt)
assert len(order) == len(TOOL_NAMES), (
f"SOFT_DEPENDENCIES contain a cycle; topo order={order}"
)