feat: 3 new tools, format streaming, distribution-ready demo + landing pages

Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 22:31:26 +00:00
parent d18b95880d
commit 966af8ef94
89 changed files with 12039 additions and 284 deletions

352
landing/revops/index.html Normal file
View File

@@ -0,0 +1,352 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>DataTools for RevOps — Dedupe Lead Lists Across HubSpot, LinkedIn, and Manual Scrapes · $49</title>
<meta name="description" content="One tool to dedupe lead lists across HubSpot, LinkedIn, and manual scrapes. International phones (50+ country codes), per-row country normalization, fuzzy match across vendors, fully offline. $49 one-time." />
<meta name="keywords" content="dedupe lead list, hubspot deduplicate, linkedin lead cleanup, marketing data cleaning, revops csv tool, multi-vendor lead unification, international phone normalization" />
<link rel="canonical" href="https://datatools.app/revops/" />
<link rel="stylesheet" href="../_shared/styles.css" />
<!-- Persona accent: RevOps → vivid violet -->
<style>
:root {
--accent: #c4b5fd;
--accent-ink: #2e1065;
}
</style>
<meta property="og:title" content="DataTools for RevOps — Dedupe Lead Lists Across HubSpot, LinkedIn, and Manual Scrapes" />
<meta property="og:description" content="International phones, country normalization, fuzzy dedup with merge — one tool, no upload. $49 one-time." />
<meta property="og:type" content="product" />
<meta property="og:url" content="https://datatools.app/revops/" />
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareApplication",
"name": "DataTools for RevOps",
"operatingSystem": "Windows, macOS, Linux",
"applicationCategory": "BusinessApplication",
"offers": {
"@type": "Offer",
"price": "49",
"priceCurrency": "USD"
},
"description": "Dedupe and unify lead lists across CRM, scraping, and manual sources. International phone normalization, per-row country, fuzzy match with merge. Six-tool data-cleaning bundle for RevOps and marketing agencies.",
"softwareVersion": "1.0"
}
</script>
</head>
<body>
<div class="buybar">
<div class="buybar-inner">
<div class="brand"><span class="brand-mark"></span> DataTools <span class="muted">/ for RevOps</span></div>
<div>
<span class="price-tag">$49 — one-time, no subscription</span>
<a class="btn" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Get DataTools →</a>
</div>
</div>
</div>
<section class="hero">
<div class="container">
<div class="eyebrow">For RevOps · marketing ops · agency lead-gen · audience-builders</div>
<h1>Dedupe lead lists across HubSpot, LinkedIn,<br /><strong>and manual scrapes — locally.</strong></h1>
<p class="lead">
The same prospect shows up as <code>alice@acme.com</code> in HubSpot,
<code>Alice.Johnson@acme.com</code> in LinkedIn Sales Navigator, and
<code>alice@acme.com</code> again from your VA's manual scrape. Their
phone is <code>(415) 555-1234</code> in one source and
<code>4155551234</code> in another. DataTools fuzzy-matches across
sources, normalizes phones to E.164 with per-row country awareness,
and produces one canonical lead per real person — without uploading
a single contact to a third-party tool.
</p>
<div class="cta-row">
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Get DataTools — $49 →</a>
<a class="btn btn-ghost btn-large" href="#demo">Try the live demo ↓</a>
<span class="price-note">One-time payment · cross-platform · runs offline</span>
</div>
<div class="stats">
<div class="stat"><div class="num">50+</div><div class="label">country codes</div></div>
<div class="stat"><div class="num">3</div><div class="label">CRM sources unified</div></div>
<div class="stat"><div class="num">0</div><div class="label">cloud uploads ever</div></div>
</div>
</div>
</section>
<!-- ============= Pain points ============= -->
<section>
<div class="container">
<div class="eyebrow">If your last campaign launch was held up by data hygiene</div>
<h2>Five pains DataTools fixes before you import to HubSpot</h2>
<div class="grid">
<div class="card">
<span class="icon">💸</span>
<h3>HubSpot / Marketo / Iterable bills you for every duplicate contact</h3>
<p>10 k contacts → enterprise tier at $48 k/mo. 18 % cross-source duplicate rate from Apollo + ZoomInfo + LinkedIn means you're at 8.2 k unique people but paying for 10 k. Every month. Forever.</p>
<p class="muted"><strong>What it costs:</strong> $200$800 per 1 k duplicate contacts — recurring, every month.</p>
</div>
<div class="card">
<span class="icon">🚫</span>
<h3>Sender reputation tanks when you mail to invalid or duplicate addresses</h3>
<p>One bad sending session — to addresses your team scraped or imported without hygiene — and your domain reputation takes weeks to recover. Your good campaigns sit in spam folders during the recovery.</p>
<p class="muted"><strong>What it costs:</strong> catastrophic — entire email programme degraded for 26 weeks.</p>
</div>
<div class="card">
<span class="icon">⚖️</span>
<h3>GDPR makes uploading to a cloud cleaner a legal-review marathon</h3>
<p>Every cloud-based lead-cleaner needs you to upload your prospect list. Your legal team needs 48 weeks to bless that. DataTools is desktop-only — no upload, no DPA, no review, no delay.</p>
<p class="muted"><strong>What it costs:</strong> 48 weeks of legal-review delay per tool, every time.</p>
</div>
<div class="card">
<span class="icon">🪢</span>
<h3>Apollo + ZoomInfo + LinkedIn + manual scrapes all use different schemas</h3>
<p>Each export has its own column names, scoring scale, country format. Unifying them by hand for one campaign costs 13 days. Doing it for every campaign is unsustainable.</p>
<p class="muted"><strong>What it costs:</strong> 13 days per campaign of manual unification + judgement calls that drift across team members.</p>
</div>
<div class="card">
<span class="icon">🛡️</span>
<h3>Suppression lists across 5+ marketing platforms get out of sync</h3>
<p>Each platform has its own suppression format. Out-of-sync lists let opted-out contacts slip through, triggering CAN-SPAM / GDPR exposure and the kind of "we got a complaint" email no one wants.</p>
<p class="muted"><strong>What it costs:</strong> compliance risk + churn-back cost + stakeholder trust.</p>
</div>
<div class="card">
<span class="icon">📞</span>
<h3>International dialer fails because phone formats vary</h3>
<p>Calling list to 15 countries with mixed formats means dialler rejects 815 % of numbers, your reps spend the day on "number invalid" tones instead of conversations.</p>
<p class="muted"><strong>What it costs:</strong> rep productivity × failure rate × team size.</p>
</div>
</div>
</div>
</section>
<section id="demo">
<div class="container">
<div class="eyebrow">Live demo · runs in your browser</div>
<h2>Try it on a real-looking 3-vendor lead list</h2>
<p>
The demo below loads a 25-row lead worksheet combining HubSpot,
LinkedIn Sales Navigator, and manual scraping — with the same prospect
appearing in two or three sources, country names spelled three
different ways (<code>USA</code>, <code>US</code>, <code>United
States</code>), and 13 different international phone formats. Click
<strong>Run pipeline</strong> and watch the 5-step pipeline (text
clean → format → missing → column map → dedup) collapse 25 rows to 19
with a single canonical record per prospect.
</p>
<div class="demo-frame">
<iframe
src="https://demo.datatools.app/?p=revops"
loading="lazy"
title="DataTools live demo — RevOps"
sandbox="allow-scripts allow-same-origin allow-downloads allow-forms"></iframe>
<div class="demo-caption">
Demo runs on free hosting. Capped at 100 input rows · output
watermarked. The paid product has no caps and runs entirely offline.
</div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">Built for the agency RevOps day</div>
<h2>Three workflows you do every campaign</h2>
<div class="grid">
<div class="card">
<span class="icon">🪢</span>
<h3>Email-list dedup across lead sources</h3>
<p>HubSpot exports + LinkedIn Sales Navigator + the VA's spreadsheet, all merged. Fuzzy match across email + phone + name catches the cross-source duplicates that broke your last campaign send.</p>
</div>
<div class="card">
<span class="icon">🌍</span>
<h3>Multi-platform audience reconciliation</h3>
<p>Build one canonical audience from Meta, Google Ads, LinkedIn, and your CRM. Each platform exports a different shape; column-mapper aligns them all, dedup merges the survivors with their most-complete fields.</p>
</div>
<div class="card">
<span class="icon">🛡️</span>
<h3>Suppression-list management</h3>
<p>Suppression lists need to dedupe across email + phone + first-party identifiers. Add a row, dedupe, ship the canonical CSV to every platform — without uploading the suppression list to any of them.</p>
</div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">If your campaigns target outside the US — almost everyone's do</div>
<h2>50+ country codes. Per-row country awareness.</h2>
<p>
Your HubSpot list has <code>(415) 555-1234</code>. Your scraped
list from the same prospect has <code>+1 415 555 1234</code>. Your
Italian prospect entered <code>+39 06 6982</code>. Your Brazilian
lead has <code>11 3071 0000</code>. Each comes from a row tagged
with its country — DataTools reads that column per row and parses
every phone correctly to E.164.
</p>
<ul class="bullets">
<li><strong>Per-row country column</strong> drives the parser — no global default that bucks UK numbers as malformed US.</li>
<li><strong>Country-name normalization</strong>: <code>USA</code> / <code>US</code> / <code>United States</code> all resolve to the same ISO-2 code.</li>
<li><strong>50+ country support</strong> via Google's libphonenumber, including KR, CN, IN, MX, BR, IL, TR, PL, DK, SE.</li>
<li><strong>Schema enforcement</strong> via the column-mapper: project to your CRM's required shape, coerce score columns to integers, reorder fields to match the import contract.</li>
</ul>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">For platforms that charge per contact</div>
<h2>Every duplicate you don't catch costs you for the life of the contract.</h2>
<p>
HubSpot prices on contacts. Klaviyo prices on contacts. Marketo,
Iterable, ActiveCampaign — all priced on contacts. Every duplicate
you don't catch is a recurring tax on your campaign. DataTools
catches them once, before import, with a fuzzy matcher that's
tuned to the cross-source noise you actually see.
</p>
<div class="callout">
<strong>Real numbers from the demo:</strong> 25 input rows from
three sources collapse to 19 — that's 6 duplicates the cross-source
noise was hiding. On a 50,000-row campaign list, that ratio
typically saves 12,000+ contacts a month, every month.
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">The thing every cloud cleaner can't say</div>
<h2>Your prospects' contact info never leaves your computer.</h2>
<p>
Cloud lead-cleaning tools require you to upload your audience.
That audience is your single most valuable agency asset — and once
it's on someone else's server, your client's privacy story is
no longer in your hands. DataTools is a desktop app. There is no
upload step.
</p>
<div class="terminal"><span class="prompt">$</span> python -m src.cli_pipeline campaign_q1.csv --pipeline revops_pipeline.json --apply
Reading campaign_q1.csv...
53,802 rows, 14 columns
Executing pipeline:
<span class="ok"></span> text_clean (160 ms) {cells_changed: 8,205}
<span class="ok"></span> format_standardize (1.4 s) {cells_changed: 41,889 — 50 country codes}
<span class="ok"></span> missing (140 ms) {sentinels_standardized: 6,710}
<span class="ok"></span> column_map (220 ms) {columns_renamed: 4, columns_added: 1}
<span class="ok"></span> dedup (4.8 s) {duplicates_removed: 12,344, merged: 12,344}
Initial rows: 53,802 → Final rows: 41,458
Total elapsed: 6.7 s
<span class="prompt">$</span> # 12,344 fewer contacts to pay for. for $49.</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match across email + phone + name + company; merge survivors with most-complete fields.</p></div>
<div class="card"><h3>2 · Text Cleaner</h3><p>Smart quotes from copy-paste, NBSP from spreadsheet exports, BOM from Excel.</p></div>
<div class="card"><h3>3 · Format Standardizer</h3><p>E.164 phones with per-row country, canonical emails, name casing, ISO dates.</p></div>
<div class="card"><h3>4 · Missing Value Handler</h3><p>Detect <code>TBD</code>, <code>(unknown)</code>, <code></code> across vendor exports.</p></div>
<div class="card"><h3>5 · Column Mapper</h3><p>Project to your CRM's required schema, coerce score to integer, reorder for import.</p></div>
<div class="card"><h3>6 · Pipeline Runner</h3><p>Save the cleanup as JSON. Drop next campaign's combined export on it. Same dedup, automated.</p></div>
</div>
</div>
</section>
<section>
<div class="container">
<div class="eyebrow">Pricing — pay once, own it</div>
<h2>$49. No subscription. No per-campaign fee.</h2>
<div class="pricing">
<div class="card featured">
<div class="row"><div class="price">$49</div><div class="price-suffix">one-time</div></div>
<h3>DataTools for RevOps</h3>
<ul>
<li>All 6 tools, full pipeline</li>
<li>Mac · Windows · Linux installers</li>
<li>Code-signed (no Gatekeeper warnings)</li>
<li>Free updates for the v1.x line</li>
<li>Bonus: 3-source unification pipeline preset</li>
<li><strong>Use on any number of clients</strong> — no seat limits</li>
</ul>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Buy on Gumroad →</a>
</div>
<div class="card">
<div class="row"><div class="price">$149</div><div class="price-suffix">one-time</div></div>
<h3>Full DataTools Suite</h3>
<p class="muted">Available when 3+ bundles ship. Includes everything in the RevOps pack plus the Shopify and Bookkeeper bundles. Save $48.</p>
<a class="btn btn-ghost btn-large" href="#" aria-disabled="true">Coming when ready</a>
</div>
</div>
</div>
</section>
<section>
<div class="container">
<h2>Questions</h2>
<details class="faq">
<summary>Does this replace HubSpot's deduplication?</summary>
<p>No — it cleans data <em>before</em> import to HubSpot (or LinkedIn, Marketo, Klaviyo, etc.). HubSpot's dedup runs on already-imported contacts; DataTools catches duplicates that haven't yet cost you a contract slot.</p>
</details>
<details class="faq">
<summary>Does it handle international phones correctly?</summary>
<p>Yes — via Google's libphonenumber, with 50+ country codes. The killer feature is per-row country: point a column at it (any column with values like <code>US</code>, <code>USA</code>, <code>United States</code>, <code>+1</code>, <code>JP</code>, <code>Japan</code>) and DataTools parses each row in its own region. No more UK numbers bucketed as malformed US.</p>
</details>
<details class="faq">
<summary>Can I use it on multiple clients without paying again?</summary>
<p>Yes. The licence is per-operator, not per-client. Run it on every agency client's lead list for the same $49.</p>
</details>
<details class="faq">
<summary>How does fuzzy match work across columns?</summary>
<p>Out of the box, the dedup engine builds default strategies based on column names — typically email + phone with exact match, name with Jaro-Winkler at 85%. You can override via JSON: pick which columns to match on, which algorithm, and what threshold. Strategies survive in the saved pipeline so next campaign uses the same rules.</p>
</details>
<details class="faq">
<summary>What's the audit trail look like?</summary>
<p>A row-by-row CSV: every modified cell with its original value, new value, and which rule fired. A separate JSON file describes the pipeline that produced it. Together they reproduce the cleanup deterministically — your client can verify it on their machine.</p>
</details>
<details class="faq">
<summary>What's your refund policy?</summary>
<p>Try the live demo above on the sample dataset before you buy. If DataTools doesn't fit your workflow within 14 days, email for a refund — no questions asked.</p>
</details>
</div>
</section>
<section>
<div class="container" style="text-align: center;">
<h2>Stop paying twice for the same contact.</h2>
<p class="lead" style="margin: 0 auto 28px;">One $49 download. Catches the cross-source duplicates HubSpot and LinkedIn can't see, normalizes phones for 50+ countries, and saves a pipeline you can re-run on next campaign's combined list.</p>
<a class="btn btn-large" href="https://gumroad.com/l/datatools?from=revops" rel="noopener">Get DataTools — $49 →</a>
</div>
</section>
<footer>
<div class="container">
<div>
<p><strong>DataTools</strong> — local data-cleaning for Shopify, bookkeepers, and RevOps teams.</p>
<p class="muted">© 2026 · Built solo · Shipped from a small office.</p>
</div>
<div>
<p>
<a href="../shopify-pet/">For Shopify operators</a> ·
<a href="../bookkeeper/">For bookkeepers</a><br />
<a href="https://gumroad.com/l/datatools?from=revops">Buy on Gumroad</a> ·
<a href="mailto:hello@datatools.app">Email support</a>
</p>
</div>
</div>
</footer>
</body>
</html>