docs+code: rename tool labels everywhere

Sweep follow-up to 93e43fc. Display labels now consistent across docs,
landing pages, CLI output, code comments, docstrings, and test prose.
Five parallel surfaces touched:

- docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal
  design/planning docs
- landing pages: index + bookkeeper/revops/shopify-pet
- src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py
  and gui/components/_legacy.py, core module headers, every tool
  page's module docstring
- tests: class/method/module docstrings and section-header comments
- test-cases READMEs

Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator
etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*),
URL paths, anchor IDs, CSS classes, and asset filenames were left
intact since they're code identifiers / structural references.

All 2033 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 19:50:09 +00:00
parent 93e43fc0d9
commit db5ec084da
57 changed files with 205 additions and 205 deletions

View File

@@ -8,15 +8,15 @@ Limpieza local de CSV / Excel. CLI + GUI en el navegador, sin nube, sin ceremoni
| # | Herramienta | Estado |
|---|------|--------|
| 01 | **Eliminador de duplicados** — coincidencia exacta + difusa, 5 normalizadores, reglas de superviviente, auditoría | Listo |
| 02 | **Limpiador de texto** — espacios, caracteres tipográficos, BOM, finales de línea, mayúsculas/minúsculas | Listo |
| 03 | **Estandarizador de formatos** — fechas, teléfonos, correos, direcciones, nombres, monedas, booleanos | Listo |
| 04 | **Gestor de valores faltantes** — detección de nulos disfrazados, perfil, media/mediana/moda/ffill/bfill/interpolación, estrategias de descarte | Listo |
| 05 | **Mapeador de columnas** — autodetección difusa de renombrados, esquema objetivo con coerción de tipos, campos requeridos con valores por defecto, descartar/reordenar | Listo |
| 06 | Detector de valores atípicos | Próximamente |
| 07 | Combinador de varios archivos | Próximamente |
| 08 | Validador e informes | Próximamente |
| 09 | **Ejecutor de canalizaciones** — encadena herramientas en un orden recomendado (no forzado), guarda/carga JSON, automatiza limpiezas semanales | Listo |
| 01 | **Buscar duplicados** — coincidencia exacta + difusa, 5 normalizadores, reglas de superviviente, auditoría | Listo |
| 02 | **Limpiar texto** — espacios, caracteres tipográficos, BOM, finales de línea, mayúsculas/minúsculas | Listo |
| 03 | **Estandarizar formatos** — fechas, teléfonos, correos, direcciones, nombres, monedas, booleanos | Listo |
| 04 | **Corregir valores faltantes** — detección de nulos disfrazados, perfil, media/mediana/moda/ffill/bfill/interpolación, estrategias de descarte | Listo |
| 05 | **Mapear columnas** — autodetección difusa de renombrados, esquema objetivo con coerción de tipos, campos requeridos con valores por defecto, descartar/reordenar | Listo |
| 06 | Detectar valores atípicos | Próximamente |
| 07 | Combinar archivos | Próximamente |
| 08 | Verificación de calidad | Próximamente |
| 09 | **Flujos automatizados** — encadena herramientas en un orden recomendado (no forzado), guarda/carga JSON, automatiza limpiezas semanales | Listo |
## Descarga (usuarios no técnicos)

View File

@@ -8,15 +8,15 @@ Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony. GU
| # | Tool | Status |
|---|------|--------|
| 01 | **Deduplicator** — exact + fuzzy match, 5 normalizers, survivor rules, audit | Ready |
| 02 | **Text Cleaner** — whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | **Format Standardizer** — dates, phones, emails, addresses, names, currencies, booleans | Ready |
| 04 | **Missing Value Handler** — disguised-null detection, profile, mean/median/mode/ffill/bfill/interpolate, drop strategies | Ready |
| 05 | **Column Mapper** — fuzzy auto-rename, target schema with type coercion, required fields with defaults, drop/reorder | Ready |
| 06 | Outlier Detector | Coming Soon |
| 07 | Multi-File Merger | Coming Soon |
| 08 | Validator & Reporter | Coming Soon |
| 09 | **Pipeline Runner** — chain tools with recommended (not forced) order, save/load JSON, automate weekly cleanups | Ready |
| 01 | **Find Duplicates** — exact + fuzzy match, 5 normalizers, survivor rules, audit | Ready |
| 02 | **Clean Text** — whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | **Standardize Formats** — dates, phones, emails, addresses, names, currencies, booleans | Ready |
| 04 | **Fix Missing Values** — disguised-null detection, profile, mean/median/mode/ffill/bfill/interpolate, drop strategies | Ready |
| 05 | **Map Columns** — fuzzy auto-rename, target schema with type coercion, required fields with defaults, drop/reorder | Ready |
| 06 | Find Unusual Values | Coming Soon |
| 07 | Combine Files | Coming Soon |
| 08 | Quality Check | Coming Soon |
| 09 | **Automated Workflows** — chain tools with recommended (not forced) order, save/load JSON, automate weekly cleanups | Ready |
## Download (non-technical users)

View File

@@ -246,7 +246,7 @@ much state to trust:
4. Double-click the app icon.
5. Browser should open to http://127.0.0.1:850x within 5 seconds.
6. Drop samples/demo/shopify_pet_customers.csv into the
Pipeline Runner page; click Run; AFTER preview should appear.
Automated Workflows page; click Run; AFTER preview should appear.
7. Confirm in the network tab: zero outbound calls except to
127.0.0.1 and the Streamlit static asset paths (also local).
```

View File

@@ -333,7 +333,7 @@ the attached `.dtlic` file.
| Tier | Features |
|------|---------|
| **lite** | Deduplicator, Text Cleaner, Format Standardizer |
| **lite** | Find Duplicates, Clean Text, Standardize Formats |
| **core** | All 9 tools |
| **pro** | All 9 tools + future Pro-only features |

View File

@@ -47,7 +47,7 @@ Sell niche Python automation tools as one-time downloadable digital products. Ta
**Surface**: desktop install per OS (PyInstaller) with Streamlit GUI + CLI. Constrained demo on Streamlit Community Cloud.
## 4a. Lead bundle — Deduplicator
## 4a. Lead bundle — Find Duplicates
Highest pain density across all 4 personas. Feeds landing copy, demo design, feature priority. Tech spec: TECHNICAL.md §11.1.
@@ -208,7 +208,7 @@ Headroom enables optional ad spend ($100-200/mo) once a bundle has proven conver
## 13. Honest status (2026-05-01)
- 3 of 9 tools shipped (Dedup, Text Cleaner, Format Standardizer).
- 3 of 9 tools shipped (Find Duplicates, Clean Text, Standardize Formats).
- Cross-platform build pipeline designed, not yet built.
- macOS code signing not yet set up.
- Streamlit GUI shipped for the 3 ready tools.

View File

@@ -8,15 +8,15 @@ Tres módulos de CLI, uno por cada herramienta Lista:
| Módulo | Comando | Propósito |
|--------|---------|---------|
| `src.cli` | `python -m src.cli FILE` | Eliminador de duplicados |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Limpiador de texto |
| `src.cli` | `python -m src.cli FILE` | Buscar duplicados |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Limpiar texto |
| `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analizador (escaneo de solo lectura) |
Cada comando es **previsualización por defecto** — añade `--apply` para escribir la salida.
---
# Eliminador de duplicados
# Buscar duplicados
```
python -m src.cli ARCHIVO_ENTRADA [OPCIONES]
@@ -125,7 +125,7 @@ Registro: `logs/dedup_YYYYMMDD_HHMMSS.log`.
---
# Limpiador de texto
# Limpiar texto
```
python -m src.cli_text_clean ARCHIVO_ENTRADA [OPCIONES]
@@ -156,7 +156,7 @@ Higiene a nivel de carácter. Ver [TECHNICAL.md §10.2](TECHNICAL.md) (solo en i
- `--config RUTA` / `--save-config RUTA`.
### Archivo
- `--sheet`, `--encoding`, `--header-row` — iguales que en el Eliminador de duplicados.
- `--sheet`, `--encoding`, `--header-row` — iguales que en Buscar duplicados.
## Presets

View File

@@ -6,15 +6,15 @@ Three CLI modules, one per Ready tool:
| Module | Command | Purpose |
|--------|---------|---------|
| `src.cli` | `python -m src.cli FILE` | Deduplicator |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Text Cleaner |
| `src.cli` | `python -m src.cli FILE` | Find Duplicates |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Clean Text |
| `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analyzer (read-only scan) |
Every command is **preview-only by default** — add `--apply` to write output.
---
# Deduplicator
# Find Duplicates
```
python -m src.cli INPUT_FILE [OPTIONS]
@@ -123,7 +123,7 @@ Log: `logs/dedup_YYYYMMDD_HHMMSS.log`.
---
# Text Cleaner
# Clean Text
```
python -m src.cli_text_clean INPUT_FILE [OPTIONS]
@@ -154,7 +154,7 @@ Character-level hygiene. See [TECHNICAL.md §10.2](TECHNICAL.md) for the spec.
- `--config PATH` / `--save-config PATH`.
### File
- `--sheet`, `--encoding`, `--header-row` — same as Deduplicator.
- `--sheet`, `--encoding`, `--header-row` — same as Find Duplicates.
## Presets

View File

@@ -67,7 +67,7 @@ Each candidate scored 1-5 on 6 dimensions. Total /30 → verdict.
**v1.2 rationale**:
- Buyer persona ("hates Excel work but can't code") won't learn a CLI. Refunds at this price.
- Deduplicator needs interactive review — not viable in pure CLI.
- Find Duplicates needs interactive review — not viable in pure CLI.
- Dual interface keeps CLI for automation without sacrificing primary buyer surface.
## 4a. Functional scope principle (v1.2)
@@ -170,13 +170,13 @@ $49-79/bundle · $149 full suite (when 3+ exist).
| Apr 28 (v1.3) | Add hosted browser demo as conversion lever | Direct consequence of Streamlit choice. See §5. |
| Apr 28 (v1.4) | Re-apply 04/06 boundary work (silent-drift recovery) | Stream B v1.2 content overwritten in parallel v1.3 work. Restored per no-silent-drift rule. |
| Apr 28 (v1.5) | Add `02_text_cleaner.py`; renumber 02-08 → 03-09 | Character-level hygiene had no clear owner. See TECHNICAL §10. |
| Apr 29 (v1.7) | Adopt Text Cleaner Tier 1/2/3 spec; lock `excel-hygiene` default | Promotes from stub to buildable v1 target. Full spec in TECHNICAL §11.2. |
| Apr 29 (v1.7) | Adopt Clean Text Tier 1/2/3 spec; lock `excel-hygiene` default | Promotes from stub to buildable v1 target. Full spec in TECHNICAL §11.2. |
| Apr 28 (v1.6) | Fold conversation-history content into docs (deduplicator spec, lead bundle use cases, full GUI matrix, 04/06 examples, Streamlit-to-SaaS reasoning) | No new decisions; promote at-risk analysis from chat history per no-silent-drift rule. |
| May 1 (v1.6) | Mark Format Standardizer **Ready** | 199-row buyer corpus passing; Tier 1 + most Tier 2 built. |
| May 1 (v1.6) | Mark Standardize Formats **Ready** | 199-row buyer corpus passing; Tier 1 + most Tier 2 built. |
| May 1 (v1.6) | Add `src/core/errors.py` structured hierarchy | Uniform helpful messages across CLI + GUI. See TECHNICAL §7. |
| May 13 (v1.6) | Ship in-house JSON i18n + EN/ES packs | Expand addressable market (Spanish-first buyers, LatAm bookkeepers) without a `gettext` build step. JSON packs editable by non-devs; parity test prevents drift. See TECHNICAL §10b. |
| May 13 (v1.6) | Ship licensing: 1-year HMAC-signed blobs, name+email registration, offline verification, tier-scaffolded for future SKUs | Unlock the lifetime-update business model without recurring infra. Honor-system DRM (HMAC + 30-day refund) — sufficient at $49. See §9b below. |
| May 13 (v1.6) | Add Lite SKU (Dedup + Text Cleaner + Format Standardizer) | Lower-priced entry point for buyers who only need the three universal tools. Per-tool feature gating + lock badges on the home grid surface the upgrade path. See §9b. |
| May 13 (v1.6) | Add Lite SKU (Find Duplicates + Clean Text + Standardize Formats) | Lower-priced entry point for buyers who only need the three universal tools. Per-tool feature gating + lock badges on the home grid surface the upgrade path. See §9b. |
| May 13 (v1.6) | Remove user-facing free trial | A 1-year all-features trial undercut the paid Lite SKU. Paid-only keeps tier economics clean. Internal ``_mint`` API still exists for tests and the seller's key generator. See §9b. |
| May 13 (v1.6) | Upgrade license crypto: HMAC → Ed25519 (asymmetric) | HMAC's symmetric secret was extractable from the shipped binary — anyone with the binary could mint blobs. Ed25519 splits sign (seller) from verify (binary), so binary compromise doesn't let an attacker forge licenses. Blob prefix bumped DTLIC1 → DTLIC2. See §9b. |
| May 13 (v1.6) | Add ``assert_production_safe`` tripwire | A shipped build with ``DATATOOLS_DEV_MODE=1`` or the in-source dev pubkey would silently defeat licensing. The tripwire refuses to boot such a build. No-op in source / pytest runs. See §9b. |
@@ -211,13 +211,13 @@ The 30-day refund window covers casual blob sharing from a different angle (anyo
- Number of devices the same blob is used on (no concurrent-use detection).
- Reverse-engineered re-signing of expired blobs (would require RSA / online check).
**Future SKUs**: the ``FEATURES_BY_TIER`` table in ``src/license/features.py`` is the single source of truth for "which tools each tier unlocks". Adding a PRO SKU that excludes the pipeline runner is a 1-line edit there + a 1-line edit at the gate site. No consumer-code churn.
**Future SKUs**: the ``FEATURES_BY_TIER`` table in ``src/license/features.py`` is the single source of truth for "which tools each tier unlocks". Adding a PRO SKU that excludes Automated Workflows is a 1-line edit there + a 1-line edit at the gate site. No consumer-code churn.
**v1.6 SKU lineup**:
| Tier | Tools unlocked | Notes |
|---|---|---|
| LITE | Deduplicator, Text Cleaner, Format Standardizer | Entry SKU. Three universal tools that handle the most common bookkeeping / RevOps / Klaviyo prep workflows. |
| LITE | Find Duplicates, Clean Text, Standardize Formats | Entry SKU. Three universal tools that handle the most common bookkeeping / RevOps / Klaviyo prep workflows. |
| CORE | All 9 tools | Full v1 suite. |
| PRO | All 9 tools (scaffolded) | Reserved for future per-feature carve-outs (e.g., scheduled pipelines, API access). |
| ENTERPRISE | All 9 tools (scaffolded) | Reserved for future bulk / multi-seat SKUs. |

View File

@@ -33,7 +33,7 @@ CLI (src/cli*.py) GUI (src/gui/app.py + pages/)
| `core.errors` | `DataToolsError` hierarchy, `ensure_dataframe()`, `ensure_choice()`, `wrap_file_read/write()`, `format_for_user()` |
| `core._constants` | `US_STATE_NAMES`, `US_STATE_CODES`, `USPS_EXPANSIONS`, `USPS_COMPRESSIONS` |
## Data flow — Deduplicator
## Data flow — Find Duplicates
```
read_file() # auto-detect encoding, delimiter, header

View File

@@ -30,7 +30,7 @@ Status legend:
| ✓ | Item | Where it lives |
|---|------|----------------|
| 🟢 | 6 of 9 tools shipped (Dedup, Text, Format, Missing, Column-Map, Pipeline) | `src/core/`, `src/cli_*.py`, `src/gui/pages/` |
| 🟢 | Pipeline Runner (the retention multiplier per `PLAN.md` §2.6) | `src/core/pipeline.py`, `src/cli_pipeline.py`, `src/gui/pages/9_Pipeline_Runner.py` |
| 🟢 | Automated Workflows (the retention multiplier per `PLAN.md` §2.6) | `src/core/pipeline.py`, `src/cli_pipeline.py`, `src/gui/pages/9_Pipeline_Runner.py` |
| 🟢 | 1,729 passing tests · 0 skipped · 0 xfailed | `tests/` |
| 🟢 | 3 niche demo datasets + pre-tuned pipeline JSONs | `samples/demo/` |
| 🟢 | Streamlit demo app + Cloud entry shim | `streamlit_app.py`, `src/gui/app_demo.py` |

View File

@@ -29,8 +29,8 @@ win.
| Asset | State |
|---|---|
| Tools 15 (Dedup, Text Clean, Format Standardize, Missing, Column Mapper) | Ready · 1,691 tests passing · 0 xfailed |
| Tools 69 (Outlier, Multi-File Merge, Validator, Pipeline) | Coming Soon |
| Tools 15 (Find Duplicates, Clean Text, Standardize Formats, Fix Missing Values, Map Columns) | Ready · 1,691 tests passing · 0 xfailed |
| Tools 69 (Find Unusual Values, Combine Files, Quality Check, Automated Workflows) | Coming Soon |
| PyInstaller installer pipeline | Not started |
| macOS code signing (Apple Dev Program) | Not started |
| Hosted browser demo (Streamlit Cloud) | Not deployed |
@@ -52,7 +52,7 @@ Tools 68 are blocked behind a **distribution gate**: no work on them
until the existing 5 tools have a paying customer + one external review
(BUSINESS.md §4 sequence rule, applied recursively inside the bundle).
**Exception granted 2026-05-01**: Tool 09 Pipeline Runner is built
**Exception granted 2026-05-01**: Tool 09 Automated Workflows is built
*now*. Rationale: the pipeline transforms the bundle from "5 tools you
buy" into "an automatable workflow you depend on." That conversion is
what produces retention and word-of-mouth — the only marketing channel
@@ -104,10 +104,10 @@ demo dataset.
| # | Pain | $ / time impact | Tools that fix it |
|---|------|-----------------|---|
| S1 | **Klaviyo / Mailchimp / Omnisend per-contact billing.** Subscriber list with 1018 % duplicate rate (case drift, plus signs in Gmail addresses, multiple devices) → recurring overpay forever. | $30300/mo per percent of dupes on a 50 k list — recurring | Dedup + Format Standardize (email canonicalization) + Pipeline (re-run weekly) |
| S2 | **Product feed rejected by Google Merchant Center / Meta Catalog.** Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 2472 h while feed gets fixed. | 13 days delayed launch × campaign value | Text Cleaner + Format Standardize |
| S3 | **Multi-channel order consolidation.** Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 48 hr / month manually merging | Column Mapper + Dedup + Pipeline |
| S2 | **Product feed rejected by Google Merchant Center / Meta Catalog.** Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 2472 h while feed gets fixed. | 13 days delayed launch × campaign value | Clean Text + Standardize Formats |
| S3 | **Multi-channel order consolidation.** Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 48 hr / month manually merging | Map Columns + Find Duplicates + Automated Workflows |
| S4 | **Subscription identity fragmentation.** Pet-box subscribers cancel and re-sub under a different email; cohort analysis says churn is 20 % when it's actually 12 % — pricing decisions wrong. | Mis-priced LTV → over- or under-paid acquisition | Dedup with `merge=true` survivor |
| S5 | **International tax / VAT MOSS compliance.** Country column is `UK` / `U.K.` / `United Kingdom` / `GB` in the same export; VAT report breaks. Phone formats per region break call-center routing. | Compliance penalty risk + ops friction | Format Standardize (per-row country) + Column Mapper |
| S5 | **International tax / VAT MOSS compliance.** Country column is `UK` / `U.K.` / `United Kingdom` / `GB` in the same export; VAT report breaks. Phone formats per region break call-center routing. | Compliance penalty risk + ops friction | Standardize Formats (per-row country) + Map Columns |
#### Bookkeeper / freelance accountant
@@ -126,7 +126,7 @@ demo dataset.
| R1 | **HubSpot / Marketo / Iterable per-contact tier pricing.** 10 k contacts → enterprise tier at $48 k/mo. Every duplicate is a recurring tax. | $200800 / month per 1 k duplicate contacts — recurring | Dedup with cross-source merge + Pipeline |
| R2 | **Email-deliverability / sender reputation.** Sending to invalid or duplicate addresses tanks reputation; recovery takes weeks. | Catastrophic — entire email programme degraded | Format Standardize (email canonicalization) + Missing (sentinel detection) |
| R3 | **GDPR / contact-data privacy.** Uploading lead data to a third-party cleaning SaaS is itself a GDPR concern; legal review blocks adoption. | Compliance risk + 48 wk legal-review delay | Local-only desktop app, zero outbound calls |
| R4 | **Multi-vendor lead-source unification.** Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 13 days per campaign of manual unification | Column Mapper (alias matching) + Format Standardize (per-row country) + Dedup |
| R4 | **Multi-vendor lead-source unification.** Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 13 days per campaign of manual unification | Map Columns (alias matching) + Standardize Formats (per-row country) + Find Duplicates |
| R5 | **Suppression-list management across 5+ platforms.** Each platform has its own format; un-deduped suppression lists let opt-outs slip through, triggering CAN-SPAM / GDPR exposure. | Compliance risk + churn-back cost | Pipeline saved as JSON, re-run on each new suppression batch |
### 2.4 Operationalize the moat the docs already name.
@@ -154,7 +154,7 @@ right after "runs locally."
Copy seed: *"Every change auditable. Hand the audit CSV to your client
with the cleaned file."*
### 2.6 The Pipeline Runner is the retention multiplier.
### 2.6 Automated Workflows is the retention multiplier.
A buyer with a saved pipeline isn't a one-off purchase — they're a
recurring user who recommends the product. This is exactly the
@@ -172,8 +172,8 @@ trigger DECISIONS.md §8 already names).
### 2.8 Dependency-aware pipeline UX.
Tools have soft execution-order preferences (Text Clean before Format
Standardize, Format before Dedup, Missing before Dedup). The Pipeline
Runner *recommends* the order, *warns* on reversals, and **never
Standardize, Format before Dedup, Missing before Dedup). Automated
Workflows *recommends* the order, *warns* on reversals, and **never
forces** — the user owns their workflow. Implementation: see
`src/core/pipeline.py` `SOFT_DEPENDENCIES`.
@@ -184,7 +184,7 @@ forces** — the user owns their workflow. Implementation: see
| 1 | PyInstaller pipeline · Mac/Win unsigned installers · Apple Dev Program enrollment (12 wk lead) | `dist/datatools-mac.dmg` and `dist/datatools-win.exe` install on a clean machine |
| 2 | Demo deployed to Streamlit Cloud · landing page v1 with embedded demo · 3 persona datasets in the demo | Public URL serves a working pipeline run on a sample dataset in < 30 s |
| 3 | Gumroad listing live · share value-first in 3 niche communities (no pitch) · 1 long-tail SEO post for the lead persona | First listing impression captured · post not removed for self-promotion |
| 4 | Pipeline Runner v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded |
| 4 | Automated Workflows v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded |
| 58 | Bookkeeper landing page · agency landing page · second tool's promo cycle · priority-support tier added (defer purchase until §2.7 trigger) | Three live landing pages with distinct H1, demo dataset, conversion target |
| 913 | Tool 0608 only **if** revenue trajectory supports continued investment · otherwise more market work on the existing 5 + 09 | Decision made on 13 Aug 2026 with revenue data, not feature ambition |
@@ -202,7 +202,7 @@ These flip the plan, not the underlying criteria:
## 5. Anti-temptations (things the plan refuses)
- **More tools before more buyers.** Locked. Exception only for Pipeline Runner per §2.1.
- **More tools before more buyers.** Locked. Exception only for Automated Workflows per §2.1.
- **SaaS pivot.** Recurring infra conflicts with the lifestyle constraint (DECISIONS.md §4).
- **Live chat / sales calls.** Conflicts with no-touch (DECISIONS.md §1 #8).
- **Custom integrations / one-off consulting.** $300/hr looks tempting; breaks the "build once, sell many" model that justifies the entire strategy.

View File

@@ -144,7 +144,7 @@ Reading PLAN.md §3 + this doc together, the rough script:
| **M1** (June) | Installers · demo · 3 landing pages · Gumroad live | Whether the funnel mechanically works. Numbers will be noisy; just look for one purchase. |
| **M2** (July) | M1 + community posts in 3 niches + 1 SEO post | Which persona converts. Re-allocate effort to the highest-converting niche. |
| **M3** (August) | M2 + landing-page changes from M2 review | Whether intent-rate moved on the change. Decide tools 0608 go/no-go. |
| **M4** (September) | M3 + first repeat-buyer signals | Whether the Pipeline Runner is producing retention as designed. |
| **M4** (September) | M3 + first repeat-buyer signals | Whether Automated Workflows is producing retention as designed. |
By end of M4, the data tells you whether the plan is producing
$1k3k/mo (BUSINESS.md §6 6-month target) — extrapolated from the

View File

@@ -21,8 +21,8 @@ project-root/
│ └── CLI-REFERENCE.md
├── src/
│ ├── core/ # shared logic — both CLI + GUI call into this
│ ├── cli.py # Deduplicator CLI
│ ├── cli_text_clean.py # Text Cleaner CLI
│ ├── cli.py # Find Duplicates CLI
│ ├── cli_text_clean.py # Clean Text CLI
│ ├── cli_analyze.py # Analyzer CLI
│ └── gui/
│ ├── app.py # Streamlit entry

View File

@@ -76,7 +76,7 @@ Sample size: 1,000 rows (configurable).
- Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
- Output write: ~10 s.
- Recommended RAM: 34× input size for the full-Apply path.
- **Format standardizer** (`standardize_dataframe`): ~2.7M rows/sec on
- **Standardize Formats** (`standardize_dataframe`): ~2.7M rows/sec on
cache-warm repetition-heavy columns (synthetic 1M-row in-memory
benchmark, 2 typed columns); the fused single-pass loop replaced a
3-pass ``.tolist()`` cycle, so per-call overhead is now dominated by
@@ -87,20 +87,20 @@ Sample size: 1,000 rows (configurable).
thread-pool scaffolding; on CPython 3.12 with the GIL it's
roughly neutral, but the API is ready for the free-threaded
(PEP 703) Python 3.13+ build where it will help.
- **Text cleaner** (`clean_dataframe`): ~1M rows/sec on
- **Clean Text** (`clean_dataframe`): ~1M rows/sec on
repetition-heavy columns (per-call string cache: the pipeline runs
once per *unique* cell value, not once per row).
- **Missing handler** (`handle_missing`): lazy-copy — when sentinel
- **Fix Missing Values** (`handle_missing`): lazy-copy — when sentinel
standardization runs but finds nothing, AND no drops AND no fills
apply, the input frame is returned as-is. On a clean 1 GB file this
saves the 1 GB allocation that the unconditional upfront copy used
to take.
- **Column mapper** (`map_columns`): rename + drop both already
- **Map Columns** (`map_columns`): rename + drop both already
return fresh frames; the explicit upfront `df.copy()` is now
removed and downstream mutating steps (schema-add, coerce) copy on
demand via `_ensure_owned()`. Rename-only and identity-mapping
paths run with zero explicit copies.
- **Deduplicator**:
- **Find Duplicates**:
- **Exact-only strategies** (every column uses `Algorithm.EXACT` at
threshold 100 — covers strong-key dedup like email/phone, the
fallback drop-duplicates path, and explicit "match on this exact
@@ -117,19 +117,19 @@ Sample size: 1,000 rows (configurable).
(the common dedup workload) skip re-parsing.
## 11. Tools
1. Deduplicator — Ready
2. Text Cleaner — Ready
3. Format Standardizer — Ready
4. Missing Value Handler — Ready
5. Column Mapper — Ready
6. Outlier Detector — Coming Soon
7. Multi-File Merger — Coming Soon
8. Validator & Reporter — Coming Soon
9. Pipeline Runner — Ready
1. Find Duplicates — Ready
2. Clean Text — Ready
3. Standardize Formats — Ready
4. Fix Missing Values — Ready
5. Map Columns — Ready
6. Find Unusual Values — Coming Soon
7. Combine Files — Coming Soon
8. Quality Check — Coming Soon
9. Automated Workflows — Ready
### 11.a Recommended pipeline order (soft, not enforced)
The Pipeline Runner ships with a `SOFT_DEPENDENCIES` table; the
Automated Workflows ships with a `SOFT_DEPENDENCIES` table; the
following ordering is the default and the basis of the warning
surface. Re-ordering is allowed; the runner emits a warning string
and proceeds.
@@ -214,7 +214,7 @@ and proceeds.
fresh blob without losing the embedded buyer identity. Tier may
change during renewal (Lite → Core upgrade path).
- **Tiers**:
- ``lite`` — Deduplicator + Text Cleaner + Format Standardizer.
- ``lite`` — Find Duplicates + Clean Text + Standardize Formats.
Buyer pays once, gets the three universally-useful tools.
- ``core`` — every Ready tool (all 9 in v1.6).
- ``pro``, ``enterprise`` — scaffolded for future SKUs; currently

View File

@@ -34,8 +34,8 @@ src/
normalizers.py # Per-column normalizers for dedup matching
text_clean.py # clean_dataframe + smart_title_case
_constants.py # Shared USPS abbrevs + state names
cli.py # Deduplicator CLI (Typer)
cli_text_clean.py # Text Cleaner CLI
cli.py # Find Duplicates CLI (Typer)
cli_text_clean.py # Clean Text CLI
cli_analyze.py # Analyzer CLI (--json)
gui/
app.py # Streamlit entry point
@@ -192,7 +192,7 @@ GUI / CLI handlers use `format_for_user()` so the user always sees: file path, o
| Bundle | Status |
|--------|--------|
| Data Cleaning Mastery | 3/9 tools Ready (Dedup, Text Cleaner, Format Standardizer); 6 stubs |
| Data Cleaning Mastery | 3/9 tools Ready (Find Duplicates, Clean Text, Standardize Formats); 6 stubs |
| Automated Business Reporting | Not started |
| Ecommerce Data Pipeline | Not started |
| Small Business Finance | Not started |
@@ -214,12 +214,12 @@ Deliberately separate. Confluent original spec was wrong.
| Script | Owns |
|--------|------|
| 04 Missing Value Handler | "What's not there." Disguised nulls (`N/A`, `-`, sentinel codes), missingness patterns, imputation, drop-by-threshold. |
| 06 Outlier Detector | "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization. |
| 04 Fix Missing Values | "What's not there." Disguised nulls (`N/A`, `-`, sentinel codes), missingness patterns, imputation, drop-by-threshold. |
| 06 Find Unusual Values | "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization. |
**Run order**: 04 before 06. Outlier stats on data with `NaN` / sentinels are mathematically poisoned (means dragged, IQR widens, false negatives).
**Pipeline order** (Pipeline Runner enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible.
**Pipeline order** (Automated Workflows enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible.
**Contested cases**:
- Whitespace-only cell — 02 trims to empty; 04 then flags empty as null.

View File

@@ -14,7 +14,7 @@ Introduce tu nombre completo y correo, pega el código de licencia del correo de
| Nivel | Herramientas |
|---|---|
| **Lite** | Eliminador de duplicados · Limpiador de texto · Estandarizador de formatos |
| **Lite** | Buscar duplicados · Limpiar texto · Estandarizar formatos |
| **Core** | Las 9 herramientas |
Un usuario Lite que abra una herramienta exclusiva de Core verá un mensaje "Actualiza tu licencia". La página de inicio también muestra una marca 🔒 Bloqueado en las tarjetas de las herramientas que tu nivel no incluye. Para actualizar, pega un código Core en la página Activar.
@@ -53,15 +53,15 @@ Matriz de soporte completa: [REQUIREMENTS.md](REQUIREMENTS.md) (solo en inglés)
| # | Herramienta | Propósito | Estado |
|---|------|---------|--------|
| 01 | Eliminador de duplicados | Coincidencia exacta + difusa, 5 normalizadores, auditoría | Listo |
| 02 | Limpiador de texto | Espacios, caracteres tipográficos, BOM, finales de línea, mayúsculas/minúsculas | Listo |
| 03 | Estandarizador de formatos | Fechas / teléfonos / correos / direcciones / nombres / monedas / booleanos | Listo |
| 04 | Gestor de valores faltantes | Nulos disfrazados, imputación, descarte por umbral | Próximamente |
| 05 | Mapeador de columnas | Renombrar + aplicar esquema | Próximamente |
| 06 | Detector de valores atípicos | z-score, IQR, multivariante | Próximamente |
| 07 | Combinador de varios archivos | Combina varios archivos | Próximamente |
| 08 | Validador e informes | Reglas + informe PDF/Excel | Próximamente |
| 09 | Ejecutor de canalizaciones | Lanzador multi-herramienta de un clic | Próximamente |
| 01 | Buscar duplicados | Coincidencia exacta + difusa, 5 normalizadores, auditoría | Listo |
| 02 | Limpiar texto | Espacios, caracteres tipográficos, BOM, finales de línea, mayúsculas/minúsculas | Listo |
| 03 | Estandarizar formatos | Fechas / teléfonos / correos / direcciones / nombres / monedas / booleanos | Listo |
| 04 | Corregir valores faltantes | Nulos disfrazados, imputación, descarte por umbral | Próximamente |
| 05 | Mapear columnas | Renombrar + aplicar esquema | Próximamente |
| 06 | Detectar valores atípicos | z-score, IQR, multivariante | Próximamente |
| 07 | Combinar archivos | Combina varios archivos | Próximamente |
| 08 | Verificación de calidad | Reglas + informe PDF/Excel | Próximamente |
| 09 | Flujos automatizados | Lanzador multi-herramienta de un clic | Próximamente |
**Datos de muestra** (`samples/`): `messy_sales.csv`, `bank_export.xlsx`.
@@ -89,17 +89,17 @@ Ayuda: `deduplicator --help`. Referencia completa: [CLI-REFERENCE.es.md](CLI-REF
### 3.3 Orden de ejecución (cuando uses las herramientas manualmente)
Si no usas el Ejecutor de canalizaciones, sigue este orden:
Si no usas Flujos automatizados, sigue este orden:
1. **02 Limpiador de texto** primero — normaliza espacios y caracteres especiales.
2. **03 Estandarizador de formatos** — fechas, teléfonos, etc. necesitan texto limpio.
3. **04 Gestor de valores faltantes** — códigos centinela se ocultan como números.
4. **05 Mapeador de columnas** — esquema antes que estadísticas de atípicos.
5. **06 Detector de valores atípicos** — necesita datos numéricos limpios. Calcular estadísticas con `NaN` o `-999` envenena los resultados.
6. **07 Combinador de varios archivos**, **08 Validador** según sea necesario.
7. **01 Eliminador de duplicados** es flexible en cuanto al orden (normaliza internamente para la coincidencia).
1. **02 Limpiar texto** primero — normaliza espacios y caracteres especiales.
2. **03 Estandarizar formatos** — fechas, teléfonos, etc. necesitan texto limpio.
3. **04 Corregir valores faltantes** — códigos centinela se ocultan como números.
4. **05 Mapear columnas** — esquema antes que estadísticas de atípicos.
5. **06 Detectar valores atípicos** — necesita datos numéricos limpios. Calcular estadísticas con `NaN` o `-999` envenena los resultados.
6. **07 Combinar archivos**, **08 Verificación de calidad** según sea necesario.
7. **01 Buscar duplicados** es flexible en cuanto al orden (normaliza internamente para la coincidencia).
El Ejecutor de canalizaciones aplica este orden automáticamente.
Flujos automatizados aplica este orden automáticamente.
### 3.4 Idioma

View File

@@ -14,7 +14,7 @@ Enter your full name + email, paste the license blob from your purchase email (s
| Tier | Tools |
|---|---|
| **Lite** | Deduplicator · Text Cleaner · Format Standardizer |
| **Lite** | Find Duplicates · Clean Text · Standardize Formats |
| **Core** | All 9 tools |
A Lite user opening a Core-only tool sees an "Upgrade your license" prompt. The home page also shows a 🔒 Locked badge on tool cards your tier doesn't unlock. To upgrade, paste a Core blob on the Activate page.
@@ -53,15 +53,15 @@ Full numbered support matrix: [REQUIREMENTS.md](REQUIREMENTS.md).
| # | Tool | Purpose | Status |
|---|------|---------|--------|
| 01 | Deduplicator | Exact + fuzzy match, 5 normalizers, audit | Ready |
| 02 | Text Cleaner | Whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | Format Standardizer | Dates / phones / emails / addresses / names / currencies / booleans | Ready |
| 04 | Missing Value Handler | Disguised nulls, imputation, drop-by-threshold | Coming Soon |
| 05 | Column Mapper | Rename + enforce schema | Coming Soon |
| 06 | Outlier Detector | z-score, IQR, multivariate | Coming Soon |
| 07 | Multi-File Merger | Combine multiple files | Coming Soon |
| 08 | Validator & Reporter | Rules + PDF/Excel report | Coming Soon |
| 09 | Pipeline Runner | One-click multi-tool launcher | Coming Soon |
| 01 | Find Duplicates | Exact + fuzzy match, 5 normalizers, audit | Ready |
| 02 | Clean Text | Whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | Standardize Formats | Dates / phones / emails / addresses / names / currencies / booleans | Ready |
| 04 | Fix Missing Values | Disguised nulls, imputation, drop-by-threshold | Coming Soon |
| 05 | Map Columns | Rename + enforce schema | Coming Soon |
| 06 | Find Unusual Values | z-score, IQR, multivariate | Coming Soon |
| 07 | Combine Files | Combine multiple files | Coming Soon |
| 08 | Quality Check | Rules + PDF/Excel report | Coming Soon |
| 09 | Automated Workflows | One-click multi-tool launcher | Coming Soon |
**Sample data** (`samples/`): `messy_sales.csv`, `bank_export.xlsx`.
@@ -89,17 +89,17 @@ Get help: `deduplicator --help`. Full reference: [CLI-REFERENCE.md](CLI-REFERENC
### 3.3 Run order (when running tools manually)
If you skip the Pipeline Runner, follow this order:
If you skip Automated Workflows, follow this order:
1. **02 Text Cleaner** first — normalizes whitespace + special chars.
2. **03 Format Standardizer** — dates, phones, etc. need cleaned text.
3. **04 Missing Value Handler** — sentinel codes hide as numbers.
4. **05 Column Mapper** — schema before outlier stats.
5. **06 Outlier Detector** — needs clean numerics. Stats on data with `NaN` or `-999` are mathematically poisoned.
6. **07 Multi-File Merger**, **08 Validator** as needed.
7. **01 Deduplicator** is order-flexible (normalizes internally for matching).
1. **02 Clean Text** first — normalizes whitespace + special chars.
2. **03 Standardize Formats** — dates, phones, etc. need cleaned text.
3. **04 Fix Missing Values** — sentinel codes hide as numbers.
4. **05 Map Columns** — schema before outlier stats.
5. **06 Find Unusual Values** — needs clean numerics. Stats on data with `NaN` or `-999` are mathematically poisoned.
6. **07 Combine Files**, **08 Quality Check** as needed.
7. **01 Find Duplicates** is order-flexible (normalizes internally for matching).
The Pipeline Runner enforces this automatically.
Automated Workflows enforces this automatically.
### 3.4 Language

View File

@@ -251,12 +251,12 @@ row,column,field_type,old,new
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match (Jaro-Winkler), explicit strategies for Date+Amount+Vendor, survivor rules.</p></div>
<div class="card"><h3>2 · Text Cleaner</h3><p>Header whitespace, smart quotes from copy-paste, em-dash sentinels.</p></div>
<div class="card"><h3>3 · Format Standardizer</h3><p>ISO dates, numeric amounts (parens-negative), vendor casing, multi-currency.</p></div>
<div class="card"><h3>4 · Missing Value Handler</h3><p>Disguised-null detection: <code></code>, <code>N/A</code>, <code>(blank)</code>, <code>?</code>.</p></div>
<div class="card"><h3>5 · Column Mapper</h3><p>Project to your accounting tool's required schema, coerce types, drop extras.</p></div>
<div class="card"><h3>6 · Pipeline Runner</h3><p>Save the cleanup. Run it on next month's export with one command. Same audit, automated.</p></div>
<div class="card"><h3>1 · Find Duplicates</h3><p>Fuzzy match (Jaro-Winkler), explicit strategies for Date+Amount+Vendor, survivor rules.</p></div>
<div class="card"><h3>2 · Clean Text</h3><p>Header whitespace, smart quotes from copy-paste, em-dash sentinels.</p></div>
<div class="card"><h3>3 · Standardize Formats</h3><p>ISO dates, numeric amounts (parens-negative), vendor casing, multi-currency.</p></div>
<div class="card"><h3>4 · Fix Missing Values</h3><p>Disguised-null detection: <code></code>, <code>N/A</code>, <code>(blank)</code>, <code>?</code>.</p></div>
<div class="card"><h3>5 · Map Columns</h3><p>Project to your accounting tool's required schema, coerce types, drop extras.</p></div>
<div class="card"><h3>6 · Automated Workflows</h3><p>Save the cleanup. Run it on next month's export with one command. Same audit, automated.</p></div>
</div>
</div>
</section>

View File

@@ -168,9 +168,9 @@
<h2>One engine. Same six tools. Same $49.</h2>
<p>
The persona pages above are positioning, not different products.
Whichever you buy, you get the full bundle: Deduplicator, Text
Cleaner, Format Standardizer, Missing-Value Handler, Column
Mapper, and Pipeline Runner — pre-tuned with a saved pipeline
Whichever you buy, you get the full bundle: Find Duplicates, Clean
Text, Standardize Formats, Fix Missing Values, Map Columns,
and Automated Workflows — pre-tuned with a saved pipeline
that matches your workflow.
</p>
<div class="grid">

View File

@@ -165,7 +165,7 @@
<div class="card">
<span class="icon">🌍</span>
<h3>Multi-platform audience reconciliation</h3>
<p>Build one canonical audience from Meta, Google Ads, LinkedIn, and your CRM. Each platform exports a different shape; column-mapper aligns them all, dedup merges the survivors with their most-complete fields.</p>
<p>Build one canonical audience from Meta, Google Ads, LinkedIn, and your CRM. Each platform exports a different shape; Map Columns aligns them all, dedup merges the survivors with their most-complete fields.</p>
</div>
<div class="card">
<span class="icon">🛡️</span>
@@ -192,7 +192,7 @@
<li><strong>Per-row country column</strong> drives the parser — no global default that bucks UK numbers as malformed US.</li>
<li><strong>Country-name normalization</strong>: <code>USA</code> / <code>US</code> / <code>United States</code> all resolve to the same ISO-2 code.</li>
<li><strong>50+ country support</strong> via Google's libphonenumber, including KR, CN, IN, MX, BR, IL, TR, PL, DK, SE.</li>
<li><strong>Schema enforcement</strong> via the column-mapper: project to your CRM's required shape, coerce score columns to integers, reorder fields to match the import contract.</li>
<li><strong>Schema enforcement</strong> via Map Columns: project to your CRM's required shape, coerce score columns to integers, reorder fields to match the import contract.</li>
</ul>
</div>
</section>
@@ -249,12 +249,12 @@ Total elapsed: 6.7 s
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match across email + phone + name + company; merge survivors with most-complete fields.</p></div>
<div class="card"><h3>2 · Text Cleaner</h3><p>Smart quotes from copy-paste, NBSP from spreadsheet exports, BOM from Excel.</p></div>
<div class="card"><h3>3 · Format Standardizer</h3><p>E.164 phones with per-row country, canonical emails, name casing, ISO dates.</p></div>
<div class="card"><h3>4 · Missing Value Handler</h3><p>Detect <code>TBD</code>, <code>(unknown)</code>, <code></code> across vendor exports.</p></div>
<div class="card"><h3>5 · Column Mapper</h3><p>Project to your CRM's required schema, coerce score to integer, reorder for import.</p></div>
<div class="card"><h3>6 · Pipeline Runner</h3><p>Save the cleanup as JSON. Drop next campaign's combined export on it. Same dedup, automated.</p></div>
<div class="card"><h3>1 · Find Duplicates</h3><p>Fuzzy match across email + phone + name + company; merge survivors with most-complete fields.</p></div>
<div class="card"><h3>2 · Clean Text</h3><p>Smart quotes from copy-paste, NBSP from spreadsheet exports, BOM from Excel.</p></div>
<div class="card"><h3>3 · Standardize Formats</h3><p>E.164 phones with per-row country, canonical emails, name casing, ISO dates.</p></div>
<div class="card"><h3>4 · Fix Missing Values</h3><p>Detect <code>TBD</code>, <code>(unknown)</code>, <code></code> across vendor exports.</p></div>
<div class="card"><h3>5 · Map Columns</h3><p>Project to your CRM's required schema, coerce score to integer, reorder for import.</p></div>
<div class="card"><h3>6 · Automated Workflows</h3><p>Save the cleanup as JSON. Drop next campaign's combined export on it. Same dedup, automated.</p></div>
</div>
</div>
</section>

View File

@@ -178,7 +178,7 @@
<div class="card">
<span class="icon">🔗</span>
<h3>Multi-channel order consolidation</h3>
<p>Orders from Shopify + Etsy + a wholesale spreadsheet, each with a different column for "customer email." Column-mapper aligns them; dedup merges across channels.</p>
<p>Orders from Shopify + Etsy + a wholesale spreadsheet, each with a different column for "customer email." Map Columns aligns them; dedup merges across channels.</p>
</div>
<div class="card">
<span class="icon">⚙️</span>
@@ -270,12 +270,12 @@ Total elapsed: 4.2 s
<div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid">
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match (Jaro-Winkler), 5 normalizers, survivor rules, interactive review.</p></div>
<div class="card"><h3>2 · Text Cleaner</h3><p>Whitespace, smart chars, NBSP, BOM, line endings, case ops.</p></div>
<div class="card"><h3>3 · Format Standardizer</h3><p>Dates, phones, emails, addresses, names, currencies, booleans.</p></div>
<div class="card"><h3>4 · Missing Value Handler</h3><p>Disguised-null detection, profile, mean/median/mode/ffill, drop strategies.</p></div>
<div class="card"><h3>5 · Column Mapper</h3><p>Fuzzy auto-rename, target schema, type coercion, required-field defaults.</p></div>
<div class="card"><h3>6 · Pipeline Runner</h3><p>Chain tools in recommended order, save/load JSON, automate weekly cleanups.</p></div>
<div class="card"><h3>1 · Find Duplicates</h3><p>Fuzzy match (Jaro-Winkler), 5 normalizers, survivor rules, interactive review.</p></div>
<div class="card"><h3>2 · Clean Text</h3><p>Whitespace, smart chars, NBSP, BOM, line endings, case ops.</p></div>
<div class="card"><h3>3 · Standardize Formats</h3><p>Dates, phones, emails, addresses, names, currencies, booleans.</p></div>
<div class="card"><h3>4 · Fix Missing Values</h3><p>Disguised-null detection, profile, mean/median/mode/ffill, drop strategies.</p></div>
<div class="card"><h3>5 · Map Columns</h3><p>Fuzzy auto-rename, target schema, type coercion, required-field defaults.</p></div>
<div class="card"><h3>6 · Automated Workflows</h3><p>Chain tools in recommended order, save/load JSON, automate weekly cleanups.</p></div>
</div>
</div>
</section>

View File

@@ -45,15 +45,15 @@ app = typer.Typer(
# Tool id -> friendly display name. Kept in the CLI module since the GUI has
# its own version; both stay in lockstep with the actual script lineup.
_TOOL_DISPLAY = {
"01_deduplicator": "Deduplicator",
"02_text_cleaner": "Text Cleaner",
"03_format_standardizer": "Format Standardizer",
"04_missing_handler": "Missing Value Handler",
"05_column_mapper": "Column Mapper",
"06_outlier_detector": "Outlier Detector",
"07_multi_file_merger": "Multi-File Merger",
"08_validator_reporter": "Validator & Reporter",
"09_pipeline_runner": "Pipeline Runner",
"01_deduplicator": "Find Duplicates",
"02_text_cleaner": "Clean Text",
"03_format_standardizer": "Standardize Formats",
"04_missing_handler": "Fix Missing Values",
"05_column_mapper": "Map Columns",
"06_outlier_detector": "Find Unusual Values",
"07_multi_file_merger": "Combine Files",
"08_validator_reporter": "Quality Check",
"09_pipeline_runner": "Automated Workflows",
}

View File

@@ -1,4 +1,4 @@
"""CLI for the DataTools Column Mapper (script 05).
"""CLI for the DataTools Map Columns tool (script 05).
Usage:
python -m src.cli_column_map input.csv # auto-mapping preview

View File

@@ -1,4 +1,4 @@
"""CLI for the DataTools Format Standardizer (script 03).
"""CLI for the DataTools Standardize Formats tool (script 03).
Usage:
python -m src.cli_format input.csv \\

View File

@@ -1,4 +1,4 @@
"""CLI for the DataTools Missing Value Handler (script 04).
"""CLI for the DataTools Fix Missing Values tool (script 04).
Usage:
python -m src.cli_missing input.csv # profile only

View File

@@ -1,4 +1,4 @@
"""CLI for the DataTools Pipeline Runner (script 09).
"""CLI for the DataTools Automated Workflows tool (script 09).
Usage:
# Run the recommended default pipeline (text → format → missing → dedup):

View File

@@ -1,4 +1,4 @@
"""DataTools Column Mapper.
"""DataTools Map Columns.
Rename columns, enforce a target schema, coerce types, drop / add /
reorder columns. Designed for the three buyer profiles the toolkit

View File

@@ -1,4 +1,4 @@
"""DataTools Missing Value Handler.
"""DataTools Fix Missing Values.
Detects disguised nulls, profiles missingness per column, and applies
imputation or drop strategies with a full audit trail.

View File

@@ -1,4 +1,4 @@
"""DataTools Pipeline Runner.
"""DataTools Automated Workflows.
Chain the cleaning tools (text-clean, format-standardize, missing,
column-map, dedup) into a single orchestrated workflow. The pipeline

View File

@@ -1 +1 @@
"""Streamlit GUI for the DataTools Deduplicator."""
"""Streamlit GUI for DataTools."""

View File

@@ -16,7 +16,7 @@ they need without dragging the entire kitchen-sink module:
dedup_review.py ← dedup match-group cards + review pipeline
shared.py ← chrome / file-pickup helpers used by every tool
A standalone Deduplicator build, for example, can ship without
A standalone Find Duplicates build, for example, can ship without
``findings.py`` and ``gate.py`` — those modules import the analyzer /
gate code that the Lite SKU does not include.

View File

@@ -847,15 +847,15 @@ def _build_match_groups_csv(
# Tool id -> friendly display name. Single source of truth for the GUI; the
# CLI keeps its own copy so each entrypoint stays self-contained.
TOOL_DISPLAY_NAMES: dict[str, str] = {
"01_deduplicator": "Deduplicator",
"02_text_cleaner": "Text Cleaner",
"03_format_standardizer": "Format Standardizer",
"04_missing_handler": "Missing Value Handler",
"05_column_mapper": "Column Mapper",
"06_outlier_detector": "Outlier Detector",
"07_multi_file_merger": "Multi-File Merger",
"08_validator_reporter": "Validator & Reporter",
"09_pipeline_runner": "Pipeline Runner",
"01_deduplicator": "Find Duplicates",
"02_text_cleaner": "Clean Text",
"03_format_standardizer": "Standardize Formats",
"04_missing_handler": "Fix Missing Values",
"05_column_mapper": "Map Columns",
"06_outlier_detector": "Find Unusual Values",
"07_multi_file_merger": "Combine Files",
"08_validator_reporter": "Quality Check",
"09_pipeline_runner": "Automated Workflows",
}
_SEVERITY_ICON: dict[str, str] = {
@@ -1016,7 +1016,7 @@ def render_hidden_aware_preview(
) -> None:
"""Render a DataFrame preview that shows hidden characters in every cell.
Used for the Text Cleaner's "before" and "after" previews so the user
Used for the Clean Text tool's "before" and "after" previews so the user
can actually see the leading/trailing whitespace, NBSP padding,
zero-width characters, and smart punctuation that the cleaner is going
to remove (or just removed). A plain ``st.dataframe`` collapses outer

View File

@@ -1,4 +1,4 @@
"""DataTools Deduplicator — full working tool page."""
"""DataTools Find Duplicates — full working tool page."""
from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Text Cleaner — Streamlit page."""
"""DataTools Clean Text — Streamlit page."""
from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Format Standardizer — Streamlit page."""
"""DataTools Standardize Formats — Streamlit page."""
from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Missing Value Handler — Streamlit page."""
"""DataTools Fix Missing Values — Streamlit page."""
from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Column Mapper — Streamlit page."""
"""DataTools Map Columns — Streamlit page."""
from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Outlier Detector — stub page."""
"""DataTools Find Unusual Values — stub page."""
from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Multi-File Merger — stub page."""
"""DataTools Combine Files — stub page."""
from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Validator & Reporter — stub page."""
"""DataTools Quality Check — stub page."""
from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Pipeline Runner — Streamlit page."""
"""DataTools Automated Workflows — Streamlit page."""
from __future__ import annotations

View File

@@ -1,4 +1,4 @@
# Column Mapper — corpus
# Map Columns — corpus
Acceptance fixtures for `src/core/column_mapper.py`. Each `.csv` under
`test_data/` is paired with assertions in

View File

@@ -1,4 +1,4 @@
# Missing Value Handler — corpus
# Fix Missing Values — corpus
Acceptance fixtures for `src/core/missing.py`. Each `.csv` under
`test_data/` is paired with assertions in `tests/test_missing_corpus.py`.

View File

@@ -1,4 +1,4 @@
# Text Cleaner Test Corpus
# Clean Text Test Corpus
Test fixtures for `02_text_cleaner.py` (Excel & CSV Data Cleaning Mastery Bundle).

View File

@@ -3,7 +3,7 @@
These exercise the chrome-level gate that ``hide_streamlit_chrome``
installs: when no valid license is on disk, every page renders the
activation form instead of the page body, and tool widgets do NOT
appear. We test against the Deduplicator page since it's the smallest
appear. We test against the Find Duplicates page since it's the smallest
real-world tool that depends on chrome.
The autouse fixture in ``tests/conftest.py`` sets

View File

@@ -5,7 +5,7 @@ expander that houses every per-column / per-strategy knob. It's the
densest single widget surface in the GUI, so a session-state key drift
in there cascades into every dedup session.
We exercise it via the Deduplicator page (rendering ``config_panel``
We exercise it via the Find Duplicates page (rendering ``config_panel``
in isolation requires a fake Streamlit context). The page provides
the surrounding state; we poke widgets and verify their effects.
"""

View File

@@ -2,7 +2,7 @@
``match_group_card`` from ``src.gui.components`` has two modes (decided
/ undecided) and a Confirm/Undo flow keyed by session_state. We test
each state by exercising the parent Deduplicator page end to end and
each state by exercising the parent Find Duplicates page end to end and
then poking at ``review_decisions`` directly.
Why not unit-test ``match_group_card`` in isolation? AppTest needs a

View File

@@ -21,7 +21,7 @@ from .conftest import collected_text, stash_upload
# ---------------------------------------------------------------------------
class TestMalformedUploadErrors:
"""Bytes that look like a CSV but aren't parseable. The Deduplicator
"""Bytes that look like a CSV but aren't parseable. The Find Duplicates
page wraps ``read_file`` failures in an ``st.error`` with the file
name and the structured ``format_for_user`` output."""

View File

@@ -11,7 +11,7 @@ exist, each pinned here:
3. **Upload + matching passed normalization** — gate is a no-op; the
page proceeds.
We exercise the gate via the Deduplicator page (any tool page would
We exercise the gate via the Find Duplicates page (any tool page would
work; dedup is the smallest one that doesn't depend on heavy widgets).
"""
@@ -27,7 +27,7 @@ from .conftest import (
)
# Deduplicator is our canary — it calls ``require_normalization_gate``
# Find Duplicates is our canary — it calls ``require_normalization_gate``
# on the second line of the module. If the gate blocks, the dedup-
# specific title shouldn't even render.
GATED_PAGE = "1_Deduplicator"

View File

@@ -1,9 +1,9 @@
"""GUI tests for the Lite tier.
A Lite license unlocks Deduplicator, Text Cleaner, Format
Standardizer. Opening any other tool page (Missing Values, Column
Mapper, Pipeline Runner, etc.) must render an upgrade prompt and
short-circuit the page body.
A Lite license unlocks Find Duplicates, Clean Text, Standardize
Formats. Opening any other tool page (Fix Missing Values, Map
Columns, Automated Workflows, etc.) must render an upgrade prompt
and short-circuit the page body.
The home grid shows a 🔒 Locked badge on the cards for tools the
user's tier doesn't unlock.
@@ -104,7 +104,7 @@ class TestLiteHomeGridBadges:
):
home_app.run()
text = collected_text(home_app)
# Missing Value Handler is locked under Lite — its card should
# Fix Missing Values is locked under Lite — its card should
# have a 🔒 Locked badge.
# We assert the lock glyph appears alongside the locked tool's
# display name. Streamlit renders the markdown verbatim so the

View File

@@ -19,7 +19,7 @@ from .conftest import collected_text, stash_upload
# ---------------------------------------------------------------------------
# Deduplicator
# Find Duplicates
# ---------------------------------------------------------------------------
class TestDeduplicatorWorkflow:
@@ -64,7 +64,7 @@ class TestDeduplicatorWorkflow:
# ---------------------------------------------------------------------------
# Text Cleaner
# Clean Text
# ---------------------------------------------------------------------------
class TestTextCleanerWorkflow:
@@ -96,7 +96,7 @@ class TestTextCleanerWorkflow:
# ---------------------------------------------------------------------------
# Format Standardizer
# Standardize Formats
# ---------------------------------------------------------------------------
class TestFormatStandardizerWorkflow:
@@ -110,7 +110,7 @@ class TestFormatStandardizerWorkflow:
# ---------------------------------------------------------------------------
# Missing Value Handler
# Fix Missing Values
# ---------------------------------------------------------------------------
class TestMissingValuesWorkflow:
@@ -124,7 +124,7 @@ class TestMissingValuesWorkflow:
# ---------------------------------------------------------------------------
# Column Mapper
# Map Columns
# ---------------------------------------------------------------------------
class TestColumnMapperWorkflow:
@@ -138,7 +138,7 @@ class TestColumnMapperWorkflow:
# ---------------------------------------------------------------------------
# Pipeline Runner
# Automated Workflows
# ---------------------------------------------------------------------------
class TestPipelineRunnerWorkflow:

View File

@@ -41,8 +41,8 @@ class TestAnalyzeCli:
assert result.exit_code == 0
# The Rich table breaks lines; assert on stable substrings instead of
# full finding ids.
assert "Text Cleaner" in result.stdout
assert "Missing Value" in result.stdout
assert "Clean Text" in result.stdout
assert "Fix Missing Values" in result.stdout
# Severity column is rendered.
assert "warn" in result.stdout

View File

@@ -1,4 +1,4 @@
"""Acceptance corpus for the Column Mapper.
"""Acceptance corpus for the Map Columns tool.
Loads every fixture in ``test-cases/column-mapper-corpus/test_data/``
and asserts the documented behaviour against the documented schema.

View File

@@ -48,7 +48,7 @@ class TestAnalyzeCliE2E:
proc = _run("-m", "src.cli_analyze", str(CORPUS_KITCHEN_SINK))
assert proc.returncode == 0, proc.stderr
# Rich tables wrap; assert on stable substrings.
assert "Text Cleaner" in proc.stdout
assert "Clean Text" in proc.stdout
assert "csv_bom_stripped" in proc.stdout or "smart_quotes" in proc.stdout
def test_json_output_parses(self):

View File

@@ -1,7 +1,7 @@
"""Tier-specific tests: Lite tier feature set + gating.
Lite unlocks exactly three tools — Deduplicator, Text Cleaner,
Format Standardizer — and locks the other six. We test:
Lite unlocks exactly three tools — Find Duplicates, Clean Text,
Standardize Formats — and locks the other six. We test:
- The features map for Lite returns the right three flags (and only
those three).

View File

@@ -1,4 +1,4 @@
"""Acceptance corpus for the Missing Value Handler.
"""Acceptance corpus for the Fix Missing Values tool.
Loads every fixture in ``test-cases/missing-corpus/test_data/`` and
asserts the documented behaviour. The fixtures are split into:

View File

@@ -25,7 +25,7 @@ from src.core import (
# ---------------------------------------------------------------------------
# Format Standardizer: single-tolist hot loop
# Standardize Formats: single-tolist hot loop
# ---------------------------------------------------------------------------
class TestStandardizerHotLoop:
@@ -93,7 +93,7 @@ class TestStandardizerHotLoop:
# ---------------------------------------------------------------------------
# Deduplicator: per-call normalizer cache
# Find Duplicates: per-call normalizer cache
# ---------------------------------------------------------------------------
class TestDedupNormalizerCache: