docs+code: rename tool labels everywhere

Sweep follow-up to 93e43fc. Display labels now consistent across docs,
landing pages, CLI output, code comments, docstrings, and test prose.
Five parallel surfaces touched:

- docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal
  design/planning docs
- landing pages: index + bookkeeper/revops/shopify-pet
- src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py
  and gui/components/_legacy.py, core module headers, every tool
  page's module docstring
- tests: class/method/module docstrings and section-header comments
- test-cases READMEs

Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator
etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*),
URL paths, anchor IDs, CSS classes, and asset filenames were left
intact since they're code identifiers / structural references.

All 2033 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 19:50:09 +00:00
parent 93e43fc0d9
commit db5ec084da
57 changed files with 205 additions and 205 deletions

View File

@@ -8,15 +8,15 @@ Limpieza local de CSV / Excel. CLI + GUI en el navegador, sin nube, sin ceremoni
| # | Herramienta | Estado | | # | Herramienta | Estado |
|---|------|--------| |---|------|--------|
| 01 | **Eliminador de duplicados** — coincidencia exacta + difusa, 5 normalizadores, reglas de superviviente, auditoría | Listo | | 01 | **Buscar duplicados** — coincidencia exacta + difusa, 5 normalizadores, reglas de superviviente, auditoría | Listo |
| 02 | **Limpiador de texto** — espacios, caracteres tipográficos, BOM, finales de línea, mayúsculas/minúsculas | Listo | | 02 | **Limpiar texto** — espacios, caracteres tipográficos, BOM, finales de línea, mayúsculas/minúsculas | Listo |
| 03 | **Estandarizador de formatos** — fechas, teléfonos, correos, direcciones, nombres, monedas, booleanos | Listo | | 03 | **Estandarizar formatos** — fechas, teléfonos, correos, direcciones, nombres, monedas, booleanos | Listo |
| 04 | **Gestor de valores faltantes** — detección de nulos disfrazados, perfil, media/mediana/moda/ffill/bfill/interpolación, estrategias de descarte | Listo | | 04 | **Corregir valores faltantes** — detección de nulos disfrazados, perfil, media/mediana/moda/ffill/bfill/interpolación, estrategias de descarte | Listo |
| 05 | **Mapeador de columnas** — autodetección difusa de renombrados, esquema objetivo con coerción de tipos, campos requeridos con valores por defecto, descartar/reordenar | Listo | | 05 | **Mapear columnas** — autodetección difusa de renombrados, esquema objetivo con coerción de tipos, campos requeridos con valores por defecto, descartar/reordenar | Listo |
| 06 | Detector de valores atípicos | Próximamente | | 06 | Detectar valores atípicos | Próximamente |
| 07 | Combinador de varios archivos | Próximamente | | 07 | Combinar archivos | Próximamente |
| 08 | Validador e informes | Próximamente | | 08 | Verificación de calidad | Próximamente |
| 09 | **Ejecutor de canalizaciones** — encadena herramientas en un orden recomendado (no forzado), guarda/carga JSON, automatiza limpiezas semanales | Listo | | 09 | **Flujos automatizados** — encadena herramientas en un orden recomendado (no forzado), guarda/carga JSON, automatiza limpiezas semanales | Listo |
## Descarga (usuarios no técnicos) ## Descarga (usuarios no técnicos)

View File

@@ -8,15 +8,15 @@ Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony. GU
| # | Tool | Status | | # | Tool | Status |
|---|------|--------| |---|------|--------|
| 01 | **Deduplicator** — exact + fuzzy match, 5 normalizers, survivor rules, audit | Ready | | 01 | **Find Duplicates** — exact + fuzzy match, 5 normalizers, survivor rules, audit | Ready |
| 02 | **Text Cleaner** — whitespace, smart chars, BOM, line endings, case ops | Ready | | 02 | **Clean Text** — whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | **Format Standardizer** — dates, phones, emails, addresses, names, currencies, booleans | Ready | | 03 | **Standardize Formats** — dates, phones, emails, addresses, names, currencies, booleans | Ready |
| 04 | **Missing Value Handler** — disguised-null detection, profile, mean/median/mode/ffill/bfill/interpolate, drop strategies | Ready | | 04 | **Fix Missing Values** — disguised-null detection, profile, mean/median/mode/ffill/bfill/interpolate, drop strategies | Ready |
| 05 | **Column Mapper** — fuzzy auto-rename, target schema with type coercion, required fields with defaults, drop/reorder | Ready | | 05 | **Map Columns** — fuzzy auto-rename, target schema with type coercion, required fields with defaults, drop/reorder | Ready |
| 06 | Outlier Detector | Coming Soon | | 06 | Find Unusual Values | Coming Soon |
| 07 | Multi-File Merger | Coming Soon | | 07 | Combine Files | Coming Soon |
| 08 | Validator & Reporter | Coming Soon | | 08 | Quality Check | Coming Soon |
| 09 | **Pipeline Runner** — chain tools with recommended (not forced) order, save/load JSON, automate weekly cleanups | Ready | | 09 | **Automated Workflows** — chain tools with recommended (not forced) order, save/load JSON, automate weekly cleanups | Ready |
## Download (non-technical users) ## Download (non-technical users)

View File

@@ -246,7 +246,7 @@ much state to trust:
4. Double-click the app icon. 4. Double-click the app icon.
5. Browser should open to http://127.0.0.1:850x within 5 seconds. 5. Browser should open to http://127.0.0.1:850x within 5 seconds.
6. Drop samples/demo/shopify_pet_customers.csv into the 6. Drop samples/demo/shopify_pet_customers.csv into the
Pipeline Runner page; click Run; AFTER preview should appear. Automated Workflows page; click Run; AFTER preview should appear.
7. Confirm in the network tab: zero outbound calls except to 7. Confirm in the network tab: zero outbound calls except to
127.0.0.1 and the Streamlit static asset paths (also local). 127.0.0.1 and the Streamlit static asset paths (also local).
``` ```

View File

@@ -333,7 +333,7 @@ the attached `.dtlic` file.
| Tier | Features | | Tier | Features |
|------|---------| |------|---------|
| **lite** | Deduplicator, Text Cleaner, Format Standardizer | | **lite** | Find Duplicates, Clean Text, Standardize Formats |
| **core** | All 9 tools | | **core** | All 9 tools |
| **pro** | All 9 tools + future Pro-only features | | **pro** | All 9 tools + future Pro-only features |

View File

@@ -47,7 +47,7 @@ Sell niche Python automation tools as one-time downloadable digital products. Ta
**Surface**: desktop install per OS (PyInstaller) with Streamlit GUI + CLI. Constrained demo on Streamlit Community Cloud. **Surface**: desktop install per OS (PyInstaller) with Streamlit GUI + CLI. Constrained demo on Streamlit Community Cloud.
## 4a. Lead bundle — Deduplicator ## 4a. Lead bundle — Find Duplicates
Highest pain density across all 4 personas. Feeds landing copy, demo design, feature priority. Tech spec: TECHNICAL.md §11.1. Highest pain density across all 4 personas. Feeds landing copy, demo design, feature priority. Tech spec: TECHNICAL.md §11.1.
@@ -208,7 +208,7 @@ Headroom enables optional ad spend ($100-200/mo) once a bundle has proven conver
## 13. Honest status (2026-05-01) ## 13. Honest status (2026-05-01)
- 3 of 9 tools shipped (Dedup, Text Cleaner, Format Standardizer). - 3 of 9 tools shipped (Find Duplicates, Clean Text, Standardize Formats).
- Cross-platform build pipeline designed, not yet built. - Cross-platform build pipeline designed, not yet built.
- macOS code signing not yet set up. - macOS code signing not yet set up.
- Streamlit GUI shipped for the 3 ready tools. - Streamlit GUI shipped for the 3 ready tools.

View File

@@ -8,15 +8,15 @@ Tres módulos de CLI, uno por cada herramienta Lista:
| Módulo | Comando | Propósito | | Módulo | Comando | Propósito |
|--------|---------|---------| |--------|---------|---------|
| `src.cli` | `python -m src.cli FILE` | Eliminador de duplicados | | `src.cli` | `python -m src.cli FILE` | Buscar duplicados |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Limpiador de texto | | `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Limpiar texto |
| `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analizador (escaneo de solo lectura) | | `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analizador (escaneo de solo lectura) |
Cada comando es **previsualización por defecto** — añade `--apply` para escribir la salida. Cada comando es **previsualización por defecto** — añade `--apply` para escribir la salida.
--- ---
# Eliminador de duplicados # Buscar duplicados
``` ```
python -m src.cli ARCHIVO_ENTRADA [OPCIONES] python -m src.cli ARCHIVO_ENTRADA [OPCIONES]
@@ -125,7 +125,7 @@ Registro: `logs/dedup_YYYYMMDD_HHMMSS.log`.
--- ---
# Limpiador de texto # Limpiar texto
``` ```
python -m src.cli_text_clean ARCHIVO_ENTRADA [OPCIONES] python -m src.cli_text_clean ARCHIVO_ENTRADA [OPCIONES]
@@ -156,7 +156,7 @@ Higiene a nivel de carácter. Ver [TECHNICAL.md §10.2](TECHNICAL.md) (solo en i
- `--config RUTA` / `--save-config RUTA`. - `--config RUTA` / `--save-config RUTA`.
### Archivo ### Archivo
- `--sheet`, `--encoding`, `--header-row` — iguales que en el Eliminador de duplicados. - `--sheet`, `--encoding`, `--header-row` — iguales que en Buscar duplicados.
## Presets ## Presets

View File

@@ -6,15 +6,15 @@ Three CLI modules, one per Ready tool:
| Module | Command | Purpose | | Module | Command | Purpose |
|--------|---------|---------| |--------|---------|---------|
| `src.cli` | `python -m src.cli FILE` | Deduplicator | | `src.cli` | `python -m src.cli FILE` | Find Duplicates |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Text Cleaner | | `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Clean Text |
| `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analyzer (read-only scan) | | `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analyzer (read-only scan) |
Every command is **preview-only by default** — add `--apply` to write output. Every command is **preview-only by default** — add `--apply` to write output.
--- ---
# Deduplicator # Find Duplicates
``` ```
python -m src.cli INPUT_FILE [OPTIONS] python -m src.cli INPUT_FILE [OPTIONS]
@@ -123,7 +123,7 @@ Log: `logs/dedup_YYYYMMDD_HHMMSS.log`.
--- ---
# Text Cleaner # Clean Text
``` ```
python -m src.cli_text_clean INPUT_FILE [OPTIONS] python -m src.cli_text_clean INPUT_FILE [OPTIONS]
@@ -154,7 +154,7 @@ Character-level hygiene. See [TECHNICAL.md §10.2](TECHNICAL.md) for the spec.
- `--config PATH` / `--save-config PATH`. - `--config PATH` / `--save-config PATH`.
### File ### File
- `--sheet`, `--encoding`, `--header-row` — same as Deduplicator. - `--sheet`, `--encoding`, `--header-row` — same as Find Duplicates.
## Presets ## Presets

View File

@@ -67,7 +67,7 @@ Each candidate scored 1-5 on 6 dimensions. Total /30 → verdict.
**v1.2 rationale**: **v1.2 rationale**:
- Buyer persona ("hates Excel work but can't code") won't learn a CLI. Refunds at this price. - Buyer persona ("hates Excel work but can't code") won't learn a CLI. Refunds at this price.
- Deduplicator needs interactive review — not viable in pure CLI. - Find Duplicates needs interactive review — not viable in pure CLI.
- Dual interface keeps CLI for automation without sacrificing primary buyer surface. - Dual interface keeps CLI for automation without sacrificing primary buyer surface.
## 4a. Functional scope principle (v1.2) ## 4a. Functional scope principle (v1.2)
@@ -170,13 +170,13 @@ $49-79/bundle · $149 full suite (when 3+ exist).
| Apr 28 (v1.3) | Add hosted browser demo as conversion lever | Direct consequence of Streamlit choice. See §5. | | Apr 28 (v1.3) | Add hosted browser demo as conversion lever | Direct consequence of Streamlit choice. See §5. |
| Apr 28 (v1.4) | Re-apply 04/06 boundary work (silent-drift recovery) | Stream B v1.2 content overwritten in parallel v1.3 work. Restored per no-silent-drift rule. | | Apr 28 (v1.4) | Re-apply 04/06 boundary work (silent-drift recovery) | Stream B v1.2 content overwritten in parallel v1.3 work. Restored per no-silent-drift rule. |
| Apr 28 (v1.5) | Add `02_text_cleaner.py`; renumber 02-08 → 03-09 | Character-level hygiene had no clear owner. See TECHNICAL §10. | | Apr 28 (v1.5) | Add `02_text_cleaner.py`; renumber 02-08 → 03-09 | Character-level hygiene had no clear owner. See TECHNICAL §10. |
| Apr 29 (v1.7) | Adopt Text Cleaner Tier 1/2/3 spec; lock `excel-hygiene` default | Promotes from stub to buildable v1 target. Full spec in TECHNICAL §11.2. | | Apr 29 (v1.7) | Adopt Clean Text Tier 1/2/3 spec; lock `excel-hygiene` default | Promotes from stub to buildable v1 target. Full spec in TECHNICAL §11.2. |
| Apr 28 (v1.6) | Fold conversation-history content into docs (deduplicator spec, lead bundle use cases, full GUI matrix, 04/06 examples, Streamlit-to-SaaS reasoning) | No new decisions; promote at-risk analysis from chat history per no-silent-drift rule. | | Apr 28 (v1.6) | Fold conversation-history content into docs (deduplicator spec, lead bundle use cases, full GUI matrix, 04/06 examples, Streamlit-to-SaaS reasoning) | No new decisions; promote at-risk analysis from chat history per no-silent-drift rule. |
| May 1 (v1.6) | Mark Format Standardizer **Ready** | 199-row buyer corpus passing; Tier 1 + most Tier 2 built. | | May 1 (v1.6) | Mark Standardize Formats **Ready** | 199-row buyer corpus passing; Tier 1 + most Tier 2 built. |
| May 1 (v1.6) | Add `src/core/errors.py` structured hierarchy | Uniform helpful messages across CLI + GUI. See TECHNICAL §7. | | May 1 (v1.6) | Add `src/core/errors.py` structured hierarchy | Uniform helpful messages across CLI + GUI. See TECHNICAL §7. |
| May 13 (v1.6) | Ship in-house JSON i18n + EN/ES packs | Expand addressable market (Spanish-first buyers, LatAm bookkeepers) without a `gettext` build step. JSON packs editable by non-devs; parity test prevents drift. See TECHNICAL §10b. | | May 13 (v1.6) | Ship in-house JSON i18n + EN/ES packs | Expand addressable market (Spanish-first buyers, LatAm bookkeepers) without a `gettext` build step. JSON packs editable by non-devs; parity test prevents drift. See TECHNICAL §10b. |
| May 13 (v1.6) | Ship licensing: 1-year HMAC-signed blobs, name+email registration, offline verification, tier-scaffolded for future SKUs | Unlock the lifetime-update business model without recurring infra. Honor-system DRM (HMAC + 30-day refund) — sufficient at $49. See §9b below. | | May 13 (v1.6) | Ship licensing: 1-year HMAC-signed blobs, name+email registration, offline verification, tier-scaffolded for future SKUs | Unlock the lifetime-update business model without recurring infra. Honor-system DRM (HMAC + 30-day refund) — sufficient at $49. See §9b below. |
| May 13 (v1.6) | Add Lite SKU (Dedup + Text Cleaner + Format Standardizer) | Lower-priced entry point for buyers who only need the three universal tools. Per-tool feature gating + lock badges on the home grid surface the upgrade path. See §9b. | | May 13 (v1.6) | Add Lite SKU (Find Duplicates + Clean Text + Standardize Formats) | Lower-priced entry point for buyers who only need the three universal tools. Per-tool feature gating + lock badges on the home grid surface the upgrade path. See §9b. |
| May 13 (v1.6) | Remove user-facing free trial | A 1-year all-features trial undercut the paid Lite SKU. Paid-only keeps tier economics clean. Internal ``_mint`` API still exists for tests and the seller's key generator. See §9b. | | May 13 (v1.6) | Remove user-facing free trial | A 1-year all-features trial undercut the paid Lite SKU. Paid-only keeps tier economics clean. Internal ``_mint`` API still exists for tests and the seller's key generator. See §9b. |
| May 13 (v1.6) | Upgrade license crypto: HMAC → Ed25519 (asymmetric) | HMAC's symmetric secret was extractable from the shipped binary — anyone with the binary could mint blobs. Ed25519 splits sign (seller) from verify (binary), so binary compromise doesn't let an attacker forge licenses. Blob prefix bumped DTLIC1 → DTLIC2. See §9b. | | May 13 (v1.6) | Upgrade license crypto: HMAC → Ed25519 (asymmetric) | HMAC's symmetric secret was extractable from the shipped binary — anyone with the binary could mint blobs. Ed25519 splits sign (seller) from verify (binary), so binary compromise doesn't let an attacker forge licenses. Blob prefix bumped DTLIC1 → DTLIC2. See §9b. |
| May 13 (v1.6) | Add ``assert_production_safe`` tripwire | A shipped build with ``DATATOOLS_DEV_MODE=1`` or the in-source dev pubkey would silently defeat licensing. The tripwire refuses to boot such a build. No-op in source / pytest runs. See §9b. | | May 13 (v1.6) | Add ``assert_production_safe`` tripwire | A shipped build with ``DATATOOLS_DEV_MODE=1`` or the in-source dev pubkey would silently defeat licensing. The tripwire refuses to boot such a build. No-op in source / pytest runs. See §9b. |
@@ -211,13 +211,13 @@ The 30-day refund window covers casual blob sharing from a different angle (anyo
- Number of devices the same blob is used on (no concurrent-use detection). - Number of devices the same blob is used on (no concurrent-use detection).
- Reverse-engineered re-signing of expired blobs (would require RSA / online check). - Reverse-engineered re-signing of expired blobs (would require RSA / online check).
**Future SKUs**: the ``FEATURES_BY_TIER`` table in ``src/license/features.py`` is the single source of truth for "which tools each tier unlocks". Adding a PRO SKU that excludes the pipeline runner is a 1-line edit there + a 1-line edit at the gate site. No consumer-code churn. **Future SKUs**: the ``FEATURES_BY_TIER`` table in ``src/license/features.py`` is the single source of truth for "which tools each tier unlocks". Adding a PRO SKU that excludes Automated Workflows is a 1-line edit there + a 1-line edit at the gate site. No consumer-code churn.
**v1.6 SKU lineup**: **v1.6 SKU lineup**:
| Tier | Tools unlocked | Notes | | Tier | Tools unlocked | Notes |
|---|---|---| |---|---|---|
| LITE | Deduplicator, Text Cleaner, Format Standardizer | Entry SKU. Three universal tools that handle the most common bookkeeping / RevOps / Klaviyo prep workflows. | | LITE | Find Duplicates, Clean Text, Standardize Formats | Entry SKU. Three universal tools that handle the most common bookkeeping / RevOps / Klaviyo prep workflows. |
| CORE | All 9 tools | Full v1 suite. | | CORE | All 9 tools | Full v1 suite. |
| PRO | All 9 tools (scaffolded) | Reserved for future per-feature carve-outs (e.g., scheduled pipelines, API access). | | PRO | All 9 tools (scaffolded) | Reserved for future per-feature carve-outs (e.g., scheduled pipelines, API access). |
| ENTERPRISE | All 9 tools (scaffolded) | Reserved for future bulk / multi-seat SKUs. | | ENTERPRISE | All 9 tools (scaffolded) | Reserved for future bulk / multi-seat SKUs. |

View File

@@ -33,7 +33,7 @@ CLI (src/cli*.py) GUI (src/gui/app.py + pages/)
| `core.errors` | `DataToolsError` hierarchy, `ensure_dataframe()`, `ensure_choice()`, `wrap_file_read/write()`, `format_for_user()` | | `core.errors` | `DataToolsError` hierarchy, `ensure_dataframe()`, `ensure_choice()`, `wrap_file_read/write()`, `format_for_user()` |
| `core._constants` | `US_STATE_NAMES`, `US_STATE_CODES`, `USPS_EXPANSIONS`, `USPS_COMPRESSIONS` | | `core._constants` | `US_STATE_NAMES`, `US_STATE_CODES`, `USPS_EXPANSIONS`, `USPS_COMPRESSIONS` |
## Data flow — Deduplicator ## Data flow — Find Duplicates
``` ```
read_file() # auto-detect encoding, delimiter, header read_file() # auto-detect encoding, delimiter, header

View File

@@ -30,7 +30,7 @@ Status legend:
| ✓ | Item | Where it lives | | ✓ | Item | Where it lives |
|---|------|----------------| |---|------|----------------|
| 🟢 | 6 of 9 tools shipped (Dedup, Text, Format, Missing, Column-Map, Pipeline) | `src/core/`, `src/cli_*.py`, `src/gui/pages/` | | 🟢 | 6 of 9 tools shipped (Dedup, Text, Format, Missing, Column-Map, Pipeline) | `src/core/`, `src/cli_*.py`, `src/gui/pages/` |
| 🟢 | Pipeline Runner (the retention multiplier per `PLAN.md` §2.6) | `src/core/pipeline.py`, `src/cli_pipeline.py`, `src/gui/pages/9_Pipeline_Runner.py` | | 🟢 | Automated Workflows (the retention multiplier per `PLAN.md` §2.6) | `src/core/pipeline.py`, `src/cli_pipeline.py`, `src/gui/pages/9_Pipeline_Runner.py` |
| 🟢 | 1,729 passing tests · 0 skipped · 0 xfailed | `tests/` | | 🟢 | 1,729 passing tests · 0 skipped · 0 xfailed | `tests/` |
| 🟢 | 3 niche demo datasets + pre-tuned pipeline JSONs | `samples/demo/` | | 🟢 | 3 niche demo datasets + pre-tuned pipeline JSONs | `samples/demo/` |
| 🟢 | Streamlit demo app + Cloud entry shim | `streamlit_app.py`, `src/gui/app_demo.py` | | 🟢 | Streamlit demo app + Cloud entry shim | `streamlit_app.py`, `src/gui/app_demo.py` |

View File

@@ -29,8 +29,8 @@ win.
| Asset | State | | Asset | State |
|---|---| |---|---|
| Tools 15 (Dedup, Text Clean, Format Standardize, Missing, Column Mapper) | Ready · 1,691 tests passing · 0 xfailed | | Tools 15 (Find Duplicates, Clean Text, Standardize Formats, Fix Missing Values, Map Columns) | Ready · 1,691 tests passing · 0 xfailed |
| Tools 69 (Outlier, Multi-File Merge, Validator, Pipeline) | Coming Soon | | Tools 69 (Find Unusual Values, Combine Files, Quality Check, Automated Workflows) | Coming Soon |
| PyInstaller installer pipeline | Not started | | PyInstaller installer pipeline | Not started |
| macOS code signing (Apple Dev Program) | Not started | | macOS code signing (Apple Dev Program) | Not started |
| Hosted browser demo (Streamlit Cloud) | Not deployed | | Hosted browser demo (Streamlit Cloud) | Not deployed |
@@ -52,7 +52,7 @@ Tools 68 are blocked behind a **distribution gate**: no work on them
until the existing 5 tools have a paying customer + one external review until the existing 5 tools have a paying customer + one external review
(BUSINESS.md §4 sequence rule, applied recursively inside the bundle). (BUSINESS.md §4 sequence rule, applied recursively inside the bundle).
**Exception granted 2026-05-01**: Tool 09 Pipeline Runner is built **Exception granted 2026-05-01**: Tool 09 Automated Workflows is built
*now*. Rationale: the pipeline transforms the bundle from "5 tools you *now*. Rationale: the pipeline transforms the bundle from "5 tools you
buy" into "an automatable workflow you depend on." That conversion is buy" into "an automatable workflow you depend on." That conversion is
what produces retention and word-of-mouth — the only marketing channel what produces retention and word-of-mouth — the only marketing channel
@@ -104,10 +104,10 @@ demo dataset.
| # | Pain | $ / time impact | Tools that fix it | | # | Pain | $ / time impact | Tools that fix it |
|---|------|-----------------|---| |---|------|-----------------|---|
| S1 | **Klaviyo / Mailchimp / Omnisend per-contact billing.** Subscriber list with 1018 % duplicate rate (case drift, plus signs in Gmail addresses, multiple devices) → recurring overpay forever. | $30300/mo per percent of dupes on a 50 k list — recurring | Dedup + Format Standardize (email canonicalization) + Pipeline (re-run weekly) | | S1 | **Klaviyo / Mailchimp / Omnisend per-contact billing.** Subscriber list with 1018 % duplicate rate (case drift, plus signs in Gmail addresses, multiple devices) → recurring overpay forever. | $30300/mo per percent of dupes on a 50 k list — recurring | Dedup + Format Standardize (email canonicalization) + Pipeline (re-run weekly) |
| S2 | **Product feed rejected by Google Merchant Center / Meta Catalog.** Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 2472 h while feed gets fixed. | 13 days delayed launch × campaign value | Text Cleaner + Format Standardize | | S2 | **Product feed rejected by Google Merchant Center / Meta Catalog.** Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 2472 h while feed gets fixed. | 13 days delayed launch × campaign value | Clean Text + Standardize Formats |
| S3 | **Multi-channel order consolidation.** Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 48 hr / month manually merging | Column Mapper + Dedup + Pipeline | | S3 | **Multi-channel order consolidation.** Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 48 hr / month manually merging | Map Columns + Find Duplicates + Automated Workflows |
| S4 | **Subscription identity fragmentation.** Pet-box subscribers cancel and re-sub under a different email; cohort analysis says churn is 20 % when it's actually 12 % — pricing decisions wrong. | Mis-priced LTV → over- or under-paid acquisition | Dedup with `merge=true` survivor | | S4 | **Subscription identity fragmentation.** Pet-box subscribers cancel and re-sub under a different email; cohort analysis says churn is 20 % when it's actually 12 % — pricing decisions wrong. | Mis-priced LTV → over- or under-paid acquisition | Dedup with `merge=true` survivor |
| S5 | **International tax / VAT MOSS compliance.** Country column is `UK` / `U.K.` / `United Kingdom` / `GB` in the same export; VAT report breaks. Phone formats per region break call-center routing. | Compliance penalty risk + ops friction | Format Standardize (per-row country) + Column Mapper | | S5 | **International tax / VAT MOSS compliance.** Country column is `UK` / `U.K.` / `United Kingdom` / `GB` in the same export; VAT report breaks. Phone formats per region break call-center routing. | Compliance penalty risk + ops friction | Standardize Formats (per-row country) + Map Columns |
#### Bookkeeper / freelance accountant #### Bookkeeper / freelance accountant
@@ -126,7 +126,7 @@ demo dataset.
| R1 | **HubSpot / Marketo / Iterable per-contact tier pricing.** 10 k contacts → enterprise tier at $48 k/mo. Every duplicate is a recurring tax. | $200800 / month per 1 k duplicate contacts — recurring | Dedup with cross-source merge + Pipeline | | R1 | **HubSpot / Marketo / Iterable per-contact tier pricing.** 10 k contacts → enterprise tier at $48 k/mo. Every duplicate is a recurring tax. | $200800 / month per 1 k duplicate contacts — recurring | Dedup with cross-source merge + Pipeline |
| R2 | **Email-deliverability / sender reputation.** Sending to invalid or duplicate addresses tanks reputation; recovery takes weeks. | Catastrophic — entire email programme degraded | Format Standardize (email canonicalization) + Missing (sentinel detection) | | R2 | **Email-deliverability / sender reputation.** Sending to invalid or duplicate addresses tanks reputation; recovery takes weeks. | Catastrophic — entire email programme degraded | Format Standardize (email canonicalization) + Missing (sentinel detection) |
| R3 | **GDPR / contact-data privacy.** Uploading lead data to a third-party cleaning SaaS is itself a GDPR concern; legal review blocks adoption. | Compliance risk + 48 wk legal-review delay | Local-only desktop app, zero outbound calls | | R3 | **GDPR / contact-data privacy.** Uploading lead data to a third-party cleaning SaaS is itself a GDPR concern; legal review blocks adoption. | Compliance risk + 48 wk legal-review delay | Local-only desktop app, zero outbound calls |
| R4 | **Multi-vendor lead-source unification.** Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 13 days per campaign of manual unification | Column Mapper (alias matching) + Format Standardize (per-row country) + Dedup | | R4 | **Multi-vendor lead-source unification.** Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 13 days per campaign of manual unification | Map Columns (alias matching) + Standardize Formats (per-row country) + Find Duplicates |
| R5 | **Suppression-list management across 5+ platforms.** Each platform has its own format; un-deduped suppression lists let opt-outs slip through, triggering CAN-SPAM / GDPR exposure. | Compliance risk + churn-back cost | Pipeline saved as JSON, re-run on each new suppression batch | | R5 | **Suppression-list management across 5+ platforms.** Each platform has its own format; un-deduped suppression lists let opt-outs slip through, triggering CAN-SPAM / GDPR exposure. | Compliance risk + churn-back cost | Pipeline saved as JSON, re-run on each new suppression batch |
### 2.4 Operationalize the moat the docs already name. ### 2.4 Operationalize the moat the docs already name.
@@ -154,7 +154,7 @@ right after "runs locally."
Copy seed: *"Every change auditable. Hand the audit CSV to your client Copy seed: *"Every change auditable. Hand the audit CSV to your client
with the cleaned file."* with the cleaned file."*
### 2.6 The Pipeline Runner is the retention multiplier. ### 2.6 Automated Workflows is the retention multiplier.
A buyer with a saved pipeline isn't a one-off purchase — they're a A buyer with a saved pipeline isn't a one-off purchase — they're a
recurring user who recommends the product. This is exactly the recurring user who recommends the product. This is exactly the
@@ -172,8 +172,8 @@ trigger DECISIONS.md §8 already names).
### 2.8 Dependency-aware pipeline UX. ### 2.8 Dependency-aware pipeline UX.
Tools have soft execution-order preferences (Text Clean before Format Tools have soft execution-order preferences (Text Clean before Format
Standardize, Format before Dedup, Missing before Dedup). The Pipeline Standardize, Format before Dedup, Missing before Dedup). Automated
Runner *recommends* the order, *warns* on reversals, and **never Workflows *recommends* the order, *warns* on reversals, and **never
forces** — the user owns their workflow. Implementation: see forces** — the user owns their workflow. Implementation: see
`src/core/pipeline.py` `SOFT_DEPENDENCIES`. `src/core/pipeline.py` `SOFT_DEPENDENCIES`.
@@ -184,7 +184,7 @@ forces** — the user owns their workflow. Implementation: see
| 1 | PyInstaller pipeline · Mac/Win unsigned installers · Apple Dev Program enrollment (12 wk lead) | `dist/datatools-mac.dmg` and `dist/datatools-win.exe` install on a clean machine | | 1 | PyInstaller pipeline · Mac/Win unsigned installers · Apple Dev Program enrollment (12 wk lead) | `dist/datatools-mac.dmg` and `dist/datatools-win.exe` install on a clean machine |
| 2 | Demo deployed to Streamlit Cloud · landing page v1 with embedded demo · 3 persona datasets in the demo | Public URL serves a working pipeline run on a sample dataset in < 30 s | | 2 | Demo deployed to Streamlit Cloud · landing page v1 with embedded demo · 3 persona datasets in the demo | Public URL serves a working pipeline run on a sample dataset in < 30 s |
| 3 | Gumroad listing live · share value-first in 3 niche communities (no pitch) · 1 long-tail SEO post for the lead persona | First listing impression captured · post not removed for self-promotion | | 3 | Gumroad listing live · share value-first in 3 niche communities (no pitch) · 1 long-tail SEO post for the lead persona | First listing impression captured · post not removed for self-promotion |
| 4 | Pipeline Runner v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded | | 4 | Automated Workflows v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded |
| 58 | Bookkeeper landing page · agency landing page · second tool's promo cycle · priority-support tier added (defer purchase until §2.7 trigger) | Three live landing pages with distinct H1, demo dataset, conversion target | | 58 | Bookkeeper landing page · agency landing page · second tool's promo cycle · priority-support tier added (defer purchase until §2.7 trigger) | Three live landing pages with distinct H1, demo dataset, conversion target |
| 913 | Tool 0608 only **if** revenue trajectory supports continued investment · otherwise more market work on the existing 5 + 09 | Decision made on 13 Aug 2026 with revenue data, not feature ambition | | 913 | Tool 0608 only **if** revenue trajectory supports continued investment · otherwise more market work on the existing 5 + 09 | Decision made on 13 Aug 2026 with revenue data, not feature ambition |
@@ -202,7 +202,7 @@ These flip the plan, not the underlying criteria:
## 5. Anti-temptations (things the plan refuses) ## 5. Anti-temptations (things the plan refuses)
- **More tools before more buyers.** Locked. Exception only for Pipeline Runner per §2.1. - **More tools before more buyers.** Locked. Exception only for Automated Workflows per §2.1.
- **SaaS pivot.** Recurring infra conflicts with the lifestyle constraint (DECISIONS.md §4). - **SaaS pivot.** Recurring infra conflicts with the lifestyle constraint (DECISIONS.md §4).
- **Live chat / sales calls.** Conflicts with no-touch (DECISIONS.md §1 #8). - **Live chat / sales calls.** Conflicts with no-touch (DECISIONS.md §1 #8).
- **Custom integrations / one-off consulting.** $300/hr looks tempting; breaks the "build once, sell many" model that justifies the entire strategy. - **Custom integrations / one-off consulting.** $300/hr looks tempting; breaks the "build once, sell many" model that justifies the entire strategy.

View File

@@ -144,7 +144,7 @@ Reading PLAN.md §3 + this doc together, the rough script:
| **M1** (June) | Installers · demo · 3 landing pages · Gumroad live | Whether the funnel mechanically works. Numbers will be noisy; just look for one purchase. | | **M1** (June) | Installers · demo · 3 landing pages · Gumroad live | Whether the funnel mechanically works. Numbers will be noisy; just look for one purchase. |
| **M2** (July) | M1 + community posts in 3 niches + 1 SEO post | Which persona converts. Re-allocate effort to the highest-converting niche. | | **M2** (July) | M1 + community posts in 3 niches + 1 SEO post | Which persona converts. Re-allocate effort to the highest-converting niche. |
| **M3** (August) | M2 + landing-page changes from M2 review | Whether intent-rate moved on the change. Decide tools 0608 go/no-go. | | **M3** (August) | M2 + landing-page changes from M2 review | Whether intent-rate moved on the change. Decide tools 0608 go/no-go. |
| **M4** (September) | M3 + first repeat-buyer signals | Whether the Pipeline Runner is producing retention as designed. | | **M4** (September) | M3 + first repeat-buyer signals | Whether Automated Workflows is producing retention as designed. |
By end of M4, the data tells you whether the plan is producing By end of M4, the data tells you whether the plan is producing
$1k3k/mo (BUSINESS.md §6 6-month target) — extrapolated from the $1k3k/mo (BUSINESS.md §6 6-month target) — extrapolated from the

View File

@@ -21,8 +21,8 @@ project-root/
│ └── CLI-REFERENCE.md │ └── CLI-REFERENCE.md
├── src/ ├── src/
│ ├── core/ # shared logic — both CLI + GUI call into this │ ├── core/ # shared logic — both CLI + GUI call into this
│ ├── cli.py # Deduplicator CLI │ ├── cli.py # Find Duplicates CLI
│ ├── cli_text_clean.py # Text Cleaner CLI │ ├── cli_text_clean.py # Clean Text CLI
│ ├── cli_analyze.py # Analyzer CLI │ ├── cli_analyze.py # Analyzer CLI
│ └── gui/ │ └── gui/
│ ├── app.py # Streamlit entry │ ├── app.py # Streamlit entry

View File

@@ -76,7 +76,7 @@ Sample size: 1,000 rows (configurable).
- Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell). - Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
- Output write: ~10 s. - Output write: ~10 s.
- Recommended RAM: 34× input size for the full-Apply path. - Recommended RAM: 34× input size for the full-Apply path.
- **Format standardizer** (`standardize_dataframe`): ~2.7M rows/sec on - **Standardize Formats** (`standardize_dataframe`): ~2.7M rows/sec on
cache-warm repetition-heavy columns (synthetic 1M-row in-memory cache-warm repetition-heavy columns (synthetic 1M-row in-memory
benchmark, 2 typed columns); the fused single-pass loop replaced a benchmark, 2 typed columns); the fused single-pass loop replaced a
3-pass ``.tolist()`` cycle, so per-call overhead is now dominated by 3-pass ``.tolist()`` cycle, so per-call overhead is now dominated by
@@ -87,20 +87,20 @@ Sample size: 1,000 rows (configurable).
thread-pool scaffolding; on CPython 3.12 with the GIL it's thread-pool scaffolding; on CPython 3.12 with the GIL it's
roughly neutral, but the API is ready for the free-threaded roughly neutral, but the API is ready for the free-threaded
(PEP 703) Python 3.13+ build where it will help. (PEP 703) Python 3.13+ build where it will help.
- **Text cleaner** (`clean_dataframe`): ~1M rows/sec on - **Clean Text** (`clean_dataframe`): ~1M rows/sec on
repetition-heavy columns (per-call string cache: the pipeline runs repetition-heavy columns (per-call string cache: the pipeline runs
once per *unique* cell value, not once per row). once per *unique* cell value, not once per row).
- **Missing handler** (`handle_missing`): lazy-copy — when sentinel - **Fix Missing Values** (`handle_missing`): lazy-copy — when sentinel
standardization runs but finds nothing, AND no drops AND no fills standardization runs but finds nothing, AND no drops AND no fills
apply, the input frame is returned as-is. On a clean 1 GB file this apply, the input frame is returned as-is. On a clean 1 GB file this
saves the 1 GB allocation that the unconditional upfront copy used saves the 1 GB allocation that the unconditional upfront copy used
to take. to take.
- **Column mapper** (`map_columns`): rename + drop both already - **Map Columns** (`map_columns`): rename + drop both already
return fresh frames; the explicit upfront `df.copy()` is now return fresh frames; the explicit upfront `df.copy()` is now
removed and downstream mutating steps (schema-add, coerce) copy on removed and downstream mutating steps (schema-add, coerce) copy on
demand via `_ensure_owned()`. Rename-only and identity-mapping demand via `_ensure_owned()`. Rename-only and identity-mapping
paths run with zero explicit copies. paths run with zero explicit copies.
- **Deduplicator**: - **Find Duplicates**:
- **Exact-only strategies** (every column uses `Algorithm.EXACT` at - **Exact-only strategies** (every column uses `Algorithm.EXACT` at
threshold 100 — covers strong-key dedup like email/phone, the threshold 100 — covers strong-key dedup like email/phone, the
fallback drop-duplicates path, and explicit "match on this exact fallback drop-duplicates path, and explicit "match on this exact
@@ -117,19 +117,19 @@ Sample size: 1,000 rows (configurable).
(the common dedup workload) skip re-parsing. (the common dedup workload) skip re-parsing.
## 11. Tools ## 11. Tools
1. Deduplicator — Ready 1. Find Duplicates — Ready
2. Text Cleaner — Ready 2. Clean Text — Ready
3. Format Standardizer — Ready 3. Standardize Formats — Ready
4. Missing Value Handler — Ready 4. Fix Missing Values — Ready
5. Column Mapper — Ready 5. Map Columns — Ready
6. Outlier Detector — Coming Soon 6. Find Unusual Values — Coming Soon
7. Multi-File Merger — Coming Soon 7. Combine Files — Coming Soon
8. Validator & Reporter — Coming Soon 8. Quality Check — Coming Soon
9. Pipeline Runner — Ready 9. Automated Workflows — Ready
### 11.a Recommended pipeline order (soft, not enforced) ### 11.a Recommended pipeline order (soft, not enforced)
The Pipeline Runner ships with a `SOFT_DEPENDENCIES` table; the Automated Workflows ships with a `SOFT_DEPENDENCIES` table; the
following ordering is the default and the basis of the warning following ordering is the default and the basis of the warning
surface. Re-ordering is allowed; the runner emits a warning string surface. Re-ordering is allowed; the runner emits a warning string
and proceeds. and proceeds.
@@ -214,7 +214,7 @@ and proceeds.
fresh blob without losing the embedded buyer identity. Tier may fresh blob without losing the embedded buyer identity. Tier may
change during renewal (Lite → Core upgrade path). change during renewal (Lite → Core upgrade path).
- **Tiers**: - **Tiers**:
- ``lite`` — Deduplicator + Text Cleaner + Format Standardizer. - ``lite`` — Find Duplicates + Clean Text + Standardize Formats.
Buyer pays once, gets the three universally-useful tools. Buyer pays once, gets the three universally-useful tools.
- ``core`` — every Ready tool (all 9 in v1.6). - ``core`` — every Ready tool (all 9 in v1.6).
- ``pro``, ``enterprise`` — scaffolded for future SKUs; currently - ``pro``, ``enterprise`` — scaffolded for future SKUs; currently

View File

@@ -34,8 +34,8 @@ src/
normalizers.py # Per-column normalizers for dedup matching normalizers.py # Per-column normalizers for dedup matching
text_clean.py # clean_dataframe + smart_title_case text_clean.py # clean_dataframe + smart_title_case
_constants.py # Shared USPS abbrevs + state names _constants.py # Shared USPS abbrevs + state names
cli.py # Deduplicator CLI (Typer) cli.py # Find Duplicates CLI (Typer)
cli_text_clean.py # Text Cleaner CLI cli_text_clean.py # Clean Text CLI
cli_analyze.py # Analyzer CLI (--json) cli_analyze.py # Analyzer CLI (--json)
gui/ gui/
app.py # Streamlit entry point app.py # Streamlit entry point
@@ -192,7 +192,7 @@ GUI / CLI handlers use `format_for_user()` so the user always sees: file path, o
| Bundle | Status | | Bundle | Status |
|--------|--------| |--------|--------|
| Data Cleaning Mastery | 3/9 tools Ready (Dedup, Text Cleaner, Format Standardizer); 6 stubs | | Data Cleaning Mastery | 3/9 tools Ready (Find Duplicates, Clean Text, Standardize Formats); 6 stubs |
| Automated Business Reporting | Not started | | Automated Business Reporting | Not started |
| Ecommerce Data Pipeline | Not started | | Ecommerce Data Pipeline | Not started |
| Small Business Finance | Not started | | Small Business Finance | Not started |
@@ -214,12 +214,12 @@ Deliberately separate. Confluent original spec was wrong.
| Script | Owns | | Script | Owns |
|--------|------| |--------|------|
| 04 Missing Value Handler | "What's not there." Disguised nulls (`N/A`, `-`, sentinel codes), missingness patterns, imputation, drop-by-threshold. | | 04 Fix Missing Values | "What's not there." Disguised nulls (`N/A`, `-`, sentinel codes), missingness patterns, imputation, drop-by-threshold. |
| 06 Outlier Detector | "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization. | | 06 Find Unusual Values | "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization. |
**Run order**: 04 before 06. Outlier stats on data with `NaN` / sentinels are mathematically poisoned (means dragged, IQR widens, false negatives). **Run order**: 04 before 06. Outlier stats on data with `NaN` / sentinels are mathematically poisoned (means dragged, IQR widens, false negatives).
**Pipeline order** (Pipeline Runner enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible. **Pipeline order** (Automated Workflows enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible.
**Contested cases**: **Contested cases**:
- Whitespace-only cell — 02 trims to empty; 04 then flags empty as null. - Whitespace-only cell — 02 trims to empty; 04 then flags empty as null.

View File

@@ -14,7 +14,7 @@ Introduce tu nombre completo y correo, pega el código de licencia del correo de
| Nivel | Herramientas | | Nivel | Herramientas |
|---|---| |---|---|
| **Lite** | Eliminador de duplicados · Limpiador de texto · Estandarizador de formatos | | **Lite** | Buscar duplicados · Limpiar texto · Estandarizar formatos |
| **Core** | Las 9 herramientas | | **Core** | Las 9 herramientas |
Un usuario Lite que abra una herramienta exclusiva de Core verá un mensaje "Actualiza tu licencia". La página de inicio también muestra una marca 🔒 Bloqueado en las tarjetas de las herramientas que tu nivel no incluye. Para actualizar, pega un código Core en la página Activar. Un usuario Lite que abra una herramienta exclusiva de Core verá un mensaje "Actualiza tu licencia". La página de inicio también muestra una marca 🔒 Bloqueado en las tarjetas de las herramientas que tu nivel no incluye. Para actualizar, pega un código Core en la página Activar.
@@ -53,15 +53,15 @@ Matriz de soporte completa: [REQUIREMENTS.md](REQUIREMENTS.md) (solo en inglés)
| # | Herramienta | Propósito | Estado | | # | Herramienta | Propósito | Estado |
|---|------|---------|--------| |---|------|---------|--------|
| 01 | Eliminador de duplicados | Coincidencia exacta + difusa, 5 normalizadores, auditoría | Listo | | 01 | Buscar duplicados | Coincidencia exacta + difusa, 5 normalizadores, auditoría | Listo |
| 02 | Limpiador de texto | Espacios, caracteres tipográficos, BOM, finales de línea, mayúsculas/minúsculas | Listo | | 02 | Limpiar texto | Espacios, caracteres tipográficos, BOM, finales de línea, mayúsculas/minúsculas | Listo |
| 03 | Estandarizador de formatos | Fechas / teléfonos / correos / direcciones / nombres / monedas / booleanos | Listo | | 03 | Estandarizar formatos | Fechas / teléfonos / correos / direcciones / nombres / monedas / booleanos | Listo |
| 04 | Gestor de valores faltantes | Nulos disfrazados, imputación, descarte por umbral | Próximamente | | 04 | Corregir valores faltantes | Nulos disfrazados, imputación, descarte por umbral | Próximamente |
| 05 | Mapeador de columnas | Renombrar + aplicar esquema | Próximamente | | 05 | Mapear columnas | Renombrar + aplicar esquema | Próximamente |
| 06 | Detector de valores atípicos | z-score, IQR, multivariante | Próximamente | | 06 | Detectar valores atípicos | z-score, IQR, multivariante | Próximamente |
| 07 | Combinador de varios archivos | Combina varios archivos | Próximamente | | 07 | Combinar archivos | Combina varios archivos | Próximamente |
| 08 | Validador e informes | Reglas + informe PDF/Excel | Próximamente | | 08 | Verificación de calidad | Reglas + informe PDF/Excel | Próximamente |
| 09 | Ejecutor de canalizaciones | Lanzador multi-herramienta de un clic | Próximamente | | 09 | Flujos automatizados | Lanzador multi-herramienta de un clic | Próximamente |
**Datos de muestra** (`samples/`): `messy_sales.csv`, `bank_export.xlsx`. **Datos de muestra** (`samples/`): `messy_sales.csv`, `bank_export.xlsx`.
@@ -89,17 +89,17 @@ Ayuda: `deduplicator --help`. Referencia completa: [CLI-REFERENCE.es.md](CLI-REF
### 3.3 Orden de ejecución (cuando uses las herramientas manualmente) ### 3.3 Orden de ejecución (cuando uses las herramientas manualmente)
Si no usas el Ejecutor de canalizaciones, sigue este orden: Si no usas Flujos automatizados, sigue este orden:
1. **02 Limpiador de texto** primero — normaliza espacios y caracteres especiales. 1. **02 Limpiar texto** primero — normaliza espacios y caracteres especiales.
2. **03 Estandarizador de formatos** — fechas, teléfonos, etc. necesitan texto limpio. 2. **03 Estandarizar formatos** — fechas, teléfonos, etc. necesitan texto limpio.
3. **04 Gestor de valores faltantes** — códigos centinela se ocultan como números. 3. **04 Corregir valores faltantes** — códigos centinela se ocultan como números.
4. **05 Mapeador de columnas** — esquema antes que estadísticas de atípicos. 4. **05 Mapear columnas** — esquema antes que estadísticas de atípicos.
5. **06 Detector de valores atípicos** — necesita datos numéricos limpios. Calcular estadísticas con `NaN` o `-999` envenena los resultados. 5. **06 Detectar valores atípicos** — necesita datos numéricos limpios. Calcular estadísticas con `NaN` o `-999` envenena los resultados.
6. **07 Combinador de varios archivos**, **08 Validador** según sea necesario. 6. **07 Combinar archivos**, **08 Verificación de calidad** según sea necesario.
7. **01 Eliminador de duplicados** es flexible en cuanto al orden (normaliza internamente para la coincidencia). 7. **01 Buscar duplicados** es flexible en cuanto al orden (normaliza internamente para la coincidencia).
El Ejecutor de canalizaciones aplica este orden automáticamente. Flujos automatizados aplica este orden automáticamente.
### 3.4 Idioma ### 3.4 Idioma

View File

@@ -14,7 +14,7 @@ Enter your full name + email, paste the license blob from your purchase email (s
| Tier | Tools | | Tier | Tools |
|---|---| |---|---|
| **Lite** | Deduplicator · Text Cleaner · Format Standardizer | | **Lite** | Find Duplicates · Clean Text · Standardize Formats |
| **Core** | All 9 tools | | **Core** | All 9 tools |
A Lite user opening a Core-only tool sees an "Upgrade your license" prompt. The home page also shows a 🔒 Locked badge on tool cards your tier doesn't unlock. To upgrade, paste a Core blob on the Activate page. A Lite user opening a Core-only tool sees an "Upgrade your license" prompt. The home page also shows a 🔒 Locked badge on tool cards your tier doesn't unlock. To upgrade, paste a Core blob on the Activate page.
@@ -53,15 +53,15 @@ Full numbered support matrix: [REQUIREMENTS.md](REQUIREMENTS.md).
| # | Tool | Purpose | Status | | # | Tool | Purpose | Status |
|---|------|---------|--------| |---|------|---------|--------|
| 01 | Deduplicator | Exact + fuzzy match, 5 normalizers, audit | Ready | | 01 | Find Duplicates | Exact + fuzzy match, 5 normalizers, audit | Ready |
| 02 | Text Cleaner | Whitespace, smart chars, BOM, line endings, case ops | Ready | | 02 | Clean Text | Whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | Format Standardizer | Dates / phones / emails / addresses / names / currencies / booleans | Ready | | 03 | Standardize Formats | Dates / phones / emails / addresses / names / currencies / booleans | Ready |
| 04 | Missing Value Handler | Disguised nulls, imputation, drop-by-threshold | Coming Soon | | 04 | Fix Missing Values | Disguised nulls, imputation, drop-by-threshold | Coming Soon |
| 05 | Column Mapper | Rename + enforce schema | Coming Soon | | 05 | Map Columns | Rename + enforce schema | Coming Soon |
| 06 | Outlier Detector | z-score, IQR, multivariate | Coming Soon | | 06 | Find Unusual Values | z-score, IQR, multivariate | Coming Soon |
| 07 | Multi-File Merger | Combine multiple files | Coming Soon | | 07 | Combine Files | Combine multiple files | Coming Soon |
| 08 | Validator & Reporter | Rules + PDF/Excel report | Coming Soon | | 08 | Quality Check | Rules + PDF/Excel report | Coming Soon |
| 09 | Pipeline Runner | One-click multi-tool launcher | Coming Soon | | 09 | Automated Workflows | One-click multi-tool launcher | Coming Soon |
**Sample data** (`samples/`): `messy_sales.csv`, `bank_export.xlsx`. **Sample data** (`samples/`): `messy_sales.csv`, `bank_export.xlsx`.
@@ -89,17 +89,17 @@ Get help: `deduplicator --help`. Full reference: [CLI-REFERENCE.md](CLI-REFERENC
### 3.3 Run order (when running tools manually) ### 3.3 Run order (when running tools manually)
If you skip the Pipeline Runner, follow this order: If you skip Automated Workflows, follow this order:
1. **02 Text Cleaner** first — normalizes whitespace + special chars. 1. **02 Clean Text** first — normalizes whitespace + special chars.
2. **03 Format Standardizer** — dates, phones, etc. need cleaned text. 2. **03 Standardize Formats** — dates, phones, etc. need cleaned text.
3. **04 Missing Value Handler** — sentinel codes hide as numbers. 3. **04 Fix Missing Values** — sentinel codes hide as numbers.
4. **05 Column Mapper** — schema before outlier stats. 4. **05 Map Columns** — schema before outlier stats.
5. **06 Outlier Detector** — needs clean numerics. Stats on data with `NaN` or `-999` are mathematically poisoned. 5. **06 Find Unusual Values** — needs clean numerics. Stats on data with `NaN` or `-999` are mathematically poisoned.
6. **07 Multi-File Merger**, **08 Validator** as needed. 6. **07 Combine Files**, **08 Quality Check** as needed.
7. **01 Deduplicator** is order-flexible (normalizes internally for matching). 7. **01 Find Duplicates** is order-flexible (normalizes internally for matching).
The Pipeline Runner enforces this automatically. Automated Workflows enforces this automatically.
### 3.4 Language ### 3.4 Language

View File

@@ -251,12 +251,12 @@ row,column,field_type,old,new
<div class="eyebrow">In the bundle</div> <div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2> <h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid"> <div class="grid">
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match (Jaro-Winkler), explicit strategies for Date+Amount+Vendor, survivor rules.</p></div> <div class="card"><h3>1 · Find Duplicates</h3><p>Fuzzy match (Jaro-Winkler), explicit strategies for Date+Amount+Vendor, survivor rules.</p></div>
<div class="card"><h3>2 · Text Cleaner</h3><p>Header whitespace, smart quotes from copy-paste, em-dash sentinels.</p></div> <div class="card"><h3>2 · Clean Text</h3><p>Header whitespace, smart quotes from copy-paste, em-dash sentinels.</p></div>
<div class="card"><h3>3 · Format Standardizer</h3><p>ISO dates, numeric amounts (parens-negative), vendor casing, multi-currency.</p></div> <div class="card"><h3>3 · Standardize Formats</h3><p>ISO dates, numeric amounts (parens-negative), vendor casing, multi-currency.</p></div>
<div class="card"><h3>4 · Missing Value Handler</h3><p>Disguised-null detection: <code></code>, <code>N/A</code>, <code>(blank)</code>, <code>?</code>.</p></div> <div class="card"><h3>4 · Fix Missing Values</h3><p>Disguised-null detection: <code></code>, <code>N/A</code>, <code>(blank)</code>, <code>?</code>.</p></div>
<div class="card"><h3>5 · Column Mapper</h3><p>Project to your accounting tool's required schema, coerce types, drop extras.</p></div> <div class="card"><h3>5 · Map Columns</h3><p>Project to your accounting tool's required schema, coerce types, drop extras.</p></div>
<div class="card"><h3>6 · Pipeline Runner</h3><p>Save the cleanup. Run it on next month's export with one command. Same audit, automated.</p></div> <div class="card"><h3>6 · Automated Workflows</h3><p>Save the cleanup. Run it on next month's export with one command. Same audit, automated.</p></div>
</div> </div>
</div> </div>
</section> </section>

View File

@@ -168,9 +168,9 @@
<h2>One engine. Same six tools. Same $49.</h2> <h2>One engine. Same six tools. Same $49.</h2>
<p> <p>
The persona pages above are positioning, not different products. The persona pages above are positioning, not different products.
Whichever you buy, you get the full bundle: Deduplicator, Text Whichever you buy, you get the full bundle: Find Duplicates, Clean
Cleaner, Format Standardizer, Missing-Value Handler, Column Text, Standardize Formats, Fix Missing Values, Map Columns,
Mapper, and Pipeline Runner — pre-tuned with a saved pipeline and Automated Workflows — pre-tuned with a saved pipeline
that matches your workflow. that matches your workflow.
</p> </p>
<div class="grid"> <div class="grid">

View File

@@ -165,7 +165,7 @@
<div class="card"> <div class="card">
<span class="icon">🌍</span> <span class="icon">🌍</span>
<h3>Multi-platform audience reconciliation</h3> <h3>Multi-platform audience reconciliation</h3>
<p>Build one canonical audience from Meta, Google Ads, LinkedIn, and your CRM. Each platform exports a different shape; column-mapper aligns them all, dedup merges the survivors with their most-complete fields.</p> <p>Build one canonical audience from Meta, Google Ads, LinkedIn, and your CRM. Each platform exports a different shape; Map Columns aligns them all, dedup merges the survivors with their most-complete fields.</p>
</div> </div>
<div class="card"> <div class="card">
<span class="icon">🛡️</span> <span class="icon">🛡️</span>
@@ -192,7 +192,7 @@
<li><strong>Per-row country column</strong> drives the parser — no global default that bucks UK numbers as malformed US.</li> <li><strong>Per-row country column</strong> drives the parser — no global default that bucks UK numbers as malformed US.</li>
<li><strong>Country-name normalization</strong>: <code>USA</code> / <code>US</code> / <code>United States</code> all resolve to the same ISO-2 code.</li> <li><strong>Country-name normalization</strong>: <code>USA</code> / <code>US</code> / <code>United States</code> all resolve to the same ISO-2 code.</li>
<li><strong>50+ country support</strong> via Google's libphonenumber, including KR, CN, IN, MX, BR, IL, TR, PL, DK, SE.</li> <li><strong>50+ country support</strong> via Google's libphonenumber, including KR, CN, IN, MX, BR, IL, TR, PL, DK, SE.</li>
<li><strong>Schema enforcement</strong> via the column-mapper: project to your CRM's required shape, coerce score columns to integers, reorder fields to match the import contract.</li> <li><strong>Schema enforcement</strong> via Map Columns: project to your CRM's required shape, coerce score columns to integers, reorder fields to match the import contract.</li>
</ul> </ul>
</div> </div>
</section> </section>
@@ -249,12 +249,12 @@ Total elapsed: 6.7 s
<div class="eyebrow">In the bundle</div> <div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2> <h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid"> <div class="grid">
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match across email + phone + name + company; merge survivors with most-complete fields.</p></div> <div class="card"><h3>1 · Find Duplicates</h3><p>Fuzzy match across email + phone + name + company; merge survivors with most-complete fields.</p></div>
<div class="card"><h3>2 · Text Cleaner</h3><p>Smart quotes from copy-paste, NBSP from spreadsheet exports, BOM from Excel.</p></div> <div class="card"><h3>2 · Clean Text</h3><p>Smart quotes from copy-paste, NBSP from spreadsheet exports, BOM from Excel.</p></div>
<div class="card"><h3>3 · Format Standardizer</h3><p>E.164 phones with per-row country, canonical emails, name casing, ISO dates.</p></div> <div class="card"><h3>3 · Standardize Formats</h3><p>E.164 phones with per-row country, canonical emails, name casing, ISO dates.</p></div>
<div class="card"><h3>4 · Missing Value Handler</h3><p>Detect <code>TBD</code>, <code>(unknown)</code>, <code></code> across vendor exports.</p></div> <div class="card"><h3>4 · Fix Missing Values</h3><p>Detect <code>TBD</code>, <code>(unknown)</code>, <code></code> across vendor exports.</p></div>
<div class="card"><h3>5 · Column Mapper</h3><p>Project to your CRM's required schema, coerce score to integer, reorder for import.</p></div> <div class="card"><h3>5 · Map Columns</h3><p>Project to your CRM's required schema, coerce score to integer, reorder for import.</p></div>
<div class="card"><h3>6 · Pipeline Runner</h3><p>Save the cleanup as JSON. Drop next campaign's combined export on it. Same dedup, automated.</p></div> <div class="card"><h3>6 · Automated Workflows</h3><p>Save the cleanup as JSON. Drop next campaign's combined export on it. Same dedup, automated.</p></div>
</div> </div>
</div> </div>
</section> </section>

View File

@@ -178,7 +178,7 @@
<div class="card"> <div class="card">
<span class="icon">🔗</span> <span class="icon">🔗</span>
<h3>Multi-channel order consolidation</h3> <h3>Multi-channel order consolidation</h3>
<p>Orders from Shopify + Etsy + a wholesale spreadsheet, each with a different column for "customer email." Column-mapper aligns them; dedup merges across channels.</p> <p>Orders from Shopify + Etsy + a wholesale spreadsheet, each with a different column for "customer email." Map Columns aligns them; dedup merges across channels.</p>
</div> </div>
<div class="card"> <div class="card">
<span class="icon">⚙️</span> <span class="icon">⚙️</span>
@@ -270,12 +270,12 @@ Total elapsed: 4.2 s
<div class="eyebrow">In the bundle</div> <div class="eyebrow">In the bundle</div>
<h2>Six tools. One pipeline. One $49 download.</h2> <h2>Six tools. One pipeline. One $49 download.</h2>
<div class="grid"> <div class="grid">
<div class="card"><h3>1 · Deduplicator</h3><p>Fuzzy match (Jaro-Winkler), 5 normalizers, survivor rules, interactive review.</p></div> <div class="card"><h3>1 · Find Duplicates</h3><p>Fuzzy match (Jaro-Winkler), 5 normalizers, survivor rules, interactive review.</p></div>
<div class="card"><h3>2 · Text Cleaner</h3><p>Whitespace, smart chars, NBSP, BOM, line endings, case ops.</p></div> <div class="card"><h3>2 · Clean Text</h3><p>Whitespace, smart chars, NBSP, BOM, line endings, case ops.</p></div>
<div class="card"><h3>3 · Format Standardizer</h3><p>Dates, phones, emails, addresses, names, currencies, booleans.</p></div> <div class="card"><h3>3 · Standardize Formats</h3><p>Dates, phones, emails, addresses, names, currencies, booleans.</p></div>
<div class="card"><h3>4 · Missing Value Handler</h3><p>Disguised-null detection, profile, mean/median/mode/ffill, drop strategies.</p></div> <div class="card"><h3>4 · Fix Missing Values</h3><p>Disguised-null detection, profile, mean/median/mode/ffill, drop strategies.</p></div>
<div class="card"><h3>5 · Column Mapper</h3><p>Fuzzy auto-rename, target schema, type coercion, required-field defaults.</p></div> <div class="card"><h3>5 · Map Columns</h3><p>Fuzzy auto-rename, target schema, type coercion, required-field defaults.</p></div>
<div class="card"><h3>6 · Pipeline Runner</h3><p>Chain tools in recommended order, save/load JSON, automate weekly cleanups.</p></div> <div class="card"><h3>6 · Automated Workflows</h3><p>Chain tools in recommended order, save/load JSON, automate weekly cleanups.</p></div>
</div> </div>
</div> </div>
</section> </section>

View File

@@ -45,15 +45,15 @@ app = typer.Typer(
# Tool id -> friendly display name. Kept in the CLI module since the GUI has # Tool id -> friendly display name. Kept in the CLI module since the GUI has
# its own version; both stay in lockstep with the actual script lineup. # its own version; both stay in lockstep with the actual script lineup.
_TOOL_DISPLAY = { _TOOL_DISPLAY = {
"01_deduplicator": "Deduplicator", "01_deduplicator": "Find Duplicates",
"02_text_cleaner": "Text Cleaner", "02_text_cleaner": "Clean Text",
"03_format_standardizer": "Format Standardizer", "03_format_standardizer": "Standardize Formats",
"04_missing_handler": "Missing Value Handler", "04_missing_handler": "Fix Missing Values",
"05_column_mapper": "Column Mapper", "05_column_mapper": "Map Columns",
"06_outlier_detector": "Outlier Detector", "06_outlier_detector": "Find Unusual Values",
"07_multi_file_merger": "Multi-File Merger", "07_multi_file_merger": "Combine Files",
"08_validator_reporter": "Validator & Reporter", "08_validator_reporter": "Quality Check",
"09_pipeline_runner": "Pipeline Runner", "09_pipeline_runner": "Automated Workflows",
} }

View File

@@ -1,4 +1,4 @@
"""CLI for the DataTools Column Mapper (script 05). """CLI for the DataTools Map Columns tool (script 05).
Usage: Usage:
python -m src.cli_column_map input.csv # auto-mapping preview python -m src.cli_column_map input.csv # auto-mapping preview

View File

@@ -1,4 +1,4 @@
"""CLI for the DataTools Format Standardizer (script 03). """CLI for the DataTools Standardize Formats tool (script 03).
Usage: Usage:
python -m src.cli_format input.csv \\ python -m src.cli_format input.csv \\

View File

@@ -1,4 +1,4 @@
"""CLI for the DataTools Missing Value Handler (script 04). """CLI for the DataTools Fix Missing Values tool (script 04).
Usage: Usage:
python -m src.cli_missing input.csv # profile only python -m src.cli_missing input.csv # profile only

View File

@@ -1,4 +1,4 @@
"""CLI for the DataTools Pipeline Runner (script 09). """CLI for the DataTools Automated Workflows tool (script 09).
Usage: Usage:
# Run the recommended default pipeline (text → format → missing → dedup): # Run the recommended default pipeline (text → format → missing → dedup):

View File

@@ -1,4 +1,4 @@
"""DataTools Column Mapper. """DataTools Map Columns.
Rename columns, enforce a target schema, coerce types, drop / add / Rename columns, enforce a target schema, coerce types, drop / add /
reorder columns. Designed for the three buyer profiles the toolkit reorder columns. Designed for the three buyer profiles the toolkit

View File

@@ -1,4 +1,4 @@
"""DataTools Missing Value Handler. """DataTools Fix Missing Values.
Detects disguised nulls, profiles missingness per column, and applies Detects disguised nulls, profiles missingness per column, and applies
imputation or drop strategies with a full audit trail. imputation or drop strategies with a full audit trail.

View File

@@ -1,4 +1,4 @@
"""DataTools Pipeline Runner. """DataTools Automated Workflows.
Chain the cleaning tools (text-clean, format-standardize, missing, Chain the cleaning tools (text-clean, format-standardize, missing,
column-map, dedup) into a single orchestrated workflow. The pipeline column-map, dedup) into a single orchestrated workflow. The pipeline

View File

@@ -1 +1 @@
"""Streamlit GUI for the DataTools Deduplicator.""" """Streamlit GUI for DataTools."""

View File

@@ -16,7 +16,7 @@ they need without dragging the entire kitchen-sink module:
dedup_review.py ← dedup match-group cards + review pipeline dedup_review.py ← dedup match-group cards + review pipeline
shared.py ← chrome / file-pickup helpers used by every tool shared.py ← chrome / file-pickup helpers used by every tool
A standalone Deduplicator build, for example, can ship without A standalone Find Duplicates build, for example, can ship without
``findings.py`` and ``gate.py`` — those modules import the analyzer / ``findings.py`` and ``gate.py`` — those modules import the analyzer /
gate code that the Lite SKU does not include. gate code that the Lite SKU does not include.

View File

@@ -847,15 +847,15 @@ def _build_match_groups_csv(
# Tool id -> friendly display name. Single source of truth for the GUI; the # Tool id -> friendly display name. Single source of truth for the GUI; the
# CLI keeps its own copy so each entrypoint stays self-contained. # CLI keeps its own copy so each entrypoint stays self-contained.
TOOL_DISPLAY_NAMES: dict[str, str] = { TOOL_DISPLAY_NAMES: dict[str, str] = {
"01_deduplicator": "Deduplicator", "01_deduplicator": "Find Duplicates",
"02_text_cleaner": "Text Cleaner", "02_text_cleaner": "Clean Text",
"03_format_standardizer": "Format Standardizer", "03_format_standardizer": "Standardize Formats",
"04_missing_handler": "Missing Value Handler", "04_missing_handler": "Fix Missing Values",
"05_column_mapper": "Column Mapper", "05_column_mapper": "Map Columns",
"06_outlier_detector": "Outlier Detector", "06_outlier_detector": "Find Unusual Values",
"07_multi_file_merger": "Multi-File Merger", "07_multi_file_merger": "Combine Files",
"08_validator_reporter": "Validator & Reporter", "08_validator_reporter": "Quality Check",
"09_pipeline_runner": "Pipeline Runner", "09_pipeline_runner": "Automated Workflows",
} }
_SEVERITY_ICON: dict[str, str] = { _SEVERITY_ICON: dict[str, str] = {
@@ -1016,7 +1016,7 @@ def render_hidden_aware_preview(
) -> None: ) -> None:
"""Render a DataFrame preview that shows hidden characters in every cell. """Render a DataFrame preview that shows hidden characters in every cell.
Used for the Text Cleaner's "before" and "after" previews so the user Used for the Clean Text tool's "before" and "after" previews so the user
can actually see the leading/trailing whitespace, NBSP padding, can actually see the leading/trailing whitespace, NBSP padding,
zero-width characters, and smart punctuation that the cleaner is going zero-width characters, and smart punctuation that the cleaner is going
to remove (or just removed). A plain ``st.dataframe`` collapses outer to remove (or just removed). A plain ``st.dataframe`` collapses outer

View File

@@ -1,4 +1,4 @@
"""DataTools Deduplicator — full working tool page.""" """DataTools Find Duplicates — full working tool page."""
from __future__ import annotations from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Text Cleaner — Streamlit page.""" """DataTools Clean Text — Streamlit page."""
from __future__ import annotations from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Format Standardizer — Streamlit page.""" """DataTools Standardize Formats — Streamlit page."""
from __future__ import annotations from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Missing Value Handler — Streamlit page.""" """DataTools Fix Missing Values — Streamlit page."""
from __future__ import annotations from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Column Mapper — Streamlit page.""" """DataTools Map Columns — Streamlit page."""
from __future__ import annotations from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Outlier Detector — stub page.""" """DataTools Find Unusual Values — stub page."""
from __future__ import annotations from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Multi-File Merger — stub page.""" """DataTools Combine Files — stub page."""
from __future__ import annotations from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Validator & Reporter — stub page.""" """DataTools Quality Check — stub page."""
from __future__ import annotations from __future__ import annotations

View File

@@ -1,4 +1,4 @@
"""DataTools Pipeline Runner — Streamlit page.""" """DataTools Automated Workflows — Streamlit page."""
from __future__ import annotations from __future__ import annotations

View File

@@ -1,4 +1,4 @@
# Column Mapper — corpus # Map Columns — corpus
Acceptance fixtures for `src/core/column_mapper.py`. Each `.csv` under Acceptance fixtures for `src/core/column_mapper.py`. Each `.csv` under
`test_data/` is paired with assertions in `test_data/` is paired with assertions in

View File

@@ -1,4 +1,4 @@
# Missing Value Handler — corpus # Fix Missing Values — corpus
Acceptance fixtures for `src/core/missing.py`. Each `.csv` under Acceptance fixtures for `src/core/missing.py`. Each `.csv` under
`test_data/` is paired with assertions in `tests/test_missing_corpus.py`. `test_data/` is paired with assertions in `tests/test_missing_corpus.py`.

View File

@@ -1,4 +1,4 @@
# Text Cleaner Test Corpus # Clean Text Test Corpus
Test fixtures for `02_text_cleaner.py` (Excel & CSV Data Cleaning Mastery Bundle). Test fixtures for `02_text_cleaner.py` (Excel & CSV Data Cleaning Mastery Bundle).

View File

@@ -3,7 +3,7 @@
These exercise the chrome-level gate that ``hide_streamlit_chrome`` These exercise the chrome-level gate that ``hide_streamlit_chrome``
installs: when no valid license is on disk, every page renders the installs: when no valid license is on disk, every page renders the
activation form instead of the page body, and tool widgets do NOT activation form instead of the page body, and tool widgets do NOT
appear. We test against the Deduplicator page since it's the smallest appear. We test against the Find Duplicates page since it's the smallest
real-world tool that depends on chrome. real-world tool that depends on chrome.
The autouse fixture in ``tests/conftest.py`` sets The autouse fixture in ``tests/conftest.py`` sets

View File

@@ -5,7 +5,7 @@ expander that houses every per-column / per-strategy knob. It's the
densest single widget surface in the GUI, so a session-state key drift densest single widget surface in the GUI, so a session-state key drift
in there cascades into every dedup session. in there cascades into every dedup session.
We exercise it via the Deduplicator page (rendering ``config_panel`` We exercise it via the Find Duplicates page (rendering ``config_panel``
in isolation requires a fake Streamlit context). The page provides in isolation requires a fake Streamlit context). The page provides
the surrounding state; we poke widgets and verify their effects. the surrounding state; we poke widgets and verify their effects.
""" """

View File

@@ -2,7 +2,7 @@
``match_group_card`` from ``src.gui.components`` has two modes (decided ``match_group_card`` from ``src.gui.components`` has two modes (decided
/ undecided) and a Confirm/Undo flow keyed by session_state. We test / undecided) and a Confirm/Undo flow keyed by session_state. We test
each state by exercising the parent Deduplicator page end to end and each state by exercising the parent Find Duplicates page end to end and
then poking at ``review_decisions`` directly. then poking at ``review_decisions`` directly.
Why not unit-test ``match_group_card`` in isolation? AppTest needs a Why not unit-test ``match_group_card`` in isolation? AppTest needs a

View File

@@ -21,7 +21,7 @@ from .conftest import collected_text, stash_upload
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
class TestMalformedUploadErrors: class TestMalformedUploadErrors:
"""Bytes that look like a CSV but aren't parseable. The Deduplicator """Bytes that look like a CSV but aren't parseable. The Find Duplicates
page wraps ``read_file`` failures in an ``st.error`` with the file page wraps ``read_file`` failures in an ``st.error`` with the file
name and the structured ``format_for_user`` output.""" name and the structured ``format_for_user`` output."""

View File

@@ -11,7 +11,7 @@ exist, each pinned here:
3. **Upload + matching passed normalization** — gate is a no-op; the 3. **Upload + matching passed normalization** — gate is a no-op; the
page proceeds. page proceeds.
We exercise the gate via the Deduplicator page (any tool page would We exercise the gate via the Find Duplicates page (any tool page would
work; dedup is the smallest one that doesn't depend on heavy widgets). work; dedup is the smallest one that doesn't depend on heavy widgets).
""" """
@@ -27,7 +27,7 @@ from .conftest import (
) )
# Deduplicator is our canary — it calls ``require_normalization_gate`` # Find Duplicates is our canary — it calls ``require_normalization_gate``
# on the second line of the module. If the gate blocks, the dedup- # on the second line of the module. If the gate blocks, the dedup-
# specific title shouldn't even render. # specific title shouldn't even render.
GATED_PAGE = "1_Deduplicator" GATED_PAGE = "1_Deduplicator"

View File

@@ -1,9 +1,9 @@
"""GUI tests for the Lite tier. """GUI tests for the Lite tier.
A Lite license unlocks Deduplicator, Text Cleaner, Format A Lite license unlocks Find Duplicates, Clean Text, Standardize
Standardizer. Opening any other tool page (Missing Values, Column Formats. Opening any other tool page (Fix Missing Values, Map
Mapper, Pipeline Runner, etc.) must render an upgrade prompt and Columns, Automated Workflows, etc.) must render an upgrade prompt
short-circuit the page body. and short-circuit the page body.
The home grid shows a 🔒 Locked badge on the cards for tools the The home grid shows a 🔒 Locked badge on the cards for tools the
user's tier doesn't unlock. user's tier doesn't unlock.
@@ -104,7 +104,7 @@ class TestLiteHomeGridBadges:
): ):
home_app.run() home_app.run()
text = collected_text(home_app) text = collected_text(home_app)
# Missing Value Handler is locked under Lite — its card should # Fix Missing Values is locked under Lite — its card should
# have a 🔒 Locked badge. # have a 🔒 Locked badge.
# We assert the lock glyph appears alongside the locked tool's # We assert the lock glyph appears alongside the locked tool's
# display name. Streamlit renders the markdown verbatim so the # display name. Streamlit renders the markdown verbatim so the

View File

@@ -19,7 +19,7 @@ from .conftest import collected_text, stash_upload
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Deduplicator # Find Duplicates
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
class TestDeduplicatorWorkflow: class TestDeduplicatorWorkflow:
@@ -64,7 +64,7 @@ class TestDeduplicatorWorkflow:
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Text Cleaner # Clean Text
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
class TestTextCleanerWorkflow: class TestTextCleanerWorkflow:
@@ -96,7 +96,7 @@ class TestTextCleanerWorkflow:
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Format Standardizer # Standardize Formats
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
class TestFormatStandardizerWorkflow: class TestFormatStandardizerWorkflow:
@@ -110,7 +110,7 @@ class TestFormatStandardizerWorkflow:
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Missing Value Handler # Fix Missing Values
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
class TestMissingValuesWorkflow: class TestMissingValuesWorkflow:
@@ -124,7 +124,7 @@ class TestMissingValuesWorkflow:
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Column Mapper # Map Columns
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
class TestColumnMapperWorkflow: class TestColumnMapperWorkflow:
@@ -138,7 +138,7 @@ class TestColumnMapperWorkflow:
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Pipeline Runner # Automated Workflows
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
class TestPipelineRunnerWorkflow: class TestPipelineRunnerWorkflow:

View File

@@ -41,8 +41,8 @@ class TestAnalyzeCli:
assert result.exit_code == 0 assert result.exit_code == 0
# The Rich table breaks lines; assert on stable substrings instead of # The Rich table breaks lines; assert on stable substrings instead of
# full finding ids. # full finding ids.
assert "Text Cleaner" in result.stdout assert "Clean Text" in result.stdout
assert "Missing Value" in result.stdout assert "Fix Missing Values" in result.stdout
# Severity column is rendered. # Severity column is rendered.
assert "warn" in result.stdout assert "warn" in result.stdout

View File

@@ -1,4 +1,4 @@
"""Acceptance corpus for the Column Mapper. """Acceptance corpus for the Map Columns tool.
Loads every fixture in ``test-cases/column-mapper-corpus/test_data/`` Loads every fixture in ``test-cases/column-mapper-corpus/test_data/``
and asserts the documented behaviour against the documented schema. and asserts the documented behaviour against the documented schema.

View File

@@ -48,7 +48,7 @@ class TestAnalyzeCliE2E:
proc = _run("-m", "src.cli_analyze", str(CORPUS_KITCHEN_SINK)) proc = _run("-m", "src.cli_analyze", str(CORPUS_KITCHEN_SINK))
assert proc.returncode == 0, proc.stderr assert proc.returncode == 0, proc.stderr
# Rich tables wrap; assert on stable substrings. # Rich tables wrap; assert on stable substrings.
assert "Text Cleaner" in proc.stdout assert "Clean Text" in proc.stdout
assert "csv_bom_stripped" in proc.stdout or "smart_quotes" in proc.stdout assert "csv_bom_stripped" in proc.stdout or "smart_quotes" in proc.stdout
def test_json_output_parses(self): def test_json_output_parses(self):

View File

@@ -1,7 +1,7 @@
"""Tier-specific tests: Lite tier feature set + gating. """Tier-specific tests: Lite tier feature set + gating.
Lite unlocks exactly three tools — Deduplicator, Text Cleaner, Lite unlocks exactly three tools — Find Duplicates, Clean Text,
Format Standardizer — and locks the other six. We test: Standardize Formats — and locks the other six. We test:
- The features map for Lite returns the right three flags (and only - The features map for Lite returns the right three flags (and only
those three). those three).

View File

@@ -1,4 +1,4 @@
"""Acceptance corpus for the Missing Value Handler. """Acceptance corpus for the Fix Missing Values tool.
Loads every fixture in ``test-cases/missing-corpus/test_data/`` and Loads every fixture in ``test-cases/missing-corpus/test_data/`` and
asserts the documented behaviour. The fixtures are split into: asserts the documented behaviour. The fixtures are split into:

View File

@@ -25,7 +25,7 @@ from src.core import (
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Format Standardizer: single-tolist hot loop # Standardize Formats: single-tolist hot loop
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
class TestStandardizerHotLoop: class TestStandardizerHotLoop:
@@ -93,7 +93,7 @@ class TestStandardizerHotLoop:
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Deduplicator: per-call normalizer cache # Find Duplicates: per-call normalizer cache
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
class TestDedupNormalizerCache: class TestDedupNormalizerCache: