docs+code: rename tool labels everywhere

Sweep follow-up to 93e43fc. Display labels now consistent across docs,
landing pages, CLI output, code comments, docstrings, and test prose.
Five parallel surfaces touched:

- docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal
  design/planning docs
- landing pages: index + bookkeeper/revops/shopify-pet
- src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py
  and gui/components/_legacy.py, core module headers, every tool
  page's module docstring
- tests: class/method/module docstrings and section-header comments
- test-cases READMEs

Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator
etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*),
URL paths, anchor IDs, CSS classes, and asset filenames were left
intact since they're code identifiers / structural references.

All 2033 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 19:50:09 +00:00
parent 93e43fc0d9
commit db5ec084da
57 changed files with 205 additions and 205 deletions

View File

@@ -333,7 +333,7 @@ the attached `.dtlic` file.
| Tier | Features |
|------|---------|
| **lite** | Deduplicator, Text Cleaner, Format Standardizer |
| **lite** | Find Duplicates, Clean Text, Standardize Formats |
| **core** | All 9 tools |
| **pro** | All 9 tools + future Pro-only features |

View File

@@ -47,7 +47,7 @@ Sell niche Python automation tools as one-time downloadable digital products. Ta
**Surface**: desktop install per OS (PyInstaller) with Streamlit GUI + CLI. Constrained demo on Streamlit Community Cloud.
## 4a. Lead bundle — Deduplicator
## 4a. Lead bundle — Find Duplicates
Highest pain density across all 4 personas. Feeds landing copy, demo design, feature priority. Tech spec: TECHNICAL.md §11.1.
@@ -208,7 +208,7 @@ Headroom enables optional ad spend ($100-200/mo) once a bundle has proven conver
## 13. Honest status (2026-05-01)
- 3 of 9 tools shipped (Dedup, Text Cleaner, Format Standardizer).
- 3 of 9 tools shipped (Find Duplicates, Clean Text, Standardize Formats).
- Cross-platform build pipeline designed, not yet built.
- macOS code signing not yet set up.
- Streamlit GUI shipped for the 3 ready tools.

View File

@@ -8,15 +8,15 @@ Tres módulos de CLI, uno por cada herramienta Lista:
| Módulo | Comando | Propósito |
|--------|---------|---------|
| `src.cli` | `python -m src.cli FILE` | Eliminador de duplicados |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Limpiador de texto |
| `src.cli` | `python -m src.cli FILE` | Buscar duplicados |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Limpiar texto |
| `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analizador (escaneo de solo lectura) |
Cada comando es **previsualización por defecto** — añade `--apply` para escribir la salida.
---
# Eliminador de duplicados
# Buscar duplicados
```
python -m src.cli ARCHIVO_ENTRADA [OPCIONES]
@@ -125,7 +125,7 @@ Registro: `logs/dedup_YYYYMMDD_HHMMSS.log`.
---
# Limpiador de texto
# Limpiar texto
```
python -m src.cli_text_clean ARCHIVO_ENTRADA [OPCIONES]
@@ -156,7 +156,7 @@ Higiene a nivel de carácter. Ver [TECHNICAL.md §10.2](TECHNICAL.md) (solo en i
- `--config RUTA` / `--save-config RUTA`.
### Archivo
- `--sheet`, `--encoding`, `--header-row` — iguales que en el Eliminador de duplicados.
- `--sheet`, `--encoding`, `--header-row` — iguales que en Buscar duplicados.
## Presets

View File

@@ -6,15 +6,15 @@ Three CLI modules, one per Ready tool:
| Module | Command | Purpose |
|--------|---------|---------|
| `src.cli` | `python -m src.cli FILE` | Deduplicator |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Text Cleaner |
| `src.cli` | `python -m src.cli FILE` | Find Duplicates |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Clean Text |
| `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analyzer (read-only scan) |
Every command is **preview-only by default** — add `--apply` to write output.
---
# Deduplicator
# Find Duplicates
```
python -m src.cli INPUT_FILE [OPTIONS]
@@ -123,7 +123,7 @@ Log: `logs/dedup_YYYYMMDD_HHMMSS.log`.
---
# Text Cleaner
# Clean Text
```
python -m src.cli_text_clean INPUT_FILE [OPTIONS]
@@ -154,7 +154,7 @@ Character-level hygiene. See [TECHNICAL.md §10.2](TECHNICAL.md) for the spec.
- `--config PATH` / `--save-config PATH`.
### File
- `--sheet`, `--encoding`, `--header-row` — same as Deduplicator.
- `--sheet`, `--encoding`, `--header-row` — same as Find Duplicates.
## Presets

View File

@@ -67,7 +67,7 @@ Each candidate scored 1-5 on 6 dimensions. Total /30 → verdict.
**v1.2 rationale**:
- Buyer persona ("hates Excel work but can't code") won't learn a CLI. Refunds at this price.
- Deduplicator needs interactive review — not viable in pure CLI.
- Find Duplicates needs interactive review — not viable in pure CLI.
- Dual interface keeps CLI for automation without sacrificing primary buyer surface.
## 4a. Functional scope principle (v1.2)
@@ -170,13 +170,13 @@ $49-79/bundle · $149 full suite (when 3+ exist).
| Apr 28 (v1.3) | Add hosted browser demo as conversion lever | Direct consequence of Streamlit choice. See §5. |
| Apr 28 (v1.4) | Re-apply 04/06 boundary work (silent-drift recovery) | Stream B v1.2 content overwritten in parallel v1.3 work. Restored per no-silent-drift rule. |
| Apr 28 (v1.5) | Add `02_text_cleaner.py`; renumber 02-08 → 03-09 | Character-level hygiene had no clear owner. See TECHNICAL §10. |
| Apr 29 (v1.7) | Adopt Text Cleaner Tier 1/2/3 spec; lock `excel-hygiene` default | Promotes from stub to buildable v1 target. Full spec in TECHNICAL §11.2. |
| Apr 29 (v1.7) | Adopt Clean Text Tier 1/2/3 spec; lock `excel-hygiene` default | Promotes from stub to buildable v1 target. Full spec in TECHNICAL §11.2. |
| Apr 28 (v1.6) | Fold conversation-history content into docs (deduplicator spec, lead bundle use cases, full GUI matrix, 04/06 examples, Streamlit-to-SaaS reasoning) | No new decisions; promote at-risk analysis from chat history per no-silent-drift rule. |
| May 1 (v1.6) | Mark Format Standardizer **Ready** | 199-row buyer corpus passing; Tier 1 + most Tier 2 built. |
| May 1 (v1.6) | Mark Standardize Formats **Ready** | 199-row buyer corpus passing; Tier 1 + most Tier 2 built. |
| May 1 (v1.6) | Add `src/core/errors.py` structured hierarchy | Uniform helpful messages across CLI + GUI. See TECHNICAL §7. |
| May 13 (v1.6) | Ship in-house JSON i18n + EN/ES packs | Expand addressable market (Spanish-first buyers, LatAm bookkeepers) without a `gettext` build step. JSON packs editable by non-devs; parity test prevents drift. See TECHNICAL §10b. |
| May 13 (v1.6) | Ship licensing: 1-year HMAC-signed blobs, name+email registration, offline verification, tier-scaffolded for future SKUs | Unlock the lifetime-update business model without recurring infra. Honor-system DRM (HMAC + 30-day refund) — sufficient at $49. See §9b below. |
| May 13 (v1.6) | Add Lite SKU (Dedup + Text Cleaner + Format Standardizer) | Lower-priced entry point for buyers who only need the three universal tools. Per-tool feature gating + lock badges on the home grid surface the upgrade path. See §9b. |
| May 13 (v1.6) | Add Lite SKU (Find Duplicates + Clean Text + Standardize Formats) | Lower-priced entry point for buyers who only need the three universal tools. Per-tool feature gating + lock badges on the home grid surface the upgrade path. See §9b. |
| May 13 (v1.6) | Remove user-facing free trial | A 1-year all-features trial undercut the paid Lite SKU. Paid-only keeps tier economics clean. Internal ``_mint`` API still exists for tests and the seller's key generator. See §9b. |
| May 13 (v1.6) | Upgrade license crypto: HMAC → Ed25519 (asymmetric) | HMAC's symmetric secret was extractable from the shipped binary — anyone with the binary could mint blobs. Ed25519 splits sign (seller) from verify (binary), so binary compromise doesn't let an attacker forge licenses. Blob prefix bumped DTLIC1 → DTLIC2. See §9b. |
| May 13 (v1.6) | Add ``assert_production_safe`` tripwire | A shipped build with ``DATATOOLS_DEV_MODE=1`` or the in-source dev pubkey would silently defeat licensing. The tripwire refuses to boot such a build. No-op in source / pytest runs. See §9b. |
@@ -211,13 +211,13 @@ The 30-day refund window covers casual blob sharing from a different angle (anyo
- Number of devices the same blob is used on (no concurrent-use detection).
- Reverse-engineered re-signing of expired blobs (would require RSA / online check).
**Future SKUs**: the ``FEATURES_BY_TIER`` table in ``src/license/features.py`` is the single source of truth for "which tools each tier unlocks". Adding a PRO SKU that excludes the pipeline runner is a 1-line edit there + a 1-line edit at the gate site. No consumer-code churn.
**Future SKUs**: the ``FEATURES_BY_TIER`` table in ``src/license/features.py`` is the single source of truth for "which tools each tier unlocks". Adding a PRO SKU that excludes Automated Workflows is a 1-line edit there + a 1-line edit at the gate site. No consumer-code churn.
**v1.6 SKU lineup**:
| Tier | Tools unlocked | Notes |
|---|---|---|
| LITE | Deduplicator, Text Cleaner, Format Standardizer | Entry SKU. Three universal tools that handle the most common bookkeeping / RevOps / Klaviyo prep workflows. |
| LITE | Find Duplicates, Clean Text, Standardize Formats | Entry SKU. Three universal tools that handle the most common bookkeeping / RevOps / Klaviyo prep workflows. |
| CORE | All 9 tools | Full v1 suite. |
| PRO | All 9 tools (scaffolded) | Reserved for future per-feature carve-outs (e.g., scheduled pipelines, API access). |
| ENTERPRISE | All 9 tools (scaffolded) | Reserved for future bulk / multi-seat SKUs. |

View File

@@ -33,7 +33,7 @@ CLI (src/cli*.py) GUI (src/gui/app.py + pages/)
| `core.errors` | `DataToolsError` hierarchy, `ensure_dataframe()`, `ensure_choice()`, `wrap_file_read/write()`, `format_for_user()` |
| `core._constants` | `US_STATE_NAMES`, `US_STATE_CODES`, `USPS_EXPANSIONS`, `USPS_COMPRESSIONS` |
## Data flow — Deduplicator
## Data flow — Find Duplicates
```
read_file() # auto-detect encoding, delimiter, header

View File

@@ -30,7 +30,7 @@ Status legend:
| ✓ | Item | Where it lives |
|---|------|----------------|
| 🟢 | 6 of 9 tools shipped (Dedup, Text, Format, Missing, Column-Map, Pipeline) | `src/core/`, `src/cli_*.py`, `src/gui/pages/` |
| 🟢 | Pipeline Runner (the retention multiplier per `PLAN.md` §2.6) | `src/core/pipeline.py`, `src/cli_pipeline.py`, `src/gui/pages/9_Pipeline_Runner.py` |
| 🟢 | Automated Workflows (the retention multiplier per `PLAN.md` §2.6) | `src/core/pipeline.py`, `src/cli_pipeline.py`, `src/gui/pages/9_Pipeline_Runner.py` |
| 🟢 | 1,729 passing tests · 0 skipped · 0 xfailed | `tests/` |
| 🟢 | 3 niche demo datasets + pre-tuned pipeline JSONs | `samples/demo/` |
| 🟢 | Streamlit demo app + Cloud entry shim | `streamlit_app.py`, `src/gui/app_demo.py` |

View File

@@ -29,8 +29,8 @@ win.
| Asset | State |
|---|---|
| Tools 15 (Dedup, Text Clean, Format Standardize, Missing, Column Mapper) | Ready · 1,691 tests passing · 0 xfailed |
| Tools 69 (Outlier, Multi-File Merge, Validator, Pipeline) | Coming Soon |
| Tools 15 (Find Duplicates, Clean Text, Standardize Formats, Fix Missing Values, Map Columns) | Ready · 1,691 tests passing · 0 xfailed |
| Tools 69 (Find Unusual Values, Combine Files, Quality Check, Automated Workflows) | Coming Soon |
| PyInstaller installer pipeline | Not started |
| macOS code signing (Apple Dev Program) | Not started |
| Hosted browser demo (Streamlit Cloud) | Not deployed |
@@ -52,7 +52,7 @@ Tools 68 are blocked behind a **distribution gate**: no work on them
until the existing 5 tools have a paying customer + one external review
(BUSINESS.md §4 sequence rule, applied recursively inside the bundle).
**Exception granted 2026-05-01**: Tool 09 Pipeline Runner is built
**Exception granted 2026-05-01**: Tool 09 Automated Workflows is built
*now*. Rationale: the pipeline transforms the bundle from "5 tools you
buy" into "an automatable workflow you depend on." That conversion is
what produces retention and word-of-mouth — the only marketing channel
@@ -104,10 +104,10 @@ demo dataset.
| # | Pain | $ / time impact | Tools that fix it |
|---|------|-----------------|---|
| S1 | **Klaviyo / Mailchimp / Omnisend per-contact billing.** Subscriber list with 1018 % duplicate rate (case drift, plus signs in Gmail addresses, multiple devices) → recurring overpay forever. | $30300/mo per percent of dupes on a 50 k list — recurring | Dedup + Format Standardize (email canonicalization) + Pipeline (re-run weekly) |
| S2 | **Product feed rejected by Google Merchant Center / Meta Catalog.** Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 2472 h while feed gets fixed. | 13 days delayed launch × campaign value | Text Cleaner + Format Standardize |
| S3 | **Multi-channel order consolidation.** Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 48 hr / month manually merging | Column Mapper + Dedup + Pipeline |
| S2 | **Product feed rejected by Google Merchant Center / Meta Catalog.** Smart quotes in titles, NBSP in SKU, inconsistent attributes; campaign launch delayed 2472 h while feed gets fixed. | 13 days delayed launch × campaign value | Clean Text + Standardize Formats |
| S3 | **Multi-channel order consolidation.** Shopify + Etsy + Amazon + Faire + wholesale spreadsheet, each with a different column for "customer email" / "order total" / "ship country". | 48 hr / month manually merging | Map Columns + Find Duplicates + Automated Workflows |
| S4 | **Subscription identity fragmentation.** Pet-box subscribers cancel and re-sub under a different email; cohort analysis says churn is 20 % when it's actually 12 % — pricing decisions wrong. | Mis-priced LTV → over- or under-paid acquisition | Dedup with `merge=true` survivor |
| S5 | **International tax / VAT MOSS compliance.** Country column is `UK` / `U.K.` / `United Kingdom` / `GB` in the same export; VAT report breaks. Phone formats per region break call-center routing. | Compliance penalty risk + ops friction | Format Standardize (per-row country) + Column Mapper |
| S5 | **International tax / VAT MOSS compliance.** Country column is `UK` / `U.K.` / `United Kingdom` / `GB` in the same export; VAT report breaks. Phone formats per region break call-center routing. | Compliance penalty risk + ops friction | Standardize Formats (per-row country) + Map Columns |
#### Bookkeeper / freelance accountant
@@ -126,7 +126,7 @@ demo dataset.
| R1 | **HubSpot / Marketo / Iterable per-contact tier pricing.** 10 k contacts → enterprise tier at $48 k/mo. Every duplicate is a recurring tax. | $200800 / month per 1 k duplicate contacts — recurring | Dedup with cross-source merge + Pipeline |
| R2 | **Email-deliverability / sender reputation.** Sending to invalid or duplicate addresses tanks reputation; recovery takes weeks. | Catastrophic — entire email programme degraded | Format Standardize (email canonicalization) + Missing (sentinel detection) |
| R3 | **GDPR / contact-data privacy.** Uploading lead data to a third-party cleaning SaaS is itself a GDPR concern; legal review blocks adoption. | Compliance risk + 48 wk legal-review delay | Local-only desktop app, zero outbound calls |
| R4 | **Multi-vendor lead-source unification.** Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 13 days per campaign of manual unification | Column Mapper (alias matching) + Format Standardize (per-row country) + Dedup |
| R4 | **Multi-vendor lead-source unification.** Apollo, ZoomInfo, LinkedIn Sales Nav, manual scrapes — each export has different headers, scoring, country format. | 13 days per campaign of manual unification | Map Columns (alias matching) + Standardize Formats (per-row country) + Find Duplicates |
| R5 | **Suppression-list management across 5+ platforms.** Each platform has its own format; un-deduped suppression lists let opt-outs slip through, triggering CAN-SPAM / GDPR exposure. | Compliance risk + churn-back cost | Pipeline saved as JSON, re-run on each new suppression batch |
### 2.4 Operationalize the moat the docs already name.
@@ -154,7 +154,7 @@ right after "runs locally."
Copy seed: *"Every change auditable. Hand the audit CSV to your client
with the cleaned file."*
### 2.6 The Pipeline Runner is the retention multiplier.
### 2.6 Automated Workflows is the retention multiplier.
A buyer with a saved pipeline isn't a one-off purchase — they're a
recurring user who recommends the product. This is exactly the
@@ -172,8 +172,8 @@ trigger DECISIONS.md §8 already names).
### 2.8 Dependency-aware pipeline UX.
Tools have soft execution-order preferences (Text Clean before Format
Standardize, Format before Dedup, Missing before Dedup). The Pipeline
Runner *recommends* the order, *warns* on reversals, and **never
Standardize, Format before Dedup, Missing before Dedup). Automated
Workflows *recommends* the order, *warns* on reversals, and **never
forces** — the user owns their workflow. Implementation: see
`src/core/pipeline.py` `SOFT_DEPENDENCIES`.
@@ -184,7 +184,7 @@ forces** — the user owns their workflow. Implementation: see
| 1 | PyInstaller pipeline · Mac/Win unsigned installers · Apple Dev Program enrollment (12 wk lead) | `dist/datatools-mac.dmg` and `dist/datatools-win.exe` install on a clean machine |
| 2 | Demo deployed to Streamlit Cloud · landing page v1 with embedded demo · 3 persona datasets in the demo | Public URL serves a working pipeline run on a sample dataset in < 30 s |
| 3 | Gumroad listing live · share value-first in 3 niche communities (no pitch) · 1 long-tail SEO post for the lead persona | First listing impression captured · post not removed for self-promotion |
| 4 | Pipeline Runner v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded |
| 4 | Automated Workflows v1.0 shipped (this week, 2026-05-01 — exception per §2.1) · v1.1 patch announced with Tool 09 + intl improvements | Pipeline saves/loads JSON · 3 demo pipelines preloaded |
| 58 | Bookkeeper landing page · agency landing page · second tool's promo cycle · priority-support tier added (defer purchase until §2.7 trigger) | Three live landing pages with distinct H1, demo dataset, conversion target |
| 913 | Tool 0608 only **if** revenue trajectory supports continued investment · otherwise more market work on the existing 5 + 09 | Decision made on 13 Aug 2026 with revenue data, not feature ambition |
@@ -202,7 +202,7 @@ These flip the plan, not the underlying criteria:
## 5. Anti-temptations (things the plan refuses)
- **More tools before more buyers.** Locked. Exception only for Pipeline Runner per §2.1.
- **More tools before more buyers.** Locked. Exception only for Automated Workflows per §2.1.
- **SaaS pivot.** Recurring infra conflicts with the lifestyle constraint (DECISIONS.md §4).
- **Live chat / sales calls.** Conflicts with no-touch (DECISIONS.md §1 #8).
- **Custom integrations / one-off consulting.** $300/hr looks tempting; breaks the "build once, sell many" model that justifies the entire strategy.

View File

@@ -144,7 +144,7 @@ Reading PLAN.md §3 + this doc together, the rough script:
| **M1** (June) | Installers · demo · 3 landing pages · Gumroad live | Whether the funnel mechanically works. Numbers will be noisy; just look for one purchase. |
| **M2** (July) | M1 + community posts in 3 niches + 1 SEO post | Which persona converts. Re-allocate effort to the highest-converting niche. |
| **M3** (August) | M2 + landing-page changes from M2 review | Whether intent-rate moved on the change. Decide tools 0608 go/no-go. |
| **M4** (September) | M3 + first repeat-buyer signals | Whether the Pipeline Runner is producing retention as designed. |
| **M4** (September) | M3 + first repeat-buyer signals | Whether Automated Workflows is producing retention as designed. |
By end of M4, the data tells you whether the plan is producing
$1k3k/mo (BUSINESS.md §6 6-month target) — extrapolated from the

View File

@@ -21,8 +21,8 @@ project-root/
│ └── CLI-REFERENCE.md
├── src/
│ ├── core/ # shared logic — both CLI + GUI call into this
│ ├── cli.py # Deduplicator CLI
│ ├── cli_text_clean.py # Text Cleaner CLI
│ ├── cli.py # Find Duplicates CLI
│ ├── cli_text_clean.py # Clean Text CLI
│ ├── cli_analyze.py # Analyzer CLI
│ └── gui/
│ ├── app.py # Streamlit entry

View File

@@ -76,7 +76,7 @@ Sample size: 1,000 rows (configurable).
- Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
- Output write: ~10 s.
- Recommended RAM: 34× input size for the full-Apply path.
- **Format standardizer** (`standardize_dataframe`): ~2.7M rows/sec on
- **Standardize Formats** (`standardize_dataframe`): ~2.7M rows/sec on
cache-warm repetition-heavy columns (synthetic 1M-row in-memory
benchmark, 2 typed columns); the fused single-pass loop replaced a
3-pass ``.tolist()`` cycle, so per-call overhead is now dominated by
@@ -87,20 +87,20 @@ Sample size: 1,000 rows (configurable).
thread-pool scaffolding; on CPython 3.12 with the GIL it's
roughly neutral, but the API is ready for the free-threaded
(PEP 703) Python 3.13+ build where it will help.
- **Text cleaner** (`clean_dataframe`): ~1M rows/sec on
- **Clean Text** (`clean_dataframe`): ~1M rows/sec on
repetition-heavy columns (per-call string cache: the pipeline runs
once per *unique* cell value, not once per row).
- **Missing handler** (`handle_missing`): lazy-copy — when sentinel
- **Fix Missing Values** (`handle_missing`): lazy-copy — when sentinel
standardization runs but finds nothing, AND no drops AND no fills
apply, the input frame is returned as-is. On a clean 1 GB file this
saves the 1 GB allocation that the unconditional upfront copy used
to take.
- **Column mapper** (`map_columns`): rename + drop both already
- **Map Columns** (`map_columns`): rename + drop both already
return fresh frames; the explicit upfront `df.copy()` is now
removed and downstream mutating steps (schema-add, coerce) copy on
demand via `_ensure_owned()`. Rename-only and identity-mapping
paths run with zero explicit copies.
- **Deduplicator**:
- **Find Duplicates**:
- **Exact-only strategies** (every column uses `Algorithm.EXACT` at
threshold 100 — covers strong-key dedup like email/phone, the
fallback drop-duplicates path, and explicit "match on this exact
@@ -117,19 +117,19 @@ Sample size: 1,000 rows (configurable).
(the common dedup workload) skip re-parsing.
## 11. Tools
1. Deduplicator — Ready
2. Text Cleaner — Ready
3. Format Standardizer — Ready
4. Missing Value Handler — Ready
5. Column Mapper — Ready
6. Outlier Detector — Coming Soon
7. Multi-File Merger — Coming Soon
8. Validator & Reporter — Coming Soon
9. Pipeline Runner — Ready
1. Find Duplicates — Ready
2. Clean Text — Ready
3. Standardize Formats — Ready
4. Fix Missing Values — Ready
5. Map Columns — Ready
6. Find Unusual Values — Coming Soon
7. Combine Files — Coming Soon
8. Quality Check — Coming Soon
9. Automated Workflows — Ready
### 11.a Recommended pipeline order (soft, not enforced)
The Pipeline Runner ships with a `SOFT_DEPENDENCIES` table; the
Automated Workflows ships with a `SOFT_DEPENDENCIES` table; the
following ordering is the default and the basis of the warning
surface. Re-ordering is allowed; the runner emits a warning string
and proceeds.
@@ -214,7 +214,7 @@ and proceeds.
fresh blob without losing the embedded buyer identity. Tier may
change during renewal (Lite → Core upgrade path).
- **Tiers**:
- ``lite`` — Deduplicator + Text Cleaner + Format Standardizer.
- ``lite`` — Find Duplicates + Clean Text + Standardize Formats.
Buyer pays once, gets the three universally-useful tools.
- ``core`` — every Ready tool (all 9 in v1.6).
- ``pro``, ``enterprise`` — scaffolded for future SKUs; currently

View File

@@ -34,8 +34,8 @@ src/
normalizers.py # Per-column normalizers for dedup matching
text_clean.py # clean_dataframe + smart_title_case
_constants.py # Shared USPS abbrevs + state names
cli.py # Deduplicator CLI (Typer)
cli_text_clean.py # Text Cleaner CLI
cli.py # Find Duplicates CLI (Typer)
cli_text_clean.py # Clean Text CLI
cli_analyze.py # Analyzer CLI (--json)
gui/
app.py # Streamlit entry point
@@ -192,7 +192,7 @@ GUI / CLI handlers use `format_for_user()` so the user always sees: file path, o
| Bundle | Status |
|--------|--------|
| Data Cleaning Mastery | 3/9 tools Ready (Dedup, Text Cleaner, Format Standardizer); 6 stubs |
| Data Cleaning Mastery | 3/9 tools Ready (Find Duplicates, Clean Text, Standardize Formats); 6 stubs |
| Automated Business Reporting | Not started |
| Ecommerce Data Pipeline | Not started |
| Small Business Finance | Not started |
@@ -214,12 +214,12 @@ Deliberately separate. Confluent original spec was wrong.
| Script | Owns |
|--------|------|
| 04 Missing Value Handler | "What's not there." Disguised nulls (`N/A`, `-`, sentinel codes), missingness patterns, imputation, drop-by-threshold. |
| 06 Outlier Detector | "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization. |
| 04 Fix Missing Values | "What's not there." Disguised nulls (`N/A`, `-`, sentinel codes), missingness patterns, imputation, drop-by-threshold. |
| 06 Find Unusual Values | "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization. |
**Run order**: 04 before 06. Outlier stats on data with `NaN` / sentinels are mathematically poisoned (means dragged, IQR widens, false negatives).
**Pipeline order** (Pipeline Runner enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible.
**Pipeline order** (Automated Workflows enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible.
**Contested cases**:
- Whitespace-only cell — 02 trims to empty; 04 then flags empty as null.

View File

@@ -14,7 +14,7 @@ Introduce tu nombre completo y correo, pega el código de licencia del correo de
| Nivel | Herramientas |
|---|---|
| **Lite** | Eliminador de duplicados · Limpiador de texto · Estandarizador de formatos |
| **Lite** | Buscar duplicados · Limpiar texto · Estandarizar formatos |
| **Core** | Las 9 herramientas |
Un usuario Lite que abra una herramienta exclusiva de Core verá un mensaje "Actualiza tu licencia". La página de inicio también muestra una marca 🔒 Bloqueado en las tarjetas de las herramientas que tu nivel no incluye. Para actualizar, pega un código Core en la página Activar.
@@ -53,15 +53,15 @@ Matriz de soporte completa: [REQUIREMENTS.md](REQUIREMENTS.md) (solo en inglés)
| # | Herramienta | Propósito | Estado |
|---|------|---------|--------|
| 01 | Eliminador de duplicados | Coincidencia exacta + difusa, 5 normalizadores, auditoría | Listo |
| 02 | Limpiador de texto | Espacios, caracteres tipográficos, BOM, finales de línea, mayúsculas/minúsculas | Listo |
| 03 | Estandarizador de formatos | Fechas / teléfonos / correos / direcciones / nombres / monedas / booleanos | Listo |
| 04 | Gestor de valores faltantes | Nulos disfrazados, imputación, descarte por umbral | Próximamente |
| 05 | Mapeador de columnas | Renombrar + aplicar esquema | Próximamente |
| 06 | Detector de valores atípicos | z-score, IQR, multivariante | Próximamente |
| 07 | Combinador de varios archivos | Combina varios archivos | Próximamente |
| 08 | Validador e informes | Reglas + informe PDF/Excel | Próximamente |
| 09 | Ejecutor de canalizaciones | Lanzador multi-herramienta de un clic | Próximamente |
| 01 | Buscar duplicados | Coincidencia exacta + difusa, 5 normalizadores, auditoría | Listo |
| 02 | Limpiar texto | Espacios, caracteres tipográficos, BOM, finales de línea, mayúsculas/minúsculas | Listo |
| 03 | Estandarizar formatos | Fechas / teléfonos / correos / direcciones / nombres / monedas / booleanos | Listo |
| 04 | Corregir valores faltantes | Nulos disfrazados, imputación, descarte por umbral | Próximamente |
| 05 | Mapear columnas | Renombrar + aplicar esquema | Próximamente |
| 06 | Detectar valores atípicos | z-score, IQR, multivariante | Próximamente |
| 07 | Combinar archivos | Combina varios archivos | Próximamente |
| 08 | Verificación de calidad | Reglas + informe PDF/Excel | Próximamente |
| 09 | Flujos automatizados | Lanzador multi-herramienta de un clic | Próximamente |
**Datos de muestra** (`samples/`): `messy_sales.csv`, `bank_export.xlsx`.
@@ -89,17 +89,17 @@ Ayuda: `deduplicator --help`. Referencia completa: [CLI-REFERENCE.es.md](CLI-REF
### 3.3 Orden de ejecución (cuando uses las herramientas manualmente)
Si no usas el Ejecutor de canalizaciones, sigue este orden:
Si no usas Flujos automatizados, sigue este orden:
1. **02 Limpiador de texto** primero — normaliza espacios y caracteres especiales.
2. **03 Estandarizador de formatos** — fechas, teléfonos, etc. necesitan texto limpio.
3. **04 Gestor de valores faltantes** — códigos centinela se ocultan como números.
4. **05 Mapeador de columnas** — esquema antes que estadísticas de atípicos.
5. **06 Detector de valores atípicos** — necesita datos numéricos limpios. Calcular estadísticas con `NaN` o `-999` envenena los resultados.
6. **07 Combinador de varios archivos**, **08 Validador** según sea necesario.
7. **01 Eliminador de duplicados** es flexible en cuanto al orden (normaliza internamente para la coincidencia).
1. **02 Limpiar texto** primero — normaliza espacios y caracteres especiales.
2. **03 Estandarizar formatos** — fechas, teléfonos, etc. necesitan texto limpio.
3. **04 Corregir valores faltantes** — códigos centinela se ocultan como números.
4. **05 Mapear columnas** — esquema antes que estadísticas de atípicos.
5. **06 Detectar valores atípicos** — necesita datos numéricos limpios. Calcular estadísticas con `NaN` o `-999` envenena los resultados.
6. **07 Combinar archivos**, **08 Verificación de calidad** según sea necesario.
7. **01 Buscar duplicados** es flexible en cuanto al orden (normaliza internamente para la coincidencia).
El Ejecutor de canalizaciones aplica este orden automáticamente.
Flujos automatizados aplica este orden automáticamente.
### 3.4 Idioma

View File

@@ -14,7 +14,7 @@ Enter your full name + email, paste the license blob from your purchase email (s
| Tier | Tools |
|---|---|
| **Lite** | Deduplicator · Text Cleaner · Format Standardizer |
| **Lite** | Find Duplicates · Clean Text · Standardize Formats |
| **Core** | All 9 tools |
A Lite user opening a Core-only tool sees an "Upgrade your license" prompt. The home page also shows a 🔒 Locked badge on tool cards your tier doesn't unlock. To upgrade, paste a Core blob on the Activate page.
@@ -53,15 +53,15 @@ Full numbered support matrix: [REQUIREMENTS.md](REQUIREMENTS.md).
| # | Tool | Purpose | Status |
|---|------|---------|--------|
| 01 | Deduplicator | Exact + fuzzy match, 5 normalizers, audit | Ready |
| 02 | Text Cleaner | Whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | Format Standardizer | Dates / phones / emails / addresses / names / currencies / booleans | Ready |
| 04 | Missing Value Handler | Disguised nulls, imputation, drop-by-threshold | Coming Soon |
| 05 | Column Mapper | Rename + enforce schema | Coming Soon |
| 06 | Outlier Detector | z-score, IQR, multivariate | Coming Soon |
| 07 | Multi-File Merger | Combine multiple files | Coming Soon |
| 08 | Validator & Reporter | Rules + PDF/Excel report | Coming Soon |
| 09 | Pipeline Runner | One-click multi-tool launcher | Coming Soon |
| 01 | Find Duplicates | Exact + fuzzy match, 5 normalizers, audit | Ready |
| 02 | Clean Text | Whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | Standardize Formats | Dates / phones / emails / addresses / names / currencies / booleans | Ready |
| 04 | Fix Missing Values | Disguised nulls, imputation, drop-by-threshold | Coming Soon |
| 05 | Map Columns | Rename + enforce schema | Coming Soon |
| 06 | Find Unusual Values | z-score, IQR, multivariate | Coming Soon |
| 07 | Combine Files | Combine multiple files | Coming Soon |
| 08 | Quality Check | Rules + PDF/Excel report | Coming Soon |
| 09 | Automated Workflows | One-click multi-tool launcher | Coming Soon |
**Sample data** (`samples/`): `messy_sales.csv`, `bank_export.xlsx`.
@@ -89,17 +89,17 @@ Get help: `deduplicator --help`. Full reference: [CLI-REFERENCE.md](CLI-REFERENC
### 3.3 Run order (when running tools manually)
If you skip the Pipeline Runner, follow this order:
If you skip Automated Workflows, follow this order:
1. **02 Text Cleaner** first — normalizes whitespace + special chars.
2. **03 Format Standardizer** — dates, phones, etc. need cleaned text.
3. **04 Missing Value Handler** — sentinel codes hide as numbers.
4. **05 Column Mapper** — schema before outlier stats.
5. **06 Outlier Detector** — needs clean numerics. Stats on data with `NaN` or `-999` are mathematically poisoned.
6. **07 Multi-File Merger**, **08 Validator** as needed.
7. **01 Deduplicator** is order-flexible (normalizes internally for matching).
1. **02 Clean Text** first — normalizes whitespace + special chars.
2. **03 Standardize Formats** — dates, phones, etc. need cleaned text.
3. **04 Fix Missing Values** — sentinel codes hide as numbers.
4. **05 Map Columns** — schema before outlier stats.
5. **06 Find Unusual Values** — needs clean numerics. Stats on data with `NaN` or `-999` are mathematically poisoned.
6. **07 Combine Files**, **08 Quality Check** as needed.
7. **01 Find Duplicates** is order-flexible (normalizes internally for matching).
The Pipeline Runner enforces this automatically.
Automated Workflows enforces this automatically.
### 3.4 Language