docs: tight, scannable rewrite — every item earns its place

Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS,
TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from
prose-heavy to bullet-heavy + table-heavy. Same information density,
significantly less reading load.

Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content
that landed since v1.6:

- Format Standardizer (3rd Ready tool)
- 199-row buyer corpus
- src/core/errors.py structured hierarchy + ensure_dataframe /
  ensure_choice / wrap_file_read|write / format_for_user helpers
- src/core/_constants.py shared USPS/state lookup tables
- Cross-tool audit fixes (NaN matching, removed_df schema, validation,
  enum-bounds checks, forward-compat config)
- Per-domain error_policy across format standardizers
- Inconsistent-date-format detector
- Excel header-row auto-detection + write_file delimiter param

Per-doc changes:

- README.md (175 → 71): 9-tool table at top, status column, 3 CLI
  entry points listed, dropped repeated marketing prose.
- docs/README.md (38 → 27): pure index — buyer-facing vs creator-only
  split + version footer.
- USER-GUIDE.md (208 → 118): tool table replaces script descriptions,
  troubleshooting compressed to bullets, gate explanation tightened.
- CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed
  redundant intro text, kept full recipes section.
- REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added
  §18 Error Handling, formatting tightened to single-line entries.
- TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged
  redundant §3.5-3.7 OS sections, added §7 (Error handling) +
  §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate /
  Review page / repair_bytes promoted from §10.2.x sub-numbering).
- DEVELOPER.md (285 → 161): module map table replaces per-file prose,
  extension recipes condensed, new §Errors covers when to use each
  hierarchy class.
- BUSINESS.md (278 → 225): collapsed prose to tables (use cases,
  competitive landscape, costs, risks); honest-status updated.
- DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved,
  decision log compressed to single-line entries, added v1.6 entries
  (Format Standardizer Ready, errors module).
- RECOVERY.md (180 → 147): rebuild steps as numbered + tabular,
  external dependencies as one table, recovery priorities tightened.

No information removed; redundancy compressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 02:49:29 +00:00
parent 26b9771625
commit abb720997e
10 changed files with 1105 additions and 2053 deletions

184
README.md
View File

@@ -1,175 +1,71 @@
# DataTools # DataTools
A bundle of Python data-cleaning tools for CSV and Excel files. Two scripts ship today; more are in build. Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony.
| # | Tool | What it does | ## Tools
|---|---|---|
| 01 | **Deduplicator** | Find and remove duplicate rows with exact + fuzzy matching, smart normalization, and interactive review. |
| 02 | **Text Cleaner** | Trim whitespace, fold smart quotes, strip invisible / control characters, normalize Unicode, normalize line endings, optional case conversion. |
## Deduplicator | # | Tool | Status |
|---|------|--------|
| 01 | **Deduplicator** — exact + fuzzy match, 5 normalizers, survivor rules, audit | Ready |
| 02 | **Text Cleaner** — whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | **Format Standardizer** — dates, phones, emails, addresses, names, currencies, booleans | Ready |
| 04 | Missing Value Handler | Coming Soon |
| 05 | Column Mapper | Coming Soon |
| 06 | Outlier Detector | Coming Soon |
| 07 | Multi-File Merger | Coming Soon |
| 08 | Validator & Reporter | Coming Soon |
| 09 | Pipeline Runner | Coming Soon |
## Features ## Install
- **Zero-config start** — auto-detects encoding, delimiters, headers, and match columns
- **Fuzzy matching** — Jaro-Winkler, Levenshtein, and token set ratio algorithms
- **5 built-in normalizers** — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
- **Merge mode** — fill missing fields in the surviving row from removed duplicates
- **4 survivor rules** — keep first, last, most complete, or most recent row per group
- **Interactive review** — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live
- **Config profiles** — save and reload your settings as JSON for repeatable runs
- **Dual interface** — full CLI for automation, Streamlit GUI for visual review
- **Dry-run by default** — preview what would change before writing anything
- **Audit trail** — every run produces a match groups report and timestamped log
## Quick Start
### Install
```bash ```bash
pip install -r requirements.txt pip install -r requirements.txt
``` ```
### CLI Python 3.10+ required.
```bash ## Run
# Preview duplicates (dry run — no files written)
python -m src.cli customers.csv
# Remove duplicates and save the result
python -m src.cli customers.csv --apply
# Fuzzy-match names at 80% similarity, merge missing fields
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply
# Interactively review each match group
python -m src.cli customers.csv --review --apply
```
### GUI
**GUI** (recommended):
```bash ```bash
streamlit run src/gui/app.py streamlit run src/gui/app.py
``` ```
Upload a file, click **Find Duplicates**, review match groups side-by-side, then download the cleaned result. **CLI** — three entry points:
## CLI Usage Summary
```
python -m src.cli INPUT_FILE [OPTIONS]
Options:
--apply Write output files (default: preview only)
--output, -o PATH Output file path
--subset, -s COLS Columns to match on (comma-separated)
--key, -k COLS Strong-key columns for exact matching
--fuzzy COLS Columns to fuzzy-match
--algorithm, -a ALG levenshtein | jaro_winkler | token_set_ratio
--threshold, -t N Similarity threshold 0-100 (default: 85)
--normalize COL:TYPE Per-column normalizers (e.g., email:email,phone:phone)
--survivor RULE first | last | most-complete | most-recent
--merge Fill missing fields from removed duplicates
--review Interactively review each match group
--config PATH Load settings from a JSON config file
--save-config PATH Save current settings to JSON
--sheet NAME Excel sheet name or 0-based index
--encoding ENC Override auto-detected encoding
--header-row N 0-based header row index
--help Show full help
```
## Sample Output
```
$ python -m src.cli samples/messy_sales.csv
Reading messy_sales.csv...
50 rows, 8 columns
Finding duplicates...
──────────────────────────────────────────────────
File: messy_sales.csv
Rows in: 50
Rows out: 28
Removed: 22
Groups: 22
──────────────────────────────────────────────────
Match groups:
Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
...
This was a preview. Add --apply to write the output files.
```
## Output Files
When `--apply` is used, three files are produced:
| File | Contents |
|------|----------|
| `{input}_deduplicated.csv` | Cleaned data with duplicates removed |
| `{input}_removed.csv` | Rows that were removed |
| `{input}_match_groups.csv` | Audit trail: group ID, confidence, matched columns, survivor flag |
## Text Cleaner
Character-level hygiene for messy CSV / Excel input. Solves the dirty-data failure modes that silently break VLOOKUPs, dedup runs, and downstream imports:
- Trailing / leading whitespace and tabs in cells
- Non-breaking spaces (`U+00A0`) hiding inside text where regular spaces should be
- Smart quotes pasted from Word (`"` `"` `'` `'``"` `"` `'` `'`)
- Em / en dashes, ellipsis, other typographic Unicode
- Zero-width and bidi-mark characters (`U+200B`, `U+200C`, `U+200D`, etc.)
- BOMs from Excel "Save As CSV UTF-8"
- Mixed line endings (`\r\n`, bare `\r`) inside multi-line cells
- Control characters (`U+0000`-`U+001F` minus `\t \n \r`)
- Optional Unicode NFC / NFKC normalization
- Optional per-column case conversion (UPPER / lower / smart Title / Sentence)
```bash ```bash
# Preview what would change (dry-run) python -m src.cli customers.csv [--apply] # dedup
python -m src.cli_text_clean samples/messy_text.csv python -m src.cli_text_clean messy.csv [--apply] # text clean
python -m src.cli_analyze any_file.csv [--json] # scan only
# Apply the safe defaults
python -m src.cli_text_clean samples/messy_text.csv --apply
# Title-case the name column, upper-case the SKU column
python -m src.cli_text_clean products.csv --case title:name,upper:sku --apply
# Just trim and collapse — nothing fancy
python -m src.cli_text_clean messy.csv --preset minimal --apply
``` ```
Three presets: `minimal` (trim + collapse only), `excel-hygiene` (default; everything safe ON), `paranoid` (adds lossy NFKC fold). Every CLI runs preview-only by default; add `--apply` to write output.
Outputs `{input}_cleaned.csv` plus a per-cell `{input}_changes.csv` audit (row, column, old, new, ops applied).
See [docs/CLI-REFERENCE.md](docs/CLI-REFERENCE.md#text-cleaner-cli) for every flag.
## Review & Normalize gate ## Review & Normalize gate
Every uploaded file passes through a CSV-normalization gate before any tool page sees it. The analyzer scans for ~15 issue types whitespace pollution, NBSP / zero-width chars, mixed line endings, BOM artifacts, encoding misdetections, smart punctuation, dirty headers, null sentinels, mojibake, and more — and tags each finding by **confidence** (high / medium / low) and **fix action** (the algorithm in `src/core/fixes.py` that resolves it). Every uploaded file passes through a CSV-normalization gate before any tool sees it. The analyzer flags ~15 issue types (whitespace, NBSP / zero-width chars, BOM, encoding, smart punct, dirty headers, null sentinels, mojibake, …) tagged by **confidence** (high / medium / low) and **fix action**. The GUI shows each finding with Auto-fix / Skip / Customize, a live before/after preview, and an encoding-override picker. Tool pages refuse to load until the gate passes.
In the GUI, the **Review & Normalize** page renders one expandable card per finding with a decision control (Auto-fix / Skip / Customize), a live before-and-after preview, an encoding-override picker for misdetected codepages, and an Advanced output options block (encoding, delimiter, line terminator) for the download. Tool pages refuse to load until the gate passes. ## Output
See [docs/USER-GUIDE.md §3.3](docs/USER-GUIDE.md) for the user-facing walkthrough and [docs/TECHNICAL.md §10.2.110.2.4](docs/TECHNICAL.md) for the developer-facing API. Every run writes:
## Documentation - `{input}_<tool>.csv` — the cleaned data
- `{input}_changes.csv` (text cleaner) or `{input}_match_groups.csv` (dedup) — audit trail
- `logs/<tool>_YYYYMMDD_HHMMSS.log` — debug-level run log
- [Requirements](docs/REQUIREMENTS.md) — short-form numbered list: file size, codepages, delimiters, detectors, performance targets Original input file is never modified.
- [User Guide](docs/USER-GUIDE.md) — installation, GUI workflow, the Review & Normalize gate
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections
- [Technical](docs/TECHNICAL.md) — architecture, gate internals, finding schema, fix registry
- [Developer Guide](docs/DEVELOPER.md) — extending the bundle, adding fixes / detectors
## Requirements ## Docs
- Python 3.10+ - [User Guide](docs/USER-GUIDE.md) — install, GUI workflow, gate
- Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer - [CLI Reference](docs/CLI-REFERENCE.md) — every flag with recipes
- [Requirements](docs/REQUIREMENTS.md) — file sizes, encodings, detectors, perf targets
- [Technical](docs/TECHNICAL.md) — architecture, gate internals, fix registry
- [Developer Guide](docs/DEVELOPER.md) — adding fixes / detectors / standardizers
## Dependencies
`pandas`, `openpyxl`, `rapidfuzz`, `phonenumbers`, `typer`, `loguru`, `charset-normalizer`, `streamlit`. Optional: `ftfy` for mojibake repair.
## License ## License
Proprietary. All rights reserved. Proprietary.

View File

@@ -1,278 +1,225 @@
# BUSINESS.md - Business Case & Marketing Strategy # Business
> **Creator-only document. Do not ship to buyers.** > Creator-only. Do not ship to buyers.
> **Version**: 1.6 · **Updated**: 2026-05-01 · **Owner**: Michael
**Version**: 1.6 ## 1. Executive summary
**Last updated**: April 28, 2026
**Owner**: Michael
--- Sell niche Python automation tools as one-time downloadable digital products. Target non-technical users who hate Excel/CSV grunt work but can't code. Distribute via Gumroad / Lemon Squeezy with automated delivery. Cross-platform from launch. Each bundle ships GUI (primary, browser-local) + CLI.
## 1. Executive Summary - **Pricing**: $49-79 per bundle · $149 full suite (when 3+ exist).
- **Goal**: lifestyle cashflow. No saleable-asset exit required.
Sell niche-specific Python automation tools as one-time downloadable digital products. Target non-technical users who hate repetitive Excel/CSV work but cannot code. Distribute via Gumroad / Lemon Squeezy / Stripe with automated instant delivery. Cross-platform from launch (Windows, macOS, Linux). Each bundle ships with both a GUI (primary surface for non-technical buyers, runs in the buyer's browser locally) and a CLI (for power users and automation). ## 2. Market opportunity
**Pricing model**: One-time purchase. Individual bundles $49-$79. Full suite $149. - Persistent, evergreen pain: data cleaning is universal.
- Low competition in vertical niches (Shopify pet-supplies feeds vs. generic CSV cleaners).
- ~100% gross margin after creation.
- Hosted browser demo as try-before-buy conversion lever (added v1.3).
**Goal**: Lifestyle cashflow. No saleable-asset exit required. **Timing reality**: marketplaces + community posts → days/weeks to first sale. Own-domain SEO is a 6-18 month compounding asset, not an early channel.
--- ## 3. Target customers
## 2. Market Opportunity **Primary**:
- Shopify owners (Pet Supplies = priority niche).
- Persistent, evergreen pain: manual data cleaning is universal across small business and freelance work. - Small business owners needing reporting + finance.
- Low competition in highly vertical niches (e.g., Shopify pet supplies feeds vs. generic CSV cleaners). - Freelancers / consultants handling client data.
- High margin: near-100% gross margin after initial creation.
- Distribution leverage: marketplace search + community presence + programmatic SEO + **hosted browser demo as a try-before-buy conversion surface** (added v1.3, see Section 7).
**Realistic distribution timeline note**: Marketplace listings (Gumroad, Lemon Squeezy directory) and niche community posts can produce paying customers within days to weeks. New-domain SEO will not produce traction inside 90 days. Plan early-stage distribution around marketplaces, communities, and a hosted demo; treat owned-domain SEO as a 6-18 month compounding asset.
---
## 3. Target Customers
Primary:
- Shopify store owners (Pet Supplies niche identified as priority).
- Small business owners needing reporting and finance automation.
- Freelancers and consultants who handle client data.
- Local marketing agencies. - Local marketing agencies.
Anti-personas (do not waste effort here): **Anti-personas**:
- Enterprise data teams (will build it themselves). - Enterprise data teams (build their own).
- Pure technical buyers (will pip install something free). - Pure technical buyers (`pip install` something free).
--- ## 4. Product strategy
## 4. Product Strategy **Lead**: Excel & CSV Data Cleaning Mastery Bundle (highest pain, broadest demand).
**Lead product**: Excel & CSV Data Cleaning Mastery Bundle (highest-pain, broadest demand). **Roadmap**:
1. Data Cleaning Mastery (in progress)
2. Automated Business Reporting
3. Ecommerce Data Pipeline
4. Small Business Finance
5. Marketing Public Data Aggregation
6. AI Ecommerce Aggregation — Shopify Pet Supplies
**Bundle roadmap**: **Sequence rule**: don't start bundle 2 until bundle 1 has paying customers + one external review. Five parallel skeletons is a known failure mode.
1. Data Cleaning Mastery (lead, in progress).
2. Automated Business Reporting.
3. Ecommerce Data Pipeline.
4. Small Business Finance.
5. Marketing Public Data Aggregation.
6. AI Ecommerce Aggregation - Shopify Pet Supplies (vertical niche play).
**Sequence rule**: Do not start bundle 2 until bundle 1 has paying customers and at least one external review. Building five skeleton bundles in parallel is a known failure mode. **Surface**: desktop install per OS (PyInstaller) with Streamlit GUI + CLI. Constrained demo on Streamlit Community Cloud.
**Product surface (locked v1.3)**: Each bundle ships as a desktop install (Windows / macOS / Linux) with both a Streamlit-based GUI and a CLI. A constrained version of the GUI is also deployed publicly to Streamlit Community Cloud as a free browser demo. See TECHNICAL.md Sections 1-3 and DECISIONS.md Section 4c for the full architecture. ## 4a. Lead bundle — Deduplicator
--- Highest pain density across all 4 personas. Feeds landing copy, demo design, feature priority. Tech spec: TECHNICAL.md §11.1.
## 4a. Lead Bundle Deep Dive: Deduplicator Use Cases & Competitive Position (added v1.6) ### Use cases by persona
The deduplicator is the lead because it has the highest pain density across all four target personas. This section captures the use-case map, competitive landscape, and market gap statement. It feeds landing page copy, demo dataset design, and feature prioritization. Companion technical spec is in TECHNICAL.md Section 10.1. **Shopify**:
1. Customer list cleanup (`john@gmail.com` vs `John@Gmail.com` vs `j.ohn@gmail.com`).
2. Product catalog dedup (SKU whitespace, near-identical names).
3. Abandoned-cart cleanup before re-engagement.
4. Order export consolidation across channels.
5. Subscriber-list hygiene before Klaviyo / Mailchimp import (per-contact pricing).
### Use cases by buyer persona **Bookkeeper**:
6. Bank export reconciliation across overlapping date ranges.
7. Vendor list consolidation across QB + spreadsheets + email.
8. Customer master cleanup pre-invoicing migration.
9. Expense report dedup (same receipt twice).
**Shopify store owner (priority niche)** **Freelancer**:
1. Customer list cleanup: same person with `john@gmail.com` and `John@Gmail.com` and `j.ohn@gmail.com` (Gmail ignores dots), or with two phone formats. 10. Pre-analysis cleanup of client dumps.
2. Product catalog dedup: same SKU listed with trailing whitespace, case differences, or near-identical names ("Dog Collar - Red - Large" vs "Dog Collar Red L"). 11. Survey response dedup (same respondent, multiple devices).
3. Abandoned cart cleanup before re-engagement campaign (don't email the same person 3 times).
4. Order export consolidation when pulling from Shopify + Amazon + manual entry.
5. Subscriber list hygiene before importing to Klaviyo / Mailchimp (every duplicate costs money on per-contact pricing).
**Small business / bookkeeper**
6. Bank export reconciliation: same transaction appearing in two exports across overlapping date ranges.
7. Vendor list consolidation across QuickBooks, spreadsheets, and email.
8. Customer master record cleanup before invoicing migration.
9. Expense report dedup where employees submit the same receipt twice.
**Freelancer / consultant**
10. Pre-analysis cleanup of client-supplied data dumps (almost always have dupes).
11. Survey response dedup (same respondent submitting twice from different devices).
12. Lead list cleanup before client handoff. 12. Lead list cleanup before client handoff.
**Marketing agency** **Marketing agency**:
13. Email list deduplication across multiple lead sources before campaign send. 13. Email-list dedup across lead sources.
14. Audience reconciliation when running multi-platform campaigns (Facebook + Google + organic forms). 14. Multi-platform audience reconciliation.
15. Suppression-list management (combine unsubscribes across lists). 15. Suppression-list management.
**Highest-pain, highest-frequency**: 1, 5, 6, 13. Build the feature set, sample dataset, and demo around these first. Landing page copy should lead with these scenarios. The hosted demo's pre-loaded dataset should make at least two of these cases obvious within ten seconds. **Highest pain × frequency**: 1, 5, 6, 13. Build feature set + demo dataset + landing copy around these.
### Competitive landscape ### Competitive landscape
| Tool | Price | Strength | Weakness vs. this product | | Tool | Price | Strength | Weakness |
|---|---|---|---| |------|-------|----------|----------|
| Excel "Remove Duplicates" | Free | Universal, zero install | Exact match only. No fuzzy. No audit log. | | Excel Remove Duplicates | Free | Universal, zero install | Exact only. No fuzzy. No audit. |
| Pandas `drop_duplicates` | Free | Powerful | Requires Python skill. Buyer doesn't have it. | | Pandas `drop_duplicates` | Free | Powerful | Requires Python. |
| OpenRefine | Free | Powerful clustering, fuzzy | Steep learning curve, dated GUI, intimidating for non-technical users. | | OpenRefine | Free | Powerful clustering | Steep curve, dated GUI. |
| Dedupe.io | ~$30+/mo SaaS | ML-based fuzzy | Recurring cost, cloud upload (privacy concern for client data), overkill for small jobs. | | Dedupe.io | $30+/mo | ML fuzzy | Recurring + cloud upload. |
| WinPure / Data Ladder | $300-2000+ | Enterprise-grade | Wrong price tier and complexity for solo operators. | | WinPure / Data Ladder | $300-2000+ | Enterprise | Wrong tier. |
| Power Query (Excel) | Free | Integrated | Exact match only, no fuzzy without M-code skill. | | Power Query | Free | Integrated | Exact only without M-code. |
### The market gap this product fills ### Market gap
The market has a hole between "Excel (too basic)" and "OpenRefine / Dedupe.io (too complex or expensive or cloud-bound)." That hole is: > Fuzzy match quality of OpenRefine, with the zero-learning UX of Excel, sold once for under $100, runs locally.
> Fuzzy match quality of OpenRefine, with the zero-learning-curve UX of Excel, sold once for under $100, runs locally on the buyer's machine. Defensible **only if** fuzzy matching works without docs. Mediocre fuzzy → loses to free Excel. Learning required → loses to free OpenRefine. Tier 1 spec (TECHNICAL.md §11.1) is the minimum viable feature set to occupy this gap.
This is a defensible position **only if** the product delivers fuzzy match quality the buyer can trust without reading documentation. If fuzzy is mediocre, the product loses to free Excel. If UX requires learning, it loses to free OpenRefine. The Tier 1 functional spec in TECHNICAL.md Section 10.1 is the minimum viable feature set to occupy this gap.
### Pricing sanity check (lead bundle specifically)
$49-$79 is correct for this position. Above $99 the buyer expects SaaS support (which conflicts with the no-touch constraint). Below $30 it competes with free and signals "toy." See Section 5 for full pricing rationale.
---
## 5. Pricing ## 5. Pricing
| Tier | Price | Notes | | Tier | Price | Notes |
|---|---|---| |------|-------|-------|
| Single bundle | $49 - $79 | Sweet spot for individual buyer impulse purchase | | Single bundle | $49-79 | Impulse-purchase sweet spot for solo operators |
| Full suite (when 3+ bundles exist) | $149 | Anchor price, drives bundle attach | | Full suite (3+ bundles) | $149 | Anchor; drives bundle attach |
Rationale: $49-$79 is below the threshold that triggers procurement / approval friction for solo operators. Above $99 the buyer expects a SaaS or human support. **Why**: < $99 avoids procurement friction. > $99 triggers SaaS-support expectations that conflict with no-touch. < $30 competes with free, signals "toy".
--- ## 6. Revenue targets
## 6. Revenue Targets (realistic, tiered)
Replacing the original "$50k/mo ceiling" target with evidence-based tiers for solo digital product sellers in this category:
| Horizon | Target | Notes | | Horizon | Target | Notes |
|---|---|---| |---------|--------|-------|
| First 90 days | First paying customer | Validates the funnel, not the business | | 90 days | First paying customer | Validates funnel, not business |
| 6 months | $1,000 - $3,000 / mo | Achievable with working lead bundle + marketplace presence + hosted demo | | 6 months | $1k-3k/mo | Lead bundle + marketplace + demo |
| 12 months | $5,000 / mo | Realistic 12-month goal. Triggers re-evaluation of the "fully async" constraint (see Section 8) | | 12 months | $5k/mo | Triggers "fully async" revisit |
| 24 months | $10,000 / mo | Stretch target. Requires either a hit product or 3+ bundles compounding | | 24 months | $10k/mo | Stretch. Needs hit product or 3+ bundles compounding |
$20k+/mo is achievable but requires a channel asset (audience, brand, content) that the current operator constraints exclude. Not a target. $20k+/mo achievable but requires audience/brand asset that operator constraints exclude.
--- ## 7. Marketing
## 7. Marketing Strategy ### Channels (priority order, early stage)
**Channels (in priority order, early stage)**: 1. **Hosted browser demo** — free Streamlit Community Cloud, linked from every listing. Direct conversion lever for digital downloads where buyers can't evaluate quality otherwise.
1. **Hosted browser demo** (added v1.3). Free, public Streamlit Community Cloud deployment of a constrained version of each bundle. Linked prominently from every Gumroad / Lemon Squeezy listing and the landing page as "Try it free in your browser." Direct conversion lever: prospects can validate quality before purchase, which is otherwise impossible for digital downloads at this price. 2. Marketplace listings — Gumroad search, Lemon Squeezy directory, GitHub.
2. Marketplace listings (Gumroad search, Lemon Squeezy directory, GitHub). 3. Niche communities — value-first posts in subreddits, Indie Hackers, niche Slack/Discord. Demo doubles as the shareable asset.
3. Niche community presence (subreddits, Indie Hackers, niche Slack/Discord) - value-first posts, not promotion. The hosted demo doubles as the asset shared in these posts. 4. Programmatic SEO — long-tail landing pages (compounds over months).
4. Programmatic SEO landing pages targeting long-tail keywords (compounds over months).
5. Strong GitHub README as discovery surface. 5. Strong GitHub README as discovery surface.
**Hosted demo design**: ### Demo design
- Same core engine as the paid product, GUI front-end only.
- Constrained: row limit (e.g., 100 rows on output), watermark on output files, sample dataset preloaded plus optional small-file upload (capped size).
- Persistent CTA on every page: "Like what you see? Get the full version for $49 ->" linking to Gumroad.
- No login or signup required to use the demo. Friction kills conversion.
- Hosted on Streamlit Community Cloud (free) at launch. Migrate to $5/mo VPS if rate limits or branding constraints become an issue.
**Target keywords (long-tail, low competition)**: - Same core engine as paid product, GUI-only.
- python csv cleaner bundle - Constraints: row limit (100), output watermark, sample dataset preloaded + small upload (capped).
- excel data cleaning scripts - Persistent CTA: *"Like what you see? Get the full version for $49 →"*.
- automated data deduplicator python - No login. Friction kills conversion.
- csv duplicate removal tool - Streamlit Community Cloud (free) at launch. $5/mo VPS if rate-limited.
- shopify product feed cleaner
**Funnel**: ### Target keywords
- Discovery (marketplace search / community post / SEO) -> Hosted demo (try-before-buy) -> Landing page -> Gumroad checkout -> Stripe payment -> automated email delivery -> upsell sequence to next bundle.
**Support model**: Self-serve documentation in every download. Email support only, no live chat, no calls. `python csv cleaner bundle` · `excel data cleaning scripts` · `automated data deduplicator python` · `csv duplicate removal tool` · `shopify product feed cleaner`.
--- ### Funnel
## 8. The "Fully Async, No-Touch" Constraint Discovery → Demo (try-before-buy) → Landing page → Gumroad → Stripe → automated email delivery → upsell sequence to next bundle.
The locked criteria require fully automated, no-touch marketing and sales. This is preserved as the long-term steady state. However: ### Support
**Revisit trigger**: When monthly recurring revenue reaches $5,000/mo. Self-serve docs in every download. Email only. No live chat, no calls.
**Why revisit**: At early stage, the no-touch constraint rules out the channels most likely to produce first traction (direct outreach to 50 Shopify pet operators, founder-led community participation, customer interviews). These are time-bounded activities, not permanent commitments. Strict adherence to "no-touch" before product-market fit may cost more revenue than it saves time. ## 8. The "fully async, no-touch" constraint
**Action at trigger**: Re-evaluate whether selective non-async activity (e.g., 2 hours/week of community participation, or a small founder audience build) would compound revenue faster than additional bundle development. Decision is yours; this document only flags the trigger. Locked criteria require automated, no-touch marketing + sales. Long-term steady state.
Until $5k/mo, operate under the locked async-only rule. **Revisit trigger**: $5k/mo MRR.
--- **Why**: pre-PMF, the no-touch rule excludes the channels most likely to produce first traction (founder outreach to 50 Shopify pet operators, community participation, customer interviews). Strict adherence may cost more revenue than it saves time.
## 9. Cost Structure **Action at trigger**: re-evaluate selective non-async (e.g., 2 hr/wk community participation) vs. additional bundle dev. Decision lives with the operator; this just flags the trigger.
**Recurring (monthly budget cap: $1,200)**: ## 9. Cost structure
| Item | Cost | Notes | Recurring monthly cap: **$1,200**.
|---|---|---|
| Gumroad / Lemon Squeezy fees | ~10% of revenue | Net of merchant fees, no flat cost |
| Domain | ~$15/yr | One-time annual |
| Hosting (landing pages) | $0 - $20/mo | Static hosting via Cloudflare Pages, Netlify, or GitHub Pages is free |
| Hosting (browser demos) | $0 at launch | Streamlit Community Cloud free tier. Plan for $5-10/mo VPS migration if scale or branding requires |
| Email service (transactional + sequences) | $0 - $30/mo | Free tier covers early volume |
| Apple Developer Program | $99/yr (~$8/mo) | Required for macOS code signing - see Section 10 |
| Inno Setup (Windows installer) | Free | One-time download |
| PyInstaller, Streamlit, Python tooling | Free | All open source |
| **Total fixed monthly** | **~$30-70/mo** | Well under $1,200 cap |
Headroom in the budget allows for optional ad spend ($100-200/mo) once a bundle has proven conversion data. | Item | Cost |
|------|------|
| Gumroad / Lemon Squeezy fees | ~10% of revenue |
| Domain | ~$15/yr |
| Landing-page hosting | $0-20/mo (static via Cloudflare/Netlify/GH Pages) |
| Demo hosting | $0 at launch (Streamlit Community Cloud); plan $5-10/mo VPS migration |
| Email service | $0-30/mo |
| Apple Developer Program | $99/yr (~$8/mo) |
| Inno Setup, PyInstaller, Python | Free |
| **Total fixed monthly** | **~$30-70/mo** |
--- Headroom enables optional ad spend ($100-200/mo) once a bundle has proven conversion data.
## 10. macOS Code Signing (Apple Developer Program) ## 10. macOS code signing
**Required cost**: $99/year, paid to Apple. **Cost**: $99/yr to Apple Developer Program. **Decision: pay it.**
**Why it's required**: **Why required**: macOS Gatekeeper hard-blocks unsigned apps with *"This app cannot be opened because the developer cannot be verified"* — the only obvious button is "Move to Trash." The bypass (right-click → Open) exists but the target buyer won't perform it.
macOS includes a security layer (Gatekeeper) that blocks unsigned applications by default. When a non-technical buyer downloads an unsigned `.app` or `.dmg`, macOS shows a hard-block dialog: *"This app cannot be opened because the developer cannot be verified."* The only obvious button is "Move to Trash."
The bypass exists (right-click > Open, then confirm in a second dialog), but the target buyer persona will not perform it. The likely outcomes for unsigned Mac builds: refund requests, support tickets, or silent abandonment. **What $99 buys**: code-signing certificate (removes hard block) + notarization service (removes "downloaded from internet" warning). Result: clean double-click experience.
**What the $99/yr buys**: **Setup**: Apple ID + government ID (individuals) or D-U-N-S number (orgs). First approval takes 1-2 weeks. Once approved, sign + notarize is automated in CI.
- A code signing certificate. Removes the hard block.
- Notarization service (included). Apple scans the binary and stamps it; this removes the secondary "downloaded from internet" warning too. Result: clean double-click-to-run experience.
**Setup notes**: ## 11. Risks & mitigation
- Requires Apple ID + government ID (for individuals) or D-U-N-S number (for organizations).
- First-time approval takes 1-2 weeks. Plan accordingly.
- Once approved, signing and notarization is automated in the build pipeline (see TECHNICAL.md).
**Decision**: Pay for it. The cost is trivial relative to the conversion-rate impact for the non-technical buyer persona.
---
## 11. Risks & Mitigation
| Risk | Mitigation | | Risk | Mitigation |
|---|---| |------|------------|
| Commoditization (free scripts on GitHub) | Niche verticals + polished GUI + cross-platform installers + hosted demo | | Free GitHub scripts commoditize | Niche verticals + polished GUI + cross-platform installers + hosted demo |
| Slow early traction | Lead with hosted demo + marketplaces + communities, not own-domain SEO | | Slow early traction | Lead with demo + marketplaces + communities, not own-domain SEO |
| Refund chargebacks | Clear scope on landing page, hosted demo lets buyers validate before purchase, working samples included | | Refund chargebacks | Clear scope on landing, demo lets buyers validate, working samples included |
| macOS install friction | Apple Developer Program ($99/yr), code signing + notarization | | macOS install friction | Apple Dev Program ($99/yr), code sign + notarize |
| Browser-launch UX confusion (GUI opens in browser locally) | Single sentence in installer welcome and email; persistent in-app "runs locally, no internet used" message; pywebview native-window wrap as v1.1 enhancement if needed | | Browser-launch UX confusion | One sentence in installer + email; persistent in-app "runs locally" message; pywebview wrap as v1.1 if needed |
| Customer support burden | Robust installers, idiot-proof docs, sample data included, hosted demo lets prospects self-evaluate | | Support burden | Robust installers, idiot-proof docs, sample data included |
| IP theft / resale | License file. Accept this is partial protection; focus on staying ahead via updates | | IP theft / resale | License file. Accept partial protection; focus on staying ahead via updates |
| Platform risk (Gumroad / Lemon Squeezy policy change) | Multi-marketplace from day one; own domain as fallback | | Marketplace policy change | Multi-marketplace day 1; own domain as fallback |
| Streamlit project direction change breaks desktop packaging | Low probability; flagged as criteria-relock trigger in DECISIONS.md Section 8 | | Streamlit direction change | Low probability; flagged as criteria-relock trigger in DECISIONS §8 |
--- ## 12. Success metrics (monthly)
## 12. Success Metrics
Tracked monthly:
- Units sold per bundle. - Units sold per bundle.
- Conversion rate (landing page -> purchase). - Conversion rate (landing purchase).
- **Demo-to-purchase conversion rate** (added v1.3): hosted demo visits -> Gumroad clicks -> purchases. - **Demo-to-purchase rate** (added v1.3): demo visits Gumroad clicks purchases.
- Refund rate (target < 5%). - Refund rate (target < 5%).
- Support tickets per 100 sales (target < 10). - Support tickets / 100 sales (target < 10).
- Organic traffic to product pages. - Organic traffic to product pages.
- Per-platform install success rate (Windows, macOS, Linux). - Per-platform install success.
--- ## 13. Honest status (2026-05-01)
## 13. Honest Status (April 28, 2026) - 3 of 9 tools shipped (Dedup, Text Cleaner, Format Standardizer).
- Cross-platform build pipeline designed, not yet built.
- 1 of 9 scripts is real and tested (`01_deduplicator.py`). The other 8 are skeletons. **Expected at project start.** - macOS code signing not yet set up.
- Cross-platform build pipeline (PyInstaller-based) designed but not yet built. - Streamlit GUI shipped for the 3 ready tools.
- macOS code signing not yet set up (Apple Developer Program enrollment pending).
- Streamlit GUI not yet built (locked as the framework as of v1.3).
- Hosted demo not yet deployed. - Hosted demo not yet deployed.
- No paying customers yet. - No paying customers.
- No live landing page yet. - No live landing page.
**Next concrete steps before any marketing spend**: **Next concrete steps before marketing spend**:
1. Build the Streamlit GUI for the lead script (`01_deduplicator.py`). Apply UX standards from DECISIONS.md Section 4b. 1. Stand up the PyInstaller pipeline with Streamlit launcher (1-3 days first time).
2. Stand up the PyInstaller cross-platform build pipeline with Streamlit launcher (see TECHNICAL.md Sections 3.3 and 3.4). Budget 1-3 days for first-time Streamlit-PyInstaller integration. 2. Deploy constrained demo to Streamlit Community Cloud.
3. Deploy the constrained demo version to Streamlit Community Cloud. 3. Enroll in Apple Developer Program (start in parallel — 1-2 wk lead time).
4. Enroll in Apple Developer Program (1-2 week lead time - start in parallel with the above). 4. Single landing page for the lead bundle, demo prominently linked.
5. Stand up a single landing page for the lead bundle, with the hosted demo prominently linked. 5. Finish 2 more tools to Ready state (CLI + GUI).
6. Finish at least 2 more of the 9 scripts to working state with both CLI and GUI. 6. List on Gumroad with sample output proof, per-platform installers, demo link.
7. List on Gumroad with sample output proof, per-platform installer downloads, and hosted demo link.

View File

@@ -1,431 +1,211 @@
# CLI Reference # CLI Reference
Complete command-line reference for the DataTools bundle. Three CLI modules, one per Ready tool:
DataTools ships two CLI modules so each script can be invoked independently:
| Module | Command | Purpose | | Module | Command | Purpose |
|---|---|---| |--------|---------|---------|
| `src.cli` | `python -m src.cli INPUT_FILE [OPTIONS]` | Deduplicator (script 01) | | `src.cli` | `python -m src.cli FILE` | Deduplicator |
| `src.cli_text_clean` | `python -m src.cli_text_clean INPUT_FILE [OPTIONS]` | Text cleaner (script 02) | | `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Text Cleaner |
| `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analyzer (read-only scan) |
The deduplicator section is below; the text cleaner reference is in [Section: Text Cleaner CLI](#text-cleaner-cli). Every command is **preview-only by default** — add `--apply` to write output.
## Deduplicator ---
# Deduplicator
``` ```
python -m src.cli INPUT_FILE [OPTIONS] python -m src.cli INPUT_FILE [OPTIONS]
``` ```
## Arguments
| Argument | Required | Description |
|----------|----------|-------------|
| `INPUT_FILE` | Yes | Path to the CSV, delimited text, or Excel file to deduplicate |
## Options ## Options
### Core ### Core
- `--apply` — write output files (default: preview).
- `-o, --output PATH` — output path (default `{input}_deduplicated.csv`).
| Flag | Short | Default | Description | ### Column selection
|------|-------|---------|-------------| - `-s, --subset COLS` — comma-separated columns to match on (default: auto-detect).
| `--apply` | | `false` | Write output files. Without this flag, only a preview is shown. | - `-k, --key COLS` — strong-key columns; each becomes an independent exact-match strategy (`fb_id`, `ein`, `sku`).
| `--output` | `-o` | `{input}_deduplicated.csv` | Output file path. |
### Column Selection ### Fuzzy matching
- `--fuzzy COLS` — comma-separated columns to fuzzy-match.
| Flag | Short | Default | Description | - `-a, --algorithm ALG``levenshtein` / `jaro_winkler` (default) / `token_set_ratio`.
|------|-------|---------|-------------| - `-t, --threshold N` — similarity 0-100 (default 85).
| `--subset` | `-s` | auto-detect | Comma-separated columns to match on. When omitted, columns are auto-detected by name pattern (email, phone, name, address). |
| `--key` | `-k` | none | Comma-separated strong-key columns. Each becomes an independent exact-match strategy. Use for identifiers like `fb_id`, `ein`, `sku`. |
### Fuzzy Matching
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--fuzzy` | | none | Comma-separated columns to fuzzy-match. Other columns in the strategy use exact matching. |
| `--algorithm` | `-a` | `jaro_winkler` | Fuzzy algorithm: `levenshtein`, `jaro_winkler`, or `token_set_ratio`. |
| `--threshold` | `-t` | `85` | Similarity threshold 0-100. Lower values find more matches but increase false positives. |
### Normalization ### Normalization
- `--normalize COL:TYPE` — comma-separated `col:type` pairs. Types: `email`, `phone`, `name`, `address`, `string`.
| Flag | Short | Default | Description | | Type | Effect | Example |
|------|-------|---------|-------------| |------|--------|---------|
| `--normalize` | | auto-detect | Column normalizers as `col:type` pairs, comma-separated. Types: `email`, `phone`, `name`, `address`, `string`. | | `email` | lowercase, strip Gmail dots, strip `+tag` | `John.Doe+x@gmail.com``johndoe@gmail.com` |
| `phone` | E.164 (+ ext preserved) | `(555) 123-4567 ext 100``+15551234567;ext=100` |
| `name` | strip titles + suffixes + particles, case-fold | `Dr. Charles de Gaulle Jr.``charles gaulle` |
| `address` | USPS abbrevs + state name → 2-letter, case-fold | `123 Main Street, California``123 main st ca` |
| `string` | trim + collapse + case-fold | ` HELLO WORLD ``hello world` |
**Normalizer details:** ### Survivor selection
- `--survivor RULE``first` (default) / `last` / `most-complete` / `most-recent`.
- `--date-column COL` — required for `most-recent`.
- `--merge` — fill blanks in survivor from removed rows.
| Type | What it does | Example | ### Interactive review
|------|-------------|---------| - `--review` — prompt y/n/s per match group with side-by-side diff.
| `email` | Lowercase, strip Gmail dots, strip `+tag` suffixes | `John.Doe+tag@gmail.com``johndoe@gmail.com` |
| `phone` | Parse to E.164 format; fallback: digits only | `(555) 123-4567``+15551234567` |
| `name` | Strip titles (Dr., Mr.) and suffixes (Jr., PhD), case-fold | `Dr. John Smith Jr.``john smith` |
| `address` | USPS abbreviations (Street→St, Avenue→Ave), case-fold | `123 Main Street, Suite 4``123 main st ste 4` |
| `string` | Trim, collapse whitespace, case-fold | ` HELLO WORLD ``hello world` |
### Survivor Selection
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--survivor` | | `first` | Which row to keep per duplicate group. |
| `--date-column` | | none | Date column for the `most-recent` rule. |
| `--merge` | | `false` | Fill missing fields in the surviving row from removed duplicates. |
**Survivor rules:**
| Rule | Behavior |
|------|----------|
| `first` | Keep the first row encountered (lowest row number) |
| `last` | Keep the last row encountered (highest row number) |
| `most-complete` | Keep the row with the fewest blank/empty cells |
| `most-recent` | Keep the row with the latest date (requires `--date-column`) |
### Interactive Review
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--review` | | `false` | Interactively review each match group. For each group, choose: merge (y), keep both (n), or skip remaining (s). |
### Configuration ### Configuration
- `--config PATH` — load all settings from JSON.
- `--save-config PATH` — save current settings to JSON.
| Flag | Short | Default | Description | ### File handling
|------|-------|---------|-------------| - `--sheet NAME|N` — Excel sheet name or 0-based index.
| `--config` | | none | Load all settings from a saved JSON config file. | - `--encoding ENC` — override auto-detected encoding.
| `--save-config` | | none | Save current settings to a JSON config file for reuse. | - `--header-row N` — 0-based header row.
### File Handling
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--sheet` | | first sheet | Excel sheet name or 0-based index. Ignored for CSV files. |
| `--encoding` | | auto-detect | Override auto-detected file encoding (e.g., `utf-8`, `windows-1252`). |
| `--header-row` | | auto-detect | 0-based row index for the header row. |
---
## Recipes ## Recipes
### 1. Basic Dedup (Auto-Detect)
Let the engine detect email, phone, name, and address columns automatically.
```bash ```bash
# Preview # Basic auto-detect dedup
python -m src.cli customers.csv python -m src.cli customers.csv [--apply]
# Apply # Fuzzy name match at 80%
python -m src.cli customers.csv --apply
```
The engine scans column names for patterns like `email`, `phone`, `name`, `address` and builds strategies automatically. Strong keys (email, phone) become standalone strategies; weak keys (name, address) are paired with strong keys.
### 2. Fuzzy Name Matching
Match rows where names are similar but not identical — catches typos, nickname variations, and formatting differences.
```bash
# Fuzzy-match on the "name" column at 80% similarity
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply python -m src.cli customers.csv --fuzzy name --threshold 80 --apply
# Fuzzy-match on multiple columns # Multiple strong keys (OR logic)
python -m src.cli customers.csv --fuzzy name,address --threshold 85 --apply
# Use Levenshtein distance instead of Jaro-Winkler
python -m src.cli customers.csv --fuzzy name --algorithm levenshtein --threshold 80 --apply
```
**Algorithm comparison:**
- `jaro_winkler` (default) — best for short strings like names; weights early characters more heavily
- `levenshtein` — edit-distance ratio; works well for typos and transpositions
- `token_set_ratio` — best for addresses and long strings; ignores word order
### 3. Custom Strong Keys
Use specific identifier columns to find exact duplicates.
```bash
# Deduplicate by Facebook ID
python -m src.cli donors.csv --key fb_id --apply
# Multiple strong keys (each is independent — matched with OR)
python -m src.cli donors.csv --key fb_id,ein --apply python -m src.cli donors.csv --key fb_id,ein --apply
```
Strong keys are OR'd: a match on `fb_id` alone OR `ein` alone marks rows as duplicates. # Most-complete row + merge missing fields
### 4. Merge Mode
Keep the most complete row and fill any remaining blanks from the duplicates.
```bash
# Most complete row + merge missing fields
python -m src.cli contacts.csv --survivor most-complete --merge --apply python -m src.cli contacts.csv --survivor most-complete --merge --apply
# Keep most recent row and merge # Most-recent + merge
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply
```
**How merge works:** The survivor row keeps all its non-empty fields. For any blank/null fields, the engine fills from the removed rows (in row order). The result is a single row with maximum data retention. # Interactive review
### 5. Multi-Column Subset
Match on a specific combination of columns rather than auto-detecting.
```bash
# Exact match on email + phone only
python -m src.cli customers.csv --subset email,phone --apply
# Mix exact and fuzzy within a subset
python -m src.cli customers.csv --subset email,name --fuzzy name --threshold 85 --apply
```
When using `--subset`, all listed columns must match (AND logic) for a pair to be considered duplicates.
### 6. Save and Load Config Profiles
Save your settings for repeatable runs on similar files.
```bash
# Save settings to a file
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge \
--survivor most-complete --save-config customer_dedup.json
# Load saved settings
python -m src.cli new_customers.csv --config customer_dedup.json --apply
```
Config files are JSON. Example:
```json
{
"strategies": [],
"survivor_rule": "most_complete",
"merge": true,
"default_algorithm": "jaro_winkler",
"default_threshold": 80.0,
"fuzzy_columns": ["name"]
}
```
### 7. Interactive Review
Step through each match group and decide whether to merge.
```bash
python -m src.cli customers.csv --review --apply python -m src.cli customers.csv --review --apply
# Save / load profile
python -m src.cli customers.csv --fuzzy name --threshold 80 --save-config dedup.json
python -m src.cli new.csv --config dedup.json --apply
# Excel
python -m src.cli data.xlsx --sheet "Sales" --apply
``` ```
For each group, the CLI displays both rows side-by-side and prompts: ## Algorithms
``` - **`jaro_winkler`** (default) — best for short strings (names); weights early chars.
============================================================ - **`levenshtein`** — edit-distance ratio; typos and transpositions.
Match Group 1 — Confidence: 92.3% - **`token_set_ratio`** — best for addresses; ignores word order.
Matched on: name, phone
============================================================
Row 1: ## Auto-detection
name: John Smith
email: john@example.com
phone: (555) 123-4567
Row 2: When no `--subset` / `--fuzzy` flags, columns are detected by name:
name: Jon Smith
email:
phone: 555-123-4567
[y] Merge [n] Keep both [s] Skip remaining: | Pattern | Algorithm | Threshold | Normalizer | Key |
``` |---------|-----------|-----------|------------|-----|
| Email | exact | 100% | email | strong |
| Phone | exact | 100% | phone | strong |
| Name | jaro_winkler | 85% | name | weak |
| Address | token_set_ratio | 80% | address | weak |
- **y** — accept the match; merge/remove duplicate **Strategy rules**: strong keys → standalone OR; weak keys → AND-paired with each strong key; no strong keys → weak promoted to standalone; no patterns → exact match on all columns.
- **n** — reject the match; keep both rows
- **s** — skip all remaining groups (keep both for all)
### 8. Excel Files and Multi-Sheet ## Output files (with `--apply`)
Work with Excel files directly — no CSV conversion needed. | File | Contents |
|------|----------|
| `{stem}_deduplicated.csv` | Cleaned data |
| `{stem}_removed.csv` | Removed rows |
| `{stem}_match_groups.csv` | `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row` + originals |
```bash Log: `logs/dedup_YYYYMMDD_HHMMSS.log`.
# Deduplicate first sheet (default)
python -m src.cli data.xlsx --apply
# Specify sheet by name
python -m src.cli data.xlsx --sheet "Sales Data" --apply
# Specify sheet by index (0-based)
python -m src.cli data.xlsx --sheet 1 --apply
```
Output is always CSV by default. To write Excel output, use `-o`:
```bash
python -m src.cli data.xlsx -o cleaned.xlsx --apply
```
--- ---
## Auto-Detection Details # Text Cleaner
When no `--subset` or `--fuzzy` flags are provided, the engine scans column names and builds strategies:
| Column pattern | Detection regex | Algorithm | Threshold | Normalizer | Key type |
|---------------|----------------|-----------|-----------|------------|----------|
| Email | `e[-_]?mail` | exact | 100% | email | strong |
| Phone | `phone\|telephone\|mobile\|cell` | exact | 100% | phone | strong |
| Name | `^(name\|full_name\|customer_name\|...)$` | jaro_winkler | 85% | name | weak |
| Address | `address\|street\|addr` | token_set_ratio | 80% | address | weak |
**Strategy building rules:**
- Strong keys → standalone OR strategies (email match alone is enough)
- Weak keys → paired with each strong key via AND (name match requires email or phone match too)
- No strong keys found → weak keys promoted to standalone
- No patterns matched → exact match on all columns (equivalent to `drop_duplicates`)
## Output Files
When `--apply` is set, three files are written:
| File | Description |
|------|-------------|
| `{stem}_deduplicated.csv` | Cleaned DataFrame with duplicates removed |
| `{stem}_removed.csv` | Rows that were removed |
| `{stem}_match_groups.csv` | Audit trail with `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row`, plus all original columns |
## Logging
Every run writes a timestamped log to `logs/dedup_YYYYMMDD_HHMMSS.log` with full debug-level details: strategies used, pair comparisons, survivor decisions, and merge actions.
---
# Text Cleaner CLI
Character-level hygiene for CSV / Excel files: whitespace trim and collapse, smart-character folding, Unicode normalization, BOM strip, control-char strip, line-ending normalization, optional case conversion. See TECHNICAL.md Section 10.2 for the full functional spec.
``` ```
python -m src.cli_text_clean INPUT_FILE [OPTIONS] python -m src.cli_text_clean INPUT_FILE [OPTIONS]
``` ```
## Arguments Character-level hygiene. See [TECHNICAL.md §10.2](TECHNICAL.md) for the spec.
| Argument | Required | Description |
|----------|----------|-------------|
| `INPUT_FILE` | Yes | Path to the CSV, TSV, or Excel file to clean |
## Options ## Options
### Core ### Core
- `--apply` — write output (default: preview).
| Flag | Short | Default | Description | - `-o, --output PATH` — output path (default `{input}_cleaned.csv`).
|------|-------|---------|-------------| - `--preset NAME``minimal` / `excel-hygiene` (default) / `paranoid`.
| `--apply` | | `false` | Write output files. Without this flag, only a preview is shown. |
| `--output` | `-o` | `{input}_cleaned.csv` | Output file path. |
| `--preset` | | `excel-hygiene` | Preset bundle of safe defaults. See [Presets](#presets). |
### Scope ### Scope
- `--columns COLS` — comma-separated columns to clean (default: all string columns).
- `--skip COLS` — exclude these columns.
| Flag | Default | Description | ### Per-op overrides (override the active preset)
|------|---------|-------------| - `--no-trim`, `--no-collapse`, `--no-nfc`, `--nfkc`, `--no-smart-chars`, `--no-zero-width`, `--no-bom`, `--no-control`, `--no-line-endings`.
| `--columns` | all string columns | Comma-separated columns to clean. |
| `--skip` | none | Comma-separated columns to skip even if they look like text. Useful for free-text notes columns you don't want touched. |
### Per-operation toggles ### Case
- `--case MODE``upper` / `lower` / `title` / `sentence`. Or per-column: `--case title:name,upper:sku`.
- Title case preserves all-caps tokens (`USA`) and lowercases mid-string particles (`of`, `and`).
These override the active preset. ### Audit + config
- `--full-changelog` — write every change (default caps to first 1000).
- `--config PATH` / `--save-config PATH`.
| Flag | Effect | ### File
|------|--------| - `--sheet`, `--encoding`, `--header-row` — same as Deduplicator.
| `--no-trim` | Disable leading/trailing whitespace strip |
| `--no-collapse` | Disable internal whitespace collapse |
| `--no-nfc` | Disable Unicode NFC normalization |
| `--nfkc` | Enable NFKC compatibility fold (lossy: `①``1`, `fi``fi`) |
| `--no-smart-chars` | Disable smart-character folding (curly quotes, em/en-dash, NBSP, ellipsis) |
| `--no-zero-width` | Disable zero-width / invisible character strip |
| `--no-bom` | Disable leading BOM strip |
| `--no-control` | Disable control-character strip |
| `--no-line-endings` | Disable line-ending normalization |
### Case conversion
| Flag | Forms | Description |
|------|-------|-------------|
| `--case` | `upper`, `lower`, `title`, `sentence` | Apply this case to every selected column |
| `--case` | `mode:col[,mode:col]` | Per-column case (e.g., `--case title:name,upper:code`) |
Title case preserves all-caps tokens (`USA` stays `USA`) and lowercases mid-string particles (`of`, `and`, `the`, etc.).
### Audit and config
| Flag | Default | Description |
|------|---------|-------------|
| `--full-changelog` | `false` | Write every cell change to the audit CSV (default caps to first 1000). |
| `--config` | none | Load options from a saved JSON config file. |
| `--save-config` | none | Save the current options to a JSON config file. |
### File format / encoding
| Flag | Default | Description |
|------|---------|-------------|
| `--sheet` | `0` | Excel sheet name or 0-based index. |
| `--encoding` | auto-detect | Override auto-detected file encoding. |
| `--header-row` | auto-detect | 0-based row index for the header. |
## Presets ## Presets
| Preset | What it does | | Preset | What it does |
|---|---| |--------|--------------|
| `minimal` | Trim + collapse whitespace only. Nothing else. | | `minimal` | Trim + collapse only. |
| `excel-hygiene` (default) | Trim, collapse, NFC, smart-character fold, zero-width strip, BOM strip, control strip, line-ending normalize. NFKC off. | | `excel-hygiene` (default) | Trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize. |
| `paranoid` | All of `excel-hygiene` plus NFKC compatibility fold (lossy). | | `paranoid` | `excel-hygiene` + NFKC compatibility fold (lossy). |
## Output Files
When `--apply` is set:
| File | Description |
|------|-------------|
| `{stem}_cleaned.csv` | Cleaned DataFrame |
| `{stem}_changes.csv` | Per-cell audit: `row`, `column`, `old`, `new`, `ops_applied` (capped to 1000 rows by default; use `--full-changelog` for all) |
A timestamped log is always written to `logs/text_clean_YYYYMMDD_HHMMSS.log`.
## Recipes ## Recipes
```bash ```bash
# Preview what would change with the safe defaults # Safe defaults (preview, then apply)
python -m src.cli_text_clean messy.csv python -m src.cli_text_clean messy.csv [--apply]
# Apply the safe defaults # Just trim + collapse, leave Unicode alone
python -m src.cli_text_clean messy.csv --apply
# Just the basics — only trim and collapse, leave Unicode/quotes alone
python -m src.cli_text_clean messy.csv --preset minimal --apply python -m src.cli_text_clean messy.csv --preset minimal --apply
# Title-case the name column, upper-case the SKU column, leave others alone for case # Title-case names, upper-case SKUs
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply
# Clean only specific columns # Clean only specific columns
python -m src.cli_text_clean orders.csv --columns vendor,product --apply python -m src.cli_text_clean orders.csv --columns vendor,product --apply
# Skip a free-text notes column from cleaning # Skip a free-text notes column
python -m src.cli_text_clean tickets.csv --skip notes --apply python -m src.cli_text_clean tickets.csv --skip notes --apply
# Save the current settings as a profile and reload it later
python -m src.cli_text_clean messy.csv --preset minimal --case upper --save-config my.json
python -m src.cli_text_clean other.csv --config my.json --apply
``` ```
## Output files (with `--apply`)
| File | Contents |
|------|----------|
| `{stem}_cleaned.csv` | Cleaned data |
| `{stem}_changes.csv` | `row`, `column`, `old`, `new`, `ops_applied` (capped to 1000; `--full-changelog` removes cap) |
Log: `logs/text_clean_YYYYMMDD_HHMMSS.log`.
--- ---
## Analyzer (upload-time scan) # Analyzer
``` ```
python -m src.cli_analyze INPUT_FILE [OPTIONS] python -m src.cli_analyze INPUT_FILE [OPTIONS]
--sample-rows N Cap on rows scanned (default 1000)
--json Print findings as a JSON array on stdout
--strict Exit non-zero on any warn/error finding
``` ```
JSON output schema (one object per finding): Read-only scan; surfaces every detector finding without modifying the file.
## Options
- `--sample-rows N` — cap on rows scanned (default 1000).
- `--json` — print findings as a JSON array on stdout.
- `--strict` — exit non-zero on any warn/error finding.
## JSON schema (one object per finding)
```json ```json
{ {
@@ -442,10 +222,14 @@ JSON output schema (one object per finding):
} }
``` ```
- `severity``info` / `warn` / `error`. Only `error` blocks the GUI normalization gate. ## Field meanings
- `confidence``high` (round-trip-safe, eligible for one-click auto-fix), `medium` (preview before applying), `low` (heuristic, opt-in only). - `severity``info` / `warn` / `error`. Only `error` blocks the GUI gate.
- `fix_action` — stable id naming the algorithm in `src/core/fixes.py` that resolves the finding. Empty string for informational-only findings. - `confidence``high` (one-click), `medium` (preview), `low` (opt-in).
- `pre_applied``true` for fixes already applied during the byte-level read pass (BOM strip, NUL strip, line-ending normalize, byte-level smart-quote fold, transcode-to-UTF-8 from UTF-16/32). The GUI gate treats these as already-resolved; the CLI emits them so callers can audit what changed during read. - `fix_action` — id of the algorithm in `src/core/fixes.py`. Empty for informational-only.
- `pre_applied``true` for fixes already applied during the byte-level read pass.
The detector set covers smart punctuation, NBSP / Unicode whitespace, zero-width characters, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints (UTF-8-as-cp1252), mixed-case email columns, near-duplicate rows (case-and-padding stripped), leading-zero IDs (Excel hazard), mixed line endings, encoding decode failure (`encoding_decode_failed`), and U+FFFD presence in the loaded text (`encoding_uncertain`). New detectors plug in by appending one entry to `analyze.py` and one matching fix in `fixes.py`. ## Detectors
Smart punctuation, NBSP / Unicode whitespace, zero-width chars, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints, mixed-case email columns, inconsistent date formats, near-duplicate rows, leading-zero IDs, mixed line endings, encoding decode failure, U+FFFD presence.
Add a detector: append entry in `analyze.py` + matching fix in `fixes.py`. No other call sites change.

View File

@@ -1,269 +1,189 @@
# DECISIONS.md - Locked Criteria, Scoring Rubric, Decision Log # Decisions
> **Creator-only document. Do not ship to buyers.** > Creator-only. Locked criteria, scoring rubric, decision log.
> **Version**: 1.6 · **Updated**: 2026-05-01
**Version**: 1.6 ## 1. Locked operating criteria
**Last updated**: April 28, 2026
This document captures the original locked operating criteria, the scoring framework used to select the product category, the platform-model evaluation, and key decisions with rationale. It exists so future-you (or a recovery rebuild) can reconstruct *why* the project is what it is, not just *what* it is.
---
## 1. Locked Operating Criteria
These are the constraints, targets, and goals the product strategy must satisfy. Established at project start. Any change to these requires an explicit re-lock.
### Constraints ### Constraints
1. Cash budget ≤ $1,200/mo recurring. No external funding.
| # | Criterion | Notes | 2. Time ≤ 10 hr/wk. Build-once assets preferred.
|---|---|---| 3. Skill set: database design, data pipelines, programming. Every opportunity must leverage these.
| 1 | Cash budget ≤ $1,200/month | Recurring monthly only; no large one-time capital, no external funding | 4. Network: none. Zero reliance on personal connections.
| 2 | Time available ≤ 10 hours/week | Strong preference for build-once assets generating revenue for years with minimal maintenance |
| 3 | Skill set: Database Design, Data Pipelines, Data Aggregation, Programming | Every opportunity must directly leverage these |
| 4 | Existing network: none | Zero reliance on personal connections for acquisition, sales, or operations |
### Targets ### Targets
5. First revenue: 15 days preferred, 90 days hard stop.
| # | Target | Notes | 6. Revenue ceiling: tiered (BUSINESS §6). Realistic 12-mo: $5k/mo.
|---|---|---| 7. Lifestyle cashflow goal. No saleable-asset exit required.
| 5 | Time to first revenue: 15 days preferred, 90 days hard stop | | 8. Distribution: fully async, no-touch. Revisit at $5k/mo.
| 6 | Revenue ceiling: tiered (see BUSINESS.md Section 6) | Revised from original $50k/mo. Realistic 12-month target: $5k/mo | 9. Work pattern: deep work + recovery. No real-time on-call.
| 7 | Lifestyle cashflow goal | Sustainable for several years, no saleable-asset exit required |
| 8 | Distribution: fully async, no-touch, automated | Revisit at $5k/mo (see BUSINESS.md Section 8) |
| 9 | Day-to-day work pattern: deep work + recovery periods | No real-time on-call or customer-facing constraints |
### Goals ### Goals
10. Escape 9-5 W2 employment without stability concerns. (Primary)
11. Free up time for retirement lifestyle, optional enjoyable work. (Secondary)
| # | Goal | Priority | ### Internal contradictions
|---|---|---|
| 10 | Escape 9-5 W2 employment without stability concerns | Primary |
| 11 | Free up time for retirement lifestyle, optional enjoyable work | Secondary |
### No Internal Contradictions "Fully async + 15-day-to-revenue + no network" is tight but workable. Caveat in BUSINESS §8: revisit async at $5k/mo.
The original criteria were checked for tension. The "fully async + 15-day to first revenue + no network" combination is tight but workable, with the caveat documented in BUSINESS.md Section 8 (revisit async constraint at $5k/mo). ## 2. Scoring rubric
--- Each candidate scored 1-5 on 6 dimensions. Total /30 → verdict.
## 2. Scoring Rubric
Every business candidate was scored 1-5 on six dimensions. Total /30, then mapped to verdict.
| Dimension | What it measures | | Dimension | What it measures |
|---|---| |-----------|------------------|
| Fit to locked criteria | Direct match to constraints 1-4 and targets 5-9. **Any 1 is a hard kill.** | | Fit to locked criteria | Direct match to constraints 1-4 + targets 5-9. **Any 1 = hard kill.** |
| Demand durability | Structural shift vs. trend peak. Will this still pay in 3 years? | | Demand durability | Structural shift vs. trend peak. Pays in 3 yr? |
| Defensibility | What stops the next entrant from copying it. | | Defensibility | What stops the next entrant. |
| Unit economics realism | CAC, payback period, gross margin, working capital. | | Unit economics realism | CAC, payback, gross margin, working capital. |
| Operator fit | Skills, capital, time, stomach for the work. | | Operator fit | Skills, capital, time, stomach. |
| Exit / cash-flow optionality | Multiple paths to revenue, optionality on later changes. | | Exit / cash-flow optionality | Multiple revenue paths. |
**Verdict mapping**: PURSUE / INVESTIGATE / PASS / KILL based on total score and any hard-kill dimension. **Verdict**: PURSUE / INVESTIGATE / PASS / KILL.
**Calibration note added in v1.1**: The original scoring inflated unit economics for the lead candidate by treating near-100% gross margin as 5/5 without accounting for CAC under the "no network" constraint. A more honest score for the Python Bundles category is 7.0-7.5/10, not 8.7/10. The strategy is still sound; the optimism just needed deflating. **v1.1 calibration**: original scoring inflated unit economics by treating ~100% gross margin as 5/5 without accounting for CAC under "no network." Honest score: 7.0-7.5/10 (was 8.7). Strategy still sound; optimism deflated.
--- ## 3. Candidate evaluation
## 3. Candidate Evaluation Summary
Five candidates were evaluated against the locked criteria. Top three:
| Rank | Candidate | Score | Verdict | | Rank | Candidate | Score | Verdict |
|---|---|---|---| |------|-----------|-------|---------|
| 1 | Niche Python Automation Script Bundles | 8.7/10 (original) / ~7.5/10 (calibrated) | **PURSUE** | | 1 | Niche Python Automation Script Bundles | 8.7/10 7.5/10 (calibrated) | **PURSUE** |
| 2 | Curated Datasets | 8.7/10 | PURSUE (deferred) | | 2 | Curated Datasets | 8.7/10 | PURSUE (deferred) |
| 3 | Hosted Data Pipeline Micro-Tool | 8.3/10 | INVESTIGATE | | 3 | Hosted Data Pipeline Micro-Tool | 8.3/10 | INVESTIGATE |
**Why #1 was selected over #2**: **Why #1 over #2**: faster path to first revenue (digital download vs. ongoing curation pipeline). Lower ongoing maintenance. Direct programming leverage. Better fit for "build once, sell many."
- Faster path to first revenue (digital download vs. ongoing data curation pipeline).
- Lower ongoing maintenance after launch.
- Direct leverage of programming skills, not just data acquisition.
- Better fit for the "build once, sell many times" preference in criterion 2.
**Why others were ranked lower**: **Rejected**: Notion Templates (weak skill leverage), Query Optimizer SaaS (recurring infra conflicts with lifestyle/maintenance constraint).
- Notion Templates: weaker leverage of programming skills.
- Query Optimizer (SaaS): introduces hosting, support, and recurring infrastructure costs that conflict with the lifestyle / minimal maintenance constraint.
--- ## 4. Platform model
## 4. Platform Model Decision (How to Sell) | Model | Verdict |
|-------|---------|
| **Standalone tools, dual CLI + GUI (chosen)** | **CHOSEN** (revised v1.2). Build once, no hosting, no SaaS support. GUI captures non-tech buyer; CLI captures power users. |
| SaaS web app | Rejected. Recurring hosting + support conflicts with minimal-maintenance constraint. |
| CLI-only | Rejected (revised v1.2). Wrong fit for non-tech buyer; produces refunds. |
| Browser extension | Rejected. Sandbox limits, wrong tool for files. |
| Notion / Airtable templates | Rejected. Doesn't leverage programming. |
Models considered for the lead bundle: **v1.2 rationale**:
- Buyer persona ("hates Excel work but can't code") won't learn a CLI. Refunds at this price.
- Deduplicator needs interactive review — not viable in pure CLI.
- Dual interface keeps CLI for automation without sacrificing primary buyer surface.
| Model | Pros | Cons | Verdict | ## 4a. Functional scope principle (v1.2)
|---|---|---|---|
| **Standalone tools, dual CLI + GUI interface (chosen)** | Build once, sell forever. No hosting. No SaaS support burden. Direct skill match. GUI captures non-technical buyer; CLI captures power users and automation use cases. | Requires installer for non-technical buyers. Some platform friction (signing, etc.). GUI adds build cost vs. CLI-only. | **CHOSEN (revised v1.2)** |
| SaaS web app | Recurring revenue. Easy install. | Ongoing hosting cost, support burden, SaaS scrutiny. Conflicts with "minimal maintenance" criterion. | Rejected |
| CLI-only | Lowest build cost | Wrong fit for non-technical buyer persona. Will produce refunds. | Rejected (revised v1.2) |
| Browser extension | Easy install | Limited by browser sandbox. Wrong tool for data file processing. | Rejected |
| Notion / Airtable templates | Fast to ship | Doesn't leverage programming skills. Low defensibility. | Rejected |
**Decision (revised v1.2)**: Ship as standalone tools with **both** a CLI and a GUI front-end sharing the same core logic. Packaged with cross-platform installers (PyInstaller-based) so the buyer experience approximates a native app. GUI is no longer "deferred"; it is required at v1 launch. **Decision**: each script ships **complete coverage of the workflow it names**, including features Excel does free.
**Rationale for the v1.2 revision**: **Why**: one-stop shopping is the value. Forcing buyers to bounce between this product and Excel/OpenRefine for parts of one task defeats the value prop.
- The buyer persona ("hate repetitive Excel work but cannot code") will not learn a CLI. CLI-only at this price point produces refunds.
- The deduplicator specifically requires interactive review of fuzzy-match candidates. That UX is not viable in pure CLI.
- A dual-interface design keeps the CLI for power users and future automation/scheduling use cases without sacrificing the primary buyer experience.
--- **Anti-rule**: not license to scope-creep. Boundary = the named workflow. Dedup includes normalization + survivor + audit. NOT format conversion or charting (those belong to other scripts).
## 4a. Functional Scope Principle (added v1.2) ## 4b. UX standards for GUI (v1.2 — load-bearing)
**Decision**: Each script ships with **complete functional coverage of the problem it names**, including features available for free elsewhere (e.g., Excel's built-in exact-match dedup). | Standard | What it means |
|----------|---------------|
| Works out of the box | Drop file → useful result, zero config. |
| Sensible defaults visible | Every option has a default that works for the common case. |
| Progressive disclosure | Default view = file uploader + go button + results. Advanced in expander panes. |
| Plain-English labels | "Find duplicates" not "Apply Levenshtein at 0.85". Tooltips carry technical detail. |
| Visible safety | Dry-run / preview by default. Original input never modified. |
| No multi-step setup | Single window for the basic task. |
| Errors name problem + fix | "Column 'email' not found. Available: name, phone. Did you mean 'phone'?" not `KeyError`. |
| Identical core to CLI | No drift. Anything CLI does, GUI does (minus interactive review = GUI-natural). |
**Rationale**: The product is "one-stop shopping" for the buyer's data-cleaning workflow. Forcing a buyer to bounce between this product and Excel/OpenRefine/etc. for parts of a single task defeats the value proposition. A buyer cleaning a customer list expects exact dedup, fuzzy dedup, normalization, and survivor-merge in one tool. Splitting that across products is what they paid to avoid. **"Intuitive enough" test**: a non-technical user who's never seen the tool can complete the lead use case on first launch with no docs read.
**Consequence for design**: Do not omit a feature on the grounds that "Excel does this for free." If it belongs to the workflow, it belongs in the script. ## 4c. GUI framework: Streamlit (v1.3)
**Anti-rule**: This is not license to scope-creep. The boundary is "the workflow this script names." A deduplicator includes everything dedup-adjacent (normalization, survivor selection, audit). It does not include format conversion, charting, or anything outside the dedup workflow. Those belong to other scripts in the bundle.
---
## 4b. UX Standards for GUI Front-End (added v1.2)
The GUI is the primary buyer surface. These standards are load-bearing.
| Standard | What it means in practice |
|---|---|
| **Works out of the box** | Dropping any reasonable CSV / XLSX onto the GUI must produce a useful result with zero configuration. The buyer should never see a config screen on first run. |
| **Sensible defaults everywhere** | Every option has a default that works for the most common case. Defaults are visible (so the user understands what is being applied) but not blocking. |
| **Progressive disclosure** | Advanced options exist but are tucked behind an "Advanced" or "Settings" pane. The default view shows the minimum needed for a first run. |
| **Plain-English labels** | No technical jargon in primary UI. "Find duplicates" not "Apply Levenshtein matching with token_set_ratio threshold". Tooltips can carry the technical detail for users who want it. |
| **Visible safety** | Dry-run / preview by default. The user sees what *would* change before any file is written. Original input is never modified. |
| **No multi-step setup** | If the GUI requires more than a single window (file picker + go button + results view) to complete a basic task, it has failed this standard. |
| **Errors that name the problem and the fix** | "Column 'email' not found in this file. Available columns: name, phone, address. Did you mean 'phone'?" not "KeyError: 'email'". |
| **Identical core to CLI** | The GUI and CLI are two front-ends over the same library code. Anything the CLI can do, the GUI can do. Anything the GUI can do, the CLI can do (possibly minus interactive review). No drift. |
**Test for "intuitive enough"**: A non-technical person who has never seen the tool can complete the lead use case (dedup a customer list with one or more confidence levels) on first launch with no documentation read. If that test fails on real users, the GUI is not yet shippable.
---
## 4c. GUI Framework Decision: Streamlit (added v1.3)
**Chosen**: Streamlit.
### Frameworks evaluated
| Framework | Verdict | | Framework | Verdict |
|---|---| |-----------|---------|
| **Streamlit** | **CHOSEN** | | **Streamlit** | **CHOSEN** |
| Tkinter + CustomTkinter | Rejected (CustomTkinter maintenance status confirmed inactive: last release Jan 2024, ~28 months old as of decision date; Snyk classifies as Inactive project) | | Tkinter + CustomTkinter | Rejected — maintainer absent (last release Jan 2024, ~28 mo). Snyk: Inactive. |
| Plain Tkinter | Rejected (UX quality below what a $49-79 product justifies in 2026 without significant hand-styling work) | | Plain Tkinter | Rejected UX gap unacceptable at $49-79 in 2026 without heavy hand-styling. |
| Flet | Rejected (ecosystem too young for a build-once-maintain-for-years product) | | Flet | Rejected ecosystem too young for build-once-maintain-for-years. |
| PySide6 / Qt | Rejected (overkill for this product tier; steepest learning curve, largest bundles) | | PySide6 / Qt | Rejected overkill, steepest learning curve, biggest bundles. |
| NiceGUI | Rejected (similar pattern to Streamlit but smaller community and less mature data-tool ergonomics) | | NiceGUI | Rejected — same browser tradeoff as Streamlit, smaller community + ecosystem. |
### Full evaluation matrix (added v1.6) ### Scored matrix (1-5, 5 = best for this product)
Promoted from chat-history-only into docs in v1.6 to lock the rejection reasoning against re-litigation. Scored 1-5 where 5 is best for *this specific product*. | Dimension | Tk | Tk+CTk | Streamlit | Flet | PySide6 | NiceGUI |
|-----------|----|----|-----------|------|---------|---------|
| Dimension | Tkinter | Tk+CTk | Streamlit | Flet | PySide6 | NiceGUI | | Non-tech UX | 1 | 3 | 4 | 4 | 5 | 4 |
|---|---|---|---|---|---|---| | Native window (no browser) | 5 | 5 | 1 | 5 | 5 | 1 |
| Non-tech UX quality (look + feel) | 1 | 3 | 4 | 4 | 5 | 4 | | Build speed v1 | 3 | 3 | 5 | 4 | 2 | 4 |
| "Native window opens" (no browser) | 5 | 5 | 1 | 5 | 5 | 1 | | Build speed per feature | 3 | 3 | 5 | 4 | 2 | 4 |
| Build speed for v1 | 3 | 3 | 5 | 4 | 2 | 4 | | PyInstaller compat | 5 | 4 | 2 | 3 | 3 | 2 |
| Build speed per added feature | 3 | 3 | 5 | 4 | 2 | 4 | | Bundle size (smaller better) | 5 | 4 | 1 | 3 | 2 | 1 |
| PyInstaller compatibility (low friction) | 5 | 4 | 2 | 3 | 3 | 2 | | Maintenance burden | 4 | 3 | 4 | 3 | 4 | 3 |
| Bundle size (smaller = better) | 5 | 4 | 1 | 3 | 2 | 1 | | Ecosystem maturity | 5 | 3 | 4 | 2 | 5 | 3 |
| Maintenance burden over time | 4 | 3 | 4 | 3 | 4 | 3 | | Solo-dev learning curve | 4 | 4 | 5 | 4 | 2 | 4 |
| Ecosystem maturity / longevity bet | 5 | 3 | 4 | 2 | 5 | 3 | | Drop-file-see-result fit | 3 | 3 | 5 | 4 | 4 | 5 |
| Solo dev learning curve | 4 | 4 | 5 | 4 | 2 | 4 |
| Suits "drop file, see result" pattern | 3 | 3 | 5 | 4 | 4 | 5 |
| **Total /50** | **38** | **37** | **38** | **36** | **34** | **35** | | **Total /50** | **38** | **37** | **38** | **36** | **34** | **35** |
**The total is misleading on purpose.** Equal totals hide that these options fail differently. Tkinter ties Streamlit on the sum but loses on look-and-feel and data-app fit (the dimensions that matter most for this product). The verdict is in the per-dimension story, not the sum. **Sums lie.** Tk ties Streamlit but loses on look-and-feel + data-app fit (the dimensions that matter). Verdict is per-dimension, not total.
**Per-option summary** (the substance behind the verdicts):
- **Plain Tkinter**: Smallest bundle (~30-50 MB added), most predictable PyInstaller behavior, will work in 10 years. Default widgets look like 1998. A non-technical buyer paying $49-79 and seeing a default Tk UI will feel cheated. Don't ship.
- **Tkinter + CustomTkinter**: Native window, ~50-80 MB added, modern look, mature PyInstaller story. Maintainer absent (last release Jan 2024). Multi-year product cannot bet UI layer on a library classified Inactive. The probable failure mode is a future macOS or Python update breaking the Tk layer with no upstream fix.
- **Streamlit**: Fastest to build for data tools. Tables, file uploads, dataframes are first-class. Mature ecosystem. Browser-launch UX is the real liability, mitigated by in-app messaging and the optional pywebview wrap (v1.1). Bundle size 300-500 MB. PyInstaller packaging fiddly first time, reusable after.
- **Flet**: Modern Flutter-based UI, native windows, looks great. Ecosystem too young for a build-once-maintain-for-years product. Breaking API changes between minor versions still happening. PyInstaller story less battle-tested.
- **PySide6 / Qt**: Industrial-grade, best widget set, native everything. Steepest learning curve, largest bundles, licensing care needed. Overkill for $49-79 product tier and burns the 10 hr/wk time budget on UI scaffolding instead of script features.
- **NiceGUI**: Similar pattern to Streamlit (Python-to-web). Smaller community, less mature data-tool ergonomics. Same browser-launch tradeoff without Streamlit's velocity advantage.
### Why Streamlit won ### Why Streamlit won
1. **Fastest build velocity for v1 and every subsequent bundle.** "Drop a CSV, see results" is the native Streamlit interaction pattern. Tables, filters, dataframes display well with minimal code. This compounds across the 9-script lead bundle and the future 5 bundles in the roadmap. 1. **Fastest build velocity** — "drop CSV, see results" is native. Tables, file uploads, dataframes are first-class. Compounds across 9-script lead + 5 future bundles.
2. **Lowest maintenance burden per added feature.** Active framework, large community, mature ecosystem. Bug fixes happen upstream, not on this project's time. 2. **Lowest maintenance burden** — active, large community, mature ecosystem. Bugs fixed upstream.
3. **Hosted browser demo as a marketing asset from day one.** A Streamlit app deploys to Streamlit Community Cloud (free) or a $5/mo VPS. The Gumroad landing page can offer "Try it free in your browser" with a sample dataset. For a $49-79 product where buyers cannot evaluate quality before purchase, a working demo can move conversion meaningfully. Tkinter family options cannot provide this. 3. **Hosted demo as marketing asset** Streamlit Community Cloud (free) lets the landing page offer "Try free in browser" with sample data. Tk-family options can't.
4. **Future SaaS optionality** (expanded v1.6). Not a driver of this decision; the locked criteria reject SaaS. But if criteria ever evolve, Streamlit code converts to a hosted multi-user app in hours rather than weeks. Streamlit's session-state model, component patterns, and HTTP-server architecture are SaaS-native by default; the same code that runs the desktop bundle's local browser GUI runs unchanged on a hosted server (modulo authentication and per-user file isolation). Tkinter code, by contrast, would require a complete rewrite to become a hosted product. This is low-cost optionality: zero implementation effort now, meaningful flexibility later if the lifestyle-cashflow constraint ever lifts in favor of recurring revenue. 4. **Future SaaS optionality** — same code runs unchanged on a hosted server (modulo auth + per-user isolation). Tk would require rewrite. Zero implementation now, meaningful flexibility later.
### Tradeoffs accepted ### Tradeoffs accepted
1. **Browser-launch UX on the desktop install.** When a buyer double-clicks the desktop shortcut, their default browser opens to a localhost URL. This may briefly confuse non-technical buyers. **Mitigation**: a single sentence in the welcome dialog and install email explains that the data tool runs in the browser locally and uses no internet. If support tickets show this is a meaningful confusion driver, evaluate wrapping with pywebview (native window around the local Streamlit server) in v1.1. 1. **Browser-launch UX** — buyer double-click → default browser opens to localhost. Mitigated: install email + welcome dialog + persistent in-app message. Pywebview wrap is the v1.1 fallback if confusing.
2. **Larger bundle size**, ~300-500 MB vs. ~50 MB for Tkinter. Acceptable for marketplace download in 2026 with typical broadband. 2. **Bundle size** ~300-500 MB vs. ~50 MB for Tk. Acceptable in 2026.
3. **PyInstaller packaging is fiddly** the first time. Budget 1-3 days for the one-time setup, then it's reusable across all subsequent bundles via a shared template. 3. **PyInstaller fiddly first time** — budget 1-3 days. Reusable across all bundles after.
4. **Streamlit's session re-run model is unusual.** Manageable for single-user data tools; would matter more if the SaaS optionality were exercised at scale. 4. **Streamlit's session re-run model** is unusual but manageable.
### Why CustomTkinter was rejected (the previously-favored option) ## 5. Distribution
A web check during this decision found that CustomTkinter's last PyPI release was 5.2.2 in January 2024. As of April 2026, that's roughly 28 months without a release, and Snyk classifies the project as Inactive. The library still works and remains popular (~115k weekly downloads, 13k+ GitHub stars), but the maintainer is effectively absent. For a product intended to ship to non-technical buyers and remain functional for years with minimal touch from the operator, betting the UI layer on an unmaintained library is an unacceptable risk: any future Python or macOS update that breaks the Tk underpinnings becomes the operator's problem to fix or fork. **Primary**: Marketplaces (Gumroad, Lemon Squeezy). Built-in traffic, async payments/delivery/refunds, listing in days.
This is the kind of dependency risk that matters most in a "build once, sell forever" product, where every hour spent firefighting a dependency break is an hour stolen from the next bundle. Own-domain SEO: long-term compounding asset (6-18 mo), not early-stage channel.
--- **v1.3 addition**: hosted browser demo as secondary distribution + primary conversion lever.
## 5. Distribution Channel Decision ## 6. Pricing
**Chosen primary**: Marketplace listings (Gumroad, Lemon Squeezy). $49-79/bundle · $149 full suite (when 3+ exist).
**Rationale**: Under the "no network + fully async + 90-day hard stop" constraints, marketplaces are the only channel that: - < $99 → no procurement friction for solo operators.
- Has built-in buyer traffic (no audience-building required). - > $99 → triggers SaaS-support expectations conflicting with no-touch.
- Handles payments, delivery, refunds asynchronously. - $49-79 → right unit economics + impulse-purchase territory.
- Allows listing in days, not months.
Own-domain SEO is treated as a long-term compounding asset (6-18 months to traction), not an early-stage channel. ## 7. Decision log
**Added v1.3**: A **hosted browser demo** of each bundle (deployed via Streamlit Community Cloud) becomes a secondary distribution surface and a primary conversion-rate lever on the landing page. Marketing details in BUSINESS.md Section 7.
---
## 6. Pricing Decision
**Chosen**: $49-$79 per bundle, $149 for full suite (when 3+ bundles exist).
**Rationale**:
- Below $99 threshold avoids procurement / approval friction for solo operator buyers.
- Above $99 raises buyer expectations (SaaS, human support) that conflict with the no-touch constraint.
- $49-$79 produces the right unit economics for marketplace fees + Stripe fees while remaining impulse-purchase territory.
---
## 7. Decision Log (Chronological)
| Date | Decision | Rationale | | Date | Decision | Rationale |
|---|---|---| |------|----------|-----------|
| April 2026 | Lock operating criteria | Project kickoff | | Apr 2026 | Lock operating criteria | Project kickoff |
| April 2026 | Select Python Automation Script Bundles as the product category | Highest score against locked criteria | | Apr 2026 | Python Bundles selected | Highest score |
| April 2026 | Choose CLI standalone over SaaS / GUI | Best fit for minimal maintenance + skill leverage | | Apr 2026 | Excel/CSV Cleaning as lead bundle | Highest pain, broadest demand |
| April 2026 | Pick Excel & CSV Data Cleaning Mastery as lead bundle | Highest pain, broadest demand, easiest demonstration | | Apr 2026 (v1.1) | PyInstaller cross-platform pipeline | Eliminates "install Python" friction |
| April 2026 | Initial install path: Inno Setup (Windows-only) | First-pass design | | Apr 2026 (v1.1) | Apple Developer Program ($99/yr) | Required for clean macOS install |
| April 2026 (revised v1.1) | **Switch to PyInstaller-based cross-platform pipeline** | Eliminates "install Python first" friction; expands TAM to Mac and Linux users | | Apr 2026 (v1.1) | Tiered revenue targets ($5k @ 12mo, $10k @ 24mo) | Original $50k unsupported by evidence |
| April 2026 (revised v1.1) | **Enroll in Apple Developer Program ($99/yr)** | Required for clean macOS install experience for non-technical buyers | | Apr 2026 (v1.1) | Tag "no-touch" for revisit at $5k/mo | Strict adherence pre-PMF may cost more revenue than it saves |
| April 2026 (revised v1.1) | **Replace $50k/mo target with tiered realistic targets** | Original target was unsupported by evidence base; tiered targets hit $5k at 12mo, $10k at 24mo | | Apr 28 (v1.2) | Functional scope: include workflow features even if free elsewhere | One-stop shopping is the value prop. See §4a. |
| April 2026 (revised v1.1) | **Tag "fully async no-touch" for revisit at $5k/mo** | Strict adherence pre-PMF may cost more revenue than it saves time | | Apr 28 (v1.2) | Promote GUI to required at v1; ship dual CLI + GUI | Buyer persona won't use CLI. See §4. |
| April 28, 2026 (v1.2) | **Functional scope: include all workflow-relevant features even if available free elsewhere** | One-stop shopping is the value proposition. Forcing buyers to bounce between products defeats the purpose. See Section 4a. | | Apr 28 (v1.2) | Lock UX standards (works OOTB, sensible defaults, progressive disclosure, dry-run) | Load-bearing for non-tech buyer. See §4b. |
| April 28, 2026 (v1.2) | **Promote GUI from "deferred" to required at v1 launch; ship dual CLI + GUI interface** | Buyer persona will not use CLI. Deduplicator specifically requires interactive review UX that CLI cannot deliver well. See Section 4. | | Apr 28 (v1.3) | Lock GUI framework as Streamlit | Fastest velocity, lowest maintenance, hosted demo, SaaS optionality. See §4c. |
| April 28, 2026 (v1.2) | **Lock UX standards for GUI: works out of the box, sensible defaults, progressive disclosure, plain-English labels, dry-run by default** | These are load-bearing for the non-technical buyer. Without them the GUI may exist but won't justify the price. See Section 4b. | | Apr 28 (v1.3) | Add hosted browser demo as conversion lever | Direct consequence of Streamlit choice. See §5. |
| April 28, 2026 (v1.3) | **Lock GUI framework as Streamlit; reject CustomTkinter (maintenance inactive), plain Tkinter (UX gap), Flet/PySide6/NiceGUI (each fails on a dimension that matters)** | Fastest build velocity, lowest maintenance burden, hosted browser demo as marketing asset, future SaaS optionality. Browser-launch UX accepted as a tradeoff with documented mitigation. See Section 4c. | | Apr 28 (v1.4) | Re-apply 04/06 boundary work (silent-drift recovery) | Stream B v1.2 content overwritten in parallel v1.3 work. Restored per no-silent-drift rule. |
| April 28, 2026 (v1.3) | **Add hosted browser demo as secondary distribution surface and conversion lever** | Direct consequence of Streamlit choice. See Section 5 and BUSINESS.md Section 7. | | Apr 28 (v1.5) | Add `02_text_cleaner.py`; renumber 02-08 → 03-09 | Character-level hygiene had no clear owner. See TECHNICAL §10. |
| April 28, 2026 (v1.4) | **Re-apply 03/05 script boundary work dropped during v1.3 merge (silent drift recovery)** | Stream B v1.2 content (sharpened 03/05 descriptions in USER-GUIDE, run-order rule, TECHNICAL.md Section 9 boundary spec, RECOVERY.md pointer) was overwritten when Stream A's parallel v1.3 Streamlit work was saved to project. Restoring per the doc's own no-silent-drift rule. 03 owns "what's not there" (missing values, sentinel codes, imputation), 05 owns "what shouldn't be there" (statistical outliers, domain rules, winsorization). 03 runs before 05 because outlier statistics on data containing NaN or sentinel codes are mathematically poisoned. See TECHNICAL.md Section 9. | | Apr 29 (v1.7) | Adopt Text Cleaner Tier 1/2/3 spec; lock `excel-hygiene` default | Promotes from stub to buildable v1 target. Full spec in TECHNICAL §11.2. |
| April 28, 2026 (v1.5) | **Add `02_text_cleaner.py` as new script; renumber 02-08 → 03-09** | Audit revealed character-level hygiene (whitespace trimming, multi-space collapse, Unicode normalization, BOM handling, line-ending normalization, special-character handling) had no clear owner. Was implicitly scattered: `01_deduplicator` normalizes internally for matching only (doesn't write back), `02_format_standardizer` (now 03) implies it but its named scope is dates/currencies/names/phones/addresses, `03_missing_value_handler` (now 04) only handles whitespace-only as disguised null. A buyer with trailing-space pollution had no obvious script to run. Per Section 4a (functional scope principle: one-stop shopping for the workflow), this was a real gap. Added as 02 because text cleaning is a pre-processing step that should run before format standardization, missing-value handling, and outlier detection. Kept 01 (deduplicator) at position 1 as the lead/working/marketing-flagship script; numbering does not strictly equal pipeline order, the orchestrator manages execution order. Renumber consequence: TECHNICAL.md Section 9 boundary references updated 03→04, 05→06; orchestrator references updated 08→09. New contested case documented in Section 9.3: whitespace-only cells (02 trims first, leaving empty string; 04 then detects empty strings as disguised null). Master orchestrator now 09. | | Apr 28 (v1.6) | Fold conversation-history content into docs (deduplicator spec, lead bundle use cases, full GUI matrix, 04/06 examples, Streamlit-to-SaaS reasoning) | No new decisions; promote at-risk analysis from chat history per no-silent-drift rule. |
| April 29, 2026 (v1.7) | **Adopt `02_text_cleaner.py` Tier 1/2/3 functional spec; lock `excel-hygiene` as default preset** | Promotes character-level hygiene from a stub to a buildable v1 target. Strategic framing: Excel/Power Query/OpenRefine fail this category for non-technical buyers; the gap is "one-click correctness for dirty-CSV failure modes that cause silent VLOOKUP misses." Spec covers 10 toggleable ops (trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize, NFKC opt-in, per-column case), per-column scope control, dry-run-by-default, per-cell change audit, idempotency, three presets (`minimal`/`excel-hygiene`/`paranoid`), and JSON config save/load. Output shape mirrors deduplicator: `{input}_cleaned.csv`, `{input}_changes.csv`, `logs/text_clean_{ts}.log`. Boundary with adjacent scripts re-asserted: 02 trims whitespace-only cells to empty (04 then detects empty as null per Section 9.3); 02 is *write-time* and stays distinct from `01_deduplicator`'s match-time `normalize_string` helper. Smart-character fold defaults ON in `excel-hygiene` because demo value is highest there and dry-run preview makes the change visible before commit. NFKC stays opt-in (lossy). `ftfy` mojibake repair deferred to Tier 2 to avoid the 5MB dep without buyer demand. CLI ships as separate `src/cli_text_clean.py` module per the one-CLI-per-script pattern in TECHNICAL Section 3.2. Full spec in TECHNICAL.md Section 10.2. | | May 1 (v1.6) | Mark Format Standardizer **Ready** | 199-row buyer corpus passing; Tier 1 + most Tier 2 built. |
| April 28, 2026 (v1.6) | **Fold conversation-history content into docs: deduplicator functional spec, lead bundle use cases, competitive landscape, full GUI framework comparison matrix, concrete 04/06 boundary examples, expanded Streamlit-to-SaaS reasoning** | None of this represents new decisions; all of it represents prior analysis that lived only in chat history and was at risk of evaporating. Per the doc's own no-silent-drift rule (Section 8) and the v1.4 recovery story, valuable analysis must be promoted to docs to survive. Specifically: TECHNICAL.md gains Section 10 (per-script functional specs, starting with the deduplicator's 36-item tiered spec) which is the buildable target for the v1 launch GUI port; this also makes the gap between "currently working" (exact + basic fuzzy) and "v1 launch best-of-class" (Tier 1) explicit so the docs don't quietly overstate where the code is. Section 9.3 gains three concrete distinguishing examples (bank-export blank fees / $1M outlier / "999=refused") that prove 04 and 06 are distinct concerns. BUSINESS.md gains Section 4a (Lead Bundle Deep Dive: 15 use cases by persona, 6-row competitive landscape table, market gap statement) which feeds landing page copy and demo design. Section 4c gains a 10-dimension scored framework matrix and per-option summaries (locks the rejection reasoning against re-litigation), plus expanded point 4 on Streamlit-to-SaaS migration cost. RECOVERY.md updated to reference Section 10 in rebuild and priority steps. No structural decisions changed; this is pure capture work. | | May 1 (v1.6) | Add `src/core/errors.py` structured hierarchy | Uniform helpful messages across CLI + GUI. See TECHNICAL §7. |
--- ## 8. Re-lock triggers
## 8. What Would Trigger Re-Locking the Criteria These criteria are load-bearing. Triggers for explicit re-evaluation:
These criteria are load-bearing and not casually changed. Triggers for explicit re-evaluation: - $5k/mo MRR (revisit async constraint).
- $10k/mo MRR (revisit time-budget allocation).
- Marketplace shutdown (Gumroad / Lemon Squeezy policy).
- New skill that opens a higher-leverage product category.
- Burnout signal — time/recovery balance broken.
- Streamlit hard direction change breaking desktop packaging (low probability).
- Hitting the $5k/mo revenue tier (revisit async constraint). Any re-lock writes new criteria here with date + rationale. **No silent drift.**
- Hitting the $10k/mo revenue tier (revisit time-budget allocation).
- A platform shutting down (Gumroad / Lemon Squeezy policy change forcing channel migration).
- A new skill acquired that opens a higher-leverage product category.
- A burnout signal indicating the time / recovery balance is broken.
- Streamlit project taking a hard direction change that breaks the desktop-packaging path (low probability, but worth flagging).
Any re-lock requires writing the new criteria here with a date and rationale. No silent drift.

View File

@@ -1,285 +1,161 @@
# Developer Guide # Developer Guide
Architecture, data flow, and extension guide for the DataTools Deduplicator. Architecture, data flow, extension points.
## Architecture ## Architecture
``` ```
CLI (src/cli.py) GUI (src/gui/app.py) CLI (src/cli*.py) GUI (src/gui/app.py + pages/)
│ │
│ flags → strategies │ widgets → strategies └──────────┐ ┌──────────┘
│ _interactive_review() │ match_group_card()
│ tqdm progress bar │ st.progress()
│ │
└──────────┐ ┌────────────────┘
│ │
▼ ▼ ▼ ▼
────────────────┐ ┌────────────────┐
core.dedup src/core/
deduplicate() │ └────────────────┘
└────────┬────────┘
┌────────────┼────────────┐
▼ ▼ ▼
core.io core.normalizers core.config
read/write normalize_*() save/load JSON
``` ```
**Key principle:** All business logic lives in `src/core/`. The CLI and GUI are thin wrappers that translate user input into `deduplicate()` arguments and display the `DeduplicationResult`. **Core/UI rule**: business logic in `core/` only. CLI + GUI translate user input → core call → display result.
## File-by-File Reference ## Module map
### src/core/dedup.py — Deduplication Engine | Module | Public surface |
|--------|----------------|
| `core.dedup` | `deduplicate()`, `MatchStrategy`, `ColumnMatchStrategy`, `Algorithm`, `SurvivorRule`, `DeduplicationResult`, `MatchResult`, `build_default_strategies()` |
| `core.normalizers` | `normalize_email/phone/name/address/string`, `NormalizerType`, `get_normalizer()` |
| `core.io` | `read_file()`, `write_file()`, `list_sheets()`, `detect_encoding/delimiter/header_row`, `repair_bytes()` |
| `core.config` | `DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule` |
| `core.analyze` | `analyze()`, `Finding`, `findings_by_tool()`, `_NULL_LIKE` |
| `core.fixes` | `@register("fix_id")` decorator, `get_fix()`, `available_actions()` |
| `core.normalize` | `auto_fix()`, `apply_decisions()`, `NormalizationResult`, `is_normalized()` |
| `core.text_clean` | `clean_dataframe()`, `CleanOptions`, `CleanResult`, `smart_title_case()` |
| `core.format_standardize` | `standardize_dataframe()`, `StandardizeOptions`, `StandardizeResult`, `FieldType`, per-cell `standardize_*()` |
| `core.errors` | `DataToolsError` hierarchy, `ensure_dataframe()`, `ensure_choice()`, `wrap_file_read/write()`, `format_for_user()` |
| `core._constants` | `US_STATE_NAMES`, `US_STATE_CODES`, `USPS_EXPANSIONS`, `USPS_COMPRESSIONS` |
The central module. Contains: ## Data flow — Deduplicator
- **Enums:** `Algorithm` (4 fuzzy algorithms), `SurvivorRule` (4 selection rules)
- **Data classes:** `ColumnMatchStrategy`, `MatchStrategy`, `MatchResult`, `DeduplicationResult`
- **`deduplicate()`** — main entry point. Takes a DataFrame + optional strategies/rules, returns a `DeduplicationResult` with deduplicated DataFrame, removed rows, match groups, and log entries.
- **`build_default_strategies()`** — scans column names with regex patterns to auto-detect email, phone, name, and address columns. Builds strong/weak key strategies with appropriate algorithms and normalizers.
- **`_UnionFind`** — disjoint-set data structure for transitive closure. If A matches B and B matches C, all three end up in one group.
- **`_find_match_groups()`** — O(n^2) pairwise comparison. For each pair, tries all strategies (OR semantics). Feeds matches into union-find. Returns match groups with confidence scores.
- **`_select_survivor()`** — picks the row to keep based on the survivor rule.
- **`_merge_group()`** — fills blank fields in the survivor from loser rows.
### src/core/normalizers.py — Text Normalization
Five normalizer functions, each `str → str`, idempotent, None-safe:
- **`normalize_email()`** — lowercase, strip Gmail dots, strip `+tag` suffixes
- **`normalize_phone()`** — parse with `phonenumbers` to E.164; fallback to digits-only
- **`normalize_name()`** — strip title prefixes (Dr., Mr.) and suffixes (Jr., PhD), case-fold
- **`normalize_address()`** — USPS abbreviations (Street→St, Avenue→Ave), case-fold
- **`normalize_string()`** — trim, collapse whitespace, case-fold
The `get_normalizer()` registry function maps `NormalizerType` enum values to functions.
### src/core/io.py — File I/O
Auto-detection stack:
1. **`detect_encoding()`** — checks BOM, then uses `charset-normalizer` heuristics
2. **`detect_delimiter()`** — uses `csv.Sniffer` on first 20 lines
3. **`detect_header_row()`** — finds first row where all cells look like column names
Main functions:
- **`read_file()`** — reads CSV/TSV/Excel with full auto-detection. Returns a DataFrame.
- **`write_file()`** — writes DataFrame to CSV or Excel. Uses `utf-8-sig` by default for Windows Excel compatibility.
- **`list_sheets()`** — returns sheet names from an Excel workbook.
### src/core/config.py — Configuration Profiles
Save/load deduplication settings as JSON:
- **`DeduplicationConfig`** — flat dataclass with all settings: strategies, survivor rule, merge flag, algorithm, threshold, normalizer map.
- **`.to_file()` / `.from_file()`** — JSON serialization
- **`.to_strategies()`** — converts config back to `MatchStrategy` objects for the engine
- **`.to_survivor_rule()`** — converts string to `SurvivorRule` enum
### src/cli.py — Command-Line Interface
Typer-based CLI with 17 options. Key responsibilities:
- Parse flags into strategies, survivor rule, and other config
- Set up logging (timestamped log files in `logs/`)
- Column name validation with fuzzy suggestions on typos
- `_interactive_review()` — side-by-side row display with y/n/s prompts
- Progress bar via `tqdm` for files > 10,000 rows
- Output formatting and file writing
### src/gui/app.py — Streamlit GUI
Single-page layout:
- File upload with instant preview and configurable delimiter (comma, tab, semicolon, pipe, or custom)
- Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles)
- Find Duplicates button → runs `deduplicate()` with `progress_callback`
- Interactive review via `st.data_editor` with inline checkboxes and column dropdowns
- Batch actions: Accept All, Reject All, Clear Decisions
- Apply review decisions and download cleaned results
- Download buttons for deduplicated CSV, removed rows, and match groups report
### src/gui/components.py — Reusable GUI Widgets
- **`match_group_card()`** — expandable card with `st.data_editor`: inline Keep checkboxes per row, `SelectboxColumn` dropdowns for differing columns, and a live surviving rows preview
- **`config_panel()`** — the advanced options expander, returns settings dict with strategies, survivor rule, merge flag
- **`results_summary()`** — summary metrics and download buttons
- **`apply_review_decisions()`** — builds final DataFrames from user review decisions (merge, split, or keep-all per group) with column override support
## Data Flow
``` ```
Input File read_file() # auto-detect encoding, delimiter, header
▼ DataFrame
build_default_strategies() # if no explicit strategies
read_file() ← auto-detect encoding, delimiter, header # strong keys (email, phone) → standalone OR
# weak keys (name, address) → AND with strong
_apply_normalizations() # add _norm_* shadow columns
DataFrame
_find_match_groups() # O(n²) pair compare, OR strategies, union-find
build_default_strategies() ← (if no explicit strategies) [review_callback()] # optional interactive review
│ scan column names → regex patterns
│ strong keys: email, phone (standalone OR) _select_survivor() # per group: first/last/most-complete/most-recent
│ weak keys: name, address (AND with strong)
[_merge_group()] # optional: fill blanks from losers
_apply_normalizations() ← add _norm_* shadow columns
normalize_email(), normalize_phone(), etc. DeduplicationResult # deduplicated_df, removed_df, match_groups, log
_find_match_groups() ← O(n²) pairwise comparison
│ for each pair: try all strategies (OR)
│ _compute_similarity() per column
│ union-find for transitive closure
[review_callback()] ← optional: interactive review per group
│ True=accept, False=reject, None=skip
_select_survivor() ← per group: first/last/most-complete/most-recent
[_merge_group()] ← optional: fill blanks from losers
DeduplicationResult
├── deduplicated_df ← cleaned DataFrame (shadow cols dropped)
├── removed_df ← rows that were removed
├── match_groups ← list of MatchResult with confidence, columns
└── log_entries ← human-readable audit log
``` ```
## How to Add a Normalizer ## Extension recipes
1. **Add the function** in `src/core/normalizers.py`: ### Add a normalizer
```python 1. Add function to `core/normalizers.py`:
def normalize_company(value: Optional[str]) -> str: ```python
"""Strip legal suffixes (Inc, LLC, Corp), case-fold.""" def normalize_company(value: Optional[str]) -> str:
if not value or not isinstance(value, str): if not value or not isinstance(value, str): return ""
return "" name = value.strip().casefold()
name = value.strip().casefold() for sfx in ("inc", "llc", "corp", "ltd", "co"):
# Strip common suffixes name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip()
for suffix in ("inc", "llc", "corp", "ltd", "co"): return name
name = re.sub(rf"\b{suffix}\.?\s*$", "", name).strip() ```
return name 2. Register: add `COMPANY = "company"` to `NormalizerType` + entry in `_NORMALIZER_MAP`.
``` 3. Auto-detect (optional): add a `_COLUMN_TYPE_PATTERNS` row in `core/dedup.py`.
2. **Register it** in the same file: ### Add a fuzzy algorithm
```python 1. Add value to `Algorithm` enum in `core/dedup.py`.
class NormalizerType(str, Enum): 2. Add case in `_compute_similarity()`.
# ... existing types ... 3. Document the value in CLI help text.
COMPANY = "company" # ← add enum value
_NORMALIZER_MAP: dict[NormalizerType, Callable[[str], str]] = { ### Add a survivor rule
# ... existing entries ...
NormalizerType.COMPANY: normalize_company, # ← add mapping
}
```
3. **Add auto-detection pattern** in `src/core/dedup.py` (optional): 1. Add value to `SurvivorRule` enum.
2. Add branch in `_select_survivor()`.
3. Add CLI mapping.
```python ### Add a fix + detector (analyzer/gate)
_COLUMN_TYPE_PATTERNS = [
# ... existing patterns ...
(re.compile(r"company|organization|org_name", re.I),
NormalizerType.COMPANY, Algorithm.TOKEN_SET_RATIO, 85.0, False),
]
```
## How to Add a Matching Algorithm 1. **Detector** in `core/analyze.py`: add `_detect_<thing>(df) -> list[Finding]`, hook into the main `analyze()` pipeline. Emit Finding with a unique `fix_action` id.
2. **Fix** in `core/fixes.py`:
```python
@register("fix_id")
def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]:
# ...
return out_df, cells_changed
```
3. **Constant** in `core/analyze.py`: add `FIX_<NAME> = "fix_id"` so the detector and fix can reference it.
1. **Add the enum value** in `src/core/dedup.py`: No other call sites change. Gate auto-discovers it via the registry.
```python ### Add a format-standardizer field type
class Algorithm(str, Enum):
# ... existing values ...
SOUNDEX = "soundex"
```
2. **Add the computation** in `_compute_similarity()`: 1. Add value to `FieldType` enum in `core/format_standardize.py`.
2. Add per-cell `standardize_<x>(value, *, …)` returning `(new_value, changed)`.
3. Add option fields to `StandardizeOptions` (with defaults that preserve existing behavior).
4. Wire into `_apply_field_type()` dispatcher (the `else` branch raises `AssertionError` — every enum value needs a branch).
5. Add validation entry in `StandardizeOptions.from_dict()` for any new enum-shaped option.
```python ## Errors
def _compute_similarity(val_a: str, val_b: str, algorithm: Algorithm) -> float:
# ... existing cases ...
if algorithm == Algorithm.SOUNDEX:
return 100.0 if _soundex(val_a) == _soundex(val_b) else 0.0
```
3. **Add the CLI flag value** in `src/cli.py` help text for `--algorithm`. Use `core/errors.py` instead of raw `ValueError` / `OSError`:
## How to Add a Survivor Strategy | Pattern | Use |
|---------|-----|
| Bad arg, wrong type, missing column | `InputValidationError` |
| Bad config / options file | `ConfigError` |
| File parses but isn't what we expected | `FileFormatError` |
| File I/O failure (perms, missing, disk full) | `FileAccessError` |
| Internal invariant broken (unreachable branch) | `AssertionError` |
1. **Add the enum value** in `src/core/dedup.py`: Helpers:
- `ensure_dataframe(value, function="my_func")` at every public entry that takes a df.
- `ensure_choice(value, name="mode", choices=[...])` at every entry that takes a literal.
- `wrap_file_read(path, "operation", exc)` / `wrap_file_write(...)` when wrapping `OSError`.
```python GUI / CLI handlers: use `format_for_user(exc, context="...")` to render.
class SurvivorRule(str, Enum):
# ... existing values ...
KEEP_LONGEST = "longest"
```
2. **Add the logic** in `_select_survivor()`: All `DataToolsError` subclasses extend stdlib `ValueError` or `OSError` so existing handlers still catch them.
```python ## Tests
if rule == SurvivorRule.KEEP_LONGEST:
return max(indices, key=lambda i: len(str(df.iloc[i].to_dict())))
```
3. **Add to the CLI** survivor map in `src/cli.py`.
## Testing
### Run Tests
```bash ```bash
# All tests # All
pytest tests/ -q pytest -q
# By module
# Specific module pytest tests/test_dedup.py
pytest tests/test_dedup.py -q # Include slow / integration
pytest tests/test_normalizers.py -q pytest -m slow
pytest tests/test_io.py -q # Single test
pytest tests/test_config.py -q pytest tests/test_dedup.py::TestExactMatch::test_basic
pytest tests/test_cli.py -q
# Verbose with output
pytest tests/ -v
# Stop on first failure
pytest tests/ -x
``` ```
### Test Structure Test layout:
``` ```
tests/ tests/
├── conftest.py # Shared fixtures ├── conftest.py # fixtures
│ ├── sample_csv_path # Path to samples/messy_sales.csv ├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
│ ├── sample_df # Loaded sample CSV as DataFrame ├── test_analyze.py · test_normalize.py · test_text_clean.py
│ ├── simple_df # Small 5-row DataFrame with obvious duplicates ├── test_format_standardize.py
│ ├── merge_df # DataFrame with partial records ├── test_format_standardize_corpus.py # 199-row buyer corpus
│ └── tmp_csv # Temporary CSV from simple_df ├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_dedup.py # Engine tests: similarity, union-find, pairs, integration ├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
── test_normalizers.py # Normalizer tests: all 5 types with edge cases ── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
├── test_io.py # I/O tests: encoding, delimiter, header, read/write
├── test_config.py # Config tests: serialization round-trip
└── test_cli.py # CLI tests: argument parsing, file handling
``` ```
### Writing Tests Fixture corpora: `test-cases/text-cleaner-corpus/` (21 files) · `test-cases/encodings-corpus/` (31 files) · `test-cases/format-cleaner-corpus/` (7 files + spec).
Follow existing patterns. Tests use pytest fixtures from `conftest.py`: ## Known limitations
```python - **Dedup is O(n²)** — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix.
def test_my_feature(simple_df): - **Single-threaded** — could benefit from `multiprocessing`.
"""Test description.""" - **Memory-bound** — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine.
result = deduplicate(simple_df, ...) - **No multi-sheet dedup** — each Excel sheet processed independently.
assert len(result.match_groups) == expected - **Phonenumbers minimum-length** — international numbers without country codes fall back to digits-only.
assert result.deduplicated_df.shape[0] == expected_rows
```
## Known Limitations
- **O(n^2) pairwise comparison** — no blocking or indexing. Works well up to ~50,000 rows. Beyond that, performance degrades quadratically. Future optimization: add blocking (partition by first letter, zip code prefix, etc.) to reduce comparison space.
- **No multi-sheet dedup** — each Excel sheet is processed independently. Cross-sheet deduplication is not supported.
- **Phone normalization requires valid-length numbers** — the `phonenumbers` library rejects numbers that are too short or too long for the detected region. Fallback is digits-only, which may produce false negatives for international numbers without country codes.
- **Single-threaded** — no parallel comparison. Could benefit from `multiprocessing` for large files.
- **Memory-bound** — entire file is loaded into a pandas DataFrame. Files larger than available RAM will fail. Chunked reading exists but is not integrated with the dedup engine.

View File

@@ -1,38 +1,27 @@
# Excel & CSV Data Cleaning Mastery Bundle # Excel & CSV Data Cleaning Mastery Bundle
**Ready-to-sell Python automation product.** 9 Python data-cleaning tools, every one with a CLI and a browser GUI. Local-only, no internet. Windows / macOS / Linux.
9 scripts for data cleaning, deduplication, text hygiene, formatting, merging, validation, and reporting.
Each script ships with both a GUI (runs in your browser locally, no internet needed) and a CLI. ## Quick Start
Cross-platform: Windows, macOS, Linux. 1. Download the installer for your OS from your purchase email.
2. Run it (no Python knowledge required).
3. Launch via the desktop shortcut → your default browser opens to a local page.
Full instructions: [USER-GUIDE.md](USER-GUIDE.md).
## Docs
**Buyer-facing** (ships with the product):
- [USER-GUIDE.md](USER-GUIDE.md) — install + per-tool walkthrough
**Creator-only** (do not ship):
- [BUSINESS.md](BUSINESS.md) — market, pricing, marketing
- [TECHNICAL.md](TECHNICAL.md) — architecture, build pipeline, standards
- [DECISIONS.md](DECISIONS.md) — locked criteria, decision log
- [RECOVERY.md](RECOVERY.md) — full rebuild guide
- [REQUIREMENTS.md](REQUIREMENTS.md) — numbered support matrix
--- ---
## Quick Start (for buyers) **Version**: 1.6 · **Updated**: 2026-05-01 · **Owner**: Michael
1. Download the installer for your operating system.
2. Run the installer. No Python knowledge required.
3. Launch via the desktop shortcut "Launch Bundle" (or the app icon on macOS, or the AppImage on Linux).
4. Your default browser opens to a local page where the data tool runs. Your data never leaves your computer.
Full instructions: see [USER-GUIDE.md](USER-GUIDE.md).
---
## Documentation Index
### Ships with the product (buyer-facing)
- [USER-GUIDE.md](USER-GUIDE.md) - Installation, script reference, usage examples for both GUI and CLI.
### Creator-only (do not ship to buyers)
- [BUSINESS.md](BUSINESS.md) - Business case, market analysis, pricing, marketing strategy (including the hosted browser demo as a conversion lever).
- [TECHNICAL.md](TECHNICAL.md) - Architecture (dual CLI + Streamlit GUI), build pipeline, dev standards.
- [DECISIONS.md](DECISIONS.md) - Locked criteria, scoring rubric, decisions log, rationale for product choices including the GUI framework decision.
- [RECOVERY.md](RECOVERY.md) - How to rebuild the entire project from scratch if lost.
---
**Version**: 1.6
**Last updated**: April 28, 2026
**Owner**: Michael

View File

@@ -1,180 +1,147 @@
# RECOVERY.md - Full Project Recovery Guide # Recovery
> **Creator-only document. Do not ship to buyers.** > Creator-only. Full project rebuild guide.
> **Version**: 1.6 · **Updated**: 2026-05-01
**Version**: 1.6 If lost, this doc + the source ZIP rebuilds the project 100%.
**Last updated**: April 28, 2026
If the project is ever lost, this guide plus the source ZIP is enough to rebuild it 100%. ## 1. Project layout
---
## 1. What's in the Project
``` ```
project-root/ project-root/
├── README.md ├── README.md
├── BUSINESS.md # Creator only ├── docs/
├── TECHNICAL.md # Creator only │ ├── BUSINESS.md # creator-only
├── DECISIONS.md # Creator only - locked criteria, rationale, GUI framework decision ├── TECHNICAL.md # creator-only
├── USER-GUIDE.md # Ships to buyers │ ├── DECISIONS.md # creator-only — locked criteria + decision log
├── RECOVERY.md # Creator only (this file) │ ├── DEVELOPER.md # creator-only
├── RECOVERY.md # creator-only (this file)
├── scripts/ # The 9 .py source files (CLI entry points) │ ├── REQUIREMENTS.md
│ ├── 01_deduplicator.py # Working │ ├── USER-GUIDE.md # ships to buyers
── 02_text_cleaner.py ── CLI-REFERENCE.md
│ ├── 03_format_standardizer.py
│ ├── 04_missing_value_handler.py
│ ├── 05_column_mapper_enforcer.py
│ ├── 06_outlier_detector.py
│ ├── 07_multi_file_merger.py
│ ├── 08_validator_reporter.py
│ └── 09_master_orchestrator.py
├── src/ ├── src/
│ ├── core/ # Shared business logic - both CLI and GUI call into this │ ├── core/ # shared logic both CLI + GUI call into this
│ ├── cli.py # Typer CLI front-end │ ├── cli.py # Deduplicator CLI
── gui/ # Streamlit GUI front-end ── cli_text_clean.py # Text Cleaner CLI
├── app.py # Streamlit entry point ├── cli_analyze.py # Analyzer CLI
├── pages/ # One Streamlit page per script in the bundle └── gui/
── components.py # Shared widgets ── app.py # Streamlit entry
├── pages/ # one page per tool
├── samples/ │ └── components/ # shared widgets
│ ├── messy_sales.csv ├── samples/ # messy_sales.csv, bank_export.xlsx
│ └── bank_export.xlsx ├── test-cases/ # corpora: text-cleaner, encodings, format-cleaner
├── tests/ # pytest
├── demo/ ├── demo/streamlit_app.py # constrained Streamlit Community Cloud version
│ └── streamlit_app.py # Constrained version for Streamlit Community Cloud
├── build/ ├── build/
│ ├── pyinstaller.spec # Cross-platform build spec (handles GUI launcher + CLI binaries) │ ├── pyinstaller.spec # cross-platform build spec
│ ├── launcher.py # Starts local Streamlit server, opens default browser │ ├── launcher.py # starts Streamlit, opens browser
│ ├── windows/ │ ├── windows/installer.iss
│ └── installer.iss # Inno Setup wrapper ├── macos/{entitlements.plist, dmg_settings.py}
── macos/ ── linux/AppImage/
│ │ ├── entitlements.plist ├── ci/build.yml # GitHub Actions matrix build
│ │ └── dmg_settings.py
│ └── linux/
│ └── AppImage/ # AppImage build assets
├── ci/
│ └── build.yml # GitHub Actions cross-platform build
├── tests/
└── requirements.txt └── requirements.txt
``` ```
--- ## 2. Rebuild steps
## 2. Rebuild Steps
### From a complete ZIP backup ### From a complete ZIP backup
1. Unzip into a clean directory. 1. Unzip into a clean directory.
2. Push to a GitHub repository. 2. Push to GitHub.
3. The CI pipeline (`ci/build.yml`) builds Windows, macOS, and Linux artifacts on tagged releases. 3. Tag a release → CI builds Windows / macOS / Linux artifacts.
4. Connect the repo to Streamlit Community Cloud and point it at `demo/streamlit_app.py` to redeploy the hosted demo. 4. Connect repo to Streamlit Community Cloud → demo deploys.
5. For local builds: see Section 3. 5. Local builds: see §3.
6. Done.
### From documentation only (worst case) ### From documentation only (worst case)
1. Read `DECISIONS.md` to understand *why* the project is what it is. Section 4c locks the GUI framework as Streamlit; Section 4b locks the UX standards. These are non-negotiable. 1. Read **DECISIONS.md** understand *why* the project is what it is. §4c locks Streamlit; §4b locks UX standards. **Non-negotiable.**
2. Read `TECHNICAL.md` Sections 2-3 for the build pipeline architecture, including the Streamlit launcher pattern in Section 3.4. 2. Read **TECHNICAL.md** §1-3 (architecture + build pipeline + Streamlit launcher pattern in §3.4).
3. Read `BUSINESS.md` for product strategy, which bundles to build, and the hosted demo as a marketing asset. 3. Read **BUSINESS.md** for product strategy + hosted demo as marketing asset.
4. Recreate scripts using the spec in `USER-GUIDE.md` Section 2 (script table), `TECHNICAL.md` Section 7 (per-bundle technical notes), `TECHNICAL.md` Section 9 (boundary between scripts 04 and 06 - do not relitigate this), and `TECHNICAL.md` Section 10 (per-script functional requirements; Section 10.1 is the v1 launch target for the deduplicator). 4. Recreate scripts using:
5. Set up the cross-platform build pipeline (Section 3 below). - USER-GUIDE.md §2 (script table)
6. Recreate installer configs per `TECHNICAL.md` Section 3. - TECHNICAL.md §10 (04/06 boundary — do not relitigate)
7. Build the constrained `demo/streamlit_app.py` for hosted deployment. Constraints: row limit, watermark, sample data only or strict file-size cap. - TECHNICAL.md §11 (per-script functional specs; §11.1-11.3 are the v1 launch targets for Ready tools).
5. Set up cross-platform build pipeline (§3 below).
6. Recreate installer configs per TECHNICAL.md §3.5-3.7.
7. Build constrained `demo/streamlit_app.py` (row limit, watermark, sample data).
--- ## 3. Local build setup
## 3. Local Build Setup (per platform) ### Common
```bash
### All platforms (common) pip install -r requirements.txt pyinstaller
- Install Python 3.11+. streamlit run src/gui/app.py # verify GUI
- `pip install -r requirements.txt pyinstaller` python -m src.cli --help # verify CLI
- Verify Streamlit app runs locally: `streamlit run src/gui/app.py` ```
- Verify CLI runs locally: `python -m src.cli --help`
### Windows ### Windows
- Install Inno Setup: https://jrsoftware.org/isinfo.php - Install Inno Setup: https://jrsoftware.org/isinfo.php
- Build: `pyinstaller build/pyinstaller.spec` - `pyinstaller build/pyinstaller.spec`
- Wrap in installer: open `build/windows/installer.iss` in Inno Setup, compile. - Open `build/windows/installer.iss` in Inno Setup, compile.
### macOS ### macOS
- Install Xcode command line tools: `xcode-select --install` 1. `xcode-select --install`
- Enroll in Apple Developer Program ($99/yr). Allow 1-2 weeks first time. 2. Enroll in Apple Developer Program ($99/yr 1-2 wk first time).
- Generate Developer ID Application certificate, install in Keychain. 3. Generate Developer ID cert, install in Keychain.
- Generate app-specific password for `notarytool`. 4. Generate app-specific password for `notarytool`.
- Build: `pyinstaller build/pyinstaller.spec` 5. `pyinstaller build/pyinstaller.spec`
- Sign: `codesign --deep --force --options runtime --sign "Developer ID Application: [Name]" dist/BundleName.app` 6. `codesign --deep --force --options runtime --sign "Developer ID Application: [Name]" dist/App.app`
- Package as DMG. 7. Package as DMG.
- Notarize: `xcrun notarytool submit BundleName.dmg --wait` 8. `xcrun notarytool submit *.dmg --wait`
- Staple: `xcrun stapler staple BundleName.dmg` 9. `xcrun stapler staple *.dmg`
### Linux ### Linux
- Install AppImage tooling: download `appimagetool` from https://appimage.github.io - Download `appimagetool` from https://appimage.github.io
- Build: `pyinstaller build/pyinstaller.spec` - `pyinstaller build/pyinstaller.spec`
- Wrap as AppImage using `appimagetool` per the assets in `build/linux/AppImage/`. - Wrap as AppImage via assets in `build/linux/AppImage/`.
### Streamlit + PyInstaller specific notes ### Streamlit + PyInstaller notes
- A custom PyInstaller hook (`hook-streamlit.py`) is required to bundle Streamlit's data files correctly. - Custom `hook-streamlit.py` required.
- Hidden imports must include `streamlit`, `altair`, `pyarrow` (and their submodules where PyInstaller fails to detect them). - Hidden imports: `streamlit`, `altair`, `pyarrow` (and submodules where auto-detection fails).
- The launcher script (`build/launcher.py`) is the actual PyInstaller entry point, not the Streamlit script directly. - The PyInstaller entry point is `build/launcher.py`, **not** the Streamlit script directly.
- Budget 1-3 days the first time getting the Streamlit-PyInstaller spec right; it's reusable across all subsequent bundles. - Budget 1-3 days first time. Reusable across all bundles.
### CI build (recommended) ### CI build (recommended)
- Push the repo to GitHub. ```bash
- Tag a release: `git tag v1.0.0 && git push --tags` git tag v1.0.0 && git push --tags
- GitHub Actions runs the matrix build, produces all three artifacts. # GitHub Actions runs the matrix → 3 platform artifacts on Releases page.
- Manual step: download artifacts from the Releases page, upload to Gumroad / Lemon Squeezy. # Manual: download upload to Gumroad / Lemon Squeezy.
```
### Hosted demo deployment (separate from desktop build) ### Hosted demo deployment
- Connect GitHub repo to Streamlit Community Cloud (one-time, free). - Connect GitHub repo to Streamlit Community Cloud (one-time, free).
- Configure the deployment to point at `demo/streamlit_app.py`. - Configure deployment `demo/streamlit_app.py`.
- The demo updates automatically on git push to the configured branch. - Auto-updates on push to configured branch.
- Custom domain optional via CNAME (verify Streamlit Community Cloud current policy at recovery time). - Custom domain optional via CNAME.
--- ## 4. External dependencies
## 4. External Dependencies (re-acquire if lost)
| Item | Source | Cost | | Item | Source | Cost |
|---|---|---| |------|--------|------|
| Python | https://python.org/downloads | Free | | Python | python.org/downloads | Free |
| PyInstaller | `pip install pyinstaller` | Free | | PyInstaller, Streamlit, Python libs | `pip install -r requirements.txt` | Free |
| Streamlit | `pip install streamlit` | Free | | Inno Setup (Windows) | jrsoftware.org/isinfo.php | Free |
| Inno Setup (Windows) | https://jrsoftware.org/isinfo.php | Free | | Apple Developer Program (macOS) | developer.apple.com | $99/yr |
| Apple Developer Program (macOS signing) | https://developer.apple.com | $99/yr | | Xcode CLT (macOS) | `xcode-select --install` | Free |
| Xcode command line tools (macOS) | `xcode-select --install` | Free | | appimagetool (Linux) | appimage.github.io | Free |
| appimagetool (Linux) | https://appimage.github.io | Free | | GitHub Actions (CI) | github.com | Free tier covers all 3 OS runners |
| GitHub Actions (CI) | github.com | Free tier covers all three OS runners | | Streamlit Community Cloud | streamlit.io/cloud | Free |
| Streamlit Community Cloud (demo hosting) | streamlit.io/cloud | Free |
| Python libraries | See `requirements.txt`, `pip install -r requirements.txt` | Free |
--- ## 5. Backup recommendation
## 5. Backup Recommendation - **Primary**: GitHub repository (private). Source of truth.
- **Secondary**: ZIP of full project tree on cloud storage (Drive / Dropbox / S3).
- **Apple Developer credentials**: cert + app-specific password in a password manager. Re-issuable, not catastrophic.
- **Streamlit Community Cloud**: stored as GitHub OAuth link in Streamlit UI. Re-authorize from new account if lost.
- Back up after every meaningful change.
- **Always include RECOVERY.md + DECISIONS.md** — irreplaceable context.
- **Primary backup**: GitHub repository (private). Source is the source of truth. ## 6. Recovery priorities (under time pressure)
- **Secondary backup**: ZIP of the full project tree on cloud storage (Google Drive / Dropbox / S3).
- **Apple Developer credentials**: store certificate + app-specific password in a password manager. Losing these requires regenerating, not catastrophic.
- **Streamlit Community Cloud connection**: stored in Streamlit's UI as a GitHub OAuth link. Re-authorize from a new Streamlit account if lost.
- Back up after every meaningful code or doc change.
- Include this `RECOVERY.md` and `DECISIONS.md` in every backup. They contain the irreplaceable context.
--- 1. **`src/core/` + scripts** — without these there is no product.
2. **DECISIONS.md** — without this you'll re-litigate every settled call.
## 6. Recovery Priorities (if rebuilding under time pressure) 3. **TECHNICAL.md** §10 (04/06 boundary) + §11 (per-script specs). Without these you'll rebuild dedup with weaker fuzzy than the v1 spec demands and lose to free Excel.
4. **`src/gui/`** — primary buyer surface; without it the product reverts to CLI-only and the persona refunds.
If you only have time to rebuild part of the project, this is the order: 5. **PyInstaller spec + launcher + per-OS configs** — recreating the Streamlit-PyInstaller integration is 1-3 days.
6. **Apple Developer Program enrollment** — 1-2 wk lead. Start first if Mac matters.
1. **Source: `src/core/` and `scripts/`**. Without these there is no product. 7. **Hosted demo** — important marketing asset, not blocking for desktop sales.
2. **DECISIONS.md**. Without this you will re-litigate every settled decision (especially GUI framework, dual interface, UX standards) and probably get it wrong differently. 8. Doc files (USER-GUIDE, BUSINESS, README) — recoverable from memory + this guide.
3. **TECHNICAL.md**, especially Sections 9 (04/06 boundary) and 10 (per-script functional requirements). Without these you will rebuild the deduplicator with weaker fuzzy matching than the v1 launch spec demands and ship something that loses to free Excel. 9. CI config — nice to have, not blocking.
4. **Streamlit GUI source (`src/gui/`)**. The primary buyer surface; without it the product reverts to CLI-only and the buyer persona will refund.
5. **PyInstaller spec + launcher + per-OS build configs** (`build/`). Reproducing the Streamlit-PyInstaller integration from scratch is 1-3 days of work.
6. **Apple Developer Program enrollment**. 1-2 week lead time. Start this first if Mac distribution matters.
7. **Hosted demo (`demo/streamlit_app.py`)**. Important marketing asset but not blocking for desktop sales.
8. Documentation files (USER-GUIDE, BUSINESS, README). Recoverable from memory + this guide.
9. CI config (`ci/build.yml`). Nice to have, not blocking.

View File

@@ -1,146 +1,129 @@
# REQUIREMENTS.md # Requirements
Numbered, categorized requirements list — short form. The companion to USER-GUIDE.md and TECHNICAL.md; updated with every shipped capability. Numbered support matrix. Updated with every shipped capability.
---
## 1. File handling ## 1. File handling
1.1 Size: ≤ 1 GB target (larger works, slower).
1.1 File size: ≤ 1 GB (target; bigger files work but the gate's full-DataFrame Apply pass scales linearly). 1.2 Read: CSV, TSV, XLSX, XLS.
1.2 Input formats: CSV, TSV, XLSX, XLS. 1.3 Write: CSV, TSV.
1.3 Output formats: CSV, TSV. 1.4 Excel: multi-sheet picker.
1.4 Excel: multi-sheet workbook picker. 1.5 Empty file: blocked with `empty_input` error finding.
1.5 Empty file: detected, blocks gate with `empty_input` error finding.
## 2. Input encodings (auto-detected) ## 2. Input encodings (auto-detected)
2.1 Unicode: UTF-8, UTF-8-BOM, UTF-16 LE/BE BOM, UTF-16 LE no-BOM.
2.1 Unicode: UTF-8, UTF-8 with BOM, UTF-16 LE/BE with BOM, UTF-16 LE without BOM (best-effort).
2.2 Western: cp1252, ISO-8859-1, ISO-8859-15, Mac Roman. 2.2 Western: cp1252, ISO-8859-1, ISO-8859-15, Mac Roman.
2.3 Eastern European: cp1250, ISO-8859-2. 2.3 Eastern European: cp1250, ISO-8859-2.
2.4 Cyrillic: cp1251, KOI8-R. 2.4 Cyrillic: cp1251, KOI8-R.
2.5 CJK: Shift_JIS / cp932, GB18030, Big5, EUC-KR / cp949. 2.5 CJK: Shift_JIS / cp932, GB18030, Big5, EUC-KR / cp949.
2.6 ASCII: detected as UTF-8 (byte-equivalent). 2.6 ASCII detected as UTF-8.
2.7 User override: any Python codec name typed in the Review page. 2.7 User override: any Python codec name.
2.8 BOM: stripped on read, never written. 2.8 BOM: stripped on read, never written.
2.9 Decode failure: surfaced as `encoding_decode_failed` (error severity). 2.9 Decode failure `encoding_decode_failed` (error).
2.10 Replacement char (U+FFFD) in output: surfaced as `encoding_uncertain` (error). 2.10 U+FFFD in output `encoding_uncertain` (error).
## 3. Output encodings ## 3. Output encodings
3.1 UTF-8 (default), UTF-8-BOM (Excel-friendly).
3.1 UTF-8 (default). 3.2 cp1252, ISO-8859-1/15, cp1250, ISO-8859-2, cp1251.
3.2 UTF-8 with BOM (Excel-friendly). 3.3 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
3.3 cp1252, ISO-8859-1, ISO-8859-15, cp1250, ISO-8859-2, cp1251. 3.4 Lossy fallback: `?` + warning when codec can't represent a char.
3.4 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
3.5 Lossy fallback: `?` replacement + warning shown when chosen codec can't represent a character.
## 4. Delimiters ## 4. Delimiters
4.1 Input auto-detect: `,`, `\t`, `;`, `|`.
4.1 Auto-detect (input): `,`, `\t`, `;`, `|`.
4.2 Output: `,` (default), `\t`, `;`, `|`. 4.2 Output: `,` (default), `\t`, `;`, `|`.
4.3 File extension: `.tsv` for tab, `.csv` otherwise. 4.3 Extension: `.tsv` for tab, `.csv` otherwise.
## 5. Line endings ## 5. Line endings
5.1 Read: LF / CRLF / bare CR — all normalized to LF.
5.1 Input: LF, CRLF, bare CR (all normalized to LF on read).
5.2 Embedded in quoted cells: also normalized to LF. 5.2 Embedded in quoted cells: also normalized to LF.
5.3 Output: LF (default), CRLF, CR. 5.3 Write: LF (default), CRLF, CR.
5.4 Mixed line endings: surfaced as `mixed_line_endings` finding. 5.4 Mixed `mixed_line_endings` finding.
## 6. Analyzer detectors ## 6. Analyzer detectors
6.1 File-level (audit log of read-time fixes): `csv_bom_stripped`, `csv_nul_stripped`, `csv_smart_quotes_folded`, `csv_line_endings_normalized`, `csv_transcoded_to_utf8`, `csv_unquoted_delimiters_repaired`, `csv_unrepairable_rows`. **File-level** (read-time fixes, audit-logged):
6.2 Cell-level: `smart_punctuation_in_data`, `nbsp_or_unicode_whitespace`, `zero_width_or_invisible`, `dirty_column_headers`, `whitespace_padding`, `null_like_sentinels`, `suspected_mojibake`, `mixed_case_email_column`, `near_duplicate_rows`, `leading_zero_ids`. - `csv_bom_stripped`, `csv_nul_stripped`, `csv_smart_quotes_folded`, `csv_line_endings_normalized`, `csv_transcoded_to_utf8`, `csv_unquoted_delimiters_repaired`, `csv_unrepairable_rows`.
6.3 Encoding integrity: `encoding_uncertain`, `encoding_decode_failed`, `empty_input`.
6.4 Sample size (default): 1,000 rows; configurable. **Cell-level**:
- `smart_punctuation_in_data`, `nbsp_or_unicode_whitespace`, `zero_width_or_invisible`, `dirty_column_headers`, `whitespace_padding`, `null_like_sentinels`, `suspected_mojibake`, `mixed_case_email_column`, `inconsistent_date_format`, `near_duplicate_rows`, `leading_zero_ids`.
**Encoding integrity**: `encoding_uncertain`, `encoding_decode_failed`, `empty_input`.
Sample size: 1,000 rows (configurable).
## 7. Finding fields ## 7. Finding fields
`id`, `severity` (info/warn/error), `confidence` (high/medium/low), `fix_action`, `pre_applied`, `tool`, `count`, `description`, `column`, `samples` (≤5).
7.1 `id` — stable identifier.
7.2 `severity` — info / warn / error (error blocks gate).
7.3 `confidence` — high / medium / low (auto-fixability).
7.4 `fix_action` — id of the algorithm in `src/core/fixes.py`.
7.5 `pre_applied` — true if fixed during read pass.
7.6 `tool` — owning tool id (or empty).
7.7 `count`, `description`, `column`, `samples` (≤5).
## 8. Confidence tiers ## 8. Confidence tiers
- **high** — round-trip safe, one-click auto-fix.
- **medium** — preview before applying.
- **low** — opt-in only, can corrupt if wrong.
- **error** — must resolve or waive before tool pages unlock.
8.1 **high** — round-trip safe; one-click auto-fix. ## 9. Decision actions
8.2 **medium** — preview before applying. - `auto` — apply registered fix.
8.3 **low** — opt-in only; can corrupt data if wrong. - `skip` — waive (audit-logged).
8.4 **error** — must resolve or waive before tool pages unlock. - `modified` — apply with custom payload.
## 9. Decision actions per finding
9.1 `auto` — apply the registered fix.
9.2 `skip` — waive (no change, audit-logged).
9.3 `modified` — apply with custom payload (e.g. user-edited null sentinels).
## 10. Performance (1 GB input) ## 10. Performance (1 GB input)
- Initial scan (sample): < 2 s · peak RSS ~110 MB.
- Full-file `repair_bytes`: 3040 s.
- Full-DataFrame analyze: ~4 min (~25 µs/cell).
- Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
- Output write: ~10 s.
- Recommended RAM: 4× input size for full-Apply path.
10.1 Initial scan (`analyze` sample-mode): < 2 s. ## 11. Tools
10.2 Peak RSS during initial scan: ~110 MB. 1. Deduplicator — Ready
10.3 Full-file `repair_bytes`: ~3040 s (when triggered). 2. Text Cleaner — Ready
10.4 Full-DataFrame analyze: ~4 min (~25 µs/cell). 3. Format Standardizer — Ready
10.5 Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell). 4. Missing Value Handler — Coming Soon
10.6 Output write: ~10 s for 1 GB UTF-8 CSV. 5. Column Mapper — Coming Soon
10.7 RAM headroom recommended: 4× input file size for the full-Apply path. 6. Outlier Detector — Coming Soon
7. Multi-File Merger — Coming Soon
## 11. Tools shipped 8. Validator & Reporter — Coming Soon
9. Pipeline Runner — Coming Soon
11.1 Deduplicator — Ready.
11.2 Text Cleaner — Ready.
11.3 Format Standardizer — Coming Soon.
11.4 Missing Value Handler — Coming Soon.
11.5 Column Mapper — Coming Soon.
11.6 Outlier Detector — Coming Soon.
11.7 Multi-File Merger — Coming Soon.
11.8 Validator & Reporter — Coming Soon.
11.9 Pipeline Runner — Coming Soon.
## 12. Gate (Review & Normalize) ## 12. Gate (Review & Normalize)
- Gates every tool page.
12.1 Gates every tool page; tool pages refuse to load until passed. - Auto-fix button: applies all `confidence=high` findings in one click.
12.2 Auto-fix button applies all `confidence=high` findings in one click. - Per-finding controls: Auto / Skip / Customize.
12.3 Per-finding controls: Auto-fix / Skip / Customize. - Live before/after preview (≤5 sample rows).
12.4 Live before/after preview per finding (≤5 sample rows). - Audit log per fix (id, decision, cells changed).
12.5 Audit log: every fix tagged with finding id, decision, cells changed. - Encoding-override picker (16 codepages + custom).
12.6 Encoding override picker (16 codepages + custom). - Advanced output expander: encoding + delimiter + line terminator.
12.7 Advanced output options expander: encoding + delimiter + line terminator. - Result keyed by upload SHA-256; survives reload, invalidated on re-upload.
12.8 Result keyed by upload SHA-256; survives page reloads, invalidated on re-upload.
## 13. Interfaces ## 13. Interfaces
- **GUI**: Streamlit, browser-based, local, no internet.
13.1 GUI: Streamlit, runs locally, browser-based, no internet required. - **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_analyze`.
13.2 CLI: Typer apps — `python -m src.cli`, `src.cli_text_clean`, `src.cli_analyze`. - **Python API**: `from src.core import …` (analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …).
13.3 Python API: `from src.core import …` (analyze, repair_bytes, clean_dataframe, deduplicate, etc.). - **JSON output**: `--json` on `cli_analyze`.
13.4 JSON output: `--json` flag on `cli_analyze`; full Finding schema.
## 14. Platforms ## 14. Platforms
- Python ≥ 3.10.
14.1 Python: ≥ 3.10. - OS: Linux, macOS, Windows.
14.2 OS: Linux, macOS, Windows. - Browser: any modern browser.
14.3 Display: any modern browser (Streamlit GUI). - Network: not required at runtime.
14.4 Network: not required at runtime.
## 15. Dependencies ## 15. Dependencies
- **Core**: pandas, openpyxl, charset-normalizer, typer, loguru.
15.1 Core: pandas, openpyxl, charset-normalizer, typer, loguru. - **Dedup**: rapidfuzz, phonenumbers.
15.2 Dedup: rapidfuzz, phonenumbers. - **GUI**: streamlit.
15.3 GUI: streamlit. - **Optional**: ftfy (mojibake repair).
15.4 Optional: ftfy (mojibake repair, `repair_mojibake` fix). - **Dev**: pytest, tox.
15.5 Dev: pytest, tox.
## 16. Test coverage ## 16. Test coverage
- 1,230 tests passing, 4 skipped (ftfy not installed), 17 xfailed (documented).
16.1 Unit + integration: 765 tests passing. - Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases).
16.2 Documented gaps: 17 xfail (charset-normalizer label drift on byte-equivalent codepages, byte-level smart-quote fold expectation). - Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
16.3 Fixture corpora: 21 text-cleaner fixtures, 31 encoding fixtures, 9 reference UTF-8 files.
16.4 CI surface: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
## 17. Privacy / data handling ## 17. Privacy / data handling
- All processing local; no network calls in the data path.
- No telemetry.
- Original input never modified.
- Audit logs: `logs/` next to each run (timestamped).
17.1 All processing local; no network calls in the data path. ## 18. Error handling
17.2 No telemetry, no usage analytics shipped. - Structured hierarchy: `DataToolsError``InputValidationError`, `ConfigError`, `FileFormatError`, `FileAccessError`.
17.3 Original input file never modified — outputs go to a separate path. - Subclasses extend stdlib `ValueError` / `OSError` so existing handlers still catch them.
17.4 Audit logs written to `logs/` next to each run (timestamped). - Every error carries: message, file path, column, operation, suggestion, underlying cause.

View File

@@ -1,570 +1,350 @@
# TECHNICAL.md - Technical Design, Build Pipeline, Standards # Technical
> **Creator-only document. Do not ship to buyers.** > Creator-only. Do not ship to buyers.
> **Version**: 1.6 · **Updated**: 2026-05-01
**Version**: 1.6 ## 1. Architecture
**Last updated**: April 28, 2026
--- - **Dual interface**: CLI + GUI, both wrapping the same `src/core/` library.
- **GUI**: Streamlit, runs as local web server, opens in default browser. No internet.
- **Runtime**: Python 3.10+ (bundled into installer; buyer never sees Python).
- **Cross-platform**: Windows, macOS, Linux from day one. PyInstaller per OS.
- **Core/UI rule**: business logic in `core/` only. CLI + GUI are thin front-ends.
## 1. Architecture Overview **Locks**:
- v1.2 — dual interface required (non-technical buyers won't use CLI).
- v1.3 — Streamlit chosen (over CustomTkinter inactive, plain Tk UX gap, Flet/PySide6/NiceGUI each fails one dimension). See DECISIONS.md §4c.
- Standalone tools with **dual interface**: CLI and GUI, both wrapping the same core library. ## 2. Repo layout
- GUI framework: **Streamlit**. Runs as a local web server, opens in the buyer's default browser. No internet used.
- Python 3.11+ runtime (bundled into the installer; the buyer never installs Python).
- Modular code, one concern per script. Core logic is library code; CLI and GUI are thin front-ends.
- Cross-platform from day one: Windows, macOS, Linux.
- PyInstaller produces standalone executables per OS. Buyer never sees Python, pip, venvs, or PATH.
- No internet required at runtime.
**Why dual interface (locked v1.2)**: The primary buyer persona is non-technical and will not use a CLI. The GUI is therefore the primary surface and is required at v1, not deferred. The CLI is retained for power users, automation, scheduled jobs, and future scripted workflows. Both share a single core; neither has features the other lacks (except interactive review, which only makes sense in GUI).
**Why Streamlit (locked v1.3)**: Fastest build velocity, lowest maintenance burden per added feature, hosted browser demo deployable as a marketing asset, future SaaS optionality. Selected over CustomTkinter (maintenance inactive since Jan 2024), plain Tkinter (UX gap at this price tier), Flet (ecosystem too young), PySide6 (overkill), and NiceGUI (smaller community). Full rationale in DECISIONS.md Section 4c.
This is a major change from the original Inno-Setup-only, CLI-only design. Rationale chain:
1. Requiring a buyer to install Python before using the product is the largest source of install friction (solved by PyInstaller in v1.1).
2. Requiring a non-technical buyer to use a CLI is the second-largest source of refund risk (solved by dual interface in v1.2).
3. Betting the GUI on an unmaintained library is the largest hidden technical risk (solved by Streamlit choice in v1.3).
---
## 2. Standard Bundle Structure (source repo)
Every bundle follows this layout in source. Core logic is shared, CLI and GUI are thin front-ends.
``` ```
bundle-name/ src/
├── src/ core/ # Shared logic. No UI code.
├── __init__.py analyze.py # Detectors + Finding schema
├── core/ # Shared business logic. No UI code here. config.py # DeduplicationConfig (JSON profiles)
├── __init__.py dedup.py # Match strategies, union-find, survivor selection
│ ├── dedup.py # (example) the actual algorithm errors.py # Structured error hierarchy + format_for_user
│ │ └── io.py # File I/O, encoding/delimiter detection, etc. fixes.py # Fix registry (one per fix_action)
├── cli.py # Command-line interface (Typer). Thin wrapper over core. format_standardize.py # Per-cell standardizers + DataFrame pipeline
└── gui/ # Streamlit front-end. Thin wrapper over core. io.py # read_file / write_file / repair_bytes
│ ├── __init__.py normalize.py # CSV-normalization gate
├── app.py # Main Streamlit entry point (st.set_page_config, layout) normalizers.py # Per-column normalizers for dedup matching
├── pages/ # Streamlit multi-page app (one page per script in the bundle) text_clean.py # clean_dataframe + smart_title_case
├── 1_Deduplicator.py _constants.py # Shared USPS abbrevs + state names
├── 2_Text_Cleaner.py cli.py # Deduplicator CLI (Typer)
│ ├── 3_Format_Standardizer.py cli_text_clean.py # Text Cleaner CLI
└── ... cli_analyze.py # Analyzer CLI (--json)
└── components.py # Reusable Streamlit widgets and helpers gui/
├── data_examples/ # Sample input files app.py # Streamlit entry point
├── tests/ # Unit tests (pytest). Tests target core, not UI. pages/ # One page per tool
├── build/ components/ # shared, dedup_review, findings, gate, _legacy
│ ├── pyinstaller.spec # PyInstaller build spec (handles both CLI + GUI entry points) build/ # PyInstaller spec, launcher, OS-specific configs
│ ├── launcher.py # Small launcher script: starts Streamlit server, opens browser demo/ # Constrained Streamlit Community Cloud version
│ ├── windows/ tests/ # pytest; targets core/, not UI
└── installer.iss # Inno Setup wrapper for Windows .exe installer test-cases/ # Fixture corpora (text-cleaner, encodings, format-cleaner)
│ ├── macos/
│ │ ├── entitlements.plist
│ │ └── dmg_settings.py # dmg-creation config
│ └── linux/
│ └── AppImage/ # AppImage build assets
├── demo/ # Stripped-down version for hosted browser demo
│ └── streamlit_app.py # Entry point for Streamlit Community Cloud deployment
├── requirements.txt
├── README_bundle.md # User-facing guide (covers both CLI and GUI usage)
├── LICENSE
└── ci/
└── build.yml # GitHub Actions cross-platform build
``` ```
**Core/UI separation rule**: A new feature is implemented in `core/` first, with tests. CLI and GUI both call into core. If a feature exists only in one front-end (e.g., interactive review only in GUI), the underlying capability still lives in core; only the presentation differs. **Demo subfolder**: row-limited, watermarked, file-size-capped Streamlit app for public deployment. Same core, different front-end constraints.
**Demo subfolder rule**: The `demo/` folder contains a constrained Streamlit app for public deployment to Streamlit Community Cloud. Constraints: row limit (e.g., 100 rows max output), no file save, watermark on output, sample dataset only or strict file-size cap. Same core library, different front-end constraints. ## 3. Build pipeline
---
## 3. Cross-Platform Build Pipeline
### 3.1 Tooling ### 3.1 Tooling
| Concern | Tool | | Concern | Tool |
|---|---| |---------|------|
| Bundling Python + scripts into a standalone binary | PyInstaller | | Bundling | PyInstaller |
| GUI framework | **Streamlit** | | GUI | Streamlit |
| Browser launch from launcher | Python `webbrowser` module (stdlib) | | CLI | Typer |
| CLI framework | Typer | | Browser launch | stdlib `webbrowser` |
| Windows installer wrapper | Inno Setup (free) | | Win installer | Inno Setup (free) |
| macOS bundle format | `.app` packaged in `.dmg` | | macOS sign+notarize | `codesign` + `notarytool` |
| macOS code signing & notarization | `codesign` + `notarytool` (built into Xcode command line tools) | | Linux | AppImage (primary) + tarball fallback |
| Linux distribution format | AppImage (primary) + plain tarball (fallback) | | CI | GitHub Actions matrix |
| CI / automated builds | GitHub Actions (free tier handles all three OS runners) | | Demo host | Streamlit Community Cloud (free) |
| Hosted demo | Streamlit Community Cloud (free) or $5/mo VPS |
### 3.2 Build Outputs (what the buyer downloads) ### 3.2 Build outputs
| OS | File | Buyer experience |
|----|------|------------------|
| Win | `*-Setup-1.0.exe` | Wizard → desktop shortcut "Launch Bundle" → browser opens. CLI on PATH. |
| macOS | `*-1.0.dmg` | Drag to Applications. Signed + notarized. |
| Linux | `*-1.0.AppImage` | `chmod +x`, double-click. |
| Platform | Output file | Buyer experience | ### 3.3 PyInstaller
|---|---|---|
| Windows | `BundleName-Setup-1.0.exe` | Double-click installer, click through wizard. Desktop shortcut "Launch Bundle" runs `launcher.py`, which starts the local Streamlit server and opens default browser to `http://localhost:8501`. CLI executables also installed and on PATH. |
| macOS | `BundleName-1.0.dmg` | Double-click DMG, drag app to Applications. Signed and notarized. Launching the app runs the launcher, which starts the local server and opens the browser. CLI binaries shipped in the app bundle. |
| Linux | `BundleName-1.0.AppImage` | Mark executable, double-click. AppImage runs the launcher, opens browser. Tarball fallback also includes CLI binaries. |
The **default buyer experience on every platform is**: double-click, browser opens, work done. The CLI is present, documented, and on PATH for users who want it. - `--onefile` for Linux, `--onedir` for Win/macOS (faster startup, easier signing).
- Two entry points: GUI launcher + CLI binaries.
- Streamlit hooks needed: `streamlit`, `altair`, `pyarrow` data dirs.
- Custom `hook-streamlit.py` per documented pattern.
- Budget: 1-3 days first time. Reusable after.
**Browser-launch UX mitigation** (per DECISIONS.md Section 4c tradeoff): The launcher script displays a brief "Starting your data tool..." console message before opening the browser. The Streamlit app's first page includes a one-line note: *"This tool runs locally in your browser and does not use the internet."* Install email reinforces the same message. ### 3.4 Streamlit launcher
### 3.3 PyInstaller Configuration 1. Find free port (don't hardcode 8501).
2. Set env: `STREAMLIT_SERVER_HEADLESS=true`, `STREAMLIT_BROWSER_GATHER_USAGE_STATS=false`, `STREAMLIT_SERVER_PORT={port}`.
3. Start Streamlit programmatically in a thread.
4. Poll port until ready.
5. Open browser to `http://localhost:{port}`.
6. Keep launcher alive while server runs.
Single `.spec` file per bundle, parameterized for OS. Key settings: Optional v1.1: wrap with `pywebview` to eliminate browser-launch UX. Defer until support tickets show meaningful confusion.
- `--onefile` for Linux (single AppImage), `--onedir` for Windows and macOS (faster startup, easier signing). ### 3.5 macOS pipeline
- All dependencies bundled. No internet required at runtime.
- Hidden imports declared explicitly for pandas/openpyxl/Streamlit edge cases (PyInstaller's auto-detection misses some).
- Icon files per platform (`.ico` for Windows, `.icns` for macOS, `.png` for Linux).
- **Two entry points per bundle**: the GUI launcher (default, what the desktop shortcut runs) and the CLI binaries.
- **Streamlit-specific PyInstaller hooks**: include the `streamlit` data directory, the `altair` data directory (Streamlit dependency), and the `pyarrow` C extensions. Add a custom hook file (`hook-streamlit.py`) per the documented pattern. Budget 1-3 days the first time getting the spec right; reuse across all subsequent bundles.
### 3.4 Streamlit Launcher Pattern
The launcher script handles starting the local Streamlit server in a way that survives PyInstaller bundling. Conceptual outline:
1. Find a free local port (avoid hardcoding 8501 in case of conflict).
2. Set Streamlit environment variables: `STREAMLIT_SERVER_HEADLESS=true`, `STREAMLIT_BROWSER_GATHER_USAGE_STATS=false`, `STREAMLIT_SERVER_PORT={port}`.
3. Start Streamlit programmatically (via `streamlit.web.cli.main_run` or `bootstrap.run`) in a background thread.
4. Wait for the port to accept connections (poll with timeout).
5. Open the buyer's default browser to `http://localhost:{port}` via `webbrowser.open()`.
6. Keep the launcher process alive while the server runs. Detect server shutdown and exit cleanly.
Optional v1.1 enhancement: replace step 5 with a `pywebview` window that wraps the local server. Eliminates the "default browser opens" UX surprise. Adds a dependency and some packaging complexity. Defer until support tickets show the browser-launch is causing meaningful confusion.
### 3.5 macOS Signing & Notarization Pipeline
Required setup (one-time):
1. Enroll in Apple Developer Program ($99/yr - see BUSINESS.md Section 10).
2. Generate Developer ID Application certificate via Apple Developer portal.
3. Install certificate in macOS keychain on the build machine (or store as encrypted GitHub Actions secret for CI).
4. Generate an app-specific password for `notarytool`.
Build-time flow (automated):
1. PyInstaller produces unsigned `.app`. 1. PyInstaller produces unsigned `.app`.
2. `codesign --deep --force --options runtime --sign "Developer ID Application: [Your Name]" BundleName.app` 2. `codesign --deep --force --options runtime --sign "Developer ID Application: ..." App.app`.
3. Package into `.dmg`. 3. Package as `.dmg`.
4. Submit `.dmg` to Apple notary service: `xcrun notarytool submit BundleName.dmg --wait`. 4. `xcrun notarytool submit *.dmg --wait`.
5. Staple the notarization ticket: `xcrun stapler staple BundleName.dmg`. 5. `xcrun stapler staple *.dmg`.
6. Output is the final, distributable `.dmg`.
Buyers on macOS see no Gatekeeper warnings. Clean install. Setup: Apple Developer Program ($99/yr), Developer ID cert in Keychain, app-specific password.
### 3.6 Windows Pipeline ### 3.6-3.7 Win + Linux
1. PyInstaller produces `BundleName/` folder with launcher `BundleName.exe` (which opens the GUI in browser) plus CLI binaries plus dependencies. - **Win**: PyInstaller `--onedir` → Inno Setup wraps → installer adds Start Menu, desktop shortcut, PATH entries. Optional code-signing cert ($200-400/yr) if SmartScreen friction.
2. Inno Setup script wraps the folder into `BundleName-Setup-1.0.exe`. - **Linux**: PyInstaller → `appimagetool` wraps. `.tar.gz` fallback for distros where AppImage fails.
3. Installer creates Start Menu entry, desktop shortcut (launches GUI), optional Add/Remove Programs entry, and adds CLI binaries to PATH.
4. Optional Windows code signing certificate (~$200-400/yr from a CA) eliminates SmartScreen warnings. **Not required at launch**; revisit if SmartScreen friction shows up in support tickets.
### 3.7 Linux Pipeline ### 3.8 CI matrix
1. PyInstaller produces single-file binaries per entry point.
2. Wrap in AppImage using `appimagetool` (free, well-documented). AppImage runs the launcher as the default target.
3. Provide a plain `.tar.gz` fallback for users on distributions where AppImage fails. Tarball includes both GUI launcher and CLI binaries plus a `run.sh`.
4. No signing required on Linux; users execute `chmod +x` then double-click or run.
### 3.8 CI Build Matrix
GitHub Actions builds all three platforms on tagged release:
```yaml ```yaml
# Conceptual, full file lives in ci/build.yml
strategy: strategy:
matrix: matrix:
os: [windows-latest, macos-latest, ubuntu-latest] os: [windows-latest, macos-latest, ubuntu-latest]
``` ```
Result: one git tag triggers three platform builds. Artifacts upload to GitHub Releases. Manual step: copy artifacts to Gumroad / Lemon Squeezy product page. Tag a release → 3 platform artifacts upload to GitHub Releases. Manual: copy to Gumroad / Lemon Squeezy.
### 3.9 Hosted Demo Deployment ### 3.9 Hosted demo
Separate from the desktop build pipeline. The `demo/streamlit_app.py` entry point is deployed to Streamlit Community Cloud: `demo/streamlit_app.py` → Streamlit Community Cloud. Configure deployment in Streamlit UI. Custom domain via CNAME (verify policy at deploy time). Fall back to $5/mo VPS if rate limits / branding constraints hit.
1. Connect the GitHub repo to Streamlit Community Cloud (one-time).
2. Configure the app to deploy from the `demo/` folder, main branch.
3. Set deployment-time environment variables (e.g., row limits, watermark flag).
4. App is publicly accessible at a `*.streamlit.app` URL. Link from Gumroad landing page.
5. Optional: custom domain via CNAME (free with Streamlit Community Cloud as of last check; verify before locking).
If Streamlit Community Cloud is ever unsuitable (rate limits, bandwidth, branding requirements), fall back to a $5/mo VPS running the demo via Docker. Same `demo/streamlit_app.py`, different host.
---
## 4. Libraries ## 4. Libraries
| Purpose | Library | | Purpose | Library |
|---|---| |---------|---------|
| GUI framework | Streamlit | | GUI | streamlit |
| CLI framework | Typer | | CLI | typer |
| Data manipulation | pandas, openpyxl, numpy | | Data | pandas, openpyxl, numpy |
| Fuzzy string matching | rapidfuzz | | Fuzzy match | rapidfuzz |
| File encoding detection | charset-normalizer | | Phone parsing | phonenumbers |
| Encoding detect | charset-normalizer |
| Logging | loguru | | Logging | loguru |
| Progress bars | tqdm (CLI), `st.progress` (GUI) | | Mojibake (optional) | ftfy |
| Validation | pydantic (optional) | | Reports | reportlab |
| Reports | ReportLab (PDF), pandas styling (Excel) |
| Optional native window wrap | pywebview (deferred to v1.1 if needed) |
`requirements.txt` (current bundle, v1.3): ## 5. Coding standards
### 5.1 Code
- PEP 8 + type hints on public functions.
- Docstrings on every module + public function.
- `pathlib.Path` for paths, never string concat.
- All I/O explicitly UTF-8-aware.
- No platform-specific shell calls.
- pytest for `core/`, not UI.
- Errors raise via `src.core.errors` hierarchy (Section 7).
### 5.2 GUI UX (load-bearing per DECISIONS.md §4b)
- **Works out of the box** — drop file → useful result with zero config.
- **Sensible defaults visible everywhere**.
- **Progressive disclosure** — basic = file uploader + run button + results; rest in `st.expander`.
- **Plain-English labels**; technical detail in `help=` tooltip.
- **Dry-run / preview by default**.
- **Identical core to CLI**.
- **Local-first messaging** — "runs locally in your browser, no internet" line on every page.
### 5.3 Functional scope (load-bearing per DECISIONS.md §4a)
- Each script ships **complete coverage of the workflow it names**, including features Excel does for free.
- Boundary = the named workflow. Dedup includes normalization + survivor + audit; not format conversion or charting.
## 6. System requirements
**Buyer runtime**: Win 10/11 64-bit · macOS 11+ · Linux glibc 2020+ · modern browser · ~400-500 MB disk · no internet.
**Developer**: Python 3.10+ · PyInstaller · Inno Setup (Win) · Xcode CLT (macOS) · Apple Developer Program $99/yr · Git + GitHub.
## 7. Error handling (`src/core/errors.py`)
Structured hierarchy for friendly messages + maintainable trace context:
``` ```
streamlit>=1.30 DataToolsError # base; carries path/column/operation/suggestion/cause
pandas InputValidationError(ValueError) # bad arg / wrong type
openpyxl ConfigError(ValueError) # bad config / options
numpy FileFormatError(ValueError) # file isn't what we expected
typer FileAccessError(OSError) # I/O failure (perms, disk, missing)
rapidfuzz
charset-normalizer
loguru
tqdm
reportlab
pyarrow # Streamlit dependency, declare explicitly for PyInstaller clarity
altair # Streamlit dependency, declare explicitly for PyInstaller clarity
``` ```
--- **Subclassing rule**: every subclass extends a stdlib base (`ValueError` or `OSError`) so existing `except OSError` / `except ValueError` handlers still catch them.
## 5. Coding Standards **Helpers**:
- `ensure_dataframe(value, function=...)` — uniform DataFrame guard at every public entry.
- `ensure_choice(value, name=, choices=)` — uniform enum/literal guard.
- `wrap_file_read(path, op, exc)` / `wrap_file_write(...)` — tag OSError with file path + Windows-aware permission tip.
- `format_for_user(exc, context=)` — single string for `st.error()` / CLI stderr.
### 5.1 Code Standards GUI / CLI handlers use `format_for_user()` so the user always sees: file path, operation, underlying error class, recovery suggestion.
- PEP 8 + type hints on all public functions. ## 8. Per-bundle status
- Docstrings on every module and public function.
- `--help` output (CLI) that a non-technical user can act on.
- Graceful error handling with human-readable messages, not stack traces. Errors must name the problem AND the likely fix where possible (e.g., "Column 'email' not found. Available columns: name, phone. Did you mean 'phone'?").
- All file paths handled via `pathlib.Path`, never string concatenation. Cross-platform correctness depends on this.
- All file I/O explicitly UTF-8-aware: detect encoding on input (charset-normalizer), write UTF-8 on output. Windows defaults to cp1252 otherwise.
- No platform-specific shell calls. If absolutely needed, branch on `sys.platform`.
- Unit tests for core logic (pytest). Tests target `core/`, not UI front-ends. Tests run on all three OS runners in CI.
- Semantic versioning per bundle.
- **Core/UI separation**: never put business logic in `cli.py` or `gui/`. If a CLI command and a GUI button do "the same thing," they call the same function in `core/`.
### 5.2 UX Standards (GUI / Streamlit) - load-bearing per DECISIONS.md Section 4b | Bundle | Status |
|--------|--------|
| Data Cleaning Mastery | 3/9 tools Ready (Dedup, Text Cleaner, Format Standardizer); 6 stubs |
| Automated Business Reporting | Not started |
| Ecommerce Data Pipeline | Not started |
| Small Business Finance | Not started |
| Marketing Public Data Aggregation | Not started |
| AI Ecommerce Aggregation (Shopify Pet) | Not started |
- **Works out of the box**: dropping a file into the Streamlit `st.file_uploader` must produce a useful result with zero configuration. ## 9. Open decisions
- **Sensible defaults visible everywhere**: every `st.selectbox`, `st.slider`, `st.checkbox` has a default, the default is shown, the user is not forced through a config screen on first run.
- **Progressive disclosure**: basic view shows file uploader + go button + results. Advanced options live in `st.expander("Advanced options")` panes.
- **Plain-English labels**: no technical jargon in primary UI. `help=` parameter on widgets carries technical detail for users who want it.
- **Dry-run / preview by default**: user sees what would change before any file is written. Original input is never modified.
- **Single-page completion**: basic task completes on a single Streamlit page. Multi-page apps are for separate scripts in the bundle, not for multi-step wizards within one script.
- **Identical core to CLI**: any capability available in CLI is available in GUI, and vice versa. The only legitimate divergence is interactive review (GUI-natural) and scripted/scheduled execution (CLI-natural).
- **Local-first messaging**: every GUI page includes the line *"This tool runs locally in your browser and does not use the internet"* in a small, persistent location (e.g., footer or sidebar).
### 5.3 Functional Scope Standard - load-bearing per DECISIONS.md Section 4a - **pywebview wrap** — defer until support tickets show browser-launch confusion.
- **Win code signing** — defer until SmartScreen drives volume. Cost ~$200-400/yr.
- **Auto-update mechanism** — none at launch. Email-delivered updates. Revisit at 100+ buyers/bundle.
- **Demo hosting migration** — Streamlit Community Cloud → $5/mo VPS if rate/brand limits hit.
- **Code obfuscation** — none; license text + bundle complexity sufficient at $49-79.
- **Telemetry** — none. Consider opt-in privacy-respecting only post-launch.
- Each script ships with **complete coverage of the workflow it names**, including features available free elsewhere (e.g., exact-match dedup). ## 10. Script boundaries — 04 (Missing Values) vs 06 (Outliers)
- Scope boundary is the workflow, not "things adjacent to the workflow." A deduplicator includes normalization, survivor selection, audit. It does not include format conversion or charting; those belong elsewhere in the bundle.
--- Deliberately separate. Confluent original spec was wrong.
## 6. System Requirements | Script | Owns |
|--------|------|
| 04 Missing Value Handler | "What's not there." Disguised nulls (`N/A`, `-`, sentinel codes), missingness patterns, imputation, drop-by-threshold. |
| 06 Outlier Detector | "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization. |
**For buyers (runtime)**: **Run order**: 04 before 06. Outlier stats on data with `NaN` / sentinels are mathematically poisoned (means dragged, IQR widens, false negatives).
- Windows: Windows 10 or 11, 64-bit.
- macOS: macOS 11 (Big Sur) or later, Apple Silicon or Intel.
- Linux: any glibc-based distribution from 2020 onward (Ubuntu 20.04+, Fedora 33+, etc.).
- A modern default browser (Chrome, Edge, Firefox, Safari from the last 3 years). Used to display the local GUI; no internet required.
- ~400-500 MB free disk space (Streamlit packaging is larger than alternatives; this is an accepted tradeoff per DECISIONS.md Section 4c).
- No internet required after install. No Python install required ever.
**For developers (you)**: **Pipeline order** (Pipeline Runner enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible.
- Python 3.11+.
- PyInstaller, Inno Setup (Windows builds), Xcode command line tools (macOS builds).
- Apple Developer Program membership ($99/yr) for macOS distribution.
- Git + GitHub account (for CI builds and Streamlit Community Cloud deployment of demos).
--- **Contested cases**:
- Whitespace-only cell — 02 trims to empty; 04 then flags empty as null.
- `-999` sentinel — 04 converts to `NaN` first; 06 then computes stats.
- Suspicious-but-plausible (age 110) — 06 territory.
## 7. Per-Bundle Technical Notes ## 11. Per-script functional specs
| Bundle | Status | Tech notes | Specs live in this section as scripts enter active build. Each follows the Tier 1/2/3 structure with explicit strategic framing (what's the market gap given some of this is free elsewhere).
|---|---|---|
| Data Cleaning Mastery | Lead, 1/9 scripts complete (CLI only; needs Streamlit GUI port) | Cleaning, dedup, text hygiene, standardization, missing-value handling, outlier detection, type coercion, reporting. Scripts 04 (missing values) and 06 (outliers) are deliberately separate concerns; 04 runs first to neutralize sentinel codes before 06 computes statistics (see Section 9). Script 02 (text cleaner) runs first in the pipeline to normalize whitespace and special characters before any other operation. v1.3 spec: Streamlit GUI required at launch, with hosted demo deployed to Streamlit Community Cloud. |
| Automated Business Reporting | Not started | Aggregation -> styled PDF/Excel output |
| Ecommerce Data Pipeline | Not started | Extract -> clean -> export workflow |
| Small Business Finance | Not started | Bookkeeping summaries, simple reconciliation |
| Marketing Public Data Aggregation | Not started | Public API + scraping with respect for robots.txt and ToS |
| AI Ecommerce Aggregation (Shopify Pet) | Not started | Optional LLM enhancement, requires API key from buyer |
--- ### 11.1 `01_deduplicator.py` — Smart duplicate removal
## 8. Open Technical Decisions **Status**: Ready. Tier 1 mostly built. Streamlit GUI port complete.
GUI framework choice is now **closed** (Streamlit, locked v1.3 - see DECISIONS.md Section 4c). **Market gap**: fuzzy match quality of OpenRefine, with the zero-learning UX of Excel, sold once for under $100, runs locally.
Remaining open items: **Tier 1**:
- **Input**: auto-detect encoding (UTF-8, UTF-8-BOM, Latin-1, cp1252) · delimiter · header row · CSV/TSV/XLSX/XLS · multi-sheet picker · streaming for files > RAM.
- **Matching**: exact + 3 fuzzy algos (Levenshtein / Jaro-Winkler / token-set) · per-column normalizers (5 types) · configurable threshold per strategy · multi-strategy OR.
- **Survivor**: keep first / last / most-complete / most-recent · merge mode (fill blanks from losers).
- **Trust**: dry-run preview by default · interactive review for gray-zone matches · confidence score per match · match-group export.
- **Audit**: timestamped log · removed-rows separate file · input never modified · idempotent.
- **Config**: save/load JSON profiles · sensible auto-detect defaults.
- **UX**: human `--help` · progress bar > 10k rows · errors name row + column + value + suggestion.
- **pywebview wrap of Streamlit launcher**: Optional v1.1 enhancement to eliminate the "browser opens" UX surprise. Defer until support tickets show meaningful buyer confusion. Cost: extra dependency, more PyInstaller complexity. Benefit: native-window UX. **Tier 2**: numeric/date tolerance · phonetic match (Soundex, Metaphone) · blocking/indexing · watch-folder.
- **Windows code signing**: Currently unsigned. Revisit if SmartScreen warnings drive support volume. Cost: ~$200-400/yr.
- **Auto-update mechanism**: None at launch. Email-delivered version updates. Revisit at 100+ paying customers per bundle.
- **Demo deployment hosting**: Streamlit Community Cloud at launch (free). Migrate to $5/mo VPS if rate limits, bandwidth, or branding constraints become an issue.
- **Code obfuscation**: Currently relying on license text + PyInstaller bundling. Decompilation is possible but unlikely for $49-79 products. Accept the risk.
- **Telemetry**: None. Consider privacy-respecting opt-in usage telemetry post-launch to inform roadmap, but only if explicit and disclosed.
--- **Tier 3**: ML scoring · cross-file dedup · cron · Shopify/Klaviyo API direct.
## 9. Script Boundaries: 04 (Missing Values) vs 06 (Outliers) ### 11.2 `02_text_cleaner.py` — Character-level hygiene
The two scripts are deliberately separate. Original spec ("missing-value handler also does basic outlier flagging") was wrong: it conflated two different statistical operations and would have produced overlapping CLI flags, confused buyers, and brittle code. **Status**: Ready. Tier 1 built.
### 9.1 Boundary **Market gap**: one-click correctness for the dirty-CSV failure modes that cause silent VLOOKUP misses.
**`04_missing_value_handler.py` owns "what's not there"**: **Boundary**:
- Detect disguised nulls: `NaN`, empty string, `"N/A"`, `"-"`, `"unknown"`, whitespace-only, sentinel codes (`-999`, `9999`, etc.). - 02 — whitespace, Unicode normalize, smart-char fold, BOM, line endings, zero-width, control chars, case ops. Writes to disk.
- Missingness pattern analysis (which columns co-miss). - 03 — dates, currencies, names, phones, addresses (display formatting).
- Imputation strategies: mean, median, mode, forward-fill, KNN (optional), regression (optional). - 04 — disguised nulls.
- Required-field enforcement (drop rows missing a required column). - 01 — `normalize_string` is *match-time* only, distinct from 02's *write-time* policy.
- Drop rows or columns by missingness threshold.
**`06_outlier_detector.py` owns "what shouldn't be there"**: **Tier 1 ops** (each toggleable; defaults shown for `excel-hygiene`):
- Univariate statistical detection: z-score, IQR, modified z-score (MAD-based). 1. Trim leading/trailing whitespace — ON
- Multivariate detection: Isolation Forest, Mahalanobis distance. 2. Collapse internal whitespace runs — ON
- Domain-rule violations (age > 120, negative quantity, future dates in historical data). 3. NFC normalize — ON
- Winsorization / capping as optional remediation. 4. NFKC compatibility fold — OFF (lossy, opt-in via `paranoid` preset)
- Distribution shape diagnostics. 5. Smart-char fold (curly quotes, em/en-dash, NBSP, ellipsis) — ON
6. Zero-width / invisible char strip — ON
7. BOM strip — ON
8. Control-char strip (preserve `\t\n\r`) — ON
9. Line-ending normalize (CRLF/CR → LF inside cells) — ON
10. Case conversion (UPPER / lower / Title / Sentence) — OFF, per-column
### 9.2 Run Order **Scope**: per-column selection · skip-list · operates on string-typed columns only.
04 must run before 06. Reason: outlier statistics computed on data still containing `NaN` or sentinel codes are mathematically poisoned. Means and standard deviations get dragged, IQR widens, false negatives explode. **Trust**: dry-run by default · per-cell change log (capped 1000, `--full-changelog` removes cap) · 3 output files mirroring dedup · idempotent.
The Master Orchestrator (script 09) enforces this order. CLI users running scripts manually get a warning printed by 06 if it detects unhandled sentinel patterns in the input. **Config**: 3 presets (`minimal` / `excel-hygiene` (default) / `paranoid`) · save/load JSON.
Pipeline-wide order enforced by the orchestrator: `02_text_cleaner``03_format_standardizer``04_missing_value_handler``05_column_mapper_enforcer``06_outlier_detector``07_multi_file_merger``08_validator_reporter`. Script `01_deduplicator` is order-flexible; it normalizes whitespace and case internally for matching purposes regardless of upstream cleaning, so it can run before or after `02_text_cleaner` depending on the buyer's workflow. ### 11.3 `03_format_standardizer.py` — Per-domain canonical forms
### 9.3 Contested Cases **Status**: Ready. Full Tier 1 + most Tier 2 built. 199-row buyer corpus passing.
**Use cases that prove 04 and 06 are distinct concerns** (not just naming differences): **Market gap**: unify dates / phones / emails / addresses / names / currencies / booleans across messy ETL inputs without buyer writing code.
- *Bank export with blank fee columns*: pure 04 job. The fees aren't outliers, they're missing. Imputation or drop-by-threshold is the right tool. **Domains**:
- *Sales data with one $1M order in a $50-average column*: pure 06 job. Nothing is missing; one row is statistically extreme. Z-score or IQR catches it. | Domain | Default canonical | Notable handling |
- *Survey data where `999` means "refused to answer"*: needs both, in order. 04 converts `999` to `NaN` per `--sentinels`, then 06 computes statistics on the cleaned column. |--------|-------------------|------------------|
| Date | ISO 8601 (`YYYY-MM-DD`) | MDY/DMY, Excel serial, Unix timestamp (s + ms), longform months, year-month, quarter notation, French/German/Spanish month dictionaries (opt-in), buried-date regex, error sentinels for invalid dates |
| Phone | E.164 + `;ext=N` | libphonenumber, 001 international prefix handling, error sentinels for placeholders / multi-number / contamination |
| Email | lowercase + trim | display-name extraction, mailto/angle-bracket strip, smart-quote unwrap, optional `--gmail-canonical` mode |
| Address | USPS-canonical (`expand=False`) or expanded (`expand=True`) | state-name → 2-letter, multi-line collapse, PO Box normalize, state-code preservation regardless of input case |
| Name | smart Title Case | Mc/Mac/O'/D' inner caps, hyphen segments, particle lowercasing (von/van/de/da), comma-format reversal, period stripping for titles/suffixes/initials, PhD/MD acronym preservation, conservative mode |
| Currency | bare number (dot decimal) | auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation |
| Boolean | `True`/`False` (configurable) | accepts `yes`/`no`/`y`/`n`/`1`/`0`/`on`/`off` |
Sentinel values like `-999` are *both* disguised missing *and* statistical outliers. Resolution: 04 owns sentinel detection and converts them to `NaN` (or imputes per user choice) before 06 sees the data. Sentinel patterns are configurable in 04 via `--sentinels` flag. **Per-domain `error_policy`**: `"passthrough"` (default) keeps the original; `"sentinel"` emits `<error: <reason>>` for cases like Feb 30, double @, percentages mistaken for currency, etc.
Suspicious-but-plausible values (e.g., age = 110): 06's territory. Not missing; just rare. **Pipeline**: `standardize_dataframe(df, options)` runs per-column with `column_types: dict[str, FieldType]`. Returns `StandardizeResult` with `cells_changed`, `cells_unparseable`, change audit. Warns when > 10% of typed cells fail to parse.
Whitespace-only cells (e.g., `" "`) are a contested case between 02 (text cleaner) and 04 (missing value handler). Resolution: 02 trims first, leaving an empty string. 04 then detects empty strings as disguised nulls per its existing logic. This means 02 must run before 04 in any pipeline that uses both. The orchestrator enforces this; CLI users get a warning if 04 detects whitespace-only cells suggesting 02 was skipped. **Presets**: `us-default`, `european`, `uk`, `iso-strict`, `legacy-us`. Custom abbreviations via `extra_abbreviations`.
### 9.4 Shared Plumbing ### 11.4 Upload-time analyzer (`src/core/analyze.py`)
Both scripts emit: Read-only advisory pass on every upload. Emits `Finding` objects:
- A flagged-row report with column, row index, original value, action taken.
- A timestamped log file in `logs/`.
- An optional cleaned output file.
Report and log formats are identical between the two scripts. Implemented via shared helpers in `src/core/` to avoid drift. | Field | Meaning |
|-------|---------|
| `id` | Stable identifier (never localized) |
| `severity` | `info` / `warn` / `error` (only `error` blocks gate) |
| `confidence` | `high` (round-trip safe) / `medium` (preview) / `low` (heuristic) |
| `fix_action` | id of algorithm in `fixes.py` (empty for informational-only) |
| `pre_applied` | `true` if fix already ran during read pass |
| `tool` | owning tool id (or empty for file-level) |
| `count` | cells / rows affected |
| `description` | one-sentence human summary |
| `column` | column name (None for file-level) |
| `samples` | up to 5 `(row, col, value)` examples |
--- Entry point: `analyze(source, *, sample_rows=1000, repair_result=None, encoding_override=None)`. `encoding_override` skips charset detection — the hook that lets the Review page recover from misdetections.
## 10. Per-Script Functional Requirements ### 11.5 CSV-normalization gate (`src/core/normalize.py`, `fixes.py`)
This section captures the full functional spec for each script, beyond the one-line description in USER-GUIDE.md Section 2. Specs answer "what does v1 need to ship to be best-of-class for the target buyer." Promoted from chat-history-only into docs in v1.6 to prevent silent drift. Two paths:
1. **Auto-fix**`auto_fix(df, findings)` applies every `confidence="high"` finding whose `fix_action` is registered.
2. **Per-finding decisions**`apply_decisions(df, findings, decisions)` accepts `Decision(finding_id, action, payload)` with action `"auto"|"skip"|"modified"`.
**Note on script status**: a script labeled "Working" in the bundle status table means it has CLI execution and basic correctness, NOT that it implements every Tier 1 item below. Tier 1 is the v1 launch target; the current code may implement a subset. Returns `NormalizationResult` with `cleaned_df`, `cleaned_bytes` (UTF-8 CSV), `applied`, `skipped_findings`, `pending_findings`, `blocking_findings`.
### 10.1 `01_deduplicator.py` - Smart duplicate removal `is_normalized(findings, result)` re-runs `analyze()` against cleaned bytes; returns False if any high-confidence detector still fires (the strict contract tool pages depend on).
**Current implementation status**: `01_deduplicator.py` exists and works for exact match plus basic fuzzy with configurable subset columns and timestamped logs (the description in USER-GUIDE.md Section 2 reflects current state). It does NOT yet implement most Tier 1 items below. Tier 1 is the v1 launch target, not current state. The Streamlit GUI port is the natural moment to close this gap. **Fix registry**: `@register("fix_id")` decorates `(df, payload) → (new_df, n_cells_changed)`. New fix = one entry in `analyze.py` `FIX_*` constants + one detector emitting that `fix_action` + one registered function. No other call sites change.
**Strategic framing**: Excel's built-in Remove Duplicates handles exact match for free. Pandas `drop_duplicates()` handles it for free in code. A $49-$79 dedup tool that ships "exact + basic fuzzy" loses to Excel for free or to OpenRefine for free. The fuzzy matching has to be the product, not a checkbox. The market gap this script targets is "fuzzy match quality of OpenRefine, with the zero-learning-curve UX of Excel, sold once for under $100, runs locally" (see BUSINESS.md Section 4a). ### 11.6 Review page (`src/gui/pages/0_Review.py`)
#### Tier 1: Must-ship for v1 to be best-of-class 1. Detected encoding + override picker (16 codepages + custom).
2. One expandable card per finding (sorted by severity then confidence) with: decision radio (Auto/Skip/Customize), live before/after preview built by running the registered fix on `Finding.samples`, payload editor for fixes that take user input.
3. Apply persists `NormalizationResult` keyed by upload SHA-256; tool pages refuse to load until hash matches.
4. `⚙️ Advanced output options` expander: per-download encoding + delimiter + line terminator. `_build_output_bytes()` returns `(bytes, error_message)`; lossy fallbacks emit a warning the page surfaces.
**Input handling** Gates the entire tool sidebar via `require_normalization_gate()` in `src/gui/components/_legacy.py`.
1. Auto-detect file encoding (UTF-8, UTF-8-BOM, Latin-1, Windows-1252). Failure to handle this correctly is the #1 reason CSV tools crash on real-world business data.
2. Auto-detect delimiter (comma, tab, semicolon, pipe).
3. Read CSV, TSV, XLSX, XLS. For XLSX, support multi-sheet workbooks (let user pick or process each).
4. Handle files larger than RAM via chunked / streaming processing. A 500MB customer export should not crash the tool.
5. Header row detection with manual override.
**Matching** ### 11.7 Pre-parse repair (`src/core/io.py::repair_bytes`)
6. Exact match with configurable subset columns.
7. Fuzzy match algorithms: Levenshtein, Jaro-Winkler, token-set ratio (rapidfuzz library). Multiple algorithms, not one. Different data types match better with different algorithms.
8. Per-column normalization before comparison:
- Email: lowercase, strip whitespace, strip Gmail dots, strip `+tag` suffixes.
- Phone: strip formatting and country codes, normalize to E.164.
- Name: strip titles (Mr/Ms/Dr), strip suffixes (Jr/III), collapse whitespace, optional case-fold.
- Address: USPS-style abbreviation normalization (St/Street, Ave/Avenue, Apt/#).
- Generic string: trim, collapse internal whitespace, optional case-fold.
9. Configurable similarity threshold (e.g., 85%, 90%, 95%) per match strategy.
10. Multi-strategy matching with OR logic: "match if email is exact OR (name fuzzy >90% AND phone exact)." This is what real-world dedup actually requires. Single-strategy match handles maybe 40% of cases.
**Survivor selection (which row to keep when duplicates are found)** Byte-level pre-parse pass. **Order is meaningful**:
11. Configurable survivor rules: keep first, keep last, keep most-complete (fewest empty cells), keep most-recent (date column), keep manually-selected.
12. Merge mode: instead of deleting losers, fill missing fields in survivor from losers. Common real ask: combine partial records into one complete record.
**Trust and review** 1. **Wide-encoding transcode** (UTF-16/32 → UTF-8) — must run first or NUL strip below shreds UTF-16.
13. Dry-run / preview mode by default. Output shows what *would* be merged before any file is written. Non-negotiable for trust. Aligns with Section 5.2 visible-safety standard.
14. Interactive review mode for uncertain matches. For matches in the gray zone (e.g., 75-90% similarity), prompt user yes/no/skip with side-by-side diff. This is what justifies a paid product over free Excel. GUI-natural; CLI gets a reduced-form prompt loop.
15. Confidence score on every fuzzy match in the output.
16. Match group export: separate file showing every group of matched rows so user can audit.
**Audit and safety**
17. Full timestamped log of every action: which rows matched, on which strategy, with what score, which row survived, which fields were merged.
18. Removed-duplicates exported to a separate file (never silently destroyed).
19. Original input file is never modified. Output is always a new file.
20. Idempotency: running the tool twice on the same input with the same config produces the same output.
**Configuration**
21. Save / load configuration profiles. A user who deduplicates a Shopify customer export weekly should configure once, not every run.
22. Sensible defaults that work on a typical messy CSV with zero configuration. The first run must produce a useful result with no flags. Per DECISIONS.md Section 4b UX standards.
**UX**
23. `--help` (CLI) written for non-technical users with concrete examples, not a flag list.
24. Progress bar for files over ~10K rows.
25. Error messages name the row number, column, and value that caused the problem. No raw stack traces. Per Section 5.1.
26. Sample data (`samples/messy_sales.csv`) must demonstrate fuzzy match scenarios, not just exact dupes. Otherwise the demo doesn't sell.
#### Tier 2: Worth-considering for v1.1
27. Numeric tolerance for matching (prices within $0.01 considered same).
28. Date tolerance for matching (dates within N days considered same).
29. Phonetic matching (Soundex, Metaphone) for name fields with common misspellings.
30. Blocking / indexing for performance on large files (compare only rows that share a first letter or first three characters of a key field). Without this, fuzzy match on 100K rows is O(n²) and unusable. Move to Tier 1 if early buyers report performance complaints.
31. Watch-folder mode: auto-process any file dropped into a folder.
32. Geolocation-aware address matching (optional, requires bundled USPS data or third-party API).
#### Tier 3: Optional / later
33. Machine-learning-based match scoring (Dedupe.io territory; high complexity, marginal gain for this price point).
34. Multi-table joins for cross-file dedup.
35. Schedule / cron integration.
36. Direct Shopify / Klaviyo / Mailchimp API integration to dedupe in place. This would be a real differentiator for the Shopify niche specifically and is probably the right v2 direction if early sales validate the niche.
### 10.2 `02_text_cleaner.py` - Character-level hygiene
**Current implementation status**: Stub only. `src/gui/pages/2_Text_Cleaner.py` is a placeholder UI with disabled controls. No `src/core/text_clean.py`, no CLI, no tests. Tier 1 below is the v1 launch target; nothing in this section is built yet.
**Strategic framing**: Excel and the OS provide effectively nothing here. Find/Replace fixes one character at a time. Power Query's "Clean" strips control chars but ignores BOMs, smart quotes, NBSPs, and zero-width chars. OpenRefine has the operations buried under "Common transforms" where the buyer never finds them. Pandas users `df.applymap(str.strip)` and miss everything else.
The market gap this script fills: **one-click correctness for the dirty-CSV failure modes that cause "why won't this VLOOKUP match?"** Trailing spaces, NBSP-in-place-of-space, smart quotes pasted from Word, mojibake, BOMs from Excel's "Save As CSV UTF-8". The buyer doesn't know they need this script until it fixes a problem they have spent two hours on. Demo value is high: the before/after diff sells itself.
**Boundary clarification** (cross-references Section 9):
- 02 owns whitespace, Unicode normalization, smart-character folding, BOM strip, line-ending normalization, zero-width strip, control-char strip, case ops. Writes cleaned values back to disk.
- 03 (format standardizer) owns dates, currencies, names, phones, addresses.
- 04 (missing values) owns disguised nulls (`N/A`, `-`, `unknown`, sentinel codes). Whitespace-only cells: 02 trims first to empty string; 04 then detects empty as null (per Section 9.3).
- 01 (deduplicator) has its own `normalize_string` helper for *match-time* case-folding. That is a match-time policy and stays distinct from 02's *write-time* policy. The two will not be merged; 02 may use lower-level helpers but does not aggressively case-fold cleaned output by default.
#### Tier 1: Must-ship for v1 to be best-of-class
**Operations** (each independently toggleable; defaults given for the `excel-hygiene` preset)
1. Whitespace trim - leading/trailing on every cell. Default ON.
2. Internal whitespace collapse - multi-space and tabs-in-cells to single space. Default ON.
3. Unicode NFC normalization - combining-character forms folded to canonical (e.g., `e + U+0301` to single `é`). Default ON.
4. Unicode NFKC normalization - compat fold (`①` to `1`, `fi` to `fi`). Default OFF, lossy, opt-in only. Not part of any preset other than `paranoid`.
5. Smart-character folding - curly quotes to ASCII, em/en-dash to hyphen, ellipsis `…` to `...`, NBSP `U+00A0` to space. Default ON.
6. Zero-width / invisible character strip - `U+200B`, `U+200C`, `U+200D`, `U+2060`, mid-string `U+FEFF`. Default ON.
7. BOM strip - `U+FEFF` at the start of the first cell of the first column (covers the case where the I/O layer didn't catch it). Default ON.
8. Control character strip - `U+0000`-`U+001F` and `U+007F`, *except* preserve `\t`, `\n`, `\r`. Default ON.
9. Line-ending normalization - within multi-line cells, `\r\n` and bare `\r` to `\n`. Default ON.
10. Case conversion - UPPER / lower / Title / Sentence. Default OFF, per-column. Title case is "smart": preserves all-caps tokens (`USA`, `NASA`) and lowercases mid-string particles (`of`, `and`, `the`).
**Scope control**
11. Per-column selection - by default operate on string-typed columns only; numeric / datetime columns pass through untouched. User can pick columns explicitly via `--columns`.
12. Skip-list - exclude specific columns via `--skip` even if they match the string-dtype filter (e.g., free-text notes columns).
**Trust and audit**
13. Dry-run preview by default. Output shows N cells that would change in column X. `--apply` writes. Non-negotiable for trust. Same standard as the deduplicator.
14. Per-cell change log: `{input}_changes.csv` with (row, column, old, new, ops_applied). Capped to first N rows by default to avoid 50MB audit files; `--full-changelog` removes the cap.
15. Three output files on `--apply`: `{input}_cleaned.csv`, `{input}_changes.csv`, `logs/text_clean_{ts}.log`. Mirrors the deduplicator output shape.
16. Original input file is never modified.
17. Idempotency: `clean(clean(x)) == clean(x)` for every individual op and every preset. Asserted as a property test.
**Configuration**
18. Presets: `--preset excel-hygiene` (everything safe ON, NFKC OFF, case OFF), `--preset minimal` (only trim + collapse), `--preset paranoid` (everything including NFKC). Buyers should not have to learn 9 flags. Default preset when no flag given: `excel-hygiene`.
19. Save / load JSON config. Same shape and reuse pattern as `DeduplicationConfig`.
**UX**
20. `--help` written for non-technical users with concrete examples, not a flag dump. Per DECISIONS.md Section 4b.
21. Progress bar for files over ~10K rows.
22. Error messages name the row, column, and value that caused the problem. No raw stack traces.
23. Sample data (`samples/messy_text.csv`) demonstrates: smart quotes from Excel, NBSP-vs-space, BOM, mixed line endings, zero-width chars. The before/after diff is the demo.
#### Tier 2: Worth-considering for v1.1
24. Custom regex find/replace - power-user escape hatch, per-column.
25. Diacritic strip (`José` to `Jose`). Lossy; opt-in only.
26. Mojibake auto-repair - detect `é` to `é` patterns (UTF-8 read as Latin-1 then re-encoded) and fix. Standard tool: `ftfy`. Promote to Tier 1 if early buyers report this.
27. Punctuation normalization - all Unicode dash/quote/space variants folded; runs of punctuation collapsed.
28. Profile detector - scan a file and recommend which ops to enable based on what's actually present. Lowers config friction further.
#### Tier 3: Optional / later
29. Locale-aware case conversion (Turkish dotted/dotless `i`, German `ß`).
30. Custom character-class strip rules (regex-class).
31. Streaming / chunked write for very large files (defer until a buyer reports it).
#### Open decisions captured at spec time
- Smart-character folding default ON in `excel-hygiene` accepted as the right tradeoff: highest-impact use case, dry-run preview makes the change visible before commit.
- NFKC stays Tier 1 but OFF by default and excluded from `excel-hygiene`. Lossy by design.
- CLI surface: separate `src/cli_text_clean.py` module, matching the "one CLI binary per script on PATH" pattern in Section 3.2. Not a subcommand on the existing dedup Typer app.
- `ftfy` dependency deferred to Tier 2 (~5MB). Revisit if mojibake reports come in.
### 10.2.1 Upload-time analyzer (`src/core/analyze.py`)
The analyzer is a read-only, advisory pass that runs on every uploaded file before any tool page sees it. It produces a list of `Finding` objects, each carrying:
| Field | Type | Meaning |
|---|---|---|
| `id` | str | Stable identifier (`smart_punctuation_in_data`, `mixed_line_endings`, …). Never localized. |
| `severity` | `info` / `warn` / `error` | UX urgency. `error` is the only level that blocks the gate. |
| `confidence` | `high` / `medium` / `low` | Auto-fixability. **High** is round-trip safe, **medium** has known false-positive shapes, **low** is heuristic and opt-in. |
| `fix_action` | str | Stable id naming the algorithm in `src/core/fixes.py` that resolves this finding. Empty for informational-only findings. |
| `pre_applied` | bool | True when the fix already ran during the read pass (BOM strip, NUL strip, byte-level smart-quote fold). The gate treats these as already-resolved. |
| `tool` | str | Tool id that owns this concern (`02_text_cleaner`, `04_missing_handler`). Empty for file-level findings. |
| `count` | int | Cells / rows affected. |
| `description` | str | One-sentence human summary (banners, tooltips). |
| `column` | str / None | Column name when scoped to one column. |
| `samples` | list[(row, col, value)] | Up to 5 examples for the GUI to render. |
`analyze(source, *, sample_rows=1000, repair_result=None, encoding_override=None)` is the public entry point. `source` is a DataFrame or a path; `encoding_override` skips charset detection and uses the user's chosen codepage instead — this is the hook that lets the Review page recover from misdetections (cp1252-vs-cp1250 ambiguity, KOI8-R surfacing as Shift_JIS).
### 10.2.2 CSV-normalization gate (`src/core/normalize.py`, `src/core/fixes.py`)
A file enters tool pages only after passing the gate. The gate has two paths:
1. **Auto-fix**`auto_fix(df, findings)` applies every `confidence="high"` finding whose `fix_action` is registered in `fixes.py`.
2. **Per-finding decisions**`apply_decisions(df, findings, decisions)` accepts an explicit list of `Decision(finding_id, action, payload)` where action is `"auto" | "skip" | "modified"`.
Output is a `NormalizationResult` with:
- `cleaned_df` — the DataFrame after every applied fix.
- `cleaned_bytes` — UTF-8 CSV serialization for the download.
- `applied`, `skipped_findings`, `pending_findings`, `blocking_findings` — audit log + gate status.
`is_normalized(findings, result)` re-runs `analyze()` against the cleaned bytes and returns False if any high-confidence detector still fires — that's the strict contract tool pages depend on.
`fixes.py` is a registry: `@register("fix_id")` decorates a `(df, payload) -> (new_df, n_cells_changed)` function. Adding a new fix means appending one entry to `analyze.py`'s `FIX_*` constants, one detector that emits a Finding with that `fix_action`, and one registered function in `fixes.py`. No other call sites change.
### 10.2.3 Review page (`src/gui/pages/0_Review.py`)
Streamlit page that orchestrates the gate visually. Gates the entire tool sidebar via `require_normalization_gate()` in `src/gui/components.py`, which every tool page calls right after `hide_streamlit_chrome()`.
The page:
1. Surfaces the detected encoding plus an override picker (16 common codepages + custom-text fallback).
2. Renders one expandable card per finding, sorted by severity then confidence, with a decision radio (Auto / Skip / Customize), a live before/after preview built by running the registered fix on each `Finding.samples` value, and a payload editor for fixes that take user input (e.g. custom null-sentinel list for `replace_null_sentinels`).
3. Apply button persists a `NormalizationResult` keyed by upload SHA-256; tool pages refuse to load until the hash matches.
4. After apply, an `⚙️ Advanced output options` expander offers per-download encoding, delimiter, and line-terminator selection. The helper `_build_output_bytes(df, *, encoding, delimiter, line_terminator)` returns `(bytes, error_message)` — when the chosen encoding can't represent a character, falls back to `errors="replace"` and returns a warning the page surfaces.
### 10.2.4 Pre-parse repair (`src/core/io.py::repair_bytes`)
Byte-level pre-parse pass. Order is meaningful and each step is independently toggleable:
1. **Wide-encoding transcode** — UTF-16/UTF-32 → UTF-8. Has to run first because the byte-level NUL strip below would shred UTF-16 data (UTF-16 ASCII chars carry NUL as half of every 16-bit unit). Records `transcode_to_utf8` audit action; the analyzer surfaces it as a `csv_transcoded_to_utf8` info finding.
2. **UTF-8 BOM strip** (file start only). 2. **UTF-8 BOM strip** (file start only).
3. **NUL strip** — only meaningful after step 1, so genuine corruption (truncated C strings, half-binary exports) rather than encoding artifacts. 3. **NUL strip** — only meaningful after step 1, so flags genuine corruption.
4. **Line-ending normalize** — CRLF and bare CR → LF. Bare CR confuses the C parser; the text-cleaner contract also calls for LF inside multi-line cells. 4. **Line-ending normalize** — CRLF + bare CR → LF.
5. **Byte-level smart-quote fold** — curly / guillemet / double-prime → ASCII `"`. Only structural double-quote-equivalents; single curly quotes are deferred to the cell-level cleaner. 5. **Byte-level smart-quote fold** — curly / guillemet / double-prime → ASCII `"` (only structural double-quote-equivalents; single curlies deferred to cell-level).
6. **Per-row delimiter repair** — when one row has +1 field and the merge candidate is currency-shaped (`$1,500.00` etc.), merge and quote. 6. **Per-row delimiter repair** — when a row has +1 field and merge candidate is currency-shaped (`$1,500.00`), merge + quote.
`detect_encoding()` tries strict UTF-8 first and returns `"utf-8"` if the bytes decode cleanly. This was added because charset-normalizer fingerprints small files dominated by short non-ASCII sequences (e.g. zero-width chars at U+200B-class) as `mac_latin2` but if the bytes are valid UTF-8, that's the right answer regardless of label. `detect_encoding()` tries strict UTF-8 first — charset-normalizer mislabels short-non-ASCII files as `mac_latin2`, but valid UTF-8 bytes mean UTF-8 regardless of label.
### 10.3 - 10.9 (Future)
Functional specs for scripts 03 through 09 to be added when each script enters active build. The deduplicator (10.1) and text cleaner (10.2) specs are the template; specs for other scripts should follow the same Tier 1 / Tier 2 / Tier 3 structure with explicit strategic framing (what's the market gap this script fills, given that some of its functionality is available free elsewhere).

View File

@@ -1,208 +1,118 @@
# USER-GUIDE.md - Excel & CSV Data Cleaning Mastery Bundle # User Guide
**Version**: 1.6 **Version**: 1.6 · **Updated**: 2026-05-01
**Last updated**: April 28, 2026
Thank you for purchasing the Data Cleaning Mastery Bundle. This guide covers installation and every script included. ## 1. Install
--- You don't need Python — the bundle is self-contained.
## 1. Installation | OS | File | How |
|----|------|-----|
| Windows | `BundleName-Setup-1.0.exe` | Double-click installer → desktop shortcut. |
| macOS | `BundleName-1.0.dmg` | Mount, drag to Applications. Signed + notarized. |
| Linux | `BundleName-1.0.AppImage` | `chmod +x`, double-click. (`.tar.gz` fallback available.) |
The bundle is fully self-contained. **You do not need to install Python.** Launching opens your default browser to a local page (`http://localhost:8501`).
### Windows ### How the GUI works
1. Download `BundleName-Setup-1.0.exe` from your purchase email. - Runs locally on your machine. **No internet, no upload.**
2. Double-click the installer. - Browser is just the display surface. Closing it stops the underlying program.
3. Follow the wizard. The installer creates a desktop shortcut named "Launch Bundle" and an entry in Start Menu. - Prefer the terminal? Every tool ships with a CLI too (Section 3).
4. Launch via the desktop shortcut. Your default browser will open to a local page (typically `http://localhost:8501`) where the data tool runs.
### macOS ### System requirements
1. Download `BundleName-1.0.dmg` from your purchase email. - Windows 10/11 (64-bit), macOS 11+, modern Linux (2020+).
2. Double-click the `.dmg` to mount it. - Modern browser (Chrome, Edge, Firefox, Safari, last 3 years).
3. Drag the Bundle app into the Applications folder.
4. Launch from Applications, Spotlight, or Launchpad. Your default browser will open to a local page where the data tool runs.
The app is signed and notarized by Apple, so it opens cleanly with no security warnings.
### Linux
1. Download `BundleName-1.0.AppImage` from your purchase email.
2. Make it executable: `chmod +x BundleName-1.0.AppImage`
3. Double-click to run, or execute from a terminal. Your default browser will open to a local page where the data tool runs.
If AppImage doesn't work on your distribution, a `.tar.gz` fallback is available in your purchase email. Extract it and run `./run.sh` from the extracted folder.
### How the GUI works (important to know)
This tool runs in your browser **locally on your computer**. When you launch it, a small program starts a local server on your machine and opens your default browser to view it. This is normal and expected.
- **No internet is required.** Your data never leaves your computer.
- **Your data is not uploaded anywhere.** All processing happens on your machine.
- The browser is just the display surface. Closing the browser closes the GUI; the underlying program also stops.
If you prefer the command line, every script also ships as a CLI tool. See Section 3.
### Requirements
- Windows: Windows 10 or 11 (64-bit).
- macOS: macOS 11 Big Sur or later (Apple Silicon or Intel).
- Linux: any modern 64-bit distribution from 2020 onward.
- A modern default browser (Chrome, Edge, Firefox, or Safari from the last 3 years).
- ~400-500 MB free disk space. - ~400-500 MB free disk space.
- Internet connection: not required.
For the full short-form numbered list of what's supported (file sizes, code pages, delimiters, performance targets, detector list, etc.), see [REQUIREMENTS.md](REQUIREMENTS.md). Full numbered support matrix: [REQUIREMENTS.md](REQUIREMENTS.md).
--- ## 2. What's included
## 2. What's Included | # | Tool | Purpose | Status |
|---|------|---------|--------|
| 01 | Deduplicator | Exact + fuzzy match, 5 normalizers, audit | Ready |
| 02 | Text Cleaner | Whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | Format Standardizer | Dates / phones / emails / addresses / names / currencies / booleans | Ready |
| 04 | Missing Value Handler | Disguised nulls, imputation, drop-by-threshold | Coming Soon |
| 05 | Column Mapper | Rename + enforce schema | Coming Soon |
| 06 | Outlier Detector | z-score, IQR, multivariate | Coming Soon |
| 07 | Multi-File Merger | Combine multiple files | Coming Soon |
| 08 | Validator & Reporter | Rules + PDF/Excel report | Coming Soon |
| 09 | Pipeline Runner | One-click multi-tool launcher | Coming Soon |
**Scripts (in the `scripts/` folder)**: **Sample data** (`samples/`): `messy_sales.csv`, `bank_export.xlsx`.
| # | Script | Purpose | Status |
|---|---|---|---|
| 01 | `01_deduplicator.py` | Smart duplicate removal: exact match + basic fuzzy, configurable subset columns, full logs | Working |
| 02 | `02_text_cleaner.py` | Character-level hygiene: trim leading/trailing whitespace, collapse internal multi-spaces, strip non-printable characters, Unicode normalization (smart quotes, em-dashes, accents), remove zero-width characters, BOM handling, line-ending normalization, case operations | Working |
| 03 | `03_format_standardizer.py` | Standardize dates, currencies, names, phone numbers, addresses | Skeleton |
| 04 | `04_missing_value_handler.py` | Detect and handle missing values: disguised nulls (`N/A`, `-`, blanks, sentinel codes), imputation (mean/median/mode/forward-fill), required-field enforcement, drop-by-threshold | Skeleton |
| 05 | `05_column_mapper_enforcer.py` | Rename columns, enforce a target schema | Skeleton |
| 06 | `06_outlier_detector.py` | Detect and flag statistical outliers (z-score, IQR, modified z-score), multivariate detection, domain-rule violations, optional winsorization | Skeleton |
| 07 | `07_multi_file_merger.py` | Merge multiple CSV or Excel files into one | Skeleton |
| 08 | `08_validator_reporter.py` | Validate data against rules, output PDF or Excel report | Skeleton |
| 09 | `09_master_orchestrator.py` | One-click launcher menu, calls any other script | Skeleton |
**Sample data (in the `samples/` folder)**:
- `messy_sales.csv` - intentionally dirty sales data for testing.
- `bank_export.xlsx` - sample bank export for testing missing-value handling and outlier detection.
---
## 3. Usage ## 3. Usage
You have two ways to use the bundle: the GUI (recommended for most users) or the CLI (for power users and automation). ### 3.1 GUI (recommended)
### 3.1 GUI usage (recommended) 1. Launch the bundle.
2. Pick a tool from the sidebar.
3. Drop your file (or select a sample).
4. Defaults are pre-filled — click **Run** to preview.
5. Click **Save Output** to write the cleaned file.
1. Launch the bundle via the desktop shortcut, app icon, or AppImage. Advanced options are tucked in expander panes. The original file is never modified.
2. Your browser opens to the bundle's home page.
3. Select the script you want to use from the sidebar (Deduplicator, Format Standardizer, etc.).
4. Drop your file into the file uploader, or select from the included samples.
5. Sensible defaults are pre-filled. Click "Run" to see a preview of what the script will do.
6. Review the preview. If it looks right, click "Save Output" to write the cleaned file.
The GUI is designed to work out of the box with zero configuration. Advanced options are tucked into expandable "Advanced" panes for users who want them. ### 3.2 CLI
### 3.2 CLI usage ```bash
deduplicator customers.csv [--apply]
All scripts are also CLI tools with `--help` output. text-cleaner messy.csv [--apply]
format-standardize feed.csv [--apply]
**Basic usage** (from a terminal):
Windows (the bundle adds CLI tools to your PATH):
```
deduplicator samples\messy_sales.csv
``` ```
macOS / Linux: Get help: `deduplicator --help`. Full reference: [CLI-REFERENCE.md](CLI-REFERENCE.md).
```
deduplicator samples/messy_sales.csv
```
**With options**: ### 3.3 Run order (when running tools manually)
``` If you skip the Pipeline Runner, follow this order:
deduplicator samples/messy_sales.csv --output cleaned.csv --subset email,phone
```
**Get help on any script**: 1. **02 Text Cleaner** first — normalizes whitespace + special chars.
2. **03 Format Standardizer** — dates, phones, etc. need cleaned text.
3. **04 Missing Value Handler** — sentinel codes hide as numbers.
4. **05 Column Mapper** — schema before outlier stats.
5. **06 Outlier Detector** — needs clean numerics. Stats on data with `NaN` or `-999` are mathematically poisoned.
6. **07 Multi-File Merger**, **08 Validator** as needed.
7. **01 Deduplicator** is order-flexible (normalizes internally for matching).
``` The Pipeline Runner enforces this automatically.
deduplicator --help
```
**Recommended run order**: If you are running scripts individually, run `02_text_cleaner` first to normalize whitespace and special characters, then `04_missing_value_handler` *before* `06_outlier_detector`. Outlier detection on data still containing blanks or sentinel codes (like `-999`) produces unreliable results because missing-value placeholders distort the statistics (means get dragged, IQR widens, false negatives explode). The Master Orchestrator (script 09) runs them in the correct order automatically. ## 4. Review & Normalize gate
--- Every uploaded file is scanned before any tool sees it.
## 3.3 Review & Normalize gate **Confidence tiers**:
- **High** — round-trip safe. One-click "Auto-fix high-confidence" applies them all.
- **Medium** — usually right, occasional false positives. Preview first.
- **Low** — heuristic. Off by default; opt in per finding.
- **Error** — blocks the gate (empty file, U+FFFD, unrepairable rows).
Before any tool page accepts a file, the file passes through a **CSV-normalization gate**. The gate scans every uploaded file, surfaces every data-quality issue our analyzer can detect, and lets you choose how to handle each one before downstream tools see the data. **Encoding override**: when the picker reports `encoding_uncertain` or you spot mojibake (`é`) or `<60>` chars, choose the right codepage at the top of the page (cp1252 for Western Excel, KOI8-R for older Russian, Big5 for traditional Chinese, …) → **Re-analyze**.
### How it works **Advanced output**: an `⚙️` expander on the download lets you tune encoding, delimiter, and line terminator. The download filename auto-adjusts (`.tsv` for tab, `.csv` otherwise).
1. Upload a file on the home page. The analyzer scans it and counts findings by confidence tier. ## 5. Output
2. Click any tool. If the file hasn't been normalized yet, you're redirected to the **Review & Normalize** page.
3. The page shows every finding grouped by severity and confidence, with a per-finding decision control.
### Confidence tiers Every run writes:
- **Cleaned file** next to the input (or wherever you specify).
- **Audit file** (per-cell changes for text/format tools, match groups for dedup).
- **Timestamped log** in `logs/`.
- **High** — round-trip-safe algorithmic fix (BOM strip, whitespace trim, NBSP / zero-width strip, smart-quote fold, line-ending normalize, header cleanup). One-click "Auto-fix high-confidence" applies them all. Original input is never modified.
- **Medium** — right call in the common case but with known false-positive shapes. Examples: lowercasing the email column, replacing null-like sentinels (`N/A`, `-`, `nan`), repairing unquoted-currency rows. Preview the change before applying.
- **Low** — heuristic fixes that can corrupt data when wrong. Mojibake repair (`café``café`), mixed-encoding detection. Off by default; you opt in per finding.
- **Error** — blocking. Empty file, unrepairable rows, U+FFFD replacement characters. Cannot enter the tool pages until resolved or explicitly waived.
### Encoding override ## 6. Troubleshooting
When the analyzer reports `encoding_uncertain` or you spot mojibake (`é`) or `<60>` characters in the findings list, use the **File encoding** picker at the top of the Review page. Pick the right code page (cp1252 for Western Excel exports, KOI8-R for older Russian data, Big5 for traditional Chinese, etc.) or type a custom one, then click **Re-analyze**. Findings refresh against the corrected decode. - **GUI won't launch / browser doesn't open** — wait 10-15 s; manually visit `http://localhost:8501`. Port-in-use error → close other instances.
- **Why does my browser open?** — local web app pattern (same as Jupyter, RStudio). Nothing leaves your machine.
- **Windows SmartScreen** — click "More info" → "Run anyway". Standard for non-EV-signed software.
- **macOS "App is damaged"** — re-download (file likely corrupted in transit).
- **Linux AppImage won't run** — `chmod +x file.AppImage`. Missing FUSE → `sudo apt install libfuse2` or use `.tar.gz`.
- **Slow on large file** — over ~100k rows takes longer; progress bar shows. Multi-million rows → use the CLI directly.
- **Need help** — email the address on your purchase receipt.
The picker is hidden for `.xlsx` files since Excel stores text as Unicode internally. ## 7. License
### Advanced output options Single-user. See `LICENSE.txt`.
After applying decisions, an `⚙️ Advanced output options` expander on the download appears. Three dropdowns let you tune the output file format:
- **Encoding (code page)** — UTF-8 (default), UTF-8 with BOM (Excel-friendly), Windows-1252, Latin-1, Latin-9, cp1250, ISO-8859-2, cp1251, Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
- **Delimiter** — comma (default), tab, semicolon, pipe.
- **Line terminator** — LF (default), CRLF (Windows), CR.
The download filename auto-adjusts the extension (`.tsv` for tab, otherwise `.csv`). When the chosen encoding can't represent a character (Cyrillic content into cp1252, Asian script into Latin-1), the page shows a warning naming the offending character and falls back to `?` replacement so the download still works.
---
## 4. Output
Every script writes:
- A cleaned output file next to the input (or wherever you specify).
- A timestamped log file in the `logs/` folder showing what changed and why.
Reports from `validator_reporter` go to the `reports/` folder as PDF or Excel.
The GUI also displays the output preview in-browser before any file is written. The original input file is never modified.
---
## 5. Troubleshooting
**The GUI won't launch / browser doesn't open**:
1. Wait 10-15 seconds after double-clicking. The local server takes a moment to start the first time.
2. If the browser doesn't open automatically, manually visit `http://localhost:8501` in your browser.
3. If you see a "port in use" error, another program is using port 8501. Close other instances of the bundle and try again.
**"Why is my browser opening?" / "Why does this need internet?"**:
This tool runs as a local web app. The browser is just the display; nothing is uploaded, nothing leaves your computer. No internet connection is used after install. This is the same approach used by many modern data tools (Jupyter notebooks, RStudio, etc.).
**Windows: "Windows protected your PC" SmartScreen warning**:
Click "More info" then "Run anyway." This is a standard warning for software without an extended-validation Windows code signing certificate.
**macOS: "App is damaged and cannot be opened"**:
This usually indicates the download was corrupted. Re-download from the link in your purchase email.
**Linux: AppImage will not run**:
Make sure it is executable: `chmod +x BundleName-1.0.AppImage`. If it still fails, your distribution may be missing FUSE; install with `sudo apt install libfuse2` (Debian/Ubuntu) or use the `.tar.gz` fallback.
**Script throws an error about a file**:
Check the log file in the `logs/` folder. The log explains exactly what went wrong and which row of input data triggered it.
**The GUI feels slow on a large file**:
Files over ~100,000 rows take longer to process. The GUI shows a progress bar. If you have very large files (millions of rows) consider using the CLI directly, which is faster for batch jobs.
**Need help**: Email the address on your purchase receipt.
---
## 6. License
Single-user license. Do not redistribute. See `LICENSE.txt` in the install folder.