docs: tight, scannable rewrite — every item earns its place

Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS,
TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from
prose-heavy to bullet-heavy + table-heavy. Same information density,
significantly less reading load.

Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content
that landed since v1.6:

- Format Standardizer (3rd Ready tool)
- 199-row buyer corpus
- src/core/errors.py structured hierarchy + ensure_dataframe /
  ensure_choice / wrap_file_read|write / format_for_user helpers
- src/core/_constants.py shared USPS/state lookup tables
- Cross-tool audit fixes (NaN matching, removed_df schema, validation,
  enum-bounds checks, forward-compat config)
- Per-domain error_policy across format standardizers
- Inconsistent-date-format detector
- Excel header-row auto-detection + write_file delimiter param

Per-doc changes:

- README.md (175 → 71): 9-tool table at top, status column, 3 CLI
  entry points listed, dropped repeated marketing prose.
- docs/README.md (38 → 27): pure index — buyer-facing vs creator-only
  split + version footer.
- USER-GUIDE.md (208 → 118): tool table replaces script descriptions,
  troubleshooting compressed to bullets, gate explanation tightened.
- CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed
  redundant intro text, kept full recipes section.
- REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added
  §18 Error Handling, formatting tightened to single-line entries.
- TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged
  redundant §3.5-3.7 OS sections, added §7 (Error handling) +
  §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate /
  Review page / repair_bytes promoted from §10.2.x sub-numbering).
- DEVELOPER.md (285 → 161): module map table replaces per-file prose,
  extension recipes condensed, new §Errors covers when to use each
  hierarchy class.
- BUSINESS.md (278 → 225): collapsed prose to tables (use cases,
  competitive landscape, costs, risks); honest-status updated.
- DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved,
  decision log compressed to single-line entries, added v1.6 entries
  (Format Standardizer Ready, errors module).
- RECOVERY.md (180 → 147): rebuild steps as numbered + tabular,
  external dependencies as one table, recovery priorities tightened.

No information removed; redundancy compressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 02:49:29 +00:00
parent 26b9771625
commit abb720997e
10 changed files with 1105 additions and 2053 deletions

View File

@@ -1,278 +1,225 @@
# BUSINESS.md - Business Case & Marketing Strategy
# Business
> **Creator-only document. Do not ship to buyers.**
> Creator-only. Do not ship to buyers.
> **Version**: 1.6 · **Updated**: 2026-05-01 · **Owner**: Michael
**Version**: 1.6
**Last updated**: April 28, 2026
**Owner**: Michael
## 1. Executive summary
---
Sell niche Python automation tools as one-time downloadable digital products. Target non-technical users who hate Excel/CSV grunt work but can't code. Distribute via Gumroad / Lemon Squeezy with automated delivery. Cross-platform from launch. Each bundle ships GUI (primary, browser-local) + CLI.
## 1. Executive Summary
- **Pricing**: $49-79 per bundle · $149 full suite (when 3+ exist).
- **Goal**: lifestyle cashflow. No saleable-asset exit required.
Sell niche-specific Python automation tools as one-time downloadable digital products. Target non-technical users who hate repetitive Excel/CSV work but cannot code. Distribute via Gumroad / Lemon Squeezy / Stripe with automated instant delivery. Cross-platform from launch (Windows, macOS, Linux). Each bundle ships with both a GUI (primary surface for non-technical buyers, runs in the buyer's browser locally) and a CLI (for power users and automation).
## 2. Market opportunity
**Pricing model**: One-time purchase. Individual bundles $49-$79. Full suite $149.
- Persistent, evergreen pain: data cleaning is universal.
- Low competition in vertical niches (Shopify pet-supplies feeds vs. generic CSV cleaners).
- ~100% gross margin after creation.
- Hosted browser demo as try-before-buy conversion lever (added v1.3).
**Goal**: Lifestyle cashflow. No saleable-asset exit required.
**Timing reality**: marketplaces + community posts → days/weeks to first sale. Own-domain SEO is a 6-18 month compounding asset, not an early channel.
---
## 3. Target customers
## 2. Market Opportunity
- Persistent, evergreen pain: manual data cleaning is universal across small business and freelance work.
- Low competition in highly vertical niches (e.g., Shopify pet supplies feeds vs. generic CSV cleaners).
- High margin: near-100% gross margin after initial creation.
- Distribution leverage: marketplace search + community presence + programmatic SEO + **hosted browser demo as a try-before-buy conversion surface** (added v1.3, see Section 7).
**Realistic distribution timeline note**: Marketplace listings (Gumroad, Lemon Squeezy directory) and niche community posts can produce paying customers within days to weeks. New-domain SEO will not produce traction inside 90 days. Plan early-stage distribution around marketplaces, communities, and a hosted demo; treat owned-domain SEO as a 6-18 month compounding asset.
---
## 3. Target Customers
Primary:
- Shopify store owners (Pet Supplies niche identified as priority).
- Small business owners needing reporting and finance automation.
- Freelancers and consultants who handle client data.
**Primary**:
- Shopify owners (Pet Supplies = priority niche).
- Small business owners needing reporting + finance.
- Freelancers / consultants handling client data.
- Local marketing agencies.
Anti-personas (do not waste effort here):
- Enterprise data teams (will build it themselves).
- Pure technical buyers (will pip install something free).
**Anti-personas**:
- Enterprise data teams (build their own).
- Pure technical buyers (`pip install` something free).
---
## 4. Product strategy
## 4. Product Strategy
**Lead**: Excel & CSV Data Cleaning Mastery Bundle (highest pain, broadest demand).
**Lead product**: Excel & CSV Data Cleaning Mastery Bundle (highest-pain, broadest demand).
**Roadmap**:
1. Data Cleaning Mastery (in progress)
2. Automated Business Reporting
3. Ecommerce Data Pipeline
4. Small Business Finance
5. Marketing Public Data Aggregation
6. AI Ecommerce Aggregation — Shopify Pet Supplies
**Bundle roadmap**:
1. Data Cleaning Mastery (lead, in progress).
2. Automated Business Reporting.
3. Ecommerce Data Pipeline.
4. Small Business Finance.
5. Marketing Public Data Aggregation.
6. AI Ecommerce Aggregation - Shopify Pet Supplies (vertical niche play).
**Sequence rule**: don't start bundle 2 until bundle 1 has paying customers + one external review. Five parallel skeletons is a known failure mode.
**Sequence rule**: Do not start bundle 2 until bundle 1 has paying customers and at least one external review. Building five skeleton bundles in parallel is a known failure mode.
**Surface**: desktop install per OS (PyInstaller) with Streamlit GUI + CLI. Constrained demo on Streamlit Community Cloud.
**Product surface (locked v1.3)**: Each bundle ships as a desktop install (Windows / macOS / Linux) with both a Streamlit-based GUI and a CLI. A constrained version of the GUI is also deployed publicly to Streamlit Community Cloud as a free browser demo. See TECHNICAL.md Sections 1-3 and DECISIONS.md Section 4c for the full architecture.
## 4a. Lead bundle — Deduplicator
---
Highest pain density across all 4 personas. Feeds landing copy, demo design, feature priority. Tech spec: TECHNICAL.md §11.1.
## 4a. Lead Bundle Deep Dive: Deduplicator Use Cases & Competitive Position (added v1.6)
### Use cases by persona
The deduplicator is the lead because it has the highest pain density across all four target personas. This section captures the use-case map, competitive landscape, and market gap statement. It feeds landing page copy, demo dataset design, and feature prioritization. Companion technical spec is in TECHNICAL.md Section 10.1.
**Shopify**:
1. Customer list cleanup (`john@gmail.com` vs `John@Gmail.com` vs `j.ohn@gmail.com`).
2. Product catalog dedup (SKU whitespace, near-identical names).
3. Abandoned-cart cleanup before re-engagement.
4. Order export consolidation across channels.
5. Subscriber-list hygiene before Klaviyo / Mailchimp import (per-contact pricing).
### Use cases by buyer persona
**Bookkeeper**:
6. Bank export reconciliation across overlapping date ranges.
7. Vendor list consolidation across QB + spreadsheets + email.
8. Customer master cleanup pre-invoicing migration.
9. Expense report dedup (same receipt twice).
**Shopify store owner (priority niche)**
1. Customer list cleanup: same person with `john@gmail.com` and `John@Gmail.com` and `j.ohn@gmail.com` (Gmail ignores dots), or with two phone formats.
2. Product catalog dedup: same SKU listed with trailing whitespace, case differences, or near-identical names ("Dog Collar - Red - Large" vs "Dog Collar Red L").
3. Abandoned cart cleanup before re-engagement campaign (don't email the same person 3 times).
4. Order export consolidation when pulling from Shopify + Amazon + manual entry.
5. Subscriber list hygiene before importing to Klaviyo / Mailchimp (every duplicate costs money on per-contact pricing).
**Small business / bookkeeper**
6. Bank export reconciliation: same transaction appearing in two exports across overlapping date ranges.
7. Vendor list consolidation across QuickBooks, spreadsheets, and email.
8. Customer master record cleanup before invoicing migration.
9. Expense report dedup where employees submit the same receipt twice.
**Freelancer / consultant**
10. Pre-analysis cleanup of client-supplied data dumps (almost always have dupes).
11. Survey response dedup (same respondent submitting twice from different devices).
**Freelancer**:
10. Pre-analysis cleanup of client dumps.
11. Survey response dedup (same respondent, multiple devices).
12. Lead list cleanup before client handoff.
**Marketing agency**
13. Email list deduplication across multiple lead sources before campaign send.
14. Audience reconciliation when running multi-platform campaigns (Facebook + Google + organic forms).
15. Suppression-list management (combine unsubscribes across lists).
**Marketing agency**:
13. Email-list dedup across lead sources.
14. Multi-platform audience reconciliation.
15. Suppression-list management.
**Highest-pain, highest-frequency**: 1, 5, 6, 13. Build the feature set, sample dataset, and demo around these first. Landing page copy should lead with these scenarios. The hosted demo's pre-loaded dataset should make at least two of these cases obvious within ten seconds.
**Highest pain × frequency**: 1, 5, 6, 13. Build feature set + demo dataset + landing copy around these.
### Competitive landscape
| Tool | Price | Strength | Weakness vs. this product |
|---|---|---|---|
| Excel "Remove Duplicates" | Free | Universal, zero install | Exact match only. No fuzzy. No audit log. |
| Pandas `drop_duplicates` | Free | Powerful | Requires Python skill. Buyer doesn't have it. |
| OpenRefine | Free | Powerful clustering, fuzzy | Steep learning curve, dated GUI, intimidating for non-technical users. |
| Dedupe.io | ~$30+/mo SaaS | ML-based fuzzy | Recurring cost, cloud upload (privacy concern for client data), overkill for small jobs. |
| WinPure / Data Ladder | $300-2000+ | Enterprise-grade | Wrong price tier and complexity for solo operators. |
| Power Query (Excel) | Free | Integrated | Exact match only, no fuzzy without M-code skill. |
| Tool | Price | Strength | Weakness |
|------|-------|----------|----------|
| Excel Remove Duplicates | Free | Universal, zero install | Exact only. No fuzzy. No audit. |
| Pandas `drop_duplicates` | Free | Powerful | Requires Python. |
| OpenRefine | Free | Powerful clustering | Steep curve, dated GUI. |
| Dedupe.io | $30+/mo | ML fuzzy | Recurring + cloud upload. |
| WinPure / Data Ladder | $300-2000+ | Enterprise | Wrong tier. |
| Power Query | Free | Integrated | Exact only without M-code. |
### The market gap this product fills
### Market gap
The market has a hole between "Excel (too basic)" and "OpenRefine / Dedupe.io (too complex or expensive or cloud-bound)." That hole is:
> Fuzzy match quality of OpenRefine, with the zero-learning UX of Excel, sold once for under $100, runs locally.
> Fuzzy match quality of OpenRefine, with the zero-learning-curve UX of Excel, sold once for under $100, runs locally on the buyer's machine.
This is a defensible position **only if** the product delivers fuzzy match quality the buyer can trust without reading documentation. If fuzzy is mediocre, the product loses to free Excel. If UX requires learning, it loses to free OpenRefine. The Tier 1 functional spec in TECHNICAL.md Section 10.1 is the minimum viable feature set to occupy this gap.
### Pricing sanity check (lead bundle specifically)
$49-$79 is correct for this position. Above $99 the buyer expects SaaS support (which conflicts with the no-touch constraint). Below $30 it competes with free and signals "toy." See Section 5 for full pricing rationale.
---
Defensible **only if** fuzzy matching works without docs. Mediocre fuzzy → loses to free Excel. Learning required → loses to free OpenRefine. Tier 1 spec (TECHNICAL.md §11.1) is the minimum viable feature set to occupy this gap.
## 5. Pricing
| Tier | Price | Notes |
|---|---|---|
| Single bundle | $49 - $79 | Sweet spot for individual buyer impulse purchase |
| Full suite (when 3+ bundles exist) | $149 | Anchor price, drives bundle attach |
|------|-------|-------|
| Single bundle | $49-79 | Impulse-purchase sweet spot for solo operators |
| Full suite (3+ bundles) | $149 | Anchor; drives bundle attach |
Rationale: $49-$79 is below the threshold that triggers procurement / approval friction for solo operators. Above $99 the buyer expects a SaaS or human support.
**Why**: < $99 avoids procurement friction. > $99 triggers SaaS-support expectations that conflict with no-touch. < $30 competes with free, signals "toy".
---
## 6. Revenue Targets (realistic, tiered)
Replacing the original "$50k/mo ceiling" target with evidence-based tiers for solo digital product sellers in this category:
## 6. Revenue targets
| Horizon | Target | Notes |
|---|---|---|
| First 90 days | First paying customer | Validates the funnel, not the business |
| 6 months | $1,000 - $3,000 / mo | Achievable with working lead bundle + marketplace presence + hosted demo |
| 12 months | $5,000 / mo | Realistic 12-month goal. Triggers re-evaluation of the "fully async" constraint (see Section 8) |
| 24 months | $10,000 / mo | Stretch target. Requires either a hit product or 3+ bundles compounding |
|---------|--------|-------|
| 90 days | First paying customer | Validates funnel, not business |
| 6 months | $1k-3k/mo | Lead bundle + marketplace + demo |
| 12 months | $5k/mo | Triggers "fully async" revisit |
| 24 months | $10k/mo | Stretch. Needs hit product or 3+ bundles compounding |
$20k+/mo is achievable but requires a channel asset (audience, brand, content) that the current operator constraints exclude. Not a target.
$20k+/mo achievable but requires audience/brand asset that operator constraints exclude.
---
## 7. Marketing
## 7. Marketing Strategy
### Channels (priority order, early stage)
**Channels (in priority order, early stage)**:
1. **Hosted browser demo** (added v1.3). Free, public Streamlit Community Cloud deployment of a constrained version of each bundle. Linked prominently from every Gumroad / Lemon Squeezy listing and the landing page as "Try it free in your browser." Direct conversion lever: prospects can validate quality before purchase, which is otherwise impossible for digital downloads at this price.
2. Marketplace listings (Gumroad search, Lemon Squeezy directory, GitHub).
3. Niche community presence (subreddits, Indie Hackers, niche Slack/Discord) - value-first posts, not promotion. The hosted demo doubles as the asset shared in these posts.
4. Programmatic SEO landing pages targeting long-tail keywords (compounds over months).
1. **Hosted browser demo** — free Streamlit Community Cloud, linked from every listing. Direct conversion lever for digital downloads where buyers can't evaluate quality otherwise.
2. Marketplace listings — Gumroad search, Lemon Squeezy directory, GitHub.
3. Niche communities — value-first posts in subreddits, Indie Hackers, niche Slack/Discord. Demo doubles as the shareable asset.
4. Programmatic SEO — long-tail landing pages (compounds over months).
5. Strong GitHub README as discovery surface.
**Hosted demo design**:
- Same core engine as the paid product, GUI front-end only.
- Constrained: row limit (e.g., 100 rows on output), watermark on output files, sample dataset preloaded plus optional small-file upload (capped size).
- Persistent CTA on every page: "Like what you see? Get the full version for $49 ->" linking to Gumroad.
- No login or signup required to use the demo. Friction kills conversion.
- Hosted on Streamlit Community Cloud (free) at launch. Migrate to $5/mo VPS if rate limits or branding constraints become an issue.
### Demo design
**Target keywords (long-tail, low competition)**:
- python csv cleaner bundle
- excel data cleaning scripts
- automated data deduplicator python
- csv duplicate removal tool
- shopify product feed cleaner
- Same core engine as paid product, GUI-only.
- Constraints: row limit (100), output watermark, sample dataset preloaded + small upload (capped).
- Persistent CTA: *"Like what you see? Get the full version for $49 →"*.
- No login. Friction kills conversion.
- Streamlit Community Cloud (free) at launch. $5/mo VPS if rate-limited.
**Funnel**:
- Discovery (marketplace search / community post / SEO) -> Hosted demo (try-before-buy) -> Landing page -> Gumroad checkout -> Stripe payment -> automated email delivery -> upsell sequence to next bundle.
### Target keywords
**Support model**: Self-serve documentation in every download. Email support only, no live chat, no calls.
`python csv cleaner bundle` · `excel data cleaning scripts` · `automated data deduplicator python` · `csv duplicate removal tool` · `shopify product feed cleaner`.
---
### Funnel
## 8. The "Fully Async, No-Touch" Constraint
Discovery → Demo (try-before-buy) → Landing page → Gumroad → Stripe → automated email delivery → upsell sequence to next bundle.
The locked criteria require fully automated, no-touch marketing and sales. This is preserved as the long-term steady state. However:
### Support
**Revisit trigger**: When monthly recurring revenue reaches $5,000/mo.
Self-serve docs in every download. Email only. No live chat, no calls.
**Why revisit**: At early stage, the no-touch constraint rules out the channels most likely to produce first traction (direct outreach to 50 Shopify pet operators, founder-led community participation, customer interviews). These are time-bounded activities, not permanent commitments. Strict adherence to "no-touch" before product-market fit may cost more revenue than it saves time.
## 8. The "fully async, no-touch" constraint
**Action at trigger**: Re-evaluate whether selective non-async activity (e.g., 2 hours/week of community participation, or a small founder audience build) would compound revenue faster than additional bundle development. Decision is yours; this document only flags the trigger.
Locked criteria require automated, no-touch marketing + sales. Long-term steady state.
Until $5k/mo, operate under the locked async-only rule.
**Revisit trigger**: $5k/mo MRR.
---
**Why**: pre-PMF, the no-touch rule excludes the channels most likely to produce first traction (founder outreach to 50 Shopify pet operators, community participation, customer interviews). Strict adherence may cost more revenue than it saves time.
## 9. Cost Structure
**Action at trigger**: re-evaluate selective non-async (e.g., 2 hr/wk community participation) vs. additional bundle dev. Decision lives with the operator; this just flags the trigger.
**Recurring (monthly budget cap: $1,200)**:
## 9. Cost structure
| Item | Cost | Notes |
|---|---|---|
| Gumroad / Lemon Squeezy fees | ~10% of revenue | Net of merchant fees, no flat cost |
| Domain | ~$15/yr | One-time annual |
| Hosting (landing pages) | $0 - $20/mo | Static hosting via Cloudflare Pages, Netlify, or GitHub Pages is free |
| Hosting (browser demos) | $0 at launch | Streamlit Community Cloud free tier. Plan for $5-10/mo VPS migration if scale or branding requires |
| Email service (transactional + sequences) | $0 - $30/mo | Free tier covers early volume |
| Apple Developer Program | $99/yr (~$8/mo) | Required for macOS code signing - see Section 10 |
| Inno Setup (Windows installer) | Free | One-time download |
| PyInstaller, Streamlit, Python tooling | Free | All open source |
| **Total fixed monthly** | **~$30-70/mo** | Well under $1,200 cap |
Recurring monthly cap: **$1,200**.
Headroom in the budget allows for optional ad spend ($100-200/mo) once a bundle has proven conversion data.
| Item | Cost |
|------|------|
| Gumroad / Lemon Squeezy fees | ~10% of revenue |
| Domain | ~$15/yr |
| Landing-page hosting | $0-20/mo (static via Cloudflare/Netlify/GH Pages) |
| Demo hosting | $0 at launch (Streamlit Community Cloud); plan $5-10/mo VPS migration |
| Email service | $0-30/mo |
| Apple Developer Program | $99/yr (~$8/mo) |
| Inno Setup, PyInstaller, Python | Free |
| **Total fixed monthly** | **~$30-70/mo** |
---
Headroom enables optional ad spend ($100-200/mo) once a bundle has proven conversion data.
## 10. macOS Code Signing (Apple Developer Program)
## 10. macOS code signing
**Required cost**: $99/year, paid to Apple.
**Cost**: $99/yr to Apple Developer Program. **Decision: pay it.**
**Why it's required**:
macOS includes a security layer (Gatekeeper) that blocks unsigned applications by default. When a non-technical buyer downloads an unsigned `.app` or `.dmg`, macOS shows a hard-block dialog: *"This app cannot be opened because the developer cannot be verified."* The only obvious button is "Move to Trash."
**Why required**: macOS Gatekeeper hard-blocks unsigned apps with *"This app cannot be opened because the developer cannot be verified"* — the only obvious button is "Move to Trash." The bypass (right-click → Open) exists but the target buyer won't perform it.
The bypass exists (right-click > Open, then confirm in a second dialog), but the target buyer persona will not perform it. The likely outcomes for unsigned Mac builds: refund requests, support tickets, or silent abandonment.
**What $99 buys**: code-signing certificate (removes hard block) + notarization service (removes "downloaded from internet" warning). Result: clean double-click experience.
**What the $99/yr buys**:
- A code signing certificate. Removes the hard block.
- Notarization service (included). Apple scans the binary and stamps it; this removes the secondary "downloaded from internet" warning too. Result: clean double-click-to-run experience.
**Setup**: Apple ID + government ID (individuals) or D-U-N-S number (orgs). First approval takes 1-2 weeks. Once approved, sign + notarize is automated in CI.
**Setup notes**:
- Requires Apple ID + government ID (for individuals) or D-U-N-S number (for organizations).
- First-time approval takes 1-2 weeks. Plan accordingly.
- Once approved, signing and notarization is automated in the build pipeline (see TECHNICAL.md).
**Decision**: Pay for it. The cost is trivial relative to the conversion-rate impact for the non-technical buyer persona.
---
## 11. Risks & Mitigation
## 11. Risks & mitigation
| Risk | Mitigation |
|---|---|
| Commoditization (free scripts on GitHub) | Niche verticals + polished GUI + cross-platform installers + hosted demo |
| Slow early traction | Lead with hosted demo + marketplaces + communities, not own-domain SEO |
| Refund chargebacks | Clear scope on landing page, hosted demo lets buyers validate before purchase, working samples included |
| macOS install friction | Apple Developer Program ($99/yr), code signing + notarization |
| Browser-launch UX confusion (GUI opens in browser locally) | Single sentence in installer welcome and email; persistent in-app "runs locally, no internet used" message; pywebview native-window wrap as v1.1 enhancement if needed |
| Customer support burden | Robust installers, idiot-proof docs, sample data included, hosted demo lets prospects self-evaluate |
| IP theft / resale | License file. Accept this is partial protection; focus on staying ahead via updates |
| Platform risk (Gumroad / Lemon Squeezy policy change) | Multi-marketplace from day one; own domain as fallback |
| Streamlit project direction change breaks desktop packaging | Low probability; flagged as criteria-relock trigger in DECISIONS.md Section 8 |
|------|------------|
| Free GitHub scripts commoditize | Niche verticals + polished GUI + cross-platform installers + hosted demo |
| Slow early traction | Lead with demo + marketplaces + communities, not own-domain SEO |
| Refund chargebacks | Clear scope on landing, demo lets buyers validate, working samples included |
| macOS install friction | Apple Dev Program ($99/yr), code sign + notarize |
| Browser-launch UX confusion | One sentence in installer + email; persistent in-app "runs locally" message; pywebview wrap as v1.1 if needed |
| Support burden | Robust installers, idiot-proof docs, sample data included |
| IP theft / resale | License file. Accept partial protection; focus on staying ahead via updates |
| Marketplace policy change | Multi-marketplace day 1; own domain as fallback |
| Streamlit direction change | Low probability; flagged as criteria-relock trigger in DECISIONS §8 |
---
## 12. Success metrics (monthly)
## 12. Success Metrics
Tracked monthly:
- Units sold per bundle.
- Conversion rate (landing page -> purchase).
- **Demo-to-purchase conversion rate** (added v1.3): hosted demo visits -> Gumroad clicks -> purchases.
- Conversion rate (landing purchase).
- **Demo-to-purchase rate** (added v1.3): demo visits Gumroad clicks purchases.
- Refund rate (target < 5%).
- Support tickets per 100 sales (target < 10).
- Support tickets / 100 sales (target < 10).
- Organic traffic to product pages.
- Per-platform install success rate (Windows, macOS, Linux).
- Per-platform install success.
---
## 13. Honest status (2026-05-01)
## 13. Honest Status (April 28, 2026)
- 1 of 9 scripts is real and tested (`01_deduplicator.py`). The other 8 are skeletons. **Expected at project start.**
- Cross-platform build pipeline (PyInstaller-based) designed but not yet built.
- macOS code signing not yet set up (Apple Developer Program enrollment pending).
- Streamlit GUI not yet built (locked as the framework as of v1.3).
- 3 of 9 tools shipped (Dedup, Text Cleaner, Format Standardizer).
- Cross-platform build pipeline designed, not yet built.
- macOS code signing not yet set up.
- Streamlit GUI shipped for the 3 ready tools.
- Hosted demo not yet deployed.
- No paying customers yet.
- No live landing page yet.
- No paying customers.
- No live landing page.
**Next concrete steps before any marketing spend**:
1. Build the Streamlit GUI for the lead script (`01_deduplicator.py`). Apply UX standards from DECISIONS.md Section 4b.
2. Stand up the PyInstaller cross-platform build pipeline with Streamlit launcher (see TECHNICAL.md Sections 3.3 and 3.4). Budget 1-3 days for first-time Streamlit-PyInstaller integration.
3. Deploy the constrained demo version to Streamlit Community Cloud.
4. Enroll in Apple Developer Program (1-2 week lead time - start in parallel with the above).
5. Stand up a single landing page for the lead bundle, with the hosted demo prominently linked.
6. Finish at least 2 more of the 9 scripts to working state with both CLI and GUI.
7. List on Gumroad with sample output proof, per-platform installer downloads, and hosted demo link.
**Next concrete steps before marketing spend**:
1. Stand up the PyInstaller pipeline with Streamlit launcher (1-3 days first time).
2. Deploy constrained demo to Streamlit Community Cloud.
3. Enroll in Apple Developer Program (start in parallel — 1-2 wk lead time).
4. Single landing page for the lead bundle, demo prominently linked.
5. Finish 2 more tools to Ready state (CLI + GUI).
6. List on Gumroad with sample output proof, per-platform installers, demo link.

View File

@@ -1,431 +1,211 @@
# CLI Reference
Complete command-line reference for the DataTools bundle.
DataTools ships two CLI modules so each script can be invoked independently:
Three CLI modules, one per Ready tool:
| Module | Command | Purpose |
|---|---|---|
| `src.cli` | `python -m src.cli INPUT_FILE [OPTIONS]` | Deduplicator (script 01) |
| `src.cli_text_clean` | `python -m src.cli_text_clean INPUT_FILE [OPTIONS]` | Text cleaner (script 02) |
|--------|---------|---------|
| `src.cli` | `python -m src.cli FILE` | Deduplicator |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Text Cleaner |
| `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analyzer (read-only scan) |
The deduplicator section is below; the text cleaner reference is in [Section: Text Cleaner CLI](#text-cleaner-cli).
Every command is **preview-only by default** — add `--apply` to write output.
## Deduplicator
---
# Deduplicator
```
python -m src.cli INPUT_FILE [OPTIONS]
```
## Arguments
| Argument | Required | Description |
|----------|----------|-------------|
| `INPUT_FILE` | Yes | Path to the CSV, delimited text, or Excel file to deduplicate |
## Options
### Core
- `--apply` — write output files (default: preview).
- `-o, --output PATH` — output path (default `{input}_deduplicated.csv`).
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--apply` | | `false` | Write output files. Without this flag, only a preview is shown. |
| `--output` | `-o` | `{input}_deduplicated.csv` | Output file path. |
### Column selection
- `-s, --subset COLS` — comma-separated columns to match on (default: auto-detect).
- `-k, --key COLS` — strong-key columns; each becomes an independent exact-match strategy (`fb_id`, `ein`, `sku`).
### Column Selection
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--subset` | `-s` | auto-detect | Comma-separated columns to match on. When omitted, columns are auto-detected by name pattern (email, phone, name, address). |
| `--key` | `-k` | none | Comma-separated strong-key columns. Each becomes an independent exact-match strategy. Use for identifiers like `fb_id`, `ein`, `sku`. |
### Fuzzy Matching
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--fuzzy` | | none | Comma-separated columns to fuzzy-match. Other columns in the strategy use exact matching. |
| `--algorithm` | `-a` | `jaro_winkler` | Fuzzy algorithm: `levenshtein`, `jaro_winkler`, or `token_set_ratio`. |
| `--threshold` | `-t` | `85` | Similarity threshold 0-100. Lower values find more matches but increase false positives. |
### Fuzzy matching
- `--fuzzy COLS` — comma-separated columns to fuzzy-match.
- `-a, --algorithm ALG``levenshtein` / `jaro_winkler` (default) / `token_set_ratio`.
- `-t, --threshold N` — similarity 0-100 (default 85).
### Normalization
- `--normalize COL:TYPE` — comma-separated `col:type` pairs. Types: `email`, `phone`, `name`, `address`, `string`.
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--normalize` | | auto-detect | Column normalizers as `col:type` pairs, comma-separated. Types: `email`, `phone`, `name`, `address`, `string`. |
| Type | Effect | Example |
|------|--------|---------|
| `email` | lowercase, strip Gmail dots, strip `+tag` | `John.Doe+x@gmail.com``johndoe@gmail.com` |
| `phone` | E.164 (+ ext preserved) | `(555) 123-4567 ext 100``+15551234567;ext=100` |
| `name` | strip titles + suffixes + particles, case-fold | `Dr. Charles de Gaulle Jr.``charles gaulle` |
| `address` | USPS abbrevs + state name → 2-letter, case-fold | `123 Main Street, California``123 main st ca` |
| `string` | trim + collapse + case-fold | ` HELLO WORLD ``hello world` |
**Normalizer details:**
### Survivor selection
- `--survivor RULE``first` (default) / `last` / `most-complete` / `most-recent`.
- `--date-column COL` — required for `most-recent`.
- `--merge` — fill blanks in survivor from removed rows.
| Type | What it does | Example |
|------|-------------|---------|
| `email` | Lowercase, strip Gmail dots, strip `+tag` suffixes | `John.Doe+tag@gmail.com``johndoe@gmail.com` |
| `phone` | Parse to E.164 format; fallback: digits only | `(555) 123-4567``+15551234567` |
| `name` | Strip titles (Dr., Mr.) and suffixes (Jr., PhD), case-fold | `Dr. John Smith Jr.``john smith` |
| `address` | USPS abbreviations (Street→St, Avenue→Ave), case-fold | `123 Main Street, Suite 4``123 main st ste 4` |
| `string` | Trim, collapse whitespace, case-fold | ` HELLO WORLD ``hello world` |
### Survivor Selection
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--survivor` | | `first` | Which row to keep per duplicate group. |
| `--date-column` | | none | Date column for the `most-recent` rule. |
| `--merge` | | `false` | Fill missing fields in the surviving row from removed duplicates. |
**Survivor rules:**
| Rule | Behavior |
|------|----------|
| `first` | Keep the first row encountered (lowest row number) |
| `last` | Keep the last row encountered (highest row number) |
| `most-complete` | Keep the row with the fewest blank/empty cells |
| `most-recent` | Keep the row with the latest date (requires `--date-column`) |
### Interactive Review
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--review` | | `false` | Interactively review each match group. For each group, choose: merge (y), keep both (n), or skip remaining (s). |
### Interactive review
- `--review` — prompt y/n/s per match group with side-by-side diff.
### Configuration
- `--config PATH` — load all settings from JSON.
- `--save-config PATH` — save current settings to JSON.
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--config` | | none | Load all settings from a saved JSON config file. |
| `--save-config` | | none | Save current settings to a JSON config file for reuse. |
### File Handling
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--sheet` | | first sheet | Excel sheet name or 0-based index. Ignored for CSV files. |
| `--encoding` | | auto-detect | Override auto-detected file encoding (e.g., `utf-8`, `windows-1252`). |
| `--header-row` | | auto-detect | 0-based row index for the header row. |
---
### File handling
- `--sheet NAME|N` — Excel sheet name or 0-based index.
- `--encoding ENC` — override auto-detected encoding.
- `--header-row N` — 0-based header row.
## Recipes
### 1. Basic Dedup (Auto-Detect)
Let the engine detect email, phone, name, and address columns automatically.
```bash
# Preview
python -m src.cli customers.csv
# Basic auto-detect dedup
python -m src.cli customers.csv [--apply]
# Apply
python -m src.cli customers.csv --apply
```
The engine scans column names for patterns like `email`, `phone`, `name`, `address` and builds strategies automatically. Strong keys (email, phone) become standalone strategies; weak keys (name, address) are paired with strong keys.
### 2. Fuzzy Name Matching
Match rows where names are similar but not identical — catches typos, nickname variations, and formatting differences.
```bash
# Fuzzy-match on the "name" column at 80% similarity
# Fuzzy name match at 80%
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply
# Fuzzy-match on multiple columns
python -m src.cli customers.csv --fuzzy name,address --threshold 85 --apply
# Use Levenshtein distance instead of Jaro-Winkler
python -m src.cli customers.csv --fuzzy name --algorithm levenshtein --threshold 80 --apply
```
**Algorithm comparison:**
- `jaro_winkler` (default) — best for short strings like names; weights early characters more heavily
- `levenshtein` — edit-distance ratio; works well for typos and transpositions
- `token_set_ratio` — best for addresses and long strings; ignores word order
### 3. Custom Strong Keys
Use specific identifier columns to find exact duplicates.
```bash
# Deduplicate by Facebook ID
python -m src.cli donors.csv --key fb_id --apply
# Multiple strong keys (each is independent — matched with OR)
# Multiple strong keys (OR logic)
python -m src.cli donors.csv --key fb_id,ein --apply
```
Strong keys are OR'd: a match on `fb_id` alone OR `ein` alone marks rows as duplicates.
### 4. Merge Mode
Keep the most complete row and fill any remaining blanks from the duplicates.
```bash
# Most complete row + merge missing fields
# Most-complete row + merge missing fields
python -m src.cli contacts.csv --survivor most-complete --merge --apply
# Keep most recent row and merge
# Most-recent + merge
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply
```
**How merge works:** The survivor row keeps all its non-empty fields. For any blank/null fields, the engine fills from the removed rows (in row order). The result is a single row with maximum data retention.
### 5. Multi-Column Subset
Match on a specific combination of columns rather than auto-detecting.
```bash
# Exact match on email + phone only
python -m src.cli customers.csv --subset email,phone --apply
# Mix exact and fuzzy within a subset
python -m src.cli customers.csv --subset email,name --fuzzy name --threshold 85 --apply
```
When using `--subset`, all listed columns must match (AND logic) for a pair to be considered duplicates.
### 6. Save and Load Config Profiles
Save your settings for repeatable runs on similar files.
```bash
# Save settings to a file
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge \
--survivor most-complete --save-config customer_dedup.json
# Load saved settings
python -m src.cli new_customers.csv --config customer_dedup.json --apply
```
Config files are JSON. Example:
```json
{
"strategies": [],
"survivor_rule": "most_complete",
"merge": true,
"default_algorithm": "jaro_winkler",
"default_threshold": 80.0,
"fuzzy_columns": ["name"]
}
```
### 7. Interactive Review
Step through each match group and decide whether to merge.
```bash
# Interactive review
python -m src.cli customers.csv --review --apply
# Save / load profile
python -m src.cli customers.csv --fuzzy name --threshold 80 --save-config dedup.json
python -m src.cli new.csv --config dedup.json --apply
# Excel
python -m src.cli data.xlsx --sheet "Sales" --apply
```
For each group, the CLI displays both rows side-by-side and prompts:
## Algorithms
```
============================================================
Match Group 1 — Confidence: 92.3%
Matched on: name, phone
============================================================
- **`jaro_winkler`** (default) — best for short strings (names); weights early chars.
- **`levenshtein`** — edit-distance ratio; typos and transpositions.
- **`token_set_ratio`** — best for addresses; ignores word order.
Row 1:
name: John Smith
email: john@example.com
phone: (555) 123-4567
## Auto-detection
Row 2:
name: Jon Smith
email:
phone: 555-123-4567
When no `--subset` / `--fuzzy` flags, columns are detected by name:
[y] Merge [n] Keep both [s] Skip remaining:
```
| Pattern | Algorithm | Threshold | Normalizer | Key |
|---------|-----------|-----------|------------|-----|
| Email | exact | 100% | email | strong |
| Phone | exact | 100% | phone | strong |
| Name | jaro_winkler | 85% | name | weak |
| Address | token_set_ratio | 80% | address | weak |
- **y** — accept the match; merge/remove duplicate
- **n** — reject the match; keep both rows
- **s** — skip all remaining groups (keep both for all)
**Strategy rules**: strong keys → standalone OR; weak keys → AND-paired with each strong key; no strong keys → weak promoted to standalone; no patterns → exact match on all columns.
### 8. Excel Files and Multi-Sheet
## Output files (with `--apply`)
Work with Excel files directly — no CSV conversion needed.
| File | Contents |
|------|----------|
| `{stem}_deduplicated.csv` | Cleaned data |
| `{stem}_removed.csv` | Removed rows |
| `{stem}_match_groups.csv` | `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row` + originals |
```bash
# Deduplicate first sheet (default)
python -m src.cli data.xlsx --apply
# Specify sheet by name
python -m src.cli data.xlsx --sheet "Sales Data" --apply
# Specify sheet by index (0-based)
python -m src.cli data.xlsx --sheet 1 --apply
```
Output is always CSV by default. To write Excel output, use `-o`:
```bash
python -m src.cli data.xlsx -o cleaned.xlsx --apply
```
Log: `logs/dedup_YYYYMMDD_HHMMSS.log`.
---
## Auto-Detection Details
When no `--subset` or `--fuzzy` flags are provided, the engine scans column names and builds strategies:
| Column pattern | Detection regex | Algorithm | Threshold | Normalizer | Key type |
|---------------|----------------|-----------|-----------|------------|----------|
| Email | `e[-_]?mail` | exact | 100% | email | strong |
| Phone | `phone\|telephone\|mobile\|cell` | exact | 100% | phone | strong |
| Name | `^(name\|full_name\|customer_name\|...)$` | jaro_winkler | 85% | name | weak |
| Address | `address\|street\|addr` | token_set_ratio | 80% | address | weak |
**Strategy building rules:**
- Strong keys → standalone OR strategies (email match alone is enough)
- Weak keys → paired with each strong key via AND (name match requires email or phone match too)
- No strong keys found → weak keys promoted to standalone
- No patterns matched → exact match on all columns (equivalent to `drop_duplicates`)
## Output Files
When `--apply` is set, three files are written:
| File | Description |
|------|-------------|
| `{stem}_deduplicated.csv` | Cleaned DataFrame with duplicates removed |
| `{stem}_removed.csv` | Rows that were removed |
| `{stem}_match_groups.csv` | Audit trail with `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row`, plus all original columns |
## Logging
Every run writes a timestamped log to `logs/dedup_YYYYMMDD_HHMMSS.log` with full debug-level details: strategies used, pair comparisons, survivor decisions, and merge actions.
---
# Text Cleaner CLI
Character-level hygiene for CSV / Excel files: whitespace trim and collapse, smart-character folding, Unicode normalization, BOM strip, control-char strip, line-ending normalization, optional case conversion. See TECHNICAL.md Section 10.2 for the full functional spec.
# Text Cleaner
```
python -m src.cli_text_clean INPUT_FILE [OPTIONS]
```
## Arguments
| Argument | Required | Description |
|----------|----------|-------------|
| `INPUT_FILE` | Yes | Path to the CSV, TSV, or Excel file to clean |
Character-level hygiene. See [TECHNICAL.md §10.2](TECHNICAL.md) for the spec.
## Options
### Core
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--apply` | | `false` | Write output files. Without this flag, only a preview is shown. |
| `--output` | `-o` | `{input}_cleaned.csv` | Output file path. |
| `--preset` | | `excel-hygiene` | Preset bundle of safe defaults. See [Presets](#presets). |
- `--apply` — write output (default: preview).
- `-o, --output PATH` — output path (default `{input}_cleaned.csv`).
- `--preset NAME``minimal` / `excel-hygiene` (default) / `paranoid`.
### Scope
- `--columns COLS` — comma-separated columns to clean (default: all string columns).
- `--skip COLS` — exclude these columns.
| Flag | Default | Description |
|------|---------|-------------|
| `--columns` | all string columns | Comma-separated columns to clean. |
| `--skip` | none | Comma-separated columns to skip even if they look like text. Useful for free-text notes columns you don't want touched. |
### Per-op overrides (override the active preset)
- `--no-trim`, `--no-collapse`, `--no-nfc`, `--nfkc`, `--no-smart-chars`, `--no-zero-width`, `--no-bom`, `--no-control`, `--no-line-endings`.
### Per-operation toggles
### Case
- `--case MODE``upper` / `lower` / `title` / `sentence`. Or per-column: `--case title:name,upper:sku`.
- Title case preserves all-caps tokens (`USA`) and lowercases mid-string particles (`of`, `and`).
These override the active preset.
### Audit + config
- `--full-changelog` — write every change (default caps to first 1000).
- `--config PATH` / `--save-config PATH`.
| Flag | Effect |
|------|--------|
| `--no-trim` | Disable leading/trailing whitespace strip |
| `--no-collapse` | Disable internal whitespace collapse |
| `--no-nfc` | Disable Unicode NFC normalization |
| `--nfkc` | Enable NFKC compatibility fold (lossy: `①``1`, `fi``fi`) |
| `--no-smart-chars` | Disable smart-character folding (curly quotes, em/en-dash, NBSP, ellipsis) |
| `--no-zero-width` | Disable zero-width / invisible character strip |
| `--no-bom` | Disable leading BOM strip |
| `--no-control` | Disable control-character strip |
| `--no-line-endings` | Disable line-ending normalization |
### Case conversion
| Flag | Forms | Description |
|------|-------|-------------|
| `--case` | `upper`, `lower`, `title`, `sentence` | Apply this case to every selected column |
| `--case` | `mode:col[,mode:col]` | Per-column case (e.g., `--case title:name,upper:code`) |
Title case preserves all-caps tokens (`USA` stays `USA`) and lowercases mid-string particles (`of`, `and`, `the`, etc.).
### Audit and config
| Flag | Default | Description |
|------|---------|-------------|
| `--full-changelog` | `false` | Write every cell change to the audit CSV (default caps to first 1000). |
| `--config` | none | Load options from a saved JSON config file. |
| `--save-config` | none | Save the current options to a JSON config file. |
### File format / encoding
| Flag | Default | Description |
|------|---------|-------------|
| `--sheet` | `0` | Excel sheet name or 0-based index. |
| `--encoding` | auto-detect | Override auto-detected file encoding. |
| `--header-row` | auto-detect | 0-based row index for the header. |
### File
- `--sheet`, `--encoding`, `--header-row` — same as Deduplicator.
## Presets
| Preset | What it does |
|---|---|
| `minimal` | Trim + collapse whitespace only. Nothing else. |
| `excel-hygiene` (default) | Trim, collapse, NFC, smart-character fold, zero-width strip, BOM strip, control strip, line-ending normalize. NFKC off. |
| `paranoid` | All of `excel-hygiene` plus NFKC compatibility fold (lossy). |
## Output Files
When `--apply` is set:
| File | Description |
|------|-------------|
| `{stem}_cleaned.csv` | Cleaned DataFrame |
| `{stem}_changes.csv` | Per-cell audit: `row`, `column`, `old`, `new`, `ops_applied` (capped to 1000 rows by default; use `--full-changelog` for all) |
A timestamped log is always written to `logs/text_clean_YYYYMMDD_HHMMSS.log`.
|--------|--------------|
| `minimal` | Trim + collapse only. |
| `excel-hygiene` (default) | Trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize. |
| `paranoid` | `excel-hygiene` + NFKC compatibility fold (lossy). |
## Recipes
```bash
# Preview what would change with the safe defaults
python -m src.cli_text_clean messy.csv
# Safe defaults (preview, then apply)
python -m src.cli_text_clean messy.csv [--apply]
# Apply the safe defaults
python -m src.cli_text_clean messy.csv --apply
# Just the basics — only trim and collapse, leave Unicode/quotes alone
# Just trim + collapse, leave Unicode alone
python -m src.cli_text_clean messy.csv --preset minimal --apply
# Title-case the name column, upper-case the SKU column, leave others alone for case
# Title-case names, upper-case SKUs
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply
# Clean only specific columns
python -m src.cli_text_clean orders.csv --columns vendor,product --apply
# Skip a free-text notes column from cleaning
# Skip a free-text notes column
python -m src.cli_text_clean tickets.csv --skip notes --apply
# Save the current settings as a profile and reload it later
python -m src.cli_text_clean messy.csv --preset minimal --case upper --save-config my.json
python -m src.cli_text_clean other.csv --config my.json --apply
```
## Output files (with `--apply`)
| File | Contents |
|------|----------|
| `{stem}_cleaned.csv` | Cleaned data |
| `{stem}_changes.csv` | `row`, `column`, `old`, `new`, `ops_applied` (capped to 1000; `--full-changelog` removes cap) |
Log: `logs/text_clean_YYYYMMDD_HHMMSS.log`.
---
## Analyzer (upload-time scan)
# Analyzer
```
python -m src.cli_analyze INPUT_FILE [OPTIONS]
--sample-rows N Cap on rows scanned (default 1000)
--json Print findings as a JSON array on stdout
--strict Exit non-zero on any warn/error finding
```
JSON output schema (one object per finding):
Read-only scan; surfaces every detector finding without modifying the file.
## Options
- `--sample-rows N` — cap on rows scanned (default 1000).
- `--json` — print findings as a JSON array on stdout.
- `--strict` — exit non-zero on any warn/error finding.
## JSON schema (one object per finding)
```json
{
@@ -442,10 +222,14 @@ JSON output schema (one object per finding):
}
```
- `severity``info` / `warn` / `error`. Only `error` blocks the GUI normalization gate.
- `confidence``high` (round-trip-safe, eligible for one-click auto-fix), `medium` (preview before applying), `low` (heuristic, opt-in only).
- `fix_action` — stable id naming the algorithm in `src/core/fixes.py` that resolves the finding. Empty string for informational-only findings.
- `pre_applied``true` for fixes already applied during the byte-level read pass (BOM strip, NUL strip, line-ending normalize, byte-level smart-quote fold, transcode-to-UTF-8 from UTF-16/32). The GUI gate treats these as already-resolved; the CLI emits them so callers can audit what changed during read.
## Field meanings
- `severity``info` / `warn` / `error`. Only `error` blocks the GUI gate.
- `confidence``high` (one-click), `medium` (preview), `low` (opt-in).
- `fix_action` — id of the algorithm in `src/core/fixes.py`. Empty for informational-only.
- `pre_applied``true` for fixes already applied during the byte-level read pass.
The detector set covers smart punctuation, NBSP / Unicode whitespace, zero-width characters, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints (UTF-8-as-cp1252), mixed-case email columns, near-duplicate rows (case-and-padding stripped), leading-zero IDs (Excel hazard), mixed line endings, encoding decode failure (`encoding_decode_failed`), and U+FFFD presence in the loaded text (`encoding_uncertain`). New detectors plug in by appending one entry to `analyze.py` and one matching fix in `fixes.py`.
## Detectors
Smart punctuation, NBSP / Unicode whitespace, zero-width chars, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints, mixed-case email columns, inconsistent date formats, near-duplicate rows, leading-zero IDs, mixed line endings, encoding decode failure, U+FFFD presence.
Add a detector: append entry in `analyze.py` + matching fix in `fixes.py`. No other call sites change.

View File

@@ -1,269 +1,189 @@
# DECISIONS.md - Locked Criteria, Scoring Rubric, Decision Log
# Decisions
> **Creator-only document. Do not ship to buyers.**
> Creator-only. Locked criteria, scoring rubric, decision log.
> **Version**: 1.6 · **Updated**: 2026-05-01
**Version**: 1.6
**Last updated**: April 28, 2026
This document captures the original locked operating criteria, the scoring framework used to select the product category, the platform-model evaluation, and key decisions with rationale. It exists so future-you (or a recovery rebuild) can reconstruct *why* the project is what it is, not just *what* it is.
---
## 1. Locked Operating Criteria
These are the constraints, targets, and goals the product strategy must satisfy. Established at project start. Any change to these requires an explicit re-lock.
## 1. Locked operating criteria
### Constraints
| # | Criterion | Notes |
|---|---|---|
| 1 | Cash budget ≤ $1,200/month | Recurring monthly only; no large one-time capital, no external funding |
| 2 | Time available ≤ 10 hours/week | Strong preference for build-once assets generating revenue for years with minimal maintenance |
| 3 | Skill set: Database Design, Data Pipelines, Data Aggregation, Programming | Every opportunity must directly leverage these |
| 4 | Existing network: none | Zero reliance on personal connections for acquisition, sales, or operations |
1. Cash budget ≤ $1,200/mo recurring. No external funding.
2. Time ≤ 10 hr/wk. Build-once assets preferred.
3. Skill set: database design, data pipelines, programming. Every opportunity must leverage these.
4. Network: none. Zero reliance on personal connections.
### Targets
| # | Target | Notes |
|---|---|---|
| 5 | Time to first revenue: 15 days preferred, 90 days hard stop | |
| 6 | Revenue ceiling: tiered (see BUSINESS.md Section 6) | Revised from original $50k/mo. Realistic 12-month target: $5k/mo |
| 7 | Lifestyle cashflow goal | Sustainable for several years, no saleable-asset exit required |
| 8 | Distribution: fully async, no-touch, automated | Revisit at $5k/mo (see BUSINESS.md Section 8) |
| 9 | Day-to-day work pattern: deep work + recovery periods | No real-time on-call or customer-facing constraints |
5. First revenue: 15 days preferred, 90 days hard stop.
6. Revenue ceiling: tiered (BUSINESS §6). Realistic 12-mo: $5k/mo.
7. Lifestyle cashflow goal. No saleable-asset exit required.
8. Distribution: fully async, no-touch. Revisit at $5k/mo.
9. Work pattern: deep work + recovery. No real-time on-call.
### Goals
10. Escape 9-5 W2 employment without stability concerns. (Primary)
11. Free up time for retirement lifestyle, optional enjoyable work. (Secondary)
| # | Goal | Priority |
|---|---|---|
| 10 | Escape 9-5 W2 employment without stability concerns | Primary |
| 11 | Free up time for retirement lifestyle, optional enjoyable work | Secondary |
### Internal contradictions
### No Internal Contradictions
"Fully async + 15-day-to-revenue + no network" is tight but workable. Caveat in BUSINESS §8: revisit async at $5k/mo.
The original criteria were checked for tension. The "fully async + 15-day to first revenue + no network" combination is tight but workable, with the caveat documented in BUSINESS.md Section 8 (revisit async constraint at $5k/mo).
## 2. Scoring rubric
---
## 2. Scoring Rubric
Every business candidate was scored 1-5 on six dimensions. Total /30, then mapped to verdict.
Each candidate scored 1-5 on 6 dimensions. Total /30 → verdict.
| Dimension | What it measures |
|---|---|
| Fit to locked criteria | Direct match to constraints 1-4 and targets 5-9. **Any 1 is a hard kill.** |
| Demand durability | Structural shift vs. trend peak. Will this still pay in 3 years? |
| Defensibility | What stops the next entrant from copying it. |
| Unit economics realism | CAC, payback period, gross margin, working capital. |
| Operator fit | Skills, capital, time, stomach for the work. |
| Exit / cash-flow optionality | Multiple paths to revenue, optionality on later changes. |
|-----------|------------------|
| Fit to locked criteria | Direct match to constraints 1-4 + targets 5-9. **Any 1 = hard kill.** |
| Demand durability | Structural shift vs. trend peak. Pays in 3 yr? |
| Defensibility | What stops the next entrant. |
| Unit economics realism | CAC, payback, gross margin, working capital. |
| Operator fit | Skills, capital, time, stomach. |
| Exit / cash-flow optionality | Multiple revenue paths. |
**Verdict mapping**: PURSUE / INVESTIGATE / PASS / KILL based on total score and any hard-kill dimension.
**Verdict**: PURSUE / INVESTIGATE / PASS / KILL.
**Calibration note added in v1.1**: The original scoring inflated unit economics for the lead candidate by treating near-100% gross margin as 5/5 without accounting for CAC under the "no network" constraint. A more honest score for the Python Bundles category is 7.0-7.5/10, not 8.7/10. The strategy is still sound; the optimism just needed deflating.
**v1.1 calibration**: original scoring inflated unit economics by treating ~100% gross margin as 5/5 without accounting for CAC under "no network." Honest score: 7.0-7.5/10 (was 8.7). Strategy still sound; optimism deflated.
---
## 3. Candidate Evaluation Summary
Five candidates were evaluated against the locked criteria. Top three:
## 3. Candidate evaluation
| Rank | Candidate | Score | Verdict |
|---|---|---|---|
| 1 | Niche Python Automation Script Bundles | 8.7/10 (original) / ~7.5/10 (calibrated) | **PURSUE** |
|------|-----------|-------|---------|
| 1 | Niche Python Automation Script Bundles | 8.7/10 7.5/10 (calibrated) | **PURSUE** |
| 2 | Curated Datasets | 8.7/10 | PURSUE (deferred) |
| 3 | Hosted Data Pipeline Micro-Tool | 8.3/10 | INVESTIGATE |
**Why #1 was selected over #2**:
- Faster path to first revenue (digital download vs. ongoing data curation pipeline).
- Lower ongoing maintenance after launch.
- Direct leverage of programming skills, not just data acquisition.
- Better fit for the "build once, sell many times" preference in criterion 2.
**Why #1 over #2**: faster path to first revenue (digital download vs. ongoing curation pipeline). Lower ongoing maintenance. Direct programming leverage. Better fit for "build once, sell many."
**Why others were ranked lower**:
- Notion Templates: weaker leverage of programming skills.
- Query Optimizer (SaaS): introduces hosting, support, and recurring infrastructure costs that conflict with the lifestyle / minimal maintenance constraint.
**Rejected**: Notion Templates (weak skill leverage), Query Optimizer SaaS (recurring infra conflicts with lifestyle/maintenance constraint).
---
## 4. Platform model
## 4. Platform Model Decision (How to Sell)
| Model | Verdict |
|-------|---------|
| **Standalone tools, dual CLI + GUI (chosen)** | **CHOSEN** (revised v1.2). Build once, no hosting, no SaaS support. GUI captures non-tech buyer; CLI captures power users. |
| SaaS web app | Rejected. Recurring hosting + support conflicts with minimal-maintenance constraint. |
| CLI-only | Rejected (revised v1.2). Wrong fit for non-tech buyer; produces refunds. |
| Browser extension | Rejected. Sandbox limits, wrong tool for files. |
| Notion / Airtable templates | Rejected. Doesn't leverage programming. |
Models considered for the lead bundle:
**v1.2 rationale**:
- Buyer persona ("hates Excel work but can't code") won't learn a CLI. Refunds at this price.
- Deduplicator needs interactive review — not viable in pure CLI.
- Dual interface keeps CLI for automation without sacrificing primary buyer surface.
| Model | Pros | Cons | Verdict |
|---|---|---|---|
| **Standalone tools, dual CLI + GUI interface (chosen)** | Build once, sell forever. No hosting. No SaaS support burden. Direct skill match. GUI captures non-technical buyer; CLI captures power users and automation use cases. | Requires installer for non-technical buyers. Some platform friction (signing, etc.). GUI adds build cost vs. CLI-only. | **CHOSEN (revised v1.2)** |
| SaaS web app | Recurring revenue. Easy install. | Ongoing hosting cost, support burden, SaaS scrutiny. Conflicts with "minimal maintenance" criterion. | Rejected |
| CLI-only | Lowest build cost | Wrong fit for non-technical buyer persona. Will produce refunds. | Rejected (revised v1.2) |
| Browser extension | Easy install | Limited by browser sandbox. Wrong tool for data file processing. | Rejected |
| Notion / Airtable templates | Fast to ship | Doesn't leverage programming skills. Low defensibility. | Rejected |
## 4a. Functional scope principle (v1.2)
**Decision (revised v1.2)**: Ship as standalone tools with **both** a CLI and a GUI front-end sharing the same core logic. Packaged with cross-platform installers (PyInstaller-based) so the buyer experience approximates a native app. GUI is no longer "deferred"; it is required at v1 launch.
**Decision**: each script ships **complete coverage of the workflow it names**, including features Excel does free.
**Rationale for the v1.2 revision**:
- The buyer persona ("hate repetitive Excel work but cannot code") will not learn a CLI. CLI-only at this price point produces refunds.
- The deduplicator specifically requires interactive review of fuzzy-match candidates. That UX is not viable in pure CLI.
- A dual-interface design keeps the CLI for power users and future automation/scheduling use cases without sacrificing the primary buyer experience.
**Why**: one-stop shopping is the value. Forcing buyers to bounce between this product and Excel/OpenRefine for parts of one task defeats the value prop.
---
**Anti-rule**: not license to scope-creep. Boundary = the named workflow. Dedup includes normalization + survivor + audit. NOT format conversion or charting (those belong to other scripts).
## 4a. Functional Scope Principle (added v1.2)
## 4b. UX standards for GUI (v1.2 — load-bearing)
**Decision**: Each script ships with **complete functional coverage of the problem it names**, including features available for free elsewhere (e.g., Excel's built-in exact-match dedup).
| Standard | What it means |
|----------|---------------|
| Works out of the box | Drop file → useful result, zero config. |
| Sensible defaults visible | Every option has a default that works for the common case. |
| Progressive disclosure | Default view = file uploader + go button + results. Advanced in expander panes. |
| Plain-English labels | "Find duplicates" not "Apply Levenshtein at 0.85". Tooltips carry technical detail. |
| Visible safety | Dry-run / preview by default. Original input never modified. |
| No multi-step setup | Single window for the basic task. |
| Errors name problem + fix | "Column 'email' not found. Available: name, phone. Did you mean 'phone'?" not `KeyError`. |
| Identical core to CLI | No drift. Anything CLI does, GUI does (minus interactive review = GUI-natural). |
**Rationale**: The product is "one-stop shopping" for the buyer's data-cleaning workflow. Forcing a buyer to bounce between this product and Excel/OpenRefine/etc. for parts of a single task defeats the value proposition. A buyer cleaning a customer list expects exact dedup, fuzzy dedup, normalization, and survivor-merge in one tool. Splitting that across products is what they paid to avoid.
**"Intuitive enough" test**: a non-technical user who's never seen the tool can complete the lead use case on first launch with no docs read.
**Consequence for design**: Do not omit a feature on the grounds that "Excel does this for free." If it belongs to the workflow, it belongs in the script.
**Anti-rule**: This is not license to scope-creep. The boundary is "the workflow this script names." A deduplicator includes everything dedup-adjacent (normalization, survivor selection, audit). It does not include format conversion, charting, or anything outside the dedup workflow. Those belong to other scripts in the bundle.
---
## 4b. UX Standards for GUI Front-End (added v1.2)
The GUI is the primary buyer surface. These standards are load-bearing.
| Standard | What it means in practice |
|---|---|
| **Works out of the box** | Dropping any reasonable CSV / XLSX onto the GUI must produce a useful result with zero configuration. The buyer should never see a config screen on first run. |
| **Sensible defaults everywhere** | Every option has a default that works for the most common case. Defaults are visible (so the user understands what is being applied) but not blocking. |
| **Progressive disclosure** | Advanced options exist but are tucked behind an "Advanced" or "Settings" pane. The default view shows the minimum needed for a first run. |
| **Plain-English labels** | No technical jargon in primary UI. "Find duplicates" not "Apply Levenshtein matching with token_set_ratio threshold". Tooltips can carry the technical detail for users who want it. |
| **Visible safety** | Dry-run / preview by default. The user sees what *would* change before any file is written. Original input is never modified. |
| **No multi-step setup** | If the GUI requires more than a single window (file picker + go button + results view) to complete a basic task, it has failed this standard. |
| **Errors that name the problem and the fix** | "Column 'email' not found in this file. Available columns: name, phone, address. Did you mean 'phone'?" not "KeyError: 'email'". |
| **Identical core to CLI** | The GUI and CLI are two front-ends over the same library code. Anything the CLI can do, the GUI can do. Anything the GUI can do, the CLI can do (possibly minus interactive review). No drift. |
**Test for "intuitive enough"**: A non-technical person who has never seen the tool can complete the lead use case (dedup a customer list with one or more confidence levels) on first launch with no documentation read. If that test fails on real users, the GUI is not yet shippable.
---
## 4c. GUI Framework Decision: Streamlit (added v1.3)
**Chosen**: Streamlit.
### Frameworks evaluated
## 4c. GUI framework: Streamlit (v1.3)
| Framework | Verdict |
|---|---|
|-----------|---------|
| **Streamlit** | **CHOSEN** |
| Tkinter + CustomTkinter | Rejected (CustomTkinter maintenance status confirmed inactive: last release Jan 2024, ~28 months old as of decision date; Snyk classifies as Inactive project) |
| Plain Tkinter | Rejected (UX quality below what a $49-79 product justifies in 2026 without significant hand-styling work) |
| Flet | Rejected (ecosystem too young for a build-once-maintain-for-years product) |
| PySide6 / Qt | Rejected (overkill for this product tier; steepest learning curve, largest bundles) |
| NiceGUI | Rejected (similar pattern to Streamlit but smaller community and less mature data-tool ergonomics) |
| Tkinter + CustomTkinter | Rejected — maintainer absent (last release Jan 2024, ~28 mo). Snyk: Inactive. |
| Plain Tkinter | Rejected UX gap unacceptable at $49-79 in 2026 without heavy hand-styling. |
| Flet | Rejected ecosystem too young for build-once-maintain-for-years. |
| PySide6 / Qt | Rejected overkill, steepest learning curve, biggest bundles. |
| NiceGUI | Rejected — same browser tradeoff as Streamlit, smaller community + ecosystem. |
### Full evaluation matrix (added v1.6)
### Scored matrix (1-5, 5 = best for this product)
Promoted from chat-history-only into docs in v1.6 to lock the rejection reasoning against re-litigation. Scored 1-5 where 5 is best for *this specific product*.
| Dimension | Tkinter | Tk+CTk | Streamlit | Flet | PySide6 | NiceGUI |
|---|---|---|---|---|---|---|
| Non-tech UX quality (look + feel) | 1 | 3 | 4 | 4 | 5 | 4 |
| "Native window opens" (no browser) | 5 | 5 | 1 | 5 | 5 | 1 |
| Build speed for v1 | 3 | 3 | 5 | 4 | 2 | 4 |
| Build speed per added feature | 3 | 3 | 5 | 4 | 2 | 4 |
| PyInstaller compatibility (low friction) | 5 | 4 | 2 | 3 | 3 | 2 |
| Bundle size (smaller = better) | 5 | 4 | 1 | 3 | 2 | 1 |
| Maintenance burden over time | 4 | 3 | 4 | 3 | 4 | 3 |
| Ecosystem maturity / longevity bet | 5 | 3 | 4 | 2 | 5 | 3 |
| Solo dev learning curve | 4 | 4 | 5 | 4 | 2 | 4 |
| Suits "drop file, see result" pattern | 3 | 3 | 5 | 4 | 4 | 5 |
| Dimension | Tk | Tk+CTk | Streamlit | Flet | PySide6 | NiceGUI |
|-----------|----|----|-----------|------|---------|---------|
| Non-tech UX | 1 | 3 | 4 | 4 | 5 | 4 |
| Native window (no browser) | 5 | 5 | 1 | 5 | 5 | 1 |
| Build speed v1 | 3 | 3 | 5 | 4 | 2 | 4 |
| Build speed per feature | 3 | 3 | 5 | 4 | 2 | 4 |
| PyInstaller compat | 5 | 4 | 2 | 3 | 3 | 2 |
| Bundle size (smaller better) | 5 | 4 | 1 | 3 | 2 | 1 |
| Maintenance burden | 4 | 3 | 4 | 3 | 4 | 3 |
| Ecosystem maturity | 5 | 3 | 4 | 2 | 5 | 3 |
| Solo-dev learning curve | 4 | 4 | 5 | 4 | 2 | 4 |
| Drop-file-see-result fit | 3 | 3 | 5 | 4 | 4 | 5 |
| **Total /50** | **38** | **37** | **38** | **36** | **34** | **35** |
**The total is misleading on purpose.** Equal totals hide that these options fail differently. Tkinter ties Streamlit on the sum but loses on look-and-feel and data-app fit (the dimensions that matter most for this product). The verdict is in the per-dimension story, not the sum.
**Per-option summary** (the substance behind the verdicts):
- **Plain Tkinter**: Smallest bundle (~30-50 MB added), most predictable PyInstaller behavior, will work in 10 years. Default widgets look like 1998. A non-technical buyer paying $49-79 and seeing a default Tk UI will feel cheated. Don't ship.
- **Tkinter + CustomTkinter**: Native window, ~50-80 MB added, modern look, mature PyInstaller story. Maintainer absent (last release Jan 2024). Multi-year product cannot bet UI layer on a library classified Inactive. The probable failure mode is a future macOS or Python update breaking the Tk layer with no upstream fix.
- **Streamlit**: Fastest to build for data tools. Tables, file uploads, dataframes are first-class. Mature ecosystem. Browser-launch UX is the real liability, mitigated by in-app messaging and the optional pywebview wrap (v1.1). Bundle size 300-500 MB. PyInstaller packaging fiddly first time, reusable after.
- **Flet**: Modern Flutter-based UI, native windows, looks great. Ecosystem too young for a build-once-maintain-for-years product. Breaking API changes between minor versions still happening. PyInstaller story less battle-tested.
- **PySide6 / Qt**: Industrial-grade, best widget set, native everything. Steepest learning curve, largest bundles, licensing care needed. Overkill for $49-79 product tier and burns the 10 hr/wk time budget on UI scaffolding instead of script features.
- **NiceGUI**: Similar pattern to Streamlit (Python-to-web). Smaller community, less mature data-tool ergonomics. Same browser-launch tradeoff without Streamlit's velocity advantage.
**Sums lie.** Tk ties Streamlit but loses on look-and-feel + data-app fit (the dimensions that matter). Verdict is per-dimension, not total.
### Why Streamlit won
1. **Fastest build velocity for v1 and every subsequent bundle.** "Drop a CSV, see results" is the native Streamlit interaction pattern. Tables, filters, dataframes display well with minimal code. This compounds across the 9-script lead bundle and the future 5 bundles in the roadmap.
2. **Lowest maintenance burden per added feature.** Active framework, large community, mature ecosystem. Bug fixes happen upstream, not on this project's time.
3. **Hosted browser demo as a marketing asset from day one.** A Streamlit app deploys to Streamlit Community Cloud (free) or a $5/mo VPS. The Gumroad landing page can offer "Try it free in your browser" with a sample dataset. For a $49-79 product where buyers cannot evaluate quality before purchase, a working demo can move conversion meaningfully. Tkinter family options cannot provide this.
4. **Future SaaS optionality** (expanded v1.6). Not a driver of this decision; the locked criteria reject SaaS. But if criteria ever evolve, Streamlit code converts to a hosted multi-user app in hours rather than weeks. Streamlit's session-state model, component patterns, and HTTP-server architecture are SaaS-native by default; the same code that runs the desktop bundle's local browser GUI runs unchanged on a hosted server (modulo authentication and per-user file isolation). Tkinter code, by contrast, would require a complete rewrite to become a hosted product. This is low-cost optionality: zero implementation effort now, meaningful flexibility later if the lifestyle-cashflow constraint ever lifts in favor of recurring revenue.
1. **Fastest build velocity** — "drop CSV, see results" is native. Tables, file uploads, dataframes are first-class. Compounds across 9-script lead + 5 future bundles.
2. **Lowest maintenance burden** — active, large community, mature ecosystem. Bugs fixed upstream.
3. **Hosted demo as marketing asset** Streamlit Community Cloud (free) lets the landing page offer "Try free in browser" with sample data. Tk-family options can't.
4. **Future SaaS optionality** — same code runs unchanged on a hosted server (modulo auth + per-user isolation). Tk would require rewrite. Zero implementation now, meaningful flexibility later.
### Tradeoffs accepted
1. **Browser-launch UX on the desktop install.** When a buyer double-clicks the desktop shortcut, their default browser opens to a localhost URL. This may briefly confuse non-technical buyers. **Mitigation**: a single sentence in the welcome dialog and install email explains that the data tool runs in the browser locally and uses no internet. If support tickets show this is a meaningful confusion driver, evaluate wrapping with pywebview (native window around the local Streamlit server) in v1.1.
2. **Larger bundle size**, ~300-500 MB vs. ~50 MB for Tkinter. Acceptable for marketplace download in 2026 with typical broadband.
3. **PyInstaller packaging is fiddly** the first time. Budget 1-3 days for the one-time setup, then it's reusable across all subsequent bundles via a shared template.
4. **Streamlit's session re-run model is unusual.** Manageable for single-user data tools; would matter more if the SaaS optionality were exercised at scale.
1. **Browser-launch UX** — buyer double-click → default browser opens to localhost. Mitigated: install email + welcome dialog + persistent in-app message. Pywebview wrap is the v1.1 fallback if confusing.
2. **Bundle size** ~300-500 MB vs. ~50 MB for Tk. Acceptable in 2026.
3. **PyInstaller fiddly first time** — budget 1-3 days. Reusable across all bundles after.
4. **Streamlit's session re-run model** is unusual but manageable.
### Why CustomTkinter was rejected (the previously-favored option)
## 5. Distribution
A web check during this decision found that CustomTkinter's last PyPI release was 5.2.2 in January 2024. As of April 2026, that's roughly 28 months without a release, and Snyk classifies the project as Inactive. The library still works and remains popular (~115k weekly downloads, 13k+ GitHub stars), but the maintainer is effectively absent. For a product intended to ship to non-technical buyers and remain functional for years with minimal touch from the operator, betting the UI layer on an unmaintained library is an unacceptable risk: any future Python or macOS update that breaks the Tk underpinnings becomes the operator's problem to fix or fork.
**Primary**: Marketplaces (Gumroad, Lemon Squeezy). Built-in traffic, async payments/delivery/refunds, listing in days.
This is the kind of dependency risk that matters most in a "build once, sell forever" product, where every hour spent firefighting a dependency break is an hour stolen from the next bundle.
Own-domain SEO: long-term compounding asset (6-18 mo), not early-stage channel.
---
**v1.3 addition**: hosted browser demo as secondary distribution + primary conversion lever.
## 5. Distribution Channel Decision
## 6. Pricing
**Chosen primary**: Marketplace listings (Gumroad, Lemon Squeezy).
$49-79/bundle · $149 full suite (when 3+ exist).
**Rationale**: Under the "no network + fully async + 90-day hard stop" constraints, marketplaces are the only channel that:
- Has built-in buyer traffic (no audience-building required).
- Handles payments, delivery, refunds asynchronously.
- Allows listing in days, not months.
- < $99 → no procurement friction for solo operators.
- > $99 → triggers SaaS-support expectations conflicting with no-touch.
- $49-79 → right unit economics + impulse-purchase territory.
Own-domain SEO is treated as a long-term compounding asset (6-18 months to traction), not an early-stage channel.
**Added v1.3**: A **hosted browser demo** of each bundle (deployed via Streamlit Community Cloud) becomes a secondary distribution surface and a primary conversion-rate lever on the landing page. Marketing details in BUSINESS.md Section 7.
---
## 6. Pricing Decision
**Chosen**: $49-$79 per bundle, $149 for full suite (when 3+ bundles exist).
**Rationale**:
- Below $99 threshold avoids procurement / approval friction for solo operator buyers.
- Above $99 raises buyer expectations (SaaS, human support) that conflict with the no-touch constraint.
- $49-$79 produces the right unit economics for marketplace fees + Stripe fees while remaining impulse-purchase territory.
---
## 7. Decision Log (Chronological)
## 7. Decision log
| Date | Decision | Rationale |
|---|---|---|
| April 2026 | Lock operating criteria | Project kickoff |
| April 2026 | Select Python Automation Script Bundles as the product category | Highest score against locked criteria |
| April 2026 | Choose CLI standalone over SaaS / GUI | Best fit for minimal maintenance + skill leverage |
| April 2026 | Pick Excel & CSV Data Cleaning Mastery as lead bundle | Highest pain, broadest demand, easiest demonstration |
| April 2026 | Initial install path: Inno Setup (Windows-only) | First-pass design |
| April 2026 (revised v1.1) | **Switch to PyInstaller-based cross-platform pipeline** | Eliminates "install Python first" friction; expands TAM to Mac and Linux users |
| April 2026 (revised v1.1) | **Enroll in Apple Developer Program ($99/yr)** | Required for clean macOS install experience for non-technical buyers |
| April 2026 (revised v1.1) | **Replace $50k/mo target with tiered realistic targets** | Original target was unsupported by evidence base; tiered targets hit $5k at 12mo, $10k at 24mo |
| April 2026 (revised v1.1) | **Tag "fully async no-touch" for revisit at $5k/mo** | Strict adherence pre-PMF may cost more revenue than it saves time |
| April 28, 2026 (v1.2) | **Functional scope: include all workflow-relevant features even if available free elsewhere** | One-stop shopping is the value proposition. Forcing buyers to bounce between products defeats the purpose. See Section 4a. |
| April 28, 2026 (v1.2) | **Promote GUI from "deferred" to required at v1 launch; ship dual CLI + GUI interface** | Buyer persona will not use CLI. Deduplicator specifically requires interactive review UX that CLI cannot deliver well. See Section 4. |
| April 28, 2026 (v1.2) | **Lock UX standards for GUI: works out of the box, sensible defaults, progressive disclosure, plain-English labels, dry-run by default** | These are load-bearing for the non-technical buyer. Without them the GUI may exist but won't justify the price. See Section 4b. |
| April 28, 2026 (v1.3) | **Lock GUI framework as Streamlit; reject CustomTkinter (maintenance inactive), plain Tkinter (UX gap), Flet/PySide6/NiceGUI (each fails on a dimension that matters)** | Fastest build velocity, lowest maintenance burden, hosted browser demo as marketing asset, future SaaS optionality. Browser-launch UX accepted as a tradeoff with documented mitigation. See Section 4c. |
| April 28, 2026 (v1.3) | **Add hosted browser demo as secondary distribution surface and conversion lever** | Direct consequence of Streamlit choice. See Section 5 and BUSINESS.md Section 7. |
| April 28, 2026 (v1.4) | **Re-apply 03/05 script boundary work dropped during v1.3 merge (silent drift recovery)** | Stream B v1.2 content (sharpened 03/05 descriptions in USER-GUIDE, run-order rule, TECHNICAL.md Section 9 boundary spec, RECOVERY.md pointer) was overwritten when Stream A's parallel v1.3 Streamlit work was saved to project. Restoring per the doc's own no-silent-drift rule. 03 owns "what's not there" (missing values, sentinel codes, imputation), 05 owns "what shouldn't be there" (statistical outliers, domain rules, winsorization). 03 runs before 05 because outlier statistics on data containing NaN or sentinel codes are mathematically poisoned. See TECHNICAL.md Section 9. |
| April 28, 2026 (v1.5) | **Add `02_text_cleaner.py` as new script; renumber 02-08 → 03-09** | Audit revealed character-level hygiene (whitespace trimming, multi-space collapse, Unicode normalization, BOM handling, line-ending normalization, special-character handling) had no clear owner. Was implicitly scattered: `01_deduplicator` normalizes internally for matching only (doesn't write back), `02_format_standardizer` (now 03) implies it but its named scope is dates/currencies/names/phones/addresses, `03_missing_value_handler` (now 04) only handles whitespace-only as disguised null. A buyer with trailing-space pollution had no obvious script to run. Per Section 4a (functional scope principle: one-stop shopping for the workflow), this was a real gap. Added as 02 because text cleaning is a pre-processing step that should run before format standardization, missing-value handling, and outlier detection. Kept 01 (deduplicator) at position 1 as the lead/working/marketing-flagship script; numbering does not strictly equal pipeline order, the orchestrator manages execution order. Renumber consequence: TECHNICAL.md Section 9 boundary references updated 03→04, 05→06; orchestrator references updated 08→09. New contested case documented in Section 9.3: whitespace-only cells (02 trims first, leaving empty string; 04 then detects empty strings as disguised null). Master orchestrator now 09. |
| April 29, 2026 (v1.7) | **Adopt `02_text_cleaner.py` Tier 1/2/3 functional spec; lock `excel-hygiene` as default preset** | Promotes character-level hygiene from a stub to a buildable v1 target. Strategic framing: Excel/Power Query/OpenRefine fail this category for non-technical buyers; the gap is "one-click correctness for dirty-CSV failure modes that cause silent VLOOKUP misses." Spec covers 10 toggleable ops (trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize, NFKC opt-in, per-column case), per-column scope control, dry-run-by-default, per-cell change audit, idempotency, three presets (`minimal`/`excel-hygiene`/`paranoid`), and JSON config save/load. Output shape mirrors deduplicator: `{input}_cleaned.csv`, `{input}_changes.csv`, `logs/text_clean_{ts}.log`. Boundary with adjacent scripts re-asserted: 02 trims whitespace-only cells to empty (04 then detects empty as null per Section 9.3); 02 is *write-time* and stays distinct from `01_deduplicator`'s match-time `normalize_string` helper. Smart-character fold defaults ON in `excel-hygiene` because demo value is highest there and dry-run preview makes the change visible before commit. NFKC stays opt-in (lossy). `ftfy` mojibake repair deferred to Tier 2 to avoid the 5MB dep without buyer demand. CLI ships as separate `src/cli_text_clean.py` module per the one-CLI-per-script pattern in TECHNICAL Section 3.2. Full spec in TECHNICAL.md Section 10.2. |
| April 28, 2026 (v1.6) | **Fold conversation-history content into docs: deduplicator functional spec, lead bundle use cases, competitive landscape, full GUI framework comparison matrix, concrete 04/06 boundary examples, expanded Streamlit-to-SaaS reasoning** | None of this represents new decisions; all of it represents prior analysis that lived only in chat history and was at risk of evaporating. Per the doc's own no-silent-drift rule (Section 8) and the v1.4 recovery story, valuable analysis must be promoted to docs to survive. Specifically: TECHNICAL.md gains Section 10 (per-script functional specs, starting with the deduplicator's 36-item tiered spec) which is the buildable target for the v1 launch GUI port; this also makes the gap between "currently working" (exact + basic fuzzy) and "v1 launch best-of-class" (Tier 1) explicit so the docs don't quietly overstate where the code is. Section 9.3 gains three concrete distinguishing examples (bank-export blank fees / $1M outlier / "999=refused") that prove 04 and 06 are distinct concerns. BUSINESS.md gains Section 4a (Lead Bundle Deep Dive: 15 use cases by persona, 6-row competitive landscape table, market gap statement) which feeds landing page copy and demo design. Section 4c gains a 10-dimension scored framework matrix and per-option summaries (locks the rejection reasoning against re-litigation), plus expanded point 4 on Streamlit-to-SaaS migration cost. RECOVERY.md updated to reference Section 10 in rebuild and priority steps. No structural decisions changed; this is pure capture work. |
|------|----------|-----------|
| Apr 2026 | Lock operating criteria | Project kickoff |
| Apr 2026 | Python Bundles selected | Highest score |
| Apr 2026 | Excel/CSV Cleaning as lead bundle | Highest pain, broadest demand |
| Apr 2026 (v1.1) | PyInstaller cross-platform pipeline | Eliminates "install Python" friction |
| Apr 2026 (v1.1) | Apple Developer Program ($99/yr) | Required for clean macOS install |
| Apr 2026 (v1.1) | Tiered revenue targets ($5k @ 12mo, $10k @ 24mo) | Original $50k unsupported by evidence |
| Apr 2026 (v1.1) | Tag "no-touch" for revisit at $5k/mo | Strict adherence pre-PMF may cost more revenue than it saves |
| Apr 28 (v1.2) | Functional scope: include workflow features even if free elsewhere | One-stop shopping is the value prop. See §4a. |
| Apr 28 (v1.2) | Promote GUI to required at v1; ship dual CLI + GUI | Buyer persona won't use CLI. See §4. |
| Apr 28 (v1.2) | Lock UX standards (works OOTB, sensible defaults, progressive disclosure, dry-run) | Load-bearing for non-tech buyer. See §4b. |
| Apr 28 (v1.3) | Lock GUI framework as Streamlit | Fastest velocity, lowest maintenance, hosted demo, SaaS optionality. See §4c. |
| Apr 28 (v1.3) | Add hosted browser demo as conversion lever | Direct consequence of Streamlit choice. See §5. |
| Apr 28 (v1.4) | Re-apply 04/06 boundary work (silent-drift recovery) | Stream B v1.2 content overwritten in parallel v1.3 work. Restored per no-silent-drift rule. |
| Apr 28 (v1.5) | Add `02_text_cleaner.py`; renumber 02-08 → 03-09 | Character-level hygiene had no clear owner. See TECHNICAL §10. |
| Apr 29 (v1.7) | Adopt Text Cleaner Tier 1/2/3 spec; lock `excel-hygiene` default | Promotes from stub to buildable v1 target. Full spec in TECHNICAL §11.2. |
| Apr 28 (v1.6) | Fold conversation-history content into docs (deduplicator spec, lead bundle use cases, full GUI matrix, 04/06 examples, Streamlit-to-SaaS reasoning) | No new decisions; promote at-risk analysis from chat history per no-silent-drift rule. |
| May 1 (v1.6) | Mark Format Standardizer **Ready** | 199-row buyer corpus passing; Tier 1 + most Tier 2 built. |
| May 1 (v1.6) | Add `src/core/errors.py` structured hierarchy | Uniform helpful messages across CLI + GUI. See TECHNICAL §7. |
---
## 8. Re-lock triggers
## 8. What Would Trigger Re-Locking the Criteria
These criteria are load-bearing. Triggers for explicit re-evaluation:
These criteria are load-bearing and not casually changed. Triggers for explicit re-evaluation:
- $5k/mo MRR (revisit async constraint).
- $10k/mo MRR (revisit time-budget allocation).
- Marketplace shutdown (Gumroad / Lemon Squeezy policy).
- New skill that opens a higher-leverage product category.
- Burnout signal — time/recovery balance broken.
- Streamlit hard direction change breaking desktop packaging (low probability).
- Hitting the $5k/mo revenue tier (revisit async constraint).
- Hitting the $10k/mo revenue tier (revisit time-budget allocation).
- A platform shutting down (Gumroad / Lemon Squeezy policy change forcing channel migration).
- A new skill acquired that opens a higher-leverage product category.
- A burnout signal indicating the time / recovery balance is broken.
- Streamlit project taking a hard direction change that breaks the desktop-packaging path (low probability, but worth flagging).
Any re-lock requires writing the new criteria here with a date and rationale. No silent drift.
Any re-lock writes new criteria here with date + rationale. **No silent drift.**

View File

@@ -1,285 +1,161 @@
# Developer Guide
Architecture, data flow, and extension guide for the DataTools Deduplicator.
Architecture, data flow, extension points.
## Architecture
```
CLI (src/cli.py) GUI (src/gui/app.py)
│ flags → strategies │ widgets → strategies
│ _interactive_review() │ match_group_card()
│ tqdm progress bar │ st.progress()
│ │
└──────────┐ ┌────────────────┘
│ │
CLI (src/cli*.py) GUI (src/gui/app.py + pages/)
│ │
└──────────┐ ┌──────────┘
▼ ▼
────────────────┐
core.dedup
deduplicate() │
└────────┬────────┘
┌────────────┼────────────┐
▼ ▼ ▼
core.io core.normalizers core.config
read/write normalize_*() save/load JSON
┌────────────────┐
src/core/
└────────────────┘
```
**Key principle:** All business logic lives in `src/core/`. The CLI and GUI are thin wrappers that translate user input into `deduplicate()` arguments and display the `DeduplicationResult`.
**Core/UI rule**: business logic in `core/` only. CLI + GUI translate user input → core call → display result.
## File-by-File Reference
## Module map
### src/core/dedup.py — Deduplication Engine
| Module | Public surface |
|--------|----------------|
| `core.dedup` | `deduplicate()`, `MatchStrategy`, `ColumnMatchStrategy`, `Algorithm`, `SurvivorRule`, `DeduplicationResult`, `MatchResult`, `build_default_strategies()` |
| `core.normalizers` | `normalize_email/phone/name/address/string`, `NormalizerType`, `get_normalizer()` |
| `core.io` | `read_file()`, `write_file()`, `list_sheets()`, `detect_encoding/delimiter/header_row`, `repair_bytes()` |
| `core.config` | `DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule` |
| `core.analyze` | `analyze()`, `Finding`, `findings_by_tool()`, `_NULL_LIKE` |
| `core.fixes` | `@register("fix_id")` decorator, `get_fix()`, `available_actions()` |
| `core.normalize` | `auto_fix()`, `apply_decisions()`, `NormalizationResult`, `is_normalized()` |
| `core.text_clean` | `clean_dataframe()`, `CleanOptions`, `CleanResult`, `smart_title_case()` |
| `core.format_standardize` | `standardize_dataframe()`, `StandardizeOptions`, `StandardizeResult`, `FieldType`, per-cell `standardize_*()` |
| `core.errors` | `DataToolsError` hierarchy, `ensure_dataframe()`, `ensure_choice()`, `wrap_file_read/write()`, `format_for_user()` |
| `core._constants` | `US_STATE_NAMES`, `US_STATE_CODES`, `USPS_EXPANSIONS`, `USPS_COMPRESSIONS` |
The central module. Contains:
- **Enums:** `Algorithm` (4 fuzzy algorithms), `SurvivorRule` (4 selection rules)
- **Data classes:** `ColumnMatchStrategy`, `MatchStrategy`, `MatchResult`, `DeduplicationResult`
- **`deduplicate()`** — main entry point. Takes a DataFrame + optional strategies/rules, returns a `DeduplicationResult` with deduplicated DataFrame, removed rows, match groups, and log entries.
- **`build_default_strategies()`** — scans column names with regex patterns to auto-detect email, phone, name, and address columns. Builds strong/weak key strategies with appropriate algorithms and normalizers.
- **`_UnionFind`** — disjoint-set data structure for transitive closure. If A matches B and B matches C, all three end up in one group.
- **`_find_match_groups()`** — O(n^2) pairwise comparison. For each pair, tries all strategies (OR semantics). Feeds matches into union-find. Returns match groups with confidence scores.
- **`_select_survivor()`** — picks the row to keep based on the survivor rule.
- **`_merge_group()`** — fills blank fields in the survivor from loser rows.
### src/core/normalizers.py — Text Normalization
Five normalizer functions, each `str → str`, idempotent, None-safe:
- **`normalize_email()`** — lowercase, strip Gmail dots, strip `+tag` suffixes
- **`normalize_phone()`** — parse with `phonenumbers` to E.164; fallback to digits-only
- **`normalize_name()`** — strip title prefixes (Dr., Mr.) and suffixes (Jr., PhD), case-fold
- **`normalize_address()`** — USPS abbreviations (Street→St, Avenue→Ave), case-fold
- **`normalize_string()`** — trim, collapse whitespace, case-fold
The `get_normalizer()` registry function maps `NormalizerType` enum values to functions.
### src/core/io.py — File I/O
Auto-detection stack:
1. **`detect_encoding()`** — checks BOM, then uses `charset-normalizer` heuristics
2. **`detect_delimiter()`** — uses `csv.Sniffer` on first 20 lines
3. **`detect_header_row()`** — finds first row where all cells look like column names
Main functions:
- **`read_file()`** — reads CSV/TSV/Excel with full auto-detection. Returns a DataFrame.
- **`write_file()`** — writes DataFrame to CSV or Excel. Uses `utf-8-sig` by default for Windows Excel compatibility.
- **`list_sheets()`** — returns sheet names from an Excel workbook.
### src/core/config.py — Configuration Profiles
Save/load deduplication settings as JSON:
- **`DeduplicationConfig`** — flat dataclass with all settings: strategies, survivor rule, merge flag, algorithm, threshold, normalizer map.
- **`.to_file()` / `.from_file()`** — JSON serialization
- **`.to_strategies()`** — converts config back to `MatchStrategy` objects for the engine
- **`.to_survivor_rule()`** — converts string to `SurvivorRule` enum
### src/cli.py — Command-Line Interface
Typer-based CLI with 17 options. Key responsibilities:
- Parse flags into strategies, survivor rule, and other config
- Set up logging (timestamped log files in `logs/`)
- Column name validation with fuzzy suggestions on typos
- `_interactive_review()` — side-by-side row display with y/n/s prompts
- Progress bar via `tqdm` for files > 10,000 rows
- Output formatting and file writing
### src/gui/app.py — Streamlit GUI
Single-page layout:
- File upload with instant preview and configurable delimiter (comma, tab, semicolon, pipe, or custom)
- Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles)
- Find Duplicates button → runs `deduplicate()` with `progress_callback`
- Interactive review via `st.data_editor` with inline checkboxes and column dropdowns
- Batch actions: Accept All, Reject All, Clear Decisions
- Apply review decisions and download cleaned results
- Download buttons for deduplicated CSV, removed rows, and match groups report
### src/gui/components.py — Reusable GUI Widgets
- **`match_group_card()`** — expandable card with `st.data_editor`: inline Keep checkboxes per row, `SelectboxColumn` dropdowns for differing columns, and a live surviving rows preview
- **`config_panel()`** — the advanced options expander, returns settings dict with strategies, survivor rule, merge flag
- **`results_summary()`** — summary metrics and download buttons
- **`apply_review_decisions()`** — builds final DataFrames from user review decisions (merge, split, or keep-all per group) with column override support
## Data Flow
## Data flow — Deduplicator
```
Input File
read_file() ← auto-detect encoding, delimiter, header
DataFrame
build_default_strategies() ← (if no explicit strategies)
│ scan column names → regex patterns
│ strong keys: email, phone (standalone OR)
│ weak keys: name, address (AND with strong)
_apply_normalizations() ← add _norm_* shadow columns
normalize_email(), normalize_phone(), etc.
_find_match_groups() ← O(n²) pairwise comparison
│ for each pair: try all strategies (OR)
│ _compute_similarity() per column
│ union-find for transitive closure
[review_callback()] ← optional: interactive review per group
│ True=accept, False=reject, None=skip
_select_survivor() ← per group: first/last/most-complete/most-recent
[_merge_group()] ← optional: fill blanks from losers
DeduplicationResult
├── deduplicated_df ← cleaned DataFrame (shadow cols dropped)
├── removed_df ← rows that were removed
├── match_groups ← list of MatchResult with confidence, columns
└── log_entries ← human-readable audit log
read_file() # auto-detect encoding, delimiter, header
▼ DataFrame
build_default_strategies() # if no explicit strategies
# strong keys (email, phone) → standalone OR
# weak keys (name, address) → AND with strong
_apply_normalizations() # add _norm_* shadow columns
_find_match_groups() # O(n²) pair compare, OR strategies, union-find
[review_callback()] # optional interactive review
_select_survivor() # per group: first/last/most-complete/most-recent
[_merge_group()] # optional: fill blanks from losers
DeduplicationResult # deduplicated_df, removed_df, match_groups, log
```
## How to Add a Normalizer
## Extension recipes
1. **Add the function** in `src/core/normalizers.py`:
### Add a normalizer
```python
def normalize_company(value: Optional[str]) -> str:
"""Strip legal suffixes (Inc, LLC, Corp), case-fold."""
if not value or not isinstance(value, str):
return ""
name = value.strip().casefold()
# Strip common suffixes
for suffix in ("inc", "llc", "corp", "ltd", "co"):
name = re.sub(rf"\b{suffix}\.?\s*$", "", name).strip()
return name
```
1. Add function to `core/normalizers.py`:
```python
def normalize_company(value: Optional[str]) -> str:
if not value or not isinstance(value, str): return ""
name = value.strip().casefold()
for sfx in ("inc", "llc", "corp", "ltd", "co"):
name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip()
return name
```
2. Register: add `COMPANY = "company"` to `NormalizerType` + entry in `_NORMALIZER_MAP`.
3. Auto-detect (optional): add a `_COLUMN_TYPE_PATTERNS` row in `core/dedup.py`.
2. **Register it** in the same file:
### Add a fuzzy algorithm
```python
class NormalizerType(str, Enum):
# ... existing types ...
COMPANY = "company" # ← add enum value
1. Add value to `Algorithm` enum in `core/dedup.py`.
2. Add case in `_compute_similarity()`.
3. Document the value in CLI help text.
_NORMALIZER_MAP: dict[NormalizerType, Callable[[str], str]] = {
# ... existing entries ...
NormalizerType.COMPANY: normalize_company, # ← add mapping
}
```
### Add a survivor rule
3. **Add auto-detection pattern** in `src/core/dedup.py` (optional):
1. Add value to `SurvivorRule` enum.
2. Add branch in `_select_survivor()`.
3. Add CLI mapping.
```python
_COLUMN_TYPE_PATTERNS = [
# ... existing patterns ...
(re.compile(r"company|organization|org_name", re.I),
NormalizerType.COMPANY, Algorithm.TOKEN_SET_RATIO, 85.0, False),
]
```
### Add a fix + detector (analyzer/gate)
## How to Add a Matching Algorithm
1. **Detector** in `core/analyze.py`: add `_detect_<thing>(df) -> list[Finding]`, hook into the main `analyze()` pipeline. Emit Finding with a unique `fix_action` id.
2. **Fix** in `core/fixes.py`:
```python
@register("fix_id")
def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]:
# ...
return out_df, cells_changed
```
3. **Constant** in `core/analyze.py`: add `FIX_<NAME> = "fix_id"` so the detector and fix can reference it.
1. **Add the enum value** in `src/core/dedup.py`:
No other call sites change. Gate auto-discovers it via the registry.
```python
class Algorithm(str, Enum):
# ... existing values ...
SOUNDEX = "soundex"
```
### Add a format-standardizer field type
2. **Add the computation** in `_compute_similarity()`:
1. Add value to `FieldType` enum in `core/format_standardize.py`.
2. Add per-cell `standardize_<x>(value, *, …)` returning `(new_value, changed)`.
3. Add option fields to `StandardizeOptions` (with defaults that preserve existing behavior).
4. Wire into `_apply_field_type()` dispatcher (the `else` branch raises `AssertionError` — every enum value needs a branch).
5. Add validation entry in `StandardizeOptions.from_dict()` for any new enum-shaped option.
```python
def _compute_similarity(val_a: str, val_b: str, algorithm: Algorithm) -> float:
# ... existing cases ...
if algorithm == Algorithm.SOUNDEX:
return 100.0 if _soundex(val_a) == _soundex(val_b) else 0.0
```
## Errors
3. **Add the CLI flag value** in `src/cli.py` help text for `--algorithm`.
Use `core/errors.py` instead of raw `ValueError` / `OSError`:
## How to Add a Survivor Strategy
| Pattern | Use |
|---------|-----|
| Bad arg, wrong type, missing column | `InputValidationError` |
| Bad config / options file | `ConfigError` |
| File parses but isn't what we expected | `FileFormatError` |
| File I/O failure (perms, missing, disk full) | `FileAccessError` |
| Internal invariant broken (unreachable branch) | `AssertionError` |
1. **Add the enum value** in `src/core/dedup.py`:
Helpers:
- `ensure_dataframe(value, function="my_func")` at every public entry that takes a df.
- `ensure_choice(value, name="mode", choices=[...])` at every entry that takes a literal.
- `wrap_file_read(path, "operation", exc)` / `wrap_file_write(...)` when wrapping `OSError`.
```python
class SurvivorRule(str, Enum):
# ... existing values ...
KEEP_LONGEST = "longest"
```
GUI / CLI handlers: use `format_for_user(exc, context="...")` to render.
2. **Add the logic** in `_select_survivor()`:
All `DataToolsError` subclasses extend stdlib `ValueError` or `OSError` so existing handlers still catch them.
```python
if rule == SurvivorRule.KEEP_LONGEST:
return max(indices, key=lambda i: len(str(df.iloc[i].to_dict())))
```
3. **Add to the CLI** survivor map in `src/cli.py`.
## Testing
### Run Tests
## Tests
```bash
# All tests
pytest tests/ -q
# Specific module
pytest tests/test_dedup.py -q
pytest tests/test_normalizers.py -q
pytest tests/test_io.py -q
pytest tests/test_config.py -q
pytest tests/test_cli.py -q
# Verbose with output
pytest tests/ -v
# Stop on first failure
pytest tests/ -x
# All
pytest -q
# By module
pytest tests/test_dedup.py
# Include slow / integration
pytest -m slow
# Single test
pytest tests/test_dedup.py::TestExactMatch::test_basic
```
### Test Structure
Test layout:
```
tests/
├── conftest.py # Shared fixtures
│ ├── sample_csv_path # Path to samples/messy_sales.csv
│ ├── sample_df # Loaded sample CSV as DataFrame
│ ├── simple_df # Small 5-row DataFrame with obvious duplicates
│ ├── merge_df # DataFrame with partial records
│ └── tmp_csv # Temporary CSV from simple_df
├── test_dedup.py # Engine tests: similarity, union-find, pairs, integration
── test_normalizers.py # Normalizer tests: all 5 types with edge cases
├── test_io.py # I/O tests: encoding, delimiter, header, read/write
├── test_config.py # Config tests: serialization round-trip
└── test_cli.py # CLI tests: argument parsing, file handling
├── conftest.py # fixtures
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
├── test_analyze.py · test_normalize.py · test_text_clean.py
├── test_format_standardize.py
├── test_format_standardize_corpus.py # 199-row buyer corpus
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
```
### Writing Tests
Fixture corpora: `test-cases/text-cleaner-corpus/` (21 files) · `test-cases/encodings-corpus/` (31 files) · `test-cases/format-cleaner-corpus/` (7 files + spec).
Follow existing patterns. Tests use pytest fixtures from `conftest.py`:
## Known limitations
```python
def test_my_feature(simple_df):
"""Test description."""
result = deduplicate(simple_df, ...)
assert len(result.match_groups) == expected
assert result.deduplicated_df.shape[0] == expected_rows
```
## Known Limitations
- **O(n^2) pairwise comparison** — no blocking or indexing. Works well up to ~50,000 rows. Beyond that, performance degrades quadratically. Future optimization: add blocking (partition by first letter, zip code prefix, etc.) to reduce comparison space.
- **No multi-sheet dedup** — each Excel sheet is processed independently. Cross-sheet deduplication is not supported.
- **Phone normalization requires valid-length numbers** — the `phonenumbers` library rejects numbers that are too short or too long for the detected region. Fallback is digits-only, which may produce false negatives for international numbers without country codes.
- **Single-threaded** — no parallel comparison. Could benefit from `multiprocessing` for large files.
- **Memory-bound** — entire file is loaded into a pandas DataFrame. Files larger than available RAM will fail. Chunked reading exists but is not integrated with the dedup engine.
- **Dedup is O(n²)** — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix.
- **Single-threaded** — could benefit from `multiprocessing`.
- **Memory-bound** — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine.
- **No multi-sheet dedup** — each Excel sheet processed independently.
- **Phonenumbers minimum-length** — international numbers without country codes fall back to digits-only.

View File

@@ -1,38 +1,27 @@
# Excel & CSV Data Cleaning Mastery Bundle
**Ready-to-sell Python automation product.**
9 scripts for data cleaning, deduplication, text hygiene, formatting, merging, validation, and reporting.
9 Python data-cleaning tools, every one with a CLI and a browser GUI. Local-only, no internet. Windows / macOS / Linux.
Each script ships with both a GUI (runs in your browser locally, no internet needed) and a CLI.
## Quick Start
Cross-platform: Windows, macOS, Linux.
1. Download the installer for your OS from your purchase email.
2. Run it (no Python knowledge required).
3. Launch via the desktop shortcut → your default browser opens to a local page.
Full instructions: [USER-GUIDE.md](USER-GUIDE.md).
## Docs
**Buyer-facing** (ships with the product):
- [USER-GUIDE.md](USER-GUIDE.md) — install + per-tool walkthrough
**Creator-only** (do not ship):
- [BUSINESS.md](BUSINESS.md) — market, pricing, marketing
- [TECHNICAL.md](TECHNICAL.md) — architecture, build pipeline, standards
- [DECISIONS.md](DECISIONS.md) — locked criteria, decision log
- [RECOVERY.md](RECOVERY.md) — full rebuild guide
- [REQUIREMENTS.md](REQUIREMENTS.md) — numbered support matrix
---
## Quick Start (for buyers)
1. Download the installer for your operating system.
2. Run the installer. No Python knowledge required.
3. Launch via the desktop shortcut "Launch Bundle" (or the app icon on macOS, or the AppImage on Linux).
4. Your default browser opens to a local page where the data tool runs. Your data never leaves your computer.
Full instructions: see [USER-GUIDE.md](USER-GUIDE.md).
---
## Documentation Index
### Ships with the product (buyer-facing)
- [USER-GUIDE.md](USER-GUIDE.md) - Installation, script reference, usage examples for both GUI and CLI.
### Creator-only (do not ship to buyers)
- [BUSINESS.md](BUSINESS.md) - Business case, market analysis, pricing, marketing strategy (including the hosted browser demo as a conversion lever).
- [TECHNICAL.md](TECHNICAL.md) - Architecture (dual CLI + Streamlit GUI), build pipeline, dev standards.
- [DECISIONS.md](DECISIONS.md) - Locked criteria, scoring rubric, decisions log, rationale for product choices including the GUI framework decision.
- [RECOVERY.md](RECOVERY.md) - How to rebuild the entire project from scratch if lost.
---
**Version**: 1.6
**Last updated**: April 28, 2026
**Owner**: Michael
**Version**: 1.6 · **Updated**: 2026-05-01 · **Owner**: Michael

View File

@@ -1,180 +1,147 @@
# RECOVERY.md - Full Project Recovery Guide
# Recovery
> **Creator-only document. Do not ship to buyers.**
> Creator-only. Full project rebuild guide.
> **Version**: 1.6 · **Updated**: 2026-05-01
**Version**: 1.6
**Last updated**: April 28, 2026
If lost, this doc + the source ZIP rebuilds the project 100%.
If the project is ever lost, this guide plus the source ZIP is enough to rebuild it 100%.
---
## 1. What's in the Project
## 1. Project layout
```
project-root/
├── README.md
├── BUSINESS.md # Creator only
├── TECHNICAL.md # Creator only
├── DECISIONS.md # Creator only - locked criteria, rationale, GUI framework decision
├── USER-GUIDE.md # Ships to buyers
├── RECOVERY.md # Creator only (this file)
├── scripts/ # The 9 .py source files (CLI entry points)
│ ├── 01_deduplicator.py # Working
── 02_text_cleaner.py
│ ├── 03_format_standardizer.py
│ ├── 04_missing_value_handler.py
│ ├── 05_column_mapper_enforcer.py
│ ├── 06_outlier_detector.py
│ ├── 07_multi_file_merger.py
│ ├── 08_validator_reporter.py
│ └── 09_master_orchestrator.py
├── docs/
│ ├── BUSINESS.md # creator-only
├── TECHNICAL.md # creator-only
│ ├── DECISIONS.md # creator-only — locked criteria + decision log
│ ├── DEVELOPER.md # creator-only
├── RECOVERY.md # creator-only (this file)
│ ├── REQUIREMENTS.md
│ ├── USER-GUIDE.md # ships to buyers
── CLI-REFERENCE.md
├── src/
│ ├── core/ # Shared business logic - both CLI and GUI call into this
│ ├── cli.py # Typer CLI front-end
── gui/ # Streamlit GUI front-end
├── app.py # Streamlit entry point
├── pages/ # One Streamlit page per script in the bundle
── components.py # Shared widgets
├── samples/
│ ├── messy_sales.csv
│ └── bank_export.xlsx
├── demo/
│ └── streamlit_app.py # Constrained version for Streamlit Community Cloud
│ ├── core/ # shared logic both CLI + GUI call into this
│ ├── cli.py # Deduplicator CLI
── cli_text_clean.py # Text Cleaner CLI
├── cli_analyze.py # Analyzer CLI
└── gui/
── app.py # Streamlit entry
├── pages/ # one page per tool
│ └── components/ # shared widgets
├── samples/ # messy_sales.csv, bank_export.xlsx
├── test-cases/ # corpora: text-cleaner, encodings, format-cleaner
├── tests/ # pytest
├── demo/streamlit_app.py # constrained Streamlit Community Cloud version
├── build/
│ ├── pyinstaller.spec # Cross-platform build spec (handles GUI launcher + CLI binaries)
│ ├── launcher.py # Starts local Streamlit server, opens default browser
│ ├── windows/
│ └── installer.iss # Inno Setup wrapper
── macos/
│ │ ├── entitlements.plist
│ │ └── dmg_settings.py
│ └── linux/
│ └── AppImage/ # AppImage build assets
├── ci/
│ └── build.yml # GitHub Actions cross-platform build
├── tests/
│ ├── pyinstaller.spec # cross-platform build spec
│ ├── launcher.py # starts Streamlit, opens browser
│ ├── windows/installer.iss
├── macos/{entitlements.plist, dmg_settings.py}
── linux/AppImage/
├── ci/build.yml # GitHub Actions matrix build
└── requirements.txt
```
---
## 2. Rebuild Steps
## 2. Rebuild steps
### From a complete ZIP backup
1. Unzip into a clean directory.
2. Push to a GitHub repository.
3. The CI pipeline (`ci/build.yml`) builds Windows, macOS, and Linux artifacts on tagged releases.
4. Connect the repo to Streamlit Community Cloud and point it at `demo/streamlit_app.py` to redeploy the hosted demo.
5. For local builds: see Section 3.
6. Done.
2. Push to GitHub.
3. Tag a release → CI builds Windows / macOS / Linux artifacts.
4. Connect repo to Streamlit Community Cloud → demo deploys.
5. Local builds: see §3.
### From documentation only (worst case)
1. Read `DECISIONS.md` to understand *why* the project is what it is. Section 4c locks the GUI framework as Streamlit; Section 4b locks the UX standards. These are non-negotiable.
2. Read `TECHNICAL.md` Sections 2-3 for the build pipeline architecture, including the Streamlit launcher pattern in Section 3.4.
3. Read `BUSINESS.md` for product strategy, which bundles to build, and the hosted demo as a marketing asset.
4. Recreate scripts using the spec in `USER-GUIDE.md` Section 2 (script table), `TECHNICAL.md` Section 7 (per-bundle technical notes), `TECHNICAL.md` Section 9 (boundary between scripts 04 and 06 - do not relitigate this), and `TECHNICAL.md` Section 10 (per-script functional requirements; Section 10.1 is the v1 launch target for the deduplicator).
5. Set up the cross-platform build pipeline (Section 3 below).
6. Recreate installer configs per `TECHNICAL.md` Section 3.
7. Build the constrained `demo/streamlit_app.py` for hosted deployment. Constraints: row limit, watermark, sample data only or strict file-size cap.
1. Read **DECISIONS.md** understand *why* the project is what it is. §4c locks Streamlit; §4b locks UX standards. **Non-negotiable.**
2. Read **TECHNICAL.md** §1-3 (architecture + build pipeline + Streamlit launcher pattern in §3.4).
3. Read **BUSINESS.md** for product strategy + hosted demo as marketing asset.
4. Recreate scripts using:
- USER-GUIDE.md §2 (script table)
- TECHNICAL.md §10 (04/06 boundary — do not relitigate)
- TECHNICAL.md §11 (per-script functional specs; §11.1-11.3 are the v1 launch targets for Ready tools).
5. Set up cross-platform build pipeline (§3 below).
6. Recreate installer configs per TECHNICAL.md §3.5-3.7.
7. Build constrained `demo/streamlit_app.py` (row limit, watermark, sample data).
---
## 3. Local build setup
## 3. Local Build Setup (per platform)
### All platforms (common)
- Install Python 3.11+.
- `pip install -r requirements.txt pyinstaller`
- Verify Streamlit app runs locally: `streamlit run src/gui/app.py`
- Verify CLI runs locally: `python -m src.cli --help`
### Common
```bash
pip install -r requirements.txt pyinstaller
streamlit run src/gui/app.py # verify GUI
python -m src.cli --help # verify CLI
```
### Windows
- Install Inno Setup: https://jrsoftware.org/isinfo.php
- Build: `pyinstaller build/pyinstaller.spec`
- Wrap in installer: open `build/windows/installer.iss` in Inno Setup, compile.
- `pyinstaller build/pyinstaller.spec`
- Open `build/windows/installer.iss` in Inno Setup, compile.
### macOS
- Install Xcode command line tools: `xcode-select --install`
- Enroll in Apple Developer Program ($99/yr). Allow 1-2 weeks first time.
- Generate Developer ID Application certificate, install in Keychain.
- Generate app-specific password for `notarytool`.
- Build: `pyinstaller build/pyinstaller.spec`
- Sign: `codesign --deep --force --options runtime --sign "Developer ID Application: [Name]" dist/BundleName.app`
- Package as DMG.
- Notarize: `xcrun notarytool submit BundleName.dmg --wait`
- Staple: `xcrun stapler staple BundleName.dmg`
1. `xcode-select --install`
2. Enroll in Apple Developer Program ($99/yr 1-2 wk first time).
3. Generate Developer ID cert, install in Keychain.
4. Generate app-specific password for `notarytool`.
5. `pyinstaller build/pyinstaller.spec`
6. `codesign --deep --force --options runtime --sign "Developer ID Application: [Name]" dist/App.app`
7. Package as DMG.
8. `xcrun notarytool submit *.dmg --wait`
9. `xcrun stapler staple *.dmg`
### Linux
- Install AppImage tooling: download `appimagetool` from https://appimage.github.io
- Build: `pyinstaller build/pyinstaller.spec`
- Wrap as AppImage using `appimagetool` per the assets in `build/linux/AppImage/`.
- Download `appimagetool` from https://appimage.github.io
- `pyinstaller build/pyinstaller.spec`
- Wrap as AppImage via assets in `build/linux/AppImage/`.
### Streamlit + PyInstaller specific notes
- A custom PyInstaller hook (`hook-streamlit.py`) is required to bundle Streamlit's data files correctly.
- Hidden imports must include `streamlit`, `altair`, `pyarrow` (and their submodules where PyInstaller fails to detect them).
- The launcher script (`build/launcher.py`) is the actual PyInstaller entry point, not the Streamlit script directly.
- Budget 1-3 days the first time getting the Streamlit-PyInstaller spec right; it's reusable across all subsequent bundles.
### Streamlit + PyInstaller notes
- Custom `hook-streamlit.py` required.
- Hidden imports: `streamlit`, `altair`, `pyarrow` (and submodules where auto-detection fails).
- The PyInstaller entry point is `build/launcher.py`, **not** the Streamlit script directly.
- Budget 1-3 days first time. Reusable across all bundles.
### CI build (recommended)
- Push the repo to GitHub.
- Tag a release: `git tag v1.0.0 && git push --tags`
- GitHub Actions runs the matrix build, produces all three artifacts.
- Manual step: download artifacts from the Releases page, upload to Gumroad / Lemon Squeezy.
```bash
git tag v1.0.0 && git push --tags
# GitHub Actions runs the matrix → 3 platform artifacts on Releases page.
# Manual: download upload to Gumroad / Lemon Squeezy.
```
### Hosted demo deployment (separate from desktop build)
### Hosted demo deployment
- Connect GitHub repo to Streamlit Community Cloud (one-time, free).
- Configure the deployment to point at `demo/streamlit_app.py`.
- The demo updates automatically on git push to the configured branch.
- Custom domain optional via CNAME (verify Streamlit Community Cloud current policy at recovery time).
- Configure deployment `demo/streamlit_app.py`.
- Auto-updates on push to configured branch.
- Custom domain optional via CNAME.
---
## 4. External Dependencies (re-acquire if lost)
## 4. External dependencies
| Item | Source | Cost |
|---|---|---|
| Python | https://python.org/downloads | Free |
| PyInstaller | `pip install pyinstaller` | Free |
| Streamlit | `pip install streamlit` | Free |
| Inno Setup (Windows) | https://jrsoftware.org/isinfo.php | Free |
| Apple Developer Program (macOS signing) | https://developer.apple.com | $99/yr |
| Xcode command line tools (macOS) | `xcode-select --install` | Free |
| appimagetool (Linux) | https://appimage.github.io | Free |
| GitHub Actions (CI) | github.com | Free tier covers all three OS runners |
| Streamlit Community Cloud (demo hosting) | streamlit.io/cloud | Free |
| Python libraries | See `requirements.txt`, `pip install -r requirements.txt` | Free |
|------|--------|------|
| Python | python.org/downloads | Free |
| PyInstaller, Streamlit, Python libs | `pip install -r requirements.txt` | Free |
| Inno Setup (Windows) | jrsoftware.org/isinfo.php | Free |
| Apple Developer Program (macOS) | developer.apple.com | $99/yr |
| Xcode CLT (macOS) | `xcode-select --install` | Free |
| appimagetool (Linux) | appimage.github.io | Free |
| GitHub Actions (CI) | github.com | Free tier covers all 3 OS runners |
| Streamlit Community Cloud | streamlit.io/cloud | Free |
---
## 5. Backup recommendation
## 5. Backup Recommendation
- **Primary**: GitHub repository (private). Source of truth.
- **Secondary**: ZIP of full project tree on cloud storage (Drive / Dropbox / S3).
- **Apple Developer credentials**: cert + app-specific password in a password manager. Re-issuable, not catastrophic.
- **Streamlit Community Cloud**: stored as GitHub OAuth link in Streamlit UI. Re-authorize from new account if lost.
- Back up after every meaningful change.
- **Always include RECOVERY.md + DECISIONS.md** — irreplaceable context.
- **Primary backup**: GitHub repository (private). Source is the source of truth.
- **Secondary backup**: ZIP of the full project tree on cloud storage (Google Drive / Dropbox / S3).
- **Apple Developer credentials**: store certificate + app-specific password in a password manager. Losing these requires regenerating, not catastrophic.
- **Streamlit Community Cloud connection**: stored in Streamlit's UI as a GitHub OAuth link. Re-authorize from a new Streamlit account if lost.
- Back up after every meaningful code or doc change.
- Include this `RECOVERY.md` and `DECISIONS.md` in every backup. They contain the irreplaceable context.
## 6. Recovery priorities (under time pressure)
---
## 6. Recovery Priorities (if rebuilding under time pressure)
If you only have time to rebuild part of the project, this is the order:
1. **Source: `src/core/` and `scripts/`**. Without these there is no product.
2. **DECISIONS.md**. Without this you will re-litigate every settled decision (especially GUI framework, dual interface, UX standards) and probably get it wrong differently.
3. **TECHNICAL.md**, especially Sections 9 (04/06 boundary) and 10 (per-script functional requirements). Without these you will rebuild the deduplicator with weaker fuzzy matching than the v1 launch spec demands and ship something that loses to free Excel.
4. **Streamlit GUI source (`src/gui/`)**. The primary buyer surface; without it the product reverts to CLI-only and the buyer persona will refund.
5. **PyInstaller spec + launcher + per-OS build configs** (`build/`). Reproducing the Streamlit-PyInstaller integration from scratch is 1-3 days of work.
6. **Apple Developer Program enrollment**. 1-2 week lead time. Start this first if Mac distribution matters.
7. **Hosted demo (`demo/streamlit_app.py`)**. Important marketing asset but not blocking for desktop sales.
8. Documentation files (USER-GUIDE, BUSINESS, README). Recoverable from memory + this guide.
9. CI config (`ci/build.yml`). Nice to have, not blocking.
1. **`src/core/` + scripts** — without these there is no product.
2. **DECISIONS.md** — without this you'll re-litigate every settled call.
3. **TECHNICAL.md** §10 (04/06 boundary) + §11 (per-script specs). Without these you'll rebuild dedup with weaker fuzzy than the v1 spec demands and lose to free Excel.
4. **`src/gui/`** — primary buyer surface; without it the product reverts to CLI-only and the persona refunds.
5. **PyInstaller spec + launcher + per-OS configs** — recreating the Streamlit-PyInstaller integration is 1-3 days.
6. **Apple Developer Program enrollment** — 1-2 wk lead. Start first if Mac matters.
7. **Hosted demo** — important marketing asset, not blocking for desktop sales.
8. Doc files (USER-GUIDE, BUSINESS, README) — recoverable from memory + this guide.
9. CI config — nice to have, not blocking.

View File

@@ -1,146 +1,129 @@
# REQUIREMENTS.md
# Requirements
Numbered, categorized requirements list — short form. The companion to USER-GUIDE.md and TECHNICAL.md; updated with every shipped capability.
---
Numbered support matrix. Updated with every shipped capability.
## 1. File handling
1.1 File size: ≤ 1 GB (target; bigger files work but the gate's full-DataFrame Apply pass scales linearly).
1.2 Input formats: CSV, TSV, XLSX, XLS.
1.3 Output formats: CSV, TSV.
1.4 Excel: multi-sheet workbook picker.
1.5 Empty file: detected, blocks gate with `empty_input` error finding.
1.1 Size: ≤ 1 GB target (larger works, slower).
1.2 Read: CSV, TSV, XLSX, XLS.
1.3 Write: CSV, TSV.
1.4 Excel: multi-sheet picker.
1.5 Empty file: blocked with `empty_input` error finding.
## 2. Input encodings (auto-detected)
2.1 Unicode: UTF-8, UTF-8 with BOM, UTF-16 LE/BE with BOM, UTF-16 LE without BOM (best-effort).
2.1 Unicode: UTF-8, UTF-8-BOM, UTF-16 LE/BE BOM, UTF-16 LE no-BOM.
2.2 Western: cp1252, ISO-8859-1, ISO-8859-15, Mac Roman.
2.3 Eastern European: cp1250, ISO-8859-2.
2.4 Cyrillic: cp1251, KOI8-R.
2.5 CJK: Shift_JIS / cp932, GB18030, Big5, EUC-KR / cp949.
2.6 ASCII: detected as UTF-8 (byte-equivalent).
2.7 User override: any Python codec name typed in the Review page.
2.6 ASCII detected as UTF-8.
2.7 User override: any Python codec name.
2.8 BOM: stripped on read, never written.
2.9 Decode failure: surfaced as `encoding_decode_failed` (error severity).
2.10 Replacement char (U+FFFD) in output: surfaced as `encoding_uncertain` (error).
2.9 Decode failure `encoding_decode_failed` (error).
2.10 U+FFFD in output `encoding_uncertain` (error).
## 3. Output encodings
3.1 UTF-8 (default).
3.2 UTF-8 with BOM (Excel-friendly).
3.3 cp1252, ISO-8859-1, ISO-8859-15, cp1250, ISO-8859-2, cp1251.
3.4 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
3.5 Lossy fallback: `?` replacement + warning shown when chosen codec can't represent a character.
3.1 UTF-8 (default), UTF-8-BOM (Excel-friendly).
3.2 cp1252, ISO-8859-1/15, cp1250, ISO-8859-2, cp1251.
3.3 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
3.4 Lossy fallback: `?` + warning when codec can't represent a char.
## 4. Delimiters
4.1 Auto-detect (input): `,`, `\t`, `;`, `|`.
4.1 Input auto-detect: `,`, `\t`, `;`, `|`.
4.2 Output: `,` (default), `\t`, `;`, `|`.
4.3 File extension: `.tsv` for tab, `.csv` otherwise.
4.3 Extension: `.tsv` for tab, `.csv` otherwise.
## 5. Line endings
5.1 Input: LF, CRLF, bare CR (all normalized to LF on read).
5.1 Read: LF / CRLF / bare CR — all normalized to LF.
5.2 Embedded in quoted cells: also normalized to LF.
5.3 Output: LF (default), CRLF, CR.
5.4 Mixed line endings: surfaced as `mixed_line_endings` finding.
5.3 Write: LF (default), CRLF, CR.
5.4 Mixed `mixed_line_endings` finding.
## 6. Analyzer detectors
6.1 File-level (audit log of read-time fixes): `csv_bom_stripped`, `csv_nul_stripped`, `csv_smart_quotes_folded`, `csv_line_endings_normalized`, `csv_transcoded_to_utf8`, `csv_unquoted_delimiters_repaired`, `csv_unrepairable_rows`.
6.2 Cell-level: `smart_punctuation_in_data`, `nbsp_or_unicode_whitespace`, `zero_width_or_invisible`, `dirty_column_headers`, `whitespace_padding`, `null_like_sentinels`, `suspected_mojibake`, `mixed_case_email_column`, `near_duplicate_rows`, `leading_zero_ids`.
6.3 Encoding integrity: `encoding_uncertain`, `encoding_decode_failed`, `empty_input`.
6.4 Sample size (default): 1,000 rows; configurable.
**File-level** (read-time fixes, audit-logged):
- `csv_bom_stripped`, `csv_nul_stripped`, `csv_smart_quotes_folded`, `csv_line_endings_normalized`, `csv_transcoded_to_utf8`, `csv_unquoted_delimiters_repaired`, `csv_unrepairable_rows`.
**Cell-level**:
- `smart_punctuation_in_data`, `nbsp_or_unicode_whitespace`, `zero_width_or_invisible`, `dirty_column_headers`, `whitespace_padding`, `null_like_sentinels`, `suspected_mojibake`, `mixed_case_email_column`, `inconsistent_date_format`, `near_duplicate_rows`, `leading_zero_ids`.
**Encoding integrity**: `encoding_uncertain`, `encoding_decode_failed`, `empty_input`.
Sample size: 1,000 rows (configurable).
## 7. Finding fields
7.1 `id` — stable identifier.
7.2 `severity` — info / warn / error (error blocks gate).
7.3 `confidence` — high / medium / low (auto-fixability).
7.4 `fix_action` — id of the algorithm in `src/core/fixes.py`.
7.5 `pre_applied` — true if fixed during read pass.
7.6 `tool` — owning tool id (or empty).
7.7 `count`, `description`, `column`, `samples` (≤5).
`id`, `severity` (info/warn/error), `confidence` (high/medium/low), `fix_action`, `pre_applied`, `tool`, `count`, `description`, `column`, `samples` (≤5).
## 8. Confidence tiers
- **high** — round-trip safe, one-click auto-fix.
- **medium** — preview before applying.
- **low** — opt-in only, can corrupt if wrong.
- **error** — must resolve or waive before tool pages unlock.
8.1 **high** — round-trip safe; one-click auto-fix.
8.2 **medium** — preview before applying.
8.3 **low** — opt-in only; can corrupt data if wrong.
8.4 **error** — must resolve or waive before tool pages unlock.
## 9. Decision actions per finding
9.1 `auto` — apply the registered fix.
9.2 `skip` — waive (no change, audit-logged).
9.3 `modified` — apply with custom payload (e.g. user-edited null sentinels).
## 9. Decision actions
- `auto` — apply registered fix.
- `skip` — waive (audit-logged).
- `modified` — apply with custom payload.
## 10. Performance (1 GB input)
- Initial scan (sample): < 2 s · peak RSS ~110 MB.
- Full-file `repair_bytes`: 3040 s.
- Full-DataFrame analyze: ~4 min (~25 µs/cell).
- Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
- Output write: ~10 s.
- Recommended RAM: 4× input size for full-Apply path.
10.1 Initial scan (`analyze` sample-mode): < 2 s.
10.2 Peak RSS during initial scan: ~110 MB.
10.3 Full-file `repair_bytes`: ~3040 s (when triggered).
10.4 Full-DataFrame analyze: ~4 min (~25 µs/cell).
10.5 Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
10.6 Output write: ~10 s for 1 GB UTF-8 CSV.
10.7 RAM headroom recommended: 4× input file size for the full-Apply path.
## 11. Tools shipped
11.1 Deduplicator — Ready.
11.2 Text Cleaner — Ready.
11.3 Format Standardizer — Coming Soon.
11.4 Missing Value Handler — Coming Soon.
11.5 Column Mapper — Coming Soon.
11.6 Outlier Detector — Coming Soon.
11.7 Multi-File Merger — Coming Soon.
11.8 Validator & Reporter — Coming Soon.
11.9 Pipeline Runner — Coming Soon.
## 11. Tools
1. Deduplicator — Ready
2. Text Cleaner — Ready
3. Format Standardizer — Ready
4. Missing Value Handler — Coming Soon
5. Column Mapper — Coming Soon
6. Outlier Detector — Coming Soon
7. Multi-File Merger — Coming Soon
8. Validator & Reporter — Coming Soon
9. Pipeline Runner — Coming Soon
## 12. Gate (Review & Normalize)
12.1 Gates every tool page; tool pages refuse to load until passed.
12.2 Auto-fix button applies all `confidence=high` findings in one click.
12.3 Per-finding controls: Auto-fix / Skip / Customize.
12.4 Live before/after preview per finding (≤5 sample rows).
12.5 Audit log: every fix tagged with finding id, decision, cells changed.
12.6 Encoding override picker (16 codepages + custom).
12.7 Advanced output options expander: encoding + delimiter + line terminator.
12.8 Result keyed by upload SHA-256; survives page reloads, invalidated on re-upload.
- Gates every tool page.
- Auto-fix button: applies all `confidence=high` findings in one click.
- Per-finding controls: Auto / Skip / Customize.
- Live before/after preview (≤5 sample rows).
- Audit log per fix (id, decision, cells changed).
- Encoding-override picker (16 codepages + custom).
- Advanced output expander: encoding + delimiter + line terminator.
- Result keyed by upload SHA-256; survives reload, invalidated on re-upload.
## 13. Interfaces
13.1 GUI: Streamlit, runs locally, browser-based, no internet required.
13.2 CLI: Typer apps — `python -m src.cli`, `src.cli_text_clean`, `src.cli_analyze`.
13.3 Python API: `from src.core import …` (analyze, repair_bytes, clean_dataframe, deduplicate, etc.).
13.4 JSON output: `--json` flag on `cli_analyze`; full Finding schema.
- **GUI**: Streamlit, browser-based, local, no internet.
- **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_analyze`.
- **Python API**: `from src.core import …` (analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …).
- **JSON output**: `--json` on `cli_analyze`.
## 14. Platforms
14.1 Python: ≥ 3.10.
14.2 OS: Linux, macOS, Windows.
14.3 Display: any modern browser (Streamlit GUI).
14.4 Network: not required at runtime.
- Python ≥ 3.10.
- OS: Linux, macOS, Windows.
- Browser: any modern browser.
- Network: not required at runtime.
## 15. Dependencies
15.1 Core: pandas, openpyxl, charset-normalizer, typer, loguru.
15.2 Dedup: rapidfuzz, phonenumbers.
15.3 GUI: streamlit.
15.4 Optional: ftfy (mojibake repair, `repair_mojibake` fix).
15.5 Dev: pytest, tox.
- **Core**: pandas, openpyxl, charset-normalizer, typer, loguru.
- **Dedup**: rapidfuzz, phonenumbers.
- **GUI**: streamlit.
- **Optional**: ftfy (mojibake repair).
- **Dev**: pytest, tox.
## 16. Test coverage
16.1 Unit + integration: 765 tests passing.
16.2 Documented gaps: 17 xfail (charset-normalizer label drift on byte-equivalent codepages, byte-level smart-quote fold expectation).
16.3 Fixture corpora: 21 text-cleaner fixtures, 31 encoding fixtures, 9 reference UTF-8 files.
16.4 CI surface: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
- 1,230 tests passing, 4 skipped (ftfy not installed), 17 xfailed (documented).
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases).
- Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
## 17. Privacy / data handling
- All processing local; no network calls in the data path.
- No telemetry.
- Original input never modified.
- Audit logs: `logs/` next to each run (timestamped).
17.1 All processing local; no network calls in the data path.
17.2 No telemetry, no usage analytics shipped.
17.3 Original input file never modified — outputs go to a separate path.
17.4 Audit logs written to `logs/` next to each run (timestamped).
## 18. Error handling
- Structured hierarchy: `DataToolsError``InputValidationError`, `ConfigError`, `FileFormatError`, `FileAccessError`.
- Subclasses extend stdlib `ValueError` / `OSError` so existing handlers still catch them.
- Every error carries: message, file path, column, operation, suggestion, underlying cause.

View File

@@ -1,570 +1,350 @@
# TECHNICAL.md - Technical Design, Build Pipeline, Standards
# Technical
> **Creator-only document. Do not ship to buyers.**
> Creator-only. Do not ship to buyers.
> **Version**: 1.6 · **Updated**: 2026-05-01
**Version**: 1.6
**Last updated**: April 28, 2026
## 1. Architecture
---
- **Dual interface**: CLI + GUI, both wrapping the same `src/core/` library.
- **GUI**: Streamlit, runs as local web server, opens in default browser. No internet.
- **Runtime**: Python 3.10+ (bundled into installer; buyer never sees Python).
- **Cross-platform**: Windows, macOS, Linux from day one. PyInstaller per OS.
- **Core/UI rule**: business logic in `core/` only. CLI + GUI are thin front-ends.
## 1. Architecture Overview
**Locks**:
- v1.2 — dual interface required (non-technical buyers won't use CLI).
- v1.3 — Streamlit chosen (over CustomTkinter inactive, plain Tk UX gap, Flet/PySide6/NiceGUI each fails one dimension). See DECISIONS.md §4c.
- Standalone tools with **dual interface**: CLI and GUI, both wrapping the same core library.
- GUI framework: **Streamlit**. Runs as a local web server, opens in the buyer's default browser. No internet used.
- Python 3.11+ runtime (bundled into the installer; the buyer never installs Python).
- Modular code, one concern per script. Core logic is library code; CLI and GUI are thin front-ends.
- Cross-platform from day one: Windows, macOS, Linux.
- PyInstaller produces standalone executables per OS. Buyer never sees Python, pip, venvs, or PATH.
- No internet required at runtime.
**Why dual interface (locked v1.2)**: The primary buyer persona is non-technical and will not use a CLI. The GUI is therefore the primary surface and is required at v1, not deferred. The CLI is retained for power users, automation, scheduled jobs, and future scripted workflows. Both share a single core; neither has features the other lacks (except interactive review, which only makes sense in GUI).
**Why Streamlit (locked v1.3)**: Fastest build velocity, lowest maintenance burden per added feature, hosted browser demo deployable as a marketing asset, future SaaS optionality. Selected over CustomTkinter (maintenance inactive since Jan 2024), plain Tkinter (UX gap at this price tier), Flet (ecosystem too young), PySide6 (overkill), and NiceGUI (smaller community). Full rationale in DECISIONS.md Section 4c.
This is a major change from the original Inno-Setup-only, CLI-only design. Rationale chain:
1. Requiring a buyer to install Python before using the product is the largest source of install friction (solved by PyInstaller in v1.1).
2. Requiring a non-technical buyer to use a CLI is the second-largest source of refund risk (solved by dual interface in v1.2).
3. Betting the GUI on an unmaintained library is the largest hidden technical risk (solved by Streamlit choice in v1.3).
---
## 2. Standard Bundle Structure (source repo)
Every bundle follows this layout in source. Core logic is shared, CLI and GUI are thin front-ends.
## 2. Repo layout
```
bundle-name/
├── src/
├── __init__.py
├── core/ # Shared business logic. No UI code here.
├── __init__.py
│ ├── dedup.py # (example) the actual algorithm
│ │ └── io.py # File I/O, encoding/delimiter detection, etc.
├── cli.py # Command-line interface (Typer). Thin wrapper over core.
└── gui/ # Streamlit front-end. Thin wrapper over core.
│ ├── __init__.py
├── app.py # Main Streamlit entry point (st.set_page_config, layout)
├── pages/ # Streamlit multi-page app (one page per script in the bundle)
├── 1_Deduplicator.py
├── 2_Text_Cleaner.py
│ ├── 3_Format_Standardizer.py
└── ...
└── components.py # Reusable Streamlit widgets and helpers
├── data_examples/ # Sample input files
├── tests/ # Unit tests (pytest). Tests target core, not UI.
├── build/
│ ├── pyinstaller.spec # PyInstaller build spec (handles both CLI + GUI entry points)
│ ├── launcher.py # Small launcher script: starts Streamlit server, opens browser
│ ├── windows/
└── installer.iss # Inno Setup wrapper for Windows .exe installer
│ ├── macos/
│ │ ├── entitlements.plist
│ │ └── dmg_settings.py # dmg-creation config
│ └── linux/
│ └── AppImage/ # AppImage build assets
├── demo/ # Stripped-down version for hosted browser demo
│ └── streamlit_app.py # Entry point for Streamlit Community Cloud deployment
├── requirements.txt
├── README_bundle.md # User-facing guide (covers both CLI and GUI usage)
├── LICENSE
└── ci/
└── build.yml # GitHub Actions cross-platform build
src/
core/ # Shared logic. No UI code.
analyze.py # Detectors + Finding schema
config.py # DeduplicationConfig (JSON profiles)
dedup.py # Match strategies, union-find, survivor selection
errors.py # Structured error hierarchy + format_for_user
fixes.py # Fix registry (one per fix_action)
format_standardize.py # Per-cell standardizers + DataFrame pipeline
io.py # read_file / write_file / repair_bytes
normalize.py # CSV-normalization gate
normalizers.py # Per-column normalizers for dedup matching
text_clean.py # clean_dataframe + smart_title_case
_constants.py # Shared USPS abbrevs + state names
cli.py # Deduplicator CLI (Typer)
cli_text_clean.py # Text Cleaner CLI
cli_analyze.py # Analyzer CLI (--json)
gui/
app.py # Streamlit entry point
pages/ # One page per tool
components/ # shared, dedup_review, findings, gate, _legacy
build/ # PyInstaller spec, launcher, OS-specific configs
demo/ # Constrained Streamlit Community Cloud version
tests/ # pytest; targets core/, not UI
test-cases/ # Fixture corpora (text-cleaner, encodings, format-cleaner)
```
**Core/UI separation rule**: A new feature is implemented in `core/` first, with tests. CLI and GUI both call into core. If a feature exists only in one front-end (e.g., interactive review only in GUI), the underlying capability still lives in core; only the presentation differs.
**Demo subfolder**: row-limited, watermarked, file-size-capped Streamlit app for public deployment. Same core, different front-end constraints.
**Demo subfolder rule**: The `demo/` folder contains a constrained Streamlit app for public deployment to Streamlit Community Cloud. Constraints: row limit (e.g., 100 rows max output), no file save, watermark on output, sample dataset only or strict file-size cap. Same core library, different front-end constraints.
---
## 3. Cross-Platform Build Pipeline
## 3. Build pipeline
### 3.1 Tooling
| Concern | Tool |
|---|---|
| Bundling Python + scripts into a standalone binary | PyInstaller |
| GUI framework | **Streamlit** |
| Browser launch from launcher | Python `webbrowser` module (stdlib) |
| CLI framework | Typer |
| Windows installer wrapper | Inno Setup (free) |
| macOS bundle format | `.app` packaged in `.dmg` |
| macOS code signing & notarization | `codesign` + `notarytool` (built into Xcode command line tools) |
| Linux distribution format | AppImage (primary) + plain tarball (fallback) |
| CI / automated builds | GitHub Actions (free tier handles all three OS runners) |
| Hosted demo | Streamlit Community Cloud (free) or $5/mo VPS |
|---------|------|
| Bundling | PyInstaller |
| GUI | Streamlit |
| CLI | Typer |
| Browser launch | stdlib `webbrowser` |
| Win installer | Inno Setup (free) |
| macOS sign+notarize | `codesign` + `notarytool` |
| Linux | AppImage (primary) + tarball fallback |
| CI | GitHub Actions matrix |
| Demo host | Streamlit Community Cloud (free) |
### 3.2 Build Outputs (what the buyer downloads)
### 3.2 Build outputs
| OS | File | Buyer experience |
|----|------|------------------|
| Win | `*-Setup-1.0.exe` | Wizard → desktop shortcut "Launch Bundle" → browser opens. CLI on PATH. |
| macOS | `*-1.0.dmg` | Drag to Applications. Signed + notarized. |
| Linux | `*-1.0.AppImage` | `chmod +x`, double-click. |
| Platform | Output file | Buyer experience |
|---|---|---|
| Windows | `BundleName-Setup-1.0.exe` | Double-click installer, click through wizard. Desktop shortcut "Launch Bundle" runs `launcher.py`, which starts the local Streamlit server and opens default browser to `http://localhost:8501`. CLI executables also installed and on PATH. |
| macOS | `BundleName-1.0.dmg` | Double-click DMG, drag app to Applications. Signed and notarized. Launching the app runs the launcher, which starts the local server and opens the browser. CLI binaries shipped in the app bundle. |
| Linux | `BundleName-1.0.AppImage` | Mark executable, double-click. AppImage runs the launcher, opens browser. Tarball fallback also includes CLI binaries. |
### 3.3 PyInstaller
The **default buyer experience on every platform is**: double-click, browser opens, work done. The CLI is present, documented, and on PATH for users who want it.
- `--onefile` for Linux, `--onedir` for Win/macOS (faster startup, easier signing).
- Two entry points: GUI launcher + CLI binaries.
- Streamlit hooks needed: `streamlit`, `altair`, `pyarrow` data dirs.
- Custom `hook-streamlit.py` per documented pattern.
- Budget: 1-3 days first time. Reusable after.
**Browser-launch UX mitigation** (per DECISIONS.md Section 4c tradeoff): The launcher script displays a brief "Starting your data tool..." console message before opening the browser. The Streamlit app's first page includes a one-line note: *"This tool runs locally in your browser and does not use the internet."* Install email reinforces the same message.
### 3.4 Streamlit launcher
### 3.3 PyInstaller Configuration
1. Find free port (don't hardcode 8501).
2. Set env: `STREAMLIT_SERVER_HEADLESS=true`, `STREAMLIT_BROWSER_GATHER_USAGE_STATS=false`, `STREAMLIT_SERVER_PORT={port}`.
3. Start Streamlit programmatically in a thread.
4. Poll port until ready.
5. Open browser to `http://localhost:{port}`.
6. Keep launcher alive while server runs.
Single `.spec` file per bundle, parameterized for OS. Key settings:
Optional v1.1: wrap with `pywebview` to eliminate browser-launch UX. Defer until support tickets show meaningful confusion.
- `--onefile` for Linux (single AppImage), `--onedir` for Windows and macOS (faster startup, easier signing).
- All dependencies bundled. No internet required at runtime.
- Hidden imports declared explicitly for pandas/openpyxl/Streamlit edge cases (PyInstaller's auto-detection misses some).
- Icon files per platform (`.ico` for Windows, `.icns` for macOS, `.png` for Linux).
- **Two entry points per bundle**: the GUI launcher (default, what the desktop shortcut runs) and the CLI binaries.
- **Streamlit-specific PyInstaller hooks**: include the `streamlit` data directory, the `altair` data directory (Streamlit dependency), and the `pyarrow` C extensions. Add a custom hook file (`hook-streamlit.py`) per the documented pattern. Budget 1-3 days the first time getting the spec right; reuse across all subsequent bundles.
### 3.5 macOS pipeline
### 3.4 Streamlit Launcher Pattern
The launcher script handles starting the local Streamlit server in a way that survives PyInstaller bundling. Conceptual outline:
1. Find a free local port (avoid hardcoding 8501 in case of conflict).
2. Set Streamlit environment variables: `STREAMLIT_SERVER_HEADLESS=true`, `STREAMLIT_BROWSER_GATHER_USAGE_STATS=false`, `STREAMLIT_SERVER_PORT={port}`.
3. Start Streamlit programmatically (via `streamlit.web.cli.main_run` or `bootstrap.run`) in a background thread.
4. Wait for the port to accept connections (poll with timeout).
5. Open the buyer's default browser to `http://localhost:{port}` via `webbrowser.open()`.
6. Keep the launcher process alive while the server runs. Detect server shutdown and exit cleanly.
Optional v1.1 enhancement: replace step 5 with a `pywebview` window that wraps the local server. Eliminates the "default browser opens" UX surprise. Adds a dependency and some packaging complexity. Defer until support tickets show the browser-launch is causing meaningful confusion.
### 3.5 macOS Signing & Notarization Pipeline
Required setup (one-time):
1. Enroll in Apple Developer Program ($99/yr - see BUSINESS.md Section 10).
2. Generate Developer ID Application certificate via Apple Developer portal.
3. Install certificate in macOS keychain on the build machine (or store as encrypted GitHub Actions secret for CI).
4. Generate an app-specific password for `notarytool`.
Build-time flow (automated):
1. PyInstaller produces unsigned `.app`.
2. `codesign --deep --force --options runtime --sign "Developer ID Application: [Your Name]" BundleName.app`
3. Package into `.dmg`.
4. Submit `.dmg` to Apple notary service: `xcrun notarytool submit BundleName.dmg --wait`.
5. Staple the notarization ticket: `xcrun stapler staple BundleName.dmg`.
6. Output is the final, distributable `.dmg`.
2. `codesign --deep --force --options runtime --sign "Developer ID Application: ..." App.app`.
3. Package as `.dmg`.
4. `xcrun notarytool submit *.dmg --wait`.
5. `xcrun stapler staple *.dmg`.
Buyers on macOS see no Gatekeeper warnings. Clean install.
Setup: Apple Developer Program ($99/yr), Developer ID cert in Keychain, app-specific password.
### 3.6 Windows Pipeline
### 3.6-3.7 Win + Linux
1. PyInstaller produces `BundleName/` folder with launcher `BundleName.exe` (which opens the GUI in browser) plus CLI binaries plus dependencies.
2. Inno Setup script wraps the folder into `BundleName-Setup-1.0.exe`.
3. Installer creates Start Menu entry, desktop shortcut (launches GUI), optional Add/Remove Programs entry, and adds CLI binaries to PATH.
4. Optional Windows code signing certificate (~$200-400/yr from a CA) eliminates SmartScreen warnings. **Not required at launch**; revisit if SmartScreen friction shows up in support tickets.
- **Win**: PyInstaller `--onedir` → Inno Setup wraps → installer adds Start Menu, desktop shortcut, PATH entries. Optional code-signing cert ($200-400/yr) if SmartScreen friction.
- **Linux**: PyInstaller → `appimagetool` wraps. `.tar.gz` fallback for distros where AppImage fails.
### 3.7 Linux Pipeline
1. PyInstaller produces single-file binaries per entry point.
2. Wrap in AppImage using `appimagetool` (free, well-documented). AppImage runs the launcher as the default target.
3. Provide a plain `.tar.gz` fallback for users on distributions where AppImage fails. Tarball includes both GUI launcher and CLI binaries plus a `run.sh`.
4. No signing required on Linux; users execute `chmod +x` then double-click or run.
### 3.8 CI Build Matrix
GitHub Actions builds all three platforms on tagged release:
### 3.8 CI matrix
```yaml
# Conceptual, full file lives in ci/build.yml
strategy:
matrix:
os: [windows-latest, macos-latest, ubuntu-latest]
```
Result: one git tag triggers three platform builds. Artifacts upload to GitHub Releases. Manual step: copy artifacts to Gumroad / Lemon Squeezy product page.
Tag a release → 3 platform artifacts upload to GitHub Releases. Manual: copy to Gumroad / Lemon Squeezy.
### 3.9 Hosted Demo Deployment
### 3.9 Hosted demo
Separate from the desktop build pipeline. The `demo/streamlit_app.py` entry point is deployed to Streamlit Community Cloud:
1. Connect the GitHub repo to Streamlit Community Cloud (one-time).
2. Configure the app to deploy from the `demo/` folder, main branch.
3. Set deployment-time environment variables (e.g., row limits, watermark flag).
4. App is publicly accessible at a `*.streamlit.app` URL. Link from Gumroad landing page.
5. Optional: custom domain via CNAME (free with Streamlit Community Cloud as of last check; verify before locking).
If Streamlit Community Cloud is ever unsuitable (rate limits, bandwidth, branding requirements), fall back to a $5/mo VPS running the demo via Docker. Same `demo/streamlit_app.py`, different host.
---
`demo/streamlit_app.py` → Streamlit Community Cloud. Configure deployment in Streamlit UI. Custom domain via CNAME (verify policy at deploy time). Fall back to $5/mo VPS if rate limits / branding constraints hit.
## 4. Libraries
| Purpose | Library |
|---|---|
| GUI framework | Streamlit |
| CLI framework | Typer |
| Data manipulation | pandas, openpyxl, numpy |
| Fuzzy string matching | rapidfuzz |
| File encoding detection | charset-normalizer |
|---------|---------|
| GUI | streamlit |
| CLI | typer |
| Data | pandas, openpyxl, numpy |
| Fuzzy match | rapidfuzz |
| Phone parsing | phonenumbers |
| Encoding detect | charset-normalizer |
| Logging | loguru |
| Progress bars | tqdm (CLI), `st.progress` (GUI) |
| Validation | pydantic (optional) |
| Reports | ReportLab (PDF), pandas styling (Excel) |
| Optional native window wrap | pywebview (deferred to v1.1 if needed) |
| Mojibake (optional) | ftfy |
| Reports | reportlab |
`requirements.txt` (current bundle, v1.3):
## 5. Coding standards
### 5.1 Code
- PEP 8 + type hints on public functions.
- Docstrings on every module + public function.
- `pathlib.Path` for paths, never string concat.
- All I/O explicitly UTF-8-aware.
- No platform-specific shell calls.
- pytest for `core/`, not UI.
- Errors raise via `src.core.errors` hierarchy (Section 7).
### 5.2 GUI UX (load-bearing per DECISIONS.md §4b)
- **Works out of the box** — drop file → useful result with zero config.
- **Sensible defaults visible everywhere**.
- **Progressive disclosure** — basic = file uploader + run button + results; rest in `st.expander`.
- **Plain-English labels**; technical detail in `help=` tooltip.
- **Dry-run / preview by default**.
- **Identical core to CLI**.
- **Local-first messaging** — "runs locally in your browser, no internet" line on every page.
### 5.3 Functional scope (load-bearing per DECISIONS.md §4a)
- Each script ships **complete coverage of the workflow it names**, including features Excel does for free.
- Boundary = the named workflow. Dedup includes normalization + survivor + audit; not format conversion or charting.
## 6. System requirements
**Buyer runtime**: Win 10/11 64-bit · macOS 11+ · Linux glibc 2020+ · modern browser · ~400-500 MB disk · no internet.
**Developer**: Python 3.10+ · PyInstaller · Inno Setup (Win) · Xcode CLT (macOS) · Apple Developer Program $99/yr · Git + GitHub.
## 7. Error handling (`src/core/errors.py`)
Structured hierarchy for friendly messages + maintainable trace context:
```
streamlit>=1.30
pandas
openpyxl
numpy
typer
rapidfuzz
charset-normalizer
loguru
tqdm
reportlab
pyarrow # Streamlit dependency, declare explicitly for PyInstaller clarity
altair # Streamlit dependency, declare explicitly for PyInstaller clarity
DataToolsError # base; carries path/column/operation/suggestion/cause
InputValidationError(ValueError) # bad arg / wrong type
ConfigError(ValueError) # bad config / options
FileFormatError(ValueError) # file isn't what we expected
FileAccessError(OSError) # I/O failure (perms, disk, missing)
```
---
**Subclassing rule**: every subclass extends a stdlib base (`ValueError` or `OSError`) so existing `except OSError` / `except ValueError` handlers still catch them.
## 5. Coding Standards
**Helpers**:
- `ensure_dataframe(value, function=...)` — uniform DataFrame guard at every public entry.
- `ensure_choice(value, name=, choices=)` — uniform enum/literal guard.
- `wrap_file_read(path, op, exc)` / `wrap_file_write(...)` — tag OSError with file path + Windows-aware permission tip.
- `format_for_user(exc, context=)` — single string for `st.error()` / CLI stderr.
### 5.1 Code Standards
GUI / CLI handlers use `format_for_user()` so the user always sees: file path, operation, underlying error class, recovery suggestion.
- PEP 8 + type hints on all public functions.
- Docstrings on every module and public function.
- `--help` output (CLI) that a non-technical user can act on.
- Graceful error handling with human-readable messages, not stack traces. Errors must name the problem AND the likely fix where possible (e.g., "Column 'email' not found. Available columns: name, phone. Did you mean 'phone'?").
- All file paths handled via `pathlib.Path`, never string concatenation. Cross-platform correctness depends on this.
- All file I/O explicitly UTF-8-aware: detect encoding on input (charset-normalizer), write UTF-8 on output. Windows defaults to cp1252 otherwise.
- No platform-specific shell calls. If absolutely needed, branch on `sys.platform`.
- Unit tests for core logic (pytest). Tests target `core/`, not UI front-ends. Tests run on all three OS runners in CI.
- Semantic versioning per bundle.
- **Core/UI separation**: never put business logic in `cli.py` or `gui/`. If a CLI command and a GUI button do "the same thing," they call the same function in `core/`.
## 8. Per-bundle status
### 5.2 UX Standards (GUI / Streamlit) - load-bearing per DECISIONS.md Section 4b
| Bundle | Status |
|--------|--------|
| Data Cleaning Mastery | 3/9 tools Ready (Dedup, Text Cleaner, Format Standardizer); 6 stubs |
| Automated Business Reporting | Not started |
| Ecommerce Data Pipeline | Not started |
| Small Business Finance | Not started |
| Marketing Public Data Aggregation | Not started |
| AI Ecommerce Aggregation (Shopify Pet) | Not started |
- **Works out of the box**: dropping a file into the Streamlit `st.file_uploader` must produce a useful result with zero configuration.
- **Sensible defaults visible everywhere**: every `st.selectbox`, `st.slider`, `st.checkbox` has a default, the default is shown, the user is not forced through a config screen on first run.
- **Progressive disclosure**: basic view shows file uploader + go button + results. Advanced options live in `st.expander("Advanced options")` panes.
- **Plain-English labels**: no technical jargon in primary UI. `help=` parameter on widgets carries technical detail for users who want it.
- **Dry-run / preview by default**: user sees what would change before any file is written. Original input is never modified.
- **Single-page completion**: basic task completes on a single Streamlit page. Multi-page apps are for separate scripts in the bundle, not for multi-step wizards within one script.
- **Identical core to CLI**: any capability available in CLI is available in GUI, and vice versa. The only legitimate divergence is interactive review (GUI-natural) and scripted/scheduled execution (CLI-natural).
- **Local-first messaging**: every GUI page includes the line *"This tool runs locally in your browser and does not use the internet"* in a small, persistent location (e.g., footer or sidebar).
## 9. Open decisions
### 5.3 Functional Scope Standard - load-bearing per DECISIONS.md Section 4a
- **pywebview wrap** — defer until support tickets show browser-launch confusion.
- **Win code signing** — defer until SmartScreen drives volume. Cost ~$200-400/yr.
- **Auto-update mechanism** — none at launch. Email-delivered updates. Revisit at 100+ buyers/bundle.
- **Demo hosting migration** — Streamlit Community Cloud → $5/mo VPS if rate/brand limits hit.
- **Code obfuscation** — none; license text + bundle complexity sufficient at $49-79.
- **Telemetry** — none. Consider opt-in privacy-respecting only post-launch.
- Each script ships with **complete coverage of the workflow it names**, including features available free elsewhere (e.g., exact-match dedup).
- Scope boundary is the workflow, not "things adjacent to the workflow." A deduplicator includes normalization, survivor selection, audit. It does not include format conversion or charting; those belong elsewhere in the bundle.
## 10. Script boundaries — 04 (Missing Values) vs 06 (Outliers)
---
Deliberately separate. Confluent original spec was wrong.
## 6. System Requirements
| Script | Owns |
|--------|------|
| 04 Missing Value Handler | "What's not there." Disguised nulls (`N/A`, `-`, sentinel codes), missingness patterns, imputation, drop-by-threshold. |
| 06 Outlier Detector | "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization. |
**For buyers (runtime)**:
- Windows: Windows 10 or 11, 64-bit.
- macOS: macOS 11 (Big Sur) or later, Apple Silicon or Intel.
- Linux: any glibc-based distribution from 2020 onward (Ubuntu 20.04+, Fedora 33+, etc.).
- A modern default browser (Chrome, Edge, Firefox, Safari from the last 3 years). Used to display the local GUI; no internet required.
- ~400-500 MB free disk space (Streamlit packaging is larger than alternatives; this is an accepted tradeoff per DECISIONS.md Section 4c).
- No internet required after install. No Python install required ever.
**Run order**: 04 before 06. Outlier stats on data with `NaN` / sentinels are mathematically poisoned (means dragged, IQR widens, false negatives).
**For developers (you)**:
- Python 3.11+.
- PyInstaller, Inno Setup (Windows builds), Xcode command line tools (macOS builds).
- Apple Developer Program membership ($99/yr) for macOS distribution.
- Git + GitHub account (for CI builds and Streamlit Community Cloud deployment of demos).
**Pipeline order** (Pipeline Runner enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible.
---
**Contested cases**:
- Whitespace-only cell — 02 trims to empty; 04 then flags empty as null.
- `-999` sentinel — 04 converts to `NaN` first; 06 then computes stats.
- Suspicious-but-plausible (age 110) — 06 territory.
## 7. Per-Bundle Technical Notes
## 11. Per-script functional specs
| Bundle | Status | Tech notes |
|---|---|---|
| Data Cleaning Mastery | Lead, 1/9 scripts complete (CLI only; needs Streamlit GUI port) | Cleaning, dedup, text hygiene, standardization, missing-value handling, outlier detection, type coercion, reporting. Scripts 04 (missing values) and 06 (outliers) are deliberately separate concerns; 04 runs first to neutralize sentinel codes before 06 computes statistics (see Section 9). Script 02 (text cleaner) runs first in the pipeline to normalize whitespace and special characters before any other operation. v1.3 spec: Streamlit GUI required at launch, with hosted demo deployed to Streamlit Community Cloud. |
| Automated Business Reporting | Not started | Aggregation -> styled PDF/Excel output |
| Ecommerce Data Pipeline | Not started | Extract -> clean -> export workflow |
| Small Business Finance | Not started | Bookkeeping summaries, simple reconciliation |
| Marketing Public Data Aggregation | Not started | Public API + scraping with respect for robots.txt and ToS |
| AI Ecommerce Aggregation (Shopify Pet) | Not started | Optional LLM enhancement, requires API key from buyer |
Specs live in this section as scripts enter active build. Each follows the Tier 1/2/3 structure with explicit strategic framing (what's the market gap given some of this is free elsewhere).
---
### 11.1 `01_deduplicator.py` — Smart duplicate removal
## 8. Open Technical Decisions
**Status**: Ready. Tier 1 mostly built. Streamlit GUI port complete.
GUI framework choice is now **closed** (Streamlit, locked v1.3 - see DECISIONS.md Section 4c).
**Market gap**: fuzzy match quality of OpenRefine, with the zero-learning UX of Excel, sold once for under $100, runs locally.
Remaining open items:
**Tier 1**:
- **Input**: auto-detect encoding (UTF-8, UTF-8-BOM, Latin-1, cp1252) · delimiter · header row · CSV/TSV/XLSX/XLS · multi-sheet picker · streaming for files > RAM.
- **Matching**: exact + 3 fuzzy algos (Levenshtein / Jaro-Winkler / token-set) · per-column normalizers (5 types) · configurable threshold per strategy · multi-strategy OR.
- **Survivor**: keep first / last / most-complete / most-recent · merge mode (fill blanks from losers).
- **Trust**: dry-run preview by default · interactive review for gray-zone matches · confidence score per match · match-group export.
- **Audit**: timestamped log · removed-rows separate file · input never modified · idempotent.
- **Config**: save/load JSON profiles · sensible auto-detect defaults.
- **UX**: human `--help` · progress bar > 10k rows · errors name row + column + value + suggestion.
- **pywebview wrap of Streamlit launcher**: Optional v1.1 enhancement to eliminate the "browser opens" UX surprise. Defer until support tickets show meaningful buyer confusion. Cost: extra dependency, more PyInstaller complexity. Benefit: native-window UX.
- **Windows code signing**: Currently unsigned. Revisit if SmartScreen warnings drive support volume. Cost: ~$200-400/yr.
- **Auto-update mechanism**: None at launch. Email-delivered version updates. Revisit at 100+ paying customers per bundle.
- **Demo deployment hosting**: Streamlit Community Cloud at launch (free). Migrate to $5/mo VPS if rate limits, bandwidth, or branding constraints become an issue.
- **Code obfuscation**: Currently relying on license text + PyInstaller bundling. Decompilation is possible but unlikely for $49-79 products. Accept the risk.
- **Telemetry**: None. Consider privacy-respecting opt-in usage telemetry post-launch to inform roadmap, but only if explicit and disclosed.
**Tier 2**: numeric/date tolerance · phonetic match (Soundex, Metaphone) · blocking/indexing · watch-folder.
---
**Tier 3**: ML scoring · cross-file dedup · cron · Shopify/Klaviyo API direct.
## 9. Script Boundaries: 04 (Missing Values) vs 06 (Outliers)
### 11.2 `02_text_cleaner.py` — Character-level hygiene
The two scripts are deliberately separate. Original spec ("missing-value handler also does basic outlier flagging") was wrong: it conflated two different statistical operations and would have produced overlapping CLI flags, confused buyers, and brittle code.
**Status**: Ready. Tier 1 built.
### 9.1 Boundary
**Market gap**: one-click correctness for the dirty-CSV failure modes that cause silent VLOOKUP misses.
**`04_missing_value_handler.py` owns "what's not there"**:
- Detect disguised nulls: `NaN`, empty string, `"N/A"`, `"-"`, `"unknown"`, whitespace-only, sentinel codes (`-999`, `9999`, etc.).
- Missingness pattern analysis (which columns co-miss).
- Imputation strategies: mean, median, mode, forward-fill, KNN (optional), regression (optional).
- Required-field enforcement (drop rows missing a required column).
- Drop rows or columns by missingness threshold.
**Boundary**:
- 02 — whitespace, Unicode normalize, smart-char fold, BOM, line endings, zero-width, control chars, case ops. Writes to disk.
- 03 — dates, currencies, names, phones, addresses (display formatting).
- 04 — disguised nulls.
- 01 — `normalize_string` is *match-time* only, distinct from 02's *write-time* policy.
**`06_outlier_detector.py` owns "what shouldn't be there"**:
- Univariate statistical detection: z-score, IQR, modified z-score (MAD-based).
- Multivariate detection: Isolation Forest, Mahalanobis distance.
- Domain-rule violations (age > 120, negative quantity, future dates in historical data).
- Winsorization / capping as optional remediation.
- Distribution shape diagnostics.
**Tier 1 ops** (each toggleable; defaults shown for `excel-hygiene`):
1. Trim leading/trailing whitespace — ON
2. Collapse internal whitespace runs — ON
3. NFC normalize — ON
4. NFKC compatibility fold — OFF (lossy, opt-in via `paranoid` preset)
5. Smart-char fold (curly quotes, em/en-dash, NBSP, ellipsis) — ON
6. Zero-width / invisible char strip — ON
7. BOM strip — ON
8. Control-char strip (preserve `\t\n\r`) — ON
9. Line-ending normalize (CRLF/CR → LF inside cells) — ON
10. Case conversion (UPPER / lower / Title / Sentence) — OFF, per-column
### 9.2 Run Order
**Scope**: per-column selection · skip-list · operates on string-typed columns only.
04 must run before 06. Reason: outlier statistics computed on data still containing `NaN` or sentinel codes are mathematically poisoned. Means and standard deviations get dragged, IQR widens, false negatives explode.
**Trust**: dry-run by default · per-cell change log (capped 1000, `--full-changelog` removes cap) · 3 output files mirroring dedup · idempotent.
The Master Orchestrator (script 09) enforces this order. CLI users running scripts manually get a warning printed by 06 if it detects unhandled sentinel patterns in the input.
**Config**: 3 presets (`minimal` / `excel-hygiene` (default) / `paranoid`) · save/load JSON.
Pipeline-wide order enforced by the orchestrator: `02_text_cleaner``03_format_standardizer``04_missing_value_handler``05_column_mapper_enforcer``06_outlier_detector``07_multi_file_merger``08_validator_reporter`. Script `01_deduplicator` is order-flexible; it normalizes whitespace and case internally for matching purposes regardless of upstream cleaning, so it can run before or after `02_text_cleaner` depending on the buyer's workflow.
### 11.3 `03_format_standardizer.py` — Per-domain canonical forms
### 9.3 Contested Cases
**Status**: Ready. Full Tier 1 + most Tier 2 built. 199-row buyer corpus passing.
**Use cases that prove 04 and 06 are distinct concerns** (not just naming differences):
**Market gap**: unify dates / phones / emails / addresses / names / currencies / booleans across messy ETL inputs without buyer writing code.
- *Bank export with blank fee columns*: pure 04 job. The fees aren't outliers, they're missing. Imputation or drop-by-threshold is the right tool.
- *Sales data with one $1M order in a $50-average column*: pure 06 job. Nothing is missing; one row is statistically extreme. Z-score or IQR catches it.
- *Survey data where `999` means "refused to answer"*: needs both, in order. 04 converts `999` to `NaN` per `--sentinels`, then 06 computes statistics on the cleaned column.
**Domains**:
| Domain | Default canonical | Notable handling |
|--------|-------------------|------------------|
| Date | ISO 8601 (`YYYY-MM-DD`) | MDY/DMY, Excel serial, Unix timestamp (s + ms), longform months, year-month, quarter notation, French/German/Spanish month dictionaries (opt-in), buried-date regex, error sentinels for invalid dates |
| Phone | E.164 + `;ext=N` | libphonenumber, 001 international prefix handling, error sentinels for placeholders / multi-number / contamination |
| Email | lowercase + trim | display-name extraction, mailto/angle-bracket strip, smart-quote unwrap, optional `--gmail-canonical` mode |
| Address | USPS-canonical (`expand=False`) or expanded (`expand=True`) | state-name → 2-letter, multi-line collapse, PO Box normalize, state-code preservation regardless of input case |
| Name | smart Title Case | Mc/Mac/O'/D' inner caps, hyphen segments, particle lowercasing (von/van/de/da), comma-format reversal, period stripping for titles/suffixes/initials, PhD/MD acronym preservation, conservative mode |
| Currency | bare number (dot decimal) | auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation |
| Boolean | `True`/`False` (configurable) | accepts `yes`/`no`/`y`/`n`/`1`/`0`/`on`/`off` |
Sentinel values like `-999` are *both* disguised missing *and* statistical outliers. Resolution: 04 owns sentinel detection and converts them to `NaN` (or imputes per user choice) before 06 sees the data. Sentinel patterns are configurable in 04 via `--sentinels` flag.
**Per-domain `error_policy`**: `"passthrough"` (default) keeps the original; `"sentinel"` emits `<error: <reason>>` for cases like Feb 30, double @, percentages mistaken for currency, etc.
Suspicious-but-plausible values (e.g., age = 110): 06's territory. Not missing; just rare.
**Pipeline**: `standardize_dataframe(df, options)` runs per-column with `column_types: dict[str, FieldType]`. Returns `StandardizeResult` with `cells_changed`, `cells_unparseable`, change audit. Warns when > 10% of typed cells fail to parse.
Whitespace-only cells (e.g., `" "`) are a contested case between 02 (text cleaner) and 04 (missing value handler). Resolution: 02 trims first, leaving an empty string. 04 then detects empty strings as disguised nulls per its existing logic. This means 02 must run before 04 in any pipeline that uses both. The orchestrator enforces this; CLI users get a warning if 04 detects whitespace-only cells suggesting 02 was skipped.
**Presets**: `us-default`, `european`, `uk`, `iso-strict`, `legacy-us`. Custom abbreviations via `extra_abbreviations`.
### 9.4 Shared Plumbing
### 11.4 Upload-time analyzer (`src/core/analyze.py`)
Both scripts emit:
- A flagged-row report with column, row index, original value, action taken.
- A timestamped log file in `logs/`.
- An optional cleaned output file.
Read-only advisory pass on every upload. Emits `Finding` objects:
Report and log formats are identical between the two scripts. Implemented via shared helpers in `src/core/` to avoid drift.
| Field | Meaning |
|-------|---------|
| `id` | Stable identifier (never localized) |
| `severity` | `info` / `warn` / `error` (only `error` blocks gate) |
| `confidence` | `high` (round-trip safe) / `medium` (preview) / `low` (heuristic) |
| `fix_action` | id of algorithm in `fixes.py` (empty for informational-only) |
| `pre_applied` | `true` if fix already ran during read pass |
| `tool` | owning tool id (or empty for file-level) |
| `count` | cells / rows affected |
| `description` | one-sentence human summary |
| `column` | column name (None for file-level) |
| `samples` | up to 5 `(row, col, value)` examples |
---
Entry point: `analyze(source, *, sample_rows=1000, repair_result=None, encoding_override=None)`. `encoding_override` skips charset detection — the hook that lets the Review page recover from misdetections.
## 10. Per-Script Functional Requirements
### 11.5 CSV-normalization gate (`src/core/normalize.py`, `fixes.py`)
This section captures the full functional spec for each script, beyond the one-line description in USER-GUIDE.md Section 2. Specs answer "what does v1 need to ship to be best-of-class for the target buyer." Promoted from chat-history-only into docs in v1.6 to prevent silent drift.
Two paths:
1. **Auto-fix**`auto_fix(df, findings)` applies every `confidence="high"` finding whose `fix_action` is registered.
2. **Per-finding decisions**`apply_decisions(df, findings, decisions)` accepts `Decision(finding_id, action, payload)` with action `"auto"|"skip"|"modified"`.
**Note on script status**: a script labeled "Working" in the bundle status table means it has CLI execution and basic correctness, NOT that it implements every Tier 1 item below. Tier 1 is the v1 launch target; the current code may implement a subset.
Returns `NormalizationResult` with `cleaned_df`, `cleaned_bytes` (UTF-8 CSV), `applied`, `skipped_findings`, `pending_findings`, `blocking_findings`.
### 10.1 `01_deduplicator.py` - Smart duplicate removal
`is_normalized(findings, result)` re-runs `analyze()` against cleaned bytes; returns False if any high-confidence detector still fires (the strict contract tool pages depend on).
**Current implementation status**: `01_deduplicator.py` exists and works for exact match plus basic fuzzy with configurable subset columns and timestamped logs (the description in USER-GUIDE.md Section 2 reflects current state). It does NOT yet implement most Tier 1 items below. Tier 1 is the v1 launch target, not current state. The Streamlit GUI port is the natural moment to close this gap.
**Fix registry**: `@register("fix_id")` decorates `(df, payload) → (new_df, n_cells_changed)`. New fix = one entry in `analyze.py` `FIX_*` constants + one detector emitting that `fix_action` + one registered function. No other call sites change.
**Strategic framing**: Excel's built-in Remove Duplicates handles exact match for free. Pandas `drop_duplicates()` handles it for free in code. A $49-$79 dedup tool that ships "exact + basic fuzzy" loses to Excel for free or to OpenRefine for free. The fuzzy matching has to be the product, not a checkbox. The market gap this script targets is "fuzzy match quality of OpenRefine, with the zero-learning-curve UX of Excel, sold once for under $100, runs locally" (see BUSINESS.md Section 4a).
### 11.6 Review page (`src/gui/pages/0_Review.py`)
#### Tier 1: Must-ship for v1 to be best-of-class
1. Detected encoding + override picker (16 codepages + custom).
2. One expandable card per finding (sorted by severity then confidence) with: decision radio (Auto/Skip/Customize), live before/after preview built by running the registered fix on `Finding.samples`, payload editor for fixes that take user input.
3. Apply persists `NormalizationResult` keyed by upload SHA-256; tool pages refuse to load until hash matches.
4. `⚙️ Advanced output options` expander: per-download encoding + delimiter + line terminator. `_build_output_bytes()` returns `(bytes, error_message)`; lossy fallbacks emit a warning the page surfaces.
**Input handling**
1. Auto-detect file encoding (UTF-8, UTF-8-BOM, Latin-1, Windows-1252). Failure to handle this correctly is the #1 reason CSV tools crash on real-world business data.
2. Auto-detect delimiter (comma, tab, semicolon, pipe).
3. Read CSV, TSV, XLSX, XLS. For XLSX, support multi-sheet workbooks (let user pick or process each).
4. Handle files larger than RAM via chunked / streaming processing. A 500MB customer export should not crash the tool.
5. Header row detection with manual override.
Gates the entire tool sidebar via `require_normalization_gate()` in `src/gui/components/_legacy.py`.
**Matching**
6. Exact match with configurable subset columns.
7. Fuzzy match algorithms: Levenshtein, Jaro-Winkler, token-set ratio (rapidfuzz library). Multiple algorithms, not one. Different data types match better with different algorithms.
8. Per-column normalization before comparison:
- Email: lowercase, strip whitespace, strip Gmail dots, strip `+tag` suffixes.
- Phone: strip formatting and country codes, normalize to E.164.
- Name: strip titles (Mr/Ms/Dr), strip suffixes (Jr/III), collapse whitespace, optional case-fold.
- Address: USPS-style abbreviation normalization (St/Street, Ave/Avenue, Apt/#).
- Generic string: trim, collapse internal whitespace, optional case-fold.
9. Configurable similarity threshold (e.g., 85%, 90%, 95%) per match strategy.
10. Multi-strategy matching with OR logic: "match if email is exact OR (name fuzzy >90% AND phone exact)." This is what real-world dedup actually requires. Single-strategy match handles maybe 40% of cases.
### 11.7 Pre-parse repair (`src/core/io.py::repair_bytes`)
**Survivor selection (which row to keep when duplicates are found)**
11. Configurable survivor rules: keep first, keep last, keep most-complete (fewest empty cells), keep most-recent (date column), keep manually-selected.
12. Merge mode: instead of deleting losers, fill missing fields in survivor from losers. Common real ask: combine partial records into one complete record.
Byte-level pre-parse pass. **Order is meaningful**:
**Trust and review**
13. Dry-run / preview mode by default. Output shows what *would* be merged before any file is written. Non-negotiable for trust. Aligns with Section 5.2 visible-safety standard.
14. Interactive review mode for uncertain matches. For matches in the gray zone (e.g., 75-90% similarity), prompt user yes/no/skip with side-by-side diff. This is what justifies a paid product over free Excel. GUI-natural; CLI gets a reduced-form prompt loop.
15. Confidence score on every fuzzy match in the output.
16. Match group export: separate file showing every group of matched rows so user can audit.
**Audit and safety**
17. Full timestamped log of every action: which rows matched, on which strategy, with what score, which row survived, which fields were merged.
18. Removed-duplicates exported to a separate file (never silently destroyed).
19. Original input file is never modified. Output is always a new file.
20. Idempotency: running the tool twice on the same input with the same config produces the same output.
**Configuration**
21. Save / load configuration profiles. A user who deduplicates a Shopify customer export weekly should configure once, not every run.
22. Sensible defaults that work on a typical messy CSV with zero configuration. The first run must produce a useful result with no flags. Per DECISIONS.md Section 4b UX standards.
**UX**
23. `--help` (CLI) written for non-technical users with concrete examples, not a flag list.
24. Progress bar for files over ~10K rows.
25. Error messages name the row number, column, and value that caused the problem. No raw stack traces. Per Section 5.1.
26. Sample data (`samples/messy_sales.csv`) must demonstrate fuzzy match scenarios, not just exact dupes. Otherwise the demo doesn't sell.
#### Tier 2: Worth-considering for v1.1
27. Numeric tolerance for matching (prices within $0.01 considered same).
28. Date tolerance for matching (dates within N days considered same).
29. Phonetic matching (Soundex, Metaphone) for name fields with common misspellings.
30. Blocking / indexing for performance on large files (compare only rows that share a first letter or first three characters of a key field). Without this, fuzzy match on 100K rows is O(n²) and unusable. Move to Tier 1 if early buyers report performance complaints.
31. Watch-folder mode: auto-process any file dropped into a folder.
32. Geolocation-aware address matching (optional, requires bundled USPS data or third-party API).
#### Tier 3: Optional / later
33. Machine-learning-based match scoring (Dedupe.io territory; high complexity, marginal gain for this price point).
34. Multi-table joins for cross-file dedup.
35. Schedule / cron integration.
36. Direct Shopify / Klaviyo / Mailchimp API integration to dedupe in place. This would be a real differentiator for the Shopify niche specifically and is probably the right v2 direction if early sales validate the niche.
### 10.2 `02_text_cleaner.py` - Character-level hygiene
**Current implementation status**: Stub only. `src/gui/pages/2_Text_Cleaner.py` is a placeholder UI with disabled controls. No `src/core/text_clean.py`, no CLI, no tests. Tier 1 below is the v1 launch target; nothing in this section is built yet.
**Strategic framing**: Excel and the OS provide effectively nothing here. Find/Replace fixes one character at a time. Power Query's "Clean" strips control chars but ignores BOMs, smart quotes, NBSPs, and zero-width chars. OpenRefine has the operations buried under "Common transforms" where the buyer never finds them. Pandas users `df.applymap(str.strip)` and miss everything else.
The market gap this script fills: **one-click correctness for the dirty-CSV failure modes that cause "why won't this VLOOKUP match?"** Trailing spaces, NBSP-in-place-of-space, smart quotes pasted from Word, mojibake, BOMs from Excel's "Save As CSV UTF-8". The buyer doesn't know they need this script until it fixes a problem they have spent two hours on. Demo value is high: the before/after diff sells itself.
**Boundary clarification** (cross-references Section 9):
- 02 owns whitespace, Unicode normalization, smart-character folding, BOM strip, line-ending normalization, zero-width strip, control-char strip, case ops. Writes cleaned values back to disk.
- 03 (format standardizer) owns dates, currencies, names, phones, addresses.
- 04 (missing values) owns disguised nulls (`N/A`, `-`, `unknown`, sentinel codes). Whitespace-only cells: 02 trims first to empty string; 04 then detects empty as null (per Section 9.3).
- 01 (deduplicator) has its own `normalize_string` helper for *match-time* case-folding. That is a match-time policy and stays distinct from 02's *write-time* policy. The two will not be merged; 02 may use lower-level helpers but does not aggressively case-fold cleaned output by default.
#### Tier 1: Must-ship for v1 to be best-of-class
**Operations** (each independently toggleable; defaults given for the `excel-hygiene` preset)
1. Whitespace trim - leading/trailing on every cell. Default ON.
2. Internal whitespace collapse - multi-space and tabs-in-cells to single space. Default ON.
3. Unicode NFC normalization - combining-character forms folded to canonical (e.g., `e + U+0301` to single `é`). Default ON.
4. Unicode NFKC normalization - compat fold (`①` to `1`, `fi` to `fi`). Default OFF, lossy, opt-in only. Not part of any preset other than `paranoid`.
5. Smart-character folding - curly quotes to ASCII, em/en-dash to hyphen, ellipsis `…` to `...`, NBSP `U+00A0` to space. Default ON.
6. Zero-width / invisible character strip - `U+200B`, `U+200C`, `U+200D`, `U+2060`, mid-string `U+FEFF`. Default ON.
7. BOM strip - `U+FEFF` at the start of the first cell of the first column (covers the case where the I/O layer didn't catch it). Default ON.
8. Control character strip - `U+0000`-`U+001F` and `U+007F`, *except* preserve `\t`, `\n`, `\r`. Default ON.
9. Line-ending normalization - within multi-line cells, `\r\n` and bare `\r` to `\n`. Default ON.
10. Case conversion - UPPER / lower / Title / Sentence. Default OFF, per-column. Title case is "smart": preserves all-caps tokens (`USA`, `NASA`) and lowercases mid-string particles (`of`, `and`, `the`).
**Scope control**
11. Per-column selection - by default operate on string-typed columns only; numeric / datetime columns pass through untouched. User can pick columns explicitly via `--columns`.
12. Skip-list - exclude specific columns via `--skip` even if they match the string-dtype filter (e.g., free-text notes columns).
**Trust and audit**
13. Dry-run preview by default. Output shows N cells that would change in column X. `--apply` writes. Non-negotiable for trust. Same standard as the deduplicator.
14. Per-cell change log: `{input}_changes.csv` with (row, column, old, new, ops_applied). Capped to first N rows by default to avoid 50MB audit files; `--full-changelog` removes the cap.
15. Three output files on `--apply`: `{input}_cleaned.csv`, `{input}_changes.csv`, `logs/text_clean_{ts}.log`. Mirrors the deduplicator output shape.
16. Original input file is never modified.
17. Idempotency: `clean(clean(x)) == clean(x)` for every individual op and every preset. Asserted as a property test.
**Configuration**
18. Presets: `--preset excel-hygiene` (everything safe ON, NFKC OFF, case OFF), `--preset minimal` (only trim + collapse), `--preset paranoid` (everything including NFKC). Buyers should not have to learn 9 flags. Default preset when no flag given: `excel-hygiene`.
19. Save / load JSON config. Same shape and reuse pattern as `DeduplicationConfig`.
**UX**
20. `--help` written for non-technical users with concrete examples, not a flag dump. Per DECISIONS.md Section 4b.
21. Progress bar for files over ~10K rows.
22. Error messages name the row, column, and value that caused the problem. No raw stack traces.
23. Sample data (`samples/messy_text.csv`) demonstrates: smart quotes from Excel, NBSP-vs-space, BOM, mixed line endings, zero-width chars. The before/after diff is the demo.
#### Tier 2: Worth-considering for v1.1
24. Custom regex find/replace - power-user escape hatch, per-column.
25. Diacritic strip (`José` to `Jose`). Lossy; opt-in only.
26. Mojibake auto-repair - detect `é` to `é` patterns (UTF-8 read as Latin-1 then re-encoded) and fix. Standard tool: `ftfy`. Promote to Tier 1 if early buyers report this.
27. Punctuation normalization - all Unicode dash/quote/space variants folded; runs of punctuation collapsed.
28. Profile detector - scan a file and recommend which ops to enable based on what's actually present. Lowers config friction further.
#### Tier 3: Optional / later
29. Locale-aware case conversion (Turkish dotted/dotless `i`, German `ß`).
30. Custom character-class strip rules (regex-class).
31. Streaming / chunked write for very large files (defer until a buyer reports it).
#### Open decisions captured at spec time
- Smart-character folding default ON in `excel-hygiene` accepted as the right tradeoff: highest-impact use case, dry-run preview makes the change visible before commit.
- NFKC stays Tier 1 but OFF by default and excluded from `excel-hygiene`. Lossy by design.
- CLI surface: separate `src/cli_text_clean.py` module, matching the "one CLI binary per script on PATH" pattern in Section 3.2. Not a subcommand on the existing dedup Typer app.
- `ftfy` dependency deferred to Tier 2 (~5MB). Revisit if mojibake reports come in.
### 10.2.1 Upload-time analyzer (`src/core/analyze.py`)
The analyzer is a read-only, advisory pass that runs on every uploaded file before any tool page sees it. It produces a list of `Finding` objects, each carrying:
| Field | Type | Meaning |
|---|---|---|
| `id` | str | Stable identifier (`smart_punctuation_in_data`, `mixed_line_endings`, …). Never localized. |
| `severity` | `info` / `warn` / `error` | UX urgency. `error` is the only level that blocks the gate. |
| `confidence` | `high` / `medium` / `low` | Auto-fixability. **High** is round-trip safe, **medium** has known false-positive shapes, **low** is heuristic and opt-in. |
| `fix_action` | str | Stable id naming the algorithm in `src/core/fixes.py` that resolves this finding. Empty for informational-only findings. |
| `pre_applied` | bool | True when the fix already ran during the read pass (BOM strip, NUL strip, byte-level smart-quote fold). The gate treats these as already-resolved. |
| `tool` | str | Tool id that owns this concern (`02_text_cleaner`, `04_missing_handler`). Empty for file-level findings. |
| `count` | int | Cells / rows affected. |
| `description` | str | One-sentence human summary (banners, tooltips). |
| `column` | str / None | Column name when scoped to one column. |
| `samples` | list[(row, col, value)] | Up to 5 examples for the GUI to render. |
`analyze(source, *, sample_rows=1000, repair_result=None, encoding_override=None)` is the public entry point. `source` is a DataFrame or a path; `encoding_override` skips charset detection and uses the user's chosen codepage instead — this is the hook that lets the Review page recover from misdetections (cp1252-vs-cp1250 ambiguity, KOI8-R surfacing as Shift_JIS).
### 10.2.2 CSV-normalization gate (`src/core/normalize.py`, `src/core/fixes.py`)
A file enters tool pages only after passing the gate. The gate has two paths:
1. **Auto-fix**`auto_fix(df, findings)` applies every `confidence="high"` finding whose `fix_action` is registered in `fixes.py`.
2. **Per-finding decisions**`apply_decisions(df, findings, decisions)` accepts an explicit list of `Decision(finding_id, action, payload)` where action is `"auto" | "skip" | "modified"`.
Output is a `NormalizationResult` with:
- `cleaned_df` — the DataFrame after every applied fix.
- `cleaned_bytes` — UTF-8 CSV serialization for the download.
- `applied`, `skipped_findings`, `pending_findings`, `blocking_findings` — audit log + gate status.
`is_normalized(findings, result)` re-runs `analyze()` against the cleaned bytes and returns False if any high-confidence detector still fires — that's the strict contract tool pages depend on.
`fixes.py` is a registry: `@register("fix_id")` decorates a `(df, payload) -> (new_df, n_cells_changed)` function. Adding a new fix means appending one entry to `analyze.py`'s `FIX_*` constants, one detector that emits a Finding with that `fix_action`, and one registered function in `fixes.py`. No other call sites change.
### 10.2.3 Review page (`src/gui/pages/0_Review.py`)
Streamlit page that orchestrates the gate visually. Gates the entire tool sidebar via `require_normalization_gate()` in `src/gui/components.py`, which every tool page calls right after `hide_streamlit_chrome()`.
The page:
1. Surfaces the detected encoding plus an override picker (16 common codepages + custom-text fallback).
2. Renders one expandable card per finding, sorted by severity then confidence, with a decision radio (Auto / Skip / Customize), a live before/after preview built by running the registered fix on each `Finding.samples` value, and a payload editor for fixes that take user input (e.g. custom null-sentinel list for `replace_null_sentinels`).
3. Apply button persists a `NormalizationResult` keyed by upload SHA-256; tool pages refuse to load until the hash matches.
4. After apply, an `⚙️ Advanced output options` expander offers per-download encoding, delimiter, and line-terminator selection. The helper `_build_output_bytes(df, *, encoding, delimiter, line_terminator)` returns `(bytes, error_message)` — when the chosen encoding can't represent a character, falls back to `errors="replace"` and returns a warning the page surfaces.
### 10.2.4 Pre-parse repair (`src/core/io.py::repair_bytes`)
Byte-level pre-parse pass. Order is meaningful and each step is independently toggleable:
1. **Wide-encoding transcode** — UTF-16/UTF-32 → UTF-8. Has to run first because the byte-level NUL strip below would shred UTF-16 data (UTF-16 ASCII chars carry NUL as half of every 16-bit unit). Records `transcode_to_utf8` audit action; the analyzer surfaces it as a `csv_transcoded_to_utf8` info finding.
1. **Wide-encoding transcode** (UTF-16/32 → UTF-8) — must run first or NUL strip below shreds UTF-16.
2. **UTF-8 BOM strip** (file start only).
3. **NUL strip** — only meaningful after step 1, so genuine corruption (truncated C strings, half-binary exports) rather than encoding artifacts.
4. **Line-ending normalize** — CRLF and bare CR → LF. Bare CR confuses the C parser; the text-cleaner contract also calls for LF inside multi-line cells.
5. **Byte-level smart-quote fold** — curly / guillemet / double-prime → ASCII `"`. Only structural double-quote-equivalents; single curly quotes are deferred to the cell-level cleaner.
6. **Per-row delimiter repair** — when one row has +1 field and the merge candidate is currency-shaped (`$1,500.00` etc.), merge and quote.
3. **NUL strip** — only meaningful after step 1, so flags genuine corruption.
4. **Line-ending normalize** — CRLF + bare CR → LF.
5. **Byte-level smart-quote fold** — curly / guillemet / double-prime → ASCII `"` (only structural double-quote-equivalents; single curlies deferred to cell-level).
6. **Per-row delimiter repair** — when a row has +1 field and merge candidate is currency-shaped (`$1,500.00`), merge + quote.
`detect_encoding()` tries strict UTF-8 first and returns `"utf-8"` if the bytes decode cleanly. This was added because charset-normalizer fingerprints small files dominated by short non-ASCII sequences (e.g. zero-width chars at U+200B-class) as `mac_latin2` but if the bytes are valid UTF-8, that's the right answer regardless of label.
### 10.3 - 10.9 (Future)
Functional specs for scripts 03 through 09 to be added when each script enters active build. The deduplicator (10.1) and text cleaner (10.2) specs are the template; specs for other scripts should follow the same Tier 1 / Tier 2 / Tier 3 structure with explicit strategic framing (what's the market gap this script fills, given that some of its functionality is available free elsewhere).
`detect_encoding()` tries strict UTF-8 first — charset-normalizer mislabels short-non-ASCII files as `mac_latin2`, but valid UTF-8 bytes mean UTF-8 regardless of label.

View File

@@ -1,208 +1,118 @@
# USER-GUIDE.md - Excel & CSV Data Cleaning Mastery Bundle
# User Guide
**Version**: 1.6
**Last updated**: April 28, 2026
**Version**: 1.6 · **Updated**: 2026-05-01
Thank you for purchasing the Data Cleaning Mastery Bundle. This guide covers installation and every script included.
## 1. Install
---
You don't need Python — the bundle is self-contained.
## 1. Installation
| OS | File | How |
|----|------|-----|
| Windows | `BundleName-Setup-1.0.exe` | Double-click installer → desktop shortcut. |
| macOS | `BundleName-1.0.dmg` | Mount, drag to Applications. Signed + notarized. |
| Linux | `BundleName-1.0.AppImage` | `chmod +x`, double-click. (`.tar.gz` fallback available.) |
The bundle is fully self-contained. **You do not need to install Python.**
Launching opens your default browser to a local page (`http://localhost:8501`).
### Windows
### How the GUI works
1. Download `BundleName-Setup-1.0.exe` from your purchase email.
2. Double-click the installer.
3. Follow the wizard. The installer creates a desktop shortcut named "Launch Bundle" and an entry in Start Menu.
4. Launch via the desktop shortcut. Your default browser will open to a local page (typically `http://localhost:8501`) where the data tool runs.
- Runs locally on your machine. **No internet, no upload.**
- Browser is just the display surface. Closing it stops the underlying program.
- Prefer the terminal? Every tool ships with a CLI too (Section 3).
### macOS
### System requirements
1. Download `BundleName-1.0.dmg` from your purchase email.
2. Double-click the `.dmg` to mount it.
3. Drag the Bundle app into the Applications folder.
4. Launch from Applications, Spotlight, or Launchpad. Your default browser will open to a local page where the data tool runs.
The app is signed and notarized by Apple, so it opens cleanly with no security warnings.
### Linux
1. Download `BundleName-1.0.AppImage` from your purchase email.
2. Make it executable: `chmod +x BundleName-1.0.AppImage`
3. Double-click to run, or execute from a terminal. Your default browser will open to a local page where the data tool runs.
If AppImage doesn't work on your distribution, a `.tar.gz` fallback is available in your purchase email. Extract it and run `./run.sh` from the extracted folder.
### How the GUI works (important to know)
This tool runs in your browser **locally on your computer**. When you launch it, a small program starts a local server on your machine and opens your default browser to view it. This is normal and expected.
- **No internet is required.** Your data never leaves your computer.
- **Your data is not uploaded anywhere.** All processing happens on your machine.
- The browser is just the display surface. Closing the browser closes the GUI; the underlying program also stops.
If you prefer the command line, every script also ships as a CLI tool. See Section 3.
### Requirements
- Windows: Windows 10 or 11 (64-bit).
- macOS: macOS 11 Big Sur or later (Apple Silicon or Intel).
- Linux: any modern 64-bit distribution from 2020 onward.
- A modern default browser (Chrome, Edge, Firefox, or Safari from the last 3 years).
- Windows 10/11 (64-bit), macOS 11+, modern Linux (2020+).
- Modern browser (Chrome, Edge, Firefox, Safari, last 3 years).
- ~400-500 MB free disk space.
- Internet connection: not required.
For the full short-form numbered list of what's supported (file sizes, code pages, delimiters, performance targets, detector list, etc.), see [REQUIREMENTS.md](REQUIREMENTS.md).
Full numbered support matrix: [REQUIREMENTS.md](REQUIREMENTS.md).
---
## 2. What's included
## 2. What's Included
| # | Tool | Purpose | Status |
|---|------|---------|--------|
| 01 | Deduplicator | Exact + fuzzy match, 5 normalizers, audit | Ready |
| 02 | Text Cleaner | Whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | Format Standardizer | Dates / phones / emails / addresses / names / currencies / booleans | Ready |
| 04 | Missing Value Handler | Disguised nulls, imputation, drop-by-threshold | Coming Soon |
| 05 | Column Mapper | Rename + enforce schema | Coming Soon |
| 06 | Outlier Detector | z-score, IQR, multivariate | Coming Soon |
| 07 | Multi-File Merger | Combine multiple files | Coming Soon |
| 08 | Validator & Reporter | Rules + PDF/Excel report | Coming Soon |
| 09 | Pipeline Runner | One-click multi-tool launcher | Coming Soon |
**Scripts (in the `scripts/` folder)**:
| # | Script | Purpose | Status |
|---|---|---|---|
| 01 | `01_deduplicator.py` | Smart duplicate removal: exact match + basic fuzzy, configurable subset columns, full logs | Working |
| 02 | `02_text_cleaner.py` | Character-level hygiene: trim leading/trailing whitespace, collapse internal multi-spaces, strip non-printable characters, Unicode normalization (smart quotes, em-dashes, accents), remove zero-width characters, BOM handling, line-ending normalization, case operations | Working |
| 03 | `03_format_standardizer.py` | Standardize dates, currencies, names, phone numbers, addresses | Skeleton |
| 04 | `04_missing_value_handler.py` | Detect and handle missing values: disguised nulls (`N/A`, `-`, blanks, sentinel codes), imputation (mean/median/mode/forward-fill), required-field enforcement, drop-by-threshold | Skeleton |
| 05 | `05_column_mapper_enforcer.py` | Rename columns, enforce a target schema | Skeleton |
| 06 | `06_outlier_detector.py` | Detect and flag statistical outliers (z-score, IQR, modified z-score), multivariate detection, domain-rule violations, optional winsorization | Skeleton |
| 07 | `07_multi_file_merger.py` | Merge multiple CSV or Excel files into one | Skeleton |
| 08 | `08_validator_reporter.py` | Validate data against rules, output PDF or Excel report | Skeleton |
| 09 | `09_master_orchestrator.py` | One-click launcher menu, calls any other script | Skeleton |
**Sample data (in the `samples/` folder)**:
- `messy_sales.csv` - intentionally dirty sales data for testing.
- `bank_export.xlsx` - sample bank export for testing missing-value handling and outlier detection.
---
**Sample data** (`samples/`): `messy_sales.csv`, `bank_export.xlsx`.
## 3. Usage
You have two ways to use the bundle: the GUI (recommended for most users) or the CLI (for power users and automation).
### 3.1 GUI (recommended)
### 3.1 GUI usage (recommended)
1. Launch the bundle.
2. Pick a tool from the sidebar.
3. Drop your file (or select a sample).
4. Defaults are pre-filled — click **Run** to preview.
5. Click **Save Output** to write the cleaned file.
1. Launch the bundle via the desktop shortcut, app icon, or AppImage.
2. Your browser opens to the bundle's home page.
3. Select the script you want to use from the sidebar (Deduplicator, Format Standardizer, etc.).
4. Drop your file into the file uploader, or select from the included samples.
5. Sensible defaults are pre-filled. Click "Run" to see a preview of what the script will do.
6. Review the preview. If it looks right, click "Save Output" to write the cleaned file.
Advanced options are tucked in expander panes. The original file is never modified.
The GUI is designed to work out of the box with zero configuration. Advanced options are tucked into expandable "Advanced" panes for users who want them.
### 3.2 CLI
### 3.2 CLI usage
All scripts are also CLI tools with `--help` output.
**Basic usage** (from a terminal):
Windows (the bundle adds CLI tools to your PATH):
```
deduplicator samples\messy_sales.csv
```bash
deduplicator customers.csv [--apply]
text-cleaner messy.csv [--apply]
format-standardize feed.csv [--apply]
```
macOS / Linux:
```
deduplicator samples/messy_sales.csv
```
Get help: `deduplicator --help`. Full reference: [CLI-REFERENCE.md](CLI-REFERENCE.md).
**With options**:
### 3.3 Run order (when running tools manually)
```
deduplicator samples/messy_sales.csv --output cleaned.csv --subset email,phone
```
If you skip the Pipeline Runner, follow this order:
**Get help on any script**:
1. **02 Text Cleaner** first — normalizes whitespace + special chars.
2. **03 Format Standardizer** — dates, phones, etc. need cleaned text.
3. **04 Missing Value Handler** — sentinel codes hide as numbers.
4. **05 Column Mapper** — schema before outlier stats.
5. **06 Outlier Detector** — needs clean numerics. Stats on data with `NaN` or `-999` are mathematically poisoned.
6. **07 Multi-File Merger**, **08 Validator** as needed.
7. **01 Deduplicator** is order-flexible (normalizes internally for matching).
```
deduplicator --help
```
The Pipeline Runner enforces this automatically.
**Recommended run order**: If you are running scripts individually, run `02_text_cleaner` first to normalize whitespace and special characters, then `04_missing_value_handler` *before* `06_outlier_detector`. Outlier detection on data still containing blanks or sentinel codes (like `-999`) produces unreliable results because missing-value placeholders distort the statistics (means get dragged, IQR widens, false negatives explode). The Master Orchestrator (script 09) runs them in the correct order automatically.
## 4. Review & Normalize gate
---
Every uploaded file is scanned before any tool sees it.
## 3.3 Review & Normalize gate
**Confidence tiers**:
- **High** — round-trip safe. One-click "Auto-fix high-confidence" applies them all.
- **Medium** — usually right, occasional false positives. Preview first.
- **Low** — heuristic. Off by default; opt in per finding.
- **Error** — blocks the gate (empty file, U+FFFD, unrepairable rows).
Before any tool page accepts a file, the file passes through a **CSV-normalization gate**. The gate scans every uploaded file, surfaces every data-quality issue our analyzer can detect, and lets you choose how to handle each one before downstream tools see the data.
**Encoding override**: when the picker reports `encoding_uncertain` or you spot mojibake (`é`) or `<60>` chars, choose the right codepage at the top of the page (cp1252 for Western Excel, KOI8-R for older Russian, Big5 for traditional Chinese, …) → **Re-analyze**.
### How it works
**Advanced output**: an `⚙️` expander on the download lets you tune encoding, delimiter, and line terminator. The download filename auto-adjusts (`.tsv` for tab, `.csv` otherwise).
1. Upload a file on the home page. The analyzer scans it and counts findings by confidence tier.
2. Click any tool. If the file hasn't been normalized yet, you're redirected to the **Review & Normalize** page.
3. The page shows every finding grouped by severity and confidence, with a per-finding decision control.
## 5. Output
### Confidence tiers
Every run writes:
- **Cleaned file** next to the input (or wherever you specify).
- **Audit file** (per-cell changes for text/format tools, match groups for dedup).
- **Timestamped log** in `logs/`.
- **High** — round-trip-safe algorithmic fix (BOM strip, whitespace trim, NBSP / zero-width strip, smart-quote fold, line-ending normalize, header cleanup). One-click "Auto-fix high-confidence" applies them all.
- **Medium** — right call in the common case but with known false-positive shapes. Examples: lowercasing the email column, replacing null-like sentinels (`N/A`, `-`, `nan`), repairing unquoted-currency rows. Preview the change before applying.
- **Low** — heuristic fixes that can corrupt data when wrong. Mojibake repair (`café``café`), mixed-encoding detection. Off by default; you opt in per finding.
- **Error** — blocking. Empty file, unrepairable rows, U+FFFD replacement characters. Cannot enter the tool pages until resolved or explicitly waived.
Original input is never modified.
### Encoding override
## 6. Troubleshooting
When the analyzer reports `encoding_uncertain` or you spot mojibake (`é`) or `<60>` characters in the findings list, use the **File encoding** picker at the top of the Review page. Pick the right code page (cp1252 for Western Excel exports, KOI8-R for older Russian data, Big5 for traditional Chinese, etc.) or type a custom one, then click **Re-analyze**. Findings refresh against the corrected decode.
- **GUI won't launch / browser doesn't open** — wait 10-15 s; manually visit `http://localhost:8501`. Port-in-use error → close other instances.
- **Why does my browser open?** — local web app pattern (same as Jupyter, RStudio). Nothing leaves your machine.
- **Windows SmartScreen** — click "More info" → "Run anyway". Standard for non-EV-signed software.
- **macOS "App is damaged"** — re-download (file likely corrupted in transit).
- **Linux AppImage won't run** — `chmod +x file.AppImage`. Missing FUSE → `sudo apt install libfuse2` or use `.tar.gz`.
- **Slow on large file** — over ~100k rows takes longer; progress bar shows. Multi-million rows → use the CLI directly.
- **Need help** — email the address on your purchase receipt.
The picker is hidden for `.xlsx` files since Excel stores text as Unicode internally.
## 7. License
### Advanced output options
After applying decisions, an `⚙️ Advanced output options` expander on the download appears. Three dropdowns let you tune the output file format:
- **Encoding (code page)** — UTF-8 (default), UTF-8 with BOM (Excel-friendly), Windows-1252, Latin-1, Latin-9, cp1250, ISO-8859-2, cp1251, Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
- **Delimiter** — comma (default), tab, semicolon, pipe.
- **Line terminator** — LF (default), CRLF (Windows), CR.
The download filename auto-adjusts the extension (`.tsv` for tab, otherwise `.csv`). When the chosen encoding can't represent a character (Cyrillic content into cp1252, Asian script into Latin-1), the page shows a warning naming the offending character and falls back to `?` replacement so the download still works.
---
## 4. Output
Every script writes:
- A cleaned output file next to the input (or wherever you specify).
- A timestamped log file in the `logs/` folder showing what changed and why.
Reports from `validator_reporter` go to the `reports/` folder as PDF or Excel.
The GUI also displays the output preview in-browser before any file is written. The original input file is never modified.
---
## 5. Troubleshooting
**The GUI won't launch / browser doesn't open**:
1. Wait 10-15 seconds after double-clicking. The local server takes a moment to start the first time.
2. If the browser doesn't open automatically, manually visit `http://localhost:8501` in your browser.
3. If you see a "port in use" error, another program is using port 8501. Close other instances of the bundle and try again.
**"Why is my browser opening?" / "Why does this need internet?"**:
This tool runs as a local web app. The browser is just the display; nothing is uploaded, nothing leaves your computer. No internet connection is used after install. This is the same approach used by many modern data tools (Jupyter notebooks, RStudio, etc.).
**Windows: "Windows protected your PC" SmartScreen warning**:
Click "More info" then "Run anyway." This is a standard warning for software without an extended-validation Windows code signing certificate.
**macOS: "App is damaged and cannot be opened"**:
This usually indicates the download was corrupted. Re-download from the link in your purchase email.
**Linux: AppImage will not run**:
Make sure it is executable: `chmod +x BundleName-1.0.AppImage`. If it still fails, your distribution may be missing FUSE; install with `sudo apt install libfuse2` (Debian/Ubuntu) or use the `.tar.gz` fallback.
**Script throws an error about a file**:
Check the log file in the `logs/` folder. The log explains exactly what went wrong and which row of input data triggered it.
**The GUI feels slow on a large file**:
Files over ~100,000 rows take longer to process. The GUI shows a progress bar. If you have very large files (millions of rows) consider using the CLI directly, which is faster for batch jobs.
**Need help**: Email the address on your purchase receipt.
---
## 6. License
Single-user license. Do not redistribute. See `LICENSE.txt` in the install folder.
Single-user. See `LICENSE.txt`.