docs: add project documentation files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-28 22:02:07 +00:00
parent a23f7a9b6f
commit 0613dc420c
7 changed files with 1370 additions and 0 deletions

278
docs/BUSINESS.md Normal file
View File

@@ -0,0 +1,278 @@
# BUSINESS.md - Business Case & Marketing Strategy
> **Creator-only document. Do not ship to buyers.**
**Version**: 1.6
**Last updated**: April 28, 2026
**Owner**: Michael
---
## 1. Executive Summary
Sell niche-specific Python automation tools as one-time downloadable digital products. Target non-technical users who hate repetitive Excel/CSV work but cannot code. Distribute via Gumroad / Lemon Squeezy / Stripe with automated instant delivery. Cross-platform from launch (Windows, macOS, Linux). Each bundle ships with both a GUI (primary surface for non-technical buyers, runs in the buyer's browser locally) and a CLI (for power users and automation).
**Pricing model**: One-time purchase. Individual bundles $49-$79. Full suite $149.
**Goal**: Lifestyle cashflow. No saleable-asset exit required.
---
## 2. Market Opportunity
- Persistent, evergreen pain: manual data cleaning is universal across small business and freelance work.
- Low competition in highly vertical niches (e.g., Shopify pet supplies feeds vs. generic CSV cleaners).
- High margin: near-100% gross margin after initial creation.
- Distribution leverage: marketplace search + community presence + programmatic SEO + **hosted browser demo as a try-before-buy conversion surface** (added v1.3, see Section 7).
**Realistic distribution timeline note**: Marketplace listings (Gumroad, Lemon Squeezy directory) and niche community posts can produce paying customers within days to weeks. New-domain SEO will not produce traction inside 90 days. Plan early-stage distribution around marketplaces, communities, and a hosted demo; treat owned-domain SEO as a 6-18 month compounding asset.
---
## 3. Target Customers
Primary:
- Shopify store owners (Pet Supplies niche identified as priority).
- Small business owners needing reporting and finance automation.
- Freelancers and consultants who handle client data.
- Local marketing agencies.
Anti-personas (do not waste effort here):
- Enterprise data teams (will build it themselves).
- Pure technical buyers (will pip install something free).
---
## 4. Product Strategy
**Lead product**: Excel & CSV Data Cleaning Mastery Bundle (highest-pain, broadest demand).
**Bundle roadmap**:
1. Data Cleaning Mastery (lead, in progress).
2. Automated Business Reporting.
3. Ecommerce Data Pipeline.
4. Small Business Finance.
5. Marketing Public Data Aggregation.
6. AI Ecommerce Aggregation - Shopify Pet Supplies (vertical niche play).
**Sequence rule**: Do not start bundle 2 until bundle 1 has paying customers and at least one external review. Building five skeleton bundles in parallel is a known failure mode.
**Product surface (locked v1.3)**: Each bundle ships as a desktop install (Windows / macOS / Linux) with both a Streamlit-based GUI and a CLI. A constrained version of the GUI is also deployed publicly to Streamlit Community Cloud as a free browser demo. See TECHNICAL.md Sections 1-3 and DECISIONS.md Section 4c for the full architecture.
---
## 4a. Lead Bundle Deep Dive: Deduplicator Use Cases & Competitive Position (added v1.6)
The deduplicator is the lead because it has the highest pain density across all four target personas. This section captures the use-case map, competitive landscape, and market gap statement. It feeds landing page copy, demo dataset design, and feature prioritization. Companion technical spec is in TECHNICAL.md Section 10.1.
### Use cases by buyer persona
**Shopify store owner (priority niche)**
1. Customer list cleanup: same person with `john@gmail.com` and `John@Gmail.com` and `j.ohn@gmail.com` (Gmail ignores dots), or with two phone formats.
2. Product catalog dedup: same SKU listed with trailing whitespace, case differences, or near-identical names ("Dog Collar - Red - Large" vs "Dog Collar Red L").
3. Abandoned cart cleanup before re-engagement campaign (don't email the same person 3 times).
4. Order export consolidation when pulling from Shopify + Amazon + manual entry.
5. Subscriber list hygiene before importing to Klaviyo / Mailchimp (every duplicate costs money on per-contact pricing).
**Small business / bookkeeper**
6. Bank export reconciliation: same transaction appearing in two exports across overlapping date ranges.
7. Vendor list consolidation across QuickBooks, spreadsheets, and email.
8. Customer master record cleanup before invoicing migration.
9. Expense report dedup where employees submit the same receipt twice.
**Freelancer / consultant**
10. Pre-analysis cleanup of client-supplied data dumps (almost always have dupes).
11. Survey response dedup (same respondent submitting twice from different devices).
12. Lead list cleanup before client handoff.
**Marketing agency**
13. Email list deduplication across multiple lead sources before campaign send.
14. Audience reconciliation when running multi-platform campaigns (Facebook + Google + organic forms).
15. Suppression-list management (combine unsubscribes across lists).
**Highest-pain, highest-frequency**: 1, 5, 6, 13. Build the feature set, sample dataset, and demo around these first. Landing page copy should lead with these scenarios. The hosted demo's pre-loaded dataset should make at least two of these cases obvious within ten seconds.
### Competitive landscape
| Tool | Price | Strength | Weakness vs. this product |
|---|---|---|---|
| Excel "Remove Duplicates" | Free | Universal, zero install | Exact match only. No fuzzy. No audit log. |
| Pandas `drop_duplicates` | Free | Powerful | Requires Python skill. Buyer doesn't have it. |
| OpenRefine | Free | Powerful clustering, fuzzy | Steep learning curve, dated GUI, intimidating for non-technical users. |
| Dedupe.io | ~$30+/mo SaaS | ML-based fuzzy | Recurring cost, cloud upload (privacy concern for client data), overkill for small jobs. |
| WinPure / Data Ladder | $300-2000+ | Enterprise-grade | Wrong price tier and complexity for solo operators. |
| Power Query (Excel) | Free | Integrated | Exact match only, no fuzzy without M-code skill. |
### The market gap this product fills
The market has a hole between "Excel (too basic)" and "OpenRefine / Dedupe.io (too complex or expensive or cloud-bound)." That hole is:
> Fuzzy match quality of OpenRefine, with the zero-learning-curve UX of Excel, sold once for under $100, runs locally on the buyer's machine.
This is a defensible position **only if** the product delivers fuzzy match quality the buyer can trust without reading documentation. If fuzzy is mediocre, the product loses to free Excel. If UX requires learning, it loses to free OpenRefine. The Tier 1 functional spec in TECHNICAL.md Section 10.1 is the minimum viable feature set to occupy this gap.
### Pricing sanity check (lead bundle specifically)
$49-$79 is correct for this position. Above $99 the buyer expects SaaS support (which conflicts with the no-touch constraint). Below $30 it competes with free and signals "toy." See Section 5 for full pricing rationale.
---
## 5. Pricing
| Tier | Price | Notes |
|---|---|---|
| Single bundle | $49 - $79 | Sweet spot for individual buyer impulse purchase |
| Full suite (when 3+ bundles exist) | $149 | Anchor price, drives bundle attach |
Rationale: $49-$79 is below the threshold that triggers procurement / approval friction for solo operators. Above $99 the buyer expects a SaaS or human support.
---
## 6. Revenue Targets (realistic, tiered)
Replacing the original "$50k/mo ceiling" target with evidence-based tiers for solo digital product sellers in this category:
| Horizon | Target | Notes |
|---|---|---|
| First 90 days | First paying customer | Validates the funnel, not the business |
| 6 months | $1,000 - $3,000 / mo | Achievable with working lead bundle + marketplace presence + hosted demo |
| 12 months | $5,000 / mo | Realistic 12-month goal. Triggers re-evaluation of the "fully async" constraint (see Section 8) |
| 24 months | $10,000 / mo | Stretch target. Requires either a hit product or 3+ bundles compounding |
$20k+/mo is achievable but requires a channel asset (audience, brand, content) that the current operator constraints exclude. Not a target.
---
## 7. Marketing Strategy
**Channels (in priority order, early stage)**:
1. **Hosted browser demo** (added v1.3). Free, public Streamlit Community Cloud deployment of a constrained version of each bundle. Linked prominently from every Gumroad / Lemon Squeezy listing and the landing page as "Try it free in your browser." Direct conversion lever: prospects can validate quality before purchase, which is otherwise impossible for digital downloads at this price.
2. Marketplace listings (Gumroad search, Lemon Squeezy directory, GitHub).
3. Niche community presence (subreddits, Indie Hackers, niche Slack/Discord) - value-first posts, not promotion. The hosted demo doubles as the asset shared in these posts.
4. Programmatic SEO landing pages targeting long-tail keywords (compounds over months).
5. Strong GitHub README as discovery surface.
**Hosted demo design**:
- Same core engine as the paid product, GUI front-end only.
- Constrained: row limit (e.g., 100 rows on output), watermark on output files, sample dataset preloaded plus optional small-file upload (capped size).
- Persistent CTA on every page: "Like what you see? Get the full version for $49 ->" linking to Gumroad.
- No login or signup required to use the demo. Friction kills conversion.
- Hosted on Streamlit Community Cloud (free) at launch. Migrate to $5/mo VPS if rate limits or branding constraints become an issue.
**Target keywords (long-tail, low competition)**:
- python csv cleaner bundle
- excel data cleaning scripts
- automated data deduplicator python
- csv duplicate removal tool
- shopify product feed cleaner
**Funnel**:
- Discovery (marketplace search / community post / SEO) -> Hosted demo (try-before-buy) -> Landing page -> Gumroad checkout -> Stripe payment -> automated email delivery -> upsell sequence to next bundle.
**Support model**: Self-serve documentation in every download. Email support only, no live chat, no calls.
---
## 8. The "Fully Async, No-Touch" Constraint
The locked criteria require fully automated, no-touch marketing and sales. This is preserved as the long-term steady state. However:
**Revisit trigger**: When monthly recurring revenue reaches $5,000/mo.
**Why revisit**: At early stage, the no-touch constraint rules out the channels most likely to produce first traction (direct outreach to 50 Shopify pet operators, founder-led community participation, customer interviews). These are time-bounded activities, not permanent commitments. Strict adherence to "no-touch" before product-market fit may cost more revenue than it saves time.
**Action at trigger**: Re-evaluate whether selective non-async activity (e.g., 2 hours/week of community participation, or a small founder audience build) would compound revenue faster than additional bundle development. Decision is yours; this document only flags the trigger.
Until $5k/mo, operate under the locked async-only rule.
---
## 9. Cost Structure
**Recurring (monthly budget cap: $1,200)**:
| Item | Cost | Notes |
|---|---|---|
| Gumroad / Lemon Squeezy fees | ~10% of revenue | Net of merchant fees, no flat cost |
| Domain | ~$15/yr | One-time annual |
| Hosting (landing pages) | $0 - $20/mo | Static hosting via Cloudflare Pages, Netlify, or GitHub Pages is free |
| Hosting (browser demos) | $0 at launch | Streamlit Community Cloud free tier. Plan for $5-10/mo VPS migration if scale or branding requires |
| Email service (transactional + sequences) | $0 - $30/mo | Free tier covers early volume |
| Apple Developer Program | $99/yr (~$8/mo) | Required for macOS code signing - see Section 10 |
| Inno Setup (Windows installer) | Free | One-time download |
| PyInstaller, Streamlit, Python tooling | Free | All open source |
| **Total fixed monthly** | **~$30-70/mo** | Well under $1,200 cap |
Headroom in the budget allows for optional ad spend ($100-200/mo) once a bundle has proven conversion data.
---
## 10. macOS Code Signing (Apple Developer Program)
**Required cost**: $99/year, paid to Apple.
**Why it's required**:
macOS includes a security layer (Gatekeeper) that blocks unsigned applications by default. When a non-technical buyer downloads an unsigned `.app` or `.dmg`, macOS shows a hard-block dialog: *"This app cannot be opened because the developer cannot be verified."* The only obvious button is "Move to Trash."
The bypass exists (right-click > Open, then confirm in a second dialog), but the target buyer persona will not perform it. The likely outcomes for unsigned Mac builds: refund requests, support tickets, or silent abandonment.
**What the $99/yr buys**:
- A code signing certificate. Removes the hard block.
- Notarization service (included). Apple scans the binary and stamps it; this removes the secondary "downloaded from internet" warning too. Result: clean double-click-to-run experience.
**Setup notes**:
- Requires Apple ID + government ID (for individuals) or D-U-N-S number (for organizations).
- First-time approval takes 1-2 weeks. Plan accordingly.
- Once approved, signing and notarization is automated in the build pipeline (see TECHNICAL.md).
**Decision**: Pay for it. The cost is trivial relative to the conversion-rate impact for the non-technical buyer persona.
---
## 11. Risks & Mitigation
| Risk | Mitigation |
|---|---|
| Commoditization (free scripts on GitHub) | Niche verticals + polished GUI + cross-platform installers + hosted demo |
| Slow early traction | Lead with hosted demo + marketplaces + communities, not own-domain SEO |
| Refund chargebacks | Clear scope on landing page, hosted demo lets buyers validate before purchase, working samples included |
| macOS install friction | Apple Developer Program ($99/yr), code signing + notarization |
| Browser-launch UX confusion (GUI opens in browser locally) | Single sentence in installer welcome and email; persistent in-app "runs locally, no internet used" message; pywebview native-window wrap as v1.1 enhancement if needed |
| Customer support burden | Robust installers, idiot-proof docs, sample data included, hosted demo lets prospects self-evaluate |
| IP theft / resale | License file. Accept this is partial protection; focus on staying ahead via updates |
| Platform risk (Gumroad / Lemon Squeezy policy change) | Multi-marketplace from day one; own domain as fallback |
| Streamlit project direction change breaks desktop packaging | Low probability; flagged as criteria-relock trigger in DECISIONS.md Section 8 |
---
## 12. Success Metrics
Tracked monthly:
- Units sold per bundle.
- Conversion rate (landing page -> purchase).
- **Demo-to-purchase conversion rate** (added v1.3): hosted demo visits -> Gumroad clicks -> purchases.
- Refund rate (target < 5%).
- Support tickets per 100 sales (target < 10).
- Organic traffic to product pages.
- Per-platform install success rate (Windows, macOS, Linux).
---
## 13. Honest Status (April 28, 2026)
- 1 of 9 scripts is real and tested (`01_deduplicator.py`). The other 8 are skeletons. **Expected at project start.**
- Cross-platform build pipeline (PyInstaller-based) designed but not yet built.
- macOS code signing not yet set up (Apple Developer Program enrollment pending).
- Streamlit GUI not yet built (locked as the framework as of v1.3).
- Hosted demo not yet deployed.
- No paying customers yet.
- No live landing page yet.
**Next concrete steps before any marketing spend**:
1. Build the Streamlit GUI for the lead script (`01_deduplicator.py`). Apply UX standards from DECISIONS.md Section 4b.
2. Stand up the PyInstaller cross-platform build pipeline with Streamlit launcher (see TECHNICAL.md Sections 3.3 and 3.4). Budget 1-3 days for first-time Streamlit-PyInstaller integration.
3. Deploy the constrained demo version to Streamlit Community Cloud.
4. Enroll in Apple Developer Program (1-2 week lead time - start in parallel with the above).
5. Stand up a single landing page for the lead bundle, with the hosted demo prominently linked.
6. Finish at least 2 more of the 9 scripts to working state with both CLI and GUI.
7. List on Gumroad with sample output proof, per-platform installer downloads, and hosted demo link.

268
docs/DECISIONS.md Normal file
View File

@@ -0,0 +1,268 @@
# DECISIONS.md - Locked Criteria, Scoring Rubric, Decision Log
> **Creator-only document. Do not ship to buyers.**
**Version**: 1.6
**Last updated**: April 28, 2026
This document captures the original locked operating criteria, the scoring framework used to select the product category, the platform-model evaluation, and key decisions with rationale. It exists so future-you (or a recovery rebuild) can reconstruct *why* the project is what it is, not just *what* it is.
---
## 1. Locked Operating Criteria
These are the constraints, targets, and goals the product strategy must satisfy. Established at project start. Any change to these requires an explicit re-lock.
### Constraints
| # | Criterion | Notes |
|---|---|---|
| 1 | Cash budget ≤ $1,200/month | Recurring monthly only; no large one-time capital, no external funding |
| 2 | Time available ≤ 10 hours/week | Strong preference for build-once assets generating revenue for years with minimal maintenance |
| 3 | Skill set: Database Design, Data Pipelines, Data Aggregation, Programming | Every opportunity must directly leverage these |
| 4 | Existing network: none | Zero reliance on personal connections for acquisition, sales, or operations |
### Targets
| # | Target | Notes |
|---|---|---|
| 5 | Time to first revenue: 15 days preferred, 90 days hard stop | |
| 6 | Revenue ceiling: tiered (see BUSINESS.md Section 6) | Revised from original $50k/mo. Realistic 12-month target: $5k/mo |
| 7 | Lifestyle cashflow goal | Sustainable for several years, no saleable-asset exit required |
| 8 | Distribution: fully async, no-touch, automated | Revisit at $5k/mo (see BUSINESS.md Section 8) |
| 9 | Day-to-day work pattern: deep work + recovery periods | No real-time on-call or customer-facing constraints |
### Goals
| # | Goal | Priority |
|---|---|---|
| 10 | Escape 9-5 W2 employment without stability concerns | Primary |
| 11 | Free up time for retirement lifestyle, optional enjoyable work | Secondary |
### No Internal Contradictions
The original criteria were checked for tension. The "fully async + 15-day to first revenue + no network" combination is tight but workable, with the caveat documented in BUSINESS.md Section 8 (revisit async constraint at $5k/mo).
---
## 2. Scoring Rubric
Every business candidate was scored 1-5 on six dimensions. Total /30, then mapped to verdict.
| Dimension | What it measures |
|---|---|
| Fit to locked criteria | Direct match to constraints 1-4 and targets 5-9. **Any 1 is a hard kill.** |
| Demand durability | Structural shift vs. trend peak. Will this still pay in 3 years? |
| Defensibility | What stops the next entrant from copying it. |
| Unit economics realism | CAC, payback period, gross margin, working capital. |
| Operator fit | Skills, capital, time, stomach for the work. |
| Exit / cash-flow optionality | Multiple paths to revenue, optionality on later changes. |
**Verdict mapping**: PURSUE / INVESTIGATE / PASS / KILL based on total score and any hard-kill dimension.
**Calibration note added in v1.1**: The original scoring inflated unit economics for the lead candidate by treating near-100% gross margin as 5/5 without accounting for CAC under the "no network" constraint. A more honest score for the Python Bundles category is 7.0-7.5/10, not 8.7/10. The strategy is still sound; the optimism just needed deflating.
---
## 3. Candidate Evaluation Summary
Five candidates were evaluated against the locked criteria. Top three:
| Rank | Candidate | Score | Verdict |
|---|---|---|---|
| 1 | Niche Python Automation Script Bundles | 8.7/10 (original) / ~7.5/10 (calibrated) | **PURSUE** |
| 2 | Curated Datasets | 8.7/10 | PURSUE (deferred) |
| 3 | Hosted Data Pipeline Micro-Tool | 8.3/10 | INVESTIGATE |
**Why #1 was selected over #2**:
- Faster path to first revenue (digital download vs. ongoing data curation pipeline).
- Lower ongoing maintenance after launch.
- Direct leverage of programming skills, not just data acquisition.
- Better fit for the "build once, sell many times" preference in criterion 2.
**Why others were ranked lower**:
- Notion Templates: weaker leverage of programming skills.
- Query Optimizer (SaaS): introduces hosting, support, and recurring infrastructure costs that conflict with the lifestyle / minimal maintenance constraint.
---
## 4. Platform Model Decision (How to Sell)
Models considered for the lead bundle:
| Model | Pros | Cons | Verdict |
|---|---|---|---|
| **Standalone tools, dual CLI + GUI interface (chosen)** | Build once, sell forever. No hosting. No SaaS support burden. Direct skill match. GUI captures non-technical buyer; CLI captures power users and automation use cases. | Requires installer for non-technical buyers. Some platform friction (signing, etc.). GUI adds build cost vs. CLI-only. | **CHOSEN (revised v1.2)** |
| SaaS web app | Recurring revenue. Easy install. | Ongoing hosting cost, support burden, SaaS scrutiny. Conflicts with "minimal maintenance" criterion. | Rejected |
| CLI-only | Lowest build cost | Wrong fit for non-technical buyer persona. Will produce refunds. | Rejected (revised v1.2) |
| Browser extension | Easy install | Limited by browser sandbox. Wrong tool for data file processing. | Rejected |
| Notion / Airtable templates | Fast to ship | Doesn't leverage programming skills. Low defensibility. | Rejected |
**Decision (revised v1.2)**: Ship as standalone tools with **both** a CLI and a GUI front-end sharing the same core logic. Packaged with cross-platform installers (PyInstaller-based) so the buyer experience approximates a native app. GUI is no longer "deferred"; it is required at v1 launch.
**Rationale for the v1.2 revision**:
- The buyer persona ("hate repetitive Excel work but cannot code") will not learn a CLI. CLI-only at this price point produces refunds.
- The deduplicator specifically requires interactive review of fuzzy-match candidates. That UX is not viable in pure CLI.
- A dual-interface design keeps the CLI for power users and future automation/scheduling use cases without sacrificing the primary buyer experience.
---
## 4a. Functional Scope Principle (added v1.2)
**Decision**: Each script ships with **complete functional coverage of the problem it names**, including features available for free elsewhere (e.g., Excel's built-in exact-match dedup).
**Rationale**: The product is "one-stop shopping" for the buyer's data-cleaning workflow. Forcing a buyer to bounce between this product and Excel/OpenRefine/etc. for parts of a single task defeats the value proposition. A buyer cleaning a customer list expects exact dedup, fuzzy dedup, normalization, and survivor-merge in one tool. Splitting that across products is what they paid to avoid.
**Consequence for design**: Do not omit a feature on the grounds that "Excel does this for free." If it belongs to the workflow, it belongs in the script.
**Anti-rule**: This is not license to scope-creep. The boundary is "the workflow this script names." A deduplicator includes everything dedup-adjacent (normalization, survivor selection, audit). It does not include format conversion, charting, or anything outside the dedup workflow. Those belong to other scripts in the bundle.
---
## 4b. UX Standards for GUI Front-End (added v1.2)
The GUI is the primary buyer surface. These standards are load-bearing.
| Standard | What it means in practice |
|---|---|
| **Works out of the box** | Dropping any reasonable CSV / XLSX onto the GUI must produce a useful result with zero configuration. The buyer should never see a config screen on first run. |
| **Sensible defaults everywhere** | Every option has a default that works for the most common case. Defaults are visible (so the user understands what is being applied) but not blocking. |
| **Progressive disclosure** | Advanced options exist but are tucked behind an "Advanced" or "Settings" pane. The default view shows the minimum needed for a first run. |
| **Plain-English labels** | No technical jargon in primary UI. "Find duplicates" not "Apply Levenshtein matching with token_set_ratio threshold". Tooltips can carry the technical detail for users who want it. |
| **Visible safety** | Dry-run / preview by default. The user sees what *would* change before any file is written. Original input is never modified. |
| **No multi-step setup** | If the GUI requires more than a single window (file picker + go button + results view) to complete a basic task, it has failed this standard. |
| **Errors that name the problem and the fix** | "Column 'email' not found in this file. Available columns: name, phone, address. Did you mean 'phone'?" not "KeyError: 'email'". |
| **Identical core to CLI** | The GUI and CLI are two front-ends over the same library code. Anything the CLI can do, the GUI can do. Anything the GUI can do, the CLI can do (possibly minus interactive review). No drift. |
**Test for "intuitive enough"**: A non-technical person who has never seen the tool can complete the lead use case (dedup a customer list with one or more confidence levels) on first launch with no documentation read. If that test fails on real users, the GUI is not yet shippable.
---
## 4c. GUI Framework Decision: Streamlit (added v1.3)
**Chosen**: Streamlit.
### Frameworks evaluated
| Framework | Verdict |
|---|---|
| **Streamlit** | **CHOSEN** |
| Tkinter + CustomTkinter | Rejected (CustomTkinter maintenance status confirmed inactive: last release Jan 2024, ~28 months old as of decision date; Snyk classifies as Inactive project) |
| Plain Tkinter | Rejected (UX quality below what a $49-79 product justifies in 2026 without significant hand-styling work) |
| Flet | Rejected (ecosystem too young for a build-once-maintain-for-years product) |
| PySide6 / Qt | Rejected (overkill for this product tier; steepest learning curve, largest bundles) |
| NiceGUI | Rejected (similar pattern to Streamlit but smaller community and less mature data-tool ergonomics) |
### Full evaluation matrix (added v1.6)
Promoted from chat-history-only into docs in v1.6 to lock the rejection reasoning against re-litigation. Scored 1-5 where 5 is best for *this specific product*.
| Dimension | Tkinter | Tk+CTk | Streamlit | Flet | PySide6 | NiceGUI |
|---|---|---|---|---|---|---|
| Non-tech UX quality (look + feel) | 1 | 3 | 4 | 4 | 5 | 4 |
| "Native window opens" (no browser) | 5 | 5 | 1 | 5 | 5 | 1 |
| Build speed for v1 | 3 | 3 | 5 | 4 | 2 | 4 |
| Build speed per added feature | 3 | 3 | 5 | 4 | 2 | 4 |
| PyInstaller compatibility (low friction) | 5 | 4 | 2 | 3 | 3 | 2 |
| Bundle size (smaller = better) | 5 | 4 | 1 | 3 | 2 | 1 |
| Maintenance burden over time | 4 | 3 | 4 | 3 | 4 | 3 |
| Ecosystem maturity / longevity bet | 5 | 3 | 4 | 2 | 5 | 3 |
| Solo dev learning curve | 4 | 4 | 5 | 4 | 2 | 4 |
| Suits "drop file, see result" pattern | 3 | 3 | 5 | 4 | 4 | 5 |
| **Total /50** | **38** | **37** | **38** | **36** | **34** | **35** |
**The total is misleading on purpose.** Equal totals hide that these options fail differently. Tkinter ties Streamlit on the sum but loses on look-and-feel and data-app fit (the dimensions that matter most for this product). The verdict is in the per-dimension story, not the sum.
**Per-option summary** (the substance behind the verdicts):
- **Plain Tkinter**: Smallest bundle (~30-50 MB added), most predictable PyInstaller behavior, will work in 10 years. Default widgets look like 1998. A non-technical buyer paying $49-79 and seeing a default Tk UI will feel cheated. Don't ship.
- **Tkinter + CustomTkinter**: Native window, ~50-80 MB added, modern look, mature PyInstaller story. Maintainer absent (last release Jan 2024). Multi-year product cannot bet UI layer on a library classified Inactive. The probable failure mode is a future macOS or Python update breaking the Tk layer with no upstream fix.
- **Streamlit**: Fastest to build for data tools. Tables, file uploads, dataframes are first-class. Mature ecosystem. Browser-launch UX is the real liability, mitigated by in-app messaging and the optional pywebview wrap (v1.1). Bundle size 300-500 MB. PyInstaller packaging fiddly first time, reusable after.
- **Flet**: Modern Flutter-based UI, native windows, looks great. Ecosystem too young for a build-once-maintain-for-years product. Breaking API changes between minor versions still happening. PyInstaller story less battle-tested.
- **PySide6 / Qt**: Industrial-grade, best widget set, native everything. Steepest learning curve, largest bundles, licensing care needed. Overkill for $49-79 product tier and burns the 10 hr/wk time budget on UI scaffolding instead of script features.
- **NiceGUI**: Similar pattern to Streamlit (Python-to-web). Smaller community, less mature data-tool ergonomics. Same browser-launch tradeoff without Streamlit's velocity advantage.
### Why Streamlit won
1. **Fastest build velocity for v1 and every subsequent bundle.** "Drop a CSV, see results" is the native Streamlit interaction pattern. Tables, filters, dataframes display well with minimal code. This compounds across the 9-script lead bundle and the future 5 bundles in the roadmap.
2. **Lowest maintenance burden per added feature.** Active framework, large community, mature ecosystem. Bug fixes happen upstream, not on this project's time.
3. **Hosted browser demo as a marketing asset from day one.** A Streamlit app deploys to Streamlit Community Cloud (free) or a $5/mo VPS. The Gumroad landing page can offer "Try it free in your browser" with a sample dataset. For a $49-79 product where buyers cannot evaluate quality before purchase, a working demo can move conversion meaningfully. Tkinter family options cannot provide this.
4. **Future SaaS optionality** (expanded v1.6). Not a driver of this decision; the locked criteria reject SaaS. But if criteria ever evolve, Streamlit code converts to a hosted multi-user app in hours rather than weeks. Streamlit's session-state model, component patterns, and HTTP-server architecture are SaaS-native by default; the same code that runs the desktop bundle's local browser GUI runs unchanged on a hosted server (modulo authentication and per-user file isolation). Tkinter code, by contrast, would require a complete rewrite to become a hosted product. This is low-cost optionality: zero implementation effort now, meaningful flexibility later if the lifestyle-cashflow constraint ever lifts in favor of recurring revenue.
### Tradeoffs accepted
1. **Browser-launch UX on the desktop install.** When a buyer double-clicks the desktop shortcut, their default browser opens to a localhost URL. This may briefly confuse non-technical buyers. **Mitigation**: a single sentence in the welcome dialog and install email explains that the data tool runs in the browser locally and uses no internet. If support tickets show this is a meaningful confusion driver, evaluate wrapping with pywebview (native window around the local Streamlit server) in v1.1.
2. **Larger bundle size**, ~300-500 MB vs. ~50 MB for Tkinter. Acceptable for marketplace download in 2026 with typical broadband.
3. **PyInstaller packaging is fiddly** the first time. Budget 1-3 days for the one-time setup, then it's reusable across all subsequent bundles via a shared template.
4. **Streamlit's session re-run model is unusual.** Manageable for single-user data tools; would matter more if the SaaS optionality were exercised at scale.
### Why CustomTkinter was rejected (the previously-favored option)
A web check during this decision found that CustomTkinter's last PyPI release was 5.2.2 in January 2024. As of April 2026, that's roughly 28 months without a release, and Snyk classifies the project as Inactive. The library still works and remains popular (~115k weekly downloads, 13k+ GitHub stars), but the maintainer is effectively absent. For a product intended to ship to non-technical buyers and remain functional for years with minimal touch from the operator, betting the UI layer on an unmaintained library is an unacceptable risk: any future Python or macOS update that breaks the Tk underpinnings becomes the operator's problem to fix or fork.
This is the kind of dependency risk that matters most in a "build once, sell forever" product, where every hour spent firefighting a dependency break is an hour stolen from the next bundle.
---
## 5. Distribution Channel Decision
**Chosen primary**: Marketplace listings (Gumroad, Lemon Squeezy).
**Rationale**: Under the "no network + fully async + 90-day hard stop" constraints, marketplaces are the only channel that:
- Has built-in buyer traffic (no audience-building required).
- Handles payments, delivery, refunds asynchronously.
- Allows listing in days, not months.
Own-domain SEO is treated as a long-term compounding asset (6-18 months to traction), not an early-stage channel.
**Added v1.3**: A **hosted browser demo** of each bundle (deployed via Streamlit Community Cloud) becomes a secondary distribution surface and a primary conversion-rate lever on the landing page. Marketing details in BUSINESS.md Section 7.
---
## 6. Pricing Decision
**Chosen**: $49-$79 per bundle, $149 for full suite (when 3+ bundles exist).
**Rationale**:
- Below $99 threshold avoids procurement / approval friction for solo operator buyers.
- Above $99 raises buyer expectations (SaaS, human support) that conflict with the no-touch constraint.
- $49-$79 produces the right unit economics for marketplace fees + Stripe fees while remaining impulse-purchase territory.
---
## 7. Decision Log (Chronological)
| Date | Decision | Rationale |
|---|---|---|
| April 2026 | Lock operating criteria | Project kickoff |
| April 2026 | Select Python Automation Script Bundles as the product category | Highest score against locked criteria |
| April 2026 | Choose CLI standalone over SaaS / GUI | Best fit for minimal maintenance + skill leverage |
| April 2026 | Pick Excel & CSV Data Cleaning Mastery as lead bundle | Highest pain, broadest demand, easiest demonstration |
| April 2026 | Initial install path: Inno Setup (Windows-only) | First-pass design |
| April 2026 (revised v1.1) | **Switch to PyInstaller-based cross-platform pipeline** | Eliminates "install Python first" friction; expands TAM to Mac and Linux users |
| April 2026 (revised v1.1) | **Enroll in Apple Developer Program ($99/yr)** | Required for clean macOS install experience for non-technical buyers |
| April 2026 (revised v1.1) | **Replace $50k/mo target with tiered realistic targets** | Original target was unsupported by evidence base; tiered targets hit $5k at 12mo, $10k at 24mo |
| April 2026 (revised v1.1) | **Tag "fully async no-touch" for revisit at $5k/mo** | Strict adherence pre-PMF may cost more revenue than it saves time |
| April 28, 2026 (v1.2) | **Functional scope: include all workflow-relevant features even if available free elsewhere** | One-stop shopping is the value proposition. Forcing buyers to bounce between products defeats the purpose. See Section 4a. |
| April 28, 2026 (v1.2) | **Promote GUI from "deferred" to required at v1 launch; ship dual CLI + GUI interface** | Buyer persona will not use CLI. Deduplicator specifically requires interactive review UX that CLI cannot deliver well. See Section 4. |
| April 28, 2026 (v1.2) | **Lock UX standards for GUI: works out of the box, sensible defaults, progressive disclosure, plain-English labels, dry-run by default** | These are load-bearing for the non-technical buyer. Without them the GUI may exist but won't justify the price. See Section 4b. |
| April 28, 2026 (v1.3) | **Lock GUI framework as Streamlit; reject CustomTkinter (maintenance inactive), plain Tkinter (UX gap), Flet/PySide6/NiceGUI (each fails on a dimension that matters)** | Fastest build velocity, lowest maintenance burden, hosted browser demo as marketing asset, future SaaS optionality. Browser-launch UX accepted as a tradeoff with documented mitigation. See Section 4c. |
| April 28, 2026 (v1.3) | **Add hosted browser demo as secondary distribution surface and conversion lever** | Direct consequence of Streamlit choice. See Section 5 and BUSINESS.md Section 7. |
| April 28, 2026 (v1.4) | **Re-apply 03/05 script boundary work dropped during v1.3 merge (silent drift recovery)** | Stream B v1.2 content (sharpened 03/05 descriptions in USER-GUIDE, run-order rule, TECHNICAL.md Section 9 boundary spec, RECOVERY.md pointer) was overwritten when Stream A's parallel v1.3 Streamlit work was saved to project. Restoring per the doc's own no-silent-drift rule. 03 owns "what's not there" (missing values, sentinel codes, imputation), 05 owns "what shouldn't be there" (statistical outliers, domain rules, winsorization). 03 runs before 05 because outlier statistics on data containing NaN or sentinel codes are mathematically poisoned. See TECHNICAL.md Section 9. |
| April 28, 2026 (v1.5) | **Add `02_text_cleaner.py` as new script; renumber 02-08 → 03-09** | Audit revealed character-level hygiene (whitespace trimming, multi-space collapse, Unicode normalization, BOM handling, line-ending normalization, special-character handling) had no clear owner. Was implicitly scattered: `01_deduplicator` normalizes internally for matching only (doesn't write back), `02_format_standardizer` (now 03) implies it but its named scope is dates/currencies/names/phones/addresses, `03_missing_value_handler` (now 04) only handles whitespace-only as disguised null. A buyer with trailing-space pollution had no obvious script to run. Per Section 4a (functional scope principle: one-stop shopping for the workflow), this was a real gap. Added as 02 because text cleaning is a pre-processing step that should run before format standardization, missing-value handling, and outlier detection. Kept 01 (deduplicator) at position 1 as the lead/working/marketing-flagship script; numbering does not strictly equal pipeline order, the orchestrator manages execution order. Renumber consequence: TECHNICAL.md Section 9 boundary references updated 03→04, 05→06; orchestrator references updated 08→09. New contested case documented in Section 9.3: whitespace-only cells (02 trims first, leaving empty string; 04 then detects empty strings as disguised null). Master orchestrator now 09. |
| April 28, 2026 (v1.6) | **Fold conversation-history content into docs: deduplicator functional spec, lead bundle use cases, competitive landscape, full GUI framework comparison matrix, concrete 04/06 boundary examples, expanded Streamlit-to-SaaS reasoning** | None of this represents new decisions; all of it represents prior analysis that lived only in chat history and was at risk of evaporating. Per the doc's own no-silent-drift rule (Section 8) and the v1.4 recovery story, valuable analysis must be promoted to docs to survive. Specifically: TECHNICAL.md gains Section 10 (per-script functional specs, starting with the deduplicator's 36-item tiered spec) which is the buildable target for the v1 launch GUI port; this also makes the gap between "currently working" (exact + basic fuzzy) and "v1 launch best-of-class" (Tier 1) explicit so the docs don't quietly overstate where the code is. Section 9.3 gains three concrete distinguishing examples (bank-export blank fees / $1M outlier / "999=refused") that prove 04 and 06 are distinct concerns. BUSINESS.md gains Section 4a (Lead Bundle Deep Dive: 15 use cases by persona, 6-row competitive landscape table, market gap statement) which feeds landing page copy and demo design. Section 4c gains a 10-dimension scored framework matrix and per-option summaries (locks the rejection reasoning against re-litigation), plus expanded point 4 on Streamlit-to-SaaS migration cost. RECOVERY.md updated to reference Section 10 in rebuild and priority steps. No structural decisions changed; this is pure capture work. |
---
## 8. What Would Trigger Re-Locking the Criteria
These criteria are load-bearing and not casually changed. Triggers for explicit re-evaluation:
- Hitting the $5k/mo revenue tier (revisit async constraint).
- Hitting the $10k/mo revenue tier (revisit time-budget allocation).
- A platform shutting down (Gumroad / Lemon Squeezy policy change forcing channel migration).
- A new skill acquired that opens a higher-leverage product category.
- A burnout signal indicating the time / recovery balance is broken.
- Streamlit project taking a hard direction change that breaks the desktop-packaging path (low probability, but worth flagging).
Any re-lock requires writing the new criteria here with a date and rationale. No silent drift.

38
docs/README.md Normal file
View File

@@ -0,0 +1,38 @@
# Excel & CSV Data Cleaning Mastery Bundle
**Ready-to-sell Python automation product.**
9 scripts for data cleaning, deduplication, text hygiene, formatting, merging, validation, and reporting.
Each script ships with both a GUI (runs in your browser locally, no internet needed) and a CLI.
Cross-platform: Windows, macOS, Linux.
---
## Quick Start (for buyers)
1. Download the installer for your operating system.
2. Run the installer. No Python knowledge required.
3. Launch via the desktop shortcut "Launch Bundle" (or the app icon on macOS, or the AppImage on Linux).
4. Your default browser opens to a local page where the data tool runs. Your data never leaves your computer.
Full instructions: see [USER-GUIDE.md](USER-GUIDE.md).
---
## Documentation Index
### Ships with the product (buyer-facing)
- [USER-GUIDE.md](USER-GUIDE.md) - Installation, script reference, usage examples for both GUI and CLI.
### Creator-only (do not ship to buyers)
- [BUSINESS.md](BUSINESS.md) - Business case, market analysis, pricing, marketing strategy (including the hosted browser demo as a conversion lever).
- [TECHNICAL.md](TECHNICAL.md) - Architecture (dual CLI + Streamlit GUI), build pipeline, dev standards.
- [DECISIONS.md](DECISIONS.md) - Locked criteria, scoring rubric, decisions log, rationale for product choices including the GUI framework decision.
- [RECOVERY.md](RECOVERY.md) - How to rebuild the entire project from scratch if lost.
---
**Version**: 1.6
**Last updated**: April 28, 2026
**Owner**: Michael

180
docs/RECOVERY.md Normal file
View File

@@ -0,0 +1,180 @@
# RECOVERY.md - Full Project Recovery Guide
> **Creator-only document. Do not ship to buyers.**
**Version**: 1.6
**Last updated**: April 28, 2026
If the project is ever lost, this guide plus the source ZIP is enough to rebuild it 100%.
---
## 1. What's in the Project
```
project-root/
├── README.md
├── BUSINESS.md # Creator only
├── TECHNICAL.md # Creator only
├── DECISIONS.md # Creator only - locked criteria, rationale, GUI framework decision
├── USER-GUIDE.md # Ships to buyers
├── RECOVERY.md # Creator only (this file)
├── scripts/ # The 9 .py source files (CLI entry points)
│ ├── 01_deduplicator.py # Working
│ ├── 02_text_cleaner.py
│ ├── 03_format_standardizer.py
│ ├── 04_missing_value_handler.py
│ ├── 05_column_mapper_enforcer.py
│ ├── 06_outlier_detector.py
│ ├── 07_multi_file_merger.py
│ ├── 08_validator_reporter.py
│ └── 09_master_orchestrator.py
├── src/
│ ├── core/ # Shared business logic - both CLI and GUI call into this
│ ├── cli.py # Typer CLI front-end
│ └── gui/ # Streamlit GUI front-end
│ ├── app.py # Streamlit entry point
│ ├── pages/ # One Streamlit page per script in the bundle
│ └── components.py # Shared widgets
├── samples/
│ ├── messy_sales.csv
│ └── bank_export.xlsx
├── demo/
│ └── streamlit_app.py # Constrained version for Streamlit Community Cloud
├── build/
│ ├── pyinstaller.spec # Cross-platform build spec (handles GUI launcher + CLI binaries)
│ ├── launcher.py # Starts local Streamlit server, opens default browser
│ ├── windows/
│ │ └── installer.iss # Inno Setup wrapper
│ ├── macos/
│ │ ├── entitlements.plist
│ │ └── dmg_settings.py
│ └── linux/
│ └── AppImage/ # AppImage build assets
├── ci/
│ └── build.yml # GitHub Actions cross-platform build
├── tests/
└── requirements.txt
```
---
## 2. Rebuild Steps
### From a complete ZIP backup
1. Unzip into a clean directory.
2. Push to a GitHub repository.
3. The CI pipeline (`ci/build.yml`) builds Windows, macOS, and Linux artifacts on tagged releases.
4. Connect the repo to Streamlit Community Cloud and point it at `demo/streamlit_app.py` to redeploy the hosted demo.
5. For local builds: see Section 3.
6. Done.
### From documentation only (worst case)
1. Read `DECISIONS.md` to understand *why* the project is what it is. Section 4c locks the GUI framework as Streamlit; Section 4b locks the UX standards. These are non-negotiable.
2. Read `TECHNICAL.md` Sections 2-3 for the build pipeline architecture, including the Streamlit launcher pattern in Section 3.4.
3. Read `BUSINESS.md` for product strategy, which bundles to build, and the hosted demo as a marketing asset.
4. Recreate scripts using the spec in `USER-GUIDE.md` Section 2 (script table), `TECHNICAL.md` Section 7 (per-bundle technical notes), `TECHNICAL.md` Section 9 (boundary between scripts 04 and 06 - do not relitigate this), and `TECHNICAL.md` Section 10 (per-script functional requirements; Section 10.1 is the v1 launch target for the deduplicator).
5. Set up the cross-platform build pipeline (Section 3 below).
6. Recreate installer configs per `TECHNICAL.md` Section 3.
7. Build the constrained `demo/streamlit_app.py` for hosted deployment. Constraints: row limit, watermark, sample data only or strict file-size cap.
---
## 3. Local Build Setup (per platform)
### All platforms (common)
- Install Python 3.11+.
- `pip install -r requirements.txt pyinstaller`
- Verify Streamlit app runs locally: `streamlit run src/gui/app.py`
- Verify CLI runs locally: `python -m src.cli --help`
### Windows
- Install Inno Setup: https://jrsoftware.org/isinfo.php
- Build: `pyinstaller build/pyinstaller.spec`
- Wrap in installer: open `build/windows/installer.iss` in Inno Setup, compile.
### macOS
- Install Xcode command line tools: `xcode-select --install`
- Enroll in Apple Developer Program ($99/yr). Allow 1-2 weeks first time.
- Generate Developer ID Application certificate, install in Keychain.
- Generate app-specific password for `notarytool`.
- Build: `pyinstaller build/pyinstaller.spec`
- Sign: `codesign --deep --force --options runtime --sign "Developer ID Application: [Name]" dist/BundleName.app`
- Package as DMG.
- Notarize: `xcrun notarytool submit BundleName.dmg --wait`
- Staple: `xcrun stapler staple BundleName.dmg`
### Linux
- Install AppImage tooling: download `appimagetool` from https://appimage.github.io
- Build: `pyinstaller build/pyinstaller.spec`
- Wrap as AppImage using `appimagetool` per the assets in `build/linux/AppImage/`.
### Streamlit + PyInstaller specific notes
- A custom PyInstaller hook (`hook-streamlit.py`) is required to bundle Streamlit's data files correctly.
- Hidden imports must include `streamlit`, `altair`, `pyarrow` (and their submodules where PyInstaller fails to detect them).
- The launcher script (`build/launcher.py`) is the actual PyInstaller entry point, not the Streamlit script directly.
- Budget 1-3 days the first time getting the Streamlit-PyInstaller spec right; it's reusable across all subsequent bundles.
### CI build (recommended)
- Push the repo to GitHub.
- Tag a release: `git tag v1.0.0 && git push --tags`
- GitHub Actions runs the matrix build, produces all three artifacts.
- Manual step: download artifacts from the Releases page, upload to Gumroad / Lemon Squeezy.
### Hosted demo deployment (separate from desktop build)
- Connect GitHub repo to Streamlit Community Cloud (one-time, free).
- Configure the deployment to point at `demo/streamlit_app.py`.
- The demo updates automatically on git push to the configured branch.
- Custom domain optional via CNAME (verify Streamlit Community Cloud current policy at recovery time).
---
## 4. External Dependencies (re-acquire if lost)
| Item | Source | Cost |
|---|---|---|
| Python | https://python.org/downloads | Free |
| PyInstaller | `pip install pyinstaller` | Free |
| Streamlit | `pip install streamlit` | Free |
| Inno Setup (Windows) | https://jrsoftware.org/isinfo.php | Free |
| Apple Developer Program (macOS signing) | https://developer.apple.com | $99/yr |
| Xcode command line tools (macOS) | `xcode-select --install` | Free |
| appimagetool (Linux) | https://appimage.github.io | Free |
| GitHub Actions (CI) | github.com | Free tier covers all three OS runners |
| Streamlit Community Cloud (demo hosting) | streamlit.io/cloud | Free |
| Python libraries | See `requirements.txt`, `pip install -r requirements.txt` | Free |
---
## 5. Backup Recommendation
- **Primary backup**: GitHub repository (private). Source is the source of truth.
- **Secondary backup**: ZIP of the full project tree on cloud storage (Google Drive / Dropbox / S3).
- **Apple Developer credentials**: store certificate + app-specific password in a password manager. Losing these requires regenerating, not catastrophic.
- **Streamlit Community Cloud connection**: stored in Streamlit's UI as a GitHub OAuth link. Re-authorize from a new Streamlit account if lost.
- Back up after every meaningful code or doc change.
- Include this `RECOVERY.md` and `DECISIONS.md` in every backup. They contain the irreplaceable context.
---
## 6. Recovery Priorities (if rebuilding under time pressure)
If you only have time to rebuild part of the project, this is the order:
1. **Source: `src/core/` and `scripts/`**. Without these there is no product.
2. **DECISIONS.md**. Without this you will re-litigate every settled decision (especially GUI framework, dual interface, UX standards) and probably get it wrong differently.
3. **TECHNICAL.md**, especially Sections 9 (04/06 boundary) and 10 (per-script functional requirements). Without these you will rebuild the deduplicator with weaker fuzzy matching than the v1 launch spec demands and ship something that loses to free Excel.
4. **Streamlit GUI source (`src/gui/`)**. The primary buyer surface; without it the product reverts to CLI-only and the buyer persona will refund.
5. **PyInstaller spec + launcher + per-OS build configs** (`build/`). Reproducing the Streamlit-PyInstaller integration from scratch is 1-3 days of work.
6. **Apple Developer Program enrollment**. 1-2 week lead time. Start this first if Mac distribution matters.
7. **Hosted demo (`demo/streamlit_app.py`)**. Important marketing asset but not blocking for desktop sales.
8. Documentation files (USER-GUIDE, BUSINESS, README). Recoverable from memory + this guide.
9. CI config (`ci/build.yml`). Nice to have, not blocking.

435
docs/TECHNICAL.md Normal file
View File

@@ -0,0 +1,435 @@
# TECHNICAL.md - Technical Design, Build Pipeline, Standards
> **Creator-only document. Do not ship to buyers.**
**Version**: 1.6
**Last updated**: April 28, 2026
---
## 1. Architecture Overview
- Standalone tools with **dual interface**: CLI and GUI, both wrapping the same core library.
- GUI framework: **Streamlit**. Runs as a local web server, opens in the buyer's default browser. No internet used.
- Python 3.11+ runtime (bundled into the installer; the buyer never installs Python).
- Modular code, one concern per script. Core logic is library code; CLI and GUI are thin front-ends.
- Cross-platform from day one: Windows, macOS, Linux.
- PyInstaller produces standalone executables per OS. Buyer never sees Python, pip, venvs, or PATH.
- No internet required at runtime.
**Why dual interface (locked v1.2)**: The primary buyer persona is non-technical and will not use a CLI. The GUI is therefore the primary surface and is required at v1, not deferred. The CLI is retained for power users, automation, scheduled jobs, and future scripted workflows. Both share a single core; neither has features the other lacks (except interactive review, which only makes sense in GUI).
**Why Streamlit (locked v1.3)**: Fastest build velocity, lowest maintenance burden per added feature, hosted browser demo deployable as a marketing asset, future SaaS optionality. Selected over CustomTkinter (maintenance inactive since Jan 2024), plain Tkinter (UX gap at this price tier), Flet (ecosystem too young), PySide6 (overkill), and NiceGUI (smaller community). Full rationale in DECISIONS.md Section 4c.
This is a major change from the original Inno-Setup-only, CLI-only design. Rationale chain:
1. Requiring a buyer to install Python before using the product is the largest source of install friction (solved by PyInstaller in v1.1).
2. Requiring a non-technical buyer to use a CLI is the second-largest source of refund risk (solved by dual interface in v1.2).
3. Betting the GUI on an unmaintained library is the largest hidden technical risk (solved by Streamlit choice in v1.3).
---
## 2. Standard Bundle Structure (source repo)
Every bundle follows this layout in source. Core logic is shared, CLI and GUI are thin front-ends.
```
bundle-name/
├── src/
│ ├── __init__.py
│ ├── core/ # Shared business logic. No UI code here.
│ │ ├── __init__.py
│ │ ├── dedup.py # (example) the actual algorithm
│ │ └── io.py # File I/O, encoding/delimiter detection, etc.
│ ├── cli.py # Command-line interface (Typer). Thin wrapper over core.
│ └── gui/ # Streamlit front-end. Thin wrapper over core.
│ ├── __init__.py
│ ├── app.py # Main Streamlit entry point (st.set_page_config, layout)
│ ├── pages/ # Streamlit multi-page app (one page per script in the bundle)
│ │ ├── 1_Deduplicator.py
│ │ ├── 2_Text_Cleaner.py
│ │ ├── 3_Format_Standardizer.py
│ │ └── ...
│ └── components.py # Reusable Streamlit widgets and helpers
├── data_examples/ # Sample input files
├── tests/ # Unit tests (pytest). Tests target core, not UI.
├── build/
│ ├── pyinstaller.spec # PyInstaller build spec (handles both CLI + GUI entry points)
│ ├── launcher.py # Small launcher script: starts Streamlit server, opens browser
│ ├── windows/
│ │ └── installer.iss # Inno Setup wrapper for Windows .exe installer
│ ├── macos/
│ │ ├── entitlements.plist
│ │ └── dmg_settings.py # dmg-creation config
│ └── linux/
│ └── AppImage/ # AppImage build assets
├── demo/ # Stripped-down version for hosted browser demo
│ └── streamlit_app.py # Entry point for Streamlit Community Cloud deployment
├── requirements.txt
├── README_bundle.md # User-facing guide (covers both CLI and GUI usage)
├── LICENSE
└── ci/
└── build.yml # GitHub Actions cross-platform build
```
**Core/UI separation rule**: A new feature is implemented in `core/` first, with tests. CLI and GUI both call into core. If a feature exists only in one front-end (e.g., interactive review only in GUI), the underlying capability still lives in core; only the presentation differs.
**Demo subfolder rule**: The `demo/` folder contains a constrained Streamlit app for public deployment to Streamlit Community Cloud. Constraints: row limit (e.g., 100 rows max output), no file save, watermark on output, sample dataset only or strict file-size cap. Same core library, different front-end constraints.
---
## 3. Cross-Platform Build Pipeline
### 3.1 Tooling
| Concern | Tool |
|---|---|
| Bundling Python + scripts into a standalone binary | PyInstaller |
| GUI framework | **Streamlit** |
| Browser launch from launcher | Python `webbrowser` module (stdlib) |
| CLI framework | Typer |
| Windows installer wrapper | Inno Setup (free) |
| macOS bundle format | `.app` packaged in `.dmg` |
| macOS code signing & notarization | `codesign` + `notarytool` (built into Xcode command line tools) |
| Linux distribution format | AppImage (primary) + plain tarball (fallback) |
| CI / automated builds | GitHub Actions (free tier handles all three OS runners) |
| Hosted demo | Streamlit Community Cloud (free) or $5/mo VPS |
### 3.2 Build Outputs (what the buyer downloads)
| Platform | Output file | Buyer experience |
|---|---|---|
| Windows | `BundleName-Setup-1.0.exe` | Double-click installer, click through wizard. Desktop shortcut "Launch Bundle" runs `launcher.py`, which starts the local Streamlit server and opens default browser to `http://localhost:8501`. CLI executables also installed and on PATH. |
| macOS | `BundleName-1.0.dmg` | Double-click DMG, drag app to Applications. Signed and notarized. Launching the app runs the launcher, which starts the local server and opens the browser. CLI binaries shipped in the app bundle. |
| Linux | `BundleName-1.0.AppImage` | Mark executable, double-click. AppImage runs the launcher, opens browser. Tarball fallback also includes CLI binaries. |
The **default buyer experience on every platform is**: double-click, browser opens, work done. The CLI is present, documented, and on PATH for users who want it.
**Browser-launch UX mitigation** (per DECISIONS.md Section 4c tradeoff): The launcher script displays a brief "Starting your data tool..." console message before opening the browser. The Streamlit app's first page includes a one-line note: *"This tool runs locally in your browser and does not use the internet."* Install email reinforces the same message.
### 3.3 PyInstaller Configuration
Single `.spec` file per bundle, parameterized for OS. Key settings:
- `--onefile` for Linux (single AppImage), `--onedir` for Windows and macOS (faster startup, easier signing).
- All dependencies bundled. No internet required at runtime.
- Hidden imports declared explicitly for pandas/openpyxl/Streamlit edge cases (PyInstaller's auto-detection misses some).
- Icon files per platform (`.ico` for Windows, `.icns` for macOS, `.png` for Linux).
- **Two entry points per bundle**: the GUI launcher (default, what the desktop shortcut runs) and the CLI binaries.
- **Streamlit-specific PyInstaller hooks**: include the `streamlit` data directory, the `altair` data directory (Streamlit dependency), and the `pyarrow` C extensions. Add a custom hook file (`hook-streamlit.py`) per the documented pattern. Budget 1-3 days the first time getting the spec right; reuse across all subsequent bundles.
### 3.4 Streamlit Launcher Pattern
The launcher script handles starting the local Streamlit server in a way that survives PyInstaller bundling. Conceptual outline:
1. Find a free local port (avoid hardcoding 8501 in case of conflict).
2. Set Streamlit environment variables: `STREAMLIT_SERVER_HEADLESS=true`, `STREAMLIT_BROWSER_GATHER_USAGE_STATS=false`, `STREAMLIT_SERVER_PORT={port}`.
3. Start Streamlit programmatically (via `streamlit.web.cli.main_run` or `bootstrap.run`) in a background thread.
4. Wait for the port to accept connections (poll with timeout).
5. Open the buyer's default browser to `http://localhost:{port}` via `webbrowser.open()`.
6. Keep the launcher process alive while the server runs. Detect server shutdown and exit cleanly.
Optional v1.1 enhancement: replace step 5 with a `pywebview` window that wraps the local server. Eliminates the "default browser opens" UX surprise. Adds a dependency and some packaging complexity. Defer until support tickets show the browser-launch is causing meaningful confusion.
### 3.5 macOS Signing & Notarization Pipeline
Required setup (one-time):
1. Enroll in Apple Developer Program ($99/yr - see BUSINESS.md Section 10).
2. Generate Developer ID Application certificate via Apple Developer portal.
3. Install certificate in macOS keychain on the build machine (or store as encrypted GitHub Actions secret for CI).
4. Generate an app-specific password for `notarytool`.
Build-time flow (automated):
1. PyInstaller produces unsigned `.app`.
2. `codesign --deep --force --options runtime --sign "Developer ID Application: [Your Name]" BundleName.app`
3. Package into `.dmg`.
4. Submit `.dmg` to Apple notary service: `xcrun notarytool submit BundleName.dmg --wait`.
5. Staple the notarization ticket: `xcrun stapler staple BundleName.dmg`.
6. Output is the final, distributable `.dmg`.
Buyers on macOS see no Gatekeeper warnings. Clean install.
### 3.6 Windows Pipeline
1. PyInstaller produces `BundleName/` folder with launcher `BundleName.exe` (which opens the GUI in browser) plus CLI binaries plus dependencies.
2. Inno Setup script wraps the folder into `BundleName-Setup-1.0.exe`.
3. Installer creates Start Menu entry, desktop shortcut (launches GUI), optional Add/Remove Programs entry, and adds CLI binaries to PATH.
4. Optional Windows code signing certificate (~$200-400/yr from a CA) eliminates SmartScreen warnings. **Not required at launch**; revisit if SmartScreen friction shows up in support tickets.
### 3.7 Linux Pipeline
1. PyInstaller produces single-file binaries per entry point.
2. Wrap in AppImage using `appimagetool` (free, well-documented). AppImage runs the launcher as the default target.
3. Provide a plain `.tar.gz` fallback for users on distributions where AppImage fails. Tarball includes both GUI launcher and CLI binaries plus a `run.sh`.
4. No signing required on Linux; users execute `chmod +x` then double-click or run.
### 3.8 CI Build Matrix
GitHub Actions builds all three platforms on tagged release:
```yaml
# Conceptual, full file lives in ci/build.yml
strategy:
matrix:
os: [windows-latest, macos-latest, ubuntu-latest]
```
Result: one git tag triggers three platform builds. Artifacts upload to GitHub Releases. Manual step: copy artifacts to Gumroad / Lemon Squeezy product page.
### 3.9 Hosted Demo Deployment
Separate from the desktop build pipeline. The `demo/streamlit_app.py` entry point is deployed to Streamlit Community Cloud:
1. Connect the GitHub repo to Streamlit Community Cloud (one-time).
2. Configure the app to deploy from the `demo/` folder, main branch.
3. Set deployment-time environment variables (e.g., row limits, watermark flag).
4. App is publicly accessible at a `*.streamlit.app` URL. Link from Gumroad landing page.
5. Optional: custom domain via CNAME (free with Streamlit Community Cloud as of last check; verify before locking).
If Streamlit Community Cloud is ever unsuitable (rate limits, bandwidth, branding requirements), fall back to a $5/mo VPS running the demo via Docker. Same `demo/streamlit_app.py`, different host.
---
## 4. Libraries
| Purpose | Library |
|---|---|
| GUI framework | Streamlit |
| CLI framework | Typer |
| Data manipulation | pandas, openpyxl, numpy |
| Fuzzy string matching | rapidfuzz |
| File encoding detection | charset-normalizer |
| Logging | loguru |
| Progress bars | tqdm (CLI), `st.progress` (GUI) |
| Validation | pydantic (optional) |
| Reports | ReportLab (PDF), pandas styling (Excel) |
| Optional native window wrap | pywebview (deferred to v1.1 if needed) |
`requirements.txt` (current bundle, v1.3):
```
streamlit>=1.30
pandas
openpyxl
numpy
typer
rapidfuzz
charset-normalizer
loguru
tqdm
reportlab
pyarrow # Streamlit dependency, declare explicitly for PyInstaller clarity
altair # Streamlit dependency, declare explicitly for PyInstaller clarity
```
---
## 5. Coding Standards
### 5.1 Code Standards
- PEP 8 + type hints on all public functions.
- Docstrings on every module and public function.
- `--help` output (CLI) that a non-technical user can act on.
- Graceful error handling with human-readable messages, not stack traces. Errors must name the problem AND the likely fix where possible (e.g., "Column 'email' not found. Available columns: name, phone. Did you mean 'phone'?").
- All file paths handled via `pathlib.Path`, never string concatenation. Cross-platform correctness depends on this.
- All file I/O explicitly UTF-8-aware: detect encoding on input (charset-normalizer), write UTF-8 on output. Windows defaults to cp1252 otherwise.
- No platform-specific shell calls. If absolutely needed, branch on `sys.platform`.
- Unit tests for core logic (pytest). Tests target `core/`, not UI front-ends. Tests run on all three OS runners in CI.
- Semantic versioning per bundle.
- **Core/UI separation**: never put business logic in `cli.py` or `gui/`. If a CLI command and a GUI button do "the same thing," they call the same function in `core/`.
### 5.2 UX Standards (GUI / Streamlit) - load-bearing per DECISIONS.md Section 4b
- **Works out of the box**: dropping a file into the Streamlit `st.file_uploader` must produce a useful result with zero configuration.
- **Sensible defaults visible everywhere**: every `st.selectbox`, `st.slider`, `st.checkbox` has a default, the default is shown, the user is not forced through a config screen on first run.
- **Progressive disclosure**: basic view shows file uploader + go button + results. Advanced options live in `st.expander("Advanced options")` panes.
- **Plain-English labels**: no technical jargon in primary UI. `help=` parameter on widgets carries technical detail for users who want it.
- **Dry-run / preview by default**: user sees what would change before any file is written. Original input is never modified.
- **Single-page completion**: basic task completes on a single Streamlit page. Multi-page apps are for separate scripts in the bundle, not for multi-step wizards within one script.
- **Identical core to CLI**: any capability available in CLI is available in GUI, and vice versa. The only legitimate divergence is interactive review (GUI-natural) and scripted/scheduled execution (CLI-natural).
- **Local-first messaging**: every GUI page includes the line *"This tool runs locally in your browser and does not use the internet"* in a small, persistent location (e.g., footer or sidebar).
### 5.3 Functional Scope Standard - load-bearing per DECISIONS.md Section 4a
- Each script ships with **complete coverage of the workflow it names**, including features available free elsewhere (e.g., exact-match dedup).
- Scope boundary is the workflow, not "things adjacent to the workflow." A deduplicator includes normalization, survivor selection, audit. It does not include format conversion or charting; those belong elsewhere in the bundle.
---
## 6. System Requirements
**For buyers (runtime)**:
- Windows: Windows 10 or 11, 64-bit.
- macOS: macOS 11 (Big Sur) or later, Apple Silicon or Intel.
- Linux: any glibc-based distribution from 2020 onward (Ubuntu 20.04+, Fedora 33+, etc.).
- A modern default browser (Chrome, Edge, Firefox, Safari from the last 3 years). Used to display the local GUI; no internet required.
- ~400-500 MB free disk space (Streamlit packaging is larger than alternatives; this is an accepted tradeoff per DECISIONS.md Section 4c).
- No internet required after install. No Python install required ever.
**For developers (you)**:
- Python 3.11+.
- PyInstaller, Inno Setup (Windows builds), Xcode command line tools (macOS builds).
- Apple Developer Program membership ($99/yr) for macOS distribution.
- Git + GitHub account (for CI builds and Streamlit Community Cloud deployment of demos).
---
## 7. Per-Bundle Technical Notes
| Bundle | Status | Tech notes |
|---|---|---|
| Data Cleaning Mastery | Lead, 1/9 scripts complete (CLI only; needs Streamlit GUI port) | Cleaning, dedup, text hygiene, standardization, missing-value handling, outlier detection, type coercion, reporting. Scripts 04 (missing values) and 06 (outliers) are deliberately separate concerns; 04 runs first to neutralize sentinel codes before 06 computes statistics (see Section 9). Script 02 (text cleaner) runs first in the pipeline to normalize whitespace and special characters before any other operation. v1.3 spec: Streamlit GUI required at launch, with hosted demo deployed to Streamlit Community Cloud. |
| Automated Business Reporting | Not started | Aggregation -> styled PDF/Excel output |
| Ecommerce Data Pipeline | Not started | Extract -> clean -> export workflow |
| Small Business Finance | Not started | Bookkeeping summaries, simple reconciliation |
| Marketing Public Data Aggregation | Not started | Public API + scraping with respect for robots.txt and ToS |
| AI Ecommerce Aggregation (Shopify Pet) | Not started | Optional LLM enhancement, requires API key from buyer |
---
## 8. Open Technical Decisions
GUI framework choice is now **closed** (Streamlit, locked v1.3 - see DECISIONS.md Section 4c).
Remaining open items:
- **pywebview wrap of Streamlit launcher**: Optional v1.1 enhancement to eliminate the "browser opens" UX surprise. Defer until support tickets show meaningful buyer confusion. Cost: extra dependency, more PyInstaller complexity. Benefit: native-window UX.
- **Windows code signing**: Currently unsigned. Revisit if SmartScreen warnings drive support volume. Cost: ~$200-400/yr.
- **Auto-update mechanism**: None at launch. Email-delivered version updates. Revisit at 100+ paying customers per bundle.
- **Demo deployment hosting**: Streamlit Community Cloud at launch (free). Migrate to $5/mo VPS if rate limits, bandwidth, or branding constraints become an issue.
- **Code obfuscation**: Currently relying on license text + PyInstaller bundling. Decompilation is possible but unlikely for $49-79 products. Accept the risk.
- **Telemetry**: None. Consider privacy-respecting opt-in usage telemetry post-launch to inform roadmap, but only if explicit and disclosed.
---
## 9. Script Boundaries: 04 (Missing Values) vs 06 (Outliers)
The two scripts are deliberately separate. Original spec ("missing-value handler also does basic outlier flagging") was wrong: it conflated two different statistical operations and would have produced overlapping CLI flags, confused buyers, and brittle code.
### 9.1 Boundary
**`04_missing_value_handler.py` owns "what's not there"**:
- Detect disguised nulls: `NaN`, empty string, `"N/A"`, `"-"`, `"unknown"`, whitespace-only, sentinel codes (`-999`, `9999`, etc.).
- Missingness pattern analysis (which columns co-miss).
- Imputation strategies: mean, median, mode, forward-fill, KNN (optional), regression (optional).
- Required-field enforcement (drop rows missing a required column).
- Drop rows or columns by missingness threshold.
**`06_outlier_detector.py` owns "what shouldn't be there"**:
- Univariate statistical detection: z-score, IQR, modified z-score (MAD-based).
- Multivariate detection: Isolation Forest, Mahalanobis distance.
- Domain-rule violations (age > 120, negative quantity, future dates in historical data).
- Winsorization / capping as optional remediation.
- Distribution shape diagnostics.
### 9.2 Run Order
04 must run before 06. Reason: outlier statistics computed on data still containing `NaN` or sentinel codes are mathematically poisoned. Means and standard deviations get dragged, IQR widens, false negatives explode.
The Master Orchestrator (script 09) enforces this order. CLI users running scripts manually get a warning printed by 06 if it detects unhandled sentinel patterns in the input.
Pipeline-wide order enforced by the orchestrator: `02_text_cleaner``03_format_standardizer``04_missing_value_handler``05_column_mapper_enforcer``06_outlier_detector``07_multi_file_merger``08_validator_reporter`. Script `01_deduplicator` is order-flexible; it normalizes whitespace and case internally for matching purposes regardless of upstream cleaning, so it can run before or after `02_text_cleaner` depending on the buyer's workflow.
### 9.3 Contested Cases
**Use cases that prove 04 and 06 are distinct concerns** (not just naming differences):
- *Bank export with blank fee columns*: pure 04 job. The fees aren't outliers, they're missing. Imputation or drop-by-threshold is the right tool.
- *Sales data with one $1M order in a $50-average column*: pure 06 job. Nothing is missing; one row is statistically extreme. Z-score or IQR catches it.
- *Survey data where `999` means "refused to answer"*: needs both, in order. 04 converts `999` to `NaN` per `--sentinels`, then 06 computes statistics on the cleaned column.
Sentinel values like `-999` are *both* disguised missing *and* statistical outliers. Resolution: 04 owns sentinel detection and converts them to `NaN` (or imputes per user choice) before 06 sees the data. Sentinel patterns are configurable in 04 via `--sentinels` flag.
Suspicious-but-plausible values (e.g., age = 110): 06's territory. Not missing; just rare.
Whitespace-only cells (e.g., `" "`) are a contested case between 02 (text cleaner) and 04 (missing value handler). Resolution: 02 trims first, leaving an empty string. 04 then detects empty strings as disguised nulls per its existing logic. This means 02 must run before 04 in any pipeline that uses both. The orchestrator enforces this; CLI users get a warning if 04 detects whitespace-only cells suggesting 02 was skipped.
### 9.4 Shared Plumbing
Both scripts emit:
- A flagged-row report with column, row index, original value, action taken.
- A timestamped log file in `logs/`.
- An optional cleaned output file.
Report and log formats are identical between the two scripts. Implemented via shared helpers in `src/core/` to avoid drift.
---
## 10. Per-Script Functional Requirements
This section captures the full functional spec for each script, beyond the one-line description in USER-GUIDE.md Section 2. Specs answer "what does v1 need to ship to be best-of-class for the target buyer." Promoted from chat-history-only into docs in v1.6 to prevent silent drift.
**Note on script status**: a script labeled "Working" in the bundle status table means it has CLI execution and basic correctness, NOT that it implements every Tier 1 item below. Tier 1 is the v1 launch target; the current code may implement a subset.
### 10.1 `01_deduplicator.py` - Smart duplicate removal
**Current implementation status**: `01_deduplicator.py` exists and works for exact match plus basic fuzzy with configurable subset columns and timestamped logs (the description in USER-GUIDE.md Section 2 reflects current state). It does NOT yet implement most Tier 1 items below. Tier 1 is the v1 launch target, not current state. The Streamlit GUI port is the natural moment to close this gap.
**Strategic framing**: Excel's built-in Remove Duplicates handles exact match for free. Pandas `drop_duplicates()` handles it for free in code. A $49-$79 dedup tool that ships "exact + basic fuzzy" loses to Excel for free or to OpenRefine for free. The fuzzy matching has to be the product, not a checkbox. The market gap this script targets is "fuzzy match quality of OpenRefine, with the zero-learning-curve UX of Excel, sold once for under $100, runs locally" (see BUSINESS.md Section 4a).
#### Tier 1: Must-ship for v1 to be best-of-class
**Input handling**
1. Auto-detect file encoding (UTF-8, UTF-8-BOM, Latin-1, Windows-1252). Failure to handle this correctly is the #1 reason CSV tools crash on real-world business data.
2. Auto-detect delimiter (comma, tab, semicolon, pipe).
3. Read CSV, TSV, XLSX, XLS. For XLSX, support multi-sheet workbooks (let user pick or process each).
4. Handle files larger than RAM via chunked / streaming processing. A 500MB customer export should not crash the tool.
5. Header row detection with manual override.
**Matching**
6. Exact match with configurable subset columns.
7. Fuzzy match algorithms: Levenshtein, Jaro-Winkler, token-set ratio (rapidfuzz library). Multiple algorithms, not one. Different data types match better with different algorithms.
8. Per-column normalization before comparison:
- Email: lowercase, strip whitespace, strip Gmail dots, strip `+tag` suffixes.
- Phone: strip formatting and country codes, normalize to E.164.
- Name: strip titles (Mr/Ms/Dr), strip suffixes (Jr/III), collapse whitespace, optional case-fold.
- Address: USPS-style abbreviation normalization (St/Street, Ave/Avenue, Apt/#).
- Generic string: trim, collapse internal whitespace, optional case-fold.
9. Configurable similarity threshold (e.g., 85%, 90%, 95%) per match strategy.
10. Multi-strategy matching with OR logic: "match if email is exact OR (name fuzzy >90% AND phone exact)." This is what real-world dedup actually requires. Single-strategy match handles maybe 40% of cases.
**Survivor selection (which row to keep when duplicates are found)**
11. Configurable survivor rules: keep first, keep last, keep most-complete (fewest empty cells), keep most-recent (date column), keep manually-selected.
12. Merge mode: instead of deleting losers, fill missing fields in survivor from losers. Common real ask: combine partial records into one complete record.
**Trust and review**
13. Dry-run / preview mode by default. Output shows what *would* be merged before any file is written. Non-negotiable for trust. Aligns with Section 5.2 visible-safety standard.
14. Interactive review mode for uncertain matches. For matches in the gray zone (e.g., 75-90% similarity), prompt user yes/no/skip with side-by-side diff. This is what justifies a paid product over free Excel. GUI-natural; CLI gets a reduced-form prompt loop.
15. Confidence score on every fuzzy match in the output.
16. Match group export: separate file showing every group of matched rows so user can audit.
**Audit and safety**
17. Full timestamped log of every action: which rows matched, on which strategy, with what score, which row survived, which fields were merged.
18. Removed-duplicates exported to a separate file (never silently destroyed).
19. Original input file is never modified. Output is always a new file.
20. Idempotency: running the tool twice on the same input with the same config produces the same output.
**Configuration**
21. Save / load configuration profiles. A user who deduplicates a Shopify customer export weekly should configure once, not every run.
22. Sensible defaults that work on a typical messy CSV with zero configuration. The first run must produce a useful result with no flags. Per DECISIONS.md Section 4b UX standards.
**UX**
23. `--help` (CLI) written for non-technical users with concrete examples, not a flag list.
24. Progress bar for files over ~10K rows.
25. Error messages name the row number, column, and value that caused the problem. No raw stack traces. Per Section 5.1.
26. Sample data (`samples/messy_sales.csv`) must demonstrate fuzzy match scenarios, not just exact dupes. Otherwise the demo doesn't sell.
#### Tier 2: Worth-considering for v1.1
27. Numeric tolerance for matching (prices within $0.01 considered same).
28. Date tolerance for matching (dates within N days considered same).
29. Phonetic matching (Soundex, Metaphone) for name fields with common misspellings.
30. Blocking / indexing for performance on large files (compare only rows that share a first letter or first three characters of a key field). Without this, fuzzy match on 100K rows is O(n²) and unusable. Move to Tier 1 if early buyers report performance complaints.
31. Watch-folder mode: auto-process any file dropped into a folder.
32. Geolocation-aware address matching (optional, requires bundled USPS data or third-party API).
#### Tier 3: Optional / later
33. Machine-learning-based match scoring (Dedupe.io territory; high complexity, marginal gain for this price point).
34. Multi-table joins for cross-file dedup.
35. Schedule / cron integration.
36. Direct Shopify / Klaviyo / Mailchimp API integration to dedupe in place. This would be a real differentiator for the Shopify niche specifically and is probably the right v2 direction if early sales validate the niche.
### 10.2 - 10.9 (Future)
Functional specs for scripts 02 through 09 to be added when each script enters active build. The deduplicator spec is the template; specs for other scripts should follow the same Tier 1 / Tier 2 / Tier 3 structure with explicit strategic framing (what's the market gap this script fills, given that some of its functionality is available free elsewhere).

171
docs/USER-GUIDE.md Normal file
View File

@@ -0,0 +1,171 @@
# USER-GUIDE.md - Excel & CSV Data Cleaning Mastery Bundle
**Version**: 1.6
**Last updated**: April 28, 2026
Thank you for purchasing the Data Cleaning Mastery Bundle. This guide covers installation and every script included.
---
## 1. Installation
The bundle is fully self-contained. **You do not need to install Python.**
### Windows
1. Download `BundleName-Setup-1.0.exe` from your purchase email.
2. Double-click the installer.
3. Follow the wizard. The installer creates a desktop shortcut named "Launch Bundle" and an entry in Start Menu.
4. Launch via the desktop shortcut. Your default browser will open to a local page (typically `http://localhost:8501`) where the data tool runs.
### macOS
1. Download `BundleName-1.0.dmg` from your purchase email.
2. Double-click the `.dmg` to mount it.
3. Drag the Bundle app into the Applications folder.
4. Launch from Applications, Spotlight, or Launchpad. Your default browser will open to a local page where the data tool runs.
The app is signed and notarized by Apple, so it opens cleanly with no security warnings.
### Linux
1. Download `BundleName-1.0.AppImage` from your purchase email.
2. Make it executable: `chmod +x BundleName-1.0.AppImage`
3. Double-click to run, or execute from a terminal. Your default browser will open to a local page where the data tool runs.
If AppImage doesn't work on your distribution, a `.tar.gz` fallback is available in your purchase email. Extract it and run `./run.sh` from the extracted folder.
### How the GUI works (important to know)
This tool runs in your browser **locally on your computer**. When you launch it, a small program starts a local server on your machine and opens your default browser to view it. This is normal and expected.
- **No internet is required.** Your data never leaves your computer.
- **Your data is not uploaded anywhere.** All processing happens on your machine.
- The browser is just the display surface. Closing the browser closes the GUI; the underlying program also stops.
If you prefer the command line, every script also ships as a CLI tool. See Section 3.
### Requirements
- Windows: Windows 10 or 11 (64-bit).
- macOS: macOS 11 Big Sur or later (Apple Silicon or Intel).
- Linux: any modern 64-bit distribution from 2020 onward.
- A modern default browser (Chrome, Edge, Firefox, or Safari from the last 3 years).
- ~400-500 MB free disk space.
- Internet connection: not required.
---
## 2. What's Included
**Scripts (in the `scripts/` folder)**:
| # | Script | Purpose | Status |
|---|---|---|---|
| 01 | `01_deduplicator.py` | Smart duplicate removal: exact match + basic fuzzy, configurable subset columns, full logs | Working |
| 02 | `02_text_cleaner.py` | Character-level hygiene: trim leading/trailing whitespace, collapse internal multi-spaces, strip non-printable characters, Unicode normalization (smart quotes, em-dashes, accents), remove zero-width characters, BOM handling, line-ending normalization, case operations | Skeleton |
| 03 | `03_format_standardizer.py` | Standardize dates, currencies, names, phone numbers, addresses | Skeleton |
| 04 | `04_missing_value_handler.py` | Detect and handle missing values: disguised nulls (`N/A`, `-`, blanks, sentinel codes), imputation (mean/median/mode/forward-fill), required-field enforcement, drop-by-threshold | Skeleton |
| 05 | `05_column_mapper_enforcer.py` | Rename columns, enforce a target schema | Skeleton |
| 06 | `06_outlier_detector.py` | Detect and flag statistical outliers (z-score, IQR, modified z-score), multivariate detection, domain-rule violations, optional winsorization | Skeleton |
| 07 | `07_multi_file_merger.py` | Merge multiple CSV or Excel files into one | Skeleton |
| 08 | `08_validator_reporter.py` | Validate data against rules, output PDF or Excel report | Skeleton |
| 09 | `09_master_orchestrator.py` | One-click launcher menu, calls any other script | Skeleton |
**Sample data (in the `samples/` folder)**:
- `messy_sales.csv` - intentionally dirty sales data for testing.
- `bank_export.xlsx` - sample bank export for testing missing-value handling and outlier detection.
---
## 3. Usage
You have two ways to use the bundle: the GUI (recommended for most users) or the CLI (for power users and automation).
### 3.1 GUI usage (recommended)
1. Launch the bundle via the desktop shortcut, app icon, or AppImage.
2. Your browser opens to the bundle's home page.
3. Select the script you want to use from the sidebar (Deduplicator, Format Standardizer, etc.).
4. Drop your file into the file uploader, or select from the included samples.
5. Sensible defaults are pre-filled. Click "Run" to see a preview of what the script will do.
6. Review the preview. If it looks right, click "Save Output" to write the cleaned file.
The GUI is designed to work out of the box with zero configuration. Advanced options are tucked into expandable "Advanced" panes for users who want them.
### 3.2 CLI usage
All scripts are also CLI tools with `--help` output.
**Basic usage** (from a terminal):
Windows (the bundle adds CLI tools to your PATH):
```
deduplicator samples\messy_sales.csv
```
macOS / Linux:
```
deduplicator samples/messy_sales.csv
```
**With options**:
```
deduplicator samples/messy_sales.csv --output cleaned.csv --subset email,phone
```
**Get help on any script**:
```
deduplicator --help
```
**Recommended run order**: If you are running scripts individually, run `02_text_cleaner` first to normalize whitespace and special characters, then `04_missing_value_handler` *before* `06_outlier_detector`. Outlier detection on data still containing blanks or sentinel codes (like `-999`) produces unreliable results because missing-value placeholders distort the statistics (means get dragged, IQR widens, false negatives explode). The Master Orchestrator (script 09) runs them in the correct order automatically.
---
## 4. Output
Every script writes:
- A cleaned output file next to the input (or wherever you specify).
- A timestamped log file in the `logs/` folder showing what changed and why.
Reports from `validator_reporter` go to the `reports/` folder as PDF or Excel.
The GUI also displays the output preview in-browser before any file is written. The original input file is never modified.
---
## 5. Troubleshooting
**The GUI won't launch / browser doesn't open**:
1. Wait 10-15 seconds after double-clicking. The local server takes a moment to start the first time.
2. If the browser doesn't open automatically, manually visit `http://localhost:8501` in your browser.
3. If you see a "port in use" error, another program is using port 8501. Close other instances of the bundle and try again.
**"Why is my browser opening?" / "Why does this need internet?"**:
This tool runs as a local web app. The browser is just the display; nothing is uploaded, nothing leaves your computer. No internet connection is used after install. This is the same approach used by many modern data tools (Jupyter notebooks, RStudio, etc.).
**Windows: "Windows protected your PC" SmartScreen warning**:
Click "More info" then "Run anyway." This is a standard warning for software without an extended-validation Windows code signing certificate.
**macOS: "App is damaged and cannot be opened"**:
This usually indicates the download was corrupted. Re-download from the link in your purchase email.
**Linux: AppImage will not run**:
Make sure it is executable: `chmod +x BundleName-1.0.AppImage`. If it still fails, your distribution may be missing FUSE; install with `sudo apt install libfuse2` (Debian/Ubuntu) or use the `.tar.gz` fallback.
**Script throws an error about a file**:
Check the log file in the `logs/` folder. The log explains exactly what went wrong and which row of input data triggered it.
**The GUI feels slow on a large file**:
Files over ~100,000 rows take longer to process. The GUI shows a progress bar. If you have very large files (millions of rows) consider using the CLI directly, which is faster for batch jobs.
**Need help**: Email the address on your purchase receipt.
---
## 6. License
Single-user license. Do not redistribute. See `LICENSE.txt` in the install folder.

BIN
docs/project_files_1.6.zip Normal file

Binary file not shown.