Files

Michael abb720997e docs: tight, scannable rewrite — every item earns its place

Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS,
TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from
prose-heavy to bullet-heavy + table-heavy. Same information density,
significantly less reading load.

Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content
that landed since v1.6:

- Format Standardizer (3rd Ready tool)
- 199-row buyer corpus
- src/core/errors.py structured hierarchy + ensure_dataframe /
  ensure_choice / wrap_file_read|write / format_for_user helpers
- src/core/_constants.py shared USPS/state lookup tables
- Cross-tool audit fixes (NaN matching, removed_df schema, validation,
  enum-bounds checks, forward-compat config)
- Per-domain error_policy across format standardizers
- Inconsistent-date-format detector
- Excel header-row auto-detection + write_file delimiter param

Per-doc changes:

- README.md (175 → 71): 9-tool table at top, status column, 3 CLI
  entry points listed, dropped repeated marketing prose.
- docs/README.md (38 → 27): pure index — buyer-facing vs creator-only
  split + version footer.
- USER-GUIDE.md (208 → 118): tool table replaces script descriptions,
  troubleshooting compressed to bullets, gate explanation tightened.
- CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed
  redundant intro text, kept full recipes section.
- REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added
  §18 Error Handling, formatting tightened to single-line entries.
- TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged
  redundant §3.5-3.7 OS sections, added §7 (Error handling) +
  §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate /
  Review page / repair_bytes promoted from §10.2.x sub-numbering).
- DEVELOPER.md (285 → 161): module map table replaces per-file prose,
  extension recipes condensed, new §Errors covers when to use each
  hierarchy class.
- BUSINESS.md (278 → 225): collapsed prose to tables (use cases,
  competitive landscape, costs, risks); honest-status updated.
- DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved,
  decision log compressed to single-line entries, added v1.6 entries
  (Format Standardizer Ready, errors module).
- RECOVERY.md (180 → 147): rebuild steps as numbered + tabular,
  external dependencies as one table, recovery priorities tightened.

No information removed; redundancy compressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-01 02:49:29 +00:00

18 KiB

Raw Blame History

Technical

Creator-only. Do not ship to buyers. Version: 1.6 · Updated: 2026-05-01

1. Architecture

Dual interface: CLI + GUI, both wrapping the same src/core/ library.
GUI: Streamlit, runs as local web server, opens in default browser. No internet.
Runtime: Python 3.10+ (bundled into installer; buyer never sees Python).
Cross-platform: Windows, macOS, Linux from day one. PyInstaller per OS.
Core/UI rule: business logic in core/ only. CLI + GUI are thin front-ends.

Locks:

v1.2 — dual interface required (non-technical buyers won't use CLI).
v1.3 — Streamlit chosen (over CustomTkinter inactive, plain Tk UX gap, Flet/PySide6/NiceGUI each fails one dimension). See DECISIONS.md §4c.

2. Repo layout

src/
  core/                  # Shared logic. No UI code.
    analyze.py           # Detectors + Finding schema
    config.py            # DeduplicationConfig (JSON profiles)
    dedup.py             # Match strategies, union-find, survivor selection
    errors.py            # Structured error hierarchy + format_for_user
    fixes.py             # Fix registry (one per fix_action)
    format_standardize.py # Per-cell standardizers + DataFrame pipeline
    io.py                # read_file / write_file / repair_bytes
    normalize.py         # CSV-normalization gate
    normalizers.py       # Per-column normalizers for dedup matching
    text_clean.py        # clean_dataframe + smart_title_case
    _constants.py        # Shared USPS abbrevs + state names
  cli.py                 # Deduplicator CLI (Typer)
  cli_text_clean.py      # Text Cleaner CLI
  cli_analyze.py         # Analyzer CLI (--json)
  gui/
    app.py               # Streamlit entry point
    pages/               # One page per tool
    components/          # shared, dedup_review, findings, gate, _legacy
build/                   # PyInstaller spec, launcher, OS-specific configs
demo/                    # Constrained Streamlit Community Cloud version
tests/                   # pytest; targets core/, not UI
test-cases/              # Fixture corpora (text-cleaner, encodings, format-cleaner)

Demo subfolder: row-limited, watermarked, file-size-capped Streamlit app for public deployment. Same core, different front-end constraints.

3. Build pipeline

3.1 Tooling

Concern	Tool
Bundling	PyInstaller
GUI	Streamlit
CLI	Typer
Browser launch	stdlib `webbrowser`
Win installer	Inno Setup (free)
macOS sign+notarize	`codesign` + `notarytool`
Linux	AppImage (primary) + tarball fallback
CI	GitHub Actions matrix
Demo host	Streamlit Community Cloud (free)

3.2 Build outputs

OS	File	Buyer experience
Win	`*-Setup-1.0.exe`	Wizard → desktop shortcut "Launch Bundle" → browser opens. CLI on PATH.
macOS	`*-1.0.dmg`	Drag to Applications. Signed + notarized.
Linux	`*-1.0.AppImage`	`chmod +x`, double-click.

3.3 PyInstaller

--onefile for Linux, --onedir for Win/macOS (faster startup, easier signing).
Two entry points: GUI launcher + CLI binaries.
Streamlit hooks needed: streamlit, altair, pyarrow data dirs.
Custom hook-streamlit.py per documented pattern.
Budget: 1-3 days first time. Reusable after.

3.4 Streamlit launcher

Find free port (don't hardcode 8501).
Set env: STREAMLIT_SERVER_HEADLESS=true, STREAMLIT_BROWSER_GATHER_USAGE_STATS=false, STREAMLIT_SERVER_PORT={port}.
Start Streamlit programmatically in a thread.
Poll port until ready.
Open browser to http://localhost:{port}.
Keep launcher alive while server runs.

Optional v1.1: wrap with pywebview to eliminate browser-launch UX. Defer until support tickets show meaningful confusion.

3.5 macOS pipeline

PyInstaller produces unsigned .app.
codesign --deep --force --options runtime --sign "Developer ID Application: ..." App.app.
Package as .dmg.
xcrun notarytool submit *.dmg --wait.
xcrun stapler staple *.dmg.

Setup: Apple Developer Program ($99/yr), Developer ID cert in Keychain, app-specific password.

3.6-3.7 Win + Linux

Win: PyInstaller --onedir → Inno Setup wraps → installer adds Start Menu, desktop shortcut, PATH entries. Optional code-signing cert ($200-400/yr) if SmartScreen friction.
Linux: PyInstaller → appimagetool wraps. .tar.gz fallback for distros where AppImage fails.

3.8 CI matrix

strategy:
  matrix:
    os: [windows-latest, macos-latest, ubuntu-latest]

Tag a release → 3 platform artifacts upload to GitHub Releases. Manual: copy to Gumroad / Lemon Squeezy.

3.9 Hosted demo

demo/streamlit_app.py → Streamlit Community Cloud. Configure deployment in Streamlit UI. Custom domain via CNAME (verify policy at deploy time). Fall back to $5/mo VPS if rate limits / branding constraints hit.

4. Libraries

Purpose	Library
GUI	streamlit
CLI	typer
Data	pandas, openpyxl, numpy
Fuzzy match	rapidfuzz
Phone parsing	phonenumbers
Encoding detect	charset-normalizer
Logging	loguru
Mojibake (optional)	ftfy
Reports	reportlab

5. Coding standards

5.1 Code

PEP 8 + type hints on public functions.
Docstrings on every module + public function.
pathlib.Path for paths, never string concat.
All I/O explicitly UTF-8-aware.
No platform-specific shell calls.
pytest for core/, not UI.
Errors raise via src.core.errors hierarchy (Section 7).

5.2 GUI UX (load-bearing per DECISIONS.md §4b)

Works out of the box — drop file → useful result with zero config.
Sensible defaults visible everywhere.
Progressive disclosure — basic = file uploader + run button + results; rest in st.expander.
Plain-English labels; technical detail in help= tooltip.
Dry-run / preview by default.
Identical core to CLI.
Local-first messaging — "runs locally in your browser, no internet" line on every page.

5.3 Functional scope (load-bearing per DECISIONS.md §4a)

Each script ships complete coverage of the workflow it names, including features Excel does for free.
Boundary = the named workflow. Dedup includes normalization + survivor + audit; not format conversion or charting.

6. System requirements

Buyer runtime: Win 10/11 64-bit · macOS 11+ · Linux glibc 2020+ · modern browser · ~400-500 MB disk · no internet.

Developer: Python 3.10+ · PyInstaller · Inno Setup (Win) · Xcode CLT (macOS) · Apple Developer Program $99/yr · Git + GitHub.

7. Error handling (`src/core/errors.py`)

Structured hierarchy for friendly messages + maintainable trace context:

DataToolsError              # base; carries path/column/operation/suggestion/cause
  InputValidationError(ValueError)   # bad arg / wrong type
  ConfigError(ValueError)            # bad config / options
  FileFormatError(ValueError)        # file isn't what we expected
  FileAccessError(OSError)           # I/O failure (perms, disk, missing)

Subclassing rule: every subclass extends a stdlib base (ValueError or OSError) so existing except OSError / except ValueError handlers still catch them.

Helpers:

ensure_dataframe(value, function=...) — uniform DataFrame guard at every public entry.
ensure_choice(value, name=, choices=) — uniform enum/literal guard.
wrap_file_read(path, op, exc) / wrap_file_write(...) — tag OSError with file path + Windows-aware permission tip.
format_for_user(exc, context=) — single string for st.error() / CLI stderr.

GUI / CLI handlers use format_for_user() so the user always sees: file path, operation, underlying error class, recovery suggestion.

8. Per-bundle status

Bundle	Status
Data Cleaning Mastery	3/9 tools Ready (Dedup, Text Cleaner, Format Standardizer); 6 stubs
Automated Business Reporting	Not started
Ecommerce Data Pipeline	Not started
Small Business Finance	Not started
Marketing Public Data Aggregation	Not started
AI Ecommerce Aggregation (Shopify Pet)	Not started

9. Open decisions

pywebview wrap — defer until support tickets show browser-launch confusion.
Win code signing — defer until SmartScreen drives volume. Cost ~$200-400/yr.
Auto-update mechanism — none at launch. Email-delivered updates. Revisit at 100+ buyers/bundle.
Demo hosting migration — Streamlit Community Cloud → $5/mo VPS if rate/brand limits hit.
Code obfuscation — none; license text + bundle complexity sufficient at $49-79.
Telemetry — none. Consider opt-in privacy-respecting only post-launch.

10. Script boundaries — 04 (Missing Values) vs 06 (Outliers)

Deliberately separate. Confluent original spec was wrong.

Script	Owns
04 Missing Value Handler	"What's not there." Disguised nulls (`N/A`, `-`, sentinel codes), missingness patterns, imputation, drop-by-threshold.
06 Outlier Detector	"What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization.

Run order: 04 before 06. Outlier stats on data with NaN / sentinels are mathematically poisoned (means dragged, IQR widens, false negatives).

Pipeline order (Pipeline Runner enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible.

Contested cases:

Whitespace-only cell — 02 trims to empty; 04 then flags empty as null.
-999 sentinel — 04 converts to NaN first; 06 then computes stats.
Suspicious-but-plausible (age 110) — 06 territory.

11. Per-script functional specs

Specs live in this section as scripts enter active build. Each follows the Tier 1/2/3 structure with explicit strategic framing (what's the market gap given some of this is free elsewhere).

11.1 `01_deduplicator.py` — Smart duplicate removal

Status: Ready. Tier 1 mostly built. Streamlit GUI port complete.

Market gap: fuzzy match quality of OpenRefine, with the zero-learning UX of Excel, sold once for under $100, runs locally.

Tier 1:

Input: auto-detect encoding (UTF-8, UTF-8-BOM, Latin-1, cp1252) · delimiter · header row · CSV/TSV/XLSX/XLS · multi-sheet picker · streaming for files > RAM.
Matching: exact + 3 fuzzy algos (Levenshtein / Jaro-Winkler / token-set) · per-column normalizers (5 types) · configurable threshold per strategy · multi-strategy OR.
Survivor: keep first / last / most-complete / most-recent · merge mode (fill blanks from losers).
Trust: dry-run preview by default · interactive review for gray-zone matches · confidence score per match · match-group export.
Audit: timestamped log · removed-rows separate file · input never modified · idempotent.
Config: save/load JSON profiles · sensible auto-detect defaults.
UX: human --help · progress bar > 10k rows · errors name row + column + value + suggestion.

Tier 2: numeric/date tolerance · phonetic match (Soundex, Metaphone) · blocking/indexing · watch-folder.

Tier 3: ML scoring · cross-file dedup · cron · Shopify/Klaviyo API direct.

11.2 `02_text_cleaner.py` — Character-level hygiene

Status: Ready. Tier 1 built.

Market gap: one-click correctness for the dirty-CSV failure modes that cause silent VLOOKUP misses.

Boundary:

02 — whitespace, Unicode normalize, smart-char fold, BOM, line endings, zero-width, control chars, case ops. Writes to disk.
03 — dates, currencies, names, phones, addresses (display formatting).
04 — disguised nulls.
01 — normalize_string is match-time only, distinct from 02's write-time policy.

Tier 1 ops (each toggleable; defaults shown for excel-hygiene):

Trim leading/trailing whitespace — ON
Collapse internal whitespace runs — ON
NFC normalize — ON
NFKC compatibility fold — OFF (lossy, opt-in via paranoid preset)
Smart-char fold (curly quotes, em/en-dash, NBSP, ellipsis) — ON
Zero-width / invisible char strip — ON
BOM strip — ON
Control-char strip (preserve \t\n\r) — ON
Line-ending normalize (CRLF/CR → LF inside cells) — ON
Case conversion (UPPER / lower / Title / Sentence) — OFF, per-column

Scope: per-column selection · skip-list · operates on string-typed columns only.

Trust: dry-run by default · per-cell change log (capped 1000, --full-changelog removes cap) · 3 output files mirroring dedup · idempotent.

Config: 3 presets (minimal / excel-hygiene (default) / paranoid) · save/load JSON.

11.3 `03_format_standardizer.py` — Per-domain canonical forms

Status: Ready. Full Tier 1 + most Tier 2 built. 199-row buyer corpus passing.

Market gap: unify dates / phones / emails / addresses / names / currencies / booleans across messy ETL inputs without buyer writing code.

Domains:

Domain	Default canonical	Notable handling
Date	ISO 8601 (`YYYY-MM-DD`)	MDY/DMY, Excel serial, Unix timestamp (s + ms), longform months, year-month, quarter notation, French/German/Spanish month dictionaries (opt-in), buried-date regex, error sentinels for invalid dates
Phone	E.164 + `;ext=N`	libphonenumber, 001 international prefix handling, error sentinels for placeholders / multi-number / contamination
Email	lowercase + trim	display-name extraction, mailto/angle-bracket strip, smart-quote unwrap, optional `--gmail-canonical` mode
Address	USPS-canonical (`expand=False`) or expanded (`expand=True`)	state-name → 2-letter, multi-line collapse, PO Box normalize, state-code preservation regardless of input case
Name	smart Title Case	Mc/Mac/O'/D' inner caps, hyphen segments, particle lowercasing (von/van/de/da), comma-format reversal, period stripping for titles/suffixes/initials, PhD/MD acronym preservation, conservative mode
Currency	bare number (dot decimal)	auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation
Boolean	`True`/`False` (configurable)	accepts `yes`/`no`/`y`/`n`/`1`/`0`/`on`/`off`

Per-domain error_policy: "passthrough" (default) keeps the original; "sentinel" emits <error: <reason>> for cases like Feb 30, double @, percentages mistaken for currency, etc.

Pipeline: standardize_dataframe(df, options) runs per-column with column_types: dict[str, FieldType]. Returns StandardizeResult with cells_changed, cells_unparseable, change audit. Warns when > 10% of typed cells fail to parse.

Presets: us-default, european, uk, iso-strict, legacy-us. Custom abbreviations via extra_abbreviations.

11.4 Upload-time analyzer (`src/core/analyze.py`)

Read-only advisory pass on every upload. Emits Finding objects:

Field	Meaning
`id`	Stable identifier (never localized)
`severity`	`info` / `warn` / `error` (only `error` blocks gate)
`confidence`	`high` (round-trip safe) / `medium` (preview) / `low` (heuristic)
`fix_action`	id of algorithm in `fixes.py` (empty for informational-only)
`pre_applied`	`true` if fix already ran during read pass
`tool`	owning tool id (or empty for file-level)
`count`	cells / rows affected
`description`	one-sentence human summary
`column`	column name (None for file-level)
`samples`	up to 5 `(row, col, value)` examples

Entry point: analyze(source, *, sample_rows=1000, repair_result=None, encoding_override=None). encoding_override skips charset detection — the hook that lets the Review page recover from misdetections.

11.5 CSV-normalization gate (`src/core/normalize.py`, `fixes.py`)

Two paths:

Auto-fix — auto_fix(df, findings) applies every confidence="high" finding whose fix_action is registered.
Per-finding decisions — apply_decisions(df, findings, decisions) accepts Decision(finding_id, action, payload) with action "auto"|"skip"|"modified".

Returns NormalizationResult with cleaned_df, cleaned_bytes (UTF-8 CSV), applied, skipped_findings, pending_findings, blocking_findings.

is_normalized(findings, result) re-runs analyze() against cleaned bytes; returns False if any high-confidence detector still fires (the strict contract tool pages depend on).

Fix registry: @register("fix_id") decorates (df, payload) → (new_df, n_cells_changed). New fix = one entry in analyze.py FIX_* constants + one detector emitting that fix_action + one registered function. No other call sites change.

11.6 Review page (`src/gui/pages/0_Review.py`)

Detected encoding + override picker (16 codepages + custom).
One expandable card per finding (sorted by severity then confidence) with: decision radio (Auto/Skip/Customize), live before/after preview built by running the registered fix on Finding.samples, payload editor for fixes that take user input.
Apply persists NormalizationResult keyed by upload SHA-256; tool pages refuse to load until hash matches.
⚙️ Advanced output options expander: per-download encoding + delimiter + line terminator. _build_output_bytes() returns (bytes, error_message); lossy fallbacks emit a warning the page surfaces.

Gates the entire tool sidebar via require_normalization_gate() in src/gui/components/_legacy.py.

11.7 Pre-parse repair (`src/core/io.py::repair_bytes`)

Byte-level pre-parse pass. Order is meaningful:

Wide-encoding transcode (UTF-16/32 → UTF-8) — must run first or NUL strip below shreds UTF-16.
UTF-8 BOM strip (file start only).
NUL strip — only meaningful after step 1, so flags genuine corruption.
Line-ending normalize — CRLF + bare CR → LF.
Byte-level smart-quote fold — curly / guillemet / double-prime → ASCII " (only structural double-quote-equivalents; single curlies deferred to cell-level).
Per-row delimiter repair — when a row has +1 field and merge candidate is currency-shaped ($1,500.00), merge + quote.

detect_encoding() tries strict UTF-8 first — charset-normalizer mislabels short-non-ASCII files as mac_latin2, but valid UTF-8 bytes mean UTF-8 regardless of label.

18 KiB Raw Blame History