README + USER-GUIDE describe the sidebar picker and current coverage (home + shared chrome, per-tool bodies pending). DEVELOPER gains a how-to for adding packs and keys with the parity-test guarantee. TECHNICAL §10b records the in-house-JSON architecture and locks in the no-gettext decision (also logged in DECISIONS). REQUIREMENTS reflects the new interface surface and updated test count. COPY.md adds a "Language claim" slot so landing/email work can pick it up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
21 KiB
Technical
Creator-only. Do not ship to buyers. Version: 1.6 · Updated: 2026-05-01
1. Architecture
- Dual interface: CLI + GUI, both wrapping the same
src/core/library. - GUI: Streamlit, runs as local web server, opens in default browser. No internet.
- Runtime: Python 3.10+ (bundled into installer; buyer never sees Python).
- Cross-platform: Windows, macOS, Linux from day one. PyInstaller per OS.
- Core/UI rule: business logic in
core/only. CLI + GUI are thin front-ends.
Locks:
- v1.2 — dual interface required (non-technical buyers won't use CLI).
- v1.3 — Streamlit chosen (over CustomTkinter inactive, plain Tk UX gap, Flet/PySide6/NiceGUI each fails one dimension). See DECISIONS.md §4c.
2. Repo layout
src/
core/ # Shared logic. No UI code.
analyze.py # Detectors + Finding schema
config.py # DeduplicationConfig (JSON profiles)
dedup.py # Match strategies, union-find, survivor selection
errors.py # Structured error hierarchy + format_for_user
fixes.py # Fix registry (one per fix_action)
format_standardize.py # Per-cell standardizers + DataFrame pipeline
io.py # read_file / write_file / repair_bytes
normalize.py # CSV-normalization gate
normalizers.py # Per-column normalizers for dedup matching
text_clean.py # clean_dataframe + smart_title_case
_constants.py # Shared USPS abbrevs + state names
cli.py # Deduplicator CLI (Typer)
cli_text_clean.py # Text Cleaner CLI
cli_analyze.py # Analyzer CLI (--json)
gui/
app.py # Streamlit entry point
pages/ # One page per tool
components/ # shared, dedup_review, findings, gate, _legacy
i18n/ # GUI language packs (JSON-backed, in-house lookup)
__init__.py # t() · current_language() · render_language_selector()
packs/ # en.json, es.json, … (one file per language)
build/ # PyInstaller spec, launcher, OS-specific configs
demo/ # Constrained Streamlit Community Cloud version
tests/ # pytest; targets core/, not UI
test-cases/ # Fixture corpora (text-cleaner, encodings, format-cleaner)
Demo subfolder: row-limited, watermarked, file-size-capped Streamlit app for public deployment. Same core, different front-end constraints.
3. Build pipeline
3.1 Tooling
| Concern | Tool |
|---|---|
| Bundling | PyInstaller |
| GUI | Streamlit |
| CLI | Typer |
| Browser launch | stdlib webbrowser |
| Win installer | Inno Setup (free) |
| macOS sign+notarize | codesign + notarytool |
| Linux | AppImage (primary) + tarball fallback |
| CI | GitHub Actions matrix |
| Demo host | Streamlit Community Cloud (free) |
3.2 Build outputs
| OS | File | Buyer experience |
|---|---|---|
| Win | *-Setup-1.0.exe |
Wizard → desktop shortcut "Launch Bundle" → browser opens. CLI on PATH. |
| macOS | *-1.0.dmg |
Drag to Applications. Signed + notarized. |
| Linux | *-1.0.AppImage |
chmod +x, double-click. |
3.3 PyInstaller
--onefilefor Linux,--onedirfor Win/macOS (faster startup, easier signing).- Two entry points: GUI launcher + CLI binaries.
- Streamlit hooks needed:
streamlit,altair,pyarrowdata dirs. - Custom
hook-streamlit.pyper documented pattern. - Budget: 1-3 days first time. Reusable after.
3.4 Streamlit launcher
- Find free port (don't hardcode 8501).
- Set env:
STREAMLIT_SERVER_HEADLESS=true,STREAMLIT_BROWSER_GATHER_USAGE_STATS=false,STREAMLIT_SERVER_PORT={port}. - Start Streamlit programmatically in a thread.
- Poll port until ready.
- Open browser to
http://localhost:{port}. - Keep launcher alive while server runs.
Optional v1.1: wrap with pywebview to eliminate browser-launch UX. Defer until support tickets show meaningful confusion.
3.5 macOS pipeline
- PyInstaller produces unsigned
.app. codesign --deep --force --options runtime --sign "Developer ID Application: ..." App.app.- Package as
.dmg. xcrun notarytool submit *.dmg --wait.xcrun stapler staple *.dmg.
Setup: Apple Developer Program ($99/yr), Developer ID cert in Keychain, app-specific password.
3.6-3.7 Win + Linux
- Win: PyInstaller
--onedir→ Inno Setup wraps → installer adds Start Menu, desktop shortcut, PATH entries. Optional code-signing cert ($200-400/yr) if SmartScreen friction. - Linux: PyInstaller →
appimagetoolwraps..tar.gzfallback for distros where AppImage fails.
3.8 CI matrix
strategy:
matrix:
os: [windows-latest, macos-latest, ubuntu-latest]
Tag a release → 3 platform artifacts upload to GitHub Releases. Manual: copy to Gumroad / Lemon Squeezy.
3.9 Hosted demo
demo/streamlit_app.py → Streamlit Community Cloud. Configure deployment in Streamlit UI. Custom domain via CNAME (verify policy at deploy time). Fall back to $5/mo VPS if rate limits / branding constraints hit.
4. Libraries
| Purpose | Library |
|---|---|
| GUI | streamlit |
| CLI | typer |
| Data | pandas, openpyxl, numpy |
| Fuzzy match | rapidfuzz |
| Phone parsing | phonenumbers |
| Encoding detect | charset-normalizer |
| Logging | loguru |
| Mojibake (optional) | ftfy |
| Reports | reportlab |
5. Coding standards
5.1 Code
- PEP 8 + type hints on public functions.
- Docstrings on every module + public function.
pathlib.Pathfor paths, never string concat.- All I/O explicitly UTF-8-aware.
- No platform-specific shell calls.
- pytest for
core/, not UI. - Errors raise via
src.core.errorshierarchy (Section 7).
5.2 GUI UX (load-bearing per DECISIONS.md §4b)
- Works out of the box — drop file → useful result with zero config.
- Sensible defaults visible everywhere.
- Progressive disclosure — basic = file uploader + run button + results; rest in
st.expander. - Plain-English labels; technical detail in
help=tooltip. - Dry-run / preview by default.
- Identical core to CLI.
- Local-first messaging — "runs locally in your browser, no internet" line on every page.
5.3 Functional scope (load-bearing per DECISIONS.md §4a)
- Each script ships complete coverage of the workflow it names, including features Excel does for free.
- Boundary = the named workflow. Dedup includes normalization + survivor + audit; not format conversion or charting.
6. System requirements
Buyer runtime: Win 10/11 64-bit · macOS 11+ · Linux glibc 2020+ · modern browser · ~400-500 MB disk · no internet.
Developer: Python 3.10+ · PyInstaller · Inno Setup (Win) · Xcode CLT (macOS) · Apple Developer Program $99/yr · Git + GitHub.
7. Error handling (src/core/errors.py)
Structured hierarchy for friendly messages + maintainable trace context:
DataToolsError # base; carries path/column/operation/suggestion/cause
InputValidationError(ValueError) # bad arg / wrong type
ConfigError(ValueError) # bad config / options
FileFormatError(ValueError) # file isn't what we expected
FileAccessError(OSError) # I/O failure (perms, disk, missing)
Subclassing rule: every subclass extends a stdlib base (ValueError or OSError) so existing except OSError / except ValueError handlers still catch them.
Helpers:
ensure_dataframe(value, function=...)— uniform DataFrame guard at every public entry.ensure_choice(value, name=, choices=)— uniform enum/literal guard.wrap_file_read(path, op, exc)/wrap_file_write(...)— tag OSError with file path + Windows-aware permission tip.format_for_user(exc, context=)— single string forst.error()/ CLI stderr.
GUI / CLI handlers use format_for_user() so the user always sees: file path, operation, underlying error class, recovery suggestion.
8. Per-bundle status
| Bundle | Status |
|---|---|
| Data Cleaning Mastery | 3/9 tools Ready (Dedup, Text Cleaner, Format Standardizer); 6 stubs |
| Automated Business Reporting | Not started |
| Ecommerce Data Pipeline | Not started |
| Small Business Finance | Not started |
| Marketing Public Data Aggregation | Not started |
| AI Ecommerce Aggregation (Shopify Pet) | Not started |
9. Open decisions
- pywebview wrap — defer until support tickets show browser-launch confusion.
- Win code signing — defer until SmartScreen drives volume. Cost ~$200-400/yr.
- Auto-update mechanism — none at launch. Email-delivered updates. Revisit at 100+ buyers/bundle.
- Demo hosting migration — Streamlit Community Cloud → $5/mo VPS if rate/brand limits hit.
- Code obfuscation — none; license text + bundle complexity sufficient at $49-79.
- Telemetry — none. Consider opt-in privacy-respecting only post-launch.
10. Script boundaries — 04 (Missing Values) vs 06 (Outliers)
Deliberately separate. Confluent original spec was wrong.
| Script | Owns |
|---|---|
| 04 Missing Value Handler | "What's not there." Disguised nulls (N/A, -, sentinel codes), missingness patterns, imputation, drop-by-threshold. |
| 06 Outlier Detector | "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization. |
Run order: 04 before 06. Outlier stats on data with NaN / sentinels are mathematically poisoned (means dragged, IQR widens, false negatives).
Pipeline order (Pipeline Runner enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible.
Contested cases:
- Whitespace-only cell — 02 trims to empty; 04 then flags empty as null.
-999sentinel — 04 converts toNaNfirst; 06 then computes stats.- Suspicious-but-plausible (age 110) — 06 territory.
10b. GUI internationalization (i18n)
The GUI uses an in-house, JSON-backed translation layer at src/i18n/. No gettext / babel / .po pipeline — the surface is small enough that a 100-line module + per-language JSON file is a better fit than a build-time toolchain.
Resolution model: t(key, lang=None, **fmt) walks a dotted key (home.title, tools.01_deduplicator.name) through a nested dict. Fallback chain: requested lang → English (canonical) → the literal key. Missing format placeholders return the raw template rather than raising so a translation file cannot crash the UI.
Active language is stored in st.session_state["ui_lang"]. Reading it outside a Streamlit run (tests, scripts) silently falls back to English, keeping the module importable without Streamlit context.
Picker placement: hide_streamlit_chrome() calls render_language_selector() on every page that hides Streamlit's default chrome — i.e., the entire app. One mount point, every page picks it up.
Pack parity is a tested invariant: tests/test_lang_packs.py::TestPackParity fails CI when en.json and another pack diverge in either direction. This catches translation drift at PR time rather than from buyer reports.
Farewell overlay: the shutdown screen's JS payload interpolates pack strings into an innerHTML inside a JS single-quoted string. _js_html_safe() in components/_legacy.py escapes both the JS string terminator (') and HTML special chars (< > &). The test TestFarewellEscape pins this; never bypass it.
Why not gettext: zero compiled artifacts in the PyInstaller bundle, no build step before tests run, no .po/.mo round-trip for translators (anyone can edit JSON), and the same lookup works in unit tests without process state. Locked in because the surface won't grow large enough to need the alternative, and the alternative breaks the "drop a file, run pytest, ship" loop.
11. Per-script functional specs
Specs live in this section as scripts enter active build. Each follows the Tier 1/2/3 structure with explicit strategic framing (what's the market gap given some of this is free elsewhere).
11.1 01_deduplicator.py — Smart duplicate removal
Status: Ready. Tier 1 mostly built. Streamlit GUI port complete.
Market gap: fuzzy match quality of OpenRefine, with the zero-learning UX of Excel, sold once for under $100, runs locally.
Tier 1:
- Input: auto-detect encoding (UTF-8, UTF-8-BOM, Latin-1, cp1252) · delimiter · header row · CSV/TSV/XLSX/XLS · multi-sheet picker · streaming for files > RAM.
- Matching: exact + 3 fuzzy algos (Levenshtein / Jaro-Winkler / token-set) · per-column normalizers (5 types) · configurable threshold per strategy · multi-strategy OR.
- Survivor: keep first / last / most-complete / most-recent · merge mode (fill blanks from losers).
- Trust: dry-run preview by default · interactive review for gray-zone matches · confidence score per match · match-group export.
- Audit: timestamped log · removed-rows separate file · input never modified · idempotent.
- Config: save/load JSON profiles · sensible auto-detect defaults.
- UX: human
--help· progress bar > 10k rows · errors name row + column + value + suggestion.
Tier 2: numeric/date tolerance · phonetic match (Soundex, Metaphone) · blocking/indexing · watch-folder.
Tier 3: ML scoring · cross-file dedup · cron · Shopify/Klaviyo API direct.
11.2 02_text_cleaner.py — Character-level hygiene
Status: Ready. Tier 1 built.
Market gap: one-click correctness for the dirty-CSV failure modes that cause silent VLOOKUP misses.
Boundary:
- 02 — whitespace, Unicode normalize, smart-char fold, BOM, line endings, zero-width, control chars, case ops. Writes to disk.
- 03 — dates, currencies, names, phones, addresses (display formatting).
- 04 — disguised nulls.
- 01 —
normalize_stringis match-time only, distinct from 02's write-time policy.
Tier 1 ops (each toggleable; defaults shown for excel-hygiene):
- Trim leading/trailing whitespace — ON
- Collapse internal whitespace runs — ON
- NFC normalize — ON
- NFKC compatibility fold — OFF (lossy, opt-in via
paranoidpreset) - Smart-char fold (curly quotes, em/en-dash, NBSP, ellipsis) — ON
- Zero-width / invisible char strip — ON
- BOM strip — ON
- Control-char strip (preserve
\t\n\r) — ON - Line-ending normalize (CRLF/CR → LF inside cells) — ON
- Case conversion (UPPER / lower / Title / Sentence) — OFF, per-column
Scope: per-column selection · skip-list · operates on string-typed columns only.
Trust: dry-run by default · per-cell change log (capped 1000, --full-changelog removes cap) · 3 output files mirroring dedup · idempotent.
Config: 3 presets (minimal / excel-hygiene (default) / paranoid) · save/load JSON.
11.3 03_format_standardizer.py — Per-domain canonical forms
Status: Ready. Full Tier 1 + most Tier 2 built. 199-row buyer corpus passing.
Market gap: unify dates / phones / emails / addresses / names / currencies / booleans across messy ETL inputs without buyer writing code.
Domains:
| Domain | Default canonical | Notable handling |
|---|---|---|
| Date | ISO 8601 (YYYY-MM-DD) |
MDY/DMY, Excel serial, Unix timestamp (s + ms), longform months, year-month, quarter, ISO week date (2024-W03-1), ISO ordinal (2024-015), RFC 2822, CJK separators (2024年01月15日), fullwidth digits, named-TZ resolution (EST/PST/JST/…), two_digit_year_cutoff |
| Phone | E.164 + ;ext=N |
libphonenumber, 001 international prefix, error sentinels for placeholders / multi-number / contamination |
| lowercase + trim | display-name extraction, mailto/angle-bracket strip, smart-quote unwrap, BIDI/RTL override strip (security), optional --gmail-canonical |
|
| Address | USPS-canonical (expand=False) or expanded (expand=True) |
state/province-name → code for US/CA/AU/DE, UK postcode detection, multi-line collapse, PO Box normalize, state-code preservation regardless of input case |
| Name | smart Title Case | Mc/Mac/O'/D' inner caps, Arabic al-/el- lowercase, particle lowercasing (von/van/de/da/bin/ibn/ben), East Asian honorific suffixes (-san/-sama/-ssi), comma reversal (skippable via family_first), period stripping for titles/suffixes/initials, PhD/MD/Mag/Habil acronyms |
| Currency | bare number (dot decimal) | auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation |
| Boolean | True/False (configurable) |
accepts yes/no/y/n/1/0/on/off |
International coverage (added v1.7):
- Date locales: English (default) plus opt-in French / German / Spanish / Portuguese / Italian / Dutch / Russian month + weekday recognition.
- Currency symbols: $, €, £, ¥, ₹, ₩, ₽, ₪, ₺, ¢ + ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS).
- ISO 4217 codes: 23 baseline (USD, EUR, …) plus ~30 emerging-market additions (SAR, AED, ARS, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, …).
- Address jurisdictions: US, Canada (13 provinces/territories), Australia (8 states), Germany (16 Bundesländer), UK (postcode shape).
- Address PO Box: English, German (
Postfach), French (Boîte postale), Spanish (Apartado), Italian (Casella postale), Portuguese (Caixa postal).
Per-domain error_policy: "passthrough" (default) keeps the original; "sentinel" emits <error: <reason>> for cases like Feb 30, double @, percentages mistaken for currency, etc.
Pipeline: standardize_dataframe(df, options) runs per-column with column_types: dict[str, FieldType]. Returns StandardizeResult with cells_changed, cells_unparseable, change audit. Warns when > 10% of typed cells fail to parse.
Presets: us-default, european, uk, iso-strict, legacy-us. Custom abbreviations via extra_abbreviations. Per-column culture flags: name_family_first (East Asian), address_state_to_code (any of 4 supported jurisdictions), date_month_locales (list of 8 supported codes).
11.4 Upload-time analyzer (src/core/analyze.py)
Read-only advisory pass on every upload. Emits Finding objects:
| Field | Meaning |
|---|---|
id |
Stable identifier (never localized) |
severity |
info / warn / error (only error blocks gate) |
confidence |
high (round-trip safe) / medium (preview) / low (heuristic) |
fix_action |
id of algorithm in fixes.py (empty for informational-only) |
pre_applied |
true if fix already ran during read pass |
tool |
owning tool id (or empty for file-level) |
count |
cells / rows affected |
description |
one-sentence human summary |
column |
column name (None for file-level) |
samples |
up to 5 (row, col, value) examples |
Entry point: analyze(source, *, sample_rows=1000, repair_result=None, encoding_override=None). encoding_override skips charset detection — the hook that lets the Review page recover from misdetections.
11.5 CSV-normalization gate (src/core/normalize.py, fixes.py)
Two paths:
- Auto-fix —
auto_fix(df, findings)applies everyconfidence="high"finding whosefix_actionis registered. - Per-finding decisions —
apply_decisions(df, findings, decisions)acceptsDecision(finding_id, action, payload)with action"auto"|"skip"|"modified".
Returns NormalizationResult with cleaned_df, cleaned_bytes (UTF-8 CSV), applied, skipped_findings, pending_findings, blocking_findings.
is_normalized(findings, result) re-runs analyze() against cleaned bytes; returns False if any high-confidence detector still fires (the strict contract tool pages depend on).
Fix registry: @register("fix_id") decorates (df, payload) → (new_df, n_cells_changed). New fix = one entry in analyze.py FIX_* constants + one detector emitting that fix_action + one registered function. No other call sites change.
11.6 Review page (src/gui/pages/0_Review.py)
- Detected encoding + override picker (16 codepages + custom).
- One expandable card per finding (sorted by severity then confidence) with: decision radio (Auto/Skip/Customize), live before/after preview built by running the registered fix on
Finding.samples, payload editor for fixes that take user input. - Apply persists
NormalizationResultkeyed by upload SHA-256; tool pages refuse to load until hash matches. ⚙️ Advanced output optionsexpander: per-download encoding + delimiter + line terminator._build_output_bytes()returns(bytes, error_message); lossy fallbacks emit a warning the page surfaces.
Gates the entire tool sidebar via require_normalization_gate() in src/gui/components/_legacy.py.
11.7 Pre-parse repair (src/core/io.py::repair_bytes)
Byte-level pre-parse pass. Order is meaningful:
- Wide-encoding transcode (UTF-16/32 → UTF-8) — must run first or NUL strip below shreds UTF-16.
- UTF-8 BOM strip (file start only).
- NUL strip — only meaningful after step 1, so flags genuine corruption.
- Line-ending normalize — CRLF + bare CR → LF.
- Byte-level smart-quote fold — curly / guillemet / double-prime → ASCII
"(only structural double-quote-equivalents; single curlies deferred to cell-level). - Per-row delimiter repair — when a row has +1 field and merge candidate is currency-shaped (
$1,500.00), merge + quote.
detect_encoding() tries strict UTF-8 first — charset-normalizer mislabels short-non-ASCII files as mac_latin2, but valid UTF-8 bytes mean UTF-8 regardless of label.