# Technical > Creator-only. Do not ship to buyers. > **Version**: 1.6 · **Updated**: 2026-05-01 For the end-to-end picture (desktop app + license server + storefronts + email), see `ARCHITECTURE.md`. This doc focuses on desktop internals. ## 1. Architecture - **Dual interface**: CLI + GUI, both wrapping the same `src/core/` library. - **GUI**: Streamlit, runs as local web server, opens in default browser. No internet. - **Runtime**: Python 3.10+ (bundled into installer; buyer never sees Python). - **Cross-platform**: Windows, macOS, Linux from day one. PyInstaller per OS. - **Core/UI rule**: business logic in `core/` only. CLI + GUI are thin front-ends. **Locks**: - v1.2 — dual interface required (non-technical buyers won't use CLI). - v1.3 — Streamlit chosen (over CustomTkinter inactive, plain Tk UX gap, Flet/PySide6/NiceGUI each fails one dimension). See DECISIONS.md §4c. ## 2. Repo layout ``` src/ core/ # Shared logic. No UI code. analyze.py # Detectors + Finding schema config.py # DeduplicationConfig (JSON profiles) dedup.py # Match strategies, union-find, survivor selection errors.py # Structured error hierarchy + format_for_user fixes.py # Fix registry (one per fix_action) format_standardize.py # Per-cell standardizers + DataFrame pipeline io.py # read_file / write_file / repair_bytes normalize.py # CSV-normalization gate normalizers.py # Per-column normalizers for dedup matching text_clean.py # clean_dataframe + smart_title_case _constants.py # Shared USPS abbrevs + state names cli.py # Find Duplicates CLI (Typer) cli_text_clean.py # Clean Text CLI cli_analyze.py # Analyzer CLI (--json) gui/ app.py # Streamlit entry point pages/ # One page per tool components/ # shared, dedup_review, findings, gate, _legacy i18n/ # GUI language packs (JSON-backed, in-house lookup) __init__.py # t() · current_language() · render_language_selector() packs/ # en.json, es.json, … (one file per language) build/ # PyInstaller spec, launcher, OS-specific configs demo/ # Constrained Streamlit Community Cloud version tests/ # pytest; targets core/, not UI test-cases/ # Fixture corpora (text-cleaner, encodings, format-cleaner) ``` **Demo subfolder**: row-limited, watermarked, file-size-capped Streamlit app for public deployment. Same core, different front-end constraints. ## 3. Build pipeline ### 3.1 Tooling | Concern | Tool | |---------|------| | Bundling | PyInstaller | | GUI | Streamlit | | CLI | Typer | | Browser launch | stdlib `webbrowser` | | Win installer | Inno Setup (free) | | macOS sign+notarize | `codesign` + `notarytool` | | Linux | AppImage (primary) + tarball fallback | | CI | GitHub Actions matrix | | Demo host | Streamlit Community Cloud (free) | ### 3.2 Build outputs | OS | File | Buyer experience | |----|------|------------------| | Win | `*-Setup-1.0.exe` | Wizard → desktop shortcut "Launch Bundle" → browser opens. CLI on PATH. | | macOS | `*-1.0.dmg` | Drag to Applications. Signed + notarized. | | Linux | `*-1.0.AppImage` | `chmod +x`, double-click. | ### 3.3 PyInstaller - `--onefile` for Linux, `--onedir` for Win/macOS (faster startup, easier signing). - Two entry points: GUI launcher + CLI binaries. - Streamlit hooks needed: `streamlit`, `altair`, `pyarrow` data dirs. - Custom `hook-streamlit.py` per documented pattern. - Budget: 1-3 days first time. Reusable after. ### 3.4 Streamlit launcher 1. Find free port (don't hardcode 8501). 2. Set env: `STREAMLIT_SERVER_HEADLESS=true`, `STREAMLIT_BROWSER_GATHER_USAGE_STATS=false`, `STREAMLIT_SERVER_PORT={port}`. 3. Start Streamlit programmatically in a thread. 4. Poll port until ready. 5. Open browser to `http://localhost:{port}`. 6. Keep launcher alive while server runs. Optional v1.1: wrap with `pywebview` to eliminate browser-launch UX. Defer until support tickets show meaningful confusion. ### 3.5 macOS pipeline 1. PyInstaller produces unsigned `.app`. 2. `codesign --deep --force --options runtime --sign "Developer ID Application: ..." App.app`. 3. Package as `.dmg`. 4. `xcrun notarytool submit *.dmg --wait`. 5. `xcrun stapler staple *.dmg`. Setup: Apple Developer Program ($99/yr), Developer ID cert in Keychain, app-specific password. ### 3.6-3.7 Win + Linux - **Win**: PyInstaller `--onedir` → Inno Setup wraps → installer adds Start Menu, desktop shortcut, PATH entries. Optional code-signing cert ($200-400/yr) if SmartScreen friction. - **Linux**: PyInstaller → `appimagetool` wraps. `.tar.gz` fallback for distros where AppImage fails. ### 3.8 CI matrix ```yaml strategy: matrix: os: [windows-latest, macos-latest, ubuntu-latest] ``` Tag a release → 3 platform artifacts upload to GitHub Releases. Manual: copy to Gumroad / Lemon Squeezy. ### 3.9 Hosted demo `demo/streamlit_app.py` → Streamlit Community Cloud. Configure deployment in Streamlit UI. Custom domain via CNAME (verify policy at deploy time). Fall back to $5/mo VPS if rate limits / branding constraints hit. ### 3.10 Bundled Tesseract (PDF Extractor OCR) Frozen builds ship Tesseract 5.5 + `eng.traineddata` inside the PyInstaller bundle so scanned PDFs work without a separate install. Per-platform binary URLs pinned in `build/tesseract.py`; tessdata vendored at `build/vendor/tessdata/eng.traineddata`. License attribution in `LICENSE_TESSERACT.txt` at the repo root. **Discovery order at runtime** (see `docs/DEVELOPER.md` for the full Path layout): 1. `DATATOOLS_TESSERACT_BIN` env var override. 2. Bundled path under `sys._MEIPASS / "tesseract" /` (frozen bundles only). 3. `tesseract` on `PATH` (source / pip developer environments). 4. Windows well-known locations. ## 4. Libraries | Purpose | Library | |---------|---------| | GUI | streamlit | | CLI | typer | | Data | pandas, openpyxl, numpy | | Fuzzy match | rapidfuzz | | Phone parsing | phonenumbers | | Encoding detect | charset-normalizer | | Logging | loguru | | Mojibake (optional) | ftfy | | Reports | reportlab | ## 5. Coding standards ### 5.1 Code - PEP 8 + type hints on public functions. - Docstrings on every module + public function. - `pathlib.Path` for paths, never string concat. - All I/O explicitly UTF-8-aware. - No platform-specific shell calls. - pytest for `core/`, not UI. - Errors raise via `src.core.errors` hierarchy (Section 7). ### 5.2 GUI UX (load-bearing per DECISIONS.md §4b) - **Works out of the box** — drop file → useful result with zero config. - **Sensible defaults visible everywhere**. - **Progressive disclosure** — basic = file uploader + run button + results; rest in `st.expander`. - **Plain-English labels**; technical detail in `help=` tooltip. - **Dry-run / preview by default**. - **Identical core to CLI**. - **Local-first messaging** — "runs locally in your browser, no internet" line on every page. ### 5.3 Functional scope (load-bearing per DECISIONS.md §4a) - Each script ships **complete coverage of the workflow it names**, including features Excel does for free. - Boundary = the named workflow. Dedup includes normalization + survivor + audit; not format conversion or charting. ## 6. System requirements **Buyer runtime**: Win 10/11 64-bit · macOS 11+ · Linux glibc 2020+ · modern browser · ~400-500 MB disk · no internet. **Developer**: Python 3.10+ · PyInstaller · Inno Setup (Win) · Xcode CLT (macOS) · Apple Developer Program $99/yr · Git + GitHub. ## 7. Error handling (`src/core/errors.py`) Structured hierarchy for friendly messages + maintainable trace context: ``` DataToolsError # base; carries path/column/operation/suggestion/cause InputValidationError(ValueError) # bad arg / wrong type ConfigError(ValueError) # bad config / options FileFormatError(ValueError) # file isn't what we expected FileAccessError(OSError) # I/O failure (perms, disk, missing) ``` **Subclassing rule**: every subclass extends a stdlib base (`ValueError` or `OSError`) so existing `except OSError` / `except ValueError` handlers still catch them. **Helpers**: - `ensure_dataframe(value, function=...)` — uniform DataFrame guard at every public entry. - `ensure_choice(value, name=, choices=)` — uniform enum/literal guard. - `wrap_file_read(path, op, exc)` / `wrap_file_write(...)` — tag OSError with file path + Windows-aware permission tip. - `format_for_user(exc, context=)` — single string for `st.error()` / CLI stderr. GUI / CLI handlers use `format_for_user()` so the user always sees: file path, operation, underlying error class, recovery suggestion. ## 8. Per-bundle status | Bundle | Status | |--------|--------| | Data Cleaning Mastery | 3/9 tools Ready (Find Duplicates, Clean Text, Standardize Formats); 6 stubs | | Automated Business Reporting | Not started | | Ecommerce Data Pipeline | Not started | | Small Business Finance | Not started | | Marketing Public Data Aggregation | Not started | | AI Ecommerce Aggregation (Shopify Pet) | Not started | ## 9. Open decisions - **pywebview wrap** — defer until support tickets show browser-launch confusion. - **Win code signing** — defer until SmartScreen drives volume. Cost ~$200-400/yr. - **Auto-update mechanism** — none at launch. Email-delivered updates. Revisit at 100+ buyers/bundle. - **Demo hosting migration** — Streamlit Community Cloud → $5/mo VPS if rate/brand limits hit. - **Code obfuscation** — none; license text + bundle complexity sufficient at $49-79. - **Telemetry** — none. Consider opt-in privacy-respecting only post-launch. ## 10. Script boundaries — 04 (Missing Values) vs 06 (Outliers) Deliberately separate. Confluent original spec was wrong. | Script | Owns | |--------|------| | 04 Fix Missing Values | "What's not there." Disguised nulls (`N/A`, `-`, sentinel codes), missingness patterns, imputation, drop-by-threshold. | | 06 Find Unusual Values | "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization. | **Run order**: 04 before 06. Outlier stats on data with `NaN` / sentinels are mathematically poisoned (means dragged, IQR widens, false negatives). **Pipeline order** (Automated Workflows enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible. **Contested cases**: - Whitespace-only cell — 02 trims to empty; 04 then flags empty as null. - `-999` sentinel — 04 converts to `NaN` first; 06 then computes stats. - Suspicious-but-plausible (age 110) — 06 territory. ## 10b. GUI internationalization (i18n) The GUI uses an in-house, JSON-backed translation layer at `src/i18n/`. **No** `gettext` / `babel` / `.po` pipeline — the surface is small enough that a 100-line module + per-language JSON file is a better fit than a build-time toolchain. **Resolution model**: `t(key, lang=None, **fmt)` walks a dotted key (`home.title`, `tools.01_deduplicator.name`) through a nested dict. Fallback chain: requested lang → English (canonical) → the literal key. Missing format placeholders return the raw template rather than raising so a translation file cannot crash the UI. **Active language** is stored in `st.session_state["ui_lang"]`. Reading it outside a Streamlit run (tests, scripts) silently falls back to English, keeping the module importable without Streamlit context. **Picker placement**: `hide_streamlit_chrome()` calls `render_language_selector()` on every page that hides Streamlit's default chrome — i.e., the entire app. One mount point, every page picks it up. **Pack parity** is a tested invariant: `tests/test_lang_packs.py::TestPackParity` fails CI when `en.json` and another pack diverge in either direction. This catches translation drift at PR time rather than from buyer reports. **Farewell overlay**: the shutdown screen's JS payload interpolates pack strings into an `innerHTML` inside a JS single-quoted string. `_js_html_safe()` in `components/_legacy.py` escapes both the JS string terminator (`'`) and HTML special chars (`< > &`). The test `TestFarewellEscape` pins this; never bypass it. **Why not gettext**: zero compiled artifacts in the PyInstaller bundle, no build step before tests run, no `.po`/`.mo` round-trip for translators (anyone can edit JSON), and the same lookup works in unit tests without process state. Locked in because the surface won't grow large enough to need the alternative, and the alternative breaks the "drop a file, run pytest, ship" loop. ## 10c. GUI chrome — sidebar nav indicator swap Streamlit's `st.Page`-driven sidebar renders section headers with a Material Symbols ligature (`expand_more` / `expand_less`). The header element is not a button and carries no `aria-expanded`, so a pure-CSS swap can't follow open/closed state. We replace the glyph with plain typographic `+` / `−` (U+2212) via JS: - **CSS** (`components/_legacy.py`, `_HIDE_CHROME_CSS`) drops the Material Symbols font on `[data-testid="stIconMaterial"]` inside `[data-testid="stNavSectionHeader"]` so the rewritten character renders as normal text rather than re-resolving as an icon name. - **JS** (`_SWAP_NAV_SECTION_INDICATOR_JS`) walks each section header, reads the icon's text node, and rewrites `expand_more` → `+` / `expand_less` → `−`. A MutationObserver re-runs the swap when Streamlit re-renders the sidebar (RAF-throttled so a burst of mutations is one swap). The script ships through the same component-iframe bundle as the brand injector and upload-button rename inside `hide_streamlit_chrome()` — one iframe per page, three DOM mutations. ## 11. Per-script functional specs Specs live in this section as scripts enter active build. Each follows the Tier 1/2/3 structure with explicit strategic framing (what's the market gap given some of this is free elsewhere). ### 11.1 `01_deduplicator.py` — Smart duplicate removal **Status**: Ready. Tier 1 mostly built. Streamlit GUI port complete. **Market gap**: fuzzy match quality of OpenRefine, with the zero-learning UX of Excel, sold once for under $100, runs locally. **Tier 1**: - **Input**: auto-detect encoding (UTF-8, UTF-8-BOM, Latin-1, cp1252) · delimiter · header row · CSV/TSV/XLSX/XLS · multi-sheet picker · streaming for files > RAM. - **Matching**: exact + 3 fuzzy algos (Levenshtein / Jaro-Winkler / token-set) · per-column normalizers (5 types) · configurable threshold per strategy · multi-strategy OR. - **Survivor**: keep first / last / most-complete / most-recent · merge mode (fill blanks from losers). - **Trust**: dry-run preview by default · interactive review for gray-zone matches · confidence score per match · match-group export. - **Audit**: timestamped log · removed-rows separate file · input never modified · idempotent. - **Config**: save/load JSON profiles · sensible auto-detect defaults. - **UX**: human `--help` · progress bar > 10k rows · errors name row + column + value + suggestion. **Tier 2**: numeric/date tolerance · phonetic match (Soundex, Metaphone) · blocking/indexing · watch-folder. **Tier 3**: ML scoring · cross-file dedup · cron · Shopify/Klaviyo API direct. ### 11.2 `02_text_cleaner.py` — Character-level hygiene **Status**: Ready. Tier 1 built. **Market gap**: one-click correctness for the dirty-CSV failure modes that cause silent VLOOKUP misses. **Boundary**: - 02 — whitespace, Unicode normalize, smart-char fold, BOM, line endings, zero-width, control chars, case ops. Writes to disk. - 03 — dates, currencies, names, phones, addresses (display formatting). - 04 — disguised nulls. - 01 — `normalize_string` is *match-time* only, distinct from 02's *write-time* policy. **Tier 1 ops** (each toggleable; defaults shown for `excel-hygiene`): 1. Trim leading/trailing whitespace — ON 2. Collapse internal whitespace runs — ON 3. NFC normalize — ON 4. NFKC compatibility fold — OFF (lossy, opt-in via `paranoid` preset) 5. Smart-char fold (curly quotes, em/en-dash, NBSP, ellipsis) — ON 6. Zero-width / invisible char strip — ON 7. BOM strip — ON 8. Control-char strip (preserve `\t\n\r`) — ON 9. Line-ending normalize (CRLF/CR → LF inside cells) — ON 10. Case conversion (UPPER / lower / Title / Sentence) — OFF, per-column **Scope**: per-column selection · skip-list · operates on string-typed columns only. **Trust**: dry-run by default · per-cell change log (capped 1000, `--full-changelog` removes cap) · 3 output files mirroring dedup · idempotent. **Config**: 3 presets (`minimal` / `excel-hygiene` (default) / `paranoid`) · save/load JSON. ### 11.3 `03_format_standardizer.py` — Per-domain canonical forms **Status**: Ready. Full Tier 1 + most Tier 2 built. 199-row buyer corpus passing. **Market gap**: unify dates / phones / emails / addresses / names / currencies / booleans across messy ETL inputs without buyer writing code. **Domains**: | Domain | Default canonical | Notable handling | |--------|-------------------|------------------| | Date | ISO 8601 (`YYYY-MM-DD`) | MDY/DMY, Excel serial, Unix timestamp (s + ms), longform months, year-month, quarter, ISO week date (`2024-W03-1`), ISO ordinal (`2024-015`), RFC 2822, CJK separators (`2024年01月15日`), fullwidth digits, named-TZ resolution (EST/PST/JST/…), `two_digit_year_cutoff` | | Phone | E.164 + `;ext=N` | libphonenumber, 001 international prefix, error sentinels for placeholders / multi-number / contamination | | Email | lowercase + trim | display-name extraction, mailto/angle-bracket strip, smart-quote unwrap, BIDI/RTL override strip (security), optional `--gmail-canonical` | | Address | USPS-canonical (`expand=False`) or expanded (`expand=True`) | state/province-name → code for US/CA/AU/DE, UK postcode detection, multi-line collapse, PO Box normalize, state-code preservation regardless of input case | | Name | smart Title Case | Mc/Mac/O'/D' inner caps, Arabic `al-`/`el-` lowercase, particle lowercasing (von/van/de/da/bin/ibn/ben), East Asian honorific suffixes (`-san`/`-sama`/`-ssi`), comma reversal (skippable via `family_first`), period stripping for titles/suffixes/initials, PhD/MD/Mag/Habil acronyms | | Currency | bare number (dot decimal) | auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation | | Boolean | `True`/`False` (configurable) | accepts `yes`/`no`/`y`/`n`/`1`/`0`/`on`/`off` | **International coverage** (added v1.7): - **Date locales**: English (default) plus opt-in French / German / Spanish / Portuguese / Italian / Dutch / Russian month + weekday recognition. - **Currency symbols**: $, €, £, ¥, ₹, ₩, ₽, ₪, ₺, ¢ + ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS). - **ISO 4217 codes**: 23 baseline (USD, EUR, …) plus ~30 emerging-market additions (SAR, AED, ARS, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, …). - **Address jurisdictions**: US, Canada (13 provinces/territories), Australia (8 states), Germany (16 Bundesländer), UK (postcode shape). - **Address PO Box**: English, German (`Postfach`), French (`Boîte postale`), Spanish (`Apartado`), Italian (`Casella postale`), Portuguese (`Caixa postal`). **Per-domain `error_policy`**: `"passthrough"` (default) keeps the original; `"sentinel"` emits `>` for cases like Feb 30, double @, percentages mistaken for currency, etc. **Pipeline**: `standardize_dataframe(df, options)` runs per-column with `column_types: dict[str, FieldType]`. Returns `StandardizeResult` with `cells_changed`, `cells_unparseable`, change audit. Warns when > 10% of typed cells fail to parse. **Presets**: `us-default`, `european`, `uk`, `iso-strict`, `legacy-us`. Custom abbreviations via `extra_abbreviations`. Per-column culture flags: `name_family_first` (East Asian), `address_state_to_code` (any of 4 supported jurisdictions), `date_month_locales` (list of 8 supported codes). ### 11.4 Upload-time analyzer (`src/core/analyze.py`) Read-only advisory pass on every upload. Emits `Finding` objects: | Field | Meaning | |-------|---------| | `id` | Stable identifier (never localized) | | `severity` | `info` / `warn` / `error` (only `error` blocks gate) | | `confidence` | `high` (round-trip safe) / `medium` (preview) / `low` (heuristic) | | `fix_action` | id of algorithm in `fixes.py` (empty for informational-only) | | `pre_applied` | `true` if fix already ran during read pass | | `tool` | owning tool id (or empty for file-level) | | `count` | cells / rows affected | | `description` | one-sentence human summary | | `column` | column name (None for file-level) | | `samples` | up to 5 `(row, col, value)` examples | Entry point: `analyze(source, *, sample_rows=1000, repair_result=None, encoding_override=None)`. `encoding_override` skips charset detection — the hook that lets the Review page recover from misdetections. ### 11.5 CSV-normalization gate (`src/core/normalize.py`, `fixes.py`) Two paths: 1. **Auto-fix** — `auto_fix(df, findings)` applies every `confidence="high"` finding whose `fix_action` is registered. 2. **Per-finding decisions** — `apply_decisions(df, findings, decisions)` accepts `Decision(finding_id, action, payload)` with action `"auto"|"skip"|"modified"`. Returns `NormalizationResult` with `cleaned_df`, `cleaned_bytes` (UTF-8 CSV), `applied`, `skipped_findings`, `pending_findings`, `blocking_findings`. `is_normalized(findings, result)` re-runs `analyze()` against cleaned bytes; returns False if any high-confidence detector still fires (the strict contract tool pages depend on). **Fix registry**: `@register("fix_id")` decorates `(df, payload) → (new_df, n_cells_changed)`. New fix = one entry in `analyze.py` `FIX_*` constants + one detector emitting that `fix_action` + one registered function. No other call sites change. ### 11.6 Review page (`src/gui/pages/0_Review.py`) 1. Detected encoding + override picker (16 codepages + custom). 2. One expandable card per finding (sorted by severity then confidence) with: decision radio (Auto/Skip/Customize), live before/after preview built by running the registered fix on `Finding.samples`, payload editor for fixes that take user input. 3. Apply persists `NormalizationResult` keyed by upload SHA-256; tool pages refuse to load until hash matches. 4. `⚙️ Advanced output options` expander: per-download encoding + delimiter + line terminator. `_build_output_bytes()` returns `(bytes, error_message)`; lossy fallbacks emit a warning the page surfaces. Gates the entire tool sidebar via `require_normalization_gate()` in `src/gui/components/_legacy.py`. ### 11.7 Pre-parse repair (`src/core/io.py::repair_bytes`) Byte-level pre-parse pass. **Order is meaningful**: 1. **Wide-encoding transcode** (UTF-16/32 → UTF-8) — must run first or NUL strip below shreds UTF-16. 2. **UTF-8 BOM strip** (file start only). 3. **NUL strip** — only meaningful after step 1, so flags genuine corruption. 4. **Line-ending normalize** — CRLF + bare CR → LF. 5. **Byte-level smart-quote fold** — curly / guillemet / double-prime → ASCII `"` (only structural double-quote-equivalents; single curlies deferred to cell-level). 6. **Per-row delimiter repair** — when a row has +1 field and merge candidate is currency-shaped (`$1,500.00`), merge + quote. `detect_encoding()` tries strict UTF-8 first — charset-normalizer mislabels short-non-ASCII files as `mac_latin2`, but valid UTF-8 bytes mean UTF-8 regardless of label.