datatools-dev/docs/TECHNICAL.md

# Technical

> Creator-only. Do not ship to buyers.
> **Version**: 1.6 · **Updated**: 2026-05-01

## 1. Architecture

- **Dual interface**: CLI + GUI, both wrapping the same `src/core/` library.
- **GUI**: Streamlit, runs as local web server, opens in default browser. No internet.
- **Runtime**: Python 3.10+ (bundled into installer; buyer never sees Python).
- **Cross-platform**: Windows, macOS, Linux from day one. PyInstaller per OS.
- **Core/UI rule**: business logic in `core/` only. CLI + GUI are thin front-ends.

**Locks**:
- v1.2 — dual interface required (non-technical buyers won't use CLI).
- v1.3 — Streamlit chosen (over CustomTkinter inactive, plain Tk UX gap, Flet/PySide6/NiceGUI each fails one dimension). See DECISIONS.md §4c.

## 2. Repo layout

```
src/
  core/                  # Shared logic. No UI code.
    analyze.py           # Detectors + Finding schema
    config.py            # DeduplicationConfig (JSON profiles)
    dedup.py             # Match strategies, union-find, survivor selection
    errors.py            # Structured error hierarchy + format_for_user
    fixes.py             # Fix registry (one per fix_action)
    format_standardize.py # Per-cell standardizers + DataFrame pipeline
    io.py                # read_file / write_file / repair_bytes
    normalize.py         # CSV-normalization gate
    normalizers.py       # Per-column normalizers for dedup matching
    text_clean.py        # clean_dataframe + smart_title_case
    _constants.py        # Shared USPS abbrevs + state names
  cli.py                 # Deduplicator CLI (Typer)
  cli_text_clean.py      # Text Cleaner CLI
  cli_analyze.py         # Analyzer CLI (--json)
  gui/
    app.py               # Streamlit entry point
    pages/               # One page per tool
    components/          # shared, dedup_review, findings, gate, _legacy
build/                   # PyInstaller spec, launcher, OS-specific configs
demo/                    # Constrained Streamlit Community Cloud version
tests/                   # pytest; targets core/, not UI
test-cases/              # Fixture corpora (text-cleaner, encodings, format-cleaner)
```

**Demo subfolder**: row-limited, watermarked, file-size-capped Streamlit app for public deployment. Same core, different front-end constraints.

## 3. Build pipeline

### 3.1 Tooling
| Concern | Tool |
|---------|------|
| Bundling | PyInstaller |
| GUI | Streamlit |
| CLI | Typer |
| Browser launch | stdlib `webbrowser` |
| Win installer | Inno Setup (free) |
| macOS sign+notarize | `codesign` + `notarytool` |
| Linux | AppImage (primary) + tarball fallback |
| CI | GitHub Actions matrix |
| Demo host | Streamlit Community Cloud (free) |

### 3.2 Build outputs
| OS | File | Buyer experience |
|----|------|------------------|
| Win | `*-Setup-1.0.exe` | Wizard → desktop shortcut "Launch Bundle" → browser opens. CLI on PATH. |
| macOS | `*-1.0.dmg` | Drag to Applications. Signed + notarized. |
| Linux | `*-1.0.AppImage` | `chmod +x`, double-click. |

### 3.3 PyInstaller

- `--onefile` for Linux, `--onedir` for Win/macOS (faster startup, easier signing).
- Two entry points: GUI launcher + CLI binaries.
- Streamlit hooks needed: `streamlit`, `altair`, `pyarrow` data dirs.
- Custom `hook-streamlit.py` per documented pattern.
- Budget: 1-3 days first time. Reusable after.

### 3.4 Streamlit launcher

1. Find free port (don't hardcode 8501).
2. Set env: `STREAMLIT_SERVER_HEADLESS=true`, `STREAMLIT_BROWSER_GATHER_USAGE_STATS=false`, `STREAMLIT_SERVER_PORT={port}`.
3. Start Streamlit programmatically in a thread.
4. Poll port until ready.
5. Open browser to `http://localhost:{port}`.
6. Keep launcher alive while server runs.

Optional v1.1: wrap with `pywebview` to eliminate browser-launch UX. Defer until support tickets show meaningful confusion.

### 3.5 macOS pipeline

1. PyInstaller produces unsigned `.app`.
2. `codesign --deep --force --options runtime --sign "Developer ID Application: ..." App.app`.
3. Package as `.dmg`.
4. `xcrun notarytool submit *.dmg --wait`.
5. `xcrun stapler staple *.dmg`.

Setup: Apple Developer Program ($99/yr), Developer ID cert in Keychain, app-specific password.

### 3.6-3.7 Win + Linux

- **Win**: PyInstaller `--onedir` → Inno Setup wraps → installer adds Start Menu, desktop shortcut, PATH entries. Optional code-signing cert ($200-400/yr) if SmartScreen friction.
- **Linux**: PyInstaller → `appimagetool` wraps. `.tar.gz` fallback for distros where AppImage fails.

### 3.8 CI matrix

```yaml
strategy:
  matrix:
    os: [windows-latest, macos-latest, ubuntu-latest]
```

Tag a release → 3 platform artifacts upload to GitHub Releases. Manual: copy to Gumroad / Lemon Squeezy.

### 3.9 Hosted demo

`demo/streamlit_app.py` → Streamlit Community Cloud. Configure deployment in Streamlit UI. Custom domain via CNAME (verify policy at deploy time). Fall back to $5/mo VPS if rate limits / branding constraints hit.

## 4. Libraries

| Purpose | Library |
|---------|---------|
| GUI | streamlit |
| CLI | typer |
| Data | pandas, openpyxl, numpy |
| Fuzzy match | rapidfuzz |
| Phone parsing | phonenumbers |
| Encoding detect | charset-normalizer |
| Logging | loguru |
| Mojibake (optional) | ftfy |
| Reports | reportlab |

## 5. Coding standards

### 5.1 Code
- PEP 8 + type hints on public functions.
- Docstrings on every module + public function.
- `pathlib.Path` for paths, never string concat.
- All I/O explicitly UTF-8-aware.
- No platform-specific shell calls.
- pytest for `core/`, not UI.
- Errors raise via `src.core.errors` hierarchy (Section 7).

### 5.2 GUI UX (load-bearing per DECISIONS.md §4b)
- **Works out of the box** — drop file → useful result with zero config.
- **Sensible defaults visible everywhere**.
- **Progressive disclosure** — basic = file uploader + run button + results; rest in `st.expander`.
- **Plain-English labels**; technical detail in `help=` tooltip.
- **Dry-run / preview by default**.
- **Identical core to CLI**.
- **Local-first messaging** — "runs locally in your browser, no internet" line on every page.

### 5.3 Functional scope (load-bearing per DECISIONS.md §4a)
- Each script ships **complete coverage of the workflow it names**, including features Excel does for free.
- Boundary = the named workflow. Dedup includes normalization + survivor + audit; not format conversion or charting.

## 6. System requirements

**Buyer runtime**: Win 10/11 64-bit · macOS 11+ · Linux glibc 2020+ · modern browser · ~400-500 MB disk · no internet.

**Developer**: Python 3.10+ · PyInstaller · Inno Setup (Win) · Xcode CLT (macOS) · Apple Developer Program $99/yr · Git + GitHub.

## 7. Error handling (`src/core/errors.py`)

Structured hierarchy for friendly messages + maintainable trace context:

```
DataToolsError              # base; carries path/column/operation/suggestion/cause
  InputValidationError(ValueError)   # bad arg / wrong type
  ConfigError(ValueError)            # bad config / options
  FileFormatError(ValueError)        # file isn't what we expected
  FileAccessError(OSError)           # I/O failure (perms, disk, missing)
```

**Subclassing rule**: every subclass extends a stdlib base (`ValueError` or `OSError`) so existing `except OSError` / `except ValueError` handlers still catch them.

**Helpers**:
- `ensure_dataframe(value, function=...)` — uniform DataFrame guard at every public entry.
- `ensure_choice(value, name=, choices=)` — uniform enum/literal guard.
- `wrap_file_read(path, op, exc)` / `wrap_file_write(...)` — tag OSError with file path + Windows-aware permission tip.
- `format_for_user(exc, context=)` — single string for `st.error()` / CLI stderr.

GUI / CLI handlers use `format_for_user()` so the user always sees: file path, operation, underlying error class, recovery suggestion.

## 8. Per-bundle status

| Bundle | Status |
|--------|--------|
| Data Cleaning Mastery | 3/9 tools Ready (Dedup, Text Cleaner, Format Standardizer); 6 stubs |
| Automated Business Reporting | Not started |
| Ecommerce Data Pipeline | Not started |
| Small Business Finance | Not started |
| Marketing Public Data Aggregation | Not started |
| AI Ecommerce Aggregation (Shopify Pet) | Not started |

## 9. Open decisions

- **pywebview wrap** — defer until support tickets show browser-launch confusion.
- **Win code signing** — defer until SmartScreen drives volume. Cost ~$200-400/yr.
- **Auto-update mechanism** — none at launch. Email-delivered updates. Revisit at 100+ buyers/bundle.
- **Demo hosting migration** — Streamlit Community Cloud → $5/mo VPS if rate/brand limits hit.
- **Code obfuscation** — none; license text + bundle complexity sufficient at $49-79.
- **Telemetry** — none. Consider opt-in privacy-respecting only post-launch.

## 10. Script boundaries — 04 (Missing Values) vs 06 (Outliers)

Deliberately separate. Confluent original spec was wrong.

| Script | Owns |
|--------|------|
| 04 Missing Value Handler | "What's not there." Disguised nulls (`N/A`, `-`, sentinel codes), missingness patterns, imputation, drop-by-threshold. |
| 06 Outlier Detector | "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization. |

**Run order**: 04 before 06. Outlier stats on data with `NaN` / sentinels are mathematically poisoned (means dragged, IQR widens, false negatives).

**Pipeline order** (Pipeline Runner enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible.

**Contested cases**:
- Whitespace-only cell — 02 trims to empty; 04 then flags empty as null.
- `-999` sentinel — 04 converts to `NaN` first; 06 then computes stats.
- Suspicious-but-plausible (age 110) — 06 territory.

## 11. Per-script functional specs

Specs live in this section as scripts enter active build. Each follows the Tier 1/2/3 structure with explicit strategic framing (what's the market gap given some of this is free elsewhere).

### 11.1 `01_deduplicator.py` — Smart duplicate removal

**Status**: Ready. Tier 1 mostly built. Streamlit GUI port complete.

**Market gap**: fuzzy match quality of OpenRefine, with the zero-learning UX of Excel, sold once for under $100, runs locally.

**Tier 1**:
- **Input**: auto-detect encoding (UTF-8, UTF-8-BOM, Latin-1, cp1252) · delimiter · header row · CSV/TSV/XLSX/XLS · multi-sheet picker · streaming for files > RAM.
- **Matching**: exact + 3 fuzzy algos (Levenshtein / Jaro-Winkler / token-set) · per-column normalizers (5 types) · configurable threshold per strategy · multi-strategy OR.
- **Survivor**: keep first / last / most-complete / most-recent · merge mode (fill blanks from losers).
- **Trust**: dry-run preview by default · interactive review for gray-zone matches · confidence score per match · match-group export.
- **Audit**: timestamped log · removed-rows separate file · input never modified · idempotent.
- **Config**: save/load JSON profiles · sensible auto-detect defaults.
- **UX**: human `--help` · progress bar > 10k rows · errors name row + column + value + suggestion.

**Tier 2**: numeric/date tolerance · phonetic match (Soundex, Metaphone) · blocking/indexing · watch-folder.

**Tier 3**: ML scoring · cross-file dedup · cron · Shopify/Klaviyo API direct.

### 11.2 `02_text_cleaner.py` — Character-level hygiene

**Status**: Ready. Tier 1 built.

**Market gap**: one-click correctness for the dirty-CSV failure modes that cause silent VLOOKUP misses.

**Boundary**:
- 02 — whitespace, Unicode normalize, smart-char fold, BOM, line endings, zero-width, control chars, case ops. Writes to disk.
- 03 — dates, currencies, names, phones, addresses (display formatting).
- 04 — disguised nulls.
- 01 — `normalize_string` is *match-time* only, distinct from 02's *write-time* policy.

**Tier 1 ops** (each toggleable; defaults shown for `excel-hygiene`):
1. Trim leading/trailing whitespace — ON
2. Collapse internal whitespace runs — ON
3. NFC normalize — ON
4. NFKC compatibility fold — OFF (lossy, opt-in via `paranoid` preset)
5. Smart-char fold (curly quotes, em/en-dash, NBSP, ellipsis) — ON
6. Zero-width / invisible char strip — ON
7. BOM strip — ON
8. Control-char strip (preserve `\t\n\r`) — ON
9. Line-ending normalize (CRLF/CR → LF inside cells) — ON
10. Case conversion (UPPER / lower / Title / Sentence) — OFF, per-column

**Scope**: per-column selection · skip-list · operates on string-typed columns only.

**Trust**: dry-run by default · per-cell change log (capped 1000, `--full-changelog` removes cap) · 3 output files mirroring dedup · idempotent.

**Config**: 3 presets (`minimal` / `excel-hygiene` (default) / `paranoid`) · save/load JSON.

### 11.3 `03_format_standardizer.py` — Per-domain canonical forms

**Status**: Ready. Full Tier 1 + most Tier 2 built. 199-row buyer corpus passing.

**Market gap**: unify dates / phones / emails / addresses / names / currencies / booleans across messy ETL inputs without buyer writing code.

**Domains**:
| Domain | Default canonical | Notable handling |
|--------|-------------------|------------------|
| Date | ISO 8601 (`YYYY-MM-DD`) | MDY/DMY, Excel serial, Unix timestamp (s + ms), longform months, year-month, quarter, ISO week date (`2024-W03-1`), ISO ordinal (`2024-015`), RFC 2822, CJK separators (`2024年01月15日`), fullwidth digits, named-TZ resolution (EST/PST/JST/…), `two_digit_year_cutoff` |
| Phone | E.164 + `;ext=N` | libphonenumber, 001 international prefix, error sentinels for placeholders / multi-number / contamination |
| Email | lowercase + trim | display-name extraction, mailto/angle-bracket strip, smart-quote unwrap, BIDI/RTL override strip (security), optional `--gmail-canonical` |
| Address | USPS-canonical (`expand=False`) or expanded (`expand=True`) | state/province-name → code for US/CA/AU/DE, UK postcode detection, multi-line collapse, PO Box normalize, state-code preservation regardless of input case |
| Name | smart Title Case | Mc/Mac/O'/D' inner caps, Arabic `al-`/`el-` lowercase, particle lowercasing (von/van/de/da/bin/ibn/ben), East Asian honorific suffixes (`-san`/`-sama`/`-ssi`), comma reversal (skippable via `family_first`), period stripping for titles/suffixes/initials, PhD/MD/Mag/Habil acronyms |
| Currency | bare number (dot decimal) | auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation |
| Boolean | `True`/`False` (configurable) | accepts `yes`/`no`/`y`/`n`/`1`/`0`/`on`/`off` |

**International coverage** (added v1.7):
- **Date locales**: English (default) plus opt-in French / German / Spanish / Portuguese / Italian / Dutch / Russian month + weekday recognition.
- **Currency symbols**: $, €, £, ¥, ₹, ₩, ₽, ₪, ₺, ¢ + ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS).
- **ISO 4217 codes**: 23 baseline (USD, EUR, …) plus ~30 emerging-market additions (SAR, AED, ARS, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, …).
- **Address jurisdictions**: US, Canada (13 provinces/territories), Australia (8 states), Germany (16 Bundesländer), UK (postcode shape).
- **Address PO Box**: English, German (`Postfach`), French (`Boîte postale`), Spanish (`Apartado`), Italian (`Casella postale`), Portuguese (`Caixa postal`).

**Per-domain `error_policy`**: `"passthrough"` (default) keeps the original; `"sentinel"` emits `<error: <reason>>` for cases like Feb 30, double @, percentages mistaken for currency, etc.

**Pipeline**: `standardize_dataframe(df, options)` runs per-column with `column_types: dict[str, FieldType]`. Returns `StandardizeResult` with `cells_changed`, `cells_unparseable`, change audit. Warns when > 10% of typed cells fail to parse.

**Presets**: `us-default`, `european`, `uk`, `iso-strict`, `legacy-us`. Custom abbreviations via `extra_abbreviations`. Per-column culture flags: `name_family_first` (East Asian), `address_state_to_code` (any of 4 supported jurisdictions), `date_month_locales` (list of 8 supported codes).

### 11.4 Upload-time analyzer (`src/core/analyze.py`)

Read-only advisory pass on every upload. Emits `Finding` objects:

| Field | Meaning |
|-------|---------|
| `id` | Stable identifier (never localized) |
| `severity` | `info` / `warn` / `error` (only `error` blocks gate) |
| `confidence` | `high` (round-trip safe) / `medium` (preview) / `low` (heuristic) |
| `fix_action` | id of algorithm in `fixes.py` (empty for informational-only) |
| `pre_applied` | `true` if fix already ran during read pass |
| `tool` | owning tool id (or empty for file-level) |
| `count` | cells / rows affected |
| `description` | one-sentence human summary |
| `column` | column name (None for file-level) |
| `samples` | up to 5 `(row, col, value)` examples |

Entry point: `analyze(source, *, sample_rows=1000, repair_result=None, encoding_override=None)`. `encoding_override` skips charset detection — the hook that lets the Review page recover from misdetections.

### 11.5 CSV-normalization gate (`src/core/normalize.py`, `fixes.py`)

Two paths:
1. **Auto-fix** — `auto_fix(df, findings)` applies every `confidence="high"` finding whose `fix_action` is registered.
2. **Per-finding decisions** — `apply_decisions(df, findings, decisions)` accepts `Decision(finding_id, action, payload)` with action `"auto"|"skip"|"modified"`.

Returns `NormalizationResult` with `cleaned_df`, `cleaned_bytes` (UTF-8 CSV), `applied`, `skipped_findings`, `pending_findings`, `blocking_findings`.

`is_normalized(findings, result)` re-runs `analyze()` against cleaned bytes; returns False if any high-confidence detector still fires (the strict contract tool pages depend on).

**Fix registry**: `@register("fix_id")` decorates `(df, payload) → (new_df, n_cells_changed)`. New fix = one entry in `analyze.py` `FIX_*` constants + one detector emitting that `fix_action` + one registered function. No other call sites change.

### 11.6 Review page (`src/gui/pages/0_Review.py`)

1. Detected encoding + override picker (16 codepages + custom).
2. One expandable card per finding (sorted by severity then confidence) with: decision radio (Auto/Skip/Customize), live before/after preview built by running the registered fix on `Finding.samples`, payload editor for fixes that take user input.
3. Apply persists `NormalizationResult` keyed by upload SHA-256; tool pages refuse to load until hash matches.
4. `⚙️ Advanced output options` expander: per-download encoding + delimiter + line terminator. `_build_output_bytes()` returns `(bytes, error_message)`; lossy fallbacks emit a warning the page surfaces.

Gates the entire tool sidebar via `require_normalization_gate()` in `src/gui/components/_legacy.py`.

### 11.7 Pre-parse repair (`src/core/io.py::repair_bytes`)

Byte-level pre-parse pass. **Order is meaningful**:

1. **Wide-encoding transcode** (UTF-16/32 → UTF-8) — must run first or NUL strip below shreds UTF-16.
2. **UTF-8 BOM strip** (file start only).
3. **NUL strip** — only meaningful after step 1, so flags genuine corruption.
4. **Line-ending normalize** — CRLF + bare CR → LF.
5. **Byte-level smart-quote fold** — curly / guillemet / double-prime → ASCII `"` (only structural double-quote-equivalents; single curlies deferred to cell-level).
6. **Per-row delimiter repair** — when a row has +1 field and merge candidate is currency-shaped (`$1,500.00`), merge + quote.

`detect_encoding()` tries strict UTF-8 first — charset-normalizer mislabels short-non-ASCII files as `mac_latin2`, but valid UTF-8 bytes mean UTF-8 regardless of label.