Files
datatools-dev/docs/TECHNICAL.md
Michael fd9606c67b build: drop the local Python release method, return to CI-only installer builds
Removes the single-command Python packaging method (build/make_release.py
+ build/build_portable_zip.py + build/macos/build_zip.sh) and the portable
.zip artifacts it produced. Release builds go back to the original GitHub
Actions process: the CI matrix builds one installer per platform (.dmg /
.exe / .AppImage) on tag push and attaches them to a GitHub Release.

Tesseract OCR bundling is preserved: the fetch helpers the workflow depends
on (fetch_tessdata, fetch_tesseract_for_platform) are extracted into a
standalone build/tesseract.py, which build.yml now imports.

Docs (README, build/README, DEVELOPER, TECHNICAL, USER-GUIDE, vendor README,
es translations) updated to drop the portable-zip flavor and point at the
new module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 17:47:36 +00:00

23 KiB
Raw Permalink Blame History

Technical

Creator-only. Do not ship to buyers. Version: 1.6 · Updated: 2026-05-01

For the end-to-end picture (desktop app + license server + storefronts

  • email), see ARCHITECTURE.md. This doc focuses on desktop internals.

1. Architecture

  • Dual interface: CLI + GUI, both wrapping the same src/core/ library.
  • GUI: Streamlit, runs as local web server, opens in default browser. No internet.
  • Runtime: Python 3.10+ (bundled into installer; buyer never sees Python).
  • Cross-platform: Windows, macOS, Linux from day one. PyInstaller per OS.
  • Core/UI rule: business logic in core/ only. CLI + GUI are thin front-ends.

Locks:

  • v1.2 — dual interface required (non-technical buyers won't use CLI).
  • v1.3 — Streamlit chosen (over CustomTkinter inactive, plain Tk UX gap, Flet/PySide6/NiceGUI each fails one dimension). See DECISIONS.md §4c.

2. Repo layout

src/
  core/                  # Shared logic. No UI code.
    analyze.py           # Detectors + Finding schema
    config.py            # DeduplicationConfig (JSON profiles)
    dedup.py             # Match strategies, union-find, survivor selection
    errors.py            # Structured error hierarchy + format_for_user
    fixes.py             # Fix registry (one per fix_action)
    format_standardize.py # Per-cell standardizers + DataFrame pipeline
    io.py                # read_file / write_file / repair_bytes
    normalize.py         # CSV-normalization gate
    normalizers.py       # Per-column normalizers for dedup matching
    text_clean.py        # clean_dataframe + smart_title_case
    _constants.py        # Shared USPS abbrevs + state names
  cli.py                 # Find Duplicates CLI (Typer)
  cli_text_clean.py      # Clean Text CLI
  cli_analyze.py         # Analyzer CLI (--json)
  gui/
    app.py               # Streamlit entry point
    pages/               # One page per tool
    components/          # shared, dedup_review, findings, gate, _legacy
  i18n/                  # GUI language packs (JSON-backed, in-house lookup)
    __init__.py          # t() · current_language() · render_language_selector()
    packs/               # en.json, es.json, … (one file per language)
build/                   # PyInstaller spec, launcher, OS-specific configs
demo/                    # Constrained Streamlit Community Cloud version
tests/                   # pytest; targets core/, not UI
test-cases/              # Fixture corpora (text-cleaner, encodings, format-cleaner)

Demo subfolder: row-limited, watermarked, file-size-capped Streamlit app for public deployment. Same core, different front-end constraints.

3. Build pipeline

3.1 Tooling

Concern Tool
Bundling PyInstaller
GUI Streamlit
CLI Typer
Browser launch stdlib webbrowser
Win installer Inno Setup (free)
macOS sign+notarize codesign + notarytool
Linux AppImage (primary) + tarball fallback
CI GitHub Actions matrix
Demo host Streamlit Community Cloud (free)

3.2 Build outputs

OS File Buyer experience
Win *-Setup-1.0.exe Wizard → desktop shortcut "Launch Bundle" → browser opens. CLI on PATH.
macOS *-1.0.dmg Drag to Applications. Signed + notarized.
Linux *-1.0.AppImage chmod +x, double-click.

3.3 PyInstaller

  • --onefile for Linux, --onedir for Win/macOS (faster startup, easier signing).
  • Two entry points: GUI launcher + CLI binaries.
  • Streamlit hooks needed: streamlit, altair, pyarrow data dirs.
  • Custom hook-streamlit.py per documented pattern.
  • Budget: 1-3 days first time. Reusable after.

3.4 Streamlit launcher

  1. Find free port (don't hardcode 8501).
  2. Set env: STREAMLIT_SERVER_HEADLESS=true, STREAMLIT_BROWSER_GATHER_USAGE_STATS=false, STREAMLIT_SERVER_PORT={port}.
  3. Start Streamlit programmatically in a thread.
  4. Poll port until ready.
  5. Open browser to http://localhost:{port}.
  6. Keep launcher alive while server runs.

Optional v1.1: wrap with pywebview to eliminate browser-launch UX. Defer until support tickets show meaningful confusion.

3.5 macOS pipeline

  1. PyInstaller produces unsigned .app.
  2. codesign --deep --force --options runtime --sign "Developer ID Application: ..." App.app.
  3. Package as .dmg.
  4. xcrun notarytool submit *.dmg --wait.
  5. xcrun stapler staple *.dmg.

Setup: Apple Developer Program ($99/yr), Developer ID cert in Keychain, app-specific password.

3.6-3.7 Win + Linux

  • Win: PyInstaller --onedir → Inno Setup wraps → installer adds Start Menu, desktop shortcut, PATH entries. Optional code-signing cert ($200-400/yr) if SmartScreen friction.
  • Linux: PyInstaller → appimagetool wraps. .tar.gz fallback for distros where AppImage fails.

3.8 CI matrix

strategy:
  matrix:
    os: [windows-latest, macos-latest, ubuntu-latest]

Tag a release → 3 platform artifacts upload to GitHub Releases. Manual: copy to Gumroad / Lemon Squeezy.

3.9 Hosted demo

demo/streamlit_app.py → Streamlit Community Cloud. Configure deployment in Streamlit UI. Custom domain via CNAME (verify policy at deploy time). Fall back to $5/mo VPS if rate limits / branding constraints hit.

3.10 Bundled Tesseract (PDF Extractor OCR)

Frozen builds ship Tesseract 5.5 + eng.traineddata inside the PyInstaller bundle so scanned PDFs work without a separate install. Per-platform binary URLs pinned in build/tesseract.py; tessdata vendored at build/vendor/tessdata/eng.traineddata. License attribution in LICENSE_TESSERACT.txt at the repo root.

Discovery order at runtime (see docs/DEVELOPER.md for the full Path layout):

  1. DATATOOLS_TESSERACT_BIN env var override.
  2. Bundled path under sys._MEIPASS / "tesseract" / (frozen bundles only).
  3. tesseract on PATH (source / pip developer environments).
  4. Windows well-known locations.

4. Libraries

Purpose Library
GUI streamlit
CLI typer
Data pandas, openpyxl, numpy
Fuzzy match rapidfuzz
Phone parsing phonenumbers
Encoding detect charset-normalizer
Logging loguru
Mojibake (optional) ftfy
Reports reportlab

5. Coding standards

5.1 Code

  • PEP 8 + type hints on public functions.
  • Docstrings on every module + public function.
  • pathlib.Path for paths, never string concat.
  • All I/O explicitly UTF-8-aware.
  • No platform-specific shell calls.
  • pytest for core/, not UI.
  • Errors raise via src.core.errors hierarchy (Section 7).

5.2 GUI UX (load-bearing per DECISIONS.md §4b)

  • Works out of the box — drop file → useful result with zero config.
  • Sensible defaults visible everywhere.
  • Progressive disclosure — basic = file uploader + run button + results; rest in st.expander.
  • Plain-English labels; technical detail in help= tooltip.
  • Dry-run / preview by default.
  • Identical core to CLI.
  • Local-first messaging — "runs locally in your browser, no internet" line on every page.

5.3 Functional scope (load-bearing per DECISIONS.md §4a)

  • Each script ships complete coverage of the workflow it names, including features Excel does for free.
  • Boundary = the named workflow. Dedup includes normalization + survivor + audit; not format conversion or charting.

6. System requirements

Buyer runtime: Win 10/11 64-bit · macOS 11+ · Linux glibc 2020+ · modern browser · ~400-500 MB disk · no internet.

Developer: Python 3.10+ · PyInstaller · Inno Setup (Win) · Xcode CLT (macOS) · Apple Developer Program $99/yr · Git + GitHub.

7. Error handling (src/core/errors.py)

Structured hierarchy for friendly messages + maintainable trace context:

DataToolsError              # base; carries path/column/operation/suggestion/cause
  InputValidationError(ValueError)   # bad arg / wrong type
  ConfigError(ValueError)            # bad config / options
  FileFormatError(ValueError)        # file isn't what we expected
  FileAccessError(OSError)           # I/O failure (perms, disk, missing)

Subclassing rule: every subclass extends a stdlib base (ValueError or OSError) so existing except OSError / except ValueError handlers still catch them.

Helpers:

  • ensure_dataframe(value, function=...) — uniform DataFrame guard at every public entry.
  • ensure_choice(value, name=, choices=) — uniform enum/literal guard.
  • wrap_file_read(path, op, exc) / wrap_file_write(...) — tag OSError with file path + Windows-aware permission tip.
  • format_for_user(exc, context=) — single string for st.error() / CLI stderr.

GUI / CLI handlers use format_for_user() so the user always sees: file path, operation, underlying error class, recovery suggestion.

8. Per-bundle status

Bundle Status
Data Cleaning Mastery 3/9 tools Ready (Find Duplicates, Clean Text, Standardize Formats); 6 stubs
Automated Business Reporting Not started
Ecommerce Data Pipeline Not started
Small Business Finance Not started
Marketing Public Data Aggregation Not started
AI Ecommerce Aggregation (Shopify Pet) Not started

9. Open decisions

  • pywebview wrap — defer until support tickets show browser-launch confusion.
  • Win code signing — defer until SmartScreen drives volume. Cost ~$200-400/yr.
  • Auto-update mechanism — none at launch. Email-delivered updates. Revisit at 100+ buyers/bundle.
  • Demo hosting migration — Streamlit Community Cloud → $5/mo VPS if rate/brand limits hit.
  • Code obfuscation — none; license text + bundle complexity sufficient at $49-79.
  • Telemetry — none. Consider opt-in privacy-respecting only post-launch.

10. Script boundaries — 04 (Missing Values) vs 06 (Outliers)

Deliberately separate. Confluent original spec was wrong.

Script Owns
04 Fix Missing Values "What's not there." Disguised nulls (N/A, -, sentinel codes), missingness patterns, imputation, drop-by-threshold.
06 Find Unusual Values "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization.

Run order: 04 before 06. Outlier stats on data with NaN / sentinels are mathematically poisoned (means dragged, IQR widens, false negatives).

Pipeline order (Automated Workflows enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible.

Contested cases:

  • Whitespace-only cell — 02 trims to empty; 04 then flags empty as null.
  • -999 sentinel — 04 converts to NaN first; 06 then computes stats.
  • Suspicious-but-plausible (age 110) — 06 territory.

10b. GUI internationalization (i18n)

The GUI uses an in-house, JSON-backed translation layer at src/i18n/. No gettext / babel / .po pipeline — the surface is small enough that a 100-line module + per-language JSON file is a better fit than a build-time toolchain.

Resolution model: t(key, lang=None, **fmt) walks a dotted key (home.title, tools.01_deduplicator.name) through a nested dict. Fallback chain: requested lang → English (canonical) → the literal key. Missing format placeholders return the raw template rather than raising so a translation file cannot crash the UI.

Active language is stored in st.session_state["ui_lang"]. Reading it outside a Streamlit run (tests, scripts) silently falls back to English, keeping the module importable without Streamlit context.

Picker placement: hide_streamlit_chrome() calls render_language_selector() on every page that hides Streamlit's default chrome — i.e., the entire app. One mount point, every page picks it up.

Pack parity is a tested invariant: tests/test_lang_packs.py::TestPackParity fails CI when en.json and another pack diverge in either direction. This catches translation drift at PR time rather than from buyer reports.

Farewell overlay: the shutdown screen's JS payload interpolates pack strings into an innerHTML inside a JS single-quoted string. _js_html_safe() in components/_legacy.py escapes both the JS string terminator (') and HTML special chars (< > &). The test TestFarewellEscape pins this; never bypass it.

Why not gettext: zero compiled artifacts in the PyInstaller bundle, no build step before tests run, no .po/.mo round-trip for translators (anyone can edit JSON), and the same lookup works in unit tests without process state. Locked in because the surface won't grow large enough to need the alternative, and the alternative breaks the "drop a file, run pytest, ship" loop.

10c. GUI chrome — sidebar nav indicator swap

Streamlit's st.Page-driven sidebar renders section headers with a Material Symbols ligature (expand_more / expand_less). The header element is not a button and carries no aria-expanded, so a pure-CSS swap can't follow open/closed state. We replace the glyph with plain typographic + / (U+2212) via JS:

  • CSS (components/_legacy.py, _HIDE_CHROME_CSS) drops the Material Symbols font on [data-testid="stIconMaterial"] inside [data-testid="stNavSectionHeader"] so the rewritten character renders as normal text rather than re-resolving as an icon name.
  • JS (_SWAP_NAV_SECTION_INDICATOR_JS) walks each section header, reads the icon's text node, and rewrites expand_more+ / expand_less. A MutationObserver re-runs the swap when Streamlit re-renders the sidebar (RAF-throttled so a burst of mutations is one swap).

The script ships through the same component-iframe bundle as the brand injector and upload-button rename inside hide_streamlit_chrome() — one iframe per page, three DOM mutations.

11. Per-script functional specs

Specs live in this section as scripts enter active build. Each follows the Tier 1/2/3 structure with explicit strategic framing (what's the market gap given some of this is free elsewhere).

11.1 01_deduplicator.py — Smart duplicate removal

Status: Ready. Tier 1 mostly built. Streamlit GUI port complete.

Market gap: fuzzy match quality of OpenRefine, with the zero-learning UX of Excel, sold once for under $100, runs locally.

Tier 1:

  • Input: auto-detect encoding (UTF-8, UTF-8-BOM, Latin-1, cp1252) · delimiter · header row · CSV/TSV/XLSX/XLS · multi-sheet picker · streaming for files > RAM.
  • Matching: exact + 3 fuzzy algos (Levenshtein / Jaro-Winkler / token-set) · per-column normalizers (5 types) · configurable threshold per strategy · multi-strategy OR.
  • Survivor: keep first / last / most-complete / most-recent · merge mode (fill blanks from losers).
  • Trust: dry-run preview by default · interactive review for gray-zone matches · confidence score per match · match-group export.
  • Audit: timestamped log · removed-rows separate file · input never modified · idempotent.
  • Config: save/load JSON profiles · sensible auto-detect defaults.
  • UX: human --help · progress bar > 10k rows · errors name row + column + value + suggestion.

Tier 2: numeric/date tolerance · phonetic match (Soundex, Metaphone) · blocking/indexing · watch-folder.

Tier 3: ML scoring · cross-file dedup · cron · Shopify/Klaviyo API direct.

11.2 02_text_cleaner.py — Character-level hygiene

Status: Ready. Tier 1 built.

Market gap: one-click correctness for the dirty-CSV failure modes that cause silent VLOOKUP misses.

Boundary:

  • 02 — whitespace, Unicode normalize, smart-char fold, BOM, line endings, zero-width, control chars, case ops. Writes to disk.
  • 03 — dates, currencies, names, phones, addresses (display formatting).
  • 04 — disguised nulls.
  • 01 — normalize_string is match-time only, distinct from 02's write-time policy.

Tier 1 ops (each toggleable; defaults shown for excel-hygiene):

  1. Trim leading/trailing whitespace — ON
  2. Collapse internal whitespace runs — ON
  3. NFC normalize — ON
  4. NFKC compatibility fold — OFF (lossy, opt-in via paranoid preset)
  5. Smart-char fold (curly quotes, em/en-dash, NBSP, ellipsis) — ON
  6. Zero-width / invisible char strip — ON
  7. BOM strip — ON
  8. Control-char strip (preserve \t\n\r) — ON
  9. Line-ending normalize (CRLF/CR → LF inside cells) — ON
  10. Case conversion (UPPER / lower / Title / Sentence) — OFF, per-column

Scope: per-column selection · skip-list · operates on string-typed columns only.

Trust: dry-run by default · per-cell change log (capped 1000, --full-changelog removes cap) · 3 output files mirroring dedup · idempotent.

Config: 3 presets (minimal / excel-hygiene (default) / paranoid) · save/load JSON.

11.3 03_format_standardizer.py — Per-domain canonical forms

Status: Ready. Full Tier 1 + most Tier 2 built. 199-row buyer corpus passing.

Market gap: unify dates / phones / emails / addresses / names / currencies / booleans across messy ETL inputs without buyer writing code.

Domains:

Domain Default canonical Notable handling
Date ISO 8601 (YYYY-MM-DD) MDY/DMY, Excel serial, Unix timestamp (s + ms), longform months, year-month, quarter, ISO week date (2024-W03-1), ISO ordinal (2024-015), RFC 2822, CJK separators (2024年01月15日), fullwidth digits, named-TZ resolution (EST/PST/JST/…), two_digit_year_cutoff
Phone E.164 + ;ext=N libphonenumber, 001 international prefix, error sentinels for placeholders / multi-number / contamination
Email lowercase + trim display-name extraction, mailto/angle-bracket strip, smart-quote unwrap, BIDI/RTL override strip (security), optional --gmail-canonical
Address USPS-canonical (expand=False) or expanded (expand=True) state/province-name → code for US/CA/AU/DE, UK postcode detection, multi-line collapse, PO Box normalize, state-code preservation regardless of input case
Name smart Title Case Mc/Mac/O'/D' inner caps, Arabic al-/el- lowercase, particle lowercasing (von/van/de/da/bin/ibn/ben), East Asian honorific suffixes (-san/-sama/-ssi), comma reversal (skippable via family_first), period stripping for titles/suffixes/initials, PhD/MD/Mag/Habil acronyms
Currency bare number (dot decimal) auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation
Boolean True/False (configurable) accepts yes/no/y/n/1/0/on/off

International coverage (added v1.7):

  • Date locales: English (default) plus opt-in French / German / Spanish / Portuguese / Italian / Dutch / Russian month + weekday recognition.
  • Currency symbols: $, €, £, ¥, ₹, ₩, ₽, ₪, ₺, ¢ + ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS).
  • ISO 4217 codes: 23 baseline (USD, EUR, …) plus ~30 emerging-market additions (SAR, AED, ARS, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, …).
  • Address jurisdictions: US, Canada (13 provinces/territories), Australia (8 states), Germany (16 Bundesländer), UK (postcode shape).
  • Address PO Box: English, German (Postfach), French (Boîte postale), Spanish (Apartado), Italian (Casella postale), Portuguese (Caixa postal).

Per-domain error_policy: "passthrough" (default) keeps the original; "sentinel" emits <error: <reason>> for cases like Feb 30, double @, percentages mistaken for currency, etc.

Pipeline: standardize_dataframe(df, options) runs per-column with column_types: dict[str, FieldType]. Returns StandardizeResult with cells_changed, cells_unparseable, change audit. Warns when > 10% of typed cells fail to parse.

Presets: us-default, european, uk, iso-strict, legacy-us. Custom abbreviations via extra_abbreviations. Per-column culture flags: name_family_first (East Asian), address_state_to_code (any of 4 supported jurisdictions), date_month_locales (list of 8 supported codes).

11.4 Upload-time analyzer (src/core/analyze.py)

Read-only advisory pass on every upload. Emits Finding objects:

Field Meaning
id Stable identifier (never localized)
severity info / warn / error (only error blocks gate)
confidence high (round-trip safe) / medium (preview) / low (heuristic)
fix_action id of algorithm in fixes.py (empty for informational-only)
pre_applied true if fix already ran during read pass
tool owning tool id (or empty for file-level)
count cells / rows affected
description one-sentence human summary
column column name (None for file-level)
samples up to 5 (row, col, value) examples

Entry point: analyze(source, *, sample_rows=1000, repair_result=None, encoding_override=None). encoding_override skips charset detection — the hook that lets the Review page recover from misdetections.

11.5 CSV-normalization gate (src/core/normalize.py, fixes.py)

Two paths:

  1. Auto-fixauto_fix(df, findings) applies every confidence="high" finding whose fix_action is registered.
  2. Per-finding decisionsapply_decisions(df, findings, decisions) accepts Decision(finding_id, action, payload) with action "auto"|"skip"|"modified".

Returns NormalizationResult with cleaned_df, cleaned_bytes (UTF-8 CSV), applied, skipped_findings, pending_findings, blocking_findings.

is_normalized(findings, result) re-runs analyze() against cleaned bytes; returns False if any high-confidence detector still fires (the strict contract tool pages depend on).

Fix registry: @register("fix_id") decorates (df, payload) → (new_df, n_cells_changed). New fix = one entry in analyze.py FIX_* constants + one detector emitting that fix_action + one registered function. No other call sites change.

11.6 Review page (src/gui/pages/0_Review.py)

  1. Detected encoding + override picker (16 codepages + custom).
  2. One expandable card per finding (sorted by severity then confidence) with: decision radio (Auto/Skip/Customize), live before/after preview built by running the registered fix on Finding.samples, payload editor for fixes that take user input.
  3. Apply persists NormalizationResult keyed by upload SHA-256; tool pages refuse to load until hash matches.
  4. ⚙️ Advanced output options expander: per-download encoding + delimiter + line terminator. _build_output_bytes() returns (bytes, error_message); lossy fallbacks emit a warning the page surfaces.

Gates the entire tool sidebar via require_normalization_gate() in src/gui/components/_legacy.py.

11.7 Pre-parse repair (src/core/io.py::repair_bytes)

Byte-level pre-parse pass. Order is meaningful:

  1. Wide-encoding transcode (UTF-16/32 → UTF-8) — must run first or NUL strip below shreds UTF-16.
  2. UTF-8 BOM strip (file start only).
  3. NUL strip — only meaningful after step 1, so flags genuine corruption.
  4. Line-ending normalize — CRLF + bare CR → LF.
  5. Byte-level smart-quote fold — curly / guillemet / double-prime → ASCII " (only structural double-quote-equivalents; single curlies deferred to cell-level).
  6. Per-row delimiter repair — when a row has +1 field and merge candidate is currency-shaped ($1,500.00), merge + quote.

detect_encoding() tries strict UTF-8 first — charset-normalizer mislabels short-non-ASCII files as mac_latin2, but valid UTF-8 bytes mean UTF-8 regardless of label.