Files
datatools-dev/docs/TECHNICAL.md
Michael d18b95880d feat(format-i18n): broaden international coverage across all domains
Closes ~17 high-value international gaps surfaced by parallel review.
Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed
/ failed / xfailed).

DATES
- Adds Portuguese, Italian, Dutch, Russian month dictionaries to the
  opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru).
- Adds localized weekday recognition for those locales — "Lundi",
  "Montag", "lunedì", "понедельник", etc. all strip cleanly before
  format matching.
- New CJK separator normalization: Japanese ``2024年01月15日`` and
  fullwidth digits ``2024/01/15`` fold to ASCII before parsing.
- New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to
  fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ
  doesn't block format matching.
- New ISO 8601 extended formats: week date (``2024-W03-1``) and
  ordinal date (``2024-015``), plus RFC 2822 mail-header form
  (``Mon, 15 Jan 2024 10:30:00``).
- New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` —
  defaults to Python's stdlib 69; lower it for birth-year columns
  where most subjects were born ≤ 1999.

NAMES
- Particles set extended with Arabic patronymic markers (bin, ibn,
  bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-).
- Title set extended with German (Herr, Frau), French (M., Mme,
  Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra,
  Dott.), Portuguese.
- Acronym map extended with international academic credentials
  (Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM).
- New East Asian honorific suffix handler: ``Tanaka-san``,
  ``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the
  hyphen instead of being title-cased into ``Tanaka-San``.
- Hyphenated-segment handler now keeps Arabic prefixes ``al-`` /
  ``el-`` lowercase per Arabic transliteration convention.
- New ``family_first`` parameter on ``standardize_name()`` and matching
  ``name_family_first`` field on ``StandardizeOptions`` — set
  per-column for East Asian data to skip Western comma-format reversal
  (``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``).

CURRENCY
- Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN),
  ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa,
  Eastern Europe, Latin America gaps.
- ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD,
  BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS,
  KES, HUF, CZK, RON, UAH, KZT, etc.

EMAIL
- New BIDI / RTL override stripping (``standardize_email``):
  U+202A-U+202E and U+2066-U+2069 stripped from every email. These
  are a known phishing vector — ``alice‮@example.com`` displays as
  ``alice@elpmaxe.com`` to RTL-aware renderers.

ADDRESS
- Canadian provinces: 13 codes + names → 2-letter (Ontario → ON).
- UK postcode pattern recognition (``SW1A 2AA`` shape).
- Australian states: 8 codes + names (NSW, VIC, QLD, … + full names).
- German Bundesland: 16 codes + names (Bayern → BY, etc.).
- International PO Box variants: ``Postfach`` (DE), ``Boîte postale``
  (FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal``
  (PT) — all fold to canonical ``PO Box``.
- ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position
  check that preserves state codes regardless of input case applies
  to all four jurisdictions.
- ``_is_state_code_position`` postal pattern broadened to recognize
  US ZIP, AU 4-digit, CA first half, and UK outward code.

CONSTANTS
- ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` /
  ``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``,
  ``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS``
  (us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex),
  ``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for
  use when address takes a `country_hint` parameter in a future pass).

DOCS
- TECHNICAL.md §11.3 domain table updated with the new handling per
  domain plus a new "International coverage" sub-section listing the
  supported locales / symbols / jurisdictions.

DEFERRED (out of scope or rare)
- Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) —
  corpus § 3.5 marks out of scope.
- Persian/Arabic-Indic digit conversion — rare in tabular data.
- Trailing-minus RTL currency convention.
- Punycode ↔ Unicode IDN normalization.
- Mixed-country phone column auto-detection (user can override
  ``default_region`` per column).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 03:06:03 +00:00

19 KiB

Technical

Creator-only. Do not ship to buyers. Version: 1.6 · Updated: 2026-05-01

1. Architecture

  • Dual interface: CLI + GUI, both wrapping the same src/core/ library.
  • GUI: Streamlit, runs as local web server, opens in default browser. No internet.
  • Runtime: Python 3.10+ (bundled into installer; buyer never sees Python).
  • Cross-platform: Windows, macOS, Linux from day one. PyInstaller per OS.
  • Core/UI rule: business logic in core/ only. CLI + GUI are thin front-ends.

Locks:

  • v1.2 — dual interface required (non-technical buyers won't use CLI).
  • v1.3 — Streamlit chosen (over CustomTkinter inactive, plain Tk UX gap, Flet/PySide6/NiceGUI each fails one dimension). See DECISIONS.md §4c.

2. Repo layout

src/
  core/                  # Shared logic. No UI code.
    analyze.py           # Detectors + Finding schema
    config.py            # DeduplicationConfig (JSON profiles)
    dedup.py             # Match strategies, union-find, survivor selection
    errors.py            # Structured error hierarchy + format_for_user
    fixes.py             # Fix registry (one per fix_action)
    format_standardize.py # Per-cell standardizers + DataFrame pipeline
    io.py                # read_file / write_file / repair_bytes
    normalize.py         # CSV-normalization gate
    normalizers.py       # Per-column normalizers for dedup matching
    text_clean.py        # clean_dataframe + smart_title_case
    _constants.py        # Shared USPS abbrevs + state names
  cli.py                 # Deduplicator CLI (Typer)
  cli_text_clean.py      # Text Cleaner CLI
  cli_analyze.py         # Analyzer CLI (--json)
  gui/
    app.py               # Streamlit entry point
    pages/               # One page per tool
    components/          # shared, dedup_review, findings, gate, _legacy
build/                   # PyInstaller spec, launcher, OS-specific configs
demo/                    # Constrained Streamlit Community Cloud version
tests/                   # pytest; targets core/, not UI
test-cases/              # Fixture corpora (text-cleaner, encodings, format-cleaner)

Demo subfolder: row-limited, watermarked, file-size-capped Streamlit app for public deployment. Same core, different front-end constraints.

3. Build pipeline

3.1 Tooling

Concern Tool
Bundling PyInstaller
GUI Streamlit
CLI Typer
Browser launch stdlib webbrowser
Win installer Inno Setup (free)
macOS sign+notarize codesign + notarytool
Linux AppImage (primary) + tarball fallback
CI GitHub Actions matrix
Demo host Streamlit Community Cloud (free)

3.2 Build outputs

OS File Buyer experience
Win *-Setup-1.0.exe Wizard → desktop shortcut "Launch Bundle" → browser opens. CLI on PATH.
macOS *-1.0.dmg Drag to Applications. Signed + notarized.
Linux *-1.0.AppImage chmod +x, double-click.

3.3 PyInstaller

  • --onefile for Linux, --onedir for Win/macOS (faster startup, easier signing).
  • Two entry points: GUI launcher + CLI binaries.
  • Streamlit hooks needed: streamlit, altair, pyarrow data dirs.
  • Custom hook-streamlit.py per documented pattern.
  • Budget: 1-3 days first time. Reusable after.

3.4 Streamlit launcher

  1. Find free port (don't hardcode 8501).
  2. Set env: STREAMLIT_SERVER_HEADLESS=true, STREAMLIT_BROWSER_GATHER_USAGE_STATS=false, STREAMLIT_SERVER_PORT={port}.
  3. Start Streamlit programmatically in a thread.
  4. Poll port until ready.
  5. Open browser to http://localhost:{port}.
  6. Keep launcher alive while server runs.

Optional v1.1: wrap with pywebview to eliminate browser-launch UX. Defer until support tickets show meaningful confusion.

3.5 macOS pipeline

  1. PyInstaller produces unsigned .app.
  2. codesign --deep --force --options runtime --sign "Developer ID Application: ..." App.app.
  3. Package as .dmg.
  4. xcrun notarytool submit *.dmg --wait.
  5. xcrun stapler staple *.dmg.

Setup: Apple Developer Program ($99/yr), Developer ID cert in Keychain, app-specific password.

3.6-3.7 Win + Linux

  • Win: PyInstaller --onedir → Inno Setup wraps → installer adds Start Menu, desktop shortcut, PATH entries. Optional code-signing cert ($200-400/yr) if SmartScreen friction.
  • Linux: PyInstaller → appimagetool wraps. .tar.gz fallback for distros where AppImage fails.

3.8 CI matrix

strategy:
  matrix:
    os: [windows-latest, macos-latest, ubuntu-latest]

Tag a release → 3 platform artifacts upload to GitHub Releases. Manual: copy to Gumroad / Lemon Squeezy.

3.9 Hosted demo

demo/streamlit_app.py → Streamlit Community Cloud. Configure deployment in Streamlit UI. Custom domain via CNAME (verify policy at deploy time). Fall back to $5/mo VPS if rate limits / branding constraints hit.

4. Libraries

Purpose Library
GUI streamlit
CLI typer
Data pandas, openpyxl, numpy
Fuzzy match rapidfuzz
Phone parsing phonenumbers
Encoding detect charset-normalizer
Logging loguru
Mojibake (optional) ftfy
Reports reportlab

5. Coding standards

5.1 Code

  • PEP 8 + type hints on public functions.
  • Docstrings on every module + public function.
  • pathlib.Path for paths, never string concat.
  • All I/O explicitly UTF-8-aware.
  • No platform-specific shell calls.
  • pytest for core/, not UI.
  • Errors raise via src.core.errors hierarchy (Section 7).

5.2 GUI UX (load-bearing per DECISIONS.md §4b)

  • Works out of the box — drop file → useful result with zero config.
  • Sensible defaults visible everywhere.
  • Progressive disclosure — basic = file uploader + run button + results; rest in st.expander.
  • Plain-English labels; technical detail in help= tooltip.
  • Dry-run / preview by default.
  • Identical core to CLI.
  • Local-first messaging — "runs locally in your browser, no internet" line on every page.

5.3 Functional scope (load-bearing per DECISIONS.md §4a)

  • Each script ships complete coverage of the workflow it names, including features Excel does for free.
  • Boundary = the named workflow. Dedup includes normalization + survivor + audit; not format conversion or charting.

6. System requirements

Buyer runtime: Win 10/11 64-bit · macOS 11+ · Linux glibc 2020+ · modern browser · ~400-500 MB disk · no internet.

Developer: Python 3.10+ · PyInstaller · Inno Setup (Win) · Xcode CLT (macOS) · Apple Developer Program $99/yr · Git + GitHub.

7. Error handling (src/core/errors.py)

Structured hierarchy for friendly messages + maintainable trace context:

DataToolsError              # base; carries path/column/operation/suggestion/cause
  InputValidationError(ValueError)   # bad arg / wrong type
  ConfigError(ValueError)            # bad config / options
  FileFormatError(ValueError)        # file isn't what we expected
  FileAccessError(OSError)           # I/O failure (perms, disk, missing)

Subclassing rule: every subclass extends a stdlib base (ValueError or OSError) so existing except OSError / except ValueError handlers still catch them.

Helpers:

  • ensure_dataframe(value, function=...) — uniform DataFrame guard at every public entry.
  • ensure_choice(value, name=, choices=) — uniform enum/literal guard.
  • wrap_file_read(path, op, exc) / wrap_file_write(...) — tag OSError with file path + Windows-aware permission tip.
  • format_for_user(exc, context=) — single string for st.error() / CLI stderr.

GUI / CLI handlers use format_for_user() so the user always sees: file path, operation, underlying error class, recovery suggestion.

8. Per-bundle status

Bundle Status
Data Cleaning Mastery 3/9 tools Ready (Dedup, Text Cleaner, Format Standardizer); 6 stubs
Automated Business Reporting Not started
Ecommerce Data Pipeline Not started
Small Business Finance Not started
Marketing Public Data Aggregation Not started
AI Ecommerce Aggregation (Shopify Pet) Not started

9. Open decisions

  • pywebview wrap — defer until support tickets show browser-launch confusion.
  • Win code signing — defer until SmartScreen drives volume. Cost ~$200-400/yr.
  • Auto-update mechanism — none at launch. Email-delivered updates. Revisit at 100+ buyers/bundle.
  • Demo hosting migration — Streamlit Community Cloud → $5/mo VPS if rate/brand limits hit.
  • Code obfuscation — none; license text + bundle complexity sufficient at $49-79.
  • Telemetry — none. Consider opt-in privacy-respecting only post-launch.

10. Script boundaries — 04 (Missing Values) vs 06 (Outliers)

Deliberately separate. Confluent original spec was wrong.

Script Owns
04 Missing Value Handler "What's not there." Disguised nulls (N/A, -, sentinel codes), missingness patterns, imputation, drop-by-threshold.
06 Outlier Detector "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization.

Run order: 04 before 06. Outlier stats on data with NaN / sentinels are mathematically poisoned (means dragged, IQR widens, false negatives).

Pipeline order (Pipeline Runner enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible.

Contested cases:

  • Whitespace-only cell — 02 trims to empty; 04 then flags empty as null.
  • -999 sentinel — 04 converts to NaN first; 06 then computes stats.
  • Suspicious-but-plausible (age 110) — 06 territory.

11. Per-script functional specs

Specs live in this section as scripts enter active build. Each follows the Tier 1/2/3 structure with explicit strategic framing (what's the market gap given some of this is free elsewhere).

11.1 01_deduplicator.py — Smart duplicate removal

Status: Ready. Tier 1 mostly built. Streamlit GUI port complete.

Market gap: fuzzy match quality of OpenRefine, with the zero-learning UX of Excel, sold once for under $100, runs locally.

Tier 1:

  • Input: auto-detect encoding (UTF-8, UTF-8-BOM, Latin-1, cp1252) · delimiter · header row · CSV/TSV/XLSX/XLS · multi-sheet picker · streaming for files > RAM.
  • Matching: exact + 3 fuzzy algos (Levenshtein / Jaro-Winkler / token-set) · per-column normalizers (5 types) · configurable threshold per strategy · multi-strategy OR.
  • Survivor: keep first / last / most-complete / most-recent · merge mode (fill blanks from losers).
  • Trust: dry-run preview by default · interactive review for gray-zone matches · confidence score per match · match-group export.
  • Audit: timestamped log · removed-rows separate file · input never modified · idempotent.
  • Config: save/load JSON profiles · sensible auto-detect defaults.
  • UX: human --help · progress bar > 10k rows · errors name row + column + value + suggestion.

Tier 2: numeric/date tolerance · phonetic match (Soundex, Metaphone) · blocking/indexing · watch-folder.

Tier 3: ML scoring · cross-file dedup · cron · Shopify/Klaviyo API direct.

11.2 02_text_cleaner.py — Character-level hygiene

Status: Ready. Tier 1 built.

Market gap: one-click correctness for the dirty-CSV failure modes that cause silent VLOOKUP misses.

Boundary:

  • 02 — whitespace, Unicode normalize, smart-char fold, BOM, line endings, zero-width, control chars, case ops. Writes to disk.
  • 03 — dates, currencies, names, phones, addresses (display formatting).
  • 04 — disguised nulls.
  • 01 — normalize_string is match-time only, distinct from 02's write-time policy.

Tier 1 ops (each toggleable; defaults shown for excel-hygiene):

  1. Trim leading/trailing whitespace — ON
  2. Collapse internal whitespace runs — ON
  3. NFC normalize — ON
  4. NFKC compatibility fold — OFF (lossy, opt-in via paranoid preset)
  5. Smart-char fold (curly quotes, em/en-dash, NBSP, ellipsis) — ON
  6. Zero-width / invisible char strip — ON
  7. BOM strip — ON
  8. Control-char strip (preserve \t\n\r) — ON
  9. Line-ending normalize (CRLF/CR → LF inside cells) — ON
  10. Case conversion (UPPER / lower / Title / Sentence) — OFF, per-column

Scope: per-column selection · skip-list · operates on string-typed columns only.

Trust: dry-run by default · per-cell change log (capped 1000, --full-changelog removes cap) · 3 output files mirroring dedup · idempotent.

Config: 3 presets (minimal / excel-hygiene (default) / paranoid) · save/load JSON.

11.3 03_format_standardizer.py — Per-domain canonical forms

Status: Ready. Full Tier 1 + most Tier 2 built. 199-row buyer corpus passing.

Market gap: unify dates / phones / emails / addresses / names / currencies / booleans across messy ETL inputs without buyer writing code.

Domains:

Domain Default canonical Notable handling
Date ISO 8601 (YYYY-MM-DD) MDY/DMY, Excel serial, Unix timestamp (s + ms), longform months, year-month, quarter, ISO week date (2024-W03-1), ISO ordinal (2024-015), RFC 2822, CJK separators (2024年01月15日), fullwidth digits, named-TZ resolution (EST/PST/JST/…), two_digit_year_cutoff
Phone E.164 + ;ext=N libphonenumber, 001 international prefix, error sentinels for placeholders / multi-number / contamination
Email lowercase + trim display-name extraction, mailto/angle-bracket strip, smart-quote unwrap, BIDI/RTL override strip (security), optional --gmail-canonical
Address USPS-canonical (expand=False) or expanded (expand=True) state/province-name → code for US/CA/AU/DE, UK postcode detection, multi-line collapse, PO Box normalize, state-code preservation regardless of input case
Name smart Title Case Mc/Mac/O'/D' inner caps, Arabic al-/el- lowercase, particle lowercasing (von/van/de/da/bin/ibn/ben), East Asian honorific suffixes (-san/-sama/-ssi), comma reversal (skippable via family_first), period stripping for titles/suffixes/initials, PhD/MD/Mag/Habil acronyms
Currency bare number (dot decimal) auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation
Boolean True/False (configurable) accepts yes/no/y/n/1/0/on/off

International coverage (added v1.7):

  • Date locales: English (default) plus opt-in French / German / Spanish / Portuguese / Italian / Dutch / Russian month + weekday recognition.
  • Currency symbols: $, €, £, ¥, ₹, ₩, ₽, ₪, ₺, ¢ + ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS).
  • ISO 4217 codes: 23 baseline (USD, EUR, …) plus ~30 emerging-market additions (SAR, AED, ARS, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, …).
  • Address jurisdictions: US, Canada (13 provinces/territories), Australia (8 states), Germany (16 Bundesländer), UK (postcode shape).
  • Address PO Box: English, German (Postfach), French (Boîte postale), Spanish (Apartado), Italian (Casella postale), Portuguese (Caixa postal).

Per-domain error_policy: "passthrough" (default) keeps the original; "sentinel" emits <error: <reason>> for cases like Feb 30, double @, percentages mistaken for currency, etc.

Pipeline: standardize_dataframe(df, options) runs per-column with column_types: dict[str, FieldType]. Returns StandardizeResult with cells_changed, cells_unparseable, change audit. Warns when > 10% of typed cells fail to parse.

Presets: us-default, european, uk, iso-strict, legacy-us. Custom abbreviations via extra_abbreviations. Per-column culture flags: name_family_first (East Asian), address_state_to_code (any of 4 supported jurisdictions), date_month_locales (list of 8 supported codes).

11.4 Upload-time analyzer (src/core/analyze.py)

Read-only advisory pass on every upload. Emits Finding objects:

Field Meaning
id Stable identifier (never localized)
severity info / warn / error (only error blocks gate)
confidence high (round-trip safe) / medium (preview) / low (heuristic)
fix_action id of algorithm in fixes.py (empty for informational-only)
pre_applied true if fix already ran during read pass
tool owning tool id (or empty for file-level)
count cells / rows affected
description one-sentence human summary
column column name (None for file-level)
samples up to 5 (row, col, value) examples

Entry point: analyze(source, *, sample_rows=1000, repair_result=None, encoding_override=None). encoding_override skips charset detection — the hook that lets the Review page recover from misdetections.

11.5 CSV-normalization gate (src/core/normalize.py, fixes.py)

Two paths:

  1. Auto-fixauto_fix(df, findings) applies every confidence="high" finding whose fix_action is registered.
  2. Per-finding decisionsapply_decisions(df, findings, decisions) accepts Decision(finding_id, action, payload) with action "auto"|"skip"|"modified".

Returns NormalizationResult with cleaned_df, cleaned_bytes (UTF-8 CSV), applied, skipped_findings, pending_findings, blocking_findings.

is_normalized(findings, result) re-runs analyze() against cleaned bytes; returns False if any high-confidence detector still fires (the strict contract tool pages depend on).

Fix registry: @register("fix_id") decorates (df, payload) → (new_df, n_cells_changed). New fix = one entry in analyze.py FIX_* constants + one detector emitting that fix_action + one registered function. No other call sites change.

11.6 Review page (src/gui/pages/0_Review.py)

  1. Detected encoding + override picker (16 codepages + custom).
  2. One expandable card per finding (sorted by severity then confidence) with: decision radio (Auto/Skip/Customize), live before/after preview built by running the registered fix on Finding.samples, payload editor for fixes that take user input.
  3. Apply persists NormalizationResult keyed by upload SHA-256; tool pages refuse to load until hash matches.
  4. ⚙️ Advanced output options expander: per-download encoding + delimiter + line terminator. _build_output_bytes() returns (bytes, error_message); lossy fallbacks emit a warning the page surfaces.

Gates the entire tool sidebar via require_normalization_gate() in src/gui/components/_legacy.py.

11.7 Pre-parse repair (src/core/io.py::repair_bytes)

Byte-level pre-parse pass. Order is meaningful:

  1. Wide-encoding transcode (UTF-16/32 → UTF-8) — must run first or NUL strip below shreds UTF-16.
  2. UTF-8 BOM strip (file start only).
  3. NUL strip — only meaningful after step 1, so flags genuine corruption.
  4. Line-ending normalize — CRLF + bare CR → LF.
  5. Byte-level smart-quote fold — curly / guillemet / double-prime → ASCII " (only structural double-quote-equivalents; single curlies deferred to cell-level).
  6. Per-row delimiter repair — when a row has +1 field and merge candidate is currency-shaped ($1,500.00), merge + quote.

detect_encoding() tries strict UTF-8 first — charset-normalizer mislabels short-non-ASCII files as mac_latin2, but valid UTF-8 bytes mean UTF-8 regardless of label.