Closes ~17 high-value international gaps surfaced by parallel review. Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed / failed / xfailed). DATES - Adds Portuguese, Italian, Dutch, Russian month dictionaries to the opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru). - Adds localized weekday recognition for those locales — "Lundi", "Montag", "lunedì", "понедельник", etc. all strip cleanly before format matching. - New CJK separator normalization: Japanese ``2024年01月15日`` and fullwidth digits ``2024/01/15`` fold to ASCII before parsing. - New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ doesn't block format matching. - New ISO 8601 extended formats: week date (``2024-W03-1``) and ordinal date (``2024-015``), plus RFC 2822 mail-header form (``Mon, 15 Jan 2024 10:30:00``). - New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` — defaults to Python's stdlib 69; lower it for birth-year columns where most subjects were born ≤ 1999. NAMES - Particles set extended with Arabic patronymic markers (bin, ibn, bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-). - Title set extended with German (Herr, Frau), French (M., Mme, Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra, Dott.), Portuguese. - Acronym map extended with international academic credentials (Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM). - New East Asian honorific suffix handler: ``Tanaka-san``, ``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the hyphen instead of being title-cased into ``Tanaka-San``. - Hyphenated-segment handler now keeps Arabic prefixes ``al-`` / ``el-`` lowercase per Arabic transliteration convention. - New ``family_first`` parameter on ``standardize_name()`` and matching ``name_family_first`` field on ``StandardizeOptions`` — set per-column for East Asian data to skip Western comma-format reversal (``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``). CURRENCY - Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa, Eastern Europe, Latin America gaps. - ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD, BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, KZT, etc. EMAIL - New BIDI / RTL override stripping (``standardize_email``): U+202A-U+202E and U+2066-U+2069 stripped from every email. These are a known phishing vector — ``alice@example.com`` displays as ``alice@elpmaxe.com`` to RTL-aware renderers. ADDRESS - Canadian provinces: 13 codes + names → 2-letter (Ontario → ON). - UK postcode pattern recognition (``SW1A 2AA`` shape). - Australian states: 8 codes + names (NSW, VIC, QLD, … + full names). - German Bundesland: 16 codes + names (Bayern → BY, etc.). - International PO Box variants: ``Postfach`` (DE), ``Boîte postale`` (FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal`` (PT) — all fold to canonical ``PO Box``. - ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position check that preserves state codes regardless of input case applies to all four jurisdictions. - ``_is_state_code_position`` postal pattern broadened to recognize US ZIP, AU 4-digit, CA first half, and UK outward code. CONSTANTS - ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` / ``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``, ``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS`` (us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex), ``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for use when address takes a `country_hint` parameter in a future pass). DOCS - TECHNICAL.md §11.3 domain table updated with the new handling per domain plus a new "International coverage" sub-section listing the supported locales / symbols / jurisdictions. DEFERRED (out of scope or rare) - Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) — corpus § 3.5 marks out of scope. - Persian/Arabic-Indic digit conversion — rare in tabular data. - Trailing-minus RTL currency convention. - Punycode ↔ Unicode IDN normalization. - Mixed-country phone column auto-detection (user can override ``default_region`` per column). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
19 KiB
Technical
Creator-only. Do not ship to buyers. Version: 1.6 · Updated: 2026-05-01
1. Architecture
- Dual interface: CLI + GUI, both wrapping the same
src/core/library. - GUI: Streamlit, runs as local web server, opens in default browser. No internet.
- Runtime: Python 3.10+ (bundled into installer; buyer never sees Python).
- Cross-platform: Windows, macOS, Linux from day one. PyInstaller per OS.
- Core/UI rule: business logic in
core/only. CLI + GUI are thin front-ends.
Locks:
- v1.2 — dual interface required (non-technical buyers won't use CLI).
- v1.3 — Streamlit chosen (over CustomTkinter inactive, plain Tk UX gap, Flet/PySide6/NiceGUI each fails one dimension). See DECISIONS.md §4c.
2. Repo layout
src/
core/ # Shared logic. No UI code.
analyze.py # Detectors + Finding schema
config.py # DeduplicationConfig (JSON profiles)
dedup.py # Match strategies, union-find, survivor selection
errors.py # Structured error hierarchy + format_for_user
fixes.py # Fix registry (one per fix_action)
format_standardize.py # Per-cell standardizers + DataFrame pipeline
io.py # read_file / write_file / repair_bytes
normalize.py # CSV-normalization gate
normalizers.py # Per-column normalizers for dedup matching
text_clean.py # clean_dataframe + smart_title_case
_constants.py # Shared USPS abbrevs + state names
cli.py # Deduplicator CLI (Typer)
cli_text_clean.py # Text Cleaner CLI
cli_analyze.py # Analyzer CLI (--json)
gui/
app.py # Streamlit entry point
pages/ # One page per tool
components/ # shared, dedup_review, findings, gate, _legacy
build/ # PyInstaller spec, launcher, OS-specific configs
demo/ # Constrained Streamlit Community Cloud version
tests/ # pytest; targets core/, not UI
test-cases/ # Fixture corpora (text-cleaner, encodings, format-cleaner)
Demo subfolder: row-limited, watermarked, file-size-capped Streamlit app for public deployment. Same core, different front-end constraints.
3. Build pipeline
3.1 Tooling
| Concern | Tool |
|---|---|
| Bundling | PyInstaller |
| GUI | Streamlit |
| CLI | Typer |
| Browser launch | stdlib webbrowser |
| Win installer | Inno Setup (free) |
| macOS sign+notarize | codesign + notarytool |
| Linux | AppImage (primary) + tarball fallback |
| CI | GitHub Actions matrix |
| Demo host | Streamlit Community Cloud (free) |
3.2 Build outputs
| OS | File | Buyer experience |
|---|---|---|
| Win | *-Setup-1.0.exe |
Wizard → desktop shortcut "Launch Bundle" → browser opens. CLI on PATH. |
| macOS | *-1.0.dmg |
Drag to Applications. Signed + notarized. |
| Linux | *-1.0.AppImage |
chmod +x, double-click. |
3.3 PyInstaller
--onefilefor Linux,--onedirfor Win/macOS (faster startup, easier signing).- Two entry points: GUI launcher + CLI binaries.
- Streamlit hooks needed:
streamlit,altair,pyarrowdata dirs. - Custom
hook-streamlit.pyper documented pattern. - Budget: 1-3 days first time. Reusable after.
3.4 Streamlit launcher
- Find free port (don't hardcode 8501).
- Set env:
STREAMLIT_SERVER_HEADLESS=true,STREAMLIT_BROWSER_GATHER_USAGE_STATS=false,STREAMLIT_SERVER_PORT={port}. - Start Streamlit programmatically in a thread.
- Poll port until ready.
- Open browser to
http://localhost:{port}. - Keep launcher alive while server runs.
Optional v1.1: wrap with pywebview to eliminate browser-launch UX. Defer until support tickets show meaningful confusion.
3.5 macOS pipeline
- PyInstaller produces unsigned
.app. codesign --deep --force --options runtime --sign "Developer ID Application: ..." App.app.- Package as
.dmg. xcrun notarytool submit *.dmg --wait.xcrun stapler staple *.dmg.
Setup: Apple Developer Program ($99/yr), Developer ID cert in Keychain, app-specific password.
3.6-3.7 Win + Linux
- Win: PyInstaller
--onedir→ Inno Setup wraps → installer adds Start Menu, desktop shortcut, PATH entries. Optional code-signing cert ($200-400/yr) if SmartScreen friction. - Linux: PyInstaller →
appimagetoolwraps..tar.gzfallback for distros where AppImage fails.
3.8 CI matrix
strategy:
matrix:
os: [windows-latest, macos-latest, ubuntu-latest]
Tag a release → 3 platform artifacts upload to GitHub Releases. Manual: copy to Gumroad / Lemon Squeezy.
3.9 Hosted demo
demo/streamlit_app.py → Streamlit Community Cloud. Configure deployment in Streamlit UI. Custom domain via CNAME (verify policy at deploy time). Fall back to $5/mo VPS if rate limits / branding constraints hit.
4. Libraries
| Purpose | Library |
|---|---|
| GUI | streamlit |
| CLI | typer |
| Data | pandas, openpyxl, numpy |
| Fuzzy match | rapidfuzz |
| Phone parsing | phonenumbers |
| Encoding detect | charset-normalizer |
| Logging | loguru |
| Mojibake (optional) | ftfy |
| Reports | reportlab |
5. Coding standards
5.1 Code
- PEP 8 + type hints on public functions.
- Docstrings on every module + public function.
pathlib.Pathfor paths, never string concat.- All I/O explicitly UTF-8-aware.
- No platform-specific shell calls.
- pytest for
core/, not UI. - Errors raise via
src.core.errorshierarchy (Section 7).
5.2 GUI UX (load-bearing per DECISIONS.md §4b)
- Works out of the box — drop file → useful result with zero config.
- Sensible defaults visible everywhere.
- Progressive disclosure — basic = file uploader + run button + results; rest in
st.expander. - Plain-English labels; technical detail in
help=tooltip. - Dry-run / preview by default.
- Identical core to CLI.
- Local-first messaging — "runs locally in your browser, no internet" line on every page.
5.3 Functional scope (load-bearing per DECISIONS.md §4a)
- Each script ships complete coverage of the workflow it names, including features Excel does for free.
- Boundary = the named workflow. Dedup includes normalization + survivor + audit; not format conversion or charting.
6. System requirements
Buyer runtime: Win 10/11 64-bit · macOS 11+ · Linux glibc 2020+ · modern browser · ~400-500 MB disk · no internet.
Developer: Python 3.10+ · PyInstaller · Inno Setup (Win) · Xcode CLT (macOS) · Apple Developer Program $99/yr · Git + GitHub.
7. Error handling (src/core/errors.py)
Structured hierarchy for friendly messages + maintainable trace context:
DataToolsError # base; carries path/column/operation/suggestion/cause
InputValidationError(ValueError) # bad arg / wrong type
ConfigError(ValueError) # bad config / options
FileFormatError(ValueError) # file isn't what we expected
FileAccessError(OSError) # I/O failure (perms, disk, missing)
Subclassing rule: every subclass extends a stdlib base (ValueError or OSError) so existing except OSError / except ValueError handlers still catch them.
Helpers:
ensure_dataframe(value, function=...)— uniform DataFrame guard at every public entry.ensure_choice(value, name=, choices=)— uniform enum/literal guard.wrap_file_read(path, op, exc)/wrap_file_write(...)— tag OSError with file path + Windows-aware permission tip.format_for_user(exc, context=)— single string forst.error()/ CLI stderr.
GUI / CLI handlers use format_for_user() so the user always sees: file path, operation, underlying error class, recovery suggestion.
8. Per-bundle status
| Bundle | Status |
|---|---|
| Data Cleaning Mastery | 3/9 tools Ready (Dedup, Text Cleaner, Format Standardizer); 6 stubs |
| Automated Business Reporting | Not started |
| Ecommerce Data Pipeline | Not started |
| Small Business Finance | Not started |
| Marketing Public Data Aggregation | Not started |
| AI Ecommerce Aggregation (Shopify Pet) | Not started |
9. Open decisions
- pywebview wrap — defer until support tickets show browser-launch confusion.
- Win code signing — defer until SmartScreen drives volume. Cost ~$200-400/yr.
- Auto-update mechanism — none at launch. Email-delivered updates. Revisit at 100+ buyers/bundle.
- Demo hosting migration — Streamlit Community Cloud → $5/mo VPS if rate/brand limits hit.
- Code obfuscation — none; license text + bundle complexity sufficient at $49-79.
- Telemetry — none. Consider opt-in privacy-respecting only post-launch.
10. Script boundaries — 04 (Missing Values) vs 06 (Outliers)
Deliberately separate. Confluent original spec was wrong.
| Script | Owns |
|---|---|
| 04 Missing Value Handler | "What's not there." Disguised nulls (N/A, -, sentinel codes), missingness patterns, imputation, drop-by-threshold. |
| 06 Outlier Detector | "What shouldn't be there." z-score / IQR / modified-z, multivariate (Isolation Forest, Mahalanobis), domain rules, winsorization. |
Run order: 04 before 06. Outlier stats on data with NaN / sentinels are mathematically poisoned (means dragged, IQR widens, false negatives).
Pipeline order (Pipeline Runner enforces): 02 → 03 → 04 → 05 → 06 → 07 → 08. 01 is order-flexible.
Contested cases:
- Whitespace-only cell — 02 trims to empty; 04 then flags empty as null.
-999sentinel — 04 converts toNaNfirst; 06 then computes stats.- Suspicious-but-plausible (age 110) — 06 territory.
11. Per-script functional specs
Specs live in this section as scripts enter active build. Each follows the Tier 1/2/3 structure with explicit strategic framing (what's the market gap given some of this is free elsewhere).
11.1 01_deduplicator.py — Smart duplicate removal
Status: Ready. Tier 1 mostly built. Streamlit GUI port complete.
Market gap: fuzzy match quality of OpenRefine, with the zero-learning UX of Excel, sold once for under $100, runs locally.
Tier 1:
- Input: auto-detect encoding (UTF-8, UTF-8-BOM, Latin-1, cp1252) · delimiter · header row · CSV/TSV/XLSX/XLS · multi-sheet picker · streaming for files > RAM.
- Matching: exact + 3 fuzzy algos (Levenshtein / Jaro-Winkler / token-set) · per-column normalizers (5 types) · configurable threshold per strategy · multi-strategy OR.
- Survivor: keep first / last / most-complete / most-recent · merge mode (fill blanks from losers).
- Trust: dry-run preview by default · interactive review for gray-zone matches · confidence score per match · match-group export.
- Audit: timestamped log · removed-rows separate file · input never modified · idempotent.
- Config: save/load JSON profiles · sensible auto-detect defaults.
- UX: human
--help· progress bar > 10k rows · errors name row + column + value + suggestion.
Tier 2: numeric/date tolerance · phonetic match (Soundex, Metaphone) · blocking/indexing · watch-folder.
Tier 3: ML scoring · cross-file dedup · cron · Shopify/Klaviyo API direct.
11.2 02_text_cleaner.py — Character-level hygiene
Status: Ready. Tier 1 built.
Market gap: one-click correctness for the dirty-CSV failure modes that cause silent VLOOKUP misses.
Boundary:
- 02 — whitespace, Unicode normalize, smart-char fold, BOM, line endings, zero-width, control chars, case ops. Writes to disk.
- 03 — dates, currencies, names, phones, addresses (display formatting).
- 04 — disguised nulls.
- 01 —
normalize_stringis match-time only, distinct from 02's write-time policy.
Tier 1 ops (each toggleable; defaults shown for excel-hygiene):
- Trim leading/trailing whitespace — ON
- Collapse internal whitespace runs — ON
- NFC normalize — ON
- NFKC compatibility fold — OFF (lossy, opt-in via
paranoidpreset) - Smart-char fold (curly quotes, em/en-dash, NBSP, ellipsis) — ON
- Zero-width / invisible char strip — ON
- BOM strip — ON
- Control-char strip (preserve
\t\n\r) — ON - Line-ending normalize (CRLF/CR → LF inside cells) — ON
- Case conversion (UPPER / lower / Title / Sentence) — OFF, per-column
Scope: per-column selection · skip-list · operates on string-typed columns only.
Trust: dry-run by default · per-cell change log (capped 1000, --full-changelog removes cap) · 3 output files mirroring dedup · idempotent.
Config: 3 presets (minimal / excel-hygiene (default) / paranoid) · save/load JSON.
11.3 03_format_standardizer.py — Per-domain canonical forms
Status: Ready. Full Tier 1 + most Tier 2 built. 199-row buyer corpus passing.
Market gap: unify dates / phones / emails / addresses / names / currencies / booleans across messy ETL inputs without buyer writing code.
Domains:
| Domain | Default canonical | Notable handling |
|---|---|---|
| Date | ISO 8601 (YYYY-MM-DD) |
MDY/DMY, Excel serial, Unix timestamp (s + ms), longform months, year-month, quarter, ISO week date (2024-W03-1), ISO ordinal (2024-015), RFC 2822, CJK separators (2024年01月15日), fullwidth digits, named-TZ resolution (EST/PST/JST/…), two_digit_year_cutoff |
| Phone | E.164 + ;ext=N |
libphonenumber, 001 international prefix, error sentinels for placeholders / multi-number / contamination |
| lowercase + trim | display-name extraction, mailto/angle-bracket strip, smart-quote unwrap, BIDI/RTL override strip (security), optional --gmail-canonical |
|
| Address | USPS-canonical (expand=False) or expanded (expand=True) |
state/province-name → code for US/CA/AU/DE, UK postcode detection, multi-line collapse, PO Box normalize, state-code preservation regardless of input case |
| Name | smart Title Case | Mc/Mac/O'/D' inner caps, Arabic al-/el- lowercase, particle lowercasing (von/van/de/da/bin/ibn/ben), East Asian honorific suffixes (-san/-sama/-ssi), comma reversal (skippable via family_first), period stripping for titles/suffixes/initials, PhD/MD/Mag/Habil acronyms |
| Currency | bare number (dot decimal) | auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation |
| Boolean | True/False (configurable) |
accepts yes/no/y/n/1/0/on/off |
International coverage (added v1.7):
- Date locales: English (default) plus opt-in French / German / Spanish / Portuguese / Italian / Dutch / Russian month + weekday recognition.
- Currency symbols: $, €, £, ¥, ₹, ₩, ₽, ₪, ₺, ¢ + ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS).
- ISO 4217 codes: 23 baseline (USD, EUR, …) plus ~30 emerging-market additions (SAR, AED, ARS, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, …).
- Address jurisdictions: US, Canada (13 provinces/territories), Australia (8 states), Germany (16 Bundesländer), UK (postcode shape).
- Address PO Box: English, German (
Postfach), French (Boîte postale), Spanish (Apartado), Italian (Casella postale), Portuguese (Caixa postal).
Per-domain error_policy: "passthrough" (default) keeps the original; "sentinel" emits <error: <reason>> for cases like Feb 30, double @, percentages mistaken for currency, etc.
Pipeline: standardize_dataframe(df, options) runs per-column with column_types: dict[str, FieldType]. Returns StandardizeResult with cells_changed, cells_unparseable, change audit. Warns when > 10% of typed cells fail to parse.
Presets: us-default, european, uk, iso-strict, legacy-us. Custom abbreviations via extra_abbreviations. Per-column culture flags: name_family_first (East Asian), address_state_to_code (any of 4 supported jurisdictions), date_month_locales (list of 8 supported codes).
11.4 Upload-time analyzer (src/core/analyze.py)
Read-only advisory pass on every upload. Emits Finding objects:
| Field | Meaning |
|---|---|
id |
Stable identifier (never localized) |
severity |
info / warn / error (only error blocks gate) |
confidence |
high (round-trip safe) / medium (preview) / low (heuristic) |
fix_action |
id of algorithm in fixes.py (empty for informational-only) |
pre_applied |
true if fix already ran during read pass |
tool |
owning tool id (or empty for file-level) |
count |
cells / rows affected |
description |
one-sentence human summary |
column |
column name (None for file-level) |
samples |
up to 5 (row, col, value) examples |
Entry point: analyze(source, *, sample_rows=1000, repair_result=None, encoding_override=None). encoding_override skips charset detection — the hook that lets the Review page recover from misdetections.
11.5 CSV-normalization gate (src/core/normalize.py, fixes.py)
Two paths:
- Auto-fix —
auto_fix(df, findings)applies everyconfidence="high"finding whosefix_actionis registered. - Per-finding decisions —
apply_decisions(df, findings, decisions)acceptsDecision(finding_id, action, payload)with action"auto"|"skip"|"modified".
Returns NormalizationResult with cleaned_df, cleaned_bytes (UTF-8 CSV), applied, skipped_findings, pending_findings, blocking_findings.
is_normalized(findings, result) re-runs analyze() against cleaned bytes; returns False if any high-confidence detector still fires (the strict contract tool pages depend on).
Fix registry: @register("fix_id") decorates (df, payload) → (new_df, n_cells_changed). New fix = one entry in analyze.py FIX_* constants + one detector emitting that fix_action + one registered function. No other call sites change.
11.6 Review page (src/gui/pages/0_Review.py)
- Detected encoding + override picker (16 codepages + custom).
- One expandable card per finding (sorted by severity then confidence) with: decision radio (Auto/Skip/Customize), live before/after preview built by running the registered fix on
Finding.samples, payload editor for fixes that take user input. - Apply persists
NormalizationResultkeyed by upload SHA-256; tool pages refuse to load until hash matches. ⚙️ Advanced output optionsexpander: per-download encoding + delimiter + line terminator._build_output_bytes()returns(bytes, error_message); lossy fallbacks emit a warning the page surfaces.
Gates the entire tool sidebar via require_normalization_gate() in src/gui/components/_legacy.py.
11.7 Pre-parse repair (src/core/io.py::repair_bytes)
Byte-level pre-parse pass. Order is meaningful:
- Wide-encoding transcode (UTF-16/32 → UTF-8) — must run first or NUL strip below shreds UTF-16.
- UTF-8 BOM strip (file start only).
- NUL strip — only meaningful after step 1, so flags genuine corruption.
- Line-ending normalize — CRLF + bare CR → LF.
- Byte-level smart-quote fold — curly / guillemet / double-prime → ASCII
"(only structural double-quote-equivalents; single curlies deferred to cell-level). - Per-row delimiter repair — when a row has +1 field and merge candidate is currency-shaped (
$1,500.00), merge + quote.
detect_encoding() tries strict UTF-8 first — charset-normalizer mislabels short-non-ASCII files as mac_latin2, but valid UTF-8 bytes mean UTF-8 regardless of label.