datatools-dev

Author	SHA1	Message	Date
Michael	966af8ef94	feat: 3 new tools, format streaming, distribution-ready demo + landing pages Tools shipped this batch (4 → 6 of 9 Ready): 04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI 05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI 09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI with soft tool-dependency graph (recommended, not enforced) and JSON save/load for repeatable weekly cleanups. Format Standardizer reworked for 1 GB international files: • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email • Per-row country / address columns drive parsing • Audit cap (default 10 k rows, ~50 MB RAM) • standardize_file(): chunked streaming entry point (~165 k rows/sec) • currency_decimal="auto" for EU comma-decimal locales • R$ / kr / zł multi-char currency prefixes • cli_format.py with auto-stream above 100 MB inputs Encoding detection arbiter + language-aware probe: Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM) via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes. Distribution-readiness assets: • streamlit_app.py — Streamlit Community Cloud entry shim • src/gui/app_demo.py — single-page demo, ?p=<persona> routing, 100-row cap + watermark, free-vs-paid boundary enforced at surface • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs • landing/ — 4 static HTML pages (apex chooser + 3 niche), shared CSS, deploy.py URL-substitution script, auto-generated robots.txt + sitemap.xml + 404.html + favicon • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md — full strategy + measurement + deployment + master checklist Test counts: before: 1,520 passed · 4 skipped · 17 xfailed after: 1,729 passed · 0 skipped · 0 xfailed Tier-1 corpora added: • missing-corpus 3 use cases + 16 edge cases • column-mapper-corpus 3 use cases + 5 edge cases • format-cleaner intl 20-row 13-country stress fixture Engine hardening flushed out by the corpora: • interpolate guards against object-dtype columns • mean/median skip all-NaN columns (silences numpy warning) • fillna runs under future.no_silent_downcasting (silences pandas warning) • mojibake test no longer skips when ftfy installed (monkeypatch path) • drop-row threshold semantics: strict-greater (consistent across rows / cols) • currency_decimal validator allow-set updated for "auto" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:31:26 +00:00
Michael	d18b95880d	feat(format-i18n): broaden international coverage across all domains Closes ~17 high-value international gaps surfaced by parallel review. Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed / failed / xfailed). DATES - Adds Portuguese, Italian, Dutch, Russian month dictionaries to the opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru). - Adds localized weekday recognition for those locales — "Lundi", "Montag", "lunedì", "понедельник", etc. all strip cleanly before format matching. - New CJK separator normalization: Japanese ``2024年01月15日`` and fullwidth digits ``２０２４/０１/１５`` fold to ASCII before parsing. - New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ doesn't block format matching. - New ISO 8601 extended formats: week date (``2024-W03-1``) and ordinal date (``2024-015``), plus RFC 2822 mail-header form (``Mon, 15 Jan 2024 10:30:00``). - New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` — defaults to Python's stdlib 69; lower it for birth-year columns where most subjects were born ≤ 1999. NAMES - Particles set extended with Arabic patronymic markers (bin, ibn, bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-). - Title set extended with German (Herr, Frau), French (M., Mme, Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra, Dott.), Portuguese. - Acronym map extended with international academic credentials (Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM). - New East Asian honorific suffix handler: ``Tanaka-san``, ``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the hyphen instead of being title-cased into ``Tanaka-San``. - Hyphenated-segment handler now keeps Arabic prefixes ``al-`` / ``el-`` lowercase per Arabic transliteration convention. - New ``family_first`` parameter on ``standardize_name()`` and matching ``name_family_first`` field on ``StandardizeOptions`` — set per-column for East Asian data to skip Western comma-format reversal (``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``). CURRENCY - Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa, Eastern Europe, Latin America gaps. - ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD, BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, KZT, etc. EMAIL - New BIDI / RTL override stripping (``standardize_email``): U+202A-U+202E and U+2066-U+2069 stripped from every email. These are a known phishing vector — ``alice‮@example.com`` displays as ``alice@elpmaxe.com`` to RTL-aware renderers. ADDRESS - Canadian provinces: 13 codes + names → 2-letter (Ontario → ON). - UK postcode pattern recognition (``SW1A 2AA`` shape). - Australian states: 8 codes + names (NSW, VIC, QLD, … + full names). - German Bundesland: 16 codes + names (Bayern → BY, etc.). - International PO Box variants: ``Postfach`` (DE), ``Boîte postale`` (FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal`` (PT) — all fold to canonical ``PO Box``. - ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position check that preserves state codes regardless of input case applies to all four jurisdictions. - ``_is_state_code_position`` postal pattern broadened to recognize US ZIP, AU 4-digit, CA first half, and UK outward code. CONSTANTS - ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` / ``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``, ``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS`` (us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex), ``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for use when address takes a `country_hint` parameter in a future pass). DOCS - TECHNICAL.md §11.3 domain table updated with the new handling per domain plus a new "International coverage" sub-section listing the supported locales / symbols / jurisdictions. DEFERRED (out of scope or rare) - Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) — corpus § 3.5 marks out of scope. - Persian/Arabic-Indic digit conversion — rare in tabular data. - Trailing-minus RTL currency convention. - Punycode ↔ Unicode IDN normalization. - Mixed-country phone column auto-detection (user can override ``default_region`` per column). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 03:06:03 +00:00
Michael	26b9771625	feat(errors): structured error hierarchy + helpful messages everywhere Introduces src/core/errors.py with a small structured error hierarchy that every public entry point now uses. Each error carries the context a user needs to fix it and the context a maintainer needs to trace it. The hierarchy: DataToolsError (base — formats path, column, operation, suggestion) InputValidationError (extends ValueError — bad arg / wrong type) ConfigError (extends ValueError — bad config / options) FileFormatError (extends ValueError — file is not what we expected) FileAccessError (extends OSError — file I/O failure) Subclassing the stdlib bases means existing `except OSError` / `except ValueError` handlers still catch them — no breaking change. Helpers: - ensure_dataframe(value, function=...) — uniform DataFrame guard - ensure_choice(value, name=, choices=) — uniform enum/literal guard - wrap_file_read(path, op, exc) — tag OSError with hint + path - wrap_file_write(path, op, exc) — same, with Windows-aware tip - format_for_user(exc, context=) — user-facing string for st.error / stderr Library hardening: - io.read_file: missing files surface FileAccessError listing whether the parent directory exists, and the suggestion to check the path. - io.read_file: chunk_size <= 0 now raises InputValidationError with a positive-integer suggestion. - io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas ValueError ("sheet not found") wrapped as FileFormatError listing the path and a "list sheets with list_sheets()" hint. - io._detect_excel_header_row: bare except narrowed to specific openpyxl exceptions; falls back gracefully and logs at debug so the real error surfaces from pd.read_excel. - io.write_file: OSError / PermissionError on to_csv/to_excel wrapped with file path and Windows-aware "file may be open in another program" hint. - dedup._parse_date: bare `except Exception` narrowed to (TypeError, ValueError, OutOfBoundsDatetime); failed values logged at debug for survivor-selection forensics. - dedup._select_survivor: KEEP_MOST_RECENT now raises InputValidationError instead of silently falling back to keep_first. - dedup.deduplicate: input validation errors are InputValidationError with operation/column/suggestion fields. - format_standardize.from_dict: invalid FieldType for a column raises ConfigError naming the column AND the bad value AND listing valid values; same for date_order / phone_format / etc. - format_standardize.from_file: OSError / JSON decode wrapped with path AND line/column where parsing failed. - format_standardize.to_file: TypeError on json.dumps wrapped as ConfigError with the suspected source (extra_abbreviations). - format_standardize._apply_field_type: dispatcher's "unknown field type" branch now raises AssertionError (it's an internal invariant, not user error — a new enum value was added without a branch). - format_standardize._resolve_column_types: missing-column error now InputValidationError with a "check for typos / unparsed header" suggestion. - format_standardize.standardize_dataframe: ensure_dataframe at entry. - text_clean.clean_dataframe: ensure_dataframe at entry. - config.to_strategies: invalid Algorithm/NormalizerType wrapped as ConfigError naming the strategy index AND the column. - config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError listing valid values. - config.from_file: OSError / JSON decode wrapped (mirror of StandardizeOptions.from_file). - fixes.repair_mojibake: ImportError on ftfy now logged at info level with the underlying ImportError so a corrupt-package vs not-installed distinction is visible in the logs. - normalizers.normalize_phone: phonenumbers.NumberParseException now logged at debug when the digits-only fallback drops extension / country-code information — gives a trail when matching results look wrong. GUI / CLI surfaces: - All 9 page handlers (`except Exception as e: st.error(...)`) now use format_for_user(), which renders DataToolsError fields nicely and falls back to "ClassName: message" for unrecognized errors. - 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish UnicodeDecodeError with an "re-save as UTF-8" suggestion before the generic handler. - cli.py's "Error reading file" handler now uses format_for_user() and includes the input path in the prefix. Tests: - tests/test_errors.py — 22 new tests covering: base class formatting, stdlib inheritance, ensure_dataframe / ensure_choice helpers, wrap_file_read / wrap_file_write, format_for_user behavior, and end-to-end integration (missing file, missing dir, bad JSON, bad algorithm, bad enum, missing column). - tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for the new exception types (InputValidationError replaces TypeError, FileAccessError extends OSError). Full project suite: 1230 passed, 4 skipped, 17 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:35:42 +00:00
Michael	2eece6467d	refactor: dedup, consolidate, harden public APIs across core modules Closes 16 high-value findings from a parallel cross-module review. Refactors: - New src/core/_constants.py centralizes USPS street-suffix abbreviations, US state names, and 2-letter postal codes — one source of truth for both normalize_address (matching keys) and standardize_address (display formatting). Eliminates ~80 lines of duplicated dicts across normalizers.py and format_standardize.py. - format_standardize.py: collapse 4 identical nested _err() helpers into one shared _err_or_passthrough() module function; drop a dead duplicate `return _err("not a phone number")` branch in standardize_phone. - format_standardize.py: precompile per-locale month-name regexes (_MONTH_LOCALE_PATTERNS) and per-state-name regexes (_STATE_NAME_PATTERNS) at import time — they were rebuilt on every cell, a measurable hot path on million-row inputs. - dedup.py: extract _is_missing(value) helper; one definition of "this cell is None / NaN / pd.NA" instead of two. - fixes.py: extract _is_string_column(ser) helper; one dtype check instead of three duplicates across _apply_to_strings, _vectorized_translate, _vectorized_regex_sub. Production-readiness: - format_standardize.standardize_dataframe now logs a warning when more than 10% of typed cells are unparseable — surfaces the silently-broken-pipeline failure mode. - StandardizeOptions.from_dict validates date_order / phone_format / currency_decimal / name_case / boolean_style / *_error_policy enum values up front, with a clear error message instead of a deep crash inside the per-cell function. - StandardizeOptions.from_file and DeduplicationConfig.from_file wrap read + json.loads with descriptive OSError / ValueError messages including the file path. - standardize_date(month_locales=...) validates locale codes against the available set instead of silently passing through unknown ones. - io.read_file rejects chunk_size <= 0 (was silently failing inside pandas) and logs the resolved suffix + chunk_size at info level so data-pipeline runs are debuggable. - io.read_file's FileNotFoundError gains explanatory context. - io.write_file, text_clean.clean_dataframe, and dedup.deduplicate now reject non-DataFrame inputs with clear TypeError instead of cryptic pandas tracebacks downstream. - dedup.deduplicate validates that survivor_rule=KEEP_MOST_RECENT has a usable date_column up front; the helper _select_survivor now raises (instead of silently falling back to keep_first) when called directly with bad arguments. - dedup.deduplicate gains a structured no-op return when strategies is empty after auto-detection — preserves schema instead of crashing. - analyze._detect_inconsistent_date_format narrows its bare except to (TypeError, ValueError) and logs a debug line so genuine bugs don't hide behind silent skip. Tests: - tests/test_audit_fixes.py grows by 11 cases covering the new validation paths (chunk_size, DataFrame guards, KEEP_MOST_RECENT date_column, enum validation, locale validation, JSON error wrapping). Full project suite: 1208 passed, 4 skipped, 17 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:23:09 +00:00
Michael	4adeb5c7f3	feat(format): per-cell standardizers + 199-row buyer corpus Adds src/core/format_standardize.py — a per-cell standardizer for dates, phones, emails, addresses, names, currencies, booleans — wired through StandardizeOptions / standardize_dataframe with FieldType registry. Includes: - Date parser handles ISO/US/EU/longform/excel-serial/unix-timestamp/ partial-precision/quarter notation; opt-in French/German/Spanish month dictionaries via month_locales. - Phone via libphonenumber with extension preservation (;ext=N), 001 international prefix handling, error sentinels for placeholders / multi-number cells. - Email lowercase/trim/mailto/angle-bracket strip with optional --gmail-canonical mode. - Address USPS abbreviation expansion or compression (expand=False per corpus § 6.3), state-name → 2-letter conversion, multi-line collapse, PO Box normalization, state-code preservation regardless of input case. - Name handler: Mc/Mac/O'/D' inner caps, hyphen segments, particle lowercasing (von/van/de/da), comma-format reversal, period stripping for titles/suffixes/initials, PhD/MD acronym preservation, conservative mode for mixed-case input. - Currency: auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation, error sentinels for percentages/ranges/word-values/ambiguous separators. - Per-domain error_policy ("passthrough" \| "sentinel") for surfacing malformed values as <error: reason> per corpus § 0.3. Test corpus from Business/DataTools/test-cases-format-cleaner copied to test-cases/format-cleaner-corpus/ — 7 fixtures plus FORMATS-CASES.md. tests/test_format_standardize_corpus.py drives all 199 rows through the per-cell standardizers; 0 xfailed. Wires the GUI page (3_Format_Standardizer.py) to "Ready" status. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:11:24 +00:00

5 Commits