feat(format-i18n): broaden international coverage across all domains

Closes ~17 high-value international gaps surfaced by parallel review.
Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed
/ failed / xfailed).

DATES
- Adds Portuguese, Italian, Dutch, Russian month dictionaries to the
  opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru).
- Adds localized weekday recognition for those locales — "Lundi",
  "Montag", "lunedì", "понедельник", etc. all strip cleanly before
  format matching.
- New CJK separator normalization: Japanese ``2024年01月15日`` and
  fullwidth digits ``2024/01/15`` fold to ASCII before parsing.
- New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to
  fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ
  doesn't block format matching.
- New ISO 8601 extended formats: week date (``2024-W03-1``) and
  ordinal date (``2024-015``), plus RFC 2822 mail-header form
  (``Mon, 15 Jan 2024 10:30:00``).
- New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` —
  defaults to Python's stdlib 69; lower it for birth-year columns
  where most subjects were born ≤ 1999.

NAMES
- Particles set extended with Arabic patronymic markers (bin, ibn,
  bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-).
- Title set extended with German (Herr, Frau), French (M., Mme,
  Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra,
  Dott.), Portuguese.
- Acronym map extended with international academic credentials
  (Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM).
- New East Asian honorific suffix handler: ``Tanaka-san``,
  ``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the
  hyphen instead of being title-cased into ``Tanaka-San``.
- Hyphenated-segment handler now keeps Arabic prefixes ``al-`` /
  ``el-`` lowercase per Arabic transliteration convention.
- New ``family_first`` parameter on ``standardize_name()`` and matching
  ``name_family_first`` field on ``StandardizeOptions`` — set
  per-column for East Asian data to skip Western comma-format reversal
  (``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``).

CURRENCY
- Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN),
  ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa,
  Eastern Europe, Latin America gaps.
- ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD,
  BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS,
  KES, HUF, CZK, RON, UAH, KZT, etc.

EMAIL
- New BIDI / RTL override stripping (``standardize_email``):
  U+202A-U+202E and U+2066-U+2069 stripped from every email. These
  are a known phishing vector — ``alice‮@example.com`` displays as
  ``alice@elpmaxe.com`` to RTL-aware renderers.

ADDRESS
- Canadian provinces: 13 codes + names → 2-letter (Ontario → ON).
- UK postcode pattern recognition (``SW1A 2AA`` shape).
- Australian states: 8 codes + names (NSW, VIC, QLD, … + full names).
- German Bundesland: 16 codes + names (Bayern → BY, etc.).
- International PO Box variants: ``Postfach`` (DE), ``Boîte postale``
  (FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal``
  (PT) — all fold to canonical ``PO Box``.
- ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position
  check that preserves state codes regardless of input case applies
  to all four jurisdictions.
- ``_is_state_code_position`` postal pattern broadened to recognize
  US ZIP, AU 4-digit, CA first half, and UK outward code.

CONSTANTS
- ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` /
  ``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``,
  ``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS``
  (us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex),
  ``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for
  use when address takes a `country_hint` parameter in a future pass).

DOCS
- TECHNICAL.md §11.3 domain table updated with the new handling per
  domain plus a new "International coverage" sub-section listing the
  supported locales / symbols / jurisdictions.

DEFERRED (out of scope or rare)
- Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) —
  corpus § 3.5 marks out of scope.
- Persian/Arabic-Indic digit conversion — rare in tabular data.
- Trailing-minus RTL currency convention.
- Punycode ↔ Unicode IDN normalization.
- Mixed-country phone column auto-detection (user can override
  ``default_region`` per column).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 03:06:03 +00:00
parent abb720997e
commit d18b95880d
4 changed files with 920 additions and 52 deletions

View File

@@ -282,19 +282,26 @@ Specs live in this section as scripts enter active build. Each follows the Tier
**Domains**:
| Domain | Default canonical | Notable handling |
|--------|-------------------|------------------|
| Date | ISO 8601 (`YYYY-MM-DD`) | MDY/DMY, Excel serial, Unix timestamp (s + ms), longform months, year-month, quarter notation, French/German/Spanish month dictionaries (opt-in), buried-date regex, error sentinels for invalid dates |
| Phone | E.164 + `;ext=N` | libphonenumber, 001 international prefix handling, error sentinels for placeholders / multi-number / contamination |
| Email | lowercase + trim | display-name extraction, mailto/angle-bracket strip, smart-quote unwrap, optional `--gmail-canonical` mode |
| Address | USPS-canonical (`expand=False`) or expanded (`expand=True`) | state-name → 2-letter, multi-line collapse, PO Box normalize, state-code preservation regardless of input case |
| Name | smart Title Case | Mc/Mac/O'/D' inner caps, hyphen segments, particle lowercasing (von/van/de/da), comma-format reversal, period stripping for titles/suffixes/initials, PhD/MD acronym preservation, conservative mode |
| Date | ISO 8601 (`YYYY-MM-DD`) | MDY/DMY, Excel serial, Unix timestamp (s + ms), longform months, year-month, quarter, ISO week date (`2024-W03-1`), ISO ordinal (`2024-015`), RFC 2822, CJK separators (`2024年01月15日`), fullwidth digits, named-TZ resolution (EST/PST/JST/…), `two_digit_year_cutoff` |
| Phone | E.164 + `;ext=N` | libphonenumber, 001 international prefix, error sentinels for placeholders / multi-number / contamination |
| Email | lowercase + trim | display-name extraction, mailto/angle-bracket strip, smart-quote unwrap, BIDI/RTL override strip (security), optional `--gmail-canonical` |
| Address | USPS-canonical (`expand=False`) or expanded (`expand=True`) | state/province-name → code for US/CA/AU/DE, UK postcode detection, multi-line collapse, PO Box normalize, state-code preservation regardless of input case |
| Name | smart Title Case | Mc/Mac/O'/D' inner caps, Arabic `al-`/`el-` lowercase, particle lowercasing (von/van/de/da/bin/ibn/ben), East Asian honorific suffixes (`-san`/`-sama`/`-ssi`), comma reversal (skippable via `family_first`), period stripping for titles/suffixes/initials, PhD/MD/Mag/Habil acronyms |
| Currency | bare number (dot decimal) | auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation |
| Boolean | `True`/`False` (configurable) | accepts `yes`/`no`/`y`/`n`/`1`/`0`/`on`/`off` |
**International coverage** (added v1.7):
- **Date locales**: English (default) plus opt-in French / German / Spanish / Portuguese / Italian / Dutch / Russian month + weekday recognition.
- **Currency symbols**: $, €, £, ¥, ₹, ₩, ₽, ₪, ₺, ¢ + ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS).
- **ISO 4217 codes**: 23 baseline (USD, EUR, …) plus ~30 emerging-market additions (SAR, AED, ARS, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, …).
- **Address jurisdictions**: US, Canada (13 provinces/territories), Australia (8 states), Germany (16 Bundesländer), UK (postcode shape).
- **Address PO Box**: English, German (`Postfach`), French (`Boîte postale`), Spanish (`Apartado`), Italian (`Casella postale`), Portuguese (`Caixa postal`).
**Per-domain `error_policy`**: `"passthrough"` (default) keeps the original; `"sentinel"` emits `<error: <reason>>` for cases like Feb 30, double @, percentages mistaken for currency, etc.
**Pipeline**: `standardize_dataframe(df, options)` runs per-column with `column_types: dict[str, FieldType]`. Returns `StandardizeResult` with `cells_changed`, `cells_unparseable`, change audit. Warns when > 10% of typed cells fail to parse.
**Presets**: `us-default`, `european`, `uk`, `iso-strict`, `legacy-us`. Custom abbreviations via `extra_abbreviations`.
**Presets**: `us-default`, `european`, `uk`, `iso-strict`, `legacy-us`. Custom abbreviations via `extra_abbreviations`. Per-column culture flags: `name_family_first` (East Asian), `address_state_to_code` (any of 4 supported jurisdictions), `date_month_locales` (list of 8 supported codes).
### 11.4 Upload-time analyzer (`src/core/analyze.py`)