feat(format-i18n): broaden international coverage across all domains

Closes ~17 high-value international gaps surfaced by parallel review. Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed / failed / xfailed). DATES - Adds Portuguese, Italian, Dutch, Russian month dictionaries to the opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru). - Adds localized weekday recognition for those locales — "Lundi", "Montag", "lunedì", "понедельник", etc. all strip cleanly before format matching. - New CJK separator normalization: Japanese ``2024年01月15日`` and fullwidth digits ``２０２４/０１/１５`` fold to ASCII before parsing. - New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ doesn't block format matching. - New ISO 8601 extended formats: week date (``2024-W03-1``) and ordinal date (``2024-015``), plus RFC 2822 mail-header form (``Mon, 15 Jan 2024 10:30:00``). - New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` — defaults to Python's stdlib 69; lower it for birth-year columns where most subjects were born ≤ 1999. NAMES - Particles set extended with Arabic patronymic markers (bin, ibn, bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-). - Title set extended with German (Herr, Frau), French (M., Mme, Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra, Dott.), Portuguese. - Acronym map extended with international academic credentials (Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM). - New East Asian honorific suffix handler: ``Tanaka-san``, ``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the hyphen instead of being title-cased into ``Tanaka-San``. - Hyphenated-segment handler now keeps Arabic prefixes ``al-`` / ``el-`` lowercase per Arabic transliteration convention. - New ``family_first`` parameter on ``standardize_name()`` and matching ``name_family_first`` field on ``StandardizeOptions`` — set per-column for East Asian data to skip Western comma-format reversal (``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``). CURRENCY - Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa, Eastern Europe, Latin America gaps. - ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD, BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, KZT, etc. EMAIL - New BIDI / RTL override stripping (``standardize_email``): U+202A-U+202E and U+2066-U+2069 stripped from every email. These are a known phishing vector — ``alice‮@example.com`` displays as ``alice@elpmaxe.com`` to RTL-aware renderers. ADDRESS - Canadian provinces: 13 codes + names → 2-letter (Ontario → ON). - UK postcode pattern recognition (``SW1A 2AA`` shape). - Australian states: 8 codes + names (NSW, VIC, QLD, … + full names). - German Bundesland: 16 codes + names (Bayern → BY, etc.). - International PO Box variants: ``Postfach`` (DE), ``Boîte postale`` (FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal`` (PT) — all fold to canonical ``PO Box``. - ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position check that preserves state codes regardless of input case applies to all four jurisdictions. - ``_is_state_code_position`` postal pattern broadened to recognize US ZIP, AU 4-digit, CA first half, and UK outward code. CONSTANTS - ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` / ``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``, ``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS`` (us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex), ``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for use when address takes a `country_hint` parameter in a future pass). DOCS - TECHNICAL.md §11.3 domain table updated with the new handling per domain plus a new "International coverage" sub-section listing the supported locales / symbols / jurisdictions. DEFERRED (out of scope or rare) - Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) — corpus § 3.5 marks out of scope. - Persian/Arabic-Indic digit conversion — rare in tabular data. - Trailing-minus RTL currency convention. - Punycode ↔ Unicode IDN normalization. - Mixed-country phone column auto-detection (user can override ``default_region`` per column). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 03:06:03 +00:00
parent abb720997e
commit d18b95880d
4 changed files with 920 additions and 52 deletions
--- a/docs/TECHNICAL.md
+++ b/docs/TECHNICAL.md
@@ -282,19 +282,26 @@ Specs live in this section as scripts enter active build. Each follows the Tier
 **Domains**:
 | Domain | Default canonical | Notable handling |
 |--------|-------------------|------------------|
-| Date | ISO 8601 (`YYYY-MM-DD`) | MDY/DMY, Excel serial, Unix timestamp (s + ms), longform months, year-month, quarter notation, French/German/Spanish month dictionaries (opt-in), buried-date regex, error sentinels for invalid dates |
-| Phone | E.164 + `;ext=N` | libphonenumber, 001 international prefix handling, error sentinels for placeholders / multi-number / contamination |
-| Email | lowercase + trim | display-name extraction, mailto/angle-bracket strip, smart-quote unwrap, optional `--gmail-canonical` mode |
-| Address | USPS-canonical (`expand=False`) or expanded (`expand=True`) | state-name → 2-letter, multi-line collapse, PO Box normalize, state-code preservation regardless of input case |
-| Name | smart Title Case | Mc/Mac/O'/D' inner caps, hyphen segments, particle lowercasing (von/van/de/da), comma-format reversal, period stripping for titles/suffixes/initials, PhD/MD acronym preservation, conservative mode |
+| Date | ISO 8601 (`YYYY-MM-DD`) | MDY/DMY, Excel serial, Unix timestamp (s + ms), longform months, year-month, quarter, ISO week date (`2024-W03-1`), ISO ordinal (`2024-015`), RFC 2822, CJK separators (`2024年01月15日`), fullwidth digits, named-TZ resolution (EST/PST/JST/…), `two_digit_year_cutoff` |
+| Phone | E.164 + `;ext=N` | libphonenumber, 001 international prefix, error sentinels for placeholders / multi-number / contamination |
+| Email | lowercase + trim | display-name extraction, mailto/angle-bracket strip, smart-quote unwrap, BIDI/RTL override strip (security), optional `--gmail-canonical` |
+| Address | USPS-canonical (`expand=False`) or expanded (`expand=True`) | state/province-name → code for US/CA/AU/DE, UK postcode detection, multi-line collapse, PO Box normalize, state-code preservation regardless of input case |
+| Name | smart Title Case | Mc/Mac/O'/D' inner caps, Arabic `al-`/`el-` lowercase, particle lowercasing (von/van/de/da/bin/ibn/ben), East Asian honorific suffixes (`-san`/`-sama`/`-ssi`), comma reversal (skippable via `family_first`), period stripping for titles/suffixes/initials, PhD/MD/Mag/Habil acronyms |
 | Currency | bare number (dot decimal) | auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation |
 | Boolean | `True`/`False` (configurable) | accepts `yes`/`no`/`y`/`n`/`1`/`0`/`on`/`off` |

+**International coverage** (added v1.7):
+- **Date locales**: English (default) plus opt-in French / German / Spanish / Portuguese / Italian / Dutch / Russian month + weekday recognition.
+- **Currency symbols**: $, €, £, ¥, ₹, ₩, ₽, ₪, ₺, ¢ + ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS).
+- **ISO 4217 codes**: 23 baseline (USD, EUR, …) plus ~30 emerging-market additions (SAR, AED, ARS, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, …).
+- **Address jurisdictions**: US, Canada (13 provinces/territories), Australia (8 states), Germany (16 Bundesländer), UK (postcode shape).
+- **Address PO Box**: English, German (`Postfach`), French (`Boîte postale`), Spanish (`Apartado`), Italian (`Casella postale`), Portuguese (`Caixa postal`).
+
 **Per-domain `error_policy`**: `"passthrough"` (default) keeps the original; `"sentinel"` emits `<error: <reason>>` for cases like Feb 30, double @, percentages mistaken for currency, etc.

 **Pipeline**: `standardize_dataframe(df, options)` runs per-column with `column_types: dict[str, FieldType]`. Returns `StandardizeResult` with `cells_changed`, `cells_unparseable`, change audit. Warns when > 10% of typed cells fail to parse.

-**Presets**: `us-default`, `european`, `uk`, `iso-strict`, `legacy-us`. Custom abbreviations via `extra_abbreviations`.
+**Presets**: `us-default`, `european`, `uk`, `iso-strict`, `legacy-us`. Custom abbreviations via `extra_abbreviations`. Per-column culture flags: `name_family_first` (East Asian), `address_state_to_code` (any of 4 supported jurisdictions), `date_month_locales` (list of 8 supported codes).

 ### 11.4 Upload-time analyzer (`src/core/analyze.py`)