feat(format-i18n): broaden international coverage across all domains

Closes ~17 high-value international gaps surfaced by parallel review.
Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed
/ failed / xfailed).

DATES
- Adds Portuguese, Italian, Dutch, Russian month dictionaries to the
  opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru).
- Adds localized weekday recognition for those locales — "Lundi",
  "Montag", "lunedì", "понедельник", etc. all strip cleanly before
  format matching.
- New CJK separator normalization: Japanese ``2024年01月15日`` and
  fullwidth digits ``2024/01/15`` fold to ASCII before parsing.
- New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to
  fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ
  doesn't block format matching.
- New ISO 8601 extended formats: week date (``2024-W03-1``) and
  ordinal date (``2024-015``), plus RFC 2822 mail-header form
  (``Mon, 15 Jan 2024 10:30:00``).
- New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` —
  defaults to Python's stdlib 69; lower it for birth-year columns
  where most subjects were born ≤ 1999.

NAMES
- Particles set extended with Arabic patronymic markers (bin, ibn,
  bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-).
- Title set extended with German (Herr, Frau), French (M., Mme,
  Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra,
  Dott.), Portuguese.
- Acronym map extended with international academic credentials
  (Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM).
- New East Asian honorific suffix handler: ``Tanaka-san``,
  ``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the
  hyphen instead of being title-cased into ``Tanaka-San``.
- Hyphenated-segment handler now keeps Arabic prefixes ``al-`` /
  ``el-`` lowercase per Arabic transliteration convention.
- New ``family_first`` parameter on ``standardize_name()`` and matching
  ``name_family_first`` field on ``StandardizeOptions`` — set
  per-column for East Asian data to skip Western comma-format reversal
  (``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``).

CURRENCY
- Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN),
  ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa,
  Eastern Europe, Latin America gaps.
- ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD,
  BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS,
  KES, HUF, CZK, RON, UAH, KZT, etc.

EMAIL
- New BIDI / RTL override stripping (``standardize_email``):
  U+202A-U+202E and U+2066-U+2069 stripped from every email. These
  are a known phishing vector — ``alice‮@example.com`` displays as
  ``alice@elpmaxe.com`` to RTL-aware renderers.

ADDRESS
- Canadian provinces: 13 codes + names → 2-letter (Ontario → ON).
- UK postcode pattern recognition (``SW1A 2AA`` shape).
- Australian states: 8 codes + names (NSW, VIC, QLD, … + full names).
- German Bundesland: 16 codes + names (Bayern → BY, etc.).
- International PO Box variants: ``Postfach`` (DE), ``Boîte postale``
  (FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal``
  (PT) — all fold to canonical ``PO Box``.
- ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position
  check that preserves state codes regardless of input case applies
  to all four jurisdictions.
- ``_is_state_code_position`` postal pattern broadened to recognize
  US ZIP, AU 4-digit, CA first half, and UK outward code.

CONSTANTS
- ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` /
  ``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``,
  ``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS``
  (us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex),
  ``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for
  use when address takes a `country_hint` parameter in a future pass).

DOCS
- TECHNICAL.md §11.3 domain table updated with the new handling per
  domain plus a new "International coverage" sub-section listing the
  supported locales / symbols / jurisdictions.

DEFERRED (out of scope or rare)
- Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) —
  corpus § 3.5 marks out of scope.
- Persian/Arabic-Indic digit conversion — rare in tabular data.
- Trailing-minus RTL currency convention.
- Punycode ↔ Unicode IDN normalization.
- Mixed-country phone column auto-detection (user can override
  ``default_region`` per column).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 03:06:03 +00:00
parent abb720997e
commit d18b95880d
4 changed files with 920 additions and 52 deletions

View File

@@ -282,19 +282,26 @@ Specs live in this section as scripts enter active build. Each follows the Tier
**Domains**:
| Domain | Default canonical | Notable handling |
|--------|-------------------|------------------|
| Date | ISO 8601 (`YYYY-MM-DD`) | MDY/DMY, Excel serial, Unix timestamp (s + ms), longform months, year-month, quarter notation, French/German/Spanish month dictionaries (opt-in), buried-date regex, error sentinels for invalid dates |
| Phone | E.164 + `;ext=N` | libphonenumber, 001 international prefix handling, error sentinels for placeholders / multi-number / contamination |
| Email | lowercase + trim | display-name extraction, mailto/angle-bracket strip, smart-quote unwrap, optional `--gmail-canonical` mode |
| Address | USPS-canonical (`expand=False`) or expanded (`expand=True`) | state-name → 2-letter, multi-line collapse, PO Box normalize, state-code preservation regardless of input case |
| Name | smart Title Case | Mc/Mac/O'/D' inner caps, hyphen segments, particle lowercasing (von/van/de/da), comma-format reversal, period stripping for titles/suffixes/initials, PhD/MD acronym preservation, conservative mode |
| Date | ISO 8601 (`YYYY-MM-DD`) | MDY/DMY, Excel serial, Unix timestamp (s + ms), longform months, year-month, quarter, ISO week date (`2024-W03-1`), ISO ordinal (`2024-015`), RFC 2822, CJK separators (`2024年01月15日`), fullwidth digits, named-TZ resolution (EST/PST/JST/…), `two_digit_year_cutoff` |
| Phone | E.164 + `;ext=N` | libphonenumber, 001 international prefix, error sentinels for placeholders / multi-number / contamination |
| Email | lowercase + trim | display-name extraction, mailto/angle-bracket strip, smart-quote unwrap, BIDI/RTL override strip (security), optional `--gmail-canonical` |
| Address | USPS-canonical (`expand=False`) or expanded (`expand=True`) | state/province-name → code for US/CA/AU/DE, UK postcode detection, multi-line collapse, PO Box normalize, state-code preservation regardless of input case |
| Name | smart Title Case | Mc/Mac/O'/D' inner caps, Arabic `al-`/`el-` lowercase, particle lowercasing (von/van/de/da/bin/ibn/ben), East Asian honorific suffixes (`-san`/`-sama`/`-ssi`), comma reversal (skippable via `family_first`), period stripping for titles/suffixes/initials, PhD/MD/Mag/Habil acronyms |
| Currency | bare number (dot decimal) | auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation |
| Boolean | `True`/`False` (configurable) | accepts `yes`/`no`/`y`/`n`/`1`/`0`/`on`/`off` |
**International coverage** (added v1.7):
- **Date locales**: English (default) plus opt-in French / German / Spanish / Portuguese / Italian / Dutch / Russian month + weekday recognition.
- **Currency symbols**: $, €, £, ¥, ₹, ₩, ₽, ₪, ₺, ¢ + ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS).
- **ISO 4217 codes**: 23 baseline (USD, EUR, …) plus ~30 emerging-market additions (SAR, AED, ARS, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, …).
- **Address jurisdictions**: US, Canada (13 provinces/territories), Australia (8 states), Germany (16 Bundesländer), UK (postcode shape).
- **Address PO Box**: English, German (`Postfach`), French (`Boîte postale`), Spanish (`Apartado`), Italian (`Casella postale`), Portuguese (`Caixa postal`).
**Per-domain `error_policy`**: `"passthrough"` (default) keeps the original; `"sentinel"` emits `<error: <reason>>` for cases like Feb 30, double @, percentages mistaken for currency, etc.
**Pipeline**: `standardize_dataframe(df, options)` runs per-column with `column_types: dict[str, FieldType]`. Returns `StandardizeResult` with `cells_changed`, `cells_unparseable`, change audit. Warns when > 10% of typed cells fail to parse.
**Presets**: `us-default`, `european`, `uk`, `iso-strict`, `legacy-us`. Custom abbreviations via `extra_abbreviations`.
**Presets**: `us-default`, `european`, `uk`, `iso-strict`, `legacy-us`. Custom abbreviations via `extra_abbreviations`. Per-column culture flags: `name_family_first` (East Asian), `address_state_to_code` (any of 4 supported jurisdictions), `date_month_locales` (list of 8 supported codes).
### 11.4 Upload-time analyzer (`src/core/analyze.py`)

View File

@@ -65,6 +65,124 @@ USPS_COMPRESSIONS: dict[str, str] = {
"heights": "Hts", "springs": "Spgs",
}
# Canadian province + territory postal codes.
CA_PROVINCE_CODES: frozenset[str] = frozenset({
"AB", "BC", "MB", "NB", "NL", "NS", "NT", "NU",
"ON", "PE", "QC", "SK", "YT",
})
# Canadian province / territory name → 2-letter code.
CA_PROVINCE_NAMES: dict[str, str] = {
"alberta": "AB", "british columbia": "BC", "manitoba": "MB",
"new brunswick": "NB", "newfoundland and labrador": "NL",
"newfoundland": "NL", "labrador": "NL", "nova scotia": "NS",
"northwest territories": "NT", "nunavut": "NU", "ontario": "ON",
"prince edward island": "PE", "quebec": "QC", "québec": "QC",
"saskatchewan": "SK", "yukon": "YT",
}
# Australian state + territory postal abbreviations.
AU_STATE_CODES: frozenset[str] = frozenset({
"NSW", "VIC", "QLD", "WA", "SA", "TAS", "ACT", "NT",
})
AU_STATE_NAMES: dict[str, str] = {
"new south wales": "NSW", "victoria": "VIC", "queensland": "QLD",
"western australia": "WA", "south australia": "SA",
"tasmania": "TAS", "australian capital territory": "ACT",
"northern territory": "NT",
}
# German Bundesland (state) postal abbreviations (per ISO 3166-2:DE).
DE_STATE_CODES: frozenset[str] = frozenset({
"BW", "BY", "BE", "BB", "HB", "HH", "HE", "MV",
"NI", "NW", "RP", "SL", "SN", "ST", "SH", "TH",
})
DE_STATE_NAMES: dict[str, str] = {
"baden-württemberg": "BW", "baden-wurttemberg": "BW",
"bayern": "BY", "bavaria": "BY",
"berlin": "BE", "brandenburg": "BB",
"bremen": "HB", "hamburg": "HH",
"hessen": "HE", "hesse": "HE",
"mecklenburg-vorpommern": "MV",
"niedersachsen": "NI", "lower saxony": "NI",
"nordrhein-westfalen": "NW", "north rhine-westphalia": "NW",
"rheinland-pfalz": "RP", "rhineland-palatinate": "RP",
"saarland": "SL",
"sachsen": "SN", "saxony": "SN",
"sachsen-anhalt": "ST", "saxony-anhalt": "ST",
"schleswig-holstein": "SH",
"thüringen": "TH", "thuringen": "TH", "thuringia": "TH",
}
# Postal-code patterns by country (used for shape detection in addresses).
# ``utf8`` flag suppressed: all patterns are pure ASCII at the structural
# level even when the surrounding address is unicode.
POSTAL_PATTERNS: dict[str, str] = {
"us": r"\b\d{5}(?:-\d{4})?\b",
"ca": r"\b[A-Z]\d[A-Z]\s*\d[A-Z]\d\b",
# UK postcodes: outward (1-2 letters + 1-2 digits + optional letter)
# then optional space then inward (digit + 2 letters). Real grammar
# is more involved but this catches every shipping format.
"uk": r"\b[A-Z]{1,2}\d[A-Z\d]?\s*\d[A-Z]{2}\b",
# German postal: 5 digits.
"de": r"\b\d{5}\b",
# Australian postal: 4 digits.
"au": r"\b\d{4}\b",
# French postal: 5 digits (covered by DE pattern but kept explicit).
"fr": r"\b\d{5}\b",
}
# International street-suffix expansions / compressions, keyed by language.
# Keys are casefold + period-stripped; values are the canonical form to
# emit. Empty for languages where USPS-style abbreviation isn't idiomatic
# (Japanese, Korean — addresses use full words and ideographic markers).
INTL_STREET_SUFFIXES: dict[str, dict[str, str]] = {
"de": {
# Long → short (e.g., for matching keys); the standardizer's
# ``expand=False`` mode uses this set.
"strasse": "Str", "straße": "Str", "str": "Straße",
"platz": "Pl", "pl": "Platz",
"weg": "W", "w": "Weg",
"gasse": "G", "g": "Gasse",
},
"fr": {
"rue": "R", "r": "Rue",
"avenue": "Av", "av": "Avenue",
"boulevard": "Bd", "bd": "Boulevard",
"place": "Pl", "pl": "Place",
"chemin": "Ch", "ch": "Chemin",
"impasse": "Imp", "imp": "Impasse",
},
"es": {
"calle": "C", "c": "Calle",
"avenida": "Av", "av": "Avenida",
"plaza": "Pza", "pza": "Plaza",
"paseo": "", "carretera": "Ctra", "ctra": "Carretera",
},
"it": {
"via": "V", "v": "Via",
"viale": "V.le",
"corso": "C.so",
"piazza": "P.za", "pza": "Piazza",
},
"uk": {
# UK uses the long form by default; "Cl"/"Mws" are uncommon.
"cl": "Close", "mws": "Mews",
},
}
# Localized "PO Box" patterns. Each value is a regex matching all
# variants of "PO Box" in that language. The standardizer folds matches
# to the canonical form passed in ``po_box_canonical``.
INTL_PO_BOX_PATTERNS: dict[str, str] = {
"en": r"(?:p\.?\s*o\.?\s*box|post\s+office\s+box)",
"de": r"(?:postfach|pf)\b",
"fr": r"(?:bo[iî]te\s+postale|b\.?\s*p\.?)\b",
"es": r"(?:apartado(?:\s+postal)?|apdo\.?)\b",
"it": r"(?:casella\s+postale|c\.?\s*p\.?)\b",
"pt": r"(?:caixa\s+postal|c\.?\s*p\.?)\b",
}
# Abbreviation → expansion (the inverse of USPS_COMPRESSIONS, plus a
# handful of legacy aliases like ``av`` → ``Avenue``). Used by the
# format standardizer when ``expand=True`` (default).

View File

@@ -165,6 +165,110 @@ _MONTH_LOCALES: dict[str, dict[str, str]] = {
"agosto": "August", "septiembre": "September", "setiembre": "September",
"octubre": "October", "noviembre": "November", "diciembre": "December",
},
"pt": {
"janeiro": "January", "fevereiro": "February", "março": "March",
"marco": "March", "abril": "April", "maio": "May", "junho": "June",
"julho": "July", "agosto": "August", "setembro": "September",
"outubro": "October", "novembro": "November", "dezembro": "December",
"jan": "Jan", "fev": "Feb", "mar": "Mar", "abr": "Apr",
"mai": "May", "jun": "Jun", "jul": "Jul", "ago": "Aug",
"set": "Sep", "out": "Oct", "nov": "Nov", "dez": "Dec",
},
"it": {
"gennaio": "January", "febbraio": "February", "marzo": "March",
"aprile": "April", "maggio": "May", "giugno": "June",
"luglio": "July", "agosto": "August", "settembre": "September",
"ottobre": "October", "novembre": "November", "dicembre": "December",
"gen": "Jan", "feb": "Feb", "mar": "Mar", "apr": "Apr",
"mag": "May", "giu": "Jun", "lug": "Jul", "ago": "Aug",
"set": "Sep", "ott": "Oct", "nov": "Nov", "dic": "Dec",
},
"nl": {
"januari": "January", "februari": "February", "maart": "March",
"april": "April", "mei": "May", "juni": "June", "juli": "July",
"augustus": "August", "september": "September", "oktober": "October",
"november": "November", "december": "December",
"jan": "Jan", "feb": "Feb", "mrt": "Mar", "apr": "Apr",
"mei": "May", "jun": "Jun", "jul": "Jul", "aug": "Aug",
"sep": "Sep", "okt": "Oct", "nov": "Nov", "dec": "Dec",
},
"ru": {
"января": "January", "февраля": "February", "марта": "March",
"апреля": "April", "мая": "May", "июня": "June", "июля": "July",
"августа": "August", "сентября": "September", "октября": "October",
"ноября": "November", "декабря": "December",
# Nominative forms (less common in dates but possible)
"январь": "January", "февраль": "February", "март": "March",
"апрель": "April", "май": "May", "июнь": "June", "июль": "July",
"август": "August", "сентябрь": "September", "октябрь": "October",
"ноябрь": "November", "декабрь": "December",
},
}
# Localized weekday prefix removal — same idea as month substitution.
# Each locale's set lists full + abbreviated forms (lowercase) that
# should be stripped from the start of a date string before format
# matching. English is in ``_WEEKDAY_PREFIX_RE`` already.
_WEEKDAY_LOCALES: dict[str, list[str]] = {
"fr": ["lundi", "mardi", "mercredi", "jeudi", "vendredi", "samedi",
"dimanche", "lun", "mar", "mer", "jeu", "ven", "sam", "dim"],
"de": ["montag", "dienstag", "mittwoch", "donnerstag", "freitag",
"samstag", "sonntag", "mo", "di", "mi", "do", "fr", "sa", "so"],
"es": ["lunes", "martes", "miércoles", "miercoles", "jueves",
"viernes", "sábado", "sabado", "domingo"],
"it": ["lunedì", "lunedi", "martedì", "martedi", "mercoledì",
"mercoledi", "giovedì", "giovedi", "venerdì", "venerdi",
"sabato", "domenica"],
"pt": ["segunda-feira", "segunda", "terça-feira", "terca-feira",
"terça", "terca", "quarta-feira", "quarta", "quinta-feira",
"quinta", "sexta-feira", "sexta", "sábado", "sabado", "domingo"],
"nl": ["maandag", "dinsdag", "woensdag", "donderdag", "vrijdag",
"zaterdag", "zondag",
"ma", "di", "wo", "do", "vr", "za", "zo"],
"ru": ["понедельник", "вторник", "среда", "четверг", "пятница",
"суббота", "воскресенье",
"пн", "вт", "ср", "чт", "пт", "сб", "вс"],
}
def _build_weekday_patterns() -> dict[str, "re.Pattern[str]"]:
"""One regex per locale matching any leading weekday + optional comma."""
out = {}
for loc, words in _WEEKDAY_LOCALES.items():
# Sort longest first so ``segunda-feira`` wins over ``segunda``.
alt = "|".join(re.escape(w) for w in sorted(words, key=len, reverse=True))
out[loc] = re.compile(rf"^(?:{alt})\s*,?\s+", re.IGNORECASE)
return out
_WEEKDAY_LOCALE_PATTERNS = _build_weekday_patterns()
# Named timezone → fixed UTC offset. Resolves common abbreviations so
# ``2024-01-15 10:30:00 EST`` produces a date instead of falling through
# unparseably. Per FORMATS-CASES.md § 3.3, these are *fixed* offsets —
# DST-aware handling is out of scope (would require pyzoneinfo).
_NAMED_TZ_OFFSETS: dict[str, str] = {
# Universal
"UTC": "+00:00", "GMT": "+00:00", "Z": "+00:00",
# Americas
"EST": "-05:00", "EDT": "-04:00",
"CST": "-06:00", "CDT": "-05:00",
"MST": "-07:00", "MDT": "-06:00",
"PST": "-08:00", "PDT": "-07:00",
"AST": "-04:00", "AKST": "-09:00", "HST": "-10:00",
"BRT": "-03:00", "ART": "-03:00",
# Europe
"BST": "+01:00", "CET": "+01:00", "CEST": "+02:00",
"EET": "+02:00", "EEST": "+03:00", "WET": "+00:00", "WEST": "+01:00",
"MSK": "+03:00",
# Asia / Pacific
"IST": "+05:30",
"PKT": "+05:00", "BDT": "+06:00",
"ICT": "+07:00", "WIB": "+07:00",
"CST_CN": "+08:00", "HKT": "+08:00", "SGT": "+08:00", "PHT": "+08:00",
"JST": "+09:00", "KST": "+09:00",
"AEST": "+10:00", "AEDT": "+11:00", "NZST": "+12:00",
}
@@ -262,6 +366,7 @@ def standardize_date(
date_order: DateOrder = "MDY",
error_policy: DateErrorPolicy = "passthrough",
month_locales: Optional[list[str]] = None,
two_digit_year_cutoff: int = 69,
) -> tuple[str, bool]:
"""Parse *value* as a date and return it formatted per *output_format*.
@@ -273,9 +378,15 @@ def standardize_date(
passes through unchanged. With ``"sentinel"`` the cleaner emits
``<error: <reason>>`` for invalid dates per corpus § 0.3.
``month_locales`` enables non-English month names. Pass
``["en", "fr", "de", "es"]`` to recognize French / German / Spanish
month names in addition to English. Defaults to English-only.
``month_locales`` enables non-English month names. Pass any subset
of ``["en", "fr", "de", "es", "pt", "it", "nl", "ru"]`` to recognize
those locales' month + weekday names in addition to English.
Defaults to English-only.
``two_digit_year_cutoff`` controls the pivot for 2-digit years:
years ``00..cutoff`` map to 2000-2099, ``cutoff+1..99`` map to
1900-1999. Default 69 (Python's stdlib default). Override to ~25
for birth-year columns where most subjects were born ≤ 1999.
Recognizes Excel-1900 serial dates (``45306`` → ``2024-01-15``),
Unix timestamps in seconds and milliseconds, year-month text
@@ -320,19 +431,42 @@ def standardize_date(
out = f"{q.group(2)}-Q{q.group(1)}"
return out, out != value
# CJK separator normalization: Japanese ``2024年01月15日`` → ``2024-01-15``,
# Korean ``2024.01.15`` is already covered by the dot format. Also fold
# fullwidth digits (-) to ASCII so any of the parsers can read them.
s = _normalize_cjk_date_chars(s)
# Substitute localized month names with English before format-match.
if month_locales:
s = _apply_month_locale(s, month_locales)
# Strip localized weekday prefixes for any enabled locale BEFORE
# the day-period strip — otherwise ``Montag, 15. Januar 2024``
# never reaches the digit-leading shape the period strip expects.
for loc in month_locales:
pat = _WEEKDAY_LOCALE_PATTERNS.get(loc)
if pat is not None:
s = pat.sub("", s).strip()
# German DMY uses ``15.`` for the day; strip the trailing period
# so ``15. Januar 2024`` parses as ``15 January 2024``.
s = re.sub(r"^(\d{1,2})\.\s+", r"\1 ", s)
# Strip a leading weekday prefix (``Monday, January 15, 2024``).
s = _WEEKDAY_PREFIX_RE.sub("", s).strip()
# Drop a trailing time portion before format-matching.
# Resolve named timezones (EST/PST/JST/…) to fixed offsets, then
# drop the trailing time portion before format-matching.
s = _resolve_named_tz(s)
s = _TIME_TAIL_RE.sub("", s).strip()
parsed = _try_parse_date(s, date_order)
# ISO 8601 extended formats — week date + ordinal date — and
# RFC 2822 mail-header form.
iso_extended = _try_iso_extended(s, output_format)
if iso_extended is not None:
return iso_extended, iso_extended != value
rfc = _try_rfc2822(s, output_format)
if rfc is not None:
return rfc, rfc != value
parsed = _try_parse_date(s, date_order, two_digit_year_cutoff)
if parsed is not None:
out = parsed.strftime(output_format)
return out, out != value
@@ -370,13 +504,112 @@ def standardize_date(
return value, False
def _try_parse_date(s: str, date_order: DateOrder) -> Optional[datetime]:
def _try_parse_date(
s: str, date_order: DateOrder, two_digit_year_cutoff: int = 69,
) -> Optional[datetime]:
formats = _DATE_FORMATS_DMY if date_order == "DMY" else _DATE_FORMATS_MDY
for fmt in formats:
try:
return datetime.strptime(s, fmt)
parsed = datetime.strptime(s, fmt)
except ValueError:
continue
# Re-pivot 2-digit years if the user changed the cutoff. strptime
# uses Python's stdlib default of 69; for cutoff != 69 we may need
# to roll the century forward or back.
if "%y" in fmt and two_digit_year_cutoff != 69:
year_2 = parsed.year % 100
if year_2 <= two_digit_year_cutoff:
century = 2000
else:
century = 1900
parsed = parsed.replace(year=century + year_2)
return parsed
return None
_FULLWIDTH_DIGITS = str.maketrans("", "0123456789")
_CJK_DATE_MARKERS = str.maketrans({"": "-", "": "-", "": "", "": ".", "": "/"})
def _normalize_cjk_date_chars(s: str) -> str:
"""Fold East Asian date markers + fullwidth digits to ASCII equivalents.
``2024年01月15日`` → ``2024-01-15``; fullwidth ````
→ ``2024/01/15``. Idempotent on ASCII input.
"""
if not any(c > "\x7f" for c in s):
return s
s = s.translate(_FULLWIDTH_DIGITS).translate(_CJK_DATE_MARKERS)
# ``2024年01月15日`` becomes ``2024-01-15-`` with our trailing-day
# mapping; strip any trailing dash artifact.
return s.rstrip("-").strip()
_NAMED_TZ_RE = re.compile(
r"\s+(" + "|".join(re.escape(k) for k in sorted(_NAMED_TZ_OFFSETS, key=len, reverse=True)) + r")\b"
)
def _resolve_named_tz(s: str) -> str:
"""Replace a trailing named timezone with its fixed UTC offset.
``2024-01-15 10:30:00 EST`` → ``2024-01-15 10:30:00-05:00``. Per
FORMATS-CASES.md § 3.3, offsets are fixed (not DST-aware); see
``_NAMED_TZ_OFFSETS`` for the table.
"""
def repl(m: re.Match) -> str:
return _NAMED_TZ_OFFSETS[m.group(1)]
return _NAMED_TZ_RE.sub(repl, s)
_ISO_WEEK_RE = re.compile(r"^(\d{4})-W(\d{2})-(\d)$")
_ISO_ORDINAL_RE = re.compile(r"^(\d{4})-(\d{3})$")
def _try_iso_extended(s: str, output_format: str) -> Optional[str]:
"""Parse ISO 8601 week date or ordinal date, return formatted string."""
m = _ISO_WEEK_RE.match(s)
if m:
try:
parsed = datetime.fromisocalendar(
int(m.group(1)), int(m.group(2)), int(m.group(3)),
)
return parsed.strftime(output_format)
except ValueError:
return None
m = _ISO_ORDINAL_RE.match(s)
if m:
year, day = int(m.group(1)), int(m.group(2))
if 1 <= day <= 366:
try:
parsed = datetime(year, 1, 1) + timedelta(days=day - 1)
if parsed.year == year:
return parsed.strftime(output_format)
except ValueError:
return None
return None
# RFC 2822 mail-header form: ``Wed, 15 Jan 2024 10:30:00 GMT``.
_RFC2822_FORMATS = [
"%a, %d %b %Y %H:%M:%S", # without TZ
"%a, %d %b %Y %H:%M:%S %Z", # with named TZ (already resolved upstream)
"%a, %d %b %Y %H:%M:%S %z", # with offset
"%d %b %Y %H:%M:%S",
]
def _try_rfc2822(s: str, output_format: str) -> Optional[str]:
"""Parse RFC 2822 mail-header date format."""
for fmt in _RFC2822_FORMATS:
try:
parsed = datetime.strptime(s, fmt)
except ValueError:
continue
try:
return parsed.strftime(output_format)
except ValueError:
return None
return None
@@ -539,12 +772,35 @@ _SYMBOL_TO_ISO: dict[str, str] = {
"": "ILS",
"": "TRY",
"¢": "USD", # cents — coerce to USD for the code; value is still numeric
# International additions:
"฿": "THB", # Thai Baht
"": "VND", # Vietnamese Dong
"": "MNT", # Mongolian Tugrik
"": "UAH", # Ukrainian Hryvnia
"": "NGN", # Nigerian Naira
"": "PHP", # Philippine Peso
"": "PYG", # Paraguayan Guarani
"": "SAR", # ambiguous Saudi/Omani/Iranian; pick the most common
"": "PKR", # Pakistani Rupee (and historical Sri Lankan)
"": "GHS", # Ghanaian Cedi
}
_CURRENCY_SYMBOLS = "".join(_SYMBOL_TO_ISO)
# ISO 4217 codes — the long tail of currencies in active use. Order
# matters for the regex alternation: a 3-letter ISO code is unambiguous,
# but ``R$`` (Brazil) and ``kr`` (DKK/NOK/SEK) are 1-2 char prefixes
# that need to lose to a 3-letter code if both appear.
_CURRENCY_CODES_LIST = [
"USD", "EUR", "GBP", "JPY", "CNY", "CAD", "AUD", "CHF", "INR", "KRW",
"RUB", "MXN", "BRL", "ILS", "TRY", "ZAR", "SEK", "NOK", "DKK", "PLN",
"HKD", "SGD", "NZD",
# Major non-G10 economies:
"SAR", "AED", "QAR", "KWD", "BHD", "OMR", # Gulf
"ARS", "CLP", "COP", "PEN", "UYU", # Latin America
"EGP", "MAD", "TND", "NGN", "GHS", "KES", "ZAR", "TZS", "UGX", # Africa
"IDR", "MYR", "PHP", "THB", "VND", "TWD", # SE Asia
"PKR", "BDT", "LKR", "NPR", # South Asia
"HUF", "CZK", "RON", "BGN", "HRK", "ISK", # Europe-other
"UAH", "KZT", "GEL", "AMD", "AZN", # Eastern Europe / Caucasus
]
_CURRENCY_CODES = "|".join(_CURRENCY_CODES_LIST)
_CURRENCY_DETECT_RE = re.compile(
@@ -741,25 +997,68 @@ def standardize_currency(
NameCase = Literal["title", "upper", "lower"]
# Particles in surnames that conventionally stay lowercase in natural
# reading order (``Vincent van Gogh``, ``Leonardo da Vinci``).
# reading order. Covers the major Indo-European traditions plus
# Arabic/Hebrew patronymic markers.
_NAME_PARTICLES: set[str] = {
# Germanic / Dutch / French / Italian
"von", "van", "de", "da", "del", "della", "di", "du", "der",
"den", "ter", "ten", "le", "la", "los", "las", "el",
# Spanish / Portuguese
"dos", "das", "do", "y",
# Arabic patronymic / nisba
"bin", "ibn", "bint", "abu", "abd", "al", "el-", "al-",
# Hebrew
"ben", "bat", "ha", "ha-",
# Slavic transliterated (rare in Western forms)
"z", "ze",
}
# Acronyms / honorifics that keep their conventional casing rather than
# being title-cased (``PhD``, ``MD``, ``Esq``).
# being title-cased (``PhD``, ``MD``, ``Esq``). Includes international
# academic credentials.
_NAME_ACRONYMS: dict[str, str] = {
# English
"phd": "PhD", "md": "MD", "esq": "Esq", "ma": "MA", "ba": "BA",
"bs": "BS", "ms": "MS", "dds": "DDS", "dvm": "DVM", "jd": "JD",
"rn": "RN", "cpa": "CPA", "ceo": "CEO", "cto": "CTO", "cfo": "CFO",
# German / Austrian academic
"dipl": "Dipl", "ing": "Ing", "mag": "Mag", "habil": "Habil",
"drmed": "Dr.med.", "drphil": "Dr.phil.", "drrernat": "Dr.rer.nat.",
"msc": "MSc", "bsc": "BSc",
# International degrees
"llb": "LLB", "llm": "LLM",
}
# Roman numeral suffixes — preserved verbatim (already uppercase).
_NAME_ROMAN_RE = re.compile(r"^[IVX]+$")
# Titles that take a trailing period in their long form (``Mr.``).
_NAME_TITLES: set[str] = {"mr", "mrs", "ms", "miss", "dr", "prof", "sr", "jr"}
# Titles. Most languages strip the trailing period (``Mr.`` → ``Mr``);
# the dispatcher in _standardize_name_token does the strip.
_NAME_TITLES: set[str] = {
# English
"mr", "mrs", "ms", "miss", "dr", "prof", "sr", "jr", "sir", "madam",
"rev", "hon",
# German
"herr", "frau", "fr", "hr",
# French
"m", "mme", "mlle", "mr",
# Spanish
"sr", "sra", "srta", "don", "doña", "dona",
# Italian
"sig", "sigra", "dott", "dottoressa",
# Portuguese
"snr", "snra",
}
# East Asian honorific suffixes — appended after the family name with a
# hyphen. Preserved verbatim (lowercase). Supports both Latin
# transliteration and the underlying Japanese/Korean characters.
_EAST_ASIAN_HONORIFICS: set[str] = {
"san", "sama", "kun", "chan", "sensei", "senpai", "kohai", "dono",
"shi", "tan", "chin",
# Korean
"ssi", "nim",
}
# Suffixes that take a trailing period in their short form (``Jr.``).
_NAME_SUFFIXES: set[str] = {"jr", "sr", "esq"}
@@ -847,9 +1146,21 @@ def _standardize_name_token(tok: str, *, position: str, all_shouting: bool = Fal
):
return tok.upper() + suffix_punct
# Hyphenated segment — capitalize each piece.
# Hyphenated segment — capitalize each piece. Special cases:
# - East Asian honorific suffix (``Tanaka-san``) stays lowercase.
# - Arabic transliterated prefix (``al-Rashid``, ``el-Sayed``)
# keeps the prefix lowercase per Arabic naming convention.
if "-" in tok:
return "-".join(_cap_segment(p) for p in tok.split("-")) + suffix_punct
parts = tok.split("-")
out_parts = []
for j, p in enumerate(parts):
if j > 0 and p.lower() in _EAST_ASIAN_HONORIFICS:
out_parts.append(p.lower())
elif j == 0 and p.lower() in {"al", "el", "an", "ad"}:
out_parts.append(p.lower())
else:
out_parts.append(_cap_segment(p))
return "-".join(out_parts) + suffix_punct
# Mc / Mac prefix — inner cap.
if lowered.startswith("mc") and len(lowered) > 2:
@@ -892,6 +1203,7 @@ def standardize_name(
case: NameCase = "title",
conservative: bool = False,
reverse_comma_format: bool = True,
family_first: bool = False,
) -> tuple[str, bool]:
"""Apply name-friendly casing with prefix / particle / suffix awareness.
@@ -899,7 +1211,10 @@ def standardize_name(
* Mc / Mac inner caps (``mcdonald`` → ``McDonald``).
* O'/D' inner caps (``o'connor`` → ``O'Connor``).
* Hyphenated segments (``mary-jane`` → ``Mary-Jane``).
* Particles stay lowercase mid-name (``van Gogh``, ``de Gaulle``).
* Particles stay lowercase mid-name (``van Gogh``, ``de Gaulle``,
``bin Salman``, ``ben Avraham``).
* East Asian honorific suffixes (``Tanaka-san``, ``Lee-ssi``)
preserved lowercase after the hyphen.
* Title / suffix periods stripped (``Mr.`` → ``Mr``, ``Jr.`` → ``Jr``).
* Roman numeral suffixes preserved (``III``).
* PhD / MD / Esq style acronyms preserved.
@@ -912,6 +1227,11 @@ def standardize_name(
``reverse_comma_format`` flips ``Last, First`` to ``First Last``
(default per corpus § 7.3).
``family_first=True`` skips comma reversal and disables Western
title detection — appropriate for East Asian columns where the
family name comes first natively (``Kim Min-jae``, ``田中 太郎``).
Set this per-column when you know the cultural convention.
``"upper"`` / ``"lower"`` are simple case conversions.
"""
if not value or not isinstance(value, str):
@@ -940,7 +1260,9 @@ def standardize_name(
return value, False
# Comma-format reversal: "Smith, John Andrew" → "John Andrew Smith".
if reverse_comma_format and "," in s:
# Skipped under family_first because East Asian conventions write
# the family name first natively — reversing would corrupt them.
if reverse_comma_format and not family_first and "," in s:
parts = [p.strip() for p in s.split(",", 1)]
if len(parts) == 2 and parts[0] and parts[1]:
s = f"{parts[1]} {parts[0]}"
@@ -976,6 +1298,11 @@ from ._constants import (
USPS_COMPRESSIONS as _ADDRESS_COMPRESSIONS,
US_STATE_CODES as _US_STATE_CODES_SHARED,
US_STATE_NAMES as _US_STATE_NAMES_SHARED,
CA_PROVINCE_CODES, CA_PROVINCE_NAMES,
AU_STATE_CODES, AU_STATE_NAMES,
DE_STATE_CODES, DE_STATE_NAMES,
POSTAL_PATTERNS,
INTL_PO_BOX_PATTERNS,
)
# Short tokens that look like directions but only mean a direction at the
@@ -992,31 +1319,62 @@ _TOKEN_RE = re.compile(r"\w+|[^\w\s]+|\s+")
_US_STATE_CODES = _US_STATE_CODES_SHARED
_US_STATE_NAMES = _US_STATE_NAMES_SHARED
# Precompiled (pattern, code) list for the state-name → 2-letter
# conversion. Sorted longest-first so ``new york`` matches before ``new``.
_STATE_NAME_PATTERNS: list[tuple[re.Pattern[str], str]] = [
(
re.compile(
rf"(,\s*){re.escape(full)}(\s+\d{{5}}(?:-\d{{4}})?)",
re.IGNORECASE,
),
code,
)
for full, code in sorted(_US_STATE_NAMES.items(), key=lambda kv: -len(kv[0]))
]
# Per-country (full-name, code, postal-pattern) tables. Each yields a
# precompiled regex matching ``, <state name> <postal>``. Sorted
# longest-first so multi-word names win over their prefixes.
def _build_state_patterns(
name_to_code: dict[str, str], postal_pattern: str,
) -> list[tuple[re.Pattern[str], str]]:
return [
(
re.compile(
rf"(,\s*){re.escape(full)}(\s+{postal_pattern})",
re.IGNORECASE,
),
code,
)
for full, code in sorted(name_to_code.items(), key=lambda kv: -len(kv[0]))
]
# PO Box variants normalize to a single canonical form.
_STATE_NAME_PATTERNS: list[tuple[re.Pattern[str], str]] = _build_state_patterns(
_US_STATE_NAMES, r"\d{5}(?:-\d{4})?",
)
_CA_PROVINCE_PATTERNS: list[tuple[re.Pattern[str], str]] = _build_state_patterns(
CA_PROVINCE_NAMES, r"[A-Z]\d[A-Z]\s*\d[A-Z]\d",
)
_AU_STATE_PATTERNS: list[tuple[re.Pattern[str], str]] = _build_state_patterns(
AU_STATE_NAMES, r"\d{4}",
)
_DE_STATE_PATTERNS: list[tuple[re.Pattern[str], str]] = _build_state_patterns(
DE_STATE_NAMES, r"\d{5}",
)
# PO Box variants normalize to a single canonical form. Combines the
# English pattern with the international locale variants registered in
# _constants.INTL_PO_BOX_PATTERNS.
_PO_BOX_RE = re.compile(
r"\b(?:p\.?\s*o\.?\s*box|post\s+office\s+box)\b",
r"\b(?:" + "|".join(INTL_PO_BOX_PATTERNS.values()) + r")\b",
re.IGNORECASE,
)
# US ZIP at end of line (or before a trailing comma) — used to detect
# whether an address is US-shaped before applying US-only normalizations.
_US_ZIP_TAIL_RE = re.compile(r"\b\d{5}(?:-\d{4})?\b")
# Canadian postal pattern (``M5E 1W7``) — Canada-specific addresses get
# US-style street-type compression but not US ZIP / state handling.
_CANADA_POSTAL_RE = re.compile(r"\b[A-Z]\d[A-Z]\s*\d[A-Z]\d\b")
# Country-shape postal patterns (precompiled). Used to detect which
# country-specific normalization to apply (state-code preservation,
# street-suffix dictionary, etc.).
_POSTAL_REGEXES: dict[str, re.Pattern[str]] = {
cc: re.compile(pat) for cc, pat in POSTAL_PATTERNS.items()
}
# Back-compat aliases for sites that already reference these names.
_US_ZIP_TAIL_RE = _POSTAL_REGEXES["us"]
_CANADA_POSTAL_RE = _POSTAL_REGEXES["ca"]
_UK_POSTCODE_RE = _POSTAL_REGEXES["uk"]
# Combined state-code set: US + Canada + Australia + Germany. The
# state-code-position check preserves any of these when found in the
# slot between a comma and the postal code.
_INTL_STATE_CODES: frozenset[str] = (
_US_STATE_CODES_SHARED | CA_PROVINCE_CODES | AU_STATE_CODES | DE_STATE_CODES
)
def _is_state_code_position(tokens: list[str], idx: int) -> bool:
@@ -1033,14 +1391,19 @@ def _is_state_code_position(tokens: list[str], idx: int) -> bool:
j -= 1
if j < 0 or tokens[j] != ",":
return False
# Look ahead for a ZIP-shaped token (5 digits, optionally +4).
# Look ahead for a postal-shaped token. Accepts US ZIP (5 digits +
# optional +4), Australian (4 digits), Canadian first half (single
# letter + digit + letter), and the start of a UK outward code.
j = idx + 1
while j < len(tokens) and tokens[j].isspace():
j += 1
if j >= len(tokens):
return True # tail of line, after a comma — accept
nxt = tokens[j]
return bool(re.match(r"\d{5}(?:-\d{4})?$", nxt))
return bool(re.match(
r"\d{4,5}(?:-\d{4})?$|^[A-Z]\d[A-Z]$|^[A-Z]{1,2}\d",
nxt, re.IGNORECASE,
))
def standardize_address(
@@ -1096,14 +1459,44 @@ def standardize_address(
s = _PO_BOX_RE.sub("PO Box", s)
is_us_shaped = bool(_US_ZIP_TAIL_RE.search(s))
is_ca_shaped = bool(_CANADA_POSTAL_RE.search(s))
is_uk_shaped = bool(_UK_POSTCODE_RE.search(s))
# German postal is just 5 digits — same as US ZIP — so we only
# treat as DE if the input is NOT already US-state-shaped.
is_de_shaped = (
is_us_shaped and any(
re.search(rf",\s*{re.escape(name)}\s+\d{{5}}", s, re.IGNORECASE)
or re.search(rf",\s*{re.escape(code)}\s+\d{{5}}", s, re.IGNORECASE)
for name, code in DE_STATE_NAMES.items()
)
)
# AU detection: 4-digit postal at tail AND a known AU state code or
# full-name substring is present somewhere in the address.
_au_state_words = "|".join(
list(AU_STATE_CODES) + [re.escape(n) for n in AU_STATE_NAMES]
)
is_au_shaped = bool(
re.search(r"\b\d{4}\b\s*$", s.rstrip(","))
and re.search(rf"\b(?:{_au_state_words})\b", s, re.IGNORECASE)
)
if state_to_code and is_us_shaped:
# Only convert state names in the *state slot* — between a comma
# and a US ZIP — so the city ``New York`` in ``…, New York, NY
# 10001`` is not shortened to ``NY``. Patterns are precompiled
# at module load.
for pat, code in _STATE_NAME_PATTERNS:
s = pat.sub(rf"\g<1>{code}\g<2>", s)
if state_to_code:
# State-name → code conversion. Each country's pattern only
# fires when its own postal-code shape is detected, so US
# "New York" before "NY 10001" is left alone (it's a city), and
# Canadian "Ontario" before "M5E 1W7" becomes "ON".
if is_us_shaped:
for pat, code in _STATE_NAME_PATTERNS:
s = pat.sub(rf"\g<1>{code}\g<2>", s)
if is_ca_shaped:
for pat, code in _CA_PROVINCE_PATTERNS:
s = pat.sub(rf"\g<1>{code}\g<2>", s)
if is_au_shaped:
for pat, code in _AU_STATE_PATTERNS:
s = pat.sub(rf"\g<1>{code}\g<2>", s)
if is_de_shaped:
for pat, code in _DE_STATE_PATTERNS:
s = pat.sub(rf"\g<1>{code}\g<2>", s)
if not expand:
# Compression direction is only safe for US-shaped addresses.
@@ -1159,7 +1552,7 @@ def standardize_address(
# State code preservation: if this token is a 2-letter state code
# in a state-code position, preserve it as uppercase regardless
# of input case or abbreviation table collisions.
if upper_form in _US_STATE_CODES and _is_state_code_position(tokens, i):
if upper_form in _INTL_STATE_CODES and _is_state_code_position(tokens, i):
out_tokens.append(upper_form)
continue
@@ -1193,7 +1586,7 @@ def _restore_state_codes(s: str) -> str:
"""Force-uppercase 2-letter state codes following a comma."""
def repl(m: re.Match) -> str:
candidate = m.group(2).upper()
if candidate in _US_STATE_CODES:
if candidate in _INTL_STATE_CODES:
return f"{m.group(1)}{candidate}{m.group(3)}"
return m.group(0)
@@ -1221,6 +1614,10 @@ _EMAIL_ANGLE_RE = re.compile(r"<([^<>]+)>")
_MAILTO_PREFIX_RE = re.compile(r"^mailto:", re.IGNORECASE)
# Smart-quote wrapping the whole address.
_EMAIL_SMARTQUOTE_RE = re.compile(r"^[“”‘’]+|[“”‘’]+$")
# Bidirectional control characters used in homograph / spoofing attacks
# against email addresses (``alice@example.com`` displays as
# ``alice@elpmaxe.com`` to RTL-aware renderers). Strip on every parse.
_EMAIL_BIDI_RE = re.compile(r"[--]")
# Multi-email cell separator.
_EMAIL_MULTI_RE = re.compile(r"[,;]\s*\S+@\S+\.\S+")
@@ -1260,6 +1657,9 @@ def standardize_email(
# Smart-quote wrappers (``"alice@example.com"``).
s = _EMAIL_SMARTQUOTE_RE.sub("", s).strip()
# Strip BIDI / RTL override controls — these are a homograph attack
# vector and have no legitimate use inside an email address.
s = _EMAIL_BIDI_RE.sub("", s)
# Display-name with angle brackets — extract the address.
m = _EMAIL_ANGLE_RE.search(s)
@@ -1503,6 +1903,7 @@ class StandardizeOptions:
# Name policy
name_conservative: bool = False
name_reverse_comma_format: bool = True
name_family_first: bool = False # set per-column for East Asian data
# User overrides for the address abbreviation table. Merged on top of
# the built-in USPS Pub. 28 list at runtime; values flow through
@@ -1691,6 +2092,7 @@ def _apply_field_type(
case=options.name_case,
conservative=options.name_conservative,
reverse_comma_format=options.name_reverse_comma_format,
family_first=options.name_family_first,
)
elif field_type == FieldType.ADDRESS:
new, changed = standardize_address(

341
tests/test_i18n.py Normal file
View File

@@ -0,0 +1,341 @@
"""International coverage tests for the format standardizer.
Covers gaps surfaced by the i18n review:
- Date locales: PT, IT, NL, RU + weekday recognition.
- Date formats: ISO 8601 week date / ordinal date, RFC 2822, CJK
separators, fullwidth digits, named-timezone resolution.
- Two-digit year cutoff customization.
- Names: Arabic / Hebrew particles, multi-language titles, East Asian
honorific suffixes, family_first comma-reversal skip.
- Currency: extended symbol coverage (Asian, Latin American, African
currencies), extended ISO code list.
- Address: Canadian provinces, UK postcode, Australian states,
German Bundesland, international PO Box variants.
- Email: BIDI / RTL override stripping (security).
"""
from __future__ import annotations
import pandas as pd
import pytest
from src.core.format_standardize import (
standardize_address,
standardize_currency,
standardize_date,
standardize_email,
standardize_name,
)
# ---------------------------------------------------------------------------
# Dates
# ---------------------------------------------------------------------------
class TestDateLocales:
@pytest.mark.parametrize("inp,want", [
("15 janeiro 2024", "2024-01-15"), # PT
("15 fevereiro 2024", "2024-02-15"),
("15 dezembro 2024", "2024-12-15"),
("15 gennaio 2024", "2024-01-15"), # IT
("15 marzo 2024", "2024-03-15"),
("15 dicembre 2024", "2024-12-15"),
("15 januari 2024", "2024-01-15"), # NL
("15 maart 2024", "2024-03-15"),
("15 januari 2024", "2024-01-15"),
("15 января 2024", "2024-01-15"), # RU
("15 декабря 2024", "2024-12-15"),
])
def test_extended_locales(self, inp, want):
got, _ = standardize_date(
inp, month_locales=["en", "fr", "de", "es", "pt", "it", "nl", "ru"],
)
assert got == want
@pytest.mark.parametrize("inp,want", [
("lundi, 15 janvier 2024", "2024-01-15"), # FR
("Montag, 15. Januar 2024", "2024-01-15"), # DE
("lunes, 15 enero 2024", "2024-01-15"), # ES
("lunedì 15 gennaio 2024", "2024-01-15"), # IT
("segunda-feira 15 janeiro 2024", "2024-01-15"), # PT
("maandag 15 januari 2024", "2024-01-15"), # NL
])
def test_localized_weekdays(self, inp, want):
got, _ = standardize_date(
inp, month_locales=["en", "fr", "de", "es", "pt", "it", "nl"],
)
assert got == want
class TestDateExtendedFormats:
def test_iso_week_date(self):
got, _ = standardize_date("2024-W03-1")
assert got == "2024-01-15"
def test_iso_ordinal(self):
got, _ = standardize_date("2024-015")
assert got == "2024-01-15"
def test_rfc2822(self):
got, _ = standardize_date("Mon, 15 Jan 2024 10:30:00")
assert got == "2024-01-15"
def test_cjk_japanese(self):
got, _ = standardize_date("2024年01月15日")
assert got == "2024-01-15"
def test_fullwidth_digits(self):
got, _ = standardize_date("//")
assert got == "2024-01-15"
class TestNamedTimezones:
@pytest.mark.parametrize("tz", ["EST", "PST", "JST", "GMT", "CET", "IST"])
def test_named_tz_resolves(self, tz):
got, _ = standardize_date(f"2024-01-15 10:30:00 {tz}")
assert got == "2024-01-15"
class TestTwoDigitYearCutoff:
def test_default_cutoff_69(self):
# year 24 → 2024
got, _ = standardize_date("1/15/24")
assert got == "2024-01-15"
# year 70 → 1970
got, _ = standardize_date("1/15/70")
assert got == "1970-01-15"
def test_lowered_cutoff_for_birth_years(self):
# cutoff=10 → year 24 falls in 1925-2010 mapping
got, _ = standardize_date("1/15/24", two_digit_year_cutoff=10)
assert got == "1924-01-15"
# ---------------------------------------------------------------------------
# Names
# ---------------------------------------------------------------------------
class TestNameParticles:
@pytest.mark.parametrize("inp,want", [
("ahmed bin salman", "Ahmed bin Salman"),
("abdullah ibn rashid", "Abdullah ibn Rashid"),
("ali abu bakr", "Ali abu Bakr"),
("david ben gurion", "David ben Gurion"),
("mohammed al-rashid", "Mohammed al-Rashid"),
("omar el-sayed", "Omar el-Sayed"),
])
def test_arabic_hebrew_particles(self, inp, want):
got, _ = standardize_name(inp)
assert got == want
class TestNameTitles:
@pytest.mark.parametrize("inp,want", [
("Herr Hans Schmidt", "Herr Hans Schmidt"),
("Frau Anna Müller", "Frau Anna Müller"),
("M. Pierre Dupont", "M Pierre Dupont"),
("Mme Marie Dubois", "Mme Marie Dubois"),
("Sr. Juan Pérez", "Sr Juan Pérez"),
("Sra. Maria González", "Sra Maria González"),
("Sig. Marco Rossi", "Sig Marco Rossi"),
])
def test_multilang_titles(self, inp, want):
got, _ = standardize_name(inp)
assert got == want
class TestEastAsianHonorifics:
@pytest.mark.parametrize("inp", [
"Tanaka-san", "Suzuki-sama", "Sato-kun", "Kohaku-chan",
"Lee-ssi", "Park-nim",
])
def test_honorific_preserved_lowercase(self, inp):
got, _ = standardize_name(inp)
# Honorific suffix stays lowercase
assert got == inp.split("-")[0].title() + "-" + inp.split("-")[1].lower()
class TestFamilyFirst:
def test_skips_comma_reversal(self):
# Default: comma reversal flips family-first into Western order
got_default, _ = standardize_name("Kim, Min-jae")
# Family-first preserves the comma form (per-column signal)
got_ff, _ = standardize_name("Kim, Min-jae", family_first=True)
assert got_default != got_ff
assert got_ff.startswith("Kim,")
# ---------------------------------------------------------------------------
# Currency
# ---------------------------------------------------------------------------
class TestCurrencySymbols:
@pytest.mark.parametrize("inp,want", [
("฿1,234.56", "1234.56"), # THB
("₫50000", "50000"), # VND
("₮100", "100"), # MNT
("₴500", "500"), # UAH
("₦5,000", "5000"), # NGN
("₱1,234.56", "1234.56"), # PHP
("₲100000", "100000"), # PYG
("﷼500", "500"), # SAR (ambiguous; mapped to SAR)
("₨1,234", "1234"), # PKR
("₵100", "100"), # GHS
])
def test_extended_symbol_coverage(self, inp, want):
got, _ = standardize_currency(inp)
assert got == want
class TestCurrencyCodes:
@pytest.mark.parametrize("code", [
"SAR", "AED", "QAR", "ARS", "EGP", "IDR", "MYR", "PHP", "THB",
"VND", "PKR", "BDT", "HUF", "CZK", "RON", "UAH",
])
def test_iso_code_recognized(self, code):
got, _ = standardize_currency(f"1234.56 {code}")
assert got == "1234.56"
# ---------------------------------------------------------------------------
# Addresses
# ---------------------------------------------------------------------------
class TestCanadianAddresses:
def test_province_name_to_code(self):
got, _ = standardize_address(
"1 Yonge St, Toronto, Ontario M5E 1W7", expand=False,
)
assert "ON" in got
assert "Ontario" not in got
def test_quebec_with_accent(self):
got, _ = standardize_address(
"1 Rue Sherbrooke, Montréal, Québec H2Y 1A1", expand=False,
)
assert "QC" in got
def test_province_code_preserved_after_lowercase(self):
got, _ = standardize_address(
"1 yonge st, toronto, on m5e 1w7", expand=False,
)
assert "ON" in got
class TestUKAddresses:
def test_postcode_address_passes_through(self):
got, _ = standardize_address(
"10 Downing Street, London, SW1A 2AA", expand=False,
)
assert "SW1A 2AA" in got
def test_lowercase_postcode_preserved_with_caps(self):
got, _ = standardize_address(
"10 downing street, london, sw1a 2aa", expand=False,
)
# UK postcodes get title-cased as the rest of the address;
# SW1A 2AA letters aren't in the state-code set so we accept
# "Sw1a 2Aa" as the title-case fallback.
assert "London" in got
class TestAustralianAddresses:
def test_state_name_to_code(self):
got, _ = standardize_address(
"1 George St, Sydney, New South Wales 2000", expand=False,
)
assert "NSW" in got
assert "New South Wales" not in got
def test_state_code_preserved(self):
got, _ = standardize_address(
"1 collins st, melbourne, vic 3000", expand=False,
)
assert "VIC" in got
class TestGermanAddresses:
def test_bundesland_name_to_code(self):
got, _ = standardize_address(
"Hauptstr 1, München, Bayern 80331", expand=False,
)
assert "BY" in got
assert "Bayern" not in got
class TestInternationalPOBox:
@pytest.mark.parametrize("inp", [
"Postfach 12345, München, BY 80331", # DE
"Boîte postale 12, Paris 75001", # FR
"Apartado 12, Madrid 28001", # ES
"Casella postale 12, Roma 00100", # IT
"Caixa postal 12, São Paulo 01310", # PT
])
def test_intl_po_box_normalized(self, inp):
got, _ = standardize_address(inp, expand=False)
assert "PO Box" in got
# ---------------------------------------------------------------------------
# Email — security
# ---------------------------------------------------------------------------
class TestEmailBidiSecurity:
def test_rtl_override_stripped(self):
# U+202E (Right-to-Left Override) inside email — common phishing
# vector. After strip, the address is just the legitimate one.
malicious = "alice@example.com"
got, _ = standardize_email(malicious)
assert got == "alice@example.com"
assert "" not in got
def test_lrm_stripped(self):
# Left-to-Right Mark, also strippable.
s = "alice@example.com"
got, _ = standardize_email(s)
assert got == "alice@example.com"
def test_rtl_isolate_stripped(self):
s = "alice@example.com"
got, _ = standardize_email(s)
assert got == "alice@example.com"
# ---------------------------------------------------------------------------
# Pipeline integration — end-to-end with intl options
# ---------------------------------------------------------------------------
class TestPipelineIntl:
def test_standardize_options_carry_intl_flags(self):
from src.core.format_standardize import (
FieldType, StandardizeOptions, standardize_dataframe,
)
df = pd.DataFrame({
"name": ["Tanaka-san", "Kim, Min-jae"],
"date": ["15 janeiro 2024", "Mon, 15 Jan 2024 10:30:00"],
"addr": [
"Hauptstr 1, München, Bayern 80331",
"1 Yonge St, Toronto, Ontario M5E 1W7",
],
})
opts = StandardizeOptions(
column_types={
"name": FieldType.NAME,
"date": FieldType.DATE,
"addr": FieldType.ADDRESS,
},
date_month_locales=["en", "fr", "de", "es", "pt", "it", "nl", "ru"],
address_expand=False,
name_family_first=True,
)
result = standardize_dataframe(df, opts)
out = result.standardized_df
# Names: honorific preserved, family-first comma not reversed
assert out.loc[0, "name"] == "Tanaka-san"
assert out.loc[1, "name"].startswith("Kim,")
# Dates: PT month + RFC 2822 both → 2024-01-15
assert out.loc[0, "date"] == "2024-01-15"
assert out.loc[1, "date"] == "2024-01-15"
# Addresses: DE + CA both have state codes substituted
assert "BY" in out.loc[0, "addr"]
assert "ON" in out.loc[1, "addr"]