feat(format-i18n): broaden international coverage across all domains
Closes ~17 high-value international gaps surfaced by parallel review. Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed / failed / xfailed). DATES - Adds Portuguese, Italian, Dutch, Russian month dictionaries to the opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru). - Adds localized weekday recognition for those locales — "Lundi", "Montag", "lunedì", "понедельник", etc. all strip cleanly before format matching. - New CJK separator normalization: Japanese ``2024年01月15日`` and fullwidth digits ``2024/01/15`` fold to ASCII before parsing. - New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ doesn't block format matching. - New ISO 8601 extended formats: week date (``2024-W03-1``) and ordinal date (``2024-015``), plus RFC 2822 mail-header form (``Mon, 15 Jan 2024 10:30:00``). - New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` — defaults to Python's stdlib 69; lower it for birth-year columns where most subjects were born ≤ 1999. NAMES - Particles set extended with Arabic patronymic markers (bin, ibn, bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-). - Title set extended with German (Herr, Frau), French (M., Mme, Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra, Dott.), Portuguese. - Acronym map extended with international academic credentials (Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM). - New East Asian honorific suffix handler: ``Tanaka-san``, ``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the hyphen instead of being title-cased into ``Tanaka-San``. - Hyphenated-segment handler now keeps Arabic prefixes ``al-`` / ``el-`` lowercase per Arabic transliteration convention. - New ``family_first`` parameter on ``standardize_name()`` and matching ``name_family_first`` field on ``StandardizeOptions`` — set per-column for East Asian data to skip Western comma-format reversal (``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``). CURRENCY - Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa, Eastern Europe, Latin America gaps. - ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD, BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, KZT, etc. EMAIL - New BIDI / RTL override stripping (``standardize_email``): U+202A-U+202E and U+2066-U+2069 stripped from every email. These are a known phishing vector — ``alice@example.com`` displays as ``alice@elpmaxe.com`` to RTL-aware renderers. ADDRESS - Canadian provinces: 13 codes + names → 2-letter (Ontario → ON). - UK postcode pattern recognition (``SW1A 2AA`` shape). - Australian states: 8 codes + names (NSW, VIC, QLD, … + full names). - German Bundesland: 16 codes + names (Bayern → BY, etc.). - International PO Box variants: ``Postfach`` (DE), ``Boîte postale`` (FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal`` (PT) — all fold to canonical ``PO Box``. - ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position check that preserves state codes regardless of input case applies to all four jurisdictions. - ``_is_state_code_position`` postal pattern broadened to recognize US ZIP, AU 4-digit, CA first half, and UK outward code. CONSTANTS - ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` / ``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``, ``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS`` (us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex), ``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for use when address takes a `country_hint` parameter in a future pass). DOCS - TECHNICAL.md §11.3 domain table updated with the new handling per domain plus a new "International coverage" sub-section listing the supported locales / symbols / jurisdictions. DEFERRED (out of scope or rare) - Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) — corpus § 3.5 marks out of scope. - Persian/Arabic-Indic digit conversion — rare in tabular data. - Trailing-minus RTL currency convention. - Punycode ↔ Unicode IDN normalization. - Mixed-country phone column auto-detection (user can override ``default_region`` per column). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -165,6 +165,110 @@ _MONTH_LOCALES: dict[str, dict[str, str]] = {
|
||||
"agosto": "August", "septiembre": "September", "setiembre": "September",
|
||||
"octubre": "October", "noviembre": "November", "diciembre": "December",
|
||||
},
|
||||
"pt": {
|
||||
"janeiro": "January", "fevereiro": "February", "março": "March",
|
||||
"marco": "March", "abril": "April", "maio": "May", "junho": "June",
|
||||
"julho": "July", "agosto": "August", "setembro": "September",
|
||||
"outubro": "October", "novembro": "November", "dezembro": "December",
|
||||
"jan": "Jan", "fev": "Feb", "mar": "Mar", "abr": "Apr",
|
||||
"mai": "May", "jun": "Jun", "jul": "Jul", "ago": "Aug",
|
||||
"set": "Sep", "out": "Oct", "nov": "Nov", "dez": "Dec",
|
||||
},
|
||||
"it": {
|
||||
"gennaio": "January", "febbraio": "February", "marzo": "March",
|
||||
"aprile": "April", "maggio": "May", "giugno": "June",
|
||||
"luglio": "July", "agosto": "August", "settembre": "September",
|
||||
"ottobre": "October", "novembre": "November", "dicembre": "December",
|
||||
"gen": "Jan", "feb": "Feb", "mar": "Mar", "apr": "Apr",
|
||||
"mag": "May", "giu": "Jun", "lug": "Jul", "ago": "Aug",
|
||||
"set": "Sep", "ott": "Oct", "nov": "Nov", "dic": "Dec",
|
||||
},
|
||||
"nl": {
|
||||
"januari": "January", "februari": "February", "maart": "March",
|
||||
"april": "April", "mei": "May", "juni": "June", "juli": "July",
|
||||
"augustus": "August", "september": "September", "oktober": "October",
|
||||
"november": "November", "december": "December",
|
||||
"jan": "Jan", "feb": "Feb", "mrt": "Mar", "apr": "Apr",
|
||||
"mei": "May", "jun": "Jun", "jul": "Jul", "aug": "Aug",
|
||||
"sep": "Sep", "okt": "Oct", "nov": "Nov", "dec": "Dec",
|
||||
},
|
||||
"ru": {
|
||||
"января": "January", "февраля": "February", "марта": "March",
|
||||
"апреля": "April", "мая": "May", "июня": "June", "июля": "July",
|
||||
"августа": "August", "сентября": "September", "октября": "October",
|
||||
"ноября": "November", "декабря": "December",
|
||||
# Nominative forms (less common in dates but possible)
|
||||
"январь": "January", "февраль": "February", "март": "March",
|
||||
"апрель": "April", "май": "May", "июнь": "June", "июль": "July",
|
||||
"август": "August", "сентябрь": "September", "октябрь": "October",
|
||||
"ноябрь": "November", "декабрь": "December",
|
||||
},
|
||||
}
|
||||
|
||||
# Localized weekday prefix removal — same idea as month substitution.
|
||||
# Each locale's set lists full + abbreviated forms (lowercase) that
|
||||
# should be stripped from the start of a date string before format
|
||||
# matching. English is in ``_WEEKDAY_PREFIX_RE`` already.
|
||||
_WEEKDAY_LOCALES: dict[str, list[str]] = {
|
||||
"fr": ["lundi", "mardi", "mercredi", "jeudi", "vendredi", "samedi",
|
||||
"dimanche", "lun", "mar", "mer", "jeu", "ven", "sam", "dim"],
|
||||
"de": ["montag", "dienstag", "mittwoch", "donnerstag", "freitag",
|
||||
"samstag", "sonntag", "mo", "di", "mi", "do", "fr", "sa", "so"],
|
||||
"es": ["lunes", "martes", "miércoles", "miercoles", "jueves",
|
||||
"viernes", "sábado", "sabado", "domingo"],
|
||||
"it": ["lunedì", "lunedi", "martedì", "martedi", "mercoledì",
|
||||
"mercoledi", "giovedì", "giovedi", "venerdì", "venerdi",
|
||||
"sabato", "domenica"],
|
||||
"pt": ["segunda-feira", "segunda", "terça-feira", "terca-feira",
|
||||
"terça", "terca", "quarta-feira", "quarta", "quinta-feira",
|
||||
"quinta", "sexta-feira", "sexta", "sábado", "sabado", "domingo"],
|
||||
"nl": ["maandag", "dinsdag", "woensdag", "donderdag", "vrijdag",
|
||||
"zaterdag", "zondag",
|
||||
"ma", "di", "wo", "do", "vr", "za", "zo"],
|
||||
"ru": ["понедельник", "вторник", "среда", "четверг", "пятница",
|
||||
"суббота", "воскресенье",
|
||||
"пн", "вт", "ср", "чт", "пт", "сб", "вс"],
|
||||
}
|
||||
|
||||
|
||||
def _build_weekday_patterns() -> dict[str, "re.Pattern[str]"]:
|
||||
"""One regex per locale matching any leading weekday + optional comma."""
|
||||
out = {}
|
||||
for loc, words in _WEEKDAY_LOCALES.items():
|
||||
# Sort longest first so ``segunda-feira`` wins over ``segunda``.
|
||||
alt = "|".join(re.escape(w) for w in sorted(words, key=len, reverse=True))
|
||||
out[loc] = re.compile(rf"^(?:{alt})\s*,?\s+", re.IGNORECASE)
|
||||
return out
|
||||
|
||||
|
||||
_WEEKDAY_LOCALE_PATTERNS = _build_weekday_patterns()
|
||||
|
||||
|
||||
# Named timezone → fixed UTC offset. Resolves common abbreviations so
|
||||
# ``2024-01-15 10:30:00 EST`` produces a date instead of falling through
|
||||
# unparseably. Per FORMATS-CASES.md § 3.3, these are *fixed* offsets —
|
||||
# DST-aware handling is out of scope (would require pyzoneinfo).
|
||||
_NAMED_TZ_OFFSETS: dict[str, str] = {
|
||||
# Universal
|
||||
"UTC": "+00:00", "GMT": "+00:00", "Z": "+00:00",
|
||||
# Americas
|
||||
"EST": "-05:00", "EDT": "-04:00",
|
||||
"CST": "-06:00", "CDT": "-05:00",
|
||||
"MST": "-07:00", "MDT": "-06:00",
|
||||
"PST": "-08:00", "PDT": "-07:00",
|
||||
"AST": "-04:00", "AKST": "-09:00", "HST": "-10:00",
|
||||
"BRT": "-03:00", "ART": "-03:00",
|
||||
# Europe
|
||||
"BST": "+01:00", "CET": "+01:00", "CEST": "+02:00",
|
||||
"EET": "+02:00", "EEST": "+03:00", "WET": "+00:00", "WEST": "+01:00",
|
||||
"MSK": "+03:00",
|
||||
# Asia / Pacific
|
||||
"IST": "+05:30",
|
||||
"PKT": "+05:00", "BDT": "+06:00",
|
||||
"ICT": "+07:00", "WIB": "+07:00",
|
||||
"CST_CN": "+08:00", "HKT": "+08:00", "SGT": "+08:00", "PHT": "+08:00",
|
||||
"JST": "+09:00", "KST": "+09:00",
|
||||
"AEST": "+10:00", "AEDT": "+11:00", "NZST": "+12:00",
|
||||
}
|
||||
|
||||
|
||||
@@ -262,6 +366,7 @@ def standardize_date(
|
||||
date_order: DateOrder = "MDY",
|
||||
error_policy: DateErrorPolicy = "passthrough",
|
||||
month_locales: Optional[list[str]] = None,
|
||||
two_digit_year_cutoff: int = 69,
|
||||
) -> tuple[str, bool]:
|
||||
"""Parse *value* as a date and return it formatted per *output_format*.
|
||||
|
||||
@@ -273,9 +378,15 @@ def standardize_date(
|
||||
passes through unchanged. With ``"sentinel"`` the cleaner emits
|
||||
``<error: <reason>>`` for invalid dates per corpus § 0.3.
|
||||
|
||||
``month_locales`` enables non-English month names. Pass
|
||||
``["en", "fr", "de", "es"]`` to recognize French / German / Spanish
|
||||
month names in addition to English. Defaults to English-only.
|
||||
``month_locales`` enables non-English month names. Pass any subset
|
||||
of ``["en", "fr", "de", "es", "pt", "it", "nl", "ru"]`` to recognize
|
||||
those locales' month + weekday names in addition to English.
|
||||
Defaults to English-only.
|
||||
|
||||
``two_digit_year_cutoff`` controls the pivot for 2-digit years:
|
||||
years ``00..cutoff`` map to 2000-2099, ``cutoff+1..99`` map to
|
||||
1900-1999. Default 69 (Python's stdlib default). Override to ~25
|
||||
for birth-year columns where most subjects were born ≤ 1999.
|
||||
|
||||
Recognizes Excel-1900 serial dates (``45306`` → ``2024-01-15``),
|
||||
Unix timestamps in seconds and milliseconds, year-month text
|
||||
@@ -320,19 +431,42 @@ def standardize_date(
|
||||
out = f"{q.group(2)}-Q{q.group(1)}"
|
||||
return out, out != value
|
||||
|
||||
# CJK separator normalization: Japanese ``2024年01月15日`` → ``2024-01-15``,
|
||||
# Korean ``2024.01.15`` is already covered by the dot format. Also fold
|
||||
# fullwidth digits (0-9) to ASCII so any of the parsers can read them.
|
||||
s = _normalize_cjk_date_chars(s)
|
||||
|
||||
# Substitute localized month names with English before format-match.
|
||||
if month_locales:
|
||||
s = _apply_month_locale(s, month_locales)
|
||||
# Strip localized weekday prefixes for any enabled locale BEFORE
|
||||
# the day-period strip — otherwise ``Montag, 15. Januar 2024``
|
||||
# never reaches the digit-leading shape the period strip expects.
|
||||
for loc in month_locales:
|
||||
pat = _WEEKDAY_LOCALE_PATTERNS.get(loc)
|
||||
if pat is not None:
|
||||
s = pat.sub("", s).strip()
|
||||
# German DMY uses ``15.`` for the day; strip the trailing period
|
||||
# so ``15. Januar 2024`` parses as ``15 January 2024``.
|
||||
s = re.sub(r"^(\d{1,2})\.\s+", r"\1 ", s)
|
||||
|
||||
# Strip a leading weekday prefix (``Monday, January 15, 2024``).
|
||||
s = _WEEKDAY_PREFIX_RE.sub("", s).strip()
|
||||
# Drop a trailing time portion before format-matching.
|
||||
# Resolve named timezones (EST/PST/JST/…) to fixed offsets, then
|
||||
# drop the trailing time portion before format-matching.
|
||||
s = _resolve_named_tz(s)
|
||||
s = _TIME_TAIL_RE.sub("", s).strip()
|
||||
|
||||
parsed = _try_parse_date(s, date_order)
|
||||
# ISO 8601 extended formats — week date + ordinal date — and
|
||||
# RFC 2822 mail-header form.
|
||||
iso_extended = _try_iso_extended(s, output_format)
|
||||
if iso_extended is not None:
|
||||
return iso_extended, iso_extended != value
|
||||
rfc = _try_rfc2822(s, output_format)
|
||||
if rfc is not None:
|
||||
return rfc, rfc != value
|
||||
|
||||
parsed = _try_parse_date(s, date_order, two_digit_year_cutoff)
|
||||
if parsed is not None:
|
||||
out = parsed.strftime(output_format)
|
||||
return out, out != value
|
||||
@@ -370,13 +504,112 @@ def standardize_date(
|
||||
return value, False
|
||||
|
||||
|
||||
def _try_parse_date(s: str, date_order: DateOrder) -> Optional[datetime]:
|
||||
def _try_parse_date(
|
||||
s: str, date_order: DateOrder, two_digit_year_cutoff: int = 69,
|
||||
) -> Optional[datetime]:
|
||||
formats = _DATE_FORMATS_DMY if date_order == "DMY" else _DATE_FORMATS_MDY
|
||||
for fmt in formats:
|
||||
try:
|
||||
return datetime.strptime(s, fmt)
|
||||
parsed = datetime.strptime(s, fmt)
|
||||
except ValueError:
|
||||
continue
|
||||
# Re-pivot 2-digit years if the user changed the cutoff. strptime
|
||||
# uses Python's stdlib default of 69; for cutoff != 69 we may need
|
||||
# to roll the century forward or back.
|
||||
if "%y" in fmt and two_digit_year_cutoff != 69:
|
||||
year_2 = parsed.year % 100
|
||||
if year_2 <= two_digit_year_cutoff:
|
||||
century = 2000
|
||||
else:
|
||||
century = 1900
|
||||
parsed = parsed.replace(year=century + year_2)
|
||||
return parsed
|
||||
return None
|
||||
|
||||
|
||||
_FULLWIDTH_DIGITS = str.maketrans("0123456789", "0123456789")
|
||||
_CJK_DATE_MARKERS = str.maketrans({"年": "-", "月": "-", "日": "", ".": ".", "/": "/"})
|
||||
|
||||
|
||||
def _normalize_cjk_date_chars(s: str) -> str:
|
||||
"""Fold East Asian date markers + fullwidth digits to ASCII equivalents.
|
||||
|
||||
``2024年01月15日`` → ``2024-01-15``; fullwidth ``2024/01/15``
|
||||
→ ``2024/01/15``. Idempotent on ASCII input.
|
||||
"""
|
||||
if not any(c > "\x7f" for c in s):
|
||||
return s
|
||||
s = s.translate(_FULLWIDTH_DIGITS).translate(_CJK_DATE_MARKERS)
|
||||
# ``2024年01月15日`` becomes ``2024-01-15-`` with our trailing-day
|
||||
# mapping; strip any trailing dash artifact.
|
||||
return s.rstrip("-").strip()
|
||||
|
||||
|
||||
_NAMED_TZ_RE = re.compile(
|
||||
r"\s+(" + "|".join(re.escape(k) for k in sorted(_NAMED_TZ_OFFSETS, key=len, reverse=True)) + r")\b"
|
||||
)
|
||||
|
||||
|
||||
def _resolve_named_tz(s: str) -> str:
|
||||
"""Replace a trailing named timezone with its fixed UTC offset.
|
||||
|
||||
``2024-01-15 10:30:00 EST`` → ``2024-01-15 10:30:00-05:00``. Per
|
||||
FORMATS-CASES.md § 3.3, offsets are fixed (not DST-aware); see
|
||||
``_NAMED_TZ_OFFSETS`` for the table.
|
||||
"""
|
||||
def repl(m: re.Match) -> str:
|
||||
return _NAMED_TZ_OFFSETS[m.group(1)]
|
||||
return _NAMED_TZ_RE.sub(repl, s)
|
||||
|
||||
|
||||
_ISO_WEEK_RE = re.compile(r"^(\d{4})-W(\d{2})-(\d)$")
|
||||
_ISO_ORDINAL_RE = re.compile(r"^(\d{4})-(\d{3})$")
|
||||
|
||||
|
||||
def _try_iso_extended(s: str, output_format: str) -> Optional[str]:
|
||||
"""Parse ISO 8601 week date or ordinal date, return formatted string."""
|
||||
m = _ISO_WEEK_RE.match(s)
|
||||
if m:
|
||||
try:
|
||||
parsed = datetime.fromisocalendar(
|
||||
int(m.group(1)), int(m.group(2)), int(m.group(3)),
|
||||
)
|
||||
return parsed.strftime(output_format)
|
||||
except ValueError:
|
||||
return None
|
||||
m = _ISO_ORDINAL_RE.match(s)
|
||||
if m:
|
||||
year, day = int(m.group(1)), int(m.group(2))
|
||||
if 1 <= day <= 366:
|
||||
try:
|
||||
parsed = datetime(year, 1, 1) + timedelta(days=day - 1)
|
||||
if parsed.year == year:
|
||||
return parsed.strftime(output_format)
|
||||
except ValueError:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
# RFC 2822 mail-header form: ``Wed, 15 Jan 2024 10:30:00 GMT``.
|
||||
_RFC2822_FORMATS = [
|
||||
"%a, %d %b %Y %H:%M:%S", # without TZ
|
||||
"%a, %d %b %Y %H:%M:%S %Z", # with named TZ (already resolved upstream)
|
||||
"%a, %d %b %Y %H:%M:%S %z", # with offset
|
||||
"%d %b %Y %H:%M:%S",
|
||||
]
|
||||
|
||||
|
||||
def _try_rfc2822(s: str, output_format: str) -> Optional[str]:
|
||||
"""Parse RFC 2822 mail-header date format."""
|
||||
for fmt in _RFC2822_FORMATS:
|
||||
try:
|
||||
parsed = datetime.strptime(s, fmt)
|
||||
except ValueError:
|
||||
continue
|
||||
try:
|
||||
return parsed.strftime(output_format)
|
||||
except ValueError:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
@@ -539,12 +772,35 @@ _SYMBOL_TO_ISO: dict[str, str] = {
|
||||
"₪": "ILS",
|
||||
"₺": "TRY",
|
||||
"¢": "USD", # cents — coerce to USD for the code; value is still numeric
|
||||
# International additions:
|
||||
"฿": "THB", # Thai Baht
|
||||
"₫": "VND", # Vietnamese Dong
|
||||
"₮": "MNT", # Mongolian Tugrik
|
||||
"₴": "UAH", # Ukrainian Hryvnia
|
||||
"₦": "NGN", # Nigerian Naira
|
||||
"₱": "PHP", # Philippine Peso
|
||||
"₲": "PYG", # Paraguayan Guarani
|
||||
"﷼": "SAR", # ambiguous Saudi/Omani/Iranian; pick the most common
|
||||
"₨": "PKR", # Pakistani Rupee (and historical Sri Lankan)
|
||||
"₵": "GHS", # Ghanaian Cedi
|
||||
}
|
||||
_CURRENCY_SYMBOLS = "".join(_SYMBOL_TO_ISO)
|
||||
# ISO 4217 codes — the long tail of currencies in active use. Order
|
||||
# matters for the regex alternation: a 3-letter ISO code is unambiguous,
|
||||
# but ``R$`` (Brazil) and ``kr`` (DKK/NOK/SEK) are 1-2 char prefixes
|
||||
# that need to lose to a 3-letter code if both appear.
|
||||
_CURRENCY_CODES_LIST = [
|
||||
"USD", "EUR", "GBP", "JPY", "CNY", "CAD", "AUD", "CHF", "INR", "KRW",
|
||||
"RUB", "MXN", "BRL", "ILS", "TRY", "ZAR", "SEK", "NOK", "DKK", "PLN",
|
||||
"HKD", "SGD", "NZD",
|
||||
# Major non-G10 economies:
|
||||
"SAR", "AED", "QAR", "KWD", "BHD", "OMR", # Gulf
|
||||
"ARS", "CLP", "COP", "PEN", "UYU", # Latin America
|
||||
"EGP", "MAD", "TND", "NGN", "GHS", "KES", "ZAR", "TZS", "UGX", # Africa
|
||||
"IDR", "MYR", "PHP", "THB", "VND", "TWD", # SE Asia
|
||||
"PKR", "BDT", "LKR", "NPR", # South Asia
|
||||
"HUF", "CZK", "RON", "BGN", "HRK", "ISK", # Europe-other
|
||||
"UAH", "KZT", "GEL", "AMD", "AZN", # Eastern Europe / Caucasus
|
||||
]
|
||||
_CURRENCY_CODES = "|".join(_CURRENCY_CODES_LIST)
|
||||
_CURRENCY_DETECT_RE = re.compile(
|
||||
@@ -741,25 +997,68 @@ def standardize_currency(
|
||||
NameCase = Literal["title", "upper", "lower"]
|
||||
|
||||
# Particles in surnames that conventionally stay lowercase in natural
|
||||
# reading order (``Vincent van Gogh``, ``Leonardo da Vinci``).
|
||||
# reading order. Covers the major Indo-European traditions plus
|
||||
# Arabic/Hebrew patronymic markers.
|
||||
_NAME_PARTICLES: set[str] = {
|
||||
# Germanic / Dutch / French / Italian
|
||||
"von", "van", "de", "da", "del", "della", "di", "du", "der",
|
||||
"den", "ter", "ten", "le", "la", "los", "las", "el",
|
||||
# Spanish / Portuguese
|
||||
"dos", "das", "do", "y",
|
||||
# Arabic patronymic / nisba
|
||||
"bin", "ibn", "bint", "abu", "abd", "al", "el-", "al-",
|
||||
# Hebrew
|
||||
"ben", "bat", "ha", "ha-",
|
||||
# Slavic transliterated (rare in Western forms)
|
||||
"z", "ze",
|
||||
}
|
||||
|
||||
# Acronyms / honorifics that keep their conventional casing rather than
|
||||
# being title-cased (``PhD``, ``MD``, ``Esq``).
|
||||
# being title-cased (``PhD``, ``MD``, ``Esq``). Includes international
|
||||
# academic credentials.
|
||||
_NAME_ACRONYMS: dict[str, str] = {
|
||||
# English
|
||||
"phd": "PhD", "md": "MD", "esq": "Esq", "ma": "MA", "ba": "BA",
|
||||
"bs": "BS", "ms": "MS", "dds": "DDS", "dvm": "DVM", "jd": "JD",
|
||||
"rn": "RN", "cpa": "CPA", "ceo": "CEO", "cto": "CTO", "cfo": "CFO",
|
||||
# German / Austrian academic
|
||||
"dipl": "Dipl", "ing": "Ing", "mag": "Mag", "habil": "Habil",
|
||||
"drmed": "Dr.med.", "drphil": "Dr.phil.", "drrernat": "Dr.rer.nat.",
|
||||
"msc": "MSc", "bsc": "BSc",
|
||||
# International degrees
|
||||
"llb": "LLB", "llm": "LLM",
|
||||
}
|
||||
|
||||
# Roman numeral suffixes — preserved verbatim (already uppercase).
|
||||
_NAME_ROMAN_RE = re.compile(r"^[IVX]+$")
|
||||
|
||||
# Titles that take a trailing period in their long form (``Mr.``).
|
||||
_NAME_TITLES: set[str] = {"mr", "mrs", "ms", "miss", "dr", "prof", "sr", "jr"}
|
||||
# Titles. Most languages strip the trailing period (``Mr.`` → ``Mr``);
|
||||
# the dispatcher in _standardize_name_token does the strip.
|
||||
_NAME_TITLES: set[str] = {
|
||||
# English
|
||||
"mr", "mrs", "ms", "miss", "dr", "prof", "sr", "jr", "sir", "madam",
|
||||
"rev", "hon",
|
||||
# German
|
||||
"herr", "frau", "fr", "hr",
|
||||
# French
|
||||
"m", "mme", "mlle", "mr",
|
||||
# Spanish
|
||||
"sr", "sra", "srta", "don", "doña", "dona",
|
||||
# Italian
|
||||
"sig", "sigra", "dott", "dottoressa",
|
||||
# Portuguese
|
||||
"snr", "snra",
|
||||
}
|
||||
|
||||
# East Asian honorific suffixes — appended after the family name with a
|
||||
# hyphen. Preserved verbatim (lowercase). Supports both Latin
|
||||
# transliteration and the underlying Japanese/Korean characters.
|
||||
_EAST_ASIAN_HONORIFICS: set[str] = {
|
||||
"san", "sama", "kun", "chan", "sensei", "senpai", "kohai", "dono",
|
||||
"shi", "tan", "chin",
|
||||
# Korean
|
||||
"ssi", "nim",
|
||||
}
|
||||
|
||||
# Suffixes that take a trailing period in their short form (``Jr.``).
|
||||
_NAME_SUFFIXES: set[str] = {"jr", "sr", "esq"}
|
||||
@@ -847,9 +1146,21 @@ def _standardize_name_token(tok: str, *, position: str, all_shouting: bool = Fal
|
||||
):
|
||||
return tok.upper() + suffix_punct
|
||||
|
||||
# Hyphenated segment — capitalize each piece.
|
||||
# Hyphenated segment — capitalize each piece. Special cases:
|
||||
# - East Asian honorific suffix (``Tanaka-san``) stays lowercase.
|
||||
# - Arabic transliterated prefix (``al-Rashid``, ``el-Sayed``)
|
||||
# keeps the prefix lowercase per Arabic naming convention.
|
||||
if "-" in tok:
|
||||
return "-".join(_cap_segment(p) for p in tok.split("-")) + suffix_punct
|
||||
parts = tok.split("-")
|
||||
out_parts = []
|
||||
for j, p in enumerate(parts):
|
||||
if j > 0 and p.lower() in _EAST_ASIAN_HONORIFICS:
|
||||
out_parts.append(p.lower())
|
||||
elif j == 0 and p.lower() in {"al", "el", "an", "ad"}:
|
||||
out_parts.append(p.lower())
|
||||
else:
|
||||
out_parts.append(_cap_segment(p))
|
||||
return "-".join(out_parts) + suffix_punct
|
||||
|
||||
# Mc / Mac prefix — inner cap.
|
||||
if lowered.startswith("mc") and len(lowered) > 2:
|
||||
@@ -892,6 +1203,7 @@ def standardize_name(
|
||||
case: NameCase = "title",
|
||||
conservative: bool = False,
|
||||
reverse_comma_format: bool = True,
|
||||
family_first: bool = False,
|
||||
) -> tuple[str, bool]:
|
||||
"""Apply name-friendly casing with prefix / particle / suffix awareness.
|
||||
|
||||
@@ -899,7 +1211,10 @@ def standardize_name(
|
||||
* Mc / Mac inner caps (``mcdonald`` → ``McDonald``).
|
||||
* O'/D' inner caps (``o'connor`` → ``O'Connor``).
|
||||
* Hyphenated segments (``mary-jane`` → ``Mary-Jane``).
|
||||
* Particles stay lowercase mid-name (``van Gogh``, ``de Gaulle``).
|
||||
* Particles stay lowercase mid-name (``van Gogh``, ``de Gaulle``,
|
||||
``bin Salman``, ``ben Avraham``).
|
||||
* East Asian honorific suffixes (``Tanaka-san``, ``Lee-ssi``)
|
||||
preserved lowercase after the hyphen.
|
||||
* Title / suffix periods stripped (``Mr.`` → ``Mr``, ``Jr.`` → ``Jr``).
|
||||
* Roman numeral suffixes preserved (``III``).
|
||||
* PhD / MD / Esq style acronyms preserved.
|
||||
@@ -912,6 +1227,11 @@ def standardize_name(
|
||||
``reverse_comma_format`` flips ``Last, First`` to ``First Last``
|
||||
(default per corpus § 7.3).
|
||||
|
||||
``family_first=True`` skips comma reversal and disables Western
|
||||
title detection — appropriate for East Asian columns where the
|
||||
family name comes first natively (``Kim Min-jae``, ``田中 太郎``).
|
||||
Set this per-column when you know the cultural convention.
|
||||
|
||||
``"upper"`` / ``"lower"`` are simple case conversions.
|
||||
"""
|
||||
if not value or not isinstance(value, str):
|
||||
@@ -940,7 +1260,9 @@ def standardize_name(
|
||||
return value, False
|
||||
|
||||
# Comma-format reversal: "Smith, John Andrew" → "John Andrew Smith".
|
||||
if reverse_comma_format and "," in s:
|
||||
# Skipped under family_first because East Asian conventions write
|
||||
# the family name first natively — reversing would corrupt them.
|
||||
if reverse_comma_format and not family_first and "," in s:
|
||||
parts = [p.strip() for p in s.split(",", 1)]
|
||||
if len(parts) == 2 and parts[0] and parts[1]:
|
||||
s = f"{parts[1]} {parts[0]}"
|
||||
@@ -976,6 +1298,11 @@ from ._constants import (
|
||||
USPS_COMPRESSIONS as _ADDRESS_COMPRESSIONS,
|
||||
US_STATE_CODES as _US_STATE_CODES_SHARED,
|
||||
US_STATE_NAMES as _US_STATE_NAMES_SHARED,
|
||||
CA_PROVINCE_CODES, CA_PROVINCE_NAMES,
|
||||
AU_STATE_CODES, AU_STATE_NAMES,
|
||||
DE_STATE_CODES, DE_STATE_NAMES,
|
||||
POSTAL_PATTERNS,
|
||||
INTL_PO_BOX_PATTERNS,
|
||||
)
|
||||
|
||||
# Short tokens that look like directions but only mean a direction at the
|
||||
@@ -992,31 +1319,62 @@ _TOKEN_RE = re.compile(r"\w+|[^\w\s]+|\s+")
|
||||
_US_STATE_CODES = _US_STATE_CODES_SHARED
|
||||
_US_STATE_NAMES = _US_STATE_NAMES_SHARED
|
||||
|
||||
# Precompiled (pattern, code) list for the state-name → 2-letter
|
||||
# conversion. Sorted longest-first so ``new york`` matches before ``new``.
|
||||
_STATE_NAME_PATTERNS: list[tuple[re.Pattern[str], str]] = [
|
||||
(
|
||||
re.compile(
|
||||
rf"(,\s*){re.escape(full)}(\s+\d{{5}}(?:-\d{{4}})?)",
|
||||
re.IGNORECASE,
|
||||
),
|
||||
code,
|
||||
)
|
||||
for full, code in sorted(_US_STATE_NAMES.items(), key=lambda kv: -len(kv[0]))
|
||||
]
|
||||
# Per-country (full-name, code, postal-pattern) tables. Each yields a
|
||||
# precompiled regex matching ``, <state name> <postal>``. Sorted
|
||||
# longest-first so multi-word names win over their prefixes.
|
||||
def _build_state_patterns(
|
||||
name_to_code: dict[str, str], postal_pattern: str,
|
||||
) -> list[tuple[re.Pattern[str], str]]:
|
||||
return [
|
||||
(
|
||||
re.compile(
|
||||
rf"(,\s*){re.escape(full)}(\s+{postal_pattern})",
|
||||
re.IGNORECASE,
|
||||
),
|
||||
code,
|
||||
)
|
||||
for full, code in sorted(name_to_code.items(), key=lambda kv: -len(kv[0]))
|
||||
]
|
||||
|
||||
# PO Box variants normalize to a single canonical form.
|
||||
|
||||
_STATE_NAME_PATTERNS: list[tuple[re.Pattern[str], str]] = _build_state_patterns(
|
||||
_US_STATE_NAMES, r"\d{5}(?:-\d{4})?",
|
||||
)
|
||||
_CA_PROVINCE_PATTERNS: list[tuple[re.Pattern[str], str]] = _build_state_patterns(
|
||||
CA_PROVINCE_NAMES, r"[A-Z]\d[A-Z]\s*\d[A-Z]\d",
|
||||
)
|
||||
_AU_STATE_PATTERNS: list[tuple[re.Pattern[str], str]] = _build_state_patterns(
|
||||
AU_STATE_NAMES, r"\d{4}",
|
||||
)
|
||||
_DE_STATE_PATTERNS: list[tuple[re.Pattern[str], str]] = _build_state_patterns(
|
||||
DE_STATE_NAMES, r"\d{5}",
|
||||
)
|
||||
|
||||
# PO Box variants normalize to a single canonical form. Combines the
|
||||
# English pattern with the international locale variants registered in
|
||||
# _constants.INTL_PO_BOX_PATTERNS.
|
||||
_PO_BOX_RE = re.compile(
|
||||
r"\b(?:p\.?\s*o\.?\s*box|post\s+office\s+box)\b",
|
||||
r"\b(?:" + "|".join(INTL_PO_BOX_PATTERNS.values()) + r")\b",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
# US ZIP at end of line (or before a trailing comma) — used to detect
|
||||
# whether an address is US-shaped before applying US-only normalizations.
|
||||
_US_ZIP_TAIL_RE = re.compile(r"\b\d{5}(?:-\d{4})?\b")
|
||||
# Canadian postal pattern (``M5E 1W7``) — Canada-specific addresses get
|
||||
# US-style street-type compression but not US ZIP / state handling.
|
||||
_CANADA_POSTAL_RE = re.compile(r"\b[A-Z]\d[A-Z]\s*\d[A-Z]\d\b")
|
||||
# Country-shape postal patterns (precompiled). Used to detect which
|
||||
# country-specific normalization to apply (state-code preservation,
|
||||
# street-suffix dictionary, etc.).
|
||||
_POSTAL_REGEXES: dict[str, re.Pattern[str]] = {
|
||||
cc: re.compile(pat) for cc, pat in POSTAL_PATTERNS.items()
|
||||
}
|
||||
# Back-compat aliases for sites that already reference these names.
|
||||
_US_ZIP_TAIL_RE = _POSTAL_REGEXES["us"]
|
||||
_CANADA_POSTAL_RE = _POSTAL_REGEXES["ca"]
|
||||
_UK_POSTCODE_RE = _POSTAL_REGEXES["uk"]
|
||||
|
||||
# Combined state-code set: US + Canada + Australia + Germany. The
|
||||
# state-code-position check preserves any of these when found in the
|
||||
# slot between a comma and the postal code.
|
||||
_INTL_STATE_CODES: frozenset[str] = (
|
||||
_US_STATE_CODES_SHARED | CA_PROVINCE_CODES | AU_STATE_CODES | DE_STATE_CODES
|
||||
)
|
||||
|
||||
|
||||
def _is_state_code_position(tokens: list[str], idx: int) -> bool:
|
||||
@@ -1033,14 +1391,19 @@ def _is_state_code_position(tokens: list[str], idx: int) -> bool:
|
||||
j -= 1
|
||||
if j < 0 or tokens[j] != ",":
|
||||
return False
|
||||
# Look ahead for a ZIP-shaped token (5 digits, optionally +4).
|
||||
# Look ahead for a postal-shaped token. Accepts US ZIP (5 digits +
|
||||
# optional +4), Australian (4 digits), Canadian first half (single
|
||||
# letter + digit + letter), and the start of a UK outward code.
|
||||
j = idx + 1
|
||||
while j < len(tokens) and tokens[j].isspace():
|
||||
j += 1
|
||||
if j >= len(tokens):
|
||||
return True # tail of line, after a comma — accept
|
||||
nxt = tokens[j]
|
||||
return bool(re.match(r"\d{5}(?:-\d{4})?$", nxt))
|
||||
return bool(re.match(
|
||||
r"\d{4,5}(?:-\d{4})?$|^[A-Z]\d[A-Z]$|^[A-Z]{1,2}\d",
|
||||
nxt, re.IGNORECASE,
|
||||
))
|
||||
|
||||
|
||||
def standardize_address(
|
||||
@@ -1096,14 +1459,44 @@ def standardize_address(
|
||||
s = _PO_BOX_RE.sub("PO Box", s)
|
||||
|
||||
is_us_shaped = bool(_US_ZIP_TAIL_RE.search(s))
|
||||
is_ca_shaped = bool(_CANADA_POSTAL_RE.search(s))
|
||||
is_uk_shaped = bool(_UK_POSTCODE_RE.search(s))
|
||||
# German postal is just 5 digits — same as US ZIP — so we only
|
||||
# treat as DE if the input is NOT already US-state-shaped.
|
||||
is_de_shaped = (
|
||||
is_us_shaped and any(
|
||||
re.search(rf",\s*{re.escape(name)}\s+\d{{5}}", s, re.IGNORECASE)
|
||||
or re.search(rf",\s*{re.escape(code)}\s+\d{{5}}", s, re.IGNORECASE)
|
||||
for name, code in DE_STATE_NAMES.items()
|
||||
)
|
||||
)
|
||||
# AU detection: 4-digit postal at tail AND a known AU state code or
|
||||
# full-name substring is present somewhere in the address.
|
||||
_au_state_words = "|".join(
|
||||
list(AU_STATE_CODES) + [re.escape(n) for n in AU_STATE_NAMES]
|
||||
)
|
||||
is_au_shaped = bool(
|
||||
re.search(r"\b\d{4}\b\s*$", s.rstrip(","))
|
||||
and re.search(rf"\b(?:{_au_state_words})\b", s, re.IGNORECASE)
|
||||
)
|
||||
|
||||
if state_to_code and is_us_shaped:
|
||||
# Only convert state names in the *state slot* — between a comma
|
||||
# and a US ZIP — so the city ``New York`` in ``…, New York, NY
|
||||
# 10001`` is not shortened to ``NY``. Patterns are precompiled
|
||||
# at module load.
|
||||
for pat, code in _STATE_NAME_PATTERNS:
|
||||
s = pat.sub(rf"\g<1>{code}\g<2>", s)
|
||||
if state_to_code:
|
||||
# State-name → code conversion. Each country's pattern only
|
||||
# fires when its own postal-code shape is detected, so US
|
||||
# "New York" before "NY 10001" is left alone (it's a city), and
|
||||
# Canadian "Ontario" before "M5E 1W7" becomes "ON".
|
||||
if is_us_shaped:
|
||||
for pat, code in _STATE_NAME_PATTERNS:
|
||||
s = pat.sub(rf"\g<1>{code}\g<2>", s)
|
||||
if is_ca_shaped:
|
||||
for pat, code in _CA_PROVINCE_PATTERNS:
|
||||
s = pat.sub(rf"\g<1>{code}\g<2>", s)
|
||||
if is_au_shaped:
|
||||
for pat, code in _AU_STATE_PATTERNS:
|
||||
s = pat.sub(rf"\g<1>{code}\g<2>", s)
|
||||
if is_de_shaped:
|
||||
for pat, code in _DE_STATE_PATTERNS:
|
||||
s = pat.sub(rf"\g<1>{code}\g<2>", s)
|
||||
|
||||
if not expand:
|
||||
# Compression direction is only safe for US-shaped addresses.
|
||||
@@ -1159,7 +1552,7 @@ def standardize_address(
|
||||
# State code preservation: if this token is a 2-letter state code
|
||||
# in a state-code position, preserve it as uppercase regardless
|
||||
# of input case or abbreviation table collisions.
|
||||
if upper_form in _US_STATE_CODES and _is_state_code_position(tokens, i):
|
||||
if upper_form in _INTL_STATE_CODES and _is_state_code_position(tokens, i):
|
||||
out_tokens.append(upper_form)
|
||||
continue
|
||||
|
||||
@@ -1193,7 +1586,7 @@ def _restore_state_codes(s: str) -> str:
|
||||
"""Force-uppercase 2-letter state codes following a comma."""
|
||||
def repl(m: re.Match) -> str:
|
||||
candidate = m.group(2).upper()
|
||||
if candidate in _US_STATE_CODES:
|
||||
if candidate in _INTL_STATE_CODES:
|
||||
return f"{m.group(1)}{candidate}{m.group(3)}"
|
||||
return m.group(0)
|
||||
|
||||
@@ -1221,6 +1614,10 @@ _EMAIL_ANGLE_RE = re.compile(r"<([^<>]+)>")
|
||||
_MAILTO_PREFIX_RE = re.compile(r"^mailto:", re.IGNORECASE)
|
||||
# Smart-quote wrapping the whole address.
|
||||
_EMAIL_SMARTQUOTE_RE = re.compile(r"^[“”‘’]+|[“”‘’]+$")
|
||||
# Bidirectional control characters used in homograph / spoofing attacks
|
||||
# against email addresses (``alice@example.com`` displays as
|
||||
# ``alice@elpmaxe.com`` to RTL-aware renderers). Strip on every parse.
|
||||
_EMAIL_BIDI_RE = re.compile(r"[--]")
|
||||
# Multi-email cell separator.
|
||||
_EMAIL_MULTI_RE = re.compile(r"[,;]\s*\S+@\S+\.\S+")
|
||||
|
||||
@@ -1260,6 +1657,9 @@ def standardize_email(
|
||||
|
||||
# Smart-quote wrappers (``"alice@example.com"``).
|
||||
s = _EMAIL_SMARTQUOTE_RE.sub("", s).strip()
|
||||
# Strip BIDI / RTL override controls — these are a homograph attack
|
||||
# vector and have no legitimate use inside an email address.
|
||||
s = _EMAIL_BIDI_RE.sub("", s)
|
||||
|
||||
# Display-name with angle brackets — extract the address.
|
||||
m = _EMAIL_ANGLE_RE.search(s)
|
||||
@@ -1503,6 +1903,7 @@ class StandardizeOptions:
|
||||
# Name policy
|
||||
name_conservative: bool = False
|
||||
name_reverse_comma_format: bool = True
|
||||
name_family_first: bool = False # set per-column for East Asian data
|
||||
|
||||
# User overrides for the address abbreviation table. Merged on top of
|
||||
# the built-in USPS Pub. 28 list at runtime; values flow through
|
||||
@@ -1691,6 +2092,7 @@ def _apply_field_type(
|
||||
case=options.name_case,
|
||||
conservative=options.name_conservative,
|
||||
reverse_comma_format=options.name_reverse_comma_format,
|
||||
family_first=options.name_family_first,
|
||||
)
|
||||
elif field_type == FieldType.ADDRESS:
|
||||
new, changed = standardize_address(
|
||||
|
||||
Reference in New Issue
Block a user