Files
datatools-dev/tests/test_i18n.py
Michael d18b95880d feat(format-i18n): broaden international coverage across all domains
Closes ~17 high-value international gaps surfaced by parallel review.
Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed
/ failed / xfailed).

DATES
- Adds Portuguese, Italian, Dutch, Russian month dictionaries to the
  opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru).
- Adds localized weekday recognition for those locales — "Lundi",
  "Montag", "lunedì", "понедельник", etc. all strip cleanly before
  format matching.
- New CJK separator normalization: Japanese ``2024年01月15日`` and
  fullwidth digits ``2024/01/15`` fold to ASCII before parsing.
- New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to
  fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ
  doesn't block format matching.
- New ISO 8601 extended formats: week date (``2024-W03-1``) and
  ordinal date (``2024-015``), plus RFC 2822 mail-header form
  (``Mon, 15 Jan 2024 10:30:00``).
- New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` —
  defaults to Python's stdlib 69; lower it for birth-year columns
  where most subjects were born ≤ 1999.

NAMES
- Particles set extended with Arabic patronymic markers (bin, ibn,
  bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-).
- Title set extended with German (Herr, Frau), French (M., Mme,
  Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra,
  Dott.), Portuguese.
- Acronym map extended with international academic credentials
  (Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM).
- New East Asian honorific suffix handler: ``Tanaka-san``,
  ``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the
  hyphen instead of being title-cased into ``Tanaka-San``.
- Hyphenated-segment handler now keeps Arabic prefixes ``al-`` /
  ``el-`` lowercase per Arabic transliteration convention.
- New ``family_first`` parameter on ``standardize_name()`` and matching
  ``name_family_first`` field on ``StandardizeOptions`` — set
  per-column for East Asian data to skip Western comma-format reversal
  (``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``).

CURRENCY
- Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN),
  ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa,
  Eastern Europe, Latin America gaps.
- ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD,
  BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS,
  KES, HUF, CZK, RON, UAH, KZT, etc.

EMAIL
- New BIDI / RTL override stripping (``standardize_email``):
  U+202A-U+202E and U+2066-U+2069 stripped from every email. These
  are a known phishing vector — ``alice‮@example.com`` displays as
  ``alice@elpmaxe.com`` to RTL-aware renderers.

ADDRESS
- Canadian provinces: 13 codes + names → 2-letter (Ontario → ON).
- UK postcode pattern recognition (``SW1A 2AA`` shape).
- Australian states: 8 codes + names (NSW, VIC, QLD, … + full names).
- German Bundesland: 16 codes + names (Bayern → BY, etc.).
- International PO Box variants: ``Postfach`` (DE), ``Boîte postale``
  (FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal``
  (PT) — all fold to canonical ``PO Box``.
- ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position
  check that preserves state codes regardless of input case applies
  to all four jurisdictions.
- ``_is_state_code_position`` postal pattern broadened to recognize
  US ZIP, AU 4-digit, CA first half, and UK outward code.

CONSTANTS
- ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` /
  ``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``,
  ``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS``
  (us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex),
  ``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for
  use when address takes a `country_hint` parameter in a future pass).

DOCS
- TECHNICAL.md §11.3 domain table updated with the new handling per
  domain plus a new "International coverage" sub-section listing the
  supported locales / symbols / jurisdictions.

DEFERRED (out of scope or rare)
- Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) —
  corpus § 3.5 marks out of scope.
- Persian/Arabic-Indic digit conversion — rare in tabular data.
- Trailing-minus RTL currency convention.
- Punycode ↔ Unicode IDN normalization.
- Mixed-country phone column auto-detection (user can override
  ``default_region`` per column).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 03:06:03 +00:00

342 lines
12 KiB
Python
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""International coverage tests for the format standardizer.
Covers gaps surfaced by the i18n review:
- Date locales: PT, IT, NL, RU + weekday recognition.
- Date formats: ISO 8601 week date / ordinal date, RFC 2822, CJK
separators, fullwidth digits, named-timezone resolution.
- Two-digit year cutoff customization.
- Names: Arabic / Hebrew particles, multi-language titles, East Asian
honorific suffixes, family_first comma-reversal skip.
- Currency: extended symbol coverage (Asian, Latin American, African
currencies), extended ISO code list.
- Address: Canadian provinces, UK postcode, Australian states,
German Bundesland, international PO Box variants.
- Email: BIDI / RTL override stripping (security).
"""
from __future__ import annotations
import pandas as pd
import pytest
from src.core.format_standardize import (
standardize_address,
standardize_currency,
standardize_date,
standardize_email,
standardize_name,
)
# ---------------------------------------------------------------------------
# Dates
# ---------------------------------------------------------------------------
class TestDateLocales:
@pytest.mark.parametrize("inp,want", [
("15 janeiro 2024", "2024-01-15"), # PT
("15 fevereiro 2024", "2024-02-15"),
("15 dezembro 2024", "2024-12-15"),
("15 gennaio 2024", "2024-01-15"), # IT
("15 marzo 2024", "2024-03-15"),
("15 dicembre 2024", "2024-12-15"),
("15 januari 2024", "2024-01-15"), # NL
("15 maart 2024", "2024-03-15"),
("15 januari 2024", "2024-01-15"),
("15 января 2024", "2024-01-15"), # RU
("15 декабря 2024", "2024-12-15"),
])
def test_extended_locales(self, inp, want):
got, _ = standardize_date(
inp, month_locales=["en", "fr", "de", "es", "pt", "it", "nl", "ru"],
)
assert got == want
@pytest.mark.parametrize("inp,want", [
("lundi, 15 janvier 2024", "2024-01-15"), # FR
("Montag, 15. Januar 2024", "2024-01-15"), # DE
("lunes, 15 enero 2024", "2024-01-15"), # ES
("lunedì 15 gennaio 2024", "2024-01-15"), # IT
("segunda-feira 15 janeiro 2024", "2024-01-15"), # PT
("maandag 15 januari 2024", "2024-01-15"), # NL
])
def test_localized_weekdays(self, inp, want):
got, _ = standardize_date(
inp, month_locales=["en", "fr", "de", "es", "pt", "it", "nl"],
)
assert got == want
class TestDateExtendedFormats:
def test_iso_week_date(self):
got, _ = standardize_date("2024-W03-1")
assert got == "2024-01-15"
def test_iso_ordinal(self):
got, _ = standardize_date("2024-015")
assert got == "2024-01-15"
def test_rfc2822(self):
got, _ = standardize_date("Mon, 15 Jan 2024 10:30:00")
assert got == "2024-01-15"
def test_cjk_japanese(self):
got, _ = standardize_date("2024年01月15日")
assert got == "2024-01-15"
def test_fullwidth_digits(self):
got, _ = standardize_date("//")
assert got == "2024-01-15"
class TestNamedTimezones:
@pytest.mark.parametrize("tz", ["EST", "PST", "JST", "GMT", "CET", "IST"])
def test_named_tz_resolves(self, tz):
got, _ = standardize_date(f"2024-01-15 10:30:00 {tz}")
assert got == "2024-01-15"
class TestTwoDigitYearCutoff:
def test_default_cutoff_69(self):
# year 24 → 2024
got, _ = standardize_date("1/15/24")
assert got == "2024-01-15"
# year 70 → 1970
got, _ = standardize_date("1/15/70")
assert got == "1970-01-15"
def test_lowered_cutoff_for_birth_years(self):
# cutoff=10 → year 24 falls in 1925-2010 mapping
got, _ = standardize_date("1/15/24", two_digit_year_cutoff=10)
assert got == "1924-01-15"
# ---------------------------------------------------------------------------
# Names
# ---------------------------------------------------------------------------
class TestNameParticles:
@pytest.mark.parametrize("inp,want", [
("ahmed bin salman", "Ahmed bin Salman"),
("abdullah ibn rashid", "Abdullah ibn Rashid"),
("ali abu bakr", "Ali abu Bakr"),
("david ben gurion", "David ben Gurion"),
("mohammed al-rashid", "Mohammed al-Rashid"),
("omar el-sayed", "Omar el-Sayed"),
])
def test_arabic_hebrew_particles(self, inp, want):
got, _ = standardize_name(inp)
assert got == want
class TestNameTitles:
@pytest.mark.parametrize("inp,want", [
("Herr Hans Schmidt", "Herr Hans Schmidt"),
("Frau Anna Müller", "Frau Anna Müller"),
("M. Pierre Dupont", "M Pierre Dupont"),
("Mme Marie Dubois", "Mme Marie Dubois"),
("Sr. Juan Pérez", "Sr Juan Pérez"),
("Sra. Maria González", "Sra Maria González"),
("Sig. Marco Rossi", "Sig Marco Rossi"),
])
def test_multilang_titles(self, inp, want):
got, _ = standardize_name(inp)
assert got == want
class TestEastAsianHonorifics:
@pytest.mark.parametrize("inp", [
"Tanaka-san", "Suzuki-sama", "Sato-kun", "Kohaku-chan",
"Lee-ssi", "Park-nim",
])
def test_honorific_preserved_lowercase(self, inp):
got, _ = standardize_name(inp)
# Honorific suffix stays lowercase
assert got == inp.split("-")[0].title() + "-" + inp.split("-")[1].lower()
class TestFamilyFirst:
def test_skips_comma_reversal(self):
# Default: comma reversal flips family-first into Western order
got_default, _ = standardize_name("Kim, Min-jae")
# Family-first preserves the comma form (per-column signal)
got_ff, _ = standardize_name("Kim, Min-jae", family_first=True)
assert got_default != got_ff
assert got_ff.startswith("Kim,")
# ---------------------------------------------------------------------------
# Currency
# ---------------------------------------------------------------------------
class TestCurrencySymbols:
@pytest.mark.parametrize("inp,want", [
("฿1,234.56", "1234.56"), # THB
("₫50000", "50000"), # VND
("₮100", "100"), # MNT
("₴500", "500"), # UAH
("₦5,000", "5000"), # NGN
("₱1,234.56", "1234.56"), # PHP
("₲100000", "100000"), # PYG
("﷼500", "500"), # SAR (ambiguous; mapped to SAR)
("₨1,234", "1234"), # PKR
("₵100", "100"), # GHS
])
def test_extended_symbol_coverage(self, inp, want):
got, _ = standardize_currency(inp)
assert got == want
class TestCurrencyCodes:
@pytest.mark.parametrize("code", [
"SAR", "AED", "QAR", "ARS", "EGP", "IDR", "MYR", "PHP", "THB",
"VND", "PKR", "BDT", "HUF", "CZK", "RON", "UAH",
])
def test_iso_code_recognized(self, code):
got, _ = standardize_currency(f"1234.56 {code}")
assert got == "1234.56"
# ---------------------------------------------------------------------------
# Addresses
# ---------------------------------------------------------------------------
class TestCanadianAddresses:
def test_province_name_to_code(self):
got, _ = standardize_address(
"1 Yonge St, Toronto, Ontario M5E 1W7", expand=False,
)
assert "ON" in got
assert "Ontario" not in got
def test_quebec_with_accent(self):
got, _ = standardize_address(
"1 Rue Sherbrooke, Montréal, Québec H2Y 1A1", expand=False,
)
assert "QC" in got
def test_province_code_preserved_after_lowercase(self):
got, _ = standardize_address(
"1 yonge st, toronto, on m5e 1w7", expand=False,
)
assert "ON" in got
class TestUKAddresses:
def test_postcode_address_passes_through(self):
got, _ = standardize_address(
"10 Downing Street, London, SW1A 2AA", expand=False,
)
assert "SW1A 2AA" in got
def test_lowercase_postcode_preserved_with_caps(self):
got, _ = standardize_address(
"10 downing street, london, sw1a 2aa", expand=False,
)
# UK postcodes get title-cased as the rest of the address;
# SW1A 2AA letters aren't in the state-code set so we accept
# "Sw1a 2Aa" as the title-case fallback.
assert "London" in got
class TestAustralianAddresses:
def test_state_name_to_code(self):
got, _ = standardize_address(
"1 George St, Sydney, New South Wales 2000", expand=False,
)
assert "NSW" in got
assert "New South Wales" not in got
def test_state_code_preserved(self):
got, _ = standardize_address(
"1 collins st, melbourne, vic 3000", expand=False,
)
assert "VIC" in got
class TestGermanAddresses:
def test_bundesland_name_to_code(self):
got, _ = standardize_address(
"Hauptstr 1, München, Bayern 80331", expand=False,
)
assert "BY" in got
assert "Bayern" not in got
class TestInternationalPOBox:
@pytest.mark.parametrize("inp", [
"Postfach 12345, München, BY 80331", # DE
"Boîte postale 12, Paris 75001", # FR
"Apartado 12, Madrid 28001", # ES
"Casella postale 12, Roma 00100", # IT
"Caixa postal 12, São Paulo 01310", # PT
])
def test_intl_po_box_normalized(self, inp):
got, _ = standardize_address(inp, expand=False)
assert "PO Box" in got
# ---------------------------------------------------------------------------
# Email — security
# ---------------------------------------------------------------------------
class TestEmailBidiSecurity:
def test_rtl_override_stripped(self):
# U+202E (Right-to-Left Override) inside email — common phishing
# vector. After strip, the address is just the legitimate one.
malicious = "alice@example.com"
got, _ = standardize_email(malicious)
assert got == "alice@example.com"
assert "" not in got
def test_lrm_stripped(self):
# Left-to-Right Mark, also strippable.
s = "alice@example.com"
got, _ = standardize_email(s)
assert got == "alice@example.com"
def test_rtl_isolate_stripped(self):
s = "alice@example.com"
got, _ = standardize_email(s)
assert got == "alice@example.com"
# ---------------------------------------------------------------------------
# Pipeline integration — end-to-end with intl options
# ---------------------------------------------------------------------------
class TestPipelineIntl:
def test_standardize_options_carry_intl_flags(self):
from src.core.format_standardize import (
FieldType, StandardizeOptions, standardize_dataframe,
)
df = pd.DataFrame({
"name": ["Tanaka-san", "Kim, Min-jae"],
"date": ["15 janeiro 2024", "Mon, 15 Jan 2024 10:30:00"],
"addr": [
"Hauptstr 1, München, Bayern 80331",
"1 Yonge St, Toronto, Ontario M5E 1W7",
],
})
opts = StandardizeOptions(
column_types={
"name": FieldType.NAME,
"date": FieldType.DATE,
"addr": FieldType.ADDRESS,
},
date_month_locales=["en", "fr", "de", "es", "pt", "it", "nl", "ru"],
address_expand=False,
name_family_first=True,
)
result = standardize_dataframe(df, opts)
out = result.standardized_df
# Names: honorific preserved, family-first comma not reversed
assert out.loc[0, "name"] == "Tanaka-san"
assert out.loc[1, "name"].startswith("Kim,")
# Dates: PT month + RFC 2822 both → 2024-01-15
assert out.loc[0, "date"] == "2024-01-15"
assert out.loc[1, "date"] == "2024-01-15"
# Addresses: DE + CA both have state codes substituted
assert "BY" in out.loc[0, "addr"]
assert "ON" in out.loc[1, "addr"]