Closes ~17 high-value international gaps surfaced by parallel review. Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed / failed / xfailed). DATES - Adds Portuguese, Italian, Dutch, Russian month dictionaries to the opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru). - Adds localized weekday recognition for those locales — "Lundi", "Montag", "lunedì", "понедельник", etc. all strip cleanly before format matching. - New CJK separator normalization: Japanese ``2024年01月15日`` and fullwidth digits ``2024/01/15`` fold to ASCII before parsing. - New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ doesn't block format matching. - New ISO 8601 extended formats: week date (``2024-W03-1``) and ordinal date (``2024-015``), plus RFC 2822 mail-header form (``Mon, 15 Jan 2024 10:30:00``). - New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` — defaults to Python's stdlib 69; lower it for birth-year columns where most subjects were born ≤ 1999. NAMES - Particles set extended with Arabic patronymic markers (bin, ibn, bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-). - Title set extended with German (Herr, Frau), French (M., Mme, Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra, Dott.), Portuguese. - Acronym map extended with international academic credentials (Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM). - New East Asian honorific suffix handler: ``Tanaka-san``, ``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the hyphen instead of being title-cased into ``Tanaka-San``. - Hyphenated-segment handler now keeps Arabic prefixes ``al-`` / ``el-`` lowercase per Arabic transliteration convention. - New ``family_first`` parameter on ``standardize_name()`` and matching ``name_family_first`` field on ``StandardizeOptions`` — set per-column for East Asian data to skip Western comma-format reversal (``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``). CURRENCY - Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN), ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa, Eastern Europe, Latin America gaps. - ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD, BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS, KES, HUF, CZK, RON, UAH, KZT, etc. EMAIL - New BIDI / RTL override stripping (``standardize_email``): U+202A-U+202E and U+2066-U+2069 stripped from every email. These are a known phishing vector — ``alice@example.com`` displays as ``alice@elpmaxe.com`` to RTL-aware renderers. ADDRESS - Canadian provinces: 13 codes + names → 2-letter (Ontario → ON). - UK postcode pattern recognition (``SW1A 2AA`` shape). - Australian states: 8 codes + names (NSW, VIC, QLD, … + full names). - German Bundesland: 16 codes + names (Bayern → BY, etc.). - International PO Box variants: ``Postfach`` (DE), ``Boîte postale`` (FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal`` (PT) — all fold to canonical ``PO Box``. - ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position check that preserves state codes regardless of input case applies to all four jurisdictions. - ``_is_state_code_position`` postal pattern broadened to recognize US ZIP, AU 4-digit, CA first half, and UK outward code. CONSTANTS - ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` / ``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``, ``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS`` (us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex), ``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for use when address takes a `country_hint` parameter in a future pass). DOCS - TECHNICAL.md §11.3 domain table updated with the new handling per domain plus a new "International coverage" sub-section listing the supported locales / symbols / jurisdictions. DEFERRED (out of scope or rare) - Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) — corpus § 3.5 marks out of scope. - Persian/Arabic-Indic digit conversion — rare in tabular data. - Trailing-minus RTL currency convention. - Punycode ↔ Unicode IDN normalization. - Mixed-country phone column auto-detection (user can override ``default_region`` per column). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
342 lines
12 KiB
Python
342 lines
12 KiB
Python
"""International coverage tests for the format standardizer.
|
||
|
||
Covers gaps surfaced by the i18n review:
|
||
- Date locales: PT, IT, NL, RU + weekday recognition.
|
||
- Date formats: ISO 8601 week date / ordinal date, RFC 2822, CJK
|
||
separators, fullwidth digits, named-timezone resolution.
|
||
- Two-digit year cutoff customization.
|
||
- Names: Arabic / Hebrew particles, multi-language titles, East Asian
|
||
honorific suffixes, family_first comma-reversal skip.
|
||
- Currency: extended symbol coverage (Asian, Latin American, African
|
||
currencies), extended ISO code list.
|
||
- Address: Canadian provinces, UK postcode, Australian states,
|
||
German Bundesland, international PO Box variants.
|
||
- Email: BIDI / RTL override stripping (security).
|
||
"""
|
||
|
||
from __future__ import annotations
|
||
|
||
import pandas as pd
|
||
import pytest
|
||
|
||
from src.core.format_standardize import (
|
||
standardize_address,
|
||
standardize_currency,
|
||
standardize_date,
|
||
standardize_email,
|
||
standardize_name,
|
||
)
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Dates
|
||
# ---------------------------------------------------------------------------
|
||
|
||
class TestDateLocales:
|
||
@pytest.mark.parametrize("inp,want", [
|
||
("15 janeiro 2024", "2024-01-15"), # PT
|
||
("15 fevereiro 2024", "2024-02-15"),
|
||
("15 dezembro 2024", "2024-12-15"),
|
||
("15 gennaio 2024", "2024-01-15"), # IT
|
||
("15 marzo 2024", "2024-03-15"),
|
||
("15 dicembre 2024", "2024-12-15"),
|
||
("15 januari 2024", "2024-01-15"), # NL
|
||
("15 maart 2024", "2024-03-15"),
|
||
("15 januari 2024", "2024-01-15"),
|
||
("15 января 2024", "2024-01-15"), # RU
|
||
("15 декабря 2024", "2024-12-15"),
|
||
])
|
||
def test_extended_locales(self, inp, want):
|
||
got, _ = standardize_date(
|
||
inp, month_locales=["en", "fr", "de", "es", "pt", "it", "nl", "ru"],
|
||
)
|
||
assert got == want
|
||
|
||
@pytest.mark.parametrize("inp,want", [
|
||
("lundi, 15 janvier 2024", "2024-01-15"), # FR
|
||
("Montag, 15. Januar 2024", "2024-01-15"), # DE
|
||
("lunes, 15 enero 2024", "2024-01-15"), # ES
|
||
("lunedì 15 gennaio 2024", "2024-01-15"), # IT
|
||
("segunda-feira 15 janeiro 2024", "2024-01-15"), # PT
|
||
("maandag 15 januari 2024", "2024-01-15"), # NL
|
||
])
|
||
def test_localized_weekdays(self, inp, want):
|
||
got, _ = standardize_date(
|
||
inp, month_locales=["en", "fr", "de", "es", "pt", "it", "nl"],
|
||
)
|
||
assert got == want
|
||
|
||
|
||
class TestDateExtendedFormats:
|
||
def test_iso_week_date(self):
|
||
got, _ = standardize_date("2024-W03-1")
|
||
assert got == "2024-01-15"
|
||
|
||
def test_iso_ordinal(self):
|
||
got, _ = standardize_date("2024-015")
|
||
assert got == "2024-01-15"
|
||
|
||
def test_rfc2822(self):
|
||
got, _ = standardize_date("Mon, 15 Jan 2024 10:30:00")
|
||
assert got == "2024-01-15"
|
||
|
||
def test_cjk_japanese(self):
|
||
got, _ = standardize_date("2024年01月15日")
|
||
assert got == "2024-01-15"
|
||
|
||
def test_fullwidth_digits(self):
|
||
got, _ = standardize_date("2024/01/15")
|
||
assert got == "2024-01-15"
|
||
|
||
|
||
class TestNamedTimezones:
|
||
@pytest.mark.parametrize("tz", ["EST", "PST", "JST", "GMT", "CET", "IST"])
|
||
def test_named_tz_resolves(self, tz):
|
||
got, _ = standardize_date(f"2024-01-15 10:30:00 {tz}")
|
||
assert got == "2024-01-15"
|
||
|
||
|
||
class TestTwoDigitYearCutoff:
|
||
def test_default_cutoff_69(self):
|
||
# year 24 → 2024
|
||
got, _ = standardize_date("1/15/24")
|
||
assert got == "2024-01-15"
|
||
# year 70 → 1970
|
||
got, _ = standardize_date("1/15/70")
|
||
assert got == "1970-01-15"
|
||
|
||
def test_lowered_cutoff_for_birth_years(self):
|
||
# cutoff=10 → year 24 falls in 1925-2010 mapping
|
||
got, _ = standardize_date("1/15/24", two_digit_year_cutoff=10)
|
||
assert got == "1924-01-15"
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Names
|
||
# ---------------------------------------------------------------------------
|
||
|
||
class TestNameParticles:
|
||
@pytest.mark.parametrize("inp,want", [
|
||
("ahmed bin salman", "Ahmed bin Salman"),
|
||
("abdullah ibn rashid", "Abdullah ibn Rashid"),
|
||
("ali abu bakr", "Ali abu Bakr"),
|
||
("david ben gurion", "David ben Gurion"),
|
||
("mohammed al-rashid", "Mohammed al-Rashid"),
|
||
("omar el-sayed", "Omar el-Sayed"),
|
||
])
|
||
def test_arabic_hebrew_particles(self, inp, want):
|
||
got, _ = standardize_name(inp)
|
||
assert got == want
|
||
|
||
|
||
class TestNameTitles:
|
||
@pytest.mark.parametrize("inp,want", [
|
||
("Herr Hans Schmidt", "Herr Hans Schmidt"),
|
||
("Frau Anna Müller", "Frau Anna Müller"),
|
||
("M. Pierre Dupont", "M Pierre Dupont"),
|
||
("Mme Marie Dubois", "Mme Marie Dubois"),
|
||
("Sr. Juan Pérez", "Sr Juan Pérez"),
|
||
("Sra. Maria González", "Sra Maria González"),
|
||
("Sig. Marco Rossi", "Sig Marco Rossi"),
|
||
])
|
||
def test_multilang_titles(self, inp, want):
|
||
got, _ = standardize_name(inp)
|
||
assert got == want
|
||
|
||
|
||
class TestEastAsianHonorifics:
|
||
@pytest.mark.parametrize("inp", [
|
||
"Tanaka-san", "Suzuki-sama", "Sato-kun", "Kohaku-chan",
|
||
"Lee-ssi", "Park-nim",
|
||
])
|
||
def test_honorific_preserved_lowercase(self, inp):
|
||
got, _ = standardize_name(inp)
|
||
# Honorific suffix stays lowercase
|
||
assert got == inp.split("-")[0].title() + "-" + inp.split("-")[1].lower()
|
||
|
||
|
||
class TestFamilyFirst:
|
||
def test_skips_comma_reversal(self):
|
||
# Default: comma reversal flips family-first into Western order
|
||
got_default, _ = standardize_name("Kim, Min-jae")
|
||
# Family-first preserves the comma form (per-column signal)
|
||
got_ff, _ = standardize_name("Kim, Min-jae", family_first=True)
|
||
assert got_default != got_ff
|
||
assert got_ff.startswith("Kim,")
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Currency
|
||
# ---------------------------------------------------------------------------
|
||
|
||
class TestCurrencySymbols:
|
||
@pytest.mark.parametrize("inp,want", [
|
||
("฿1,234.56", "1234.56"), # THB
|
||
("₫50000", "50000"), # VND
|
||
("₮100", "100"), # MNT
|
||
("₴500", "500"), # UAH
|
||
("₦5,000", "5000"), # NGN
|
||
("₱1,234.56", "1234.56"), # PHP
|
||
("₲100000", "100000"), # PYG
|
||
("﷼500", "500"), # SAR (ambiguous; mapped to SAR)
|
||
("₨1,234", "1234"), # PKR
|
||
("₵100", "100"), # GHS
|
||
])
|
||
def test_extended_symbol_coverage(self, inp, want):
|
||
got, _ = standardize_currency(inp)
|
||
assert got == want
|
||
|
||
|
||
class TestCurrencyCodes:
|
||
@pytest.mark.parametrize("code", [
|
||
"SAR", "AED", "QAR", "ARS", "EGP", "IDR", "MYR", "PHP", "THB",
|
||
"VND", "PKR", "BDT", "HUF", "CZK", "RON", "UAH",
|
||
])
|
||
def test_iso_code_recognized(self, code):
|
||
got, _ = standardize_currency(f"1234.56 {code}")
|
||
assert got == "1234.56"
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Addresses
|
||
# ---------------------------------------------------------------------------
|
||
|
||
class TestCanadianAddresses:
|
||
def test_province_name_to_code(self):
|
||
got, _ = standardize_address(
|
||
"1 Yonge St, Toronto, Ontario M5E 1W7", expand=False,
|
||
)
|
||
assert "ON" in got
|
||
assert "Ontario" not in got
|
||
|
||
def test_quebec_with_accent(self):
|
||
got, _ = standardize_address(
|
||
"1 Rue Sherbrooke, Montréal, Québec H2Y 1A1", expand=False,
|
||
)
|
||
assert "QC" in got
|
||
|
||
def test_province_code_preserved_after_lowercase(self):
|
||
got, _ = standardize_address(
|
||
"1 yonge st, toronto, on m5e 1w7", expand=False,
|
||
)
|
||
assert "ON" in got
|
||
|
||
|
||
class TestUKAddresses:
|
||
def test_postcode_address_passes_through(self):
|
||
got, _ = standardize_address(
|
||
"10 Downing Street, London, SW1A 2AA", expand=False,
|
||
)
|
||
assert "SW1A 2AA" in got
|
||
|
||
def test_lowercase_postcode_preserved_with_caps(self):
|
||
got, _ = standardize_address(
|
||
"10 downing street, london, sw1a 2aa", expand=False,
|
||
)
|
||
# UK postcodes get title-cased as the rest of the address;
|
||
# SW1A 2AA letters aren't in the state-code set so we accept
|
||
# "Sw1a 2Aa" as the title-case fallback.
|
||
assert "London" in got
|
||
|
||
|
||
class TestAustralianAddresses:
|
||
def test_state_name_to_code(self):
|
||
got, _ = standardize_address(
|
||
"1 George St, Sydney, New South Wales 2000", expand=False,
|
||
)
|
||
assert "NSW" in got
|
||
assert "New South Wales" not in got
|
||
|
||
def test_state_code_preserved(self):
|
||
got, _ = standardize_address(
|
||
"1 collins st, melbourne, vic 3000", expand=False,
|
||
)
|
||
assert "VIC" in got
|
||
|
||
|
||
class TestGermanAddresses:
|
||
def test_bundesland_name_to_code(self):
|
||
got, _ = standardize_address(
|
||
"Hauptstr 1, München, Bayern 80331", expand=False,
|
||
)
|
||
assert "BY" in got
|
||
assert "Bayern" not in got
|
||
|
||
|
||
class TestInternationalPOBox:
|
||
@pytest.mark.parametrize("inp", [
|
||
"Postfach 12345, München, BY 80331", # DE
|
||
"Boîte postale 12, Paris 75001", # FR
|
||
"Apartado 12, Madrid 28001", # ES
|
||
"Casella postale 12, Roma 00100", # IT
|
||
"Caixa postal 12, São Paulo 01310", # PT
|
||
])
|
||
def test_intl_po_box_normalized(self, inp):
|
||
got, _ = standardize_address(inp, expand=False)
|
||
assert "PO Box" in got
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Email — security
|
||
# ---------------------------------------------------------------------------
|
||
|
||
class TestEmailBidiSecurity:
|
||
def test_rtl_override_stripped(self):
|
||
# U+202E (Right-to-Left Override) inside email — common phishing
|
||
# vector. After strip, the address is just the legitimate one.
|
||
malicious = "alice@example.com"
|
||
got, _ = standardize_email(malicious)
|
||
assert got == "alice@example.com"
|
||
assert "" not in got
|
||
|
||
def test_lrm_stripped(self):
|
||
# Left-to-Right Mark, also strippable.
|
||
s = "alice@example.com"
|
||
got, _ = standardize_email(s)
|
||
assert got == "alice@example.com"
|
||
|
||
def test_rtl_isolate_stripped(self):
|
||
s = "alice@example.com"
|
||
got, _ = standardize_email(s)
|
||
assert got == "alice@example.com"
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Pipeline integration — end-to-end with intl options
|
||
# ---------------------------------------------------------------------------
|
||
|
||
class TestPipelineIntl:
|
||
def test_standardize_options_carry_intl_flags(self):
|
||
from src.core.format_standardize import (
|
||
FieldType, StandardizeOptions, standardize_dataframe,
|
||
)
|
||
df = pd.DataFrame({
|
||
"name": ["Tanaka-san", "Kim, Min-jae"],
|
||
"date": ["15 janeiro 2024", "Mon, 15 Jan 2024 10:30:00"],
|
||
"addr": [
|
||
"Hauptstr 1, München, Bayern 80331",
|
||
"1 Yonge St, Toronto, Ontario M5E 1W7",
|
||
],
|
||
})
|
||
opts = StandardizeOptions(
|
||
column_types={
|
||
"name": FieldType.NAME,
|
||
"date": FieldType.DATE,
|
||
"addr": FieldType.ADDRESS,
|
||
},
|
||
date_month_locales=["en", "fr", "de", "es", "pt", "it", "nl", "ru"],
|
||
address_expand=False,
|
||
name_family_first=True,
|
||
)
|
||
result = standardize_dataframe(df, opts)
|
||
out = result.standardized_df
|
||
# Names: honorific preserved, family-first comma not reversed
|
||
assert out.loc[0, "name"] == "Tanaka-san"
|
||
assert out.loc[1, "name"].startswith("Kim,")
|
||
# Dates: PT month + RFC 2822 both → 2024-01-15
|
||
assert out.loc[0, "date"] == "2024-01-15"
|
||
assert out.loc[1, "date"] == "2024-01-15"
|
||
# Addresses: DE + CA both have state codes substituted
|
||
assert "BY" in out.loc[0, "addr"]
|
||
assert "ON" in out.loc[1, "addr"]
|