Files
datatools-dev/tests/test_normalizers.py
Michael b23a27d4e3 fix: cross-tool audit findings + alignment with format standardizer
Closes 12 bugs and 8 gaps surfaced by parallel audits across all core
modules, plus aligns the dedup-side normalizers with the new
format_standardize behavior where they had silently diverged.

Bugs (data integrity / correctness):
- dedup: NaN/None values matched as duplicates because str(None)='None'.
  Two rows with missing email silently merged.
- dedup: removed_df had 0 columns when nothing was removed; downstream
  code expecting matching schema broke. Now preserves column shape.
- dedup: ColumnMatchStrategy threshold accepted any value; out-of-range
  silently broke matching. Validated to [0, 100] in __post_init__.
- dedup: strategy referencing a missing column was silently skipped.
  Now raises ValueError listing available columns.
- fixes: replace_null_sentinels crashed on non-string sentinels (int/None
  from JSON payload). Coerced to str.
- fixes: _vectorized_regex_sub raised raw re.error on bad patterns. Now
  wraps as ValueError with clear message.
- io: detect_header_row mis-identified all-empty and metadata-only rows
  as headers (all([]) is True). Now requires ≥2 non-empty cells.
- config: from_dict crashed when JSON had unknown fields, breaking
  forward compat. Now filters to known fields.
- analyze: mixed-case email detector flagged all-None columns because
  str(None)='None' contains both N and one. Now drops NaN before stringify.

New features and gap closures:
- io: _detect_excel_header_row mirrors detect_header_row for Excel via
  openpyxl read-only; _read_excel uses it when header_row=None.
- io: write_file gains delimiter + encoding params; .tsv extension
  defaults to tab.
- normalizers: normalize_phone preserves extensions as ;ext=N suffix.
- normalizers: normalize_address folds spelled-out US state names to
  2-letter codes (California ≡ CA).
- normalizers: normalize_name drops surname particles (van, de, von)
  so "Charles de Gaulle" ≡ "Charles Gaulle" for matching.
- analyze: new _detect_inconsistent_date_format detector flags columns
  with mixed ISO/US/EU date shapes; routes to format standardizer.
- analyze: _NULL_LIKE recognizes "<na>" (pd.NA repr).
- analyze: duplicate-row finding renamed count → n_extra (rows that
  would actually be removed) with clarified description.
- dedup: group_confidence no longer falsely 100.0 when transitive group
  members lack a recorded direct pair; falls back to 100.0 only when
  truly no pairs were observed.
- dedup: MatchResult / DeduplicationResult docstrings clarify that
  row_indices refer to the input frame's positional index (output index
  is reset).
- text_clean: visualize_hidden_html(None) now returns None (matches
  visualize_hidden_text); strip_bom strips at most one BOM per call;
  sentence_case dead elif branch removed.

Tests:
- tests/test_audit_fixes.py — 28 regression tests, one or more per
  numbered finding, named after BUG/GAP/NIT tags so future readers
  can trace each test back to its audit.
- tests/test_fixes_unit.py — 26 isolated unit tests for previously
  integration-only fix functions (trim_whitespace, strip_nbsp,
  strip_zero_width, normalize_line_endings, clean_headers,
  repair_mojibake — last skipped if ftfy unavailable).
- tests/test_io.py — adds CSV / TSV / semicolon / UTF-8-BOM round-trip
  tests + Excel auto-header-detection tests.
- tests/test_normalizers.py — adds 8 tests for the alignment work
  above (phone extension, state names, particles).

Adds .claude/ to .gitignore (agent worktrees + local settings).

Full project suite: 1197 passed, 4 skipped, 17 xfailed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:11:57 +00:00

207 lines
6.9 KiB
Python

"""Tests for src.core.normalizers."""
import pytest
from src.core.normalizers import (
NormalizerType,
get_normalizer,
normalize_email,
normalize_phone,
normalize_name,
normalize_address,
normalize_string,
)
class TestNormalizeEmail:
def test_basic_lowercase(self):
assert normalize_email("John@Example.COM") == "john@example.com"
def test_strip_whitespace(self):
assert normalize_email(" alice@test.com ") == "alice@test.com"
def test_strip_gmail_dots(self):
assert normalize_email("j.o.h.n@gmail.com") == "john@gmail.com"
def test_strip_plus_tag(self):
assert normalize_email("alice+promo@test.com") == "alice@test.com"
def test_gmail_dots_and_plus(self):
assert normalize_email("j.smith+tag@gmail.com") == "jsmith@gmail.com"
def test_non_gmail_keeps_dots(self):
assert normalize_email("j.smith@company.com") == "j.smith@company.com"
def test_empty(self):
assert normalize_email("") == ""
assert normalize_email(None) == ""
def test_no_at_sign(self):
assert normalize_email("not-an-email") == "not-an-email"
def test_idempotent(self):
result = normalize_email("J.Smith+tag@Gmail.com")
assert normalize_email(result) == result
class TestNormalizePhone:
def test_us_formatted(self):
assert normalize_phone("(555) 123-4567") == "+15551234567"
def test_dashes(self):
assert normalize_phone("555-123-4567") == "+15551234567"
def test_dots(self):
assert normalize_phone("555.123.4567") == "+15551234567"
def test_with_country_code(self):
assert normalize_phone("+1 555-123-4567") == "+15551234567"
def test_digits_only_input(self):
assert normalize_phone("5551234567") == "+15551234567"
def test_empty(self):
assert normalize_phone("") == ""
assert normalize_phone(None) == ""
def test_invalid_fallback_digits(self):
# Very short number that phonenumbers rejects
result = normalize_phone("123")
assert result == "123"
def test_idempotent(self):
result = normalize_phone("(555) 123-4567")
assert normalize_phone(result) == result
class TestNormalizeName:
def test_strip_mr(self):
assert normalize_name("Mr. John Smith") == "john smith"
def test_strip_dr(self):
assert normalize_name("Dr. Jane Doe") == "jane doe"
def test_strip_suffix(self):
assert normalize_name("Robert Brown Jr.") == "robert brown"
def test_strip_numeral_suffix(self):
assert normalize_name("James Wilson III") == "james wilson"
def test_title_and_suffix(self):
assert normalize_name("Dr. Michael Williams III") == "michael williams"
def test_collapse_whitespace(self):
assert normalize_name(" John Smith ") == "john smith"
def test_case_fold(self):
assert normalize_name("JOHN SMITH") == "john smith"
def test_empty(self):
assert normalize_name("") == ""
assert normalize_name(None) == ""
def test_idempotent(self):
result = normalize_name("Mr. John Smith Jr.")
assert normalize_name(result) == result
class TestNormalizeAddress:
def test_street_abbreviation(self):
assert normalize_address("123 Main Street") == "123 main st"
def test_avenue_abbreviation(self):
assert normalize_address("456 Oak Avenue") == "456 oak ave"
def test_boulevard_abbreviation(self):
assert normalize_address("789 Pine Boulevard") == "789 pine blvd"
def test_apartment(self):
assert normalize_address("123 Main St Apartment 4") == "123 main st apt 4"
def test_direction(self):
assert normalize_address("111 First Street North") == "111 first st n"
def test_collapse_whitespace(self):
assert normalize_address(" 123 Main Street ") == "123 main st"
def test_empty(self):
assert normalize_address("") == ""
assert normalize_address(None) == ""
def test_idempotent(self):
result = normalize_address("123 Main Street Apartment 4")
assert normalize_address(result) == result
class TestNormalizeString:
def test_trim_and_casefold(self):
assert normalize_string(" Hello World ") == "hello world"
def test_collapse_whitespace(self):
assert normalize_string("a b c") == "a b c"
def test_empty(self):
assert normalize_string("") == ""
assert normalize_string(None) == ""
class TestGetNormalizer:
def test_get_by_enum(self):
fn = get_normalizer(NormalizerType.EMAIL)
assert fn("TEST@Gmail.com") == "test@gmail.com"
def test_get_by_string(self):
fn = get_normalizer("phone")
assert fn("(555) 123-4567") == "+15551234567"
def test_unknown_raises(self):
with pytest.raises(ValueError):
get_normalizer("unknown_type")
# ---------------------------------------------------------------------------
# Alignment with format_standardize: extension preservation, state codes,
# particle handling. See audit GAPs 15/16/17.
# ---------------------------------------------------------------------------
class TestNormalizerAudit:
def test_phone_extension_preserved(self):
# Two records with different extensions must NOT normalize to
# the same key — they're different people at the same business.
a = normalize_phone("+15551234567 ext 100")
b = normalize_phone("+15551234567 ext 200")
assert a != b
assert a == "+15551234567;ext=100"
def test_phone_no_extension_unchanged(self):
assert normalize_phone("+15551234567") == "+15551234567"
def test_address_state_name_to_code(self):
# "California" and "CA" produce the same matching key.
a = normalize_address("123 Main St, Los Angeles, California 90001")
b = normalize_address("123 Main St, Los Angeles, CA 90001")
assert a == b
def test_address_multiword_state_name(self):
a = normalize_address("100 Beacon St, Boston, Massachusetts 02101")
b = normalize_address("100 Beacon St, Boston, MA 02101")
assert a == b
def test_address_does_not_butcher_city_named_after_state(self):
# "New York" appearing as a city should still fold to "ny" —
# this is intentional for matching keys (we want ``New York, NY``
# and ``NY, NY`` to be the same record) even though the
# standardizer (display) would preserve the city name.
out = normalize_address("123 Main St, New York, NY 10001")
assert "ny" in out
def test_name_particle_dropped(self):
# "Charles de Gaulle" and "Charles Gaulle" produce the same key.
assert normalize_name("Charles de Gaulle") == normalize_name("Charles Gaulle")
def test_name_van_dropped(self):
assert normalize_name("Vincent van Gogh") == normalize_name("Vincent Gogh")
def test_name_particle_idempotent(self):
out = normalize_name("Vincent van Gogh")
assert normalize_name(out) == out