Files
datatools-dev/tests/test_audit_fixes.py
Michael b23a27d4e3 fix: cross-tool audit findings + alignment with format standardizer
Closes 12 bugs and 8 gaps surfaced by parallel audits across all core
modules, plus aligns the dedup-side normalizers with the new
format_standardize behavior where they had silently diverged.

Bugs (data integrity / correctness):
- dedup: NaN/None values matched as duplicates because str(None)='None'.
  Two rows with missing email silently merged.
- dedup: removed_df had 0 columns when nothing was removed; downstream
  code expecting matching schema broke. Now preserves column shape.
- dedup: ColumnMatchStrategy threshold accepted any value; out-of-range
  silently broke matching. Validated to [0, 100] in __post_init__.
- dedup: strategy referencing a missing column was silently skipped.
  Now raises ValueError listing available columns.
- fixes: replace_null_sentinels crashed on non-string sentinels (int/None
  from JSON payload). Coerced to str.
- fixes: _vectorized_regex_sub raised raw re.error on bad patterns. Now
  wraps as ValueError with clear message.
- io: detect_header_row mis-identified all-empty and metadata-only rows
  as headers (all([]) is True). Now requires ≥2 non-empty cells.
- config: from_dict crashed when JSON had unknown fields, breaking
  forward compat. Now filters to known fields.
- analyze: mixed-case email detector flagged all-None columns because
  str(None)='None' contains both N and one. Now drops NaN before stringify.

New features and gap closures:
- io: _detect_excel_header_row mirrors detect_header_row for Excel via
  openpyxl read-only; _read_excel uses it when header_row=None.
- io: write_file gains delimiter + encoding params; .tsv extension
  defaults to tab.
- normalizers: normalize_phone preserves extensions as ;ext=N suffix.
- normalizers: normalize_address folds spelled-out US state names to
  2-letter codes (California ≡ CA).
- normalizers: normalize_name drops surname particles (van, de, von)
  so "Charles de Gaulle" ≡ "Charles Gaulle" for matching.
- analyze: new _detect_inconsistent_date_format detector flags columns
  with mixed ISO/US/EU date shapes; routes to format standardizer.
- analyze: _NULL_LIKE recognizes "<na>" (pd.NA repr).
- analyze: duplicate-row finding renamed count → n_extra (rows that
  would actually be removed) with clarified description.
- dedup: group_confidence no longer falsely 100.0 when transitive group
  members lack a recorded direct pair; falls back to 100.0 only when
  truly no pairs were observed.
- dedup: MatchResult / DeduplicationResult docstrings clarify that
  row_indices refer to the input frame's positional index (output index
  is reset).
- text_clean: visualize_hidden_html(None) now returns None (matches
  visualize_hidden_text); strip_bom strips at most one BOM per call;
  sentence_case dead elif branch removed.

Tests:
- tests/test_audit_fixes.py — 28 regression tests, one or more per
  numbered finding, named after BUG/GAP/NIT tags so future readers
  can trace each test back to its audit.
- tests/test_fixes_unit.py — 26 isolated unit tests for previously
  integration-only fix functions (trim_whitespace, strip_nbsp,
  strip_zero_width, normalize_line_endings, clean_headers,
  repair_mojibake — last skipped if ftfy unavailable).
- tests/test_io.py — adds CSV / TSV / semicolon / UTF-8-BOM round-trip
  tests + Excel auto-header-detection tests.
- tests/test_normalizers.py — adds 8 tests for the alignment work
  above (phone extension, state names, particles).

Adds .claude/ to .gitignore (agent worktrees + local settings).

Full project suite: 1197 passed, 4 skipped, 17 xfailed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:11:57 +00:00

304 lines
12 KiB
Python
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""Regression tests for bugs surfaced by the cross-tool audit.
Each test pins a specific behavioral bug or gap that an audit
identified. Test names match the BUG-N / GAP-N tags in the audit
notes so a future reader can trace why each test exists.
"""
from __future__ import annotations
import json
from pathlib import Path
import numpy as np
import pandas as pd
import pytest
from src.core.analyze import _NULL_LIKE, _detect_mixed_case_email
import src.core.fixes as f
from src.core.config import (
ColumnStrategyConfig,
DeduplicationConfig,
StrategyConfig,
)
from src.core.dedup import (
Algorithm,
ColumnMatchStrategy,
MatchStrategy,
deduplicate,
)
from src.core.io import detect_header_row
from src.core.text_clean import sentence_case, smart_title_case, strip_bom
# ---------------------------------------------------------------------------
# BUG-1: dedup NaN values must not match as duplicates
# ---------------------------------------------------------------------------
class TestDedupNaNHandling:
def test_two_nan_emails_do_not_match(self):
# Both rows have NaN for email; no other matching column. Without
# the fix, str(NaN) == "nan" would match exactly and the rows
# would silently merge.
df = pd.DataFrame({
"id": [1, 2],
"email": [np.nan, np.nan],
})
strategies = [MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT,
threshold=100.0),
])]
result = deduplicate(df, strategies=strategies)
assert len(result.deduplicated_df) == 2
assert len(result.match_groups) == 0
def test_one_nan_one_real_does_not_match(self):
df = pd.DataFrame({
"email": [np.nan, "alice@example.com"],
})
strategies = [MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
])]
result = deduplicate(df, strategies=strategies)
assert len(result.deduplicated_df) == 2
def test_none_does_not_match_string_none(self):
df = pd.DataFrame({
"name": [None, "None"],
})
strategies = [MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="name", algorithm=Algorithm.EXACT),
])]
result = deduplicate(df, strategies=strategies)
assert len(result.deduplicated_df) == 2
# ---------------------------------------------------------------------------
# BUG-2: removed_df must preserve column schema even when empty
# ---------------------------------------------------------------------------
class TestDedupRemovedDfSchema:
def test_empty_removed_df_has_same_columns(self):
df = pd.DataFrame({
"name": ["alice", "bob", "carol"],
"email": ["a@x.com", "b@x.com", "c@x.com"],
})
strategies = [MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
])]
result = deduplicate(df, strategies=strategies)
# No duplicates → empty removed_df, but columns must match.
assert len(result.removed_df) == 0
assert list(result.removed_df.columns) == list(result.deduplicated_df.columns)
# ---------------------------------------------------------------------------
# GAP-3: missing column reference should raise
# ---------------------------------------------------------------------------
class TestDedupMissingColumn:
def test_missing_column_raises(self):
df = pd.DataFrame({"email": ["a@x.com"]})
strategies = [MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="e_mail", algorithm=Algorithm.EXACT),
])]
with pytest.raises(ValueError, match="not present in the input"):
deduplicate(df, strategies=strategies)
# ---------------------------------------------------------------------------
# GAP-4: threshold must be in [0, 100]
# ---------------------------------------------------------------------------
class TestThresholdValidation:
def test_negative_threshold_rejected(self):
with pytest.raises(ValueError, match=r"\[0, 100\]"):
ColumnMatchStrategy(column="x", threshold=-1)
def test_over_hundred_rejected(self):
with pytest.raises(ValueError, match=r"\[0, 100\]"):
ColumnMatchStrategy(column="x", threshold=101)
def test_zero_and_hundred_allowed(self):
ColumnMatchStrategy(column="x", threshold=0)
ColumnMatchStrategy(column="x", threshold=100)
def test_non_numeric_rejected(self):
with pytest.raises(TypeError):
ColumnMatchStrategy(column="x", threshold="high") # type: ignore[arg-type]
# ---------------------------------------------------------------------------
# BUG-9: replace_null_sentinels must coerce non-string sentinels
# ---------------------------------------------------------------------------
class TestReplaceNullSentinelsTypes:
def test_int_sentinels_do_not_crash(self):
df = pd.DataFrame({"x": ["0", "5", ""]})
out, _ = f.replace_null_sentinels(df, {"sentinels": [0, "5"]})
assert out.loc[0, "x"] == "" # "0" matched int 0 stringified
assert out.loc[1, "x"] == "" # "5" matched
assert out.loc[2, "x"] == "" # already empty
def test_none_sentinel_skipped(self):
df = pd.DataFrame({"x": ["a", "b"]})
# Should not crash on None entry in the sentinel list.
out, _ = f.replace_null_sentinels(df, {"sentinels": ["a", None]})
assert out.loc[0, "x"] == ""
assert out.loc[1, "x"] == "b"
# ---------------------------------------------------------------------------
# BUG-10: malformed regex should raise ValueError, not re.error
# ---------------------------------------------------------------------------
class TestVectorizedRegexErrorHandling:
def test_malformed_pattern_raises_valueerror(self):
df = pd.DataFrame({"x": ["abc"]})
with pytest.raises(ValueError, match="Invalid regex pattern"):
f._vectorized_regex_sub(df, "[invalid", "")
# ---------------------------------------------------------------------------
# NIT-12: strip_bom strips at most one BOM
# ---------------------------------------------------------------------------
class TestStripBomSingleChar:
def test_strips_one_leading_bom(self):
assert strip_bom("hello") == "hello"
def test_does_not_strip_multiple_consecutive_boms(self):
# Per docstring: "at most one BOM". Second BOM stays so the
# caller can see something odd happened.
assert strip_bom("hello") == "hello"
def test_no_bom_unchanged(self):
assert strip_bom("hello") == "hello"
def test_non_string_passthrough(self):
assert strip_bom(None) is None # type: ignore[arg-type]
# ---------------------------------------------------------------------------
# Smart title case — particle behavior at boundaries (regression / docs)
# ---------------------------------------------------------------------------
class TestSmartTitleCaseBoundaries:
def test_first_word_particle_capitalized(self):
# "a" at index 0 is a particle but must capitalize as the first
# word of a title.
assert smart_title_case("a story") == "A Story"
def test_last_word_particle_capitalized(self):
# "to" at the end is the last word; must capitalize.
assert smart_title_case("things to") == "Things To"
def test_mid_string_particles_lowercase(self):
assert smart_title_case("the cat in the hat") == "The Cat in the Hat"
# ---------------------------------------------------------------------------
# NIT-14: sentence_case dead branch removed — regression guard
# ---------------------------------------------------------------------------
class TestSentenceCaseUnchanged:
def test_basic(self):
assert sentence_case("hello. world.") == "Hello. World."
def test_open_paren_does_not_consume_trigger(self):
# The dead-branch removal didn't change behavior; this is a
# regression guard that opening punctuation still doesn't
# capitalize itself but doesn't reset the trigger either.
assert sentence_case('hello. "world"') == 'Hello. "World"'
# ---------------------------------------------------------------------------
# BUG-18: detect_header_row must not pick all-empty rows
# ---------------------------------------------------------------------------
class TestDetectHeaderRowEmptyRows:
def test_all_empty_first_row_skipped(self, tmp_path: Path):
# First row is all-empty — the header is on row 1.
p = tmp_path / "blank_first.csv"
p.write_text(",,\nname,email,phone\nalice,a@x.com,555\n")
assert detect_header_row(p) == 1
def test_pure_header_at_row_zero(self, tmp_path: Path):
p = tmp_path / "normal.csv"
p.write_text("name,email,phone\nalice,a@x.com,555\n")
assert detect_header_row(p) == 0
# ---------------------------------------------------------------------------
# BUG-20: config.from_dict must accept unknown fields (forward compat)
# ---------------------------------------------------------------------------
class TestConfigForwardCompat:
def test_extra_field_in_column_config_ignored(self, tmp_path: Path):
# Simulate a config file written by a future version with an
# extra ``priority`` field.
config_dict = {
"strategies": [{
"columns": [{
"column": "email",
"algorithm": "exact",
"threshold": 100.0,
"normalizer": None,
"priority": 5, # future field — must not crash
}],
}],
"survivor_rule": "first",
"merge": False,
}
loaded = DeduplicationConfig.from_dict(config_dict)
assert len(loaded.strategies) == 1
assert loaded.strategies[0].columns[0].column == "email"
def test_roundtrip_then_reload_with_extra(self, tmp_path: Path):
cfg = DeduplicationConfig(
strategies=[StrategyConfig(columns=[
ColumnStrategyConfig(column="email"),
])],
)
path = tmp_path / "cfg.json"
cfg.to_file(path)
# Manually inject an unknown field to simulate forward-compat.
data = json.loads(path.read_text())
data["strategies"][0]["columns"][0]["future_thing"] = "abc"
path.write_text(json.dumps(data))
loaded = DeduplicationConfig.from_file(path)
assert loaded.strategies[0].columns[0].column == "email"
# ---------------------------------------------------------------------------
# BUG-22: mixed-case email detector must not flag all-None columns
# ---------------------------------------------------------------------------
class TestMixedCaseEmailFalsePositive:
def test_all_none_email_column_no_finding(self):
df = pd.DataFrame({
"email": [None, None, None],
})
findings = _detect_mixed_case_email(df)
assert findings == []
def test_real_mixed_case_still_flagged(self):
df = pd.DataFrame({
"email": ["Alice@X.com", "bob@y.com"],
})
findings = _detect_mixed_case_email(df)
assert len(findings) == 1
assert findings[0].column == "email"
# ---------------------------------------------------------------------------
# NIT-24: <NA> recognized as a null-like sentinel
# ---------------------------------------------------------------------------
class TestNullLikeIncludesPandasNA:
def test_pd_na_string_repr_recognized(self):
# str(pd.NA) → "<NA>" — when a DataFrame is loaded with
# keep_default_na=False, pandas NA values appear as the literal
# string "<NA>" and the analyzer should flag them.
assert "<na>" in _NULL_LIKE