Files
datatools-dev/tests/test_fixes_unit.py
Michael 966af8ef94 feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00

258 lines
8.5 KiB
Python
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""Isolated unit tests for individual fix functions in src.core.fixes.
The integration tests at tests/test_normalize.py exercise these
functions through the full analyze→fix pipeline. These tests pin each
function's behavior in isolation so a regression surfaces close to the
broken function rather than at the pipeline output.
"""
from __future__ import annotations
import pandas as pd
import pytest
from src.core.fixes import (
clean_headers,
normalize_line_endings,
repair_mojibake,
strip_nbsp,
strip_zero_width,
trim_whitespace,
)
# ---------------------------------------------------------------------------
# trim_whitespace
# ---------------------------------------------------------------------------
class TestTrimWhitespace:
def test_strips_leading_trailing(self):
df = pd.DataFrame({"x": [" hello ", " world "]})
out, changed = trim_whitespace(df)
assert list(out["x"]) == ["hello", "world"]
assert changed == 2
def test_collapses_internal_runs(self):
df = pd.DataFrame({"x": ["a b c"]})
out, _ = trim_whitespace(df)
assert out.loc[0, "x"] == "a b c"
def test_preserves_internal_in_structured(self):
# Phone-shaped strings keep internal spacing (often semantic).
df = pd.DataFrame({"x": ["(555) 123-4567"]})
out, changed = trim_whitespace(df)
assert out.loc[0, "x"] == "(555) 123-4567"
assert changed == 0
def test_empty_df(self):
df = pd.DataFrame({"x": []})
out, changed = trim_whitespace(df)
assert len(out) == 0
assert changed == 0
def test_no_string_columns(self):
df = pd.DataFrame({"n": [1, 2, 3]})
out, changed = trim_whitespace(df)
assert changed == 0
assert list(out["n"]) == [1, 2, 3]
def test_nan_preserved(self):
df = pd.DataFrame({"x": [" ok ", None]})
out, _ = trim_whitespace(df)
assert out.loc[0, "x"] == "ok"
# NaN/None passes through (becomes empty string after strip OR stays)
assert out.loc[1, "x"] is None or out.loc[1, "x"] == ""
def test_idempotent(self):
df = pd.DataFrame({"x": [" hello world "]})
out1, _ = trim_whitespace(df)
out2, changed2 = trim_whitespace(out1)
assert changed2 == 0
assert list(out2["x"]) == list(out1["x"])
# ---------------------------------------------------------------------------
# strip_nbsp
# ---------------------------------------------------------------------------
class TestStripNbsp:
def test_replaces_nbsp_with_ascii_space(self):
df = pd.DataFrame({"x": ["a b"]})
out, changed = strip_nbsp(df)
assert out.loc[0, "x"] == "a b"
assert changed == 1
def test_no_change_when_clean(self):
df = pd.DataFrame({"x": ["a b c"]})
out, changed = strip_nbsp(df)
assert changed == 0
def test_other_unicode_spaces(self):
# Em space (U+2003), thin space (U+2009)
df = pd.DataFrame({"x": ["abc"]})
out, _ = strip_nbsp(df)
assert out.loc[0, "x"] == "a b c"
def test_idempotent(self):
df = pd.DataFrame({"x": ["a  b"]})
out1, _ = strip_nbsp(df)
out2, changed2 = strip_nbsp(out1)
assert changed2 == 0
# ---------------------------------------------------------------------------
# strip_zero_width
# ---------------------------------------------------------------------------
class TestStripZeroWidth:
def test_removes_zero_width_space(self):
df = pd.DataFrame({"x": ["ab"]})
out, changed = strip_zero_width(df)
assert out.loc[0, "x"] == "ab"
assert changed == 1
def test_removes_zero_width_joiner(self):
df = pd.DataFrame({"x": ["ab"]})
out, _ = strip_zero_width(df)
assert out.loc[0, "x"] == "ab"
def test_clean_passthrough(self):
df = pd.DataFrame({"x": ["clean"]})
out, changed = strip_zero_width(df)
assert changed == 0
def test_idempotent(self):
df = pd.DataFrame({"x": ["abc"]})
out1, _ = strip_zero_width(df)
out2, changed2 = strip_zero_width(out1)
assert changed2 == 0
# ---------------------------------------------------------------------------
# normalize_line_endings
# ---------------------------------------------------------------------------
class TestNormalizeLineEndings:
def test_crlf_to_lf(self):
df = pd.DataFrame({"x": ["line1\r\nline2"]})
out, changed = normalize_line_endings(df)
assert out.loc[0, "x"] == "line1\nline2"
assert changed == 1
def test_bare_cr_to_lf(self):
df = pd.DataFrame({"x": ["line1\rline2"]})
out, _ = normalize_line_endings(df)
assert out.loc[0, "x"] == "line1\nline2"
def test_already_lf_unchanged(self):
df = pd.DataFrame({"x": ["line1\nline2"]})
out, changed = normalize_line_endings(df)
assert changed == 0
def test_idempotent(self):
df = pd.DataFrame({"x": ["a\r\nb\rc"]})
out1, _ = normalize_line_endings(df)
out2, changed2 = normalize_line_endings(out1)
assert changed2 == 0
# ---------------------------------------------------------------------------
# clean_headers
# ---------------------------------------------------------------------------
class TestCleanHeaders:
def test_strips_bom_from_header(self):
df = pd.DataFrame({"name": [1], "email": [2]})
out, changed = clean_headers(df)
assert "name" in out.columns
assert "name" not in out.columns
assert changed >= 1
def test_strips_nbsp_from_header(self):
df = pd.DataFrame({"first name": [1]})
out, _ = clean_headers(df)
assert "first name" in out.columns
def test_strips_trailing_whitespace_from_header(self):
df = pd.DataFrame({"Email ": [1]})
out, _ = clean_headers(df)
assert "Email" in out.columns
assert "Email " not in out.columns
def test_non_string_label_preserved(self):
df = pd.DataFrame({0: [1], 1: [2]})
out, changed = clean_headers(df)
assert list(out.columns) == [0, 1]
assert changed == 0
def test_clean_headers_idempotent(self):
df = pd.DataFrame({"name": [1]})
out1, _ = clean_headers(df)
out2, changed2 = clean_headers(out1)
assert changed2 == 0
assert list(out2.columns) == list(out1.columns)
# ---------------------------------------------------------------------------
# repair_mojibake
# ---------------------------------------------------------------------------
_HAS_FTFY = True
try:
import ftfy # noqa: F401
except ImportError:
_HAS_FTFY = False
@pytest.mark.skipif(not _HAS_FTFY, reason="ftfy library not installed — fix is a no-op")
class TestRepairMojibake:
def test_classic_cafe_repair(self):
df = pd.DataFrame({"x": ["café"]}) # café miscoded
out, changed = repair_mojibake(df)
assert out.loc[0, "x"] == "café"
assert changed == 1
def test_clean_text_unchanged(self):
df = pd.DataFrame({"x": ["café"]})
out, changed = repair_mojibake(df)
assert changed == 0
def test_no_string_columns(self):
df = pd.DataFrame({"n": [1, 2]})
out, changed = repair_mojibake(df)
assert changed == 0
def test_idempotent(self):
df = pd.DataFrame({"x": ["café"]})
out1, _ = repair_mojibake(df)
out2, changed2 = repair_mojibake(out1)
assert changed2 == 0
class TestRepairMojibakeNoFtfy:
def test_returns_input_unchanged_without_ftfy(self, monkeypatch):
"""Exercise the no-op path regardless of whether ftfy is installed.
``repair_mojibake`` lazy-imports ftfy inside the function body, so
we hide ``ftfy`` from ``sys.modules`` and from import resolution
before calling. The function must then degrade to ``(df, 0)``
without raising.
"""
import sys
import builtins
monkeypatch.delitem(sys.modules, "ftfy", raising=False)
real_import = builtins.__import__
def fake_import(name, *args, **kwargs):
if name == "ftfy" or name.startswith("ftfy."):
raise ImportError("ftfy hidden by test")
return real_import(name, *args, **kwargs)
monkeypatch.setattr(builtins, "__import__", fake_import)
df = pd.DataFrame({"x": ["café"]})
out, changed = repair_mojibake(df)
assert changed == 0
assert out.loc[0, "x"] == "café"