Files
datatools-dev/tests/test_gap_coverage.py
Michael 4687cf87b4 test: single-command runner, cross-platform automation, fixture auto-discovery
Adds a top-level test infrastructure layer addressing four needs at once:
a single command to run anything, cross-platform automation, install/e2e
sanity, and zero-config pickup of new fixtures dropped into test-cases/.

Top-level runner — run_tests.py
  python run_tests.py                # everything (default)
  python run_tests.py --tool dedup   # one tool's tests
  python run_tests.py --unit         # category scopes
  python run_tests.py --e2e          # end-to-end CLI
  python run_tests.py --install      # import / dependency sanity
  python run_tests.py --fixtures     # corpus + dropped-file sweep
  python run_tests.py --coverage     # term-missing report
  python run_tests.py --quick        # skip @pytest.mark.slow
Tools: analyze, cli, config, dedup, io, normalizers, text_clean.

Cross-platform — tox.ini
  Envs for py310-py313 plus install / e2e / fixtures / coverage / lint.
  Forces UTF-8 (PYTHONUTF8=1, PYTHONIOENCODING=utf-8) so identical fixture
  bytes parse the same on Linux/macOS/Windows.

Shared config — pytest.ini
  testpaths, python_files conventions, custom markers (slow, e2e, install,
  fixture_sweep), warning filters that fail on our own DeprecationWarnings
  while tolerating third-party ones.

New test layers
  tests/test_install.py — required deps import; project modules import;
    src.core public API surface; CLI --help exits 0; streamlit app.py
    parses as valid Python; run_tests.py --help works.
  tests/test_e2e.py — CLI roundtrips: cli_analyze table + JSON, cli_text_clean
    --apply writes a real file with NBSP/smart-quote folded, dedup CLI
    removes duplicates, run_tests.py self-tests.
  tests/test_fixtures_sweep.py — parametrizes over every CSV/TSV/XLSX
    inside test-cases/ (excluding text-cleaner-corpus/, which has its own
    suite). Each fixture must: load through repair_bytes, run analyze()
    cleanly, and survive clean_dataframe() with row/col counts unchanged
    plus idempotency. Drop a CSV in, re-run — no test code changes needed.
  tests/test_gap_coverage.py — closes audit gaps: clean_headers=False
    toggle, repair_bytes with tab/semicolon delimiters, BOM+NUL+smart-
    quote combined-fix scenario, analyze() over an XLSX path, sample_rows
    larger than the DataFrame, mid-cell BOM, findings_by_tool edges, plus
    a strict xfail documenting the known §4.17 numeric/phone whitespace
    heuristic gap.

Test count
  Before: 288 passed + 1 xfailed
  After:  475 passed + 2 xfailed (the second xfail is the documented
          collapse_whitespace gap on phone-shaped cells; spec §4.17 calls
          for a heuristic that hasn't been implemented yet).

Functional gaps surfaced (not fixed in this commit):
  - Text cleaner: collapse_whitespace runs unconditionally on every string
    cell, including phone/numeric/date-shaped ones. Spec §4.17 requires a
    skip heuristic. Captured as strict xfail so the gap stays visible.
  - io.read_file does not run pre-parse repair; only analyze() and direct
    callers of read_csv_repaired() get it. CLI tool pages and the dedup
    CLI miss the safety net.
  - Analyzer has no mixed_line_endings detector or near_duplicate_rows
    detector; both planned but require additional plumbing.
  - GUI tool pages each have their own uploader instead of picking up the
    home-page upload through session_state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:01:06 +00:00

162 lines
6.3 KiB
Python
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""Tests added to close gaps surfaced by the test audit.
These cover edges that existing suites missed:
- ``CleanOptions.clean_headers=False`` toggle (added but not directly tested).
- ``repair_bytes`` with non-comma delimiters and combined-fix scenarios.
- ``analyze()`` over a path-based Excel file.
- ``analyze()`` with ``sample_rows >= len(df)`` (uses copy(), not head()).
- ``findings_by_tool`` on an empty list.
- BOM that appears mid-cell rather than at file start.
The collapse-whitespace heuristic for numeric/date/phone-shaped cells (spec
§4.17) is *not yet implemented* and is captured here as a known-gap xfail
so it's surfaced rather than silently missing.
"""
from __future__ import annotations
import io
import pandas as pd
import pytest
from src.core.analyze import analyze, findings_by_tool
from src.core.io import RepairAction, repair_bytes
from src.core.text_clean import CleanOptions, clean_dataframe
# ---------------------------------------------------------------------------
# clean_headers toggle
# ---------------------------------------------------------------------------
class TestCleanHeadersToggle:
def test_default_cleans_headers(self):
df = pd.DataFrame({" id ": [1], "Email": ["a@b.com"]})
result = clean_dataframe(df)
assert list(result.cleaned_df.columns) == ["id", "Email"]
def test_disable_preserves_dirty_headers(self):
df = pd.DataFrame({" id ": [1], "Email": ["a@b.com"]})
result = clean_dataframe(df, CleanOptions(clean_headers=False))
assert list(result.cleaned_df.columns) == [" id ", "Email"]
def test_disable_still_cleans_data_cells(self):
df = pd.DataFrame({"name": [" Alice ", "Bob "]})
result = clean_dataframe(df, CleanOptions(clean_headers=False))
assert result.cleaned_df["name"].tolist() == ["Alice", "Bob"]
# ---------------------------------------------------------------------------
# repair_bytes — non-comma delimiters and combined fixes
# ---------------------------------------------------------------------------
class TestRepairBytesDelimiters:
def test_tab_delimited_smart_quote_fold(self):
raw = "id\tnote\n1\t“hi”\n".encode("utf-8")
result = repair_bytes(raw, delimiter="\t")
text = result.repaired_bytes.decode("utf-8")
assert "" not in text and "" not in text
assert "\t" in text # delimiter preserved
def test_semicolon_delimited_unrepairable_extras(self):
raw = b"id;a;b\n1;foo;bar\n2;1;2;3;4\n"
result = repair_bytes(raw, delimiter=";")
# Extra-field row with no clear merge candidate is logged unrepairable.
assert 3 in result.unrepairable_lines
class TestRepairBytesCombinedFixes:
def test_bom_plus_nul_plus_smart_quotes(self):
raw = (
b"\xef\xbb\xbf"
b"id,note\n"
b"1,Hel\x00lo \xe2\x80\x9cworld\xe2\x80\x9d\n"
)
result = repair_bytes(raw)
kinds = {a.kind for a in result.actions}
assert {"strip_bom", "strip_nul", "fold_smart_quote"} <= kinds
# Resulting bytes parse cleanly.
df = pd.read_csv(io.BytesIO(result.repaired_bytes))
assert df.iloc[0]["note"] == 'Hello "world"'
# ---------------------------------------------------------------------------
# analyze() — path-based Excel and large-sample edges
# ---------------------------------------------------------------------------
class TestAnalyzeXlsxPath:
def test_excel_path_runs_without_repair(self, tmp_path):
path = tmp_path / "small.xlsx"
df = pd.DataFrame({
"id": ["1", "2"],
"name": [" Alice ", "Bob"], # padding in xlsx
})
df.to_excel(path, index=False, engine="openpyxl")
findings = analyze(path)
ids = {f.id for f in findings}
assert "whitespace_padding" in ids
# Excel skips csv_* findings — no pre-parse repair on xlsx.
assert not any(i.startswith("csv_") for i in ids)
class TestAnalyzeSampleRowsEdge:
def test_sample_rows_larger_than_df(self):
df = pd.DataFrame({"x": [" pad ", "clean"]})
# sample_rows=1000 but df has only 2 rows; must not crash.
findings = analyze(df, sample_rows=1000)
assert any(f.id == "whitespace_padding" for f in findings)
class TestAnalyzeMidCellBom:
def test_bom_inside_cell_treated_as_zero_width(self):
df = pd.DataFrame({"name": ["Hello"]})
findings = analyze(df)
assert any(f.id == "zero_width_or_invisible" for f in findings)
# ---------------------------------------------------------------------------
# findings_by_tool — edge cases
# ---------------------------------------------------------------------------
class TestFindingsByToolEdges:
def test_empty_list_returns_empty_dict(self):
assert findings_by_tool([]) == {}
def test_only_toolless_findings_returns_empty_dict(self):
from src.core.analyze import Finding
# Construct a Finding with no tool — like csv_unrepairable_rows.
f = Finding(
id="x", severity="info", tool="", count=1,
description="d",
)
assert findings_by_tool([f]) == {}
# ---------------------------------------------------------------------------
# Known gap: collapse_whitespace on numeric/date/phone-shaped cells
# ---------------------------------------------------------------------------
class TestNumericPhoneWhitespaceGap:
"""Spec §4.17: ``collapse_whitespace`` should NOT collapse internal
whitespace in cells that look numeric, dated, or phone-shaped.
Currently unconditional. Marked xfail so the suite tracks the gap
without silently allowing regressions on the cells that *do* get
correctly collapsed.
"""
@pytest.mark.xfail(
reason=(
"Heuristic not yet implemented — collapse_whitespace runs on every "
"string cell, including phone-shaped ones. See TEST-CASES.md §4.17."
),
strict=True,
)
def test_phone_internal_double_space_preserved(self):
df = pd.DataFrame({"phone": ["(555) 123-4567"]}) # double space inside
result = clean_dataframe(df)
# Spec requires the double space to survive because the cell looks
# phone-shaped. Today the cleaner collapses it.
assert result.cleaned_df.iloc[0]["phone"] == "(555) 123-4567"