test: single-command runner, cross-platform automation, fixture auto-discovery

Adds a top-level test infrastructure layer addressing four needs at once:
a single command to run anything, cross-platform automation, install/e2e
sanity, and zero-config pickup of new fixtures dropped into test-cases/.

Top-level runner — run_tests.py
  python run_tests.py                # everything (default)
  python run_tests.py --tool dedup   # one tool's tests
  python run_tests.py --unit         # category scopes
  python run_tests.py --e2e          # end-to-end CLI
  python run_tests.py --install      # import / dependency sanity
  python run_tests.py --fixtures     # corpus + dropped-file sweep
  python run_tests.py --coverage     # term-missing report
  python run_tests.py --quick        # skip @pytest.mark.slow
Tools: analyze, cli, config, dedup, io, normalizers, text_clean.

Cross-platform — tox.ini
  Envs for py310-py313 plus install / e2e / fixtures / coverage / lint.
  Forces UTF-8 (PYTHONUTF8=1, PYTHONIOENCODING=utf-8) so identical fixture
  bytes parse the same on Linux/macOS/Windows.

Shared config — pytest.ini
  testpaths, python_files conventions, custom markers (slow, e2e, install,
  fixture_sweep), warning filters that fail on our own DeprecationWarnings
  while tolerating third-party ones.

New test layers
  tests/test_install.py — required deps import; project modules import;
    src.core public API surface; CLI --help exits 0; streamlit app.py
    parses as valid Python; run_tests.py --help works.
  tests/test_e2e.py — CLI roundtrips: cli_analyze table + JSON, cli_text_clean
    --apply writes a real file with NBSP/smart-quote folded, dedup CLI
    removes duplicates, run_tests.py self-tests.
  tests/test_fixtures_sweep.py — parametrizes over every CSV/TSV/XLSX
    inside test-cases/ (excluding text-cleaner-corpus/, which has its own
    suite). Each fixture must: load through repair_bytes, run analyze()
    cleanly, and survive clean_dataframe() with row/col counts unchanged
    plus idempotency. Drop a CSV in, re-run — no test code changes needed.
  tests/test_gap_coverage.py — closes audit gaps: clean_headers=False
    toggle, repair_bytes with tab/semicolon delimiters, BOM+NUL+smart-
    quote combined-fix scenario, analyze() over an XLSX path, sample_rows
    larger than the DataFrame, mid-cell BOM, findings_by_tool edges, plus
    a strict xfail documenting the known §4.17 numeric/phone whitespace
    heuristic gap.

Test count
  Before: 288 passed + 1 xfailed
  After:  475 passed + 2 xfailed (the second xfail is the documented
          collapse_whitespace gap on phone-shaped cells; spec §4.17 calls
          for a heuristic that hasn't been implemented yet).

Functional gaps surfaced (not fixed in this commit):
  - Text cleaner: collapse_whitespace runs unconditionally on every string
    cell, including phone/numeric/date-shaped ones. Spec §4.17 requires a
    skip heuristic. Captured as strict xfail so the gap stays visible.
  - io.read_file does not run pre-parse repair; only analyze() and direct
    callers of read_csv_repaired() get it. CLI tool pages and the dedup
    CLI miss the safety net.
  - Analyzer has no mixed_line_endings detector or near_duplicate_rows
    detector; both planned but require additional plumbing.
  - GUI tool pages each have their own uploader instead of picking up the
    home-page upload through session_state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 16:01:06 +00:00
parent a8943f29eb
commit 4687cf87b4
7 changed files with 897 additions and 0 deletions

161
tests/test_gap_coverage.py Normal file
View File

@@ -0,0 +1,161 @@
"""Tests added to close gaps surfaced by the test audit.
These cover edges that existing suites missed:
- ``CleanOptions.clean_headers=False`` toggle (added but not directly tested).
- ``repair_bytes`` with non-comma delimiters and combined-fix scenarios.
- ``analyze()`` over a path-based Excel file.
- ``analyze()`` with ``sample_rows >= len(df)`` (uses copy(), not head()).
- ``findings_by_tool`` on an empty list.
- BOM that appears mid-cell rather than at file start.
The collapse-whitespace heuristic for numeric/date/phone-shaped cells (spec
§4.17) is *not yet implemented* and is captured here as a known-gap xfail
so it's surfaced rather than silently missing.
"""
from __future__ import annotations
import io
import pandas as pd
import pytest
from src.core.analyze import analyze, findings_by_tool
from src.core.io import RepairAction, repair_bytes
from src.core.text_clean import CleanOptions, clean_dataframe
# ---------------------------------------------------------------------------
# clean_headers toggle
# ---------------------------------------------------------------------------
class TestCleanHeadersToggle:
def test_default_cleans_headers(self):
df = pd.DataFrame({" id ": [1], "Email": ["a@b.com"]})
result = clean_dataframe(df)
assert list(result.cleaned_df.columns) == ["id", "Email"]
def test_disable_preserves_dirty_headers(self):
df = pd.DataFrame({" id ": [1], "Email": ["a@b.com"]})
result = clean_dataframe(df, CleanOptions(clean_headers=False))
assert list(result.cleaned_df.columns) == [" id ", "Email"]
def test_disable_still_cleans_data_cells(self):
df = pd.DataFrame({"name": [" Alice ", "Bob "]})
result = clean_dataframe(df, CleanOptions(clean_headers=False))
assert result.cleaned_df["name"].tolist() == ["Alice", "Bob"]
# ---------------------------------------------------------------------------
# repair_bytes — non-comma delimiters and combined fixes
# ---------------------------------------------------------------------------
class TestRepairBytesDelimiters:
def test_tab_delimited_smart_quote_fold(self):
raw = "id\tnote\n1\t“hi”\n".encode("utf-8")
result = repair_bytes(raw, delimiter="\t")
text = result.repaired_bytes.decode("utf-8")
assert "" not in text and "" not in text
assert "\t" in text # delimiter preserved
def test_semicolon_delimited_unrepairable_extras(self):
raw = b"id;a;b\n1;foo;bar\n2;1;2;3;4\n"
result = repair_bytes(raw, delimiter=";")
# Extra-field row with no clear merge candidate is logged unrepairable.
assert 3 in result.unrepairable_lines
class TestRepairBytesCombinedFixes:
def test_bom_plus_nul_plus_smart_quotes(self):
raw = (
b"\xef\xbb\xbf"
b"id,note\n"
b"1,Hel\x00lo \xe2\x80\x9cworld\xe2\x80\x9d\n"
)
result = repair_bytes(raw)
kinds = {a.kind for a in result.actions}
assert {"strip_bom", "strip_nul", "fold_smart_quote"} <= kinds
# Resulting bytes parse cleanly.
df = pd.read_csv(io.BytesIO(result.repaired_bytes))
assert df.iloc[0]["note"] == 'Hello "world"'
# ---------------------------------------------------------------------------
# analyze() — path-based Excel and large-sample edges
# ---------------------------------------------------------------------------
class TestAnalyzeXlsxPath:
def test_excel_path_runs_without_repair(self, tmp_path):
path = tmp_path / "small.xlsx"
df = pd.DataFrame({
"id": ["1", "2"],
"name": [" Alice ", "Bob"], # padding in xlsx
})
df.to_excel(path, index=False, engine="openpyxl")
findings = analyze(path)
ids = {f.id for f in findings}
assert "whitespace_padding" in ids
# Excel skips csv_* findings — no pre-parse repair on xlsx.
assert not any(i.startswith("csv_") for i in ids)
class TestAnalyzeSampleRowsEdge:
def test_sample_rows_larger_than_df(self):
df = pd.DataFrame({"x": [" pad ", "clean"]})
# sample_rows=1000 but df has only 2 rows; must not crash.
findings = analyze(df, sample_rows=1000)
assert any(f.id == "whitespace_padding" for f in findings)
class TestAnalyzeMidCellBom:
def test_bom_inside_cell_treated_as_zero_width(self):
df = pd.DataFrame({"name": ["Hello"]})
findings = analyze(df)
assert any(f.id == "zero_width_or_invisible" for f in findings)
# ---------------------------------------------------------------------------
# findings_by_tool — edge cases
# ---------------------------------------------------------------------------
class TestFindingsByToolEdges:
def test_empty_list_returns_empty_dict(self):
assert findings_by_tool([]) == {}
def test_only_toolless_findings_returns_empty_dict(self):
from src.core.analyze import Finding
# Construct a Finding with no tool — like csv_unrepairable_rows.
f = Finding(
id="x", severity="info", tool="", count=1,
description="d",
)
assert findings_by_tool([f]) == {}
# ---------------------------------------------------------------------------
# Known gap: collapse_whitespace on numeric/date/phone-shaped cells
# ---------------------------------------------------------------------------
class TestNumericPhoneWhitespaceGap:
"""Spec §4.17: ``collapse_whitespace`` should NOT collapse internal
whitespace in cells that look numeric, dated, or phone-shaped.
Currently unconditional. Marked xfail so the suite tracks the gap
without silently allowing regressions on the cells that *do* get
correctly collapsed.
"""
@pytest.mark.xfail(
reason=(
"Heuristic not yet implemented — collapse_whitespace runs on every "
"string cell, including phone-shaped ones. See TEST-CASES.md §4.17."
),
strict=True,
)
def test_phone_internal_double_space_preserved(self):
df = pd.DataFrame({"phone": ["(555) 123-4567"]}) # double space inside
result = clean_dataframe(df)
# Spec requires the double space to survive because the cell looks
# phone-shaped. Today the cleaner collapses it.
assert result.cleaned_df.iloc[0]["phone"] == "(555) 123-4567"