test: single-command runner, cross-platform automation, fixture auto-discovery

Adds a top-level test infrastructure layer addressing four needs at once: a single command to run anything, cross-platform automation, install/e2e sanity, and zero-config pickup of new fixtures dropped into test-cases/. Top-level runner — run_tests.py python run_tests.py # everything (default) python run_tests.py --tool dedup # one tool's tests python run_tests.py --unit # category scopes python run_tests.py --e2e # end-to-end CLI python run_tests.py --install # import / dependency sanity python run_tests.py --fixtures # corpus + dropped-file sweep python run_tests.py --coverage # term-missing report python run_tests.py --quick # skip @pytest.mark.slow Tools: analyze, cli, config, dedup, io, normalizers, text_clean. Cross-platform — tox.ini Envs for py310-py313 plus install / e2e / fixtures / coverage / lint. Forces UTF-8 (PYTHONUTF8=1, PYTHONIOENCODING=utf-8) so identical fixture bytes parse the same on Linux/macOS/Windows. Shared config — pytest.ini testpaths, python_files conventions, custom markers (slow, e2e, install, fixture_sweep), warning filters that fail on our own DeprecationWarnings while tolerating third-party ones. New test layers tests/test_install.py — required deps import; project modules import; src.core public API surface; CLI --help exits 0; streamlit app.py parses as valid Python; run_tests.py --help works. tests/test_e2e.py — CLI roundtrips: cli_analyze table + JSON, cli_text_clean --apply writes a real file with NBSP/smart-quote folded, dedup CLI removes duplicates, run_tests.py self-tests. tests/test_fixtures_sweep.py — parametrizes over every CSV/TSV/XLSX inside test-cases/ (excluding text-cleaner-corpus/, which has its own suite). Each fixture must: load through repair_bytes, run analyze() cleanly, and survive clean_dataframe() with row/col counts unchanged plus idempotency. Drop a CSV in, re-run — no test code changes needed. tests/test_gap_coverage.py — closes audit gaps: clean_headers=False toggle, repair_bytes with tab/semicolon delimiters, BOM+NUL+smart- quote combined-fix scenario, analyze() over an XLSX path, sample_rows larger than the DataFrame, mid-cell BOM, findings_by_tool edges, plus a strict xfail documenting the known §4.17 numeric/phone whitespace heuristic gap. Test count Before: 288 passed + 1 xfailed After: 475 passed + 2 xfailed (the second xfail is the documented collapse_whitespace gap on phone-shaped cells; spec §4.17 calls for a heuristic that hasn't been implemented yet). Functional gaps surfaced (not fixed in this commit): - Text cleaner: collapse_whitespace runs unconditionally on every string cell, including phone/numeric/date-shaped ones. Spec §4.17 requires a skip heuristic. Captured as strict xfail so the gap stays visible. - io.read_file does not run pre-parse repair; only analyze() and direct callers of read_csv_repaired() get it. CLI tool pages and the dedup CLI miss the safety net. - Analyzer has no mixed_line_endings detector or near_duplicate_rows detector; both planned but require additional plumbing. - GUI tool pages each have their own uploader instead of picking up the home-page upload through session_state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:01:06 +00:00
parent a8943f29eb
commit 4687cf87b4
7 changed files with 897 additions and 0 deletions
--- a/tests/test_fixtures_sweep.py
+++ b/tests/test_fixtures_sweep.py
@@ -0,0 +1,156 @@
+"""Automated sweep over every fixture in ``test-cases/``.
+
+Drop a new CSV/TSV/XLSX into ``test-cases/`` and the sweep picks it up the
+next time pytest runs — no test code changes required. Each fixture goes
+through three smoke tests:
+
+1. **Pre-parse repair runs cleanly.** Byte-level repair (BOM, NUL, smart
+   quotes, rogue delimiters) must not crash, and produced bytes must be
+   valid for ``pd.read_csv``.
+2. **Analyzer runs cleanly.** ``analyze()`` must produce a list of
+   :class:`Finding` objects without raising.
+3. **Text cleaner runs cleanly and preserves schema.** Default-config
+   ``clean_dataframe`` must not change row count and must return the same
+   number of columns it started with.
+
+The sweep skips files inside ``text-cleaner-corpus/`` because that subdir
+has its own dedicated test (``test_corpus.py``) with byte-exact expected
+outputs.
+"""
+
+from __future__ import annotations
+
+import io
+from pathlib import Path
+
+import pandas as pd
+import pytest
+
+from src.core.analyze import Finding, analyze
+from src.core.io import detect_delimiter, detect_encoding, repair_bytes
+from src.core.text_clean import clean_dataframe
+
+
+TEST_CASES_DIR = Path(__file__).resolve().parent.parent / "test-cases"
+
+# Subdirectories in test-cases/ that are exercised by their own dedicated
+# tests. The sweep ignores these so we don't double-test or fight expected
+# byte-exact outputs.
+_EXCLUDED_SUBDIRS = {"text-cleaner-corpus"}
+
+# File suffixes we know how to load.
+_SUPPORTED_SUFFIXES = {".csv", ".tsv", ".xlsx", ".xls"}
+
+
+def _discover_fixtures() -> list[Path]:
+    """Return every fixture file under test-cases/ that the sweep should run.
+
+    Walks one level deep — CSV/XLSX directly inside test-cases/ are picked
+    up; files in excluded subdirectories are not.
+    """
+    if not TEST_CASES_DIR.is_dir():
+        return []
+    out: list[Path] = []
+    for entry in sorted(TEST_CASES_DIR.iterdir()):
+        if entry.is_dir():
+            if entry.name in _EXCLUDED_SUBDIRS:
+                continue
+            for sub in sorted(entry.rglob("*")):
+                if sub.is_file() and sub.suffix.lower() in _SUPPORTED_SUFFIXES:
+                    out.append(sub)
+            continue
+        if entry.is_file() and entry.suffix.lower() in _SUPPORTED_SUFFIXES:
+            out.append(entry)
+    return out
+
+
+_FIXTURES = _discover_fixtures()
+
+
+def _fixture_id(path: Path) -> str:
+    """Pretty pytest id derived from the filename, keeping subdirs visible."""
+    rel = path.relative_to(TEST_CASES_DIR)
+    return str(rel)
+
+
+# Skip the entire module gracefully when no fixtures are present, instead of
+# emitting a "no tests collected" failure.
+pytestmark = [
+    pytest.mark.fixture_sweep,
+    pytest.mark.skipif(
+        not _FIXTURES,
+        reason="no fixtures found under test-cases/ — drop a CSV/XLSX in to enable the sweep",
+    ),
+]
+
+
+def _read_with_repair(path: Path) -> tuple[pd.DataFrame, object | None]:
+    """Read *path* with the same robust pipeline analyze() uses.
+
+    Returns ``(df, repair_result)`` where repair_result is None for Excel.
+    """
+    suffix = path.suffix.lower()
+    if suffix in (".xlsx", ".xls"):
+        df = pd.read_excel(path, dtype=str, keep_default_na=False, engine="openpyxl")
+        return df, None
+    enc = detect_encoding(path)
+    delim = detect_delimiter(path, enc)
+    raw = path.read_bytes()
+    repair = repair_bytes(raw, encoding=enc, delimiter=delim)
+    df = pd.read_csv(
+        io.BytesIO(repair.repaired_bytes),
+        encoding="utf-8", delimiter=delim,
+        dtype=str, keep_default_na=False, on_bad_lines="warn",
+    )
+    return df, repair
+
+
+@pytest.mark.parametrize("fixture", _FIXTURES, ids=[_fixture_id(p) for p in _FIXTURES])
+class TestFixtureSweep:
+    """Smoke tests that every fixture in ``test-cases/`` must pass."""
+
+    def test_repair_and_load(self, fixture: Path) -> None:
+        df, _ = _read_with_repair(fixture)
+        assert isinstance(df, pd.DataFrame), f"{fixture.name}: did not return a DataFrame"
+        assert len(df.columns) >= 1, f"{fixture.name}: zero columns after parse"
+
+    def test_analyze_runs(self, fixture: Path) -> None:
+        df, repair = _read_with_repair(fixture)
+        findings = analyze(df, repair_result=repair)
+        assert isinstance(findings, list)
+        for f in findings:
+            assert isinstance(f, Finding), (
+                f"{fixture.name}: analyze() returned a non-Finding ({type(f)})"
+            )
+
+    def test_text_cleaner_preserves_schema(self, fixture: Path) -> None:
+        df, _ = _read_with_repair(fixture)
+        before_rows = len(df)
+        before_cols = len(df.columns)
+        result = clean_dataframe(df)
+        assert len(result.cleaned_df) == before_rows, (
+            f"{fixture.name}: row count changed "
+            f"({before_rows} -> {len(result.cleaned_df)})"
+        )
+        assert len(result.cleaned_df.columns) == before_cols, (
+            f"{fixture.name}: column count changed "
+            f"({before_cols} -> {len(result.cleaned_df.columns)})"
+        )
+
+    def test_text_cleaner_idempotent(self, fixture: Path) -> None:
+        df, _ = _read_with_repair(fixture)
+        once = clean_dataframe(df).cleaned_df.reset_index(drop=True)
+        twice = clean_dataframe(once).cleaned_df.reset_index(drop=True)
+        assert once.equals(twice), (
+            f"{fixture.name}: clean(clean(x)) != clean(x); cleaner is not idempotent"
+        )
+
+
+def test_at_least_one_fixture_present() -> None:
+    """Smoke check: every project should ship at least one fixture so the
+    sweep is not silently skipped on a clean checkout. Adjust the threshold
+    only if intentionally moving fixtures elsewhere."""
+    assert len(_FIXTURES) > 0, (
+        "No fixtures found under test-cases/. "
+        "Drop a CSV or XLSX file into the directory and re-run."
+    )