test: single-command runner, cross-platform automation, fixture auto-discovery
Adds a top-level test infrastructure layer addressing four needs at once:
a single command to run anything, cross-platform automation, install/e2e
sanity, and zero-config pickup of new fixtures dropped into test-cases/.
Top-level runner — run_tests.py
python run_tests.py # everything (default)
python run_tests.py --tool dedup # one tool's tests
python run_tests.py --unit # category scopes
python run_tests.py --e2e # end-to-end CLI
python run_tests.py --install # import / dependency sanity
python run_tests.py --fixtures # corpus + dropped-file sweep
python run_tests.py --coverage # term-missing report
python run_tests.py --quick # skip @pytest.mark.slow
Tools: analyze, cli, config, dedup, io, normalizers, text_clean.
Cross-platform — tox.ini
Envs for py310-py313 plus install / e2e / fixtures / coverage / lint.
Forces UTF-8 (PYTHONUTF8=1, PYTHONIOENCODING=utf-8) so identical fixture
bytes parse the same on Linux/macOS/Windows.
Shared config — pytest.ini
testpaths, python_files conventions, custom markers (slow, e2e, install,
fixture_sweep), warning filters that fail on our own DeprecationWarnings
while tolerating third-party ones.
New test layers
tests/test_install.py — required deps import; project modules import;
src.core public API surface; CLI --help exits 0; streamlit app.py
parses as valid Python; run_tests.py --help works.
tests/test_e2e.py — CLI roundtrips: cli_analyze table + JSON, cli_text_clean
--apply writes a real file with NBSP/smart-quote folded, dedup CLI
removes duplicates, run_tests.py self-tests.
tests/test_fixtures_sweep.py — parametrizes over every CSV/TSV/XLSX
inside test-cases/ (excluding text-cleaner-corpus/, which has its own
suite). Each fixture must: load through repair_bytes, run analyze()
cleanly, and survive clean_dataframe() with row/col counts unchanged
plus idempotency. Drop a CSV in, re-run — no test code changes needed.
tests/test_gap_coverage.py — closes audit gaps: clean_headers=False
toggle, repair_bytes with tab/semicolon delimiters, BOM+NUL+smart-
quote combined-fix scenario, analyze() over an XLSX path, sample_rows
larger than the DataFrame, mid-cell BOM, findings_by_tool edges, plus
a strict xfail documenting the known §4.17 numeric/phone whitespace
heuristic gap.
Test count
Before: 288 passed + 1 xfailed
After: 475 passed + 2 xfailed (the second xfail is the documented
collapse_whitespace gap on phone-shaped cells; spec §4.17 calls
for a heuristic that hasn't been implemented yet).
Functional gaps surfaced (not fixed in this commit):
- Text cleaner: collapse_whitespace runs unconditionally on every string
cell, including phone/numeric/date-shaped ones. Spec §4.17 requires a
skip heuristic. Captured as strict xfail so the gap stays visible.
- io.read_file does not run pre-parse repair; only analyze() and direct
callers of read_csv_repaired() get it. CLI tool pages and the dedup
CLI miss the safety net.
- Analyzer has no mixed_line_endings detector or near_duplicate_rows
detector; both planned but require additional plumbing.
- GUI tool pages each have their own uploader instead of picking up the
home-page upload through session_state.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
156
tests/test_fixtures_sweep.py
Normal file
156
tests/test_fixtures_sweep.py
Normal file
@@ -0,0 +1,156 @@
|
||||
"""Automated sweep over every fixture in ``test-cases/``.
|
||||
|
||||
Drop a new CSV/TSV/XLSX into ``test-cases/`` and the sweep picks it up the
|
||||
next time pytest runs — no test code changes required. Each fixture goes
|
||||
through three smoke tests:
|
||||
|
||||
1. **Pre-parse repair runs cleanly.** Byte-level repair (BOM, NUL, smart
|
||||
quotes, rogue delimiters) must not crash, and produced bytes must be
|
||||
valid for ``pd.read_csv``.
|
||||
2. **Analyzer runs cleanly.** ``analyze()`` must produce a list of
|
||||
:class:`Finding` objects without raising.
|
||||
3. **Text cleaner runs cleanly and preserves schema.** Default-config
|
||||
``clean_dataframe`` must not change row count and must return the same
|
||||
number of columns it started with.
|
||||
|
||||
The sweep skips files inside ``text-cleaner-corpus/`` because that subdir
|
||||
has its own dedicated test (``test_corpus.py``) with byte-exact expected
|
||||
outputs.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
import pytest
|
||||
|
||||
from src.core.analyze import Finding, analyze
|
||||
from src.core.io import detect_delimiter, detect_encoding, repair_bytes
|
||||
from src.core.text_clean import clean_dataframe
|
||||
|
||||
|
||||
TEST_CASES_DIR = Path(__file__).resolve().parent.parent / "test-cases"
|
||||
|
||||
# Subdirectories in test-cases/ that are exercised by their own dedicated
|
||||
# tests. The sweep ignores these so we don't double-test or fight expected
|
||||
# byte-exact outputs.
|
||||
_EXCLUDED_SUBDIRS = {"text-cleaner-corpus"}
|
||||
|
||||
# File suffixes we know how to load.
|
||||
_SUPPORTED_SUFFIXES = {".csv", ".tsv", ".xlsx", ".xls"}
|
||||
|
||||
|
||||
def _discover_fixtures() -> list[Path]:
|
||||
"""Return every fixture file under test-cases/ that the sweep should run.
|
||||
|
||||
Walks one level deep — CSV/XLSX directly inside test-cases/ are picked
|
||||
up; files in excluded subdirectories are not.
|
||||
"""
|
||||
if not TEST_CASES_DIR.is_dir():
|
||||
return []
|
||||
out: list[Path] = []
|
||||
for entry in sorted(TEST_CASES_DIR.iterdir()):
|
||||
if entry.is_dir():
|
||||
if entry.name in _EXCLUDED_SUBDIRS:
|
||||
continue
|
||||
for sub in sorted(entry.rglob("*")):
|
||||
if sub.is_file() and sub.suffix.lower() in _SUPPORTED_SUFFIXES:
|
||||
out.append(sub)
|
||||
continue
|
||||
if entry.is_file() and entry.suffix.lower() in _SUPPORTED_SUFFIXES:
|
||||
out.append(entry)
|
||||
return out
|
||||
|
||||
|
||||
_FIXTURES = _discover_fixtures()
|
||||
|
||||
|
||||
def _fixture_id(path: Path) -> str:
|
||||
"""Pretty pytest id derived from the filename, keeping subdirs visible."""
|
||||
rel = path.relative_to(TEST_CASES_DIR)
|
||||
return str(rel)
|
||||
|
||||
|
||||
# Skip the entire module gracefully when no fixtures are present, instead of
|
||||
# emitting a "no tests collected" failure.
|
||||
pytestmark = [
|
||||
pytest.mark.fixture_sweep,
|
||||
pytest.mark.skipif(
|
||||
not _FIXTURES,
|
||||
reason="no fixtures found under test-cases/ — drop a CSV/XLSX in to enable the sweep",
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
def _read_with_repair(path: Path) -> tuple[pd.DataFrame, object | None]:
|
||||
"""Read *path* with the same robust pipeline analyze() uses.
|
||||
|
||||
Returns ``(df, repair_result)`` where repair_result is None for Excel.
|
||||
"""
|
||||
suffix = path.suffix.lower()
|
||||
if suffix in (".xlsx", ".xls"):
|
||||
df = pd.read_excel(path, dtype=str, keep_default_na=False, engine="openpyxl")
|
||||
return df, None
|
||||
enc = detect_encoding(path)
|
||||
delim = detect_delimiter(path, enc)
|
||||
raw = path.read_bytes()
|
||||
repair = repair_bytes(raw, encoding=enc, delimiter=delim)
|
||||
df = pd.read_csv(
|
||||
io.BytesIO(repair.repaired_bytes),
|
||||
encoding="utf-8", delimiter=delim,
|
||||
dtype=str, keep_default_na=False, on_bad_lines="warn",
|
||||
)
|
||||
return df, repair
|
||||
|
||||
|
||||
@pytest.mark.parametrize("fixture", _FIXTURES, ids=[_fixture_id(p) for p in _FIXTURES])
|
||||
class TestFixtureSweep:
|
||||
"""Smoke tests that every fixture in ``test-cases/`` must pass."""
|
||||
|
||||
def test_repair_and_load(self, fixture: Path) -> None:
|
||||
df, _ = _read_with_repair(fixture)
|
||||
assert isinstance(df, pd.DataFrame), f"{fixture.name}: did not return a DataFrame"
|
||||
assert len(df.columns) >= 1, f"{fixture.name}: zero columns after parse"
|
||||
|
||||
def test_analyze_runs(self, fixture: Path) -> None:
|
||||
df, repair = _read_with_repair(fixture)
|
||||
findings = analyze(df, repair_result=repair)
|
||||
assert isinstance(findings, list)
|
||||
for f in findings:
|
||||
assert isinstance(f, Finding), (
|
||||
f"{fixture.name}: analyze() returned a non-Finding ({type(f)})"
|
||||
)
|
||||
|
||||
def test_text_cleaner_preserves_schema(self, fixture: Path) -> None:
|
||||
df, _ = _read_with_repair(fixture)
|
||||
before_rows = len(df)
|
||||
before_cols = len(df.columns)
|
||||
result = clean_dataframe(df)
|
||||
assert len(result.cleaned_df) == before_rows, (
|
||||
f"{fixture.name}: row count changed "
|
||||
f"({before_rows} -> {len(result.cleaned_df)})"
|
||||
)
|
||||
assert len(result.cleaned_df.columns) == before_cols, (
|
||||
f"{fixture.name}: column count changed "
|
||||
f"({before_cols} -> {len(result.cleaned_df.columns)})"
|
||||
)
|
||||
|
||||
def test_text_cleaner_idempotent(self, fixture: Path) -> None:
|
||||
df, _ = _read_with_repair(fixture)
|
||||
once = clean_dataframe(df).cleaned_df.reset_index(drop=True)
|
||||
twice = clean_dataframe(once).cleaned_df.reset_index(drop=True)
|
||||
assert once.equals(twice), (
|
||||
f"{fixture.name}: clean(clean(x)) != clean(x); cleaner is not idempotent"
|
||||
)
|
||||
|
||||
|
||||
def test_at_least_one_fixture_present() -> None:
|
||||
"""Smoke check: every project should ship at least one fixture so the
|
||||
sweep is not silently skipped on a clean checkout. Adjust the threshold
|
||||
only if intentionally moving fixtures elsewhere."""
|
||||
assert len(_FIXTURES) > 0, (
|
||||
"No fixtures found under test-cases/. "
|
||||
"Drop a CSV or XLSX file into the directory and re-run."
|
||||
)
|
||||
Reference in New Issue
Block a user