test: single-command runner, cross-platform automation, fixture auto-discovery

Adds a top-level test infrastructure layer addressing four needs at once: a single command to run anything, cross-platform automation, install/e2e sanity, and zero-config pickup of new fixtures dropped into test-cases/. Top-level runner — run_tests.py python run_tests.py # everything (default) python run_tests.py --tool dedup # one tool's tests python run_tests.py --unit # category scopes python run_tests.py --e2e # end-to-end CLI python run_tests.py --install # import / dependency sanity python run_tests.py --fixtures # corpus + dropped-file sweep python run_tests.py --coverage # term-missing report python run_tests.py --quick # skip @pytest.mark.slow Tools: analyze, cli, config, dedup, io, normalizers, text_clean. Cross-platform — tox.ini Envs for py310-py313 plus install / e2e / fixtures / coverage / lint. Forces UTF-8 (PYTHONUTF8=1, PYTHONIOENCODING=utf-8) so identical fixture bytes parse the same on Linux/macOS/Windows. Shared config — pytest.ini testpaths, python_files conventions, custom markers (slow, e2e, install, fixture_sweep), warning filters that fail on our own DeprecationWarnings while tolerating third-party ones. New test layers tests/test_install.py — required deps import; project modules import; src.core public API surface; CLI --help exits 0; streamlit app.py parses as valid Python; run_tests.py --help works. tests/test_e2e.py — CLI roundtrips: cli_analyze table + JSON, cli_text_clean --apply writes a real file with NBSP/smart-quote folded, dedup CLI removes duplicates, run_tests.py self-tests. tests/test_fixtures_sweep.py — parametrizes over every CSV/TSV/XLSX inside test-cases/ (excluding text-cleaner-corpus/, which has its own suite). Each fixture must: load through repair_bytes, run analyze() cleanly, and survive clean_dataframe() with row/col counts unchanged plus idempotency. Drop a CSV in, re-run — no test code changes needed. tests/test_gap_coverage.py — closes audit gaps: clean_headers=False toggle, repair_bytes with tab/semicolon delimiters, BOM+NUL+smart- quote combined-fix scenario, analyze() over an XLSX path, sample_rows larger than the DataFrame, mid-cell BOM, findings_by_tool edges, plus a strict xfail documenting the known §4.17 numeric/phone whitespace heuristic gap. Test count Before: 288 passed + 1 xfailed After: 475 passed + 2 xfailed (the second xfail is the documented collapse_whitespace gap on phone-shaped cells; spec §4.17 calls for a heuristic that hasn't been implemented yet). Functional gaps surfaced (not fixed in this commit): - Text cleaner: collapse_whitespace runs unconditionally on every string cell, including phone/numeric/date-shaped ones. Spec §4.17 requires a skip heuristic. Captured as strict xfail so the gap stays visible. - io.read_file does not run pre-parse repair; only analyze() and direct callers of read_csv_repaired() get it. CLI tool pages and the dedup CLI miss the safety net. - Analyzer has no mixed_line_endings detector or near_duplicate_rows detector; both planned but require additional plumbing. - GUI tool pages each have their own uploader instead of picking up the home-page upload through session_state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:01:06 +00:00
parent a8943f29eb
commit 4687cf87b4
7 changed files with 897 additions and 0 deletions
--- a/tests/test_gap_coverage.py
+++ b/tests/test_gap_coverage.py
@@ -0,0 +1,161 @@
+"""Tests added to close gaps surfaced by the test audit.
+
+These cover edges that existing suites missed:
+
+- ``CleanOptions.clean_headers=False`` toggle (added but not directly tested).
+- ``repair_bytes`` with non-comma delimiters and combined-fix scenarios.
+- ``analyze()`` over a path-based Excel file.
+- ``analyze()`` with ``sample_rows >= len(df)`` (uses copy(), not head()).
+- ``findings_by_tool`` on an empty list.
+- BOM that appears mid-cell rather than at file start.
+
+The collapse-whitespace heuristic for numeric/date/phone-shaped cells (spec
+§4.17) is *not yet implemented* and is captured here as a known-gap xfail
+so it's surfaced rather than silently missing.
+"""
+
+from __future__ import annotations
+
+import io
+
+import pandas as pd
+import pytest
+
+from src.core.analyze import analyze, findings_by_tool
+from src.core.io import RepairAction, repair_bytes
+from src.core.text_clean import CleanOptions, clean_dataframe
+
+
+# ---------------------------------------------------------------------------
+# clean_headers toggle
+# ---------------------------------------------------------------------------
+
+class TestCleanHeadersToggle:
+    def test_default_cleans_headers(self):
+        df = pd.DataFrame({"  id  ": [1], "Email": ["a@b.com"]})
+        result = clean_dataframe(df)
+        assert list(result.cleaned_df.columns) == ["id", "Email"]
+
+    def test_disable_preserves_dirty_headers(self):
+        df = pd.DataFrame({"  id  ": [1], "Email": ["a@b.com"]})
+        result = clean_dataframe(df, CleanOptions(clean_headers=False))
+        assert list(result.cleaned_df.columns) == ["  id  ", "Email"]
+
+    def test_disable_still_cleans_data_cells(self):
+        df = pd.DataFrame({"name": ["  Alice  ", "Bob "]})
+        result = clean_dataframe(df, CleanOptions(clean_headers=False))
+        assert result.cleaned_df["name"].tolist() == ["Alice", "Bob"]
+
+
+# ---------------------------------------------------------------------------
+# repair_bytes — non-comma delimiters and combined fixes
+# ---------------------------------------------------------------------------
+
+class TestRepairBytesDelimiters:
+    def test_tab_delimited_smart_quote_fold(self):
+        raw = "id\tnote\n1\t“hi”\n".encode("utf-8")
+        result = repair_bytes(raw, delimiter="\t")
+        text = result.repaired_bytes.decode("utf-8")
+        assert "“" not in text and "”" not in text
+        assert "\t" in text  # delimiter preserved
+
+    def test_semicolon_delimited_unrepairable_extras(self):
+        raw = b"id;a;b\n1;foo;bar\n2;1;2;3;4\n"
+        result = repair_bytes(raw, delimiter=";")
+        # Extra-field row with no clear merge candidate is logged unrepairable.
+        assert 3 in result.unrepairable_lines
+
+
+class TestRepairBytesCombinedFixes:
+    def test_bom_plus_nul_plus_smart_quotes(self):
+        raw = (
+            b"\xef\xbb\xbf"
+            b"id,note\n"
+            b"1,Hel\x00lo \xe2\x80\x9cworld\xe2\x80\x9d\n"
+        )
+        result = repair_bytes(raw)
+        kinds = {a.kind for a in result.actions}
+        assert {"strip_bom", "strip_nul", "fold_smart_quote"} <= kinds
+        # Resulting bytes parse cleanly.
+        df = pd.read_csv(io.BytesIO(result.repaired_bytes))
+        assert df.iloc[0]["note"] == 'Hello "world"'
+
+
+# ---------------------------------------------------------------------------
+# analyze() — path-based Excel and large-sample edges
+# ---------------------------------------------------------------------------
+
+class TestAnalyzeXlsxPath:
+    def test_excel_path_runs_without_repair(self, tmp_path):
+        path = tmp_path / "small.xlsx"
+        df = pd.DataFrame({
+            "id": ["1", "2"],
+            "name": ["  Alice  ", "Bob"],   # padding in xlsx
+        })
+        df.to_excel(path, index=False, engine="openpyxl")
+        findings = analyze(path)
+        ids = {f.id for f in findings}
+        assert "whitespace_padding" in ids
+        # Excel skips csv_* findings — no pre-parse repair on xlsx.
+        assert not any(i.startswith("csv_") for i in ids)
+
+
+class TestAnalyzeSampleRowsEdge:
+    def test_sample_rows_larger_than_df(self):
+        df = pd.DataFrame({"x": ["  pad  ", "clean"]})
+        # sample_rows=1000 but df has only 2 rows; must not crash.
+        findings = analyze(df, sample_rows=1000)
+        assert any(f.id == "whitespace_padding" for f in findings)
+
+
+class TestAnalyzeMidCellBom:
+    def test_bom_inside_cell_treated_as_zero_width(self):
+        df = pd.DataFrame({"name": ["Hello"]})
+        findings = analyze(df)
+        assert any(f.id == "zero_width_or_invisible" for f in findings)
+
+
+# ---------------------------------------------------------------------------
+# findings_by_tool — edge cases
+# ---------------------------------------------------------------------------
+
+class TestFindingsByToolEdges:
+    def test_empty_list_returns_empty_dict(self):
+        assert findings_by_tool([]) == {}
+
+    def test_only_toolless_findings_returns_empty_dict(self):
+        from src.core.analyze import Finding
+        # Construct a Finding with no tool — like csv_unrepairable_rows.
+        f = Finding(
+            id="x", severity="info", tool="", count=1,
+            description="d",
+        )
+        assert findings_by_tool([f]) == {}
+
+
+# ---------------------------------------------------------------------------
+# Known gap: collapse_whitespace on numeric/date/phone-shaped cells
+# ---------------------------------------------------------------------------
+
+class TestNumericPhoneWhitespaceGap:
+    """Spec §4.17: ``collapse_whitespace`` should NOT collapse internal
+    whitespace in cells that look numeric, dated, or phone-shaped.
+
+    Currently unconditional. Marked xfail so the suite tracks the gap
+    without silently allowing regressions on the cells that *do* get
+    correctly collapsed.
+    """
+
+    @pytest.mark.xfail(
+        reason=(
+            "Heuristic not yet implemented — collapse_whitespace runs on every "
+            "string cell, including phone-shaped ones. See TEST-CASES.md §4.17."
+        ),
+        strict=True,
+    )
+    def test_phone_internal_double_space_preserved(self):
+        df = pd.DataFrame({"phone": ["(555)  123-4567"]})  # double space inside
+        result = clean_dataframe(df)
+        # Spec requires the double space to survive because the cell looks
+        # phone-shaped. Today the cleaner collapses it.
+        assert result.cleaned_df.iloc[0]["phone"] == "(555)  123-4567"