Adds a top-level test infrastructure layer addressing four needs at once:
a single command to run anything, cross-platform automation, install/e2e
sanity, and zero-config pickup of new fixtures dropped into test-cases/.
Top-level runner — run_tests.py
python run_tests.py # everything (default)
python run_tests.py --tool dedup # one tool's tests
python run_tests.py --unit # category scopes
python run_tests.py --e2e # end-to-end CLI
python run_tests.py --install # import / dependency sanity
python run_tests.py --fixtures # corpus + dropped-file sweep
python run_tests.py --coverage # term-missing report
python run_tests.py --quick # skip @pytest.mark.slow
Tools: analyze, cli, config, dedup, io, normalizers, text_clean.
Cross-platform — tox.ini
Envs for py310-py313 plus install / e2e / fixtures / coverage / lint.
Forces UTF-8 (PYTHONUTF8=1, PYTHONIOENCODING=utf-8) so identical fixture
bytes parse the same on Linux/macOS/Windows.
Shared config — pytest.ini
testpaths, python_files conventions, custom markers (slow, e2e, install,
fixture_sweep), warning filters that fail on our own DeprecationWarnings
while tolerating third-party ones.
New test layers
tests/test_install.py — required deps import; project modules import;
src.core public API surface; CLI --help exits 0; streamlit app.py
parses as valid Python; run_tests.py --help works.
tests/test_e2e.py — CLI roundtrips: cli_analyze table + JSON, cli_text_clean
--apply writes a real file with NBSP/smart-quote folded, dedup CLI
removes duplicates, run_tests.py self-tests.
tests/test_fixtures_sweep.py — parametrizes over every CSV/TSV/XLSX
inside test-cases/ (excluding text-cleaner-corpus/, which has its own
suite). Each fixture must: load through repair_bytes, run analyze()
cleanly, and survive clean_dataframe() with row/col counts unchanged
plus idempotency. Drop a CSV in, re-run — no test code changes needed.
tests/test_gap_coverage.py — closes audit gaps: clean_headers=False
toggle, repair_bytes with tab/semicolon delimiters, BOM+NUL+smart-
quote combined-fix scenario, analyze() over an XLSX path, sample_rows
larger than the DataFrame, mid-cell BOM, findings_by_tool edges, plus
a strict xfail documenting the known §4.17 numeric/phone whitespace
heuristic gap.
Test count
Before: 288 passed + 1 xfailed
After: 475 passed + 2 xfailed (the second xfail is the documented
collapse_whitespace gap on phone-shaped cells; spec §4.17 calls
for a heuristic that hasn't been implemented yet).
Functional gaps surfaced (not fixed in this commit):
- Text cleaner: collapse_whitespace runs unconditionally on every string
cell, including phone/numeric/date-shaped ones. Spec §4.17 requires a
skip heuristic. Captured as strict xfail so the gap stays visible.
- io.read_file does not run pre-parse repair; only analyze() and direct
callers of read_csv_repaired() get it. CLI tool pages and the dedup
CLI miss the safety net.
- Analyzer has no mixed_line_endings detector or near_duplicate_rows
detector; both planned but require additional plumbing.
- GUI tool pages each have their own uploader instead of picking up the
home-page upload through session_state.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
162 lines
6.3 KiB
Python
162 lines
6.3 KiB
Python
"""Tests added to close gaps surfaced by the test audit.
|
||
|
||
These cover edges that existing suites missed:
|
||
|
||
- ``CleanOptions.clean_headers=False`` toggle (added but not directly tested).
|
||
- ``repair_bytes`` with non-comma delimiters and combined-fix scenarios.
|
||
- ``analyze()`` over a path-based Excel file.
|
||
- ``analyze()`` with ``sample_rows >= len(df)`` (uses copy(), not head()).
|
||
- ``findings_by_tool`` on an empty list.
|
||
- BOM that appears mid-cell rather than at file start.
|
||
|
||
The collapse-whitespace heuristic for numeric/date/phone-shaped cells (spec
|
||
§4.17) is *not yet implemented* and is captured here as a known-gap xfail
|
||
so it's surfaced rather than silently missing.
|
||
"""
|
||
|
||
from __future__ import annotations
|
||
|
||
import io
|
||
|
||
import pandas as pd
|
||
import pytest
|
||
|
||
from src.core.analyze import analyze, findings_by_tool
|
||
from src.core.io import RepairAction, repair_bytes
|
||
from src.core.text_clean import CleanOptions, clean_dataframe
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# clean_headers toggle
|
||
# ---------------------------------------------------------------------------
|
||
|
||
class TestCleanHeadersToggle:
|
||
def test_default_cleans_headers(self):
|
||
df = pd.DataFrame({" id ": [1], "Email": ["a@b.com"]})
|
||
result = clean_dataframe(df)
|
||
assert list(result.cleaned_df.columns) == ["id", "Email"]
|
||
|
||
def test_disable_preserves_dirty_headers(self):
|
||
df = pd.DataFrame({" id ": [1], "Email": ["a@b.com"]})
|
||
result = clean_dataframe(df, CleanOptions(clean_headers=False))
|
||
assert list(result.cleaned_df.columns) == [" id ", "Email"]
|
||
|
||
def test_disable_still_cleans_data_cells(self):
|
||
df = pd.DataFrame({"name": [" Alice ", "Bob "]})
|
||
result = clean_dataframe(df, CleanOptions(clean_headers=False))
|
||
assert result.cleaned_df["name"].tolist() == ["Alice", "Bob"]
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# repair_bytes — non-comma delimiters and combined fixes
|
||
# ---------------------------------------------------------------------------
|
||
|
||
class TestRepairBytesDelimiters:
|
||
def test_tab_delimited_smart_quote_fold(self):
|
||
raw = "id\tnote\n1\t“hi”\n".encode("utf-8")
|
||
result = repair_bytes(raw, delimiter="\t")
|
||
text = result.repaired_bytes.decode("utf-8")
|
||
assert "“" not in text and "”" not in text
|
||
assert "\t" in text # delimiter preserved
|
||
|
||
def test_semicolon_delimited_unrepairable_extras(self):
|
||
raw = b"id;a;b\n1;foo;bar\n2;1;2;3;4\n"
|
||
result = repair_bytes(raw, delimiter=";")
|
||
# Extra-field row with no clear merge candidate is logged unrepairable.
|
||
assert 3 in result.unrepairable_lines
|
||
|
||
|
||
class TestRepairBytesCombinedFixes:
|
||
def test_bom_plus_nul_plus_smart_quotes(self):
|
||
raw = (
|
||
b"\xef\xbb\xbf"
|
||
b"id,note\n"
|
||
b"1,Hel\x00lo \xe2\x80\x9cworld\xe2\x80\x9d\n"
|
||
)
|
||
result = repair_bytes(raw)
|
||
kinds = {a.kind for a in result.actions}
|
||
assert {"strip_bom", "strip_nul", "fold_smart_quote"} <= kinds
|
||
# Resulting bytes parse cleanly.
|
||
df = pd.read_csv(io.BytesIO(result.repaired_bytes))
|
||
assert df.iloc[0]["note"] == 'Hello "world"'
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# analyze() — path-based Excel and large-sample edges
|
||
# ---------------------------------------------------------------------------
|
||
|
||
class TestAnalyzeXlsxPath:
|
||
def test_excel_path_runs_without_repair(self, tmp_path):
|
||
path = tmp_path / "small.xlsx"
|
||
df = pd.DataFrame({
|
||
"id": ["1", "2"],
|
||
"name": [" Alice ", "Bob"], # padding in xlsx
|
||
})
|
||
df.to_excel(path, index=False, engine="openpyxl")
|
||
findings = analyze(path)
|
||
ids = {f.id for f in findings}
|
||
assert "whitespace_padding" in ids
|
||
# Excel skips csv_* findings — no pre-parse repair on xlsx.
|
||
assert not any(i.startswith("csv_") for i in ids)
|
||
|
||
|
||
class TestAnalyzeSampleRowsEdge:
|
||
def test_sample_rows_larger_than_df(self):
|
||
df = pd.DataFrame({"x": [" pad ", "clean"]})
|
||
# sample_rows=1000 but df has only 2 rows; must not crash.
|
||
findings = analyze(df, sample_rows=1000)
|
||
assert any(f.id == "whitespace_padding" for f in findings)
|
||
|
||
|
||
class TestAnalyzeMidCellBom:
|
||
def test_bom_inside_cell_treated_as_zero_width(self):
|
||
df = pd.DataFrame({"name": ["Hello"]})
|
||
findings = analyze(df)
|
||
assert any(f.id == "zero_width_or_invisible" for f in findings)
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# findings_by_tool — edge cases
|
||
# ---------------------------------------------------------------------------
|
||
|
||
class TestFindingsByToolEdges:
|
||
def test_empty_list_returns_empty_dict(self):
|
||
assert findings_by_tool([]) == {}
|
||
|
||
def test_only_toolless_findings_returns_empty_dict(self):
|
||
from src.core.analyze import Finding
|
||
# Construct a Finding with no tool — like csv_unrepairable_rows.
|
||
f = Finding(
|
||
id="x", severity="info", tool="", count=1,
|
||
description="d",
|
||
)
|
||
assert findings_by_tool([f]) == {}
|
||
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Known gap: collapse_whitespace on numeric/date/phone-shaped cells
|
||
# ---------------------------------------------------------------------------
|
||
|
||
class TestNumericPhoneWhitespaceGap:
|
||
"""Spec §4.17: ``collapse_whitespace`` should NOT collapse internal
|
||
whitespace in cells that look numeric, dated, or phone-shaped.
|
||
|
||
Currently unconditional. Marked xfail so the suite tracks the gap
|
||
without silently allowing regressions on the cells that *do* get
|
||
correctly collapsed.
|
||
"""
|
||
|
||
@pytest.mark.xfail(
|
||
reason=(
|
||
"Heuristic not yet implemented — collapse_whitespace runs on every "
|
||
"string cell, including phone-shaped ones. See TEST-CASES.md §4.17."
|
||
),
|
||
strict=True,
|
||
)
|
||
def test_phone_internal_double_space_preserved(self):
|
||
df = pd.DataFrame({"phone": ["(555) 123-4567"]}) # double space inside
|
||
result = clean_dataframe(df)
|
||
# Spec requires the double space to survive because the cell looks
|
||
# phone-shaped. Today the cleaner collapses it.
|
||
assert result.cleaned_df.iloc[0]["phone"] == "(555) 123-4567"
|