test(junk-corpus): pathological-input stress suite for the analyzer
Build a corpus of 35 deliberately-broken files (empty bytes, NUL
bytes, mojibake, UTF-16 without BOM, mismatched columns, unescaped
quotes, corrupt zip, etc.) and pin the analyzer's stability contract
against them.
Files land in ``test-cases/junk-corpus/test_data/``. The generator
``make_junk_corpus.py`` produces them deterministically (one random
sample uses ``secrets.token_bytes`` — committed bytes are stable
across regenerations because the byte stream is captured at commit
time). README documents the categories and how to add new shapes.
``tests/test_junk_corpus.py`` parametrizes over every file in the
corpus and asserts:
1. ``_run_analysis_on_upload`` never raises — exceptions must be
caught and surfaced as a synthetic ``Finding`` with
severity="error". This was the user-reported crash for
13_non_latin_scripts.csv that the previous fix in ae9d4a2
defensively wrapped; the corpus now stops the regression
from re-landing on a different shape.
2. Every Finding in the result list is well-formed (string id,
valid severity, non-empty description).
3. A high-risk subset (empty.csv, only_bom.csv, only_nul.csv,
corrupt_xlsx.xlsx) MUST surface at least one error-level
Finding — otherwise the GUI would render "no issues found"
for a structurally broken file.
4. Error-level Finding descriptions are at least 20 chars so the
UI banner gives the user something to act on.
Also exclude ``junk-corpus`` from ``tests/test_fixtures_sweep.py``
since that sweep is happy-path (round-trip the text cleaner) and
fights with files designed to break it. The contract is enforced
by the dedicated junk-corpus test, not the sweep.
Runtime: 12 s for the junk-corpus tests, 30 s for the full
project suite (was 19 s without these). 2118 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
63
test-cases/junk-corpus/README.md
Normal file
63
test-cases/junk-corpus/README.md
Normal file
@@ -0,0 +1,63 @@
|
||||
# Junk Corpus — pathological-input stress tests
|
||||
|
||||
This corpus exists to make the upload analyzer prove it can survive any
|
||||
file a user (or an adversary) might drop on it. Every file under
|
||||
`test_data/` is deliberately broken in a different way: empty bytes,
|
||||
NUL bytes, mojibake, UTF-16 without a BOM, mismatched columns,
|
||||
unescaped quotes, corrupt `.xlsx`, and so on.
|
||||
|
||||
The contract enforced by `tests/test_junk_corpus.py`:
|
||||
|
||||
1. `_run_analysis_on_upload(file)` MUST NOT raise. Errors are caught
|
||||
and surfaced as a synthetic `Finding` with severity `"error"`.
|
||||
2. The return is always a `list[Finding]` (possibly empty for files
|
||||
the analyzer judges clean).
|
||||
3. A specific subset of files (`empty.csv`, `only_bom.csv`,
|
||||
`only_nul.csv`, `corrupt_xlsx.xlsx`) MUST produce at least one
|
||||
error-level Finding so the GUI shows a red banner instead of
|
||||
silently rendering "no issues found".
|
||||
|
||||
## Why this matters
|
||||
|
||||
In a multi-file home-page upload, one bad file used to bubble a
|
||||
Python traceback up through the page chrome and kill every other
|
||||
file's analysis. The defensive wrap in `_run_analysis_on_upload` plus
|
||||
this stress test together prevent that regression.
|
||||
|
||||
## Regenerating the corpus
|
||||
|
||||
```bash
|
||||
python test-cases/junk-corpus/make_junk_corpus.py
|
||||
```
|
||||
|
||||
The generator writes 35-ish files into `test_data/`. They are small
|
||||
(< 100 KB each) and committed to the repo so the stress test runs
|
||||
without depending on a regenerate step.
|
||||
|
||||
## Adding a new pathological shape
|
||||
|
||||
1. Add a `write(...)` call to `make_junk_corpus.py`.
|
||||
2. Re-run that script to materialize the file on disk.
|
||||
3. (Optional) Add the filename to `_MUST_BE_ERROR` in
|
||||
`tests/test_junk_corpus.py` if "no findings" would be a silent
|
||||
failure for that shape.
|
||||
|
||||
## What's already covered
|
||||
|
||||
| Category | Files |
|
||||
|---|---|
|
||||
| Empty / near-empty | `empty.csv`, `only_whitespace.csv`, `only_bom.csv`, `only_nul.csv`, `just_newlines.csv`, `header_only.csv` |
|
||||
| Random / binary garbage | `random_bytes.csv`, `png_magic_as_csv.csv` |
|
||||
| Truncated or huge | `truncated_mid_row.csv`, `one_huge_line.csv`, `massive_columns.csv`, `single_column.csv` |
|
||||
| Wrong delimiter | `tsv_as_csv.csv`, `mixed_delimiters.csv` |
|
||||
| Encoding chaos | `utf16_le_no_bom.csv`, `utf16_be_with_bom.csv`, `utf32_le.csv`, `mojibake.csv`, `invalid_utf8.csv`, `cp1252_smart_quotes.csv` |
|
||||
| Quoting / shape | `unescaped_quotes.csv`, `embedded_newlines.csv`, `mismatched_columns.csv`, `duplicate_headers.csv`, `empty_header_names.csv`, `trailing_commas.csv` |
|
||||
| Content | `all_nulls.csv`, `very_wide_cell.csv`, `all_same_row.csv` |
|
||||
| Extension confusion | `no_extension`, `weird_extension.foo`, `double_extension.csv.txt` |
|
||||
| Excel pathologies | `corrupt_xlsx.xlsx`, `excel_empty.xlsx`, `excel_header_only.xlsx` |
|
||||
|
||||
## Manually loading a junk file in the GUI
|
||||
|
||||
The files are real on-disk artifacts. Drag any of them into the home
|
||||
page uploader to verify the GUI renders a sensible error (or clean
|
||||
findings, for files the analyzer is OK with) instead of crashing.
|
||||
231
test-cases/junk-corpus/make_junk_corpus.py
Normal file
231
test-cases/junk-corpus/make_junk_corpus.py
Normal file
@@ -0,0 +1,231 @@
|
||||
"""Generate a corpus of pathological files for stress-testing the upload
|
||||
analyzer.
|
||||
|
||||
Each file in ``test_data/`` is deliberately broken in a different way:
|
||||
empty bytes, NUL bytes, mojibake, UTF-16 without BOM, mismatched columns,
|
||||
unescaped quotes, etc. The goal is to make sure ``_run_analysis_on_upload``
|
||||
returns a clean error Finding (never a Python traceback) for any of them,
|
||||
in any combination, on every operating system the GUI ships on.
|
||||
|
||||
Run::
|
||||
|
||||
python test-cases/junk-corpus/make_junk_corpus.py
|
||||
|
||||
The matching pytest at ``tests/test_junk_corpus.py`` iterates every file
|
||||
in ``test_data/`` and asserts the analyzer either returns findings or an
|
||||
error Finding — never raises.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
import os
|
||||
import secrets
|
||||
import struct
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
_HERE = Path(__file__).resolve().parent
|
||||
_OUT = _HERE / "test_data"
|
||||
|
||||
|
||||
def write(name: str, data: bytes) -> None:
|
||||
"""Write *data* to ``test_data/name`` and report the size."""
|
||||
path = _OUT / name
|
||||
path.write_bytes(data)
|
||||
print(f" {name:<40} {len(data):>10} bytes")
|
||||
|
||||
|
||||
def _valid_xlsx_bytes(*, sheet_xml: str) -> bytes:
|
||||
"""Build a minimal but valid .xlsx (zip with the required parts).
|
||||
|
||||
``sheet_xml`` is the inner ``<sheetData>`` content; the rest of the
|
||||
workbook scaffolding is filled in around it. Good enough for pandas
|
||||
to load.
|
||||
"""
|
||||
buf = io.BytesIO()
|
||||
with zipfile.ZipFile(buf, "w", zipfile.ZIP_DEFLATED) as z:
|
||||
z.writestr(
|
||||
"[Content_Types].xml",
|
||||
'<?xml version="1.0"?>'
|
||||
'<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">'
|
||||
'<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>'
|
||||
'<Default Extension="xml" ContentType="application/xml"/>'
|
||||
'<Override PartName="/xl/workbook.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>'
|
||||
'<Override PartName="/xl/worksheets/sheet1.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>'
|
||||
"</Types>",
|
||||
)
|
||||
z.writestr(
|
||||
"_rels/.rels",
|
||||
'<?xml version="1.0"?>'
|
||||
'<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">'
|
||||
'<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="xl/workbook.xml"/>'
|
||||
"</Relationships>",
|
||||
)
|
||||
z.writestr(
|
||||
"xl/_rels/workbook.xml.rels",
|
||||
'<?xml version="1.0"?>'
|
||||
'<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">'
|
||||
'<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" Target="worksheets/sheet1.xml"/>'
|
||||
"</Relationships>",
|
||||
)
|
||||
z.writestr(
|
||||
"xl/workbook.xml",
|
||||
'<?xml version="1.0"?>'
|
||||
'<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"'
|
||||
' xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">'
|
||||
'<sheets><sheet name="Sheet1" sheetId="1" r:id="rId1"/></sheets>'
|
||||
"</workbook>",
|
||||
)
|
||||
z.writestr(
|
||||
"xl/worksheets/sheet1.xml",
|
||||
'<?xml version="1.0"?>'
|
||||
'<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">'
|
||||
f"<sheetData>{sheet_xml}</sheetData>"
|
||||
"</worksheet>",
|
||||
)
|
||||
return buf.getvalue()
|
||||
|
||||
|
||||
def main() -> None:
|
||||
_OUT.mkdir(parents=True, exist_ok=True)
|
||||
print(f"Writing junk corpus to {_OUT}")
|
||||
|
||||
# ---- Empty / near-empty -------------------------------------------------
|
||||
write("empty.csv", b"")
|
||||
write("only_whitespace.csv", b" \t\n \n\t \n")
|
||||
write("only_bom.csv", b"\xef\xbb\xbf")
|
||||
write("only_nul.csv", b"\x00" * 64)
|
||||
write("just_newlines.csv", b"\n\n\n\n\n")
|
||||
write("header_only.csv", b"id,name,note\n")
|
||||
|
||||
# ---- Random / binary garbage -------------------------------------------
|
||||
write("random_bytes.csv", secrets.token_bytes(2048))
|
||||
# Bytes that look like a PNG signature plus garbage; would mislead any
|
||||
# naive file-type sniffer.
|
||||
write("png_magic_as_csv.csv", b"\x89PNG\r\n\x1a\n" + secrets.token_bytes(512))
|
||||
|
||||
# ---- Truncated / structurally damaged ----------------------------------
|
||||
write(
|
||||
"truncated_mid_row.csv",
|
||||
b"id,name,note\n1,alice,hello\n2,bob,wor", # row 2 ends mid-cell
|
||||
)
|
||||
write(
|
||||
"one_huge_line.csv",
|
||||
b"a," * 5_000, # 10KB single line, no newline anywhere
|
||||
)
|
||||
write(
|
||||
"massive_columns.csv",
|
||||
(",".join(f"c{i}" for i in range(500)) + "\n"
|
||||
+ ",".join("x" for _ in range(500)) + "\n").encode(),
|
||||
)
|
||||
write(
|
||||
"single_column.csv",
|
||||
b"\n".join([b"id"] + [str(i).encode() for i in range(20)]) + b"\n",
|
||||
)
|
||||
|
||||
# ---- Wrong / misleading delimiter --------------------------------------
|
||||
write(
|
||||
"tsv_as_csv.csv",
|
||||
b"id\tname\tnote\n1\talice\thi\n2\tbob\tworld\n",
|
||||
)
|
||||
write(
|
||||
"mixed_delimiters.csv",
|
||||
b"id,name\tnote;extra|tail\n1,alice\thi;x|y\n",
|
||||
)
|
||||
|
||||
# ---- Encoding chaos ----------------------------------------------------
|
||||
sample_text = "id,name,note\n1,café,hello\n2,naïve,world\n"
|
||||
write("utf16_le_no_bom.csv", sample_text.encode("utf-16-le"))
|
||||
write("utf16_be_with_bom.csv", b"\xfe\xff" + sample_text.encode("utf-16-be"))
|
||||
write("utf32_le.csv", sample_text.encode("utf-32-le"))
|
||||
# Latin-1 bytes that decode as UTF-8 produce mojibake (é, ï etc.)
|
||||
write("mojibake.csv", sample_text.encode("latin-1"))
|
||||
# Bytes that aren't valid UTF-8 (lone continuation bytes)
|
||||
write("invalid_utf8.csv", b"id,name\n1,\xff\xfe\xfd,hello\n")
|
||||
# cp1252-encoded smart quotes in column values. cp1252 ascribes
|
||||
# smart-quote glyphs to bytes 0x91-0x94; the surrounding ASCII +
|
||||
# accented "é" is just there to keep the value realistic.
|
||||
write(
|
||||
"cp1252_smart_quotes.csv",
|
||||
b"id,quote\n1,"
|
||||
+ "café ".encode("cp1252")
|
||||
+ b"\x93smart\x94 \x91quote\x92"
|
||||
+ b"\n",
|
||||
)
|
||||
|
||||
# ---- Quoting and field-shape pathologies -------------------------------
|
||||
write(
|
||||
"unescaped_quotes.csv",
|
||||
b'id,note\n1,"this has " unescaped quote"\n2,"normal"\n',
|
||||
)
|
||||
write(
|
||||
"embedded_newlines.csv",
|
||||
b'id,note\n1,"line one\nline two"\n2,"single line"\n',
|
||||
)
|
||||
write(
|
||||
"mismatched_columns.csv",
|
||||
b"id,name,note\n1,alice,hi\n2,bob\n3,carol,hi,extra,fields\n",
|
||||
)
|
||||
write(
|
||||
"duplicate_headers.csv",
|
||||
b"col,col,col\n1,2,3\n4,5,6\n",
|
||||
)
|
||||
write(
|
||||
"empty_header_names.csv",
|
||||
b",,,\n1,2,3,4\n5,6,7,8\n",
|
||||
)
|
||||
write(
|
||||
"trailing_commas.csv",
|
||||
b"id,name,note,\n1,alice,hi,\n2,bob,wo,\n",
|
||||
)
|
||||
|
||||
# ---- Content pathologies ----------------------------------------------
|
||||
write(
|
||||
"all_nulls.csv",
|
||||
b"id,name,note\nNULL,NULL,NULL\nN/A,NA,(null)\nNone,nan,?\n",
|
||||
)
|
||||
write(
|
||||
"very_wide_cell.csv",
|
||||
b'id,blob\n1,"' + b"x" * 10_000 + b'"\n',
|
||||
)
|
||||
write(
|
||||
"all_same_row.csv",
|
||||
b"id,name,note\n" + b"1,alice,hello\n" * 100,
|
||||
)
|
||||
|
||||
# ---- Extension confusion ----------------------------------------------
|
||||
write("no_extension", b"id,name,note\n1,alice,hi\n")
|
||||
write(
|
||||
"weird_extension.foo",
|
||||
b"id,name,note\n1,alice,hi\n",
|
||||
)
|
||||
write(
|
||||
"double_extension.csv.txt",
|
||||
b"id,name,note\n1,alice,hi\n",
|
||||
)
|
||||
|
||||
# ---- Excel-specific pathologies ----------------------------------------
|
||||
# Not a real zip — pandas/openpyxl should error cleanly.
|
||||
write("corrupt_xlsx.xlsx", b"PK\x03\x04 not really a zip file")
|
||||
# Valid xlsx with an entirely empty sheet.
|
||||
write("excel_empty.xlsx", _valid_xlsx_bytes(sheet_xml=""))
|
||||
# Valid xlsx with one row of headers and no data.
|
||||
write(
|
||||
"excel_header_only.xlsx",
|
||||
_valid_xlsx_bytes(
|
||||
sheet_xml=(
|
||||
'<row r="1">'
|
||||
'<c r="A1" t="inlineStr"><is><t>id</t></is></c>'
|
||||
'<c r="B1" t="inlineStr"><is><t>name</t></is></c>'
|
||||
"</row>"
|
||||
),
|
||||
),
|
||||
)
|
||||
|
||||
print(f"\nWrote {len(list(_OUT.iterdir()))} files.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
4
test-cases/junk-corpus/test_data/all_nulls.csv
Normal file
4
test-cases/junk-corpus/test_data/all_nulls.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
id,name,note
|
||||
NULL,NULL,NULL
|
||||
N/A,NA,(null)
|
||||
None,nan,?
|
||||
|
101
test-cases/junk-corpus/test_data/all_same_row.csv
Normal file
101
test-cases/junk-corpus/test_data/all_same_row.csv
Normal file
@@ -0,0 +1,101 @@
|
||||
id,name,note
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
1,alice,hello
|
||||
|
1
test-cases/junk-corpus/test_data/corrupt_xlsx.xlsx
Normal file
1
test-cases/junk-corpus/test_data/corrupt_xlsx.xlsx
Normal file
@@ -0,0 +1 @@
|
||||
PK not really a zip file
|
||||
2
test-cases/junk-corpus/test_data/cp1252_smart_quotes.csv
Normal file
2
test-cases/junk-corpus/test_data/cp1252_smart_quotes.csv
Normal file
@@ -0,0 +1,2 @@
|
||||
id,quote
|
||||
1,café “smart” ‘quote’
|
||||
|
@@ -0,0 +1,2 @@
|
||||
id,name,note
|
||||
1,alice,hi
|
||||
3
test-cases/junk-corpus/test_data/duplicate_headers.csv
Normal file
3
test-cases/junk-corpus/test_data/duplicate_headers.csv
Normal file
@@ -0,0 +1,3 @@
|
||||
col,col,col
|
||||
1,2,3
|
||||
4,5,6
|
||||
|
4
test-cases/junk-corpus/test_data/embedded_newlines.csv
Normal file
4
test-cases/junk-corpus/test_data/embedded_newlines.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
id,note
|
||||
1,"line one
|
||||
line two"
|
||||
2,"single line"
|
||||
|
0
test-cases/junk-corpus/test_data/empty.csv
Normal file
0
test-cases/junk-corpus/test_data/empty.csv
Normal file
|
|
3
test-cases/junk-corpus/test_data/empty_header_names.csv
Normal file
3
test-cases/junk-corpus/test_data/empty_header_names.csv
Normal file
@@ -0,0 +1,3 @@
|
||||
,,,
|
||||
1,2,3,4
|
||||
5,6,7,8
|
||||
|
BIN
test-cases/junk-corpus/test_data/excel_empty.xlsx
Normal file
BIN
test-cases/junk-corpus/test_data/excel_empty.xlsx
Normal file
Binary file not shown.
BIN
test-cases/junk-corpus/test_data/excel_header_only.xlsx
Normal file
BIN
test-cases/junk-corpus/test_data/excel_header_only.xlsx
Normal file
Binary file not shown.
1
test-cases/junk-corpus/test_data/header_only.csv
Normal file
1
test-cases/junk-corpus/test_data/header_only.csv
Normal file
@@ -0,0 +1 @@
|
||||
id,name,note
|
||||
|
2
test-cases/junk-corpus/test_data/invalid_utf8.csv
Normal file
2
test-cases/junk-corpus/test_data/invalid_utf8.csv
Normal file
@@ -0,0 +1,2 @@
|
||||
id,name
|
||||
1,ÿþý,hello
|
||||
|
5
test-cases/junk-corpus/test_data/just_newlines.csv
Normal file
5
test-cases/junk-corpus/test_data/just_newlines.csv
Normal file
@@ -0,0 +1,5 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
2
test-cases/junk-corpus/test_data/massive_columns.csv
Normal file
2
test-cases/junk-corpus/test_data/massive_columns.csv
Normal file
@@ -0,0 +1,2 @@
|
||||
c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20,c21,c22,c23,c24,c25,c26,c27,c28,c29,c30,c31,c32,c33,c34,c35,c36,c37,c38,c39,c40,c41,c42,c43,c44,c45,c46,c47,c48,c49,c50,c51,c52,c53,c54,c55,c56,c57,c58,c59,c60,c61,c62,c63,c64,c65,c66,c67,c68,c69,c70,c71,c72,c73,c74,c75,c76,c77,c78,c79,c80,c81,c82,c83,c84,c85,c86,c87,c88,c89,c90,c91,c92,c93,c94,c95,c96,c97,c98,c99,c100,c101,c102,c103,c104,c105,c106,c107,c108,c109,c110,c111,c112,c113,c114,c115,c116,c117,c118,c119,c120,c121,c122,c123,c124,c125,c126,c127,c128,c129,c130,c131,c132,c133,c134,c135,c136,c137,c138,c139,c140,c141,c142,c143,c144,c145,c146,c147,c148,c149,c150,c151,c152,c153,c154,c155,c156,c157,c158,c159,c160,c161,c162,c163,c164,c165,c166,c167,c168,c169,c170,c171,c172,c173,c174,c175,c176,c177,c178,c179,c180,c181,c182,c183,c184,c185,c186,c187,c188,c189,c190,c191,c192,c193,c194,c195,c196,c197,c198,c199,c200,c201,c202,c203,c204,c205,c206,c207,c208,c209,c210,c211,c212,c213,c214,c215,c216,c217,c218,c219,c220,c221,c222,c223,c224,c225,c226,c227,c228,c229,c230,c231,c232,c233,c234,c235,c236,c237,c238,c239,c240,c241,c242,c243,c244,c245,c246,c247,c248,c249,c250,c251,c252,c253,c254,c255,c256,c257,c258,c259,c260,c261,c262,c263,c264,c265,c266,c267,c268,c269,c270,c271,c272,c273,c274,c275,c276,c277,c278,c279,c280,c281,c282,c283,c284,c285,c286,c287,c288,c289,c290,c291,c292,c293,c294,c295,c296,c297,c298,c299,c300,c301,c302,c303,c304,c305,c306,c307,c308,c309,c310,c311,c312,c313,c314,c315,c316,c317,c318,c319,c320,c321,c322,c323,c324,c325,c326,c327,c328,c329,c330,c331,c332,c333,c334,c335,c336,c337,c338,c339,c340,c341,c342,c343,c344,c345,c346,c347,c348,c349,c350,c351,c352,c353,c354,c355,c356,c357,c358,c359,c360,c361,c362,c363,c364,c365,c366,c367,c368,c369,c370,c371,c372,c373,c374,c375,c376,c377,c378,c379,c380,c381,c382,c383,c384,c385,c386,c387,c388,c389,c390,c391,c392,c393,c394,c395,c396,c397,c398,c399,c400,c401,c402,c403,c404,c405,c406,c407,c408,c409,c410,c411,c412,c413,c414,c415,c416,c417,c418,c419,c420,c421,c422,c423,c424,c425,c426,c427,c428,c429,c430,c431,c432,c433,c434,c435,c436,c437,c438,c439,c440,c441,c442,c443,c444,c445,c446,c447,c448,c449,c450,c451,c452,c453,c454,c455,c456,c457,c458,c459,c460,c461,c462,c463,c464,c465,c466,c467,c468,c469,c470,c471,c472,c473,c474,c475,c476,c477,c478,c479,c480,c481,c482,c483,c484,c485,c486,c487,c488,c489,c490,c491,c492,c493,c494,c495,c496,c497,c498,c499
|
||||
x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x
|
||||
|
4
test-cases/junk-corpus/test_data/mismatched_columns.csv
Normal file
4
test-cases/junk-corpus/test_data/mismatched_columns.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
id,name,note
|
||||
1,alice,hi
|
||||
2,bob
|
||||
3,carol,hi,extra,fields
|
||||
|
2
test-cases/junk-corpus/test_data/mixed_delimiters.csv
Normal file
2
test-cases/junk-corpus/test_data/mixed_delimiters.csv
Normal file
@@ -0,0 +1,2 @@
|
||||
id,name note;extra|tail
|
||||
1,alice hi;x|y
|
||||
|
3
test-cases/junk-corpus/test_data/mojibake.csv
Normal file
3
test-cases/junk-corpus/test_data/mojibake.csv
Normal file
@@ -0,0 +1,3 @@
|
||||
id,name,note
|
||||
1,café,hello
|
||||
2,naïve,world
|
||||
|
2
test-cases/junk-corpus/test_data/no_extension
Normal file
2
test-cases/junk-corpus/test_data/no_extension
Normal file
@@ -0,0 +1,2 @@
|
||||
id,name,note
|
||||
1,alice,hi
|
||||
1
test-cases/junk-corpus/test_data/one_huge_line.csv
Normal file
1
test-cases/junk-corpus/test_data/one_huge_line.csv
Normal file
File diff suppressed because one or more lines are too long
1
test-cases/junk-corpus/test_data/only_bom.csv
Normal file
1
test-cases/junk-corpus/test_data/only_bom.csv
Normal file
@@ -0,0 +1 @@
|
||||
|
||||
|
|
BIN
test-cases/junk-corpus/test_data/only_nul.csv
Normal file
BIN
test-cases/junk-corpus/test_data/only_nul.csv
Normal file
Binary file not shown.
|
3
test-cases/junk-corpus/test_data/only_whitespace.csv
Normal file
3
test-cases/junk-corpus/test_data/only_whitespace.csv
Normal file
@@ -0,0 +1,3 @@
|
||||
|
||||
|
||||
|
||||
|
BIN
test-cases/junk-corpus/test_data/png_magic_as_csv.csv
Normal file
BIN
test-cases/junk-corpus/test_data/png_magic_as_csv.csv
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 520 B |
BIN
test-cases/junk-corpus/test_data/random_bytes.csv
Normal file
BIN
test-cases/junk-corpus/test_data/random_bytes.csv
Normal file
Binary file not shown.
|
Can't render this file because it contains an unexpected character in line 5 and column 331.
|
21
test-cases/junk-corpus/test_data/single_column.csv
Normal file
21
test-cases/junk-corpus/test_data/single_column.csv
Normal file
@@ -0,0 +1,21 @@
|
||||
id
|
||||
0
|
||||
1
|
||||
2
|
||||
3
|
||||
4
|
||||
5
|
||||
6
|
||||
7
|
||||
8
|
||||
9
|
||||
10
|
||||
11
|
||||
12
|
||||
13
|
||||
14
|
||||
15
|
||||
16
|
||||
17
|
||||
18
|
||||
19
|
||||
|
3
test-cases/junk-corpus/test_data/trailing_commas.csv
Normal file
3
test-cases/junk-corpus/test_data/trailing_commas.csv
Normal file
@@ -0,0 +1,3 @@
|
||||
id,name,note,
|
||||
1,alice,hi,
|
||||
2,bob,wo,
|
||||
|
3
test-cases/junk-corpus/test_data/truncated_mid_row.csv
Normal file
3
test-cases/junk-corpus/test_data/truncated_mid_row.csv
Normal file
@@ -0,0 +1,3 @@
|
||||
id,name,note
|
||||
1,alice,hello
|
||||
2,bob,wor
|
||||
|
3
test-cases/junk-corpus/test_data/tsv_as_csv.csv
Normal file
3
test-cases/junk-corpus/test_data/tsv_as_csv.csv
Normal file
@@ -0,0 +1,3 @@
|
||||
id name note
|
||||
1 alice hi
|
||||
2 bob world
|
||||
|
3
test-cases/junk-corpus/test_data/unescaped_quotes.csv
Normal file
3
test-cases/junk-corpus/test_data/unescaped_quotes.csv
Normal file
@@ -0,0 +1,3 @@
|
||||
id,note
|
||||
1,"this has " unescaped quote"
|
||||
2,"normal"
|
||||
|
Can't render this file because it contains an unexpected character in line 2 and column 13.
|
BIN
test-cases/junk-corpus/test_data/utf16_be_with_bom.csv
Normal file
BIN
test-cases/junk-corpus/test_data/utf16_be_with_bom.csv
Normal file
Binary file not shown.
|
BIN
test-cases/junk-corpus/test_data/utf16_le_no_bom.csv
Normal file
BIN
test-cases/junk-corpus/test_data/utf16_le_no_bom.csv
Normal file
Binary file not shown.
|
BIN
test-cases/junk-corpus/test_data/utf32_le.csv
Normal file
BIN
test-cases/junk-corpus/test_data/utf32_le.csv
Normal file
Binary file not shown.
|
2
test-cases/junk-corpus/test_data/very_wide_cell.csv
Normal file
2
test-cases/junk-corpus/test_data/very_wide_cell.csv
Normal file
File diff suppressed because one or more lines are too long
2
test-cases/junk-corpus/test_data/weird_extension.foo
Normal file
2
test-cases/junk-corpus/test_data/weird_extension.foo
Normal file
@@ -0,0 +1,2 @@
|
||||
id,name,note
|
||||
1,alice,hi
|
||||
@@ -35,8 +35,10 @@ TEST_CASES_DIR = Path(__file__).resolve().parent.parent / "test-cases"
|
||||
|
||||
# Subdirectories in test-cases/ that are exercised by their own dedicated
|
||||
# tests. The sweep ignores these so we don't double-test or fight expected
|
||||
# byte-exact outputs.
|
||||
_EXCLUDED_SUBDIRS = {"text-cleaner-corpus"}
|
||||
# byte-exact outputs. ``junk-corpus`` is intentionally pathological —
|
||||
# files there are designed to break the cleaner/analyzer; the contract is
|
||||
# enforced by ``tests/test_junk_corpus.py``, not this happy-path sweep.
|
||||
_EXCLUDED_SUBDIRS = {"text-cleaner-corpus", "junk-corpus"}
|
||||
|
||||
# File suffixes we know how to load.
|
||||
_SUPPORTED_SUFFIXES = {".csv", ".tsv", ".xlsx", ".xls"}
|
||||
|
||||
156
tests/test_junk_corpus.py
Normal file
156
tests/test_junk_corpus.py
Normal file
@@ -0,0 +1,156 @@
|
||||
"""Stress-test the upload analyzer against a corpus of pathological files.
|
||||
|
||||
Every file under ``test-cases/junk-corpus/test_data/`` is fed through
|
||||
``_run_analysis_on_upload`` — the same path the GUI takes when a user
|
||||
drops a file on the home page. The contract under test is:
|
||||
|
||||
* The call never raises. Errors must surface as a synthetic ``Finding``
|
||||
with severity ``"error"``, not a Python traceback that the page
|
||||
chrome bubbles up to the user.
|
||||
* The return is always a list of :class:`Finding` (possibly empty for
|
||||
files the analyzer judges clean).
|
||||
* Specific high-risk files (empty bytes, corrupt zip, etc.) MUST
|
||||
produce at least one error-level Finding so the UI shows a red
|
||||
banner rather than silently rendering "no issues found".
|
||||
|
||||
To add a new pathological shape:
|
||||
|
||||
1. Edit ``test-cases/junk-corpus/make_junk_corpus.py`` to write the new
|
||||
file under ``test_data/``.
|
||||
2. Re-run that script to materialize the file on disk.
|
||||
3. (Optional) Add the filename to ``_MUST_BE_ERROR`` below if the file
|
||||
represents a state where "no findings" would be a silent failure.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from src.core.analyze import Finding
|
||||
from src.gui.components._legacy import _run_analysis_on_upload
|
||||
|
||||
|
||||
_CORPUS = Path(__file__).resolve().parent.parent / "test-cases" / "junk-corpus" / "test_data"
|
||||
|
||||
|
||||
class _FakeUpload:
|
||||
"""Duck-type the Streamlit ``UploadedFile`` interface from a path."""
|
||||
|
||||
def __init__(self, path: Path) -> None:
|
||||
self.name = path.name
|
||||
self._bytes = path.read_bytes()
|
||||
|
||||
def getvalue(self) -> bytes:
|
||||
return self._bytes
|
||||
|
||||
|
||||
def _corpus_files() -> list[Path]:
|
||||
files = sorted(p for p in _CORPUS.iterdir() if p.is_file())
|
||||
if not files:
|
||||
raise RuntimeError(
|
||||
f"Junk corpus is empty. Run "
|
||||
f"`python test-cases/junk-corpus/make_junk_corpus.py` "
|
||||
f"to generate {_CORPUS}."
|
||||
)
|
||||
return files
|
||||
|
||||
|
||||
# Files where "zero findings" would be a silent failure — these are
|
||||
# structurally broken enough that the analyzer MUST flag them. The
|
||||
# error-level Finding is what shows the user a red banner instead of
|
||||
# the misleading "no issues found" success path.
|
||||
_MUST_BE_ERROR = {
|
||||
"empty.csv",
|
||||
"only_bom.csv",
|
||||
"only_nul.csv",
|
||||
"corrupt_xlsx.xlsx",
|
||||
}
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"path",
|
||||
_corpus_files(),
|
||||
ids=lambda p: p.name,
|
||||
)
|
||||
class TestJunkCorpus:
|
||||
"""Every pathological file must round-trip through the analyzer
|
||||
without raising. The error message format is checked separately
|
||||
via :func:`TestJunkCorpus.test_error_findings_have_a_description`.
|
||||
"""
|
||||
|
||||
def test_no_exception_propagates(self, path: Path) -> None:
|
||||
upload = _FakeUpload(path)
|
||||
# The point of the test: any exception from analyze() / pandas /
|
||||
# repair_bytes / openpyxl SHOULD have been caught and turned
|
||||
# into an error Finding by ``_run_analysis_on_upload``. If this
|
||||
# raises, the home page would crash on this file in production.
|
||||
findings = _run_analysis_on_upload(upload)
|
||||
assert isinstance(findings, list), (
|
||||
f"{path.name}: expected list[Finding], got {type(findings).__name__}"
|
||||
)
|
||||
|
||||
def test_findings_are_well_formed(self, path: Path) -> None:
|
||||
upload = _FakeUpload(path)
|
||||
findings = _run_analysis_on_upload(upload)
|
||||
for f in findings:
|
||||
assert isinstance(f, Finding), (
|
||||
f"{path.name}: non-Finding in result list: {f!r}"
|
||||
)
|
||||
assert isinstance(f.id, str) and f.id, (
|
||||
f"{path.name}: Finding has empty id"
|
||||
)
|
||||
assert f.severity in ("info", "warn", "error"), (
|
||||
f"{path.name}: Finding has bad severity {f.severity!r}"
|
||||
)
|
||||
assert isinstance(f.description, str) and f.description, (
|
||||
f"{path.name}: Finding has empty description"
|
||||
)
|
||||
|
||||
def test_must_be_error_files_actually_flag(self, path: Path) -> None:
|
||||
if path.name not in _MUST_BE_ERROR:
|
||||
pytest.skip(f"{path.name} is allowed to pass clean")
|
||||
upload = _FakeUpload(path)
|
||||
findings = _run_analysis_on_upload(upload)
|
||||
errors = [f for f in findings if f.severity == "error"]
|
||||
assert errors, (
|
||||
f"{path.name} should surface at least one error-level "
|
||||
f"Finding so the UI shows a red banner; got {len(findings)} "
|
||||
f"findings (none of severity 'error')."
|
||||
)
|
||||
|
||||
def test_error_findings_have_a_description(self, path: Path) -> None:
|
||||
"""Error findings must carry a description the user can act on.
|
||||
|
||||
For an empty / corrupt file the description is the ONLY thing
|
||||
the user sees — it has to name the file or include enough
|
||||
context that they can fix the underlying problem.
|
||||
"""
|
||||
upload = _FakeUpload(path)
|
||||
findings = _run_analysis_on_upload(upload)
|
||||
for f in findings:
|
||||
if f.severity != "error":
|
||||
continue
|
||||
# The synthetic error Findings always interpolate the file
|
||||
# name; analyzer-generated errors include the column or a
|
||||
# description that mentions what was wrong.
|
||||
assert len(f.description) >= 20, (
|
||||
f"{path.name}: error Finding description is too short "
|
||||
f"to be useful: {f.description!r}"
|
||||
)
|
||||
|
||||
|
||||
def test_corpus_contains_expected_shapes() -> None:
|
||||
"""Sanity-check that the corpus generator wrote the files we rely
|
||||
on for the must-be-error matrix. If somebody renames a file in
|
||||
``make_junk_corpus.py`` without updating ``_MUST_BE_ERROR``, this
|
||||
test catches it before the per-file parametrization silently
|
||||
skips the must-be-error assertion."""
|
||||
names = {p.name for p in _corpus_files()}
|
||||
missing = _MUST_BE_ERROR - names
|
||||
assert not missing, (
|
||||
f"_MUST_BE_ERROR references files that don't exist in the "
|
||||
f"corpus: {sorted(missing)}. Regenerate the corpus or update "
|
||||
f"_MUST_BE_ERROR."
|
||||
)
|
||||
Reference in New Issue
Block a user