test(junk-corpus): pathological-input stress suite for the analyzer

Build a corpus of 35 deliberately-broken files (empty bytes, NUL
bytes, mojibake, UTF-16 without BOM, mismatched columns, unescaped
quotes, corrupt zip, etc.) and pin the analyzer's stability contract
against them.

Files land in ``test-cases/junk-corpus/test_data/``. The generator
``make_junk_corpus.py`` produces them deterministically (one random
sample uses ``secrets.token_bytes`` — committed bytes are stable
across regenerations because the byte stream is captured at commit
time). README documents the categories and how to add new shapes.

``tests/test_junk_corpus.py`` parametrizes over every file in the
corpus and asserts:

1. ``_run_analysis_on_upload`` never raises — exceptions must be
   caught and surfaced as a synthetic ``Finding`` with
   severity="error". This was the user-reported crash for
   13_non_latin_scripts.csv that the previous fix in ae9d4a2
   defensively wrapped; the corpus now stops the regression
   from re-landing on a different shape.
2. Every Finding in the result list is well-formed (string id,
   valid severity, non-empty description).
3. A high-risk subset (empty.csv, only_bom.csv, only_nul.csv,
   corrupt_xlsx.xlsx) MUST surface at least one error-level
   Finding — otherwise the GUI would render "no issues found"
   for a structurally broken file.
4. Error-level Finding descriptions are at least 20 chars so the
   UI banner gives the user something to act on.

Also exclude ``junk-corpus`` from ``tests/test_fixtures_sweep.py``
since that sweep is happy-path (round-trip the text cleaner) and
fights with files designed to break it. The contract is enforced
by the dedicated junk-corpus test, not the sweep.

Runtime: 12 s for the junk-corpus tests, 30 s for the full
project suite (was 19 s without these). 2118 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 21:35:22 +00:00
parent ae9d4a2db5
commit 696996c119
39 changed files with 637 additions and 2 deletions

View File

@@ -0,0 +1,63 @@
# Junk Corpus — pathological-input stress tests
This corpus exists to make the upload analyzer prove it can survive any
file a user (or an adversary) might drop on it. Every file under
`test_data/` is deliberately broken in a different way: empty bytes,
NUL bytes, mojibake, UTF-16 without a BOM, mismatched columns,
unescaped quotes, corrupt `.xlsx`, and so on.
The contract enforced by `tests/test_junk_corpus.py`:
1. `_run_analysis_on_upload(file)` MUST NOT raise. Errors are caught
and surfaced as a synthetic `Finding` with severity `"error"`.
2. The return is always a `list[Finding]` (possibly empty for files
the analyzer judges clean).
3. A specific subset of files (`empty.csv`, `only_bom.csv`,
`only_nul.csv`, `corrupt_xlsx.xlsx`) MUST produce at least one
error-level Finding so the GUI shows a red banner instead of
silently rendering "no issues found".
## Why this matters
In a multi-file home-page upload, one bad file used to bubble a
Python traceback up through the page chrome and kill every other
file's analysis. The defensive wrap in `_run_analysis_on_upload` plus
this stress test together prevent that regression.
## Regenerating the corpus
```bash
python test-cases/junk-corpus/make_junk_corpus.py
```
The generator writes 35-ish files into `test_data/`. They are small
(< 100 KB each) and committed to the repo so the stress test runs
without depending on a regenerate step.
## Adding a new pathological shape
1. Add a `write(...)` call to `make_junk_corpus.py`.
2. Re-run that script to materialize the file on disk.
3. (Optional) Add the filename to `_MUST_BE_ERROR` in
`tests/test_junk_corpus.py` if "no findings" would be a silent
failure for that shape.
## What's already covered
| Category | Files |
|---|---|
| Empty / near-empty | `empty.csv`, `only_whitespace.csv`, `only_bom.csv`, `only_nul.csv`, `just_newlines.csv`, `header_only.csv` |
| Random / binary garbage | `random_bytes.csv`, `png_magic_as_csv.csv` |
| Truncated or huge | `truncated_mid_row.csv`, `one_huge_line.csv`, `massive_columns.csv`, `single_column.csv` |
| Wrong delimiter | `tsv_as_csv.csv`, `mixed_delimiters.csv` |
| Encoding chaos | `utf16_le_no_bom.csv`, `utf16_be_with_bom.csv`, `utf32_le.csv`, `mojibake.csv`, `invalid_utf8.csv`, `cp1252_smart_quotes.csv` |
| Quoting / shape | `unescaped_quotes.csv`, `embedded_newlines.csv`, `mismatched_columns.csv`, `duplicate_headers.csv`, `empty_header_names.csv`, `trailing_commas.csv` |
| Content | `all_nulls.csv`, `very_wide_cell.csv`, `all_same_row.csv` |
| Extension confusion | `no_extension`, `weird_extension.foo`, `double_extension.csv.txt` |
| Excel pathologies | `corrupt_xlsx.xlsx`, `excel_empty.xlsx`, `excel_header_only.xlsx` |
## Manually loading a junk file in the GUI
The files are real on-disk artifacts. Drag any of them into the home
page uploader to verify the GUI renders a sensible error (or clean
findings, for files the analyzer is OK with) instead of crashing.

View File

@@ -0,0 +1,231 @@
"""Generate a corpus of pathological files for stress-testing the upload
analyzer.
Each file in ``test_data/`` is deliberately broken in a different way:
empty bytes, NUL bytes, mojibake, UTF-16 without BOM, mismatched columns,
unescaped quotes, etc. The goal is to make sure ``_run_analysis_on_upload``
returns a clean error Finding (never a Python traceback) for any of them,
in any combination, on every operating system the GUI ships on.
Run::
python test-cases/junk-corpus/make_junk_corpus.py
The matching pytest at ``tests/test_junk_corpus.py`` iterates every file
in ``test_data/`` and asserts the analyzer either returns findings or an
error Finding — never raises.
"""
from __future__ import annotations
import io
import os
import secrets
import struct
import zipfile
from pathlib import Path
_HERE = Path(__file__).resolve().parent
_OUT = _HERE / "test_data"
def write(name: str, data: bytes) -> None:
"""Write *data* to ``test_data/name`` and report the size."""
path = _OUT / name
path.write_bytes(data)
print(f" {name:<40} {len(data):>10} bytes")
def _valid_xlsx_bytes(*, sheet_xml: str) -> bytes:
"""Build a minimal but valid .xlsx (zip with the required parts).
``sheet_xml`` is the inner ``<sheetData>`` content; the rest of the
workbook scaffolding is filled in around it. Good enough for pandas
to load.
"""
buf = io.BytesIO()
with zipfile.ZipFile(buf, "w", zipfile.ZIP_DEFLATED) as z:
z.writestr(
"[Content_Types].xml",
'<?xml version="1.0"?>'
'<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">'
'<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>'
'<Default Extension="xml" ContentType="application/xml"/>'
'<Override PartName="/xl/workbook.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>'
'<Override PartName="/xl/worksheets/sheet1.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>'
"</Types>",
)
z.writestr(
"_rels/.rels",
'<?xml version="1.0"?>'
'<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">'
'<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="xl/workbook.xml"/>'
"</Relationships>",
)
z.writestr(
"xl/_rels/workbook.xml.rels",
'<?xml version="1.0"?>'
'<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">'
'<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" Target="worksheets/sheet1.xml"/>'
"</Relationships>",
)
z.writestr(
"xl/workbook.xml",
'<?xml version="1.0"?>'
'<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"'
' xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">'
'<sheets><sheet name="Sheet1" sheetId="1" r:id="rId1"/></sheets>'
"</workbook>",
)
z.writestr(
"xl/worksheets/sheet1.xml",
'<?xml version="1.0"?>'
'<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">'
f"<sheetData>{sheet_xml}</sheetData>"
"</worksheet>",
)
return buf.getvalue()
def main() -> None:
_OUT.mkdir(parents=True, exist_ok=True)
print(f"Writing junk corpus to {_OUT}")
# ---- Empty / near-empty -------------------------------------------------
write("empty.csv", b"")
write("only_whitespace.csv", b" \t\n \n\t \n")
write("only_bom.csv", b"\xef\xbb\xbf")
write("only_nul.csv", b"\x00" * 64)
write("just_newlines.csv", b"\n\n\n\n\n")
write("header_only.csv", b"id,name,note\n")
# ---- Random / binary garbage -------------------------------------------
write("random_bytes.csv", secrets.token_bytes(2048))
# Bytes that look like a PNG signature plus garbage; would mislead any
# naive file-type sniffer.
write("png_magic_as_csv.csv", b"\x89PNG\r\n\x1a\n" + secrets.token_bytes(512))
# ---- Truncated / structurally damaged ----------------------------------
write(
"truncated_mid_row.csv",
b"id,name,note\n1,alice,hello\n2,bob,wor", # row 2 ends mid-cell
)
write(
"one_huge_line.csv",
b"a," * 5_000, # 10KB single line, no newline anywhere
)
write(
"massive_columns.csv",
(",".join(f"c{i}" for i in range(500)) + "\n"
+ ",".join("x" for _ in range(500)) + "\n").encode(),
)
write(
"single_column.csv",
b"\n".join([b"id"] + [str(i).encode() for i in range(20)]) + b"\n",
)
# ---- Wrong / misleading delimiter --------------------------------------
write(
"tsv_as_csv.csv",
b"id\tname\tnote\n1\talice\thi\n2\tbob\tworld\n",
)
write(
"mixed_delimiters.csv",
b"id,name\tnote;extra|tail\n1,alice\thi;x|y\n",
)
# ---- Encoding chaos ----------------------------------------------------
sample_text = "id,name,note\n1,café,hello\n2,naïve,world\n"
write("utf16_le_no_bom.csv", sample_text.encode("utf-16-le"))
write("utf16_be_with_bom.csv", b"\xfe\xff" + sample_text.encode("utf-16-be"))
write("utf32_le.csv", sample_text.encode("utf-32-le"))
# Latin-1 bytes that decode as UTF-8 produce mojibake (é, ï etc.)
write("mojibake.csv", sample_text.encode("latin-1"))
# Bytes that aren't valid UTF-8 (lone continuation bytes)
write("invalid_utf8.csv", b"id,name\n1,\xff\xfe\xfd,hello\n")
# cp1252-encoded smart quotes in column values. cp1252 ascribes
# smart-quote glyphs to bytes 0x91-0x94; the surrounding ASCII +
# accented "é" is just there to keep the value realistic.
write(
"cp1252_smart_quotes.csv",
b"id,quote\n1,"
+ "café ".encode("cp1252")
+ b"\x93smart\x94 \x91quote\x92"
+ b"\n",
)
# ---- Quoting and field-shape pathologies -------------------------------
write(
"unescaped_quotes.csv",
b'id,note\n1,"this has " unescaped quote"\n2,"normal"\n',
)
write(
"embedded_newlines.csv",
b'id,note\n1,"line one\nline two"\n2,"single line"\n',
)
write(
"mismatched_columns.csv",
b"id,name,note\n1,alice,hi\n2,bob\n3,carol,hi,extra,fields\n",
)
write(
"duplicate_headers.csv",
b"col,col,col\n1,2,3\n4,5,6\n",
)
write(
"empty_header_names.csv",
b",,,\n1,2,3,4\n5,6,7,8\n",
)
write(
"trailing_commas.csv",
b"id,name,note,\n1,alice,hi,\n2,bob,wo,\n",
)
# ---- Content pathologies ----------------------------------------------
write(
"all_nulls.csv",
b"id,name,note\nNULL,NULL,NULL\nN/A,NA,(null)\nNone,nan,?\n",
)
write(
"very_wide_cell.csv",
b'id,blob\n1,"' + b"x" * 10_000 + b'"\n',
)
write(
"all_same_row.csv",
b"id,name,note\n" + b"1,alice,hello\n" * 100,
)
# ---- Extension confusion ----------------------------------------------
write("no_extension", b"id,name,note\n1,alice,hi\n")
write(
"weird_extension.foo",
b"id,name,note\n1,alice,hi\n",
)
write(
"double_extension.csv.txt",
b"id,name,note\n1,alice,hi\n",
)
# ---- Excel-specific pathologies ----------------------------------------
# Not a real zip — pandas/openpyxl should error cleanly.
write("corrupt_xlsx.xlsx", b"PK\x03\x04 not really a zip file")
# Valid xlsx with an entirely empty sheet.
write("excel_empty.xlsx", _valid_xlsx_bytes(sheet_xml=""))
# Valid xlsx with one row of headers and no data.
write(
"excel_header_only.xlsx",
_valid_xlsx_bytes(
sheet_xml=(
'<row r="1">'
'<c r="A1" t="inlineStr"><is><t>id</t></is></c>'
'<c r="B1" t="inlineStr"><is><t>name</t></is></c>'
"</row>"
),
),
)
print(f"\nWrote {len(list(_OUT.iterdir()))} files.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,4 @@
id,name,note
NULL,NULL,NULL
N/A,NA,(null)
None,nan,?
1 id name note
2 NULL NULL NULL
3 N/A NA (null)
4 None nan ?

View File

@@ -0,0 +1,101 @@
id,name,note
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1,alice,hello
1 id name note
2 1 alice hello
3 1 alice hello
4 1 alice hello
5 1 alice hello
6 1 alice hello
7 1 alice hello
8 1 alice hello
9 1 alice hello
10 1 alice hello
11 1 alice hello
12 1 alice hello
13 1 alice hello
14 1 alice hello
15 1 alice hello
16 1 alice hello
17 1 alice hello
18 1 alice hello
19 1 alice hello
20 1 alice hello
21 1 alice hello
22 1 alice hello
23 1 alice hello
24 1 alice hello
25 1 alice hello
26 1 alice hello
27 1 alice hello
28 1 alice hello
29 1 alice hello
30 1 alice hello
31 1 alice hello
32 1 alice hello
33 1 alice hello
34 1 alice hello
35 1 alice hello
36 1 alice hello
37 1 alice hello
38 1 alice hello
39 1 alice hello
40 1 alice hello
41 1 alice hello
42 1 alice hello
43 1 alice hello
44 1 alice hello
45 1 alice hello
46 1 alice hello
47 1 alice hello
48 1 alice hello
49 1 alice hello
50 1 alice hello
51 1 alice hello
52 1 alice hello
53 1 alice hello
54 1 alice hello
55 1 alice hello
56 1 alice hello
57 1 alice hello
58 1 alice hello
59 1 alice hello
60 1 alice hello
61 1 alice hello
62 1 alice hello
63 1 alice hello
64 1 alice hello
65 1 alice hello
66 1 alice hello
67 1 alice hello
68 1 alice hello
69 1 alice hello
70 1 alice hello
71 1 alice hello
72 1 alice hello
73 1 alice hello
74 1 alice hello
75 1 alice hello
76 1 alice hello
77 1 alice hello
78 1 alice hello
79 1 alice hello
80 1 alice hello
81 1 alice hello
82 1 alice hello
83 1 alice hello
84 1 alice hello
85 1 alice hello
86 1 alice hello
87 1 alice hello
88 1 alice hello
89 1 alice hello
90 1 alice hello
91 1 alice hello
92 1 alice hello
93 1 alice hello
94 1 alice hello
95 1 alice hello
96 1 alice hello
97 1 alice hello
98 1 alice hello
99 1 alice hello
100 1 alice hello
101 1 alice hello

View File

@@ -0,0 +1 @@
PK not really a zip file

View File

@@ -0,0 +1,2 @@
id,quote
1,café “smart” quote
1 id quote
2 1 café “smart” ‘quote’

View File

@@ -0,0 +1,2 @@
id,name,note
1,alice,hi

View File

@@ -0,0 +1,3 @@
col,col,col
1,2,3
4,5,6
1 col col col
2 1 2 3
3 4 5 6

View File

@@ -0,0 +1,4 @@
id,note
1,"line one
line two"
2,"single line"
1 id note
2 1 line one line two
3 2 single line

View File

@@ -0,0 +1,3 @@
,,,
1,2,3,4
5,6,7,8
1
2 1 2 3 4
3 5 6 7 8

Binary file not shown.

View File

@@ -0,0 +1 @@
id,name,note
1 id name note

View File

@@ -0,0 +1,2 @@
id,name
1,ÿþý,hello
1 id,name
2 1,ÿþý,hello

View File

@@ -0,0 +1,5 @@

View File

@@ -0,0 +1,2 @@
c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20,c21,c22,c23,c24,c25,c26,c27,c28,c29,c30,c31,c32,c33,c34,c35,c36,c37,c38,c39,c40,c41,c42,c43,c44,c45,c46,c47,c48,c49,c50,c51,c52,c53,c54,c55,c56,c57,c58,c59,c60,c61,c62,c63,c64,c65,c66,c67,c68,c69,c70,c71,c72,c73,c74,c75,c76,c77,c78,c79,c80,c81,c82,c83,c84,c85,c86,c87,c88,c89,c90,c91,c92,c93,c94,c95,c96,c97,c98,c99,c100,c101,c102,c103,c104,c105,c106,c107,c108,c109,c110,c111,c112,c113,c114,c115,c116,c117,c118,c119,c120,c121,c122,c123,c124,c125,c126,c127,c128,c129,c130,c131,c132,c133,c134,c135,c136,c137,c138,c139,c140,c141,c142,c143,c144,c145,c146,c147,c148,c149,c150,c151,c152,c153,c154,c155,c156,c157,c158,c159,c160,c161,c162,c163,c164,c165,c166,c167,c168,c169,c170,c171,c172,c173,c174,c175,c176,c177,c178,c179,c180,c181,c182,c183,c184,c185,c186,c187,c188,c189,c190,c191,c192,c193,c194,c195,c196,c197,c198,c199,c200,c201,c202,c203,c204,c205,c206,c207,c208,c209,c210,c211,c212,c213,c214,c215,c216,c217,c218,c219,c220,c221,c222,c223,c224,c225,c226,c227,c228,c229,c230,c231,c232,c233,c234,c235,c236,c237,c238,c239,c240,c241,c242,c243,c244,c245,c246,c247,c248,c249,c250,c251,c252,c253,c254,c255,c256,c257,c258,c259,c260,c261,c262,c263,c264,c265,c266,c267,c268,c269,c270,c271,c272,c273,c274,c275,c276,c277,c278,c279,c280,c281,c282,c283,c284,c285,c286,c287,c288,c289,c290,c291,c292,c293,c294,c295,c296,c297,c298,c299,c300,c301,c302,c303,c304,c305,c306,c307,c308,c309,c310,c311,c312,c313,c314,c315,c316,c317,c318,c319,c320,c321,c322,c323,c324,c325,c326,c327,c328,c329,c330,c331,c332,c333,c334,c335,c336,c337,c338,c339,c340,c341,c342,c343,c344,c345,c346,c347,c348,c349,c350,c351,c352,c353,c354,c355,c356,c357,c358,c359,c360,c361,c362,c363,c364,c365,c366,c367,c368,c369,c370,c371,c372,c373,c374,c375,c376,c377,c378,c379,c380,c381,c382,c383,c384,c385,c386,c387,c388,c389,c390,c391,c392,c393,c394,c395,c396,c397,c398,c399,c400,c401,c402,c403,c404,c405,c406,c407,c408,c409,c410,c411,c412,c413,c414,c415,c416,c417,c418,c419,c420,c421,c422,c423,c424,c425,c426,c427,c428,c429,c430,c431,c432,c433,c434,c435,c436,c437,c438,c439,c440,c441,c442,c443,c444,c445,c446,c447,c448,c449,c450,c451,c452,c453,c454,c455,c456,c457,c458,c459,c460,c461,c462,c463,c464,c465,c466,c467,c468,c469,c470,c471,c472,c473,c474,c475,c476,c477,c478,c479,c480,c481,c482,c483,c484,c485,c486,c487,c488,c489,c490,c491,c492,c493,c494,c495,c496,c497,c498,c499
x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x
1 c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27 c28 c29 c30 c31 c32 c33 c34 c35 c36 c37 c38 c39 c40 c41 c42 c43 c44 c45 c46 c47 c48 c49 c50 c51 c52 c53 c54 c55 c56 c57 c58 c59 c60 c61 c62 c63 c64 c65 c66 c67 c68 c69 c70 c71 c72 c73 c74 c75 c76 c77 c78 c79 c80 c81 c82 c83 c84 c85 c86 c87 c88 c89 c90 c91 c92 c93 c94 c95 c96 c97 c98 c99 c100 c101 c102 c103 c104 c105 c106 c107 c108 c109 c110 c111 c112 c113 c114 c115 c116 c117 c118 c119 c120 c121 c122 c123 c124 c125 c126 c127 c128 c129 c130 c131 c132 c133 c134 c135 c136 c137 c138 c139 c140 c141 c142 c143 c144 c145 c146 c147 c148 c149 c150 c151 c152 c153 c154 c155 c156 c157 c158 c159 c160 c161 c162 c163 c164 c165 c166 c167 c168 c169 c170 c171 c172 c173 c174 c175 c176 c177 c178 c179 c180 c181 c182 c183 c184 c185 c186 c187 c188 c189 c190 c191 c192 c193 c194 c195 c196 c197 c198 c199 c200 c201 c202 c203 c204 c205 c206 c207 c208 c209 c210 c211 c212 c213 c214 c215 c216 c217 c218 c219 c220 c221 c222 c223 c224 c225 c226 c227 c228 c229 c230 c231 c232 c233 c234 c235 c236 c237 c238 c239 c240 c241 c242 c243 c244 c245 c246 c247 c248 c249 c250 c251 c252 c253 c254 c255 c256 c257 c258 c259 c260 c261 c262 c263 c264 c265 c266 c267 c268 c269 c270 c271 c272 c273 c274 c275 c276 c277 c278 c279 c280 c281 c282 c283 c284 c285 c286 c287 c288 c289 c290 c291 c292 c293 c294 c295 c296 c297 c298 c299 c300 c301 c302 c303 c304 c305 c306 c307 c308 c309 c310 c311 c312 c313 c314 c315 c316 c317 c318 c319 c320 c321 c322 c323 c324 c325 c326 c327 c328 c329 c330 c331 c332 c333 c334 c335 c336 c337 c338 c339 c340 c341 c342 c343 c344 c345 c346 c347 c348 c349 c350 c351 c352 c353 c354 c355 c356 c357 c358 c359 c360 c361 c362 c363 c364 c365 c366 c367 c368 c369 c370 c371 c372 c373 c374 c375 c376 c377 c378 c379 c380 c381 c382 c383 c384 c385 c386 c387 c388 c389 c390 c391 c392 c393 c394 c395 c396 c397 c398 c399 c400 c401 c402 c403 c404 c405 c406 c407 c408 c409 c410 c411 c412 c413 c414 c415 c416 c417 c418 c419 c420 c421 c422 c423 c424 c425 c426 c427 c428 c429 c430 c431 c432 c433 c434 c435 c436 c437 c438 c439 c440 c441 c442 c443 c444 c445 c446 c447 c448 c449 c450 c451 c452 c453 c454 c455 c456 c457 c458 c459 c460 c461 c462 c463 c464 c465 c466 c467 c468 c469 c470 c471 c472 c473 c474 c475 c476 c477 c478 c479 c480 c481 c482 c483 c484 c485 c486 c487 c488 c489 c490 c491 c492 c493 c494 c495 c496 c497 c498 c499
2 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

View File

@@ -0,0 +1,4 @@
id,name,note
1,alice,hi
2,bob
3,carol,hi,extra,fields
1 id,name,note
2 1,alice,hi
3 2,bob
4 3,carol,hi,extra,fields

View File

@@ -0,0 +1,2 @@
id,name note;extra|tail
1,alice hi;x|y
1 id name note;extra|tail
2 1 alice hi;x|y

View File

@@ -0,0 +1,3 @@
id,name,note
1,café,hello
2,naïve,world
1 id name note
2 1 café hello
3 2 naïve world

View File

@@ -0,0 +1,2 @@
id,name,note
1,alice,hi

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1 @@

Binary file not shown.
1 ����������������������������������������������������������������

View File

@@ -0,0 +1,3 @@
1
2
3

Binary file not shown.

After

Width:  |  Height:  |  Size: 520 B

Binary file not shown.
Can't render this file because it contains an unexpected character in line 5 and column 331.

View File

@@ -0,0 +1,21 @@
id
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1 id
2 0
3 1
4 2
5 3
6 4
7 5
8 6
9 7
10 8
11 9
12 10
13 11
14 12
15 13
16 14
17 15
18 16
19 17
20 18
21 19

View File

@@ -0,0 +1,3 @@
id,name,note,
1,alice,hi,
2,bob,wo,
1 id name note
2 1 alice hi
3 2 bob wo

View File

@@ -0,0 +1,3 @@
id,name,note
1,alice,hello
2,bob,wor
1 id name note
2 1 alice hello
3 2 bob wor

View File

@@ -0,0 +1,3 @@
id name note
1 alice hi
2 bob world
1 id name note
2 1 alice hi
3 2 bob world

View File

@@ -0,0 +1,3 @@
id,note
1,"this has " unescaped quote"
2,"normal"
Can't render this file because it contains an unexpected character in line 2 and column 13.

Binary file not shown.
1 id name note
2 1 café hello
3 2 naïve world

Binary file not shown.
1 i�d�,�n�a�m�e�,�n�o�t�e�
2 �1�,�c�a�f�é�,�h�e�l�l�o�
3 �2�,�n�a�ï�v�e�,�w�o�r�l�d�
4

Binary file not shown.
1 i���d���,���n���a���m���e���,���n���o���t���e���
2 ���1���,���c���a���f������,���h���e���l���l���o���
3 ���2���,���n���a������v���e���,���w���o���r���l���d���
4 ���

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,2 @@
id,name,note
1,alice,hi

View File

@@ -35,8 +35,10 @@ TEST_CASES_DIR = Path(__file__).resolve().parent.parent / "test-cases"
# Subdirectories in test-cases/ that are exercised by their own dedicated
# tests. The sweep ignores these so we don't double-test or fight expected
# byte-exact outputs.
_EXCLUDED_SUBDIRS = {"text-cleaner-corpus"}
# byte-exact outputs. ``junk-corpus`` is intentionally pathological —
# files there are designed to break the cleaner/analyzer; the contract is
# enforced by ``tests/test_junk_corpus.py``, not this happy-path sweep.
_EXCLUDED_SUBDIRS = {"text-cleaner-corpus", "junk-corpus"}
# File suffixes we know how to load.
_SUPPORTED_SUFFIXES = {".csv", ".tsv", ".xlsx", ".xls"}

156
tests/test_junk_corpus.py Normal file
View File

@@ -0,0 +1,156 @@
"""Stress-test the upload analyzer against a corpus of pathological files.
Every file under ``test-cases/junk-corpus/test_data/`` is fed through
``_run_analysis_on_upload`` — the same path the GUI takes when a user
drops a file on the home page. The contract under test is:
* The call never raises. Errors must surface as a synthetic ``Finding``
with severity ``"error"``, not a Python traceback that the page
chrome bubbles up to the user.
* The return is always a list of :class:`Finding` (possibly empty for
files the analyzer judges clean).
* Specific high-risk files (empty bytes, corrupt zip, etc.) MUST
produce at least one error-level Finding so the UI shows a red
banner rather than silently rendering "no issues found".
To add a new pathological shape:
1. Edit ``test-cases/junk-corpus/make_junk_corpus.py`` to write the new
file under ``test_data/``.
2. Re-run that script to materialize the file on disk.
3. (Optional) Add the filename to ``_MUST_BE_ERROR`` below if the file
represents a state where "no findings" would be a silent failure.
"""
from __future__ import annotations
from pathlib import Path
import pytest
from src.core.analyze import Finding
from src.gui.components._legacy import _run_analysis_on_upload
_CORPUS = Path(__file__).resolve().parent.parent / "test-cases" / "junk-corpus" / "test_data"
class _FakeUpload:
"""Duck-type the Streamlit ``UploadedFile`` interface from a path."""
def __init__(self, path: Path) -> None:
self.name = path.name
self._bytes = path.read_bytes()
def getvalue(self) -> bytes:
return self._bytes
def _corpus_files() -> list[Path]:
files = sorted(p for p in _CORPUS.iterdir() if p.is_file())
if not files:
raise RuntimeError(
f"Junk corpus is empty. Run "
f"`python test-cases/junk-corpus/make_junk_corpus.py` "
f"to generate {_CORPUS}."
)
return files
# Files where "zero findings" would be a silent failure — these are
# structurally broken enough that the analyzer MUST flag them. The
# error-level Finding is what shows the user a red banner instead of
# the misleading "no issues found" success path.
_MUST_BE_ERROR = {
"empty.csv",
"only_bom.csv",
"only_nul.csv",
"corrupt_xlsx.xlsx",
}
@pytest.mark.parametrize(
"path",
_corpus_files(),
ids=lambda p: p.name,
)
class TestJunkCorpus:
"""Every pathological file must round-trip through the analyzer
without raising. The error message format is checked separately
via :func:`TestJunkCorpus.test_error_findings_have_a_description`.
"""
def test_no_exception_propagates(self, path: Path) -> None:
upload = _FakeUpload(path)
# The point of the test: any exception from analyze() / pandas /
# repair_bytes / openpyxl SHOULD have been caught and turned
# into an error Finding by ``_run_analysis_on_upload``. If this
# raises, the home page would crash on this file in production.
findings = _run_analysis_on_upload(upload)
assert isinstance(findings, list), (
f"{path.name}: expected list[Finding], got {type(findings).__name__}"
)
def test_findings_are_well_formed(self, path: Path) -> None:
upload = _FakeUpload(path)
findings = _run_analysis_on_upload(upload)
for f in findings:
assert isinstance(f, Finding), (
f"{path.name}: non-Finding in result list: {f!r}"
)
assert isinstance(f.id, str) and f.id, (
f"{path.name}: Finding has empty id"
)
assert f.severity in ("info", "warn", "error"), (
f"{path.name}: Finding has bad severity {f.severity!r}"
)
assert isinstance(f.description, str) and f.description, (
f"{path.name}: Finding has empty description"
)
def test_must_be_error_files_actually_flag(self, path: Path) -> None:
if path.name not in _MUST_BE_ERROR:
pytest.skip(f"{path.name} is allowed to pass clean")
upload = _FakeUpload(path)
findings = _run_analysis_on_upload(upload)
errors = [f for f in findings if f.severity == "error"]
assert errors, (
f"{path.name} should surface at least one error-level "
f"Finding so the UI shows a red banner; got {len(findings)} "
f"findings (none of severity 'error')."
)
def test_error_findings_have_a_description(self, path: Path) -> None:
"""Error findings must carry a description the user can act on.
For an empty / corrupt file the description is the ONLY thing
the user sees — it has to name the file or include enough
context that they can fix the underlying problem.
"""
upload = _FakeUpload(path)
findings = _run_analysis_on_upload(upload)
for f in findings:
if f.severity != "error":
continue
# The synthetic error Findings always interpolate the file
# name; analyzer-generated errors include the column or a
# description that mentions what was wrong.
assert len(f.description) >= 20, (
f"{path.name}: error Finding description is too short "
f"to be useful: {f.description!r}"
)
def test_corpus_contains_expected_shapes() -> None:
"""Sanity-check that the corpus generator wrote the files we rely
on for the must-be-error matrix. If somebody renames a file in
``make_junk_corpus.py`` without updating ``_MUST_BE_ERROR``, this
test catches it before the per-file parametrization silently
skips the must-be-error assertion."""
names = {p.name for p in _corpus_files()}
missing = _MUST_BE_ERROR - names
assert not missing, (
f"_MUST_BE_ERROR references files that don't exist in the "
f"corpus: {sorted(missing)}. Regenerate the corpus or update "
f"_MUST_BE_ERROR."
)