test(junk-corpus): pathological-input stress suite for the analyzer

Build a corpus of 35 deliberately-broken files (empty bytes, NUL bytes, mojibake, UTF-16 without BOM, mismatched columns, unescaped quotes, corrupt zip, etc.) and pin the analyzer's stability contract against them. Files land in ``test-cases/junk-corpus/test_data/``. The generator ``make_junk_corpus.py`` produces them deterministically (one random sample uses ``secrets.token_bytes`` — committed bytes are stable across regenerations because the byte stream is captured at commit time). README documents the categories and how to add new shapes. ``tests/test_junk_corpus.py`` parametrizes over every file in the corpus and asserts: 1. ``_run_analysis_on_upload`` never raises — exceptions must be caught and surfaced as a synthetic ``Finding`` with severity="error". This was the user-reported crash for 13_non_latin_scripts.csv that the previous fix in ae9d4a2 defensively wrapped; the corpus now stops the regression from re-landing on a different shape. 2. Every Finding in the result list is well-formed (string id, valid severity, non-empty description). 3. A high-risk subset (empty.csv, only_bom.csv, only_nul.csv, corrupt_xlsx.xlsx) MUST surface at least one error-level Finding — otherwise the GUI would render "no issues found" for a structurally broken file. 4. Error-level Finding descriptions are at least 20 chars so the UI banner gives the user something to act on. Also exclude ``junk-corpus`` from ``tests/test_fixtures_sweep.py`` since that sweep is happy-path (round-trip the text cleaner) and fights with files designed to break it. The contract is enforced by the dedicated junk-corpus test, not the sweep. Runtime: 12 s for the junk-corpus tests, 30 s for the full project suite (was 19 s without these). 2118 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:35:22 +00:00
parent ae9d4a2db5
commit 696996c119
39 changed files with 637 additions and 2 deletions
--- a/test-cases/junk-corpus/README.md
+++ b/test-cases/junk-corpus/README.md
@@ -0,0 +1,63 @@
+# Junk Corpus — pathological-input stress tests
+
+This corpus exists to make the upload analyzer prove it can survive any
+file a user (or an adversary) might drop on it. Every file under
+`test_data/` is deliberately broken in a different way: empty bytes,
+NUL bytes, mojibake, UTF-16 without a BOM, mismatched columns,
+unescaped quotes, corrupt `.xlsx`, and so on.
+
+The contract enforced by `tests/test_junk_corpus.py`:
+
+1. `_run_analysis_on_upload(file)` MUST NOT raise. Errors are caught
+   and surfaced as a synthetic `Finding` with severity `"error"`.
+2. The return is always a `list[Finding]` (possibly empty for files
+   the analyzer judges clean).
+3. A specific subset of files (`empty.csv`, `only_bom.csv`,
+   `only_nul.csv`, `corrupt_xlsx.xlsx`) MUST produce at least one
+   error-level Finding so the GUI shows a red banner instead of
+   silently rendering "no issues found".
+
+## Why this matters
+
+In a multi-file home-page upload, one bad file used to bubble a
+Python traceback up through the page chrome and kill every other
+file's analysis. The defensive wrap in `_run_analysis_on_upload` plus
+this stress test together prevent that regression.
+
+## Regenerating the corpus
+
+```bash
+python test-cases/junk-corpus/make_junk_corpus.py
+```
+
+The generator writes 35-ish files into `test_data/`. They are small
+(< 100 KB each) and committed to the repo so the stress test runs
+without depending on a regenerate step.
+
+## Adding a new pathological shape
+
+1. Add a `write(...)` call to `make_junk_corpus.py`.
+2. Re-run that script to materialize the file on disk.
+3. (Optional) Add the filename to `_MUST_BE_ERROR` in
+   `tests/test_junk_corpus.py` if "no findings" would be a silent
+   failure for that shape.
+
+## What's already covered
+
+| Category | Files |
+|---|---|
+| Empty / near-empty | `empty.csv`, `only_whitespace.csv`, `only_bom.csv`, `only_nul.csv`, `just_newlines.csv`, `header_only.csv` |
+| Random / binary garbage | `random_bytes.csv`, `png_magic_as_csv.csv` |
+| Truncated or huge | `truncated_mid_row.csv`, `one_huge_line.csv`, `massive_columns.csv`, `single_column.csv` |
+| Wrong delimiter | `tsv_as_csv.csv`, `mixed_delimiters.csv` |
+| Encoding chaos | `utf16_le_no_bom.csv`, `utf16_be_with_bom.csv`, `utf32_le.csv`, `mojibake.csv`, `invalid_utf8.csv`, `cp1252_smart_quotes.csv` |
+| Quoting / shape | `unescaped_quotes.csv`, `embedded_newlines.csv`, `mismatched_columns.csv`, `duplicate_headers.csv`, `empty_header_names.csv`, `trailing_commas.csv` |
+| Content | `all_nulls.csv`, `very_wide_cell.csv`, `all_same_row.csv` |
+| Extension confusion | `no_extension`, `weird_extension.foo`, `double_extension.csv.txt` |
+| Excel pathologies | `corrupt_xlsx.xlsx`, `excel_empty.xlsx`, `excel_header_only.xlsx` |
+
+## Manually loading a junk file in the GUI
+
+The files are real on-disk artifacts. Drag any of them into the home
+page uploader to verify the GUI renders a sensible error (or clean
+findings, for files the analyzer is OK with) instead of crashing.
--- a/test-cases/junk-corpus/make_junk_corpus.py
+++ b/test-cases/junk-corpus/make_junk_corpus.py
@@ -0,0 +1,231 @@
+"""Generate a corpus of pathological files for stress-testing the upload
+analyzer.
+
+Each file in ``test_data/`` is deliberately broken in a different way:
+empty bytes, NUL bytes, mojibake, UTF-16 without BOM, mismatched columns,
+unescaped quotes, etc. The goal is to make sure ``_run_analysis_on_upload``
+returns a clean error Finding (never a Python traceback) for any of them,
+in any combination, on every operating system the GUI ships on.
+
+Run::
+
+    python test-cases/junk-corpus/make_junk_corpus.py
+
+The matching pytest at ``tests/test_junk_corpus.py`` iterates every file
+in ``test_data/`` and asserts the analyzer either returns findings or an
+error Finding — never raises.
+"""
+
+from __future__ import annotations
+
+import io
+import os
+import secrets
+import struct
+import zipfile
+from pathlib import Path
+
+
+_HERE = Path(__file__).resolve().parent
+_OUT = _HERE / "test_data"
+
+
+def write(name: str, data: bytes) -> None:
+    """Write *data* to ``test_data/name`` and report the size."""
+    path = _OUT / name
+    path.write_bytes(data)
+    print(f"  {name:<40} {len(data):>10} bytes")
+
+
+def _valid_xlsx_bytes(*, sheet_xml: str) -> bytes:
+    """Build a minimal but valid .xlsx (zip with the required parts).
+
+    ``sheet_xml`` is the inner ``<sheetData>`` content; the rest of the
+    workbook scaffolding is filled in around it. Good enough for pandas
+    to load.
+    """
+    buf = io.BytesIO()
+    with zipfile.ZipFile(buf, "w", zipfile.ZIP_DEFLATED) as z:
+        z.writestr(
+            "[Content_Types].xml",
+            '<?xml version="1.0"?>'
+            '<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">'
+            '<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>'
+            '<Default Extension="xml" ContentType="application/xml"/>'
+            '<Override PartName="/xl/workbook.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>'
+            '<Override PartName="/xl/worksheets/sheet1.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>'
+            "</Types>",
+        )
+        z.writestr(
+            "_rels/.rels",
+            '<?xml version="1.0"?>'
+            '<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">'
+            '<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="xl/workbook.xml"/>'
+            "</Relationships>",
+        )
+        z.writestr(
+            "xl/_rels/workbook.xml.rels",
+            '<?xml version="1.0"?>'
+            '<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">'
+            '<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" Target="worksheets/sheet1.xml"/>'
+            "</Relationships>",
+        )
+        z.writestr(
+            "xl/workbook.xml",
+            '<?xml version="1.0"?>'
+            '<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"'
+            ' xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">'
+            '<sheets><sheet name="Sheet1" sheetId="1" r:id="rId1"/></sheets>'
+            "</workbook>",
+        )
+        z.writestr(
+            "xl/worksheets/sheet1.xml",
+            '<?xml version="1.0"?>'
+            '<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">'
+            f"<sheetData>{sheet_xml}</sheetData>"
+            "</worksheet>",
+        )
+    return buf.getvalue()
+
+
+def main() -> None:
+    _OUT.mkdir(parents=True, exist_ok=True)
+    print(f"Writing junk corpus to {_OUT}")
+
+    # ---- Empty / near-empty -------------------------------------------------
+    write("empty.csv", b"")
+    write("only_whitespace.csv", b"   \t\n  \n\t  \n")
+    write("only_bom.csv", b"\xef\xbb\xbf")
+    write("only_nul.csv", b"\x00" * 64)
+    write("just_newlines.csv", b"\n\n\n\n\n")
+    write("header_only.csv", b"id,name,note\n")
+
+    # ---- Random / binary garbage -------------------------------------------
+    write("random_bytes.csv", secrets.token_bytes(2048))
+    # Bytes that look like a PNG signature plus garbage; would mislead any
+    # naive file-type sniffer.
+    write("png_magic_as_csv.csv", b"\x89PNG\r\n\x1a\n" + secrets.token_bytes(512))
+
+    # ---- Truncated / structurally damaged ----------------------------------
+    write(
+        "truncated_mid_row.csv",
+        b"id,name,note\n1,alice,hello\n2,bob,wor",  # row 2 ends mid-cell
+    )
+    write(
+        "one_huge_line.csv",
+        b"a," * 5_000,  # 10KB single line, no newline anywhere
+    )
+    write(
+        "massive_columns.csv",
+        (",".join(f"c{i}" for i in range(500)) + "\n"
+         + ",".join("x" for _ in range(500)) + "\n").encode(),
+    )
+    write(
+        "single_column.csv",
+        b"\n".join([b"id"] + [str(i).encode() for i in range(20)]) + b"\n",
+    )
+
+    # ---- Wrong / misleading delimiter --------------------------------------
+    write(
+        "tsv_as_csv.csv",
+        b"id\tname\tnote\n1\talice\thi\n2\tbob\tworld\n",
+    )
+    write(
+        "mixed_delimiters.csv",
+        b"id,name\tnote;extra|tail\n1,alice\thi;x|y\n",
+    )
+
+    # ---- Encoding chaos ----------------------------------------------------
+    sample_text = "id,name,note\n1,café,hello\n2,naïve,world\n"
+    write("utf16_le_no_bom.csv", sample_text.encode("utf-16-le"))
+    write("utf16_be_with_bom.csv", b"\xfe\xff" + sample_text.encode("utf-16-be"))
+    write("utf32_le.csv", sample_text.encode("utf-32-le"))
+    # Latin-1 bytes that decode as UTF-8 produce mojibake (Ã©, Ã¯ etc.)
+    write("mojibake.csv", sample_text.encode("latin-1"))
+    # Bytes that aren't valid UTF-8 (lone continuation bytes)
+    write("invalid_utf8.csv", b"id,name\n1,\xff\xfe\xfd,hello\n")
+    # cp1252-encoded smart quotes in column values. cp1252 ascribes
+    # smart-quote glyphs to bytes 0x91-0x94; the surrounding ASCII +
+    # accented "é" is just there to keep the value realistic.
+    write(
+        "cp1252_smart_quotes.csv",
+        b"id,quote\n1,"
+        + "café ".encode("cp1252")
+        + b"\x93smart\x94 \x91quote\x92"
+        + b"\n",
+    )
+
+    # ---- Quoting and field-shape pathologies -------------------------------
+    write(
+        "unescaped_quotes.csv",
+        b'id,note\n1,"this has " unescaped quote"\n2,"normal"\n',
+    )
+    write(
+        "embedded_newlines.csv",
+        b'id,note\n1,"line one\nline two"\n2,"single line"\n',
+    )
+    write(
+        "mismatched_columns.csv",
+        b"id,name,note\n1,alice,hi\n2,bob\n3,carol,hi,extra,fields\n",
+    )
+    write(
+        "duplicate_headers.csv",
+        b"col,col,col\n1,2,3\n4,5,6\n",
+    )
+    write(
+        "empty_header_names.csv",
+        b",,,\n1,2,3,4\n5,6,7,8\n",
+    )
+    write(
+        "trailing_commas.csv",
+        b"id,name,note,\n1,alice,hi,\n2,bob,wo,\n",
+    )
+
+    # ---- Content pathologies ----------------------------------------------
+    write(
+        "all_nulls.csv",
+        b"id,name,note\nNULL,NULL,NULL\nN/A,NA,(null)\nNone,nan,?\n",
+    )
+    write(
+        "very_wide_cell.csv",
+        b'id,blob\n1,"' + b"x" * 10_000 + b'"\n',
+    )
+    write(
+        "all_same_row.csv",
+        b"id,name,note\n" + b"1,alice,hello\n" * 100,
+    )
+
+    # ---- Extension confusion ----------------------------------------------
+    write("no_extension", b"id,name,note\n1,alice,hi\n")
+    write(
+        "weird_extension.foo",
+        b"id,name,note\n1,alice,hi\n",
+    )
+    write(
+        "double_extension.csv.txt",
+        b"id,name,note\n1,alice,hi\n",
+    )
+
+    # ---- Excel-specific pathologies ----------------------------------------
+    # Not a real zip — pandas/openpyxl should error cleanly.
+    write("corrupt_xlsx.xlsx", b"PK\x03\x04 not really a zip file")
+    # Valid xlsx with an entirely empty sheet.
+    write("excel_empty.xlsx", _valid_xlsx_bytes(sheet_xml=""))
+    # Valid xlsx with one row of headers and no data.
+    write(
+        "excel_header_only.xlsx",
+        _valid_xlsx_bytes(
+            sheet_xml=(
+                '<row r="1">'
+                '<c r="A1" t="inlineStr"><is><t>id</t></is></c>'
+                '<c r="B1" t="inlineStr"><is><t>name</t></is></c>'
+                "</row>"
+            ),
+        ),
+    )
+
+    print(f"\nWrote {len(list(_OUT.iterdir()))} files.")
+
+
+if __name__ == "__main__":
+    main()
--- a/test-cases/junk-corpus/test_data/all_nulls.csv
+++ b/test-cases/junk-corpus/test_data/all_nulls.csv
@@ -0,0 +1,4 @@
+id,name,note
+NULL,NULL,NULL
+N/A,NA,(null)
+None,nan,?
--- a/test-cases/junk-corpus/test_data/all_same_row.csv
+++ b/test-cases/junk-corpus/test_data/all_same_row.csv
@@ -0,0 +1,101 @@
+id,name,note
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
+1,alice,hello
--- a/test-cases/junk-corpus/test_data/corrupt_xlsx.xlsx
+++ b/test-cases/junk-corpus/test_data/corrupt_xlsx.xlsx
@@ -0,0 +1 @@
+PK not really a zip file
--- a/test-cases/junk-corpus/test_data/cp1252_smart_quotes.csv
+++ b/test-cases/junk-corpus/test_data/cp1252_smart_quotes.csv
@@ -0,0 +1,2 @@
+id,quote
+1,café “smart” ‘quote’
--- a/test-cases/junk-corpus/test_data/double_extension.csv.txt
+++ b/test-cases/junk-corpus/test_data/double_extension.csv.txt
@@ -0,0 +1,2 @@
+id,name,note
+1,alice,hi
--- a/test-cases/junk-corpus/test_data/duplicate_headers.csv
+++ b/test-cases/junk-corpus/test_data/duplicate_headers.csv
@@ -0,0 +1,3 @@
+col,col,col
+1,2,3
+4,5,6
--- a/test-cases/junk-corpus/test_data/embedded_newlines.csv
+++ b/test-cases/junk-corpus/test_data/embedded_newlines.csv
@@ -0,0 +1,4 @@
+id,note
+1,"line one
+line two"
+2,"single line"
--- a/test-cases/junk-corpus/test_data/empty.csv
+++ b/test-cases/junk-corpus/test_data/empty.csv
--- a/test-cases/junk-corpus/test_data/empty_header_names.csv
+++ b/test-cases/junk-corpus/test_data/empty_header_names.csv
@@ -0,0 +1,3 @@
+,,,
+1,2,3,4
+5,6,7,8
--- a/test-cases/junk-corpus/test_data/excel_empty.xlsx
+++ b/test-cases/junk-corpus/test_data/excel_empty.xlsx
--- a/test-cases/junk-corpus/test_data/excel_header_only.xlsx
+++ b/test-cases/junk-corpus/test_data/excel_header_only.xlsx
--- a/test-cases/junk-corpus/test_data/header_only.csv
+++ b/test-cases/junk-corpus/test_data/header_only.csv
@@ -0,0 +1 @@
+id,name,note
--- a/test-cases/junk-corpus/test_data/invalid_utf8.csv
+++ b/test-cases/junk-corpus/test_data/invalid_utf8.csv
@@ -0,0 +1,2 @@
+id,name
+1,ÿþý,hello
--- a/test-cases/junk-corpus/test_data/just_newlines.csv
+++ b/test-cases/junk-corpus/test_data/just_newlines.csv
@@ -0,0 +1,5 @@
+
+
+
+
+
--- a/test-cases/junk-corpus/test_data/massive_columns.csv
+++ b/test-cases/junk-corpus/test_data/massive_columns.csv
@@ -0,0 +1,2 @@
+c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20,c21,c22,c23,c24,c25,c26,c27,c28,c29,c30,c31,c32,c33,c34,c35,c36,c37,c38,c39,c40,c41,c42,c43,c44,c45,c46,c47,c48,c49,c50,c51,c52,c53,c54,c55,c56,c57,c58,c59,c60,c61,c62,c63,c64,c65,c66,c67,c68,c69,c70,c71,c72,c73,c74,c75,c76,c77,c78,c79,c80,c81,c82,c83,c84,c85,c86,c87,c88,c89,c90,c91,c92,c93,c94,c95,c96,c97,c98,c99,c100,c101,c102,c103,c104,c105,c106,c107,c108,c109,c110,c111,c112,c113,c114,c115,c116,c117,c118,c119,c120,c121,c122,c123,c124,c125,c126,c127,c128,c129,c130,c131,c132,c133,c134,c135,c136,c137,c138,c139,c140,c141,c142,c143,c144,c145,c146,c147,c148,c149,c150,c151,c152,c153,c154,c155,c156,c157,c158,c159,c160,c161,c162,c163,c164,c165,c166,c167,c168,c169,c170,c171,c172,c173,c174,c175,c176,c177,c178,c179,c180,c181,c182,c183,c184,c185,c186,c187,c188,c189,c190,c191,c192,c193,c194,c195,c196,c197,c198,c199,c200,c201,c202,c203,c204,c205,c206,c207,c208,c209,c210,c211,c212,c213,c214,c215,c216,c217,c218,c219,c220,c221,c222,c223,c224,c225,c226,c227,c228,c229,c230,c231,c232,c233,c234,c235,c236,c237,c238,c239,c240,c241,c242,c243,c244,c245,c246,c247,c248,c249,c250,c251,c252,c253,c254,c255,c256,c257,c258,c259,c260,c261,c262,c263,c264,c265,c266,c267,c268,c269,c270,c271,c272,c273,c274,c275,c276,c277,c278,c279,c280,c281,c282,c283,c284,c285,c286,c287,c288,c289,c290,c291,c292,c293,c294,c295,c296,c297,c298,c299,c300,c301,c302,c303,c304,c305,c306,c307,c308,c309,c310,c311,c312,c313,c314,c315,c316,c317,c318,c319,c320,c321,c322,c323,c324,c325,c326,c327,c328,c329,c330,c331,c332,c333,c334,c335,c336,c337,c338,c339,c340,c341,c342,c343,c344,c345,c346,c347,c348,c349,c350,c351,c352,c353,c354,c355,c356,c357,c358,c359,c360,c361,c362,c363,c364,c365,c366,c367,c368,c369,c370,c371,c372,c373,c374,c375,c376,c377,c378,c379,c380,c381,c382,c383,c384,c385,c386,c387,c388,c389,c390,c391,c392,c393,c394,c395,c396,c397,c398,c399,c400,c401,c402,c403,c404,c405,c406,c407,c408,c409,c410,c411,c412,c413,c414,c415,c416,c417,c418,c419,c420,c421,c422,c423,c424,c425,c426,c427,c428,c429,c430,c431,c432,c433,c434,c435,c436,c437,c438,c439,c440,c441,c442,c443,c444,c445,c446,c447,c448,c449,c450,c451,c452,c453,c454,c455,c456,c457,c458,c459,c460,c461,c462,c463,c464,c465,c466,c467,c468,c469,c470,c471,c472,c473,c474,c475,c476,c477,c478,c479,c480,c481,c482,c483,c484,c485,c486,c487,c488,c489,c490,c491,c492,c493,c494,c495,c496,c497,c498,c499
+x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x
--- a/test-cases/junk-corpus/test_data/mismatched_columns.csv
+++ b/test-cases/junk-corpus/test_data/mismatched_columns.csv
@@ -0,0 +1,4 @@
+id,name,note
+1,alice,hi
+2,bob
+3,carol,hi,extra,fields
--- a/test-cases/junk-corpus/test_data/mixed_delimiters.csv
+++ b/test-cases/junk-corpus/test_data/mixed_delimiters.csv
@@ -0,0 +1,2 @@
+id,name	note;extra|tail
+1,alice	hi;x|y
--- a/test-cases/junk-corpus/test_data/mojibake.csv
+++ b/test-cases/junk-corpus/test_data/mojibake.csv
@@ -0,0 +1,3 @@
+id,name,note
+1,café,hello
+2,naïve,world
--- a/test-cases/junk-corpus/test_data/no_extension
+++ b/test-cases/junk-corpus/test_data/no_extension
@@ -0,0 +1,2 @@
+id,name,note
+1,alice,hi
--- a/test-cases/junk-corpus/test_data/one_huge_line.csv
+++ b/test-cases/junk-corpus/test_data/one_huge_line.csv
--- a/test-cases/junk-corpus/test_data/only_bom.csv
+++ b/test-cases/junk-corpus/test_data/only_bom.csv
@@ -0,0 +1 @@
+
--- a/test-cases/junk-corpus/test_data/only_nul.csv
+++ b/test-cases/junk-corpus/test_data/only_nul.csv
--- a/test-cases/junk-corpus/test_data/only_whitespace.csv
+++ b/test-cases/junk-corpus/test_data/only_whitespace.csv
@@ -0,0 +1,3 @@
+   	
+  
+	  
--- a/test-cases/junk-corpus/test_data/png_magic_as_csv.csv
+++ b/test-cases/junk-corpus/test_data/png_magic_as_csv.csv
--- a/test-cases/junk-corpus/test_data/random_bytes.csv
+++ b/test-cases/junk-corpus/test_data/random_bytes.csv
--- a/test-cases/junk-corpus/test_data/single_column.csv
+++ b/test-cases/junk-corpus/test_data/single_column.csv
@@ -0,0 +1,21 @@
+id
+0
+1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
--- a/test-cases/junk-corpus/test_data/trailing_commas.csv
+++ b/test-cases/junk-corpus/test_data/trailing_commas.csv
@@ -0,0 +1,3 @@
+id,name,note,
+1,alice,hi,
+2,bob,wo,
--- a/test-cases/junk-corpus/test_data/truncated_mid_row.csv
+++ b/test-cases/junk-corpus/test_data/truncated_mid_row.csv
@@ -0,0 +1,3 @@
+id,name,note
+1,alice,hello
+2,bob,wor
--- a/test-cases/junk-corpus/test_data/tsv_as_csv.csv
+++ b/test-cases/junk-corpus/test_data/tsv_as_csv.csv
@@ -0,0 +1,3 @@
+id	name	note
+1	alice	hi
+2	bob	world
--- a/test-cases/junk-corpus/test_data/unescaped_quotes.csv
+++ b/test-cases/junk-corpus/test_data/unescaped_quotes.csv
@@ -0,0 +1,3 @@
+id,note
+1,"this has " unescaped quote"
+2,"normal"
--- a/test-cases/junk-corpus/test_data/utf16_be_with_bom.csv
+++ b/test-cases/junk-corpus/test_data/utf16_be_with_bom.csv
--- a/test-cases/junk-corpus/test_data/utf16_le_no_bom.csv
+++ b/test-cases/junk-corpus/test_data/utf16_le_no_bom.csv
--- a/test-cases/junk-corpus/test_data/utf32_le.csv
+++ b/test-cases/junk-corpus/test_data/utf32_le.csv
--- a/test-cases/junk-corpus/test_data/very_wide_cell.csv
+++ b/test-cases/junk-corpus/test_data/very_wide_cell.csv
--- a/test-cases/junk-corpus/test_data/weird_extension.foo
+++ b/test-cases/junk-corpus/test_data/weird_extension.foo
@@ -0,0 +1,2 @@
+id,name,note
+1,alice,hi
--- a/tests/test_fixtures_sweep.py
+++ b/tests/test_fixtures_sweep.py
@@ -35,8 +35,10 @@ TEST_CASES_DIR = Path(__file__).resolve().parent.parent / "test-cases"

 # Subdirectories in test-cases/ that are exercised by their own dedicated
 # tests. The sweep ignores these so we don't double-test or fight expected
-# byte-exact outputs.
-_EXCLUDED_SUBDIRS = {"text-cleaner-corpus"}
+# byte-exact outputs. ``junk-corpus`` is intentionally pathological —
+# files there are designed to break the cleaner/analyzer; the contract is
+# enforced by ``tests/test_junk_corpus.py``, not this happy-path sweep.
+_EXCLUDED_SUBDIRS = {"text-cleaner-corpus", "junk-corpus"}

 # File suffixes we know how to load.
 _SUPPORTED_SUFFIXES = {".csv", ".tsv", ".xlsx", ".xls"}
--- a/tests/test_junk_corpus.py
+++ b/tests/test_junk_corpus.py
@@ -0,0 +1,156 @@
+"""Stress-test the upload analyzer against a corpus of pathological files.
+
+Every file under ``test-cases/junk-corpus/test_data/`` is fed through
+``_run_analysis_on_upload`` — the same path the GUI takes when a user
+drops a file on the home page. The contract under test is:
+
+* The call never raises. Errors must surface as a synthetic ``Finding``
+  with severity ``"error"``, not a Python traceback that the page
+  chrome bubbles up to the user.
+* The return is always a list of :class:`Finding` (possibly empty for
+  files the analyzer judges clean).
+* Specific high-risk files (empty bytes, corrupt zip, etc.) MUST
+  produce at least one error-level Finding so the UI shows a red
+  banner rather than silently rendering "no issues found".
+
+To add a new pathological shape:
+
+1. Edit ``test-cases/junk-corpus/make_junk_corpus.py`` to write the new
+   file under ``test_data/``.
+2. Re-run that script to materialize the file on disk.
+3. (Optional) Add the filename to ``_MUST_BE_ERROR`` below if the file
+   represents a state where "no findings" would be a silent failure.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pytest
+
+from src.core.analyze import Finding
+from src.gui.components._legacy import _run_analysis_on_upload
+
+
+_CORPUS = Path(__file__).resolve().parent.parent / "test-cases" / "junk-corpus" / "test_data"
+
+
+class _FakeUpload:
+    """Duck-type the Streamlit ``UploadedFile`` interface from a path."""
+
+    def __init__(self, path: Path) -> None:
+        self.name = path.name
+        self._bytes = path.read_bytes()
+
+    def getvalue(self) -> bytes:
+        return self._bytes
+
+
+def _corpus_files() -> list[Path]:
+    files = sorted(p for p in _CORPUS.iterdir() if p.is_file())
+    if not files:
+        raise RuntimeError(
+            f"Junk corpus is empty. Run "
+            f"`python test-cases/junk-corpus/make_junk_corpus.py` "
+            f"to generate {_CORPUS}."
+        )
+    return files
+
+
+# Files where "zero findings" would be a silent failure — these are
+# structurally broken enough that the analyzer MUST flag them. The
+# error-level Finding is what shows the user a red banner instead of
+# the misleading "no issues found" success path.
+_MUST_BE_ERROR = {
+    "empty.csv",
+    "only_bom.csv",
+    "only_nul.csv",
+    "corrupt_xlsx.xlsx",
+}
+
+
+@pytest.mark.parametrize(
+    "path",
+    _corpus_files(),
+    ids=lambda p: p.name,
+)
+class TestJunkCorpus:
+    """Every pathological file must round-trip through the analyzer
+    without raising. The error message format is checked separately
+    via :func:`TestJunkCorpus.test_error_findings_have_a_description`.
+    """
+
+    def test_no_exception_propagates(self, path: Path) -> None:
+        upload = _FakeUpload(path)
+        # The point of the test: any exception from analyze() / pandas /
+        # repair_bytes / openpyxl SHOULD have been caught and turned
+        # into an error Finding by ``_run_analysis_on_upload``. If this
+        # raises, the home page would crash on this file in production.
+        findings = _run_analysis_on_upload(upload)
+        assert isinstance(findings, list), (
+            f"{path.name}: expected list[Finding], got {type(findings).__name__}"
+        )
+
+    def test_findings_are_well_formed(self, path: Path) -> None:
+        upload = _FakeUpload(path)
+        findings = _run_analysis_on_upload(upload)
+        for f in findings:
+            assert isinstance(f, Finding), (
+                f"{path.name}: non-Finding in result list: {f!r}"
+            )
+            assert isinstance(f.id, str) and f.id, (
+                f"{path.name}: Finding has empty id"
+            )
+            assert f.severity in ("info", "warn", "error"), (
+                f"{path.name}: Finding has bad severity {f.severity!r}"
+            )
+            assert isinstance(f.description, str) and f.description, (
+                f"{path.name}: Finding has empty description"
+            )
+
+    def test_must_be_error_files_actually_flag(self, path: Path) -> None:
+        if path.name not in _MUST_BE_ERROR:
+            pytest.skip(f"{path.name} is allowed to pass clean")
+        upload = _FakeUpload(path)
+        findings = _run_analysis_on_upload(upload)
+        errors = [f for f in findings if f.severity == "error"]
+        assert errors, (
+            f"{path.name} should surface at least one error-level "
+            f"Finding so the UI shows a red banner; got {len(findings)} "
+            f"findings (none of severity 'error')."
+        )
+
+    def test_error_findings_have_a_description(self, path: Path) -> None:
+        """Error findings must carry a description the user can act on.
+
+        For an empty / corrupt file the description is the ONLY thing
+        the user sees — it has to name the file or include enough
+        context that they can fix the underlying problem.
+        """
+        upload = _FakeUpload(path)
+        findings = _run_analysis_on_upload(upload)
+        for f in findings:
+            if f.severity != "error":
+                continue
+            # The synthetic error Findings always interpolate the file
+            # name; analyzer-generated errors include the column or a
+            # description that mentions what was wrong.
+            assert len(f.description) >= 20, (
+                f"{path.name}: error Finding description is too short "
+                f"to be useful: {f.description!r}"
+            )
+
+
+def test_corpus_contains_expected_shapes() -> None:
+    """Sanity-check that the corpus generator wrote the files we rely
+    on for the must-be-error matrix. If somebody renames a file in
+    ``make_junk_corpus.py`` without updating ``_MUST_BE_ERROR``, this
+    test catches it before the per-file parametrization silently
+    skips the must-be-error assertion."""
+    names = {p.name for p in _corpus_files()}
+    missing = _MUST_BE_ERROR - names
+    assert not missing, (
+        f"_MUST_BE_ERROR references files that don't exist in the "
+        f"corpus: {sorted(missing)}. Regenerate the corpus or update "
+        f"_MUST_BE_ERROR."
+    )