# Junk Corpus — pathological-input stress tests This corpus exists to make the upload analyzer prove it can survive any file a user (or an adversary) might drop on it. Every file under `test_data/` is deliberately broken in a different way: empty bytes, NUL bytes, mojibake, UTF-16 without a BOM, mismatched columns, unescaped quotes, corrupt `.xlsx`, and so on. The contract enforced by `tests/test_junk_corpus.py`: 1. `_run_analysis_on_upload(file)` MUST NOT raise. Errors are caught and surfaced as a synthetic `Finding` with severity `"error"`. 2. The return is always a `list[Finding]` (possibly empty for files the analyzer judges clean). 3. A specific subset of files (`empty.csv`, `only_bom.csv`, `only_nul.csv`, `corrupt_xlsx.xlsx`) MUST produce at least one error-level Finding so the GUI shows a red banner instead of silently rendering "no issues found". ## Why this matters In a multi-file home-page upload, one bad file used to bubble a Python traceback up through the page chrome and kill every other file's analysis. The defensive wrap in `_run_analysis_on_upload` plus this stress test together prevent that regression. ## Regenerating the corpus ```bash python test-cases/junk-corpus/make_junk_corpus.py ``` The generator writes 35-ish files into `test_data/`. They are small (< 100 KB each) and committed to the repo so the stress test runs without depending on a regenerate step. ## Adding a new pathological shape 1. Add a `write(...)` call to `make_junk_corpus.py`. 2. Re-run that script to materialize the file on disk. 3. (Optional) Add the filename to `_MUST_BE_ERROR` in `tests/test_junk_corpus.py` if "no findings" would be a silent failure for that shape. ## What's already covered | Category | Files | |---|---| | Empty / near-empty | `empty.csv`, `only_whitespace.csv`, `only_bom.csv`, `only_nul.csv`, `just_newlines.csv`, `header_only.csv` | | Random / binary garbage | `random_bytes.csv`, `png_magic_as_csv.csv` | | Truncated or huge | `truncated_mid_row.csv`, `one_huge_line.csv`, `massive_columns.csv`, `single_column.csv` | | Wrong delimiter | `tsv_as_csv.csv`, `mixed_delimiters.csv` | | Encoding chaos | `utf16_le_no_bom.csv`, `utf16_be_with_bom.csv`, `utf32_le.csv`, `mojibake.csv`, `invalid_utf8.csv`, `cp1252_smart_quotes.csv` | | Quoting / shape | `unescaped_quotes.csv`, `embedded_newlines.csv`, `mismatched_columns.csv`, `duplicate_headers.csv`, `empty_header_names.csv`, `trailing_commas.csv` | | Content | `all_nulls.csv`, `very_wide_cell.csv`, `all_same_row.csv` | | Extension confusion | `no_extension`, `weird_extension.foo`, `double_extension.csv.txt` | | Excel pathologies | `corrupt_xlsx.xlsx`, `excel_empty.xlsx`, `excel_header_only.xlsx` | ## Manually loading a junk file in the GUI The files are real on-disk artifacts. Drag any of them into the home page uploader to verify the GUI renders a sensible error (or clean findings, for files the analyzer is OK with) instead of crashing.