Files
Michael 696996c119 test(junk-corpus): pathological-input stress suite for the analyzer
Build a corpus of 35 deliberately-broken files (empty bytes, NUL
bytes, mojibake, UTF-16 without BOM, mismatched columns, unescaped
quotes, corrupt zip, etc.) and pin the analyzer's stability contract
against them.

Files land in ``test-cases/junk-corpus/test_data/``. The generator
``make_junk_corpus.py`` produces them deterministically (one random
sample uses ``secrets.token_bytes`` — committed bytes are stable
across regenerations because the byte stream is captured at commit
time). README documents the categories and how to add new shapes.

``tests/test_junk_corpus.py`` parametrizes over every file in the
corpus and asserts:

1. ``_run_analysis_on_upload`` never raises — exceptions must be
   caught and surfaced as a synthetic ``Finding`` with
   severity="error". This was the user-reported crash for
   13_non_latin_scripts.csv that the previous fix in ae9d4a2
   defensively wrapped; the corpus now stops the regression
   from re-landing on a different shape.
2. Every Finding in the result list is well-formed (string id,
   valid severity, non-empty description).
3. A high-risk subset (empty.csv, only_bom.csv, only_nul.csv,
   corrupt_xlsx.xlsx) MUST surface at least one error-level
   Finding — otherwise the GUI would render "no issues found"
   for a structurally broken file.
4. Error-level Finding descriptions are at least 20 chars so the
   UI banner gives the user something to act on.

Also exclude ``junk-corpus`` from ``tests/test_fixtures_sweep.py``
since that sweep is happy-path (round-trip the text cleaner) and
fights with files designed to break it. The contract is enforced
by the dedicated junk-corpus test, not the sweep.

Runtime: 12 s for the junk-corpus tests, 30 s for the full
project suite (was 19 s without these). 2118 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:35:22 +00:00

64 lines
2.9 KiB
Markdown

# Junk Corpus — pathological-input stress tests
This corpus exists to make the upload analyzer prove it can survive any
file a user (or an adversary) might drop on it. Every file under
`test_data/` is deliberately broken in a different way: empty bytes,
NUL bytes, mojibake, UTF-16 without a BOM, mismatched columns,
unescaped quotes, corrupt `.xlsx`, and so on.
The contract enforced by `tests/test_junk_corpus.py`:
1. `_run_analysis_on_upload(file)` MUST NOT raise. Errors are caught
and surfaced as a synthetic `Finding` with severity `"error"`.
2. The return is always a `list[Finding]` (possibly empty for files
the analyzer judges clean).
3. A specific subset of files (`empty.csv`, `only_bom.csv`,
`only_nul.csv`, `corrupt_xlsx.xlsx`) MUST produce at least one
error-level Finding so the GUI shows a red banner instead of
silently rendering "no issues found".
## Why this matters
In a multi-file home-page upload, one bad file used to bubble a
Python traceback up through the page chrome and kill every other
file's analysis. The defensive wrap in `_run_analysis_on_upload` plus
this stress test together prevent that regression.
## Regenerating the corpus
```bash
python test-cases/junk-corpus/make_junk_corpus.py
```
The generator writes 35-ish files into `test_data/`. They are small
(< 100 KB each) and committed to the repo so the stress test runs
without depending on a regenerate step.
## Adding a new pathological shape
1. Add a `write(...)` call to `make_junk_corpus.py`.
2. Re-run that script to materialize the file on disk.
3. (Optional) Add the filename to `_MUST_BE_ERROR` in
`tests/test_junk_corpus.py` if "no findings" would be a silent
failure for that shape.
## What's already covered
| Category | Files |
|---|---|
| Empty / near-empty | `empty.csv`, `only_whitespace.csv`, `only_bom.csv`, `only_nul.csv`, `just_newlines.csv`, `header_only.csv` |
| Random / binary garbage | `random_bytes.csv`, `png_magic_as_csv.csv` |
| Truncated or huge | `truncated_mid_row.csv`, `one_huge_line.csv`, `massive_columns.csv`, `single_column.csv` |
| Wrong delimiter | `tsv_as_csv.csv`, `mixed_delimiters.csv` |
| Encoding chaos | `utf16_le_no_bom.csv`, `utf16_be_with_bom.csv`, `utf32_le.csv`, `mojibake.csv`, `invalid_utf8.csv`, `cp1252_smart_quotes.csv` |
| Quoting / shape | `unescaped_quotes.csv`, `embedded_newlines.csv`, `mismatched_columns.csv`, `duplicate_headers.csv`, `empty_header_names.csv`, `trailing_commas.csv` |
| Content | `all_nulls.csv`, `very_wide_cell.csv`, `all_same_row.csv` |
| Extension confusion | `no_extension`, `weird_extension.foo`, `double_extension.csv.txt` |
| Excel pathologies | `corrupt_xlsx.xlsx`, `excel_empty.xlsx`, `excel_header_only.xlsx` |
## Manually loading a junk file in the GUI
The files are real on-disk artifacts. Drag any of them into the home
page uploader to verify the GUI renders a sensible error (or clean
findings, for files the analyzer is OK with) instead of crashing.