Build a corpus of 35 deliberately-broken files (empty bytes, NUL
bytes, mojibake, UTF-16 without BOM, mismatched columns, unescaped
quotes, corrupt zip, etc.) and pin the analyzer's stability contract
against them.
Files land in ``test-cases/junk-corpus/test_data/``. The generator
``make_junk_corpus.py`` produces them deterministically (one random
sample uses ``secrets.token_bytes`` — committed bytes are stable
across regenerations because the byte stream is captured at commit
time). README documents the categories and how to add new shapes.
``tests/test_junk_corpus.py`` parametrizes over every file in the
corpus and asserts:
1. ``_run_analysis_on_upload`` never raises — exceptions must be
caught and surfaced as a synthetic ``Finding`` with
severity="error". This was the user-reported crash for
13_non_latin_scripts.csv that the previous fix in ae9d4a2
defensively wrapped; the corpus now stops the regression
from re-landing on a different shape.
2. Every Finding in the result list is well-formed (string id,
valid severity, non-empty description).
3. A high-risk subset (empty.csv, only_bom.csv, only_nul.csv,
corrupt_xlsx.xlsx) MUST surface at least one error-level
Finding — otherwise the GUI would render "no issues found"
for a structurally broken file.
4. Error-level Finding descriptions are at least 20 chars so the
UI banner gives the user something to act on.
Also exclude ``junk-corpus`` from ``tests/test_fixtures_sweep.py``
since that sweep is happy-path (round-trip the text cleaner) and
fights with files designed to break it. The contract is enforced
by the dedicated junk-corpus test, not the sweep.
Runtime: 12 s for the junk-corpus tests, 30 s for the full
project suite (was 19 s without these). 2118 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
64 lines
2.9 KiB
Markdown
64 lines
2.9 KiB
Markdown
# Junk Corpus — pathological-input stress tests
|
|
|
|
This corpus exists to make the upload analyzer prove it can survive any
|
|
file a user (or an adversary) might drop on it. Every file under
|
|
`test_data/` is deliberately broken in a different way: empty bytes,
|
|
NUL bytes, mojibake, UTF-16 without a BOM, mismatched columns,
|
|
unescaped quotes, corrupt `.xlsx`, and so on.
|
|
|
|
The contract enforced by `tests/test_junk_corpus.py`:
|
|
|
|
1. `_run_analysis_on_upload(file)` MUST NOT raise. Errors are caught
|
|
and surfaced as a synthetic `Finding` with severity `"error"`.
|
|
2. The return is always a `list[Finding]` (possibly empty for files
|
|
the analyzer judges clean).
|
|
3. A specific subset of files (`empty.csv`, `only_bom.csv`,
|
|
`only_nul.csv`, `corrupt_xlsx.xlsx`) MUST produce at least one
|
|
error-level Finding so the GUI shows a red banner instead of
|
|
silently rendering "no issues found".
|
|
|
|
## Why this matters
|
|
|
|
In a multi-file home-page upload, one bad file used to bubble a
|
|
Python traceback up through the page chrome and kill every other
|
|
file's analysis. The defensive wrap in `_run_analysis_on_upload` plus
|
|
this stress test together prevent that regression.
|
|
|
|
## Regenerating the corpus
|
|
|
|
```bash
|
|
python test-cases/junk-corpus/make_junk_corpus.py
|
|
```
|
|
|
|
The generator writes 35-ish files into `test_data/`. They are small
|
|
(< 100 KB each) and committed to the repo so the stress test runs
|
|
without depending on a regenerate step.
|
|
|
|
## Adding a new pathological shape
|
|
|
|
1. Add a `write(...)` call to `make_junk_corpus.py`.
|
|
2. Re-run that script to materialize the file on disk.
|
|
3. (Optional) Add the filename to `_MUST_BE_ERROR` in
|
|
`tests/test_junk_corpus.py` if "no findings" would be a silent
|
|
failure for that shape.
|
|
|
|
## What's already covered
|
|
|
|
| Category | Files |
|
|
|---|---|
|
|
| Empty / near-empty | `empty.csv`, `only_whitespace.csv`, `only_bom.csv`, `only_nul.csv`, `just_newlines.csv`, `header_only.csv` |
|
|
| Random / binary garbage | `random_bytes.csv`, `png_magic_as_csv.csv` |
|
|
| Truncated or huge | `truncated_mid_row.csv`, `one_huge_line.csv`, `massive_columns.csv`, `single_column.csv` |
|
|
| Wrong delimiter | `tsv_as_csv.csv`, `mixed_delimiters.csv` |
|
|
| Encoding chaos | `utf16_le_no_bom.csv`, `utf16_be_with_bom.csv`, `utf32_le.csv`, `mojibake.csv`, `invalid_utf8.csv`, `cp1252_smart_quotes.csv` |
|
|
| Quoting / shape | `unescaped_quotes.csv`, `embedded_newlines.csv`, `mismatched_columns.csv`, `duplicate_headers.csv`, `empty_header_names.csv`, `trailing_commas.csv` |
|
|
| Content | `all_nulls.csv`, `very_wide_cell.csv`, `all_same_row.csv` |
|
|
| Extension confusion | `no_extension`, `weird_extension.foo`, `double_extension.csv.txt` |
|
|
| Excel pathologies | `corrupt_xlsx.xlsx`, `excel_empty.xlsx`, `excel_header_only.xlsx` |
|
|
|
|
## Manually loading a junk file in the GUI
|
|
|
|
The files are real on-disk artifacts. Drag any of them into the home
|
|
page uploader to verify the GUI renders a sensible error (or clean
|
|
findings, for files the analyzer is OK with) instead of crashing.
|