Build a corpus of 35 deliberately-broken files (empty bytes, NUL
bytes, mojibake, UTF-16 without BOM, mismatched columns, unescaped
quotes, corrupt zip, etc.) and pin the analyzer's stability contract
against them.
Files land in ``test-cases/junk-corpus/test_data/``. The generator
``make_junk_corpus.py`` produces them deterministically (one random
sample uses ``secrets.token_bytes`` — committed bytes are stable
across regenerations because the byte stream is captured at commit
time). README documents the categories and how to add new shapes.
``tests/test_junk_corpus.py`` parametrizes over every file in the
corpus and asserts:
1. ``_run_analysis_on_upload`` never raises — exceptions must be
caught and surfaced as a synthetic ``Finding`` with
severity="error". This was the user-reported crash for
13_non_latin_scripts.csv that the previous fix in ae9d4a2
defensively wrapped; the corpus now stops the regression
from re-landing on a different shape.
2. Every Finding in the result list is well-formed (string id,
valid severity, non-empty description).
3. A high-risk subset (empty.csv, only_bom.csv, only_nul.csv,
corrupt_xlsx.xlsx) MUST surface at least one error-level
Finding — otherwise the GUI would render "no issues found"
for a structurally broken file.
4. Error-level Finding descriptions are at least 20 chars so the
UI banner gives the user something to act on.
Also exclude ``junk-corpus`` from ``tests/test_fixtures_sweep.py``
since that sweep is happy-path (round-trip the text cleaner) and
fights with files designed to break it. The contract is enforced
by the dedicated junk-corpus test, not the sweep.
Runtime: 12 s for the junk-corpus tests, 30 s for the full
project suite (was 19 s without these). 2118 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2.9 KiB
Junk Corpus — pathological-input stress tests
This corpus exists to make the upload analyzer prove it can survive any
file a user (or an adversary) might drop on it. Every file under
test_data/ is deliberately broken in a different way: empty bytes,
NUL bytes, mojibake, UTF-16 without a BOM, mismatched columns,
unescaped quotes, corrupt .xlsx, and so on.
The contract enforced by tests/test_junk_corpus.py:
_run_analysis_on_upload(file)MUST NOT raise. Errors are caught and surfaced as a syntheticFindingwith severity"error".- The return is always a
list[Finding](possibly empty for files the analyzer judges clean). - A specific subset of files (
empty.csv,only_bom.csv,only_nul.csv,corrupt_xlsx.xlsx) MUST produce at least one error-level Finding so the GUI shows a red banner instead of silently rendering "no issues found".
Why this matters
In a multi-file home-page upload, one bad file used to bubble a
Python traceback up through the page chrome and kill every other
file's analysis. The defensive wrap in _run_analysis_on_upload plus
this stress test together prevent that regression.
Regenerating the corpus
python test-cases/junk-corpus/make_junk_corpus.py
The generator writes 35-ish files into test_data/. They are small
(< 100 KB each) and committed to the repo so the stress test runs
without depending on a regenerate step.
Adding a new pathological shape
- Add a
write(...)call tomake_junk_corpus.py. - Re-run that script to materialize the file on disk.
- (Optional) Add the filename to
_MUST_BE_ERRORintests/test_junk_corpus.pyif "no findings" would be a silent failure for that shape.
What's already covered
| Category | Files |
|---|---|
| Empty / near-empty | empty.csv, only_whitespace.csv, only_bom.csv, only_nul.csv, just_newlines.csv, header_only.csv |
| Random / binary garbage | random_bytes.csv, png_magic_as_csv.csv |
| Truncated or huge | truncated_mid_row.csv, one_huge_line.csv, massive_columns.csv, single_column.csv |
| Wrong delimiter | tsv_as_csv.csv, mixed_delimiters.csv |
| Encoding chaos | utf16_le_no_bom.csv, utf16_be_with_bom.csv, utf32_le.csv, mojibake.csv, invalid_utf8.csv, cp1252_smart_quotes.csv |
| Quoting / shape | unescaped_quotes.csv, embedded_newlines.csv, mismatched_columns.csv, duplicate_headers.csv, empty_header_names.csv, trailing_commas.csv |
| Content | all_nulls.csv, very_wide_cell.csv, all_same_row.csv |
| Extension confusion | no_extension, weird_extension.foo, double_extension.csv.txt |
| Excel pathologies | corrupt_xlsx.xlsx, excel_empty.xlsx, excel_header_only.xlsx |
Manually loading a junk file in the GUI
The files are real on-disk artifacts. Drag any of them into the home page uploader to verify the GUI renders a sensible error (or clean findings, for files the analyzer is OK with) instead of crashing.