Files
datatools-dev/test-cases/junk-corpus/README.md
Michael 696996c119 test(junk-corpus): pathological-input stress suite for the analyzer
Build a corpus of 35 deliberately-broken files (empty bytes, NUL
bytes, mojibake, UTF-16 without BOM, mismatched columns, unescaped
quotes, corrupt zip, etc.) and pin the analyzer's stability contract
against them.

Files land in ``test-cases/junk-corpus/test_data/``. The generator
``make_junk_corpus.py`` produces them deterministically (one random
sample uses ``secrets.token_bytes`` — committed bytes are stable
across regenerations because the byte stream is captured at commit
time). README documents the categories and how to add new shapes.

``tests/test_junk_corpus.py`` parametrizes over every file in the
corpus and asserts:

1. ``_run_analysis_on_upload`` never raises — exceptions must be
   caught and surfaced as a synthetic ``Finding`` with
   severity="error". This was the user-reported crash for
   13_non_latin_scripts.csv that the previous fix in ae9d4a2
   defensively wrapped; the corpus now stops the regression
   from re-landing on a different shape.
2. Every Finding in the result list is well-formed (string id,
   valid severity, non-empty description).
3. A high-risk subset (empty.csv, only_bom.csv, only_nul.csv,
   corrupt_xlsx.xlsx) MUST surface at least one error-level
   Finding — otherwise the GUI would render "no issues found"
   for a structurally broken file.
4. Error-level Finding descriptions are at least 20 chars so the
   UI banner gives the user something to act on.

Also exclude ``junk-corpus`` from ``tests/test_fixtures_sweep.py``
since that sweep is happy-path (round-trip the text cleaner) and
fights with files designed to break it. The contract is enforced
by the dedicated junk-corpus test, not the sweep.

Runtime: 12 s for the junk-corpus tests, 30 s for the full
project suite (was 19 s without these). 2118 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:35:22 +00:00

2.9 KiB

Junk Corpus — pathological-input stress tests

This corpus exists to make the upload analyzer prove it can survive any file a user (or an adversary) might drop on it. Every file under test_data/ is deliberately broken in a different way: empty bytes, NUL bytes, mojibake, UTF-16 without a BOM, mismatched columns, unescaped quotes, corrupt .xlsx, and so on.

The contract enforced by tests/test_junk_corpus.py:

  1. _run_analysis_on_upload(file) MUST NOT raise. Errors are caught and surfaced as a synthetic Finding with severity "error".
  2. The return is always a list[Finding] (possibly empty for files the analyzer judges clean).
  3. A specific subset of files (empty.csv, only_bom.csv, only_nul.csv, corrupt_xlsx.xlsx) MUST produce at least one error-level Finding so the GUI shows a red banner instead of silently rendering "no issues found".

Why this matters

In a multi-file home-page upload, one bad file used to bubble a Python traceback up through the page chrome and kill every other file's analysis. The defensive wrap in _run_analysis_on_upload plus this stress test together prevent that regression.

Regenerating the corpus

python test-cases/junk-corpus/make_junk_corpus.py

The generator writes 35-ish files into test_data/. They are small (< 100 KB each) and committed to the repo so the stress test runs without depending on a regenerate step.

Adding a new pathological shape

  1. Add a write(...) call to make_junk_corpus.py.
  2. Re-run that script to materialize the file on disk.
  3. (Optional) Add the filename to _MUST_BE_ERROR in tests/test_junk_corpus.py if "no findings" would be a silent failure for that shape.

What's already covered

Category Files
Empty / near-empty empty.csv, only_whitespace.csv, only_bom.csv, only_nul.csv, just_newlines.csv, header_only.csv
Random / binary garbage random_bytes.csv, png_magic_as_csv.csv
Truncated or huge truncated_mid_row.csv, one_huge_line.csv, massive_columns.csv, single_column.csv
Wrong delimiter tsv_as_csv.csv, mixed_delimiters.csv
Encoding chaos utf16_le_no_bom.csv, utf16_be_with_bom.csv, utf32_le.csv, mojibake.csv, invalid_utf8.csv, cp1252_smart_quotes.csv
Quoting / shape unescaped_quotes.csv, embedded_newlines.csv, mismatched_columns.csv, duplicate_headers.csv, empty_header_names.csv, trailing_commas.csv
Content all_nulls.csv, very_wide_cell.csv, all_same_row.csv
Extension confusion no_extension, weird_extension.foo, double_extension.csv.txt
Excel pathologies corrupt_xlsx.xlsx, excel_empty.xlsx, excel_header_only.xlsx

Manually loading a junk file in the GUI

The files are real on-disk artifacts. Drag any of them into the home page uploader to verify the GUI renders a sensible error (or clean findings, for files the analyzer is OK with) instead of crashing.