datatools-dev/test-cases/junk-corpus/README.md

# Junk Corpus — pathological-input stress tests

This corpus exists to make the upload analyzer prove it can survive any
file a user (or an adversary) might drop on it. Every file under
`test_data/` is deliberately broken in a different way: empty bytes,
NUL bytes, mojibake, UTF-16 without a BOM, mismatched columns,
unescaped quotes, corrupt `.xlsx`, and so on.

The contract enforced by `tests/test_junk_corpus.py`:

1. `_run_analysis_on_upload(file)` MUST NOT raise. Errors are caught
   and surfaced as a synthetic `Finding` with severity `"error"`.
2. The return is always a `list[Finding]` (possibly empty for files
   the analyzer judges clean).
3. A specific subset of files (`empty.csv`, `only_bom.csv`,
   `only_nul.csv`, `corrupt_xlsx.xlsx`) MUST produce at least one
   error-level Finding so the GUI shows a red banner instead of
   silently rendering "no issues found".

## Why this matters

In a multi-file home-page upload, one bad file used to bubble a
Python traceback up through the page chrome and kill every other
file's analysis. The defensive wrap in `_run_analysis_on_upload` plus
this stress test together prevent that regression.

## Regenerating the corpus

```bash
python test-cases/junk-corpus/make_junk_corpus.py
```

The generator writes 35-ish files into `test_data/`. They are small
(< 100 KB each) and committed to the repo so the stress test runs
without depending on a regenerate step.

## Adding a new pathological shape

1. Add a `write(...)` call to `make_junk_corpus.py`.
2. Re-run that script to materialize the file on disk.
3. (Optional) Add the filename to `_MUST_BE_ERROR` in
   `tests/test_junk_corpus.py` if "no findings" would be a silent
   failure for that shape.

## What's already covered

| Category | Files |
|---|---|
| Empty / near-empty | `empty.csv`, `only_whitespace.csv`, `only_bom.csv`, `only_nul.csv`, `just_newlines.csv`, `header_only.csv` |
| Random / binary garbage | `random_bytes.csv`, `png_magic_as_csv.csv` |
| Truncated or huge | `truncated_mid_row.csv`, `one_huge_line.csv`, `massive_columns.csv`, `single_column.csv` |
| Wrong delimiter | `tsv_as_csv.csv`, `mixed_delimiters.csv` |
| Encoding chaos | `utf16_le_no_bom.csv`, `utf16_be_with_bom.csv`, `utf32_le.csv`, `mojibake.csv`, `invalid_utf8.csv`, `cp1252_smart_quotes.csv` |
| Quoting / shape | `unescaped_quotes.csv`, `embedded_newlines.csv`, `mismatched_columns.csv`, `duplicate_headers.csv`, `empty_header_names.csv`, `trailing_commas.csv` |
| Content | `all_nulls.csv`, `very_wide_cell.csv`, `all_same_row.csv` |
| Extension confusion | `no_extension`, `weird_extension.foo`, `double_extension.csv.txt` |
| Excel pathologies | `corrupt_xlsx.xlsx`, `excel_empty.xlsx`, `excel_header_only.xlsx` |

## Manually loading a junk file in the GUI

The files are real on-disk artifacts. Drag any of them into the home
page uploader to verify the GUI renders a sensible error (or clean
findings, for files the analyzer is OK with) instead of crashing.