Files
datatools-dev/test-cases/encodings-corpus/ENCODINGS-CASES.md
Michael 82d7fef21e feat(gate): CSV-normalization gate with confidence-tiered findings
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:35:27 +00:00

285 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ENCODINGS-CASES.md - Code Page / Encoding Test Corpus
**Version**: 1.0
**Last updated**: April 29, 2026
**Companion to**: TEST-CASES.md and QUOTE-CASES.md.
## Why this is a separate corpus
Files 01-23 in the main corpus test the **transformation layer**: given a Python `str` already in memory, what does the cleaner do to it. Encoding tests are about the **I/O layer** that runs *before* the transformation layer ever sees data: given a sequence of bytes on disk, can the reader correctly turn them into a Python `str` in the first place?
These are different failures:
- A transformation bug produces wrong-but-valid output (curly quotes that should have been folded, whitespace that should have been trimmed).
- An I/O bug produces *garbage* (mojibake) or *crashes* the reader entirely. The cleaner never gets to apply any transformation rule because the input never decoded.
Per TECHNICAL.md Section 9, encoding handling lives in `src/core/io.py`, separate from any individual cleaning script. This corpus tests that module.
---
## 1. Layout
```
test_data/encodings/
├── E01_western_basic_utf8.csv ... E26_korean_euckr.csv
├── E27_pathological_ascii_only.csv ... E31_pathological_mixed_concat.csv
├── expected_detection.csv # Manifest: ground truth + acceptable detection
├── detector_baseline.csv # What charset-normalizer actually returns
└── reference/
├── WESTERN_BASIC.utf8.txt
├── WESTERN_EXTENDED.utf8.txt
├── EASTERN_EUROPEAN.utf8.txt
├── CYRILLIC.utf8.txt
├── JAPANESE.utf8.txt
├── CHINESE_SIMPLIFIED.utf8.txt
├── CHINESE_TRADITIONAL.utf8.txt
├── KOREAN.utf8.txt
└── ASCII_ONLY.utf8.txt
```
Every encoded file has a `canonical_content_id` linking it to one of the 9 reference files in `reference/`. After correct decoding (and BOM stripping if applicable), the encoded file's content must equal the corresponding reference file byte-for-byte.
---
## 2. Coverage matrix
The corpus uses 9 distinct content sets, each chosen to exercise a specific encoding family. Cross-coverage is enforced by content design: Cyrillic content cannot be encoded as cp1252; Western extended content (with euro/em-dash) cannot be encoded as Latin-1; etc. Attempting those encodings would either error or substitute, both of which are themselves test cases.
| Content family | What it contains | Encodings covered |
|---|---|---|
| WESTERN_BASIC | ASCII + accented Latin-1 chars (é, ü, ñ, ç) | UTF-8, UTF-8 with BOM, cp1252, ISO-8859-1, ISO-8859-15, Mac Roman, UTF-16 LE/BE with BOM, UTF-16 LE without BOM |
| WESTERN_EXTENDED | Above + euro sign, smart quotes, em-dash | UTF-8, cp1252, UTF-16 LE (NOT Latin-1: chars don't exist there) |
| EASTERN_EUROPEAN | Czech, Polish, Hungarian, Slovak accents | UTF-8, cp1250, ISO-8859-2 |
| CYRILLIC | Russian | UTF-8, cp1251, KOI8-R |
| JAPANESE | Kanji + kana | UTF-8, Shift_JIS |
| CHINESE_SIMPLIFIED | Mainland China characters | UTF-8, GB18030 |
| CHINESE_TRADITIONAL | Taiwan/HK characters | UTF-8, Big5 |
| KOREAN | Hangul | UTF-8, EUC-KR |
| ASCII_ONLY | Pure ASCII | One file; encoding genuinely ambiguous |
---
## 3. Per-file index
### Group A — WESTERN_BASIC (single content, 9 encodings)
This group's purpose is mainly to test detector behavior on the most common Western encodings. Because the content uses only ASCII + Latin-1 characters in the 0xA0+ range, **cp1252 / ISO-8859-1 / ISO-8859-15 produce byte-identical output for this content**. The detector cannot meaningfully distinguish among them; any of them is a correct answer.
| File | Encoding | Notes |
|---|---|---|
| E01 | UTF-8 | Modern default |
| E02 | UTF-8 with BOM | Excel "CSV UTF-8" export. Reader must strip the BOM. |
| E03 | cp1252 | Excel default "CSV" on US/UK/Western Windows |
| E04 | ISO-8859-1 | Latin-1. Identical bytes to cp1252 for this content. |
| E05 | ISO-8859-15 | Latin-9. Identical to Latin-1 here (no euro). |
| E06 | Mac Roman | Different byte mappings; distinguishable |
| E07 | UTF-16 LE with BOM | Excel "Unicode Text" export |
| E08 | UTF-16 BE with BOM | Less common but spec'd |
| E09 | UTF-16 LE without BOM | Detection unreliable; document failure mode |
### Group B — WESTERN_EXTENDED (3 encodings)
This is the cleanest **cp1252-vs-Latin-1 discriminator** in the corpus. The content uses bytes 0x80-0x9F (where cp1252 puts euro, smart quotes, em-dash) — exactly the range Latin-1 leaves undefined. A reader that misidentifies this file as Latin-1 will produce control characters or replacement chars; correct identification as cp1252 yields readable text.
| File | Encoding | Notes |
|---|---|---|
| E10 | UTF-8 | Reference |
| E11 | cp1252 | The discriminator file |
| E12 | UTF-16 LE with BOM | Same content, sanity check |
### Group C — EASTERN_EUROPEAN (3 encodings)
| File | Encoding | Notes |
|---|---|---|
| E13 | UTF-8 | Reference |
| E14 | cp1250 | Polish/Czech/Hungarian Windows default |
| E15 | ISO-8859-2 | Latin-2; distinct byte mappings from cp1250 |
### Group D — CYRILLIC (3 encodings)
| File | Encoding | Notes |
|---|---|---|
| E16 | UTF-8 | Reference |
| E17 | cp1251 | Russian Windows default |
| E18 | KOI8-R | Older Russian Unix encoding; distinct bytes from cp1251 |
### Group E — CJK (8 files, 4 languages × 2 encodings each)
| File | Encoding | Notes |
|---|---|---|
| E19 | UTF-8 (Japanese) | Reference |
| E20 | Shift_JIS | Japanese Excel default; cp932 is the MS extended variant |
| E21 | UTF-8 (Chinese simplified) | Reference |
| E22 | GB18030 | Mainland China; supersets GBK and GB2312 |
| E23 | UTF-8 (Chinese traditional) | Reference |
| E24 | Big5 | Taiwan/HK; cp950 is the MS variant |
| E25 | UTF-8 (Korean) | Reference |
| E26 | EUC-KR | Korean Windows default; cp949 is the MS variant |
### Group F — Pathological (5 files)
These are the cases that crash readers, produce silent corruption, or expose the limits of encoding detection. The expected behavior is **that the reader fails informatively**, not that it succeeds.
| File | Pathology | What should happen |
|---|---|---|
| E27 | ASCII only — encoding genuinely ambiguous | Detector picks any of ASCII/UTF-8/cp1252/Latin-1; all decode identically. Test that the reader doesn't OVER-confidently commit to one when ambiguous. |
| E28 | Invalid UTF-8 byte sequence (0xC3 0x28 mid-file) | Strict UTF-8 decode raises UnicodeDecodeError. Reader should fall back to a single-byte encoding and warn, not silently substitute. |
| E29 | Truncated UTF-8 multibyte at EOF | Strict decode raises "unexpected end of data". Reader should error with a clear "file appears truncated" message, not silently produce U+FFFD. |
| E30 | "Lying BOM" — UTF-8 BOM on cp1252 body | utf-8-sig decoder errors at first cp1252 byte in 0x80-0x9F range. Reader should detect the lie and recover by stripping BOM and trying cp1252; warn the user. |
| E31 | Mixed encoding concatenation (cp1252 + UTF-8) | NO single encoding decodes the whole file. UTF-8 errors on cp1252 bytes; cp1252 mojibakes the UTF-8 bytes. Reader should refuse and tell the user the file contains mixed encodings. |
---
## 4. Manifest files
### `expected_detection.csv` — ground truth + acceptable detection answers
7 columns:
- `filename` — the encoded test file
- `canonical_content_id` — links to the reference content
- `encoding` — the actual encoding used by the generator (ground truth)
- `has_bom` — whether the file has a BOM
- `byte_length` — file size in bytes
- `expected_detection` — pipe-separated list of detector answers that should be considered correct. Includes fuzzy markers (`AMBIGUOUS`, `UNRELIABLE`, `REJECT`, `LOW_CONFIDENCE`) for cases where any reasonable detector behavior is acceptable.
- `decode_notes` — human-readable explanation of expected behavior
Use this as the primary reference when validating your reader.
### `detector_baseline.csv` — what charset-normalizer actually returns
Recorded during fixture generation against the version of `charset-normalizer` installed at that time. 6 columns:
- `filename`, `ground_truth_encoding`, `charset_normalizer_returns`, `cn_aliases`, `cn_language`, `cn_chaos_score`
This is **not authoritative** — different detector versions return different labels. It exists so you can see typical detector output without having to run charset-normalizer yourself, and so you have a baseline to compare against if you're testing a different detector or a newer version.
### `reference/*.utf8.txt` — canonical decoded content
One UTF-8 file per content family. After your reader decodes any encoded file in the corpus and strips any BOM, the result should equal the corresponding reference file's content byte-for-byte.
---
## 5. Observed charset-normalizer behavior
Recorded against `charset-normalizer` 3.x. Some of these are known detector quirks worth understanding before you debug your own code:
### Cases where charset-normalizer is reliably correct
- All UTF-8 files (E01, E02, E10, E13, E16, E19, E21, E23, E25): detected as `utf_8`.
- All UTF-16 with BOM (E07, E08, E12): detected as `utf_16` (loses LE/BE distinction in label, recoverable from BOM).
- E14 (cp1250 Eastern European): correctly detected.
- E17 (cp1251 Cyrillic): correctly detected.
- E20 (Shift_JIS Japanese): returns `cp932` (the MS extended variant; equivalent for this content).
- E22 (GB18030 Chinese): correctly detected.
- E24 (Big5 Chinese traditional): correctly detected.
- E26 (EUC-KR Korean): returns `cp949` (the MS variant; equivalent for this content).
- E27 (ASCII): correctly detected as `ascii`.
### Cases where charset-normalizer mislabels but produces the right decoded content
These return a wrong-sounding name but decode to the correct characters because the encodings are byte-equivalent for this specific content:
- **E03, E04, E05** (cp1252, Latin-1, Latin-9 with WESTERN_BASIC content): all returned as `cp1250`. The decoded chars are correct because for ASCII + Latin-1 chars in the 0xA0+ range, all four encodings produce identical results. The label is misleading but the data is fine.
- **E06** (Mac Roman): returned as `mac_iceland`. Same family, identical for our content.
- **E11** (cp1252 with WESTERN_EXTENDED): returned as `cp1250`. Surprising — `cp1250` does NOT have euro at 0x80 (it has Cyrillic-adjacent chars), so the actual decoded euro sign would be wrong. Verify your reader actually re-decodes with the returned label and check the output, don't assume a "matching" label means correct content.
### Cases where charset-normalizer is wrong
- **E15** (ISO-8859-2 Eastern European): returned as `cp1258` (Vietnamese encoding). Wrong family entirely. Probable cause: the chaos heuristic doesn't penalize cp1258 for the byte distribution in the test content.
- **E18** (KOI8-R Cyrillic): returned as `shift_jis_2004` (Japanese!). Bytes in KOI8-R's high-bit range happen to look like valid Shift_JIS multibyte sequences for this content. **High-confidence misdetection** — this is the one to plan a fallback for in your reader.
### Pathological cases
- **E28-E31**: charset-normalizer returns various labels (`cp1257`, `cp1250`, `cp1252`, `cp1250`). For pathological inputs, the *label* is less important than the *behavior*: does your reader detect that something is wrong (low confidence, multiple candidate encodings, or a decode error after detection) and warn the user? The `expected_detection` field accepts any label paired with appropriate warning behavior.
### Implication for your reader
Don't trust charset-normalizer's label blindly. The robust pattern:
1. Run charset-normalizer.
2. Try to decode the entire file with the returned encoding.
3. If decode succeeds, sanity-check the result: does it contain replacement characters (U+FFFD)? Does it contain control characters in unexpected places (which suggest cp1252-vs-Latin-1 ambiguity decoded wrong)?
4. If it fails or smells wrong, try a small candidate set (utf-8, cp1252, latin-1) and pick the one with the cleanest result.
5. When confidence is low, log a warning and let the user override via a `--encoding` flag.
---
## 6. Suggested test workflow
```python
import csv
from pathlib import Path
from src.core.io import detect_encoding, read_csv # your reader
CORPUS = Path("test_data/encodings")
# Load ground-truth manifest
with (CORPUS / "expected_detection.csv").open() as f:
manifest = list(csv.DictReader(f))
# Load reference content
references = {
p.stem.replace(".utf8", ""): p.read_text(encoding="utf-8")
for p in (CORPUS / "reference").glob("*.utf8.txt")
}
# Test 1: detection - your detector returns an acceptable answer
for entry in manifest:
if entry["canonical_content_id"] in references: # skip pure pathological
detected = detect_encoding(CORPUS / entry["filename"])
acceptable = [e.strip() for e in entry["expected_detection"].split("|")]
assert detected in acceptable or any(
marker in entry["expected_detection"]
for marker in ["AMBIGUOUS", "UNRELIABLE"]
), f"{entry['filename']}: detected {detected} not in {acceptable}"
# Test 2: decoded content matches reference
for entry in manifest:
cid = entry["canonical_content_id"]
if cid not in references:
continue # pathological case
decoded = read_csv(CORPUS / entry["filename"])
assert decoded == references[cid], f"{entry['filename']}: content mismatch"
# Test 3: pathological cases produce warnings, not silent corruption
for entry in manifest:
cid = entry["canonical_content_id"]
if cid in references:
continue
# Reader must either raise a clear error OR succeed with a logged warning
# The exact behavior is a policy choice; document it and test against it
```
---
## 7. What this corpus does NOT cover
Listed so the gaps are explicit:
1. **Big files**. Every fixture is small (under 1 KB). Detection on a 500MB cp1252 export may behave differently because charset-normalizer samples; if your reader processes giant files, add a separate large-file detection test.
2. **Streaming detection**. Detector is run on the full bytes here. If your reader decodes in chunks (for memory reasons on huge files), encoding-detection-at-stream-start is its own test surface.
3. **Languages with complex scripts not represented here**: Thai, Hebrew, Arabic, Vietnamese (cp1258), Greek (cp1253), Turkish (cp1254). Add per-language fixtures if your buyers use these locales. The generator script is parameterized; adding a new content family is a few-line change.
4. **Extended grapheme handling**. This corpus tests encoding detection and byte-to-string conversion. It does NOT test grapheme-cluster boundaries (multi-codepoint emoji, family ZWJ sequences). Those are the cleaner's territory in the main corpus, case 13.
5. **Encoding errors during WRITE**. The corpus tests reading. If your tool writes output in a non-UTF-8 encoding for any reason, write-side encoding correctness needs separate fixtures.
6. **Filename / path encoding issues**. Some filesystems mangle non-ASCII filenames (older Windows, NFS configs). Out of scope for the cleaner; that's a deployment problem.
---
## 8. How to extend the corpus
Add a new content family:
```python
# In generate_encoding_test_files.py:
THAI = "id,name,city\n1,สมชาย,กรุงเทพ\n..."
# Then add encoding lines:
write_encoded("E32_thai_utf8.csv", "THAI", THAI, "utf-8", ...)
write_encoded("E33_thai_cp874.csv", "THAI", THAI, "cp874", ...)
```
Add reference content to the `references` dict at the bottom of the generator. Re-run the generator. The manifest and detector baseline will refresh automatically.
For a new pathological case: construct the raw bytes by hand and use `write_raw()`. Document the failure mode in the `decode_notes` field.
Continue numbering: `E32`, `E33`, etc. Reserve `E9#` if you need a "destructive" subcategory paralleling the malformed CSV corpus.