Files
datatools-dev/test-cases/encodings-corpus/ENCODINGS-CASES.md
Michael 82d7fef21e feat(gate): CSV-normalization gate with confidence-tiered findings
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:35:27 +00:00

16 KiB
Raw Blame History

ENCODINGS-CASES.md - Code Page / Encoding Test Corpus

Version: 1.0 Last updated: April 29, 2026 Companion to: TEST-CASES.md and QUOTE-CASES.md.

Why this is a separate corpus

Files 01-23 in the main corpus test the transformation layer: given a Python str already in memory, what does the cleaner do to it. Encoding tests are about the I/O layer that runs before the transformation layer ever sees data: given a sequence of bytes on disk, can the reader correctly turn them into a Python str in the first place?

These are different failures:

  • A transformation bug produces wrong-but-valid output (curly quotes that should have been folded, whitespace that should have been trimmed).
  • An I/O bug produces garbage (mojibake) or crashes the reader entirely. The cleaner never gets to apply any transformation rule because the input never decoded.

Per TECHNICAL.md Section 9, encoding handling lives in src/core/io.py, separate from any individual cleaning script. This corpus tests that module.


1. Layout

test_data/encodings/
├── E01_western_basic_utf8.csv             ... E26_korean_euckr.csv
├── E27_pathological_ascii_only.csv        ... E31_pathological_mixed_concat.csv
├── expected_detection.csv                 # Manifest: ground truth + acceptable detection
├── detector_baseline.csv                  # What charset-normalizer actually returns
└── reference/
    ├── WESTERN_BASIC.utf8.txt
    ├── WESTERN_EXTENDED.utf8.txt
    ├── EASTERN_EUROPEAN.utf8.txt
    ├── CYRILLIC.utf8.txt
    ├── JAPANESE.utf8.txt
    ├── CHINESE_SIMPLIFIED.utf8.txt
    ├── CHINESE_TRADITIONAL.utf8.txt
    ├── KOREAN.utf8.txt
    └── ASCII_ONLY.utf8.txt

Every encoded file has a canonical_content_id linking it to one of the 9 reference files in reference/. After correct decoding (and BOM stripping if applicable), the encoded file's content must equal the corresponding reference file byte-for-byte.


2. Coverage matrix

The corpus uses 9 distinct content sets, each chosen to exercise a specific encoding family. Cross-coverage is enforced by content design: Cyrillic content cannot be encoded as cp1252; Western extended content (with euro/em-dash) cannot be encoded as Latin-1; etc. Attempting those encodings would either error or substitute, both of which are themselves test cases.

Content family What it contains Encodings covered
WESTERN_BASIC ASCII + accented Latin-1 chars (é, ü, ñ, ç) UTF-8, UTF-8 with BOM, cp1252, ISO-8859-1, ISO-8859-15, Mac Roman, UTF-16 LE/BE with BOM, UTF-16 LE without BOM
WESTERN_EXTENDED Above + euro sign, smart quotes, em-dash UTF-8, cp1252, UTF-16 LE (NOT Latin-1: chars don't exist there)
EASTERN_EUROPEAN Czech, Polish, Hungarian, Slovak accents UTF-8, cp1250, ISO-8859-2
CYRILLIC Russian UTF-8, cp1251, KOI8-R
JAPANESE Kanji + kana UTF-8, Shift_JIS
CHINESE_SIMPLIFIED Mainland China characters UTF-8, GB18030
CHINESE_TRADITIONAL Taiwan/HK characters UTF-8, Big5
KOREAN Hangul UTF-8, EUC-KR
ASCII_ONLY Pure ASCII One file; encoding genuinely ambiguous

3. Per-file index

Group A — WESTERN_BASIC (single content, 9 encodings)

This group's purpose is mainly to test detector behavior on the most common Western encodings. Because the content uses only ASCII + Latin-1 characters in the 0xA0+ range, cp1252 / ISO-8859-1 / ISO-8859-15 produce byte-identical output for this content. The detector cannot meaningfully distinguish among them; any of them is a correct answer.

File Encoding Notes
E01 UTF-8 Modern default
E02 UTF-8 with BOM Excel "CSV UTF-8" export. Reader must strip the BOM.
E03 cp1252 Excel default "CSV" on US/UK/Western Windows
E04 ISO-8859-1 Latin-1. Identical bytes to cp1252 for this content.
E05 ISO-8859-15 Latin-9. Identical to Latin-1 here (no euro).
E06 Mac Roman Different byte mappings; distinguishable
E07 UTF-16 LE with BOM Excel "Unicode Text" export
E08 UTF-16 BE with BOM Less common but spec'd
E09 UTF-16 LE without BOM Detection unreliable; document failure mode

Group B — WESTERN_EXTENDED (3 encodings)

This is the cleanest cp1252-vs-Latin-1 discriminator in the corpus. The content uses bytes 0x80-0x9F (where cp1252 puts euro, smart quotes, em-dash) — exactly the range Latin-1 leaves undefined. A reader that misidentifies this file as Latin-1 will produce control characters or replacement chars; correct identification as cp1252 yields readable text.

File Encoding Notes
E10 UTF-8 Reference
E11 cp1252 The discriminator file
E12 UTF-16 LE with BOM Same content, sanity check

Group C — EASTERN_EUROPEAN (3 encodings)

File Encoding Notes
E13 UTF-8 Reference
E14 cp1250 Polish/Czech/Hungarian Windows default
E15 ISO-8859-2 Latin-2; distinct byte mappings from cp1250

Group D — CYRILLIC (3 encodings)

File Encoding Notes
E16 UTF-8 Reference
E17 cp1251 Russian Windows default
E18 KOI8-R Older Russian Unix encoding; distinct bytes from cp1251

Group E — CJK (8 files, 4 languages × 2 encodings each)

File Encoding Notes
E19 UTF-8 (Japanese) Reference
E20 Shift_JIS Japanese Excel default; cp932 is the MS extended variant
E21 UTF-8 (Chinese simplified) Reference
E22 GB18030 Mainland China; supersets GBK and GB2312
E23 UTF-8 (Chinese traditional) Reference
E24 Big5 Taiwan/HK; cp950 is the MS variant
E25 UTF-8 (Korean) Reference
E26 EUC-KR Korean Windows default; cp949 is the MS variant

Group F — Pathological (5 files)

These are the cases that crash readers, produce silent corruption, or expose the limits of encoding detection. The expected behavior is that the reader fails informatively, not that it succeeds.

File Pathology What should happen
E27 ASCII only — encoding genuinely ambiguous Detector picks any of ASCII/UTF-8/cp1252/Latin-1; all decode identically. Test that the reader doesn't OVER-confidently commit to one when ambiguous.
E28 Invalid UTF-8 byte sequence (0xC3 0x28 mid-file) Strict UTF-8 decode raises UnicodeDecodeError. Reader should fall back to a single-byte encoding and warn, not silently substitute.
E29 Truncated UTF-8 multibyte at EOF Strict decode raises "unexpected end of data". Reader should error with a clear "file appears truncated" message, not silently produce U+FFFD.
E30 "Lying BOM" — UTF-8 BOM on cp1252 body utf-8-sig decoder errors at first cp1252 byte in 0x80-0x9F range. Reader should detect the lie and recover by stripping BOM and trying cp1252; warn the user.
E31 Mixed encoding concatenation (cp1252 + UTF-8) NO single encoding decodes the whole file. UTF-8 errors on cp1252 bytes; cp1252 mojibakes the UTF-8 bytes. Reader should refuse and tell the user the file contains mixed encodings.

4. Manifest files

expected_detection.csv — ground truth + acceptable detection answers

7 columns:

  • filename — the encoded test file
  • canonical_content_id — links to the reference content
  • encoding — the actual encoding used by the generator (ground truth)
  • has_bom — whether the file has a BOM
  • byte_length — file size in bytes
  • expected_detection — pipe-separated list of detector answers that should be considered correct. Includes fuzzy markers (AMBIGUOUS, UNRELIABLE, REJECT, LOW_CONFIDENCE) for cases where any reasonable detector behavior is acceptable.
  • decode_notes — human-readable explanation of expected behavior

Use this as the primary reference when validating your reader.

detector_baseline.csv — what charset-normalizer actually returns

Recorded during fixture generation against the version of charset-normalizer installed at that time. 6 columns:

  • filename, ground_truth_encoding, charset_normalizer_returns, cn_aliases, cn_language, cn_chaos_score

This is not authoritative — different detector versions return different labels. It exists so you can see typical detector output without having to run charset-normalizer yourself, and so you have a baseline to compare against if you're testing a different detector or a newer version.

reference/*.utf8.txt — canonical decoded content

One UTF-8 file per content family. After your reader decodes any encoded file in the corpus and strips any BOM, the result should equal the corresponding reference file's content byte-for-byte.


5. Observed charset-normalizer behavior

Recorded against charset-normalizer 3.x. Some of these are known detector quirks worth understanding before you debug your own code:

Cases where charset-normalizer is reliably correct

  • All UTF-8 files (E01, E02, E10, E13, E16, E19, E21, E23, E25): detected as utf_8.
  • All UTF-16 with BOM (E07, E08, E12): detected as utf_16 (loses LE/BE distinction in label, recoverable from BOM).
  • E14 (cp1250 Eastern European): correctly detected.
  • E17 (cp1251 Cyrillic): correctly detected.
  • E20 (Shift_JIS Japanese): returns cp932 (the MS extended variant; equivalent for this content).
  • E22 (GB18030 Chinese): correctly detected.
  • E24 (Big5 Chinese traditional): correctly detected.
  • E26 (EUC-KR Korean): returns cp949 (the MS variant; equivalent for this content).
  • E27 (ASCII): correctly detected as ascii.

Cases where charset-normalizer mislabels but produces the right decoded content

These return a wrong-sounding name but decode to the correct characters because the encodings are byte-equivalent for this specific content:

  • E03, E04, E05 (cp1252, Latin-1, Latin-9 with WESTERN_BASIC content): all returned as cp1250. The decoded chars are correct because for ASCII + Latin-1 chars in the 0xA0+ range, all four encodings produce identical results. The label is misleading but the data is fine.
  • E06 (Mac Roman): returned as mac_iceland. Same family, identical for our content.
  • E11 (cp1252 with WESTERN_EXTENDED): returned as cp1250. Surprising — cp1250 does NOT have euro at 0x80 (it has Cyrillic-adjacent chars), so the actual decoded euro sign would be wrong. Verify your reader actually re-decodes with the returned label and check the output, don't assume a "matching" label means correct content.

Cases where charset-normalizer is wrong

  • E15 (ISO-8859-2 Eastern European): returned as cp1258 (Vietnamese encoding). Wrong family entirely. Probable cause: the chaos heuristic doesn't penalize cp1258 for the byte distribution in the test content.
  • E18 (KOI8-R Cyrillic): returned as shift_jis_2004 (Japanese!). Bytes in KOI8-R's high-bit range happen to look like valid Shift_JIS multibyte sequences for this content. High-confidence misdetection — this is the one to plan a fallback for in your reader.

Pathological cases

  • E28-E31: charset-normalizer returns various labels (cp1257, cp1250, cp1252, cp1250). For pathological inputs, the label is less important than the behavior: does your reader detect that something is wrong (low confidence, multiple candidate encodings, or a decode error after detection) and warn the user? The expected_detection field accepts any label paired with appropriate warning behavior.

Implication for your reader

Don't trust charset-normalizer's label blindly. The robust pattern:

  1. Run charset-normalizer.
  2. Try to decode the entire file with the returned encoding.
  3. If decode succeeds, sanity-check the result: does it contain replacement characters (U+FFFD)? Does it contain control characters in unexpected places (which suggest cp1252-vs-Latin-1 ambiguity decoded wrong)?
  4. If it fails or smells wrong, try a small candidate set (utf-8, cp1252, latin-1) and pick the one with the cleanest result.
  5. When confidence is low, log a warning and let the user override via a --encoding flag.

6. Suggested test workflow

import csv
from pathlib import Path
from src.core.io import detect_encoding, read_csv  # your reader

CORPUS = Path("test_data/encodings")

# Load ground-truth manifest
with (CORPUS / "expected_detection.csv").open() as f:
    manifest = list(csv.DictReader(f))

# Load reference content
references = {
    p.stem.replace(".utf8", ""): p.read_text(encoding="utf-8")
    for p in (CORPUS / "reference").glob("*.utf8.txt")
}

# Test 1: detection - your detector returns an acceptable answer
for entry in manifest:
    if entry["canonical_content_id"] in references:  # skip pure pathological
        detected = detect_encoding(CORPUS / entry["filename"])
        acceptable = [e.strip() for e in entry["expected_detection"].split("|")]
        assert detected in acceptable or any(
            marker in entry["expected_detection"]
            for marker in ["AMBIGUOUS", "UNRELIABLE"]
        ), f"{entry['filename']}: detected {detected} not in {acceptable}"

# Test 2: decoded content matches reference
for entry in manifest:
    cid = entry["canonical_content_id"]
    if cid not in references:
        continue  # pathological case
    decoded = read_csv(CORPUS / entry["filename"])
    assert decoded == references[cid], f"{entry['filename']}: content mismatch"

# Test 3: pathological cases produce warnings, not silent corruption
for entry in manifest:
    cid = entry["canonical_content_id"]
    if cid in references:
        continue
    # Reader must either raise a clear error OR succeed with a logged warning
    # The exact behavior is a policy choice; document it and test against it

7. What this corpus does NOT cover

Listed so the gaps are explicit:

  1. Big files. Every fixture is small (under 1 KB). Detection on a 500MB cp1252 export may behave differently because charset-normalizer samples; if your reader processes giant files, add a separate large-file detection test.
  2. Streaming detection. Detector is run on the full bytes here. If your reader decodes in chunks (for memory reasons on huge files), encoding-detection-at-stream-start is its own test surface.
  3. Languages with complex scripts not represented here: Thai, Hebrew, Arabic, Vietnamese (cp1258), Greek (cp1253), Turkish (cp1254). Add per-language fixtures if your buyers use these locales. The generator script is parameterized; adding a new content family is a few-line change.
  4. Extended grapheme handling. This corpus tests encoding detection and byte-to-string conversion. It does NOT test grapheme-cluster boundaries (multi-codepoint emoji, family ZWJ sequences). Those are the cleaner's territory in the main corpus, case 13.
  5. Encoding errors during WRITE. The corpus tests reading. If your tool writes output in a non-UTF-8 encoding for any reason, write-side encoding correctness needs separate fixtures.
  6. Filename / path encoding issues. Some filesystems mangle non-ASCII filenames (older Windows, NFS configs). Out of scope for the cleaner; that's a deployment problem.

8. How to extend the corpus

Add a new content family:

# In generate_encoding_test_files.py:
THAI = "id,name,city\n1,สมชาย,กรุงเทพ\n..."

# Then add encoding lines:
write_encoded("E32_thai_utf8.csv", "THAI", THAI, "utf-8", ...)
write_encoded("E33_thai_cp874.csv", "THAI", THAI, "cp874", ...)

Add reference content to the references dict at the bottom of the generator. Re-run the generator. The manifest and detector baseline will refresh automatically.

For a new pathological case: construct the raw bytes by hand and use write_raw(). Document the failure mode in the decode_notes field.

Continue numbering: E32, E33, etc. Reserve E9# if you need a "destructive" subcategory paralleling the malformed CSV corpus.