Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.
Core (src/core/):
- analyze.py: Finding gains confidence, fix_action, pre_applied; new
detectors for encoding_uncertain, encoding_decode_failed; new top-
level encoding_override parameter.
- fixes.py: registry of fix algorithms keyed by fix_action id.
- normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
the NormalizationResult / Decision dataclasses the gate consumes.
- io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
and normalizes line endings (fixes bare-CR parser crash); empty file
handled gracefully instead of EmptyDataError traceback.
GUI (src/gui/):
- pages/0_Review.py: gate page with per-finding decision controls,
encoding override picker (16 codepages + custom), and Advanced output
options (encoding, delimiter, line terminator) on the download.
- components.py: require_normalization_gate() helper.
- pages/1-9: gate guard wired on every tool page.
Test corpora:
- test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
UTF-8 files + manifest, synced from Business/DataTools.
- test-cases/text-cleaner-corpus/test_data/17: synced malformed input
(unquoted $1,500.00) for the unquoted-delimiter detector.
Tests (94 new):
- test_normalize.py (48): finding fields, fix registry, auto_fix scope,
decision paths, gate idempotency, output-options helper.
- test_encodings_corpus.py (90, 16 xfailed): parametric detection +
decode + analyzer-no-crash sweep against the manifest.
- test_analyze.py: encoding override + encoding_uncertain detectors.
- test_corpus.py: pre-parse repair in the strict reader.
run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.
Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.
Suite: 765 passed, 17 xfailed (was 458 passed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
16 KiB
ENCODINGS-CASES.md - Code Page / Encoding Test Corpus
Version: 1.0 Last updated: April 29, 2026 Companion to: TEST-CASES.md and QUOTE-CASES.md.
Why this is a separate corpus
Files 01-23 in the main corpus test the transformation layer: given a Python str already in memory, what does the cleaner do to it. Encoding tests are about the I/O layer that runs before the transformation layer ever sees data: given a sequence of bytes on disk, can the reader correctly turn them into a Python str in the first place?
These are different failures:
- A transformation bug produces wrong-but-valid output (curly quotes that should have been folded, whitespace that should have been trimmed).
- An I/O bug produces garbage (mojibake) or crashes the reader entirely. The cleaner never gets to apply any transformation rule because the input never decoded.
Per TECHNICAL.md Section 9, encoding handling lives in src/core/io.py, separate from any individual cleaning script. This corpus tests that module.
1. Layout
test_data/encodings/
├── E01_western_basic_utf8.csv ... E26_korean_euckr.csv
├── E27_pathological_ascii_only.csv ... E31_pathological_mixed_concat.csv
├── expected_detection.csv # Manifest: ground truth + acceptable detection
├── detector_baseline.csv # What charset-normalizer actually returns
└── reference/
├── WESTERN_BASIC.utf8.txt
├── WESTERN_EXTENDED.utf8.txt
├── EASTERN_EUROPEAN.utf8.txt
├── CYRILLIC.utf8.txt
├── JAPANESE.utf8.txt
├── CHINESE_SIMPLIFIED.utf8.txt
├── CHINESE_TRADITIONAL.utf8.txt
├── KOREAN.utf8.txt
└── ASCII_ONLY.utf8.txt
Every encoded file has a canonical_content_id linking it to one of the 9 reference files in reference/. After correct decoding (and BOM stripping if applicable), the encoded file's content must equal the corresponding reference file byte-for-byte.
2. Coverage matrix
The corpus uses 9 distinct content sets, each chosen to exercise a specific encoding family. Cross-coverage is enforced by content design: Cyrillic content cannot be encoded as cp1252; Western extended content (with euro/em-dash) cannot be encoded as Latin-1; etc. Attempting those encodings would either error or substitute, both of which are themselves test cases.
| Content family | What it contains | Encodings covered |
|---|---|---|
| WESTERN_BASIC | ASCII + accented Latin-1 chars (é, ü, ñ, ç) | UTF-8, UTF-8 with BOM, cp1252, ISO-8859-1, ISO-8859-15, Mac Roman, UTF-16 LE/BE with BOM, UTF-16 LE without BOM |
| WESTERN_EXTENDED | Above + euro sign, smart quotes, em-dash | UTF-8, cp1252, UTF-16 LE (NOT Latin-1: chars don't exist there) |
| EASTERN_EUROPEAN | Czech, Polish, Hungarian, Slovak accents | UTF-8, cp1250, ISO-8859-2 |
| CYRILLIC | Russian | UTF-8, cp1251, KOI8-R |
| JAPANESE | Kanji + kana | UTF-8, Shift_JIS |
| CHINESE_SIMPLIFIED | Mainland China characters | UTF-8, GB18030 |
| CHINESE_TRADITIONAL | Taiwan/HK characters | UTF-8, Big5 |
| KOREAN | Hangul | UTF-8, EUC-KR |
| ASCII_ONLY | Pure ASCII | One file; encoding genuinely ambiguous |
3. Per-file index
Group A — WESTERN_BASIC (single content, 9 encodings)
This group's purpose is mainly to test detector behavior on the most common Western encodings. Because the content uses only ASCII + Latin-1 characters in the 0xA0+ range, cp1252 / ISO-8859-1 / ISO-8859-15 produce byte-identical output for this content. The detector cannot meaningfully distinguish among them; any of them is a correct answer.
| File | Encoding | Notes |
|---|---|---|
| E01 | UTF-8 | Modern default |
| E02 | UTF-8 with BOM | Excel "CSV UTF-8" export. Reader must strip the BOM. |
| E03 | cp1252 | Excel default "CSV" on US/UK/Western Windows |
| E04 | ISO-8859-1 | Latin-1. Identical bytes to cp1252 for this content. |
| E05 | ISO-8859-15 | Latin-9. Identical to Latin-1 here (no euro). |
| E06 | Mac Roman | Different byte mappings; distinguishable |
| E07 | UTF-16 LE with BOM | Excel "Unicode Text" export |
| E08 | UTF-16 BE with BOM | Less common but spec'd |
| E09 | UTF-16 LE without BOM | Detection unreliable; document failure mode |
Group B — WESTERN_EXTENDED (3 encodings)
This is the cleanest cp1252-vs-Latin-1 discriminator in the corpus. The content uses bytes 0x80-0x9F (where cp1252 puts euro, smart quotes, em-dash) — exactly the range Latin-1 leaves undefined. A reader that misidentifies this file as Latin-1 will produce control characters or replacement chars; correct identification as cp1252 yields readable text.
| File | Encoding | Notes |
|---|---|---|
| E10 | UTF-8 | Reference |
| E11 | cp1252 | The discriminator file |
| E12 | UTF-16 LE with BOM | Same content, sanity check |
Group C — EASTERN_EUROPEAN (3 encodings)
| File | Encoding | Notes |
|---|---|---|
| E13 | UTF-8 | Reference |
| E14 | cp1250 | Polish/Czech/Hungarian Windows default |
| E15 | ISO-8859-2 | Latin-2; distinct byte mappings from cp1250 |
Group D — CYRILLIC (3 encodings)
| File | Encoding | Notes |
|---|---|---|
| E16 | UTF-8 | Reference |
| E17 | cp1251 | Russian Windows default |
| E18 | KOI8-R | Older Russian Unix encoding; distinct bytes from cp1251 |
Group E — CJK (8 files, 4 languages × 2 encodings each)
| File | Encoding | Notes |
|---|---|---|
| E19 | UTF-8 (Japanese) | Reference |
| E20 | Shift_JIS | Japanese Excel default; cp932 is the MS extended variant |
| E21 | UTF-8 (Chinese simplified) | Reference |
| E22 | GB18030 | Mainland China; supersets GBK and GB2312 |
| E23 | UTF-8 (Chinese traditional) | Reference |
| E24 | Big5 | Taiwan/HK; cp950 is the MS variant |
| E25 | UTF-8 (Korean) | Reference |
| E26 | EUC-KR | Korean Windows default; cp949 is the MS variant |
Group F — Pathological (5 files)
These are the cases that crash readers, produce silent corruption, or expose the limits of encoding detection. The expected behavior is that the reader fails informatively, not that it succeeds.
| File | Pathology | What should happen |
|---|---|---|
| E27 | ASCII only — encoding genuinely ambiguous | Detector picks any of ASCII/UTF-8/cp1252/Latin-1; all decode identically. Test that the reader doesn't OVER-confidently commit to one when ambiguous. |
| E28 | Invalid UTF-8 byte sequence (0xC3 0x28 mid-file) | Strict UTF-8 decode raises UnicodeDecodeError. Reader should fall back to a single-byte encoding and warn, not silently substitute. |
| E29 | Truncated UTF-8 multibyte at EOF | Strict decode raises "unexpected end of data". Reader should error with a clear "file appears truncated" message, not silently produce U+FFFD. |
| E30 | "Lying BOM" — UTF-8 BOM on cp1252 body | utf-8-sig decoder errors at first cp1252 byte in 0x80-0x9F range. Reader should detect the lie and recover by stripping BOM and trying cp1252; warn the user. |
| E31 | Mixed encoding concatenation (cp1252 + UTF-8) | NO single encoding decodes the whole file. UTF-8 errors on cp1252 bytes; cp1252 mojibakes the UTF-8 bytes. Reader should refuse and tell the user the file contains mixed encodings. |
4. Manifest files
expected_detection.csv — ground truth + acceptable detection answers
7 columns:
filename— the encoded test filecanonical_content_id— links to the reference contentencoding— the actual encoding used by the generator (ground truth)has_bom— whether the file has a BOMbyte_length— file size in bytesexpected_detection— pipe-separated list of detector answers that should be considered correct. Includes fuzzy markers (AMBIGUOUS,UNRELIABLE,REJECT,LOW_CONFIDENCE) for cases where any reasonable detector behavior is acceptable.decode_notes— human-readable explanation of expected behavior
Use this as the primary reference when validating your reader.
detector_baseline.csv — what charset-normalizer actually returns
Recorded during fixture generation against the version of charset-normalizer installed at that time. 6 columns:
filename,ground_truth_encoding,charset_normalizer_returns,cn_aliases,cn_language,cn_chaos_score
This is not authoritative — different detector versions return different labels. It exists so you can see typical detector output without having to run charset-normalizer yourself, and so you have a baseline to compare against if you're testing a different detector or a newer version.
reference/*.utf8.txt — canonical decoded content
One UTF-8 file per content family. After your reader decodes any encoded file in the corpus and strips any BOM, the result should equal the corresponding reference file's content byte-for-byte.
5. Observed charset-normalizer behavior
Recorded against charset-normalizer 3.x. Some of these are known detector quirks worth understanding before you debug your own code:
Cases where charset-normalizer is reliably correct
- All UTF-8 files (E01, E02, E10, E13, E16, E19, E21, E23, E25): detected as
utf_8. - All UTF-16 with BOM (E07, E08, E12): detected as
utf_16(loses LE/BE distinction in label, recoverable from BOM). - E14 (cp1250 Eastern European): correctly detected.
- E17 (cp1251 Cyrillic): correctly detected.
- E20 (Shift_JIS Japanese): returns
cp932(the MS extended variant; equivalent for this content). - E22 (GB18030 Chinese): correctly detected.
- E24 (Big5 Chinese traditional): correctly detected.
- E26 (EUC-KR Korean): returns
cp949(the MS variant; equivalent for this content). - E27 (ASCII): correctly detected as
ascii.
Cases where charset-normalizer mislabels but produces the right decoded content
These return a wrong-sounding name but decode to the correct characters because the encodings are byte-equivalent for this specific content:
- E03, E04, E05 (cp1252, Latin-1, Latin-9 with WESTERN_BASIC content): all returned as
cp1250. The decoded chars are correct because for ASCII + Latin-1 chars in the 0xA0+ range, all four encodings produce identical results. The label is misleading but the data is fine. - E06 (Mac Roman): returned as
mac_iceland. Same family, identical for our content. - E11 (cp1252 with WESTERN_EXTENDED): returned as
cp1250. Surprising —cp1250does NOT have euro at 0x80 (it has Cyrillic-adjacent chars), so the actual decoded euro sign would be wrong. Verify your reader actually re-decodes with the returned label and check the output, don't assume a "matching" label means correct content.
Cases where charset-normalizer is wrong
- E15 (ISO-8859-2 Eastern European): returned as
cp1258(Vietnamese encoding). Wrong family entirely. Probable cause: the chaos heuristic doesn't penalize cp1258 for the byte distribution in the test content. - E18 (KOI8-R Cyrillic): returned as
shift_jis_2004(Japanese!). Bytes in KOI8-R's high-bit range happen to look like valid Shift_JIS multibyte sequences for this content. High-confidence misdetection — this is the one to plan a fallback for in your reader.
Pathological cases
- E28-E31: charset-normalizer returns various labels (
cp1257,cp1250,cp1252,cp1250). For pathological inputs, the label is less important than the behavior: does your reader detect that something is wrong (low confidence, multiple candidate encodings, or a decode error after detection) and warn the user? Theexpected_detectionfield accepts any label paired with appropriate warning behavior.
Implication for your reader
Don't trust charset-normalizer's label blindly. The robust pattern:
- Run charset-normalizer.
- Try to decode the entire file with the returned encoding.
- If decode succeeds, sanity-check the result: does it contain replacement characters (U+FFFD)? Does it contain control characters in unexpected places (which suggest cp1252-vs-Latin-1 ambiguity decoded wrong)?
- If it fails or smells wrong, try a small candidate set (utf-8, cp1252, latin-1) and pick the one with the cleanest result.
- When confidence is low, log a warning and let the user override via a
--encodingflag.
6. Suggested test workflow
import csv
from pathlib import Path
from src.core.io import detect_encoding, read_csv # your reader
CORPUS = Path("test_data/encodings")
# Load ground-truth manifest
with (CORPUS / "expected_detection.csv").open() as f:
manifest = list(csv.DictReader(f))
# Load reference content
references = {
p.stem.replace(".utf8", ""): p.read_text(encoding="utf-8")
for p in (CORPUS / "reference").glob("*.utf8.txt")
}
# Test 1: detection - your detector returns an acceptable answer
for entry in manifest:
if entry["canonical_content_id"] in references: # skip pure pathological
detected = detect_encoding(CORPUS / entry["filename"])
acceptable = [e.strip() for e in entry["expected_detection"].split("|")]
assert detected in acceptable or any(
marker in entry["expected_detection"]
for marker in ["AMBIGUOUS", "UNRELIABLE"]
), f"{entry['filename']}: detected {detected} not in {acceptable}"
# Test 2: decoded content matches reference
for entry in manifest:
cid = entry["canonical_content_id"]
if cid not in references:
continue # pathological case
decoded = read_csv(CORPUS / entry["filename"])
assert decoded == references[cid], f"{entry['filename']}: content mismatch"
# Test 3: pathological cases produce warnings, not silent corruption
for entry in manifest:
cid = entry["canonical_content_id"]
if cid in references:
continue
# Reader must either raise a clear error OR succeed with a logged warning
# The exact behavior is a policy choice; document it and test against it
7. What this corpus does NOT cover
Listed so the gaps are explicit:
- Big files. Every fixture is small (under 1 KB). Detection on a 500MB cp1252 export may behave differently because charset-normalizer samples; if your reader processes giant files, add a separate large-file detection test.
- Streaming detection. Detector is run on the full bytes here. If your reader decodes in chunks (for memory reasons on huge files), encoding-detection-at-stream-start is its own test surface.
- Languages with complex scripts not represented here: Thai, Hebrew, Arabic, Vietnamese (cp1258), Greek (cp1253), Turkish (cp1254). Add per-language fixtures if your buyers use these locales. The generator script is parameterized; adding a new content family is a few-line change.
- Extended grapheme handling. This corpus tests encoding detection and byte-to-string conversion. It does NOT test grapheme-cluster boundaries (multi-codepoint emoji, family ZWJ sequences). Those are the cleaner's territory in the main corpus, case 13.
- Encoding errors during WRITE. The corpus tests reading. If your tool writes output in a non-UTF-8 encoding for any reason, write-side encoding correctness needs separate fixtures.
- Filename / path encoding issues. Some filesystems mangle non-ASCII filenames (older Windows, NFS configs). Out of scope for the cleaner; that's a deployment problem.
8. How to extend the corpus
Add a new content family:
# In generate_encoding_test_files.py:
THAI = "id,name,city\n1,สมชาย,กรุงเทพ\n..."
# Then add encoding lines:
write_encoded("E32_thai_utf8.csv", "THAI", THAI, "utf-8", ...)
write_encoded("E33_thai_cp874.csv", "THAI", THAI, "cp874", ...)
Add reference content to the references dict at the bottom of the generator. Re-run the generator. The manifest and detector baseline will refresh automatically.
For a new pathological case: construct the raw bytes by hand and use write_raw(). Document the failure mode in the decode_notes field.
Continue numbering: E32, E33, etc. Reserve E9# if you need a "destructive" subcategory paralleling the malformed CSV corpus.