feat(gate): CSV-normalization gate with confidence-tiered findings
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.
Core (src/core/):
- analyze.py: Finding gains confidence, fix_action, pre_applied; new
detectors for encoding_uncertain, encoding_decode_failed; new top-
level encoding_override parameter.
- fixes.py: registry of fix algorithms keyed by fix_action id.
- normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
the NormalizationResult / Decision dataclasses the gate consumes.
- io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
and normalizes line endings (fixes bare-CR parser crash); empty file
handled gracefully instead of EmptyDataError traceback.
GUI (src/gui/):
- pages/0_Review.py: gate page with per-finding decision controls,
encoding override picker (16 codepages + custom), and Advanced output
options (encoding, delimiter, line terminator) on the download.
- components.py: require_normalization_gate() helper.
- pages/1-9: gate guard wired on every tool page.
Test corpora:
- test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
UTF-8 files + manifest, synced from Business/DataTools.
- test-cases/text-cleaner-corpus/test_data/17: synced malformed input
(unquoted $1,500.00) for the unquoted-delimiter detector.
Tests (94 new):
- test_normalize.py (48): finding fields, fix registry, auto_fix scope,
decision paths, gate idempotency, output-options helper.
- test_encodings_corpus.py (90, 16 xfailed): parametric detection +
decode + analyzer-no-crash sweep against the manifest.
- test_analyze.py: encoding override + encoding_uncertain detectors.
- test_corpus.py: pre-parse repair in the strict reader.
run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.
Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.
Suite: 765 passed, 17 xfailed (was 458 passed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
284
test-cases/encodings-corpus/ENCODINGS-CASES.md
Normal file
284
test-cases/encodings-corpus/ENCODINGS-CASES.md
Normal file
@@ -0,0 +1,284 @@
|
||||
# ENCODINGS-CASES.md - Code Page / Encoding Test Corpus
|
||||
|
||||
**Version**: 1.0
|
||||
**Last updated**: April 29, 2026
|
||||
**Companion to**: TEST-CASES.md and QUOTE-CASES.md.
|
||||
|
||||
## Why this is a separate corpus
|
||||
|
||||
Files 01-23 in the main corpus test the **transformation layer**: given a Python `str` already in memory, what does the cleaner do to it. Encoding tests are about the **I/O layer** that runs *before* the transformation layer ever sees data: given a sequence of bytes on disk, can the reader correctly turn them into a Python `str` in the first place?
|
||||
|
||||
These are different failures:
|
||||
|
||||
- A transformation bug produces wrong-but-valid output (curly quotes that should have been folded, whitespace that should have been trimmed).
|
||||
- An I/O bug produces *garbage* (mojibake) or *crashes* the reader entirely. The cleaner never gets to apply any transformation rule because the input never decoded.
|
||||
|
||||
Per TECHNICAL.md Section 9, encoding handling lives in `src/core/io.py`, separate from any individual cleaning script. This corpus tests that module.
|
||||
|
||||
---
|
||||
|
||||
## 1. Layout
|
||||
|
||||
```
|
||||
test_data/encodings/
|
||||
├── E01_western_basic_utf8.csv ... E26_korean_euckr.csv
|
||||
├── E27_pathological_ascii_only.csv ... E31_pathological_mixed_concat.csv
|
||||
├── expected_detection.csv # Manifest: ground truth + acceptable detection
|
||||
├── detector_baseline.csv # What charset-normalizer actually returns
|
||||
└── reference/
|
||||
├── WESTERN_BASIC.utf8.txt
|
||||
├── WESTERN_EXTENDED.utf8.txt
|
||||
├── EASTERN_EUROPEAN.utf8.txt
|
||||
├── CYRILLIC.utf8.txt
|
||||
├── JAPANESE.utf8.txt
|
||||
├── CHINESE_SIMPLIFIED.utf8.txt
|
||||
├── CHINESE_TRADITIONAL.utf8.txt
|
||||
├── KOREAN.utf8.txt
|
||||
└── ASCII_ONLY.utf8.txt
|
||||
```
|
||||
|
||||
Every encoded file has a `canonical_content_id` linking it to one of the 9 reference files in `reference/`. After correct decoding (and BOM stripping if applicable), the encoded file's content must equal the corresponding reference file byte-for-byte.
|
||||
|
||||
---
|
||||
|
||||
## 2. Coverage matrix
|
||||
|
||||
The corpus uses 9 distinct content sets, each chosen to exercise a specific encoding family. Cross-coverage is enforced by content design: Cyrillic content cannot be encoded as cp1252; Western extended content (with euro/em-dash) cannot be encoded as Latin-1; etc. Attempting those encodings would either error or substitute, both of which are themselves test cases.
|
||||
|
||||
| Content family | What it contains | Encodings covered |
|
||||
|---|---|---|
|
||||
| WESTERN_BASIC | ASCII + accented Latin-1 chars (é, ü, ñ, ç) | UTF-8, UTF-8 with BOM, cp1252, ISO-8859-1, ISO-8859-15, Mac Roman, UTF-16 LE/BE with BOM, UTF-16 LE without BOM |
|
||||
| WESTERN_EXTENDED | Above + euro sign, smart quotes, em-dash | UTF-8, cp1252, UTF-16 LE (NOT Latin-1: chars don't exist there) |
|
||||
| EASTERN_EUROPEAN | Czech, Polish, Hungarian, Slovak accents | UTF-8, cp1250, ISO-8859-2 |
|
||||
| CYRILLIC | Russian | UTF-8, cp1251, KOI8-R |
|
||||
| JAPANESE | Kanji + kana | UTF-8, Shift_JIS |
|
||||
| CHINESE_SIMPLIFIED | Mainland China characters | UTF-8, GB18030 |
|
||||
| CHINESE_TRADITIONAL | Taiwan/HK characters | UTF-8, Big5 |
|
||||
| KOREAN | Hangul | UTF-8, EUC-KR |
|
||||
| ASCII_ONLY | Pure ASCII | One file; encoding genuinely ambiguous |
|
||||
|
||||
---
|
||||
|
||||
## 3. Per-file index
|
||||
|
||||
### Group A — WESTERN_BASIC (single content, 9 encodings)
|
||||
|
||||
This group's purpose is mainly to test detector behavior on the most common Western encodings. Because the content uses only ASCII + Latin-1 characters in the 0xA0+ range, **cp1252 / ISO-8859-1 / ISO-8859-15 produce byte-identical output for this content**. The detector cannot meaningfully distinguish among them; any of them is a correct answer.
|
||||
|
||||
| File | Encoding | Notes |
|
||||
|---|---|---|
|
||||
| E01 | UTF-8 | Modern default |
|
||||
| E02 | UTF-8 with BOM | Excel "CSV UTF-8" export. Reader must strip the BOM. |
|
||||
| E03 | cp1252 | Excel default "CSV" on US/UK/Western Windows |
|
||||
| E04 | ISO-8859-1 | Latin-1. Identical bytes to cp1252 for this content. |
|
||||
| E05 | ISO-8859-15 | Latin-9. Identical to Latin-1 here (no euro). |
|
||||
| E06 | Mac Roman | Different byte mappings; distinguishable |
|
||||
| E07 | UTF-16 LE with BOM | Excel "Unicode Text" export |
|
||||
| E08 | UTF-16 BE with BOM | Less common but spec'd |
|
||||
| E09 | UTF-16 LE without BOM | Detection unreliable; document failure mode |
|
||||
|
||||
### Group B — WESTERN_EXTENDED (3 encodings)
|
||||
|
||||
This is the cleanest **cp1252-vs-Latin-1 discriminator** in the corpus. The content uses bytes 0x80-0x9F (where cp1252 puts euro, smart quotes, em-dash) — exactly the range Latin-1 leaves undefined. A reader that misidentifies this file as Latin-1 will produce control characters or replacement chars; correct identification as cp1252 yields readable text.
|
||||
|
||||
| File | Encoding | Notes |
|
||||
|---|---|---|
|
||||
| E10 | UTF-8 | Reference |
|
||||
| E11 | cp1252 | The discriminator file |
|
||||
| E12 | UTF-16 LE with BOM | Same content, sanity check |
|
||||
|
||||
### Group C — EASTERN_EUROPEAN (3 encodings)
|
||||
|
||||
| File | Encoding | Notes |
|
||||
|---|---|---|
|
||||
| E13 | UTF-8 | Reference |
|
||||
| E14 | cp1250 | Polish/Czech/Hungarian Windows default |
|
||||
| E15 | ISO-8859-2 | Latin-2; distinct byte mappings from cp1250 |
|
||||
|
||||
### Group D — CYRILLIC (3 encodings)
|
||||
|
||||
| File | Encoding | Notes |
|
||||
|---|---|---|
|
||||
| E16 | UTF-8 | Reference |
|
||||
| E17 | cp1251 | Russian Windows default |
|
||||
| E18 | KOI8-R | Older Russian Unix encoding; distinct bytes from cp1251 |
|
||||
|
||||
### Group E — CJK (8 files, 4 languages × 2 encodings each)
|
||||
|
||||
| File | Encoding | Notes |
|
||||
|---|---|---|
|
||||
| E19 | UTF-8 (Japanese) | Reference |
|
||||
| E20 | Shift_JIS | Japanese Excel default; cp932 is the MS extended variant |
|
||||
| E21 | UTF-8 (Chinese simplified) | Reference |
|
||||
| E22 | GB18030 | Mainland China; supersets GBK and GB2312 |
|
||||
| E23 | UTF-8 (Chinese traditional) | Reference |
|
||||
| E24 | Big5 | Taiwan/HK; cp950 is the MS variant |
|
||||
| E25 | UTF-8 (Korean) | Reference |
|
||||
| E26 | EUC-KR | Korean Windows default; cp949 is the MS variant |
|
||||
|
||||
### Group F — Pathological (5 files)
|
||||
|
||||
These are the cases that crash readers, produce silent corruption, or expose the limits of encoding detection. The expected behavior is **that the reader fails informatively**, not that it succeeds.
|
||||
|
||||
| File | Pathology | What should happen |
|
||||
|---|---|---|
|
||||
| E27 | ASCII only — encoding genuinely ambiguous | Detector picks any of ASCII/UTF-8/cp1252/Latin-1; all decode identically. Test that the reader doesn't OVER-confidently commit to one when ambiguous. |
|
||||
| E28 | Invalid UTF-8 byte sequence (0xC3 0x28 mid-file) | Strict UTF-8 decode raises UnicodeDecodeError. Reader should fall back to a single-byte encoding and warn, not silently substitute. |
|
||||
| E29 | Truncated UTF-8 multibyte at EOF | Strict decode raises "unexpected end of data". Reader should error with a clear "file appears truncated" message, not silently produce U+FFFD. |
|
||||
| E30 | "Lying BOM" — UTF-8 BOM on cp1252 body | utf-8-sig decoder errors at first cp1252 byte in 0x80-0x9F range. Reader should detect the lie and recover by stripping BOM and trying cp1252; warn the user. |
|
||||
| E31 | Mixed encoding concatenation (cp1252 + UTF-8) | NO single encoding decodes the whole file. UTF-8 errors on cp1252 bytes; cp1252 mojibakes the UTF-8 bytes. Reader should refuse and tell the user the file contains mixed encodings. |
|
||||
|
||||
---
|
||||
|
||||
## 4. Manifest files
|
||||
|
||||
### `expected_detection.csv` — ground truth + acceptable detection answers
|
||||
|
||||
7 columns:
|
||||
- `filename` — the encoded test file
|
||||
- `canonical_content_id` — links to the reference content
|
||||
- `encoding` — the actual encoding used by the generator (ground truth)
|
||||
- `has_bom` — whether the file has a BOM
|
||||
- `byte_length` — file size in bytes
|
||||
- `expected_detection` — pipe-separated list of detector answers that should be considered correct. Includes fuzzy markers (`AMBIGUOUS`, `UNRELIABLE`, `REJECT`, `LOW_CONFIDENCE`) for cases where any reasonable detector behavior is acceptable.
|
||||
- `decode_notes` — human-readable explanation of expected behavior
|
||||
|
||||
Use this as the primary reference when validating your reader.
|
||||
|
||||
### `detector_baseline.csv` — what charset-normalizer actually returns
|
||||
|
||||
Recorded during fixture generation against the version of `charset-normalizer` installed at that time. 6 columns:
|
||||
- `filename`, `ground_truth_encoding`, `charset_normalizer_returns`, `cn_aliases`, `cn_language`, `cn_chaos_score`
|
||||
|
||||
This is **not authoritative** — different detector versions return different labels. It exists so you can see typical detector output without having to run charset-normalizer yourself, and so you have a baseline to compare against if you're testing a different detector or a newer version.
|
||||
|
||||
### `reference/*.utf8.txt` — canonical decoded content
|
||||
|
||||
One UTF-8 file per content family. After your reader decodes any encoded file in the corpus and strips any BOM, the result should equal the corresponding reference file's content byte-for-byte.
|
||||
|
||||
---
|
||||
|
||||
## 5. Observed charset-normalizer behavior
|
||||
|
||||
Recorded against `charset-normalizer` 3.x. Some of these are known detector quirks worth understanding before you debug your own code:
|
||||
|
||||
### Cases where charset-normalizer is reliably correct
|
||||
|
||||
- All UTF-8 files (E01, E02, E10, E13, E16, E19, E21, E23, E25): detected as `utf_8`.
|
||||
- All UTF-16 with BOM (E07, E08, E12): detected as `utf_16` (loses LE/BE distinction in label, recoverable from BOM).
|
||||
- E14 (cp1250 Eastern European): correctly detected.
|
||||
- E17 (cp1251 Cyrillic): correctly detected.
|
||||
- E20 (Shift_JIS Japanese): returns `cp932` (the MS extended variant; equivalent for this content).
|
||||
- E22 (GB18030 Chinese): correctly detected.
|
||||
- E24 (Big5 Chinese traditional): correctly detected.
|
||||
- E26 (EUC-KR Korean): returns `cp949` (the MS variant; equivalent for this content).
|
||||
- E27 (ASCII): correctly detected as `ascii`.
|
||||
|
||||
### Cases where charset-normalizer mislabels but produces the right decoded content
|
||||
|
||||
These return a wrong-sounding name but decode to the correct characters because the encodings are byte-equivalent for this specific content:
|
||||
|
||||
- **E03, E04, E05** (cp1252, Latin-1, Latin-9 with WESTERN_BASIC content): all returned as `cp1250`. The decoded chars are correct because for ASCII + Latin-1 chars in the 0xA0+ range, all four encodings produce identical results. The label is misleading but the data is fine.
|
||||
- **E06** (Mac Roman): returned as `mac_iceland`. Same family, identical for our content.
|
||||
- **E11** (cp1252 with WESTERN_EXTENDED): returned as `cp1250`. Surprising — `cp1250` does NOT have euro at 0x80 (it has Cyrillic-adjacent chars), so the actual decoded euro sign would be wrong. Verify your reader actually re-decodes with the returned label and check the output, don't assume a "matching" label means correct content.
|
||||
|
||||
### Cases where charset-normalizer is wrong
|
||||
|
||||
- **E15** (ISO-8859-2 Eastern European): returned as `cp1258` (Vietnamese encoding). Wrong family entirely. Probable cause: the chaos heuristic doesn't penalize cp1258 for the byte distribution in the test content.
|
||||
- **E18** (KOI8-R Cyrillic): returned as `shift_jis_2004` (Japanese!). Bytes in KOI8-R's high-bit range happen to look like valid Shift_JIS multibyte sequences for this content. **High-confidence misdetection** — this is the one to plan a fallback for in your reader.
|
||||
|
||||
### Pathological cases
|
||||
|
||||
- **E28-E31**: charset-normalizer returns various labels (`cp1257`, `cp1250`, `cp1252`, `cp1250`). For pathological inputs, the *label* is less important than the *behavior*: does your reader detect that something is wrong (low confidence, multiple candidate encodings, or a decode error after detection) and warn the user? The `expected_detection` field accepts any label paired with appropriate warning behavior.
|
||||
|
||||
### Implication for your reader
|
||||
|
||||
Don't trust charset-normalizer's label blindly. The robust pattern:
|
||||
|
||||
1. Run charset-normalizer.
|
||||
2. Try to decode the entire file with the returned encoding.
|
||||
3. If decode succeeds, sanity-check the result: does it contain replacement characters (U+FFFD)? Does it contain control characters in unexpected places (which suggest cp1252-vs-Latin-1 ambiguity decoded wrong)?
|
||||
4. If it fails or smells wrong, try a small candidate set (utf-8, cp1252, latin-1) and pick the one with the cleanest result.
|
||||
5. When confidence is low, log a warning and let the user override via a `--encoding` flag.
|
||||
|
||||
---
|
||||
|
||||
## 6. Suggested test workflow
|
||||
|
||||
```python
|
||||
import csv
|
||||
from pathlib import Path
|
||||
from src.core.io import detect_encoding, read_csv # your reader
|
||||
|
||||
CORPUS = Path("test_data/encodings")
|
||||
|
||||
# Load ground-truth manifest
|
||||
with (CORPUS / "expected_detection.csv").open() as f:
|
||||
manifest = list(csv.DictReader(f))
|
||||
|
||||
# Load reference content
|
||||
references = {
|
||||
p.stem.replace(".utf8", ""): p.read_text(encoding="utf-8")
|
||||
for p in (CORPUS / "reference").glob("*.utf8.txt")
|
||||
}
|
||||
|
||||
# Test 1: detection - your detector returns an acceptable answer
|
||||
for entry in manifest:
|
||||
if entry["canonical_content_id"] in references: # skip pure pathological
|
||||
detected = detect_encoding(CORPUS / entry["filename"])
|
||||
acceptable = [e.strip() for e in entry["expected_detection"].split("|")]
|
||||
assert detected in acceptable or any(
|
||||
marker in entry["expected_detection"]
|
||||
for marker in ["AMBIGUOUS", "UNRELIABLE"]
|
||||
), f"{entry['filename']}: detected {detected} not in {acceptable}"
|
||||
|
||||
# Test 2: decoded content matches reference
|
||||
for entry in manifest:
|
||||
cid = entry["canonical_content_id"]
|
||||
if cid not in references:
|
||||
continue # pathological case
|
||||
decoded = read_csv(CORPUS / entry["filename"])
|
||||
assert decoded == references[cid], f"{entry['filename']}: content mismatch"
|
||||
|
||||
# Test 3: pathological cases produce warnings, not silent corruption
|
||||
for entry in manifest:
|
||||
cid = entry["canonical_content_id"]
|
||||
if cid in references:
|
||||
continue
|
||||
# Reader must either raise a clear error OR succeed with a logged warning
|
||||
# The exact behavior is a policy choice; document it and test against it
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. What this corpus does NOT cover
|
||||
|
||||
Listed so the gaps are explicit:
|
||||
|
||||
1. **Big files**. Every fixture is small (under 1 KB). Detection on a 500MB cp1252 export may behave differently because charset-normalizer samples; if your reader processes giant files, add a separate large-file detection test.
|
||||
2. **Streaming detection**. Detector is run on the full bytes here. If your reader decodes in chunks (for memory reasons on huge files), encoding-detection-at-stream-start is its own test surface.
|
||||
3. **Languages with complex scripts not represented here**: Thai, Hebrew, Arabic, Vietnamese (cp1258), Greek (cp1253), Turkish (cp1254). Add per-language fixtures if your buyers use these locales. The generator script is parameterized; adding a new content family is a few-line change.
|
||||
4. **Extended grapheme handling**. This corpus tests encoding detection and byte-to-string conversion. It does NOT test grapheme-cluster boundaries (multi-codepoint emoji, family ZWJ sequences). Those are the cleaner's territory in the main corpus, case 13.
|
||||
5. **Encoding errors during WRITE**. The corpus tests reading. If your tool writes output in a non-UTF-8 encoding for any reason, write-side encoding correctness needs separate fixtures.
|
||||
6. **Filename / path encoding issues**. Some filesystems mangle non-ASCII filenames (older Windows, NFS configs). Out of scope for the cleaner; that's a deployment problem.
|
||||
|
||||
---
|
||||
|
||||
## 8. How to extend the corpus
|
||||
|
||||
Add a new content family:
|
||||
|
||||
```python
|
||||
# In generate_encoding_test_files.py:
|
||||
THAI = "id,name,city\n1,สมชาย,กรุงเทพ\n..."
|
||||
|
||||
# Then add encoding lines:
|
||||
write_encoded("E32_thai_utf8.csv", "THAI", THAI, "utf-8", ...)
|
||||
write_encoded("E33_thai_cp874.csv", "THAI", THAI, "cp874", ...)
|
||||
```
|
||||
|
||||
Add reference content to the `references` dict at the bottom of the generator. Re-run the generator. The manifest and detector baseline will refresh automatically.
|
||||
|
||||
For a new pathological case: construct the raw bytes by hand and use `write_raw()`. Document the failure mode in the `decode_notes` field.
|
||||
|
||||
Continue numbering: `E32`, `E33`, etc. Reserve `E9#` if you need a "destructive" subcategory paralleling the malformed CSV corpus.
|
||||
Reference in New Issue
Block a user