feat(gate): CSV-normalization gate with confidence-tiered findings

Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 20:35:27 +00:00
parent e9c490ae1b
commit 82d7fef21e
68 changed files with 2883 additions and 34 deletions

View File

@@ -0,0 +1,5 @@
id,name,city,note
1,Alice,New York,plain ASCII
2,Café Müller,Köln,Latin-1 accents
3,Naïve Façade,Zürich,more accents
4,España,Düsseldorf,Spanish n-tilde
1 id name city note
2 1 Alice New York plain ASCII
3 2 Café Müller Köln Latin-1 accents
4 3 Naïve Façade Zürich more accents
5 4 España Düsseldorf Spanish n-tilde

View File

@@ -0,0 +1,5 @@
id,name,city,note
1,Alice,New York,plain ASCII
2,Café Müller,Köln,Latin-1 accents
3,Naïve Façade,Zürich,more accents
4,España,Düsseldorf,Spanish n-tilde
1 id name city note
2 1 Alice New York plain ASCII
3 2 Café Müller Köln Latin-1 accents
4 3 Naïve Façade Zürich more accents
5 4 España Düsseldorf Spanish n-tilde

View File

@@ -0,0 +1,5 @@
id,name,city,note
1,Alice,New York,plain ASCII
2,Café Müller,Köln,Latin-1 accents
3,Naďve Façade,Zürich,more accents
4,Espańa,Düsseldorf,Spanish n-tilde
1 id name city note
2 1 Alice New York plain ASCII
3 2 Café Müller Köln Latin-1 accents
4 3 Naďve Façade Zürich more accents
5 4 Espańa Düsseldorf Spanish n-tilde

View File

@@ -0,0 +1,5 @@
id,name,city,note
1,Alice,New York,plain ASCII
2,Café Müller,Köln,Latin-1 accents
3,Naďve Façade,Zürich,more accents
4,Espańa,Düsseldorf,Spanish n-tilde
1 id name city note
2 1 Alice New York plain ASCII
3 2 Café Müller Köln Latin-1 accents
4 3 Naďve Façade Zürich more accents
5 4 Espańa Düsseldorf Spanish n-tilde

View File

@@ -0,0 +1,5 @@
id,name,city,note
1,Alice,New York,plain ASCII
2,Café Müller,Köln,Latin-1 accents
3,Naďve Façade,Zürich,more accents
4,Espańa,Düsseldorf,Spanish n-tilde
1 id name city note
2 1 Alice New York plain ASCII
3 2 Café Müller Köln Latin-1 accents
4 3 Naďve Façade Zürich more accents
5 4 Espańa Düsseldorf Spanish n-tilde

View File

@@ -0,0 +1,5 @@
id,name,city,note
1,Alice,New York,plain ASCII
2,CafŽ Mźller,Kšln,Latin-1 accents
3,Na•ve FaŤade,Zźrich,more accents
4,Espaa,Dźsseldorf,Spanish n-tilde
1 id name city note
2 1 Alice New York plain ASCII
3 2 CafŽ Mźller Kšln Latin-1 accents
4 3 Na•ve FaŤade Zźrich more accents
5 4 Espa–a Dźsseldorf Spanish n-tilde

Binary file not shown.
1 id name city note
2 1 Alice New York plain ASCII
3 2 Café Müller Köln Latin-1 accents
4 3 Naïve Façade Zürich more accents
5 4 España Düsseldorf Spanish n-tilde

Binary file not shown.
1 id name city note
2 1 Alice New York plain ASCII
3 2 Café Müller Köln Latin-1 accents
4 3 Naïve Façade Zürich more accents
5 4 España Düsseldorf Spanish n-tilde

Binary file not shown.
1 i�d�,�n�a�m�e�,�c�i�t�y�,�n�o�t�e�
2 �1�,�A�l�i�c�e�,�N�e�w� �Y�o�r�k�,�p�l�a�i�n� �A�S�C�I�I�
3 �2�,�C�a�f�é� �M�ü�l�l�e�r�,�K�ö�l�n�,�L�a�t�i�n�-�1� �a�c�c�e�n�t�s�
4 �3�,�N�a�ï�v�e� �F�a�ç�a�d�e�,�Z�ü�r�i�c�h�,�m�o�r�e� �a�c�c�e�n�t�s�
5 �4�,�E�s�p�a�ñ�a�,�D�ü�s�s�e�l�d�o�r�f�,�S�p�a�n�i�s�h� �n�-�t�i�l�d�e�
6

View File

@@ -0,0 +1,5 @@
id,name,note
1,€100 product,euro sign U+20AC
2,“smart” quotes,curly U+201C and U+201D
3,café — résumé,em-dash U+2014
4,quotes ok,smart apostrophe U+2019
1 id name note
2 1 €100 product euro sign U+20AC
3 2 “smart” quotes curly U+201C and U+201D
4 3 café — résumé em-dash U+2014
5 4 quote’s ok smart apostrophe U+2019

View File

@@ -0,0 +1,5 @@
id,name,note
1,€100 product,euro sign U+20AC
2,“smart” quotes,curly U+201C and U+201D
3,café — résumé,em-dash U+2014
4,quotes ok,smart apostrophe U+2019
1 id name note
2 1 €100 product euro sign U+20AC
3 2 “smart” quotes curly U+201C and U+201D
4 3 café — résumé em-dash U+2014
5 4 quote’s ok smart apostrophe U+2019

Binary file not shown.
1 id name note
2 1 €100 product euro sign U+20AC
3 2 “smart” quotes curly U+201C and U+201D
4 3 café — résumé em-dash U+2014
5 4 quote’s ok smart apostrophe U+2019

View File

@@ -0,0 +1,5 @@
id,name,city,language
1,Příliš,Praha,Czech
2,Żółć,Warszawa,Polish
3,Tűrő,Budapest,Hungarian
4,Spaňski,Bratislava,Slovak
1 id name city language
2 1 Příliš Praha Czech
3 2 Żółć Warszawa Polish
4 3 Tűrő Budapest Hungarian
5 4 Spaňski Bratislava Slovak

View File

@@ -0,0 +1,5 @@
id,name,city,language
1,Příliš,Praha,Czech
2,Żółć,Warszawa,Polish
3,Tűrő,Budapest,Hungarian
4,Spaňski,Bratislava,Slovak
1 id name city language
2 1 Příliš Praha Czech
3 2 Żółć Warszawa Polish
4 3 Tűrő Budapest Hungarian
5 4 Spaňski Bratislava Slovak

View File

@@ -0,0 +1,5 @@
id,name,city,language
1,Příliš,Praha,Czech
2,Żółć,Warszawa,Polish
3,Tűrő,Budapest,Hungarian
4,Spaňski,Bratislava,Slovak
1 id name city language
2 1 Příliš Praha Czech
3 2 Żółć Warszawa Polish
4 3 Tűrő Budapest Hungarian
5 4 Spaňski Bratislava Slovak

View File

@@ -0,0 +1,4 @@
id,name,city
1,Иван,Москва
2,Анна,Санкт-Петербург
3,Дмитрий,Новосибирск
1 id name city
2 1 Иван Москва
3 2 Анна Санкт-Петербург
4 3 Дмитрий Новосибирск

View File

@@ -0,0 +1,4 @@
id,name,city
1,Иван,Москва
2,Анна,Санкт-Петербург
3,Дмитрий,Новосибирск
1 id name city
2 1 Иван Москва
3 2 Анна Санкт-Петербург
4 3 Дмитрий Новосибирск

View File

@@ -0,0 +1,4 @@
id,name,city
1,י<EFBFBD><EFBFBD><EFBFBD>,ם<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>
2,ב<EFBFBD><EFBFBD><EFBFBD>,ף<EFBFBD><EFBFBD><EFBFBD><EFBFBD><><D7A0><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>
3,ה<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>,מ<EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>
1 id name city
2 1 י��� ם�����
3 2 ב��� ף����-נ��������
4 3 ה������ מ����������

View File

@@ -0,0 +1,4 @@
id,name,city
1,田中太郎,東京
2,鈴木花子,大阪
3,Alice Smith,横浜
1 id name city
2 1 田中太郎 東京
3 2 鈴木花子 大阪
4 3 Alice Smith 横浜

View File

@@ -0,0 +1,4 @@
id,name,city
1,“c¾˜Y,“Œ‹ž
2,—éØ‰ÔŽq,å<EFBFBD>ã
3,Alice Smith,‰¡•l
1 id name city
2 1 “c’†‘¾˜Y “Œ‹ž
3 2 —é–Ø‰ÔŽq ‘å�ã
4 3 Alice Smith ‰¡•l

View File

@@ -0,0 +1,4 @@
id,name,city
1,张三,北京
2,李四,上海
3,Alice Smith,深圳
1 id name city
2 1 张三 北京
3 2 李四 上海
4 3 Alice Smith 深圳

View File

@@ -0,0 +1,4 @@
id,name,city
1,张三,北京
2,李四,上海
3,Alice Smith,深圳
1 id name city
2 1 张三 北京
3 2 李四 上海
4 3 Alice Smith 深圳

View File

@@ -0,0 +1,4 @@
id,name,city
1,張三,台北
2,李四,香港
3,Alice Smith,新竹
1 id name city
2 1 張三 台北
3 2 李四 香港
4 3 Alice Smith 新竹

View File

@@ -0,0 +1,4 @@
id,name,city
1,張三,台北
2,李四,香港
3,Alice Smith,新竹
1 id name city
2 1 張三 台北
3 2 李四 香港
4 3 Alice Smith 新竹

View File

@@ -0,0 +1,4 @@
id,name,city
1,김철수,서울
2,박영희,부산
3,Alice Smith,인천
1 id name city
2 1 김철수 서울
3 2 박영희 부산
4 3 Alice Smith 인천

View File

@@ -0,0 +1,4 @@
id,name,city
1,김철수,서울
2,박영희,부산
3,Alice Smith,인천
1 id name city
2 1 김철수 서울
3 2 박영희 부산
4 3 Alice Smith 인천

View File

@@ -0,0 +1,4 @@
id,name,city
1,Alice,New York
2,Bob,Chicago
3,Carol,San Francisco
1 id name city
2 1 Alice New York
3 2 Bob Chicago
4 3 Carol San Francisco

View File

@@ -0,0 +1,4 @@
id,name,city
1,Alice,New York
2,BÃ(b,Chicago
3,Carol,San Francisco
1 id name city
2 1 Alice New York
3 2 BÃ(b Chicago
4 3 Carol San Francisco

View File

@@ -0,0 +1,4 @@
id,name,city
1,Alice,New York
2,Bob,Chicago
3,<EFBFBD>
1 id,name,city
2 1,Alice,New York
3 2,Bob,Chicago
4 3,

View File

@@ -0,0 +1,5 @@
id,name,note
1,€100 product,euro sign U+20AC
2,“smart” quotes,curly U+201C and U+201D
3,café — résumé,em-dash U+2014
4,quotes ok,smart apostrophe U+2019
1 id name note
2 1 €100 product euro sign U+20AC
3 2 “smart” quotes curly U+201C and U+201D
4 3 café — résumé em-dash U+2014
5 4 quote’s ok smart apostrophe U+2019

View File

@@ -0,0 +1,4 @@
id,name,city
1,Müller,Köln
2,Müller,Köln
3,Alice,New York
1 id name city
2 1 Müller Köln
3 2 Müller Köln
4 3 Alice New York

View File

@@ -0,0 +1,284 @@
# ENCODINGS-CASES.md - Code Page / Encoding Test Corpus
**Version**: 1.0
**Last updated**: April 29, 2026
**Companion to**: TEST-CASES.md and QUOTE-CASES.md.
## Why this is a separate corpus
Files 01-23 in the main corpus test the **transformation layer**: given a Python `str` already in memory, what does the cleaner do to it. Encoding tests are about the **I/O layer** that runs *before* the transformation layer ever sees data: given a sequence of bytes on disk, can the reader correctly turn them into a Python `str` in the first place?
These are different failures:
- A transformation bug produces wrong-but-valid output (curly quotes that should have been folded, whitespace that should have been trimmed).
- An I/O bug produces *garbage* (mojibake) or *crashes* the reader entirely. The cleaner never gets to apply any transformation rule because the input never decoded.
Per TECHNICAL.md Section 9, encoding handling lives in `src/core/io.py`, separate from any individual cleaning script. This corpus tests that module.
---
## 1. Layout
```
test_data/encodings/
├── E01_western_basic_utf8.csv ... E26_korean_euckr.csv
├── E27_pathological_ascii_only.csv ... E31_pathological_mixed_concat.csv
├── expected_detection.csv # Manifest: ground truth + acceptable detection
├── detector_baseline.csv # What charset-normalizer actually returns
└── reference/
├── WESTERN_BASIC.utf8.txt
├── WESTERN_EXTENDED.utf8.txt
├── EASTERN_EUROPEAN.utf8.txt
├── CYRILLIC.utf8.txt
├── JAPANESE.utf8.txt
├── CHINESE_SIMPLIFIED.utf8.txt
├── CHINESE_TRADITIONAL.utf8.txt
├── KOREAN.utf8.txt
└── ASCII_ONLY.utf8.txt
```
Every encoded file has a `canonical_content_id` linking it to one of the 9 reference files in `reference/`. After correct decoding (and BOM stripping if applicable), the encoded file's content must equal the corresponding reference file byte-for-byte.
---
## 2. Coverage matrix
The corpus uses 9 distinct content sets, each chosen to exercise a specific encoding family. Cross-coverage is enforced by content design: Cyrillic content cannot be encoded as cp1252; Western extended content (with euro/em-dash) cannot be encoded as Latin-1; etc. Attempting those encodings would either error or substitute, both of which are themselves test cases.
| Content family | What it contains | Encodings covered |
|---|---|---|
| WESTERN_BASIC | ASCII + accented Latin-1 chars (é, ü, ñ, ç) | UTF-8, UTF-8 with BOM, cp1252, ISO-8859-1, ISO-8859-15, Mac Roman, UTF-16 LE/BE with BOM, UTF-16 LE without BOM |
| WESTERN_EXTENDED | Above + euro sign, smart quotes, em-dash | UTF-8, cp1252, UTF-16 LE (NOT Latin-1: chars don't exist there) |
| EASTERN_EUROPEAN | Czech, Polish, Hungarian, Slovak accents | UTF-8, cp1250, ISO-8859-2 |
| CYRILLIC | Russian | UTF-8, cp1251, KOI8-R |
| JAPANESE | Kanji + kana | UTF-8, Shift_JIS |
| CHINESE_SIMPLIFIED | Mainland China characters | UTF-8, GB18030 |
| CHINESE_TRADITIONAL | Taiwan/HK characters | UTF-8, Big5 |
| KOREAN | Hangul | UTF-8, EUC-KR |
| ASCII_ONLY | Pure ASCII | One file; encoding genuinely ambiguous |
---
## 3. Per-file index
### Group A — WESTERN_BASIC (single content, 9 encodings)
This group's purpose is mainly to test detector behavior on the most common Western encodings. Because the content uses only ASCII + Latin-1 characters in the 0xA0+ range, **cp1252 / ISO-8859-1 / ISO-8859-15 produce byte-identical output for this content**. The detector cannot meaningfully distinguish among them; any of them is a correct answer.
| File | Encoding | Notes |
|---|---|---|
| E01 | UTF-8 | Modern default |
| E02 | UTF-8 with BOM | Excel "CSV UTF-8" export. Reader must strip the BOM. |
| E03 | cp1252 | Excel default "CSV" on US/UK/Western Windows |
| E04 | ISO-8859-1 | Latin-1. Identical bytes to cp1252 for this content. |
| E05 | ISO-8859-15 | Latin-9. Identical to Latin-1 here (no euro). |
| E06 | Mac Roman | Different byte mappings; distinguishable |
| E07 | UTF-16 LE with BOM | Excel "Unicode Text" export |
| E08 | UTF-16 BE with BOM | Less common but spec'd |
| E09 | UTF-16 LE without BOM | Detection unreliable; document failure mode |
### Group B — WESTERN_EXTENDED (3 encodings)
This is the cleanest **cp1252-vs-Latin-1 discriminator** in the corpus. The content uses bytes 0x80-0x9F (where cp1252 puts euro, smart quotes, em-dash) — exactly the range Latin-1 leaves undefined. A reader that misidentifies this file as Latin-1 will produce control characters or replacement chars; correct identification as cp1252 yields readable text.
| File | Encoding | Notes |
|---|---|---|
| E10 | UTF-8 | Reference |
| E11 | cp1252 | The discriminator file |
| E12 | UTF-16 LE with BOM | Same content, sanity check |
### Group C — EASTERN_EUROPEAN (3 encodings)
| File | Encoding | Notes |
|---|---|---|
| E13 | UTF-8 | Reference |
| E14 | cp1250 | Polish/Czech/Hungarian Windows default |
| E15 | ISO-8859-2 | Latin-2; distinct byte mappings from cp1250 |
### Group D — CYRILLIC (3 encodings)
| File | Encoding | Notes |
|---|---|---|
| E16 | UTF-8 | Reference |
| E17 | cp1251 | Russian Windows default |
| E18 | KOI8-R | Older Russian Unix encoding; distinct bytes from cp1251 |
### Group E — CJK (8 files, 4 languages × 2 encodings each)
| File | Encoding | Notes |
|---|---|---|
| E19 | UTF-8 (Japanese) | Reference |
| E20 | Shift_JIS | Japanese Excel default; cp932 is the MS extended variant |
| E21 | UTF-8 (Chinese simplified) | Reference |
| E22 | GB18030 | Mainland China; supersets GBK and GB2312 |
| E23 | UTF-8 (Chinese traditional) | Reference |
| E24 | Big5 | Taiwan/HK; cp950 is the MS variant |
| E25 | UTF-8 (Korean) | Reference |
| E26 | EUC-KR | Korean Windows default; cp949 is the MS variant |
### Group F — Pathological (5 files)
These are the cases that crash readers, produce silent corruption, or expose the limits of encoding detection. The expected behavior is **that the reader fails informatively**, not that it succeeds.
| File | Pathology | What should happen |
|---|---|---|
| E27 | ASCII only — encoding genuinely ambiguous | Detector picks any of ASCII/UTF-8/cp1252/Latin-1; all decode identically. Test that the reader doesn't OVER-confidently commit to one when ambiguous. |
| E28 | Invalid UTF-8 byte sequence (0xC3 0x28 mid-file) | Strict UTF-8 decode raises UnicodeDecodeError. Reader should fall back to a single-byte encoding and warn, not silently substitute. |
| E29 | Truncated UTF-8 multibyte at EOF | Strict decode raises "unexpected end of data". Reader should error with a clear "file appears truncated" message, not silently produce U+FFFD. |
| E30 | "Lying BOM" — UTF-8 BOM on cp1252 body | utf-8-sig decoder errors at first cp1252 byte in 0x80-0x9F range. Reader should detect the lie and recover by stripping BOM and trying cp1252; warn the user. |
| E31 | Mixed encoding concatenation (cp1252 + UTF-8) | NO single encoding decodes the whole file. UTF-8 errors on cp1252 bytes; cp1252 mojibakes the UTF-8 bytes. Reader should refuse and tell the user the file contains mixed encodings. |
---
## 4. Manifest files
### `expected_detection.csv` — ground truth + acceptable detection answers
7 columns:
- `filename` — the encoded test file
- `canonical_content_id` — links to the reference content
- `encoding` — the actual encoding used by the generator (ground truth)
- `has_bom` — whether the file has a BOM
- `byte_length` — file size in bytes
- `expected_detection` — pipe-separated list of detector answers that should be considered correct. Includes fuzzy markers (`AMBIGUOUS`, `UNRELIABLE`, `REJECT`, `LOW_CONFIDENCE`) for cases where any reasonable detector behavior is acceptable.
- `decode_notes` — human-readable explanation of expected behavior
Use this as the primary reference when validating your reader.
### `detector_baseline.csv` — what charset-normalizer actually returns
Recorded during fixture generation against the version of `charset-normalizer` installed at that time. 6 columns:
- `filename`, `ground_truth_encoding`, `charset_normalizer_returns`, `cn_aliases`, `cn_language`, `cn_chaos_score`
This is **not authoritative** — different detector versions return different labels. It exists so you can see typical detector output without having to run charset-normalizer yourself, and so you have a baseline to compare against if you're testing a different detector or a newer version.
### `reference/*.utf8.txt` — canonical decoded content
One UTF-8 file per content family. After your reader decodes any encoded file in the corpus and strips any BOM, the result should equal the corresponding reference file's content byte-for-byte.
---
## 5. Observed charset-normalizer behavior
Recorded against `charset-normalizer` 3.x. Some of these are known detector quirks worth understanding before you debug your own code:
### Cases where charset-normalizer is reliably correct
- All UTF-8 files (E01, E02, E10, E13, E16, E19, E21, E23, E25): detected as `utf_8`.
- All UTF-16 with BOM (E07, E08, E12): detected as `utf_16` (loses LE/BE distinction in label, recoverable from BOM).
- E14 (cp1250 Eastern European): correctly detected.
- E17 (cp1251 Cyrillic): correctly detected.
- E20 (Shift_JIS Japanese): returns `cp932` (the MS extended variant; equivalent for this content).
- E22 (GB18030 Chinese): correctly detected.
- E24 (Big5 Chinese traditional): correctly detected.
- E26 (EUC-KR Korean): returns `cp949` (the MS variant; equivalent for this content).
- E27 (ASCII): correctly detected as `ascii`.
### Cases where charset-normalizer mislabels but produces the right decoded content
These return a wrong-sounding name but decode to the correct characters because the encodings are byte-equivalent for this specific content:
- **E03, E04, E05** (cp1252, Latin-1, Latin-9 with WESTERN_BASIC content): all returned as `cp1250`. The decoded chars are correct because for ASCII + Latin-1 chars in the 0xA0+ range, all four encodings produce identical results. The label is misleading but the data is fine.
- **E06** (Mac Roman): returned as `mac_iceland`. Same family, identical for our content.
- **E11** (cp1252 with WESTERN_EXTENDED): returned as `cp1250`. Surprising — `cp1250` does NOT have euro at 0x80 (it has Cyrillic-adjacent chars), so the actual decoded euro sign would be wrong. Verify your reader actually re-decodes with the returned label and check the output, don't assume a "matching" label means correct content.
### Cases where charset-normalizer is wrong
- **E15** (ISO-8859-2 Eastern European): returned as `cp1258` (Vietnamese encoding). Wrong family entirely. Probable cause: the chaos heuristic doesn't penalize cp1258 for the byte distribution in the test content.
- **E18** (KOI8-R Cyrillic): returned as `shift_jis_2004` (Japanese!). Bytes in KOI8-R's high-bit range happen to look like valid Shift_JIS multibyte sequences for this content. **High-confidence misdetection** — this is the one to plan a fallback for in your reader.
### Pathological cases
- **E28-E31**: charset-normalizer returns various labels (`cp1257`, `cp1250`, `cp1252`, `cp1250`). For pathological inputs, the *label* is less important than the *behavior*: does your reader detect that something is wrong (low confidence, multiple candidate encodings, or a decode error after detection) and warn the user? The `expected_detection` field accepts any label paired with appropriate warning behavior.
### Implication for your reader
Don't trust charset-normalizer's label blindly. The robust pattern:
1. Run charset-normalizer.
2. Try to decode the entire file with the returned encoding.
3. If decode succeeds, sanity-check the result: does it contain replacement characters (U+FFFD)? Does it contain control characters in unexpected places (which suggest cp1252-vs-Latin-1 ambiguity decoded wrong)?
4. If it fails or smells wrong, try a small candidate set (utf-8, cp1252, latin-1) and pick the one with the cleanest result.
5. When confidence is low, log a warning and let the user override via a `--encoding` flag.
---
## 6. Suggested test workflow
```python
import csv
from pathlib import Path
from src.core.io import detect_encoding, read_csv # your reader
CORPUS = Path("test_data/encodings")
# Load ground-truth manifest
with (CORPUS / "expected_detection.csv").open() as f:
manifest = list(csv.DictReader(f))
# Load reference content
references = {
p.stem.replace(".utf8", ""): p.read_text(encoding="utf-8")
for p in (CORPUS / "reference").glob("*.utf8.txt")
}
# Test 1: detection - your detector returns an acceptable answer
for entry in manifest:
if entry["canonical_content_id"] in references: # skip pure pathological
detected = detect_encoding(CORPUS / entry["filename"])
acceptable = [e.strip() for e in entry["expected_detection"].split("|")]
assert detected in acceptable or any(
marker in entry["expected_detection"]
for marker in ["AMBIGUOUS", "UNRELIABLE"]
), f"{entry['filename']}: detected {detected} not in {acceptable}"
# Test 2: decoded content matches reference
for entry in manifest:
cid = entry["canonical_content_id"]
if cid not in references:
continue # pathological case
decoded = read_csv(CORPUS / entry["filename"])
assert decoded == references[cid], f"{entry['filename']}: content mismatch"
# Test 3: pathological cases produce warnings, not silent corruption
for entry in manifest:
cid = entry["canonical_content_id"]
if cid in references:
continue
# Reader must either raise a clear error OR succeed with a logged warning
# The exact behavior is a policy choice; document it and test against it
```
---
## 7. What this corpus does NOT cover
Listed so the gaps are explicit:
1. **Big files**. Every fixture is small (under 1 KB). Detection on a 500MB cp1252 export may behave differently because charset-normalizer samples; if your reader processes giant files, add a separate large-file detection test.
2. **Streaming detection**. Detector is run on the full bytes here. If your reader decodes in chunks (for memory reasons on huge files), encoding-detection-at-stream-start is its own test surface.
3. **Languages with complex scripts not represented here**: Thai, Hebrew, Arabic, Vietnamese (cp1258), Greek (cp1253), Turkish (cp1254). Add per-language fixtures if your buyers use these locales. The generator script is parameterized; adding a new content family is a few-line change.
4. **Extended grapheme handling**. This corpus tests encoding detection and byte-to-string conversion. It does NOT test grapheme-cluster boundaries (multi-codepoint emoji, family ZWJ sequences). Those are the cleaner's territory in the main corpus, case 13.
5. **Encoding errors during WRITE**. The corpus tests reading. If your tool writes output in a non-UTF-8 encoding for any reason, write-side encoding correctness needs separate fixtures.
6. **Filename / path encoding issues**. Some filesystems mangle non-ASCII filenames (older Windows, NFS configs). Out of scope for the cleaner; that's a deployment problem.
---
## 8. How to extend the corpus
Add a new content family:
```python
# In generate_encoding_test_files.py:
THAI = "id,name,city\n1,สมชาย,กรุงเทพ\n..."
# Then add encoding lines:
write_encoded("E32_thai_utf8.csv", "THAI", THAI, "utf-8", ...)
write_encoded("E33_thai_cp874.csv", "THAI", THAI, "cp874", ...)
```
Add reference content to the `references` dict at the bottom of the generator. Re-run the generator. The manifest and detector baseline will refresh automatically.
For a new pathological case: construct the raw bytes by hand and use `write_raw()`. Document the failure mode in the `decode_notes` field.
Continue numbering: `E32`, `E33`, etc. Reserve `E9#` if you need a "destructive" subcategory paralleling the malformed CSV corpus.

View File

@@ -0,0 +1,32 @@
filename,ground_truth_encoding,charset_normalizer_returns,cn_aliases,cn_language,cn_chaos_score
E01_western_basic_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Turkish,0.000
E02_western_basic_utf8bom.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Turkish,0.000
E03_western_basic_cp1252.csv,cp1252,cp1250,"1250, windows_1250",Turkish,0.000
E04_western_basic_latin1.csv,iso-8859-1,cp1250,"1250, windows_1250",Turkish,0.000
E05_western_basic_latin9.csv,iso-8859-15,cp1250,"1250, windows_1250",Turkish,0.000
E06_western_basic_macroman.csv,mac-roman,mac_iceland,maciceland,Turkish,0.000
E07_western_basic_utf16le.csv,utf-16-le,utf_16,"u16, utf16",Turkish,0.000
E08_western_basic_utf16be.csv,utf-16-be,utf_16,"u16, utf16",Turkish,0.000
E09_western_basic_utf16le_nobom.csv,utf-16-le,utf_16_le,"unicodelittleunmarked, utf_16le",Turkish,0.000
E10_western_extended_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",French,0.013
E11_western_extended_cp1252.csv,cp1252,cp1250,"1250, windows_1250",French,0.013
E12_western_extended_utf16le.csv,utf-16-le,utf_16,"u16, utf16",French,0.013
E13_eastern_european_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Spanish,0.042
E14_eastern_european_cp1250.csv,cp1250,cp1250,"1250, windows_1250",Spanish,0.042
E15_eastern_european_iso88592.csv,iso-8859-2,cp1258,"1258, windows_1258",German,0.000
E16_cyrillic_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Ukrainian,0.059
E17_cyrillic_cp1251.csv,cp1251,cp1251,"1251, windows_1251",Ukrainian,0.059
E18_cyrillic_koi8r.csv,koi8-r,shift_jis_2004,"shiftjis2004, sjis_2004, s_jis_2004",Japanese,0.066
E19_japanese_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Italian,0.000
E20_japanese_shiftjis.csv,shift_jis,cp932,"932, ms932, mskanji, ms_kanji",Japanese,0.000
E21_chinese_simplified_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Unknown,0.000
E22_chinese_simplified_gb18030.csv,gb18030,gb18030,gb18030_2000,Chinese,0.000
E23_chinese_traditional_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Unknown,0.060
E24_chinese_traditional_big5.csv,big5,big5,"big5_tw, csbig5, x_mac_trad_chinese",Chinese,0.060
E25_korean_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Unknown,0.000
E26_korean_euckr.csv,euc-kr,cp949,"949, ms949, uhc",Korean,0.000
E27_pathological_ascii_only.csv,ascii,ascii,"646, ansi_x3.4_1968, ansi_x3_4_1968, ansi_x3.4_1986, cp367, csascii, ibm367, iso646_us, iso_646.irv_1991, iso_ir_6, us, us_ascii",English,0.000
E28_pathological_invalid_utf8.csv,invalid-utf8,cp1257,"1257, windows_1257",Croatian,0.000
E29_pathological_truncated_utf8.csv,invalid-utf8-truncated,cp1250,"1250, windows_1250",Polish,0.000
E30_pathological_lying_bom.csv,cp1252-with-utf8-bom,cp1252,"1252, windows_1252",French,0.013
E31_pathological_mixed_concat.csv,cp1252+utf8-concatenated,cp1250,"1250, windows_1250",German,0.000
1 filename ground_truth_encoding charset_normalizer_returns cn_aliases cn_language cn_chaos_score
2 E01_western_basic_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Turkish 0.000
3 E02_western_basic_utf8bom.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Turkish 0.000
4 E03_western_basic_cp1252.csv cp1252 cp1250 1250, windows_1250 Turkish 0.000
5 E04_western_basic_latin1.csv iso-8859-1 cp1250 1250, windows_1250 Turkish 0.000
6 E05_western_basic_latin9.csv iso-8859-15 cp1250 1250, windows_1250 Turkish 0.000
7 E06_western_basic_macroman.csv mac-roman mac_iceland maciceland Turkish 0.000
8 E07_western_basic_utf16le.csv utf-16-le utf_16 u16, utf16 Turkish 0.000
9 E08_western_basic_utf16be.csv utf-16-be utf_16 u16, utf16 Turkish 0.000
10 E09_western_basic_utf16le_nobom.csv utf-16-le utf_16_le unicodelittleunmarked, utf_16le Turkish 0.000
11 E10_western_extended_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 French 0.013
12 E11_western_extended_cp1252.csv cp1252 cp1250 1250, windows_1250 French 0.013
13 E12_western_extended_utf16le.csv utf-16-le utf_16 u16, utf16 French 0.013
14 E13_eastern_european_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Spanish 0.042
15 E14_eastern_european_cp1250.csv cp1250 cp1250 1250, windows_1250 Spanish 0.042
16 E15_eastern_european_iso88592.csv iso-8859-2 cp1258 1258, windows_1258 German 0.000
17 E16_cyrillic_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Ukrainian 0.059
18 E17_cyrillic_cp1251.csv cp1251 cp1251 1251, windows_1251 Ukrainian 0.059
19 E18_cyrillic_koi8r.csv koi8-r shift_jis_2004 shiftjis2004, sjis_2004, s_jis_2004 Japanese 0.066
20 E19_japanese_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Italian 0.000
21 E20_japanese_shiftjis.csv shift_jis cp932 932, ms932, mskanji, ms_kanji Japanese 0.000
22 E21_chinese_simplified_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Unknown 0.000
23 E22_chinese_simplified_gb18030.csv gb18030 gb18030 gb18030_2000 Chinese 0.000
24 E23_chinese_traditional_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Unknown 0.060
25 E24_chinese_traditional_big5.csv big5 big5 big5_tw, csbig5, x_mac_trad_chinese Chinese 0.060
26 E25_korean_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Unknown 0.000
27 E26_korean_euckr.csv euc-kr cp949 949, ms949, uhc Korean 0.000
28 E27_pathological_ascii_only.csv ascii ascii 646, ansi_x3.4_1968, ansi_x3_4_1968, ansi_x3.4_1986, cp367, csascii, ibm367, iso646_us, iso_646.irv_1991, iso_ir_6, us, us_ascii English 0.000
29 E28_pathological_invalid_utf8.csv invalid-utf8 cp1257 1257, windows_1257 Croatian 0.000
30 E29_pathological_truncated_utf8.csv invalid-utf8-truncated cp1250 1250, windows_1250 Polish 0.000
31 E30_pathological_lying_bom.csv cp1252-with-utf8-bom cp1252 1252, windows_1252 French 0.013
32 E31_pathological_mixed_concat.csv cp1252+utf8-concatenated cp1250 1250, windows_1250 German 0.000

View File

@@ -0,0 +1,32 @@
filename,canonical_content_id,encoding,has_bom,byte_length,expected_detection,decode_notes
E01_western_basic_utf8.csv,WESTERN_BASIC,utf-8,no,161,utf_8|utf-8,UTF-8 no BOM. Modern default.
E02_western_basic_utf8bom.csv,WESTERN_BASIC,utf-8,yes,164,utf_8|utf_8_sig|utf-8|utf-8-sig,UTF-8 with BOM. Excel CSV UTF-8 export. BOM must be stripped on read.
E03_western_basic_cp1252.csv,WESTERN_BASIC,cp1252,no,153,cp1252|windows-1252|iso-8859-1|latin-1,"Western single-byte. For this content (no euro, no smart quotes, no em-dash), cp1252 and Latin-1 produce IDENTICAL decoded bytes. Detector cannot distinguish - any of cp1252/Latin-1/Latin-9 is a correct answer."
E04_western_basic_latin1.csv,WESTERN_BASIC,iso-8859-1,no,153,iso-8859-1|latin-1|cp1252|latin_1,Latin-1. Identical bytes to cp1252 for this content. Detector ambiguity is expected and acceptable.
E05_western_basic_latin9.csv,WESTERN_BASIC,iso-8859-15,no,153,iso-8859-15|latin-9|iso-8859-1|cp1252,"Latin-9. For this content with no euro sign, decodes identically to Latin-1. Detector may pick any."
E06_western_basic_macroman.csv,WESTERN_BASIC,mac-roman,no,153,mac-roman|macroman,"Mac Roman. Different byte values for the accented chars vs cp1252/Latin-1, so this one is distinguishable."
E07_western_basic_utf16le.csv,WESTERN_BASIC,utf-16-le,yes,308,utf-16|utf-16-le|utf_16|utf_16_le,UTF-16 LE with BOM. Excel 'Unicode Text' export.
E08_western_basic_utf16be.csv,WESTERN_BASIC,utf-16-be,yes,308,utf-16|utf-16-be|utf_16|utf_16_be,UTF-16 BE with BOM. Less common but valid.
E09_western_basic_utf16le_nobom.csv,WESTERN_BASIC,utf-16-le,no,306,utf-16|utf-16-le|UNRELIABLE,"UTF-16 LE without BOM. Detection is heuristic and unreliable; bytes look like 'every other byte is null' for ASCII-heavy content, which charset-normalizer may or may not catch. If detector returns wrong encoding here, that is the buyer's responsibility to manually specify - flag in error message."
E10_western_extended_utf8.csv,WESTERN_EXTENDED,utf-8,no,167,utf_8|utf-8,"UTF-8. Has euro, smart quotes, em-dash."
E11_western_extended_cp1252.csv,WESTERN_EXTENDED,cp1252,no,154,cp1252|windows-1252,"cp1252. Content uses 0x80-0x9F range (euro, smart quotes, em-dash). Decoding as Latin-1 produces control characters or replacement chars - this file is the cleanest cp1252-vs-Latin-1 discriminator."
E12_western_extended_utf16le.csv,WESTERN_EXTENDED,utf-16-le,yes,310,utf-16|utf-16-le,UTF-16 LE with BOM. Same content as E10/E11.
E13_eastern_european_utf8.csv,EASTERN_EUROPEAN,utf-8,no,130,utf_8|utf-8,UTF-8 baseline for Czech/Polish/Hungarian/Slovak content.
E14_eastern_european_cp1250.csv,EASTERN_EUROPEAN,cp1250,no,120,cp1250|windows-1250,"cp1250. Decoding as cp1252 produces mojibake (Polish slash-l would become U+0142 vs ascii letter, etc.). Real distinguishing test."
E15_eastern_european_iso88592.csv,EASTERN_EUROPEAN,iso-8859-2,no,120,iso-8859-2|latin-2|iso8859_2,ISO-8859-2 / Latin-2. Different byte assignments than cp1250 for the same characters.
E16_cyrillic_utf8.csv,CYRILLIC,utf-8,no,118,utf_8|utf-8,UTF-8 baseline for Russian content.
E17_cyrillic_cp1251.csv,CYRILLIC,cp1251,no,72,cp1251|windows-1251,cp1251. The dominant Russian Windows encoding.
E18_cyrillic_koi8r.csv,CYRILLIC,koi8-r,no,72,koi8-r|koi8_r,KOI8-R. Older Unix Russian encoding. Distinct byte patterns from cp1251.
E19_japanese_utf8.csv,JAPANESE,utf-8,no,78,utf_8|utf-8,UTF-8 baseline for Japanese content.
E20_japanese_shiftjis.csv,JAPANESE,shift_jis,no,64,shift_jis|shift-jis|cp932|sjis,Shift_JIS. Excel on Japanese Windows defaults to this. cp932 is Microsoft's extended variant; either name is acceptable.
E21_chinese_simplified_utf8.csv,CHINESE_SIMPLIFIED,utf-8,no,66,utf_8|utf-8,UTF-8 baseline for simplified Chinese.
E22_chinese_simplified_gb18030.csv,CHINESE_SIMPLIFIED,gb18030,no,56,gb18030|gbk|gb2312,GB18030. Mainland China default. GB18030 supersets GBK supersets GB2312; for this content any is acceptable.
E23_chinese_traditional_utf8.csv,CHINESE_TRADITIONAL,utf-8,no,66,utf_8|utf-8,UTF-8 baseline for traditional Chinese.
E24_chinese_traditional_big5.csv,CHINESE_TRADITIONAL,big5,no,56,big5|big5_hkscs|cp950,Big5. Taiwan and Hong Kong default. cp950 is Microsoft's variant.
E25_korean_utf8.csv,KOREAN,utf-8,no,72,utf_8|utf-8,UTF-8 baseline for Korean.
E26_korean_euckr.csv,KOREAN,euc-kr,no,60,euc-kr|euc_kr|cp949,EUC-KR. Korean Windows default. cp949 is the MS variant.
E27_pathological_ascii_only.csv,ASCII_ONLY,ascii,no,66,ascii|utf_8|utf-8|cp1252|iso-8859-1|AMBIGUOUS,"Pure ASCII. Multiple encodings produce identical bytes for this content. Any of ASCII, UTF-8, cp1252, Latin-1 is a correct detection answer because all four decode to the same string. Detector confidence should be high; specific label is interchangeable."
E28_pathological_invalid_utf8.csv,INVALID_UTF8,invalid-utf8,no,67,cp1252|iso-8859-1|REJECT_UTF8,File starts as if UTF-8 but contains an invalid byte sequence (0xC3 0x28). A strict UTF-8 decoder errors. Detector should reject UTF-8 and fall back to a single-byte encoding; cp1252 will produce mojibake but parse without error. Cleaner should warn the user that encoding detection was uncertain.
E29_pathological_truncated_utf8.csv,TRUNCATED_UTF8,invalid-utf8-truncated,no,47,utf_8_with_errors|cp1252|REJECT,"Valid UTF-8 throughout, but the last byte (0xE4) starts a 3-byte sequence that's never completed. Strict UTF-8 decoder errors at EOF. errors='replace' produces \ufffd. Real-world cause: file was truncated by a transfer interruption or a buggy export. Cleaner should treat as corrupt-input error, not silent data loss."
E30_pathological_lying_bom.csv,WESTERN_EXTENDED,cp1252-with-utf8-bom,yes (lying),157,utf_8_FAILS|cp1252|AMBIGUOUS,File has UTF-8 BOM but body is cp1252. UTF-8 decoder will see 0x80 (euro in cp1252) as an invalid UTF-8 continuation byte and error out. Better detectors recover by ignoring the BOM and trying cp1252. Cleaner should warn 'BOM suggested UTF-8 but content decoded as cp1252' so the user knows their file is lying about itself.
E31_pathological_mixed_concat.csv,MIXED_CONCAT,cp1252+utf8-concatenated,no,60,LOW_CONFIDENCE|cp1252|utf_8|REJECT,"First half cp1252, second half UTF-8. No single encoding decodes both halves correctly. UTF-8 decoder errors on row 1. cp1252 decoder produces mojibake on rows 2-3. charset-normalizer detection confidence should be low. Right behavior for the cleaner: refuse to process and tell the user the file contains mixed encodings."
1 filename canonical_content_id encoding has_bom byte_length expected_detection decode_notes
2 E01_western_basic_utf8.csv WESTERN_BASIC utf-8 no 161 utf_8|utf-8 UTF-8 no BOM. Modern default.
3 E02_western_basic_utf8bom.csv WESTERN_BASIC utf-8 yes 164 utf_8|utf_8_sig|utf-8|utf-8-sig UTF-8 with BOM. Excel CSV UTF-8 export. BOM must be stripped on read.
4 E03_western_basic_cp1252.csv WESTERN_BASIC cp1252 no 153 cp1252|windows-1252|iso-8859-1|latin-1 Western single-byte. For this content (no euro, no smart quotes, no em-dash), cp1252 and Latin-1 produce IDENTICAL decoded bytes. Detector cannot distinguish - any of cp1252/Latin-1/Latin-9 is a correct answer.
5 E04_western_basic_latin1.csv WESTERN_BASIC iso-8859-1 no 153 iso-8859-1|latin-1|cp1252|latin_1 Latin-1. Identical bytes to cp1252 for this content. Detector ambiguity is expected and acceptable.
6 E05_western_basic_latin9.csv WESTERN_BASIC iso-8859-15 no 153 iso-8859-15|latin-9|iso-8859-1|cp1252 Latin-9. For this content with no euro sign, decodes identically to Latin-1. Detector may pick any.
7 E06_western_basic_macroman.csv WESTERN_BASIC mac-roman no 153 mac-roman|macroman Mac Roman. Different byte values for the accented chars vs cp1252/Latin-1, so this one is distinguishable.
8 E07_western_basic_utf16le.csv WESTERN_BASIC utf-16-le yes 308 utf-16|utf-16-le|utf_16|utf_16_le UTF-16 LE with BOM. Excel 'Unicode Text' export.
9 E08_western_basic_utf16be.csv WESTERN_BASIC utf-16-be yes 308 utf-16|utf-16-be|utf_16|utf_16_be UTF-16 BE with BOM. Less common but valid.
10 E09_western_basic_utf16le_nobom.csv WESTERN_BASIC utf-16-le no 306 utf-16|utf-16-le|UNRELIABLE UTF-16 LE without BOM. Detection is heuristic and unreliable; bytes look like 'every other byte is null' for ASCII-heavy content, which charset-normalizer may or may not catch. If detector returns wrong encoding here, that is the buyer's responsibility to manually specify - flag in error message.
11 E10_western_extended_utf8.csv WESTERN_EXTENDED utf-8 no 167 utf_8|utf-8 UTF-8. Has euro, smart quotes, em-dash.
12 E11_western_extended_cp1252.csv WESTERN_EXTENDED cp1252 no 154 cp1252|windows-1252 cp1252. Content uses 0x80-0x9F range (euro, smart quotes, em-dash). Decoding as Latin-1 produces control characters or replacement chars - this file is the cleanest cp1252-vs-Latin-1 discriminator.
13 E12_western_extended_utf16le.csv WESTERN_EXTENDED utf-16-le yes 310 utf-16|utf-16-le UTF-16 LE with BOM. Same content as E10/E11.
14 E13_eastern_european_utf8.csv EASTERN_EUROPEAN utf-8 no 130 utf_8|utf-8 UTF-8 baseline for Czech/Polish/Hungarian/Slovak content.
15 E14_eastern_european_cp1250.csv EASTERN_EUROPEAN cp1250 no 120 cp1250|windows-1250 cp1250. Decoding as cp1252 produces mojibake (Polish slash-l would become U+0142 vs ascii letter, etc.). Real distinguishing test.
16 E15_eastern_european_iso88592.csv EASTERN_EUROPEAN iso-8859-2 no 120 iso-8859-2|latin-2|iso8859_2 ISO-8859-2 / Latin-2. Different byte assignments than cp1250 for the same characters.
17 E16_cyrillic_utf8.csv CYRILLIC utf-8 no 118 utf_8|utf-8 UTF-8 baseline for Russian content.
18 E17_cyrillic_cp1251.csv CYRILLIC cp1251 no 72 cp1251|windows-1251 cp1251. The dominant Russian Windows encoding.
19 E18_cyrillic_koi8r.csv CYRILLIC koi8-r no 72 koi8-r|koi8_r KOI8-R. Older Unix Russian encoding. Distinct byte patterns from cp1251.
20 E19_japanese_utf8.csv JAPANESE utf-8 no 78 utf_8|utf-8 UTF-8 baseline for Japanese content.
21 E20_japanese_shiftjis.csv JAPANESE shift_jis no 64 shift_jis|shift-jis|cp932|sjis Shift_JIS. Excel on Japanese Windows defaults to this. cp932 is Microsoft's extended variant; either name is acceptable.
22 E21_chinese_simplified_utf8.csv CHINESE_SIMPLIFIED utf-8 no 66 utf_8|utf-8 UTF-8 baseline for simplified Chinese.
23 E22_chinese_simplified_gb18030.csv CHINESE_SIMPLIFIED gb18030 no 56 gb18030|gbk|gb2312 GB18030. Mainland China default. GB18030 supersets GBK supersets GB2312; for this content any is acceptable.
24 E23_chinese_traditional_utf8.csv CHINESE_TRADITIONAL utf-8 no 66 utf_8|utf-8 UTF-8 baseline for traditional Chinese.
25 E24_chinese_traditional_big5.csv CHINESE_TRADITIONAL big5 no 56 big5|big5_hkscs|cp950 Big5. Taiwan and Hong Kong default. cp950 is Microsoft's variant.
26 E25_korean_utf8.csv KOREAN utf-8 no 72 utf_8|utf-8 UTF-8 baseline for Korean.
27 E26_korean_euckr.csv KOREAN euc-kr no 60 euc-kr|euc_kr|cp949 EUC-KR. Korean Windows default. cp949 is the MS variant.
28 E27_pathological_ascii_only.csv ASCII_ONLY ascii no 66 ascii|utf_8|utf-8|cp1252|iso-8859-1|AMBIGUOUS Pure ASCII. Multiple encodings produce identical bytes for this content. Any of ASCII, UTF-8, cp1252, Latin-1 is a correct detection answer because all four decode to the same string. Detector confidence should be high; specific label is interchangeable.
29 E28_pathological_invalid_utf8.csv INVALID_UTF8 invalid-utf8 no 67 cp1252|iso-8859-1|REJECT_UTF8 File starts as if UTF-8 but contains an invalid byte sequence (0xC3 0x28). A strict UTF-8 decoder errors. Detector should reject UTF-8 and fall back to a single-byte encoding; cp1252 will produce mojibake but parse without error. Cleaner should warn the user that encoding detection was uncertain.
30 E29_pathological_truncated_utf8.csv TRUNCATED_UTF8 invalid-utf8-truncated no 47 utf_8_with_errors|cp1252|REJECT Valid UTF-8 throughout, but the last byte (0xE4) starts a 3-byte sequence that's never completed. Strict UTF-8 decoder errors at EOF. errors='replace' produces \ufffd. Real-world cause: file was truncated by a transfer interruption or a buggy export. Cleaner should treat as corrupt-input error, not silent data loss.
31 E30_pathological_lying_bom.csv WESTERN_EXTENDED cp1252-with-utf8-bom yes (lying) 157 utf_8_FAILS|cp1252|AMBIGUOUS File has UTF-8 BOM but body is cp1252. UTF-8 decoder will see 0x80 (euro in cp1252) as an invalid UTF-8 continuation byte and error out. Better detectors recover by ignoring the BOM and trying cp1252. Cleaner should warn 'BOM suggested UTF-8 but content decoded as cp1252' so the user knows their file is lying about itself.
32 E31_pathological_mixed_concat.csv MIXED_CONCAT cp1252+utf8-concatenated no 60 LOW_CONFIDENCE|cp1252|utf_8|REJECT First half cp1252, second half UTF-8. No single encoding decodes both halves correctly. UTF-8 decoder errors on row 1. cp1252 decoder produces mojibake on rows 2-3. charset-normalizer detection confidence should be low. Right behavior for the cleaner: refuse to process and tell the user the file contains mixed encodings.

View File

@@ -0,0 +1,4 @@
id,name,city
1,Alice,New York
2,Bob,Chicago
3,Carol,San Francisco

View File

@@ -0,0 +1,4 @@
id,name,city
1,张三,北京
2,李四,上海
3,Alice Smith,深圳

View File

@@ -0,0 +1,4 @@
id,name,city
1,張三,台北
2,李四,香港
3,Alice Smith,新竹

View File

@@ -0,0 +1,4 @@
id,name,city
1,Иван,Москва
2,Анна,Санкт-Петербург
3,Дмитрий,Новосибирск

View File

@@ -0,0 +1,5 @@
id,name,city,language
1,Příliš,Praha,Czech
2,Żółć,Warszawa,Polish
3,Tűrő,Budapest,Hungarian
4,Spaňski,Bratislava,Slovak

View File

@@ -0,0 +1,4 @@
id,name,city
1,田中太郎,東京
2,鈴木花子,大阪
3,Alice Smith,横浜

View File

@@ -0,0 +1,4 @@
id,name,city
1,김철수,서울
2,박영희,부산
3,Alice Smith,인천

View File

@@ -0,0 +1,5 @@
id,name,city,note
1,Alice,New York,plain ASCII
2,Café Müller,Köln,Latin-1 accents
3,Naïve Façade,Zürich,more accents
4,España,Düsseldorf,Spanish n-tilde

View File

@@ -0,0 +1,5 @@
id,name,note
1,€100 product,euro sign U+20AC
2,“smart” quotes,curly U+201C and U+201D
3,café — résumé,em-dash U+2014
4,quotes ok,smart apostrophe U+2019