Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.
Core (src/core/):
- analyze.py: Finding gains confidence, fix_action, pre_applied; new
detectors for encoding_uncertain, encoding_decode_failed; new top-
level encoding_override parameter.
- fixes.py: registry of fix algorithms keyed by fix_action id.
- normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
the NormalizationResult / Decision dataclasses the gate consumes.
- io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
and normalizes line endings (fixes bare-CR parser crash); empty file
handled gracefully instead of EmptyDataError traceback.
GUI (src/gui/):
- pages/0_Review.py: gate page with per-finding decision controls,
encoding override picker (16 codepages + custom), and Advanced output
options (encoding, delimiter, line terminator) on the download.
- components.py: require_normalization_gate() helper.
- pages/1-9: gate guard wired on every tool page.
Test corpora:
- test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
UTF-8 files + manifest, synced from Business/DataTools.
- test-cases/text-cleaner-corpus/test_data/17: synced malformed input
(unquoted $1,500.00) for the unquoted-delimiter detector.
Tests (94 new):
- test_normalize.py (48): finding fields, fix registry, auto_fix scope,
decision paths, gate idempotency, output-options helper.
- test_encodings_corpus.py (90, 16 xfailed): parametric detection +
decode + analyzer-no-crash sweep against the manifest.
- test_analyze.py: encoding override + encoding_uncertain detectors.
- test_corpus.py: pre-parse repair in the strict reader.
run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.
Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.
Suite: 765 passed, 17 xfailed (was 458 passed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.3 KiB
6.3 KiB
| 1 | filename | canonical_content_id | encoding | has_bom | byte_length | expected_detection | decode_notes |
|---|---|---|---|---|---|---|---|
| 2 | E01_western_basic_utf8.csv | WESTERN_BASIC | utf-8 | no | 161 | utf_8|utf-8 | UTF-8 no BOM. Modern default. |
| 3 | E02_western_basic_utf8bom.csv | WESTERN_BASIC | utf-8 | yes | 164 | utf_8|utf_8_sig|utf-8|utf-8-sig | UTF-8 with BOM. Excel CSV UTF-8 export. BOM must be stripped on read. |
| 4 | E03_western_basic_cp1252.csv | WESTERN_BASIC | cp1252 | no | 153 | cp1252|windows-1252|iso-8859-1|latin-1 | Western single-byte. For this content (no euro, no smart quotes, no em-dash), cp1252 and Latin-1 produce IDENTICAL decoded bytes. Detector cannot distinguish - any of cp1252/Latin-1/Latin-9 is a correct answer. |
| 5 | E04_western_basic_latin1.csv | WESTERN_BASIC | iso-8859-1 | no | 153 | iso-8859-1|latin-1|cp1252|latin_1 | Latin-1. Identical bytes to cp1252 for this content. Detector ambiguity is expected and acceptable. |
| 6 | E05_western_basic_latin9.csv | WESTERN_BASIC | iso-8859-15 | no | 153 | iso-8859-15|latin-9|iso-8859-1|cp1252 | Latin-9. For this content with no euro sign, decodes identically to Latin-1. Detector may pick any. |
| 7 | E06_western_basic_macroman.csv | WESTERN_BASIC | mac-roman | no | 153 | mac-roman|macroman | Mac Roman. Different byte values for the accented chars vs cp1252/Latin-1, so this one is distinguishable. |
| 8 | E07_western_basic_utf16le.csv | WESTERN_BASIC | utf-16-le | yes | 308 | utf-16|utf-16-le|utf_16|utf_16_le | UTF-16 LE with BOM. Excel 'Unicode Text' export. |
| 9 | E08_western_basic_utf16be.csv | WESTERN_BASIC | utf-16-be | yes | 308 | utf-16|utf-16-be|utf_16|utf_16_be | UTF-16 BE with BOM. Less common but valid. |
| 10 | E09_western_basic_utf16le_nobom.csv | WESTERN_BASIC | utf-16-le | no | 306 | utf-16|utf-16-le|UNRELIABLE | UTF-16 LE without BOM. Detection is heuristic and unreliable; bytes look like 'every other byte is null' for ASCII-heavy content, which charset-normalizer may or may not catch. If detector returns wrong encoding here, that is the buyer's responsibility to manually specify - flag in error message. |
| 11 | E10_western_extended_utf8.csv | WESTERN_EXTENDED | utf-8 | no | 167 | utf_8|utf-8 | UTF-8. Has euro, smart quotes, em-dash. |
| 12 | E11_western_extended_cp1252.csv | WESTERN_EXTENDED | cp1252 | no | 154 | cp1252|windows-1252 | cp1252. Content uses 0x80-0x9F range (euro, smart quotes, em-dash). Decoding as Latin-1 produces control characters or replacement chars - this file is the cleanest cp1252-vs-Latin-1 discriminator. |
| 13 | E12_western_extended_utf16le.csv | WESTERN_EXTENDED | utf-16-le | yes | 310 | utf-16|utf-16-le | UTF-16 LE with BOM. Same content as E10/E11. |
| 14 | E13_eastern_european_utf8.csv | EASTERN_EUROPEAN | utf-8 | no | 130 | utf_8|utf-8 | UTF-8 baseline for Czech/Polish/Hungarian/Slovak content. |
| 15 | E14_eastern_european_cp1250.csv | EASTERN_EUROPEAN | cp1250 | no | 120 | cp1250|windows-1250 | cp1250. Decoding as cp1252 produces mojibake (Polish slash-l would become U+0142 vs ascii letter, etc.). Real distinguishing test. |
| 16 | E15_eastern_european_iso88592.csv | EASTERN_EUROPEAN | iso-8859-2 | no | 120 | iso-8859-2|latin-2|iso8859_2 | ISO-8859-2 / Latin-2. Different byte assignments than cp1250 for the same characters. |
| 17 | E16_cyrillic_utf8.csv | CYRILLIC | utf-8 | no | 118 | utf_8|utf-8 | UTF-8 baseline for Russian content. |
| 18 | E17_cyrillic_cp1251.csv | CYRILLIC | cp1251 | no | 72 | cp1251|windows-1251 | cp1251. The dominant Russian Windows encoding. |
| 19 | E18_cyrillic_koi8r.csv | CYRILLIC | koi8-r | no | 72 | koi8-r|koi8_r | KOI8-R. Older Unix Russian encoding. Distinct byte patterns from cp1251. |
| 20 | E19_japanese_utf8.csv | JAPANESE | utf-8 | no | 78 | utf_8|utf-8 | UTF-8 baseline for Japanese content. |
| 21 | E20_japanese_shiftjis.csv | JAPANESE | shift_jis | no | 64 | shift_jis|shift-jis|cp932|sjis | Shift_JIS. Excel on Japanese Windows defaults to this. cp932 is Microsoft's extended variant; either name is acceptable. |
| 22 | E21_chinese_simplified_utf8.csv | CHINESE_SIMPLIFIED | utf-8 | no | 66 | utf_8|utf-8 | UTF-8 baseline for simplified Chinese. |
| 23 | E22_chinese_simplified_gb18030.csv | CHINESE_SIMPLIFIED | gb18030 | no | 56 | gb18030|gbk|gb2312 | GB18030. Mainland China default. GB18030 supersets GBK supersets GB2312; for this content any is acceptable. |
| 24 | E23_chinese_traditional_utf8.csv | CHINESE_TRADITIONAL | utf-8 | no | 66 | utf_8|utf-8 | UTF-8 baseline for traditional Chinese. |
| 25 | E24_chinese_traditional_big5.csv | CHINESE_TRADITIONAL | big5 | no | 56 | big5|big5_hkscs|cp950 | Big5. Taiwan and Hong Kong default. cp950 is Microsoft's variant. |
| 26 | E25_korean_utf8.csv | KOREAN | utf-8 | no | 72 | utf_8|utf-8 | UTF-8 baseline for Korean. |
| 27 | E26_korean_euckr.csv | KOREAN | euc-kr | no | 60 | euc-kr|euc_kr|cp949 | EUC-KR. Korean Windows default. cp949 is the MS variant. |
| 28 | E27_pathological_ascii_only.csv | ASCII_ONLY | ascii | no | 66 | ascii|utf_8|utf-8|cp1252|iso-8859-1|AMBIGUOUS | Pure ASCII. Multiple encodings produce identical bytes for this content. Any of ASCII, UTF-8, cp1252, Latin-1 is a correct detection answer because all four decode to the same string. Detector confidence should be high; specific label is interchangeable. |
| 29 | E28_pathological_invalid_utf8.csv | INVALID_UTF8 | invalid-utf8 | no | 67 | cp1252|iso-8859-1|REJECT_UTF8 | File starts as if UTF-8 but contains an invalid byte sequence (0xC3 0x28). A strict UTF-8 decoder errors. Detector should reject UTF-8 and fall back to a single-byte encoding; cp1252 will produce mojibake but parse without error. Cleaner should warn the user that encoding detection was uncertain. |
| 30 | E29_pathological_truncated_utf8.csv | TRUNCATED_UTF8 | invalid-utf8-truncated | no | 47 | utf_8_with_errors|cp1252|REJECT | Valid UTF-8 throughout, but the last byte (0xE4) starts a 3-byte sequence that's never completed. Strict UTF-8 decoder errors at EOF. errors='replace' produces \ufffd. Real-world cause: file was truncated by a transfer interruption or a buggy export. Cleaner should treat as corrupt-input error, not silent data loss. |
| 31 | E30_pathological_lying_bom.csv | WESTERN_EXTENDED | cp1252-with-utf8-bom | yes (lying) | 157 | utf_8_FAILS|cp1252|AMBIGUOUS | File has UTF-8 BOM but body is cp1252. UTF-8 decoder will see 0x80 (euro in cp1252) as an invalid UTF-8 continuation byte and error out. Better detectors recover by ignoring the BOM and trying cp1252. Cleaner should warn 'BOM suggested UTF-8 but content decoded as cp1252' so the user knows their file is lying about itself. |
| 32 | E31_pathological_mixed_concat.csv | MIXED_CONCAT | cp1252+utf8-concatenated | no | 60 | LOW_CONFIDENCE|cp1252|utf_8|REJECT | First half cp1252, second half UTF-8. No single encoding decodes both halves correctly. UTF-8 decoder errors on row 1. cp1252 decoder produces mojibake on rows 2-3. charset-normalizer detection confidence should be low. Right behavior for the cleaner: refuse to process and tell the user the file contains mixed encodings. |