Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.
Core (src/core/):
- analyze.py: Finding gains confidence, fix_action, pre_applied; new
detectors for encoding_uncertain, encoding_decode_failed; new top-
level encoding_override parameter.
- fixes.py: registry of fix algorithms keyed by fix_action id.
- normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
the NormalizationResult / Decision dataclasses the gate consumes.
- io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
and normalizes line endings (fixes bare-CR parser crash); empty file
handled gracefully instead of EmptyDataError traceback.
GUI (src/gui/):
- pages/0_Review.py: gate page with per-finding decision controls,
encoding override picker (16 codepages + custom), and Advanced output
options (encoding, delimiter, line terminator) on the download.
- components.py: require_normalization_gate() helper.
- pages/1-9: gate guard wired on every tool page.
Test corpora:
- test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
UTF-8 files + manifest, synced from Business/DataTools.
- test-cases/text-cleaner-corpus/test_data/17: synced malformed input
(unquoted $1,500.00) for the unquoted-delimiter detector.
Tests (94 new):
- test_normalize.py (48): finding fields, fix registry, auto_fix scope,
decision paths, gate idempotency, output-options helper.
- test_encodings_corpus.py (90, 16 xfailed): parametric detection +
decode + analyzer-no-crash sweep against the manifest.
- test_analyze.py: encoding override + encoding_uncertain detectors.
- test_corpus.py: pre-parse repair in the strict reader.
run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.
Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.
Suite: 765 passed, 17 xfailed (was 458 passed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2.9 KiB
2.9 KiB
| 1 | filename | ground_truth_encoding | charset_normalizer_returns | cn_aliases | cn_language | cn_chaos_score |
|---|---|---|---|---|---|---|
| 2 | E01_western_basic_utf8.csv | utf-8 | utf_8 | u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 | Turkish | 0.000 |
| 3 | E02_western_basic_utf8bom.csv | utf-8 | utf_8 | u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 | Turkish | 0.000 |
| 4 | E03_western_basic_cp1252.csv | cp1252 | cp1250 | 1250, windows_1250 | Turkish | 0.000 |
| 5 | E04_western_basic_latin1.csv | iso-8859-1 | cp1250 | 1250, windows_1250 | Turkish | 0.000 |
| 6 | E05_western_basic_latin9.csv | iso-8859-15 | cp1250 | 1250, windows_1250 | Turkish | 0.000 |
| 7 | E06_western_basic_macroman.csv | mac-roman | mac_iceland | maciceland | Turkish | 0.000 |
| 8 | E07_western_basic_utf16le.csv | utf-16-le | utf_16 | u16, utf16 | Turkish | 0.000 |
| 9 | E08_western_basic_utf16be.csv | utf-16-be | utf_16 | u16, utf16 | Turkish | 0.000 |
| 10 | E09_western_basic_utf16le_nobom.csv | utf-16-le | utf_16_le | unicodelittleunmarked, utf_16le | Turkish | 0.000 |
| 11 | E10_western_extended_utf8.csv | utf-8 | utf_8 | u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 | French | 0.013 |
| 12 | E11_western_extended_cp1252.csv | cp1252 | cp1250 | 1250, windows_1250 | French | 0.013 |
| 13 | E12_western_extended_utf16le.csv | utf-16-le | utf_16 | u16, utf16 | French | 0.013 |
| 14 | E13_eastern_european_utf8.csv | utf-8 | utf_8 | u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 | Spanish | 0.042 |
| 15 | E14_eastern_european_cp1250.csv | cp1250 | cp1250 | 1250, windows_1250 | Spanish | 0.042 |
| 16 | E15_eastern_european_iso88592.csv | iso-8859-2 | cp1258 | 1258, windows_1258 | German | 0.000 |
| 17 | E16_cyrillic_utf8.csv | utf-8 | utf_8 | u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 | Ukrainian | 0.059 |
| 18 | E17_cyrillic_cp1251.csv | cp1251 | cp1251 | 1251, windows_1251 | Ukrainian | 0.059 |
| 19 | E18_cyrillic_koi8r.csv | koi8-r | shift_jis_2004 | shiftjis2004, sjis_2004, s_jis_2004 | Japanese | 0.066 |
| 20 | E19_japanese_utf8.csv | utf-8 | utf_8 | u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 | Italian | 0.000 |
| 21 | E20_japanese_shiftjis.csv | shift_jis | cp932 | 932, ms932, mskanji, ms_kanji | Japanese | 0.000 |
| 22 | E21_chinese_simplified_utf8.csv | utf-8 | utf_8 | u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 | Unknown | 0.000 |
| 23 | E22_chinese_simplified_gb18030.csv | gb18030 | gb18030 | gb18030_2000 | Chinese | 0.000 |
| 24 | E23_chinese_traditional_utf8.csv | utf-8 | utf_8 | u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 | Unknown | 0.060 |
| 25 | E24_chinese_traditional_big5.csv | big5 | big5 | big5_tw, csbig5, x_mac_trad_chinese | Chinese | 0.060 |
| 26 | E25_korean_utf8.csv | utf-8 | utf_8 | u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 | Unknown | 0.000 |
| 27 | E26_korean_euckr.csv | euc-kr | cp949 | 949, ms949, uhc | Korean | 0.000 |
| 28 | E27_pathological_ascii_only.csv | ascii | ascii | 646, ansi_x3.4_1968, ansi_x3_4_1968, ansi_x3.4_1986, cp367, csascii, ibm367, iso646_us, iso_646.irv_1991, iso_ir_6, us, us_ascii | English | 0.000 |
| 29 | E28_pathological_invalid_utf8.csv | invalid-utf8 | cp1257 | 1257, windows_1257 | Croatian | 0.000 |
| 30 | E29_pathological_truncated_utf8.csv | invalid-utf8-truncated | cp1250 | 1250, windows_1250 | Polish | 0.000 |
| 31 | E30_pathological_lying_bom.csv | cp1252-with-utf8-bom | cp1252 | 1252, windows_1252 | French | 0.013 |
| 32 | E31_pathological_mixed_concat.csv | cp1252+utf8-concatenated | cp1250 | 1250, windows_1250 | German | 0.000 |