feat(gate): CSV-normalization gate with confidence-tiered findings
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.
Core (src/core/):
- analyze.py: Finding gains confidence, fix_action, pre_applied; new
detectors for encoding_uncertain, encoding_decode_failed; new top-
level encoding_override parameter.
- fixes.py: registry of fix algorithms keyed by fix_action id.
- normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
the NormalizationResult / Decision dataclasses the gate consumes.
- io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
and normalizes line endings (fixes bare-CR parser crash); empty file
handled gracefully instead of EmptyDataError traceback.
GUI (src/gui/):
- pages/0_Review.py: gate page with per-finding decision controls,
encoding override picker (16 codepages + custom), and Advanced output
options (encoding, delimiter, line terminator) on the download.
- components.py: require_normalization_gate() helper.
- pages/1-9: gate guard wired on every tool page.
Test corpora:
- test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
UTF-8 files + manifest, synced from Business/DataTools.
- test-cases/text-cleaner-corpus/test_data/17: synced malformed input
(unquoted $1,500.00) for the unquoted-delimiter detector.
Tests (94 new):
- test_normalize.py (48): finding fields, fix registry, auto_fix scope,
decision paths, gate idempotency, output-options helper.
- test_encodings_corpus.py (90, 16 xfailed): parametric detection +
decode + analyzer-no-crash sweep against the manifest.
- test_analyze.py: encoding override + encoding_uncertain detectors.
- test_corpus.py: pre-parse repair in the strict reader.
run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.
Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.
Suite: 765 passed, 17 xfailed (was 458 passed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
32
test-cases/encodings-corpus/detector_baseline.csv
Normal file
32
test-cases/encodings-corpus/detector_baseline.csv
Normal file
@@ -0,0 +1,32 @@
|
||||
filename,ground_truth_encoding,charset_normalizer_returns,cn_aliases,cn_language,cn_chaos_score
|
||||
E01_western_basic_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Turkish,0.000
|
||||
E02_western_basic_utf8bom.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Turkish,0.000
|
||||
E03_western_basic_cp1252.csv,cp1252,cp1250,"1250, windows_1250",Turkish,0.000
|
||||
E04_western_basic_latin1.csv,iso-8859-1,cp1250,"1250, windows_1250",Turkish,0.000
|
||||
E05_western_basic_latin9.csv,iso-8859-15,cp1250,"1250, windows_1250",Turkish,0.000
|
||||
E06_western_basic_macroman.csv,mac-roman,mac_iceland,maciceland,Turkish,0.000
|
||||
E07_western_basic_utf16le.csv,utf-16-le,utf_16,"u16, utf16",Turkish,0.000
|
||||
E08_western_basic_utf16be.csv,utf-16-be,utf_16,"u16, utf16",Turkish,0.000
|
||||
E09_western_basic_utf16le_nobom.csv,utf-16-le,utf_16_le,"unicodelittleunmarked, utf_16le",Turkish,0.000
|
||||
E10_western_extended_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",French,0.013
|
||||
E11_western_extended_cp1252.csv,cp1252,cp1250,"1250, windows_1250",French,0.013
|
||||
E12_western_extended_utf16le.csv,utf-16-le,utf_16,"u16, utf16",French,0.013
|
||||
E13_eastern_european_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Spanish,0.042
|
||||
E14_eastern_european_cp1250.csv,cp1250,cp1250,"1250, windows_1250",Spanish,0.042
|
||||
E15_eastern_european_iso88592.csv,iso-8859-2,cp1258,"1258, windows_1258",German,0.000
|
||||
E16_cyrillic_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Ukrainian,0.059
|
||||
E17_cyrillic_cp1251.csv,cp1251,cp1251,"1251, windows_1251",Ukrainian,0.059
|
||||
E18_cyrillic_koi8r.csv,koi8-r,shift_jis_2004,"shiftjis2004, sjis_2004, s_jis_2004",Japanese,0.066
|
||||
E19_japanese_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Italian,0.000
|
||||
E20_japanese_shiftjis.csv,shift_jis,cp932,"932, ms932, mskanji, ms_kanji",Japanese,0.000
|
||||
E21_chinese_simplified_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Unknown,0.000
|
||||
E22_chinese_simplified_gb18030.csv,gb18030,gb18030,gb18030_2000,Chinese,0.000
|
||||
E23_chinese_traditional_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Unknown,0.060
|
||||
E24_chinese_traditional_big5.csv,big5,big5,"big5_tw, csbig5, x_mac_trad_chinese",Chinese,0.060
|
||||
E25_korean_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Unknown,0.000
|
||||
E26_korean_euckr.csv,euc-kr,cp949,"949, ms949, uhc",Korean,0.000
|
||||
E27_pathological_ascii_only.csv,ascii,ascii,"646, ansi_x3.4_1968, ansi_x3_4_1968, ansi_x3.4_1986, cp367, csascii, ibm367, iso646_us, iso_646.irv_1991, iso_ir_6, us, us_ascii",English,0.000
|
||||
E28_pathological_invalid_utf8.csv,invalid-utf8,cp1257,"1257, windows_1257",Croatian,0.000
|
||||
E29_pathological_truncated_utf8.csv,invalid-utf8-truncated,cp1250,"1250, windows_1250",Polish,0.000
|
||||
E30_pathological_lying_bom.csv,cp1252-with-utf8-bom,cp1252,"1252, windows_1252",French,0.013
|
||||
E31_pathological_mixed_concat.csv,cp1252+utf8-concatenated,cp1250,"1250, windows_1250",German,0.000
|
||||
|
Reference in New Issue
Block a user