feat(gate): CSV-normalization gate with confidence-tiered findings

Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 20:35:27 +00:00
parent e9c490ae1b
commit 82d7fef21e
68 changed files with 2883 additions and 34 deletions

View File

@@ -0,0 +1,32 @@
filename,ground_truth_encoding,charset_normalizer_returns,cn_aliases,cn_language,cn_chaos_score
E01_western_basic_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Turkish,0.000
E02_western_basic_utf8bom.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Turkish,0.000
E03_western_basic_cp1252.csv,cp1252,cp1250,"1250, windows_1250",Turkish,0.000
E04_western_basic_latin1.csv,iso-8859-1,cp1250,"1250, windows_1250",Turkish,0.000
E05_western_basic_latin9.csv,iso-8859-15,cp1250,"1250, windows_1250",Turkish,0.000
E06_western_basic_macroman.csv,mac-roman,mac_iceland,maciceland,Turkish,0.000
E07_western_basic_utf16le.csv,utf-16-le,utf_16,"u16, utf16",Turkish,0.000
E08_western_basic_utf16be.csv,utf-16-be,utf_16,"u16, utf16",Turkish,0.000
E09_western_basic_utf16le_nobom.csv,utf-16-le,utf_16_le,"unicodelittleunmarked, utf_16le",Turkish,0.000
E10_western_extended_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",French,0.013
E11_western_extended_cp1252.csv,cp1252,cp1250,"1250, windows_1250",French,0.013
E12_western_extended_utf16le.csv,utf-16-le,utf_16,"u16, utf16",French,0.013
E13_eastern_european_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Spanish,0.042
E14_eastern_european_cp1250.csv,cp1250,cp1250,"1250, windows_1250",Spanish,0.042
E15_eastern_european_iso88592.csv,iso-8859-2,cp1258,"1258, windows_1258",German,0.000
E16_cyrillic_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Ukrainian,0.059
E17_cyrillic_cp1251.csv,cp1251,cp1251,"1251, windows_1251",Ukrainian,0.059
E18_cyrillic_koi8r.csv,koi8-r,shift_jis_2004,"shiftjis2004, sjis_2004, s_jis_2004",Japanese,0.066
E19_japanese_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Italian,0.000
E20_japanese_shiftjis.csv,shift_jis,cp932,"932, ms932, mskanji, ms_kanji",Japanese,0.000
E21_chinese_simplified_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Unknown,0.000
E22_chinese_simplified_gb18030.csv,gb18030,gb18030,gb18030_2000,Chinese,0.000
E23_chinese_traditional_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Unknown,0.060
E24_chinese_traditional_big5.csv,big5,big5,"big5_tw, csbig5, x_mac_trad_chinese",Chinese,0.060
E25_korean_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Unknown,0.000
E26_korean_euckr.csv,euc-kr,cp949,"949, ms949, uhc",Korean,0.000
E27_pathological_ascii_only.csv,ascii,ascii,"646, ansi_x3.4_1968, ansi_x3_4_1968, ansi_x3.4_1986, cp367, csascii, ibm367, iso646_us, iso_646.irv_1991, iso_ir_6, us, us_ascii",English,0.000
E28_pathological_invalid_utf8.csv,invalid-utf8,cp1257,"1257, windows_1257",Croatian,0.000
E29_pathological_truncated_utf8.csv,invalid-utf8-truncated,cp1250,"1250, windows_1250",Polish,0.000
E30_pathological_lying_bom.csv,cp1252-with-utf8-bom,cp1252,"1252, windows_1252",French,0.013
E31_pathological_mixed_concat.csv,cp1252+utf8-concatenated,cp1250,"1250, windows_1250",German,0.000
1 filename ground_truth_encoding charset_normalizer_returns cn_aliases cn_language cn_chaos_score
2 E01_western_basic_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Turkish 0.000
3 E02_western_basic_utf8bom.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Turkish 0.000
4 E03_western_basic_cp1252.csv cp1252 cp1250 1250, windows_1250 Turkish 0.000
5 E04_western_basic_latin1.csv iso-8859-1 cp1250 1250, windows_1250 Turkish 0.000
6 E05_western_basic_latin9.csv iso-8859-15 cp1250 1250, windows_1250 Turkish 0.000
7 E06_western_basic_macroman.csv mac-roman mac_iceland maciceland Turkish 0.000
8 E07_western_basic_utf16le.csv utf-16-le utf_16 u16, utf16 Turkish 0.000
9 E08_western_basic_utf16be.csv utf-16-be utf_16 u16, utf16 Turkish 0.000
10 E09_western_basic_utf16le_nobom.csv utf-16-le utf_16_le unicodelittleunmarked, utf_16le Turkish 0.000
11 E10_western_extended_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 French 0.013
12 E11_western_extended_cp1252.csv cp1252 cp1250 1250, windows_1250 French 0.013
13 E12_western_extended_utf16le.csv utf-16-le utf_16 u16, utf16 French 0.013
14 E13_eastern_european_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Spanish 0.042
15 E14_eastern_european_cp1250.csv cp1250 cp1250 1250, windows_1250 Spanish 0.042
16 E15_eastern_european_iso88592.csv iso-8859-2 cp1258 1258, windows_1258 German 0.000
17 E16_cyrillic_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Ukrainian 0.059
18 E17_cyrillic_cp1251.csv cp1251 cp1251 1251, windows_1251 Ukrainian 0.059
19 E18_cyrillic_koi8r.csv koi8-r shift_jis_2004 shiftjis2004, sjis_2004, s_jis_2004 Japanese 0.066
20 E19_japanese_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Italian 0.000
21 E20_japanese_shiftjis.csv shift_jis cp932 932, ms932, mskanji, ms_kanji Japanese 0.000
22 E21_chinese_simplified_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Unknown 0.000
23 E22_chinese_simplified_gb18030.csv gb18030 gb18030 gb18030_2000 Chinese 0.000
24 E23_chinese_traditional_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Unknown 0.060
25 E24_chinese_traditional_big5.csv big5 big5 big5_tw, csbig5, x_mac_trad_chinese Chinese 0.060
26 E25_korean_utf8.csv utf-8 utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001 Unknown 0.000
27 E26_korean_euckr.csv euc-kr cp949 949, ms949, uhc Korean 0.000
28 E27_pathological_ascii_only.csv ascii ascii 646, ansi_x3.4_1968, ansi_x3_4_1968, ansi_x3.4_1986, cp367, csascii, ibm367, iso646_us, iso_646.irv_1991, iso_ir_6, us, us_ascii English 0.000
29 E28_pathological_invalid_utf8.csv invalid-utf8 cp1257 1257, windows_1257 Croatian 0.000
30 E29_pathological_truncated_utf8.csv invalid-utf8-truncated cp1250 1250, windows_1250 Polish 0.000
31 E30_pathological_lying_bom.csv cp1252-with-utf8-bom cp1252 1252, windows_1252 French 0.013
32 E31_pathological_mixed_concat.csv cp1252+utf8-concatenated cp1250 1250, windows_1250 German 0.000