datatools-dev/test-cases/encodings-corpus/detector_baseline.csv at 82d7fef21e55c03de0a362ffd37c80a7650190c3

Files

Michael 82d7fef21e feat(gate): CSV-normalization gate with confidence-tiered findings

Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 20:35:27 +00:00

2.9 KiB

Raw Blame History

1	filename	ground_truth_encoding	charset_normalizer_returns	cn_aliases	cn_language	cn_chaos_score
2	E01_western_basic_utf8.csv	utf-8	utf_8	u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001	Turkish	0.000
3	E02_western_basic_utf8bom.csv	utf-8	utf_8	u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001	Turkish	0.000
4	E03_western_basic_cp1252.csv	cp1252	cp1250	1250, windows_1250	Turkish	0.000
5	E04_western_basic_latin1.csv	iso-8859-1	cp1250	1250, windows_1250	Turkish	0.000
6	E05_western_basic_latin9.csv	iso-8859-15	cp1250	1250, windows_1250	Turkish	0.000
7	E06_western_basic_macroman.csv	mac-roman	mac_iceland	maciceland	Turkish	0.000
8	E07_western_basic_utf16le.csv	utf-16-le	utf_16	u16, utf16	Turkish	0.000
9	E08_western_basic_utf16be.csv	utf-16-be	utf_16	u16, utf16	Turkish	0.000
10	E09_western_basic_utf16le_nobom.csv	utf-16-le	utf_16_le	unicodelittleunmarked, utf_16le	Turkish	0.000
11	E10_western_extended_utf8.csv	utf-8	utf_8	u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001	French	0.013
12	E11_western_extended_cp1252.csv	cp1252	cp1250	1250, windows_1250	French	0.013
13	E12_western_extended_utf16le.csv	utf-16-le	utf_16	u16, utf16	French	0.013
14	E13_eastern_european_utf8.csv	utf-8	utf_8	u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001	Spanish	0.042
15	E14_eastern_european_cp1250.csv	cp1250	cp1250	1250, windows_1250	Spanish	0.042
16	E15_eastern_european_iso88592.csv	iso-8859-2	cp1258	1258, windows_1258	German	0.000
17	E16_cyrillic_utf8.csv	utf-8	utf_8	u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001	Ukrainian	0.059
18	E17_cyrillic_cp1251.csv	cp1251	cp1251	1251, windows_1251	Ukrainian	0.059
19	E18_cyrillic_koi8r.csv	koi8-r	shift_jis_2004	shiftjis2004, sjis_2004, s_jis_2004	Japanese	0.066
20	E19_japanese_utf8.csv	utf-8	utf_8	u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001	Italian	0.000
21	E20_japanese_shiftjis.csv	shift_jis	cp932	932, ms932, mskanji, ms_kanji	Japanese	0.000
22	E21_chinese_simplified_utf8.csv	utf-8	utf_8	u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001	Unknown	0.000
23	E22_chinese_simplified_gb18030.csv	gb18030	gb18030	gb18030_2000	Chinese	0.000
24	E23_chinese_traditional_utf8.csv	utf-8	utf_8	u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001	Unknown	0.060
25	E24_chinese_traditional_big5.csv	big5	big5	big5_tw, csbig5, x_mac_trad_chinese	Chinese	0.060
26	E25_korean_utf8.csv	utf-8	utf_8	u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001	Unknown	0.000
27	E26_korean_euckr.csv	euc-kr	cp949	949, ms949, uhc	Korean	0.000
28	E27_pathological_ascii_only.csv	ascii	ascii	646, ansi_x3.4_1968, ansi_x3_4_1968, ansi_x3.4_1986, cp367, csascii, ibm367, iso646_us, iso_646.irv_1991, iso_ir_6, us, us_ascii	English	0.000
29	E28_pathological_invalid_utf8.csv	invalid-utf8	cp1257	1257, windows_1257	Croatian	0.000
30	E29_pathological_truncated_utf8.csv	invalid-utf8-truncated	cp1250	1250, windows_1250	Polish	0.000
31	E30_pathological_lying_bom.csv	cp1252-with-utf8-bom	cp1252	1252, windows_1252	French	0.013
32	E31_pathological_mixed_concat.csv	cp1252+utf8-concatenated	cp1250	1250, windows_1250	German	0.000

2.9 KiB Raw Blame History

2.9 KiB

Raw Blame History