Files
datatools-dev/test-cases/encodings-corpus/detector_baseline.csv
Michael 82d7fef21e feat(gate): CSV-normalization gate with confidence-tiered findings
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:35:27 +00:00

2.9 KiB

1filenameground_truth_encodingcharset_normalizer_returnscn_aliasescn_languagecn_chaos_score
2E01_western_basic_utf8.csvutf-8utf_8u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001Turkish0.000
3E02_western_basic_utf8bom.csvutf-8utf_8u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001Turkish0.000
4E03_western_basic_cp1252.csvcp1252cp12501250, windows_1250Turkish0.000
5E04_western_basic_latin1.csviso-8859-1cp12501250, windows_1250Turkish0.000
6E05_western_basic_latin9.csviso-8859-15cp12501250, windows_1250Turkish0.000
7E06_western_basic_macroman.csvmac-romanmac_icelandmacicelandTurkish0.000
8E07_western_basic_utf16le.csvutf-16-leutf_16u16, utf16Turkish0.000
9E08_western_basic_utf16be.csvutf-16-beutf_16u16, utf16Turkish0.000
10E09_western_basic_utf16le_nobom.csvutf-16-leutf_16_leunicodelittleunmarked, utf_16leTurkish0.000
11E10_western_extended_utf8.csvutf-8utf_8u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001French0.013
12E11_western_extended_cp1252.csvcp1252cp12501250, windows_1250French0.013
13E12_western_extended_utf16le.csvutf-16-leutf_16u16, utf16French0.013
14E13_eastern_european_utf8.csvutf-8utf_8u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001Spanish0.042
15E14_eastern_european_cp1250.csvcp1250cp12501250, windows_1250Spanish0.042
16E15_eastern_european_iso88592.csviso-8859-2cp12581258, windows_1258German0.000
17E16_cyrillic_utf8.csvutf-8utf_8u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001Ukrainian0.059
18E17_cyrillic_cp1251.csvcp1251cp12511251, windows_1251Ukrainian0.059
19E18_cyrillic_koi8r.csvkoi8-rshift_jis_2004shiftjis2004, sjis_2004, s_jis_2004Japanese0.066
20E19_japanese_utf8.csvutf-8utf_8u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001Italian0.000
21E20_japanese_shiftjis.csvshift_jiscp932932, ms932, mskanji, ms_kanjiJapanese0.000
22E21_chinese_simplified_utf8.csvutf-8utf_8u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001Unknown0.000
23E22_chinese_simplified_gb18030.csvgb18030gb18030gb18030_2000Chinese0.000
24E23_chinese_traditional_utf8.csvutf-8utf_8u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001Unknown0.060
25E24_chinese_traditional_big5.csvbig5big5big5_tw, csbig5, x_mac_trad_chineseChinese0.060
26E25_korean_utf8.csvutf-8utf_8u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001Unknown0.000
27E26_korean_euckr.csveuc-krcp949949, ms949, uhcKorean0.000
28E27_pathological_ascii_only.csvasciiascii646, ansi_x3.4_1968, ansi_x3_4_1968, ansi_x3.4_1986, cp367, csascii, ibm367, iso646_us, iso_646.irv_1991, iso_ir_6, us, us_asciiEnglish0.000
29E28_pathological_invalid_utf8.csvinvalid-utf8cp12571257, windows_1257Croatian0.000
30E29_pathological_truncated_utf8.csvinvalid-utf8-truncatedcp12501250, windows_1250Polish0.000
31E30_pathological_lying_bom.csvcp1252-with-utf8-bomcp12521252, windows_1252French0.013
32E31_pathological_mixed_concat.csvcp1252+utf8-concatenatedcp12501250, windows_1250German0.000