feat(gate): CSV-normalization gate with confidence-tiered findings

Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 20:35:27 +00:00
parent e9c490ae1b
commit 82d7fef21e
68 changed files with 2883 additions and 34 deletions

View File

@@ -0,0 +1,32 @@
filename,canonical_content_id,encoding,has_bom,byte_length,expected_detection,decode_notes
E01_western_basic_utf8.csv,WESTERN_BASIC,utf-8,no,161,utf_8|utf-8,UTF-8 no BOM. Modern default.
E02_western_basic_utf8bom.csv,WESTERN_BASIC,utf-8,yes,164,utf_8|utf_8_sig|utf-8|utf-8-sig,UTF-8 with BOM. Excel CSV UTF-8 export. BOM must be stripped on read.
E03_western_basic_cp1252.csv,WESTERN_BASIC,cp1252,no,153,cp1252|windows-1252|iso-8859-1|latin-1,"Western single-byte. For this content (no euro, no smart quotes, no em-dash), cp1252 and Latin-1 produce IDENTICAL decoded bytes. Detector cannot distinguish - any of cp1252/Latin-1/Latin-9 is a correct answer."
E04_western_basic_latin1.csv,WESTERN_BASIC,iso-8859-1,no,153,iso-8859-1|latin-1|cp1252|latin_1,Latin-1. Identical bytes to cp1252 for this content. Detector ambiguity is expected and acceptable.
E05_western_basic_latin9.csv,WESTERN_BASIC,iso-8859-15,no,153,iso-8859-15|latin-9|iso-8859-1|cp1252,"Latin-9. For this content with no euro sign, decodes identically to Latin-1. Detector may pick any."
E06_western_basic_macroman.csv,WESTERN_BASIC,mac-roman,no,153,mac-roman|macroman,"Mac Roman. Different byte values for the accented chars vs cp1252/Latin-1, so this one is distinguishable."
E07_western_basic_utf16le.csv,WESTERN_BASIC,utf-16-le,yes,308,utf-16|utf-16-le|utf_16|utf_16_le,UTF-16 LE with BOM. Excel 'Unicode Text' export.
E08_western_basic_utf16be.csv,WESTERN_BASIC,utf-16-be,yes,308,utf-16|utf-16-be|utf_16|utf_16_be,UTF-16 BE with BOM. Less common but valid.
E09_western_basic_utf16le_nobom.csv,WESTERN_BASIC,utf-16-le,no,306,utf-16|utf-16-le|UNRELIABLE,"UTF-16 LE without BOM. Detection is heuristic and unreliable; bytes look like 'every other byte is null' for ASCII-heavy content, which charset-normalizer may or may not catch. If detector returns wrong encoding here, that is the buyer's responsibility to manually specify - flag in error message."
E10_western_extended_utf8.csv,WESTERN_EXTENDED,utf-8,no,167,utf_8|utf-8,"UTF-8. Has euro, smart quotes, em-dash."
E11_western_extended_cp1252.csv,WESTERN_EXTENDED,cp1252,no,154,cp1252|windows-1252,"cp1252. Content uses 0x80-0x9F range (euro, smart quotes, em-dash). Decoding as Latin-1 produces control characters or replacement chars - this file is the cleanest cp1252-vs-Latin-1 discriminator."
E12_western_extended_utf16le.csv,WESTERN_EXTENDED,utf-16-le,yes,310,utf-16|utf-16-le,UTF-16 LE with BOM. Same content as E10/E11.
E13_eastern_european_utf8.csv,EASTERN_EUROPEAN,utf-8,no,130,utf_8|utf-8,UTF-8 baseline for Czech/Polish/Hungarian/Slovak content.
E14_eastern_european_cp1250.csv,EASTERN_EUROPEAN,cp1250,no,120,cp1250|windows-1250,"cp1250. Decoding as cp1252 produces mojibake (Polish slash-l would become U+0142 vs ascii letter, etc.). Real distinguishing test."
E15_eastern_european_iso88592.csv,EASTERN_EUROPEAN,iso-8859-2,no,120,iso-8859-2|latin-2|iso8859_2,ISO-8859-2 / Latin-2. Different byte assignments than cp1250 for the same characters.
E16_cyrillic_utf8.csv,CYRILLIC,utf-8,no,118,utf_8|utf-8,UTF-8 baseline for Russian content.
E17_cyrillic_cp1251.csv,CYRILLIC,cp1251,no,72,cp1251|windows-1251,cp1251. The dominant Russian Windows encoding.
E18_cyrillic_koi8r.csv,CYRILLIC,koi8-r,no,72,koi8-r|koi8_r,KOI8-R. Older Unix Russian encoding. Distinct byte patterns from cp1251.
E19_japanese_utf8.csv,JAPANESE,utf-8,no,78,utf_8|utf-8,UTF-8 baseline for Japanese content.
E20_japanese_shiftjis.csv,JAPANESE,shift_jis,no,64,shift_jis|shift-jis|cp932|sjis,Shift_JIS. Excel on Japanese Windows defaults to this. cp932 is Microsoft's extended variant; either name is acceptable.
E21_chinese_simplified_utf8.csv,CHINESE_SIMPLIFIED,utf-8,no,66,utf_8|utf-8,UTF-8 baseline for simplified Chinese.
E22_chinese_simplified_gb18030.csv,CHINESE_SIMPLIFIED,gb18030,no,56,gb18030|gbk|gb2312,GB18030. Mainland China default. GB18030 supersets GBK supersets GB2312; for this content any is acceptable.
E23_chinese_traditional_utf8.csv,CHINESE_TRADITIONAL,utf-8,no,66,utf_8|utf-8,UTF-8 baseline for traditional Chinese.
E24_chinese_traditional_big5.csv,CHINESE_TRADITIONAL,big5,no,56,big5|big5_hkscs|cp950,Big5. Taiwan and Hong Kong default. cp950 is Microsoft's variant.
E25_korean_utf8.csv,KOREAN,utf-8,no,72,utf_8|utf-8,UTF-8 baseline for Korean.
E26_korean_euckr.csv,KOREAN,euc-kr,no,60,euc-kr|euc_kr|cp949,EUC-KR. Korean Windows default. cp949 is the MS variant.
E27_pathological_ascii_only.csv,ASCII_ONLY,ascii,no,66,ascii|utf_8|utf-8|cp1252|iso-8859-1|AMBIGUOUS,"Pure ASCII. Multiple encodings produce identical bytes for this content. Any of ASCII, UTF-8, cp1252, Latin-1 is a correct detection answer because all four decode to the same string. Detector confidence should be high; specific label is interchangeable."
E28_pathological_invalid_utf8.csv,INVALID_UTF8,invalid-utf8,no,67,cp1252|iso-8859-1|REJECT_UTF8,File starts as if UTF-8 but contains an invalid byte sequence (0xC3 0x28). A strict UTF-8 decoder errors. Detector should reject UTF-8 and fall back to a single-byte encoding; cp1252 will produce mojibake but parse without error. Cleaner should warn the user that encoding detection was uncertain.
E29_pathological_truncated_utf8.csv,TRUNCATED_UTF8,invalid-utf8-truncated,no,47,utf_8_with_errors|cp1252|REJECT,"Valid UTF-8 throughout, but the last byte (0xE4) starts a 3-byte sequence that's never completed. Strict UTF-8 decoder errors at EOF. errors='replace' produces \ufffd. Real-world cause: file was truncated by a transfer interruption or a buggy export. Cleaner should treat as corrupt-input error, not silent data loss."
E30_pathological_lying_bom.csv,WESTERN_EXTENDED,cp1252-with-utf8-bom,yes (lying),157,utf_8_FAILS|cp1252|AMBIGUOUS,File has UTF-8 BOM but body is cp1252. UTF-8 decoder will see 0x80 (euro in cp1252) as an invalid UTF-8 continuation byte and error out. Better detectors recover by ignoring the BOM and trying cp1252. Cleaner should warn 'BOM suggested UTF-8 but content decoded as cp1252' so the user knows their file is lying about itself.
E31_pathological_mixed_concat.csv,MIXED_CONCAT,cp1252+utf8-concatenated,no,60,LOW_CONFIDENCE|cp1252|utf_8|REJECT,"First half cp1252, second half UTF-8. No single encoding decodes both halves correctly. UTF-8 decoder errors on row 1. cp1252 decoder produces mojibake on rows 2-3. charset-normalizer detection confidence should be low. Right behavior for the cleaner: refuse to process and tell the user the file contains mixed encodings."
1 filename canonical_content_id encoding has_bom byte_length expected_detection decode_notes
2 E01_western_basic_utf8.csv WESTERN_BASIC utf-8 no 161 utf_8|utf-8 UTF-8 no BOM. Modern default.
3 E02_western_basic_utf8bom.csv WESTERN_BASIC utf-8 yes 164 utf_8|utf_8_sig|utf-8|utf-8-sig UTF-8 with BOM. Excel CSV UTF-8 export. BOM must be stripped on read.
4 E03_western_basic_cp1252.csv WESTERN_BASIC cp1252 no 153 cp1252|windows-1252|iso-8859-1|latin-1 Western single-byte. For this content (no euro, no smart quotes, no em-dash), cp1252 and Latin-1 produce IDENTICAL decoded bytes. Detector cannot distinguish - any of cp1252/Latin-1/Latin-9 is a correct answer.
5 E04_western_basic_latin1.csv WESTERN_BASIC iso-8859-1 no 153 iso-8859-1|latin-1|cp1252|latin_1 Latin-1. Identical bytes to cp1252 for this content. Detector ambiguity is expected and acceptable.
6 E05_western_basic_latin9.csv WESTERN_BASIC iso-8859-15 no 153 iso-8859-15|latin-9|iso-8859-1|cp1252 Latin-9. For this content with no euro sign, decodes identically to Latin-1. Detector may pick any.
7 E06_western_basic_macroman.csv WESTERN_BASIC mac-roman no 153 mac-roman|macroman Mac Roman. Different byte values for the accented chars vs cp1252/Latin-1, so this one is distinguishable.
8 E07_western_basic_utf16le.csv WESTERN_BASIC utf-16-le yes 308 utf-16|utf-16-le|utf_16|utf_16_le UTF-16 LE with BOM. Excel 'Unicode Text' export.
9 E08_western_basic_utf16be.csv WESTERN_BASIC utf-16-be yes 308 utf-16|utf-16-be|utf_16|utf_16_be UTF-16 BE with BOM. Less common but valid.
10 E09_western_basic_utf16le_nobom.csv WESTERN_BASIC utf-16-le no 306 utf-16|utf-16-le|UNRELIABLE UTF-16 LE without BOM. Detection is heuristic and unreliable; bytes look like 'every other byte is null' for ASCII-heavy content, which charset-normalizer may or may not catch. If detector returns wrong encoding here, that is the buyer's responsibility to manually specify - flag in error message.
11 E10_western_extended_utf8.csv WESTERN_EXTENDED utf-8 no 167 utf_8|utf-8 UTF-8. Has euro, smart quotes, em-dash.
12 E11_western_extended_cp1252.csv WESTERN_EXTENDED cp1252 no 154 cp1252|windows-1252 cp1252. Content uses 0x80-0x9F range (euro, smart quotes, em-dash). Decoding as Latin-1 produces control characters or replacement chars - this file is the cleanest cp1252-vs-Latin-1 discriminator.
13 E12_western_extended_utf16le.csv WESTERN_EXTENDED utf-16-le yes 310 utf-16|utf-16-le UTF-16 LE with BOM. Same content as E10/E11.
14 E13_eastern_european_utf8.csv EASTERN_EUROPEAN utf-8 no 130 utf_8|utf-8 UTF-8 baseline for Czech/Polish/Hungarian/Slovak content.
15 E14_eastern_european_cp1250.csv EASTERN_EUROPEAN cp1250 no 120 cp1250|windows-1250 cp1250. Decoding as cp1252 produces mojibake (Polish slash-l would become U+0142 vs ascii letter, etc.). Real distinguishing test.
16 E15_eastern_european_iso88592.csv EASTERN_EUROPEAN iso-8859-2 no 120 iso-8859-2|latin-2|iso8859_2 ISO-8859-2 / Latin-2. Different byte assignments than cp1250 for the same characters.
17 E16_cyrillic_utf8.csv CYRILLIC utf-8 no 118 utf_8|utf-8 UTF-8 baseline for Russian content.
18 E17_cyrillic_cp1251.csv CYRILLIC cp1251 no 72 cp1251|windows-1251 cp1251. The dominant Russian Windows encoding.
19 E18_cyrillic_koi8r.csv CYRILLIC koi8-r no 72 koi8-r|koi8_r KOI8-R. Older Unix Russian encoding. Distinct byte patterns from cp1251.
20 E19_japanese_utf8.csv JAPANESE utf-8 no 78 utf_8|utf-8 UTF-8 baseline for Japanese content.
21 E20_japanese_shiftjis.csv JAPANESE shift_jis no 64 shift_jis|shift-jis|cp932|sjis Shift_JIS. Excel on Japanese Windows defaults to this. cp932 is Microsoft's extended variant; either name is acceptable.
22 E21_chinese_simplified_utf8.csv CHINESE_SIMPLIFIED utf-8 no 66 utf_8|utf-8 UTF-8 baseline for simplified Chinese.
23 E22_chinese_simplified_gb18030.csv CHINESE_SIMPLIFIED gb18030 no 56 gb18030|gbk|gb2312 GB18030. Mainland China default. GB18030 supersets GBK supersets GB2312; for this content any is acceptable.
24 E23_chinese_traditional_utf8.csv CHINESE_TRADITIONAL utf-8 no 66 utf_8|utf-8 UTF-8 baseline for traditional Chinese.
25 E24_chinese_traditional_big5.csv CHINESE_TRADITIONAL big5 no 56 big5|big5_hkscs|cp950 Big5. Taiwan and Hong Kong default. cp950 is Microsoft's variant.
26 E25_korean_utf8.csv KOREAN utf-8 no 72 utf_8|utf-8 UTF-8 baseline for Korean.
27 E26_korean_euckr.csv KOREAN euc-kr no 60 euc-kr|euc_kr|cp949 EUC-KR. Korean Windows default. cp949 is the MS variant.
28 E27_pathological_ascii_only.csv ASCII_ONLY ascii no 66 ascii|utf_8|utf-8|cp1252|iso-8859-1|AMBIGUOUS Pure ASCII. Multiple encodings produce identical bytes for this content. Any of ASCII, UTF-8, cp1252, Latin-1 is a correct detection answer because all four decode to the same string. Detector confidence should be high; specific label is interchangeable.
29 E28_pathological_invalid_utf8.csv INVALID_UTF8 invalid-utf8 no 67 cp1252|iso-8859-1|REJECT_UTF8 File starts as if UTF-8 but contains an invalid byte sequence (0xC3 0x28). A strict UTF-8 decoder errors. Detector should reject UTF-8 and fall back to a single-byte encoding; cp1252 will produce mojibake but parse without error. Cleaner should warn the user that encoding detection was uncertain.
30 E29_pathological_truncated_utf8.csv TRUNCATED_UTF8 invalid-utf8-truncated no 47 utf_8_with_errors|cp1252|REJECT Valid UTF-8 throughout, but the last byte (0xE4) starts a 3-byte sequence that's never completed. Strict UTF-8 decoder errors at EOF. errors='replace' produces \ufffd. Real-world cause: file was truncated by a transfer interruption or a buggy export. Cleaner should treat as corrupt-input error, not silent data loss.
31 E30_pathological_lying_bom.csv WESTERN_EXTENDED cp1252-with-utf8-bom yes (lying) 157 utf_8_FAILS|cp1252|AMBIGUOUS File has UTF-8 BOM but body is cp1252. UTF-8 decoder will see 0x80 (euro in cp1252) as an invalid UTF-8 continuation byte and error out. Better detectors recover by ignoring the BOM and trying cp1252. Cleaner should warn 'BOM suggested UTF-8 but content decoded as cp1252' so the user knows their file is lying about itself.
32 E31_pathological_mixed_concat.csv MIXED_CONCAT cp1252+utf8-concatenated no 60 LOW_CONFIDENCE|cp1252|utf_8|REJECT First half cp1252, second half UTF-8. No single encoding decodes both halves correctly. UTF-8 decoder errors on row 1. cp1252 decoder produces mojibake on rows 2-3. charset-normalizer detection confidence should be low. Right behavior for the cleaner: refuse to process and tell the user the file contains mixed encodings.