feat(gate): CSV-normalization gate with confidence-tiered findings

Adds a Review & Normalize page that sits between upload and every tool page. The analyzer now tags each finding with confidence (high/medium/low) and a fix_action; the gate auto-applies high-confidence fixes, surfaces medium/low ones for user review, and blocks tool pages on error-level findings until resolved or waived. Core (src/core/): - analyze.py: Finding gains confidence, fix_action, pre_applied; new detectors for encoding_uncertain, encoding_decode_failed; new top- level encoding_override parameter. - fixes.py: registry of fix algorithms keyed by fix_action id. - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and the NormalizationResult / Decision dataclasses the gate consumes. - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption) and normalizes line endings (fixes bare-CR parser crash); empty file handled gracefully instead of EmptyDataError traceback. GUI (src/gui/): - pages/0_Review.py: gate page with per-finding decision controls, encoding override picker (16 codepages + custom), and Advanced output options (encoding, delimiter, line terminator) on the download. - components.py: require_normalization_gate() helper. - pages/1-9: gate guard wired on every tool page. Test corpora: - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference UTF-8 files + manifest, synced from Business/DataTools. - test-cases/text-cleaner-corpus/test_data/17: synced malformed input (unquoted $1,500.00) for the unquoted-delimiter detector. Tests (94 new): - test_normalize.py (48): finding fields, fix registry, auto_fix scope, decision paths, gate idempotency, output-options helper. - test_encodings_corpus.py (90, 16 xfailed): parametric detection + decode + analyzer-no-crash sweep against the manifest. - test_analyze.py: encoding override + encoding_uncertain detectors. - test_corpus.py: pre-parse repair in the strict reader. run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate; encodings corpus added to --fixtures category. Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema, gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds the analyzer JSON schema with the new fields; README links to all of it. Suite: 765 passed, 17 xfailed (was 458 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:35:27 +00:00
parent e9c490ae1b
commit 82d7fef21e
68 changed files with 2883 additions and 34 deletions
--- a/tests/test_analyze.py
+++ b/tests/test_analyze.py
@@ -204,6 +204,67 @@ class TestNearDuplicates:
 # Mixed line endings
 # ---------------------------------------------------------------------------

+class TestEncodingUncertainty:
+    def test_replacement_chars_in_data_flagged(self):
+        df = pd.DataFrame({"name": ["Caf<EFBFBD>", "Ber<EFBFBD>in"]})
+        findings = analyze(df)
+        f = next(f for f in findings if f.id == "encoding_uncertain")
+        assert f.severity == "error"
+        assert f.confidence == "low"
+        assert f.count == 2
+
+    def test_replacement_chars_in_header_flagged(self):
+        df = pd.DataFrame({"emai<EFBFBD>l": ["a@x.com"]})
+        findings = analyze(df)
+        ids = {f.id for f in findings}
+        assert "encoding_uncertain" in ids
+
+    def test_clean_data_no_finding(self):
+        df = pd.DataFrame({"name": ["Alice", "Bob"]})
+        findings = analyze(df)
+        assert "encoding_uncertain" not in {f.id for f in findings}
+
+
+class TestEncodingOverride:
+    def test_override_corrects_misdetected_codepage(self, tmp_path):
+        # WESTERN_BASIC bytes encoded as cp1252; charset-normalizer guesses
+        # cp1250, which gets 0xF1 wrong (ń vs ñ).
+        f = tmp_path / "cp1252.csv"
+        f.write_bytes("id,name\n1,España\n".encode("cp1252"))
+
+        from src.core.analyze import _load_for_analysis
+        df_auto, _, _ = _load_for_analysis(f, sample_rows=10)
+        df_overridden, _, _ = _load_for_analysis(
+            f, sample_rows=10, encoding_override="cp1252",
+        )
+        # Override yields the correct character.
+        assert df_overridden["name"].iloc[0] == "España"
+
+    def test_override_propagates_through_top_level_analyze(self, tmp_path):
+        f = tmp_path / "koi8.csv"
+        # KOI8-R Cyrillic; default detection guesses Shift_JIS.
+        f.write_bytes("id,name\n1,Иван\n".encode("koi8-r"))
+        # With the override the analyzer should produce zero findings
+        # against this clean fixture (no mojibake, no U+FFFD).
+        findings = analyze(f, encoding_override="koi8-r")
+        ids = {x.id for x in findings}
+        assert "encoding_uncertain" not in ids
+        assert "encoding_decode_failed" not in ids
+
+
+class TestEncodingDecodeFailedFromRepair:
+    def test_decode_replaced_action_surfaces_error_finding(self, tmp_path):
+        # Create a file with a UTF-8 BOM but cp1252 body bytes — utf-8-sig
+        # fails on byte 0x80 (€ in cp1252).
+        f = tmp_path / "lying_bom.csv"
+        f.write_bytes(b"\xef\xbb\xbfid,name\n1,\x80100\n")
+        findings = analyze(f)
+        ids = {x.id for x in findings}
+        assert "encoding_decode_failed" in ids
+        bad = next(x for x in findings if x.id == "encoding_decode_failed")
+        assert bad.severity == "error"
+
+
 class TestMixedLineEndings:
    def test_crlf_plus_lf_flagged(self, tmp_path):
        f = tmp_path / "mixed.csv"