feat(gate): CSV-normalization gate with confidence-tiered findings

Adds a Review & Normalize page that sits between upload and every tool page. The analyzer now tags each finding with confidence (high/medium/low) and a fix_action; the gate auto-applies high-confidence fixes, surfaces medium/low ones for user review, and blocks tool pages on error-level findings until resolved or waived. Core (src/core/): - analyze.py: Finding gains confidence, fix_action, pre_applied; new detectors for encoding_uncertain, encoding_decode_failed; new top- level encoding_override parameter. - fixes.py: registry of fix algorithms keyed by fix_action id. - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and the NormalizationResult / Decision dataclasses the gate consumes. - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption) and normalizes line endings (fixes bare-CR parser crash); empty file handled gracefully instead of EmptyDataError traceback. GUI (src/gui/): - pages/0_Review.py: gate page with per-finding decision controls, encoding override picker (16 codepages + custom), and Advanced output options (encoding, delimiter, line terminator) on the download. - components.py: require_normalization_gate() helper. - pages/1-9: gate guard wired on every tool page. Test corpora: - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference UTF-8 files + manifest, synced from Business/DataTools. - test-cases/text-cleaner-corpus/test_data/17: synced malformed input (unquoted $1,500.00) for the unquoted-delimiter detector. Tests (94 new): - test_normalize.py (48): finding fields, fix registry, auto_fix scope, decision paths, gate idempotency, output-options helper. - test_encodings_corpus.py (90, 16 xfailed): parametric detection + decode + analyzer-no-crash sweep against the manifest. - test_analyze.py: encoding override + encoding_uncertain detectors. - test_corpus.py: pre-parse repair in the strict reader. run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate; encodings corpus added to --fixtures category. Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema, gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds the analyzer JSON schema with the new fields; README links to all of it. Suite: 765 passed, 17 xfailed (was 458 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:35:27 +00:00
parent e9c490ae1b
commit 82d7fef21e
68 changed files with 2883 additions and 34 deletions
--- a/docs/CLI-REFERENCE.md
+++ b/docs/CLI-REFERENCE.md
@@ -412,3 +412,40 @@ python -m src.cli_text_clean tickets.csv --skip notes --apply
 python -m src.cli_text_clean messy.csv --preset minimal --case upper --save-config my.json
 python -m src.cli_text_clean other.csv --config my.json --apply
 ```
+
+---
+
+## Analyzer (upload-time scan)
+
+```
+python -m src.cli_analyze INPUT_FILE [OPTIONS]
+
+  --sample-rows N       Cap on rows scanned (default 1000)
+  --json                Print findings as a JSON array on stdout
+  --strict              Exit non-zero on any warn/error finding
+```
+
+JSON output schema (one object per finding):
+
+```json
+{
+  "id": "smart_punctuation_in_data",
+  "severity": "warn",
+  "confidence": "high",
+  "fix_action": "fold_smart_punctuation",
+  "pre_applied": false,
+  "tool": "02_text_cleaner",
+  "count": 17,
+  "description": "17 cell(s) contain curly quotes…",
+  "column": null,
+  "samples": [{"row": 3, "column": "name", "value": "“Alice”"}]
+}
+```
+
+- `severity` — `info` / `warn` / `error`. Only `error` blocks the GUI normalization gate.
+- `confidence` — `high` (round-trip-safe, eligible for one-click auto-fix), `medium` (preview before applying), `low` (heuristic, opt-in only).
+- `fix_action` — stable id naming the algorithm in `src/core/fixes.py` that resolves the finding. Empty string for informational-only findings.
+- `pre_applied` — `true` for fixes already applied during the byte-level read pass (BOM strip, NUL strip, line-ending normalize, byte-level smart-quote fold, transcode-to-UTF-8 from UTF-16/32). The GUI gate treats these as already-resolved; the CLI emits them so callers can audit what changed during read.
+
+The detector set covers smart punctuation, NBSP / Unicode whitespace, zero-width characters, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints (UTF-8-as-cp1252), mixed-case email columns, near-duplicate rows (case-and-padding stripped), leading-zero IDs (Excel hazard), mixed line endings, encoding decode failure (`encoding_decode_failed`), and U+FFFD presence in the loaded text (`encoding_uncertain`). New detectors plug in by appending one entry to `analyze.py` and one matching fix in `fixes.py`.
+