feat(gate): CSV-normalization gate with confidence-tiered findings
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.
Core (src/core/):
- analyze.py: Finding gains confidence, fix_action, pre_applied; new
detectors for encoding_uncertain, encoding_decode_failed; new top-
level encoding_override parameter.
- fixes.py: registry of fix algorithms keyed by fix_action id.
- normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
the NormalizationResult / Decision dataclasses the gate consumes.
- io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
and normalizes line endings (fixes bare-CR parser crash); empty file
handled gracefully instead of EmptyDataError traceback.
GUI (src/gui/):
- pages/0_Review.py: gate page with per-finding decision controls,
encoding override picker (16 codepages + custom), and Advanced output
options (encoding, delimiter, line terminator) on the download.
- components.py: require_normalization_gate() helper.
- pages/1-9: gate guard wired on every tool page.
Test corpora:
- test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
UTF-8 files + manifest, synced from Business/DataTools.
- test-cases/text-cleaner-corpus/test_data/17: synced malformed input
(unquoted $1,500.00) for the unquoted-delimiter detector.
Tests (94 new):
- test_normalize.py (48): finding fields, fix registry, auto_fix scope,
decision paths, gate idempotency, output-options helper.
- test_encodings_corpus.py (90, 16 xfailed): parametric detection +
decode + analyzer-no-crash sweep against the manifest.
- test_analyze.py: encoding override + encoding_uncertain detectors.
- test_corpus.py: pre-parse repair in the strict reader.
run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.
Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.
Suite: 765 passed, 17 xfailed (was 458 passed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -125,6 +125,41 @@ deduplicator --help
|
||||
|
||||
---
|
||||
|
||||
## 3.3 Review & Normalize gate
|
||||
|
||||
Before any tool page accepts a file, the file passes through a **CSV-normalization gate**. The gate scans every uploaded file, surfaces every data-quality issue our analyzer can detect, and lets you choose how to handle each one before downstream tools see the data.
|
||||
|
||||
### How it works
|
||||
|
||||
1. Upload a file on the home page. The analyzer scans it and counts findings by confidence tier.
|
||||
2. Click any tool. If the file hasn't been normalized yet, you're redirected to the **Review & Normalize** page.
|
||||
3. The page shows every finding grouped by severity and confidence, with a per-finding decision control.
|
||||
|
||||
### Confidence tiers
|
||||
|
||||
- **High** — round-trip-safe algorithmic fix (BOM strip, whitespace trim, NBSP / zero-width strip, smart-quote fold, line-ending normalize, header cleanup). One-click "Auto-fix high-confidence" applies them all.
|
||||
- **Medium** — right call in the common case but with known false-positive shapes. Examples: lowercasing the email column, replacing null-like sentinels (`N/A`, `-`, `nan`), repairing unquoted-currency rows. Preview the change before applying.
|
||||
- **Low** — heuristic fixes that can corrupt data when wrong. Mojibake repair (`café` → `café`), mixed-encoding detection. Off by default; you opt in per finding.
|
||||
- **Error** — blocking. Empty file, unrepairable rows, U+FFFD replacement characters. Cannot enter the tool pages until resolved or explicitly waived.
|
||||
|
||||
### Encoding override
|
||||
|
||||
When the analyzer reports `encoding_uncertain` or you spot mojibake (`é`) or `<60>` characters in the findings list, use the **File encoding** picker at the top of the Review page. Pick the right code page (cp1252 for Western Excel exports, KOI8-R for older Russian data, Big5 for traditional Chinese, etc.) or type a custom one, then click **Re-analyze**. Findings refresh against the corrected decode.
|
||||
|
||||
The picker is hidden for `.xlsx` files since Excel stores text as Unicode internally.
|
||||
|
||||
### Advanced output options
|
||||
|
||||
After applying decisions, an `⚙️ Advanced output options` expander on the download appears. Three dropdowns let you tune the output file format:
|
||||
|
||||
- **Encoding (code page)** — UTF-8 (default), UTF-8 with BOM (Excel-friendly), Windows-1252, Latin-1, Latin-9, cp1250, ISO-8859-2, cp1251, Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
|
||||
- **Delimiter** — comma (default), tab, semicolon, pipe.
|
||||
- **Line terminator** — LF (default), CRLF (Windows), CR.
|
||||
|
||||
The download filename auto-adjusts the extension (`.tsv` for tab, otherwise `.csv`). When the chosen encoding can't represent a character (Cyrillic content into cp1252, Asian script into Latin-1), the page shows a warning naming the offending character and falls back to `?` replacement so the download still works.
|
||||
|
||||
---
|
||||
|
||||
## 4. Output
|
||||
|
||||
Every script writes:
|
||||
|
||||
Reference in New Issue
Block a user