feat(gate): CSV-normalization gate with confidence-tiered findings
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.
Core (src/core/):
- analyze.py: Finding gains confidence, fix_action, pre_applied; new
detectors for encoding_uncertain, encoding_decode_failed; new top-
level encoding_override parameter.
- fixes.py: registry of fix algorithms keyed by fix_action id.
- normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
the NormalizationResult / Decision dataclasses the gate consumes.
- io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
and normalizes line endings (fixes bare-CR parser crash); empty file
handled gracefully instead of EmptyDataError traceback.
GUI (src/gui/):
- pages/0_Review.py: gate page with per-finding decision controls,
encoding override picker (16 codepages + custom), and Advanced output
options (encoding, delimiter, line terminator) on the download.
- components.py: require_normalization_gate() helper.
- pages/1-9: gate guard wired on every tool page.
Test corpora:
- test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
UTF-8 files + manifest, synced from Business/DataTools.
- test-cases/text-cleaner-corpus/test_data/17: synced malformed input
(unquoted $1,500.00) for the unquoted-delimiter detector.
Tests (94 new):
- test_normalize.py (48): finding fields, fix registry, auto_fix scope,
decision paths, gate idempotency, output-options helper.
- test_encodings_corpus.py (90, 16 xfailed): parametric detection +
decode + analyzer-no-crash sweep against the manifest.
- test_analyze.py: encoding override + encoding_uncertain detectors.
- test_corpus.py: pre-parse repair in the strict reader.
run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.
Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.
Suite: 765 passed, 17 xfailed (was 458 passed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -412,3 +412,40 @@ python -m src.cli_text_clean tickets.csv --skip notes --apply
|
||||
python -m src.cli_text_clean messy.csv --preset minimal --case upper --save-config my.json
|
||||
python -m src.cli_text_clean other.csv --config my.json --apply
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Analyzer (upload-time scan)
|
||||
|
||||
```
|
||||
python -m src.cli_analyze INPUT_FILE [OPTIONS]
|
||||
|
||||
--sample-rows N Cap on rows scanned (default 1000)
|
||||
--json Print findings as a JSON array on stdout
|
||||
--strict Exit non-zero on any warn/error finding
|
||||
```
|
||||
|
||||
JSON output schema (one object per finding):
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "smart_punctuation_in_data",
|
||||
"severity": "warn",
|
||||
"confidence": "high",
|
||||
"fix_action": "fold_smart_punctuation",
|
||||
"pre_applied": false,
|
||||
"tool": "02_text_cleaner",
|
||||
"count": 17,
|
||||
"description": "17 cell(s) contain curly quotes…",
|
||||
"column": null,
|
||||
"samples": [{"row": 3, "column": "name", "value": "“Alice”"}]
|
||||
}
|
||||
```
|
||||
|
||||
- `severity` — `info` / `warn` / `error`. Only `error` blocks the GUI normalization gate.
|
||||
- `confidence` — `high` (round-trip-safe, eligible for one-click auto-fix), `medium` (preview before applying), `low` (heuristic, opt-in only).
|
||||
- `fix_action` — stable id naming the algorithm in `src/core/fixes.py` that resolves the finding. Empty string for informational-only findings.
|
||||
- `pre_applied` — `true` for fixes already applied during the byte-level read pass (BOM strip, NUL strip, line-ending normalize, byte-level smart-quote fold, transcode-to-UTF-8 from UTF-16/32). The GUI gate treats these as already-resolved; the CLI emits them so callers can audit what changed during read.
|
||||
|
||||
The detector set covers smart punctuation, NBSP / Unicode whitespace, zero-width characters, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints (UTF-8-as-cp1252), mixed-case email columns, near-duplicate rows (case-and-padding stripped), leading-zero IDs (Excel hazard), mixed line endings, encoding decode failure (`encoding_decode_failed`), and U+FFFD presence in the loaded text (`encoding_uncertain`). New detectors plug in by appending one entry to `analyze.py` and one matching fix in `fixes.py`.
|
||||
|
||||
|
||||
@@ -505,6 +505,66 @@ The market gap this script fills: **one-click correctness for the dirty-CSV fail
|
||||
- CLI surface: separate `src/cli_text_clean.py` module, matching the "one CLI binary per script on PATH" pattern in Section 3.2. Not a subcommand on the existing dedup Typer app.
|
||||
- `ftfy` dependency deferred to Tier 2 (~5MB). Revisit if mojibake reports come in.
|
||||
|
||||
### 10.2.1 Upload-time analyzer (`src/core/analyze.py`)
|
||||
|
||||
The analyzer is a read-only, advisory pass that runs on every uploaded file before any tool page sees it. It produces a list of `Finding` objects, each carrying:
|
||||
|
||||
| Field | Type | Meaning |
|
||||
|---|---|---|
|
||||
| `id` | str | Stable identifier (`smart_punctuation_in_data`, `mixed_line_endings`, …). Never localized. |
|
||||
| `severity` | `info` / `warn` / `error` | UX urgency. `error` is the only level that blocks the gate. |
|
||||
| `confidence` | `high` / `medium` / `low` | Auto-fixability. **High** is round-trip safe, **medium** has known false-positive shapes, **low** is heuristic and opt-in. |
|
||||
| `fix_action` | str | Stable id naming the algorithm in `src/core/fixes.py` that resolves this finding. Empty for informational-only findings. |
|
||||
| `pre_applied` | bool | True when the fix already ran during the read pass (BOM strip, NUL strip, byte-level smart-quote fold). The gate treats these as already-resolved. |
|
||||
| `tool` | str | Tool id that owns this concern (`02_text_cleaner`, `04_missing_handler`). Empty for file-level findings. |
|
||||
| `count` | int | Cells / rows affected. |
|
||||
| `description` | str | One-sentence human summary (banners, tooltips). |
|
||||
| `column` | str / None | Column name when scoped to one column. |
|
||||
| `samples` | list[(row, col, value)] | Up to 5 examples for the GUI to render. |
|
||||
|
||||
`analyze(source, *, sample_rows=1000, repair_result=None, encoding_override=None)` is the public entry point. `source` is a DataFrame or a path; `encoding_override` skips charset detection and uses the user's chosen codepage instead — this is the hook that lets the Review page recover from misdetections (cp1252-vs-cp1250 ambiguity, KOI8-R surfacing as Shift_JIS).
|
||||
|
||||
### 10.2.2 CSV-normalization gate (`src/core/normalize.py`, `src/core/fixes.py`)
|
||||
|
||||
A file enters tool pages only after passing the gate. The gate has two paths:
|
||||
|
||||
1. **Auto-fix** — `auto_fix(df, findings)` applies every `confidence="high"` finding whose `fix_action` is registered in `fixes.py`.
|
||||
2. **Per-finding decisions** — `apply_decisions(df, findings, decisions)` accepts an explicit list of `Decision(finding_id, action, payload)` where action is `"auto" | "skip" | "modified"`.
|
||||
|
||||
Output is a `NormalizationResult` with:
|
||||
|
||||
- `cleaned_df` — the DataFrame after every applied fix.
|
||||
- `cleaned_bytes` — UTF-8 CSV serialization for the download.
|
||||
- `applied`, `skipped_findings`, `pending_findings`, `blocking_findings` — audit log + gate status.
|
||||
|
||||
`is_normalized(findings, result)` re-runs `analyze()` against the cleaned bytes and returns False if any high-confidence detector still fires — that's the strict contract tool pages depend on.
|
||||
|
||||
`fixes.py` is a registry: `@register("fix_id")` decorates a `(df, payload) -> (new_df, n_cells_changed)` function. Adding a new fix means appending one entry to `analyze.py`'s `FIX_*` constants, one detector that emits a Finding with that `fix_action`, and one registered function in `fixes.py`. No other call sites change.
|
||||
|
||||
### 10.2.3 Review page (`src/gui/pages/0_Review.py`)
|
||||
|
||||
Streamlit page that orchestrates the gate visually. Gates the entire tool sidebar via `require_normalization_gate()` in `src/gui/components.py`, which every tool page calls right after `hide_streamlit_chrome()`.
|
||||
|
||||
The page:
|
||||
|
||||
1. Surfaces the detected encoding plus an override picker (16 common codepages + custom-text fallback).
|
||||
2. Renders one expandable card per finding, sorted by severity then confidence, with a decision radio (Auto / Skip / Customize), a live before/after preview built by running the registered fix on each `Finding.samples` value, and a payload editor for fixes that take user input (e.g. custom null-sentinel list for `replace_null_sentinels`).
|
||||
3. Apply button persists a `NormalizationResult` keyed by upload SHA-256; tool pages refuse to load until the hash matches.
|
||||
4. After apply, an `⚙️ Advanced output options` expander offers per-download encoding, delimiter, and line-terminator selection. The helper `_build_output_bytes(df, *, encoding, delimiter, line_terminator)` returns `(bytes, error_message)` — when the chosen encoding can't represent a character, falls back to `errors="replace"` and returns a warning the page surfaces.
|
||||
|
||||
### 10.2.4 Pre-parse repair (`src/core/io.py::repair_bytes`)
|
||||
|
||||
Byte-level pre-parse pass. Order is meaningful and each step is independently toggleable:
|
||||
|
||||
1. **Wide-encoding transcode** — UTF-16/UTF-32 → UTF-8. Has to run first because the byte-level NUL strip below would shred UTF-16 data (UTF-16 ASCII chars carry NUL as half of every 16-bit unit). Records `transcode_to_utf8` audit action; the analyzer surfaces it as a `csv_transcoded_to_utf8` info finding.
|
||||
2. **UTF-8 BOM strip** (file start only).
|
||||
3. **NUL strip** — only meaningful after step 1, so genuine corruption (truncated C strings, half-binary exports) rather than encoding artifacts.
|
||||
4. **Line-ending normalize** — CRLF and bare CR → LF. Bare CR confuses the C parser; the text-cleaner contract also calls for LF inside multi-line cells.
|
||||
5. **Byte-level smart-quote fold** — curly / guillemet / double-prime → ASCII `"`. Only structural double-quote-equivalents; single curly quotes are deferred to the cell-level cleaner.
|
||||
6. **Per-row delimiter repair** — when one row has +1 field and the merge candidate is currency-shaped (`$1,500.00` etc.), merge and quote.
|
||||
|
||||
`detect_encoding()` tries strict UTF-8 first and returns `"utf-8"` if the bytes decode cleanly. This was added because charset-normalizer fingerprints small files dominated by short non-ASCII sequences (e.g. zero-width chars at U+200B-class) as `mac_latin2` — but if the bytes are valid UTF-8, that's the right answer regardless of label.
|
||||
|
||||
### 10.3 - 10.9 (Future)
|
||||
|
||||
Functional specs for scripts 03 through 09 to be added when each script enters active build. The deduplicator (10.1) and text cleaner (10.2) specs are the template; specs for other scripts should follow the same Tier 1 / Tier 2 / Tier 3 structure with explicit strategic framing (what's the market gap this script fills, given that some of its functionality is available free elsewhere).
|
||||
|
||||
@@ -125,6 +125,41 @@ deduplicator --help
|
||||
|
||||
---
|
||||
|
||||
## 3.3 Review & Normalize gate
|
||||
|
||||
Before any tool page accepts a file, the file passes through a **CSV-normalization gate**. The gate scans every uploaded file, surfaces every data-quality issue our analyzer can detect, and lets you choose how to handle each one before downstream tools see the data.
|
||||
|
||||
### How it works
|
||||
|
||||
1. Upload a file on the home page. The analyzer scans it and counts findings by confidence tier.
|
||||
2. Click any tool. If the file hasn't been normalized yet, you're redirected to the **Review & Normalize** page.
|
||||
3. The page shows every finding grouped by severity and confidence, with a per-finding decision control.
|
||||
|
||||
### Confidence tiers
|
||||
|
||||
- **High** — round-trip-safe algorithmic fix (BOM strip, whitespace trim, NBSP / zero-width strip, smart-quote fold, line-ending normalize, header cleanup). One-click "Auto-fix high-confidence" applies them all.
|
||||
- **Medium** — right call in the common case but with known false-positive shapes. Examples: lowercasing the email column, replacing null-like sentinels (`N/A`, `-`, `nan`), repairing unquoted-currency rows. Preview the change before applying.
|
||||
- **Low** — heuristic fixes that can corrupt data when wrong. Mojibake repair (`café` → `café`), mixed-encoding detection. Off by default; you opt in per finding.
|
||||
- **Error** — blocking. Empty file, unrepairable rows, U+FFFD replacement characters. Cannot enter the tool pages until resolved or explicitly waived.
|
||||
|
||||
### Encoding override
|
||||
|
||||
When the analyzer reports `encoding_uncertain` or you spot mojibake (`é`) or `<60>` characters in the findings list, use the **File encoding** picker at the top of the Review page. Pick the right code page (cp1252 for Western Excel exports, KOI8-R for older Russian data, Big5 for traditional Chinese, etc.) or type a custom one, then click **Re-analyze**. Findings refresh against the corrected decode.
|
||||
|
||||
The picker is hidden for `.xlsx` files since Excel stores text as Unicode internally.
|
||||
|
||||
### Advanced output options
|
||||
|
||||
After applying decisions, an `⚙️ Advanced output options` expander on the download appears. Three dropdowns let you tune the output file format:
|
||||
|
||||
- **Encoding (code page)** — UTF-8 (default), UTF-8 with BOM (Excel-friendly), Windows-1252, Latin-1, Latin-9, cp1250, ISO-8859-2, cp1251, Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
|
||||
- **Delimiter** — comma (default), tab, semicolon, pipe.
|
||||
- **Line terminator** — LF (default), CRLF (Windows), CR.
|
||||
|
||||
The download filename auto-adjusts the extension (`.tsv` for tab, otherwise `.csv`). When the chosen encoding can't represent a character (Cyrillic content into cp1252, Asian script into Latin-1), the page shows a warning naming the offending character and falls back to `?` replacement so the download still works.
|
||||
|
||||
---
|
||||
|
||||
## 4. Output
|
||||
|
||||
Every script writes:
|
||||
|
||||
Reference in New Issue
Block a user