Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.
Core (src/core/):
- analyze.py: Finding gains confidence, fix_action, pre_applied; new
detectors for encoding_uncertain, encoding_decode_failed; new top-
level encoding_override parameter.
- fixes.py: registry of fix algorithms keyed by fix_action id.
- normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
the NormalizationResult / Decision dataclasses the gate consumes.
- io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
and normalizes line endings (fixes bare-CR parser crash); empty file
handled gracefully instead of EmptyDataError traceback.
GUI (src/gui/):
- pages/0_Review.py: gate page with per-finding decision controls,
encoding override picker (16 codepages + custom), and Advanced output
options (encoding, delimiter, line terminator) on the download.
- components.py: require_normalization_gate() helper.
- pages/1-9: gate guard wired on every tool page.
Test corpora:
- test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
UTF-8 files + manifest, synced from Business/DataTools.
- test-cases/text-cleaner-corpus/test_data/17: synced malformed input
(unquoted $1,500.00) for the unquoted-delimiter detector.
Tests (94 new):
- test_normalize.py (48): finding fields, fix registry, auto_fix scope,
decision paths, gate idempotency, output-options helper.
- test_encodings_corpus.py (90, 16 xfailed): parametric detection +
decode + analyzer-no-crash sweep against the manifest.
- test_analyze.py: encoding override + encoding_uncertain detectors.
- test_corpus.py: pre-parse repair in the strict reader.
run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.
Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.
Suite: 765 passed, 17 xfailed (was 458 passed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two more detectors close the analyzer gap list:
mixed_line_endings (warn, tool=02): scans raw bytes for combinations of
CRLF / LF / bare CR. Disaster pattern after multi-source concat
(Windows + macOS + Linux exports stitched together). Operates on raw
bytes only — DataFrame-mode analyze() skips it because raw bytes
aren't available. _load_for_analysis now returns the raw bytes
alongside the DataFrame and repair result so the detector has them.
near_duplicate_rows (info, tool=01): cheap dedup signal — strip and
lowercase every string column, then count df.duplicated(). Catches the
most common case (same customer entered twice with subtle formatting
differences) without paying for fuzzy matching. Anything more
sophisticated stays in tool 01.
Six new tests cover both detectors plus the dataframe-mode skip path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure, advisory scan over an uploaded file or DataFrame that returns a list of
Finding objects naming each issue, the affected count, and which downstream
tool can fix it. The GUI uses this to badge tool nav items at upload; the CLI
will print findings as a table or JSON.
src/core/analyze.py:
Finding dataclass (id, severity, tool, count, description, column, samples)
analyze(source, *, sample_rows=1000, repair_result=None) -> list[Finding]
- source: DataFrame, path, or str. Path scans first 1000 rows.
- When source is a path, runs the same pre-parse repair the tool pages
will use; the resulting RepairResult is auto-surfaced as csv_*
findings. A caller-supplied repair_result wins so non-default repair
flags are respected.
Detectors (each independent, samples capped at 5):
- smart_punctuation_in_data -> 02
- nbsp_or_unicode_whitespace -> 02
- zero_width_or_invisible -> 02
- dirty_column_headers -> 02
- whitespace_padding -> 02
- null_like_sentinels -> 04
- suspected_mojibake -> 02 (Tier 2)
- mixed_case_email_column -> 02 case op
- leading_zero_ids -> informational, no tool
Helpers: findings_by_tool() for sidebar grouping, to_dict() for JSON.
Detectors are decoupled from the GUI display layer — they emit stable tool
ids ("02_text_cleaner") and the GUI maps those to display names.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>