diff --git a/README.md b/README.md index 4c77721..dd1aaa0 100644 --- a/README.md +++ b/README.md @@ -149,10 +149,20 @@ Outputs `{input}_cleaned.csv` plus a per-cell `{input}_changes.csv` audit (row, See [docs/CLI-REFERENCE.md](docs/CLI-REFERENCE.md#text-cleaner-cli) for every flag. +## Review & Normalize gate + +Every uploaded file passes through a CSV-normalization gate before any tool page sees it. The analyzer scans for ~15 issue types — whitespace pollution, NBSP / zero-width chars, mixed line endings, BOM artifacts, encoding misdetections, smart punctuation, dirty headers, null sentinels, mojibake, and more — and tags each finding by **confidence** (high / medium / low) and **fix action** (the algorithm in `src/core/fixes.py` that resolves it). + +In the GUI, the **Review & Normalize** page renders one expandable card per finding with a decision control (Auto-fix / Skip / Customize), a live before-and-after preview, an encoding-override picker for misdetected codepages, and an Advanced output options block (encoding, delimiter, line terminator) for the download. Tool pages refuse to load until the gate passes. + +See [docs/USER-GUIDE.md §3.3](docs/USER-GUIDE.md) for the user-facing walkthrough and [docs/TECHNICAL.md §10.2.1–10.2.4](docs/TECHNICAL.md) for the developer-facing API. + ## Documentation +- [User Guide](docs/USER-GUIDE.md) — installation, GUI workflow, the Review & Normalize gate - [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections -- [Developer Guide](docs/DEVELOPER.md) — architecture, data flow, how to extend +- [Technical](docs/TECHNICAL.md) — architecture, gate internals, finding schema, fix registry +- [Developer Guide](docs/DEVELOPER.md) — extending the bundle, adding fixes / detectors ## Requirements diff --git a/docs/CLI-REFERENCE.md b/docs/CLI-REFERENCE.md index bb44a1d..57bb591 100644 --- a/docs/CLI-REFERENCE.md +++ b/docs/CLI-REFERENCE.md @@ -412,3 +412,40 @@ python -m src.cli_text_clean tickets.csv --skip notes --apply python -m src.cli_text_clean messy.csv --preset minimal --case upper --save-config my.json python -m src.cli_text_clean other.csv --config my.json --apply ``` + +--- + +## Analyzer (upload-time scan) + +``` +python -m src.cli_analyze INPUT_FILE [OPTIONS] + + --sample-rows N Cap on rows scanned (default 1000) + --json Print findings as a JSON array on stdout + --strict Exit non-zero on any warn/error finding +``` + +JSON output schema (one object per finding): + +```json +{ + "id": "smart_punctuation_in_data", + "severity": "warn", + "confidence": "high", + "fix_action": "fold_smart_punctuation", + "pre_applied": false, + "tool": "02_text_cleaner", + "count": 17, + "description": "17 cell(s) contain curly quotes…", + "column": null, + "samples": [{"row": 3, "column": "name", "value": "“Alice”"}] +} +``` + +- `severity` — `info` / `warn` / `error`. Only `error` blocks the GUI normalization gate. +- `confidence` — `high` (round-trip-safe, eligible for one-click auto-fix), `medium` (preview before applying), `low` (heuristic, opt-in only). +- `fix_action` — stable id naming the algorithm in `src/core/fixes.py` that resolves the finding. Empty string for informational-only findings. +- `pre_applied` — `true` for fixes already applied during the byte-level read pass (BOM strip, NUL strip, line-ending normalize, byte-level smart-quote fold, transcode-to-UTF-8 from UTF-16/32). The GUI gate treats these as already-resolved; the CLI emits them so callers can audit what changed during read. + +The detector set covers smart punctuation, NBSP / Unicode whitespace, zero-width characters, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints (UTF-8-as-cp1252), mixed-case email columns, near-duplicate rows (case-and-padding stripped), leading-zero IDs (Excel hazard), mixed line endings, encoding decode failure (`encoding_decode_failed`), and U+FFFD presence in the loaded text (`encoding_uncertain`). New detectors plug in by appending one entry to `analyze.py` and one matching fix in `fixes.py`. + diff --git a/docs/TECHNICAL.md b/docs/TECHNICAL.md index 9d39ac3..ac88e90 100644 --- a/docs/TECHNICAL.md +++ b/docs/TECHNICAL.md @@ -505,6 +505,66 @@ The market gap this script fills: **one-click correctness for the dirty-CSV fail - CLI surface: separate `src/cli_text_clean.py` module, matching the "one CLI binary per script on PATH" pattern in Section 3.2. Not a subcommand on the existing dedup Typer app. - `ftfy` dependency deferred to Tier 2 (~5MB). Revisit if mojibake reports come in. +### 10.2.1 Upload-time analyzer (`src/core/analyze.py`) + +The analyzer is a read-only, advisory pass that runs on every uploaded file before any tool page sees it. It produces a list of `Finding` objects, each carrying: + +| Field | Type | Meaning | +|---|---|---| +| `id` | str | Stable identifier (`smart_punctuation_in_data`, `mixed_line_endings`, …). Never localized. | +| `severity` | `info` / `warn` / `error` | UX urgency. `error` is the only level that blocks the gate. | +| `confidence` | `high` / `medium` / `low` | Auto-fixability. **High** is round-trip safe, **medium** has known false-positive shapes, **low** is heuristic and opt-in. | +| `fix_action` | str | Stable id naming the algorithm in `src/core/fixes.py` that resolves this finding. Empty for informational-only findings. | +| `pre_applied` | bool | True when the fix already ran during the read pass (BOM strip, NUL strip, byte-level smart-quote fold). The gate treats these as already-resolved. | +| `tool` | str | Tool id that owns this concern (`02_text_cleaner`, `04_missing_handler`). Empty for file-level findings. | +| `count` | int | Cells / rows affected. | +| `description` | str | One-sentence human summary (banners, tooltips). | +| `column` | str / None | Column name when scoped to one column. | +| `samples` | list[(row, col, value)] | Up to 5 examples for the GUI to render. | + +`analyze(source, *, sample_rows=1000, repair_result=None, encoding_override=None)` is the public entry point. `source` is a DataFrame or a path; `encoding_override` skips charset detection and uses the user's chosen codepage instead — this is the hook that lets the Review page recover from misdetections (cp1252-vs-cp1250 ambiguity, KOI8-R surfacing as Shift_JIS). + +### 10.2.2 CSV-normalization gate (`src/core/normalize.py`, `src/core/fixes.py`) + +A file enters tool pages only after passing the gate. The gate has two paths: + +1. **Auto-fix** — `auto_fix(df, findings)` applies every `confidence="high"` finding whose `fix_action` is registered in `fixes.py`. +2. **Per-finding decisions** — `apply_decisions(df, findings, decisions)` accepts an explicit list of `Decision(finding_id, action, payload)` where action is `"auto" | "skip" | "modified"`. + +Output is a `NormalizationResult` with: + +- `cleaned_df` — the DataFrame after every applied fix. +- `cleaned_bytes` — UTF-8 CSV serialization for the download. +- `applied`, `skipped_findings`, `pending_findings`, `blocking_findings` — audit log + gate status. + +`is_normalized(findings, result)` re-runs `analyze()` against the cleaned bytes and returns False if any high-confidence detector still fires — that's the strict contract tool pages depend on. + +`fixes.py` is a registry: `@register("fix_id")` decorates a `(df, payload) -> (new_df, n_cells_changed)` function. Adding a new fix means appending one entry to `analyze.py`'s `FIX_*` constants, one detector that emits a Finding with that `fix_action`, and one registered function in `fixes.py`. No other call sites change. + +### 10.2.3 Review page (`src/gui/pages/0_Review.py`) + +Streamlit page that orchestrates the gate visually. Gates the entire tool sidebar via `require_normalization_gate()` in `src/gui/components.py`, which every tool page calls right after `hide_streamlit_chrome()`. + +The page: + +1. Surfaces the detected encoding plus an override picker (16 common codepages + custom-text fallback). +2. Renders one expandable card per finding, sorted by severity then confidence, with a decision radio (Auto / Skip / Customize), a live before/after preview built by running the registered fix on each `Finding.samples` value, and a payload editor for fixes that take user input (e.g. custom null-sentinel list for `replace_null_sentinels`). +3. Apply button persists a `NormalizationResult` keyed by upload SHA-256; tool pages refuse to load until the hash matches. +4. After apply, an `⚙️ Advanced output options` expander offers per-download encoding, delimiter, and line-terminator selection. The helper `_build_output_bytes(df, *, encoding, delimiter, line_terminator)` returns `(bytes, error_message)` — when the chosen encoding can't represent a character, falls back to `errors="replace"` and returns a warning the page surfaces. + +### 10.2.4 Pre-parse repair (`src/core/io.py::repair_bytes`) + +Byte-level pre-parse pass. Order is meaningful and each step is independently toggleable: + +1. **Wide-encoding transcode** — UTF-16/UTF-32 → UTF-8. Has to run first because the byte-level NUL strip below would shred UTF-16 data (UTF-16 ASCII chars carry NUL as half of every 16-bit unit). Records `transcode_to_utf8` audit action; the analyzer surfaces it as a `csv_transcoded_to_utf8` info finding. +2. **UTF-8 BOM strip** (file start only). +3. **NUL strip** — only meaningful after step 1, so genuine corruption (truncated C strings, half-binary exports) rather than encoding artifacts. +4. **Line-ending normalize** — CRLF and bare CR → LF. Bare CR confuses the C parser; the text-cleaner contract also calls for LF inside multi-line cells. +5. **Byte-level smart-quote fold** — curly / guillemet / double-prime → ASCII `"`. Only structural double-quote-equivalents; single curly quotes are deferred to the cell-level cleaner. +6. **Per-row delimiter repair** — when one row has +1 field and the merge candidate is currency-shaped (`$1,500.00` etc.), merge and quote. + +`detect_encoding()` tries strict UTF-8 first and returns `"utf-8"` if the bytes decode cleanly. This was added because charset-normalizer fingerprints small files dominated by short non-ASCII sequences (e.g. zero-width chars at U+200B-class) as `mac_latin2` — but if the bytes are valid UTF-8, that's the right answer regardless of label. + ### 10.3 - 10.9 (Future) Functional specs for scripts 03 through 09 to be added when each script enters active build. The deduplicator (10.1) and text cleaner (10.2) specs are the template; specs for other scripts should follow the same Tier 1 / Tier 2 / Tier 3 structure with explicit strategic framing (what's the market gap this script fills, given that some of its functionality is available free elsewhere). diff --git a/docs/USER-GUIDE.md b/docs/USER-GUIDE.md index 0a045c0..60a8609 100644 --- a/docs/USER-GUIDE.md +++ b/docs/USER-GUIDE.md @@ -125,6 +125,41 @@ deduplicator --help --- +## 3.3 Review & Normalize gate + +Before any tool page accepts a file, the file passes through a **CSV-normalization gate**. The gate scans every uploaded file, surfaces every data-quality issue our analyzer can detect, and lets you choose how to handle each one before downstream tools see the data. + +### How it works + +1. Upload a file on the home page. The analyzer scans it and counts findings by confidence tier. +2. Click any tool. If the file hasn't been normalized yet, you're redirected to the **Review & Normalize** page. +3. The page shows every finding grouped by severity and confidence, with a per-finding decision control. + +### Confidence tiers + +- **High** — round-trip-safe algorithmic fix (BOM strip, whitespace trim, NBSP / zero-width strip, smart-quote fold, line-ending normalize, header cleanup). One-click "Auto-fix high-confidence" applies them all. +- **Medium** — right call in the common case but with known false-positive shapes. Examples: lowercasing the email column, replacing null-like sentinels (`N/A`, `-`, `nan`), repairing unquoted-currency rows. Preview the change before applying. +- **Low** — heuristic fixes that can corrupt data when wrong. Mojibake repair (`café` → `café`), mixed-encoding detection. Off by default; you opt in per finding. +- **Error** — blocking. Empty file, unrepairable rows, U+FFFD replacement characters. Cannot enter the tool pages until resolved or explicitly waived. + +### Encoding override + +When the analyzer reports `encoding_uncertain` or you spot mojibake (`é`) or `�` characters in the findings list, use the **File encoding** picker at the top of the Review page. Pick the right code page (cp1252 for Western Excel exports, KOI8-R for older Russian data, Big5 for traditional Chinese, etc.) or type a custom one, then click **Re-analyze**. Findings refresh against the corrected decode. + +The picker is hidden for `.xlsx` files since Excel stores text as Unicode internally. + +### Advanced output options + +After applying decisions, an `⚙️ Advanced output options` expander on the download appears. Three dropdowns let you tune the output file format: + +- **Encoding (code page)** — UTF-8 (default), UTF-8 with BOM (Excel-friendly), Windows-1252, Latin-1, Latin-9, cp1250, ISO-8859-2, cp1251, Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE. +- **Delimiter** — comma (default), tab, semicolon, pipe. +- **Line terminator** — LF (default), CRLF (Windows), CR. + +The download filename auto-adjusts the extension (`.tsv` for tab, otherwise `.csv`). When the chosen encoding can't represent a character (Cyrillic content into cp1252, Asian script into Latin-1), the page shows a warning naming the offending character and falls back to `?` replacement so the download still works. + +--- + ## 4. Output Every script writes: diff --git a/run_tests.py b/run_tests.py index d801ad0..2b7daea 100755 --- a/run_tests.py +++ b/run_tests.py @@ -52,13 +52,20 @@ _TOOL_MAP: dict[str, str] = { "cli": "test_cli or test_cli_text_clean or test_cli_analyze", "config": "test_config", "normalizers": "test_normalizers", + "normalize": "test_normalize", + "encodings": "test_encodings_corpus or test_io", + "gate": "test_normalize", } _CATEGORY_PATHS: dict[str, list[str]] = { "unit": ["tests/"], # all tests are unit unless marked otherwise "e2e": ["tests/test_e2e.py"], "install": ["tests/test_install.py"], - "fixtures": ["tests/test_corpus.py", "tests/test_fixtures_sweep.py"], + "fixtures": [ + "tests/test_corpus.py", + "tests/test_fixtures_sweep.py", + "tests/test_encodings_corpus.py", + ], } diff --git a/src/core/analyze.py b/src/core/analyze.py index ad50bed..4561aee 100644 --- a/src/core/analyze.py +++ b/src/core/analyze.py @@ -25,6 +25,7 @@ from pandas.api import types as pdtypes from .io import RepairResult, repair_bytes, detect_encoding, detect_delimiter Severity = Literal["info", "warn", "error"] +Confidence = Literal["high", "medium", "low"] # Tool identifiers — match the 0N_ convention used by the script set. @@ -35,6 +36,29 @@ TOOL_DEDUPLICATOR = "01_deduplicator" TOOL_FORMAT_STANDARDIZER = "03_format_standardizer" +# Stable fix-action ids. These name the algorithm that resolves a finding; +# the normalize layer dispatches on this id. Keep in sync with fixes.py. +FIX_TRIM_WHITESPACE = "trim_whitespace" +FIX_STRIP_NBSP = "strip_nbsp_unicode_whitespace" +FIX_STRIP_ZERO_WIDTH = "strip_zero_width" +FIX_FOLD_SMART_PUNCT = "fold_smart_punctuation" +FIX_CLEAN_HEADERS = "clean_headers" +FIX_NORMALIZE_LINE_ENDINGS = "normalize_line_endings" +FIX_STRIP_BOM = "strip_bom" +FIX_STRIP_NUL = "strip_nul" +FIX_FOLD_SMART_QUOTES_BYTE = "fold_smart_quotes_byte" +FIX_REPAIR_UNQUOTED_DELIM = "repair_unquoted_delimiters" +FIX_LOWERCASE_EMAIL = "lowercase_email_column" +FIX_REPLACE_NULL_SENTINELS = "replace_null_sentinels" +FIX_REPAIR_MOJIBAKE = "repair_mojibake" +FIX_NONE = "" # informational — nothing to apply + +# Replacement character (U+FFFD) inserted when a decoder gave up on a byte. +# Anything more than a tiny ratio of it in the loaded text is a strong +# signal that the encoding was wrong. +_REPLACEMENT_CHAR = "�" + + @dataclass class Finding: """One issue the analyzer surfaced. @@ -47,6 +71,16 @@ class Finding: severity ``"info"`` (FYI), ``"warn"`` (likely needs cleanup), ``"error"`` (will block downstream work). + confidence + ``"high"`` — round-trip-safe algorithmic fix, eligible for auto-fix. + ``"medium"`` — right call in the common case but has known + false-positive shapes; user should preview before applying. + ``"low"`` — heuristic; the wrong call corrupts data; opt-in only. + Independent of severity: a ``warn`` finding can be high-confidence + (NBSP strip) and an ``info`` finding can be low-confidence (mojibake). + fix_action + Stable id naming the algorithm that resolves this finding. Empty + string for informational findings with no associated fix. tool Tool id that can address the finding, or empty string for purely informational findings. @@ -69,6 +103,13 @@ class Finding: description: str column: Optional[str] = None samples: list[tuple[int, str, str]] = field(default_factory=list) + confidence: Confidence = "high" + fix_action: str = FIX_NONE + # True when the fix already ran during the pre-parse repair pass + # (e.g. BOM strip, byte-level smart-quote fold). The gate treats these + # as already-resolved; the review page still surfaces them so the + # user can see what was auto-applied during read. + pre_applied: bool = False # --------------------------------------------------------------------------- @@ -139,6 +180,8 @@ def _detect_smart_punctuation(df: pd.DataFrame) -> list[Finding]: f"regex patterns." ), samples=sample_rows, + confidence="high", + fix_action=FIX_FOLD_SMART_PUNCT, )] @@ -172,6 +215,8 @@ def _detect_invisible_chars(df: pd.DataFrame) -> list[Finding]: f"join keys." ), samples=nbsp_samples, + confidence="high", + fix_action=FIX_STRIP_NBSP, )) if zw_cells: findings.append(Finding( @@ -184,6 +229,8 @@ def _detect_invisible_chars(df: pd.DataFrame) -> list[Finding]: f"characters (ZWSP, ZWJ, soft hyphen, BOM, bidi marks)." ), samples=zw_samples, + confidence="high", + fix_action=FIX_STRIP_ZERO_WIDTH, )) # Headers carry the same risks; flag separately so the user sees that # df["Email"] vs df["Email​"] is the issue. @@ -208,6 +255,8 @@ def _detect_invisible_chars(df: pd.DataFrame) -> list[Finding]: f"df['col'] lookups." ), samples=[(0, h, h) for h in bad_headers[:5]], + confidence="high", + fix_action=FIX_CLEAN_HEADERS, )) return findings @@ -235,6 +284,8 @@ def _detect_whitespace_padding(df: pd.DataFrame) -> list[Finding]: f"multi-space internal runs. Common cause of failed joins." ), samples=samples, + confidence="high", + fix_action=FIX_TRIM_WHITESPACE, )] @@ -264,6 +315,8 @@ def _detect_null_like_sentinels(df: pd.DataFrame) -> list[Finding]: f"counts as missing in the missing-value handler." ), samples=samples, + confidence="medium", + fix_action=FIX_REPLACE_NULL_SENTINELS, )] @@ -290,6 +343,8 @@ def _detect_mojibake(df: pd.DataFrame) -> list[Finding]: f"patterns (é, ’, etc.). Auto-repair is opt-in (Tier 2)." ), samples=samples, + confidence="low", + fix_action=FIX_REPAIR_MOJIBAKE, )] @@ -316,6 +371,8 @@ def _detect_mixed_case_email(df: pd.DataFrame) -> list[Finding]: ), column=col, samples=samples, + confidence="medium", + fix_action=FIX_LOWERCASE_EMAIL, )) return findings @@ -362,6 +419,8 @@ def _detect_near_duplicates(df: pd.DataFrame) -> list[Finding]: f"Run the deduplicator to merge or remove." ), samples=samples, + confidence="medium", + fix_action=FIX_NONE, # routed to dedup tool, not auto-fixed here )] @@ -397,23 +456,60 @@ def _detect_leading_zero_ids(df: pd.DataFrame) -> list[Finding]: ), column=str(col), samples=samples, + confidence="low", + fix_action=FIX_NONE, # informational only )) return findings +def _count_row_terminators(raw: bytes) -> tuple[int, int, int]: + """Count CRLF / LF / CR sequences that act as *row* terminators. + + Walks the bytes tracking quoted-region state so that line breaks + inside multi-line quoted cells (e.g. an address column) are not + counted. Without this, files that legitimately have CRLF at row + boundaries plus LF inside quoted cells get false-positive + ``mixed_line_endings`` findings. + """ + n_crlf = n_lf = n_cr = 0 + in_quotes = False + i = 0 + n = len(raw) + while i < n: + b = raw[i] + if b == 0x22: # ASCII double quote — toggles quoted region. + # Doubled quote inside a quoted cell is an escape, not an exit. + if in_quotes and i + 1 < n and raw[i + 1] == 0x22: + i += 2 + continue + in_quotes = not in_quotes + i += 1 + continue + if not in_quotes: + if b == 0x0D: # CR + if i + 1 < n and raw[i + 1] == 0x0A: + n_crlf += 1 + i += 2 + continue + n_cr += 1 + elif b == 0x0A: # LF + n_lf += 1 + i += 1 + return n_crlf, n_lf, n_cr + + def _detect_mixed_line_endings(raw: bytes) -> list[Finding]: - """Flag files that mix CRLF, LF, and bare CR line terminators. + """Flag files that mix CRLF, LF, and bare CR row terminators. Mixed endings are a classic disaster pattern after multi-source concat - (Windows + macOS + Linux exports stitched together). Operates on raw + (Windows + macOS + Linux exports stitched together). Counts only the + terminators that act as row separators, so embedded newlines inside + quoted multi-line cells don't create false positives. Operates on raw bytes only — DataFrame-mode :func:`analyze` skips this detector. """ if not raw: return [] - n_crlf = raw.count(b"\r\n") - # Count standalone \r and \n (not part of \r\n) by subtracting overlaps. - n_lf = raw.count(b"\n") - n_crlf - n_cr = raw.count(b"\r") - n_crlf + n_crlf, n_lf, n_cr = _count_row_terminators(raw) kinds_present = sum(1 for n in (n_crlf, n_lf, n_cr) if n > 0) if kinds_present <= 1: return [] @@ -434,6 +530,53 @@ def _detect_mixed_line_endings(raw: bytes) -> list[Finding]: f"({', '.join(breakdown)}). Naive splits on one style produce " f"ghost rows or merged lines. Run the text cleaner to normalize." ), + confidence="high", + fix_action=FIX_NORMALIZE_LINE_ENDINGS, + )] + + +def _detect_encoding_uncertainty(df: pd.DataFrame) -> list[Finding]: + """Flag DataFrames whose loaded text contains U+FFFD replacement chars. + + The replacement character is what Python's decoder substitutes for + bytes it could not interpret under ``errors="replace"``. Any non-zero + count is a strong signal that the encoding picked by the loader was + wrong for at least part of the file — classic lying-BOM, mixed-encoding, + or wrong-codepage symptom. The user has to pick: re-upload with an + explicit encoding, or accept the loss. + """ + affected_cells = 0 + sample_rows: list[tuple[int, str, str]] = [] + bad_headers: list[str] = [] + for col in df.columns: + if isinstance(col, str) and _REPLACEMENT_CHAR in col: + bad_headers.append(col) + for row_idx, val in enumerate(df[col].tolist()): + if isinstance(val, str) and _REPLACEMENT_CHAR in val: + affected_cells += 1 + if len(sample_rows) < 5: + sample_rows.append((row_idx, str(col), val)) + if not affected_cells and not bad_headers: + return [] + location = [] + if affected_cells: + location.append(f"{affected_cells} cell(s)") + if bad_headers: + location.append(f"{len(bad_headers)} header(s)") + return [Finding( + id="encoding_uncertain", + severity="error", + tool="", + count=affected_cells + len(bad_headers), + description=( + f"{' and '.join(location)} contain U+FFFD replacement characters, " + f"which means the file's encoding could not be decoded cleanly. " + f"Re-upload with an explicit encoding (e.g. cp1252, latin-1) " + f"or fix the source. Continuing risks silent data loss." + ), + samples=sample_rows, + confidence="low", + fix_action=FIX_NONE, )] @@ -455,6 +598,9 @@ def _findings_from_repair(repair: RepairResult) -> list[Finding]: tool=TOOL_TEXT_CLEANER, count=1, description="UTF-8 BOM at file start was removed before parsing.", + confidence="high", + fix_action=FIX_STRIP_BOM, + pre_applied=True, )) if "strip_nul" in summary: nul_action = next(a for a in repair.actions if a.kind == "strip_nul") @@ -467,6 +613,9 @@ def _findings_from_repair(repair: RepairResult) -> list[Finding]: f"Embedded NUL bytes in the file were stripped before " f"parsing ({nul_action.detail})." ), + confidence="high", + fix_action=FIX_STRIP_NUL, + pre_applied=True, )) if "fold_smart_quote" in summary: action = next(a for a in repair.actions if a.kind == "fold_smart_quote") @@ -479,6 +628,55 @@ def _findings_from_repair(repair: RepairResult) -> list[Finding]: f"Smart double quotes were folded to ASCII before parsing " f"({action.detail})." ), + confidence="high", + fix_action=FIX_FOLD_SMART_QUOTES_BYTE, + pre_applied=True, + )) + if "normalize_line_endings" in summary: + action = next(a for a in repair.actions if a.kind == "normalize_line_endings") + findings.append(Finding( + id="csv_line_endings_normalized", + severity="info", + tool=TOOL_TEXT_CLEANER, + count=1, + description=( + f"Line endings were normalized to LF before parsing " + f"({action.detail})." + ), + confidence="high", + fix_action=FIX_NORMALIZE_LINE_ENDINGS, + pre_applied=True, + )) + if "transcode_to_utf8" in summary: + action = next(a for a in repair.actions if a.kind == "transcode_to_utf8") + findings.append(Finding( + id="csv_transcoded_to_utf8", + severity="info", + tool="", + count=1, + description=( + f"File was transcoded from a wide encoding to UTF-8 before " + f"parsing ({action.detail})." + ), + confidence="high", + fix_action=FIX_NONE, + pre_applied=True, + )) + if "decode_replaced" in summary: + action = next(a for a in repair.actions if a.kind == "decode_replaced") + findings.append(Finding( + id="encoding_decode_failed", + severity="error", + tool="", + count=1, + description=( + f"Some bytes could not be decoded under the detected " + f"encoding ({action.detail}). Replacement characters " + f"(U+FFFD) were inserted; the file likely uses a different " + f"encoding or mixes encodings. Re-upload with --encoding." + ), + confidence="low", + fix_action=FIX_NONE, )) if "quote_unquoted_delim" in summary: n = summary["quote_unquoted_delim"] @@ -491,6 +689,9 @@ def _findings_from_repair(repair: RepairResult) -> list[Finding]: f"{n} row(s) had a delimiter inside an unquoted field " f"(e.g. '$1,500.00') and were merged during pre-parse repair." ), + confidence="medium", + fix_action=FIX_REPAIR_UNQUOTED_DELIM, + pre_applied=True, )) if repair.unrepairable_lines: n = len(repair.unrepairable_lines) @@ -504,6 +705,8 @@ def _findings_from_repair(repair: RepairResult) -> list[Finding]: f"left as-is. Inspect lines: " f"{repair.unrepairable_lines[:10]}" ), + confidence="low", + fix_action=FIX_NONE, )) return findings @@ -517,6 +720,7 @@ def analyze( *, sample_rows: int = 1000, repair_result: Optional[RepairResult] = None, + encoding_override: Optional[str] = None, ) -> list[Finding]: """Run all detectors against *source* and return a list of findings. @@ -533,11 +737,17 @@ def analyze( Optional :class:`RepairResult` from a prior pre-parse pass; used to synthesize ``csv_*`` findings so the user sees what the parser quietly fixed. + encoding_override + When set, skip charset detection and decode with this encoding + instead. Used by the Review page to let the user correct + misdetections (cp1250-vs-cp1252 ambiguity, KOI8-R surfacing as + Shift_JIS, etc.). Only applies when *source* is a path. """ raw_for_byte_scan: Optional[bytes] = None if isinstance(source, (str, Path)): df, internal_repair, raw_for_byte_scan = _load_for_analysis( Path(source), sample_rows=sample_rows, + encoding_override=encoding_override, ) # Caller-supplied repair_result wins over the internally produced one, # since the caller may have used non-default repair flags. @@ -547,10 +757,36 @@ def analyze( df = source.head(sample_rows).copy() if len(source) > sample_rows else source.copy() findings: list[Finding] = [] + if raw_for_byte_scan is not None and not raw_for_byte_scan.strip(): + findings.append(Finding( + id="empty_input", + severity="error", + tool="", + count=0, + description="Input file is empty (zero bytes or whitespace only).", + confidence="low", + fix_action=FIX_NONE, + )) + return findings + if df.empty and df.columns.empty and raw_for_byte_scan is not None: + # Non-empty bytes but the parser couldn't extract a header row. + findings.append(Finding( + id="empty_input", + severity="error", + tool="", + count=0, + description=( + "Input file has no parseable rows or columns " + "(only line endings, BOM, or whitespace)." + ), + confidence="low", + fix_action=FIX_NONE, + )) if repair_result is not None: findings.extend(_findings_from_repair(repair_result)) if raw_for_byte_scan is not None: findings.extend(_detect_mixed_line_endings(raw_for_byte_scan)) + findings.extend(_detect_encoding_uncertainty(df)) findings.extend(_detect_smart_punctuation(df)) findings.extend(_detect_invisible_chars(df)) findings.extend(_detect_whitespace_padding(df)) @@ -563,7 +799,7 @@ def analyze( def _load_for_analysis( - path: Path, *, sample_rows: int, + path: Path, *, sample_rows: int, encoding_override: Optional[str] = None, ) -> tuple[pd.DataFrame, Optional[RepairResult], Optional[bytes]]: """Read just enough of *path* to scan, with the same robust pre-parse repair the tool pages will use. @@ -571,6 +807,12 @@ def _load_for_analysis( Returns ``(df, repair_result, raw_bytes)``. The repair result and raw bytes are *None* for Excel files since the byte-level repair step (BOM/NUL/smart-quote folding) and line-ending scan are CSV-specific. + An empty CSV returns an empty DataFrame plus the (empty) raw bytes; + the caller synthesizes an ``empty_input`` finding from that. + + When *encoding_override* is set, it replaces the detected encoding + entirely — the user has explicitly told us what the file is. The + delimiter is still detected (it's separate from encoding choice). """ suffix = path.suffix.lower() if suffix in (".xlsx", ".xls"): @@ -579,17 +821,24 @@ def _load_for_analysis( nrows=sample_rows, ) return df, None, None - enc = detect_encoding(path) - delim = detect_delimiter(path, enc) raw = path.read_bytes() + if not raw.strip(): + return pd.DataFrame(), None, raw + enc = encoding_override or detect_encoding(path) + delim = detect_delimiter(path, enc) repair = repair_bytes(raw, encoding=enc, delimiter=delim) import io as _io - df = pd.read_csv( - _io.BytesIO(repair.repaired_bytes), - encoding="utf-8", delimiter=delim, - dtype=str, keep_default_na=False, on_bad_lines="warn", - nrows=sample_rows, - ) + try: + df = pd.read_csv( + _io.BytesIO(repair.repaired_bytes), + encoding="utf-8", delimiter=delim, + dtype=str, keep_default_na=False, on_bad_lines="warn", + nrows=sample_rows, + ) + except pd.errors.EmptyDataError: + # File is non-empty bytes but had no parseable columns (e.g. only + # whitespace, only a BOM, only line endings). Treat as empty. + return pd.DataFrame(), repair, raw return df, repair, raw @@ -598,6 +847,9 @@ def to_dict(finding: Finding) -> dict[str, Any]: return { "id": finding.id, "severity": finding.severity, + "confidence": finding.confidence, + "fix_action": finding.fix_action, + "pre_applied": finding.pre_applied, "tool": finding.tool, "count": finding.count, "description": finding.description, diff --git a/src/core/fixes.py b/src/core/fixes.py new file mode 100644 index 0000000..421fc7e --- /dev/null +++ b/src/core/fixes.py @@ -0,0 +1,296 @@ +"""Registry of fix algorithms keyed by ``fix_action`` id. + +Every :class:`~src.core.analyze.Finding` declares a ``fix_action`` naming +the algorithm that resolves it. The normalize layer dispatches on that id +into this registry. Each fix function takes a DataFrame plus an optional +``payload`` dict (for fixes that need user-supplied parameters, e.g. the +custom null-sentinel list) and returns ``(new_df, n_cells_changed)``. + +Fixes here operate on the DataFrame after the byte-level pre-parse repair +has already run (BOM, NUL, line endings, smart-quote bytes, unquoted +delimiters). Anything in this layer is reversible from the audit log; a +lossy fix (e.g. mojibake repair) is gated to ``confidence="low"`` and +requires explicit user opt-in via the review page. +""" + +from __future__ import annotations + +import re +import unicodedata +from typing import Any, Callable, Optional + +import pandas as pd + +from .text_clean import ( + _SMART_TRANS, + _ZERO_WIDTH_RE, + _CONTROL_RE, + _WHITESPACE_RUN_RE, + _looks_structured, + strip_bom, + normalize_line_endings as _norm_le_str, +) +# The package __init__ re-exports the analyze() function under the name +# `analyze`, which shadows the submodule attribute. Reach the module via +# sys.modules to get its private constants and FIX_* identifiers. +import sys as _sys +import src.core.analyze # noqa: F401 (registers the submodule) +_a = _sys.modules["src.core.analyze"] + +# NBSP / Unicode-whitespace -> ASCII space. Mirrors the analyzer's +# detection set (analyze._NBSP_LIKE_CHARS) so what the detector flags is +# exactly what this fix replaces. +_NBSP_TRANS = str.maketrans({c: " " for c in _a._NBSP_LIKE_CHARS}) + + +FixFn = Callable[[pd.DataFrame, Optional[dict]], tuple[pd.DataFrame, int]] + +_REGISTRY: dict[str, FixFn] = {} + + +def register(action_id: str) -> Callable[[FixFn], FixFn]: + def deco(fn: FixFn) -> FixFn: + _REGISTRY[action_id] = fn + return fn + return deco + + +def get_fix(action_id: str) -> Optional[FixFn]: + return _REGISTRY.get(action_id) + + +def available_actions() -> list[str]: + return sorted(_REGISTRY) + + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + +def _apply_to_strings( + df: pd.DataFrame, fn: Callable[[str], str], *, include_headers: bool = False, +) -> tuple[pd.DataFrame, int]: + """Apply *fn* to every string cell. Returns (new_df, cells_changed). + + Headers are not touched here — the dedicated header-cleaning fix owns + that scope so the gate's audit log records header changes separately. + """ + out = df.copy() + changed = 0 + for col in out.columns: + if not pd.api.types.is_object_dtype(out[col]) and not pd.api.types.is_string_dtype(out[col]): + continue + new_col = [] + for v in out[col]: + if isinstance(v, str): + nv = fn(v) + if nv != v: + changed += 1 + new_col.append(nv) + else: + new_col.append(v) + out[col] = new_col + if include_headers: + new_headers = [] + for h in out.columns: + if isinstance(h, str): + nh = fn(h) + if nh != h: + changed += 1 + new_headers.append(nh) + else: + new_headers.append(h) + out.columns = new_headers + return out, changed + + +# --------------------------------------------------------------------------- +# High-confidence fixes +# --------------------------------------------------------------------------- + +@register(_a.FIX_TRIM_WHITESPACE) +def trim_whitespace(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]: + """Strip leading/trailing whitespace; collapse internal runs in text cells. + + Numeric/date/phone-shaped cells get only outer trim — internal spacing + in those is often semantic (`1 234`, `(555) 123-4567`). + """ + def fix(s: str) -> str: + trimmed = s.strip() + if not trimmed or _looks_structured(trimmed): + return trimmed + return _WHITESPACE_RUN_RE.sub(" ", trimmed) + return _apply_to_strings(df, fix) + + +@register(_a.FIX_STRIP_NBSP) +def strip_nbsp(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]: + """Replace NBSP and other Unicode spaces with ASCII space.""" + def fix(s: str) -> str: + return s.translate(_NBSP_TRANS) + return _apply_to_strings(df, fix) + + +@register(_a.FIX_STRIP_ZERO_WIDTH) +def strip_zero_width(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]: + """Remove zero-width and invisible characters from cells.""" + def fix(s: str) -> str: + return _ZERO_WIDTH_RE.sub("", s) + return _apply_to_strings(df, fix) + + +@register(_a.FIX_FOLD_SMART_PUNCT) +def fold_smart_punctuation(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]: + """ASCII-fy curly quotes, em/en dashes, ellipsis, primes.""" + def fix(s: str) -> str: + return s.translate(_SMART_TRANS) + return _apply_to_strings(df, fix) + + +@register(_a.FIX_CLEAN_HEADERS) +def clean_headers(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]: + """Apply the same per-cell hygiene to column headers. + + Fixes the df['Email'] vs df['Email '] class of bug. + """ + def fix(s: str) -> str: + s = strip_bom(s) + s = s.translate(_NBSP_TRANS) + s = _ZERO_WIDTH_RE.sub("", s) + s = s.translate(_SMART_TRANS) + s = _CONTROL_RE.sub("", s) + return s.strip() + out = df.copy() + new_headers = [] + changed = 0 + for h in out.columns: + if isinstance(h, str): + nh = fix(h) + if nh != h: + changed += 1 + new_headers.append(nh) + else: + new_headers.append(h) + out.columns = new_headers + return out, changed + + +@register(_a.FIX_NORMALIZE_LINE_ENDINGS) +def normalize_line_endings(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]: + """Normalize CRLF / bare CR inside cells to LF. + + File-level line endings are handled by ``repair_bytes`` before parsing; + this fix covers embedded multi-line cells (case 11 in the corpus). + """ + return _apply_to_strings(df, _norm_le_str) + + +# --------------------------------------------------------------------------- +# Already-applied fixes (no-op at this layer; kept so the audit log is +# uniform and the gate can reason about them) +# --------------------------------------------------------------------------- + +@register(_a.FIX_STRIP_BOM) +def strip_bom_noop(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]: + """BOM is stripped during read by repair_bytes; nothing to do here.""" + return df, 0 + + +@register(_a.FIX_STRIP_NUL) +def strip_nul_noop(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]: + """NUL is stripped during read by repair_bytes.""" + return df, 0 + + +@register(_a.FIX_FOLD_SMART_QUOTES_BYTE) +def fold_smart_quotes_byte_noop(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]: + """Byte-level smart-quote fold runs in repair_bytes.""" + return df, 0 + + +@register(_a.FIX_REPAIR_UNQUOTED_DELIM) +def repair_unquoted_delim_noop(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]: + """Per-row delimiter repair runs in repair_bytes.""" + return df, 0 + + +# --------------------------------------------------------------------------- +# Medium-confidence fixes (require user confirmation in the review flow) +# --------------------------------------------------------------------------- + +@register(_a.FIX_LOWERCASE_EMAIL) +def lowercase_email(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]: + """Lowercase values in the column named in *payload['column']*. + + Defaults to lowercasing every column whose name matches the email + heuristic if no payload is given. + """ + out = df.copy() + payload = payload or {} + target_cols: list[str] + if "column" in payload: + target_cols = [payload["column"]] + else: + target_cols = [ + c for c in out.columns + if isinstance(c, str) and _a._EMAIL_LIKE_COL.search(c) + ] + changed = 0 + for col in target_cols: + if col not in out.columns: + continue + new_col = [] + for v in out[col]: + if isinstance(v, str): + nv = v.lower() + if nv != v: + changed += 1 + new_col.append(nv) + else: + new_col.append(v) + out[col] = new_col + return out, changed + + +@register(_a.FIX_REPLACE_NULL_SENTINELS) +def replace_null_sentinels(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]: + """Replace user-approved null-like sentinel strings with empty string. + + Payload: ``{"sentinels": ["N/A", "n/a", "nan", ...]}``. Defaults to + the analyzer's built-in set when no payload is given. Comparison is + case-insensitive, whitespace-trimmed. + """ + payload = payload or {} + sentinels = payload.get("sentinels") + if sentinels is None: + sentinels = list(_a._NULL_LIKE) + sentinel_set = {s.strip().lower() for s in sentinels} + + def fix(s: str) -> str: + return "" if s.strip().lower() in sentinel_set else s + + return _apply_to_strings(df, fix) + + +# --------------------------------------------------------------------------- +# Low-confidence fixes (off by default; user-only) +# --------------------------------------------------------------------------- + +@register(_a.FIX_REPAIR_MOJIBAKE) +def repair_mojibake(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]: + """Heuristic UTF-8-as-cp1252 mojibake repair via ftfy when available. + + Falls back to a no-op (returning ``(df, 0)``) when ftfy is not + installed; the review page surfaces that as "library missing — install + ftfy to enable" so we never silently corrupt data with a hand-rolled + heuristic. + """ + try: + import ftfy # type: ignore + except ImportError: + return df, 0 + + def fix(s: str) -> str: + return ftfy.fix_text(s) + + return _apply_to_strings(df, fix) diff --git a/src/core/io.py b/src/core/io.py index dd45b87..3795ac8 100644 --- a/src/core/io.py +++ b/src/core/io.py @@ -34,6 +34,16 @@ def detect_encoding(path: Path, sample_bytes: int = 65_536) -> str: if raw[:2] in (b"\xff\xfe", b"\xfe\xff"): return "utf-16" + # Strict UTF-8 wins. charset_normalizer fingerprints small files + # dominated by short non-ASCII sequences (e.g. zero-width chars at + # U+200B-class) as mac_latin2 / cp1250 / similar — but if the bytes + # decode cleanly as UTF-8, that's the right answer regardless. + try: + raw.decode("utf-8") + return "utf-8" + except UnicodeDecodeError: + pass + result = from_bytes(raw).best() if result is None: return "utf-8" @@ -416,6 +426,7 @@ def repair_bytes( fold_quotes: bool = True, strip_nul: bool = True, repair_delims: bool = True, + normalize_line_endings: bool = True, ) -> RepairResult: """Pre-parse repair on a raw delimited file. @@ -423,8 +434,11 @@ def repair_bytes( 1. Strip a leading UTF-8 BOM. 2. Strip embedded NUL bytes (the C parser truncates fields at NUL). - 3. Fold smart double quotes (curly, guillemet, double-prime) to ASCII ``"``. - 4. Per-row repair when one rogue delimiter is embedded in a field that + 3. Normalize line endings (CRLF and bare CR to LF). Bare CR confuses + the C parser ("new-line character seen in unquoted field"); the + text-cleaner contract also calls for LF inside multi-line cells. + 4. Fold smart double quotes (curly, guillemet, double-prime) to ASCII ``"``. + 5. Per-row repair when one rogue delimiter is embedded in a field that looks like currency or thousands-grouped digits — quote that field. Single curly quotes and other punctuation are deferred to the cell-level @@ -434,12 +448,41 @@ def repair_bytes( unrepairable: list[int] = [] data = raw + # If the input is a UTF-16 / UTF-32 byte stream, transcode it to UTF-8 + # up front. UTF-16 ASCII codepoints carry NUL as half of every 16-bit + # unit, so the byte-level NUL-strip below would shred the file. Doing + # the transcode here means the rest of the repair pipeline operates + # on UTF-8 bytes regardless of the source encoding. + enc_norm = encoding.lower().replace("-", "_") if encoding else "" + is_wide = enc_norm.startswith(("utf_16", "utf_32")) + # UTF-16 LE without a BOM that survives detection lands here too. + if is_wide: + try: + decoded = data.decode(encoding) + except (UnicodeDecodeError, LookupError): + decoded = data.decode("utf-8", errors="replace") + actions.append(RepairAction( + kind="decode_replaced", line=None, + detail=f"decode errors under {encoding}; replaced with U+FFFD", + )) + # Strip a leading UTF-16 BOM (decoded as U+FEFF) if present. + if decoded and decoded[0] == "": + decoded = decoded[1:] + data = decoded.encode("utf-8") + actions.append(RepairAction( + kind="transcode_to_utf8", line=None, + detail=f"transcoded {encoding} -> utf-8 ({len(raw)}B -> {len(data)}B)", + )) + encoding = "utf-8" # downstream steps now operate on UTF-8 + # 1. BOM if data.startswith(b"\xef\xbb\xbf"): data = data[3:] actions.append(RepairAction(kind="strip_bom", line=None, detail="UTF-8 BOM removed")) - # 2. NUL + # 2. NUL — only meaningful for single-byte / UTF-8 encodings. We've + # already transcoded UTF-16/32 to UTF-8 above, so NUL here is genuine + # corruption (truncated C strings, half-binary exports), not encoding. if strip_nul and b"\x00" in data: before = data.count(b"\x00") data = data.replace(b"\x00", b"") @@ -448,6 +491,26 @@ def repair_bytes( detail=f"removed {before} NUL byte(s)", )) + # 3. Line endings: CRLF and bare CR -> LF. CRLF first so we don't + # double-substitute. Done at the byte layer so it survives through + # any subsequent decode failure. + if normalize_line_endings and (b"\r" in data): + n_crlf = data.count(b"\r\n") + data = data.replace(b"\r\n", b"\n") + n_cr = data.count(b"\r") + if n_cr: + data = data.replace(b"\r", b"\n") + if n_crlf or n_cr: + parts = [] + if n_crlf: + parts.append(f"{n_crlf} CRLF") + if n_cr: + parts.append(f"{n_cr} bare CR") + actions.append(RepairAction( + kind="normalize_line_endings", line=None, + detail=f"normalized {', '.join(parts)} to LF", + )) + # Decode for character-level work. try: text = data.decode(encoding) diff --git a/src/core/normalize.py b/src/core/normalize.py new file mode 100644 index 0000000..17d49c5 --- /dev/null +++ b/src/core/normalize.py @@ -0,0 +1,249 @@ +"""CSV-normalization gate. + +A file enters the tool pages only after passing the gate. The gate has +two paths: + +1. **Auto-fix** — apply every algorithm flagged ``confidence="high"``. +2. **Review** — show the user a preview of medium/low-confidence findings + and accept an explicit per-finding decision before applying. + +The gate produces a :class:`NormalizationResult` containing the cleaned +DataFrame, the bytes representation, and a structured audit log of every +fix that ran. Tool pages are guarded by :func:`is_normalized` against +the result and the original list of findings. +""" + +from __future__ import annotations + +import io +from dataclasses import dataclass, field +from pathlib import Path +from typing import Literal, Optional + +import pandas as pd + +from .analyze import Finding, analyze +from .fixes import get_fix + + +DecisionAction = Literal["auto", "skip", "modified"] + + +@dataclass +class Decision: + """One user-recorded choice for a finding. + + Attributes + ---------- + finding_id + The :class:`Finding` id this decision applies to. + action + ``"auto"`` to run the registered fix as-is, ``"skip"`` to leave + it alone (the gate logs it as waived), ``"modified"`` to run the + fix with a custom payload (e.g. user-edited null sentinel list). + payload + Optional kwargs forwarded to the fix function. Required for + ``"modified"``; ignored for ``"skip"``. + """ + + finding_id: str + action: DecisionAction + payload: Optional[dict] = None + + +@dataclass +class FixApplied: + """One fix that ran during a gate pass.""" + + finding_id: str + fix_action: str + cells_changed: int + decision: DecisionAction + + +@dataclass +class NormalizationResult: + """Output of a gate pass. + + Attributes + ---------- + cleaned_df + DataFrame after every applied fix. The downstream tool pages + consume this directly. + cleaned_bytes + UTF-8 encoded CSV of *cleaned_df* — the canonical artifact for + round-tripping into another tool that re-parses. + applied + Audit log of fixes that ran. + skipped_findings + Findings the user explicitly waived (decision = ``"skip"``). + pending_findings + Findings still requiring a user decision before the gate is + considered passed. Empty on a successful gate pass. + blocking_findings + Severity=error findings that have no decision and no auto-fix. + Non-empty means the gate is blocked and the file cannot enter + tool pages. + """ + + cleaned_df: pd.DataFrame + cleaned_bytes: bytes + applied: list[FixApplied] = field(default_factory=list) + skipped_findings: list[Finding] = field(default_factory=list) + pending_findings: list[Finding] = field(default_factory=list) + blocking_findings: list[Finding] = field(default_factory=list) + + @property + def passed(self) -> bool: + return not self.pending_findings and not self.blocking_findings + + +def _df_to_bytes(df: pd.DataFrame) -> bytes: + buf = io.StringIO() + df.to_csv(buf, index=False, lineterminator="\n") + return buf.getvalue().encode("utf-8") + + +def _is_actionable(f: Finding) -> bool: + """Does this finding still need attention from the gate? + + Pre-applied fixes (BOM strip, etc. — already done during read) are + not actionable. Findings without a registered fix_action are not + actionable here either; severity=error ones become blockers. + """ + if f.pre_applied: + return False + if not f.fix_action: + return False + return get_fix(f.fix_action) is not None + + +def auto_fix( + df: pd.DataFrame, findings: list[Finding], +) -> NormalizationResult: + """Apply every fix flagged ``confidence="high"``. + + Returns a :class:`NormalizationResult`. Medium / low / unknown + confidence findings are surfaced as ``pending_findings`` and the + result is *not* considered passed until the user decides on them. + """ + decisions: list[Decision] = [ + Decision(finding_id=f.id, action="auto") + for f in findings + if _is_actionable(f) and f.confidence == "high" + ] + return apply_decisions(df, findings, decisions) + + +def apply_decisions( + df: pd.DataFrame, findings: list[Finding], decisions: list[Decision], +) -> NormalizationResult: + """Apply *decisions* to *df* in finding order. + + Findings with no matching decision are categorized: + + * ``severity=error`` -> ``blocking_findings`` + * Otherwise -> ``pending_findings`` (user still owes us a decision) + + Pre-applied findings are recorded once in the audit log with + ``cells_changed=0`` so callers can render "what was already done." + """ + decision_by_id = {d.finding_id: d for d in decisions} + + out = df.copy() + applied: list[FixApplied] = [] + skipped: list[Finding] = [] + pending: list[Finding] = [] + blocking: list[Finding] = [] + + for f in findings: + if f.pre_applied: + applied.append(FixApplied( + finding_id=f.id, + fix_action=f.fix_action, + cells_changed=0, + decision="auto", + )) + continue + + decision = decision_by_id.get(f.id) + if decision is None: + if f.severity == "error": + blocking.append(f) + elif _is_actionable(f): + pending.append(f) + # else: informational with no fix; ignore. + continue + + if decision.action == "skip": + skipped.append(f) + continue + + fix_fn = get_fix(f.fix_action) + if fix_fn is None: + # Decision references a fix we don't have; treat as pending. + pending.append(f) + continue + + payload = decision.payload + # Per-column fixes (lowercase_email) can carry the column from + # the finding when the user didn't override it. + if f.column and (payload is None or "column" not in payload): + payload = {**(payload or {}), "column": f.column} + + out, changed = fix_fn(out, payload) + applied.append(FixApplied( + finding_id=f.id, + fix_action=f.fix_action, + cells_changed=changed, + decision=decision.action, + )) + + return NormalizationResult( + cleaned_df=out, + cleaned_bytes=_df_to_bytes(out), + applied=applied, + skipped_findings=skipped, + pending_findings=pending, + blocking_findings=blocking, + ) + + +def is_normalized( + findings: list[Finding], result: Optional[NormalizationResult], +) -> bool: + """True iff *result* satisfies the gate against *findings*. + + The gate passes when: + + * A result exists, and + * It has no blocking findings, and + * It has no pending (undecided) actionable findings. + + Re-run analysis on the cleaned bytes to confirm the high-confidence + detectors no longer fire — that's the contract the tool pages rely + on. Callers who want the cheap check can pass ``result.passed`` + directly; this function is the strict version. + """ + if result is None: + return False + if not result.passed: + return False + # Re-analyze the cleaned bytes; high-confidence detectors must be silent. + rerun = analyze(result.cleaned_df) + for f in rerun: + if f.confidence == "high" and _is_actionable(f): + return False + return True + + +def gate_summary(result: NormalizationResult) -> dict: + """One-line-per-key summary suitable for logging or the CLI.""" + return { + "passed": result.passed, + "fixes_applied": len(result.applied), + "cells_changed": sum(a.cells_changed for a in result.applied), + "skipped": [f.id for f in result.skipped_findings], + "pending": [f.id for f in result.pending_findings], + "blocking": [f.id for f in result.blocking_findings], + } diff --git a/src/gui/components.py b/src/gui/components.py index 59c47a3..f02b6a0 100644 --- a/src/gui/components.py +++ b/src/gui/components.py @@ -1096,6 +1096,49 @@ class _StashedUpload: return self._data +def require_normalization_gate() -> None: + """Block the calling tool page until the upload has passed the gate. + + Tool pages should call this immediately after their imports. When the + current session upload has not been normalized — no + ``normalization_result``, the result is for a different upload, or the + result didn't pass — the user is shown a banner and a button to jump + to the Review page; the rest of the page is short-circuited via + ``st.stop()``. + + Pages that genuinely don't need a clean dataframe (rare) can opt out + by simply not calling this. + """ + import hashlib + has_upload = st.session_state.get("home_uploaded_bytes") is not None + if not has_upload: + # No upload yet — let the page's own uploader handle it; the gate + # will kick in once a file is present. + return + + upload_hash = hashlib.sha256( + st.session_state["home_uploaded_bytes"] + ).hexdigest() + result = st.session_state.get("normalization_result") + matched = ( + result is not None + and st.session_state.get("normalization_for") == upload_hash + and getattr(result, "passed", False) + ) + if matched: + return + + name = st.session_state.get("home_uploaded_name", "the uploaded file") + st.warning( + f"**{name}** must pass the CSV-normalization gate before you can " + f"use this tool. Open the Review page to apply the fixes our " + f"analyzer recommends." + ) + if st.button("Go to Review & Normalize", type="primary"): + st.switch_page("pages/0_Review.py") + st.stop() + + def pickup_or_upload( *, label: str, diff --git a/src/gui/pages/0_Review.py b/src/gui/pages/0_Review.py new file mode 100644 index 0000000..0d0fd5f --- /dev/null +++ b/src/gui/pages/0_Review.py @@ -0,0 +1,675 @@ +"""Review & normalize gate page. + +Sits between the home-page upload and every tool page. Walks the user +through every analyzer finding, lets them auto-fix, preview, customize, +or skip each one, and produces a :class:`NormalizationResult` stashed in +session state. Tool pages refuse to load until this gate has passed. + +State contract +-------------- +Session state read: +* ``home_uploaded_bytes`` / ``home_uploaded_name`` — current upload. +* ``home_findings`` — list of :class:`Finding` from the home-page scan. +* ``review_decisions`` — dict[finding_id, Decision]; user's choices so far. + +Session state written: +* ``review_decisions`` — updated as the user flips controls. +* ``normalization_result`` — :class:`NormalizationResult` after Apply. +* ``normalization_for`` — content hash of the upload the result is for. +""" + +from __future__ import annotations + +import hashlib +import io +import sys +from pathlib import Path +from typing import Optional + +import pandas as pd +import streamlit as st + +# Project root on sys.path (mirrors app.py). +_project_root = Path(__file__).resolve().parent.parent.parent.parent +if str(_project_root) not in sys.path: + sys.path.insert(0, str(_project_root)) + +from src.core.analyze import Finding, analyze +from src.core.fixes import get_fix +from src.core.io import detect_encoding, repair_bytes +from src.core.normalize import ( + Decision, + NormalizationResult, + apply_decisions, + auto_fix, + gate_summary, + is_normalized, +) +from src.gui.components import hide_streamlit_chrome + + +# Common single-byte and multi-byte encodings the user might pick to +# correct a misdetection. Ordered by frequency in real-world Western / +# multilingual data; keep the list short — too many options just adds +# noise. The user can type a custom encoding via the "Other" entry. +_OVERRIDE_ENCODINGS = [ + "(detected)", + "utf-8", + "utf-8-sig", + "cp1252", + "iso-8859-1", + "iso-8859-15", + "cp1250", + "iso-8859-2", + "cp1251", + "koi8-r", + "mac-roman", + "shift_jis", + "cp932", + "gb18030", + "big5", + "euc-kr", + "cp949", + "utf-16", + "utf-16-le", + "utf-16-be", + "Other…", +] + + +st.set_page_config(page_title="Review & Normalize", page_icon="🛡️", layout="wide") +hide_streamlit_chrome() + + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + +def _upload_hash() -> Optional[str]: + data = st.session_state.get("home_uploaded_bytes") + if not data: + return None + return hashlib.sha256(data).hexdigest() + + +def _detected_encoding_for_session() -> Optional[str]: + """Run charset detection on the session bytes via a tmp file.""" + data = st.session_state.get("home_uploaded_bytes") + name = st.session_state.get("home_uploaded_name") or "tmp.csv" + if not data: + return None + import tempfile + suffix = "." + name.rsplit(".", 1)[-1] if "." in name else ".csv" + with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as fh: + fh.write(data) + tmp_path = Path(fh.name) + try: + return detect_encoding(tmp_path) + finally: + tmp_path.unlink(missing_ok=True) + + +def _load_df_from_session(encoding_override: Optional[str] = None) -> Optional[pd.DataFrame]: + """Re-parse the session upload through the same pipeline the home page + uses, so the review page operates on identical bytes. + + When *encoding_override* is set, decode with that encoding instead of + UTF-8. The override flows into ``repair_bytes`` so the wide-encoding + transcode and decode_replaced fallback both honor the user's choice. + """ + data = st.session_state.get("home_uploaded_bytes") + name = st.session_state.get("home_uploaded_name") or "" + if not data: + return None + suffix = name.rsplit(".", 1)[-1].lower() if "." in name else "" + if suffix in ("xlsx", "xls"): + return pd.read_excel(io.BytesIO(data), dtype=str, keep_default_na=False) + delim = "\t" if suffix == "tsv" else "," + if delim == ",": + head = data[:4096].decode("utf-8", errors="replace") + for cand in ("\t", ";", "|"): + if head.count(cand) > head.count(",") * 1.5: + delim = cand + break + enc = encoding_override or "utf-8" + repair = repair_bytes(data, encoding=enc, delimiter=delim) + return pd.read_csv( + io.BytesIO(repair.repaired_bytes), + encoding="utf-8", delimiter=delim, + dtype=str, keep_default_na=False, on_bad_lines="warn", + ) + + +def _run_analysis_with_override(encoding_override: Optional[str]) -> list[Finding]: + """Re-run analyze() on the session upload with an encoding override. + + Mirrors components._run_analysis_on_upload but writes the bytes to a + tempfile so analyze() goes through the path-based loader (which is + where the encoding_override hook lives — DataFrame-mode analysis has + nothing to override). + """ + data = st.session_state.get("home_uploaded_bytes") + name = st.session_state.get("home_uploaded_name") or "tmp.csv" + if not data: + return [] + import tempfile + suffix = "." + name.rsplit(".", 1)[-1] if "." in name else ".csv" + with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as fh: + fh.write(data) + tmp_path = Path(fh.name) + try: + return analyze(tmp_path, encoding_override=encoding_override) + finally: + tmp_path.unlink(missing_ok=True) + + +def _confidence_pill(c: str) -> str: + """Streamlit-markdown pill for the confidence tier.""" + palette = {"high": "green", "medium": "orange", "low": "red"} + return f":{palette.get(c, 'gray')}-background[**{c.upper()}**]" + + +def _severity_pill(s: str) -> str: + palette = {"info": "blue", "warn": "orange", "error": "red"} + return f":{palette.get(s, 'gray')}-background[**{s}**]" + + +# --------------------------------------------------------------------------- +# Output options (Advanced — re-encode the cleaned DataFrame for download) +# --------------------------------------------------------------------------- + +# (label_shown_to_user, codec_passed_to_pandas) +_OUTPUT_ENCODINGS = [ + ("UTF-8 (recommended)", "utf-8"), + ("UTF-8 with BOM (Excel)", "utf-8-sig"), + ("Windows-1252 (Western Europe)", "cp1252"), + ("ISO-8859-1 / Latin-1", "iso-8859-1"), + ("ISO-8859-15 / Latin-9", "iso-8859-15"), + ("Windows-1250 (Central Europe)", "cp1250"), + ("ISO-8859-2 / Latin-2", "iso-8859-2"), + ("Windows-1251 (Cyrillic)", "cp1251"), + ("Shift_JIS (Japanese)", "shift_jis"), + ("GB18030 (Chinese)", "gb18030"), + ("Big5 (Traditional Chinese)", "big5"), + ("EUC-KR (Korean)", "euc-kr"), + ("UTF-16 LE with BOM", "utf-16"), +] + +_OUTPUT_DELIMITERS = [ + ("Comma ,", ","), + ("Tab \\t", "\t"), + ("Semicolon ;", ";"), + ("Pipe |", "|"), +] + +_OUTPUT_LINE_TERMINATORS = [ + ("LF — \\n (Unix / web / git default)", "\n"), + ("CRLF — \\r\\n (Windows / classic Excel)", "\r\n"), + ("CR — \\r (classic Mac, very rare)", "\r"), +] + + +def _build_output_bytes( + df: pd.DataFrame, + *, + encoding: str, + delimiter: str, + line_terminator: str, +) -> tuple[bytes, Optional[str]]: + """Serialize *df* with the user's output options. + + Returns ``(bytes, error_message)``. ``error_message`` is non-None when + the chosen encoding cannot represent at least one cell — characters + that don't exist in the target codepage are replaced with ``?`` so + the user still gets a download, plus a warning telling them which + target was lossy. + """ + buf = io.StringIO() + df.to_csv(buf, index=False, sep=delimiter, lineterminator=line_terminator) + text = buf.getvalue() + try: + return text.encode(encoding), None + except UnicodeEncodeError: + # Find the first character that fails so the message is useful. + bad: Optional[str] = None + for ch in text: + try: + ch.encode(encoding) + except UnicodeEncodeError: + bad = ch + break + msg = ( + f"Some characters cannot be represented in {encoding}" + + (f" (first offender: {bad!r})" if bad else "") + + ". Falling back to '?' replacement; non-Latin content will be lost." + ) + return text.encode(encoding, errors="replace"), msg + + +def _preview_table(f: Finding, decision_action: str, payload: Optional[dict]) -> Optional[pd.DataFrame]: + """Build a before/after preview from finding samples. + + Runs the registered fix function on each sample value individually so + the user sees exactly what would change. Returns None when no preview + is meaningful (no samples, or no fix registered). + """ + if not f.samples: + return None + fix_fn = get_fix(f.fix_action) + if fix_fn is None: + # No fix to preview; show samples as-is. + return pd.DataFrame( + [{"row": r, "column": c, "value": v} for r, c, v in f.samples] + ) + rows = [] + for r, col, val in f.samples: + # Run the fix on a tiny single-cell DataFrame so payload semantics + # (e.g. lowercase_email's column targeting) are honored. + mini = pd.DataFrame({col: [val]}) + try: + new_df, _ = fix_fn(mini, payload) + new_val = new_df[col].iloc[0] + except Exception as e: + new_val = f"" + rows.append({"row": r, "column": col, "before": val, "after": new_val}) + return pd.DataFrame(rows) + + +# --------------------------------------------------------------------------- +# Page body +# --------------------------------------------------------------------------- + +st.title("🛡️ Review & Normalize") +st.caption( + "Every finding is shown below with the algorithm that would fix it. " + "Auto-fix the high-confidence ones in one click; preview or customize " + "the rest before applying." +) + +# Pre-flight: nothing to review without an upload. +findings: list[Finding] = st.session_state.get("home_findings") or [] +upload_name = st.session_state.get("home_uploaded_name") + +if not upload_name: + st.warning("No file uploaded. Go back to the home page and upload a CSV or Excel file first.") + if st.button("Back to home"): + st.switch_page("app.py") + st.stop() + +# ---- Encoding picker -------------------------------------------------------- +# +# Charset detection misfires on small files, byte-equivalent codepages +# (cp1252 vs Latin-1 vs cp1250), and content where every byte happens to +# decode under the wrong encoding (KOI8-R bytes that look like Shift_JIS). +# When the user spots mojibake or U+FFFD chars in the findings list, this +# picker is the escape hatch — pick the right encoding, re-run the analyzer. + +with st.container(border=True): + detected_enc = _detected_encoding_for_session() + current_override = st.session_state.get("encoding_override") + suffix = (st.session_state.get("home_uploaded_name") or "") + suffix = suffix.rsplit(".", 1)[-1].lower() if "." in suffix else "" + is_excel = suffix in ("xlsx", "xls") + + st.markdown("**File encoding**") + if is_excel: + st.caption( + "Excel files store text as Unicode internally — encoding override " + "doesn't apply. Skip this section." + ) + else: + cap_parts = [f"Detected: `{detected_enc or 'unknown'}`"] + if current_override: + cap_parts.append(f"Currently using: `{current_override}`") + st.caption( + " · ".join(cap_parts) + + " · Override only if you see mojibake (e.g. `é` for `é`) or U+FFFD" + " (`�`) in the findings below." + ) + + col_pick, col_custom, col_apply = st.columns([2, 2, 1]) + + with col_pick: + current_label = current_override or "(detected)" + try: + idx = _OVERRIDE_ENCODINGS.index(current_label) + except ValueError: + idx = _OVERRIDE_ENCODINGS.index("Other…") + chosen = st.selectbox( + "Encoding", + options=_OVERRIDE_ENCODINGS, + index=idx, + key="encoding_override_select", + label_visibility="collapsed", + ) + + custom_value: Optional[str] = None + with col_custom: + if chosen == "Other…": + custom_value = st.text_input( + "Custom encoding (e.g. `cp1257`, `iso-8859-9`)", + value=current_override if current_override and current_override not in _OVERRIDE_ENCODINGS else "", + key="encoding_override_custom", + label_visibility="collapsed", + placeholder="cp1257", + ) + + with col_apply: + if st.button("Re-analyze", use_container_width=True): + if chosen == "(detected)": + new_override = None + elif chosen == "Other…": + new_override = (custom_value or "").strip() or None + else: + new_override = chosen + + # Sanity-check the override actually decodes the bytes. + data = st.session_state.get("home_uploaded_bytes") or b"" + if new_override is not None: + try: + data.decode(new_override, errors="strict") + decode_ok = True + decode_err = None + except (UnicodeDecodeError, LookupError) as e: + decode_ok = False + decode_err = str(e) + else: + decode_ok = True + decode_err = None + + if not decode_ok: + st.warning( + f"`{new_override}` cannot decode this file: {decode_err}. " + f"Re-running anyway with replacement-character fallback so " + f"you can see where the failures are." + ) + + # Re-run analysis with the override and refresh session state. + st.session_state["encoding_override"] = new_override + st.session_state["home_findings"] = _run_analysis_with_override(new_override) + # Drop any prior gate result; the user must re-apply. + st.session_state.pop("normalization_result", None) + st.session_state.pop("normalization_for", None) + st.session_state.pop("review_decisions", None) + st.rerun() + +# Reload findings — the picker above may have just rewritten them. +findings = st.session_state.get("home_findings") or [] + +if not findings: + st.success("✓ No findings to review. The file is already clean — open any tool to begin.") + st.stop() + + +# ---- Top-line counters ------------------------------------------------------- + +n_high = sum(1 for f in findings if f.confidence == "high" and not f.pre_applied and f.fix_action) +n_medium = sum(1 for f in findings if f.confidence == "medium" and not f.pre_applied) +n_low = sum(1 for f in findings if f.confidence == "low" and not f.pre_applied) +n_pre = sum(1 for f in findings if f.pre_applied) +n_block = sum(1 for f in findings if f.severity == "error") + +c1, c2, c3, c4, c5 = st.columns(5) +c1.metric("High confidence", n_high, help="Round-trip safe — eligible for auto-fix.") +c2.metric("Medium", n_medium, help="Right call in the common case; preview before applying.") +c3.metric("Low", n_low, help="Heuristic — opt in only.") +c4.metric("Already applied", n_pre, help="Fixed during the read pass (BOM, NUL, line endings).") +c5.metric("Blocking", n_block, help="Severity = error; must be resolved or waived.") + +st.divider() + + +# ---- Top-level controls ------------------------------------------------------ + +decisions_state: dict = st.session_state.setdefault("review_decisions", {}) + +bar_left, bar_mid, bar_right = st.columns([1.2, 1.2, 3]) + +with bar_left: + if st.button("✨ Auto-fix high-confidence", type="primary", use_container_width=True): + for f in findings: + if ( + not f.pre_applied + and f.confidence == "high" + and f.fix_action + and get_fix(f.fix_action) is not None + ): + decisions_state[f.id] = Decision(finding_id=f.id, action="auto") + st.rerun() + +with bar_mid: + if st.button("Skip everything (not recommended)", use_container_width=True): + for f in findings: + if not f.pre_applied: + decisions_state[f.id] = Decision(finding_id=f.id, action="skip") + st.rerun() + + +# ---- Per-finding cards ------------------------------------------------------- + +# Sort: blocking first, then high (unfixed), medium, low, pre-applied. +def _sort_key(f: Finding) -> tuple: + severity_rank = {"error": 0, "warn": 1, "info": 2}[f.severity] + confidence_rank = {"high": 0, "medium": 1, "low": 2}[f.confidence] + return (int(f.pre_applied), severity_rank, confidence_rank, f.id) + + +for f in sorted(findings, key=_sort_key): + decision = decisions_state.get(f.id) + decision_action = decision.action if decision else ( + "auto" if (f.pre_applied or (f.confidence == "high" and f.fix_action)) else "skip" + ) + + title_bits = [ + _severity_pill(f.severity), + _confidence_pill(f.confidence), + f"**{f.id}**", + f"({f.count})", + ] + if f.pre_applied: + title_bits.append(":gray-background[applied during read]") + + with st.expander(" ".join(title_bits), expanded=(f.severity == "error")): + st.caption(f.description) + if f.tool: + st.caption(f"Owned by: `{f.tool}`") + + if f.pre_applied: + st.info("This was already applied during the file read pass — no decision needed.") + continue + + if not f.fix_action: + if f.severity == "error": + st.error( + "Blocking finding with no auto-fix. Choose **Skip / waive** to " + "acknowledge and proceed (not recommended), or fix the file outside " + "DataTools and re-upload." + ) + else: + st.info("Informational only — no fix to apply.") + + # Decision radio + choice_labels = { + "auto": "Auto-fix with our algorithm", + "skip": "Skip / waive (no change)", + } + # Customize is offered for fixes that take a meaningful payload. + if f.fix_action in ("replace_null_sentinels",): + choice_labels["modified"] = "Customize" + + chosen = st.radio( + "Decision", + options=list(choice_labels.keys()), + index=list(choice_labels.keys()).index(decision_action) + if decision_action in choice_labels else 0, + format_func=lambda k: choice_labels[k], + key=f"decision_{f.id}", + horizontal=True, + ) + + # Customize payload editor (only for the modified action) + payload: Optional[dict] = None + if chosen == "modified" and f.fix_action == "replace_null_sentinels": + default_sentinels = ", ".join(sorted([ + "n/a", "na", "nan", "null", "none", "-", "--", "tbd", "unknown", + ])) + text = st.text_area( + "Sentinels (comma-separated, case-insensitive):", + value=(decision.payload or {}).get( + "sentinels_raw", default_sentinels, + ) if decision else default_sentinels, + key=f"sentinels_{f.id}", + ) + sentinels = [s.strip() for s in text.split(",") if s.strip()] + payload = {"sentinels": sentinels, "sentinels_raw": text} + + # Persist + decisions_state[f.id] = Decision( + finding_id=f.id, action=chosen, payload=payload, + ) + + # Preview + if chosen != "skip" and f.samples: + preview = _preview_table(f, chosen, payload) + if preview is not None and not preview.empty: + st.markdown("**Preview** (showing up to 5 affected cells)") + st.dataframe(preview, use_container_width=True, hide_index=True) + +st.divider() + + +# ---- Apply ------------------------------------------------------------------ + +bottom_left, bottom_mid, bottom_right = st.columns([1, 1, 3]) + +with bottom_left: + apply_clicked = st.button( + "✅ Apply & enter tools", type="primary", use_container_width=True, + disabled=not decisions_state, + ) + +with bottom_mid: + reset_clicked = st.button("Reset all decisions", use_container_width=True) + +if reset_clicked: + st.session_state.pop("review_decisions", None) + st.session_state.pop("normalization_result", None) + st.session_state.pop("normalization_for", None) + st.rerun() + +if apply_clicked: + df = _load_df_from_session( + encoding_override=st.session_state.get("encoding_override") + ) + if df is None: + st.error("Could not re-read the uploaded file. Try re-uploading.") + st.stop() + decisions_list = [d for d in decisions_state.values() if isinstance(d, Decision)] + result = apply_decisions(df, findings, decisions_list) + st.session_state["normalization_result"] = result + st.session_state["normalization_for"] = _upload_hash() + + summary = gate_summary(result) + if result.passed and is_normalized(findings, result): + st.success( + f"✓ Gate passed — {summary['fixes_applied']} fix(es) applied, " + f"{summary['cells_changed']} cell(s) changed. You can now open any tool." + ) + elif result.blocking_findings: + st.error( + f"Gate blocked by error-level findings: " + f"{', '.join(b.id for b in result.blocking_findings)}. " + f"Resolve or waive them above before continuing." + ) + elif result.pending_findings: + st.warning( + f"Pending decisions remain on: " + f"{', '.join(f.id for f in result.pending_findings)}. " + f"Choose Auto-fix or Skip for each before continuing." + ) + +# Persisted summary (re-render on reload) +result: Optional[NormalizationResult] = st.session_state.get("normalization_result") +if result is not None and st.session_state.get("normalization_for") == _upload_hash(): + with st.expander("Audit log"): + if result.applied: + st.markdown("**Applied fixes**") + st.dataframe( + pd.DataFrame([ + { + "finding": a.finding_id, + "fix_action": a.fix_action, + "decision": a.decision, + "cells_changed": a.cells_changed, + } + for a in result.applied + ]), + use_container_width=True, hide_index=True, + ) + if result.skipped_findings: + st.markdown("**Skipped (waived by user)**") + st.write([f.id for f in result.skipped_findings]) + if result.passed: + st.markdown("---") + st.markdown("**Download normalized file**") + with st.expander("⚙️ Advanced output options"): + st.caption( + "Defaults match what the analyzer normalized to: UTF-8, " + "comma-separated, LF line endings. Override only if your " + "destination tool requires a specific format." + ) + + col_enc, col_delim, col_le = st.columns(3) + with col_enc: + enc_choice = st.selectbox( + "Encoding (code page)", + options=[label for label, _ in _OUTPUT_ENCODINGS], + index=0, + key="output_encoding_select", + ) + out_encoding = next( + codec for label, codec in _OUTPUT_ENCODINGS if label == enc_choice + ) + + with col_delim: + delim_choice = st.selectbox( + "Delimiter", + options=[label for label, _ in _OUTPUT_DELIMITERS], + index=0, + key="output_delim_select", + ) + out_delim = next( + ch for label, ch in _OUTPUT_DELIMITERS if label == delim_choice + ) + + with col_le: + le_choice = st.selectbox( + "Line terminator", + options=[label for label, _ in _OUTPUT_LINE_TERMINATORS], + index=0, + key="output_le_select", + ) + out_le = next( + ch for label, ch in _OUTPUT_LINE_TERMINATORS if label == le_choice + ) + + data, encode_warn = _build_output_bytes( + result.cleaned_df, + encoding=out_encoding, + delimiter=out_delim, + line_terminator=out_le, + ) + if encode_warn: + st.warning(encode_warn) + + ext = "tsv" if out_delim == "\t" else "csv" + mime = "text/tab-separated-values" if out_delim == "\t" else "text/csv" + file_name = f"{Path(upload_name).stem}.normalized.{ext}" + + st.download_button( + f"⬇️ Download {file_name}", + data=data, + file_name=file_name, + mime=mime, + type="primary", + ) diff --git a/src/gui/pages/1_Deduplicator.py b/src/gui/pages/1_Deduplicator.py index 6fa8760..4da19b5 100644 --- a/src/gui/pages/1_Deduplicator.py +++ b/src/gui/pages/1_Deduplicator.py @@ -22,10 +22,12 @@ from src.gui.components import ( hide_streamlit_chrome, match_group_card, pickup_or_upload, + require_normalization_gate, results_summary, ) hide_streamlit_chrome() +require_normalization_gate() # --------------------------------------------------------------------------- # Session state defaults diff --git a/src/gui/pages/2_Text_Cleaner.py b/src/gui/pages/2_Text_Cleaner.py index e9ef09f..80ba7e6 100644 --- a/src/gui/pages/2_Text_Cleaner.py +++ b/src/gui/pages/2_Text_Cleaner.py @@ -18,6 +18,7 @@ from src.gui.components import ( hide_streamlit_chrome, pickup_or_upload, render_hidden_aware_preview, + require_normalization_gate, ) from src.core.text_clean import ( PRESETS, @@ -28,6 +29,7 @@ from src.core.text_clean import ( ) hide_streamlit_chrome() +require_normalization_gate() # --------------------------------------------------------------------------- diff --git a/src/gui/pages/3_Format_Standardizer.py b/src/gui/pages/3_Format_Standardizer.py index 2976325..3511f38 100644 --- a/src/gui/pages/3_Format_Standardizer.py +++ b/src/gui/pages/3_Format_Standardizer.py @@ -11,9 +11,10 @@ _project_root = Path(__file__).resolve().parent.parent.parent.parent if str(_project_root) not in sys.path: sys.path.insert(0, str(_project_root)) -from src.gui.components import hide_streamlit_chrome +from src.gui.components import hide_streamlit_chrome, require_normalization_gate hide_streamlit_chrome() +require_normalization_gate() # --------------------------------------------------------------------------- # Header diff --git a/src/gui/pages/4_Missing_Values.py b/src/gui/pages/4_Missing_Values.py index c34b1eb..8a181ed 100644 --- a/src/gui/pages/4_Missing_Values.py +++ b/src/gui/pages/4_Missing_Values.py @@ -11,9 +11,10 @@ _project_root = Path(__file__).resolve().parent.parent.parent.parent if str(_project_root) not in sys.path: sys.path.insert(0, str(_project_root)) -from src.gui.components import hide_streamlit_chrome +from src.gui.components import hide_streamlit_chrome, require_normalization_gate hide_streamlit_chrome() +require_normalization_gate() # --------------------------------------------------------------------------- # Header diff --git a/src/gui/pages/5_Column_Mapper.py b/src/gui/pages/5_Column_Mapper.py index df11527..d36cc05 100644 --- a/src/gui/pages/5_Column_Mapper.py +++ b/src/gui/pages/5_Column_Mapper.py @@ -11,9 +11,10 @@ _project_root = Path(__file__).resolve().parent.parent.parent.parent if str(_project_root) not in sys.path: sys.path.insert(0, str(_project_root)) -from src.gui.components import hide_streamlit_chrome +from src.gui.components import hide_streamlit_chrome, require_normalization_gate hide_streamlit_chrome() +require_normalization_gate() # --------------------------------------------------------------------------- # Header diff --git a/src/gui/pages/6_Outlier_Detector.py b/src/gui/pages/6_Outlier_Detector.py index c342ff1..02fbdc7 100644 --- a/src/gui/pages/6_Outlier_Detector.py +++ b/src/gui/pages/6_Outlier_Detector.py @@ -11,9 +11,10 @@ _project_root = Path(__file__).resolve().parent.parent.parent.parent if str(_project_root) not in sys.path: sys.path.insert(0, str(_project_root)) -from src.gui.components import hide_streamlit_chrome +from src.gui.components import hide_streamlit_chrome, require_normalization_gate hide_streamlit_chrome() +require_normalization_gate() # --------------------------------------------------------------------------- # Header diff --git a/src/gui/pages/7_Multi_File_Merger.py b/src/gui/pages/7_Multi_File_Merger.py index 8a22e65..7b28fc1 100644 --- a/src/gui/pages/7_Multi_File_Merger.py +++ b/src/gui/pages/7_Multi_File_Merger.py @@ -11,9 +11,10 @@ _project_root = Path(__file__).resolve().parent.parent.parent.parent if str(_project_root) not in sys.path: sys.path.insert(0, str(_project_root)) -from src.gui.components import hide_streamlit_chrome +from src.gui.components import hide_streamlit_chrome, require_normalization_gate hide_streamlit_chrome() +require_normalization_gate() # --------------------------------------------------------------------------- # Header diff --git a/src/gui/pages/8_Validator_Reporter.py b/src/gui/pages/8_Validator_Reporter.py index 614ec4c..6a6b2cf 100644 --- a/src/gui/pages/8_Validator_Reporter.py +++ b/src/gui/pages/8_Validator_Reporter.py @@ -11,9 +11,10 @@ _project_root = Path(__file__).resolve().parent.parent.parent.parent if str(_project_root) not in sys.path: sys.path.insert(0, str(_project_root)) -from src.gui.components import hide_streamlit_chrome +from src.gui.components import hide_streamlit_chrome, require_normalization_gate hide_streamlit_chrome() +require_normalization_gate() # --------------------------------------------------------------------------- # Header diff --git a/src/gui/pages/9_Pipeline_Runner.py b/src/gui/pages/9_Pipeline_Runner.py index 7346887..8057e80 100644 --- a/src/gui/pages/9_Pipeline_Runner.py +++ b/src/gui/pages/9_Pipeline_Runner.py @@ -11,9 +11,10 @@ _project_root = Path(__file__).resolve().parent.parent.parent.parent if str(_project_root) not in sys.path: sys.path.insert(0, str(_project_root)) -from src.gui.components import hide_streamlit_chrome +from src.gui.components import hide_streamlit_chrome, require_normalization_gate hide_streamlit_chrome() +require_normalization_gate() # --------------------------------------------------------------------------- # Header diff --git a/test-cases/encodings-corpus/E01_western_basic_utf8.csv b/test-cases/encodings-corpus/E01_western_basic_utf8.csv new file mode 100644 index 0000000..54b281c --- /dev/null +++ b/test-cases/encodings-corpus/E01_western_basic_utf8.csv @@ -0,0 +1,5 @@ +id,name,city,note +1,Alice,New York,plain ASCII +2,Café Müller,Köln,Latin-1 accents +3,Naïve Façade,Zürich,more accents +4,España,Düsseldorf,Spanish n-tilde diff --git a/test-cases/encodings-corpus/E02_western_basic_utf8bom.csv b/test-cases/encodings-corpus/E02_western_basic_utf8bom.csv new file mode 100644 index 0000000..5fe8b5f --- /dev/null +++ b/test-cases/encodings-corpus/E02_western_basic_utf8bom.csv @@ -0,0 +1,5 @@ +id,name,city,note +1,Alice,New York,plain ASCII +2,Café Müller,Köln,Latin-1 accents +3,Naïve Façade,Zürich,more accents +4,España,Düsseldorf,Spanish n-tilde diff --git a/test-cases/encodings-corpus/E03_western_basic_cp1252.csv b/test-cases/encodings-corpus/E03_western_basic_cp1252.csv new file mode 100644 index 0000000..5bb0225 --- /dev/null +++ b/test-cases/encodings-corpus/E03_western_basic_cp1252.csv @@ -0,0 +1,5 @@ +id,name,city,note +1,Alice,New York,plain ASCII +2,Caf Mller,Kln,Latin-1 accents +3,Nave Faade,Zrich,more accents +4,Espaa,Dsseldorf,Spanish n-tilde diff --git a/test-cases/encodings-corpus/E04_western_basic_latin1.csv b/test-cases/encodings-corpus/E04_western_basic_latin1.csv new file mode 100644 index 0000000..5bb0225 --- /dev/null +++ b/test-cases/encodings-corpus/E04_western_basic_latin1.csv @@ -0,0 +1,5 @@ +id,name,city,note +1,Alice,New York,plain ASCII +2,Caf Mller,Kln,Latin-1 accents +3,Nave Faade,Zrich,more accents +4,Espaa,Dsseldorf,Spanish n-tilde diff --git a/test-cases/encodings-corpus/E05_western_basic_latin9.csv b/test-cases/encodings-corpus/E05_western_basic_latin9.csv new file mode 100644 index 0000000..5bb0225 --- /dev/null +++ b/test-cases/encodings-corpus/E05_western_basic_latin9.csv @@ -0,0 +1,5 @@ +id,name,city,note +1,Alice,New York,plain ASCII +2,Caf Mller,Kln,Latin-1 accents +3,Nave Faade,Zrich,more accents +4,Espaa,Dsseldorf,Spanish n-tilde diff --git a/test-cases/encodings-corpus/E06_western_basic_macroman.csv b/test-cases/encodings-corpus/E06_western_basic_macroman.csv new file mode 100644 index 0000000..98feebe --- /dev/null +++ b/test-cases/encodings-corpus/E06_western_basic_macroman.csv @@ -0,0 +1,5 @@ +id,name,city,note +1,Alice,New York,plain ASCII +2,Caf Mller,Kln,Latin-1 accents +3,Nave Faade,Zrich,more accents +4,Espaa,Dsseldorf,Spanish n-tilde diff --git a/test-cases/encodings-corpus/E07_western_basic_utf16le.csv b/test-cases/encodings-corpus/E07_western_basic_utf16le.csv new file mode 100644 index 0000000..172f8a2 Binary files /dev/null and b/test-cases/encodings-corpus/E07_western_basic_utf16le.csv differ diff --git a/test-cases/encodings-corpus/E08_western_basic_utf16be.csv b/test-cases/encodings-corpus/E08_western_basic_utf16be.csv new file mode 100644 index 0000000..bc56321 Binary files /dev/null and b/test-cases/encodings-corpus/E08_western_basic_utf16be.csv differ diff --git a/test-cases/encodings-corpus/E09_western_basic_utf16le_nobom.csv b/test-cases/encodings-corpus/E09_western_basic_utf16le_nobom.csv new file mode 100644 index 0000000..c14d47b Binary files /dev/null and b/test-cases/encodings-corpus/E09_western_basic_utf16le_nobom.csv differ diff --git a/test-cases/encodings-corpus/E10_western_extended_utf8.csv b/test-cases/encodings-corpus/E10_western_extended_utf8.csv new file mode 100644 index 0000000..d204c4b --- /dev/null +++ b/test-cases/encodings-corpus/E10_western_extended_utf8.csv @@ -0,0 +1,5 @@ +id,name,note +1,€100 product,euro sign U+20AC +2,“smart” quotes,curly U+201C and U+201D +3,café — résumé,em-dash U+2014 +4,quote’s ok,smart apostrophe U+2019 diff --git a/test-cases/encodings-corpus/E11_western_extended_cp1252.csv b/test-cases/encodings-corpus/E11_western_extended_cp1252.csv new file mode 100644 index 0000000..587aff9 --- /dev/null +++ b/test-cases/encodings-corpus/E11_western_extended_cp1252.csv @@ -0,0 +1,5 @@ +id,name,note +1,100 product,euro sign U+20AC +2,smart quotes,curly U+201C and U+201D +3,caf rsum,em-dash U+2014 +4,quotes ok,smart apostrophe U+2019 diff --git a/test-cases/encodings-corpus/E12_western_extended_utf16le.csv b/test-cases/encodings-corpus/E12_western_extended_utf16le.csv new file mode 100644 index 0000000..a89a33b Binary files /dev/null and b/test-cases/encodings-corpus/E12_western_extended_utf16le.csv differ diff --git a/test-cases/encodings-corpus/E13_eastern_european_utf8.csv b/test-cases/encodings-corpus/E13_eastern_european_utf8.csv new file mode 100644 index 0000000..f5f3f92 --- /dev/null +++ b/test-cases/encodings-corpus/E13_eastern_european_utf8.csv @@ -0,0 +1,5 @@ +id,name,city,language +1,Příliš,Praha,Czech +2,Żółć,Warszawa,Polish +3,Tűrő,Budapest,Hungarian +4,Spaňski,Bratislava,Slovak diff --git a/test-cases/encodings-corpus/E14_eastern_european_cp1250.csv b/test-cases/encodings-corpus/E14_eastern_european_cp1250.csv new file mode 100644 index 0000000..a8c1b19 --- /dev/null +++ b/test-cases/encodings-corpus/E14_eastern_european_cp1250.csv @@ -0,0 +1,5 @@ +id,name,city,language +1,Pli,Praha,Czech +2,,Warszawa,Polish +3,Tr,Budapest,Hungarian +4,Spaski,Bratislava,Slovak diff --git a/test-cases/encodings-corpus/E15_eastern_european_iso88592.csv b/test-cases/encodings-corpus/E15_eastern_european_iso88592.csv new file mode 100644 index 0000000..927febf --- /dev/null +++ b/test-cases/encodings-corpus/E15_eastern_european_iso88592.csv @@ -0,0 +1,5 @@ +id,name,city,language +1,Pli,Praha,Czech +2,,Warszawa,Polish +3,Tr,Budapest,Hungarian +4,Spaski,Bratislava,Slovak diff --git a/test-cases/encodings-corpus/E16_cyrillic_utf8.csv b/test-cases/encodings-corpus/E16_cyrillic_utf8.csv new file mode 100644 index 0000000..d4ad079 --- /dev/null +++ b/test-cases/encodings-corpus/E16_cyrillic_utf8.csv @@ -0,0 +1,4 @@ +id,name,city +1,Иван,Москва +2,Анна,Санкт-Петербург +3,Дмитрий,Новосибирск diff --git a/test-cases/encodings-corpus/E17_cyrillic_cp1251.csv b/test-cases/encodings-corpus/E17_cyrillic_cp1251.csv new file mode 100644 index 0000000..e49142a --- /dev/null +++ b/test-cases/encodings-corpus/E17_cyrillic_cp1251.csv @@ -0,0 +1,4 @@ +id,name,city +1,, +2,,- +3,, diff --git a/test-cases/encodings-corpus/E18_cyrillic_koi8r.csv b/test-cases/encodings-corpus/E18_cyrillic_koi8r.csv new file mode 100644 index 0000000..d260d9b --- /dev/null +++ b/test-cases/encodings-corpus/E18_cyrillic_koi8r.csv @@ -0,0 +1,4 @@ +id,name,city +1,, +2,,- +3,, diff --git a/test-cases/encodings-corpus/E19_japanese_utf8.csv b/test-cases/encodings-corpus/E19_japanese_utf8.csv new file mode 100644 index 0000000..5a854f4 --- /dev/null +++ b/test-cases/encodings-corpus/E19_japanese_utf8.csv @@ -0,0 +1,4 @@ +id,name,city +1,田中太郎,東京 +2,鈴木花子,大阪 +3,Alice Smith,横浜 diff --git a/test-cases/encodings-corpus/E20_japanese_shiftjis.csv b/test-cases/encodings-corpus/E20_japanese_shiftjis.csv new file mode 100644 index 0000000..c60057d --- /dev/null +++ b/test-cases/encodings-corpus/E20_japanese_shiftjis.csv @@ -0,0 +1,4 @@ +id,name,city +1,cY, +2,؉Ԏq, +3,Alice Smith,l diff --git a/test-cases/encodings-corpus/E21_chinese_simplified_utf8.csv b/test-cases/encodings-corpus/E21_chinese_simplified_utf8.csv new file mode 100644 index 0000000..300df3e --- /dev/null +++ b/test-cases/encodings-corpus/E21_chinese_simplified_utf8.csv @@ -0,0 +1,4 @@ +id,name,city +1,张三,北京 +2,李四,上海 +3,Alice Smith,深圳 diff --git a/test-cases/encodings-corpus/E22_chinese_simplified_gb18030.csv b/test-cases/encodings-corpus/E22_chinese_simplified_gb18030.csv new file mode 100644 index 0000000..c8f7a53 --- /dev/null +++ b/test-cases/encodings-corpus/E22_chinese_simplified_gb18030.csv @@ -0,0 +1,4 @@ +id,name,city +1,, +2,,Ϻ +3,Alice Smith, diff --git a/test-cases/encodings-corpus/E23_chinese_traditional_utf8.csv b/test-cases/encodings-corpus/E23_chinese_traditional_utf8.csv new file mode 100644 index 0000000..60a5859 --- /dev/null +++ b/test-cases/encodings-corpus/E23_chinese_traditional_utf8.csv @@ -0,0 +1,4 @@ +id,name,city +1,張三,台北 +2,李四,香港 +3,Alice Smith,新竹 diff --git a/test-cases/encodings-corpus/E24_chinese_traditional_big5.csv b/test-cases/encodings-corpus/E24_chinese_traditional_big5.csv new file mode 100644 index 0000000..8702249 --- /dev/null +++ b/test-cases/encodings-corpus/E24_chinese_traditional_big5.csv @@ -0,0 +1,4 @@ +id,name,city +1,iT,x_ +2,|, +3,Alice Smith,s diff --git a/test-cases/encodings-corpus/E25_korean_utf8.csv b/test-cases/encodings-corpus/E25_korean_utf8.csv new file mode 100644 index 0000000..abb4304 --- /dev/null +++ b/test-cases/encodings-corpus/E25_korean_utf8.csv @@ -0,0 +1,4 @@ +id,name,city +1,김철수,서울 +2,박영희,부산 +3,Alice Smith,인천 diff --git a/test-cases/encodings-corpus/E26_korean_euckr.csv b/test-cases/encodings-corpus/E26_korean_euckr.csv new file mode 100644 index 0000000..13ccbff --- /dev/null +++ b/test-cases/encodings-corpus/E26_korean_euckr.csv @@ -0,0 +1,4 @@ +id,name,city +1,ö, +2,ڿ,λ +3,Alice Smith,õ diff --git a/test-cases/encodings-corpus/E27_pathological_ascii_only.csv b/test-cases/encodings-corpus/E27_pathological_ascii_only.csv new file mode 100644 index 0000000..8f21db1 --- /dev/null +++ b/test-cases/encodings-corpus/E27_pathological_ascii_only.csv @@ -0,0 +1,4 @@ +id,name,city +1,Alice,New York +2,Bob,Chicago +3,Carol,San Francisco diff --git a/test-cases/encodings-corpus/E28_pathological_invalid_utf8.csv b/test-cases/encodings-corpus/E28_pathological_invalid_utf8.csv new file mode 100644 index 0000000..d8443aa --- /dev/null +++ b/test-cases/encodings-corpus/E28_pathological_invalid_utf8.csv @@ -0,0 +1,4 @@ +id,name,city +1,Alice,New York +2,B(b,Chicago +3,Carol,San Francisco diff --git a/test-cases/encodings-corpus/E29_pathological_truncated_utf8.csv b/test-cases/encodings-corpus/E29_pathological_truncated_utf8.csv new file mode 100644 index 0000000..9c304c8 --- /dev/null +++ b/test-cases/encodings-corpus/E29_pathological_truncated_utf8.csv @@ -0,0 +1,4 @@ +id,name,city +1,Alice,New York +2,Bob,Chicago +3, \ No newline at end of file diff --git a/test-cases/encodings-corpus/E30_pathological_lying_bom.csv b/test-cases/encodings-corpus/E30_pathological_lying_bom.csv new file mode 100644 index 0000000..57df065 --- /dev/null +++ b/test-cases/encodings-corpus/E30_pathological_lying_bom.csv @@ -0,0 +1,5 @@ +id,name,note +1,100 product,euro sign U+20AC +2,smart quotes,curly U+201C and U+201D +3,caf rsum,em-dash U+2014 +4,quotes ok,smart apostrophe U+2019 diff --git a/test-cases/encodings-corpus/E31_pathological_mixed_concat.csv b/test-cases/encodings-corpus/E31_pathological_mixed_concat.csv new file mode 100644 index 0000000..706f863 --- /dev/null +++ b/test-cases/encodings-corpus/E31_pathological_mixed_concat.csv @@ -0,0 +1,4 @@ +id,name,city +1,Mller,Kln +2,Müller,Köln +3,Alice,New York diff --git a/test-cases/encodings-corpus/ENCODINGS-CASES.md b/test-cases/encodings-corpus/ENCODINGS-CASES.md new file mode 100644 index 0000000..b4ef1f0 --- /dev/null +++ b/test-cases/encodings-corpus/ENCODINGS-CASES.md @@ -0,0 +1,284 @@ +# ENCODINGS-CASES.md - Code Page / Encoding Test Corpus + +**Version**: 1.0 +**Last updated**: April 29, 2026 +**Companion to**: TEST-CASES.md and QUOTE-CASES.md. + +## Why this is a separate corpus + +Files 01-23 in the main corpus test the **transformation layer**: given a Python `str` already in memory, what does the cleaner do to it. Encoding tests are about the **I/O layer** that runs *before* the transformation layer ever sees data: given a sequence of bytes on disk, can the reader correctly turn them into a Python `str` in the first place? + +These are different failures: + +- A transformation bug produces wrong-but-valid output (curly quotes that should have been folded, whitespace that should have been trimmed). +- An I/O bug produces *garbage* (mojibake) or *crashes* the reader entirely. The cleaner never gets to apply any transformation rule because the input never decoded. + +Per TECHNICAL.md Section 9, encoding handling lives in `src/core/io.py`, separate from any individual cleaning script. This corpus tests that module. + +--- + +## 1. Layout + +``` +test_data/encodings/ +├── E01_western_basic_utf8.csv ... E26_korean_euckr.csv +├── E27_pathological_ascii_only.csv ... E31_pathological_mixed_concat.csv +├── expected_detection.csv # Manifest: ground truth + acceptable detection +├── detector_baseline.csv # What charset-normalizer actually returns +└── reference/ + ├── WESTERN_BASIC.utf8.txt + ├── WESTERN_EXTENDED.utf8.txt + ├── EASTERN_EUROPEAN.utf8.txt + ├── CYRILLIC.utf8.txt + ├── JAPANESE.utf8.txt + ├── CHINESE_SIMPLIFIED.utf8.txt + ├── CHINESE_TRADITIONAL.utf8.txt + ├── KOREAN.utf8.txt + └── ASCII_ONLY.utf8.txt +``` + +Every encoded file has a `canonical_content_id` linking it to one of the 9 reference files in `reference/`. After correct decoding (and BOM stripping if applicable), the encoded file's content must equal the corresponding reference file byte-for-byte. + +--- + +## 2. Coverage matrix + +The corpus uses 9 distinct content sets, each chosen to exercise a specific encoding family. Cross-coverage is enforced by content design: Cyrillic content cannot be encoded as cp1252; Western extended content (with euro/em-dash) cannot be encoded as Latin-1; etc. Attempting those encodings would either error or substitute, both of which are themselves test cases. + +| Content family | What it contains | Encodings covered | +|---|---|---| +| WESTERN_BASIC | ASCII + accented Latin-1 chars (é, ü, ñ, ç) | UTF-8, UTF-8 with BOM, cp1252, ISO-8859-1, ISO-8859-15, Mac Roman, UTF-16 LE/BE with BOM, UTF-16 LE without BOM | +| WESTERN_EXTENDED | Above + euro sign, smart quotes, em-dash | UTF-8, cp1252, UTF-16 LE (NOT Latin-1: chars don't exist there) | +| EASTERN_EUROPEAN | Czech, Polish, Hungarian, Slovak accents | UTF-8, cp1250, ISO-8859-2 | +| CYRILLIC | Russian | UTF-8, cp1251, KOI8-R | +| JAPANESE | Kanji + kana | UTF-8, Shift_JIS | +| CHINESE_SIMPLIFIED | Mainland China characters | UTF-8, GB18030 | +| CHINESE_TRADITIONAL | Taiwan/HK characters | UTF-8, Big5 | +| KOREAN | Hangul | UTF-8, EUC-KR | +| ASCII_ONLY | Pure ASCII | One file; encoding genuinely ambiguous | + +--- + +## 3. Per-file index + +### Group A — WESTERN_BASIC (single content, 9 encodings) + +This group's purpose is mainly to test detector behavior on the most common Western encodings. Because the content uses only ASCII + Latin-1 characters in the 0xA0+ range, **cp1252 / ISO-8859-1 / ISO-8859-15 produce byte-identical output for this content**. The detector cannot meaningfully distinguish among them; any of them is a correct answer. + +| File | Encoding | Notes | +|---|---|---| +| E01 | UTF-8 | Modern default | +| E02 | UTF-8 with BOM | Excel "CSV UTF-8" export. Reader must strip the BOM. | +| E03 | cp1252 | Excel default "CSV" on US/UK/Western Windows | +| E04 | ISO-8859-1 | Latin-1. Identical bytes to cp1252 for this content. | +| E05 | ISO-8859-15 | Latin-9. Identical to Latin-1 here (no euro). | +| E06 | Mac Roman | Different byte mappings; distinguishable | +| E07 | UTF-16 LE with BOM | Excel "Unicode Text" export | +| E08 | UTF-16 BE with BOM | Less common but spec'd | +| E09 | UTF-16 LE without BOM | Detection unreliable; document failure mode | + +### Group B — WESTERN_EXTENDED (3 encodings) + +This is the cleanest **cp1252-vs-Latin-1 discriminator** in the corpus. The content uses bytes 0x80-0x9F (where cp1252 puts euro, smart quotes, em-dash) — exactly the range Latin-1 leaves undefined. A reader that misidentifies this file as Latin-1 will produce control characters or replacement chars; correct identification as cp1252 yields readable text. + +| File | Encoding | Notes | +|---|---|---| +| E10 | UTF-8 | Reference | +| E11 | cp1252 | The discriminator file | +| E12 | UTF-16 LE with BOM | Same content, sanity check | + +### Group C — EASTERN_EUROPEAN (3 encodings) + +| File | Encoding | Notes | +|---|---|---| +| E13 | UTF-8 | Reference | +| E14 | cp1250 | Polish/Czech/Hungarian Windows default | +| E15 | ISO-8859-2 | Latin-2; distinct byte mappings from cp1250 | + +### Group D — CYRILLIC (3 encodings) + +| File | Encoding | Notes | +|---|---|---| +| E16 | UTF-8 | Reference | +| E17 | cp1251 | Russian Windows default | +| E18 | KOI8-R | Older Russian Unix encoding; distinct bytes from cp1251 | + +### Group E — CJK (8 files, 4 languages × 2 encodings each) + +| File | Encoding | Notes | +|---|---|---| +| E19 | UTF-8 (Japanese) | Reference | +| E20 | Shift_JIS | Japanese Excel default; cp932 is the MS extended variant | +| E21 | UTF-8 (Chinese simplified) | Reference | +| E22 | GB18030 | Mainland China; supersets GBK and GB2312 | +| E23 | UTF-8 (Chinese traditional) | Reference | +| E24 | Big5 | Taiwan/HK; cp950 is the MS variant | +| E25 | UTF-8 (Korean) | Reference | +| E26 | EUC-KR | Korean Windows default; cp949 is the MS variant | + +### Group F — Pathological (5 files) + +These are the cases that crash readers, produce silent corruption, or expose the limits of encoding detection. The expected behavior is **that the reader fails informatively**, not that it succeeds. + +| File | Pathology | What should happen | +|---|---|---| +| E27 | ASCII only — encoding genuinely ambiguous | Detector picks any of ASCII/UTF-8/cp1252/Latin-1; all decode identically. Test that the reader doesn't OVER-confidently commit to one when ambiguous. | +| E28 | Invalid UTF-8 byte sequence (0xC3 0x28 mid-file) | Strict UTF-8 decode raises UnicodeDecodeError. Reader should fall back to a single-byte encoding and warn, not silently substitute. | +| E29 | Truncated UTF-8 multibyte at EOF | Strict decode raises "unexpected end of data". Reader should error with a clear "file appears truncated" message, not silently produce U+FFFD. | +| E30 | "Lying BOM" — UTF-8 BOM on cp1252 body | utf-8-sig decoder errors at first cp1252 byte in 0x80-0x9F range. Reader should detect the lie and recover by stripping BOM and trying cp1252; warn the user. | +| E31 | Mixed encoding concatenation (cp1252 + UTF-8) | NO single encoding decodes the whole file. UTF-8 errors on cp1252 bytes; cp1252 mojibakes the UTF-8 bytes. Reader should refuse and tell the user the file contains mixed encodings. | + +--- + +## 4. Manifest files + +### `expected_detection.csv` — ground truth + acceptable detection answers + +7 columns: +- `filename` — the encoded test file +- `canonical_content_id` — links to the reference content +- `encoding` — the actual encoding used by the generator (ground truth) +- `has_bom` — whether the file has a BOM +- `byte_length` — file size in bytes +- `expected_detection` — pipe-separated list of detector answers that should be considered correct. Includes fuzzy markers (`AMBIGUOUS`, `UNRELIABLE`, `REJECT`, `LOW_CONFIDENCE`) for cases where any reasonable detector behavior is acceptable. +- `decode_notes` — human-readable explanation of expected behavior + +Use this as the primary reference when validating your reader. + +### `detector_baseline.csv` — what charset-normalizer actually returns + +Recorded during fixture generation against the version of `charset-normalizer` installed at that time. 6 columns: +- `filename`, `ground_truth_encoding`, `charset_normalizer_returns`, `cn_aliases`, `cn_language`, `cn_chaos_score` + +This is **not authoritative** — different detector versions return different labels. It exists so you can see typical detector output without having to run charset-normalizer yourself, and so you have a baseline to compare against if you're testing a different detector or a newer version. + +### `reference/*.utf8.txt` — canonical decoded content + +One UTF-8 file per content family. After your reader decodes any encoded file in the corpus and strips any BOM, the result should equal the corresponding reference file's content byte-for-byte. + +--- + +## 5. Observed charset-normalizer behavior + +Recorded against `charset-normalizer` 3.x. Some of these are known detector quirks worth understanding before you debug your own code: + +### Cases where charset-normalizer is reliably correct + +- All UTF-8 files (E01, E02, E10, E13, E16, E19, E21, E23, E25): detected as `utf_8`. +- All UTF-16 with BOM (E07, E08, E12): detected as `utf_16` (loses LE/BE distinction in label, recoverable from BOM). +- E14 (cp1250 Eastern European): correctly detected. +- E17 (cp1251 Cyrillic): correctly detected. +- E20 (Shift_JIS Japanese): returns `cp932` (the MS extended variant; equivalent for this content). +- E22 (GB18030 Chinese): correctly detected. +- E24 (Big5 Chinese traditional): correctly detected. +- E26 (EUC-KR Korean): returns `cp949` (the MS variant; equivalent for this content). +- E27 (ASCII): correctly detected as `ascii`. + +### Cases where charset-normalizer mislabels but produces the right decoded content + +These return a wrong-sounding name but decode to the correct characters because the encodings are byte-equivalent for this specific content: + +- **E03, E04, E05** (cp1252, Latin-1, Latin-9 with WESTERN_BASIC content): all returned as `cp1250`. The decoded chars are correct because for ASCII + Latin-1 chars in the 0xA0+ range, all four encodings produce identical results. The label is misleading but the data is fine. +- **E06** (Mac Roman): returned as `mac_iceland`. Same family, identical for our content. +- **E11** (cp1252 with WESTERN_EXTENDED): returned as `cp1250`. Surprising — `cp1250` does NOT have euro at 0x80 (it has Cyrillic-adjacent chars), so the actual decoded euro sign would be wrong. Verify your reader actually re-decodes with the returned label and check the output, don't assume a "matching" label means correct content. + +### Cases where charset-normalizer is wrong + +- **E15** (ISO-8859-2 Eastern European): returned as `cp1258` (Vietnamese encoding). Wrong family entirely. Probable cause: the chaos heuristic doesn't penalize cp1258 for the byte distribution in the test content. +- **E18** (KOI8-R Cyrillic): returned as `shift_jis_2004` (Japanese!). Bytes in KOI8-R's high-bit range happen to look like valid Shift_JIS multibyte sequences for this content. **High-confidence misdetection** — this is the one to plan a fallback for in your reader. + +### Pathological cases + +- **E28-E31**: charset-normalizer returns various labels (`cp1257`, `cp1250`, `cp1252`, `cp1250`). For pathological inputs, the *label* is less important than the *behavior*: does your reader detect that something is wrong (low confidence, multiple candidate encodings, or a decode error after detection) and warn the user? The `expected_detection` field accepts any label paired with appropriate warning behavior. + +### Implication for your reader + +Don't trust charset-normalizer's label blindly. The robust pattern: + +1. Run charset-normalizer. +2. Try to decode the entire file with the returned encoding. +3. If decode succeeds, sanity-check the result: does it contain replacement characters (U+FFFD)? Does it contain control characters in unexpected places (which suggest cp1252-vs-Latin-1 ambiguity decoded wrong)? +4. If it fails or smells wrong, try a small candidate set (utf-8, cp1252, latin-1) and pick the one with the cleanest result. +5. When confidence is low, log a warning and let the user override via a `--encoding` flag. + +--- + +## 6. Suggested test workflow + +```python +import csv +from pathlib import Path +from src.core.io import detect_encoding, read_csv # your reader + +CORPUS = Path("test_data/encodings") + +# Load ground-truth manifest +with (CORPUS / "expected_detection.csv").open() as f: + manifest = list(csv.DictReader(f)) + +# Load reference content +references = { + p.stem.replace(".utf8", ""): p.read_text(encoding="utf-8") + for p in (CORPUS / "reference").glob("*.utf8.txt") +} + +# Test 1: detection - your detector returns an acceptable answer +for entry in manifest: + if entry["canonical_content_id"] in references: # skip pure pathological + detected = detect_encoding(CORPUS / entry["filename"]) + acceptable = [e.strip() for e in entry["expected_detection"].split("|")] + assert detected in acceptable or any( + marker in entry["expected_detection"] + for marker in ["AMBIGUOUS", "UNRELIABLE"] + ), f"{entry['filename']}: detected {detected} not in {acceptable}" + +# Test 2: decoded content matches reference +for entry in manifest: + cid = entry["canonical_content_id"] + if cid not in references: + continue # pathological case + decoded = read_csv(CORPUS / entry["filename"]) + assert decoded == references[cid], f"{entry['filename']}: content mismatch" + +# Test 3: pathological cases produce warnings, not silent corruption +for entry in manifest: + cid = entry["canonical_content_id"] + if cid in references: + continue + # Reader must either raise a clear error OR succeed with a logged warning + # The exact behavior is a policy choice; document it and test against it +``` + +--- + +## 7. What this corpus does NOT cover + +Listed so the gaps are explicit: + +1. **Big files**. Every fixture is small (under 1 KB). Detection on a 500MB cp1252 export may behave differently because charset-normalizer samples; if your reader processes giant files, add a separate large-file detection test. +2. **Streaming detection**. Detector is run on the full bytes here. If your reader decodes in chunks (for memory reasons on huge files), encoding-detection-at-stream-start is its own test surface. +3. **Languages with complex scripts not represented here**: Thai, Hebrew, Arabic, Vietnamese (cp1258), Greek (cp1253), Turkish (cp1254). Add per-language fixtures if your buyers use these locales. The generator script is parameterized; adding a new content family is a few-line change. +4. **Extended grapheme handling**. This corpus tests encoding detection and byte-to-string conversion. It does NOT test grapheme-cluster boundaries (multi-codepoint emoji, family ZWJ sequences). Those are the cleaner's territory in the main corpus, case 13. +5. **Encoding errors during WRITE**. The corpus tests reading. If your tool writes output in a non-UTF-8 encoding for any reason, write-side encoding correctness needs separate fixtures. +6. **Filename / path encoding issues**. Some filesystems mangle non-ASCII filenames (older Windows, NFS configs). Out of scope for the cleaner; that's a deployment problem. + +--- + +## 8. How to extend the corpus + +Add a new content family: + +```python +# In generate_encoding_test_files.py: +THAI = "id,name,city\n1,สมชาย,กรุงเทพ\n..." + +# Then add encoding lines: +write_encoded("E32_thai_utf8.csv", "THAI", THAI, "utf-8", ...) +write_encoded("E33_thai_cp874.csv", "THAI", THAI, "cp874", ...) +``` + +Add reference content to the `references` dict at the bottom of the generator. Re-run the generator. The manifest and detector baseline will refresh automatically. + +For a new pathological case: construct the raw bytes by hand and use `write_raw()`. Document the failure mode in the `decode_notes` field. + +Continue numbering: `E32`, `E33`, etc. Reserve `E9#` if you need a "destructive" subcategory paralleling the malformed CSV corpus. diff --git a/test-cases/encodings-corpus/detector_baseline.csv b/test-cases/encodings-corpus/detector_baseline.csv new file mode 100644 index 0000000..1cd4864 --- /dev/null +++ b/test-cases/encodings-corpus/detector_baseline.csv @@ -0,0 +1,32 @@ +filename,ground_truth_encoding,charset_normalizer_returns,cn_aliases,cn_language,cn_chaos_score +E01_western_basic_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Turkish,0.000 +E02_western_basic_utf8bom.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Turkish,0.000 +E03_western_basic_cp1252.csv,cp1252,cp1250,"1250, windows_1250",Turkish,0.000 +E04_western_basic_latin1.csv,iso-8859-1,cp1250,"1250, windows_1250",Turkish,0.000 +E05_western_basic_latin9.csv,iso-8859-15,cp1250,"1250, windows_1250",Turkish,0.000 +E06_western_basic_macroman.csv,mac-roman,mac_iceland,maciceland,Turkish,0.000 +E07_western_basic_utf16le.csv,utf-16-le,utf_16,"u16, utf16",Turkish,0.000 +E08_western_basic_utf16be.csv,utf-16-be,utf_16,"u16, utf16",Turkish,0.000 +E09_western_basic_utf16le_nobom.csv,utf-16-le,utf_16_le,"unicodelittleunmarked, utf_16le",Turkish,0.000 +E10_western_extended_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",French,0.013 +E11_western_extended_cp1252.csv,cp1252,cp1250,"1250, windows_1250",French,0.013 +E12_western_extended_utf16le.csv,utf-16-le,utf_16,"u16, utf16",French,0.013 +E13_eastern_european_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Spanish,0.042 +E14_eastern_european_cp1250.csv,cp1250,cp1250,"1250, windows_1250",Spanish,0.042 +E15_eastern_european_iso88592.csv,iso-8859-2,cp1258,"1258, windows_1258",German,0.000 +E16_cyrillic_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Ukrainian,0.059 +E17_cyrillic_cp1251.csv,cp1251,cp1251,"1251, windows_1251",Ukrainian,0.059 +E18_cyrillic_koi8r.csv,koi8-r,shift_jis_2004,"shiftjis2004, sjis_2004, s_jis_2004",Japanese,0.066 +E19_japanese_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Italian,0.000 +E20_japanese_shiftjis.csv,shift_jis,cp932,"932, ms932, mskanji, ms_kanji",Japanese,0.000 +E21_chinese_simplified_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Unknown,0.000 +E22_chinese_simplified_gb18030.csv,gb18030,gb18030,gb18030_2000,Chinese,0.000 +E23_chinese_traditional_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Unknown,0.060 +E24_chinese_traditional_big5.csv,big5,big5,"big5_tw, csbig5, x_mac_trad_chinese",Chinese,0.060 +E25_korean_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Unknown,0.000 +E26_korean_euckr.csv,euc-kr,cp949,"949, ms949, uhc",Korean,0.000 +E27_pathological_ascii_only.csv,ascii,ascii,"646, ansi_x3.4_1968, ansi_x3_4_1968, ansi_x3.4_1986, cp367, csascii, ibm367, iso646_us, iso_646.irv_1991, iso_ir_6, us, us_ascii",English,0.000 +E28_pathological_invalid_utf8.csv,invalid-utf8,cp1257,"1257, windows_1257",Croatian,0.000 +E29_pathological_truncated_utf8.csv,invalid-utf8-truncated,cp1250,"1250, windows_1250",Polish,0.000 +E30_pathological_lying_bom.csv,cp1252-with-utf8-bom,cp1252,"1252, windows_1252",French,0.013 +E31_pathological_mixed_concat.csv,cp1252+utf8-concatenated,cp1250,"1250, windows_1250",German,0.000 diff --git a/test-cases/encodings-corpus/expected_detection.csv b/test-cases/encodings-corpus/expected_detection.csv new file mode 100644 index 0000000..8818797 --- /dev/null +++ b/test-cases/encodings-corpus/expected_detection.csv @@ -0,0 +1,32 @@ +filename,canonical_content_id,encoding,has_bom,byte_length,expected_detection,decode_notes +E01_western_basic_utf8.csv,WESTERN_BASIC,utf-8,no,161,utf_8|utf-8,UTF-8 no BOM. Modern default. +E02_western_basic_utf8bom.csv,WESTERN_BASIC,utf-8,yes,164,utf_8|utf_8_sig|utf-8|utf-8-sig,UTF-8 with BOM. Excel CSV UTF-8 export. BOM must be stripped on read. +E03_western_basic_cp1252.csv,WESTERN_BASIC,cp1252,no,153,cp1252|windows-1252|iso-8859-1|latin-1,"Western single-byte. For this content (no euro, no smart quotes, no em-dash), cp1252 and Latin-1 produce IDENTICAL decoded bytes. Detector cannot distinguish - any of cp1252/Latin-1/Latin-9 is a correct answer." +E04_western_basic_latin1.csv,WESTERN_BASIC,iso-8859-1,no,153,iso-8859-1|latin-1|cp1252|latin_1,Latin-1. Identical bytes to cp1252 for this content. Detector ambiguity is expected and acceptable. +E05_western_basic_latin9.csv,WESTERN_BASIC,iso-8859-15,no,153,iso-8859-15|latin-9|iso-8859-1|cp1252,"Latin-9. For this content with no euro sign, decodes identically to Latin-1. Detector may pick any." +E06_western_basic_macroman.csv,WESTERN_BASIC,mac-roman,no,153,mac-roman|macroman,"Mac Roman. Different byte values for the accented chars vs cp1252/Latin-1, so this one is distinguishable." +E07_western_basic_utf16le.csv,WESTERN_BASIC,utf-16-le,yes,308,utf-16|utf-16-le|utf_16|utf_16_le,UTF-16 LE with BOM. Excel 'Unicode Text' export. +E08_western_basic_utf16be.csv,WESTERN_BASIC,utf-16-be,yes,308,utf-16|utf-16-be|utf_16|utf_16_be,UTF-16 BE with BOM. Less common but valid. +E09_western_basic_utf16le_nobom.csv,WESTERN_BASIC,utf-16-le,no,306,utf-16|utf-16-le|UNRELIABLE,"UTF-16 LE without BOM. Detection is heuristic and unreliable; bytes look like 'every other byte is null' for ASCII-heavy content, which charset-normalizer may or may not catch. If detector returns wrong encoding here, that is the buyer's responsibility to manually specify - flag in error message." +E10_western_extended_utf8.csv,WESTERN_EXTENDED,utf-8,no,167,utf_8|utf-8,"UTF-8. Has euro, smart quotes, em-dash." +E11_western_extended_cp1252.csv,WESTERN_EXTENDED,cp1252,no,154,cp1252|windows-1252,"cp1252. Content uses 0x80-0x9F range (euro, smart quotes, em-dash). Decoding as Latin-1 produces control characters or replacement chars - this file is the cleanest cp1252-vs-Latin-1 discriminator." +E12_western_extended_utf16le.csv,WESTERN_EXTENDED,utf-16-le,yes,310,utf-16|utf-16-le,UTF-16 LE with BOM. Same content as E10/E11. +E13_eastern_european_utf8.csv,EASTERN_EUROPEAN,utf-8,no,130,utf_8|utf-8,UTF-8 baseline for Czech/Polish/Hungarian/Slovak content. +E14_eastern_european_cp1250.csv,EASTERN_EUROPEAN,cp1250,no,120,cp1250|windows-1250,"cp1250. Decoding as cp1252 produces mojibake (Polish slash-l would become U+0142 vs ascii letter, etc.). Real distinguishing test." +E15_eastern_european_iso88592.csv,EASTERN_EUROPEAN,iso-8859-2,no,120,iso-8859-2|latin-2|iso8859_2,ISO-8859-2 / Latin-2. Different byte assignments than cp1250 for the same characters. +E16_cyrillic_utf8.csv,CYRILLIC,utf-8,no,118,utf_8|utf-8,UTF-8 baseline for Russian content. +E17_cyrillic_cp1251.csv,CYRILLIC,cp1251,no,72,cp1251|windows-1251,cp1251. The dominant Russian Windows encoding. +E18_cyrillic_koi8r.csv,CYRILLIC,koi8-r,no,72,koi8-r|koi8_r,KOI8-R. Older Unix Russian encoding. Distinct byte patterns from cp1251. +E19_japanese_utf8.csv,JAPANESE,utf-8,no,78,utf_8|utf-8,UTF-8 baseline for Japanese content. +E20_japanese_shiftjis.csv,JAPANESE,shift_jis,no,64,shift_jis|shift-jis|cp932|sjis,Shift_JIS. Excel on Japanese Windows defaults to this. cp932 is Microsoft's extended variant; either name is acceptable. +E21_chinese_simplified_utf8.csv,CHINESE_SIMPLIFIED,utf-8,no,66,utf_8|utf-8,UTF-8 baseline for simplified Chinese. +E22_chinese_simplified_gb18030.csv,CHINESE_SIMPLIFIED,gb18030,no,56,gb18030|gbk|gb2312,GB18030. Mainland China default. GB18030 supersets GBK supersets GB2312; for this content any is acceptable. +E23_chinese_traditional_utf8.csv,CHINESE_TRADITIONAL,utf-8,no,66,utf_8|utf-8,UTF-8 baseline for traditional Chinese. +E24_chinese_traditional_big5.csv,CHINESE_TRADITIONAL,big5,no,56,big5|big5_hkscs|cp950,Big5. Taiwan and Hong Kong default. cp950 is Microsoft's variant. +E25_korean_utf8.csv,KOREAN,utf-8,no,72,utf_8|utf-8,UTF-8 baseline for Korean. +E26_korean_euckr.csv,KOREAN,euc-kr,no,60,euc-kr|euc_kr|cp949,EUC-KR. Korean Windows default. cp949 is the MS variant. +E27_pathological_ascii_only.csv,ASCII_ONLY,ascii,no,66,ascii|utf_8|utf-8|cp1252|iso-8859-1|AMBIGUOUS,"Pure ASCII. Multiple encodings produce identical bytes for this content. Any of ASCII, UTF-8, cp1252, Latin-1 is a correct detection answer because all four decode to the same string. Detector confidence should be high; specific label is interchangeable." +E28_pathological_invalid_utf8.csv,INVALID_UTF8,invalid-utf8,no,67,cp1252|iso-8859-1|REJECT_UTF8,File starts as if UTF-8 but contains an invalid byte sequence (0xC3 0x28). A strict UTF-8 decoder errors. Detector should reject UTF-8 and fall back to a single-byte encoding; cp1252 will produce mojibake but parse without error. Cleaner should warn the user that encoding detection was uncertain. +E29_pathological_truncated_utf8.csv,TRUNCATED_UTF8,invalid-utf8-truncated,no,47,utf_8_with_errors|cp1252|REJECT,"Valid UTF-8 throughout, but the last byte (0xE4) starts a 3-byte sequence that's never completed. Strict UTF-8 decoder errors at EOF. errors='replace' produces \ufffd. Real-world cause: file was truncated by a transfer interruption or a buggy export. Cleaner should treat as corrupt-input error, not silent data loss." +E30_pathological_lying_bom.csv,WESTERN_EXTENDED,cp1252-with-utf8-bom,yes (lying),157,utf_8_FAILS|cp1252|AMBIGUOUS,File has UTF-8 BOM but body is cp1252. UTF-8 decoder will see 0x80 (euro in cp1252) as an invalid UTF-8 continuation byte and error out. Better detectors recover by ignoring the BOM and trying cp1252. Cleaner should warn 'BOM suggested UTF-8 but content decoded as cp1252' so the user knows their file is lying about itself. +E31_pathological_mixed_concat.csv,MIXED_CONCAT,cp1252+utf8-concatenated,no,60,LOW_CONFIDENCE|cp1252|utf_8|REJECT,"First half cp1252, second half UTF-8. No single encoding decodes both halves correctly. UTF-8 decoder errors on row 1. cp1252 decoder produces mojibake on rows 2-3. charset-normalizer detection confidence should be low. Right behavior for the cleaner: refuse to process and tell the user the file contains mixed encodings." diff --git a/test-cases/encodings-corpus/reference/ASCII_ONLY.utf8.txt b/test-cases/encodings-corpus/reference/ASCII_ONLY.utf8.txt new file mode 100644 index 0000000..8f21db1 --- /dev/null +++ b/test-cases/encodings-corpus/reference/ASCII_ONLY.utf8.txt @@ -0,0 +1,4 @@ +id,name,city +1,Alice,New York +2,Bob,Chicago +3,Carol,San Francisco diff --git a/test-cases/encodings-corpus/reference/CHINESE_SIMPLIFIED.utf8.txt b/test-cases/encodings-corpus/reference/CHINESE_SIMPLIFIED.utf8.txt new file mode 100644 index 0000000..300df3e --- /dev/null +++ b/test-cases/encodings-corpus/reference/CHINESE_SIMPLIFIED.utf8.txt @@ -0,0 +1,4 @@ +id,name,city +1,张三,北京 +2,李四,上海 +3,Alice Smith,深圳 diff --git a/test-cases/encodings-corpus/reference/CHINESE_TRADITIONAL.utf8.txt b/test-cases/encodings-corpus/reference/CHINESE_TRADITIONAL.utf8.txt new file mode 100644 index 0000000..60a5859 --- /dev/null +++ b/test-cases/encodings-corpus/reference/CHINESE_TRADITIONAL.utf8.txt @@ -0,0 +1,4 @@ +id,name,city +1,張三,台北 +2,李四,香港 +3,Alice Smith,新竹 diff --git a/test-cases/encodings-corpus/reference/CYRILLIC.utf8.txt b/test-cases/encodings-corpus/reference/CYRILLIC.utf8.txt new file mode 100644 index 0000000..d4ad079 --- /dev/null +++ b/test-cases/encodings-corpus/reference/CYRILLIC.utf8.txt @@ -0,0 +1,4 @@ +id,name,city +1,Иван,Москва +2,Анна,Санкт-Петербург +3,Дмитрий,Новосибирск diff --git a/test-cases/encodings-corpus/reference/EASTERN_EUROPEAN.utf8.txt b/test-cases/encodings-corpus/reference/EASTERN_EUROPEAN.utf8.txt new file mode 100644 index 0000000..f5f3f92 --- /dev/null +++ b/test-cases/encodings-corpus/reference/EASTERN_EUROPEAN.utf8.txt @@ -0,0 +1,5 @@ +id,name,city,language +1,Příliš,Praha,Czech +2,Żółć,Warszawa,Polish +3,Tűrő,Budapest,Hungarian +4,Spaňski,Bratislava,Slovak diff --git a/test-cases/encodings-corpus/reference/JAPANESE.utf8.txt b/test-cases/encodings-corpus/reference/JAPANESE.utf8.txt new file mode 100644 index 0000000..5a854f4 --- /dev/null +++ b/test-cases/encodings-corpus/reference/JAPANESE.utf8.txt @@ -0,0 +1,4 @@ +id,name,city +1,田中太郎,東京 +2,鈴木花子,大阪 +3,Alice Smith,横浜 diff --git a/test-cases/encodings-corpus/reference/KOREAN.utf8.txt b/test-cases/encodings-corpus/reference/KOREAN.utf8.txt new file mode 100644 index 0000000..abb4304 --- /dev/null +++ b/test-cases/encodings-corpus/reference/KOREAN.utf8.txt @@ -0,0 +1,4 @@ +id,name,city +1,김철수,서울 +2,박영희,부산 +3,Alice Smith,인천 diff --git a/test-cases/encodings-corpus/reference/WESTERN_BASIC.utf8.txt b/test-cases/encodings-corpus/reference/WESTERN_BASIC.utf8.txt new file mode 100644 index 0000000..54b281c --- /dev/null +++ b/test-cases/encodings-corpus/reference/WESTERN_BASIC.utf8.txt @@ -0,0 +1,5 @@ +id,name,city,note +1,Alice,New York,plain ASCII +2,Café Müller,Köln,Latin-1 accents +3,Naïve Façade,Zürich,more accents +4,España,Düsseldorf,Spanish n-tilde diff --git a/test-cases/encodings-corpus/reference/WESTERN_EXTENDED.utf8.txt b/test-cases/encodings-corpus/reference/WESTERN_EXTENDED.utf8.txt new file mode 100644 index 0000000..d204c4b --- /dev/null +++ b/test-cases/encodings-corpus/reference/WESTERN_EXTENDED.utf8.txt @@ -0,0 +1,5 @@ +id,name,note +1,€100 product,euro sign U+20AC +2,“smart” quotes,curly U+201C and U+201D +3,café — résumé,em-dash U+2014 +4,quote’s ok,smart apostrophe U+2019 diff --git a/test-cases/text-cleaner-corpus/test_data/17_preserve_intended.csv b/test-cases/text-cleaner-corpus/test_data/17_preserve_intended.csv index 17409c9..d4121bd 100644 --- a/test-cases/text-cleaner-corpus/test_data/17_preserve_intended.csv +++ b/test-cases/text-cleaner-corpus/test_data/17_preserve_intended.csv @@ -1,4 +1,4 @@ id,price,european_number,date,phone,quantity 1, 100 ,1 234,2024-01-15,(555) 123-4567,42 -2," $1,500.00 ",12 345,15/01/2024,555.123.4567,7 +2, $1,500.00 ,12 345,15/01/2024,555.123.4567,7 3, N/A ,nan,Jan 15 2024,+1 555 123 4567,0 diff --git a/tests/test_analyze.py b/tests/test_analyze.py index 66335af..70113e6 100644 --- a/tests/test_analyze.py +++ b/tests/test_analyze.py @@ -204,6 +204,67 @@ class TestNearDuplicates: # Mixed line endings # --------------------------------------------------------------------------- +class TestEncodingUncertainty: + def test_replacement_chars_in_data_flagged(self): + df = pd.DataFrame({"name": ["Caf�", "Ber�in"]}) + findings = analyze(df) + f = next(f for f in findings if f.id == "encoding_uncertain") + assert f.severity == "error" + assert f.confidence == "low" + assert f.count == 2 + + def test_replacement_chars_in_header_flagged(self): + df = pd.DataFrame({"emai�l": ["a@x.com"]}) + findings = analyze(df) + ids = {f.id for f in findings} + assert "encoding_uncertain" in ids + + def test_clean_data_no_finding(self): + df = pd.DataFrame({"name": ["Alice", "Bob"]}) + findings = analyze(df) + assert "encoding_uncertain" not in {f.id for f in findings} + + +class TestEncodingOverride: + def test_override_corrects_misdetected_codepage(self, tmp_path): + # WESTERN_BASIC bytes encoded as cp1252; charset-normalizer guesses + # cp1250, which gets 0xF1 wrong (ń vs ñ). + f = tmp_path / "cp1252.csv" + f.write_bytes("id,name\n1,España\n".encode("cp1252")) + + from src.core.analyze import _load_for_analysis + df_auto, _, _ = _load_for_analysis(f, sample_rows=10) + df_overridden, _, _ = _load_for_analysis( + f, sample_rows=10, encoding_override="cp1252", + ) + # Override yields the correct character. + assert df_overridden["name"].iloc[0] == "España" + + def test_override_propagates_through_top_level_analyze(self, tmp_path): + f = tmp_path / "koi8.csv" + # KOI8-R Cyrillic; default detection guesses Shift_JIS. + f.write_bytes("id,name\n1,Иван\n".encode("koi8-r")) + # With the override the analyzer should produce zero findings + # against this clean fixture (no mojibake, no U+FFFD). + findings = analyze(f, encoding_override="koi8-r") + ids = {x.id for x in findings} + assert "encoding_uncertain" not in ids + assert "encoding_decode_failed" not in ids + + +class TestEncodingDecodeFailedFromRepair: + def test_decode_replaced_action_surfaces_error_finding(self, tmp_path): + # Create a file with a UTF-8 BOM but cp1252 body bytes — utf-8-sig + # fails on byte 0x80 (€ in cp1252). + f = tmp_path / "lying_bom.csv" + f.write_bytes(b"\xef\xbb\xbfid,name\n1,\x80100\n") + findings = analyze(f) + ids = {x.id for x in findings} + assert "encoding_decode_failed" in ids + bad = next(x for x in findings if x.id == "encoding_decode_failed") + assert bad.severity == "error" + + class TestMixedLineEndings: def test_crlf_plus_lf_flagged(self, tmp_path): f = tmp_path / "mixed.csv" diff --git a/tests/test_corpus.py b/tests/test_corpus.py index 33a545e..f70687a 100644 --- a/tests/test_corpus.py +++ b/tests/test_corpus.py @@ -51,14 +51,24 @@ DEFAULT_CASES = [ def _read_csv_strict(path: Path) -> pd.DataFrame: """Read a corpus CSV file, treating all cells as strings. - NUL bytes are stripped from the raw file before parsing because the - pandas C engine truncates fields at NUL while the python engine is - too strict about embedded literal double quotes. Stripping NUL is - the file-level pre-clean step the spec describes for case 06. + Applies only the structural pre-parse fixes that are required to make + the file parseable at all — NUL stripping (case 06), line-ending + normalization (cases 09/10), and unquoted-currency repair (case 17). + Character-level folds that the cleaner itself owns (smart quotes, + NBSP, etc.) are deliberately left alone so the cleaner's own behavior + is what's under test. """ - raw = path.read_bytes().replace(b"\x00", b"") + raw = path.read_bytes() + # NUL stripping + raw = raw.replace(b"\x00", b"") + # Line endings: CRLF -> LF, then bare CR -> LF. + raw = raw.replace(b"\r\n", b"\n").replace(b"\r", b"\n") + # Per-row repair (handles unquoted '$1,500.00' in case 17). + from src.core.io import _repair_rows + text = raw.decode("utf-8-sig") + text, _, _ = _repair_rows(text, ",") return pd.read_csv( - io.BytesIO(raw), dtype=str, keep_default_na=False, encoding="utf-8-sig", + io.StringIO(text), dtype=str, keep_default_na=False, ) diff --git a/tests/test_encodings_corpus.py b/tests/test_encodings_corpus.py new file mode 100644 index 0000000..8027740 --- /dev/null +++ b/tests/test_encodings_corpus.py @@ -0,0 +1,184 @@ +"""Run the analyzer + detector against the code-page test corpus. + +Fixtures live in ``test-cases/encodings-corpus/`` (synced from +``Business/DataTools/test-case-code-page-variations``). Each test runs +against one fixture and uses the corpus manifest +(``expected_detection.csv``) for ground truth. + +What's tested +------------- +1. ``analyze()`` does not crash on any fixture — every encoded file + produces a Finding list (possibly empty), never an exception. +2. ``detect_encoding()`` returns one of the manifest's accepted answers, + OR the manifest itself flagged the case as AMBIGUOUS / UNRELIABLE / + REJECT / LOW_CONFIDENCE. +3. The decoded DataFrame matches the canonical reference content. + +Cases where the current implementation is known to fail (charset- +normalizer label drift on byte-equivalent encodings, ``repair_bytes`` +NUL-strip destroying UTF-16, the "lying BOM" pathological case) are +marked ``xfail`` so they surface in the report as documented gaps. +A future fix that makes the case pass will flip xfail to xpass and the +test owner can drop the marker. +""" + +from __future__ import annotations + +import csv +import io +from pathlib import Path + +import pandas as pd +import pytest + +from src.core.analyze import analyze, _load_for_analysis +from src.core.io import detect_encoding + + +CORPUS = Path(__file__).parent.parent / "test-cases" / "encodings-corpus" +MANIFEST = CORPUS / "expected_detection.csv" +REFERENCE_DIR = CORPUS / "reference" + +# Known failures the analyzer does not yet handle correctly. Each entry +# has a one-line reason — drop the entry once a fix lands. +KNOWN_DETECTION_FAILURES = { + "E03_western_basic_cp1252.csv": "charset-normalizer returns cp1250 for byte-equivalent content", + "E04_western_basic_latin1.csv": "charset-normalizer returns cp1250 for byte-equivalent content", + "E05_western_basic_latin9.csv": "charset-normalizer returns cp1250 for byte-equivalent content", + "E06_western_basic_macroman.csv": "returns mac_iceland (same family) instead of mac_roman", + "E11_western_extended_cp1252.csv": "charset-normalizer returns cp1250 for cp1252 content", + "E15_eastern_european_iso88592.csv": "charset-normalizer returns cp1258 for ISO-8859-2 content", + "E18_cyrillic_koi8r.csv": "charset-normalizer returns shift_jis_2004 for KOI8-R content", +} + +KNOWN_DECODE_FAILURES = { + "E03_western_basic_cp1252.csv": "decoded as cp1250 — different mapping at 0xF1 (ñ vs ń)", + "E04_western_basic_latin1.csv": "decoded as cp1250 — different mapping at 0xF1", + "E05_western_basic_latin9.csv": "decoded as cp1250 — different mapping at 0xF1", + "E10_western_extended_utf8.csv": "byte-level smart-quote fold rewrites U+201C/U+201D to ASCII before parse", + "E11_western_extended_cp1252.csv": "wrong encoding + smart-quote fold", + "E12_western_extended_utf16le.csv": "byte-level smart-quote fold rewrites U+201C/U+201D before parse", + "E15_eastern_european_iso88592.csv": "wrong encoding (cp1258 != ISO-8859-2)", + "E18_cyrillic_koi8r.csv": "wrong encoding (shift_jis_2004 != KOI8-R)", + "E30_pathological_lying_bom.csv": "utf-8-sig fails on cp1252 body bytes; needs lying-BOM recovery", +} + + +def _normalize_encoding(name: str) -> str: + return name.lower().replace("-", "_").replace(" ", "_") + + +def _load_manifest() -> list[dict]: + if not MANIFEST.exists(): + return [] + with MANIFEST.open() as fh: + return list(csv.DictReader(fh)) + + +def _load_references() -> dict[str, str]: + if not REFERENCE_DIR.exists(): + return {} + return { + p.stem.replace(".utf8", ""): p.read_text(encoding="utf-8") + for p in REFERENCE_DIR.glob("*.utf8.txt") + } + + +MANIFEST_ENTRIES = _load_manifest() +REFERENCES = _load_references() + + +def _entry_id(entry: dict) -> str: + return entry["filename"] + + +# --------------------------------------------------------------------------- +# 1. Analyzer never crashes +# --------------------------------------------------------------------------- + +@pytest.mark.parametrize("entry", MANIFEST_ENTRIES, ids=_entry_id) +def test_analyzer_does_not_crash(entry): + findings = analyze(CORPUS / entry["filename"], sample_rows=1000) + # Either empty or a list of Findings — but never raises. + assert isinstance(findings, list) + + +# --------------------------------------------------------------------------- +# 2. detect_encoding returns an acceptable answer +# --------------------------------------------------------------------------- + +def _detection_marker(entry): + fname = entry["filename"] + if fname in KNOWN_DETECTION_FAILURES: + return pytest.mark.xfail( + reason=KNOWN_DETECTION_FAILURES[fname], strict=False, + ) + return () + + +@pytest.mark.parametrize( + "entry", + [ + pytest.param(e, marks=_detection_marker(e), id=_entry_id(e)) + for e in MANIFEST_ENTRIES + ], +) +def test_detect_encoding_accepted(entry): + accepted_raw = entry["expected_detection"] + # Manifest fuzzy markers — any answer is acceptable. + if any(m in accepted_raw for m in ("AMBIGUOUS", "UNRELIABLE", "REJECT", "LOW_CONFIDENCE")): + # Just call to ensure no exception. + detect_encoding(CORPUS / entry["filename"]) + return + accepted = {_normalize_encoding(s.strip()) for s in accepted_raw.split("|") if s.strip()} + detected = detect_encoding(CORPUS / entry["filename"]) + detected_n = _normalize_encoding(detected) + assert detected_n in accepted, ( + f"{entry['filename']}: detected {detected!r} not in {sorted(accepted)}" + ) + + +# --------------------------------------------------------------------------- +# 3. Decoded content matches the canonical reference +# --------------------------------------------------------------------------- + +def _decode_marker(entry): + fname = entry["filename"] + if fname in KNOWN_DECODE_FAILURES: + return pytest.mark.xfail( + reason=KNOWN_DECODE_FAILURES[fname], strict=False, + ) + return () + + +def _decodable_entries(): + """Skip pathological cases that have no canonical reference.""" + return [e for e in MANIFEST_ENTRIES if e["canonical_content_id"] in REFERENCES] + + +@pytest.mark.parametrize( + "entry", + [ + pytest.param(e, marks=_decode_marker(e), id=_entry_id(e)) + for e in _decodable_entries() + ], +) +def test_decoded_matches_reference(entry): + df, _, _ = _load_for_analysis(CORPUS / entry["filename"], sample_rows=1000) + ref_text = REFERENCES[entry["canonical_content_id"]] + ref_rows = list(csv.reader(io.StringIO(ref_text))) + if not ref_rows: + pytest.skip("empty reference") + + # First row = headers in the reference; compare data rows to df rows. + ref_data = ref_rows[1:] + assert len(df) >= len(ref_data), ( + f"{entry['filename']}: parsed {len(df)} rows, reference has {len(ref_data)}" + ) + for r, ref_row in enumerate(ref_data): + for c, ref_cell in enumerate(ref_row): + actual = str(df.iloc[r, c]) + assert actual == ref_cell, ( + f"{entry['filename']}: row {r} col {c}: " + f"got {actual!r}, expected {ref_cell!r}" + ) diff --git a/tests/test_normalize.py b/tests/test_normalize.py new file mode 100644 index 0000000..cd0805f --- /dev/null +++ b/tests/test_normalize.py @@ -0,0 +1,349 @@ +"""Tests for the CSV-normalization gate. + +Covers: +* ``Finding.confidence`` and ``Finding.fix_action`` field defaults. +* ``auto_fix`` applies every high-confidence finding and leaves + medium/low ones pending. +* ``apply_decisions`` honors per-finding skip / modified payloads. +* ``is_normalized`` re-checks high-confidence detectors after a fix pass. +* The full corpus auto-fix sweep: every fixture either passes the gate + or has its remaining medium/low findings declared in pending. +""" + +from __future__ import annotations + +from pathlib import Path + +import pandas as pd +import pytest + +from src.core.analyze import ( + Finding, + analyze, + _load_for_analysis, + FIX_FOLD_SMART_PUNCT, + FIX_LOWERCASE_EMAIL, + FIX_REPLACE_NULL_SENTINELS, + FIX_NONE, +) +from src.core.fixes import get_fix, available_actions +from src.core.normalize import ( + Decision, + NormalizationResult, + auto_fix, + apply_decisions, + is_normalized, + gate_summary, +) + + +CORPUS = Path(__file__).parent.parent / "test-cases" / "text-cleaner-corpus" / "test_data" + + +# --------------------------------------------------------------------------- +# Field defaults +# --------------------------------------------------------------------------- + +class TestFindingFields: + def test_default_confidence_is_high(self): + f = Finding(id="x", severity="warn", tool="", count=1, description="d") + assert f.confidence == "high" + + def test_default_fix_action_is_empty(self): + f = Finding(id="x", severity="warn", tool="", count=1, description="d") + assert f.fix_action == "" + + def test_pre_applied_default_false(self): + f = Finding(id="x", severity="warn", tool="", count=1, description="d") + assert f.pre_applied is False + + def test_smart_punct_finding_carries_fix_action(self): + df = pd.DataFrame({"x": ["“hello”"]}) + findings = analyze(df) + smart = next(f for f in findings if f.id == "smart_punctuation_in_data") + assert smart.confidence == "high" + assert smart.fix_action == FIX_FOLD_SMART_PUNCT + + def test_mojibake_finding_is_low_confidence(self): + df = pd.DataFrame({"x": ["café"]}) + findings = analyze(df) + moji = next(f for f in findings if f.id == "suspected_mojibake") + assert moji.confidence == "low" + + +# --------------------------------------------------------------------------- +# Fix registry +# --------------------------------------------------------------------------- + +class TestFixRegistry: + def test_high_confidence_fixes_registered(self): + actions = available_actions() + assert FIX_FOLD_SMART_PUNCT in actions + assert FIX_LOWERCASE_EMAIL in actions + assert FIX_REPLACE_NULL_SENTINELS in actions + + def test_get_fix_returns_callable(self): + fn = get_fix(FIX_FOLD_SMART_PUNCT) + assert callable(fn) + + def test_get_fix_unknown_returns_none(self): + assert get_fix("not_a_real_action") is None + + +# --------------------------------------------------------------------------- +# auto_fix +# --------------------------------------------------------------------------- + +class TestAutoFix: + def test_applies_high_confidence_only(self): + df = pd.DataFrame({ + "name": [" Alice ", "Bob "], # whitespace + NBSP -> high + "email": ["A@X.com", "b@x.com"], # mixed case -> medium + }) + findings = analyze(df) + result = auto_fix(df, findings) + + # whitespace_padding and nbsp_or_unicode_whitespace should be applied. + applied_ids = {a.finding_id for a in result.applied} + assert "whitespace_padding" in applied_ids + assert "nbsp_or_unicode_whitespace" in applied_ids + + # mixed_case_email_column is medium -> pending. + pending_ids = {f.id for f in result.pending_findings} + assert "mixed_case_email_column" in pending_ids + + def test_cells_actually_changed(self): + df = pd.DataFrame({"x": [" hi ", "ok"]}) + findings = analyze(df) + result = auto_fix(df, findings) + assert result.cleaned_df["x"].tolist() == ["hi", "ok"] + + def test_no_findings_no_fixes(self): + df = pd.DataFrame({"id": ["1", "2"], "name": ["a", "b"]}) + findings = analyze(df) + result = auto_fix(df, findings) + assert result.applied == [] + assert result.passed is True + + def test_blocks_on_severity_error(self, tmp_path): + f = tmp_path / "empty.csv" + f.write_bytes(b"") + findings = analyze(f) + df, _, _ = _load_for_analysis(f, sample_rows=1000) + result = auto_fix(df, findings) + assert any(b.id == "empty_input" for b in result.blocking_findings) + assert result.passed is False + + +# --------------------------------------------------------------------------- +# apply_decisions +# --------------------------------------------------------------------------- + +class TestApplyDecisions: + def test_skip_decision_records_skipped(self): + df = pd.DataFrame({"x": ["“smart”"]}) + findings = analyze(df) + decisions = [Decision(finding_id="smart_punctuation_in_data", action="skip")] + result = apply_decisions(df, findings, decisions) + assert any(s.id == "smart_punctuation_in_data" for s in result.skipped_findings) + # And the smart quotes survived. + assert "“" in result.cleaned_df["x"].iloc[0] + + def test_auto_decision_runs_fix(self): + df = pd.DataFrame({"x": ["“smart”"]}) + findings = analyze(df) + decisions = [Decision(finding_id="smart_punctuation_in_data", action="auto")] + result = apply_decisions(df, findings, decisions) + assert result.cleaned_df["x"].iloc[0] == '"smart"' + + def test_modified_decision_uses_payload(self): + df = pd.DataFrame({"status": ["ACTIVE", "TBD", "TBD", "active"]}) + findings = analyze(df) + # Restrict the null-sentinel set to only "TBD" via payload. + decisions = [Decision( + finding_id="null_like_sentinels", + action="modified", + payload={"sentinels": ["TBD"]}, + )] + # null_like_sentinels needs to be present for the decision to apply. + if not any(f.id == "null_like_sentinels" for f in findings): + pytest.skip("analyzer didn't surface null sentinels for this fixture") + result = apply_decisions(df, findings, decisions) + assert result.cleaned_df["status"].tolist() == ["ACTIVE", "", "", "active"] + + def test_lowercase_email_uses_finding_column(self): + df = pd.DataFrame({ + "email": ["ALICE@X.com", "bob@x.com"], + "name": ["Alice", "Bob"], + }) + findings = analyze(df) + decisions = [Decision(finding_id="mixed_case_email_column", action="auto")] + if not any(f.id == "mixed_case_email_column" for f in findings): + pytest.skip("analyzer didn't surface mixed-case email") + result = apply_decisions(df, findings, decisions) + assert result.cleaned_df["email"].tolist() == ["alice@x.com", "bob@x.com"] + # Other columns untouched. + assert result.cleaned_df["name"].tolist() == ["Alice", "Bob"] + + def test_undecided_medium_finding_stays_pending(self): + df = pd.DataFrame({"email": ["A@X.com", "b@x.com"]}) + findings = analyze(df) + result = apply_decisions(df, findings, decisions=[]) + if not any(f.id == "mixed_case_email_column" for f in findings): + pytest.skip("analyzer didn't surface mixed-case email") + assert any(f.id == "mixed_case_email_column" for f in result.pending_findings) + + +# --------------------------------------------------------------------------- +# is_normalized +# --------------------------------------------------------------------------- + +class TestIsNormalized: + def test_clean_dataframe_passes(self): + df = pd.DataFrame({"id": ["1"], "name": ["Alice"]}) + findings = analyze(df) + result = auto_fix(df, findings) + assert is_normalized(findings, result) is True + + def test_unnormalized_after_skip_high_confidence(self): + df = pd.DataFrame({"x": [" padded "]}) + findings = analyze(df) + # Skip the only high-confidence fix. + decisions = [Decision(finding_id="whitespace_padding", action="skip")] + result = apply_decisions(df, findings, decisions) + # Re-analysis still finds the issue, so gate is not normalized. + assert is_normalized(findings, result) is False + + def test_pending_medium_blocks_gate(self): + df = pd.DataFrame({"email": ["A@X.com", "b@x.com"]}) + findings = analyze(df) + result = auto_fix(df, findings) + # auto_fix leaves medium pending -> gate not passed. + if any(f.id == "mixed_case_email_column" for f in findings): + assert is_normalized(findings, result) is False + + def test_none_result_not_normalized(self): + assert is_normalized([], None) is False + + +# --------------------------------------------------------------------------- +# Corpus sweep — every fixture either passes or has declared pending +# --------------------------------------------------------------------------- + +CORPUS_FILES = sorted(CORPUS.glob("*.csv")) if CORPUS.exists() else [] + +# Fixtures that will have pending medium/low findings after auto_fix. +EXPECTED_PENDING_AFTER_AUTOFIX = { + "11_embedded_newlines": {"mixed_case_email_column"}, + "12_case_variations": {"mixed_case_email_column"}, + "14_mojibake": {"suspected_mojibake"}, + "17_preserve_intended": {"null_like_sentinels"}, + "20_kitchen_sink": {"mixed_case_email_column"}, +} + +# Fixtures that block the gate via severity=error findings. +EXPECTED_BLOCKING = { + "18_empty_file": {"empty_input"}, +} + + +@pytest.mark.parametrize("path", CORPUS_FILES, ids=lambda p: p.stem) +def test_corpus_auto_fix_state(path): + """Every corpus fixture either passes auto_fix or has its remaining + pending/blocking findings declared in the expected sets above.""" + findings = analyze(path, sample_rows=1000) + df, _, _ = _load_for_analysis(path, sample_rows=1000) + result = auto_fix(df, findings) + + pending_ids = {f.id for f in result.pending_findings} + blocking_ids = {f.id for f in result.blocking_findings} + + expected_pending = EXPECTED_PENDING_AFTER_AUTOFIX.get(path.stem, set()) + expected_blocking = EXPECTED_BLOCKING.get(path.stem, set()) + + assert pending_ids == expected_pending, ( + f"{path.name}: pending {pending_ids} != expected {expected_pending}" + ) + assert blocking_ids == expected_blocking, ( + f"{path.name}: blocking {blocking_ids} != expected {expected_blocking}" + ) + + +def test_corpus_auto_fix_idempotent(): + """Running auto_fix twice on the same input yields the same bytes.""" + if not CORPUS_FILES: + pytest.skip("corpus not present") + path = CORPUS / "20_kitchen_sink.csv" + findings = analyze(path, sample_rows=1000) + df, _, _ = _load_for_analysis(path, sample_rows=1000) + r1 = auto_fix(df, findings) + # Re-analyze the cleaned frame and run again. + f2 = analyze(r1.cleaned_df) + r2 = auto_fix(r1.cleaned_df, f2) + assert r1.cleaned_bytes == r2.cleaned_bytes + + +# --------------------------------------------------------------------------- +# gate_summary +# --------------------------------------------------------------------------- + +class TestOutputOptions: + """The Review page's _build_output_bytes helper for the download flow. + + Imported via importlib because the page itself runs Streamlit code at + module load; we copy the function shape here as a compact spec so a + future refactor that moves the helper into core/io.py can keep the + same contract. + """ + + @staticmethod + def _build(df, *, encoding, delimiter, line_terminator): + import io as _io + buf = _io.StringIO() + df.to_csv(buf, index=False, sep=delimiter, lineterminator=line_terminator) + text = buf.getvalue() + try: + return text.encode(encoding), None + except UnicodeEncodeError: + return text.encode(encoding, errors="replace"), "lossy" + + def test_utf8_with_bom_starts_with_bom(self): + df = pd.DataFrame({"x": ["a"]}) + data, _ = self._build(df, encoding="utf-8-sig", delimiter=",", line_terminator="\n") + assert data.startswith(b"\xef\xbb\xbf") + + def test_crlf_line_terminator(self): + df = pd.DataFrame({"x": ["a", "b"]}) + data, _ = self._build(df, encoding="utf-8", delimiter=",", line_terminator="\r\n") + assert b"\r\n" in data + assert b"\nb" not in data.replace(b"\r\n", b"") + + def test_tab_delimiter(self): + df = pd.DataFrame({"a": ["x"], "b": ["y"]}) + data, _ = self._build(df, encoding="utf-8", delimiter="\t", line_terminator="\n") + assert data.startswith(b"a\tb\n") + + def test_cp1252_single_byte_accents(self): + df = pd.DataFrame({"name": ["José"]}) + data, _ = self._build(df, encoding="cp1252", delimiter=",", line_terminator="\n") + # 'é' is single byte 0xE9 in cp1252 (vs 0xC3 0xA9 in UTF-8) + assert b"\xe9" in data + assert b"\xc3\xa9" not in data + + def test_lossy_codepage_returns_warning(self): + df = pd.DataFrame({"name": ["Иван"]}) # Cyrillic + data, warn = self._build(df, encoding="cp1252", delimiter=",", line_terminator="\n") + assert warn is not None + assert b"?" in data # replacement chars + + +class TestGateSummary: + def test_summary_keys(self): + df = pd.DataFrame({"x": [" hi "]}) + findings = analyze(df) + result = auto_fix(df, findings) + s = gate_summary(result) + assert set(s.keys()) == { + "passed", "fixes_applied", "cells_changed", + "skipped", "pending", "blocking", + }