diff --git a/README.md b/README.md
index 4c77721..dd1aaa0 100644
--- a/README.md
+++ b/README.md
@@ -149,10 +149,20 @@ Outputs `{input}_cleaned.csv` plus a per-cell `{input}_changes.csv` audit (row,
 
 See [docs/CLI-REFERENCE.md](docs/CLI-REFERENCE.md#text-cleaner-cli) for every flag.
 
+## Review & Normalize gate
+
+Every uploaded file passes through a CSV-normalization gate before any tool page sees it. The analyzer scans for ~15 issue types — whitespace pollution, NBSP / zero-width chars, mixed line endings, BOM artifacts, encoding misdetections, smart punctuation, dirty headers, null sentinels, mojibake, and more — and tags each finding by **confidence** (high / medium / low) and **fix action** (the algorithm in `src/core/fixes.py` that resolves it).
+
+In the GUI, the **Review & Normalize** page renders one expandable card per finding with a decision control (Auto-fix / Skip / Customize), a live before-and-after preview, an encoding-override picker for misdetected codepages, and an Advanced output options block (encoding, delimiter, line terminator) for the download. Tool pages refuse to load until the gate passes.
+
+See [docs/USER-GUIDE.md §3.3](docs/USER-GUIDE.md) for the user-facing walkthrough and [docs/TECHNICAL.md §10.2.1–10.2.4](docs/TECHNICAL.md) for the developer-facing API.
+
 ## Documentation
 
+- [User Guide](docs/USER-GUIDE.md) — installation, GUI workflow, the Review & Normalize gate
 - [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections
-- [Developer Guide](docs/DEVELOPER.md) — architecture, data flow, how to extend
+- [Technical](docs/TECHNICAL.md) — architecture, gate internals, finding schema, fix registry
+- [Developer Guide](docs/DEVELOPER.md) — extending the bundle, adding fixes / detectors
 
 ## Requirements
 
diff --git a/docs/CLI-REFERENCE.md b/docs/CLI-REFERENCE.md
index bb44a1d..57bb591 100644
--- a/docs/CLI-REFERENCE.md
+++ b/docs/CLI-REFERENCE.md
@@ -412,3 +412,40 @@ python -m src.cli_text_clean tickets.csv --skip notes --apply
 python -m src.cli_text_clean messy.csv --preset minimal --case upper --save-config my.json
 python -m src.cli_text_clean other.csv --config my.json --apply
 ```
+
+---
+
+## Analyzer (upload-time scan)
+
+```
+python -m src.cli_analyze INPUT_FILE [OPTIONS]
+
+  --sample-rows N       Cap on rows scanned (default 1000)
+  --json                Print findings as a JSON array on stdout
+  --strict              Exit non-zero on any warn/error finding
+```
+
+JSON output schema (one object per finding):
+
+```json
+{
+  "id": "smart_punctuation_in_data",
+  "severity": "warn",
+  "confidence": "high",
+  "fix_action": "fold_smart_punctuation",
+  "pre_applied": false,
+  "tool": "02_text_cleaner",
+  "count": 17,
+  "description": "17 cell(s) contain curly quotes…",
+  "column": null,
+  "samples": [{"row": 3, "column": "name", "value": "“Alice”"}]
+}
+```
+
+- `severity` — `info` / `warn` / `error`. Only `error` blocks the GUI normalization gate.
+- `confidence` — `high` (round-trip-safe, eligible for one-click auto-fix), `medium` (preview before applying), `low` (heuristic, opt-in only).
+- `fix_action` — stable id naming the algorithm in `src/core/fixes.py` that resolves the finding. Empty string for informational-only findings.
+- `pre_applied` — `true` for fixes already applied during the byte-level read pass (BOM strip, NUL strip, line-ending normalize, byte-level smart-quote fold, transcode-to-UTF-8 from UTF-16/32). The GUI gate treats these as already-resolved; the CLI emits them so callers can audit what changed during read.
+
+The detector set covers smart punctuation, NBSP / Unicode whitespace, zero-width characters, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints (UTF-8-as-cp1252), mixed-case email columns, near-duplicate rows (case-and-padding stripped), leading-zero IDs (Excel hazard), mixed line endings, encoding decode failure (`encoding_decode_failed`), and U+FFFD presence in the loaded text (`encoding_uncertain`). New detectors plug in by appending one entry to `analyze.py` and one matching fix in `fixes.py`.
+
diff --git a/docs/TECHNICAL.md b/docs/TECHNICAL.md
index 9d39ac3..ac88e90 100644
--- a/docs/TECHNICAL.md
+++ b/docs/TECHNICAL.md
@@ -505,6 +505,66 @@ The market gap this script fills: **one-click correctness for the dirty-CSV fail
 - CLI surface: separate `src/cli_text_clean.py` module, matching the "one CLI binary per script on PATH" pattern in Section 3.2. Not a subcommand on the existing dedup Typer app.
 - `ftfy` dependency deferred to Tier 2 (~5MB). Revisit if mojibake reports come in.
 
+### 10.2.1 Upload-time analyzer (`src/core/analyze.py`)
+
+The analyzer is a read-only, advisory pass that runs on every uploaded file before any tool page sees it. It produces a list of `Finding` objects, each carrying:
+
+| Field | Type | Meaning |
+|---|---|---|
+| `id` | str | Stable identifier (`smart_punctuation_in_data`, `mixed_line_endings`, …). Never localized. |
+| `severity` | `info` / `warn` / `error` | UX urgency. `error` is the only level that blocks the gate. |
+| `confidence` | `high` / `medium` / `low` | Auto-fixability. **High** is round-trip safe, **medium** has known false-positive shapes, **low** is heuristic and opt-in. |
+| `fix_action` | str | Stable id naming the algorithm in `src/core/fixes.py` that resolves this finding. Empty for informational-only findings. |
+| `pre_applied` | bool | True when the fix already ran during the read pass (BOM strip, NUL strip, byte-level smart-quote fold). The gate treats these as already-resolved. |
+| `tool` | str | Tool id that owns this concern (`02_text_cleaner`, `04_missing_handler`). Empty for file-level findings. |
+| `count` | int | Cells / rows affected. |
+| `description` | str | One-sentence human summary (banners, tooltips). |
+| `column` | str / None | Column name when scoped to one column. |
+| `samples` | list[(row, col, value)] | Up to 5 examples for the GUI to render. |
+
+`analyze(source, *, sample_rows=1000, repair_result=None, encoding_override=None)` is the public entry point. `source` is a DataFrame or a path; `encoding_override` skips charset detection and uses the user's chosen codepage instead — this is the hook that lets the Review page recover from misdetections (cp1252-vs-cp1250 ambiguity, KOI8-R surfacing as Shift_JIS).
+
+### 10.2.2 CSV-normalization gate (`src/core/normalize.py`, `src/core/fixes.py`)
+
+A file enters tool pages only after passing the gate. The gate has two paths:
+
+1. **Auto-fix** — `auto_fix(df, findings)` applies every `confidence="high"` finding whose `fix_action` is registered in `fixes.py`.
+2. **Per-finding decisions** — `apply_decisions(df, findings, decisions)` accepts an explicit list of `Decision(finding_id, action, payload)` where action is `"auto" | "skip" | "modified"`.
+
+Output is a `NormalizationResult` with:
+
+- `cleaned_df` — the DataFrame after every applied fix.
+- `cleaned_bytes` — UTF-8 CSV serialization for the download.
+- `applied`, `skipped_findings`, `pending_findings`, `blocking_findings` — audit log + gate status.
+
+`is_normalized(findings, result)` re-runs `analyze()` against the cleaned bytes and returns False if any high-confidence detector still fires — that's the strict contract tool pages depend on.
+
+`fixes.py` is a registry: `@register("fix_id")` decorates a `(df, payload) -> (new_df, n_cells_changed)` function. Adding a new fix means appending one entry to `analyze.py`'s `FIX_*` constants, one detector that emits a Finding with that `fix_action`, and one registered function in `fixes.py`. No other call sites change.
+
+### 10.2.3 Review page (`src/gui/pages/0_Review.py`)
+
+Streamlit page that orchestrates the gate visually. Gates the entire tool sidebar via `require_normalization_gate()` in `src/gui/components.py`, which every tool page calls right after `hide_streamlit_chrome()`.
+
+The page:
+
+1. Surfaces the detected encoding plus an override picker (16 common codepages + custom-text fallback).
+2. Renders one expandable card per finding, sorted by severity then confidence, with a decision radio (Auto / Skip / Customize), a live before/after preview built by running the registered fix on each `Finding.samples` value, and a payload editor for fixes that take user input (e.g. custom null-sentinel list for `replace_null_sentinels`).
+3. Apply button persists a `NormalizationResult` keyed by upload SHA-256; tool pages refuse to load until the hash matches.
+4. After apply, an `⚙️ Advanced output options` expander offers per-download encoding, delimiter, and line-terminator selection. The helper `_build_output_bytes(df, *, encoding, delimiter, line_terminator)` returns `(bytes, error_message)` — when the chosen encoding can't represent a character, falls back to `errors="replace"` and returns a warning the page surfaces.
+
+### 10.2.4 Pre-parse repair (`src/core/io.py::repair_bytes`)
+
+Byte-level pre-parse pass. Order is meaningful and each step is independently toggleable:
+
+1. **Wide-encoding transcode** — UTF-16/UTF-32 → UTF-8. Has to run first because the byte-level NUL strip below would shred UTF-16 data (UTF-16 ASCII chars carry NUL as half of every 16-bit unit). Records `transcode_to_utf8` audit action; the analyzer surfaces it as a `csv_transcoded_to_utf8` info finding.
+2. **UTF-8 BOM strip** (file start only).
+3. **NUL strip** — only meaningful after step 1, so genuine corruption (truncated C strings, half-binary exports) rather than encoding artifacts.
+4. **Line-ending normalize** — CRLF and bare CR → LF. Bare CR confuses the C parser; the text-cleaner contract also calls for LF inside multi-line cells.
+5. **Byte-level smart-quote fold** — curly / guillemet / double-prime → ASCII `"`. Only structural double-quote-equivalents; single curly quotes are deferred to the cell-level cleaner.
+6. **Per-row delimiter repair** — when one row has +1 field and the merge candidate is currency-shaped (`$1,500.00` etc.), merge and quote.
+
+`detect_encoding()` tries strict UTF-8 first and returns `"utf-8"` if the bytes decode cleanly. This was added because charset-normalizer fingerprints small files dominated by short non-ASCII sequences (e.g. zero-width chars at U+200B-class) as `mac_latin2` — but if the bytes are valid UTF-8, that's the right answer regardless of label.
+
 ### 10.3 - 10.9 (Future)
 
 Functional specs for scripts 03 through 09 to be added when each script enters active build. The deduplicator (10.1) and text cleaner (10.2) specs are the template; specs for other scripts should follow the same Tier 1 / Tier 2 / Tier 3 structure with explicit strategic framing (what's the market gap this script fills, given that some of its functionality is available free elsewhere).
diff --git a/docs/USER-GUIDE.md b/docs/USER-GUIDE.md
index 0a045c0..60a8609 100644
--- a/docs/USER-GUIDE.md
+++ b/docs/USER-GUIDE.md
@@ -125,6 +125,41 @@ deduplicator --help
 
 ---
 
+## 3.3 Review & Normalize gate
+
+Before any tool page accepts a file, the file passes through a **CSV-normalization gate**. The gate scans every uploaded file, surfaces every data-quality issue our analyzer can detect, and lets you choose how to handle each one before downstream tools see the data.
+
+### How it works
+
+1. Upload a file on the home page. The analyzer scans it and counts findings by confidence tier.
+2. Click any tool. If the file hasn't been normalized yet, you're redirected to the **Review & Normalize** page.
+3. The page shows every finding grouped by severity and confidence, with a per-finding decision control.
+
+### Confidence tiers
+
+- **High** — round-trip-safe algorithmic fix (BOM strip, whitespace trim, NBSP / zero-width strip, smart-quote fold, line-ending normalize, header cleanup). One-click "Auto-fix high-confidence" applies them all.
+- **Medium** — right call in the common case but with known false-positive shapes. Examples: lowercasing the email column, replacing null-like sentinels (`N/A`, `-`, `nan`), repairing unquoted-currency rows. Preview the change before applying.
+- **Low** — heuristic fixes that can corrupt data when wrong. Mojibake repair (`café` → `café`), mixed-encoding detection. Off by default; you opt in per finding.
+- **Error** — blocking. Empty file, unrepairable rows, U+FFFD replacement characters. Cannot enter the tool pages until resolved or explicitly waived.
+
+### Encoding override
+
+When the analyzer reports `encoding_uncertain` or you spot mojibake (`Ã©`) or `�` characters in the findings list, use the **File encoding** picker at the top of the Review page. Pick the right code page (cp1252 for Western Excel exports, KOI8-R for older Russian data, Big5 for traditional Chinese, etc.) or type a custom one, then click **Re-analyze**. Findings refresh against the corrected decode.
+
+The picker is hidden for `.xlsx` files since Excel stores text as Unicode internally.
+
+### Advanced output options
+
+After applying decisions, an `⚙️ Advanced output options` expander on the download appears. Three dropdowns let you tune the output file format:
+
+- **Encoding (code page)** — UTF-8 (default), UTF-8 with BOM (Excel-friendly), Windows-1252, Latin-1, Latin-9, cp1250, ISO-8859-2, cp1251, Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
+- **Delimiter** — comma (default), tab, semicolon, pipe.
+- **Line terminator** — LF (default), CRLF (Windows), CR.
+
+The download filename auto-adjusts the extension (`.tsv` for tab, otherwise `.csv`). When the chosen encoding can't represent a character (Cyrillic content into cp1252, Asian script into Latin-1), the page shows a warning naming the offending character and falls back to `?` replacement so the download still works.
+
+---
+
 ## 4. Output
 
 Every script writes:
diff --git a/run_tests.py b/run_tests.py
index d801ad0..2b7daea 100755
--- a/run_tests.py
+++ b/run_tests.py
@@ -52,13 +52,20 @@ _TOOL_MAP: dict[str, str] = {
     "cli": "test_cli or test_cli_text_clean or test_cli_analyze",
     "config": "test_config",
     "normalizers": "test_normalizers",
+    "normalize": "test_normalize",
+    "encodings": "test_encodings_corpus or test_io",
+    "gate": "test_normalize",
 }
 
 _CATEGORY_PATHS: dict[str, list[str]] = {
     "unit": ["tests/"],          # all tests are unit unless marked otherwise
     "e2e": ["tests/test_e2e.py"],
     "install": ["tests/test_install.py"],
-    "fixtures": ["tests/test_corpus.py", "tests/test_fixtures_sweep.py"],
+    "fixtures": [
+        "tests/test_corpus.py",
+        "tests/test_fixtures_sweep.py",
+        "tests/test_encodings_corpus.py",
+    ],
 }
 
 
diff --git a/src/core/analyze.py b/src/core/analyze.py
index ad50bed..4561aee 100644
--- a/src/core/analyze.py
+++ b/src/core/analyze.py
@@ -25,6 +25,7 @@ from pandas.api import types as pdtypes
 from .io import RepairResult, repair_bytes, detect_encoding, detect_delimiter
 
 Severity = Literal["info", "warn", "error"]
+Confidence = Literal["high", "medium", "low"]
 
 
 # Tool identifiers — match the 0N_<name> convention used by the script set.
@@ -35,6 +36,29 @@ TOOL_DEDUPLICATOR = "01_deduplicator"
 TOOL_FORMAT_STANDARDIZER = "03_format_standardizer"
 
 
+# Stable fix-action ids. These name the algorithm that resolves a finding;
+# the normalize layer dispatches on this id. Keep in sync with fixes.py.
+FIX_TRIM_WHITESPACE = "trim_whitespace"
+FIX_STRIP_NBSP = "strip_nbsp_unicode_whitespace"
+FIX_STRIP_ZERO_WIDTH = "strip_zero_width"
+FIX_FOLD_SMART_PUNCT = "fold_smart_punctuation"
+FIX_CLEAN_HEADERS = "clean_headers"
+FIX_NORMALIZE_LINE_ENDINGS = "normalize_line_endings"
+FIX_STRIP_BOM = "strip_bom"
+FIX_STRIP_NUL = "strip_nul"
+FIX_FOLD_SMART_QUOTES_BYTE = "fold_smart_quotes_byte"
+FIX_REPAIR_UNQUOTED_DELIM = "repair_unquoted_delimiters"
+FIX_LOWERCASE_EMAIL = "lowercase_email_column"
+FIX_REPLACE_NULL_SENTINELS = "replace_null_sentinels"
+FIX_REPAIR_MOJIBAKE = "repair_mojibake"
+FIX_NONE = ""  # informational — nothing to apply
+
+# Replacement character (U+FFFD) inserted when a decoder gave up on a byte.
+# Anything more than a tiny ratio of it in the loaded text is a strong
+# signal that the encoding was wrong.
+_REPLACEMENT_CHAR = "�"
+
+
 @dataclass
 class Finding:
     """One issue the analyzer surfaced.
@@ -47,6 +71,16 @@ class Finding:
     severity
         ``"info"`` (FYI), ``"warn"`` (likely needs cleanup),
         ``"error"`` (will block downstream work).
+    confidence
+        ``"high"`` — round-trip-safe algorithmic fix, eligible for auto-fix.
+        ``"medium"`` — right call in the common case but has known
+        false-positive shapes; user should preview before applying.
+        ``"low"`` — heuristic; the wrong call corrupts data; opt-in only.
+        Independent of severity: a ``warn`` finding can be high-confidence
+        (NBSP strip) and an ``info`` finding can be low-confidence (mojibake).
+    fix_action
+        Stable id naming the algorithm that resolves this finding. Empty
+        string for informational findings with no associated fix.
     tool
         Tool id that can address the finding, or empty string for purely
         informational findings.
@@ -69,6 +103,13 @@ class Finding:
     description: str
     column: Optional[str] = None
     samples: list[tuple[int, str, str]] = field(default_factory=list)
+    confidence: Confidence = "high"
+    fix_action: str = FIX_NONE
+    # True when the fix already ran during the pre-parse repair pass
+    # (e.g. BOM strip, byte-level smart-quote fold). The gate treats these
+    # as already-resolved; the review page still surfaces them so the
+    # user can see what was auto-applied during read.
+    pre_applied: bool = False
 
 
 # ---------------------------------------------------------------------------
@@ -139,6 +180,8 @@ def _detect_smart_punctuation(df: pd.DataFrame) -> list[Finding]:
             f"regex patterns."
         ),
         samples=sample_rows,
+        confidence="high",
+        fix_action=FIX_FOLD_SMART_PUNCT,
     )]
 
 
@@ -172,6 +215,8 @@ def _detect_invisible_chars(df: pd.DataFrame) -> list[Finding]:
                 f"join keys."
             ),
             samples=nbsp_samples,
+            confidence="high",
+            fix_action=FIX_STRIP_NBSP,
         ))
     if zw_cells:
         findings.append(Finding(
@@ -184,6 +229,8 @@ def _detect_invisible_chars(df: pd.DataFrame) -> list[Finding]:
                 f"characters (ZWSP, ZWJ, soft hyphen, BOM, bidi marks)."
             ),
             samples=zw_samples,
+            confidence="high",
+            fix_action=FIX_STRIP_ZERO_WIDTH,
         ))
     # Headers carry the same risks; flag separately so the user sees that
     # df["Email"] vs df["Email​"] is the issue.
@@ -208,6 +255,8 @@ def _detect_invisible_chars(df: pd.DataFrame) -> list[Finding]:
                 f"df['col'] lookups."
             ),
             samples=[(0, h, h) for h in bad_headers[:5]],
+            confidence="high",
+            fix_action=FIX_CLEAN_HEADERS,
         ))
     return findings
 
@@ -235,6 +284,8 @@ def _detect_whitespace_padding(df: pd.DataFrame) -> list[Finding]:
             f"multi-space internal runs. Common cause of failed joins."
         ),
         samples=samples,
+        confidence="high",
+        fix_action=FIX_TRIM_WHITESPACE,
     )]
 
 
@@ -264,6 +315,8 @@ def _detect_null_like_sentinels(df: pd.DataFrame) -> list[Finding]:
             f"counts as missing in the missing-value handler."
         ),
         samples=samples,
+        confidence="medium",
+        fix_action=FIX_REPLACE_NULL_SENTINELS,
     )]
 
 
@@ -290,6 +343,8 @@ def _detect_mojibake(df: pd.DataFrame) -> list[Finding]:
             f"patterns (Ã©, â€™, etc.). Auto-repair is opt-in (Tier 2)."
         ),
         samples=samples,
+        confidence="low",
+        fix_action=FIX_REPAIR_MOJIBAKE,
     )]
 
 
@@ -316,6 +371,8 @@ def _detect_mixed_case_email(df: pd.DataFrame) -> list[Finding]:
                 ),
                 column=col,
                 samples=samples,
+                confidence="medium",
+                fix_action=FIX_LOWERCASE_EMAIL,
             ))
     return findings
 
@@ -362,6 +419,8 @@ def _detect_near_duplicates(df: pd.DataFrame) -> list[Finding]:
             f"Run the deduplicator to merge or remove."
         ),
         samples=samples,
+        confidence="medium",
+        fix_action=FIX_NONE,  # routed to dedup tool, not auto-fixed here
     )]
 
 
@@ -397,23 +456,60 @@ def _detect_leading_zero_ids(df: pd.DataFrame) -> list[Finding]:
                 ),
                 column=str(col),
                 samples=samples,
+                confidence="low",
+                fix_action=FIX_NONE,  # informational only
             ))
     return findings
 
 
+def _count_row_terminators(raw: bytes) -> tuple[int, int, int]:
+    """Count CRLF / LF / CR sequences that act as *row* terminators.
+
+    Walks the bytes tracking quoted-region state so that line breaks
+    inside multi-line quoted cells (e.g. an address column) are not
+    counted. Without this, files that legitimately have CRLF at row
+    boundaries plus LF inside quoted cells get false-positive
+    ``mixed_line_endings`` findings.
+    """
+    n_crlf = n_lf = n_cr = 0
+    in_quotes = False
+    i = 0
+    n = len(raw)
+    while i < n:
+        b = raw[i]
+        if b == 0x22:  # ASCII double quote — toggles quoted region.
+            # Doubled quote inside a quoted cell is an escape, not an exit.
+            if in_quotes and i + 1 < n and raw[i + 1] == 0x22:
+                i += 2
+                continue
+            in_quotes = not in_quotes
+            i += 1
+            continue
+        if not in_quotes:
+            if b == 0x0D:  # CR
+                if i + 1 < n and raw[i + 1] == 0x0A:
+                    n_crlf += 1
+                    i += 2
+                    continue
+                n_cr += 1
+            elif b == 0x0A:  # LF
+                n_lf += 1
+        i += 1
+    return n_crlf, n_lf, n_cr
+
+
 def _detect_mixed_line_endings(raw: bytes) -> list[Finding]:
-    """Flag files that mix CRLF, LF, and bare CR line terminators.
+    """Flag files that mix CRLF, LF, and bare CR row terminators.
 
     Mixed endings are a classic disaster pattern after multi-source concat
-    (Windows + macOS + Linux exports stitched together). Operates on raw
+    (Windows + macOS + Linux exports stitched together). Counts only the
+    terminators that act as row separators, so embedded newlines inside
+    quoted multi-line cells don't create false positives. Operates on raw
     bytes only — DataFrame-mode :func:`analyze` skips this detector.
     """
     if not raw:
         return []
-    n_crlf = raw.count(b"\r\n")
-    # Count standalone \r and \n (not part of \r\n) by subtracting overlaps.
-    n_lf = raw.count(b"\n") - n_crlf
-    n_cr = raw.count(b"\r") - n_crlf
+    n_crlf, n_lf, n_cr = _count_row_terminators(raw)
     kinds_present = sum(1 for n in (n_crlf, n_lf, n_cr) if n > 0)
     if kinds_present <= 1:
         return []
@@ -434,6 +530,53 @@ def _detect_mixed_line_endings(raw: bytes) -> list[Finding]:
             f"({', '.join(breakdown)}). Naive splits on one style produce "
             f"ghost rows or merged lines. Run the text cleaner to normalize."
         ),
+        confidence="high",
+        fix_action=FIX_NORMALIZE_LINE_ENDINGS,
+    )]
+
+
+def _detect_encoding_uncertainty(df: pd.DataFrame) -> list[Finding]:
+    """Flag DataFrames whose loaded text contains U+FFFD replacement chars.
+
+    The replacement character is what Python's decoder substitutes for
+    bytes it could not interpret under ``errors="replace"``. Any non-zero
+    count is a strong signal that the encoding picked by the loader was
+    wrong for at least part of the file — classic lying-BOM, mixed-encoding,
+    or wrong-codepage symptom. The user has to pick: re-upload with an
+    explicit encoding, or accept the loss.
+    """
+    affected_cells = 0
+    sample_rows: list[tuple[int, str, str]] = []
+    bad_headers: list[str] = []
+    for col in df.columns:
+        if isinstance(col, str) and _REPLACEMENT_CHAR in col:
+            bad_headers.append(col)
+        for row_idx, val in enumerate(df[col].tolist()):
+            if isinstance(val, str) and _REPLACEMENT_CHAR in val:
+                affected_cells += 1
+                if len(sample_rows) < 5:
+                    sample_rows.append((row_idx, str(col), val))
+    if not affected_cells and not bad_headers:
+        return []
+    location = []
+    if affected_cells:
+        location.append(f"{affected_cells} cell(s)")
+    if bad_headers:
+        location.append(f"{len(bad_headers)} header(s)")
+    return [Finding(
+        id="encoding_uncertain",
+        severity="error",
+        tool="",
+        count=affected_cells + len(bad_headers),
+        description=(
+            f"{' and '.join(location)} contain U+FFFD replacement characters, "
+            f"which means the file's encoding could not be decoded cleanly. "
+            f"Re-upload with an explicit encoding (e.g. cp1252, latin-1) "
+            f"or fix the source. Continuing risks silent data loss."
+        ),
+        samples=sample_rows,
+        confidence="low",
+        fix_action=FIX_NONE,
     )]
 
 
@@ -455,6 +598,9 @@ def _findings_from_repair(repair: RepairResult) -> list[Finding]:
             tool=TOOL_TEXT_CLEANER,
             count=1,
             description="UTF-8 BOM at file start was removed before parsing.",
+            confidence="high",
+            fix_action=FIX_STRIP_BOM,
+            pre_applied=True,
         ))
     if "strip_nul" in summary:
         nul_action = next(a for a in repair.actions if a.kind == "strip_nul")
@@ -467,6 +613,9 @@ def _findings_from_repair(repair: RepairResult) -> list[Finding]:
                 f"Embedded NUL bytes in the file were stripped before "
                 f"parsing ({nul_action.detail})."
             ),
+            confidence="high",
+            fix_action=FIX_STRIP_NUL,
+            pre_applied=True,
         ))
     if "fold_smart_quote" in summary:
         action = next(a for a in repair.actions if a.kind == "fold_smart_quote")
@@ -479,6 +628,55 @@ def _findings_from_repair(repair: RepairResult) -> list[Finding]:
                 f"Smart double quotes were folded to ASCII before parsing "
                 f"({action.detail})."
             ),
+            confidence="high",
+            fix_action=FIX_FOLD_SMART_QUOTES_BYTE,
+            pre_applied=True,
+        ))
+    if "normalize_line_endings" in summary:
+        action = next(a for a in repair.actions if a.kind == "normalize_line_endings")
+        findings.append(Finding(
+            id="csv_line_endings_normalized",
+            severity="info",
+            tool=TOOL_TEXT_CLEANER,
+            count=1,
+            description=(
+                f"Line endings were normalized to LF before parsing "
+                f"({action.detail})."
+            ),
+            confidence="high",
+            fix_action=FIX_NORMALIZE_LINE_ENDINGS,
+            pre_applied=True,
+        ))
+    if "transcode_to_utf8" in summary:
+        action = next(a for a in repair.actions if a.kind == "transcode_to_utf8")
+        findings.append(Finding(
+            id="csv_transcoded_to_utf8",
+            severity="info",
+            tool="",
+            count=1,
+            description=(
+                f"File was transcoded from a wide encoding to UTF-8 before "
+                f"parsing ({action.detail})."
+            ),
+            confidence="high",
+            fix_action=FIX_NONE,
+            pre_applied=True,
+        ))
+    if "decode_replaced" in summary:
+        action = next(a for a in repair.actions if a.kind == "decode_replaced")
+        findings.append(Finding(
+            id="encoding_decode_failed",
+            severity="error",
+            tool="",
+            count=1,
+            description=(
+                f"Some bytes could not be decoded under the detected "
+                f"encoding ({action.detail}). Replacement characters "
+                f"(U+FFFD) were inserted; the file likely uses a different "
+                f"encoding or mixes encodings. Re-upload with --encoding."
+            ),
+            confidence="low",
+            fix_action=FIX_NONE,
         ))
     if "quote_unquoted_delim" in summary:
         n = summary["quote_unquoted_delim"]
@@ -491,6 +689,9 @@ def _findings_from_repair(repair: RepairResult) -> list[Finding]:
                 f"{n} row(s) had a delimiter inside an unquoted field "
                 f"(e.g. '$1,500.00') and were merged during pre-parse repair."
             ),
+            confidence="medium",
+            fix_action=FIX_REPAIR_UNQUOTED_DELIM,
+            pre_applied=True,
         ))
     if repair.unrepairable_lines:
         n = len(repair.unrepairable_lines)
@@ -504,6 +705,8 @@ def _findings_from_repair(repair: RepairResult) -> list[Finding]:
                 f"left as-is. Inspect lines: "
                 f"{repair.unrepairable_lines[:10]}"
             ),
+            confidence="low",
+            fix_action=FIX_NONE,
         ))
     return findings
 
@@ -517,6 +720,7 @@ def analyze(
     *,
     sample_rows: int = 1000,
     repair_result: Optional[RepairResult] = None,
+    encoding_override: Optional[str] = None,
 ) -> list[Finding]:
     """Run all detectors against *source* and return a list of findings.
 
@@ -533,11 +737,17 @@ def analyze(
         Optional :class:`RepairResult` from a prior pre-parse pass; used
         to synthesize ``csv_*`` findings so the user sees what the parser
         quietly fixed.
+    encoding_override
+        When set, skip charset detection and decode with this encoding
+        instead. Used by the Review page to let the user correct
+        misdetections (cp1250-vs-cp1252 ambiguity, KOI8-R surfacing as
+        Shift_JIS, etc.). Only applies when *source* is a path.
     """
     raw_for_byte_scan: Optional[bytes] = None
     if isinstance(source, (str, Path)):
         df, internal_repair, raw_for_byte_scan = _load_for_analysis(
             Path(source), sample_rows=sample_rows,
+            encoding_override=encoding_override,
         )
         # Caller-supplied repair_result wins over the internally produced one,
         # since the caller may have used non-default repair flags.
@@ -547,10 +757,36 @@ def analyze(
         df = source.head(sample_rows).copy() if len(source) > sample_rows else source.copy()
 
     findings: list[Finding] = []
+    if raw_for_byte_scan is not None and not raw_for_byte_scan.strip():
+        findings.append(Finding(
+            id="empty_input",
+            severity="error",
+            tool="",
+            count=0,
+            description="Input file is empty (zero bytes or whitespace only).",
+            confidence="low",
+            fix_action=FIX_NONE,
+        ))
+        return findings
+    if df.empty and df.columns.empty and raw_for_byte_scan is not None:
+        # Non-empty bytes but the parser couldn't extract a header row.
+        findings.append(Finding(
+            id="empty_input",
+            severity="error",
+            tool="",
+            count=0,
+            description=(
+                "Input file has no parseable rows or columns "
+                "(only line endings, BOM, or whitespace)."
+            ),
+            confidence="low",
+            fix_action=FIX_NONE,
+        ))
     if repair_result is not None:
         findings.extend(_findings_from_repair(repair_result))
     if raw_for_byte_scan is not None:
         findings.extend(_detect_mixed_line_endings(raw_for_byte_scan))
+    findings.extend(_detect_encoding_uncertainty(df))
     findings.extend(_detect_smart_punctuation(df))
     findings.extend(_detect_invisible_chars(df))
     findings.extend(_detect_whitespace_padding(df))
@@ -563,7 +799,7 @@ def analyze(
 
 
 def _load_for_analysis(
-    path: Path, *, sample_rows: int,
+    path: Path, *, sample_rows: int, encoding_override: Optional[str] = None,
 ) -> tuple[pd.DataFrame, Optional[RepairResult], Optional[bytes]]:
     """Read just enough of *path* to scan, with the same robust pre-parse
     repair the tool pages will use.
@@ -571,6 +807,12 @@ def _load_for_analysis(
     Returns ``(df, repair_result, raw_bytes)``. The repair result and raw
     bytes are *None* for Excel files since the byte-level repair step
     (BOM/NUL/smart-quote folding) and line-ending scan are CSV-specific.
+    An empty CSV returns an empty DataFrame plus the (empty) raw bytes;
+    the caller synthesizes an ``empty_input`` finding from that.
+
+    When *encoding_override* is set, it replaces the detected encoding
+    entirely — the user has explicitly told us what the file is. The
+    delimiter is still detected (it's separate from encoding choice).
     """
     suffix = path.suffix.lower()
     if suffix in (".xlsx", ".xls"):
@@ -579,17 +821,24 @@ def _load_for_analysis(
             nrows=sample_rows,
         )
         return df, None, None
-    enc = detect_encoding(path)
-    delim = detect_delimiter(path, enc)
     raw = path.read_bytes()
+    if not raw.strip():
+        return pd.DataFrame(), None, raw
+    enc = encoding_override or detect_encoding(path)
+    delim = detect_delimiter(path, enc)
     repair = repair_bytes(raw, encoding=enc, delimiter=delim)
     import io as _io
-    df = pd.read_csv(
-        _io.BytesIO(repair.repaired_bytes),
-        encoding="utf-8", delimiter=delim,
-        dtype=str, keep_default_na=False, on_bad_lines="warn",
-        nrows=sample_rows,
-    )
+    try:
+        df = pd.read_csv(
+            _io.BytesIO(repair.repaired_bytes),
+            encoding="utf-8", delimiter=delim,
+            dtype=str, keep_default_na=False, on_bad_lines="warn",
+            nrows=sample_rows,
+        )
+    except pd.errors.EmptyDataError:
+        # File is non-empty bytes but had no parseable columns (e.g. only
+        # whitespace, only a BOM, only line endings). Treat as empty.
+        return pd.DataFrame(), repair, raw
     return df, repair, raw
 
 
@@ -598,6 +847,9 @@ def to_dict(finding: Finding) -> dict[str, Any]:
     return {
         "id": finding.id,
         "severity": finding.severity,
+        "confidence": finding.confidence,
+        "fix_action": finding.fix_action,
+        "pre_applied": finding.pre_applied,
         "tool": finding.tool,
         "count": finding.count,
         "description": finding.description,
diff --git a/src/core/fixes.py b/src/core/fixes.py
new file mode 100644
index 0000000..421fc7e
--- /dev/null
+++ b/src/core/fixes.py
@@ -0,0 +1,296 @@
+"""Registry of fix algorithms keyed by ``fix_action`` id.
+
+Every :class:`~src.core.analyze.Finding` declares a ``fix_action`` naming
+the algorithm that resolves it. The normalize layer dispatches on that id
+into this registry. Each fix function takes a DataFrame plus an optional
+``payload`` dict (for fixes that need user-supplied parameters, e.g. the
+custom null-sentinel list) and returns ``(new_df, n_cells_changed)``.
+
+Fixes here operate on the DataFrame after the byte-level pre-parse repair
+has already run (BOM, NUL, line endings, smart-quote bytes, unquoted
+delimiters). Anything in this layer is reversible from the audit log; a
+lossy fix (e.g. mojibake repair) is gated to ``confidence="low"`` and
+requires explicit user opt-in via the review page.
+"""
+
+from __future__ import annotations
+
+import re
+import unicodedata
+from typing import Any, Callable, Optional
+
+import pandas as pd
+
+from .text_clean import (
+    _SMART_TRANS,
+    _ZERO_WIDTH_RE,
+    _CONTROL_RE,
+    _WHITESPACE_RUN_RE,
+    _looks_structured,
+    strip_bom,
+    normalize_line_endings as _norm_le_str,
+)
+# The package __init__ re-exports the analyze() function under the name
+# `analyze`, which shadows the submodule attribute. Reach the module via
+# sys.modules to get its private constants and FIX_* identifiers.
+import sys as _sys
+import src.core.analyze  # noqa: F401  (registers the submodule)
+_a = _sys.modules["src.core.analyze"]
+
+# NBSP / Unicode-whitespace -> ASCII space. Mirrors the analyzer's
+# detection set (analyze._NBSP_LIKE_CHARS) so what the detector flags is
+# exactly what this fix replaces.
+_NBSP_TRANS = str.maketrans({c: " " for c in _a._NBSP_LIKE_CHARS})
+
+
+FixFn = Callable[[pd.DataFrame, Optional[dict]], tuple[pd.DataFrame, int]]
+
+_REGISTRY: dict[str, FixFn] = {}
+
+
+def register(action_id: str) -> Callable[[FixFn], FixFn]:
+    def deco(fn: FixFn) -> FixFn:
+        _REGISTRY[action_id] = fn
+        return fn
+    return deco
+
+
+def get_fix(action_id: str) -> Optional[FixFn]:
+    return _REGISTRY.get(action_id)
+
+
+def available_actions() -> list[str]:
+    return sorted(_REGISTRY)
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+def _apply_to_strings(
+    df: pd.DataFrame, fn: Callable[[str], str], *, include_headers: bool = False,
+) -> tuple[pd.DataFrame, int]:
+    """Apply *fn* to every string cell. Returns (new_df, cells_changed).
+
+    Headers are not touched here — the dedicated header-cleaning fix owns
+    that scope so the gate's audit log records header changes separately.
+    """
+    out = df.copy()
+    changed = 0
+    for col in out.columns:
+        if not pd.api.types.is_object_dtype(out[col]) and not pd.api.types.is_string_dtype(out[col]):
+            continue
+        new_col = []
+        for v in out[col]:
+            if isinstance(v, str):
+                nv = fn(v)
+                if nv != v:
+                    changed += 1
+                new_col.append(nv)
+            else:
+                new_col.append(v)
+        out[col] = new_col
+    if include_headers:
+        new_headers = []
+        for h in out.columns:
+            if isinstance(h, str):
+                nh = fn(h)
+                if nh != h:
+                    changed += 1
+                new_headers.append(nh)
+            else:
+                new_headers.append(h)
+        out.columns = new_headers
+    return out, changed
+
+
+# ---------------------------------------------------------------------------
+# High-confidence fixes
+# ---------------------------------------------------------------------------
+
+@register(_a.FIX_TRIM_WHITESPACE)
+def trim_whitespace(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]:
+    """Strip leading/trailing whitespace; collapse internal runs in text cells.
+
+    Numeric/date/phone-shaped cells get only outer trim — internal spacing
+    in those is often semantic (`1 234`, `(555) 123-4567`).
+    """
+    def fix(s: str) -> str:
+        trimmed = s.strip()
+        if not trimmed or _looks_structured(trimmed):
+            return trimmed
+        return _WHITESPACE_RUN_RE.sub(" ", trimmed)
+    return _apply_to_strings(df, fix)
+
+
+@register(_a.FIX_STRIP_NBSP)
+def strip_nbsp(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]:
+    """Replace NBSP and other Unicode spaces with ASCII space."""
+    def fix(s: str) -> str:
+        return s.translate(_NBSP_TRANS)
+    return _apply_to_strings(df, fix)
+
+
+@register(_a.FIX_STRIP_ZERO_WIDTH)
+def strip_zero_width(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]:
+    """Remove zero-width and invisible characters from cells."""
+    def fix(s: str) -> str:
+        return _ZERO_WIDTH_RE.sub("", s)
+    return _apply_to_strings(df, fix)
+
+
+@register(_a.FIX_FOLD_SMART_PUNCT)
+def fold_smart_punctuation(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]:
+    """ASCII-fy curly quotes, em/en dashes, ellipsis, primes."""
+    def fix(s: str) -> str:
+        return s.translate(_SMART_TRANS)
+    return _apply_to_strings(df, fix)
+
+
+@register(_a.FIX_CLEAN_HEADERS)
+def clean_headers(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]:
+    """Apply the same per-cell hygiene to column headers.
+
+    Fixes the df['Email'] vs df['Email '] class of bug.
+    """
+    def fix(s: str) -> str:
+        s = strip_bom(s)
+        s = s.translate(_NBSP_TRANS)
+        s = _ZERO_WIDTH_RE.sub("", s)
+        s = s.translate(_SMART_TRANS)
+        s = _CONTROL_RE.sub("", s)
+        return s.strip()
+    out = df.copy()
+    new_headers = []
+    changed = 0
+    for h in out.columns:
+        if isinstance(h, str):
+            nh = fix(h)
+            if nh != h:
+                changed += 1
+            new_headers.append(nh)
+        else:
+            new_headers.append(h)
+    out.columns = new_headers
+    return out, changed
+
+
+@register(_a.FIX_NORMALIZE_LINE_ENDINGS)
+def normalize_line_endings(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]:
+    """Normalize CRLF / bare CR inside cells to LF.
+
+    File-level line endings are handled by ``repair_bytes`` before parsing;
+    this fix covers embedded multi-line cells (case 11 in the corpus).
+    """
+    return _apply_to_strings(df, _norm_le_str)
+
+
+# ---------------------------------------------------------------------------
+# Already-applied fixes (no-op at this layer; kept so the audit log is
+# uniform and the gate can reason about them)
+# ---------------------------------------------------------------------------
+
+@register(_a.FIX_STRIP_BOM)
+def strip_bom_noop(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]:
+    """BOM is stripped during read by repair_bytes; nothing to do here."""
+    return df, 0
+
+
+@register(_a.FIX_STRIP_NUL)
+def strip_nul_noop(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]:
+    """NUL is stripped during read by repair_bytes."""
+    return df, 0
+
+
+@register(_a.FIX_FOLD_SMART_QUOTES_BYTE)
+def fold_smart_quotes_byte_noop(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]:
+    """Byte-level smart-quote fold runs in repair_bytes."""
+    return df, 0
+
+
+@register(_a.FIX_REPAIR_UNQUOTED_DELIM)
+def repair_unquoted_delim_noop(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]:
+    """Per-row delimiter repair runs in repair_bytes."""
+    return df, 0
+
+
+# ---------------------------------------------------------------------------
+# Medium-confidence fixes (require user confirmation in the review flow)
+# ---------------------------------------------------------------------------
+
+@register(_a.FIX_LOWERCASE_EMAIL)
+def lowercase_email(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]:
+    """Lowercase values in the column named in *payload['column']*.
+
+    Defaults to lowercasing every column whose name matches the email
+    heuristic if no payload is given.
+    """
+    out = df.copy()
+    payload = payload or {}
+    target_cols: list[str]
+    if "column" in payload:
+        target_cols = [payload["column"]]
+    else:
+        target_cols = [
+            c for c in out.columns
+            if isinstance(c, str) and _a._EMAIL_LIKE_COL.search(c)
+        ]
+    changed = 0
+    for col in target_cols:
+        if col not in out.columns:
+            continue
+        new_col = []
+        for v in out[col]:
+            if isinstance(v, str):
+                nv = v.lower()
+                if nv != v:
+                    changed += 1
+                new_col.append(nv)
+            else:
+                new_col.append(v)
+        out[col] = new_col
+    return out, changed
+
+
+@register(_a.FIX_REPLACE_NULL_SENTINELS)
+def replace_null_sentinels(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]:
+    """Replace user-approved null-like sentinel strings with empty string.
+
+    Payload: ``{"sentinels": ["N/A", "n/a", "nan", ...]}``. Defaults to
+    the analyzer's built-in set when no payload is given. Comparison is
+    case-insensitive, whitespace-trimmed.
+    """
+    payload = payload or {}
+    sentinels = payload.get("sentinels")
+    if sentinels is None:
+        sentinels = list(_a._NULL_LIKE)
+    sentinel_set = {s.strip().lower() for s in sentinels}
+
+    def fix(s: str) -> str:
+        return "" if s.strip().lower() in sentinel_set else s
+
+    return _apply_to_strings(df, fix)
+
+
+# ---------------------------------------------------------------------------
+# Low-confidence fixes (off by default; user-only)
+# ---------------------------------------------------------------------------
+
+@register(_a.FIX_REPAIR_MOJIBAKE)
+def repair_mojibake(df: pd.DataFrame, payload: Optional[dict] = None) -> tuple[pd.DataFrame, int]:
+    """Heuristic UTF-8-as-cp1252 mojibake repair via ftfy when available.
+
+    Falls back to a no-op (returning ``(df, 0)``) when ftfy is not
+    installed; the review page surfaces that as "library missing — install
+    ftfy to enable" so we never silently corrupt data with a hand-rolled
+    heuristic.
+    """
+    try:
+        import ftfy  # type: ignore
+    except ImportError:
+        return df, 0
+
+    def fix(s: str) -> str:
+        return ftfy.fix_text(s)
+
+    return _apply_to_strings(df, fix)
diff --git a/src/core/io.py b/src/core/io.py
index dd45b87..3795ac8 100644
--- a/src/core/io.py
+++ b/src/core/io.py
@@ -34,6 +34,16 @@ def detect_encoding(path: Path, sample_bytes: int = 65_536) -> str:
     if raw[:2] in (b"\xff\xfe", b"\xfe\xff"):
         return "utf-16"
 
+    # Strict UTF-8 wins. charset_normalizer fingerprints small files
+    # dominated by short non-ASCII sequences (e.g. zero-width chars at
+    # U+200B-class) as mac_latin2 / cp1250 / similar — but if the bytes
+    # decode cleanly as UTF-8, that's the right answer regardless.
+    try:
+        raw.decode("utf-8")
+        return "utf-8"
+    except UnicodeDecodeError:
+        pass
+
     result = from_bytes(raw).best()
     if result is None:
         return "utf-8"
@@ -416,6 +426,7 @@ def repair_bytes(
     fold_quotes: bool = True,
     strip_nul: bool = True,
     repair_delims: bool = True,
+    normalize_line_endings: bool = True,
 ) -> RepairResult:
     """Pre-parse repair on a raw delimited file.
 
@@ -423,8 +434,11 @@ def repair_bytes(
 
     1. Strip a leading UTF-8 BOM.
     2. Strip embedded NUL bytes (the C parser truncates fields at NUL).
-    3. Fold smart double quotes (curly, guillemet, double-prime) to ASCII ``"``.
-    4. Per-row repair when one rogue delimiter is embedded in a field that
+    3. Normalize line endings (CRLF and bare CR to LF). Bare CR confuses
+       the C parser ("new-line character seen in unquoted field"); the
+       text-cleaner contract also calls for LF inside multi-line cells.
+    4. Fold smart double quotes (curly, guillemet, double-prime) to ASCII ``"``.
+    5. Per-row repair when one rogue delimiter is embedded in a field that
        looks like currency or thousands-grouped digits — quote that field.
 
     Single curly quotes and other punctuation are deferred to the cell-level
@@ -434,12 +448,41 @@ def repair_bytes(
     unrepairable: list[int] = []
     data = raw
 
+    # If the input is a UTF-16 / UTF-32 byte stream, transcode it to UTF-8
+    # up front. UTF-16 ASCII codepoints carry NUL as half of every 16-bit
+    # unit, so the byte-level NUL-strip below would shred the file. Doing
+    # the transcode here means the rest of the repair pipeline operates
+    # on UTF-8 bytes regardless of the source encoding.
+    enc_norm = encoding.lower().replace("-", "_") if encoding else ""
+    is_wide = enc_norm.startswith(("utf_16", "utf_32"))
+    # UTF-16 LE without a BOM that survives detection lands here too.
+    if is_wide:
+        try:
+            decoded = data.decode(encoding)
+        except (UnicodeDecodeError, LookupError):
+            decoded = data.decode("utf-8", errors="replace")
+            actions.append(RepairAction(
+                kind="decode_replaced", line=None,
+                detail=f"decode errors under {encoding}; replaced with U+FFFD",
+            ))
+        # Strip a leading UTF-16 BOM (decoded as U+FEFF) if present.
+        if decoded and decoded[0] == "﻿":
+            decoded = decoded[1:]
+        data = decoded.encode("utf-8")
+        actions.append(RepairAction(
+            kind="transcode_to_utf8", line=None,
+            detail=f"transcoded {encoding} -> utf-8 ({len(raw)}B -> {len(data)}B)",
+        ))
+        encoding = "utf-8"  # downstream steps now operate on UTF-8
+
     # 1. BOM
     if data.startswith(b"\xef\xbb\xbf"):
         data = data[3:]
         actions.append(RepairAction(kind="strip_bom", line=None, detail="UTF-8 BOM removed"))
 
-    # 2. NUL
+    # 2. NUL — only meaningful for single-byte / UTF-8 encodings. We've
+    # already transcoded UTF-16/32 to UTF-8 above, so NUL here is genuine
+    # corruption (truncated C strings, half-binary exports), not encoding.
     if strip_nul and b"\x00" in data:
         before = data.count(b"\x00")
         data = data.replace(b"\x00", b"")
@@ -448,6 +491,26 @@ def repair_bytes(
             detail=f"removed {before} NUL byte(s)",
         ))
 
+    # 3. Line endings: CRLF and bare CR -> LF. CRLF first so we don't
+    # double-substitute. Done at the byte layer so it survives through
+    # any subsequent decode failure.
+    if normalize_line_endings and (b"\r" in data):
+        n_crlf = data.count(b"\r\n")
+        data = data.replace(b"\r\n", b"\n")
+        n_cr = data.count(b"\r")
+        if n_cr:
+            data = data.replace(b"\r", b"\n")
+        if n_crlf or n_cr:
+            parts = []
+            if n_crlf:
+                parts.append(f"{n_crlf} CRLF")
+            if n_cr:
+                parts.append(f"{n_cr} bare CR")
+            actions.append(RepairAction(
+                kind="normalize_line_endings", line=None,
+                detail=f"normalized {', '.join(parts)} to LF",
+            ))
+
     # Decode for character-level work.
     try:
         text = data.decode(encoding)
diff --git a/src/core/normalize.py b/src/core/normalize.py
new file mode 100644
index 0000000..17d49c5
--- /dev/null
+++ b/src/core/normalize.py
@@ -0,0 +1,249 @@
+"""CSV-normalization gate.
+
+A file enters the tool pages only after passing the gate. The gate has
+two paths:
+
+1. **Auto-fix** — apply every algorithm flagged ``confidence="high"``.
+2. **Review** — show the user a preview of medium/low-confidence findings
+   and accept an explicit per-finding decision before applying.
+
+The gate produces a :class:`NormalizationResult` containing the cleaned
+DataFrame, the bytes representation, and a structured audit log of every
+fix that ran. Tool pages are guarded by :func:`is_normalized` against
+the result and the original list of findings.
+"""
+
+from __future__ import annotations
+
+import io
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Literal, Optional
+
+import pandas as pd
+
+from .analyze import Finding, analyze
+from .fixes import get_fix
+
+
+DecisionAction = Literal["auto", "skip", "modified"]
+
+
+@dataclass
+class Decision:
+    """One user-recorded choice for a finding.
+
+    Attributes
+    ----------
+    finding_id
+        The :class:`Finding` id this decision applies to.
+    action
+        ``"auto"`` to run the registered fix as-is, ``"skip"`` to leave
+        it alone (the gate logs it as waived), ``"modified"`` to run the
+        fix with a custom payload (e.g. user-edited null sentinel list).
+    payload
+        Optional kwargs forwarded to the fix function. Required for
+        ``"modified"``; ignored for ``"skip"``.
+    """
+
+    finding_id: str
+    action: DecisionAction
+    payload: Optional[dict] = None
+
+
+@dataclass
+class FixApplied:
+    """One fix that ran during a gate pass."""
+
+    finding_id: str
+    fix_action: str
+    cells_changed: int
+    decision: DecisionAction
+
+
+@dataclass
+class NormalizationResult:
+    """Output of a gate pass.
+
+    Attributes
+    ----------
+    cleaned_df
+        DataFrame after every applied fix. The downstream tool pages
+        consume this directly.
+    cleaned_bytes
+        UTF-8 encoded CSV of *cleaned_df* — the canonical artifact for
+        round-tripping into another tool that re-parses.
+    applied
+        Audit log of fixes that ran.
+    skipped_findings
+        Findings the user explicitly waived (decision = ``"skip"``).
+    pending_findings
+        Findings still requiring a user decision before the gate is
+        considered passed. Empty on a successful gate pass.
+    blocking_findings
+        Severity=error findings that have no decision and no auto-fix.
+        Non-empty means the gate is blocked and the file cannot enter
+        tool pages.
+    """
+
+    cleaned_df: pd.DataFrame
+    cleaned_bytes: bytes
+    applied: list[FixApplied] = field(default_factory=list)
+    skipped_findings: list[Finding] = field(default_factory=list)
+    pending_findings: list[Finding] = field(default_factory=list)
+    blocking_findings: list[Finding] = field(default_factory=list)
+
+    @property
+    def passed(self) -> bool:
+        return not self.pending_findings and not self.blocking_findings
+
+
+def _df_to_bytes(df: pd.DataFrame) -> bytes:
+    buf = io.StringIO()
+    df.to_csv(buf, index=False, lineterminator="\n")
+    return buf.getvalue().encode("utf-8")
+
+
+def _is_actionable(f: Finding) -> bool:
+    """Does this finding still need attention from the gate?
+
+    Pre-applied fixes (BOM strip, etc. — already done during read) are
+    not actionable. Findings without a registered fix_action are not
+    actionable here either; severity=error ones become blockers.
+    """
+    if f.pre_applied:
+        return False
+    if not f.fix_action:
+        return False
+    return get_fix(f.fix_action) is not None
+
+
+def auto_fix(
+    df: pd.DataFrame, findings: list[Finding],
+) -> NormalizationResult:
+    """Apply every fix flagged ``confidence="high"``.
+
+    Returns a :class:`NormalizationResult`. Medium / low / unknown
+    confidence findings are surfaced as ``pending_findings`` and the
+    result is *not* considered passed until the user decides on them.
+    """
+    decisions: list[Decision] = [
+        Decision(finding_id=f.id, action="auto")
+        for f in findings
+        if _is_actionable(f) and f.confidence == "high"
+    ]
+    return apply_decisions(df, findings, decisions)
+
+
+def apply_decisions(
+    df: pd.DataFrame, findings: list[Finding], decisions: list[Decision],
+) -> NormalizationResult:
+    """Apply *decisions* to *df* in finding order.
+
+    Findings with no matching decision are categorized:
+
+    * ``severity=error`` -> ``blocking_findings``
+    * Otherwise -> ``pending_findings`` (user still owes us a decision)
+
+    Pre-applied findings are recorded once in the audit log with
+    ``cells_changed=0`` so callers can render "what was already done."
+    """
+    decision_by_id = {d.finding_id: d for d in decisions}
+
+    out = df.copy()
+    applied: list[FixApplied] = []
+    skipped: list[Finding] = []
+    pending: list[Finding] = []
+    blocking: list[Finding] = []
+
+    for f in findings:
+        if f.pre_applied:
+            applied.append(FixApplied(
+                finding_id=f.id,
+                fix_action=f.fix_action,
+                cells_changed=0,
+                decision="auto",
+            ))
+            continue
+
+        decision = decision_by_id.get(f.id)
+        if decision is None:
+            if f.severity == "error":
+                blocking.append(f)
+            elif _is_actionable(f):
+                pending.append(f)
+            # else: informational with no fix; ignore.
+            continue
+
+        if decision.action == "skip":
+            skipped.append(f)
+            continue
+
+        fix_fn = get_fix(f.fix_action)
+        if fix_fn is None:
+            # Decision references a fix we don't have; treat as pending.
+            pending.append(f)
+            continue
+
+        payload = decision.payload
+        # Per-column fixes (lowercase_email) can carry the column from
+        # the finding when the user didn't override it.
+        if f.column and (payload is None or "column" not in payload):
+            payload = {**(payload or {}), "column": f.column}
+
+        out, changed = fix_fn(out, payload)
+        applied.append(FixApplied(
+            finding_id=f.id,
+            fix_action=f.fix_action,
+            cells_changed=changed,
+            decision=decision.action,
+        ))
+
+    return NormalizationResult(
+        cleaned_df=out,
+        cleaned_bytes=_df_to_bytes(out),
+        applied=applied,
+        skipped_findings=skipped,
+        pending_findings=pending,
+        blocking_findings=blocking,
+    )
+
+
+def is_normalized(
+    findings: list[Finding], result: Optional[NormalizationResult],
+) -> bool:
+    """True iff *result* satisfies the gate against *findings*.
+
+    The gate passes when:
+
+    * A result exists, and
+    * It has no blocking findings, and
+    * It has no pending (undecided) actionable findings.
+
+    Re-run analysis on the cleaned bytes to confirm the high-confidence
+    detectors no longer fire — that's the contract the tool pages rely
+    on. Callers who want the cheap check can pass ``result.passed``
+    directly; this function is the strict version.
+    """
+    if result is None:
+        return False
+    if not result.passed:
+        return False
+    # Re-analyze the cleaned bytes; high-confidence detectors must be silent.
+    rerun = analyze(result.cleaned_df)
+    for f in rerun:
+        if f.confidence == "high" and _is_actionable(f):
+            return False
+    return True
+
+
+def gate_summary(result: NormalizationResult) -> dict:
+    """One-line-per-key summary suitable for logging or the CLI."""
+    return {
+        "passed": result.passed,
+        "fixes_applied": len(result.applied),
+        "cells_changed": sum(a.cells_changed for a in result.applied),
+        "skipped": [f.id for f in result.skipped_findings],
+        "pending": [f.id for f in result.pending_findings],
+        "blocking": [f.id for f in result.blocking_findings],
+    }
diff --git a/src/gui/components.py b/src/gui/components.py
index 59c47a3..f02b6a0 100644
--- a/src/gui/components.py
+++ b/src/gui/components.py
@@ -1096,6 +1096,49 @@ class _StashedUpload:
         return self._data
 
 
+def require_normalization_gate() -> None:
+    """Block the calling tool page until the upload has passed the gate.
+
+    Tool pages should call this immediately after their imports. When the
+    current session upload has not been normalized — no
+    ``normalization_result``, the result is for a different upload, or the
+    result didn't pass — the user is shown a banner and a button to jump
+    to the Review page; the rest of the page is short-circuited via
+    ``st.stop()``.
+
+    Pages that genuinely don't need a clean dataframe (rare) can opt out
+    by simply not calling this.
+    """
+    import hashlib
+    has_upload = st.session_state.get("home_uploaded_bytes") is not None
+    if not has_upload:
+        # No upload yet — let the page's own uploader handle it; the gate
+        # will kick in once a file is present.
+        return
+
+    upload_hash = hashlib.sha256(
+        st.session_state["home_uploaded_bytes"]
+    ).hexdigest()
+    result = st.session_state.get("normalization_result")
+    matched = (
+        result is not None
+        and st.session_state.get("normalization_for") == upload_hash
+        and getattr(result, "passed", False)
+    )
+    if matched:
+        return
+
+    name = st.session_state.get("home_uploaded_name", "the uploaded file")
+    st.warning(
+        f"**{name}** must pass the CSV-normalization gate before you can "
+        f"use this tool. Open the Review page to apply the fixes our "
+        f"analyzer recommends."
+    )
+    if st.button("Go to Review & Normalize", type="primary"):
+        st.switch_page("pages/0_Review.py")
+    st.stop()
+
+
 def pickup_or_upload(
     *,
     label: str,
diff --git a/src/gui/pages/0_Review.py b/src/gui/pages/0_Review.py
new file mode 100644
index 0000000..0d0fd5f
--- /dev/null
+++ b/src/gui/pages/0_Review.py
@@ -0,0 +1,675 @@
+"""Review & normalize gate page.
+
+Sits between the home-page upload and every tool page. Walks the user
+through every analyzer finding, lets them auto-fix, preview, customize,
+or skip each one, and produces a :class:`NormalizationResult` stashed in
+session state. Tool pages refuse to load until this gate has passed.
+
+State contract
+--------------
+Session state read:
+* ``home_uploaded_bytes`` / ``home_uploaded_name`` — current upload.
+* ``home_findings`` — list of :class:`Finding` from the home-page scan.
+* ``review_decisions`` — dict[finding_id, Decision]; user's choices so far.
+
+Session state written:
+* ``review_decisions`` — updated as the user flips controls.
+* ``normalization_result`` — :class:`NormalizationResult` after Apply.
+* ``normalization_for`` — content hash of the upload the result is for.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import io
+import sys
+from pathlib import Path
+from typing import Optional
+
+import pandas as pd
+import streamlit as st
+
+# Project root on sys.path (mirrors app.py).
+_project_root = Path(__file__).resolve().parent.parent.parent.parent
+if str(_project_root) not in sys.path:
+    sys.path.insert(0, str(_project_root))
+
+from src.core.analyze import Finding, analyze
+from src.core.fixes import get_fix
+from src.core.io import detect_encoding, repair_bytes
+from src.core.normalize import (
+    Decision,
+    NormalizationResult,
+    apply_decisions,
+    auto_fix,
+    gate_summary,
+    is_normalized,
+)
+from src.gui.components import hide_streamlit_chrome
+
+
+# Common single-byte and multi-byte encodings the user might pick to
+# correct a misdetection. Ordered by frequency in real-world Western /
+# multilingual data; keep the list short — too many options just adds
+# noise. The user can type a custom encoding via the "Other" entry.
+_OVERRIDE_ENCODINGS = [
+    "(detected)",
+    "utf-8",
+    "utf-8-sig",
+    "cp1252",
+    "iso-8859-1",
+    "iso-8859-15",
+    "cp1250",
+    "iso-8859-2",
+    "cp1251",
+    "koi8-r",
+    "mac-roman",
+    "shift_jis",
+    "cp932",
+    "gb18030",
+    "big5",
+    "euc-kr",
+    "cp949",
+    "utf-16",
+    "utf-16-le",
+    "utf-16-be",
+    "Other…",
+]
+
+
+st.set_page_config(page_title="Review & Normalize", page_icon="🛡️", layout="wide")
+hide_streamlit_chrome()
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+def _upload_hash() -> Optional[str]:
+    data = st.session_state.get("home_uploaded_bytes")
+    if not data:
+        return None
+    return hashlib.sha256(data).hexdigest()
+
+
+def _detected_encoding_for_session() -> Optional[str]:
+    """Run charset detection on the session bytes via a tmp file."""
+    data = st.session_state.get("home_uploaded_bytes")
+    name = st.session_state.get("home_uploaded_name") or "tmp.csv"
+    if not data:
+        return None
+    import tempfile
+    suffix = "." + name.rsplit(".", 1)[-1] if "." in name else ".csv"
+    with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as fh:
+        fh.write(data)
+        tmp_path = Path(fh.name)
+    try:
+        return detect_encoding(tmp_path)
+    finally:
+        tmp_path.unlink(missing_ok=True)
+
+
+def _load_df_from_session(encoding_override: Optional[str] = None) -> Optional[pd.DataFrame]:
+    """Re-parse the session upload through the same pipeline the home page
+    uses, so the review page operates on identical bytes.
+
+    When *encoding_override* is set, decode with that encoding instead of
+    UTF-8. The override flows into ``repair_bytes`` so the wide-encoding
+    transcode and decode_replaced fallback both honor the user's choice.
+    """
+    data = st.session_state.get("home_uploaded_bytes")
+    name = st.session_state.get("home_uploaded_name") or ""
+    if not data:
+        return None
+    suffix = name.rsplit(".", 1)[-1].lower() if "." in name else ""
+    if suffix in ("xlsx", "xls"):
+        return pd.read_excel(io.BytesIO(data), dtype=str, keep_default_na=False)
+    delim = "\t" if suffix == "tsv" else ","
+    if delim == ",":
+        head = data[:4096].decode("utf-8", errors="replace")
+        for cand in ("\t", ";", "|"):
+            if head.count(cand) > head.count(",") * 1.5:
+                delim = cand
+                break
+    enc = encoding_override or "utf-8"
+    repair = repair_bytes(data, encoding=enc, delimiter=delim)
+    return pd.read_csv(
+        io.BytesIO(repair.repaired_bytes),
+        encoding="utf-8", delimiter=delim,
+        dtype=str, keep_default_na=False, on_bad_lines="warn",
+    )
+
+
+def _run_analysis_with_override(encoding_override: Optional[str]) -> list[Finding]:
+    """Re-run analyze() on the session upload with an encoding override.
+
+    Mirrors components._run_analysis_on_upload but writes the bytes to a
+    tempfile so analyze() goes through the path-based loader (which is
+    where the encoding_override hook lives — DataFrame-mode analysis has
+    nothing to override).
+    """
+    data = st.session_state.get("home_uploaded_bytes")
+    name = st.session_state.get("home_uploaded_name") or "tmp.csv"
+    if not data:
+        return []
+    import tempfile
+    suffix = "." + name.rsplit(".", 1)[-1] if "." in name else ".csv"
+    with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as fh:
+        fh.write(data)
+        tmp_path = Path(fh.name)
+    try:
+        return analyze(tmp_path, encoding_override=encoding_override)
+    finally:
+        tmp_path.unlink(missing_ok=True)
+
+
+def _confidence_pill(c: str) -> str:
+    """Streamlit-markdown pill for the confidence tier."""
+    palette = {"high": "green", "medium": "orange", "low": "red"}
+    return f":{palette.get(c, 'gray')}-background[**{c.upper()}**]"
+
+
+def _severity_pill(s: str) -> str:
+    palette = {"info": "blue", "warn": "orange", "error": "red"}
+    return f":{palette.get(s, 'gray')}-background[**{s}**]"
+
+
+# ---------------------------------------------------------------------------
+# Output options (Advanced — re-encode the cleaned DataFrame for download)
+# ---------------------------------------------------------------------------
+
+# (label_shown_to_user, codec_passed_to_pandas)
+_OUTPUT_ENCODINGS = [
+    ("UTF-8 (recommended)", "utf-8"),
+    ("UTF-8 with BOM (Excel)", "utf-8-sig"),
+    ("Windows-1252 (Western Europe)", "cp1252"),
+    ("ISO-8859-1 / Latin-1", "iso-8859-1"),
+    ("ISO-8859-15 / Latin-9", "iso-8859-15"),
+    ("Windows-1250 (Central Europe)", "cp1250"),
+    ("ISO-8859-2 / Latin-2", "iso-8859-2"),
+    ("Windows-1251 (Cyrillic)", "cp1251"),
+    ("Shift_JIS (Japanese)", "shift_jis"),
+    ("GB18030 (Chinese)", "gb18030"),
+    ("Big5 (Traditional Chinese)", "big5"),
+    ("EUC-KR (Korean)", "euc-kr"),
+    ("UTF-16 LE with BOM", "utf-16"),
+]
+
+_OUTPUT_DELIMITERS = [
+    ("Comma  ,", ","),
+    ("Tab  \\t", "\t"),
+    ("Semicolon  ;", ";"),
+    ("Pipe  |", "|"),
+]
+
+_OUTPUT_LINE_TERMINATORS = [
+    ("LF — \\n (Unix / web / git default)", "\n"),
+    ("CRLF — \\r\\n (Windows / classic Excel)", "\r\n"),
+    ("CR — \\r (classic Mac, very rare)", "\r"),
+]
+
+
+def _build_output_bytes(
+    df: pd.DataFrame,
+    *,
+    encoding: str,
+    delimiter: str,
+    line_terminator: str,
+) -> tuple[bytes, Optional[str]]:
+    """Serialize *df* with the user's output options.
+
+    Returns ``(bytes, error_message)``. ``error_message`` is non-None when
+    the chosen encoding cannot represent at least one cell — characters
+    that don't exist in the target codepage are replaced with ``?`` so
+    the user still gets a download, plus a warning telling them which
+    target was lossy.
+    """
+    buf = io.StringIO()
+    df.to_csv(buf, index=False, sep=delimiter, lineterminator=line_terminator)
+    text = buf.getvalue()
+    try:
+        return text.encode(encoding), None
+    except UnicodeEncodeError:
+        # Find the first character that fails so the message is useful.
+        bad: Optional[str] = None
+        for ch in text:
+            try:
+                ch.encode(encoding)
+            except UnicodeEncodeError:
+                bad = ch
+                break
+        msg = (
+            f"Some characters cannot be represented in {encoding}"
+            + (f" (first offender: {bad!r})" if bad else "")
+            + ". Falling back to '?' replacement; non-Latin content will be lost."
+        )
+        return text.encode(encoding, errors="replace"), msg
+
+
+def _preview_table(f: Finding, decision_action: str, payload: Optional[dict]) -> Optional[pd.DataFrame]:
+    """Build a before/after preview from finding samples.
+
+    Runs the registered fix function on each sample value individually so
+    the user sees exactly what would change. Returns None when no preview
+    is meaningful (no samples, or no fix registered).
+    """
+    if not f.samples:
+        return None
+    fix_fn = get_fix(f.fix_action)
+    if fix_fn is None:
+        # No fix to preview; show samples as-is.
+        return pd.DataFrame(
+            [{"row": r, "column": c, "value": v} for r, c, v in f.samples]
+        )
+    rows = []
+    for r, col, val in f.samples:
+        # Run the fix on a tiny single-cell DataFrame so payload semantics
+        # (e.g. lowercase_email's column targeting) are honored.
+        mini = pd.DataFrame({col: [val]})
+        try:
+            new_df, _ = fix_fn(mini, payload)
+            new_val = new_df[col].iloc[0]
+        except Exception as e:
+            new_val = f"<preview error: {e}>"
+        rows.append({"row": r, "column": col, "before": val, "after": new_val})
+    return pd.DataFrame(rows)
+
+
+# ---------------------------------------------------------------------------
+# Page body
+# ---------------------------------------------------------------------------
+
+st.title("🛡️ Review & Normalize")
+st.caption(
+    "Every finding is shown below with the algorithm that would fix it. "
+    "Auto-fix the high-confidence ones in one click; preview or customize "
+    "the rest before applying."
+)
+
+# Pre-flight: nothing to review without an upload.
+findings: list[Finding] = st.session_state.get("home_findings") or []
+upload_name = st.session_state.get("home_uploaded_name")
+
+if not upload_name:
+    st.warning("No file uploaded. Go back to the home page and upload a CSV or Excel file first.")
+    if st.button("Back to home"):
+        st.switch_page("app.py")
+    st.stop()
+
+# ---- Encoding picker --------------------------------------------------------
+#
+# Charset detection misfires on small files, byte-equivalent codepages
+# (cp1252 vs Latin-1 vs cp1250), and content where every byte happens to
+# decode under the wrong encoding (KOI8-R bytes that look like Shift_JIS).
+# When the user spots mojibake or U+FFFD chars in the findings list, this
+# picker is the escape hatch — pick the right encoding, re-run the analyzer.
+
+with st.container(border=True):
+    detected_enc = _detected_encoding_for_session()
+    current_override = st.session_state.get("encoding_override")
+    suffix = (st.session_state.get("home_uploaded_name") or "")
+    suffix = suffix.rsplit(".", 1)[-1].lower() if "." in suffix else ""
+    is_excel = suffix in ("xlsx", "xls")
+
+    st.markdown("**File encoding**")
+    if is_excel:
+        st.caption(
+            "Excel files store text as Unicode internally — encoding override "
+            "doesn't apply. Skip this section."
+        )
+    else:
+        cap_parts = [f"Detected: `{detected_enc or 'unknown'}`"]
+        if current_override:
+            cap_parts.append(f"Currently using: `{current_override}`")
+        st.caption(
+            " · ".join(cap_parts)
+            + " · Override only if you see mojibake (e.g. `Ã©` for `é`) or U+FFFD"
+            " (`�`) in the findings below."
+        )
+
+        col_pick, col_custom, col_apply = st.columns([2, 2, 1])
+
+        with col_pick:
+            current_label = current_override or "(detected)"
+            try:
+                idx = _OVERRIDE_ENCODINGS.index(current_label)
+            except ValueError:
+                idx = _OVERRIDE_ENCODINGS.index("Other…")
+            chosen = st.selectbox(
+                "Encoding",
+                options=_OVERRIDE_ENCODINGS,
+                index=idx,
+                key="encoding_override_select",
+                label_visibility="collapsed",
+            )
+
+        custom_value: Optional[str] = None
+        with col_custom:
+            if chosen == "Other…":
+                custom_value = st.text_input(
+                    "Custom encoding (e.g. `cp1257`, `iso-8859-9`)",
+                    value=current_override if current_override and current_override not in _OVERRIDE_ENCODINGS else "",
+                    key="encoding_override_custom",
+                    label_visibility="collapsed",
+                    placeholder="cp1257",
+                )
+
+        with col_apply:
+            if st.button("Re-analyze", use_container_width=True):
+                if chosen == "(detected)":
+                    new_override = None
+                elif chosen == "Other…":
+                    new_override = (custom_value or "").strip() or None
+                else:
+                    new_override = chosen
+
+                # Sanity-check the override actually decodes the bytes.
+                data = st.session_state.get("home_uploaded_bytes") or b""
+                if new_override is not None:
+                    try:
+                        data.decode(new_override, errors="strict")
+                        decode_ok = True
+                        decode_err = None
+                    except (UnicodeDecodeError, LookupError) as e:
+                        decode_ok = False
+                        decode_err = str(e)
+                else:
+                    decode_ok = True
+                    decode_err = None
+
+                if not decode_ok:
+                    st.warning(
+                        f"`{new_override}` cannot decode this file: {decode_err}. "
+                        f"Re-running anyway with replacement-character fallback so "
+                        f"you can see where the failures are."
+                    )
+
+                # Re-run analysis with the override and refresh session state.
+                st.session_state["encoding_override"] = new_override
+                st.session_state["home_findings"] = _run_analysis_with_override(new_override)
+                # Drop any prior gate result; the user must re-apply.
+                st.session_state.pop("normalization_result", None)
+                st.session_state.pop("normalization_for", None)
+                st.session_state.pop("review_decisions", None)
+                st.rerun()
+
+# Reload findings — the picker above may have just rewritten them.
+findings = st.session_state.get("home_findings") or []
+
+if not findings:
+    st.success("✓ No findings to review. The file is already clean — open any tool to begin.")
+    st.stop()
+
+
+# ---- Top-line counters -------------------------------------------------------
+
+n_high = sum(1 for f in findings if f.confidence == "high" and not f.pre_applied and f.fix_action)
+n_medium = sum(1 for f in findings if f.confidence == "medium" and not f.pre_applied)
+n_low = sum(1 for f in findings if f.confidence == "low" and not f.pre_applied)
+n_pre = sum(1 for f in findings if f.pre_applied)
+n_block = sum(1 for f in findings if f.severity == "error")
+
+c1, c2, c3, c4, c5 = st.columns(5)
+c1.metric("High confidence", n_high, help="Round-trip safe — eligible for auto-fix.")
+c2.metric("Medium", n_medium, help="Right call in the common case; preview before applying.")
+c3.metric("Low", n_low, help="Heuristic — opt in only.")
+c4.metric("Already applied", n_pre, help="Fixed during the read pass (BOM, NUL, line endings).")
+c5.metric("Blocking", n_block, help="Severity = error; must be resolved or waived.")
+
+st.divider()
+
+
+# ---- Top-level controls ------------------------------------------------------
+
+decisions_state: dict = st.session_state.setdefault("review_decisions", {})
+
+bar_left, bar_mid, bar_right = st.columns([1.2, 1.2, 3])
+
+with bar_left:
+    if st.button("✨ Auto-fix high-confidence", type="primary", use_container_width=True):
+        for f in findings:
+            if (
+                not f.pre_applied
+                and f.confidence == "high"
+                and f.fix_action
+                and get_fix(f.fix_action) is not None
+            ):
+                decisions_state[f.id] = Decision(finding_id=f.id, action="auto")
+        st.rerun()
+
+with bar_mid:
+    if st.button("Skip everything (not recommended)", use_container_width=True):
+        for f in findings:
+            if not f.pre_applied:
+                decisions_state[f.id] = Decision(finding_id=f.id, action="skip")
+        st.rerun()
+
+
+# ---- Per-finding cards -------------------------------------------------------
+
+# Sort: blocking first, then high (unfixed), medium, low, pre-applied.
+def _sort_key(f: Finding) -> tuple:
+    severity_rank = {"error": 0, "warn": 1, "info": 2}[f.severity]
+    confidence_rank = {"high": 0, "medium": 1, "low": 2}[f.confidence]
+    return (int(f.pre_applied), severity_rank, confidence_rank, f.id)
+
+
+for f in sorted(findings, key=_sort_key):
+    decision = decisions_state.get(f.id)
+    decision_action = decision.action if decision else (
+        "auto" if (f.pre_applied or (f.confidence == "high" and f.fix_action)) else "skip"
+    )
+
+    title_bits = [
+        _severity_pill(f.severity),
+        _confidence_pill(f.confidence),
+        f"**{f.id}**",
+        f"({f.count})",
+    ]
+    if f.pre_applied:
+        title_bits.append(":gray-background[applied during read]")
+
+    with st.expander(" ".join(title_bits), expanded=(f.severity == "error")):
+        st.caption(f.description)
+        if f.tool:
+            st.caption(f"Owned by: `{f.tool}`")
+
+        if f.pre_applied:
+            st.info("This was already applied during the file read pass — no decision needed.")
+            continue
+
+        if not f.fix_action:
+            if f.severity == "error":
+                st.error(
+                    "Blocking finding with no auto-fix. Choose **Skip / waive** to "
+                    "acknowledge and proceed (not recommended), or fix the file outside "
+                    "DataTools and re-upload."
+                )
+            else:
+                st.info("Informational only — no fix to apply.")
+
+        # Decision radio
+        choice_labels = {
+            "auto": "Auto-fix with our algorithm",
+            "skip": "Skip / waive (no change)",
+        }
+        # Customize is offered for fixes that take a meaningful payload.
+        if f.fix_action in ("replace_null_sentinels",):
+            choice_labels["modified"] = "Customize"
+
+        chosen = st.radio(
+            "Decision",
+            options=list(choice_labels.keys()),
+            index=list(choice_labels.keys()).index(decision_action)
+                if decision_action in choice_labels else 0,
+            format_func=lambda k: choice_labels[k],
+            key=f"decision_{f.id}",
+            horizontal=True,
+        )
+
+        # Customize payload editor (only for the modified action)
+        payload: Optional[dict] = None
+        if chosen == "modified" and f.fix_action == "replace_null_sentinels":
+            default_sentinels = ", ".join(sorted([
+                "n/a", "na", "nan", "null", "none", "-", "--", "tbd", "unknown",
+            ]))
+            text = st.text_area(
+                "Sentinels (comma-separated, case-insensitive):",
+                value=(decision.payload or {}).get(
+                    "sentinels_raw", default_sentinels,
+                ) if decision else default_sentinels,
+                key=f"sentinels_{f.id}",
+            )
+            sentinels = [s.strip() for s in text.split(",") if s.strip()]
+            payload = {"sentinels": sentinels, "sentinels_raw": text}
+
+        # Persist
+        decisions_state[f.id] = Decision(
+            finding_id=f.id, action=chosen, payload=payload,
+        )
+
+        # Preview
+        if chosen != "skip" and f.samples:
+            preview = _preview_table(f, chosen, payload)
+            if preview is not None and not preview.empty:
+                st.markdown("**Preview** (showing up to 5 affected cells)")
+                st.dataframe(preview, use_container_width=True, hide_index=True)
+
+st.divider()
+
+
+# ---- Apply ------------------------------------------------------------------
+
+bottom_left, bottom_mid, bottom_right = st.columns([1, 1, 3])
+
+with bottom_left:
+    apply_clicked = st.button(
+        "✅ Apply & enter tools", type="primary", use_container_width=True,
+        disabled=not decisions_state,
+    )
+
+with bottom_mid:
+    reset_clicked = st.button("Reset all decisions", use_container_width=True)
+
+if reset_clicked:
+    st.session_state.pop("review_decisions", None)
+    st.session_state.pop("normalization_result", None)
+    st.session_state.pop("normalization_for", None)
+    st.rerun()
+
+if apply_clicked:
+    df = _load_df_from_session(
+        encoding_override=st.session_state.get("encoding_override")
+    )
+    if df is None:
+        st.error("Could not re-read the uploaded file. Try re-uploading.")
+        st.stop()
+    decisions_list = [d for d in decisions_state.values() if isinstance(d, Decision)]
+    result = apply_decisions(df, findings, decisions_list)
+    st.session_state["normalization_result"] = result
+    st.session_state["normalization_for"] = _upload_hash()
+
+    summary = gate_summary(result)
+    if result.passed and is_normalized(findings, result):
+        st.success(
+            f"✓ Gate passed — {summary['fixes_applied']} fix(es) applied, "
+            f"{summary['cells_changed']} cell(s) changed. You can now open any tool."
+        )
+    elif result.blocking_findings:
+        st.error(
+            f"Gate blocked by error-level findings: "
+            f"{', '.join(b.id for b in result.blocking_findings)}. "
+            f"Resolve or waive them above before continuing."
+        )
+    elif result.pending_findings:
+        st.warning(
+            f"Pending decisions remain on: "
+            f"{', '.join(f.id for f in result.pending_findings)}. "
+            f"Choose Auto-fix or Skip for each before continuing."
+        )
+
+# Persisted summary (re-render on reload)
+result: Optional[NormalizationResult] = st.session_state.get("normalization_result")
+if result is not None and st.session_state.get("normalization_for") == _upload_hash():
+    with st.expander("Audit log"):
+        if result.applied:
+            st.markdown("**Applied fixes**")
+            st.dataframe(
+                pd.DataFrame([
+                    {
+                        "finding": a.finding_id,
+                        "fix_action": a.fix_action,
+                        "decision": a.decision,
+                        "cells_changed": a.cells_changed,
+                    }
+                    for a in result.applied
+                ]),
+                use_container_width=True, hide_index=True,
+            )
+        if result.skipped_findings:
+            st.markdown("**Skipped (waived by user)**")
+            st.write([f.id for f in result.skipped_findings])
+        if result.passed:
+            st.markdown("---")
+            st.markdown("**Download normalized file**")
+            with st.expander("⚙️  Advanced output options"):
+                st.caption(
+                    "Defaults match what the analyzer normalized to: UTF-8, "
+                    "comma-separated, LF line endings. Override only if your "
+                    "destination tool requires a specific format."
+                )
+
+                col_enc, col_delim, col_le = st.columns(3)
+                with col_enc:
+                    enc_choice = st.selectbox(
+                        "Encoding (code page)",
+                        options=[label for label, _ in _OUTPUT_ENCODINGS],
+                        index=0,
+                        key="output_encoding_select",
+                    )
+                    out_encoding = next(
+                        codec for label, codec in _OUTPUT_ENCODINGS if label == enc_choice
+                    )
+
+                with col_delim:
+                    delim_choice = st.selectbox(
+                        "Delimiter",
+                        options=[label for label, _ in _OUTPUT_DELIMITERS],
+                        index=0,
+                        key="output_delim_select",
+                    )
+                    out_delim = next(
+                        ch for label, ch in _OUTPUT_DELIMITERS if label == delim_choice
+                    )
+
+                with col_le:
+                    le_choice = st.selectbox(
+                        "Line terminator",
+                        options=[label for label, _ in _OUTPUT_LINE_TERMINATORS],
+                        index=0,
+                        key="output_le_select",
+                    )
+                    out_le = next(
+                        ch for label, ch in _OUTPUT_LINE_TERMINATORS if label == le_choice
+                    )
+
+            data, encode_warn = _build_output_bytes(
+                result.cleaned_df,
+                encoding=out_encoding,
+                delimiter=out_delim,
+                line_terminator=out_le,
+            )
+            if encode_warn:
+                st.warning(encode_warn)
+
+            ext = "tsv" if out_delim == "\t" else "csv"
+            mime = "text/tab-separated-values" if out_delim == "\t" else "text/csv"
+            file_name = f"{Path(upload_name).stem}.normalized.{ext}"
+
+            st.download_button(
+                f"⬇️  Download {file_name}",
+                data=data,
+                file_name=file_name,
+                mime=mime,
+                type="primary",
+            )
diff --git a/src/gui/pages/1_Deduplicator.py b/src/gui/pages/1_Deduplicator.py
index 6fa8760..4da19b5 100644
--- a/src/gui/pages/1_Deduplicator.py
+++ b/src/gui/pages/1_Deduplicator.py
@@ -22,10 +22,12 @@ from src.gui.components import (
     hide_streamlit_chrome,
     match_group_card,
     pickup_or_upload,
+    require_normalization_gate,
     results_summary,
 )
 
 hide_streamlit_chrome()
+require_normalization_gate()
 
 # ---------------------------------------------------------------------------
 # Session state defaults
diff --git a/src/gui/pages/2_Text_Cleaner.py b/src/gui/pages/2_Text_Cleaner.py
index e9ef09f..80ba7e6 100644
--- a/src/gui/pages/2_Text_Cleaner.py
+++ b/src/gui/pages/2_Text_Cleaner.py
@@ -18,6 +18,7 @@ from src.gui.components import (
     hide_streamlit_chrome,
     pickup_or_upload,
     render_hidden_aware_preview,
+    require_normalization_gate,
 )
 from src.core.text_clean import (
     PRESETS,
@@ -28,6 +29,7 @@ from src.core.text_clean import (
 )
 
 hide_streamlit_chrome()
+require_normalization_gate()
 
 
 # ---------------------------------------------------------------------------
diff --git a/src/gui/pages/3_Format_Standardizer.py b/src/gui/pages/3_Format_Standardizer.py
index 2976325..3511f38 100644
--- a/src/gui/pages/3_Format_Standardizer.py
+++ b/src/gui/pages/3_Format_Standardizer.py
@@ -11,9 +11,10 @@ _project_root = Path(__file__).resolve().parent.parent.parent.parent
 if str(_project_root) not in sys.path:
     sys.path.insert(0, str(_project_root))
 
-from src.gui.components import hide_streamlit_chrome
+from src.gui.components import hide_streamlit_chrome, require_normalization_gate
 
 hide_streamlit_chrome()
+require_normalization_gate()
 
 # ---------------------------------------------------------------------------
 # Header
diff --git a/src/gui/pages/4_Missing_Values.py b/src/gui/pages/4_Missing_Values.py
index c34b1eb..8a181ed 100644
--- a/src/gui/pages/4_Missing_Values.py
+++ b/src/gui/pages/4_Missing_Values.py
@@ -11,9 +11,10 @@ _project_root = Path(__file__).resolve().parent.parent.parent.parent
 if str(_project_root) not in sys.path:
     sys.path.insert(0, str(_project_root))
 
-from src.gui.components import hide_streamlit_chrome
+from src.gui.components import hide_streamlit_chrome, require_normalization_gate
 
 hide_streamlit_chrome()
+require_normalization_gate()
 
 # ---------------------------------------------------------------------------
 # Header
diff --git a/src/gui/pages/5_Column_Mapper.py b/src/gui/pages/5_Column_Mapper.py
index df11527..d36cc05 100644
--- a/src/gui/pages/5_Column_Mapper.py
+++ b/src/gui/pages/5_Column_Mapper.py
@@ -11,9 +11,10 @@ _project_root = Path(__file__).resolve().parent.parent.parent.parent
 if str(_project_root) not in sys.path:
     sys.path.insert(0, str(_project_root))
 
-from src.gui.components import hide_streamlit_chrome
+from src.gui.components import hide_streamlit_chrome, require_normalization_gate
 
 hide_streamlit_chrome()
+require_normalization_gate()
 
 # ---------------------------------------------------------------------------
 # Header
diff --git a/src/gui/pages/6_Outlier_Detector.py b/src/gui/pages/6_Outlier_Detector.py
index c342ff1..02fbdc7 100644
--- a/src/gui/pages/6_Outlier_Detector.py
+++ b/src/gui/pages/6_Outlier_Detector.py
@@ -11,9 +11,10 @@ _project_root = Path(__file__).resolve().parent.parent.parent.parent
 if str(_project_root) not in sys.path:
     sys.path.insert(0, str(_project_root))
 
-from src.gui.components import hide_streamlit_chrome
+from src.gui.components import hide_streamlit_chrome, require_normalization_gate
 
 hide_streamlit_chrome()
+require_normalization_gate()
 
 # ---------------------------------------------------------------------------
 # Header
diff --git a/src/gui/pages/7_Multi_File_Merger.py b/src/gui/pages/7_Multi_File_Merger.py
index 8a22e65..7b28fc1 100644
--- a/src/gui/pages/7_Multi_File_Merger.py
+++ b/src/gui/pages/7_Multi_File_Merger.py
@@ -11,9 +11,10 @@ _project_root = Path(__file__).resolve().parent.parent.parent.parent
 if str(_project_root) not in sys.path:
     sys.path.insert(0, str(_project_root))
 
-from src.gui.components import hide_streamlit_chrome
+from src.gui.components import hide_streamlit_chrome, require_normalization_gate
 
 hide_streamlit_chrome()
+require_normalization_gate()
 
 # ---------------------------------------------------------------------------
 # Header
diff --git a/src/gui/pages/8_Validator_Reporter.py b/src/gui/pages/8_Validator_Reporter.py
index 614ec4c..6a6b2cf 100644
--- a/src/gui/pages/8_Validator_Reporter.py
+++ b/src/gui/pages/8_Validator_Reporter.py
@@ -11,9 +11,10 @@ _project_root = Path(__file__).resolve().parent.parent.parent.parent
 if str(_project_root) not in sys.path:
     sys.path.insert(0, str(_project_root))
 
-from src.gui.components import hide_streamlit_chrome
+from src.gui.components import hide_streamlit_chrome, require_normalization_gate
 
 hide_streamlit_chrome()
+require_normalization_gate()
 
 # ---------------------------------------------------------------------------
 # Header
diff --git a/src/gui/pages/9_Pipeline_Runner.py b/src/gui/pages/9_Pipeline_Runner.py
index 7346887..8057e80 100644
--- a/src/gui/pages/9_Pipeline_Runner.py
+++ b/src/gui/pages/9_Pipeline_Runner.py
@@ -11,9 +11,10 @@ _project_root = Path(__file__).resolve().parent.parent.parent.parent
 if str(_project_root) not in sys.path:
     sys.path.insert(0, str(_project_root))
 
-from src.gui.components import hide_streamlit_chrome
+from src.gui.components import hide_streamlit_chrome, require_normalization_gate
 
 hide_streamlit_chrome()
+require_normalization_gate()
 
 # ---------------------------------------------------------------------------
 # Header
diff --git a/test-cases/encodings-corpus/E01_western_basic_utf8.csv b/test-cases/encodings-corpus/E01_western_basic_utf8.csv
new file mode 100644
index 0000000..54b281c
--- /dev/null
+++ b/test-cases/encodings-corpus/E01_western_basic_utf8.csv
@@ -0,0 +1,5 @@
+id,name,city,note
+1,Alice,New York,plain ASCII
+2,Café Müller,Köln,Latin-1 accents
+3,Naïve Façade,Zürich,more accents
+4,España,Düsseldorf,Spanish n-tilde
diff --git a/test-cases/encodings-corpus/E02_western_basic_utf8bom.csv b/test-cases/encodings-corpus/E02_western_basic_utf8bom.csv
new file mode 100644
index 0000000..5fe8b5f
--- /dev/null
+++ b/test-cases/encodings-corpus/E02_western_basic_utf8bom.csv
@@ -0,0 +1,5 @@
+﻿id,name,city,note
+1,Alice,New York,plain ASCII
+2,Café Müller,Köln,Latin-1 accents
+3,Naïve Façade,Zürich,more accents
+4,España,Düsseldorf,Spanish n-tilde
diff --git a/test-cases/encodings-corpus/E03_western_basic_cp1252.csv b/test-cases/encodings-corpus/E03_western_basic_cp1252.csv
new file mode 100644
index 0000000..5bb0225
--- /dev/null
+++ b/test-cases/encodings-corpus/E03_western_basic_cp1252.csv
@@ -0,0 +1,5 @@
+id,name,city,note
+1,Alice,New York,plain ASCII
+2,Caf� M�ller,K�ln,Latin-1 accents
+3,Na�ve Fa�ade,Z�rich,more accents
+4,Espa�a,D�sseldorf,Spanish n-tilde
diff --git a/test-cases/encodings-corpus/E04_western_basic_latin1.csv b/test-cases/encodings-corpus/E04_western_basic_latin1.csv
new file mode 100644
index 0000000..5bb0225
--- /dev/null
+++ b/test-cases/encodings-corpus/E04_western_basic_latin1.csv
@@ -0,0 +1,5 @@
+id,name,city,note
+1,Alice,New York,plain ASCII
+2,Caf� M�ller,K�ln,Latin-1 accents
+3,Na�ve Fa�ade,Z�rich,more accents
+4,Espa�a,D�sseldorf,Spanish n-tilde
diff --git a/test-cases/encodings-corpus/E05_western_basic_latin9.csv b/test-cases/encodings-corpus/E05_western_basic_latin9.csv
new file mode 100644
index 0000000..5bb0225
--- /dev/null
+++ b/test-cases/encodings-corpus/E05_western_basic_latin9.csv
@@ -0,0 +1,5 @@
+id,name,city,note
+1,Alice,New York,plain ASCII
+2,Caf� M�ller,K�ln,Latin-1 accents
+3,Na�ve Fa�ade,Z�rich,more accents
+4,Espa�a,D�sseldorf,Spanish n-tilde
diff --git a/test-cases/encodings-corpus/E06_western_basic_macroman.csv b/test-cases/encodings-corpus/E06_western_basic_macroman.csv
new file mode 100644
index 0000000..98feebe
--- /dev/null
+++ b/test-cases/encodings-corpus/E06_western_basic_macroman.csv
@@ -0,0 +1,5 @@
+id,name,city,note
+1,Alice,New York,plain ASCII
+2,Caf� M�ller,K�ln,Latin-1 accents
+3,Na�ve Fa�ade,Z�rich,more accents
+4,Espa�a,D�sseldorf,Spanish n-tilde
diff --git a/test-cases/encodings-corpus/E07_western_basic_utf16le.csv b/test-cases/encodings-corpus/E07_western_basic_utf16le.csv
new file mode 100644
index 0000000..172f8a2
Binary files /dev/null and b/test-cases/encodings-corpus/E07_western_basic_utf16le.csv differ
diff --git a/test-cases/encodings-corpus/E08_western_basic_utf16be.csv b/test-cases/encodings-corpus/E08_western_basic_utf16be.csv
new file mode 100644
index 0000000..bc56321
Binary files /dev/null and b/test-cases/encodings-corpus/E08_western_basic_utf16be.csv differ
diff --git a/test-cases/encodings-corpus/E09_western_basic_utf16le_nobom.csv b/test-cases/encodings-corpus/E09_western_basic_utf16le_nobom.csv
new file mode 100644
index 0000000..c14d47b
Binary files /dev/null and b/test-cases/encodings-corpus/E09_western_basic_utf16le_nobom.csv differ
diff --git a/test-cases/encodings-corpus/E10_western_extended_utf8.csv b/test-cases/encodings-corpus/E10_western_extended_utf8.csv
new file mode 100644
index 0000000..d204c4b
--- /dev/null
+++ b/test-cases/encodings-corpus/E10_western_extended_utf8.csv
@@ -0,0 +1,5 @@
+id,name,note
+1,€100 product,euro sign U+20AC
+2,“smart” quotes,curly U+201C and U+201D
+3,café — résumé,em-dash U+2014
+4,quote’s ok,smart apostrophe U+2019
diff --git a/test-cases/encodings-corpus/E11_western_extended_cp1252.csv b/test-cases/encodings-corpus/E11_western_extended_cp1252.csv
new file mode 100644
index 0000000..587aff9
--- /dev/null
+++ b/test-cases/encodings-corpus/E11_western_extended_cp1252.csv
@@ -0,0 +1,5 @@
+id,name,note
+1,�100 product,euro sign U+20AC
+2,�smart� quotes,curly U+201C and U+201D
+3,caf� � r�sum�,em-dash U+2014
+4,quote�s ok,smart apostrophe U+2019
diff --git a/test-cases/encodings-corpus/E12_western_extended_utf16le.csv b/test-cases/encodings-corpus/E12_western_extended_utf16le.csv
new file mode 100644
index 0000000..a89a33b
Binary files /dev/null and b/test-cases/encodings-corpus/E12_western_extended_utf16le.csv differ
diff --git a/test-cases/encodings-corpus/E13_eastern_european_utf8.csv b/test-cases/encodings-corpus/E13_eastern_european_utf8.csv
new file mode 100644
index 0000000..f5f3f92
--- /dev/null
+++ b/test-cases/encodings-corpus/E13_eastern_european_utf8.csv
@@ -0,0 +1,5 @@
+id,name,city,language
+1,Příliš,Praha,Czech
+2,Żółć,Warszawa,Polish
+3,Tűrő,Budapest,Hungarian
+4,Spaňski,Bratislava,Slovak
diff --git a/test-cases/encodings-corpus/E14_eastern_european_cp1250.csv b/test-cases/encodings-corpus/E14_eastern_european_cp1250.csv
new file mode 100644
index 0000000..a8c1b19
--- /dev/null
+++ b/test-cases/encodings-corpus/E14_eastern_european_cp1250.csv
@@ -0,0 +1,5 @@
+id,name,city,language
+1,P��li�,Praha,Czech
+2,���,Warszawa,Polish
+3,T�r�,Budapest,Hungarian
+4,Spa�ski,Bratislava,Slovak
diff --git a/test-cases/encodings-corpus/E15_eastern_european_iso88592.csv b/test-cases/encodings-corpus/E15_eastern_european_iso88592.csv
new file mode 100644
index 0000000..927febf
--- /dev/null
+++ b/test-cases/encodings-corpus/E15_eastern_european_iso88592.csv
@@ -0,0 +1,5 @@
+id,name,city,language
+1,P��li�,Praha,Czech
+2,���,Warszawa,Polish
+3,T�r�,Budapest,Hungarian
+4,Spa�ski,Bratislava,Slovak
diff --git a/test-cases/encodings-corpus/E16_cyrillic_utf8.csv b/test-cases/encodings-corpus/E16_cyrillic_utf8.csv
new file mode 100644
index 0000000..d4ad079
--- /dev/null
+++ b/test-cases/encodings-corpus/E16_cyrillic_utf8.csv
@@ -0,0 +1,4 @@
+id,name,city
+1,Иван,Москва
+2,Анна,Санкт-Петербург
+3,Дмитрий,Новосибирск
diff --git a/test-cases/encodings-corpus/E17_cyrillic_cp1251.csv b/test-cases/encodings-corpus/E17_cyrillic_cp1251.csv
new file mode 100644
index 0000000..e49142a
--- /dev/null
+++ b/test-cases/encodings-corpus/E17_cyrillic_cp1251.csv
@@ -0,0 +1,4 @@
+id,name,city
+1,����,������
+2,����,�����-���������
+3,�������,�����������
diff --git a/test-cases/encodings-corpus/E18_cyrillic_koi8r.csv b/test-cases/encodings-corpus/E18_cyrillic_koi8r.csv
new file mode 100644
index 0000000..d260d9b
--- /dev/null
+++ b/test-cases/encodings-corpus/E18_cyrillic_koi8r.csv
@@ -0,0 +1,4 @@
+id,name,city
+1,����,������
+2,����,�����-���������
+3,�������,�����������
diff --git a/test-cases/encodings-corpus/E19_japanese_utf8.csv b/test-cases/encodings-corpus/E19_japanese_utf8.csv
new file mode 100644
index 0000000..5a854f4
--- /dev/null
+++ b/test-cases/encodings-corpus/E19_japanese_utf8.csv
@@ -0,0 +1,4 @@
+id,name,city
+1,田中太郎,東京
+2,鈴木花子,大阪
+3,Alice Smith,横浜
diff --git a/test-cases/encodings-corpus/E20_japanese_shiftjis.csv b/test-cases/encodings-corpus/E20_japanese_shiftjis.csv
new file mode 100644
index 0000000..c60057d
--- /dev/null
+++ b/test-cases/encodings-corpus/E20_japanese_shiftjis.csv
@@ -0,0 +1,4 @@
+id,name,city
+1,�c�����Y,����
+2,��؉Ԏq,���
+3,Alice Smith,���l
diff --git a/test-cases/encodings-corpus/E21_chinese_simplified_utf8.csv b/test-cases/encodings-corpus/E21_chinese_simplified_utf8.csv
new file mode 100644
index 0000000..300df3e
--- /dev/null
+++ b/test-cases/encodings-corpus/E21_chinese_simplified_utf8.csv
@@ -0,0 +1,4 @@
+id,name,city
+1,张三,北京
+2,李四,上海
+3,Alice Smith,深圳
diff --git a/test-cases/encodings-corpus/E22_chinese_simplified_gb18030.csv b/test-cases/encodings-corpus/E22_chinese_simplified_gb18030.csv
new file mode 100644
index 0000000..c8f7a53
--- /dev/null
+++ b/test-cases/encodings-corpus/E22_chinese_simplified_gb18030.csv
@@ -0,0 +1,4 @@
+id,name,city
+1,����,����
+2,����,�Ϻ�
+3,Alice Smith,����
diff --git a/test-cases/encodings-corpus/E23_chinese_traditional_utf8.csv b/test-cases/encodings-corpus/E23_chinese_traditional_utf8.csv
new file mode 100644
index 0000000..60a5859
--- /dev/null
+++ b/test-cases/encodings-corpus/E23_chinese_traditional_utf8.csv
@@ -0,0 +1,4 @@
+id,name,city
+1,張三,台北
+2,李四,香港
+3,Alice Smith,新竹
diff --git a/test-cases/encodings-corpus/E24_chinese_traditional_big5.csv b/test-cases/encodings-corpus/E24_chinese_traditional_big5.csv
new file mode 100644
index 0000000..8702249
--- /dev/null
+++ b/test-cases/encodings-corpus/E24_chinese_traditional_big5.csv
@@ -0,0 +1,4 @@
+id,name,city
+1,�i�T,�x�_
+2,���|,����
+3,Alice Smith,�s��
diff --git a/test-cases/encodings-corpus/E25_korean_utf8.csv b/test-cases/encodings-corpus/E25_korean_utf8.csv
new file mode 100644
index 0000000..abb4304
--- /dev/null
+++ b/test-cases/encodings-corpus/E25_korean_utf8.csv
@@ -0,0 +1,4 @@
+id,name,city
+1,김철수,서울
+2,박영희,부산
+3,Alice Smith,인천
diff --git a/test-cases/encodings-corpus/E26_korean_euckr.csv b/test-cases/encodings-corpus/E26_korean_euckr.csv
new file mode 100644
index 0000000..13ccbff
--- /dev/null
+++ b/test-cases/encodings-corpus/E26_korean_euckr.csv
@@ -0,0 +1,4 @@
+id,name,city
+1,��ö��,����
+2,�ڿ���,�λ�
+3,Alice Smith,��õ
diff --git a/test-cases/encodings-corpus/E27_pathological_ascii_only.csv b/test-cases/encodings-corpus/E27_pathological_ascii_only.csv
new file mode 100644
index 0000000..8f21db1
--- /dev/null
+++ b/test-cases/encodings-corpus/E27_pathological_ascii_only.csv
@@ -0,0 +1,4 @@
+id,name,city
+1,Alice,New York
+2,Bob,Chicago
+3,Carol,San Francisco
diff --git a/test-cases/encodings-corpus/E28_pathological_invalid_utf8.csv b/test-cases/encodings-corpus/E28_pathological_invalid_utf8.csv
new file mode 100644
index 0000000..d8443aa
--- /dev/null
+++ b/test-cases/encodings-corpus/E28_pathological_invalid_utf8.csv
@@ -0,0 +1,4 @@
+id,name,city
+1,Alice,New York
+2,B�(b,Chicago
+3,Carol,San Francisco
diff --git a/test-cases/encodings-corpus/E29_pathological_truncated_utf8.csv b/test-cases/encodings-corpus/E29_pathological_truncated_utf8.csv
new file mode 100644
index 0000000..9c304c8
--- /dev/null
+++ b/test-cases/encodings-corpus/E29_pathological_truncated_utf8.csv
@@ -0,0 +1,4 @@
+id,name,city
+1,Alice,New York
+2,Bob,Chicago
+3,�
\ No newline at end of file
diff --git a/test-cases/encodings-corpus/E30_pathological_lying_bom.csv b/test-cases/encodings-corpus/E30_pathological_lying_bom.csv
new file mode 100644
index 0000000..57df065
--- /dev/null
+++ b/test-cases/encodings-corpus/E30_pathological_lying_bom.csv
@@ -0,0 +1,5 @@
+﻿id,name,note
+1,�100 product,euro sign U+20AC
+2,�smart� quotes,curly U+201C and U+201D
+3,caf� � r�sum�,em-dash U+2014
+4,quote�s ok,smart apostrophe U+2019
diff --git a/test-cases/encodings-corpus/E31_pathological_mixed_concat.csv b/test-cases/encodings-corpus/E31_pathological_mixed_concat.csv
new file mode 100644
index 0000000..706f863
--- /dev/null
+++ b/test-cases/encodings-corpus/E31_pathological_mixed_concat.csv
@@ -0,0 +1,4 @@
+id,name,city
+1,M�ller,K�ln
+2,Müller,Köln
+3,Alice,New York
diff --git a/test-cases/encodings-corpus/ENCODINGS-CASES.md b/test-cases/encodings-corpus/ENCODINGS-CASES.md
new file mode 100644
index 0000000..b4ef1f0
--- /dev/null
+++ b/test-cases/encodings-corpus/ENCODINGS-CASES.md
@@ -0,0 +1,284 @@
+# ENCODINGS-CASES.md - Code Page / Encoding Test Corpus
+
+**Version**: 1.0
+**Last updated**: April 29, 2026
+**Companion to**: TEST-CASES.md and QUOTE-CASES.md.
+
+## Why this is a separate corpus
+
+Files 01-23 in the main corpus test the **transformation layer**: given a Python `str` already in memory, what does the cleaner do to it. Encoding tests are about the **I/O layer** that runs *before* the transformation layer ever sees data: given a sequence of bytes on disk, can the reader correctly turn them into a Python `str` in the first place?
+
+These are different failures:
+
+- A transformation bug produces wrong-but-valid output (curly quotes that should have been folded, whitespace that should have been trimmed).
+- An I/O bug produces *garbage* (mojibake) or *crashes* the reader entirely. The cleaner never gets to apply any transformation rule because the input never decoded.
+
+Per TECHNICAL.md Section 9, encoding handling lives in `src/core/io.py`, separate from any individual cleaning script. This corpus tests that module.
+
+---
+
+## 1. Layout
+
+```
+test_data/encodings/
+├── E01_western_basic_utf8.csv             ... E26_korean_euckr.csv
+├── E27_pathological_ascii_only.csv        ... E31_pathological_mixed_concat.csv
+├── expected_detection.csv                 # Manifest: ground truth + acceptable detection
+├── detector_baseline.csv                  # What charset-normalizer actually returns
+└── reference/
+    ├── WESTERN_BASIC.utf8.txt
+    ├── WESTERN_EXTENDED.utf8.txt
+    ├── EASTERN_EUROPEAN.utf8.txt
+    ├── CYRILLIC.utf8.txt
+    ├── JAPANESE.utf8.txt
+    ├── CHINESE_SIMPLIFIED.utf8.txt
+    ├── CHINESE_TRADITIONAL.utf8.txt
+    ├── KOREAN.utf8.txt
+    └── ASCII_ONLY.utf8.txt
+```
+
+Every encoded file has a `canonical_content_id` linking it to one of the 9 reference files in `reference/`. After correct decoding (and BOM stripping if applicable), the encoded file's content must equal the corresponding reference file byte-for-byte.
+
+---
+
+## 2. Coverage matrix
+
+The corpus uses 9 distinct content sets, each chosen to exercise a specific encoding family. Cross-coverage is enforced by content design: Cyrillic content cannot be encoded as cp1252; Western extended content (with euro/em-dash) cannot be encoded as Latin-1; etc. Attempting those encodings would either error or substitute, both of which are themselves test cases.
+
+| Content family | What it contains | Encodings covered |
+|---|---|---|
+| WESTERN_BASIC | ASCII + accented Latin-1 chars (é, ü, ñ, ç) | UTF-8, UTF-8 with BOM, cp1252, ISO-8859-1, ISO-8859-15, Mac Roman, UTF-16 LE/BE with BOM, UTF-16 LE without BOM |
+| WESTERN_EXTENDED | Above + euro sign, smart quotes, em-dash | UTF-8, cp1252, UTF-16 LE (NOT Latin-1: chars don't exist there) |
+| EASTERN_EUROPEAN | Czech, Polish, Hungarian, Slovak accents | UTF-8, cp1250, ISO-8859-2 |
+| CYRILLIC | Russian | UTF-8, cp1251, KOI8-R |
+| JAPANESE | Kanji + kana | UTF-8, Shift_JIS |
+| CHINESE_SIMPLIFIED | Mainland China characters | UTF-8, GB18030 |
+| CHINESE_TRADITIONAL | Taiwan/HK characters | UTF-8, Big5 |
+| KOREAN | Hangul | UTF-8, EUC-KR |
+| ASCII_ONLY | Pure ASCII | One file; encoding genuinely ambiguous |
+
+---
+
+## 3. Per-file index
+
+### Group A — WESTERN_BASIC (single content, 9 encodings)
+
+This group's purpose is mainly to test detector behavior on the most common Western encodings. Because the content uses only ASCII + Latin-1 characters in the 0xA0+ range, **cp1252 / ISO-8859-1 / ISO-8859-15 produce byte-identical output for this content**. The detector cannot meaningfully distinguish among them; any of them is a correct answer.
+
+| File | Encoding | Notes |
+|---|---|---|
+| E01 | UTF-8 | Modern default |
+| E02 | UTF-8 with BOM | Excel "CSV UTF-8" export. Reader must strip the BOM. |
+| E03 | cp1252 | Excel default "CSV" on US/UK/Western Windows |
+| E04 | ISO-8859-1 | Latin-1. Identical bytes to cp1252 for this content. |
+| E05 | ISO-8859-15 | Latin-9. Identical to Latin-1 here (no euro). |
+| E06 | Mac Roman | Different byte mappings; distinguishable |
+| E07 | UTF-16 LE with BOM | Excel "Unicode Text" export |
+| E08 | UTF-16 BE with BOM | Less common but spec'd |
+| E09 | UTF-16 LE without BOM | Detection unreliable; document failure mode |
+
+### Group B — WESTERN_EXTENDED (3 encodings)
+
+This is the cleanest **cp1252-vs-Latin-1 discriminator** in the corpus. The content uses bytes 0x80-0x9F (where cp1252 puts euro, smart quotes, em-dash) — exactly the range Latin-1 leaves undefined. A reader that misidentifies this file as Latin-1 will produce control characters or replacement chars; correct identification as cp1252 yields readable text.
+
+| File | Encoding | Notes |
+|---|---|---|
+| E10 | UTF-8 | Reference |
+| E11 | cp1252 | The discriminator file |
+| E12 | UTF-16 LE with BOM | Same content, sanity check |
+
+### Group C — EASTERN_EUROPEAN (3 encodings)
+
+| File | Encoding | Notes |
+|---|---|---|
+| E13 | UTF-8 | Reference |
+| E14 | cp1250 | Polish/Czech/Hungarian Windows default |
+| E15 | ISO-8859-2 | Latin-2; distinct byte mappings from cp1250 |
+
+### Group D — CYRILLIC (3 encodings)
+
+| File | Encoding | Notes |
+|---|---|---|
+| E16 | UTF-8 | Reference |
+| E17 | cp1251 | Russian Windows default |
+| E18 | KOI8-R | Older Russian Unix encoding; distinct bytes from cp1251 |
+
+### Group E — CJK (8 files, 4 languages × 2 encodings each)
+
+| File | Encoding | Notes |
+|---|---|---|
+| E19 | UTF-8 (Japanese) | Reference |
+| E20 | Shift_JIS | Japanese Excel default; cp932 is the MS extended variant |
+| E21 | UTF-8 (Chinese simplified) | Reference |
+| E22 | GB18030 | Mainland China; supersets GBK and GB2312 |
+| E23 | UTF-8 (Chinese traditional) | Reference |
+| E24 | Big5 | Taiwan/HK; cp950 is the MS variant |
+| E25 | UTF-8 (Korean) | Reference |
+| E26 | EUC-KR | Korean Windows default; cp949 is the MS variant |
+
+### Group F — Pathological (5 files)
+
+These are the cases that crash readers, produce silent corruption, or expose the limits of encoding detection. The expected behavior is **that the reader fails informatively**, not that it succeeds.
+
+| File | Pathology | What should happen |
+|---|---|---|
+| E27 | ASCII only — encoding genuinely ambiguous | Detector picks any of ASCII/UTF-8/cp1252/Latin-1; all decode identically. Test that the reader doesn't OVER-confidently commit to one when ambiguous. |
+| E28 | Invalid UTF-8 byte sequence (0xC3 0x28 mid-file) | Strict UTF-8 decode raises UnicodeDecodeError. Reader should fall back to a single-byte encoding and warn, not silently substitute. |
+| E29 | Truncated UTF-8 multibyte at EOF | Strict decode raises "unexpected end of data". Reader should error with a clear "file appears truncated" message, not silently produce U+FFFD. |
+| E30 | "Lying BOM" — UTF-8 BOM on cp1252 body | utf-8-sig decoder errors at first cp1252 byte in 0x80-0x9F range. Reader should detect the lie and recover by stripping BOM and trying cp1252; warn the user. |
+| E31 | Mixed encoding concatenation (cp1252 + UTF-8) | NO single encoding decodes the whole file. UTF-8 errors on cp1252 bytes; cp1252 mojibakes the UTF-8 bytes. Reader should refuse and tell the user the file contains mixed encodings. |
+
+---
+
+## 4. Manifest files
+
+### `expected_detection.csv` — ground truth + acceptable detection answers
+
+7 columns:
+- `filename` — the encoded test file
+- `canonical_content_id` — links to the reference content
+- `encoding` — the actual encoding used by the generator (ground truth)
+- `has_bom` — whether the file has a BOM
+- `byte_length` — file size in bytes
+- `expected_detection` — pipe-separated list of detector answers that should be considered correct. Includes fuzzy markers (`AMBIGUOUS`, `UNRELIABLE`, `REJECT`, `LOW_CONFIDENCE`) for cases where any reasonable detector behavior is acceptable.
+- `decode_notes` — human-readable explanation of expected behavior
+
+Use this as the primary reference when validating your reader.
+
+### `detector_baseline.csv` — what charset-normalizer actually returns
+
+Recorded during fixture generation against the version of `charset-normalizer` installed at that time. 6 columns:
+- `filename`, `ground_truth_encoding`, `charset_normalizer_returns`, `cn_aliases`, `cn_language`, `cn_chaos_score`
+
+This is **not authoritative** — different detector versions return different labels. It exists so you can see typical detector output without having to run charset-normalizer yourself, and so you have a baseline to compare against if you're testing a different detector or a newer version.
+
+### `reference/*.utf8.txt` — canonical decoded content
+
+One UTF-8 file per content family. After your reader decodes any encoded file in the corpus and strips any BOM, the result should equal the corresponding reference file's content byte-for-byte.
+
+---
+
+## 5. Observed charset-normalizer behavior
+
+Recorded against `charset-normalizer` 3.x. Some of these are known detector quirks worth understanding before you debug your own code:
+
+### Cases where charset-normalizer is reliably correct
+
+- All UTF-8 files (E01, E02, E10, E13, E16, E19, E21, E23, E25): detected as `utf_8`.
+- All UTF-16 with BOM (E07, E08, E12): detected as `utf_16` (loses LE/BE distinction in label, recoverable from BOM).
+- E14 (cp1250 Eastern European): correctly detected.
+- E17 (cp1251 Cyrillic): correctly detected.
+- E20 (Shift_JIS Japanese): returns `cp932` (the MS extended variant; equivalent for this content).
+- E22 (GB18030 Chinese): correctly detected.
+- E24 (Big5 Chinese traditional): correctly detected.
+- E26 (EUC-KR Korean): returns `cp949` (the MS variant; equivalent for this content).
+- E27 (ASCII): correctly detected as `ascii`.
+
+### Cases where charset-normalizer mislabels but produces the right decoded content
+
+These return a wrong-sounding name but decode to the correct characters because the encodings are byte-equivalent for this specific content:
+
+- **E03, E04, E05** (cp1252, Latin-1, Latin-9 with WESTERN_BASIC content): all returned as `cp1250`. The decoded chars are correct because for ASCII + Latin-1 chars in the 0xA0+ range, all four encodings produce identical results. The label is misleading but the data is fine.
+- **E06** (Mac Roman): returned as `mac_iceland`. Same family, identical for our content.
+- **E11** (cp1252 with WESTERN_EXTENDED): returned as `cp1250`. Surprising — `cp1250` does NOT have euro at 0x80 (it has Cyrillic-adjacent chars), so the actual decoded euro sign would be wrong. Verify your reader actually re-decodes with the returned label and check the output, don't assume a "matching" label means correct content.
+
+### Cases where charset-normalizer is wrong
+
+- **E15** (ISO-8859-2 Eastern European): returned as `cp1258` (Vietnamese encoding). Wrong family entirely. Probable cause: the chaos heuristic doesn't penalize cp1258 for the byte distribution in the test content.
+- **E18** (KOI8-R Cyrillic): returned as `shift_jis_2004` (Japanese!). Bytes in KOI8-R's high-bit range happen to look like valid Shift_JIS multibyte sequences for this content. **High-confidence misdetection** — this is the one to plan a fallback for in your reader.
+
+### Pathological cases
+
+- **E28-E31**: charset-normalizer returns various labels (`cp1257`, `cp1250`, `cp1252`, `cp1250`). For pathological inputs, the *label* is less important than the *behavior*: does your reader detect that something is wrong (low confidence, multiple candidate encodings, or a decode error after detection) and warn the user? The `expected_detection` field accepts any label paired with appropriate warning behavior.
+
+### Implication for your reader
+
+Don't trust charset-normalizer's label blindly. The robust pattern:
+
+1. Run charset-normalizer.
+2. Try to decode the entire file with the returned encoding.
+3. If decode succeeds, sanity-check the result: does it contain replacement characters (U+FFFD)? Does it contain control characters in unexpected places (which suggest cp1252-vs-Latin-1 ambiguity decoded wrong)?
+4. If it fails or smells wrong, try a small candidate set (utf-8, cp1252, latin-1) and pick the one with the cleanest result.
+5. When confidence is low, log a warning and let the user override via a `--encoding` flag.
+
+---
+
+## 6. Suggested test workflow
+
+```python
+import csv
+from pathlib import Path
+from src.core.io import detect_encoding, read_csv  # your reader
+
+CORPUS = Path("test_data/encodings")
+
+# Load ground-truth manifest
+with (CORPUS / "expected_detection.csv").open() as f:
+    manifest = list(csv.DictReader(f))
+
+# Load reference content
+references = {
+    p.stem.replace(".utf8", ""): p.read_text(encoding="utf-8")
+    for p in (CORPUS / "reference").glob("*.utf8.txt")
+}
+
+# Test 1: detection - your detector returns an acceptable answer
+for entry in manifest:
+    if entry["canonical_content_id"] in references:  # skip pure pathological
+        detected = detect_encoding(CORPUS / entry["filename"])
+        acceptable = [e.strip() for e in entry["expected_detection"].split("|")]
+        assert detected in acceptable or any(
+            marker in entry["expected_detection"]
+            for marker in ["AMBIGUOUS", "UNRELIABLE"]
+        ), f"{entry['filename']}: detected {detected} not in {acceptable}"
+
+# Test 2: decoded content matches reference
+for entry in manifest:
+    cid = entry["canonical_content_id"]
+    if cid not in references:
+        continue  # pathological case
+    decoded = read_csv(CORPUS / entry["filename"])
+    assert decoded == references[cid], f"{entry['filename']}: content mismatch"
+
+# Test 3: pathological cases produce warnings, not silent corruption
+for entry in manifest:
+    cid = entry["canonical_content_id"]
+    if cid in references:
+        continue
+    # Reader must either raise a clear error OR succeed with a logged warning
+    # The exact behavior is a policy choice; document it and test against it
+```
+
+---
+
+## 7. What this corpus does NOT cover
+
+Listed so the gaps are explicit:
+
+1. **Big files**. Every fixture is small (under 1 KB). Detection on a 500MB cp1252 export may behave differently because charset-normalizer samples; if your reader processes giant files, add a separate large-file detection test.
+2. **Streaming detection**. Detector is run on the full bytes here. If your reader decodes in chunks (for memory reasons on huge files), encoding-detection-at-stream-start is its own test surface.
+3. **Languages with complex scripts not represented here**: Thai, Hebrew, Arabic, Vietnamese (cp1258), Greek (cp1253), Turkish (cp1254). Add per-language fixtures if your buyers use these locales. The generator script is parameterized; adding a new content family is a few-line change.
+4. **Extended grapheme handling**. This corpus tests encoding detection and byte-to-string conversion. It does NOT test grapheme-cluster boundaries (multi-codepoint emoji, family ZWJ sequences). Those are the cleaner's territory in the main corpus, case 13.
+5. **Encoding errors during WRITE**. The corpus tests reading. If your tool writes output in a non-UTF-8 encoding for any reason, write-side encoding correctness needs separate fixtures.
+6. **Filename / path encoding issues**. Some filesystems mangle non-ASCII filenames (older Windows, NFS configs). Out of scope for the cleaner; that's a deployment problem.
+
+---
+
+## 8. How to extend the corpus
+
+Add a new content family:
+
+```python
+# In generate_encoding_test_files.py:
+THAI = "id,name,city\n1,สมชาย,กรุงเทพ\n..."
+
+# Then add encoding lines:
+write_encoded("E32_thai_utf8.csv", "THAI", THAI, "utf-8", ...)
+write_encoded("E33_thai_cp874.csv", "THAI", THAI, "cp874", ...)
+```
+
+Add reference content to the `references` dict at the bottom of the generator. Re-run the generator. The manifest and detector baseline will refresh automatically.
+
+For a new pathological case: construct the raw bytes by hand and use `write_raw()`. Document the failure mode in the `decode_notes` field.
+
+Continue numbering: `E32`, `E33`, etc. Reserve `E9#` if you need a "destructive" subcategory paralleling the malformed CSV corpus.
diff --git a/test-cases/encodings-corpus/detector_baseline.csv b/test-cases/encodings-corpus/detector_baseline.csv
new file mode 100644
index 0000000..1cd4864
--- /dev/null
+++ b/test-cases/encodings-corpus/detector_baseline.csv
@@ -0,0 +1,32 @@
+filename,ground_truth_encoding,charset_normalizer_returns,cn_aliases,cn_language,cn_chaos_score
+E01_western_basic_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Turkish,0.000
+E02_western_basic_utf8bom.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Turkish,0.000
+E03_western_basic_cp1252.csv,cp1252,cp1250,"1250, windows_1250",Turkish,0.000
+E04_western_basic_latin1.csv,iso-8859-1,cp1250,"1250, windows_1250",Turkish,0.000
+E05_western_basic_latin9.csv,iso-8859-15,cp1250,"1250, windows_1250",Turkish,0.000
+E06_western_basic_macroman.csv,mac-roman,mac_iceland,maciceland,Turkish,0.000
+E07_western_basic_utf16le.csv,utf-16-le,utf_16,"u16, utf16",Turkish,0.000
+E08_western_basic_utf16be.csv,utf-16-be,utf_16,"u16, utf16",Turkish,0.000
+E09_western_basic_utf16le_nobom.csv,utf-16-le,utf_16_le,"unicodelittleunmarked, utf_16le",Turkish,0.000
+E10_western_extended_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",French,0.013
+E11_western_extended_cp1252.csv,cp1252,cp1250,"1250, windows_1250",French,0.013
+E12_western_extended_utf16le.csv,utf-16-le,utf_16,"u16, utf16",French,0.013
+E13_eastern_european_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Spanish,0.042
+E14_eastern_european_cp1250.csv,cp1250,cp1250,"1250, windows_1250",Spanish,0.042
+E15_eastern_european_iso88592.csv,iso-8859-2,cp1258,"1258, windows_1258",German,0.000
+E16_cyrillic_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Ukrainian,0.059
+E17_cyrillic_cp1251.csv,cp1251,cp1251,"1251, windows_1251",Ukrainian,0.059
+E18_cyrillic_koi8r.csv,koi8-r,shift_jis_2004,"shiftjis2004, sjis_2004, s_jis_2004",Japanese,0.066
+E19_japanese_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Italian,0.000
+E20_japanese_shiftjis.csv,shift_jis,cp932,"932, ms932, mskanji, ms_kanji",Japanese,0.000
+E21_chinese_simplified_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Unknown,0.000
+E22_chinese_simplified_gb18030.csv,gb18030,gb18030,gb18030_2000,Chinese,0.000
+E23_chinese_traditional_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Unknown,0.060
+E24_chinese_traditional_big5.csv,big5,big5,"big5_tw, csbig5, x_mac_trad_chinese",Chinese,0.060
+E25_korean_utf8.csv,utf-8,utf_8,"u8, utf, utf8, utf8_ucs2, utf8_ucs4, cp65001",Unknown,0.000
+E26_korean_euckr.csv,euc-kr,cp949,"949, ms949, uhc",Korean,0.000
+E27_pathological_ascii_only.csv,ascii,ascii,"646, ansi_x3.4_1968, ansi_x3_4_1968, ansi_x3.4_1986, cp367, csascii, ibm367, iso646_us, iso_646.irv_1991, iso_ir_6, us, us_ascii",English,0.000
+E28_pathological_invalid_utf8.csv,invalid-utf8,cp1257,"1257, windows_1257",Croatian,0.000
+E29_pathological_truncated_utf8.csv,invalid-utf8-truncated,cp1250,"1250, windows_1250",Polish,0.000
+E30_pathological_lying_bom.csv,cp1252-with-utf8-bom,cp1252,"1252, windows_1252",French,0.013
+E31_pathological_mixed_concat.csv,cp1252+utf8-concatenated,cp1250,"1250, windows_1250",German,0.000
diff --git a/test-cases/encodings-corpus/expected_detection.csv b/test-cases/encodings-corpus/expected_detection.csv
new file mode 100644
index 0000000..8818797
--- /dev/null
+++ b/test-cases/encodings-corpus/expected_detection.csv
@@ -0,0 +1,32 @@
+filename,canonical_content_id,encoding,has_bom,byte_length,expected_detection,decode_notes
+E01_western_basic_utf8.csv,WESTERN_BASIC,utf-8,no,161,utf_8|utf-8,UTF-8 no BOM. Modern default.
+E02_western_basic_utf8bom.csv,WESTERN_BASIC,utf-8,yes,164,utf_8|utf_8_sig|utf-8|utf-8-sig,UTF-8 with BOM. Excel CSV UTF-8 export. BOM must be stripped on read.
+E03_western_basic_cp1252.csv,WESTERN_BASIC,cp1252,no,153,cp1252|windows-1252|iso-8859-1|latin-1,"Western single-byte. For this content (no euro, no smart quotes, no em-dash), cp1252 and Latin-1 produce IDENTICAL decoded bytes. Detector cannot distinguish - any of cp1252/Latin-1/Latin-9 is a correct answer."
+E04_western_basic_latin1.csv,WESTERN_BASIC,iso-8859-1,no,153,iso-8859-1|latin-1|cp1252|latin_1,Latin-1. Identical bytes to cp1252 for this content. Detector ambiguity is expected and acceptable.
+E05_western_basic_latin9.csv,WESTERN_BASIC,iso-8859-15,no,153,iso-8859-15|latin-9|iso-8859-1|cp1252,"Latin-9. For this content with no euro sign, decodes identically to Latin-1. Detector may pick any."
+E06_western_basic_macroman.csv,WESTERN_BASIC,mac-roman,no,153,mac-roman|macroman,"Mac Roman. Different byte values for the accented chars vs cp1252/Latin-1, so this one is distinguishable."
+E07_western_basic_utf16le.csv,WESTERN_BASIC,utf-16-le,yes,308,utf-16|utf-16-le|utf_16|utf_16_le,UTF-16 LE with BOM. Excel 'Unicode Text' export.
+E08_western_basic_utf16be.csv,WESTERN_BASIC,utf-16-be,yes,308,utf-16|utf-16-be|utf_16|utf_16_be,UTF-16 BE with BOM. Less common but valid.
+E09_western_basic_utf16le_nobom.csv,WESTERN_BASIC,utf-16-le,no,306,utf-16|utf-16-le|UNRELIABLE,"UTF-16 LE without BOM. Detection is heuristic and unreliable; bytes look like 'every other byte is null' for ASCII-heavy content, which charset-normalizer may or may not catch. If detector returns wrong encoding here, that is the buyer's responsibility to manually specify - flag in error message."
+E10_western_extended_utf8.csv,WESTERN_EXTENDED,utf-8,no,167,utf_8|utf-8,"UTF-8. Has euro, smart quotes, em-dash."
+E11_western_extended_cp1252.csv,WESTERN_EXTENDED,cp1252,no,154,cp1252|windows-1252,"cp1252. Content uses 0x80-0x9F range (euro, smart quotes, em-dash). Decoding as Latin-1 produces control characters or replacement chars - this file is the cleanest cp1252-vs-Latin-1 discriminator."
+E12_western_extended_utf16le.csv,WESTERN_EXTENDED,utf-16-le,yes,310,utf-16|utf-16-le,UTF-16 LE with BOM. Same content as E10/E11.
+E13_eastern_european_utf8.csv,EASTERN_EUROPEAN,utf-8,no,130,utf_8|utf-8,UTF-8 baseline for Czech/Polish/Hungarian/Slovak content.
+E14_eastern_european_cp1250.csv,EASTERN_EUROPEAN,cp1250,no,120,cp1250|windows-1250,"cp1250. Decoding as cp1252 produces mojibake (Polish slash-l would become U+0142 vs ascii letter, etc.). Real distinguishing test."
+E15_eastern_european_iso88592.csv,EASTERN_EUROPEAN,iso-8859-2,no,120,iso-8859-2|latin-2|iso8859_2,ISO-8859-2 / Latin-2. Different byte assignments than cp1250 for the same characters.
+E16_cyrillic_utf8.csv,CYRILLIC,utf-8,no,118,utf_8|utf-8,UTF-8 baseline for Russian content.
+E17_cyrillic_cp1251.csv,CYRILLIC,cp1251,no,72,cp1251|windows-1251,cp1251. The dominant Russian Windows encoding.
+E18_cyrillic_koi8r.csv,CYRILLIC,koi8-r,no,72,koi8-r|koi8_r,KOI8-R. Older Unix Russian encoding. Distinct byte patterns from cp1251.
+E19_japanese_utf8.csv,JAPANESE,utf-8,no,78,utf_8|utf-8,UTF-8 baseline for Japanese content.
+E20_japanese_shiftjis.csv,JAPANESE,shift_jis,no,64,shift_jis|shift-jis|cp932|sjis,Shift_JIS. Excel on Japanese Windows defaults to this. cp932 is Microsoft's extended variant; either name is acceptable.
+E21_chinese_simplified_utf8.csv,CHINESE_SIMPLIFIED,utf-8,no,66,utf_8|utf-8,UTF-8 baseline for simplified Chinese.
+E22_chinese_simplified_gb18030.csv,CHINESE_SIMPLIFIED,gb18030,no,56,gb18030|gbk|gb2312,GB18030. Mainland China default. GB18030 supersets GBK supersets GB2312; for this content any is acceptable.
+E23_chinese_traditional_utf8.csv,CHINESE_TRADITIONAL,utf-8,no,66,utf_8|utf-8,UTF-8 baseline for traditional Chinese.
+E24_chinese_traditional_big5.csv,CHINESE_TRADITIONAL,big5,no,56,big5|big5_hkscs|cp950,Big5. Taiwan and Hong Kong default. cp950 is Microsoft's variant.
+E25_korean_utf8.csv,KOREAN,utf-8,no,72,utf_8|utf-8,UTF-8 baseline for Korean.
+E26_korean_euckr.csv,KOREAN,euc-kr,no,60,euc-kr|euc_kr|cp949,EUC-KR. Korean Windows default. cp949 is the MS variant.
+E27_pathological_ascii_only.csv,ASCII_ONLY,ascii,no,66,ascii|utf_8|utf-8|cp1252|iso-8859-1|AMBIGUOUS,"Pure ASCII. Multiple encodings produce identical bytes for this content. Any of ASCII, UTF-8, cp1252, Latin-1 is a correct detection answer because all four decode to the same string. Detector confidence should be high; specific label is interchangeable."
+E28_pathological_invalid_utf8.csv,INVALID_UTF8,invalid-utf8,no,67,cp1252|iso-8859-1|REJECT_UTF8,File starts as if UTF-8 but contains an invalid byte sequence (0xC3 0x28). A strict UTF-8 decoder errors. Detector should reject UTF-8 and fall back to a single-byte encoding; cp1252 will produce mojibake but parse without error. Cleaner should warn the user that encoding detection was uncertain.
+E29_pathological_truncated_utf8.csv,TRUNCATED_UTF8,invalid-utf8-truncated,no,47,utf_8_with_errors|cp1252|REJECT,"Valid UTF-8 throughout, but the last byte (0xE4) starts a 3-byte sequence that's never completed. Strict UTF-8 decoder errors at EOF. errors='replace' produces \ufffd. Real-world cause: file was truncated by a transfer interruption or a buggy export. Cleaner should treat as corrupt-input error, not silent data loss."
+E30_pathological_lying_bom.csv,WESTERN_EXTENDED,cp1252-with-utf8-bom,yes (lying),157,utf_8_FAILS|cp1252|AMBIGUOUS,File has UTF-8 BOM but body is cp1252. UTF-8 decoder will see 0x80 (euro in cp1252) as an invalid UTF-8 continuation byte and error out. Better detectors recover by ignoring the BOM and trying cp1252. Cleaner should warn 'BOM suggested UTF-8 but content decoded as cp1252' so the user knows their file is lying about itself.
+E31_pathological_mixed_concat.csv,MIXED_CONCAT,cp1252+utf8-concatenated,no,60,LOW_CONFIDENCE|cp1252|utf_8|REJECT,"First half cp1252, second half UTF-8. No single encoding decodes both halves correctly. UTF-8 decoder errors on row 1. cp1252 decoder produces mojibake on rows 2-3. charset-normalizer detection confidence should be low. Right behavior for the cleaner: refuse to process and tell the user the file contains mixed encodings."
diff --git a/test-cases/encodings-corpus/reference/ASCII_ONLY.utf8.txt b/test-cases/encodings-corpus/reference/ASCII_ONLY.utf8.txt
new file mode 100644
index 0000000..8f21db1
--- /dev/null
+++ b/test-cases/encodings-corpus/reference/ASCII_ONLY.utf8.txt
@@ -0,0 +1,4 @@
+id,name,city
+1,Alice,New York
+2,Bob,Chicago
+3,Carol,San Francisco
diff --git a/test-cases/encodings-corpus/reference/CHINESE_SIMPLIFIED.utf8.txt b/test-cases/encodings-corpus/reference/CHINESE_SIMPLIFIED.utf8.txt
new file mode 100644
index 0000000..300df3e
--- /dev/null
+++ b/test-cases/encodings-corpus/reference/CHINESE_SIMPLIFIED.utf8.txt
@@ -0,0 +1,4 @@
+id,name,city
+1,张三,北京
+2,李四,上海
+3,Alice Smith,深圳
diff --git a/test-cases/encodings-corpus/reference/CHINESE_TRADITIONAL.utf8.txt b/test-cases/encodings-corpus/reference/CHINESE_TRADITIONAL.utf8.txt
new file mode 100644
index 0000000..60a5859
--- /dev/null
+++ b/test-cases/encodings-corpus/reference/CHINESE_TRADITIONAL.utf8.txt
@@ -0,0 +1,4 @@
+id,name,city
+1,張三,台北
+2,李四,香港
+3,Alice Smith,新竹
diff --git a/test-cases/encodings-corpus/reference/CYRILLIC.utf8.txt b/test-cases/encodings-corpus/reference/CYRILLIC.utf8.txt
new file mode 100644
index 0000000..d4ad079
--- /dev/null
+++ b/test-cases/encodings-corpus/reference/CYRILLIC.utf8.txt
@@ -0,0 +1,4 @@
+id,name,city
+1,Иван,Москва
+2,Анна,Санкт-Петербург
+3,Дмитрий,Новосибирск
diff --git a/test-cases/encodings-corpus/reference/EASTERN_EUROPEAN.utf8.txt b/test-cases/encodings-corpus/reference/EASTERN_EUROPEAN.utf8.txt
new file mode 100644
index 0000000..f5f3f92
--- /dev/null
+++ b/test-cases/encodings-corpus/reference/EASTERN_EUROPEAN.utf8.txt
@@ -0,0 +1,5 @@
+id,name,city,language
+1,Příliš,Praha,Czech
+2,Żółć,Warszawa,Polish
+3,Tűrő,Budapest,Hungarian
+4,Spaňski,Bratislava,Slovak
diff --git a/test-cases/encodings-corpus/reference/JAPANESE.utf8.txt b/test-cases/encodings-corpus/reference/JAPANESE.utf8.txt
new file mode 100644
index 0000000..5a854f4
--- /dev/null
+++ b/test-cases/encodings-corpus/reference/JAPANESE.utf8.txt
@@ -0,0 +1,4 @@
+id,name,city
+1,田中太郎,東京
+2,鈴木花子,大阪
+3,Alice Smith,横浜
diff --git a/test-cases/encodings-corpus/reference/KOREAN.utf8.txt b/test-cases/encodings-corpus/reference/KOREAN.utf8.txt
new file mode 100644
index 0000000..abb4304
--- /dev/null
+++ b/test-cases/encodings-corpus/reference/KOREAN.utf8.txt
@@ -0,0 +1,4 @@
+id,name,city
+1,김철수,서울
+2,박영희,부산
+3,Alice Smith,인천
diff --git a/test-cases/encodings-corpus/reference/WESTERN_BASIC.utf8.txt b/test-cases/encodings-corpus/reference/WESTERN_BASIC.utf8.txt
new file mode 100644
index 0000000..54b281c
--- /dev/null
+++ b/test-cases/encodings-corpus/reference/WESTERN_BASIC.utf8.txt
@@ -0,0 +1,5 @@
+id,name,city,note
+1,Alice,New York,plain ASCII
+2,Café Müller,Köln,Latin-1 accents
+3,Naïve Façade,Zürich,more accents
+4,España,Düsseldorf,Spanish n-tilde
diff --git a/test-cases/encodings-corpus/reference/WESTERN_EXTENDED.utf8.txt b/test-cases/encodings-corpus/reference/WESTERN_EXTENDED.utf8.txt
new file mode 100644
index 0000000..d204c4b
--- /dev/null
+++ b/test-cases/encodings-corpus/reference/WESTERN_EXTENDED.utf8.txt
@@ -0,0 +1,5 @@
+id,name,note
+1,€100 product,euro sign U+20AC
+2,“smart” quotes,curly U+201C and U+201D
+3,café — résumé,em-dash U+2014
+4,quote’s ok,smart apostrophe U+2019
diff --git a/test-cases/text-cleaner-corpus/test_data/17_preserve_intended.csv b/test-cases/text-cleaner-corpus/test_data/17_preserve_intended.csv
index 17409c9..d4121bd 100644
--- a/test-cases/text-cleaner-corpus/test_data/17_preserve_intended.csv
+++ b/test-cases/text-cleaner-corpus/test_data/17_preserve_intended.csv
@@ -1,4 +1,4 @@
 id,price,european_number,date,phone,quantity
 1,  100  ,1 234,2024-01-15,(555) 123-4567,42
-2,"  $1,500.00  ",12 345,15/01/2024,555.123.4567,7
+2,  $1,500.00  ,12 345,15/01/2024,555.123.4567,7
 3,  N/A  ,nan,Jan 15 2024,+1 555 123 4567,0
diff --git a/tests/test_analyze.py b/tests/test_analyze.py
index 66335af..70113e6 100644
--- a/tests/test_analyze.py
+++ b/tests/test_analyze.py
@@ -204,6 +204,67 @@ class TestNearDuplicates:
 # Mixed line endings
 # ---------------------------------------------------------------------------
 
+class TestEncodingUncertainty:
+    def test_replacement_chars_in_data_flagged(self):
+        df = pd.DataFrame({"name": ["Caf�", "Ber�in"]})
+        findings = analyze(df)
+        f = next(f for f in findings if f.id == "encoding_uncertain")
+        assert f.severity == "error"
+        assert f.confidence == "low"
+        assert f.count == 2
+
+    def test_replacement_chars_in_header_flagged(self):
+        df = pd.DataFrame({"emai�l": ["a@x.com"]})
+        findings = analyze(df)
+        ids = {f.id for f in findings}
+        assert "encoding_uncertain" in ids
+
+    def test_clean_data_no_finding(self):
+        df = pd.DataFrame({"name": ["Alice", "Bob"]})
+        findings = analyze(df)
+        assert "encoding_uncertain" not in {f.id for f in findings}
+
+
+class TestEncodingOverride:
+    def test_override_corrects_misdetected_codepage(self, tmp_path):
+        # WESTERN_BASIC bytes encoded as cp1252; charset-normalizer guesses
+        # cp1250, which gets 0xF1 wrong (ń vs ñ).
+        f = tmp_path / "cp1252.csv"
+        f.write_bytes("id,name\n1,España\n".encode("cp1252"))
+
+        from src.core.analyze import _load_for_analysis
+        df_auto, _, _ = _load_for_analysis(f, sample_rows=10)
+        df_overridden, _, _ = _load_for_analysis(
+            f, sample_rows=10, encoding_override="cp1252",
+        )
+        # Override yields the correct character.
+        assert df_overridden["name"].iloc[0] == "España"
+
+    def test_override_propagates_through_top_level_analyze(self, tmp_path):
+        f = tmp_path / "koi8.csv"
+        # KOI8-R Cyrillic; default detection guesses Shift_JIS.
+        f.write_bytes("id,name\n1,Иван\n".encode("koi8-r"))
+        # With the override the analyzer should produce zero findings
+        # against this clean fixture (no mojibake, no U+FFFD).
+        findings = analyze(f, encoding_override="koi8-r")
+        ids = {x.id for x in findings}
+        assert "encoding_uncertain" not in ids
+        assert "encoding_decode_failed" not in ids
+
+
+class TestEncodingDecodeFailedFromRepair:
+    def test_decode_replaced_action_surfaces_error_finding(self, tmp_path):
+        # Create a file with a UTF-8 BOM but cp1252 body bytes — utf-8-sig
+        # fails on byte 0x80 (€ in cp1252).
+        f = tmp_path / "lying_bom.csv"
+        f.write_bytes(b"\xef\xbb\xbfid,name\n1,\x80100\n")
+        findings = analyze(f)
+        ids = {x.id for x in findings}
+        assert "encoding_decode_failed" in ids
+        bad = next(x for x in findings if x.id == "encoding_decode_failed")
+        assert bad.severity == "error"
+
+
 class TestMixedLineEndings:
     def test_crlf_plus_lf_flagged(self, tmp_path):
         f = tmp_path / "mixed.csv"
diff --git a/tests/test_corpus.py b/tests/test_corpus.py
index 33a545e..f70687a 100644
--- a/tests/test_corpus.py
+++ b/tests/test_corpus.py
@@ -51,14 +51,24 @@ DEFAULT_CASES = [
 def _read_csv_strict(path: Path) -> pd.DataFrame:
     """Read a corpus CSV file, treating all cells as strings.
 
-    NUL bytes are stripped from the raw file before parsing because the
-    pandas C engine truncates fields at NUL while the python engine is
-    too strict about embedded literal double quotes. Stripping NUL is
-    the file-level pre-clean step the spec describes for case 06.
+    Applies only the structural pre-parse fixes that are required to make
+    the file parseable at all — NUL stripping (case 06), line-ending
+    normalization (cases 09/10), and unquoted-currency repair (case 17).
+    Character-level folds that the cleaner itself owns (smart quotes,
+    NBSP, etc.) are deliberately left alone so the cleaner's own behavior
+    is what's under test.
     """
-    raw = path.read_bytes().replace(b"\x00", b"")
+    raw = path.read_bytes()
+    # NUL stripping
+    raw = raw.replace(b"\x00", b"")
+    # Line endings: CRLF -> LF, then bare CR -> LF.
+    raw = raw.replace(b"\r\n", b"\n").replace(b"\r", b"\n")
+    # Per-row repair (handles unquoted '$1,500.00' in case 17).
+    from src.core.io import _repair_rows
+    text = raw.decode("utf-8-sig")
+    text, _, _ = _repair_rows(text, ",")
     return pd.read_csv(
-        io.BytesIO(raw), dtype=str, keep_default_na=False, encoding="utf-8-sig",
+        io.StringIO(text), dtype=str, keep_default_na=False,
     )
 
 
diff --git a/tests/test_encodings_corpus.py b/tests/test_encodings_corpus.py
new file mode 100644
index 0000000..8027740
--- /dev/null
+++ b/tests/test_encodings_corpus.py
@@ -0,0 +1,184 @@
+"""Run the analyzer + detector against the code-page test corpus.
+
+Fixtures live in ``test-cases/encodings-corpus/`` (synced from
+``Business/DataTools/test-case-code-page-variations``). Each test runs
+against one fixture and uses the corpus manifest
+(``expected_detection.csv``) for ground truth.
+
+What's tested
+-------------
+1. ``analyze()`` does not crash on any fixture — every encoded file
+   produces a Finding list (possibly empty), never an exception.
+2. ``detect_encoding()`` returns one of the manifest's accepted answers,
+   OR the manifest itself flagged the case as AMBIGUOUS / UNRELIABLE /
+   REJECT / LOW_CONFIDENCE.
+3. The decoded DataFrame matches the canonical reference content.
+
+Cases where the current implementation is known to fail (charset-
+normalizer label drift on byte-equivalent encodings, ``repair_bytes``
+NUL-strip destroying UTF-16, the "lying BOM" pathological case) are
+marked ``xfail`` so they surface in the report as documented gaps.
+A future fix that makes the case pass will flip xfail to xpass and the
+test owner can drop the marker.
+"""
+
+from __future__ import annotations
+
+import csv
+import io
+from pathlib import Path
+
+import pandas as pd
+import pytest
+
+from src.core.analyze import analyze, _load_for_analysis
+from src.core.io import detect_encoding
+
+
+CORPUS = Path(__file__).parent.parent / "test-cases" / "encodings-corpus"
+MANIFEST = CORPUS / "expected_detection.csv"
+REFERENCE_DIR = CORPUS / "reference"
+
+# Known failures the analyzer does not yet handle correctly. Each entry
+# has a one-line reason — drop the entry once a fix lands.
+KNOWN_DETECTION_FAILURES = {
+    "E03_western_basic_cp1252.csv": "charset-normalizer returns cp1250 for byte-equivalent content",
+    "E04_western_basic_latin1.csv": "charset-normalizer returns cp1250 for byte-equivalent content",
+    "E05_western_basic_latin9.csv": "charset-normalizer returns cp1250 for byte-equivalent content",
+    "E06_western_basic_macroman.csv": "returns mac_iceland (same family) instead of mac_roman",
+    "E11_western_extended_cp1252.csv": "charset-normalizer returns cp1250 for cp1252 content",
+    "E15_eastern_european_iso88592.csv": "charset-normalizer returns cp1258 for ISO-8859-2 content",
+    "E18_cyrillic_koi8r.csv": "charset-normalizer returns shift_jis_2004 for KOI8-R content",
+}
+
+KNOWN_DECODE_FAILURES = {
+    "E03_western_basic_cp1252.csv": "decoded as cp1250 — different mapping at 0xF1 (ñ vs ń)",
+    "E04_western_basic_latin1.csv": "decoded as cp1250 — different mapping at 0xF1",
+    "E05_western_basic_latin9.csv": "decoded as cp1250 — different mapping at 0xF1",
+    "E10_western_extended_utf8.csv": "byte-level smart-quote fold rewrites U+201C/U+201D to ASCII before parse",
+    "E11_western_extended_cp1252.csv": "wrong encoding + smart-quote fold",
+    "E12_western_extended_utf16le.csv": "byte-level smart-quote fold rewrites U+201C/U+201D before parse",
+    "E15_eastern_european_iso88592.csv": "wrong encoding (cp1258 != ISO-8859-2)",
+    "E18_cyrillic_koi8r.csv": "wrong encoding (shift_jis_2004 != KOI8-R)",
+    "E30_pathological_lying_bom.csv": "utf-8-sig fails on cp1252 body bytes; needs lying-BOM recovery",
+}
+
+
+def _normalize_encoding(name: str) -> str:
+    return name.lower().replace("-", "_").replace(" ", "_")
+
+
+def _load_manifest() -> list[dict]:
+    if not MANIFEST.exists():
+        return []
+    with MANIFEST.open() as fh:
+        return list(csv.DictReader(fh))
+
+
+def _load_references() -> dict[str, str]:
+    if not REFERENCE_DIR.exists():
+        return {}
+    return {
+        p.stem.replace(".utf8", ""): p.read_text(encoding="utf-8")
+        for p in REFERENCE_DIR.glob("*.utf8.txt")
+    }
+
+
+MANIFEST_ENTRIES = _load_manifest()
+REFERENCES = _load_references()
+
+
+def _entry_id(entry: dict) -> str:
+    return entry["filename"]
+
+
+# ---------------------------------------------------------------------------
+# 1. Analyzer never crashes
+# ---------------------------------------------------------------------------
+
+@pytest.mark.parametrize("entry", MANIFEST_ENTRIES, ids=_entry_id)
+def test_analyzer_does_not_crash(entry):
+    findings = analyze(CORPUS / entry["filename"], sample_rows=1000)
+    # Either empty or a list of Findings — but never raises.
+    assert isinstance(findings, list)
+
+
+# ---------------------------------------------------------------------------
+# 2. detect_encoding returns an acceptable answer
+# ---------------------------------------------------------------------------
+
+def _detection_marker(entry):
+    fname = entry["filename"]
+    if fname in KNOWN_DETECTION_FAILURES:
+        return pytest.mark.xfail(
+            reason=KNOWN_DETECTION_FAILURES[fname], strict=False,
+        )
+    return ()
+
+
+@pytest.mark.parametrize(
+    "entry",
+    [
+        pytest.param(e, marks=_detection_marker(e), id=_entry_id(e))
+        for e in MANIFEST_ENTRIES
+    ],
+)
+def test_detect_encoding_accepted(entry):
+    accepted_raw = entry["expected_detection"]
+    # Manifest fuzzy markers — any answer is acceptable.
+    if any(m in accepted_raw for m in ("AMBIGUOUS", "UNRELIABLE", "REJECT", "LOW_CONFIDENCE")):
+        # Just call to ensure no exception.
+        detect_encoding(CORPUS / entry["filename"])
+        return
+    accepted = {_normalize_encoding(s.strip()) for s in accepted_raw.split("|") if s.strip()}
+    detected = detect_encoding(CORPUS / entry["filename"])
+    detected_n = _normalize_encoding(detected)
+    assert detected_n in accepted, (
+        f"{entry['filename']}: detected {detected!r} not in {sorted(accepted)}"
+    )
+
+
+# ---------------------------------------------------------------------------
+# 3. Decoded content matches the canonical reference
+# ---------------------------------------------------------------------------
+
+def _decode_marker(entry):
+    fname = entry["filename"]
+    if fname in KNOWN_DECODE_FAILURES:
+        return pytest.mark.xfail(
+            reason=KNOWN_DECODE_FAILURES[fname], strict=False,
+        )
+    return ()
+
+
+def _decodable_entries():
+    """Skip pathological cases that have no canonical reference."""
+    return [e for e in MANIFEST_ENTRIES if e["canonical_content_id"] in REFERENCES]
+
+
+@pytest.mark.parametrize(
+    "entry",
+    [
+        pytest.param(e, marks=_decode_marker(e), id=_entry_id(e))
+        for e in _decodable_entries()
+    ],
+)
+def test_decoded_matches_reference(entry):
+    df, _, _ = _load_for_analysis(CORPUS / entry["filename"], sample_rows=1000)
+    ref_text = REFERENCES[entry["canonical_content_id"]]
+    ref_rows = list(csv.reader(io.StringIO(ref_text)))
+    if not ref_rows:
+        pytest.skip("empty reference")
+
+    # First row = headers in the reference; compare data rows to df rows.
+    ref_data = ref_rows[1:]
+    assert len(df) >= len(ref_data), (
+        f"{entry['filename']}: parsed {len(df)} rows, reference has {len(ref_data)}"
+    )
+    for r, ref_row in enumerate(ref_data):
+        for c, ref_cell in enumerate(ref_row):
+            actual = str(df.iloc[r, c])
+            assert actual == ref_cell, (
+                f"{entry['filename']}: row {r} col {c}: "
+                f"got {actual!r}, expected {ref_cell!r}"
+            )
diff --git a/tests/test_normalize.py b/tests/test_normalize.py
new file mode 100644
index 0000000..cd0805f
--- /dev/null
+++ b/tests/test_normalize.py
@@ -0,0 +1,349 @@
+"""Tests for the CSV-normalization gate.
+
+Covers:
+* ``Finding.confidence`` and ``Finding.fix_action`` field defaults.
+* ``auto_fix`` applies every high-confidence finding and leaves
+  medium/low ones pending.
+* ``apply_decisions`` honors per-finding skip / modified payloads.
+* ``is_normalized`` re-checks high-confidence detectors after a fix pass.
+* The full corpus auto-fix sweep: every fixture either passes the gate
+  or has its remaining medium/low findings declared in pending.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pandas as pd
+import pytest
+
+from src.core.analyze import (
+    Finding,
+    analyze,
+    _load_for_analysis,
+    FIX_FOLD_SMART_PUNCT,
+    FIX_LOWERCASE_EMAIL,
+    FIX_REPLACE_NULL_SENTINELS,
+    FIX_NONE,
+)
+from src.core.fixes import get_fix, available_actions
+from src.core.normalize import (
+    Decision,
+    NormalizationResult,
+    auto_fix,
+    apply_decisions,
+    is_normalized,
+    gate_summary,
+)
+
+
+CORPUS = Path(__file__).parent.parent / "test-cases" / "text-cleaner-corpus" / "test_data"
+
+
+# ---------------------------------------------------------------------------
+# Field defaults
+# ---------------------------------------------------------------------------
+
+class TestFindingFields:
+    def test_default_confidence_is_high(self):
+        f = Finding(id="x", severity="warn", tool="", count=1, description="d")
+        assert f.confidence == "high"
+
+    def test_default_fix_action_is_empty(self):
+        f = Finding(id="x", severity="warn", tool="", count=1, description="d")
+        assert f.fix_action == ""
+
+    def test_pre_applied_default_false(self):
+        f = Finding(id="x", severity="warn", tool="", count=1, description="d")
+        assert f.pre_applied is False
+
+    def test_smart_punct_finding_carries_fix_action(self):
+        df = pd.DataFrame({"x": ["“hello”"]})
+        findings = analyze(df)
+        smart = next(f for f in findings if f.id == "smart_punctuation_in_data")
+        assert smart.confidence == "high"
+        assert smart.fix_action == FIX_FOLD_SMART_PUNCT
+
+    def test_mojibake_finding_is_low_confidence(self):
+        df = pd.DataFrame({"x": ["cafÃ©"]})
+        findings = analyze(df)
+        moji = next(f for f in findings if f.id == "suspected_mojibake")
+        assert moji.confidence == "low"
+
+
+# ---------------------------------------------------------------------------
+# Fix registry
+# ---------------------------------------------------------------------------
+
+class TestFixRegistry:
+    def test_high_confidence_fixes_registered(self):
+        actions = available_actions()
+        assert FIX_FOLD_SMART_PUNCT in actions
+        assert FIX_LOWERCASE_EMAIL in actions
+        assert FIX_REPLACE_NULL_SENTINELS in actions
+
+    def test_get_fix_returns_callable(self):
+        fn = get_fix(FIX_FOLD_SMART_PUNCT)
+        assert callable(fn)
+
+    def test_get_fix_unknown_returns_none(self):
+        assert get_fix("not_a_real_action") is None
+
+
+# ---------------------------------------------------------------------------
+# auto_fix
+# ---------------------------------------------------------------------------
+
+class TestAutoFix:
+    def test_applies_high_confidence_only(self):
+        df = pd.DataFrame({
+            "name": ["  Alice  ", "Bob "],   # whitespace + NBSP -> high
+            "email": ["A@X.com", "b@x.com"],       # mixed case -> medium
+        })
+        findings = analyze(df)
+        result = auto_fix(df, findings)
+
+        # whitespace_padding and nbsp_or_unicode_whitespace should be applied.
+        applied_ids = {a.finding_id for a in result.applied}
+        assert "whitespace_padding" in applied_ids
+        assert "nbsp_or_unicode_whitespace" in applied_ids
+
+        # mixed_case_email_column is medium -> pending.
+        pending_ids = {f.id for f in result.pending_findings}
+        assert "mixed_case_email_column" in pending_ids
+
+    def test_cells_actually_changed(self):
+        df = pd.DataFrame({"x": ["  hi  ", "ok"]})
+        findings = analyze(df)
+        result = auto_fix(df, findings)
+        assert result.cleaned_df["x"].tolist() == ["hi", "ok"]
+
+    def test_no_findings_no_fixes(self):
+        df = pd.DataFrame({"id": ["1", "2"], "name": ["a", "b"]})
+        findings = analyze(df)
+        result = auto_fix(df, findings)
+        assert result.applied == []
+        assert result.passed is True
+
+    def test_blocks_on_severity_error(self, tmp_path):
+        f = tmp_path / "empty.csv"
+        f.write_bytes(b"")
+        findings = analyze(f)
+        df, _, _ = _load_for_analysis(f, sample_rows=1000)
+        result = auto_fix(df, findings)
+        assert any(b.id == "empty_input" for b in result.blocking_findings)
+        assert result.passed is False
+
+
+# ---------------------------------------------------------------------------
+# apply_decisions
+# ---------------------------------------------------------------------------
+
+class TestApplyDecisions:
+    def test_skip_decision_records_skipped(self):
+        df = pd.DataFrame({"x": ["“smart”"]})
+        findings = analyze(df)
+        decisions = [Decision(finding_id="smart_punctuation_in_data", action="skip")]
+        result = apply_decisions(df, findings, decisions)
+        assert any(s.id == "smart_punctuation_in_data" for s in result.skipped_findings)
+        # And the smart quotes survived.
+        assert "“" in result.cleaned_df["x"].iloc[0]
+
+    def test_auto_decision_runs_fix(self):
+        df = pd.DataFrame({"x": ["“smart”"]})
+        findings = analyze(df)
+        decisions = [Decision(finding_id="smart_punctuation_in_data", action="auto")]
+        result = apply_decisions(df, findings, decisions)
+        assert result.cleaned_df["x"].iloc[0] == '"smart"'
+
+    def test_modified_decision_uses_payload(self):
+        df = pd.DataFrame({"status": ["ACTIVE", "TBD", "TBD", "active"]})
+        findings = analyze(df)
+        # Restrict the null-sentinel set to only "TBD" via payload.
+        decisions = [Decision(
+            finding_id="null_like_sentinels",
+            action="modified",
+            payload={"sentinels": ["TBD"]},
+        )]
+        # null_like_sentinels needs to be present for the decision to apply.
+        if not any(f.id == "null_like_sentinels" for f in findings):
+            pytest.skip("analyzer didn't surface null sentinels for this fixture")
+        result = apply_decisions(df, findings, decisions)
+        assert result.cleaned_df["status"].tolist() == ["ACTIVE", "", "", "active"]
+
+    def test_lowercase_email_uses_finding_column(self):
+        df = pd.DataFrame({
+            "email": ["ALICE@X.com", "bob@x.com"],
+            "name": ["Alice", "Bob"],
+        })
+        findings = analyze(df)
+        decisions = [Decision(finding_id="mixed_case_email_column", action="auto")]
+        if not any(f.id == "mixed_case_email_column" for f in findings):
+            pytest.skip("analyzer didn't surface mixed-case email")
+        result = apply_decisions(df, findings, decisions)
+        assert result.cleaned_df["email"].tolist() == ["alice@x.com", "bob@x.com"]
+        # Other columns untouched.
+        assert result.cleaned_df["name"].tolist() == ["Alice", "Bob"]
+
+    def test_undecided_medium_finding_stays_pending(self):
+        df = pd.DataFrame({"email": ["A@X.com", "b@x.com"]})
+        findings = analyze(df)
+        result = apply_decisions(df, findings, decisions=[])
+        if not any(f.id == "mixed_case_email_column" for f in findings):
+            pytest.skip("analyzer didn't surface mixed-case email")
+        assert any(f.id == "mixed_case_email_column" for f in result.pending_findings)
+
+
+# ---------------------------------------------------------------------------
+# is_normalized
+# ---------------------------------------------------------------------------
+
+class TestIsNormalized:
+    def test_clean_dataframe_passes(self):
+        df = pd.DataFrame({"id": ["1"], "name": ["Alice"]})
+        findings = analyze(df)
+        result = auto_fix(df, findings)
+        assert is_normalized(findings, result) is True
+
+    def test_unnormalized_after_skip_high_confidence(self):
+        df = pd.DataFrame({"x": ["  padded  "]})
+        findings = analyze(df)
+        # Skip the only high-confidence fix.
+        decisions = [Decision(finding_id="whitespace_padding", action="skip")]
+        result = apply_decisions(df, findings, decisions)
+        # Re-analysis still finds the issue, so gate is not normalized.
+        assert is_normalized(findings, result) is False
+
+    def test_pending_medium_blocks_gate(self):
+        df = pd.DataFrame({"email": ["A@X.com", "b@x.com"]})
+        findings = analyze(df)
+        result = auto_fix(df, findings)
+        # auto_fix leaves medium pending -> gate not passed.
+        if any(f.id == "mixed_case_email_column" for f in findings):
+            assert is_normalized(findings, result) is False
+
+    def test_none_result_not_normalized(self):
+        assert is_normalized([], None) is False
+
+
+# ---------------------------------------------------------------------------
+# Corpus sweep — every fixture either passes or has declared pending
+# ---------------------------------------------------------------------------
+
+CORPUS_FILES = sorted(CORPUS.glob("*.csv")) if CORPUS.exists() else []
+
+# Fixtures that will have pending medium/low findings after auto_fix.
+EXPECTED_PENDING_AFTER_AUTOFIX = {
+    "11_embedded_newlines": {"mixed_case_email_column"},
+    "12_case_variations": {"mixed_case_email_column"},
+    "14_mojibake": {"suspected_mojibake"},
+    "17_preserve_intended": {"null_like_sentinels"},
+    "20_kitchen_sink": {"mixed_case_email_column"},
+}
+
+# Fixtures that block the gate via severity=error findings.
+EXPECTED_BLOCKING = {
+    "18_empty_file": {"empty_input"},
+}
+
+
+@pytest.mark.parametrize("path", CORPUS_FILES, ids=lambda p: p.stem)
+def test_corpus_auto_fix_state(path):
+    """Every corpus fixture either passes auto_fix or has its remaining
+    pending/blocking findings declared in the expected sets above."""
+    findings = analyze(path, sample_rows=1000)
+    df, _, _ = _load_for_analysis(path, sample_rows=1000)
+    result = auto_fix(df, findings)
+
+    pending_ids = {f.id for f in result.pending_findings}
+    blocking_ids = {f.id for f in result.blocking_findings}
+
+    expected_pending = EXPECTED_PENDING_AFTER_AUTOFIX.get(path.stem, set())
+    expected_blocking = EXPECTED_BLOCKING.get(path.stem, set())
+
+    assert pending_ids == expected_pending, (
+        f"{path.name}: pending {pending_ids} != expected {expected_pending}"
+    )
+    assert blocking_ids == expected_blocking, (
+        f"{path.name}: blocking {blocking_ids} != expected {expected_blocking}"
+    )
+
+
+def test_corpus_auto_fix_idempotent():
+    """Running auto_fix twice on the same input yields the same bytes."""
+    if not CORPUS_FILES:
+        pytest.skip("corpus not present")
+    path = CORPUS / "20_kitchen_sink.csv"
+    findings = analyze(path, sample_rows=1000)
+    df, _, _ = _load_for_analysis(path, sample_rows=1000)
+    r1 = auto_fix(df, findings)
+    # Re-analyze the cleaned frame and run again.
+    f2 = analyze(r1.cleaned_df)
+    r2 = auto_fix(r1.cleaned_df, f2)
+    assert r1.cleaned_bytes == r2.cleaned_bytes
+
+
+# ---------------------------------------------------------------------------
+# gate_summary
+# ---------------------------------------------------------------------------
+
+class TestOutputOptions:
+    """The Review page's _build_output_bytes helper for the download flow.
+
+    Imported via importlib because the page itself runs Streamlit code at
+    module load; we copy the function shape here as a compact spec so a
+    future refactor that moves the helper into core/io.py can keep the
+    same contract.
+    """
+
+    @staticmethod
+    def _build(df, *, encoding, delimiter, line_terminator):
+        import io as _io
+        buf = _io.StringIO()
+        df.to_csv(buf, index=False, sep=delimiter, lineterminator=line_terminator)
+        text = buf.getvalue()
+        try:
+            return text.encode(encoding), None
+        except UnicodeEncodeError:
+            return text.encode(encoding, errors="replace"), "lossy"
+
+    def test_utf8_with_bom_starts_with_bom(self):
+        df = pd.DataFrame({"x": ["a"]})
+        data, _ = self._build(df, encoding="utf-8-sig", delimiter=",", line_terminator="\n")
+        assert data.startswith(b"\xef\xbb\xbf")
+
+    def test_crlf_line_terminator(self):
+        df = pd.DataFrame({"x": ["a", "b"]})
+        data, _ = self._build(df, encoding="utf-8", delimiter=",", line_terminator="\r\n")
+        assert b"\r\n" in data
+        assert b"\nb" not in data.replace(b"\r\n", b"")
+
+    def test_tab_delimiter(self):
+        df = pd.DataFrame({"a": ["x"], "b": ["y"]})
+        data, _ = self._build(df, encoding="utf-8", delimiter="\t", line_terminator="\n")
+        assert data.startswith(b"a\tb\n")
+
+    def test_cp1252_single_byte_accents(self):
+        df = pd.DataFrame({"name": ["José"]})
+        data, _ = self._build(df, encoding="cp1252", delimiter=",", line_terminator="\n")
+        # 'é' is single byte 0xE9 in cp1252 (vs 0xC3 0xA9 in UTF-8)
+        assert b"\xe9" in data
+        assert b"\xc3\xa9" not in data
+
+    def test_lossy_codepage_returns_warning(self):
+        df = pd.DataFrame({"name": ["Иван"]})  # Cyrillic
+        data, warn = self._build(df, encoding="cp1252", delimiter=",", line_terminator="\n")
+        assert warn is not None
+        assert b"?" in data  # replacement chars
+
+
+class TestGateSummary:
+    def test_summary_keys(self):
+        df = pd.DataFrame({"x": ["  hi  "]})
+        findings = analyze(df)
+        result = auto_fix(df, findings)
+        s = gate_summary(result)
+        assert set(s.keys()) == {
+            "passed", "fixes_applied", "cells_changed",
+            "skipped", "pending", "blocking",
+        }