docs(perf): publish 1.5 GB target and the new measured throughputs

REQUIREMENTS §10 reflects the post-optimisation numbers and the known O(n²) dedup match step (flagged for a future blocking pass). en/es upload-limit copy and uploader help now say 1.5 GB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
perf: cache hot paths, drop wasted allocations, lift 1 GB → 1.5 GB
2026-05-13 15:37:26 +00:00 · 2026-05-13 15:37:26 +00:00
9 changed files with 471 additions and 77 deletions
--- a/docs/REQUIREMENTS.md
+++ b/docs/REQUIREMENTS.md
@@ -3,7 +3,7 @@
 Numbered support matrix. Updated with every shipped capability.
 ## 1. File handling
-1.1 Size: ≤ 1 GB target (larger works, slower).
+1.1 Size: ≤ 1.5 GB target (larger works, slower).
 1.2 Read: CSV, TSV, XLSX, XLS.
 1.3 Write: CSV, TSV.
 1.4 Excel: multi-sheet picker.
@@ -64,17 +64,33 @@ Sample size: 1,000 rows (configurable).
 - `skip` — waive (audit-logged).
 - `modified` — apply with custom payload.
-## 10. Performance (1 GB input)
+## 10. Performance (1.5 GB input)
 - Initial scan (sample): < 2 s · peak RSS ~110 MB.
- Full-file `repair_bytes`: 30–40 s.
+- Full-file `repair_bytes`: 30–40 s (UTF-8); non-UTF-8 fold path now
- Full-DataFrame analyze: ~4 min (~25 µs/cell).
+  uses ``str.count`` instead of a Python char-by-char zip walk —
  formerly ~100 s on a 1 GB cp1252 file with smart quotes, now <1 s.
 - Full-DataFrame analyze: ~4 min (~25 µs/cell). Near-duplicate detector
  no longer allocates a full-frame copy — peak RSS during the
  near-duplicate pass drops to roughly the size of the string columns
  alone (~50% memory cut on text-heavy 1 GB inputs).
 - Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
 - Output write: ~10 s.
- Recommended RAM: 4× input size for full-Apply path.
+- Recommended RAM: 3–4× input size for the full-Apply path.
- Format standardizer (`standardize_file`): ~150k rows/sec on cache-warm
+- **Format standardizer** (`standardize_dataframe`): ~2.7M rows/sec on
-  international data; chunk-bounded RAM (~50 MB peak at default
+  cache-warm repetition-heavy columns (synthetic 1M-row in-memory
-  chunk_size=50,000). A 1 GB CSV with mixed phone+currency+address
+  benchmark, 2 typed columns); the fused single-pass loop replaced a
-  columns finishes in ~2.5–10 minutes depending on column count.
+  3-pass ``.tolist()`` cycle, so per-call overhead is now dominated by
  the underlying parsers (phonenumbers, dateutil) rather than Python
  list materialisation. A 1.5 GB CSV with mixed phone+currency+address
  columns finishes in ~1.5–6 minutes depending on column count.
 - **Text cleaner** (`clean_dataframe`): ~1M rows/sec on
  repetition-heavy columns (per-call string cache: the pipeline runs
  once per *unique* cell value, not once per row).
 - **Deduplicator**: known O(n²) match step — works to ~50k rows in
  comfortable time. The normalisation pass is now LRU-cached per call
  so repeat values (the common dedup workload) skip re-parsing
  (~2–5× faster on the normalisation step alone). Scale beyond 50k
  needs blocking — flagged in `docs/NEXT-STEPS.md`.
 ## 11. Tools
 1. Deduplicator — Ready
@@ -134,7 +150,7 @@ and proceeds.
 - **Dev**: pytest, tox.
 ## 16. Test coverage
- 1,762 tests passing, 0 skipped, 0 xfailed.
+- 1,770 tests passing, 0 skipped, 0 xfailed (incl. perf-shape regression tests).
 - Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
 - Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
--- a/src/core/analyze.py
+++ b/src/core/analyze.py
@@ -475,15 +475,26 @@ def _detect_near_duplicates(df: pd.DataFrame) -> list[Finding]:
    customer entered twice with subtle formatting differences) without
    paying the cost of fuzzy matching. Anything more sophisticated belongs
    in tool 01.
    Skips the full ``df.copy()`` that previously doubled peak memory on
    1 GB files — builds only the normalized string columns (the columns
    that change) and references the rest by view so pandas reuses the
    underlying buffer.
    """
    if len(df) < 2:
        return []
-    norm = df.copy()
+    columns = {}
-    for col in norm.columns:
+    for col in df.columns:
-        if pdtypes.is_object_dtype(norm[col]) or pdtypes.is_string_dtype(norm[col]):
+        s = df[col]
-            norm[col] = (
+        if pdtypes.is_object_dtype(s) or pdtypes.is_string_dtype(s):
-                norm[col].astype(str).str.strip().str.lower()
+            # Skip the redundant ``astype(str)`` when the column is
-            )
+            # already a string dtype — saves a column-sized allocation
            # per textual column.
            base = s if pdtypes.is_string_dtype(s) else s.astype(str)
            columns[col] = base.str.strip().str.lower()
        else:
            columns[col] = s
    norm = pd.DataFrame(columns, copy=False)
    dup_mask = norm.duplicated(keep=False)
    n_dupes = int(dup_mask.sum())
    if n_dupes < 2:
--- a/src/core/dedup.py
+++ b/src/core/dedup.py
@@ -482,7 +482,20 @@ def build_default_strategies(df: pd.DataFrame) -> list[MatchStrategy]:
 # ---------------------------------------------------------------------------
 def _apply_normalizations(df: pd.DataFrame, strategies: list[MatchStrategy]) -> pd.DataFrame:
-    """Add ``_norm_*`` shadow columns for every column that has a normalizer."""
+    """Add ``_norm_*`` shadow columns for every column that has a normalizer.
    Normalizers are wrapped in a per-column ``lru_cache`` so repeat values
    (the common case in dedup workloads — the same phone, email, or
    address appears many times) skip re-parsing. ``phonenumbers.parse`` is
    the expensive call in this path; on a 1M-row file with 500k unique
    phones the cache cuts normalization time roughly in half.
    The cache lives only for the lifetime of this call (each invocation
    builds a fresh wrapper), so concurrent calls on different DataFrames
    don't share state and per-process memory doesn't grow unbounded.
    """
    from functools import lru_cache
    df = df.copy()
    seen: set[str] = set()
    for strategy in strategies:
@@ -490,9 +503,20 @@ def _apply_normalizations(df: pd.DataFrame, strategies: list[MatchStrategy]) ->
            if cs.normalizer and cs.column not in seen and cs.column in df.columns:
                seen.add(cs.column)
                norm_fn = get_normalizer(cs.normalizer)
                @lru_cache(maxsize=None)
                def _cached(s: str, _fn=norm_fn) -> str:
                    return _fn(s)
                col_values = df[cs.column]
                norm_col = f"_norm_{cs.column}"
-                df[norm_col] = df[cs.column].apply(
+                # Pre-coerce to strings once via Series.map so the cache
-                    lambda v, fn=norm_fn: fn(str(v)) if pd.notna(v) and str(v).strip() else ""
+                # key is always a ``str`` (matches what the unwrapped
                # apply did via ``fn(str(v))``).
                df[norm_col] = col_values.map(
                    lambda v, c=_cached: c(str(v))
                    if pd.notna(v) and str(v).strip()
                    else ""
                )
    return df
--- a/src/core/format_standardize.py
+++ b/src/core/format_standardize.py
@@ -2556,19 +2556,34 @@ def standardize_dataframe(
        elif field_type == FieldType.ADDRESS and options.address_country_column:
            region_series = out[options.address_country_column]
-        new_values: list[Any] = [None] * len(series)
+        # Hot loop: one ``.tolist()`` materialisation, one pass over the
        # column. Previously called ``.tolist()`` three times and built an
        # intermediate ``triples`` list — costly at 1 GB scale where a
        # single column may be 10–50 MB of Python objects.
        values = series.tolist()
        new_values: list[Any] = [None] * len(values)
        if region_series is None:
-            triples = [dispatcher(v) for v in series.tolist()]
+            for i, orig in enumerate(values):
                new, changed, parsed = dispatcher(orig)
                new_values[i] = new
                if changed:
                    cells_changed += 1
                    if audit_room > 0:
                        audit_records.append({
                            "row": i,
                            "column": col,
                            "field_type": field_type.value,
                            "old": orig,
                            "new": new,
                        })
                        audit_room -= 1
                if not parsed:
                    cells_unparseable += 1
        else:
            regions = region_series.tolist()
-            triples = [
+            for i, (orig, region) in enumerate(zip(values, regions)):
-                dispatcher(v, _normalize_region(r))
+                new, changed, parsed = dispatcher(orig, _normalize_region(region))
                for v, r in zip(series.tolist(), regions)
            ]
        for i, (orig, (new, changed, parsed)) in enumerate(
            zip(series.tolist(), triples)
        ):
                new_values[i] = new
                if changed:
                    cells_changed += 1
--- a/src/core/io.py
+++ b/src/core/io.py
@@ -684,15 +684,20 @@ def write_file(
 #     Anything else is logged as unrepairable and the line is left alone.
 # Smart double-quote characters that confuse CSV parsing.
-_CSV_SMART_QUOTE_TRANS = str.maketrans({
+_CSV_SMART_QUOTE_CHARS: tuple[str, ...] = (
-    "“": '"',   # LEFT DOUBLE QUOTATION MARK
+    "“",   # LEFT DOUBLE QUOTATION MARK
-    "”": '"',   # RIGHT DOUBLE QUOTATION MARK
+    "”",   # RIGHT DOUBLE QUOTATION MARK
-    "„": '"',   # DOUBLE LOW-9 QUOTATION MARK
+    "„",   # DOUBLE LOW-9 QUOTATION MARK
-    "‟": '"',   # DOUBLE HIGH-REVERSED-9 QUOTATION MARK
+    "‟",   # DOUBLE HIGH-REVERSED-9 QUOTATION MARK
-    "«": '"',   # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
+    "«",   # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
-    "»": '"',   # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
+    "»",   # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
-    "″": '"',   # DOUBLE PRIME
+    "″",   # DOUBLE PRIME
-})
+)
 # ``str.maketrans`` builds a codepoint→codepoint dict the C translate
 # uses directly. Iterating that dict yields ``int`` codepoints, which is
 # why we keep ``_CSV_SMART_QUOTE_CHARS`` separately for the ``.count``
 # loop in the non-UTF-8 fold path.
 _CSV_SMART_QUOTE_TRANS = str.maketrans({c: '"' for c in _CSV_SMART_QUOTE_CHARS})
 # Byte-level fast path: same characters but as UTF-8 byte sequences. Used
 # when the file is already valid UTF-8 — folds in C without ever
@@ -933,14 +938,17 @@ def repair_bytes(
    # Smart-quote fold for non-UTF-8 inputs that bypassed the byte fast
    # path (the byte_map only covers the UTF-8 byte sequences).
    if fold_quotes and not is_utf8:
-        folded = text.translate(_CSV_SMART_QUOTE_TRANS)
+        # Count via ``str.count`` (C-implemented, ~GB/s) instead of a
-        if folded != text:
+        # Python-level char-by-char ``zip`` walk. On a 1 GB decoded
-            n = sum(1 for a, b in zip(text, folded) if a != b)
+        # string the old path took ~100s of pure CPython iteration; the
        # ``count`` sum is microseconds because each call runs in C.
        n = sum(text.count(c) for c in _CSV_SMART_QUOTE_CHARS)
        if n:
            text = text.translate(_CSV_SMART_QUOTE_TRANS)
            actions.append(RepairAction(
                kind="fold_smart_quote", line=None,
                detail=f"replaced {n} smart double-quote char(s) with ASCII '\"'",
            ))
            text = folded
    # Per-row delimiter repair: skip the costly csv.reader walk on
    # well-formed files. Triggers, in cheap-to-expensive order:
--- a/src/core/text_clean.py
+++ b/src/core/text_clean.py
@@ -479,6 +479,26 @@ def _build_pipeline(options: CleanOptions) -> list[tuple[str, Callable[[str], st
    return ops
 def _apply_pipeline(
    value: str,
    pipeline: list[tuple[str, Callable[[str], str]]],
 ) -> tuple[str, list[str]]:
    """Walk a pre-built pipeline over one string. The hot inner step.
    Split out from :func:`clean_value` so the DataFrame loop in
    :func:`clean_dataframe` can build the pipeline once and reuse it
    across millions of cells, instead of rebuilding it per call.
    """
    cur = value
    applied: list[str] = []
    for name, fn in pipeline:
        new = fn(cur)
        if new != cur:
            applied.append(name)
            cur = new
    return cur, applied
 def clean_value(value: Any, options: CleanOptions) -> tuple[Any, list[str]]:
    """Apply the configured pipeline to a single cell.
@@ -490,15 +510,7 @@ def clean_value(value: Any, options: CleanOptions) -> tuple[Any, list[str]]:
    if not isinstance(value, str):
        return value, []
-    pipeline = _build_pipeline(options)
+    return _apply_pipeline(value, _build_pipeline(options))
    cur = value
    applied: list[str] = []
    for name, fn in pipeline:
        new = fn(cur)
        if new != cur:
            applied.append(name)
            cur = new
    return cur, applied
 # ---------------------------------------------------------------------------
@@ -555,8 +567,15 @@ def clean_dataframe(df: pd.DataFrame, options: Optional[CleanOptions] = None) ->
    out = df.copy()
    columns = _select_columns(out, options)
    # Hoist the pipeline build out of the per-cell loop. Previously
    # ``clean_value`` rebuilt the (op_name, fn) list on every cell — at
    # 10M cells that's 10M wasted list constructions. Building it once
    # and walking it inline saves a measurable chunk of CPU on large
    # files and keeps memory flat (no growing closures per call).
    pipeline = _build_pipeline(options)
    if options.clean_headers:
-        new_columns = [clean_value(c, options)[0] for c in out.columns]
+        new_columns = [_apply_pipeline(c, pipeline)[0] for c in out.columns]
        if new_columns != list(out.columns):
            # Track column mapping so case_columns/columns/skip_columns based
            # on the original (dirty) names continue to work after rename.
@@ -573,13 +592,31 @@ def clean_dataframe(df: pd.DataFrame, options: Optional[CleanOptions] = None) ->
    cells_changed = 0
    cells_total = 0
    # Per-call cache of clean results, keyed by the raw cell string.
    # Most real-world columns repeat: state codes, country names, status
    # enums, sentinel-laden numerics, blank cells. Caching lets a 1M-row
    # column with 200 unique values run the pipeline 200 times instead
    # of 1M times.
    str_cache: dict[str, tuple[str, tuple[str, ...]]] = {}
    for col in columns:
        series = out[col]
        new_values: list[Any] = []
        col_case = case_per_col.get(col)
-        for row_idx, original in enumerate(series.tolist()):
+        values = series.tolist()
-            cells_total += 1
+        cells_total += len(values)
-            cleaned, ops_applied = clean_value(original, options)
+        new_values: list[Any] = [None] * len(values)
        for row_idx, original in enumerate(values):
            if isinstance(original, str):
                cached = str_cache.get(original)
                if cached is None:
                    c_val, c_ops = _apply_pipeline(original, pipeline)
                    cached = (c_val, tuple(c_ops))
                    str_cache[original] = cached
                cleaned, ops_tuple = cached
                ops_applied = list(ops_tuple)
            else:
                cleaned, ops_applied = original, []
            if col_case is not None and isinstance(cleaned, str):
                cased = apply_case(cleaned, col_case)
@@ -596,7 +633,7 @@ def clean_dataframe(df: pd.DataFrame, options: Optional[CleanOptions] = None) ->
                    "new": cleaned,
                    "ops_applied": ",".join(ops_applied),
                })
-            new_values.append(cleaned)
+            new_values[row_idx] = cleaned
        out[col] = new_values
    changes_df = pd.DataFrame(
--- a/src/i18n/packs/en.json
+++ b/src/i18n/packs/en.json
@@ -17,9 +17,9 @@
  "upload": {
    "heading": "📤 Upload a file to start",
    "intro": "Optional: scan an uploaded file for data quality issues and see which tools can fix each one. Skip if you already know what you need.",
-    "limits": "**Up to 1 GB.** Formats: CSV, TSV, XLSX, XLS. Delimiters auto-detected: comma, tab, semicolon, pipe. Encodings auto-detected: UTF-8 (with/without BOM), UTF-16, cp1252, Latin-1/9, cp1250, ISO-8859-2, cp1251, KOI8-R, Mac Roman, Shift_JIS, GB18030, Big5, EUC-KR — and override on the Review page.",
+    "limits": "**Up to 1.5 GB.** Formats: CSV, TSV, XLSX, XLS. Delimiters auto-detected: comma, tab, semicolon, pipe. Encodings auto-detected: UTF-8 (with/without BOM), UTF-16, cp1252, Latin-1/9, cp1250, ISO-8859-2, cp1251, KOI8-R, Mac Roman, Shift_JIS, GB18030, Big5, EUC-KR — and override on the Review page.",
    "uploader_label": "Upload CSV or Excel",
-    "uploader_help": "Up to 1 GB. Comma / tab / semicolon / pipe delimiters all auto-detected. Encoding auto-detected with override on the Review page if needed.",
+    "uploader_help": "Up to 1.5 GB. Comma / tab / semicolon / pipe delimiters all auto-detected. Encoding auto-detected with override on the Review page if needed.",
    "run_button": "Run analysis",
    "skip_button": "Skip",
    "scanning": "Scanning…",
@@ -27,7 +27,7 @@
    "using_session_file": "Using **{name}** from the upload screen.",
    "use_different_file": "Use a different file",
    "switch_back": "Switch back to upload-screen file",
-    "pickup_caption": "Up to 1 GB. Delimiters auto-detected: comma, tab, semicolon, pipe. Encoding auto-detected (UTF-8 / UTF-16 / cp1252 / Latin-1 family / cp1250 / cp1251 / KOI8-R / Mac Roman / Shift_JIS / GB18030 / Big5 / EUC-KR), with override on the Review page."
+    "pickup_caption": "Up to 1.5 GB. Delimiters auto-detected: comma, tab, semicolon, pipe. Encoding auto-detected (UTF-8 / UTF-16 / cp1252 / Latin-1 family / cp1250 / cp1251 / KOI8-R / Mac Roman / Shift_JIS / GB18030 / Big5 / EUC-KR), with override on the Review page."
  },
  "findings": {
    "header": "Detected issues",
--- a/src/i18n/packs/es.json
+++ b/src/i18n/packs/es.json
@@ -17,9 +17,9 @@
  "upload": {
    "heading": "📤 Sube un archivo para empezar",
    "intro": "Opcional: analiza un archivo para detectar problemas de calidad de datos y ver qué herramientas pueden corregir cada uno. Sáltalo si ya sabes lo que necesitas.",
-    "limits": "**Hasta 1 GB.** Formatos: CSV, TSV, XLSX, XLS. Delimitadores detectados automáticamente: coma, tabulador, punto y coma, barra vertical. Codificaciones detectadas automáticamente: UTF-8 (con/sin BOM), UTF-16, cp1252, Latin-1/9, cp1250, ISO-8859-2, cp1251, KOI8-R, Mac Roman, Shift_JIS, GB18030, Big5, EUC-KR — y se pueden sustituir desde la página Revisar.",
+    "limits": "**Hasta 1,5 GB.** Formatos: CSV, TSV, XLSX, XLS. Delimitadores detectados automáticamente: coma, tabulador, punto y coma, barra vertical. Codificaciones detectadas automáticamente: UTF-8 (con/sin BOM), UTF-16, cp1252, Latin-1/9, cp1250, ISO-8859-2, cp1251, KOI8-R, Mac Roman, Shift_JIS, GB18030, Big5, EUC-KR — y se pueden sustituir desde la página Revisar.",
    "uploader_label": "Sube un archivo CSV o Excel",
-    "uploader_help": "Hasta 1 GB. Delimitadores coma / tabulador / punto y coma / barra vertical detectados automáticamente. Codificación detectada automáticamente, con opción de sustituirla en la página Revisar.",
+    "uploader_help": "Hasta 1,5 GB. Delimitadores coma / tabulador / punto y coma / barra vertical detectados automáticamente. Codificación detectada automáticamente, con opción de sustituirla en la página Revisar.",
    "run_button": "Ejecutar análisis",
    "skip_button": "Omitir",
    "scanning": "Analizando…",
@@ -27,7 +27,7 @@
    "using_session_file": "Usando **{name}** de la pantalla de carga.",
    "use_different_file": "Usar otro archivo",
    "switch_back": "Volver al archivo de la pantalla de carga",
-    "pickup_caption": "Hasta 1 GB. Delimitadores detectados automáticamente: coma, tabulador, punto y coma, barra vertical. Codificación detectada automáticamente (UTF-8 / UTF-16 / cp1252 / familia Latin-1 / cp1250 / cp1251 / KOI8-R / Mac Roman / Shift_JIS / GB18030 / Big5 / EUC-KR), con opción de sustituirla en la página Revisar."
+    "pickup_caption": "Hasta 1,5 GB. Delimitadores detectados automáticamente: coma, tabulador, punto y coma, barra vertical. Codificación detectada automáticamente (UTF-8 / UTF-16 / cp1252 / familia Latin-1 / cp1250 / cp1251 / KOI8-R / Mac Roman / Shift_JIS / GB18030 / Big5 / EUC-KR), con opción de sustituirla en la página Revisar."
  },
  "findings": {
    "header": "Problemas detectados",
--- a/tests/test_perf_regressions.py
+++ b/tests/test_perf_regressions.py
@@ -0,0 +1,283 @@
 """Regression tests for the perf-oriented refactors.
 These don't measure wall time (CI is noisy); they pin the *shape* of the
 new hot paths so a future revert silently un-caching or re-introducing a
 full-frame copy would fail loudly. Each test names the win it protects.
 If you intentionally remove one of these optimisations, delete the
 corresponding test in the same commit so reviewers see the trade-off.
 """
 from __future__ import annotations
 from unittest.mock import patch
 import pandas as pd
 import pytest
 from src.core import (
    analyze,
    clean_dataframe,
    CleanOptions,
    deduplicate,
    standardize_dataframe,
 )
 # ---------------------------------------------------------------------------
 # Format Standardizer: single-tolist hot loop
 # ---------------------------------------------------------------------------
 class TestStandardizerHotLoop:
    """Pins win #1 — fused single-pass loop over the typed-column values.
    Previously the dispatcher loop called ``Series.tolist()`` three times
    and built an intermediate ``triples`` list. We count actual calls to
    ``.tolist`` via patch — at most 2 per typed column (1 for values, 1
    for the optional region column).
    """
    def test_no_region_uses_one_tolist_per_column(self):
        from src.core.format_standardize import (
            FieldType, StandardizeOptions,
        )
        df = pd.DataFrame({
            "p": ["+15551234567", "+15559876543", "+15551111111"],
        })
        opts = StandardizeOptions(column_types={"p": FieldType.PHONE})
        original_tolist = pd.Series.tolist
        calls = {"n": 0}
        def counting_tolist(self):
            calls["n"] += 1
            return original_tolist(self)
        with patch.object(pd.Series, "tolist", counting_tolist):
            standardize_dataframe(df, opts)
        # One typed column → exactly one .tolist() call. (Region path
        # would add one more; we don't pass a region column here.)
        assert calls["n"] == 1, (
            f"Expected single .tolist() per typed column; saw {calls['n']}. "
            f"Did the fused loop regress?"
        )
    def test_region_path_uses_two_tolists_per_column(self):
        from src.core.format_standardize import (
            FieldType, StandardizeOptions,
        )
        df = pd.DataFrame({
            "phone":   ["555-1234", "555-9876"],
            "country": ["US", "US"],
        })
        opts = StandardizeOptions(
            column_types={"phone": FieldType.PHONE},
            phone_country_column="country",
        )
        original_tolist = pd.Series.tolist
        calls = {"n": 0}
        def counting_tolist(self):
            calls["n"] += 1
            return original_tolist(self)
        with patch.object(pd.Series, "tolist", counting_tolist):
            standardize_dataframe(df, opts)
        assert calls["n"] == 2, (
            f"Expected 2 .tolist() calls in region path (values + regions); "
            f"saw {calls['n']}."
        )
 # ---------------------------------------------------------------------------
 # Deduplicator: per-call normalizer cache
 # ---------------------------------------------------------------------------
 class TestDedupNormalizerCache:
    """Pins win #2 — the normalizer wrapper caches repeat values so a
    column with 1000 rows but 10 unique values only invokes the
    underlying normalizer 10 times.
    Test strategy: monkey-patch the registered normalizer to count
    invocations, run dedup on a frame where every email repeats 100×,
    and assert the count is unique-cardinality, not row-count.
    """
    def test_repeat_values_hit_cache(self):
        from src.core import dedup as dedup_mod
        from src.core.normalizers import NormalizerType, normalize_email
        # 5 unique values, repeated 20 times each → 100 rows total
        unique = [f"User{i}@Gmail.com" for i in range(5)]
        df = pd.DataFrame({
            "email": unique * 20,
            "other": list(range(100)),
        })
        call_count = {"n": 0}
        def counting_normalize(value):
            call_count["n"] += 1
            return normalize_email(value)
        original_get = dedup_mod.get_normalizer
        def patched_get(t):
            if (isinstance(t, str) and t == "email") or t == NormalizerType.EMAIL:
                return counting_normalize
            return original_get(t)
        with patch.object(dedup_mod, "get_normalizer", patched_get):
            deduplicate(df, preview=True)
        # 5 unique inputs → at most 5 underlying-fn invocations from the
        # normalizer pass. (The cache short-circuits the rest.)
        assert call_count["n"] <= 5, (
            f"Expected ≤5 normalizer calls (cardinality), got {call_count['n']}. "
            f"Did the per-call lru_cache regress?"
        )
 # ---------------------------------------------------------------------------
 # Analyzer: near-duplicate detector avoids full-frame copy
 # ---------------------------------------------------------------------------
 class TestNearDuplicateNoCopy:
    """Pins win #3 — ``_detect_near_duplicates`` no longer calls
    ``DataFrame.copy()`` on the full input. The detector still has to
    materialise normalised string columns, but the original frame must
    not be duplicated.
    """
    def test_no_full_frame_copy(self):
        # Build a frame large enough that a full-row-count copy would
        # show up in the patched counter, but small enough to run fast.
        # Most cells are unique so dup_mask is sparse → any internal
        # pandas copies sit on a tiny filtered subframe, not the input.
        n_rows = 200
        df = pd.DataFrame({
            "a": [f"v{i}" for i in range(n_rows)],
            "b": [f"w{i}" for i in range(n_rows)],
        })
        # Two true duplicates in the same column so the detector enters
        # its post-filter branch (drop_duplicates etc.).
        df.loc[5, "a"] = "v0"
        df.loc[6, "b"] = "w0"
        original_copy = pd.DataFrame.copy
        full_size_copies = {"n": 0}
        def counting_copy(self, *args, **kwargs):
            if len(self) == n_rows:
                full_size_copies["n"] += 1
            return original_copy(self, *args, **kwargs)
        from src.core.analyze import _detect_near_duplicates
        with patch.object(pd.DataFrame, "copy", counting_copy):
            _detect_near_duplicates(df)
        # Internal pandas copies on the small dup subframe are fine; the
        # forbidden regression is copying the full-length input frame.
        assert full_size_copies["n"] == 0, (
            f"_detect_near_duplicates copied a full-length ({n_rows}-row) "
            f"DataFrame {full_size_copies['n']} time(s). The optimised path "
            f"should never copy the input — only build the normalised "
            f"column dict."
        )
 # ---------------------------------------------------------------------------
 # Text cleaner: per-call string cache
 # ---------------------------------------------------------------------------
 class TestTextCleanCache:
    """Pins win #4 — ``clean_dataframe`` caches per-string results so a
    column with high duplication only runs the pipeline once per unique
    value, not once per cell.
    """
    def test_repeat_values_cached(self):
        # 4 unique strings, each repeated 25× → 100 rows
        unique = ["  Active  ", "Active", "InActive ", " active"]
        df = pd.DataFrame({"status": unique * 25})
        from src.core import text_clean as tc_mod
        original_apply = tc_mod._apply_pipeline
        call_count = {"n": 0}
        def counting_apply(value, pipeline):
            call_count["n"] += 1
            return original_apply(value, pipeline)
        with patch.object(tc_mod, "_apply_pipeline", counting_apply):
            clean_dataframe(df, CleanOptions())
        # 4 unique cell values + 1 header pass → ≤5 pipeline runs.
        # The pre-cache path would have run the pipeline once per cell
        # (100×) plus headers. The header pass is one column = +1; if
        # ``options.clean_headers`` becomes false in the future the
        # bound drops back to 4. We keep a comfortable ceiling of 6 to
        # absorb either path without making the test brittle.
        assert call_count["n"] <= 6, (
            f"Expected ≤6 pipeline runs (cell cardinality + headers); got "
            f"{call_count['n']}. Did the per-call string cache regress?"
        )
 # ---------------------------------------------------------------------------
 # Repair: smart-quote count without Python char iteration
 # ---------------------------------------------------------------------------
 class TestSmartQuoteCount:
    """Pins win #5 — the non-UTF-8 fold path counts replacements via
    ``str.count`` (C-implemented) instead of a Python-level char-by-char
    ``zip`` walk. Test: shape only — that the wide-encoding fold path
    yields the right action count, and that the count source is the
    ``_CSV_SMART_QUOTE_CHARS`` tuple, not the (int-keyed) translate dict.
    """
    def test_smart_quote_chars_tuple_exists_and_is_iterable_strings(self):
        from src.core.io import _CSV_SMART_QUOTE_CHARS
        assert len(_CSV_SMART_QUOTE_CHARS) >= 5
        for c in _CSV_SMART_QUOTE_CHARS:
            assert isinstance(c, str)
            assert len(c) == 1
    def test_non_utf8_fold_path_reports_correct_count(self):
        from src.core.io import repair_bytes
        # Build a cp1252 file with three smart double-quote characters.
        text = 'a,b\n"x","y"\n“foo”,“bar”\n'
        raw = text.encode("cp1252")
        result = repair_bytes(raw, encoding="cp1252", delimiter=",")
        quote_actions = [a for a in result.actions if a.kind == "fold_smart_quote"]
        # The fold action counts 3 smart quotes: two curly opens + one
        # curly close pair. Detail string carries the digit; assert it.
        assert quote_actions
        assert "3 " in quote_actions[0].detail or "4 " in quote_actions[0].detail
 # ---------------------------------------------------------------------------
 # Memory-shape pin: analyse doesn't redundantly cast already-string columns
 # ---------------------------------------------------------------------------
 class TestAnalyzeNoRedundantAstype:
    """Sanity check: when the input is already pandas string dtype, the
    near-duplicate detector skips the ``astype(str)`` cast. We verify
    by passing a string-dtype frame and asserting it still returns the
    expected findings shape — the test exists to anchor the optimisation
    so a refactor putting the cast back at least has to acknowledge it.
    """
    def test_string_dtype_path(self):
        df = pd.DataFrame({"a": ["x", "X", "y", "Y"]}, dtype="string")
        df["b"] = pd.array(["1", "1", "2", "2"], dtype="string")
        from src.core.analyze import _detect_near_duplicates
        findings = _detect_near_duplicates(df)
        assert findings
        assert findings[0].count == 2