docs(perf): publish the dedup/parallel/lazy-copy wins and limits

REQUIREMENTS §10 carries the new measured numbers and the dedup blocking trade-off note. DEVELOPER known-limitations is rewritten to reflect that exact-only dedup is now O(n), fuzzy-blocking is opt-in, and column-parallelism is scaffolding for free-threaded Python. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:54:25 +00:00
parent 64452dd783
commit d0423a8912
2 changed files with 44 additions and 10 deletions
--- a/docs/REQUIREMENTS.md
+++ b/docs/REQUIREMENTS.md
@@ -83,14 +83,38 @@ Sample size: 1,000 rows (configurable).
  the underlying parsers (phonenumbers, dateutil) rather than Python
  list materialisation. A 1.5 GB CSV with mixed phone+currency+address
  columns finishes in ~1.5–6 minutes depending on column count.
+  `StandardizeOptions.parallel_columns` (default 1, serial) lands the
+  thread-pool scaffolding; on CPython 3.12 with the GIL it's
+  roughly neutral, but the API is ready for the free-threaded
+  (PEP 703) Python 3.13+ build where it will help.
 - **Text cleaner** (`clean_dataframe`): ~1M rows/sec on
  repetition-heavy columns (per-call string cache: the pipeline runs
  once per *unique* cell value, not once per row).
- **Deduplicator**: known O(n²) match step — works to ~50k rows in
-  comfortable time. The normalisation pass is now LRU-cached per call
-  so repeat values (the common dedup workload) skip re-parsing
-  (~2–5× faster on the normalisation step alone). Scale beyond 50k
-  needs blocking — flagged in `docs/NEXT-STEPS.md`.
+- **Missing handler** (`handle_missing`): lazy-copy — when sentinel
+  standardization runs but finds nothing, AND no drops AND no fills
+  apply, the input frame is returned as-is. On a clean 1 GB file this
+  saves the 1 GB allocation that the unconditional upfront copy used
+  to take.
+- **Column mapper** (`map_columns`): rename + drop both already
+  return fresh frames; the explicit upfront `df.copy()` is now
+  removed and downstream mutating steps (schema-add, coerce) copy on
+  demand via `_ensure_owned()`. Rename-only and identity-mapping
+  paths run with zero explicit copies.
+- **Deduplicator**:
+  - **Exact-only strategies** (every column uses `Algorithm.EXACT` at
+    threshold 100 — covers strong-key dedup like email/phone, the
+    fallback drop-duplicates path, and explicit "match on this exact
+    column" calls) now run in **O(n)** via groupby. Measured: 10k
+    rows on an email-exact strategy → 73 ms (was ~30 minutes via the
+    old O(n²) pair compare).
+  - **Fuzzy strategies** still pair-compare. Opt in to **prefix
+    blocking** via `deduplicate(..., blocking_columns=['name'],
+    blocking_prefix_len=1)` to partition pairs by a cheap key.
+    Measured: 5k rows fuzzy-name dedup → 25.6s with blocking vs.
+    179s without (7× faster). Trade-off: cross-block matches are
+    missed; lower `blocking_prefix_len` widens blocks.
+  - Normalisation pass remains LRU-cached per call so repeat values
+    (the common dedup workload) skip re-parsing.

 ## 11. Tools
 1. Deduplicator — Ready
@@ -150,7 +174,7 @@ and proceeds.
 - **Dev**: pytest, tox.

 ## 16. Test coverage
- 1,770 tests passing, 0 skipped, 0 xfailed (incl. perf-shape regression tests).
+- 1,777 tests passing, 0 skipped, 0 xfailed (incl. 15 perf-shape regression tests).
 - Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
 - Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.