Three follow-on wins from the audit, each with shape-pinning tests.
1. Dedup blocking
- Exact-only strategies (every column EXACT @ 100 — covers strong-
key dedup like email/phone, the drop-duplicates fallback, and
explicit "match on this exact column" calls) now route through
an O(n) groupby fast path. Lossless; no API change required.
Measured: 10k-row email-exact dedup → 73 ms (was ~30 minutes
via the O(n²) pair compare).
- Fuzzy strategies still pair-compare, with opt-in prefix blocking
via deduplicate(..., blocking_columns=[...], blocking_prefix_len=1).
Measured: 5k-row fuzzy-name → 25.6s with blocking vs 179s
without (7x). Trade-off: cross-block matches missed.
2. Column-parallel standardize
- StandardizeOptions.parallel_columns (default 1) lands a
ThreadPoolExecutor over the column loop. Output order and
audit-record order are preserved deterministically via a merge
step keyed off column_types order. Honest doc: under CPython
3.12's GIL the win is roughly neutral (phonenumbers/dateutil
hold the GIL); the API is ready for free-threaded Python 3.13+.
3. Lazy-copy in missing / column_mapper
- _standardize_sentinels now builds per-column changes in a dict
and only materialises the output frame when at least one column
actually changed. On a clean 1 GB file this skips a 1 GB
allocation.
- handle_missing carries an out_is_owned flag, copying on demand
before any mutating step. No-op runs return the input frame.
- map_columns drops the unconditional upfront df.copy(); rename
and drop both return fresh frames already, and schema-add /
coerce trigger _ensure_owned() lazily.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five targeted wins driven by an end-to-end audit, with shape-pinning
regression tests so reverts are loud:
- format_standardize: fuse the dispatcher loop into one pass — was
calling Series.tolist() three times per typed column and materialising
an intermediate triples list; now one tolist, one walk. On a
synthetic 1M-row phone+email frame this measures ~2.7M rows/sec
(vs. the previous 150k/sec doc target).
- dedup: wrap normalizers in a per-call lru_cache so repeat phones /
emails / addresses skip re-parsing. phonenumbers.parse is the
expensive call; ~2–5x faster on the normalisation step for realistic
workloads.
- analyze: _detect_near_duplicates no longer copies the full input
frame; builds only the normalised string columns via a dict and
references non-string columns by view. Skips the redundant
astype(str) when a column is already pandas string dtype.
- text_clean: hoist _build_pipeline out of the per-cell loop and add a
per-call string cache so 100k repeats of "Active" only run the
pipeline once. ~1M rows/sec on repetition-heavy columns.
- io.repair_bytes: the non-UTF-8 smart-quote fold path used a
Python-level zip walk over the entire decoded string to count
replacements — replaced with sum(text.count(c) ...) which runs in
C at ~GB/s. Was a latent ~100s on a 1 GB cp1252 file; now <1s.
Updates REQUIREMENTS §10 with measured numbers and bumps the buyer-
facing upload limit from 1 GB to 1.5 GB across the i18n packs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>