Five targeted wins driven by an end-to-end audit, with shape-pinning
regression tests so reverts are loud:
- format_standardize: fuse the dispatcher loop into one pass — was
calling Series.tolist() three times per typed column and materialising
an intermediate triples list; now one tolist, one walk. On a
synthetic 1M-row phone+email frame this measures ~2.7M rows/sec
(vs. the previous 150k/sec doc target).
- dedup: wrap normalizers in a per-call lru_cache so repeat phones /
emails / addresses skip re-parsing. phonenumbers.parse is the
expensive call; ~2–5x faster on the normalisation step for realistic
workloads.
- analyze: _detect_near_duplicates no longer copies the full input
frame; builds only the normalised string columns via a dict and
references non-string columns by view. Skips the redundant
astype(str) when a column is already pandas string dtype.
- text_clean: hoist _build_pipeline out of the per-cell loop and add a
per-call string cache so 100k repeats of "Active" only run the
pipeline once. ~1M rows/sec on repetition-heavy columns.
- io.repair_bytes: the non-UTF-8 smart-quote fold path used a
Python-level zip walk over the entire decoded string to count
replacements — replaced with sum(text.count(c) ...) which runs in
C at ~GB/s. Was a latent ~100s on a 1 GB cp1252 file; now <1s.
Updates REQUIREMENTS §10 with measured numbers and bumps the buyer-
facing upload limit from 1 GB to 1.5 GB across the i18n packs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>