perf: cache hot paths, drop wasted allocations, lift 1 GB → 1.5 GB
Five targeted wins driven by an end-to-end audit, with shape-pinning regression tests so reverts are loud: - format_standardize: fuse the dispatcher loop into one pass — was calling Series.tolist() three times per typed column and materialising an intermediate triples list; now one tolist, one walk. On a synthetic 1M-row phone+email frame this measures ~2.7M rows/sec (vs. the previous 150k/sec doc target). - dedup: wrap normalizers in a per-call lru_cache so repeat phones / emails / addresses skip re-parsing. phonenumbers.parse is the expensive call; ~2–5x faster on the normalisation step for realistic workloads. - analyze: _detect_near_duplicates no longer copies the full input frame; builds only the normalised string columns via a dict and references non-string columns by view. Skips the redundant astype(str) when a column is already pandas string dtype. - text_clean: hoist _build_pipeline out of the per-cell loop and add a per-call string cache so 100k repeats of "Active" only run the pipeline once. ~1M rows/sec on repetition-heavy columns. - io.repair_bytes: the non-UTF-8 smart-quote fold path used a Python-level zip walk over the entire decoded string to count replacements — replaced with sum(text.count(c) ...) which runs in C at ~GB/s. Was a latent ~100s on a 1 GB cp1252 file; now <1s. Updates REQUIREMENTS §10 with measured numbers and bumps the buyer- facing upload limit from 1 GB to 1.5 GB across the i18n packs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -684,15 +684,20 @@ def write_file(
|
||||
# Anything else is logged as unrepairable and the line is left alone.
|
||||
|
||||
# Smart double-quote characters that confuse CSV parsing.
|
||||
_CSV_SMART_QUOTE_TRANS = str.maketrans({
|
||||
"“": '"', # LEFT DOUBLE QUOTATION MARK
|
||||
"”": '"', # RIGHT DOUBLE QUOTATION MARK
|
||||
"„": '"', # DOUBLE LOW-9 QUOTATION MARK
|
||||
"‟": '"', # DOUBLE HIGH-REVERSED-9 QUOTATION MARK
|
||||
"«": '"', # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
|
||||
"»": '"', # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
|
||||
"″": '"', # DOUBLE PRIME
|
||||
})
|
||||
_CSV_SMART_QUOTE_CHARS: tuple[str, ...] = (
|
||||
"“", # LEFT DOUBLE QUOTATION MARK
|
||||
"”", # RIGHT DOUBLE QUOTATION MARK
|
||||
"„", # DOUBLE LOW-9 QUOTATION MARK
|
||||
"‟", # DOUBLE HIGH-REVERSED-9 QUOTATION MARK
|
||||
"«", # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
|
||||
"»", # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
|
||||
"″", # DOUBLE PRIME
|
||||
)
|
||||
# ``str.maketrans`` builds a codepoint→codepoint dict the C translate
|
||||
# uses directly. Iterating that dict yields ``int`` codepoints, which is
|
||||
# why we keep ``_CSV_SMART_QUOTE_CHARS`` separately for the ``.count``
|
||||
# loop in the non-UTF-8 fold path.
|
||||
_CSV_SMART_QUOTE_TRANS = str.maketrans({c: '"' for c in _CSV_SMART_QUOTE_CHARS})
|
||||
|
||||
# Byte-level fast path: same characters but as UTF-8 byte sequences. Used
|
||||
# when the file is already valid UTF-8 — folds in C without ever
|
||||
@@ -933,14 +938,17 @@ def repair_bytes(
|
||||
# Smart-quote fold for non-UTF-8 inputs that bypassed the byte fast
|
||||
# path (the byte_map only covers the UTF-8 byte sequences).
|
||||
if fold_quotes and not is_utf8:
|
||||
folded = text.translate(_CSV_SMART_QUOTE_TRANS)
|
||||
if folded != text:
|
||||
n = sum(1 for a, b in zip(text, folded) if a != b)
|
||||
# Count via ``str.count`` (C-implemented, ~GB/s) instead of a
|
||||
# Python-level char-by-char ``zip`` walk. On a 1 GB decoded
|
||||
# string the old path took ~100s of pure CPython iteration; the
|
||||
# ``count`` sum is microseconds because each call runs in C.
|
||||
n = sum(text.count(c) for c in _CSV_SMART_QUOTE_CHARS)
|
||||
if n:
|
||||
text = text.translate(_CSV_SMART_QUOTE_TRANS)
|
||||
actions.append(RepairAction(
|
||||
kind="fold_smart_quote", line=None,
|
||||
detail=f"replaced {n} smart double-quote char(s) with ASCII '\"'",
|
||||
))
|
||||
text = folded
|
||||
|
||||
# Per-row delimiter repair: skip the costly csv.reader walk on
|
||||
# well-formed files. Triggers, in cheap-to-expensive order:
|
||||
|
||||
Reference in New Issue
Block a user