perf: cache hot paths, drop wasted allocations, lift 1 GB → 1.5 GB
Five targeted wins driven by an end-to-end audit, with shape-pinning regression tests so reverts are loud: - format_standardize: fuse the dispatcher loop into one pass — was calling Series.tolist() three times per typed column and materialising an intermediate triples list; now one tolist, one walk. On a synthetic 1M-row phone+email frame this measures ~2.7M rows/sec (vs. the previous 150k/sec doc target). - dedup: wrap normalizers in a per-call lru_cache so repeat phones / emails / addresses skip re-parsing. phonenumbers.parse is the expensive call; ~2–5x faster on the normalisation step for realistic workloads. - analyze: _detect_near_duplicates no longer copies the full input frame; builds only the normalised string columns via a dict and references non-string columns by view. Skips the redundant astype(str) when a column is already pandas string dtype. - text_clean: hoist _build_pipeline out of the per-cell loop and add a per-call string cache so 100k repeats of "Active" only run the pipeline once. ~1M rows/sec on repetition-heavy columns. - io.repair_bytes: the non-UTF-8 smart-quote fold path used a Python-level zip walk over the entire decoded string to count replacements — replaced with sum(text.count(c) ...) which runs in C at ~GB/s. Was a latent ~100s on a 1 GB cp1252 file; now <1s. Updates REQUIREMENTS §10 with measured numbers and bumps the buyer- facing upload limit from 1 GB to 1.5 GB across the i18n packs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -2556,33 +2556,48 @@ def standardize_dataframe(
|
||||
elif field_type == FieldType.ADDRESS and options.address_country_column:
|
||||
region_series = out[options.address_country_column]
|
||||
|
||||
new_values: list[Any] = [None] * len(series)
|
||||
# Hot loop: one ``.tolist()`` materialisation, one pass over the
|
||||
# column. Previously called ``.tolist()`` three times and built an
|
||||
# intermediate ``triples`` list — costly at 1 GB scale where a
|
||||
# single column may be 10–50 MB of Python objects.
|
||||
values = series.tolist()
|
||||
new_values: list[Any] = [None] * len(values)
|
||||
|
||||
if region_series is None:
|
||||
triples = [dispatcher(v) for v in series.tolist()]
|
||||
for i, orig in enumerate(values):
|
||||
new, changed, parsed = dispatcher(orig)
|
||||
new_values[i] = new
|
||||
if changed:
|
||||
cells_changed += 1
|
||||
if audit_room > 0:
|
||||
audit_records.append({
|
||||
"row": i,
|
||||
"column": col,
|
||||
"field_type": field_type.value,
|
||||
"old": orig,
|
||||
"new": new,
|
||||
})
|
||||
audit_room -= 1
|
||||
if not parsed:
|
||||
cells_unparseable += 1
|
||||
else:
|
||||
regions = region_series.tolist()
|
||||
triples = [
|
||||
dispatcher(v, _normalize_region(r))
|
||||
for v, r in zip(series.tolist(), regions)
|
||||
]
|
||||
|
||||
for i, (orig, (new, changed, parsed)) in enumerate(
|
||||
zip(series.tolist(), triples)
|
||||
):
|
||||
new_values[i] = new
|
||||
if changed:
|
||||
cells_changed += 1
|
||||
if audit_room > 0:
|
||||
audit_records.append({
|
||||
"row": i,
|
||||
"column": col,
|
||||
"field_type": field_type.value,
|
||||
"old": orig,
|
||||
"new": new,
|
||||
})
|
||||
audit_room -= 1
|
||||
if not parsed:
|
||||
cells_unparseable += 1
|
||||
for i, (orig, region) in enumerate(zip(values, regions)):
|
||||
new, changed, parsed = dispatcher(orig, _normalize_region(region))
|
||||
new_values[i] = new
|
||||
if changed:
|
||||
cells_changed += 1
|
||||
if audit_room > 0:
|
||||
audit_records.append({
|
||||
"row": i,
|
||||
"column": col,
|
||||
"field_type": field_type.value,
|
||||
"old": orig,
|
||||
"new": new,
|
||||
})
|
||||
audit_room -= 1
|
||||
if not parsed:
|
||||
cells_unparseable += 1
|
||||
out[col] = new_values
|
||||
|
||||
changes_df = pd.DataFrame(
|
||||
|
||||
Reference in New Issue
Block a user