docs(perf): publish the dedup/parallel/lazy-copy wins and limits
REQUIREMENTS §10 carries the new measured numbers and the dedup blocking trade-off note. DEVELOPER known-limitations is rewritten to reflect that exact-only dedup is now O(n), fuzzy-blocking is opt-in, and column-parallelism is scaffolding for free-threaded Python. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -185,8 +185,18 @@ Fixture corpora: `test-cases/text-cleaner-corpus/` (21 files) · `test-cases/enc
|
|||||||
|
|
||||||
## Known limitations
|
## Known limitations
|
||||||
|
|
||||||
- **Dedup is O(n²)** — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix.
|
- **Dedup pair-compare is O(n²)** for fuzzy strategies. Exact-only
|
||||||
- **Single-threaded** — could benefit from `multiprocessing`.
|
strategies (every column uses `Algorithm.EXACT` at threshold 100)
|
||||||
- **Memory-bound** — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine.
|
now route through an O(n) groupby fast path automatically — no API
|
||||||
|
change. Fuzzy strategies can opt into prefix blocking via
|
||||||
|
`deduplicate(..., blocking_columns=[...], blocking_prefix_len=1)`
|
||||||
|
to partition pairs by a cheap key (trades recall for speed).
|
||||||
|
- **Threading is opt-in for format_standardize** —
|
||||||
|
`StandardizeOptions.parallel_columns > 1` uses a thread pool.
|
||||||
|
On CPython 3.12 the GIL caps the win at roughly neutral; the
|
||||||
|
scaffolding is in place for free-threaded Python 3.13+.
|
||||||
|
- **Memory-bound** — entire file loaded into pandas. Streaming reads
|
||||||
|
exist but not integrated with the dedup engine.
|
||||||
- **No multi-sheet dedup** — each Excel sheet processed independently.
|
- **No multi-sheet dedup** — each Excel sheet processed independently.
|
||||||
- **Phonenumbers minimum-length** — international numbers without country codes fall back to digits-only.
|
- **Phonenumbers minimum-length** — international numbers without
|
||||||
|
country codes fall back to digits-only.
|
||||||
|
|||||||
@@ -83,14 +83,38 @@ Sample size: 1,000 rows (configurable).
|
|||||||
the underlying parsers (phonenumbers, dateutil) rather than Python
|
the underlying parsers (phonenumbers, dateutil) rather than Python
|
||||||
list materialisation. A 1.5 GB CSV with mixed phone+currency+address
|
list materialisation. A 1.5 GB CSV with mixed phone+currency+address
|
||||||
columns finishes in ~1.5–6 minutes depending on column count.
|
columns finishes in ~1.5–6 minutes depending on column count.
|
||||||
|
`StandardizeOptions.parallel_columns` (default 1, serial) lands the
|
||||||
|
thread-pool scaffolding; on CPython 3.12 with the GIL it's
|
||||||
|
roughly neutral, but the API is ready for the free-threaded
|
||||||
|
(PEP 703) Python 3.13+ build where it will help.
|
||||||
- **Text cleaner** (`clean_dataframe`): ~1M rows/sec on
|
- **Text cleaner** (`clean_dataframe`): ~1M rows/sec on
|
||||||
repetition-heavy columns (per-call string cache: the pipeline runs
|
repetition-heavy columns (per-call string cache: the pipeline runs
|
||||||
once per *unique* cell value, not once per row).
|
once per *unique* cell value, not once per row).
|
||||||
- **Deduplicator**: known O(n²) match step — works to ~50k rows in
|
- **Missing handler** (`handle_missing`): lazy-copy — when sentinel
|
||||||
comfortable time. The normalisation pass is now LRU-cached per call
|
standardization runs but finds nothing, AND no drops AND no fills
|
||||||
so repeat values (the common dedup workload) skip re-parsing
|
apply, the input frame is returned as-is. On a clean 1 GB file this
|
||||||
(~2–5× faster on the normalisation step alone). Scale beyond 50k
|
saves the 1 GB allocation that the unconditional upfront copy used
|
||||||
needs blocking — flagged in `docs/NEXT-STEPS.md`.
|
to take.
|
||||||
|
- **Column mapper** (`map_columns`): rename + drop both already
|
||||||
|
return fresh frames; the explicit upfront `df.copy()` is now
|
||||||
|
removed and downstream mutating steps (schema-add, coerce) copy on
|
||||||
|
demand via `_ensure_owned()`. Rename-only and identity-mapping
|
||||||
|
paths run with zero explicit copies.
|
||||||
|
- **Deduplicator**:
|
||||||
|
- **Exact-only strategies** (every column uses `Algorithm.EXACT` at
|
||||||
|
threshold 100 — covers strong-key dedup like email/phone, the
|
||||||
|
fallback drop-duplicates path, and explicit "match on this exact
|
||||||
|
column" calls) now run in **O(n)** via groupby. Measured: 10k
|
||||||
|
rows on an email-exact strategy → 73 ms (was ~30 minutes via the
|
||||||
|
old O(n²) pair compare).
|
||||||
|
- **Fuzzy strategies** still pair-compare. Opt in to **prefix
|
||||||
|
blocking** via `deduplicate(..., blocking_columns=['name'],
|
||||||
|
blocking_prefix_len=1)` to partition pairs by a cheap key.
|
||||||
|
Measured: 5k rows fuzzy-name dedup → 25.6s with blocking vs.
|
||||||
|
179s without (7× faster). Trade-off: cross-block matches are
|
||||||
|
missed; lower `blocking_prefix_len` widens blocks.
|
||||||
|
- Normalisation pass remains LRU-cached per call so repeat values
|
||||||
|
(the common dedup workload) skip re-parsing.
|
||||||
|
|
||||||
## 11. Tools
|
## 11. Tools
|
||||||
1. Deduplicator — Ready
|
1. Deduplicator — Ready
|
||||||
@@ -150,7 +174,7 @@ and proceeds.
|
|||||||
- **Dev**: pytest, tox.
|
- **Dev**: pytest, tox.
|
||||||
|
|
||||||
## 16. Test coverage
|
## 16. Test coverage
|
||||||
- 1,770 tests passing, 0 skipped, 0 xfailed (incl. perf-shape regression tests).
|
- 1,777 tests passing, 0 skipped, 0 xfailed (incl. 15 perf-shape regression tests).
|
||||||
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
|
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
|
||||||
- Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
|
- Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user