docs(perf): publish the dedup/parallel/lazy-copy wins and limits
REQUIREMENTS §10 carries the new measured numbers and the dedup blocking trade-off note. DEVELOPER known-limitations is rewritten to reflect that exact-only dedup is now O(n), fuzzy-blocking is opt-in, and column-parallelism is scaffolding for free-threaded Python. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -185,8 +185,18 @@ Fixture corpora: `test-cases/text-cleaner-corpus/` (21 files) · `test-cases/enc
|
||||
|
||||
## Known limitations
|
||||
|
||||
- **Dedup is O(n²)** — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix.
|
||||
- **Single-threaded** — could benefit from `multiprocessing`.
|
||||
- **Memory-bound** — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine.
|
||||
- **Dedup pair-compare is O(n²)** for fuzzy strategies. Exact-only
|
||||
strategies (every column uses `Algorithm.EXACT` at threshold 100)
|
||||
now route through an O(n) groupby fast path automatically — no API
|
||||
change. Fuzzy strategies can opt into prefix blocking via
|
||||
`deduplicate(..., blocking_columns=[...], blocking_prefix_len=1)`
|
||||
to partition pairs by a cheap key (trades recall for speed).
|
||||
- **Threading is opt-in for format_standardize** —
|
||||
`StandardizeOptions.parallel_columns > 1` uses a thread pool.
|
||||
On CPython 3.12 the GIL caps the win at roughly neutral; the
|
||||
scaffolding is in place for free-threaded Python 3.13+.
|
||||
- **Memory-bound** — entire file loaded into pandas. Streaming reads
|
||||
exist but not integrated with the dedup engine.
|
||||
- **No multi-sheet dedup** — each Excel sheet processed independently.
|
||||
- **Phonenumbers minimum-length** — international numbers without country codes fall back to digits-only.
|
||||
- **Phonenumbers minimum-length** — international numbers without
|
||||
country codes fall back to digits-only.
|
||||
|
||||
Reference in New Issue
Block a user