diff --git a/docs/DEVELOPER.md b/docs/DEVELOPER.md index a9b0554..05122ab 100644 --- a/docs/DEVELOPER.md +++ b/docs/DEVELOPER.md @@ -185,8 +185,18 @@ Fixture corpora: `test-cases/text-cleaner-corpus/` (21 files) · `test-cases/enc ## Known limitations -- **Dedup is O(n²)** — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix. -- **Single-threaded** — could benefit from `multiprocessing`. -- **Memory-bound** — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine. +- **Dedup pair-compare is O(n²)** for fuzzy strategies. Exact-only + strategies (every column uses `Algorithm.EXACT` at threshold 100) + now route through an O(n) groupby fast path automatically — no API + change. Fuzzy strategies can opt into prefix blocking via + `deduplicate(..., blocking_columns=[...], blocking_prefix_len=1)` + to partition pairs by a cheap key (trades recall for speed). +- **Threading is opt-in for format_standardize** — + `StandardizeOptions.parallel_columns > 1` uses a thread pool. + On CPython 3.12 the GIL caps the win at roughly neutral; the + scaffolding is in place for free-threaded Python 3.13+. +- **Memory-bound** — entire file loaded into pandas. Streaming reads + exist but not integrated with the dedup engine. - **No multi-sheet dedup** — each Excel sheet processed independently. -- **Phonenumbers minimum-length** — international numbers without country codes fall back to digits-only. +- **Phonenumbers minimum-length** — international numbers without + country codes fall back to digits-only. diff --git a/docs/REQUIREMENTS.md b/docs/REQUIREMENTS.md index f9472af..2b5c3d0 100644 --- a/docs/REQUIREMENTS.md +++ b/docs/REQUIREMENTS.md @@ -83,14 +83,38 @@ Sample size: 1,000 rows (configurable). the underlying parsers (phonenumbers, dateutil) rather than Python list materialisation. A 1.5 GB CSV with mixed phone+currency+address columns finishes in ~1.5–6 minutes depending on column count. + `StandardizeOptions.parallel_columns` (default 1, serial) lands the + thread-pool scaffolding; on CPython 3.12 with the GIL it's + roughly neutral, but the API is ready for the free-threaded + (PEP 703) Python 3.13+ build where it will help. - **Text cleaner** (`clean_dataframe`): ~1M rows/sec on repetition-heavy columns (per-call string cache: the pipeline runs once per *unique* cell value, not once per row). -- **Deduplicator**: known O(n²) match step — works to ~50k rows in - comfortable time. The normalisation pass is now LRU-cached per call - so repeat values (the common dedup workload) skip re-parsing - (~2–5× faster on the normalisation step alone). Scale beyond 50k - needs blocking — flagged in `docs/NEXT-STEPS.md`. +- **Missing handler** (`handle_missing`): lazy-copy — when sentinel + standardization runs but finds nothing, AND no drops AND no fills + apply, the input frame is returned as-is. On a clean 1 GB file this + saves the 1 GB allocation that the unconditional upfront copy used + to take. +- **Column mapper** (`map_columns`): rename + drop both already + return fresh frames; the explicit upfront `df.copy()` is now + removed and downstream mutating steps (schema-add, coerce) copy on + demand via `_ensure_owned()`. Rename-only and identity-mapping + paths run with zero explicit copies. +- **Deduplicator**: + - **Exact-only strategies** (every column uses `Algorithm.EXACT` at + threshold 100 — covers strong-key dedup like email/phone, the + fallback drop-duplicates path, and explicit "match on this exact + column" calls) now run in **O(n)** via groupby. Measured: 10k + rows on an email-exact strategy → 73 ms (was ~30 minutes via the + old O(n²) pair compare). + - **Fuzzy strategies** still pair-compare. Opt in to **prefix + blocking** via `deduplicate(..., blocking_columns=['name'], + blocking_prefix_len=1)` to partition pairs by a cheap key. + Measured: 5k rows fuzzy-name dedup → 25.6s with blocking vs. + 179s without (7× faster). Trade-off: cross-block matches are + missed; lower `blocking_prefix_len` widens blocks. + - Normalisation pass remains LRU-cached per call so repeat values + (the common dedup workload) skip re-parsing. ## 11. Tools 1. Deduplicator — Ready @@ -150,7 +174,7 @@ and proceeds. - **Dev**: pytest, tox. ## 16. Test coverage -- 1,770 tests passing, 0 skipped, 0 xfailed (incl. perf-shape regression tests). +- 1,777 tests passing, 0 skipped, 0 xfailed (incl. 15 perf-shape regression tests). - Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases). - Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.