3 Commits

Author SHA1 Message Date
db5ec084da docs+code: rename tool labels everywhere
Sweep follow-up to 93e43fc. Display labels now consistent across docs,
landing pages, CLI output, code comments, docstrings, and test prose.
Five parallel surfaces touched:

- docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal
  design/planning docs
- landing pages: index + bookkeeper/revops/shopify-pet
- src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py
  and gui/components/_legacy.py, core module headers, every tool
  page's module docstring
- tests: class/method/module docstrings and section-header comments
- test-cases READMEs

Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator
etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*),
URL paths, anchor IDs, CSS classes, and asset filenames were left
intact since they're code identifiers / structural references.

All 2033 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:50:09 +00:00
64452dd783 perf: dedup blocking, column-parallel scaffolding, lazy-copy pipelines
Three follow-on wins from the audit, each with shape-pinning tests.

1. Dedup blocking
   - Exact-only strategies (every column EXACT @ 100 — covers strong-
     key dedup like email/phone, the drop-duplicates fallback, and
     explicit "match on this exact column" calls) now route through
     an O(n) groupby fast path. Lossless; no API change required.
     Measured: 10k-row email-exact dedup → 73 ms (was ~30 minutes
     via the O(n²) pair compare).
   - Fuzzy strategies still pair-compare, with opt-in prefix blocking
     via deduplicate(..., blocking_columns=[...], blocking_prefix_len=1).
     Measured: 5k-row fuzzy-name → 25.6s with blocking vs 179s
     without (7x). Trade-off: cross-block matches missed.

2. Column-parallel standardize
   - StandardizeOptions.parallel_columns (default 1) lands a
     ThreadPoolExecutor over the column loop. Output order and
     audit-record order are preserved deterministically via a merge
     step keyed off column_types order. Honest doc: under CPython
     3.12's GIL the win is roughly neutral (phonenumbers/dateutil
     hold the GIL); the API is ready for free-threaded Python 3.13+.

3. Lazy-copy in missing / column_mapper
   - _standardize_sentinels now builds per-column changes in a dict
     and only materialises the output frame when at least one column
     actually changed. On a clean 1 GB file this skips a 1 GB
     allocation.
   - handle_missing carries an out_is_owned flag, copying on demand
     before any mutating step. No-op runs return the input frame.
   - map_columns drops the unconditional upfront df.copy(); rename
     and drop both return fresh frames already, and schema-add /
     coerce trigger _ensure_owned() lazily.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:54:25 +00:00
5b672370a6 perf: cache hot paths, drop wasted allocations, lift 1 GB → 1.5 GB
Five targeted wins driven by an end-to-end audit, with shape-pinning
regression tests so reverts are loud:

- format_standardize: fuse the dispatcher loop into one pass — was
  calling Series.tolist() three times per typed column and materialising
  an intermediate triples list; now one tolist, one walk. On a
  synthetic 1M-row phone+email frame this measures ~2.7M rows/sec
  (vs. the previous 150k/sec doc target).
- dedup: wrap normalizers in a per-call lru_cache so repeat phones /
  emails / addresses skip re-parsing. phonenumbers.parse is the
  expensive call; ~2–5x faster on the normalisation step for realistic
  workloads.
- analyze: _detect_near_duplicates no longer copies the full input
  frame; builds only the normalised string columns via a dict and
  references non-string columns by view. Skips the redundant
  astype(str) when a column is already pandas string dtype.
- text_clean: hoist _build_pipeline out of the per-cell loop and add a
  per-call string cache so 100k repeats of "Active" only run the
  pipeline once. ~1M rows/sec on repetition-heavy columns.
- io.repair_bytes: the non-UTF-8 smart-quote fold path used a
  Python-level zip walk over the entire decoded string to count
  replacements — replaced with sum(text.count(c) ...) which runs in
  C at ~GB/s. Was a latent ~100s on a 1 GB cp1252 file; now <1s.

Updates REQUIREMENTS §10 with measured numbers and bumps the buyer-
facing upload limit from 1 GB to 1.5 GB across the i18n packs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:37:26 +00:00