datatools-dev/tests at d0423a891229ac9474e1d8f407dbcf2e9ec159b1 - datatools-dev - Gitea: Git with a cup of tea

giteadmin/datatools-dev

Files

History

Michael 64452dd783 perf: dedup blocking, column-parallel scaffolding, lazy-copy pipelines

Three follow-on wins from the audit, each with shape-pinning tests.

1. Dedup blocking
   - Exact-only strategies (every column EXACT @ 100 — covers strong-
     key dedup like email/phone, the drop-duplicates fallback, and
     explicit "match on this exact column" calls) now route through
     an O(n) groupby fast path. Lossless; no API change required.
     Measured: 10k-row email-exact dedup → 73 ms (was ~30 minutes
     via the O(n²) pair compare).
   - Fuzzy strategies still pair-compare, with opt-in prefix blocking
     via deduplicate(..., blocking_columns=[...], blocking_prefix_len=1).
     Measured: 5k-row fuzzy-name → 25.6s with blocking vs 179s
     without (7x). Trade-off: cross-block matches missed.

2. Column-parallel standardize
   - StandardizeOptions.parallel_columns (default 1) lands a
     ThreadPoolExecutor over the column loop. Output order and
     audit-record order are preserved deterministically via a merge
     step keyed off column_types order. Honest doc: under CPython
     3.12's GIL the win is roughly neutral (phonenumbers/dateutil
     hold the GIL); the API is ready for free-threaded Python 3.13+.

3. Lazy-copy in missing / column_mapper
   - _standardize_sentinels now builds per-column changes in a dict
     and only materialises the output frame when at least one column
     actually changed. On a clean 1 GB file this skips a 1 GB
     allocation.
   - handle_missing carries an out_is_owned flag, copying on demand
     before any mutating step. No-op runs return the input frame.
   - map_columns drops the unconditional upfront df.copy(); rename
     and drop both return fresh frames already, and schema-add /
     coerce trigger _ensure_owned() lazily.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-13 15:54:25 +00:00

..

__init__.py

feat: add documentation, Streamlit GUI, and full source tree

2026-04-28 23:06:39 +00:00

conftest.py

feat: add documentation, Streamlit GUI, and full source tree

2026-04-28 23:06:39 +00:00

test_analyze.py

feat: 3 new tools, format streaming, distribution-ready demo + landing pages

2026-05-01 22:31:26 +00:00

test_audit_fixes.py

feat(errors): structured error hierarchy + helpful messages everywhere

2026-05-01 02:35:42 +00:00

test_cli_analyze.py

feat(cli): src.cli_analyze — Typer CLI for the analyzer

2026-04-29 15:53:11 +00:00

test_cli_text_clean.py

feat: implement text cleaner (script 02) with CLI, GUI, and tests

2026-04-29 15:14:15 +00:00

test_cli.py

feat: add documentation, Streamlit GUI, and full source tree

2026-04-28 23:06:39 +00:00

test_column_mapper_corpus.py

feat: 3 new tools, format streaming, distribution-ready demo + landing pages

2026-05-01 22:31:26 +00:00

test_column_mapper.py

feat: 3 new tools, format streaming, distribution-ready demo + landing pages

2026-05-01 22:31:26 +00:00

test_config.py

feat: add documentation, Streamlit GUI, and full source tree

2026-04-28 23:06:39 +00:00

test_corpus.py

feat: 3 new tools, format streaming, distribution-ready demo + landing pages

2026-05-01 22:31:26 +00:00

test_dedup.py

feat: add documentation, Streamlit GUI, and full source tree

2026-04-28 23:06:39 +00:00

test_e2e.py

test: single-command runner, cross-platform automation, fixture auto-discovery

2026-04-29 16:01:06 +00:00

test_encodings_corpus.py

feat: 3 new tools, format streaming, distribution-ready demo + landing pages

2026-05-01 22:31:26 +00:00

test_errors.py

feat(errors): structured error hierarchy + helpful messages everywhere

2026-05-01 02:35:42 +00:00

test_fixes_unit.py

feat: 3 new tools, format streaming, distribution-ready demo + landing pages

2026-05-01 22:31:26 +00:00

test_fixtures_sweep.py

test: single-command runner, cross-platform automation, fixture auto-discovery

2026-04-29 16:01:06 +00:00

test_format_intl_corpus.py

feat: 3 new tools, format streaming, distribution-ready demo + landing pages

2026-05-01 22:31:26 +00:00

test_format_standardize_corpus.py

feat: 3 new tools, format streaming, distribution-ready demo + landing pages

2026-05-01 22:31:26 +00:00

test_format_standardize.py

feat(format): per-cell standardizers + 199-row buyer corpus

2026-05-01 02:11:24 +00:00

test_format_streaming.py

feat: 3 new tools, format streaming, distribution-ready demo + landing pages

2026-05-01 22:31:26 +00:00

test_gap_coverage.py

feat: 3 new tools, format streaming, distribution-ready demo + landing pages

2026-05-01 22:31:26 +00:00

test_i18n.py

feat(format-i18n): broaden international coverage across all domains

2026-05-01 03:06:03 +00:00

test_install.py

test: single-command runner, cross-platform automation, fixture auto-discovery

2026-04-29 16:01:06 +00:00

test_io.py

feat(errors): structured error hierarchy + helpful messages everywhere

2026-05-01 02:35:42 +00:00

test_lang_packs.py

feat(i18n): add language-pack scaffold with English and Spanish

2026-05-13 15:11:30 +00:00

test_missing_corpus.py

feat: 3 new tools, format streaming, distribution-ready demo + landing pages

2026-05-01 22:31:26 +00:00

test_missing.py

feat: 3 new tools, format streaming, distribution-ready demo + landing pages

2026-05-01 22:31:26 +00:00

test_normalize.py

feat(gate): CSV-normalization gate with confidence-tiered findings

2026-04-29 20:35:27 +00:00

test_normalizers.py

fix: cross-tool audit findings + alignment with format standardizer

2026-05-01 02:11:57 +00:00

test_perf_regressions.py

perf: dedup blocking, column-parallel scaffolding, lazy-copy pipelines

2026-05-13 15:54:25 +00:00

test_pipeline.py

feat: 3 new tools, format streaming, distribution-ready demo + landing pages

2026-05-01 22:31:26 +00:00

test_text_clean.py

fix: cross-tool audit findings + alignment with format standardizer

2026-05-01 02:11:57 +00:00