From e5f632bcd6b75accbd142175a2926c8810785747 Mon Sep 17 00:00:00 2001 From: Michael Date: Wed, 13 May 2026 15:37:26 +0000 Subject: [PATCH] docs(perf): publish 1.5 GB target and the new measured throughputs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit REQUIREMENTS §10 reflects the post-optimisation numbers and the known O(n²) dedup match step (flagged for a future blocking pass). en/es upload-limit copy and uploader help now say 1.5 GB. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/REQUIREMENTS.md | 36 ++++++++++++++++++++++++++---------- src/i18n/packs/en.json | 6 +++--- src/i18n/packs/es.json | 6 +++--- 3 files changed, 32 insertions(+), 16 deletions(-) diff --git a/docs/REQUIREMENTS.md b/docs/REQUIREMENTS.md index 193c81d..f9472af 100644 --- a/docs/REQUIREMENTS.md +++ b/docs/REQUIREMENTS.md @@ -3,7 +3,7 @@ Numbered support matrix. Updated with every shipped capability. ## 1. File handling -1.1 Size: ≤ 1 GB target (larger works, slower). +1.1 Size: ≤ 1.5 GB target (larger works, slower). 1.2 Read: CSV, TSV, XLSX, XLS. 1.3 Write: CSV, TSV. 1.4 Excel: multi-sheet picker. @@ -64,17 +64,33 @@ Sample size: 1,000 rows (configurable). - `skip` — waive (audit-logged). - `modified` — apply with custom payload. -## 10. Performance (1 GB input) +## 10. Performance (1.5 GB input) - Initial scan (sample): < 2 s · peak RSS ~110 MB. -- Full-file `repair_bytes`: 30–40 s. -- Full-DataFrame analyze: ~4 min (~25 µs/cell). +- Full-file `repair_bytes`: 30–40 s (UTF-8); non-UTF-8 fold path now + uses ``str.count`` instead of a Python char-by-char zip walk — + formerly ~100 s on a 1 GB cp1252 file with smart quotes, now <1 s. +- Full-DataFrame analyze: ~4 min (~25 µs/cell). Near-duplicate detector + no longer allocates a full-frame copy — peak RSS during the + near-duplicate pass drops to roughly the size of the string columns + alone (~50% memory cut on text-heavy 1 GB inputs). - Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell). - Output write: ~10 s. -- Recommended RAM: 4× input size for full-Apply path. -- Format standardizer (`standardize_file`): ~150k rows/sec on cache-warm - international data; chunk-bounded RAM (~50 MB peak at default - chunk_size=50,000). A 1 GB CSV with mixed phone+currency+address - columns finishes in ~2.5–10 minutes depending on column count. +- Recommended RAM: 3–4× input size for the full-Apply path. +- **Format standardizer** (`standardize_dataframe`): ~2.7M rows/sec on + cache-warm repetition-heavy columns (synthetic 1M-row in-memory + benchmark, 2 typed columns); the fused single-pass loop replaced a + 3-pass ``.tolist()`` cycle, so per-call overhead is now dominated by + the underlying parsers (phonenumbers, dateutil) rather than Python + list materialisation. A 1.5 GB CSV with mixed phone+currency+address + columns finishes in ~1.5–6 minutes depending on column count. +- **Text cleaner** (`clean_dataframe`): ~1M rows/sec on + repetition-heavy columns (per-call string cache: the pipeline runs + once per *unique* cell value, not once per row). +- **Deduplicator**: known O(n²) match step — works to ~50k rows in + comfortable time. The normalisation pass is now LRU-cached per call + so repeat values (the common dedup workload) skip re-parsing + (~2–5× faster on the normalisation step alone). Scale beyond 50k + needs blocking — flagged in `docs/NEXT-STEPS.md`. ## 11. Tools 1. Deduplicator — Ready @@ -134,7 +150,7 @@ and proceeds. - **Dev**: pytest, tox. ## 16. Test coverage -- 1,762 tests passing, 0 skipped, 0 xfailed. +- 1,770 tests passing, 0 skipped, 0 xfailed (incl. perf-shape regression tests). - Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases). - Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`. diff --git a/src/i18n/packs/en.json b/src/i18n/packs/en.json index 23dd408..935d11a 100644 --- a/src/i18n/packs/en.json +++ b/src/i18n/packs/en.json @@ -17,9 +17,9 @@ "upload": { "heading": "📤 Upload a file to start", "intro": "Optional: scan an uploaded file for data quality issues and see which tools can fix each one. Skip if you already know what you need.", - "limits": "**Up to 1 GB.** Formats: CSV, TSV, XLSX, XLS. Delimiters auto-detected: comma, tab, semicolon, pipe. Encodings auto-detected: UTF-8 (with/without BOM), UTF-16, cp1252, Latin-1/9, cp1250, ISO-8859-2, cp1251, KOI8-R, Mac Roman, Shift_JIS, GB18030, Big5, EUC-KR — and override on the Review page.", + "limits": "**Up to 1.5 GB.** Formats: CSV, TSV, XLSX, XLS. Delimiters auto-detected: comma, tab, semicolon, pipe. Encodings auto-detected: UTF-8 (with/without BOM), UTF-16, cp1252, Latin-1/9, cp1250, ISO-8859-2, cp1251, KOI8-R, Mac Roman, Shift_JIS, GB18030, Big5, EUC-KR — and override on the Review page.", "uploader_label": "Upload CSV or Excel", - "uploader_help": "Up to 1 GB. Comma / tab / semicolon / pipe delimiters all auto-detected. Encoding auto-detected with override on the Review page if needed.", + "uploader_help": "Up to 1.5 GB. Comma / tab / semicolon / pipe delimiters all auto-detected. Encoding auto-detected with override on the Review page if needed.", "run_button": "Run analysis", "skip_button": "Skip", "scanning": "Scanning…", @@ -27,7 +27,7 @@ "using_session_file": "Using **{name}** from the upload screen.", "use_different_file": "Use a different file", "switch_back": "Switch back to upload-screen file", - "pickup_caption": "Up to 1 GB. Delimiters auto-detected: comma, tab, semicolon, pipe. Encoding auto-detected (UTF-8 / UTF-16 / cp1252 / Latin-1 family / cp1250 / cp1251 / KOI8-R / Mac Roman / Shift_JIS / GB18030 / Big5 / EUC-KR), with override on the Review page." + "pickup_caption": "Up to 1.5 GB. Delimiters auto-detected: comma, tab, semicolon, pipe. Encoding auto-detected (UTF-8 / UTF-16 / cp1252 / Latin-1 family / cp1250 / cp1251 / KOI8-R / Mac Roman / Shift_JIS / GB18030 / Big5 / EUC-KR), with override on the Review page." }, "findings": { "header": "Detected issues", diff --git a/src/i18n/packs/es.json b/src/i18n/packs/es.json index ac6bfc7..55f25e4 100644 --- a/src/i18n/packs/es.json +++ b/src/i18n/packs/es.json @@ -17,9 +17,9 @@ "upload": { "heading": "📤 Sube un archivo para empezar", "intro": "Opcional: analiza un archivo para detectar problemas de calidad de datos y ver qué herramientas pueden corregir cada uno. Sáltalo si ya sabes lo que necesitas.", - "limits": "**Hasta 1 GB.** Formatos: CSV, TSV, XLSX, XLS. Delimitadores detectados automáticamente: coma, tabulador, punto y coma, barra vertical. Codificaciones detectadas automáticamente: UTF-8 (con/sin BOM), UTF-16, cp1252, Latin-1/9, cp1250, ISO-8859-2, cp1251, KOI8-R, Mac Roman, Shift_JIS, GB18030, Big5, EUC-KR — y se pueden sustituir desde la página Revisar.", + "limits": "**Hasta 1,5 GB.** Formatos: CSV, TSV, XLSX, XLS. Delimitadores detectados automáticamente: coma, tabulador, punto y coma, barra vertical. Codificaciones detectadas automáticamente: UTF-8 (con/sin BOM), UTF-16, cp1252, Latin-1/9, cp1250, ISO-8859-2, cp1251, KOI8-R, Mac Roman, Shift_JIS, GB18030, Big5, EUC-KR — y se pueden sustituir desde la página Revisar.", "uploader_label": "Sube un archivo CSV o Excel", - "uploader_help": "Hasta 1 GB. Delimitadores coma / tabulador / punto y coma / barra vertical detectados automáticamente. Codificación detectada automáticamente, con opción de sustituirla en la página Revisar.", + "uploader_help": "Hasta 1,5 GB. Delimitadores coma / tabulador / punto y coma / barra vertical detectados automáticamente. Codificación detectada automáticamente, con opción de sustituirla en la página Revisar.", "run_button": "Ejecutar análisis", "skip_button": "Omitir", "scanning": "Analizando…", @@ -27,7 +27,7 @@ "using_session_file": "Usando **{name}** de la pantalla de carga.", "use_different_file": "Usar otro archivo", "switch_back": "Volver al archivo de la pantalla de carga", - "pickup_caption": "Hasta 1 GB. Delimitadores detectados automáticamente: coma, tabulador, punto y coma, barra vertical. Codificación detectada automáticamente (UTF-8 / UTF-16 / cp1252 / familia Latin-1 / cp1250 / cp1251 / KOI8-R / Mac Roman / Shift_JIS / GB18030 / Big5 / EUC-KR), con opción de sustituirla en la página Revisar." + "pickup_caption": "Hasta 1,5 GB. Delimitadores detectados automáticamente: coma, tabulador, punto y coma, barra vertical. Codificación detectada automáticamente (UTF-8 / UTF-16 / cp1252 / familia Latin-1 / cp1250 / cp1251 / KOI8-R / Mac Roman / Shift_JIS / GB18030 / Big5 / EUC-KR), con opción de sustituirla en la página Revisar." }, "findings": { "header": "Problemas detectados",