REQUIREMENTS §16 updates the test count (1777 → 1916) and breaks out the GUI subset. DEVELOPER's Tests section gains the 'gui' marker recipes and the new tests/gui/ tree under test layout, plus a short 'GUI test layer' explainer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.5 KiB
Requirements
Numbered support matrix. Updated with every shipped capability.
1. File handling
1.1 Size: ≤ 1.5 GB target (larger works, slower).
1.2 Read: CSV, TSV, XLSX, XLS.
1.3 Write: CSV, TSV.
1.4 Excel: multi-sheet picker.
1.5 Empty file: blocked with empty_input error finding.
2. Input encodings (auto-detected)
2.1 Unicode: UTF-8, UTF-8-BOM, UTF-16 LE/BE BOM, UTF-16 LE no-BOM.
2.2 Western: cp1252, ISO-8859-1, ISO-8859-15, Mac Roman.
2.3 Eastern European: cp1250, ISO-8859-2.
2.4 Cyrillic: cp1251, KOI8-R.
2.5 CJK: Shift_JIS / cp932, GB18030, Big5, EUC-KR / cp949.
2.6 ASCII → detected as UTF-8.
2.7 User override: any Python codec name.
2.8 BOM: stripped on read, never written.
2.9 Decode failure → encoding_decode_failed (error).
2.10 U+FFFD in output → encoding_uncertain (error).
3. Output encodings
3.1 UTF-8 (default), UTF-8-BOM (Excel-friendly).
3.2 cp1252, ISO-8859-1/15, cp1250, ISO-8859-2, cp1251.
3.3 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
3.4 Lossy fallback: ? + warning when codec can't represent a char.
4. Delimiters
4.1 Input auto-detect: ,, \t, ;, |.
4.2 Output: , (default), \t, ;, |.
4.3 Extension: .tsv for tab, .csv otherwise.
5. Line endings
5.1 Read: LF / CRLF / bare CR — all normalized to LF.
5.2 Embedded in quoted cells: also normalized to LF.
5.3 Write: LF (default), CRLF, CR.
5.4 Mixed → mixed_line_endings finding.
6. Analyzer detectors
File-level (read-time fixes, audit-logged):
csv_bom_stripped,csv_nul_stripped,csv_smart_quotes_folded,csv_line_endings_normalized,csv_transcoded_to_utf8,csv_unquoted_delimiters_repaired,csv_unrepairable_rows.
Cell-level:
smart_punctuation_in_data,nbsp_or_unicode_whitespace,zero_width_or_invisible,dirty_column_headers,whitespace_padding,null_like_sentinels,suspected_mojibake,mixed_case_email_column,inconsistent_date_format,near_duplicate_rows,leading_zero_ids.
Encoding integrity: encoding_uncertain, encoding_decode_failed, encoding_lying_bom, empty_input.
Sample size: 1,000 rows (configurable).
7. Finding fields
id, severity (info/warn/error), confidence (high/medium/low), fix_action, pre_applied, tool, count, description, column, samples (≤5).
8. Confidence tiers
- high — round-trip safe, one-click auto-fix.
- medium — preview before applying.
- low — opt-in only, can corrupt if wrong.
- error — must resolve or waive before tool pages unlock.
9. Decision actions
auto— apply registered fix.skip— waive (audit-logged).modified— apply with custom payload.
10. Performance (1.5 GB input)
- Initial scan (sample): < 2 s · peak RSS ~110 MB.
- Full-file
repair_bytes: 30–40 s (UTF-8); non-UTF-8 fold path now usesstr.countinstead of a Python char-by-char zip walk — formerly ~100 s on a 1 GB cp1252 file with smart quotes, now <1 s. - Full-DataFrame analyze: ~4 min (~25 µs/cell). Near-duplicate detector no longer allocates a full-frame copy — peak RSS during the near-duplicate pass drops to roughly the size of the string columns alone (~50% memory cut on text-heavy 1 GB inputs).
- Full-DataFrame
auto_fix: ~5 min (~30 µs/cell). - Output write: ~10 s.
- Recommended RAM: 3–4× input size for the full-Apply path.
- Format standardizer (
standardize_dataframe): ~2.7M rows/sec on cache-warm repetition-heavy columns (synthetic 1M-row in-memory benchmark, 2 typed columns); the fused single-pass loop replaced a 3-pass.tolist()cycle, so per-call overhead is now dominated by the underlying parsers (phonenumbers, dateutil) rather than Python list materialisation. A 1.5 GB CSV with mixed phone+currency+address columns finishes in ~1.5–6 minutes depending on column count.StandardizeOptions.parallel_columns(default 1, serial) lands the thread-pool scaffolding; on CPython 3.12 with the GIL it's roughly neutral, but the API is ready for the free-threaded (PEP 703) Python 3.13+ build where it will help. - Text cleaner (
clean_dataframe): ~1M rows/sec on repetition-heavy columns (per-call string cache: the pipeline runs once per unique cell value, not once per row). - Missing handler (
handle_missing): lazy-copy — when sentinel standardization runs but finds nothing, AND no drops AND no fills apply, the input frame is returned as-is. On a clean 1 GB file this saves the 1 GB allocation that the unconditional upfront copy used to take. - Column mapper (
map_columns): rename + drop both already return fresh frames; the explicit upfrontdf.copy()is now removed and downstream mutating steps (schema-add, coerce) copy on demand via_ensure_owned(). Rename-only and identity-mapping paths run with zero explicit copies. - Deduplicator:
- Exact-only strategies (every column uses
Algorithm.EXACTat threshold 100 — covers strong-key dedup like email/phone, the fallback drop-duplicates path, and explicit "match on this exact column" calls) now run in O(n) via groupby. Measured: 10k rows on an email-exact strategy → 73 ms (was ~30 minutes via the old O(n²) pair compare). - Fuzzy strategies still pair-compare. Opt in to prefix
blocking via
deduplicate(..., blocking_columns=['name'], blocking_prefix_len=1)to partition pairs by a cheap key. Measured: 5k rows fuzzy-name dedup → 25.6s with blocking vs. 179s without (7× faster). Trade-off: cross-block matches are missed; lowerblocking_prefix_lenwidens blocks. - Normalisation pass remains LRU-cached per call so repeat values (the common dedup workload) skip re-parsing.
- Exact-only strategies (every column uses
11. Tools
- Deduplicator — Ready
- Text Cleaner — Ready
- Format Standardizer — Ready
- Missing Value Handler — Ready
- Column Mapper — Ready
- Outlier Detector — Coming Soon
- Multi-File Merger — Coming Soon
- Validator & Reporter — Coming Soon
- Pipeline Runner — Ready
11.a Recommended pipeline order (soft, not enforced)
The Pipeline Runner ships with a SOFT_DEPENDENCIES table; the
following ordering is the default and the basis of the warning
surface. Re-ordering is allowed; the runner emits a warning string
and proceeds.
| # | Tool | Why this slot |
|---|---|---|
| 1 | column_map (optional, for header alignment) | Multi-vendor unification — rename early so downstream tools see canonical headers |
| 2 | text_clean | NBSP / smart quotes / zero-width pollution silently breaks downstream parsers |
| 3 | format_standardize | Phones / dates / currencies → canonical form before missing detection and dedup |
| 4 | missing | Sentinel detection, imputation, drop strategies — needs canonical types |
| 5 | column_map (optional, for schema enforcement) | Project to target schema, coerce, drop extras AFTER cleaning |
| 6 | dedup | Fuzzy matching is most accurate on canonicalised, sentinel-laundered data |
12. Gate (Review & Normalize)
- Gates every tool page.
- Auto-fix button: applies all
confidence=highfindings in one click. - Per-finding controls: Auto / Skip / Customize.
- Live before/after preview (≤5 sample rows).
- Audit log per fix (id, decision, cells changed).
- Encoding-override picker (16 codepages + custom).
- Advanced output expander: encoding + delimiter + line terminator.
- Result keyed by upload SHA-256; survives reload, invalidated on re-upload.
13. Interfaces
- GUI: Streamlit, browser-based, local, no internet. Sidebar language picker (English, Español).
- CLI:
python -m src.cli(dedup) ·src.cli_text_clean·src.cli_format·src.cli_missing·src.cli_column_map·src.cli_pipeline·src.cli_analyze. (CLI output is English-only.) - Python API:
from src.core import …(analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …). - JSON output:
--jsononcli_analyze. - Language packs:
from src.i18n import t, LANGUAGES. Add<code>.jsontosrc/i18n/packs/+ entry inLANGUAGESto add a language.
14. Platforms
- Python ≥ 3.10.
- OS: Linux, macOS, Windows.
- Browser: any modern browser.
- Network: not required at runtime.
15. Dependencies
- Core: pandas, openpyxl, charset-normalizer, typer, loguru.
- Dedup: rapidfuzz, phonenumbers.
- GUI: streamlit.
- Optional: ftfy (mojibake repair).
- Dev: pytest, tox.
16. Test coverage
- 1,916 tests passing, 0 skipped, 0 xfailed.
- 1,777 core + CLI tests (run with
pytest -m 'not gui'for a quick loop). - 139 GUI tests under
tests/gui/driving Streamlit pages viaAppTest(smoke + EN/ES localization, chrome, gate, workflows, dedup review, advanced panels, error paths, findings panel). Markedgui. - Includes 15 perf-shape regression tests.
- 1,777 core + CLI tests (run with
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
- Run:
python run_tests.py [--tool …] [--fixtures] [--coverage].
17. Privacy / data handling
- All processing local; no network calls in the data path.
- No telemetry.
- Original input never modified.
- Audit logs:
logs/next to each run (timestamped).
18. Error handling
- Structured hierarchy:
DataToolsError→InputValidationError,ConfigError,FileFormatError,FileAccessError. - Subclasses extend stdlib
ValueError/OSErrorso existing handlers still catch them. - Every error carries: message, file path, column, operation, suggestion, underlying cause.