REQUIREMENTS §10 reflects the post-optimisation numbers and the known O(n²) dedup match step (flagged for a future blocking pass). en/es upload-limit copy and uploader help now say 1.5 GB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.8 KiB
Requirements
Numbered support matrix. Updated with every shipped capability.
1. File handling
1.1 Size: ≤ 1.5 GB target (larger works, slower).
1.2 Read: CSV, TSV, XLSX, XLS.
1.3 Write: CSV, TSV.
1.4 Excel: multi-sheet picker.
1.5 Empty file: blocked with empty_input error finding.
2. Input encodings (auto-detected)
2.1 Unicode: UTF-8, UTF-8-BOM, UTF-16 LE/BE BOM, UTF-16 LE no-BOM.
2.2 Western: cp1252, ISO-8859-1, ISO-8859-15, Mac Roman.
2.3 Eastern European: cp1250, ISO-8859-2.
2.4 Cyrillic: cp1251, KOI8-R.
2.5 CJK: Shift_JIS / cp932, GB18030, Big5, EUC-KR / cp949.
2.6 ASCII → detected as UTF-8.
2.7 User override: any Python codec name.
2.8 BOM: stripped on read, never written.
2.9 Decode failure → encoding_decode_failed (error).
2.10 U+FFFD in output → encoding_uncertain (error).
3. Output encodings
3.1 UTF-8 (default), UTF-8-BOM (Excel-friendly).
3.2 cp1252, ISO-8859-1/15, cp1250, ISO-8859-2, cp1251.
3.3 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
3.4 Lossy fallback: ? + warning when codec can't represent a char.
4. Delimiters
4.1 Input auto-detect: ,, \t, ;, |.
4.2 Output: , (default), \t, ;, |.
4.3 Extension: .tsv for tab, .csv otherwise.
5. Line endings
5.1 Read: LF / CRLF / bare CR — all normalized to LF.
5.2 Embedded in quoted cells: also normalized to LF.
5.3 Write: LF (default), CRLF, CR.
5.4 Mixed → mixed_line_endings finding.
6. Analyzer detectors
File-level (read-time fixes, audit-logged):
csv_bom_stripped,csv_nul_stripped,csv_smart_quotes_folded,csv_line_endings_normalized,csv_transcoded_to_utf8,csv_unquoted_delimiters_repaired,csv_unrepairable_rows.
Cell-level:
smart_punctuation_in_data,nbsp_or_unicode_whitespace,zero_width_or_invisible,dirty_column_headers,whitespace_padding,null_like_sentinels,suspected_mojibake,mixed_case_email_column,inconsistent_date_format,near_duplicate_rows,leading_zero_ids.
Encoding integrity: encoding_uncertain, encoding_decode_failed, encoding_lying_bom, empty_input.
Sample size: 1,000 rows (configurable).
7. Finding fields
id, severity (info/warn/error), confidence (high/medium/low), fix_action, pre_applied, tool, count, description, column, samples (≤5).
8. Confidence tiers
- high — round-trip safe, one-click auto-fix.
- medium — preview before applying.
- low — opt-in only, can corrupt if wrong.
- error — must resolve or waive before tool pages unlock.
9. Decision actions
auto— apply registered fix.skip— waive (audit-logged).modified— apply with custom payload.
10. Performance (1.5 GB input)
- Initial scan (sample): < 2 s · peak RSS ~110 MB.
- Full-file
repair_bytes: 30–40 s (UTF-8); non-UTF-8 fold path now usesstr.countinstead of a Python char-by-char zip walk — formerly ~100 s on a 1 GB cp1252 file with smart quotes, now <1 s. - Full-DataFrame analyze: ~4 min (~25 µs/cell). Near-duplicate detector no longer allocates a full-frame copy — peak RSS during the near-duplicate pass drops to roughly the size of the string columns alone (~50% memory cut on text-heavy 1 GB inputs).
- Full-DataFrame
auto_fix: ~5 min (~30 µs/cell). - Output write: ~10 s.
- Recommended RAM: 3–4× input size for the full-Apply path.
- Format standardizer (
standardize_dataframe): ~2.7M rows/sec on cache-warm repetition-heavy columns (synthetic 1M-row in-memory benchmark, 2 typed columns); the fused single-pass loop replaced a 3-pass.tolist()cycle, so per-call overhead is now dominated by the underlying parsers (phonenumbers, dateutil) rather than Python list materialisation. A 1.5 GB CSV with mixed phone+currency+address columns finishes in ~1.5–6 minutes depending on column count. - Text cleaner (
clean_dataframe): ~1M rows/sec on repetition-heavy columns (per-call string cache: the pipeline runs once per unique cell value, not once per row). - Deduplicator: known O(n²) match step — works to ~50k rows in
comfortable time. The normalisation pass is now LRU-cached per call
so repeat values (the common dedup workload) skip re-parsing
(~2–5× faster on the normalisation step alone). Scale beyond 50k
needs blocking — flagged in
docs/NEXT-STEPS.md.
11. Tools
- Deduplicator — Ready
- Text Cleaner — Ready
- Format Standardizer — Ready
- Missing Value Handler — Ready
- Column Mapper — Ready
- Outlier Detector — Coming Soon
- Multi-File Merger — Coming Soon
- Validator & Reporter — Coming Soon
- Pipeline Runner — Ready
11.a Recommended pipeline order (soft, not enforced)
The Pipeline Runner ships with a SOFT_DEPENDENCIES table; the
following ordering is the default and the basis of the warning
surface. Re-ordering is allowed; the runner emits a warning string
and proceeds.
| # | Tool | Why this slot |
|---|---|---|
| 1 | column_map (optional, for header alignment) | Multi-vendor unification — rename early so downstream tools see canonical headers |
| 2 | text_clean | NBSP / smart quotes / zero-width pollution silently breaks downstream parsers |
| 3 | format_standardize | Phones / dates / currencies → canonical form before missing detection and dedup |
| 4 | missing | Sentinel detection, imputation, drop strategies — needs canonical types |
| 5 | column_map (optional, for schema enforcement) | Project to target schema, coerce, drop extras AFTER cleaning |
| 6 | dedup | Fuzzy matching is most accurate on canonicalised, sentinel-laundered data |
12. Gate (Review & Normalize)
- Gates every tool page.
- Auto-fix button: applies all
confidence=highfindings in one click. - Per-finding controls: Auto / Skip / Customize.
- Live before/after preview (≤5 sample rows).
- Audit log per fix (id, decision, cells changed).
- Encoding-override picker (16 codepages + custom).
- Advanced output expander: encoding + delimiter + line terminator.
- Result keyed by upload SHA-256; survives reload, invalidated on re-upload.
13. Interfaces
- GUI: Streamlit, browser-based, local, no internet. Sidebar language picker (English, Español).
- CLI:
python -m src.cli(dedup) ·src.cli_text_clean·src.cli_format·src.cli_missing·src.cli_column_map·src.cli_pipeline·src.cli_analyze. (CLI output is English-only.) - Python API:
from src.core import …(analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …). - JSON output:
--jsononcli_analyze. - Language packs:
from src.i18n import t, LANGUAGES. Add<code>.jsontosrc/i18n/packs/+ entry inLANGUAGESto add a language.
14. Platforms
- Python ≥ 3.10.
- OS: Linux, macOS, Windows.
- Browser: any modern browser.
- Network: not required at runtime.
15. Dependencies
- Core: pandas, openpyxl, charset-normalizer, typer, loguru.
- Dedup: rapidfuzz, phonenumbers.
- GUI: streamlit.
- Optional: ftfy (mojibake repair).
- Dev: pytest, tox.
16. Test coverage
- 1,770 tests passing, 0 skipped, 0 xfailed (incl. perf-shape regression tests).
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
- Run:
python run_tests.py [--tool …] [--fixtures] [--coverage].
17. Privacy / data handling
- All processing local; no network calls in the data path.
- No telemetry.
- Original input never modified.
- Audit logs:
logs/next to each run (timestamped).
18. Error handling
- Structured hierarchy:
DataToolsError→InputValidationError,ConfigError,FileFormatError,FileAccessError. - Subclasses extend stdlib
ValueError/OSErrorso existing handlers still catch them. - Every error carries: message, file path, column, operation, suggestion, underlying cause.