README + USER-GUIDE describe the sidebar picker and current coverage (home + shared chrome, per-tool bodies pending). DEVELOPER gains a how-to for adding packs and keys with the parity-test guarantee. TECHNICAL §10b records the in-house-JSON architecture and locks in the no-gettext decision (also logged in DECISIONS). REQUIREMENTS reflects the new interface surface and updated test count. COPY.md adds a "Language claim" slot so landing/email work can pick it up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.7 KiB
Requirements
Numbered support matrix. Updated with every shipped capability.
1. File handling
1.1 Size: ≤ 1 GB target (larger works, slower).
1.2 Read: CSV, TSV, XLSX, XLS.
1.3 Write: CSV, TSV.
1.4 Excel: multi-sheet picker.
1.5 Empty file: blocked with empty_input error finding.
2. Input encodings (auto-detected)
2.1 Unicode: UTF-8, UTF-8-BOM, UTF-16 LE/BE BOM, UTF-16 LE no-BOM.
2.2 Western: cp1252, ISO-8859-1, ISO-8859-15, Mac Roman.
2.3 Eastern European: cp1250, ISO-8859-2.
2.4 Cyrillic: cp1251, KOI8-R.
2.5 CJK: Shift_JIS / cp932, GB18030, Big5, EUC-KR / cp949.
2.6 ASCII → detected as UTF-8.
2.7 User override: any Python codec name.
2.8 BOM: stripped on read, never written.
2.9 Decode failure → encoding_decode_failed (error).
2.10 U+FFFD in output → encoding_uncertain (error).
3. Output encodings
3.1 UTF-8 (default), UTF-8-BOM (Excel-friendly).
3.2 cp1252, ISO-8859-1/15, cp1250, ISO-8859-2, cp1251.
3.3 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
3.4 Lossy fallback: ? + warning when codec can't represent a char.
4. Delimiters
4.1 Input auto-detect: ,, \t, ;, |.
4.2 Output: , (default), \t, ;, |.
4.3 Extension: .tsv for tab, .csv otherwise.
5. Line endings
5.1 Read: LF / CRLF / bare CR — all normalized to LF.
5.2 Embedded in quoted cells: also normalized to LF.
5.3 Write: LF (default), CRLF, CR.
5.4 Mixed → mixed_line_endings finding.
6. Analyzer detectors
File-level (read-time fixes, audit-logged):
csv_bom_stripped,csv_nul_stripped,csv_smart_quotes_folded,csv_line_endings_normalized,csv_transcoded_to_utf8,csv_unquoted_delimiters_repaired,csv_unrepairable_rows.
Cell-level:
smart_punctuation_in_data,nbsp_or_unicode_whitespace,zero_width_or_invisible,dirty_column_headers,whitespace_padding,null_like_sentinels,suspected_mojibake,mixed_case_email_column,inconsistent_date_format,near_duplicate_rows,leading_zero_ids.
Encoding integrity: encoding_uncertain, encoding_decode_failed, encoding_lying_bom, empty_input.
Sample size: 1,000 rows (configurable).
7. Finding fields
id, severity (info/warn/error), confidence (high/medium/low), fix_action, pre_applied, tool, count, description, column, samples (≤5).
8. Confidence tiers
- high — round-trip safe, one-click auto-fix.
- medium — preview before applying.
- low — opt-in only, can corrupt if wrong.
- error — must resolve or waive before tool pages unlock.
9. Decision actions
auto— apply registered fix.skip— waive (audit-logged).modified— apply with custom payload.
10. Performance (1 GB input)
- Initial scan (sample): < 2 s · peak RSS ~110 MB.
- Full-file
repair_bytes: 30–40 s. - Full-DataFrame analyze: ~4 min (~25 µs/cell).
- Full-DataFrame
auto_fix: ~5 min (~30 µs/cell). - Output write: ~10 s.
- Recommended RAM: 4× input size for full-Apply path.
- Format standardizer (
standardize_file): ~150k rows/sec on cache-warm international data; chunk-bounded RAM (~50 MB peak at default chunk_size=50,000). A 1 GB CSV with mixed phone+currency+address columns finishes in ~2.5–10 minutes depending on column count.
11. Tools
- Deduplicator — Ready
- Text Cleaner — Ready
- Format Standardizer — Ready
- Missing Value Handler — Ready
- Column Mapper — Ready
- Outlier Detector — Coming Soon
- Multi-File Merger — Coming Soon
- Validator & Reporter — Coming Soon
- Pipeline Runner — Ready
11.a Recommended pipeline order (soft, not enforced)
The Pipeline Runner ships with a SOFT_DEPENDENCIES table; the
following ordering is the default and the basis of the warning
surface. Re-ordering is allowed; the runner emits a warning string
and proceeds.
| # | Tool | Why this slot |
|---|---|---|
| 1 | column_map (optional, for header alignment) | Multi-vendor unification — rename early so downstream tools see canonical headers |
| 2 | text_clean | NBSP / smart quotes / zero-width pollution silently breaks downstream parsers |
| 3 | format_standardize | Phones / dates / currencies → canonical form before missing detection and dedup |
| 4 | missing | Sentinel detection, imputation, drop strategies — needs canonical types |
| 5 | column_map (optional, for schema enforcement) | Project to target schema, coerce, drop extras AFTER cleaning |
| 6 | dedup | Fuzzy matching is most accurate on canonicalised, sentinel-laundered data |
12. Gate (Review & Normalize)
- Gates every tool page.
- Auto-fix button: applies all
confidence=highfindings in one click. - Per-finding controls: Auto / Skip / Customize.
- Live before/after preview (≤5 sample rows).
- Audit log per fix (id, decision, cells changed).
- Encoding-override picker (16 codepages + custom).
- Advanced output expander: encoding + delimiter + line terminator.
- Result keyed by upload SHA-256; survives reload, invalidated on re-upload.
13. Interfaces
- GUI: Streamlit, browser-based, local, no internet. Sidebar language picker (English, Español).
- CLI:
python -m src.cli(dedup) ·src.cli_text_clean·src.cli_format·src.cli_missing·src.cli_column_map·src.cli_pipeline·src.cli_analyze. (CLI output is English-only.) - Python API:
from src.core import …(analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …). - JSON output:
--jsononcli_analyze. - Language packs:
from src.i18n import t, LANGUAGES. Add<code>.jsontosrc/i18n/packs/+ entry inLANGUAGESto add a language.
14. Platforms
- Python ≥ 3.10.
- OS: Linux, macOS, Windows.
- Browser: any modern browser.
- Network: not required at runtime.
15. Dependencies
- Core: pandas, openpyxl, charset-normalizer, typer, loguru.
- Dedup: rapidfuzz, phonenumbers.
- GUI: streamlit.
- Optional: ftfy (mojibake repair).
- Dev: pytest, tox.
16. Test coverage
- 1,762 tests passing, 0 skipped, 0 xfailed.
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
- Run:
python run_tests.py [--tool …] [--fixtures] [--coverage].
17. Privacy / data handling
- All processing local; no network calls in the data path.
- No telemetry.
- Original input never modified.
- Audit logs:
logs/next to each run (timestamped).
18. Error handling
- Structured hierarchy:
DataToolsError→InputValidationError,ConfigError,FileFormatError,FileAccessError. - Subclasses extend stdlib
ValueError/OSErrorso existing handlers still catch them. - Every error carries: message, file path, column, operation, suggestion, underlying cause.