Files

Michael 38011872e1 docs(i18n): document language packs across user, dev, and marketing docs

README + USER-GUIDE describe the sidebar picker and current coverage
(home + shared chrome, per-tool bodies pending). DEVELOPER gains a
how-to for adding packs and keys with the parity-test guarantee.
TECHNICAL §10b records the in-house-JSON architecture and locks in the
no-gettext decision (also logged in DECISIONS). REQUIREMENTS reflects
the new interface surface and updated test count. COPY.md adds a
"Language claim" slot so landing/email work can pick it up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-13 15:16:24 +00:00

6.7 KiB

Raw Blame History

Requirements

Numbered support matrix. Updated with every shipped capability.

1. File handling

1.1 Size: ≤ 1 GB target (larger works, slower). 1.2 Read: CSV, TSV, XLSX, XLS. 1.3 Write: CSV, TSV. 1.4 Excel: multi-sheet picker. 1.5 Empty file: blocked with empty_input error finding.

2. Input encodings (auto-detected)

2.1 Unicode: UTF-8, UTF-8-BOM, UTF-16 LE/BE BOM, UTF-16 LE no-BOM. 2.2 Western: cp1252, ISO-8859-1, ISO-8859-15, Mac Roman. 2.3 Eastern European: cp1250, ISO-8859-2. 2.4 Cyrillic: cp1251, KOI8-R. 2.5 CJK: Shift_JIS / cp932, GB18030, Big5, EUC-KR / cp949. 2.6 ASCII → detected as UTF-8. 2.7 User override: any Python codec name. 2.8 BOM: stripped on read, never written. 2.9 Decode failure → encoding_decode_failed (error). 2.10 U+FFFD in output → encoding_uncertain (error).

3. Output encodings

3.1 UTF-8 (default), UTF-8-BOM (Excel-friendly). 3.2 cp1252, ISO-8859-1/15, cp1250, ISO-8859-2, cp1251. 3.3 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE. 3.4 Lossy fallback: ? + warning when codec can't represent a char.

4. Delimiters

4.1 Input auto-detect: ,, \t, ;, |. 4.2 Output: , (default), \t, ;, |. 4.3 Extension: .tsv for tab, .csv otherwise.

5. Line endings

5.1 Read: LF / CRLF / bare CR — all normalized to LF. 5.2 Embedded in quoted cells: also normalized to LF. 5.3 Write: LF (default), CRLF, CR. 5.4 Mixed → mixed_line_endings finding.

6. Analyzer detectors

File-level (read-time fixes, audit-logged):

csv_bom_stripped, csv_nul_stripped, csv_smart_quotes_folded, csv_line_endings_normalized, csv_transcoded_to_utf8, csv_unquoted_delimiters_repaired, csv_unrepairable_rows.

Cell-level:

smart_punctuation_in_data, nbsp_or_unicode_whitespace, zero_width_or_invisible, dirty_column_headers, whitespace_padding, null_like_sentinels, suspected_mojibake, mixed_case_email_column, inconsistent_date_format, near_duplicate_rows, leading_zero_ids.

Encoding integrity: encoding_uncertain, encoding_decode_failed, encoding_lying_bom, empty_input.

Sample size: 1,000 rows (configurable).

7. Finding fields

id, severity (info/warn/error), confidence (high/medium/low), fix_action, pre_applied, tool, count, description, column, samples (≤5).

8. Confidence tiers

high — round-trip safe, one-click auto-fix.
medium — preview before applying.
low — opt-in only, can corrupt if wrong.
error — must resolve or waive before tool pages unlock.

9. Decision actions

auto — apply registered fix.
skip — waive (audit-logged).
modified — apply with custom payload.

10. Performance (1 GB input)

Initial scan (sample): < 2 s · peak RSS ~110 MB.
Full-file repair_bytes: 30–40 s.
Full-DataFrame analyze: ~4 min (~25 µs/cell).
Full-DataFrame auto_fix: ~5 min (~30 µs/cell).
Output write: ~10 s.
Recommended RAM: 4× input size for full-Apply path.
Format standardizer (standardize_file): ~150k rows/sec on cache-warm international data; chunk-bounded RAM (~50 MB peak at default chunk_size=50,000). A 1 GB CSV with mixed phone+currency+address columns finishes in ~2.5–10 minutes depending on column count.

11. Tools

Deduplicator — Ready
Text Cleaner — Ready
Format Standardizer — Ready
Missing Value Handler — Ready
Column Mapper — Ready
Outlier Detector — Coming Soon
Multi-File Merger — Coming Soon
Validator & Reporter — Coming Soon
Pipeline Runner — Ready

11.a Recommended pipeline order (soft, not enforced)

The Pipeline Runner ships with a SOFT_DEPENDENCIES table; the following ordering is the default and the basis of the warning surface. Re-ordering is allowed; the runner emits a warning string and proceeds.

#	Tool	Why this slot
1	column_map (optional, for header alignment)	Multi-vendor unification — rename early so downstream tools see canonical headers
2	text_clean	NBSP / smart quotes / zero-width pollution silently breaks downstream parsers
3	format_standardize	Phones / dates / currencies → canonical form before missing detection and dedup
4	missing	Sentinel detection, imputation, drop strategies — needs canonical types
5	column_map (optional, for schema enforcement)	Project to target schema, coerce, drop extras AFTER cleaning
6	dedup	Fuzzy matching is most accurate on canonicalised, sentinel-laundered data

12. Gate (Review & Normalize)

Gates every tool page.
Auto-fix button: applies all confidence=high findings in one click.
Per-finding controls: Auto / Skip / Customize.
Live before/after preview (≤5 sample rows).
Audit log per fix (id, decision, cells changed).
Encoding-override picker (16 codepages + custom).
Advanced output expander: encoding + delimiter + line terminator.
Result keyed by upload SHA-256; survives reload, invalidated on re-upload.

13. Interfaces

GUI: Streamlit, browser-based, local, no internet. Sidebar language picker (English, Español).
CLI: python -m src.cli (dedup) · src.cli_text_clean · src.cli_format · src.cli_missing · src.cli_column_map · src.cli_pipeline · src.cli_analyze. (CLI output is English-only.)
Python API: from src.core import … (analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …).
JSON output: --json on cli_analyze.
Language packs: from src.i18n import t, LANGUAGES. Add <code>.json to src/i18n/packs/ + entry in LANGUAGES to add a language.

14. Platforms

Python ≥ 3.10.
OS: Linux, macOS, Windows.
Browser: any modern browser.
Network: not required at runtime.

15. Dependencies

Core: pandas, openpyxl, charset-normalizer, typer, loguru.
Dedup: rapidfuzz, phonenumbers.
GUI: streamlit.
Optional: ftfy (mojibake repair).
Dev: pytest, tox.

16. Test coverage

1,762 tests passing, 0 skipped, 0 xfailed.
Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
Run: python run_tests.py [--tool …] [--fixtures] [--coverage].

17. Privacy / data handling

All processing local; no network calls in the data path.
No telemetry.
Original input never modified.
Audit logs: logs/ next to each run (timestamped).

18. Error handling

Structured hierarchy: DataToolsError → InputValidationError, ConfigError, FileFormatError, FileAccessError.
Subclasses extend stdlib ValueError / OSError so existing handlers still catch them.
Every error carries: message, file path, column, operation, suggestion, underlying cause.

6.7 KiB Raw Blame History Unescape Escape