Files
datatools-dev/docs/REQUIREMENTS.md
Michael 38011872e1 docs(i18n): document language packs across user, dev, and marketing docs
README + USER-GUIDE describe the sidebar picker and current coverage
(home + shared chrome, per-tool bodies pending). DEVELOPER gains a
how-to for adding packs and keys with the parity-test guarantee.
TECHNICAL §10b records the in-house-JSON architecture and locks in the
no-gettext decision (also logged in DECISIONS). REQUIREMENTS reflects
the new interface surface and updated test count. COPY.md adds a
"Language claim" slot so landing/email work can pick it up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:16:24 +00:00

151 lines
6.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Requirements
Numbered support matrix. Updated with every shipped capability.
## 1. File handling
1.1 Size: ≤ 1 GB target (larger works, slower).
1.2 Read: CSV, TSV, XLSX, XLS.
1.3 Write: CSV, TSV.
1.4 Excel: multi-sheet picker.
1.5 Empty file: blocked with `empty_input` error finding.
## 2. Input encodings (auto-detected)
2.1 Unicode: UTF-8, UTF-8-BOM, UTF-16 LE/BE BOM, UTF-16 LE no-BOM.
2.2 Western: cp1252, ISO-8859-1, ISO-8859-15, Mac Roman.
2.3 Eastern European: cp1250, ISO-8859-2.
2.4 Cyrillic: cp1251, KOI8-R.
2.5 CJK: Shift_JIS / cp932, GB18030, Big5, EUC-KR / cp949.
2.6 ASCII → detected as UTF-8.
2.7 User override: any Python codec name.
2.8 BOM: stripped on read, never written.
2.9 Decode failure → `encoding_decode_failed` (error).
2.10 U+FFFD in output → `encoding_uncertain` (error).
## 3. Output encodings
3.1 UTF-8 (default), UTF-8-BOM (Excel-friendly).
3.2 cp1252, ISO-8859-1/15, cp1250, ISO-8859-2, cp1251.
3.3 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
3.4 Lossy fallback: `?` + warning when codec can't represent a char.
## 4. Delimiters
4.1 Input auto-detect: `,`, `\t`, `;`, `|`.
4.2 Output: `,` (default), `\t`, `;`, `|`.
4.3 Extension: `.tsv` for tab, `.csv` otherwise.
## 5. Line endings
5.1 Read: LF / CRLF / bare CR — all normalized to LF.
5.2 Embedded in quoted cells: also normalized to LF.
5.3 Write: LF (default), CRLF, CR.
5.4 Mixed → `mixed_line_endings` finding.
## 6. Analyzer detectors
**File-level** (read-time fixes, audit-logged):
- `csv_bom_stripped`, `csv_nul_stripped`, `csv_smart_quotes_folded`, `csv_line_endings_normalized`, `csv_transcoded_to_utf8`, `csv_unquoted_delimiters_repaired`, `csv_unrepairable_rows`.
**Cell-level**:
- `smart_punctuation_in_data`, `nbsp_or_unicode_whitespace`, `zero_width_or_invisible`, `dirty_column_headers`, `whitespace_padding`, `null_like_sentinels`, `suspected_mojibake`, `mixed_case_email_column`, `inconsistent_date_format`, `near_duplicate_rows`, `leading_zero_ids`.
**Encoding integrity**: `encoding_uncertain`, `encoding_decode_failed`, `encoding_lying_bom`, `empty_input`.
Sample size: 1,000 rows (configurable).
## 7. Finding fields
`id`, `severity` (info/warn/error), `confidence` (high/medium/low), `fix_action`, `pre_applied`, `tool`, `count`, `description`, `column`, `samples` (≤5).
## 8. Confidence tiers
- **high** — round-trip safe, one-click auto-fix.
- **medium** — preview before applying.
- **low** — opt-in only, can corrupt if wrong.
- **error** — must resolve or waive before tool pages unlock.
## 9. Decision actions
- `auto` — apply registered fix.
- `skip` — waive (audit-logged).
- `modified` — apply with custom payload.
## 10. Performance (1 GB input)
- Initial scan (sample): < 2 s · peak RSS ~110 MB.
- Full-file `repair_bytes`: 3040 s.
- Full-DataFrame analyze: ~4 min (~25 µs/cell).
- Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
- Output write: ~10 s.
- Recommended RAM: 4× input size for full-Apply path.
- Format standardizer (`standardize_file`): ~150k rows/sec on cache-warm
international data; chunk-bounded RAM (~50 MB peak at default
chunk_size=50,000). A 1 GB CSV with mixed phone+currency+address
columns finishes in ~2.510 minutes depending on column count.
## 11. Tools
1. Deduplicator — Ready
2. Text Cleaner — Ready
3. Format Standardizer — Ready
4. Missing Value Handler — Ready
5. Column Mapper — Ready
6. Outlier Detector — Coming Soon
7. Multi-File Merger — Coming Soon
8. Validator & Reporter — Coming Soon
9. Pipeline Runner — Ready
### 11.a Recommended pipeline order (soft, not enforced)
The Pipeline Runner ships with a `SOFT_DEPENDENCIES` table; the
following ordering is the default and the basis of the warning
surface. Re-ordering is allowed; the runner emits a warning string
and proceeds.
| # | Tool | Why this slot |
|---|------|---------------|
| 1 | column_map (optional, for header alignment) | Multi-vendor unification — rename early so downstream tools see canonical headers |
| 2 | text_clean | NBSP / smart quotes / zero-width pollution silently breaks downstream parsers |
| 3 | format_standardize | Phones / dates / currencies → canonical form before missing detection and dedup |
| 4 | missing | Sentinel detection, imputation, drop strategies — needs canonical types |
| 5 | column_map (optional, for schema enforcement) | Project to target schema, coerce, drop extras AFTER cleaning |
| 6 | dedup | Fuzzy matching is most accurate on canonicalised, sentinel-laundered data |
## 12. Gate (Review & Normalize)
- Gates every tool page.
- Auto-fix button: applies all `confidence=high` findings in one click.
- Per-finding controls: Auto / Skip / Customize.
- Live before/after preview (≤5 sample rows).
- Audit log per fix (id, decision, cells changed).
- Encoding-override picker (16 codepages + custom).
- Advanced output expander: encoding + delimiter + line terminator.
- Result keyed by upload SHA-256; survives reload, invalidated on re-upload.
## 13. Interfaces
- **GUI**: Streamlit, browser-based, local, no internet. Sidebar language picker (English, Español).
- **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_format` · `src.cli_missing` · `src.cli_column_map` · `src.cli_pipeline` · `src.cli_analyze`. (CLI output is English-only.)
- **Python API**: `from src.core import …` (analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …).
- **JSON output**: `--json` on `cli_analyze`.
- **Language packs**: `from src.i18n import t, LANGUAGES`. Add `<code>.json` to `src/i18n/packs/` + entry in `LANGUAGES` to add a language.
## 14. Platforms
- Python ≥ 3.10.
- OS: Linux, macOS, Windows.
- Browser: any modern browser.
- Network: not required at runtime.
## 15. Dependencies
- **Core**: pandas, openpyxl, charset-normalizer, typer, loguru.
- **Dedup**: rapidfuzz, phonenumbers.
- **GUI**: streamlit.
- **Optional**: ftfy (mojibake repair).
- **Dev**: pytest, tox.
## 16. Test coverage
- 1,762 tests passing, 0 skipped, 0 xfailed.
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
- Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
## 17. Privacy / data handling
- All processing local; no network calls in the data path.
- No telemetry.
- Original input never modified.
- Audit logs: `logs/` next to each run (timestamped).
## 18. Error handling
- Structured hierarchy: `DataToolsError``InputValidationError`, `ConfigError`, `FileFormatError`, `FileAccessError`.
- Subclasses extend stdlib `ValueError` / `OSError` so existing handlers still catch them.
- Every error carries: message, file path, column, operation, suggestion, underlying cause.