Files
datatools-dev/docs/REQUIREMENTS.md
Michael 3dd924474c docs: short-form numbered requirements list
New docs/REQUIREMENTS.md catalogs every shipped capability in 17 numbered
categories — file handling, input/output encodings, delimiters, line
endings, detectors, finding schema, confidence tiers, decisions,
performance targets (1 GB), tools, gate behavior, interfaces, platforms,
deps, test coverage, privacy. Linked from README and USER-GUIDE so a
buyer / integrator can scan compliance in under a minute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 21:19:21 +00:00

147 lines
6.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# REQUIREMENTS.md
Numbered, categorized requirements list — short form. The companion to USER-GUIDE.md and TECHNICAL.md; updated with every shipped capability.
---
## 1. File handling
1.1 File size: ≤ 1 GB (target; bigger files work but the gate's full-DataFrame Apply pass scales linearly).
1.2 Input formats: CSV, TSV, XLSX, XLS.
1.3 Output formats: CSV, TSV.
1.4 Excel: multi-sheet workbook picker.
1.5 Empty file: detected, blocks gate with `empty_input` error finding.
## 2. Input encodings (auto-detected)
2.1 Unicode: UTF-8, UTF-8 with BOM, UTF-16 LE/BE with BOM, UTF-16 LE without BOM (best-effort).
2.2 Western: cp1252, ISO-8859-1, ISO-8859-15, Mac Roman.
2.3 Eastern European: cp1250, ISO-8859-2.
2.4 Cyrillic: cp1251, KOI8-R.
2.5 CJK: Shift_JIS / cp932, GB18030, Big5, EUC-KR / cp949.
2.6 ASCII: detected as UTF-8 (byte-equivalent).
2.7 User override: any Python codec name typed in the Review page.
2.8 BOM: stripped on read, never written.
2.9 Decode failure: surfaced as `encoding_decode_failed` (error severity).
2.10 Replacement char (U+FFFD) in output: surfaced as `encoding_uncertain` (error).
## 3. Output encodings
3.1 UTF-8 (default).
3.2 UTF-8 with BOM (Excel-friendly).
3.3 cp1252, ISO-8859-1, ISO-8859-15, cp1250, ISO-8859-2, cp1251.
3.4 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
3.5 Lossy fallback: `?` replacement + warning shown when chosen codec can't represent a character.
## 4. Delimiters
4.1 Auto-detect (input): `,`, `\t`, `;`, `|`.
4.2 Output: `,` (default), `\t`, `;`, `|`.
4.3 File extension: `.tsv` for tab, `.csv` otherwise.
## 5. Line endings
5.1 Input: LF, CRLF, bare CR (all normalized to LF on read).
5.2 Embedded in quoted cells: also normalized to LF.
5.3 Output: LF (default), CRLF, CR.
5.4 Mixed line endings: surfaced as `mixed_line_endings` finding.
## 6. Analyzer detectors
6.1 File-level (audit log of read-time fixes): `csv_bom_stripped`, `csv_nul_stripped`, `csv_smart_quotes_folded`, `csv_line_endings_normalized`, `csv_transcoded_to_utf8`, `csv_unquoted_delimiters_repaired`, `csv_unrepairable_rows`.
6.2 Cell-level: `smart_punctuation_in_data`, `nbsp_or_unicode_whitespace`, `zero_width_or_invisible`, `dirty_column_headers`, `whitespace_padding`, `null_like_sentinels`, `suspected_mojibake`, `mixed_case_email_column`, `near_duplicate_rows`, `leading_zero_ids`.
6.3 Encoding integrity: `encoding_uncertain`, `encoding_decode_failed`, `empty_input`.
6.4 Sample size (default): 1,000 rows; configurable.
## 7. Finding fields
7.1 `id` — stable identifier.
7.2 `severity` — info / warn / error (error blocks gate).
7.3 `confidence` — high / medium / low (auto-fixability).
7.4 `fix_action` — id of the algorithm in `src/core/fixes.py`.
7.5 `pre_applied` — true if fixed during read pass.
7.6 `tool` — owning tool id (or empty).
7.7 `count`, `description`, `column`, `samples` (≤5).
## 8. Confidence tiers
8.1 **high** — round-trip safe; one-click auto-fix.
8.2 **medium** — preview before applying.
8.3 **low** — opt-in only; can corrupt data if wrong.
8.4 **error** — must resolve or waive before tool pages unlock.
## 9. Decision actions per finding
9.1 `auto` — apply the registered fix.
9.2 `skip` — waive (no change, audit-logged).
9.3 `modified` — apply with custom payload (e.g. user-edited null sentinels).
## 10. Performance (1 GB input)
10.1 Initial scan (`analyze` sample-mode): < 2 s.
10.2 Peak RSS during initial scan: ~110 MB.
10.3 Full-file `repair_bytes`: ~3040 s (when triggered).
10.4 Full-DataFrame analyze: ~4 min (~25 µs/cell).
10.5 Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
10.6 Output write: ~10 s for 1 GB UTF-8 CSV.
10.7 RAM headroom recommended: 4× input file size for the full-Apply path.
## 11. Tools shipped
11.1 Deduplicator — Ready.
11.2 Text Cleaner — Ready.
11.3 Format Standardizer — Coming Soon.
11.4 Missing Value Handler — Coming Soon.
11.5 Column Mapper — Coming Soon.
11.6 Outlier Detector — Coming Soon.
11.7 Multi-File Merger — Coming Soon.
11.8 Validator & Reporter — Coming Soon.
11.9 Pipeline Runner — Coming Soon.
## 12. Gate (Review & Normalize)
12.1 Gates every tool page; tool pages refuse to load until passed.
12.2 Auto-fix button applies all `confidence=high` findings in one click.
12.3 Per-finding controls: Auto-fix / Skip / Customize.
12.4 Live before/after preview per finding (≤5 sample rows).
12.5 Audit log: every fix tagged with finding id, decision, cells changed.
12.6 Encoding override picker (16 codepages + custom).
12.7 Advanced output options expander: encoding + delimiter + line terminator.
12.8 Result keyed by upload SHA-256; survives page reloads, invalidated on re-upload.
## 13. Interfaces
13.1 GUI: Streamlit, runs locally, browser-based, no internet required.
13.2 CLI: Typer apps — `python -m src.cli`, `src.cli_text_clean`, `src.cli_analyze`.
13.3 Python API: `from src.core import …` (analyze, repair_bytes, clean_dataframe, deduplicate, etc.).
13.4 JSON output: `--json` flag on `cli_analyze`; full Finding schema.
## 14. Platforms
14.1 Python: ≥ 3.10.
14.2 OS: Linux, macOS, Windows.
14.3 Display: any modern browser (Streamlit GUI).
14.4 Network: not required at runtime.
## 15. Dependencies
15.1 Core: pandas, openpyxl, charset-normalizer, typer, loguru.
15.2 Dedup: rapidfuzz, phonenumbers.
15.3 GUI: streamlit.
15.4 Optional: ftfy (mojibake repair, `repair_mojibake` fix).
15.5 Dev: pytest, tox.
## 16. Test coverage
16.1 Unit + integration: 765 tests passing.
16.2 Documented gaps: 17 xfail (charset-normalizer label drift on byte-equivalent codepages, byte-level smart-quote fold expectation).
16.3 Fixture corpora: 21 text-cleaner fixtures, 31 encoding fixtures, 9 reference UTF-8 files.
16.4 CI surface: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
## 17. Privacy / data handling
17.1 All processing local; no network calls in the data path.
17.2 No telemetry, no usage analytics shipped.
17.3 Original input file never modified — outputs go to a separate path.
17.4 Audit logs written to `logs/` next to each run (timestamped).