Files
datatools-dev/docs/REQUIREMENTS.md
Michael abb720997e docs: tight, scannable rewrite — every item earns its place
Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS,
TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from
prose-heavy to bullet-heavy + table-heavy. Same information density,
significantly less reading load.

Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content
that landed since v1.6:

- Format Standardizer (3rd Ready tool)
- 199-row buyer corpus
- src/core/errors.py structured hierarchy + ensure_dataframe /
  ensure_choice / wrap_file_read|write / format_for_user helpers
- src/core/_constants.py shared USPS/state lookup tables
- Cross-tool audit fixes (NaN matching, removed_df schema, validation,
  enum-bounds checks, forward-compat config)
- Per-domain error_policy across format standardizers
- Inconsistent-date-format detector
- Excel header-row auto-detection + write_file delimiter param

Per-doc changes:

- README.md (175 → 71): 9-tool table at top, status column, 3 CLI
  entry points listed, dropped repeated marketing prose.
- docs/README.md (38 → 27): pure index — buyer-facing vs creator-only
  split + version footer.
- USER-GUIDE.md (208 → 118): tool table replaces script descriptions,
  troubleshooting compressed to bullets, gate explanation tightened.
- CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed
  redundant intro text, kept full recipes section.
- REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added
  §18 Error Handling, formatting tightened to single-line entries.
- TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged
  redundant §3.5-3.7 OS sections, added §7 (Error handling) +
  §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate /
  Review page / repair_bytes promoted from §10.2.x sub-numbering).
- DEVELOPER.md (285 → 161): module map table replaces per-file prose,
  extension recipes condensed, new §Errors covers when to use each
  hierarchy class.
- BUSINESS.md (278 → 225): collapsed prose to tables (use cases,
  competitive landscape, costs, risks); honest-status updated.
- DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved,
  decision log compressed to single-line entries, added v1.6 entries
  (Format Standardizer Ready, errors module).
- RECOVERY.md (180 → 147): rebuild steps as numbered + tabular,
  external dependencies as one table, recovery priorities tightened.

No information removed; redundancy compressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:49:29 +00:00

5.1 KiB
Raw Blame History

Requirements

Numbered support matrix. Updated with every shipped capability.

1. File handling

1.1 Size: ≤ 1 GB target (larger works, slower). 1.2 Read: CSV, TSV, XLSX, XLS. 1.3 Write: CSV, TSV. 1.4 Excel: multi-sheet picker. 1.5 Empty file: blocked with empty_input error finding.

2. Input encodings (auto-detected)

2.1 Unicode: UTF-8, UTF-8-BOM, UTF-16 LE/BE BOM, UTF-16 LE no-BOM. 2.2 Western: cp1252, ISO-8859-1, ISO-8859-15, Mac Roman. 2.3 Eastern European: cp1250, ISO-8859-2. 2.4 Cyrillic: cp1251, KOI8-R. 2.5 CJK: Shift_JIS / cp932, GB18030, Big5, EUC-KR / cp949. 2.6 ASCII → detected as UTF-8. 2.7 User override: any Python codec name. 2.8 BOM: stripped on read, never written. 2.9 Decode failure → encoding_decode_failed (error). 2.10 U+FFFD in output → encoding_uncertain (error).

3. Output encodings

3.1 UTF-8 (default), UTF-8-BOM (Excel-friendly). 3.2 cp1252, ISO-8859-1/15, cp1250, ISO-8859-2, cp1251. 3.3 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE. 3.4 Lossy fallback: ? + warning when codec can't represent a char.

4. Delimiters

4.1 Input auto-detect: ,, \t, ;, |. 4.2 Output: , (default), \t, ;, |. 4.3 Extension: .tsv for tab, .csv otherwise.

5. Line endings

5.1 Read: LF / CRLF / bare CR — all normalized to LF. 5.2 Embedded in quoted cells: also normalized to LF. 5.3 Write: LF (default), CRLF, CR. 5.4 Mixed → mixed_line_endings finding.

6. Analyzer detectors

File-level (read-time fixes, audit-logged):

  • csv_bom_stripped, csv_nul_stripped, csv_smart_quotes_folded, csv_line_endings_normalized, csv_transcoded_to_utf8, csv_unquoted_delimiters_repaired, csv_unrepairable_rows.

Cell-level:

  • smart_punctuation_in_data, nbsp_or_unicode_whitespace, zero_width_or_invisible, dirty_column_headers, whitespace_padding, null_like_sentinels, suspected_mojibake, mixed_case_email_column, inconsistent_date_format, near_duplicate_rows, leading_zero_ids.

Encoding integrity: encoding_uncertain, encoding_decode_failed, empty_input.

Sample size: 1,000 rows (configurable).

7. Finding fields

id, severity (info/warn/error), confidence (high/medium/low), fix_action, pre_applied, tool, count, description, column, samples (≤5).

8. Confidence tiers

  • high — round-trip safe, one-click auto-fix.
  • medium — preview before applying.
  • low — opt-in only, can corrupt if wrong.
  • error — must resolve or waive before tool pages unlock.

9. Decision actions

  • auto — apply registered fix.
  • skip — waive (audit-logged).
  • modified — apply with custom payload.

10. Performance (1 GB input)

  • Initial scan (sample): < 2 s · peak RSS ~110 MB.
  • Full-file repair_bytes: 3040 s.
  • Full-DataFrame analyze: ~4 min (~25 µs/cell).
  • Full-DataFrame auto_fix: ~5 min (~30 µs/cell).
  • Output write: ~10 s.
  • Recommended RAM: 4× input size for full-Apply path.

11. Tools

  1. Deduplicator — Ready
  2. Text Cleaner — Ready
  3. Format Standardizer — Ready
  4. Missing Value Handler — Coming Soon
  5. Column Mapper — Coming Soon
  6. Outlier Detector — Coming Soon
  7. Multi-File Merger — Coming Soon
  8. Validator & Reporter — Coming Soon
  9. Pipeline Runner — Coming Soon

12. Gate (Review & Normalize)

  • Gates every tool page.
  • Auto-fix button: applies all confidence=high findings in one click.
  • Per-finding controls: Auto / Skip / Customize.
  • Live before/after preview (≤5 sample rows).
  • Audit log per fix (id, decision, cells changed).
  • Encoding-override picker (16 codepages + custom).
  • Advanced output expander: encoding + delimiter + line terminator.
  • Result keyed by upload SHA-256; survives reload, invalidated on re-upload.

13. Interfaces

  • GUI: Streamlit, browser-based, local, no internet.
  • CLI: python -m src.cli (dedup) · src.cli_text_clean · src.cli_analyze.
  • Python API: from src.core import … (analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …).
  • JSON output: --json on cli_analyze.

14. Platforms

  • Python ≥ 3.10.
  • OS: Linux, macOS, Windows.
  • Browser: any modern browser.
  • Network: not required at runtime.

15. Dependencies

  • Core: pandas, openpyxl, charset-normalizer, typer, loguru.
  • Dedup: rapidfuzz, phonenumbers.
  • GUI: streamlit.
  • Optional: ftfy (mojibake repair).
  • Dev: pytest, tox.

16. Test coverage

  • 1,230 tests passing, 4 skipped (ftfy not installed), 17 xfailed (documented).
  • Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases).
  • Run: python run_tests.py [--tool …] [--fixtures] [--coverage].

17. Privacy / data handling

  • All processing local; no network calls in the data path.
  • No telemetry.
  • Original input never modified.
  • Audit logs: logs/ next to each run (timestamped).

18. Error handling

  • Structured hierarchy: DataToolsErrorInputValidationError, ConfigError, FileFormatError, FileAccessError.
  • Subclasses extend stdlib ValueError / OSError so existing handlers still catch them.
  • Every error carries: message, file path, column, operation, suggestion, underlying cause.