Files
datatools-dev/docs/REQUIREMENTS.md
Michael ee0b1f6f6b docs: design notes for future PDF→CSV tool
New ``docs/FUTURE-TOOLS.md`` captures post-launch tool ideas with a
consistent shape — What / Why / Can we ship now / Approach / GUI
sketch / Effort / Risks / Ship criteria. Resting place for things
the new-tool freeze in ``PLAN.md`` §2.1 refuses to build but that
keep coming up.

First entry: **#10 PDF → CSV extractor** (bank statements et al.).

Key facts captured:

- **Current state**: no PDF infrastructure exists. Zero PDF
  dependencies in requirements.txt; zero PDF-touching code under
  ``src/``. The only "PDF" string in the codebase is the planned-
  output copy for the Quality Check tool, unrelated to extraction.
- **Library picks**: pdfplumber as the extraction core (BSD-3,
  no native compiler, gives coordinate-aware text), Tesseract via
  pytesseract as the OCR fallback for scanned PDFs,
  streamlit-drawable-canvas as the region-picker component.
- **GUI sketch**: user draws a header strip + a row template on a
  rendered page; the tool applies that template across N pages,
  saves the template by layout fingerprint for next month's
  statement, emits CSV.
- **Effort phased A–E**: 3–4 weeks for a text-only MVP; 6–10
  weeks for a polished version with multi-page template recall;
  +2–3 weeks if scanned-PDF OCR is required.
- **Difficulty**: medium-hard. The pieces are well-trodden; the
  combination (region selection that persists across pages and
  across documents with similar layouts) is where the engineering
  goes.
- **Ship criteria**: ≥1 paying customer + ≥3 paid or ≥5 demo
  emails asking for PDF extraction + the bookkeeper niche
  converting at least one customer first. None have fired.

Cross-references added:

- ``docs/REQUIREMENTS.md`` §11: pointer to FUTURE-TOOLS.md for
  parked tool ideas, with a one-paragraph summary of #10.
- ``docs/PLAN.md`` §2.1: notes that the freeze parks future tools
  in FUTURE-TOOLS.md and explicitly names #10 as the current
  highest-pressure entry.
- ``docs/NEXT-STEPS.md`` Phase 5 "what NOT to build" table: a new
  row for the PDF tool tied to the same ship-trigger language.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:52:42 +00:00

263 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Requirements
Numbered support matrix. Updated with every shipped capability.
## 1. File handling
1.1 Size: ≤ 1.5 GB target (larger works, slower).
1.2 Read: CSV, TSV, XLSX, XLS.
1.3 Write: CSV, TSV.
1.4 Excel: multi-sheet picker.
1.5 Empty file: blocked with `empty_input` error finding.
## 2. Input encodings (auto-detected)
2.1 Unicode: UTF-8, UTF-8-BOM, UTF-16 LE/BE BOM, UTF-16 LE no-BOM.
2.2 Western: cp1252, ISO-8859-1, ISO-8859-15, Mac Roman.
2.3 Eastern European: cp1250, ISO-8859-2.
2.4 Cyrillic: cp1251, KOI8-R.
2.5 CJK: Shift_JIS / cp932, GB18030, Big5, EUC-KR / cp949.
2.6 ASCII → detected as UTF-8.
2.7 User override: any Python codec name.
2.8 BOM: stripped on read, never written.
2.9 Decode failure → `encoding_decode_failed` (error).
2.10 U+FFFD in output → `encoding_uncertain` (error).
## 3. Output encodings
3.1 UTF-8 (default), UTF-8-BOM (Excel-friendly).
3.2 cp1252, ISO-8859-1/15, cp1250, ISO-8859-2, cp1251.
3.3 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
3.4 Lossy fallback: `?` + warning when codec can't represent a char.
## 4. Delimiters
4.1 Input auto-detect: `,`, `\t`, `;`, `|`.
4.2 Output: `,` (default), `\t`, `;`, `|`.
4.3 Extension: `.tsv` for tab, `.csv` otherwise.
## 5. Line endings
5.1 Read: LF / CRLF / bare CR — all normalized to LF.
5.2 Embedded in quoted cells: also normalized to LF.
5.3 Write: LF (default), CRLF, CR.
5.4 Mixed → `mixed_line_endings` finding.
## 6. Analyzer detectors
**File-level** (read-time fixes, audit-logged):
- `csv_bom_stripped`, `csv_nul_stripped`, `csv_smart_quotes_folded`, `csv_line_endings_normalized`, `csv_transcoded_to_utf8`, `csv_unquoted_delimiters_repaired`, `csv_unrepairable_rows`.
**Cell-level**:
- `smart_punctuation_in_data`, `nbsp_or_unicode_whitespace`, `zero_width_or_invisible`, `dirty_column_headers`, `whitespace_padding`, `null_like_sentinels`, `suspected_mojibake`, `mixed_case_email_column`, `inconsistent_date_format`, `near_duplicate_rows`, `leading_zero_ids`.
**Encoding integrity**: `encoding_uncertain`, `encoding_decode_failed`, `encoding_lying_bom`, `empty_input`.
Sample size: 1,000 rows (configurable).
## 7. Finding fields
`id`, `severity` (info/warn/error), `confidence` (high/medium/low), `fix_action`, `pre_applied`, `tool`, `count`, `description`, `column`, `samples` (≤5).
## 8. Confidence tiers
- **high** — round-trip safe, one-click auto-fix.
- **medium** — preview before applying.
- **low** — opt-in only, can corrupt if wrong.
- **error** — must resolve or waive before tool pages unlock.
## 9. Decision actions
- `auto` — apply registered fix.
- `skip` — waive (audit-logged).
- `modified` — apply with custom payload.
## 10. Performance (1.5 GB input)
- Initial scan (sample): < 2 s · peak RSS ~110 MB.
- Full-file `repair_bytes`: 3040 s (UTF-8); non-UTF-8 fold path now
uses ``str.count`` instead of a Python char-by-char zip walk —
formerly ~100 s on a 1 GB cp1252 file with smart quotes, now <1 s.
- Full-DataFrame analyze: ~4 min (~25 µs/cell). Near-duplicate detector
no longer allocates a full-frame copy — peak RSS during the
near-duplicate pass drops to roughly the size of the string columns
alone (~50% memory cut on text-heavy 1 GB inputs).
- Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
- Output write: ~10 s.
- Recommended RAM: 34× input size for the full-Apply path.
- **Standardize Formats** (`standardize_dataframe`): ~2.7M rows/sec on
cache-warm repetition-heavy columns (synthetic 1M-row in-memory
benchmark, 2 typed columns); the fused single-pass loop replaced a
3-pass ``.tolist()`` cycle, so per-call overhead is now dominated by
the underlying parsers (phonenumbers, dateutil) rather than Python
list materialisation. A 1.5 GB CSV with mixed phone+currency+address
columns finishes in ~1.56 minutes depending on column count.
`StandardizeOptions.parallel_columns` (default 1, serial) lands the
thread-pool scaffolding; on CPython 3.12 with the GIL it's
roughly neutral, but the API is ready for the free-threaded
(PEP 703) Python 3.13+ build where it will help.
- **Clean Text** (`clean_dataframe`): ~1M rows/sec on
repetition-heavy columns (per-call string cache: the pipeline runs
once per *unique* cell value, not once per row).
- **Fix Missing Values** (`handle_missing`): lazy-copy — when sentinel
standardization runs but finds nothing, AND no drops AND no fills
apply, the input frame is returned as-is. On a clean 1 GB file this
saves the 1 GB allocation that the unconditional upfront copy used
to take.
- **Map Columns** (`map_columns`): rename + drop both already
return fresh frames; the explicit upfront `df.copy()` is now
removed and downstream mutating steps (schema-add, coerce) copy on
demand via `_ensure_owned()`. Rename-only and identity-mapping
paths run with zero explicit copies.
- **Find Duplicates**:
- **Exact-only strategies** (every column uses `Algorithm.EXACT` at
threshold 100 — covers strong-key dedup like email/phone, the
fallback drop-duplicates path, and explicit "match on this exact
column" calls) now run in **O(n)** via groupby. Measured: 10k
rows on an email-exact strategy → 73 ms (was ~30 minutes via the
old O(n²) pair compare).
- **Fuzzy strategies** still pair-compare. Opt in to **prefix
blocking** via `deduplicate(..., blocking_columns=['name'],
blocking_prefix_len=1)` to partition pairs by a cheap key.
Measured: 5k rows fuzzy-name dedup → 25.6s with blocking vs.
179s without (7× faster). Trade-off: cross-block matches are
missed; lower `blocking_prefix_len` widens blocks.
- Normalisation pass remains LRU-cached per call so repeat values
(the common dedup workload) skip re-parsing.
## 11. Tools
1. Find Duplicates — Ready
2. Clean Text — Ready
3. Standardize Formats — Ready
4. Fix Missing Values — Ready
5. Map Columns — Ready
6. Find Unusual Values — Coming Soon
7. Combine Files — Coming Soon
8. Quality Check — Coming Soon
9. Automated Workflows — Ready
**Future / not in v1.** Tool ideas captured for after-launch consideration
live in `docs/FUTURE-TOOLS.md` — entries there are gated by the new-tool
freeze in `PLAN.md` §2.1 and don't ship without a paying-customer +
repeated-demand signal. Currently parked there:
- **#10. PDF → CSV extractor** (bank statements + similar). No PDF
dependency exists in the repo today; this tool would need pdfplumber,
streamlit-drawable-canvas, and a templates store. Estimated 34 weeks
for a text-only MVP, 610 weeks for the polished version with
multi-page template recall.
### 11.a Recommended pipeline order (soft, not enforced)
Automated Workflows ships with a `SOFT_DEPENDENCIES` table; the
following ordering is the default and the basis of the warning
surface. Re-ordering is allowed; the runner emits a warning string
and proceeds.
| # | Tool | Why this slot |
|---|------|---------------|
| 1 | column_map (optional, for header alignment) | Multi-vendor unification — rename early so downstream tools see canonical headers |
| 2 | text_clean | NBSP / smart quotes / zero-width pollution silently breaks downstream parsers |
| 3 | format_standardize | Phones / dates / currencies → canonical form before missing detection and dedup |
| 4 | missing | Sentinel detection, imputation, drop strategies — needs canonical types |
| 5 | column_map (optional, for schema enforcement) | Project to target schema, coerce, drop extras AFTER cleaning |
| 6 | dedup | Fuzzy matching is most accurate on canonicalised, sentinel-laundered data |
## 12. Gate (Review & Normalize)
- Gates every tool page.
- Auto-fix button: applies all `confidence=high` findings in one click.
- Per-finding controls: Auto / Skip / Customize.
- Live before/after preview (≤5 sample rows).
- Audit log per fix (id, decision, cells changed).
- Encoding-override picker (16 codepages + custom).
- Advanced output expander: encoding + delimiter + line terminator.
- Result keyed by upload SHA-256; survives reload, invalidated on re-upload.
## 13. Interfaces
- **GUI**: Streamlit, browser-based, local, no internet. Sidebar language picker (English, Español).
- **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_format` · `src.cli_missing` · `src.cli_column_map` · `src.cli_pipeline` · `src.cli_analyze`. (CLI output is English-only.)
- **Python API**: `from src.core import …` (analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …).
- **JSON output**: `--json` on `cli_analyze`.
- **Language packs**: `from src.i18n import t, LANGUAGES`. Add `<code>.json` to `src/i18n/packs/` + entry in `LANGUAGES` to add a language.
## 14. Platforms
- Python ≥ 3.10.
- OS: Linux, macOS, Windows.
- Browser: any modern browser.
- Network: not required at runtime.
## 15. Dependencies
- **Core**: pandas, openpyxl, charset-normalizer, typer, loguru.
- **Dedup**: rapidfuzz, phonenumbers.
- **GUI**: streamlit.
- **Optional**: ftfy (mojibake repair).
- **Dev**: pytest, tox.
## 16. Test coverage
- 2,033 tests passing, 0 skipped, 0 xfailed.
- 1,868 core + CLI tests (run with `pytest -m 'not gui'` for a quick loop).
Includes 49 license-layer unit tests (Ed25519 sign/verify, dev-key
derivation, production-safe tripwire, schema), 25 license-CLI
tests, and 17 Lite-tier feature-map + guard tests.
- 165 GUI tests under `tests/gui/` driving Streamlit pages via `AppTest`
(smoke + EN/ES localization, chrome, gate, workflows, dedup review,
advanced panels, error paths, findings panel, activation +
license gate, Lite-tier per-page lock behaviour). Marked `gui`.
- Includes 15 perf-shape regression tests.
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
- Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
## 17. Privacy / data handling
- All processing local; no network calls in the data path.
- No telemetry.
- Original input never modified.
- Audit logs: `logs/` next to each run (timestamped).
## 17a. Licensing
- **Storage**: ``~/.datatools/license.json`` (or
``$DATATOOLS_LICENSE_PATH`` override). Signed with Ed25519
(asymmetric).
- **Crypto**: Ed25519. The seller holds the private key; every
shipped binary embeds only the public key. A motivated reverse
engineer who pulls everything out of the binary still can't sign
new licenses. Keys are 32 bytes raw, exposed as hex via
``DATATOOLS_LICENSE_PRIVKEY`` (seller-side) and
``DATATOOLS_LICENSE_PUBKEY`` (build-time bake-in).
- **Activation**: buyer pastes a base64-encoded license blob
(``DTLIC1:...``) on first launch; app verifies the signature
offline + matches the buyer-entered name/email to the embedded
values.
- **No free trial**: every license requires a paid blob from the
seller. The user-facing trial flow (button + ``license_cli trial``
subcommand) was removed in v1.6 to keep paid-tier economics clean.
- **Lifetime**: every license is 1 year by default. Renewal applies a
fresh blob without losing the embedded buyer identity. Tier may
change during renewal (Lite → Core upgrade path).
- **Tiers**:
- ``lite`` — Find Duplicates + Clean Text + Standardize Formats.
Buyer pays once, gets the three universally-useful tools.
- ``core`` — every Ready tool (all 9 in v1.6).
- ``pro``, ``enterprise`` — scaffolded for future SKUs; currently
mirror Core. Add per-SKU restrictions by editing
``FEATURES_BY_TIER`` in ``src/license/features.py``.
- ``trial`` — kept in the enum for backwards compat with any
field-tested trial licenses but no longer issuable.
- **Feature flags**: every tool has a stable feature id matching its
``tool_id`` in :mod:`src.gui.tools_registry`. Adding a future per-
tool SKU is a one-line change to ``FEATURES_BY_TIER`` — no consumer
code edits.
- **Per-tool gating**: each tool page (GUI) and tool CLI calls
``require_feature(FeatureFlag.<TOOL>)`` at entry. GUI shows an
upgrade prompt + button to the Activate page; CLI prints a
message naming the locked feature and exits with code 2.
- **Lock badge**: the home grid shows a red 🔒 Locked pill on tool
cards the current tier doesn't unlock.
- **Dev bypass**: ``DATATOOLS_DEV_MODE=1`` skips every check (used by
the test suite and during development). **Refused in shipped
builds** by the production-safe tripwire.
- **Production-safe tripwire**: ``assert_production_safe()`` runs at
startup in every frozen build. Refuses to boot when ``DEV_MODE``
is set or the verification key is still the embedded dev key
(i.e., the build pipeline forgot to override
``DATATOOLS_LICENSE_PUBKEY``). No-op in source / pytest runs.
- **No internet**: signature verification is fully offline. The
shipped binary embeds only the public key; the private key never
leaves the seller. See ``docs/DECISIONS.md`` for the threat-model
discussion.
## 18. Error handling
- Structured hierarchy: `DataToolsError` → `InputValidationError`, `ConfigError`, `FileFormatError`, `FileAccessError`.
- Subclasses extend stdlib `ValueError` / `OSError` so existing handlers still catch them.
- Every error carries: message, file path, column, operation, suggestion, underlying cause.