Files
datatools-dev/docs/REQUIREMENTS.md
Michael e534fb4989 sec(license): Ed25519 sigs + production-safe tripwire
Two coupled hardening upgrades.

1. Asymmetric signatures (HMAC → Ed25519)

The previous HMAC scheme used a symmetric secret that any motivated
reverse engineer could pull out of the shipped binary and use to
mint blobs for any tier / name / email. With Ed25519, the binary
ships only the public verification key; the signing key never
leaves the seller's environment, so binary compromise no longer
yields forgery.

- src/license/crypto.py rewritten around
  cryptography.hazmat.primitives.asymmetric.ed25519. Same public
  API surface (sign/verify/encode_blob/decode_blob), same canonical
  JSON encoding — drop-in for the manager / cli / GUI layers.
- DATATOOLS_LICENSE_PRIVKEY (seller-side) and
  DATATOOLS_LICENSE_PUBKEY (build-time) env vars supply the keys;
  the in-source dev keypair (src/license/_dev_keypair.py)
  deterministically derives from a seed phrase for repro builds and
  tests.
- Blob prefix bumped DTLIC1: → DTLIC2:. Decoding a DTLIC1 blob
  surfaces a clear "old format" error rather than a confusing
  signature mismatch.
- scripts/generate_keypair.py mints fresh production keypairs for
  the seller (run once, stash the private key offline). Adds
  cryptography>=41,<46 to requirements.txt (was an undeclared
  transitive dep).

2. Production-safe tripwire

assert_production_safe() refuses to boot a frozen / shipped build
when either:

- DATATOOLS_DEV_MODE=1 is set (would unconditionally bypass every
  license check — fine in source/test but catastrophic in a buyer
  install).
- The active verification key is still the embedded dev key (the
  build pipeline forgot to set DATATOOLS_LICENSE_PUBKEY).

No-op in source / pytest runs (sys.frozen is unset) so test
fixtures and dev workflows keep working without ceremony. Called
from src/cli_license_guard.guard() and from hide_streamlit_chrome
— so it fires on every CLI invocation and every GUI page load.

Tests: 49 license-layer unit tests (was 40); added Ed25519
wrong-key rejection, dev-keypair seed pin, blob v2 prefix, v1
rejection with clear message, and four production-safe scenarios
(no-op in source, fires on DEV_MODE in frozen, fires on dev key in
frozen, passes in frozen with prod pubkey). Total: 2024 → 2033.

Docs (REQUIREMENTS §17a, DEVELOPER licensing recipe, DECISIONS
§9b + decision log) updated with the new threat-model write-up,
key-storage workflow, and tripwire behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:34:48 +00:00

252 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Requirements
Numbered support matrix. Updated with every shipped capability.
## 1. File handling
1.1 Size: ≤ 1.5 GB target (larger works, slower).
1.2 Read: CSV, TSV, XLSX, XLS.
1.3 Write: CSV, TSV.
1.4 Excel: multi-sheet picker.
1.5 Empty file: blocked with `empty_input` error finding.
## 2. Input encodings (auto-detected)
2.1 Unicode: UTF-8, UTF-8-BOM, UTF-16 LE/BE BOM, UTF-16 LE no-BOM.
2.2 Western: cp1252, ISO-8859-1, ISO-8859-15, Mac Roman.
2.3 Eastern European: cp1250, ISO-8859-2.
2.4 Cyrillic: cp1251, KOI8-R.
2.5 CJK: Shift_JIS / cp932, GB18030, Big5, EUC-KR / cp949.
2.6 ASCII → detected as UTF-8.
2.7 User override: any Python codec name.
2.8 BOM: stripped on read, never written.
2.9 Decode failure → `encoding_decode_failed` (error).
2.10 U+FFFD in output → `encoding_uncertain` (error).
## 3. Output encodings
3.1 UTF-8 (default), UTF-8-BOM (Excel-friendly).
3.2 cp1252, ISO-8859-1/15, cp1250, ISO-8859-2, cp1251.
3.3 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
3.4 Lossy fallback: `?` + warning when codec can't represent a char.
## 4. Delimiters
4.1 Input auto-detect: `,`, `\t`, `;`, `|`.
4.2 Output: `,` (default), `\t`, `;`, `|`.
4.3 Extension: `.tsv` for tab, `.csv` otherwise.
## 5. Line endings
5.1 Read: LF / CRLF / bare CR — all normalized to LF.
5.2 Embedded in quoted cells: also normalized to LF.
5.3 Write: LF (default), CRLF, CR.
5.4 Mixed → `mixed_line_endings` finding.
## 6. Analyzer detectors
**File-level** (read-time fixes, audit-logged):
- `csv_bom_stripped`, `csv_nul_stripped`, `csv_smart_quotes_folded`, `csv_line_endings_normalized`, `csv_transcoded_to_utf8`, `csv_unquoted_delimiters_repaired`, `csv_unrepairable_rows`.
**Cell-level**:
- `smart_punctuation_in_data`, `nbsp_or_unicode_whitespace`, `zero_width_or_invisible`, `dirty_column_headers`, `whitespace_padding`, `null_like_sentinels`, `suspected_mojibake`, `mixed_case_email_column`, `inconsistent_date_format`, `near_duplicate_rows`, `leading_zero_ids`.
**Encoding integrity**: `encoding_uncertain`, `encoding_decode_failed`, `encoding_lying_bom`, `empty_input`.
Sample size: 1,000 rows (configurable).
## 7. Finding fields
`id`, `severity` (info/warn/error), `confidence` (high/medium/low), `fix_action`, `pre_applied`, `tool`, `count`, `description`, `column`, `samples` (≤5).
## 8. Confidence tiers
- **high** — round-trip safe, one-click auto-fix.
- **medium** — preview before applying.
- **low** — opt-in only, can corrupt if wrong.
- **error** — must resolve or waive before tool pages unlock.
## 9. Decision actions
- `auto` — apply registered fix.
- `skip` — waive (audit-logged).
- `modified` — apply with custom payload.
## 10. Performance (1.5 GB input)
- Initial scan (sample): < 2 s · peak RSS ~110 MB.
- Full-file `repair_bytes`: 3040 s (UTF-8); non-UTF-8 fold path now
uses ``str.count`` instead of a Python char-by-char zip walk —
formerly ~100 s on a 1 GB cp1252 file with smart quotes, now <1 s.
- Full-DataFrame analyze: ~4 min (~25 µs/cell). Near-duplicate detector
no longer allocates a full-frame copy — peak RSS during the
near-duplicate pass drops to roughly the size of the string columns
alone (~50% memory cut on text-heavy 1 GB inputs).
- Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
- Output write: ~10 s.
- Recommended RAM: 34× input size for the full-Apply path.
- **Format standardizer** (`standardize_dataframe`): ~2.7M rows/sec on
cache-warm repetition-heavy columns (synthetic 1M-row in-memory
benchmark, 2 typed columns); the fused single-pass loop replaced a
3-pass ``.tolist()`` cycle, so per-call overhead is now dominated by
the underlying parsers (phonenumbers, dateutil) rather than Python
list materialisation. A 1.5 GB CSV with mixed phone+currency+address
columns finishes in ~1.56 minutes depending on column count.
`StandardizeOptions.parallel_columns` (default 1, serial) lands the
thread-pool scaffolding; on CPython 3.12 with the GIL it's
roughly neutral, but the API is ready for the free-threaded
(PEP 703) Python 3.13+ build where it will help.
- **Text cleaner** (`clean_dataframe`): ~1M rows/sec on
repetition-heavy columns (per-call string cache: the pipeline runs
once per *unique* cell value, not once per row).
- **Missing handler** (`handle_missing`): lazy-copy — when sentinel
standardization runs but finds nothing, AND no drops AND no fills
apply, the input frame is returned as-is. On a clean 1 GB file this
saves the 1 GB allocation that the unconditional upfront copy used
to take.
- **Column mapper** (`map_columns`): rename + drop both already
return fresh frames; the explicit upfront `df.copy()` is now
removed and downstream mutating steps (schema-add, coerce) copy on
demand via `_ensure_owned()`. Rename-only and identity-mapping
paths run with zero explicit copies.
- **Deduplicator**:
- **Exact-only strategies** (every column uses `Algorithm.EXACT` at
threshold 100 — covers strong-key dedup like email/phone, the
fallback drop-duplicates path, and explicit "match on this exact
column" calls) now run in **O(n)** via groupby. Measured: 10k
rows on an email-exact strategy → 73 ms (was ~30 minutes via the
old O(n²) pair compare).
- **Fuzzy strategies** still pair-compare. Opt in to **prefix
blocking** via `deduplicate(..., blocking_columns=['name'],
blocking_prefix_len=1)` to partition pairs by a cheap key.
Measured: 5k rows fuzzy-name dedup → 25.6s with blocking vs.
179s without (7× faster). Trade-off: cross-block matches are
missed; lower `blocking_prefix_len` widens blocks.
- Normalisation pass remains LRU-cached per call so repeat values
(the common dedup workload) skip re-parsing.
## 11. Tools
1. Deduplicator — Ready
2. Text Cleaner — Ready
3. Format Standardizer — Ready
4. Missing Value Handler — Ready
5. Column Mapper — Ready
6. Outlier Detector — Coming Soon
7. Multi-File Merger — Coming Soon
8. Validator & Reporter — Coming Soon
9. Pipeline Runner — Ready
### 11.a Recommended pipeline order (soft, not enforced)
The Pipeline Runner ships with a `SOFT_DEPENDENCIES` table; the
following ordering is the default and the basis of the warning
surface. Re-ordering is allowed; the runner emits a warning string
and proceeds.
| # | Tool | Why this slot |
|---|------|---------------|
| 1 | column_map (optional, for header alignment) | Multi-vendor unification — rename early so downstream tools see canonical headers |
| 2 | text_clean | NBSP / smart quotes / zero-width pollution silently breaks downstream parsers |
| 3 | format_standardize | Phones / dates / currencies → canonical form before missing detection and dedup |
| 4 | missing | Sentinel detection, imputation, drop strategies — needs canonical types |
| 5 | column_map (optional, for schema enforcement) | Project to target schema, coerce, drop extras AFTER cleaning |
| 6 | dedup | Fuzzy matching is most accurate on canonicalised, sentinel-laundered data |
## 12. Gate (Review & Normalize)
- Gates every tool page.
- Auto-fix button: applies all `confidence=high` findings in one click.
- Per-finding controls: Auto / Skip / Customize.
- Live before/after preview (≤5 sample rows).
- Audit log per fix (id, decision, cells changed).
- Encoding-override picker (16 codepages + custom).
- Advanced output expander: encoding + delimiter + line terminator.
- Result keyed by upload SHA-256; survives reload, invalidated on re-upload.
## 13. Interfaces
- **GUI**: Streamlit, browser-based, local, no internet. Sidebar language picker (English, Español).
- **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_format` · `src.cli_missing` · `src.cli_column_map` · `src.cli_pipeline` · `src.cli_analyze`. (CLI output is English-only.)
- **Python API**: `from src.core import …` (analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …).
- **JSON output**: `--json` on `cli_analyze`.
- **Language packs**: `from src.i18n import t, LANGUAGES`. Add `<code>.json` to `src/i18n/packs/` + entry in `LANGUAGES` to add a language.
## 14. Platforms
- Python ≥ 3.10.
- OS: Linux, macOS, Windows.
- Browser: any modern browser.
- Network: not required at runtime.
## 15. Dependencies
- **Core**: pandas, openpyxl, charset-normalizer, typer, loguru.
- **Dedup**: rapidfuzz, phonenumbers.
- **GUI**: streamlit.
- **Optional**: ftfy (mojibake repair).
- **Dev**: pytest, tox.
## 16. Test coverage
- 2,033 tests passing, 0 skipped, 0 xfailed.
- 1,868 core + CLI tests (run with `pytest -m 'not gui'` for a quick loop).
Includes 49 license-layer unit tests (Ed25519 sign/verify, dev-key
derivation, production-safe tripwire, schema), 25 license-CLI
tests, and 17 Lite-tier feature-map + guard tests.
- 165 GUI tests under `tests/gui/` driving Streamlit pages via `AppTest`
(smoke + EN/ES localization, chrome, gate, workflows, dedup review,
advanced panels, error paths, findings panel, activation +
license gate, Lite-tier per-page lock behaviour). Marked `gui`.
- Includes 15 perf-shape regression tests.
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
- Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
## 17. Privacy / data handling
- All processing local; no network calls in the data path.
- No telemetry.
- Original input never modified.
- Audit logs: `logs/` next to each run (timestamped).
## 17a. Licensing
- **Storage**: ``~/.datatools/license.json`` (or
``$DATATOOLS_LICENSE_PATH`` override). Signed with Ed25519
(asymmetric).
- **Crypto**: Ed25519. The seller holds the private key; every
shipped binary embeds only the public key. A motivated reverse
engineer who pulls everything out of the binary still can't sign
new licenses. Keys are 32 bytes raw, exposed as hex via
``DATATOOLS_LICENSE_PRIVKEY`` (seller-side) and
``DATATOOLS_LICENSE_PUBKEY`` (build-time bake-in).
- **Activation**: buyer pastes a base64-encoded license blob
(``DTLIC1:...``) on first launch; app verifies the signature
offline + matches the buyer-entered name/email to the embedded
values.
- **No free trial**: every license requires a paid blob from the
seller. The user-facing trial flow (button + ``license_cli trial``
subcommand) was removed in v1.6 to keep paid-tier economics clean.
- **Lifetime**: every license is 1 year by default. Renewal applies a
fresh blob without losing the embedded buyer identity. Tier may
change during renewal (Lite → Core upgrade path).
- **Tiers**:
- ``lite`` — Deduplicator + Text Cleaner + Format Standardizer.
Buyer pays once, gets the three universally-useful tools.
- ``core`` — every Ready tool (all 9 in v1.6).
- ``pro``, ``enterprise`` — scaffolded for future SKUs; currently
mirror Core. Add per-SKU restrictions by editing
``FEATURES_BY_TIER`` in ``src/license/features.py``.
- ``trial`` — kept in the enum for backwards compat with any
field-tested trial licenses but no longer issuable.
- **Feature flags**: every tool has a stable feature id matching its
``tool_id`` in :mod:`src.gui.tools_registry`. Adding a future per-
tool SKU is a one-line change to ``FEATURES_BY_TIER`` — no consumer
code edits.
- **Per-tool gating**: each tool page (GUI) and tool CLI calls
``require_feature(FeatureFlag.<TOOL>)`` at entry. GUI shows an
upgrade prompt + button to the Activate page; CLI prints a
message naming the locked feature and exits with code 2.
- **Lock badge**: the home grid shows a red 🔒 Locked pill on tool
cards the current tier doesn't unlock.
- **Dev bypass**: ``DATATOOLS_DEV_MODE=1`` skips every check (used by
the test suite and during development). **Refused in shipped
builds** by the production-safe tripwire.
- **Production-safe tripwire**: ``assert_production_safe()`` runs at
startup in every frozen build. Refuses to boot when ``DEV_MODE``
is set or the verification key is still the embedded dev key
(i.e., the build pipeline forgot to override
``DATATOOLS_LICENSE_PUBKEY``). No-op in source / pytest runs.
- **No internet**: signature verification is fully offline. The
shipped binary embeds only the public key; the private key never
leaves the seller. See ``docs/DECISIONS.md`` for the threat-model
discussion.
## 18. Error handling
- Structured hierarchy: `DataToolsError` → `InputValidationError`, `ConfigError`, `FileFormatError`, `FileAccessError`.
- Subclasses extend stdlib `ValueError` / `OSError` so existing handlers still catch them.
- Every error carries: message, file path, column, operation, suggestion, underlying cause.