Files
datatools-dev/docs/REQUIREMENTS.md
Michael d32b58e61a feat(license): add Lite SKU; remove user-facing free trial
Two coupled changes:

1. Lite tier
   - New Tier.LITE in src/license/schema.py.
   - FEATURES_BY_TIER[Tier.LITE] = {Deduplicator, Text Cleaner,
     Format Standardizer}. The three universally-useful tools that
     cover the most common bookkeeping / RevOps / Klaviyo prep
     workflows. Other six tools require Core.
   - i18n: license.tier_lite, license.feature_locked_title,
     license.feature_locked_body, license.upgrade_link,
     license.status_locked (en + es).
   - Per-tool feature gate at every GUI tool page
     (require_feature_or_render_upgrade) and every tool CLI
     (guard(feature=...)). A locked tool renders an upgrade
     prompt + Manage-license button (GUI) or exits with code 2
     (CLI).
   - Home grid: tool cards the user's tier doesn't unlock get a
     red 🔒 Locked badge in place of green Ready.

2. Trial removed
   - Activation form's "Start 1-year trial" button removed.
   - license_cli's `trial` subcommand removed.
   - activation.trial_button / activation.trial_help i18n keys
     dropped (pack parity test stays green).
   - Tier.TRIAL stays in the enum (back-compat with any field-
     tested trial licenses); LicenseManager._mint stays internal
     for tests and the seller's key generator.
   - Decision logged in DECISIONS §9b: a 1-year all-features
     trial undercuts paid Lite; paid-only keeps tier economics
     clean.

Tests (+29 net): +17 Lite-tier unit/guard tests + 13 Lite-tier
GUI tests + 1 trial-absent assertion - 2 trial CLI tests - 1
trial GUI button test. Total: 1995 → 2024.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:19:30 +00:00

238 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Requirements
Numbered support matrix. Updated with every shipped capability.
## 1. File handling
1.1 Size: ≤ 1.5 GB target (larger works, slower).
1.2 Read: CSV, TSV, XLSX, XLS.
1.3 Write: CSV, TSV.
1.4 Excel: multi-sheet picker.
1.5 Empty file: blocked with `empty_input` error finding.
## 2. Input encodings (auto-detected)
2.1 Unicode: UTF-8, UTF-8-BOM, UTF-16 LE/BE BOM, UTF-16 LE no-BOM.
2.2 Western: cp1252, ISO-8859-1, ISO-8859-15, Mac Roman.
2.3 Eastern European: cp1250, ISO-8859-2.
2.4 Cyrillic: cp1251, KOI8-R.
2.5 CJK: Shift_JIS / cp932, GB18030, Big5, EUC-KR / cp949.
2.6 ASCII → detected as UTF-8.
2.7 User override: any Python codec name.
2.8 BOM: stripped on read, never written.
2.9 Decode failure → `encoding_decode_failed` (error).
2.10 U+FFFD in output → `encoding_uncertain` (error).
## 3. Output encodings
3.1 UTF-8 (default), UTF-8-BOM (Excel-friendly).
3.2 cp1252, ISO-8859-1/15, cp1250, ISO-8859-2, cp1251.
3.3 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
3.4 Lossy fallback: `?` + warning when codec can't represent a char.
## 4. Delimiters
4.1 Input auto-detect: `,`, `\t`, `;`, `|`.
4.2 Output: `,` (default), `\t`, `;`, `|`.
4.3 Extension: `.tsv` for tab, `.csv` otherwise.
## 5. Line endings
5.1 Read: LF / CRLF / bare CR — all normalized to LF.
5.2 Embedded in quoted cells: also normalized to LF.
5.3 Write: LF (default), CRLF, CR.
5.4 Mixed → `mixed_line_endings` finding.
## 6. Analyzer detectors
**File-level** (read-time fixes, audit-logged):
- `csv_bom_stripped`, `csv_nul_stripped`, `csv_smart_quotes_folded`, `csv_line_endings_normalized`, `csv_transcoded_to_utf8`, `csv_unquoted_delimiters_repaired`, `csv_unrepairable_rows`.
**Cell-level**:
- `smart_punctuation_in_data`, `nbsp_or_unicode_whitespace`, `zero_width_or_invisible`, `dirty_column_headers`, `whitespace_padding`, `null_like_sentinels`, `suspected_mojibake`, `mixed_case_email_column`, `inconsistent_date_format`, `near_duplicate_rows`, `leading_zero_ids`.
**Encoding integrity**: `encoding_uncertain`, `encoding_decode_failed`, `encoding_lying_bom`, `empty_input`.
Sample size: 1,000 rows (configurable).
## 7. Finding fields
`id`, `severity` (info/warn/error), `confidence` (high/medium/low), `fix_action`, `pre_applied`, `tool`, `count`, `description`, `column`, `samples` (≤5).
## 8. Confidence tiers
- **high** — round-trip safe, one-click auto-fix.
- **medium** — preview before applying.
- **low** — opt-in only, can corrupt if wrong.
- **error** — must resolve or waive before tool pages unlock.
## 9. Decision actions
- `auto` — apply registered fix.
- `skip` — waive (audit-logged).
- `modified` — apply with custom payload.
## 10. Performance (1.5 GB input)
- Initial scan (sample): < 2 s · peak RSS ~110 MB.
- Full-file `repair_bytes`: 3040 s (UTF-8); non-UTF-8 fold path now
uses ``str.count`` instead of a Python char-by-char zip walk —
formerly ~100 s on a 1 GB cp1252 file with smart quotes, now <1 s.
- Full-DataFrame analyze: ~4 min (~25 µs/cell). Near-duplicate detector
no longer allocates a full-frame copy — peak RSS during the
near-duplicate pass drops to roughly the size of the string columns
alone (~50% memory cut on text-heavy 1 GB inputs).
- Full-DataFrame `auto_fix`: ~5 min (~30 µs/cell).
- Output write: ~10 s.
- Recommended RAM: 34× input size for the full-Apply path.
- **Format standardizer** (`standardize_dataframe`): ~2.7M rows/sec on
cache-warm repetition-heavy columns (synthetic 1M-row in-memory
benchmark, 2 typed columns); the fused single-pass loop replaced a
3-pass ``.tolist()`` cycle, so per-call overhead is now dominated by
the underlying parsers (phonenumbers, dateutil) rather than Python
list materialisation. A 1.5 GB CSV with mixed phone+currency+address
columns finishes in ~1.56 minutes depending on column count.
`StandardizeOptions.parallel_columns` (default 1, serial) lands the
thread-pool scaffolding; on CPython 3.12 with the GIL it's
roughly neutral, but the API is ready for the free-threaded
(PEP 703) Python 3.13+ build where it will help.
- **Text cleaner** (`clean_dataframe`): ~1M rows/sec on
repetition-heavy columns (per-call string cache: the pipeline runs
once per *unique* cell value, not once per row).
- **Missing handler** (`handle_missing`): lazy-copy — when sentinel
standardization runs but finds nothing, AND no drops AND no fills
apply, the input frame is returned as-is. On a clean 1 GB file this
saves the 1 GB allocation that the unconditional upfront copy used
to take.
- **Column mapper** (`map_columns`): rename + drop both already
return fresh frames; the explicit upfront `df.copy()` is now
removed and downstream mutating steps (schema-add, coerce) copy on
demand via `_ensure_owned()`. Rename-only and identity-mapping
paths run with zero explicit copies.
- **Deduplicator**:
- **Exact-only strategies** (every column uses `Algorithm.EXACT` at
threshold 100 — covers strong-key dedup like email/phone, the
fallback drop-duplicates path, and explicit "match on this exact
column" calls) now run in **O(n)** via groupby. Measured: 10k
rows on an email-exact strategy → 73 ms (was ~30 minutes via the
old O(n²) pair compare).
- **Fuzzy strategies** still pair-compare. Opt in to **prefix
blocking** via `deduplicate(..., blocking_columns=['name'],
blocking_prefix_len=1)` to partition pairs by a cheap key.
Measured: 5k rows fuzzy-name dedup → 25.6s with blocking vs.
179s without (7× faster). Trade-off: cross-block matches are
missed; lower `blocking_prefix_len` widens blocks.
- Normalisation pass remains LRU-cached per call so repeat values
(the common dedup workload) skip re-parsing.
## 11. Tools
1. Deduplicator — Ready
2. Text Cleaner — Ready
3. Format Standardizer — Ready
4. Missing Value Handler — Ready
5. Column Mapper — Ready
6. Outlier Detector — Coming Soon
7. Multi-File Merger — Coming Soon
8. Validator & Reporter — Coming Soon
9. Pipeline Runner — Ready
### 11.a Recommended pipeline order (soft, not enforced)
The Pipeline Runner ships with a `SOFT_DEPENDENCIES` table; the
following ordering is the default and the basis of the warning
surface. Re-ordering is allowed; the runner emits a warning string
and proceeds.
| # | Tool | Why this slot |
|---|------|---------------|
| 1 | column_map (optional, for header alignment) | Multi-vendor unification — rename early so downstream tools see canonical headers |
| 2 | text_clean | NBSP / smart quotes / zero-width pollution silently breaks downstream parsers |
| 3 | format_standardize | Phones / dates / currencies → canonical form before missing detection and dedup |
| 4 | missing | Sentinel detection, imputation, drop strategies — needs canonical types |
| 5 | column_map (optional, for schema enforcement) | Project to target schema, coerce, drop extras AFTER cleaning |
| 6 | dedup | Fuzzy matching is most accurate on canonicalised, sentinel-laundered data |
## 12. Gate (Review & Normalize)
- Gates every tool page.
- Auto-fix button: applies all `confidence=high` findings in one click.
- Per-finding controls: Auto / Skip / Customize.
- Live before/after preview (≤5 sample rows).
- Audit log per fix (id, decision, cells changed).
- Encoding-override picker (16 codepages + custom).
- Advanced output expander: encoding + delimiter + line terminator.
- Result keyed by upload SHA-256; survives reload, invalidated on re-upload.
## 13. Interfaces
- **GUI**: Streamlit, browser-based, local, no internet. Sidebar language picker (English, Español).
- **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_format` · `src.cli_missing` · `src.cli_column_map` · `src.cli_pipeline` · `src.cli_analyze`. (CLI output is English-only.)
- **Python API**: `from src.core import …` (analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …).
- **JSON output**: `--json` on `cli_analyze`.
- **Language packs**: `from src.i18n import t, LANGUAGES`. Add `<code>.json` to `src/i18n/packs/` + entry in `LANGUAGES` to add a language.
## 14. Platforms
- Python ≥ 3.10.
- OS: Linux, macOS, Windows.
- Browser: any modern browser.
- Network: not required at runtime.
## 15. Dependencies
- **Core**: pandas, openpyxl, charset-normalizer, typer, loguru.
- **Dedup**: rapidfuzz, phonenumbers.
- **GUI**: streamlit.
- **Optional**: ftfy (mojibake repair).
- **Dev**: pytest, tox.
## 16. Test coverage
- 2,024 tests passing, 0 skipped, 0 xfailed.
- 1,859 core + CLI tests (run with `pytest -m 'not gui'` for a quick loop).
Includes 40 license-layer unit tests, 25 license-CLI tests, and
17 Lite-tier feature-map + guard tests.
- 165 GUI tests under `tests/gui/` driving Streamlit pages via `AppTest`
(smoke + EN/ES localization, chrome, gate, workflows, dedup review,
advanced panels, error paths, findings panel, activation +
license gate, Lite-tier per-page lock behaviour). Marked `gui`.
- Includes 15 perf-shape regression tests.
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
- Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
## 17. Privacy / data handling
- All processing local; no network calls in the data path.
- No telemetry.
- Original input never modified.
- Audit logs: `logs/` next to each run (timestamped).
## 17a. Licensing
- **Storage**: ``~/.datatools/license.json`` (or
``$DATATOOLS_LICENSE_PATH`` override). Signed locally with
HMAC-SHA256.
- **Activation**: buyer pastes a base64-encoded license blob
(``DTLIC1:...``) on first launch; app verifies the signature
offline + matches the buyer-entered name/email to the embedded
values.
- **No free trial**: every license requires a paid blob from the
seller. The user-facing trial flow (button + ``license_cli trial``
subcommand) was removed in v1.6 to keep paid-tier economics clean.
- **Lifetime**: every license is 1 year by default. Renewal applies a
fresh blob without losing the embedded buyer identity. Tier may
change during renewal (Lite → Core upgrade path).
- **Tiers**:
- ``lite`` — Deduplicator + Text Cleaner + Format Standardizer.
Buyer pays once, gets the three universally-useful tools.
- ``core`` — every Ready tool (all 9 in v1.6).
- ``pro``, ``enterprise`` — scaffolded for future SKUs; currently
mirror Core. Add per-SKU restrictions by editing
``FEATURES_BY_TIER`` in ``src/license/features.py``.
- ``trial`` — kept in the enum for backwards compat with any
field-tested trial licenses but no longer issuable.
- **Feature flags**: every tool has a stable feature id matching its
``tool_id`` in :mod:`src.gui.tools_registry`. Adding a future per-
tool SKU is a one-line change to ``FEATURES_BY_TIER`` — no consumer
code edits.
- **Per-tool gating**: each tool page (GUI) and tool CLI calls
``require_feature(FeatureFlag.<TOOL>)`` at entry. GUI shows an
upgrade prompt + button to the Activate page; CLI prints a
message naming the locked feature and exits with code 2.
- **Lock badge**: the home grid shows a red 🔒 Locked pill on tool
cards the current tier doesn't unlock.
- **Dev bypass**: ``DATATOOLS_DEV_MODE=1`` skips every check (used by
the test suite and during development).
- **No internet**: signature verification is fully offline. The
shipped binary embeds the verification secret; see
``docs/DECISIONS.md`` for the threat-model discussion.
## 18. Error handling
- Structured hierarchy: `DataToolsError` → `InputValidationError`, `ConfigError`, `FileFormatError`, `FileAccessError`.
- Subclasses extend stdlib `ValueError` / `OSError` so existing handlers still catch them.
- Every error carries: message, file path, column, operation, suggestion, underlying cause.