New ``docs/FUTURE-TOOLS.md`` captures post-launch tool ideas with a consistent shape — What / Why / Can we ship now / Approach / GUI sketch / Effort / Risks / Ship criteria. Resting place for things the new-tool freeze in ``PLAN.md`` §2.1 refuses to build but that keep coming up. First entry: **#10 PDF → CSV extractor** (bank statements et al.). Key facts captured: - **Current state**: no PDF infrastructure exists. Zero PDF dependencies in requirements.txt; zero PDF-touching code under ``src/``. The only "PDF" string in the codebase is the planned- output copy for the Quality Check tool, unrelated to extraction. - **Library picks**: pdfplumber as the extraction core (BSD-3, no native compiler, gives coordinate-aware text), Tesseract via pytesseract as the OCR fallback for scanned PDFs, streamlit-drawable-canvas as the region-picker component. - **GUI sketch**: user draws a header strip + a row template on a rendered page; the tool applies that template across N pages, saves the template by layout fingerprint for next month's statement, emits CSV. - **Effort phased A–E**: 3–4 weeks for a text-only MVP; 6–10 weeks for a polished version with multi-page template recall; +2–3 weeks if scanned-PDF OCR is required. - **Difficulty**: medium-hard. The pieces are well-trodden; the combination (region selection that persists across pages and across documents with similar layouts) is where the engineering goes. - **Ship criteria**: ≥1 paying customer + ≥3 paid or ≥5 demo emails asking for PDF extraction + the bookkeeper niche converting at least one customer first. None have fired. Cross-references added: - ``docs/REQUIREMENTS.md`` §11: pointer to FUTURE-TOOLS.md for parked tool ideas, with a one-paragraph summary of #10. - ``docs/PLAN.md`` §2.1: notes that the freeze parks future tools in FUTURE-TOOLS.md and explicitly names #10 as the current highest-pressure entry. - ``docs/NEXT-STEPS.md`` Phase 5 "what NOT to build" table: a new row for the PDF tool tied to the same ship-trigger language. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13 KiB
Requirements
Numbered support matrix. Updated with every shipped capability.
1. File handling
1.1 Size: ≤ 1.5 GB target (larger works, slower).
1.2 Read: CSV, TSV, XLSX, XLS.
1.3 Write: CSV, TSV.
1.4 Excel: multi-sheet picker.
1.5 Empty file: blocked with empty_input error finding.
2. Input encodings (auto-detected)
2.1 Unicode: UTF-8, UTF-8-BOM, UTF-16 LE/BE BOM, UTF-16 LE no-BOM.
2.2 Western: cp1252, ISO-8859-1, ISO-8859-15, Mac Roman.
2.3 Eastern European: cp1250, ISO-8859-2.
2.4 Cyrillic: cp1251, KOI8-R.
2.5 CJK: Shift_JIS / cp932, GB18030, Big5, EUC-KR / cp949.
2.6 ASCII → detected as UTF-8.
2.7 User override: any Python codec name.
2.8 BOM: stripped on read, never written.
2.9 Decode failure → encoding_decode_failed (error).
2.10 U+FFFD in output → encoding_uncertain (error).
3. Output encodings
3.1 UTF-8 (default), UTF-8-BOM (Excel-friendly).
3.2 cp1252, ISO-8859-1/15, cp1250, ISO-8859-2, cp1251.
3.3 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
3.4 Lossy fallback: ? + warning when codec can't represent a char.
4. Delimiters
4.1 Input auto-detect: ,, \t, ;, |.
4.2 Output: , (default), \t, ;, |.
4.3 Extension: .tsv for tab, .csv otherwise.
5. Line endings
5.1 Read: LF / CRLF / bare CR — all normalized to LF.
5.2 Embedded in quoted cells: also normalized to LF.
5.3 Write: LF (default), CRLF, CR.
5.4 Mixed → mixed_line_endings finding.
6. Analyzer detectors
File-level (read-time fixes, audit-logged):
csv_bom_stripped,csv_nul_stripped,csv_smart_quotes_folded,csv_line_endings_normalized,csv_transcoded_to_utf8,csv_unquoted_delimiters_repaired,csv_unrepairable_rows.
Cell-level:
smart_punctuation_in_data,nbsp_or_unicode_whitespace,zero_width_or_invisible,dirty_column_headers,whitespace_padding,null_like_sentinels,suspected_mojibake,mixed_case_email_column,inconsistent_date_format,near_duplicate_rows,leading_zero_ids.
Encoding integrity: encoding_uncertain, encoding_decode_failed, encoding_lying_bom, empty_input.
Sample size: 1,000 rows (configurable).
7. Finding fields
id, severity (info/warn/error), confidence (high/medium/low), fix_action, pre_applied, tool, count, description, column, samples (≤5).
8. Confidence tiers
- high — round-trip safe, one-click auto-fix.
- medium — preview before applying.
- low — opt-in only, can corrupt if wrong.
- error — must resolve or waive before tool pages unlock.
9. Decision actions
auto— apply registered fix.skip— waive (audit-logged).modified— apply with custom payload.
10. Performance (1.5 GB input)
- Initial scan (sample): < 2 s · peak RSS ~110 MB.
- Full-file
repair_bytes: 30–40 s (UTF-8); non-UTF-8 fold path now usesstr.countinstead of a Python char-by-char zip walk — formerly ~100 s on a 1 GB cp1252 file with smart quotes, now <1 s. - Full-DataFrame analyze: ~4 min (~25 µs/cell). Near-duplicate detector no longer allocates a full-frame copy — peak RSS during the near-duplicate pass drops to roughly the size of the string columns alone (~50% memory cut on text-heavy 1 GB inputs).
- Full-DataFrame
auto_fix: ~5 min (~30 µs/cell). - Output write: ~10 s.
- Recommended RAM: 3–4× input size for the full-Apply path.
- Standardize Formats (
standardize_dataframe): ~2.7M rows/sec on cache-warm repetition-heavy columns (synthetic 1M-row in-memory benchmark, 2 typed columns); the fused single-pass loop replaced a 3-pass.tolist()cycle, so per-call overhead is now dominated by the underlying parsers (phonenumbers, dateutil) rather than Python list materialisation. A 1.5 GB CSV with mixed phone+currency+address columns finishes in ~1.5–6 minutes depending on column count.StandardizeOptions.parallel_columns(default 1, serial) lands the thread-pool scaffolding; on CPython 3.12 with the GIL it's roughly neutral, but the API is ready for the free-threaded (PEP 703) Python 3.13+ build where it will help. - Clean Text (
clean_dataframe): ~1M rows/sec on repetition-heavy columns (per-call string cache: the pipeline runs once per unique cell value, not once per row). - Fix Missing Values (
handle_missing): lazy-copy — when sentinel standardization runs but finds nothing, AND no drops AND no fills apply, the input frame is returned as-is. On a clean 1 GB file this saves the 1 GB allocation that the unconditional upfront copy used to take. - Map Columns (
map_columns): rename + drop both already return fresh frames; the explicit upfrontdf.copy()is now removed and downstream mutating steps (schema-add, coerce) copy on demand via_ensure_owned(). Rename-only and identity-mapping paths run with zero explicit copies. - Find Duplicates:
- Exact-only strategies (every column uses
Algorithm.EXACTat threshold 100 — covers strong-key dedup like email/phone, the fallback drop-duplicates path, and explicit "match on this exact column" calls) now run in O(n) via groupby. Measured: 10k rows on an email-exact strategy → 73 ms (was ~30 minutes via the old O(n²) pair compare). - Fuzzy strategies still pair-compare. Opt in to prefix
blocking via
deduplicate(..., blocking_columns=['name'], blocking_prefix_len=1)to partition pairs by a cheap key. Measured: 5k rows fuzzy-name dedup → 25.6s with blocking vs. 179s without (7× faster). Trade-off: cross-block matches are missed; lowerblocking_prefix_lenwidens blocks. - Normalisation pass remains LRU-cached per call so repeat values (the common dedup workload) skip re-parsing.
- Exact-only strategies (every column uses
11. Tools
- Find Duplicates — Ready
- Clean Text — Ready
- Standardize Formats — Ready
- Fix Missing Values — Ready
- Map Columns — Ready
- Find Unusual Values — Coming Soon
- Combine Files — Coming Soon
- Quality Check — Coming Soon
- Automated Workflows — Ready
Future / not in v1. Tool ideas captured for after-launch consideration
live in docs/FUTURE-TOOLS.md — entries there are gated by the new-tool
freeze in PLAN.md §2.1 and don't ship without a paying-customer +
repeated-demand signal. Currently parked there:
- #10. PDF → CSV extractor (bank statements + similar). No PDF dependency exists in the repo today; this tool would need pdfplumber, streamlit-drawable-canvas, and a templates store. Estimated 3–4 weeks for a text-only MVP, 6–10 weeks for the polished version with multi-page template recall.
11.a Recommended pipeline order (soft, not enforced)
Automated Workflows ships with a SOFT_DEPENDENCIES table; the
following ordering is the default and the basis of the warning
surface. Re-ordering is allowed; the runner emits a warning string
and proceeds.
| # | Tool | Why this slot |
|---|---|---|
| 1 | column_map (optional, for header alignment) | Multi-vendor unification — rename early so downstream tools see canonical headers |
| 2 | text_clean | NBSP / smart quotes / zero-width pollution silently breaks downstream parsers |
| 3 | format_standardize | Phones / dates / currencies → canonical form before missing detection and dedup |
| 4 | missing | Sentinel detection, imputation, drop strategies — needs canonical types |
| 5 | column_map (optional, for schema enforcement) | Project to target schema, coerce, drop extras AFTER cleaning |
| 6 | dedup | Fuzzy matching is most accurate on canonicalised, sentinel-laundered data |
12. Gate (Review & Normalize)
- Gates every tool page.
- Auto-fix button: applies all
confidence=highfindings in one click. - Per-finding controls: Auto / Skip / Customize.
- Live before/after preview (≤5 sample rows).
- Audit log per fix (id, decision, cells changed).
- Encoding-override picker (16 codepages + custom).
- Advanced output expander: encoding + delimiter + line terminator.
- Result keyed by upload SHA-256; survives reload, invalidated on re-upload.
13. Interfaces
- GUI: Streamlit, browser-based, local, no internet. Sidebar language picker (English, Español).
- CLI:
python -m src.cli(dedup) ·src.cli_text_clean·src.cli_format·src.cli_missing·src.cli_column_map·src.cli_pipeline·src.cli_analyze. (CLI output is English-only.) - Python API:
from src.core import …(analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …). - JSON output:
--jsononcli_analyze. - Language packs:
from src.i18n import t, LANGUAGES. Add<code>.jsontosrc/i18n/packs/+ entry inLANGUAGESto add a language.
14. Platforms
- Python ≥ 3.10.
- OS: Linux, macOS, Windows.
- Browser: any modern browser.
- Network: not required at runtime.
15. Dependencies
- Core: pandas, openpyxl, charset-normalizer, typer, loguru.
- Dedup: rapidfuzz, phonenumbers.
- GUI: streamlit.
- Optional: ftfy (mojibake repair).
- Dev: pytest, tox.
16. Test coverage
- 2,033 tests passing, 0 skipped, 0 xfailed.
- 1,868 core + CLI tests (run with
pytest -m 'not gui'for a quick loop). Includes 49 license-layer unit tests (Ed25519 sign/verify, dev-key derivation, production-safe tripwire, schema), 25 license-CLI tests, and 17 Lite-tier feature-map + guard tests. - 165 GUI tests under
tests/gui/driving Streamlit pages viaAppTest(smoke + EN/ES localization, chrome, gate, workflows, dedup review, advanced panels, error paths, findings panel, activation + license gate, Lite-tier per-page lock behaviour). Markedgui. - Includes 15 perf-shape regression tests.
- 1,868 core + CLI tests (run with
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
- Run:
python run_tests.py [--tool …] [--fixtures] [--coverage].
17. Privacy / data handling
- All processing local; no network calls in the data path.
- No telemetry.
- Original input never modified.
- Audit logs:
logs/next to each run (timestamped).
17a. Licensing
- Storage:
~/.datatools/license.json(or$DATATOOLS_LICENSE_PATHoverride). Signed with Ed25519 (asymmetric). - Crypto: Ed25519. The seller holds the private key; every
shipped binary embeds only the public key. A motivated reverse
engineer who pulls everything out of the binary still can't sign
new licenses. Keys are 32 bytes raw, exposed as hex via
DATATOOLS_LICENSE_PRIVKEY(seller-side) andDATATOOLS_LICENSE_PUBKEY(build-time bake-in). - Activation: buyer pastes a base64-encoded license blob
(
DTLIC1:...) on first launch; app verifies the signature offline + matches the buyer-entered name/email to the embedded values. - No free trial: every license requires a paid blob from the
seller. The user-facing trial flow (button +
license_cli trialsubcommand) was removed in v1.6 to keep paid-tier economics clean. - Lifetime: every license is 1 year by default. Renewal applies a fresh blob without losing the embedded buyer identity. Tier may change during renewal (Lite → Core upgrade path).
- Tiers:
lite— Find Duplicates + Clean Text + Standardize Formats. Buyer pays once, gets the three universally-useful tools.core— every Ready tool (all 9 in v1.6).pro,enterprise— scaffolded for future SKUs; currently mirror Core. Add per-SKU restrictions by editingFEATURES_BY_TIERinsrc/license/features.py.trial— kept in the enum for backwards compat with any field-tested trial licenses but no longer issuable.
- Feature flags: every tool has a stable feature id matching its
tool_idin :mod:src.gui.tools_registry. Adding a future per- tool SKU is a one-line change toFEATURES_BY_TIER— no consumer code edits. - Per-tool gating: each tool page (GUI) and tool CLI calls
require_feature(FeatureFlag.<TOOL>)at entry. GUI shows an upgrade prompt + button to the Activate page; CLI prints a message naming the locked feature and exits with code 2. - Lock badge: the home grid shows a red 🔒 Locked pill on tool cards the current tier doesn't unlock.
- Dev bypass:
DATATOOLS_DEV_MODE=1skips every check (used by the test suite and during development). Refused in shipped builds by the production-safe tripwire. - Production-safe tripwire:
assert_production_safe()runs at startup in every frozen build. Refuses to boot whenDEV_MODEis set or the verification key is still the embedded dev key (i.e., the build pipeline forgot to overrideDATATOOLS_LICENSE_PUBKEY). No-op in source / pytest runs. - No internet: signature verification is fully offline. The
shipped binary embeds only the public key; the private key never
leaves the seller. See
docs/DECISIONS.mdfor the threat-model discussion.
18. Error handling
- Structured hierarchy:
DataToolsError→InputValidationError,ConfigError,FileFormatError,FileAccessError. - Subclasses extend stdlib
ValueError/OSErrorso existing handlers still catch them. - Every error carries: message, file path, column, operation, suggestion, underlying cause.