Files

Michael ee0b1f6f6b docs: design notes for future PDF→CSV tool

New ``docs/FUTURE-TOOLS.md`` captures post-launch tool ideas with a
consistent shape — What / Why / Can we ship now / Approach / GUI
sketch / Effort / Risks / Ship criteria. Resting place for things
the new-tool freeze in ``PLAN.md`` §2.1 refuses to build but that
keep coming up.

First entry: **#10 PDF → CSV extractor** (bank statements et al.).

Key facts captured:

- **Current state**: no PDF infrastructure exists. Zero PDF
  dependencies in requirements.txt; zero PDF-touching code under
  ``src/``. The only "PDF" string in the codebase is the planned-
  output copy for the Quality Check tool, unrelated to extraction.
- **Library picks**: pdfplumber as the extraction core (BSD-3,
  no native compiler, gives coordinate-aware text), Tesseract via
  pytesseract as the OCR fallback for scanned PDFs,
  streamlit-drawable-canvas as the region-picker component.
- **GUI sketch**: user draws a header strip + a row template on a
  rendered page; the tool applies that template across N pages,
  saves the template by layout fingerprint for next month's
  statement, emits CSV.
- **Effort phased A–E**: 3–4 weeks for a text-only MVP; 6–10
  weeks for a polished version with multi-page template recall;
  +2–3 weeks if scanned-PDF OCR is required.
- **Difficulty**: medium-hard. The pieces are well-trodden; the
  combination (region selection that persists across pages and
  across documents with similar layouts) is where the engineering
  goes.
- **Ship criteria**: ≥1 paying customer + ≥3 paid or ≥5 demo
  emails asking for PDF extraction + the bookkeeper niche
  converting at least one customer first. None have fired.

Cross-references added:

- ``docs/REQUIREMENTS.md`` §11: pointer to FUTURE-TOOLS.md for
  parked tool ideas, with a one-paragraph summary of #10.
- ``docs/PLAN.md`` §2.1: notes that the freeze parks future tools
  in FUTURE-TOOLS.md and explicitly names #10 as the current
  highest-pressure entry.
- ``docs/NEXT-STEPS.md`` Phase 5 "what NOT to build" table: a new
  row for the PDF tool tied to the same ship-trigger language.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-17 01:52:42 +00:00

13 KiB

Raw Blame History

Requirements

Numbered support matrix. Updated with every shipped capability.

1. File handling

1.1 Size: ≤ 1.5 GB target (larger works, slower). 1.2 Read: CSV, TSV, XLSX, XLS. 1.3 Write: CSV, TSV. 1.4 Excel: multi-sheet picker. 1.5 Empty file: blocked with empty_input error finding.

2. Input encodings (auto-detected)

2.1 Unicode: UTF-8, UTF-8-BOM, UTF-16 LE/BE BOM, UTF-16 LE no-BOM. 2.2 Western: cp1252, ISO-8859-1, ISO-8859-15, Mac Roman. 2.3 Eastern European: cp1250, ISO-8859-2. 2.4 Cyrillic: cp1251, KOI8-R. 2.5 CJK: Shift_JIS / cp932, GB18030, Big5, EUC-KR / cp949. 2.6 ASCII → detected as UTF-8. 2.7 User override: any Python codec name. 2.8 BOM: stripped on read, never written. 2.9 Decode failure → encoding_decode_failed (error). 2.10 U+FFFD in output → encoding_uncertain (error).

3. Output encodings

3.1 UTF-8 (default), UTF-8-BOM (Excel-friendly). 3.2 cp1252, ISO-8859-1/15, cp1250, ISO-8859-2, cp1251. 3.3 Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE. 3.4 Lossy fallback: ? + warning when codec can't represent a char.

4. Delimiters

4.1 Input auto-detect: ,, \t, ;, |. 4.2 Output: , (default), \t, ;, |. 4.3 Extension: .tsv for tab, .csv otherwise.

5. Line endings

5.1 Read: LF / CRLF / bare CR — all normalized to LF. 5.2 Embedded in quoted cells: also normalized to LF. 5.3 Write: LF (default), CRLF, CR. 5.4 Mixed → mixed_line_endings finding.

6. Analyzer detectors

File-level (read-time fixes, audit-logged):

csv_bom_stripped, csv_nul_stripped, csv_smart_quotes_folded, csv_line_endings_normalized, csv_transcoded_to_utf8, csv_unquoted_delimiters_repaired, csv_unrepairable_rows.

Cell-level:

smart_punctuation_in_data, nbsp_or_unicode_whitespace, zero_width_or_invisible, dirty_column_headers, whitespace_padding, null_like_sentinels, suspected_mojibake, mixed_case_email_column, inconsistent_date_format, near_duplicate_rows, leading_zero_ids.

Encoding integrity: encoding_uncertain, encoding_decode_failed, encoding_lying_bom, empty_input.

Sample size: 1,000 rows (configurable).

7. Finding fields

id, severity (info/warn/error), confidence (high/medium/low), fix_action, pre_applied, tool, count, description, column, samples (≤5).

8. Confidence tiers

high — round-trip safe, one-click auto-fix.
medium — preview before applying.
low — opt-in only, can corrupt if wrong.
error — must resolve or waive before tool pages unlock.

9. Decision actions

auto — apply registered fix.
skip — waive (audit-logged).
modified — apply with custom payload.

10. Performance (1.5 GB input)

Initial scan (sample): < 2 s · peak RSS ~110 MB.
Full-file repair_bytes: 30–40 s (UTF-8); non-UTF-8 fold path now uses str.count instead of a Python char-by-char zip walk — formerly ~100 s on a 1 GB cp1252 file with smart quotes, now <1 s.
Full-DataFrame analyze: ~4 min (~25 µs/cell). Near-duplicate detector no longer allocates a full-frame copy — peak RSS during the near-duplicate pass drops to roughly the size of the string columns alone (~50% memory cut on text-heavy 1 GB inputs).
Full-DataFrame auto_fix: ~5 min (~30 µs/cell).
Output write: ~10 s.
Recommended RAM: 3–4× input size for the full-Apply path.
Standardize Formats (standardize_dataframe): ~2.7M rows/sec on cache-warm repetition-heavy columns (synthetic 1M-row in-memory benchmark, 2 typed columns); the fused single-pass loop replaced a 3-pass .tolist() cycle, so per-call overhead is now dominated by the underlying parsers (phonenumbers, dateutil) rather than Python list materialisation. A 1.5 GB CSV with mixed phone+currency+address columns finishes in ~1.5–6 minutes depending on column count. StandardizeOptions.parallel_columns (default 1, serial) lands the thread-pool scaffolding; on CPython 3.12 with the GIL it's roughly neutral, but the API is ready for the free-threaded (PEP 703) Python 3.13+ build where it will help.
Clean Text (clean_dataframe): ~1M rows/sec on repetition-heavy columns (per-call string cache: the pipeline runs once per unique cell value, not once per row).
Fix Missing Values (handle_missing): lazy-copy — when sentinel standardization runs but finds nothing, AND no drops AND no fills apply, the input frame is returned as-is. On a clean 1 GB file this saves the 1 GB allocation that the unconditional upfront copy used to take.
Map Columns (map_columns): rename + drop both already return fresh frames; the explicit upfront df.copy() is now removed and downstream mutating steps (schema-add, coerce) copy on demand via _ensure_owned(). Rename-only and identity-mapping paths run with zero explicit copies.
Find Duplicates:
- Exact-only strategies (every column uses Algorithm.EXACT at threshold 100 — covers strong-key dedup like email/phone, the fallback drop-duplicates path, and explicit "match on this exact column" calls) now run in O(n) via groupby. Measured: 10k rows on an email-exact strategy → 73 ms (was ~30 minutes via the old O(n²) pair compare).
- Fuzzy strategies still pair-compare. Opt in to prefix blocking via deduplicate(..., blocking_columns=['name'], blocking_prefix_len=1) to partition pairs by a cheap key. Measured: 5k rows fuzzy-name dedup → 25.6s with blocking vs. 179s without (7× faster). Trade-off: cross-block matches are missed; lower blocking_prefix_len widens blocks.
- Normalisation pass remains LRU-cached per call so repeat values (the common dedup workload) skip re-parsing.

11. Tools

Find Duplicates — Ready
Clean Text — Ready
Standardize Formats — Ready
Fix Missing Values — Ready
Map Columns — Ready
Find Unusual Values — Coming Soon
Combine Files — Coming Soon
Quality Check — Coming Soon
Automated Workflows — Ready

Future / not in v1. Tool ideas captured for after-launch consideration live in docs/FUTURE-TOOLS.md — entries there are gated by the new-tool freeze in PLAN.md §2.1 and don't ship without a paying-customer + repeated-demand signal. Currently parked there:

#10. PDF → CSV extractor (bank statements + similar). No PDF dependency exists in the repo today; this tool would need pdfplumber, streamlit-drawable-canvas, and a templates store. Estimated 3–4 weeks for a text-only MVP, 6–10 weeks for the polished version with multi-page template recall.

11.a Recommended pipeline order (soft, not enforced)

Automated Workflows ships with a SOFT_DEPENDENCIES table; the following ordering is the default and the basis of the warning surface. Re-ordering is allowed; the runner emits a warning string and proceeds.

#	Tool	Why this slot
1	column_map (optional, for header alignment)	Multi-vendor unification — rename early so downstream tools see canonical headers
2	text_clean	NBSP / smart quotes / zero-width pollution silently breaks downstream parsers
3	format_standardize	Phones / dates / currencies → canonical form before missing detection and dedup
4	missing	Sentinel detection, imputation, drop strategies — needs canonical types
5	column_map (optional, for schema enforcement)	Project to target schema, coerce, drop extras AFTER cleaning
6	dedup	Fuzzy matching is most accurate on canonicalised, sentinel-laundered data

12. Gate (Review & Normalize)

Gates every tool page.
Auto-fix button: applies all confidence=high findings in one click.
Per-finding controls: Auto / Skip / Customize.
Live before/after preview (≤5 sample rows).
Audit log per fix (id, decision, cells changed).
Encoding-override picker (16 codepages + custom).
Advanced output expander: encoding + delimiter + line terminator.
Result keyed by upload SHA-256; survives reload, invalidated on re-upload.

13. Interfaces

GUI: Streamlit, browser-based, local, no internet. Sidebar language picker (English, Español).
CLI: python -m src.cli (dedup) · src.cli_text_clean · src.cli_format · src.cli_missing · src.cli_column_map · src.cli_pipeline · src.cli_analyze. (CLI output is English-only.)
Python API: from src.core import … (analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …).
JSON output: --json on cli_analyze.
Language packs: from src.i18n import t, LANGUAGES. Add <code>.json to src/i18n/packs/ + entry in LANGUAGES to add a language.

14. Platforms

Python ≥ 3.10.
OS: Linux, macOS, Windows.
Browser: any modern browser.
Network: not required at runtime.

15. Dependencies

Core: pandas, openpyxl, charset-normalizer, typer, loguru.
Dedup: rapidfuzz, phonenumbers.
GUI: streamlit.
Optional: ftfy (mojibake repair).
Dev: pytest, tox.

16. Test coverage

2,033 tests passing, 0 skipped, 0 xfailed.
- 1,868 core + CLI tests (run with pytest -m 'not gui' for a quick loop). Includes 49 license-layer unit tests (Ed25519 sign/verify, dev-key derivation, production-safe tripwire, schema), 25 license-CLI tests, and 17 Lite-tier feature-map + guard tests.
- 165 GUI tests under tests/gui/ driving Streamlit pages via AppTest (smoke + EN/ES localization, chrome, gate, workflows, dedup review, advanced panels, error paths, findings panel, activation + license gate, Lite-tier per-page lock behaviour). Marked gui.
- Includes 15 perf-shape regression tests.
Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
Run: python run_tests.py [--tool …] [--fixtures] [--coverage].

17. Privacy / data handling

All processing local; no network calls in the data path.
No telemetry.
Original input never modified.
Audit logs: logs/ next to each run (timestamped).

17a. Licensing

Storage: ~/.datatools/license.json (or $DATATOOLS_LICENSE_PATH override). Signed with Ed25519 (asymmetric).
Crypto: Ed25519. The seller holds the private key; every shipped binary embeds only the public key. A motivated reverse engineer who pulls everything out of the binary still can't sign new licenses. Keys are 32 bytes raw, exposed as hex via DATATOOLS_LICENSE_PRIVKEY (seller-side) and DATATOOLS_LICENSE_PUBKEY (build-time bake-in).
Activation: buyer pastes a base64-encoded license blob (DTLIC1:...) on first launch; app verifies the signature offline + matches the buyer-entered name/email to the embedded values.
No free trial: every license requires a paid blob from the seller. The user-facing trial flow (button + license_cli trial subcommand) was removed in v1.6 to keep paid-tier economics clean.
Lifetime: every license is 1 year by default. Renewal applies a fresh blob without losing the embedded buyer identity. Tier may change during renewal (Lite → Core upgrade path).
Tiers:
- lite — Find Duplicates + Clean Text + Standardize Formats. Buyer pays once, gets the three universally-useful tools.
- core — every Ready tool (all 9 in v1.6).
- pro, enterprise — scaffolded for future SKUs; currently mirror Core. Add per-SKU restrictions by editing FEATURES_BY_TIER in src/license/features.py.
- trial — kept in the enum for backwards compat with any field-tested trial licenses but no longer issuable.
Feature flags: every tool has a stable feature id matching its tool_id in :mod:src.gui.tools_registry. Adding a future per- tool SKU is a one-line change to FEATURES_BY_TIER — no consumer code edits.
Per-tool gating: each tool page (GUI) and tool CLI calls require_feature(FeatureFlag.<TOOL>) at entry. GUI shows an upgrade prompt + button to the Activate page; CLI prints a message naming the locked feature and exits with code 2.
Lock badge: the home grid shows a red 🔒 Locked pill on tool cards the current tier doesn't unlock.
Dev bypass: DATATOOLS_DEV_MODE=1 skips every check (used by the test suite and during development). Refused in shipped builds by the production-safe tripwire.
Production-safe tripwire: assert_production_safe() runs at startup in every frozen build. Refuses to boot when DEV_MODE is set or the verification key is still the embedded dev key (i.e., the build pipeline forgot to override DATATOOLS_LICENSE_PUBKEY). No-op in source / pytest runs.
No internet: signature verification is fully offline. The shipped binary embeds only the public key; the private key never leaves the seller. See docs/DECISIONS.md for the threat-model discussion.

18. Error handling

Structured hierarchy: DataToolsError → InputValidationError, ConfigError, FileFormatError, FileAccessError.
Subclasses extend stdlib ValueError / OSError so existing handlers still catch them.
Every error carries: message, file path, column, operation, suggestion, underlying cause.

13 KiB Raw Blame History Unescape Escape