New ``docs/FUTURE-TOOLS.md`` captures post-launch tool ideas with a consistent shape — What / Why / Can we ship now / Approach / GUI sketch / Effort / Risks / Ship criteria. Resting place for things the new-tool freeze in ``PLAN.md`` §2.1 refuses to build but that keep coming up. First entry: **#10 PDF → CSV extractor** (bank statements et al.). Key facts captured: - **Current state**: no PDF infrastructure exists. Zero PDF dependencies in requirements.txt; zero PDF-touching code under ``src/``. The only "PDF" string in the codebase is the planned- output copy for the Quality Check tool, unrelated to extraction. - **Library picks**: pdfplumber as the extraction core (BSD-3, no native compiler, gives coordinate-aware text), Tesseract via pytesseract as the OCR fallback for scanned PDFs, streamlit-drawable-canvas as the region-picker component. - **GUI sketch**: user draws a header strip + a row template on a rendered page; the tool applies that template across N pages, saves the template by layout fingerprint for next month's statement, emits CSV. - **Effort phased A–E**: 3–4 weeks for a text-only MVP; 6–10 weeks for a polished version with multi-page template recall; +2–3 weeks if scanned-PDF OCR is required. - **Difficulty**: medium-hard. The pieces are well-trodden; the combination (region selection that persists across pages and across documents with similar layouts) is where the engineering goes. - **Ship criteria**: ≥1 paying customer + ≥3 paid or ≥5 demo emails asking for PDF extraction + the bookkeeper niche converting at least one customer first. None have fired. Cross-references added: - ``docs/REQUIREMENTS.md`` §11: pointer to FUTURE-TOOLS.md for parked tool ideas, with a one-paragraph summary of #10. - ``docs/PLAN.md`` §2.1: notes that the freeze parks future tools in FUTURE-TOOLS.md and explicitly names #10 as the current highest-pressure entry. - ``docs/NEXT-STEPS.md`` Phase 5 "what NOT to build" table: a new row for the PDF tool tied to the same ship-trigger language. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
Future tools — design notes
Creator-only. Specs for tools the strategic plan refuses to build right now but that surface repeatedly enough to be worth documenting once instead of re-thinking from scratch every time a customer asks. Status of these tools: post-launch, post-revenue. See
PLAN.md§2.1 — new-tool development is frozen until DataTools has a paying customer and a repeated demand signal for the same idea. This file is the resting place for those ideas in the meantime; nothing here ships unless a future decision says it does.
Each entry follows the same shape: What it does, Why someone would want it, Can we ship it now?, Approach, GUI sketch, Effort, Risks/unknowns, Ship criteria (the signal that overrides the freeze).
10. PDF → CSV extractor (bank statements + similar)
What it does
Takes a PDF (typically a bank statement, expense report, paystub, invoice, or any document where humans-but-not-computers can read a table) and turns the tabular content into a CSV that the rest of DataTools can consume.
The user shows the tool where the data lives by drawing rectangles on a rendered preview of the first page; the tool then applies those region templates to every page of the document (and remembers the template so the same template can be re-applied to next month's statement without re-clicking).
Why someone would want it
Bookkeepers, accountants, and any small-business operator who:
- Gets bank/credit-card statements only as PDFs (most US banks; many European ones).
- Wants to import transactions into QuickBooks / Xero / a spreadsheet without paying $10–$30/month for a SaaS converter (Docparser, Rossum, Hubdoc) or relying on a Python script they can't maintain.
- Has 12 months × N accounts of statements to back-fill into a ledger.
This is the most-requested DataTools adjacency in the casual feedback we
have so far. It maps tightly onto the bookkeeper niche identified in
PLAN.md §2.3 — that persona is exactly who needs PDF extraction and is
exactly the kind of operator who'd pay for a one-time desktop tool over a
recurring SaaS subscription.
Can we ship it now?
No. Current state, verified 2026-05-17:
- No PDF dependency in
requirements.txtorrequirements-dev.txt. - No PDF-touching code anywhere under
src/. The single string-mention of "PDF" in the codebase is in the output copy for the Quality Check tool ("generate PDF/Excel quality reports"), unrelated to extraction. - No region-selection / canvas component in the Streamlit GUI today.
Building this requires net-new infrastructure on three axes (libraries, extraction core, region-picker UI). Estimates below.
Approach (technical)
PDFs split cleanly into two populations and the strategy differs:
-
Native / text-layer PDFs — text is stored as text, just laid out visually. Most modern bank statements are this. Solvable with coordinate-aware text extraction:
pdfplumber(BSD-3, on top ofpdfminer.six) — gives(x0, y0, x1, y1, text)per character/word/line for each page. Mature, well tested, single dependency, no native compiler. First-choice.pypdf(BSD-3) — text-only, no positions. Too coarse for statement parsing; useful only for "the whole document as one big string."camelot-py(MIT) — purpose-built for table extraction. Heavier (needsghostscriptandtk/opencvfor some modes), and assumes the table grid is already visible. Worth evaluating as a fallback for documents with explicit ruled tables.
-
Scanned / image-only PDFs — pixels of a scanner; no text layer. Less common from major banks today but still happens with old PDFs and receipts. Needs OCR:
pytesseractwrapping the Tesseract binary (Apache-2). The OCR is good for English on clean scans, mediocre on receipts. Detect withpdfplumber: a page where every character is in a glyph "image" object means the page is image-only → OCR fallback.
The extraction core would be a state machine:
- Render page to an image (
pdfplumber.Page.to_image()returns a PIL image at a chosen DPI). - User draws a header region and per-row regions (or marks a single table bounding box + column dividers) on the preview.
- For each PDF page, crop the corresponding pixel region (or pdf coordinate region), pull the text in that crop, and apply per-region parsing (date, amount, description).
- Emit one CSV row per detected statement row.
Bank-statement-specific niceties — implementable as templates on top of the generic engine:
- Recurring-template store: save "Chase visa October layout" once, the
next month's PDF lands on the same template automatically. JSON file
in
~/.datatools/templates/keyed by a layout fingerprint (page size + header text hash). - Multi-page row stitching: a row that wraps across pages gets merged back together based on date-column continuity.
- Currency / sign inference: a column that mixes
$1,234.56and($45.00)— already handled by the (now-existing) Standardize Formats analyzer rules.
GUI sketch
The hardest part of the whole project. Streamlit doesn't ship a native "draw rectangles on an image" widget. Options:
streamlit-drawable-canvas— community component (MIT-licensed). Lets the user draw freehand rectangles on top of a background image. Returns the rectangle coordinates as JSON. Active maintenance. First-choice for the region picker.streamlit-cropper— single-rectangle crop tool. Good if we only needed the table bbox; too limited for "header region + column dividers + repeating-row template."- Custom React component — fully tailored UX but adds a build toolchain DataTools doesn't have today. Last resort.
Sketch of the proposed page (under "Transformations" in the sidebar section):
🧾 PDF → CSV (Beta)
─────────────────────────────────────────────────────────────────────
Upload a PDF [ Browse… ]
(statement / invoice / form — text-based PDFs work best)
[ ▸ Preview: October-statement.pdf · 3 pages ]
┌────────────────────────────────────────────────┐
│ CHASE BANK │
│ Statement period Oct 1–31, 2025 │
│ ┌─[1: header strip — drawn in red]──────────┐ │
│ │ Date Description Amount │ │
│ └────────────────────────────────────────────┘ │
│ ┌─[2: row template — drawn in green]────────┐ │
│ │ 10/03 AMAZON.COM #42… -45.67 │ │
│ └────────────────────────────────────────────┘ │
│ ⋮ (more transactions) │
└────────────────────────────────────────────────┘
Columns: [Date] [Description] [Amount] [+ Add column]
Apply template to: ( ) Only this page
(•) All pages with this layout
( ) All pages (force)
[ Save template as… Chase Visa Oct 2025 ]
[ Run extraction → CSV ]
After "Run extraction": the standard tool-page result layout (preview table, "Saved to ~/Downloads/_extracted.csv", "Open Downloads folder" — matching the other Ready tools).
The template save/recall is what makes this a one-time setup instead of a per-document chore — bookkeepers don't want to re-draw rectangles every month.
Effort estimate
| Phase | Scope | Estimate | Risk |
|---|---|---|---|
| A. Backend, native PDFs only | pdfplumber-based extraction, hard-coded region passed via a JSON config (no GUI) | 1–2 weeks | Low — straightforward use of pdfplumber. |
| B. Region-picker GUI | streamlit-drawable-canvas, multi-region drawing, per-region role assignment (date / amount / description) | 2–3 weeks | Medium — the canvas component has quirks; persisting region state across reruns is non-trivial. |
| C. Multi-page application + template persistence | Apply one page's template to N pages, save/load templates, layout fingerprint | 1–2 weeks | Medium — "is the next page the same layout?" is a real perception problem; we'll need a heuristic. |
| D. Scanned-PDF OCR fallback | Detect image-only pages, run Tesseract, merge OCR text into the extraction path | 2–3 weeks | High — OCR accuracy is variable; we'd want a quality threshold + a "fail this page noisily" path. Bundling Tesseract with the PyInstaller build is its own packaging headache. |
| E. Bank-statement specifics | Cross-page row stitching, currency-sign inference, multi-account splits | 1–2 weeks | Medium — every bank's idea of a "statement" differs. Templates absorb most of the variance. |
Realistic total for a polished v1: 6–10 calendar weeks of focused work (text-PDFs + GUI + templates + statement-specific niceties). Add another 2–3 weeks if scanned PDFs are required at launch.
Minimum viable extract (just text PDFs, single-region drawing, no template recall, no OCR): 3–4 weeks. Worth scoping a beta at that level before committing to the full surface.
Difficulty rating
Medium-hard. Not because any single piece is novel — pdfplumber + streamlit-drawable-canvas are well-trodden libraries — but because the combination (point-and-click region selection that persists across multiple PDF pages and across documents with similar layouts) is where most of the engineering goes. The "every bank does it slightly differently" reality makes templates a hard requirement rather than a nice-to-have, and templates raise the design effort.
Risks / unknowns
- Scanned-PDF coverage: if a meaningful slice of the addressable market sends image-only PDFs (older statements, scanned receipts), shipping text-only extraction limits the audience. Decide via the first 10–20 user requests.
- PyInstaller packaging of Tesseract: bundling the OCR binary into the desktop build is non-trivial. May force a "Tesseract not found — install it separately" path on first launch, which hurts the "one- click install" story.
- Bank layout drift: a template captured today can stop working next month if the bank redesigns its statement. Layout-fingerprint detection has to fail loudly rather than silently produce garbage.
- PII surface: bank statements are some of the most sensitive documents the user might touch. The "runs locally — your data never leaves this computer" guarantee is even more load-bearing here than for CSVs. No telemetry, no cloud OCR services, hard line.
Ship criteria
Before this tool re-enters active development, all of these need to be true:
- DataTools has shipped to ≥1 paying customer (the
PLAN.md§2.1 freeze condition). - At least 3 paying customers OR 5 demo-traffic emails have explicitly asked for PDF extraction. Below that signal, build something else.
- The bookkeeper niche (per
PLAN.md§2.3) has at least one converted customer — that's the persona who actually needs this tool, and confirming they pay before building a tool aimed squarely at them is the discipline the freeze exists to enforce.
If those three trip, the Phase A minimum-viable beta (3–4 weeks) goes first — text PDFs + single-region drawing — so we can see real user behaviour before committing to the full template surface.
(placeholder for additional future-tool entries)
Add new entries above this line. Keep the same shape: What / Why / Can we ship now / Approach / GUI / Effort / Risks / Ship criteria. The shape is what makes "is this idea ready" a factual question instead of an opinion.