Files

Michael ee0b1f6f6b docs: design notes for future PDF→CSV tool

New ``docs/FUTURE-TOOLS.md`` captures post-launch tool ideas with a
consistent shape — What / Why / Can we ship now / Approach / GUI
sketch / Effort / Risks / Ship criteria. Resting place for things
the new-tool freeze in ``PLAN.md`` §2.1 refuses to build but that
keep coming up.

First entry: **#10 PDF → CSV extractor** (bank statements et al.).

Key facts captured:

- **Current state**: no PDF infrastructure exists. Zero PDF
  dependencies in requirements.txt; zero PDF-touching code under
  ``src/``. The only "PDF" string in the codebase is the planned-
  output copy for the Quality Check tool, unrelated to extraction.
- **Library picks**: pdfplumber as the extraction core (BSD-3,
  no native compiler, gives coordinate-aware text), Tesseract via
  pytesseract as the OCR fallback for scanned PDFs,
  streamlit-drawable-canvas as the region-picker component.
- **GUI sketch**: user draws a header strip + a row template on a
  rendered page; the tool applies that template across N pages,
  saves the template by layout fingerprint for next month's
  statement, emits CSV.
- **Effort phased A–E**: 3–4 weeks for a text-only MVP; 6–10
  weeks for a polished version with multi-page template recall;
  +2–3 weeks if scanned-PDF OCR is required.
- **Difficulty**: medium-hard. The pieces are well-trodden; the
  combination (region selection that persists across pages and
  across documents with similar layouts) is where the engineering
  goes.
- **Ship criteria**: ≥1 paying customer + ≥3 paid or ≥5 demo
  emails asking for PDF extraction + the bookkeeper niche
  converting at least one customer first. None have fired.

Cross-references added:

- ``docs/REQUIREMENTS.md`` §11: pointer to FUTURE-TOOLS.md for
  parked tool ideas, with a one-paragraph summary of #10.
- ``docs/PLAN.md`` §2.1: notes that the freeze parks future tools
  in FUTURE-TOOLS.md and explicitly names #10 as the current
  highest-pressure entry.
- ``docs/NEXT-STEPS.md`` Phase 5 "what NOT to build" table: a new
  row for the PDF tool tied to the same ship-trigger language.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-17 01:52:42 +00:00

12 KiB

Raw Permalink Blame History

Future tools — design notes

Creator-only. Specs for tools the strategic plan refuses to build right now but that surface repeatedly enough to be worth documenting once instead of re-thinking from scratch every time a customer asks. Status of these tools: post-launch, post-revenue. See PLAN.md §2.1 — new-tool development is frozen until DataTools has a paying customer and a repeated demand signal for the same idea. This file is the resting place for those ideas in the meantime; nothing here ships unless a future decision says it does.

Each entry follows the same shape: What it does, Why someone would want it, Can we ship it now?, Approach, GUI sketch, Effort, Risks/unknowns, Ship criteria (the signal that overrides the freeze).

10. PDF → CSV extractor (bank statements + similar)

What it does

Takes a PDF (typically a bank statement, expense report, paystub, invoice, or any document where humans-but-not-computers can read a table) and turns the tabular content into a CSV that the rest of DataTools can consume.

The user shows the tool where the data lives by drawing rectangles on a rendered preview of the first page; the tool then applies those region templates to every page of the document (and remembers the template so the same template can be re-applied to next month's statement without re-clicking).

Why someone would want it

Bookkeepers, accountants, and any small-business operator who:

Gets bank/credit-card statements only as PDFs (most US banks; many European ones).
Wants to import transactions into QuickBooks / Xero / a spreadsheet without paying $10–$30/month for a SaaS converter (Docparser, Rossum, Hubdoc) or relying on a Python script they can't maintain.
Has 12 months × N accounts of statements to back-fill into a ledger.

This is the most-requested DataTools adjacency in the casual feedback we have so far. It maps tightly onto the bookkeeper niche identified in PLAN.md §2.3 — that persona is exactly who needs PDF extraction and is exactly the kind of operator who'd pay for a one-time desktop tool over a recurring SaaS subscription.

Can we ship it now?

No. Current state, verified 2026-05-17:

No PDF dependency in requirements.txt or requirements-dev.txt.
No PDF-touching code anywhere under src/. The single string-mention of "PDF" in the codebase is in the output copy for the Quality Check tool ("generate PDF/Excel quality reports"), unrelated to extraction.
No region-selection / canvas component in the Streamlit GUI today.

Building this requires net-new infrastructure on three axes (libraries, extraction core, region-picker UI). Estimates below.

Approach (technical)

PDFs split cleanly into two populations and the strategy differs:

Native / text-layer PDFs — text is stored as text, just laid out visually. Most modern bank statements are this. Solvable with coordinate-aware text extraction:
- pdfplumber (BSD-3, on top of pdfminer.six) — gives (x0, y0, x1, y1, text) per character/word/line for each page. Mature, well tested, single dependency, no native compiler. First-choice.
- pypdf (BSD-3) — text-only, no positions. Too coarse for statement parsing; useful only for "the whole document as one big string."
- camelot-py (MIT) — purpose-built for table extraction. Heavier (needs ghostscript and tk/opencv for some modes), and assumes the table grid is already visible. Worth evaluating as a fallback for documents with explicit ruled tables.
Scanned / image-only PDFs — pixels of a scanner; no text layer. Less common from major banks today but still happens with old PDFs and receipts. Needs OCR:
- pytesseract wrapping the Tesseract binary (Apache-2). The OCR is good for English on clean scans, mediocre on receipts. Detect with pdfplumber: a page where every character is in a glyph "image" object means the page is image-only → OCR fallback.

The extraction core would be a state machine:

Render page to an image (pdfplumber.Page.to_image() returns a PIL image at a chosen DPI).
User draws a header region and per-row regions (or marks a single table bounding box + column dividers) on the preview.
For each PDF page, crop the corresponding pixel region (or pdf coordinate region), pull the text in that crop, and apply per-region parsing (date, amount, description).
Emit one CSV row per detected statement row.

Bank-statement-specific niceties — implementable as templates on top of the generic engine:

Recurring-template store: save "Chase visa October layout" once, the next month's PDF lands on the same template automatically. JSON file in ~/.datatools/templates/ keyed by a layout fingerprint (page size + header text hash).
Multi-page row stitching: a row that wraps across pages gets merged back together based on date-column continuity.
Currency / sign inference: a column that mixes $1,234.56 and ($45.00) — already handled by the (now-existing) Standardize Formats analyzer rules.

GUI sketch

The hardest part of the whole project. Streamlit doesn't ship a native "draw rectangles on an image" widget. Options:

streamlit-drawable-canvas — community component (MIT-licensed). Lets the user draw freehand rectangles on top of a background image. Returns the rectangle coordinates as JSON. Active maintenance. First-choice for the region picker.
streamlit-cropper — single-rectangle crop tool. Good if we only needed the table bbox; too limited for "header region + column dividers + repeating-row template."
Custom React component — fully tailored UX but adds a build toolchain DataTools doesn't have today. Last resort.

Sketch of the proposed page (under "Transformations" in the sidebar section):

🧾  PDF → CSV (Beta)
─────────────────────────────────────────────────────────────────────
Upload a PDF                                  [ Browse… ]
  (statement / invoice / form — text-based PDFs work best)

[ ▸ Preview: October-statement.pdf  ·  3 pages ]
  ┌────────────────────────────────────────────────┐
  │  CHASE BANK                                    │
  │  Statement period Oct 1–31, 2025               │
  │  ┌─[1: header strip — drawn in red]──────────┐ │
  │  │  Date    Description          Amount      │ │
  │  └────────────────────────────────────────────┘ │
  │  ┌─[2: row template — drawn in green]────────┐ │
  │  │  10/03   AMAZON.COM #42…       -45.67     │ │
  │  └────────────────────────────────────────────┘ │
  │      ⋮ (more transactions)                     │
  └────────────────────────────────────────────────┘

Columns:  [Date]  [Description]  [Amount]      [+ Add column]

Apply template to:   ( ) Only this page
                     (•) All pages with this layout
                     ( ) All pages (force)

[ Save template as…  Chase Visa Oct 2025 ]

[ Run extraction → CSV ]

After "Run extraction": the standard tool-page result layout (preview table, "Saved to ~/Downloads/_extracted.csv", "Open Downloads folder" — matching the other Ready tools).

The template save/recall is what makes this a one-time setup instead of a per-document chore — bookkeepers don't want to re-draw rectangles every month.

Effort estimate

Phase	Scope	Estimate	Risk
A. Backend, native PDFs only	pdfplumber-based extraction, hard-coded region passed via a JSON config (no GUI)	1–2 weeks	Low — straightforward use of pdfplumber.
B. Region-picker GUI	streamlit-drawable-canvas, multi-region drawing, per-region role assignment (date / amount / description)	2–3 weeks	Medium — the canvas component has quirks; persisting region state across reruns is non-trivial.
C. Multi-page application + template persistence	Apply one page's template to N pages, save/load templates, layout fingerprint	1–2 weeks	Medium — "is the next page the same layout?" is a real perception problem; we'll need a heuristic.
D. Scanned-PDF OCR fallback	Detect image-only pages, run Tesseract, merge OCR text into the extraction path	2–3 weeks	High — OCR accuracy is variable; we'd want a quality threshold + a "fail this page noisily" path. Bundling Tesseract with the PyInstaller build is its own packaging headache.
E. Bank-statement specifics	Cross-page row stitching, currency-sign inference, multi-account splits	1–2 weeks	Medium — every bank's idea of a "statement" differs. Templates absorb most of the variance.

Realistic total for a polished v1: 6–10 calendar weeks of focused work (text-PDFs + GUI + templates + statement-specific niceties). Add another 2–3 weeks if scanned PDFs are required at launch.

Minimum viable extract (just text PDFs, single-region drawing, no template recall, no OCR): 3–4 weeks. Worth scoping a beta at that level before committing to the full surface.

Difficulty rating

Medium-hard. Not because any single piece is novel — pdfplumber + streamlit-drawable-canvas are well-trodden libraries — but because the combination (point-and-click region selection that persists across multiple PDF pages and across documents with similar layouts) is where most of the engineering goes. The "every bank does it slightly differently" reality makes templates a hard requirement rather than a nice-to-have, and templates raise the design effort.

Risks / unknowns

Scanned-PDF coverage: if a meaningful slice of the addressable market sends image-only PDFs (older statements, scanned receipts), shipping text-only extraction limits the audience. Decide via the first 10–20 user requests.
PyInstaller packaging of Tesseract: bundling the OCR binary into the desktop build is non-trivial. May force a "Tesseract not found — install it separately" path on first launch, which hurts the "one- click install" story.
Bank layout drift: a template captured today can stop working next month if the bank redesigns its statement. Layout-fingerprint detection has to fail loudly rather than silently produce garbage.
PII surface: bank statements are some of the most sensitive documents the user might touch. The "runs locally — your data never leaves this computer" guarantee is even more load-bearing here than for CSVs. No telemetry, no cloud OCR services, hard line.

Ship criteria

Before this tool re-enters active development, all of these need to be true:

DataTools has shipped to ≥1 paying customer (the PLAN.md §2.1 freeze condition).
At least 3 paying customers OR 5 demo-traffic emails have explicitly asked for PDF extraction. Below that signal, build something else.
The bookkeeper niche (per PLAN.md §2.3) has at least one converted customer — that's the persona who actually needs this tool, and confirming they pay before building a tool aimed squarely at them is the discipline the freeze exists to enforce.

If those three trip, the Phase A minimum-viable beta (3–4 weeks) goes first — text PDFs + single-region drawing — so we can see real user behaviour before committing to the full template surface.

(placeholder for additional future-tool entries)

Add new entries above this line. Keep the same shape: What / Why / Can we ship now / Approach / GUI / Effort / Risks / Ship criteria. The shape is what makes "is this idea ready" a factual question instead of an opinion.

12 KiB Raw Permalink Blame History Unescape Escape