datatools-dev/docs/FUTURE-TOOLS.md

# Future tools — design notes

> Creator-only. Specs for tools the strategic plan refuses to build right now
> but that surface repeatedly enough to be worth documenting once instead of
> re-thinking from scratch every time a customer asks.
> **Status of these tools**: post-launch, post-revenue. See `PLAN.md` §2.1 —
> new-tool development is frozen until DataTools has a paying customer and a
> repeated demand signal for the same idea. This file is the resting place
> for those ideas in the meantime; nothing here ships unless a future
> decision says it does.

Each entry follows the same shape: **What it does**, **Why someone would
want it**, **Can we ship it now?**, **Approach**, **GUI sketch**, **Effort**,
**Risks/unknowns**, **Ship criteria** (the signal that overrides the freeze).

---

## 10. PDF → CSV extractor (bank statements + similar)

### What it does

Takes a PDF (typically a bank statement, expense report, paystub, invoice,
or any document where humans-but-not-computers can read a table) and turns
the tabular content into a CSV that the rest of DataTools can consume.

The user shows the tool **where** the data lives by drawing rectangles on
a rendered preview of the first page; the tool then applies those region
templates to every page of the document (and remembers the template so the
same template can be re-applied to next month's statement without
re-clicking).

### Why someone would want it

Bookkeepers, accountants, and any small-business operator who:

- Gets bank/credit-card statements only as PDFs (most US banks; many
  European ones).
- Wants to import transactions into QuickBooks / Xero / a spreadsheet
  without paying $10–$30/month for a SaaS converter (Docparser,
  Rossum, Hubdoc) or relying on a Python script they can't maintain.
- Has 12 months × N accounts of statements to back-fill into a
  ledger.

This is the most-requested DataTools adjacency in the casual feedback we
have so far. It maps tightly onto the **bookkeeper niche** identified in
`PLAN.md` §2.3 — that persona is exactly who needs PDF extraction and is
exactly the kind of operator who'd pay for a one-time desktop tool over a
recurring SaaS subscription.

### Can we ship it now?

**No.** Current state, verified 2026-05-17:

- No PDF dependency in `requirements.txt` or `requirements-dev.txt`.
- No PDF-touching code anywhere under `src/`. The single
  string-mention of "PDF" in the codebase is in the **output** copy for
  the Quality Check tool ("generate PDF/Excel quality reports"),
  unrelated to extraction.
- No region-selection / canvas component in the Streamlit GUI today.

Building this requires net-new infrastructure on three axes (libraries,
extraction core, region-picker UI). Estimates below.

### Approach (technical)

PDFs split cleanly into two populations and the strategy differs:

1. **Native / text-layer PDFs** — text is stored as text, just laid out
   visually. Most modern bank statements are this. Solvable with
   coordinate-aware text extraction:

   - **`pdfplumber`** (BSD-3, on top of `pdfminer.six`) — gives `(x0, y0,
     x1, y1, text)` per character/word/line for each page. Mature, well
     tested, single dependency, no native compiler. **First-choice.**
   - **`pypdf`** (BSD-3) — text-only, no positions. Too coarse for
     statement parsing; useful only for "the whole document as one big
     string."
   - **`camelot-py`** (MIT) — purpose-built for table extraction.
     Heavier (needs `ghostscript` and `tk`/`opencv` for some modes),
     and assumes the table grid is already visible. Worth evaluating
     as a fallback for documents with explicit ruled tables.

2. **Scanned / image-only PDFs** — pixels of a scanner; no text layer.
   Less common from major banks today but still happens with old PDFs
   and receipts. Needs OCR:

   - **`pytesseract`** wrapping the **Tesseract** binary (Apache-2). The
     OCR is good for English on clean scans, mediocre on receipts.
     Detect with `pdfplumber`: a page where every character is in a
     glyph "image" object means the page is image-only → OCR fallback.

The extraction core would be a state machine:

1. Render page to an image (`pdfplumber.Page.to_image()` returns a PIL
   image at a chosen DPI).
2. User draws a header region and per-row regions (or marks a single
   table bounding box + column dividers) on the preview.
3. For each PDF page, crop the corresponding pixel region (or pdf
   coordinate region), pull the text in that crop, and apply per-region
   parsing (date, amount, description).
4. Emit one CSV row per detected statement row.

Bank-statement-specific niceties — implementable as templates on top of
the generic engine:

- Recurring-template store: save "Chase visa October layout" once, the
  next month's PDF lands on the same template automatically. JSON file
  in `~/.datatools/templates/` keyed by a layout fingerprint (page
  size + header text hash).
- Multi-page row stitching: a row that wraps across pages gets merged
  back together based on date-column continuity.
- Currency / sign inference: a column that mixes `$1,234.56` and
  `($45.00)` — already handled by the (now-existing) Standardize
  Formats analyzer rules.

### GUI sketch

The hardest part of the whole project. Streamlit doesn't ship a native
"draw rectangles on an image" widget. Options:

- **`streamlit-drawable-canvas`** — community component (MIT-licensed).
  Lets the user draw freehand rectangles on top of a background image.
  Returns the rectangle coordinates as JSON. Active maintenance.
  **First-choice for the region picker.**
- **`streamlit-cropper`** — single-rectangle crop tool. Good if we only
  needed the table bbox; too limited for "header region + column
  dividers + repeating-row template."
- **Custom React component** — fully tailored UX but adds a build
  toolchain DataTools doesn't have today. Last resort.

Sketch of the proposed page (under "Transformations" in the sidebar
section):

```
🧾  PDF → CSV (Beta)
─────────────────────────────────────────────────────────────────────
Upload a PDF                                  [ Browse… ]
  (statement / invoice / form — text-based PDFs work best)

[ ▸ Preview: October-statement.pdf  ·  3 pages ]
  ┌────────────────────────────────────────────────┐
  │  CHASE BANK                                    │
  │  Statement period Oct 1–31, 2025               │
  │  ┌─[1: header strip — drawn in red]──────────┐ │
  │  │  Date    Description          Amount      │ │
  │  └────────────────────────────────────────────┘ │
  │  ┌─[2: row template — drawn in green]────────┐ │
  │  │  10/03   AMAZON.COM #42…       -45.67     │ │
  │  └────────────────────────────────────────────┘ │
  │      ⋮ (more transactions)                     │
  └────────────────────────────────────────────────┘

Columns:  [Date]  [Description]  [Amount]      [+ Add column]

Apply template to:   ( ) Only this page
                     (•) All pages with this layout
                     ( ) All pages (force)

[ Save template as…  Chase Visa Oct 2025 ]

[ Run extraction → CSV ]
```

After "Run extraction": the standard tool-page result layout (preview
table, "Saved to ~/Downloads/<name>_extracted.csv", "Open Downloads
folder" — matching the other Ready tools).

The **template save/recall** is what makes this a one-time setup
instead of a per-document chore — bookkeepers don't want to re-draw
rectangles every month.

### Effort estimate

| Phase | Scope | Estimate | Risk |
|---|---|---|---|
| **A. Backend, native PDFs only** | pdfplumber-based extraction, hard-coded region passed via a JSON config (no GUI) | **1–2 weeks** | Low — straightforward use of pdfplumber. |
| **B. Region-picker GUI** | streamlit-drawable-canvas, multi-region drawing, per-region role assignment (date / amount / description) | **2–3 weeks** | Medium — the canvas component has quirks; persisting region state across reruns is non-trivial. |
| **C. Multi-page application + template persistence** | Apply one page's template to N pages, save/load templates, layout fingerprint | **1–2 weeks** | Medium — "is the next page the same layout?" is a real perception problem; we'll need a heuristic. |
| **D. Scanned-PDF OCR fallback** | Detect image-only pages, run Tesseract, merge OCR text into the extraction path | **2–3 weeks** | High — OCR accuracy is variable; we'd want a quality threshold + a "fail this page noisily" path. Bundling Tesseract with the PyInstaller build is its own packaging headache. |
| **E. Bank-statement specifics** | Cross-page row stitching, currency-sign inference, multi-account splits | **1–2 weeks** | Medium — every bank's idea of a "statement" differs. Templates absorb most of the variance. |

**Realistic total for a polished v1**: 6–10 calendar weeks of focused work
(text-PDFs + GUI + templates + statement-specific niceties). Add another
2–3 weeks if scanned PDFs are required at launch.

**Minimum viable extract** (just text PDFs, single-region drawing, no
template recall, no OCR): **3–4 weeks**. Worth scoping a beta at that
level before committing to the full surface.

### Difficulty rating

**Medium-hard.** Not because any single piece is novel — pdfplumber +
streamlit-drawable-canvas are well-trodden libraries — but because the
*combination* (point-and-click region selection that persists across
multiple PDF pages and across documents with similar layouts) is where
most of the engineering goes. The "every bank does it slightly
differently" reality makes templates a hard requirement rather than a
nice-to-have, and templates raise the design effort.

### Risks / unknowns

- **Scanned-PDF coverage**: if a meaningful slice of the addressable
  market sends image-only PDFs (older statements, scanned receipts),
  shipping text-only extraction limits the audience. Decide via the
  first 10–20 user requests.
- **PyInstaller packaging of Tesseract**: bundling the OCR binary into
  the desktop build is non-trivial. May force a "Tesseract not found —
  install it separately" path on first launch, which hurts the "one-
  click install" story.
- **Bank layout drift**: a template captured today can stop working
  next month if the bank redesigns its statement. Layout-fingerprint
  detection has to fail loudly rather than silently produce garbage.
- **PII surface**: bank statements are some of the most sensitive
  documents the user might touch. The "runs locally — your data never
  leaves this computer" guarantee is even more load-bearing here than
  for CSVs. No telemetry, no cloud OCR services, hard line.

### Ship criteria

Before this tool re-enters active development, all of these need to be
true:

- DataTools has shipped to **≥1 paying customer** (the `PLAN.md` §2.1
  freeze condition).
- **At least 3 paying customers OR 5 demo-traffic emails** have
  explicitly asked for PDF extraction. Below that signal, build
  something else.
- The bookkeeper niche (per `PLAN.md` §2.3) has at least one converted
  customer — that's the persona who actually needs this tool, and
  confirming they pay before building a tool aimed squarely at them
  is the discipline the freeze exists to enforce.

If those three trip, the **Phase A minimum-viable beta (3–4 weeks)**
goes first — text PDFs + single-region drawing — so we can see real
user behaviour before committing to the full template surface.

---

## (placeholder for additional future-tool entries)

Add new entries above this line. Keep the same shape:
What / Why / Can we ship now / Approach / GUI / Effort / Risks /
Ship criteria. The shape is what makes "is this idea ready" a
factual question instead of an opinion.