New ``docs/FUTURE-TOOLS.md`` captures post-launch tool ideas with a consistent shape — What / Why / Can we ship now / Approach / GUI sketch / Effort / Risks / Ship criteria. Resting place for things the new-tool freeze in ``PLAN.md`` §2.1 refuses to build but that keep coming up. First entry: **#10 PDF → CSV extractor** (bank statements et al.). Key facts captured: - **Current state**: no PDF infrastructure exists. Zero PDF dependencies in requirements.txt; zero PDF-touching code under ``src/``. The only "PDF" string in the codebase is the planned- output copy for the Quality Check tool, unrelated to extraction. - **Library picks**: pdfplumber as the extraction core (BSD-3, no native compiler, gives coordinate-aware text), Tesseract via pytesseract as the OCR fallback for scanned PDFs, streamlit-drawable-canvas as the region-picker component. - **GUI sketch**: user draws a header strip + a row template on a rendered page; the tool applies that template across N pages, saves the template by layout fingerprint for next month's statement, emits CSV. - **Effort phased A–E**: 3–4 weeks for a text-only MVP; 6–10 weeks for a polished version with multi-page template recall; +2–3 weeks if scanned-PDF OCR is required. - **Difficulty**: medium-hard. The pieces are well-trodden; the combination (region selection that persists across pages and across documents with similar layouts) is where the engineering goes. - **Ship criteria**: ≥1 paying customer + ≥3 paid or ≥5 demo emails asking for PDF extraction + the bookkeeper niche converting at least one customer first. None have fired. Cross-references added: - ``docs/REQUIREMENTS.md`` §11: pointer to FUTURE-TOOLS.md for parked tool ideas, with a one-paragraph summary of #10. - ``docs/PLAN.md`` §2.1: notes that the freeze parks future tools in FUTURE-TOOLS.md and explicitly names #10 as the current highest-pressure entry. - ``docs/NEXT-STEPS.md`` Phase 5 "what NOT to build" table: a new row for the PDF tool tied to the same ship-trigger language. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
245 lines
12 KiB
Markdown
245 lines
12 KiB
Markdown
# Future tools — design notes
|
||
|
||
> Creator-only. Specs for tools the strategic plan refuses to build right now
|
||
> but that surface repeatedly enough to be worth documenting once instead of
|
||
> re-thinking from scratch every time a customer asks.
|
||
> **Status of these tools**: post-launch, post-revenue. See `PLAN.md` §2.1 —
|
||
> new-tool development is frozen until DataTools has a paying customer and a
|
||
> repeated demand signal for the same idea. This file is the resting place
|
||
> for those ideas in the meantime; nothing here ships unless a future
|
||
> decision says it does.
|
||
|
||
Each entry follows the same shape: **What it does**, **Why someone would
|
||
want it**, **Can we ship it now?**, **Approach**, **GUI sketch**, **Effort**,
|
||
**Risks/unknowns**, **Ship criteria** (the signal that overrides the freeze).
|
||
|
||
---
|
||
|
||
## 10. PDF → CSV extractor (bank statements + similar)
|
||
|
||
### What it does
|
||
|
||
Takes a PDF (typically a bank statement, expense report, paystub, invoice,
|
||
or any document where humans-but-not-computers can read a table) and turns
|
||
the tabular content into a CSV that the rest of DataTools can consume.
|
||
|
||
The user shows the tool **where** the data lives by drawing rectangles on
|
||
a rendered preview of the first page; the tool then applies those region
|
||
templates to every page of the document (and remembers the template so the
|
||
same template can be re-applied to next month's statement without
|
||
re-clicking).
|
||
|
||
### Why someone would want it
|
||
|
||
Bookkeepers, accountants, and any small-business operator who:
|
||
|
||
- Gets bank/credit-card statements only as PDFs (most US banks; many
|
||
European ones).
|
||
- Wants to import transactions into QuickBooks / Xero / a spreadsheet
|
||
without paying $10–$30/month for a SaaS converter (Docparser,
|
||
Rossum, Hubdoc) or relying on a Python script they can't maintain.
|
||
- Has 12 months × N accounts of statements to back-fill into a
|
||
ledger.
|
||
|
||
This is the most-requested DataTools adjacency in the casual feedback we
|
||
have so far. It maps tightly onto the **bookkeeper niche** identified in
|
||
`PLAN.md` §2.3 — that persona is exactly who needs PDF extraction and is
|
||
exactly the kind of operator who'd pay for a one-time desktop tool over a
|
||
recurring SaaS subscription.
|
||
|
||
### Can we ship it now?
|
||
|
||
**No.** Current state, verified 2026-05-17:
|
||
|
||
- No PDF dependency in `requirements.txt` or `requirements-dev.txt`.
|
||
- No PDF-touching code anywhere under `src/`. The single
|
||
string-mention of "PDF" in the codebase is in the **output** copy for
|
||
the Quality Check tool ("generate PDF/Excel quality reports"),
|
||
unrelated to extraction.
|
||
- No region-selection / canvas component in the Streamlit GUI today.
|
||
|
||
Building this requires net-new infrastructure on three axes (libraries,
|
||
extraction core, region-picker UI). Estimates below.
|
||
|
||
### Approach (technical)
|
||
|
||
PDFs split cleanly into two populations and the strategy differs:
|
||
|
||
1. **Native / text-layer PDFs** — text is stored as text, just laid out
|
||
visually. Most modern bank statements are this. Solvable with
|
||
coordinate-aware text extraction:
|
||
|
||
- **`pdfplumber`** (BSD-3, on top of `pdfminer.six`) — gives `(x0, y0,
|
||
x1, y1, text)` per character/word/line for each page. Mature, well
|
||
tested, single dependency, no native compiler. **First-choice.**
|
||
- **`pypdf`** (BSD-3) — text-only, no positions. Too coarse for
|
||
statement parsing; useful only for "the whole document as one big
|
||
string."
|
||
- **`camelot-py`** (MIT) — purpose-built for table extraction.
|
||
Heavier (needs `ghostscript` and `tk`/`opencv` for some modes),
|
||
and assumes the table grid is already visible. Worth evaluating
|
||
as a fallback for documents with explicit ruled tables.
|
||
|
||
2. **Scanned / image-only PDFs** — pixels of a scanner; no text layer.
|
||
Less common from major banks today but still happens with old PDFs
|
||
and receipts. Needs OCR:
|
||
|
||
- **`pytesseract`** wrapping the **Tesseract** binary (Apache-2). The
|
||
OCR is good for English on clean scans, mediocre on receipts.
|
||
Detect with `pdfplumber`: a page where every character is in a
|
||
glyph "image" object means the page is image-only → OCR fallback.
|
||
|
||
The extraction core would be a state machine:
|
||
|
||
1. Render page to an image (`pdfplumber.Page.to_image()` returns a PIL
|
||
image at a chosen DPI).
|
||
2. User draws a header region and per-row regions (or marks a single
|
||
table bounding box + column dividers) on the preview.
|
||
3. For each PDF page, crop the corresponding pixel region (or pdf
|
||
coordinate region), pull the text in that crop, and apply per-region
|
||
parsing (date, amount, description).
|
||
4. Emit one CSV row per detected statement row.
|
||
|
||
Bank-statement-specific niceties — implementable as templates on top of
|
||
the generic engine:
|
||
|
||
- Recurring-template store: save "Chase visa October layout" once, the
|
||
next month's PDF lands on the same template automatically. JSON file
|
||
in `~/.datatools/templates/` keyed by a layout fingerprint (page
|
||
size + header text hash).
|
||
- Multi-page row stitching: a row that wraps across pages gets merged
|
||
back together based on date-column continuity.
|
||
- Currency / sign inference: a column that mixes `$1,234.56` and
|
||
`($45.00)` — already handled by the (now-existing) Standardize
|
||
Formats analyzer rules.
|
||
|
||
### GUI sketch
|
||
|
||
The hardest part of the whole project. Streamlit doesn't ship a native
|
||
"draw rectangles on an image" widget. Options:
|
||
|
||
- **`streamlit-drawable-canvas`** — community component (MIT-licensed).
|
||
Lets the user draw freehand rectangles on top of a background image.
|
||
Returns the rectangle coordinates as JSON. Active maintenance.
|
||
**First-choice for the region picker.**
|
||
- **`streamlit-cropper`** — single-rectangle crop tool. Good if we only
|
||
needed the table bbox; too limited for "header region + column
|
||
dividers + repeating-row template."
|
||
- **Custom React component** — fully tailored UX but adds a build
|
||
toolchain DataTools doesn't have today. Last resort.
|
||
|
||
Sketch of the proposed page (under "Transformations" in the sidebar
|
||
section):
|
||
|
||
```
|
||
🧾 PDF → CSV (Beta)
|
||
─────────────────────────────────────────────────────────────────────
|
||
Upload a PDF [ Browse… ]
|
||
(statement / invoice / form — text-based PDFs work best)
|
||
|
||
[ ▸ Preview: October-statement.pdf · 3 pages ]
|
||
┌────────────────────────────────────────────────┐
|
||
│ CHASE BANK │
|
||
│ Statement period Oct 1–31, 2025 │
|
||
│ ┌─[1: header strip — drawn in red]──────────┐ │
|
||
│ │ Date Description Amount │ │
|
||
│ └────────────────────────────────────────────┘ │
|
||
│ ┌─[2: row template — drawn in green]────────┐ │
|
||
│ │ 10/03 AMAZON.COM #42… -45.67 │ │
|
||
│ └────────────────────────────────────────────┘ │
|
||
│ ⋮ (more transactions) │
|
||
└────────────────────────────────────────────────┘
|
||
|
||
Columns: [Date] [Description] [Amount] [+ Add column]
|
||
|
||
Apply template to: ( ) Only this page
|
||
(•) All pages with this layout
|
||
( ) All pages (force)
|
||
|
||
[ Save template as… Chase Visa Oct 2025 ]
|
||
|
||
[ Run extraction → CSV ]
|
||
```
|
||
|
||
After "Run extraction": the standard tool-page result layout (preview
|
||
table, "Saved to ~/Downloads/<name>_extracted.csv", "Open Downloads
|
||
folder" — matching the other Ready tools).
|
||
|
||
The **template save/recall** is what makes this a one-time setup
|
||
instead of a per-document chore — bookkeepers don't want to re-draw
|
||
rectangles every month.
|
||
|
||
### Effort estimate
|
||
|
||
| Phase | Scope | Estimate | Risk |
|
||
|---|---|---|---|
|
||
| **A. Backend, native PDFs only** | pdfplumber-based extraction, hard-coded region passed via a JSON config (no GUI) | **1–2 weeks** | Low — straightforward use of pdfplumber. |
|
||
| **B. Region-picker GUI** | streamlit-drawable-canvas, multi-region drawing, per-region role assignment (date / amount / description) | **2–3 weeks** | Medium — the canvas component has quirks; persisting region state across reruns is non-trivial. |
|
||
| **C. Multi-page application + template persistence** | Apply one page's template to N pages, save/load templates, layout fingerprint | **1–2 weeks** | Medium — "is the next page the same layout?" is a real perception problem; we'll need a heuristic. |
|
||
| **D. Scanned-PDF OCR fallback** | Detect image-only pages, run Tesseract, merge OCR text into the extraction path | **2–3 weeks** | High — OCR accuracy is variable; we'd want a quality threshold + a "fail this page noisily" path. Bundling Tesseract with the PyInstaller build is its own packaging headache. |
|
||
| **E. Bank-statement specifics** | Cross-page row stitching, currency-sign inference, multi-account splits | **1–2 weeks** | Medium — every bank's idea of a "statement" differs. Templates absorb most of the variance. |
|
||
|
||
**Realistic total for a polished v1**: 6–10 calendar weeks of focused work
|
||
(text-PDFs + GUI + templates + statement-specific niceties). Add another
|
||
2–3 weeks if scanned PDFs are required at launch.
|
||
|
||
**Minimum viable extract** (just text PDFs, single-region drawing, no
|
||
template recall, no OCR): **3–4 weeks**. Worth scoping a beta at that
|
||
level before committing to the full surface.
|
||
|
||
### Difficulty rating
|
||
|
||
**Medium-hard.** Not because any single piece is novel — pdfplumber +
|
||
streamlit-drawable-canvas are well-trodden libraries — but because the
|
||
*combination* (point-and-click region selection that persists across
|
||
multiple PDF pages and across documents with similar layouts) is where
|
||
most of the engineering goes. The "every bank does it slightly
|
||
differently" reality makes templates a hard requirement rather than a
|
||
nice-to-have, and templates raise the design effort.
|
||
|
||
### Risks / unknowns
|
||
|
||
- **Scanned-PDF coverage**: if a meaningful slice of the addressable
|
||
market sends image-only PDFs (older statements, scanned receipts),
|
||
shipping text-only extraction limits the audience. Decide via the
|
||
first 10–20 user requests.
|
||
- **PyInstaller packaging of Tesseract**: bundling the OCR binary into
|
||
the desktop build is non-trivial. May force a "Tesseract not found —
|
||
install it separately" path on first launch, which hurts the "one-
|
||
click install" story.
|
||
- **Bank layout drift**: a template captured today can stop working
|
||
next month if the bank redesigns its statement. Layout-fingerprint
|
||
detection has to fail loudly rather than silently produce garbage.
|
||
- **PII surface**: bank statements are some of the most sensitive
|
||
documents the user might touch. The "runs locally — your data never
|
||
leaves this computer" guarantee is even more load-bearing here than
|
||
for CSVs. No telemetry, no cloud OCR services, hard line.
|
||
|
||
### Ship criteria
|
||
|
||
Before this tool re-enters active development, all of these need to be
|
||
true:
|
||
|
||
- DataTools has shipped to **≥1 paying customer** (the `PLAN.md` §2.1
|
||
freeze condition).
|
||
- **At least 3 paying customers OR 5 demo-traffic emails** have
|
||
explicitly asked for PDF extraction. Below that signal, build
|
||
something else.
|
||
- The bookkeeper niche (per `PLAN.md` §2.3) has at least one converted
|
||
customer — that's the persona who actually needs this tool, and
|
||
confirming they pay before building a tool aimed squarely at them
|
||
is the discipline the freeze exists to enforce.
|
||
|
||
If those three trip, the **Phase A minimum-viable beta (3–4 weeks)**
|
||
goes first — text PDFs + single-region drawing — so we can see real
|
||
user behaviour before committing to the full template surface.
|
||
|
||
---
|
||
|
||
## (placeholder for additional future-tool entries)
|
||
|
||
Add new entries above this line. Keep the same shape:
|
||
What / Why / Can we ship now / Approach / GUI / Effort / Risks /
|
||
Ship criteria. The shape is what makes "is this idea ready" a
|
||
factual question instead of an opinion.
|