docs: design notes for future PDF→CSV tool
New ``docs/FUTURE-TOOLS.md`` captures post-launch tool ideas with a consistent shape — What / Why / Can we ship now / Approach / GUI sketch / Effort / Risks / Ship criteria. Resting place for things the new-tool freeze in ``PLAN.md`` §2.1 refuses to build but that keep coming up. First entry: **#10 PDF → CSV extractor** (bank statements et al.). Key facts captured: - **Current state**: no PDF infrastructure exists. Zero PDF dependencies in requirements.txt; zero PDF-touching code under ``src/``. The only "PDF" string in the codebase is the planned- output copy for the Quality Check tool, unrelated to extraction. - **Library picks**: pdfplumber as the extraction core (BSD-3, no native compiler, gives coordinate-aware text), Tesseract via pytesseract as the OCR fallback for scanned PDFs, streamlit-drawable-canvas as the region-picker component. - **GUI sketch**: user draws a header strip + a row template on a rendered page; the tool applies that template across N pages, saves the template by layout fingerprint for next month's statement, emits CSV. - **Effort phased A–E**: 3–4 weeks for a text-only MVP; 6–10 weeks for a polished version with multi-page template recall; +2–3 weeks if scanned-PDF OCR is required. - **Difficulty**: medium-hard. The pieces are well-trodden; the combination (region selection that persists across pages and across documents with similar layouts) is where the engineering goes. - **Ship criteria**: ≥1 paying customer + ≥3 paid or ≥5 demo emails asking for PDF extraction + the bookkeeper niche converting at least one customer first. None have fired. Cross-references added: - ``docs/REQUIREMENTS.md`` §11: pointer to FUTURE-TOOLS.md for parked tool ideas, with a one-paragraph summary of #10. - ``docs/PLAN.md`` §2.1: notes that the freeze parks future tools in FUTURE-TOOLS.md and explicitly names #10 as the current highest-pressure entry. - ``docs/NEXT-STEPS.md`` Phase 5 "what NOT to build" table: a new row for the PDF tool tied to the same ship-trigger language. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
244
docs/FUTURE-TOOLS.md
Normal file
244
docs/FUTURE-TOOLS.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# Future tools — design notes
|
||||
|
||||
> Creator-only. Specs for tools the strategic plan refuses to build right now
|
||||
> but that surface repeatedly enough to be worth documenting once instead of
|
||||
> re-thinking from scratch every time a customer asks.
|
||||
> **Status of these tools**: post-launch, post-revenue. See `PLAN.md` §2.1 —
|
||||
> new-tool development is frozen until DataTools has a paying customer and a
|
||||
> repeated demand signal for the same idea. This file is the resting place
|
||||
> for those ideas in the meantime; nothing here ships unless a future
|
||||
> decision says it does.
|
||||
|
||||
Each entry follows the same shape: **What it does**, **Why someone would
|
||||
want it**, **Can we ship it now?**, **Approach**, **GUI sketch**, **Effort**,
|
||||
**Risks/unknowns**, **Ship criteria** (the signal that overrides the freeze).
|
||||
|
||||
---
|
||||
|
||||
## 10. PDF → CSV extractor (bank statements + similar)
|
||||
|
||||
### What it does
|
||||
|
||||
Takes a PDF (typically a bank statement, expense report, paystub, invoice,
|
||||
or any document where humans-but-not-computers can read a table) and turns
|
||||
the tabular content into a CSV that the rest of DataTools can consume.
|
||||
|
||||
The user shows the tool **where** the data lives by drawing rectangles on
|
||||
a rendered preview of the first page; the tool then applies those region
|
||||
templates to every page of the document (and remembers the template so the
|
||||
same template can be re-applied to next month's statement without
|
||||
re-clicking).
|
||||
|
||||
### Why someone would want it
|
||||
|
||||
Bookkeepers, accountants, and any small-business operator who:
|
||||
|
||||
- Gets bank/credit-card statements only as PDFs (most US banks; many
|
||||
European ones).
|
||||
- Wants to import transactions into QuickBooks / Xero / a spreadsheet
|
||||
without paying $10–$30/month for a SaaS converter (Docparser,
|
||||
Rossum, Hubdoc) or relying on a Python script they can't maintain.
|
||||
- Has 12 months × N accounts of statements to back-fill into a
|
||||
ledger.
|
||||
|
||||
This is the most-requested DataTools adjacency in the casual feedback we
|
||||
have so far. It maps tightly onto the **bookkeeper niche** identified in
|
||||
`PLAN.md` §2.3 — that persona is exactly who needs PDF extraction and is
|
||||
exactly the kind of operator who'd pay for a one-time desktop tool over a
|
||||
recurring SaaS subscription.
|
||||
|
||||
### Can we ship it now?
|
||||
|
||||
**No.** Current state, verified 2026-05-17:
|
||||
|
||||
- No PDF dependency in `requirements.txt` or `requirements-dev.txt`.
|
||||
- No PDF-touching code anywhere under `src/`. The single
|
||||
string-mention of "PDF" in the codebase is in the **output** copy for
|
||||
the Quality Check tool ("generate PDF/Excel quality reports"),
|
||||
unrelated to extraction.
|
||||
- No region-selection / canvas component in the Streamlit GUI today.
|
||||
|
||||
Building this requires net-new infrastructure on three axes (libraries,
|
||||
extraction core, region-picker UI). Estimates below.
|
||||
|
||||
### Approach (technical)
|
||||
|
||||
PDFs split cleanly into two populations and the strategy differs:
|
||||
|
||||
1. **Native / text-layer PDFs** — text is stored as text, just laid out
|
||||
visually. Most modern bank statements are this. Solvable with
|
||||
coordinate-aware text extraction:
|
||||
|
||||
- **`pdfplumber`** (BSD-3, on top of `pdfminer.six`) — gives `(x0, y0,
|
||||
x1, y1, text)` per character/word/line for each page. Mature, well
|
||||
tested, single dependency, no native compiler. **First-choice.**
|
||||
- **`pypdf`** (BSD-3) — text-only, no positions. Too coarse for
|
||||
statement parsing; useful only for "the whole document as one big
|
||||
string."
|
||||
- **`camelot-py`** (MIT) — purpose-built for table extraction.
|
||||
Heavier (needs `ghostscript` and `tk`/`opencv` for some modes),
|
||||
and assumes the table grid is already visible. Worth evaluating
|
||||
as a fallback for documents with explicit ruled tables.
|
||||
|
||||
2. **Scanned / image-only PDFs** — pixels of a scanner; no text layer.
|
||||
Less common from major banks today but still happens with old PDFs
|
||||
and receipts. Needs OCR:
|
||||
|
||||
- **`pytesseract`** wrapping the **Tesseract** binary (Apache-2). The
|
||||
OCR is good for English on clean scans, mediocre on receipts.
|
||||
Detect with `pdfplumber`: a page where every character is in a
|
||||
glyph "image" object means the page is image-only → OCR fallback.
|
||||
|
||||
The extraction core would be a state machine:
|
||||
|
||||
1. Render page to an image (`pdfplumber.Page.to_image()` returns a PIL
|
||||
image at a chosen DPI).
|
||||
2. User draws a header region and per-row regions (or marks a single
|
||||
table bounding box + column dividers) on the preview.
|
||||
3. For each PDF page, crop the corresponding pixel region (or pdf
|
||||
coordinate region), pull the text in that crop, and apply per-region
|
||||
parsing (date, amount, description).
|
||||
4. Emit one CSV row per detected statement row.
|
||||
|
||||
Bank-statement-specific niceties — implementable as templates on top of
|
||||
the generic engine:
|
||||
|
||||
- Recurring-template store: save "Chase visa October layout" once, the
|
||||
next month's PDF lands on the same template automatically. JSON file
|
||||
in `~/.datatools/templates/` keyed by a layout fingerprint (page
|
||||
size + header text hash).
|
||||
- Multi-page row stitching: a row that wraps across pages gets merged
|
||||
back together based on date-column continuity.
|
||||
- Currency / sign inference: a column that mixes `$1,234.56` and
|
||||
`($45.00)` — already handled by the (now-existing) Standardize
|
||||
Formats analyzer rules.
|
||||
|
||||
### GUI sketch
|
||||
|
||||
The hardest part of the whole project. Streamlit doesn't ship a native
|
||||
"draw rectangles on an image" widget. Options:
|
||||
|
||||
- **`streamlit-drawable-canvas`** — community component (MIT-licensed).
|
||||
Lets the user draw freehand rectangles on top of a background image.
|
||||
Returns the rectangle coordinates as JSON. Active maintenance.
|
||||
**First-choice for the region picker.**
|
||||
- **`streamlit-cropper`** — single-rectangle crop tool. Good if we only
|
||||
needed the table bbox; too limited for "header region + column
|
||||
dividers + repeating-row template."
|
||||
- **Custom React component** — fully tailored UX but adds a build
|
||||
toolchain DataTools doesn't have today. Last resort.
|
||||
|
||||
Sketch of the proposed page (under "Transformations" in the sidebar
|
||||
section):
|
||||
|
||||
```
|
||||
🧾 PDF → CSV (Beta)
|
||||
─────────────────────────────────────────────────────────────────────
|
||||
Upload a PDF [ Browse… ]
|
||||
(statement / invoice / form — text-based PDFs work best)
|
||||
|
||||
[ ▸ Preview: October-statement.pdf · 3 pages ]
|
||||
┌────────────────────────────────────────────────┐
|
||||
│ CHASE BANK │
|
||||
│ Statement period Oct 1–31, 2025 │
|
||||
│ ┌─[1: header strip — drawn in red]──────────┐ │
|
||||
│ │ Date Description Amount │ │
|
||||
│ └────────────────────────────────────────────┘ │
|
||||
│ ┌─[2: row template — drawn in green]────────┐ │
|
||||
│ │ 10/03 AMAZON.COM #42… -45.67 │ │
|
||||
│ └────────────────────────────────────────────┘ │
|
||||
│ ⋮ (more transactions) │
|
||||
└────────────────────────────────────────────────┘
|
||||
|
||||
Columns: [Date] [Description] [Amount] [+ Add column]
|
||||
|
||||
Apply template to: ( ) Only this page
|
||||
(•) All pages with this layout
|
||||
( ) All pages (force)
|
||||
|
||||
[ Save template as… Chase Visa Oct 2025 ]
|
||||
|
||||
[ Run extraction → CSV ]
|
||||
```
|
||||
|
||||
After "Run extraction": the standard tool-page result layout (preview
|
||||
table, "Saved to ~/Downloads/<name>_extracted.csv", "Open Downloads
|
||||
folder" — matching the other Ready tools).
|
||||
|
||||
The **template save/recall** is what makes this a one-time setup
|
||||
instead of a per-document chore — bookkeepers don't want to re-draw
|
||||
rectangles every month.
|
||||
|
||||
### Effort estimate
|
||||
|
||||
| Phase | Scope | Estimate | Risk |
|
||||
|---|---|---|---|
|
||||
| **A. Backend, native PDFs only** | pdfplumber-based extraction, hard-coded region passed via a JSON config (no GUI) | **1–2 weeks** | Low — straightforward use of pdfplumber. |
|
||||
| **B. Region-picker GUI** | streamlit-drawable-canvas, multi-region drawing, per-region role assignment (date / amount / description) | **2–3 weeks** | Medium — the canvas component has quirks; persisting region state across reruns is non-trivial. |
|
||||
| **C. Multi-page application + template persistence** | Apply one page's template to N pages, save/load templates, layout fingerprint | **1–2 weeks** | Medium — "is the next page the same layout?" is a real perception problem; we'll need a heuristic. |
|
||||
| **D. Scanned-PDF OCR fallback** | Detect image-only pages, run Tesseract, merge OCR text into the extraction path | **2–3 weeks** | High — OCR accuracy is variable; we'd want a quality threshold + a "fail this page noisily" path. Bundling Tesseract with the PyInstaller build is its own packaging headache. |
|
||||
| **E. Bank-statement specifics** | Cross-page row stitching, currency-sign inference, multi-account splits | **1–2 weeks** | Medium — every bank's idea of a "statement" differs. Templates absorb most of the variance. |
|
||||
|
||||
**Realistic total for a polished v1**: 6–10 calendar weeks of focused work
|
||||
(text-PDFs + GUI + templates + statement-specific niceties). Add another
|
||||
2–3 weeks if scanned PDFs are required at launch.
|
||||
|
||||
**Minimum viable extract** (just text PDFs, single-region drawing, no
|
||||
template recall, no OCR): **3–4 weeks**. Worth scoping a beta at that
|
||||
level before committing to the full surface.
|
||||
|
||||
### Difficulty rating
|
||||
|
||||
**Medium-hard.** Not because any single piece is novel — pdfplumber +
|
||||
streamlit-drawable-canvas are well-trodden libraries — but because the
|
||||
*combination* (point-and-click region selection that persists across
|
||||
multiple PDF pages and across documents with similar layouts) is where
|
||||
most of the engineering goes. The "every bank does it slightly
|
||||
differently" reality makes templates a hard requirement rather than a
|
||||
nice-to-have, and templates raise the design effort.
|
||||
|
||||
### Risks / unknowns
|
||||
|
||||
- **Scanned-PDF coverage**: if a meaningful slice of the addressable
|
||||
market sends image-only PDFs (older statements, scanned receipts),
|
||||
shipping text-only extraction limits the audience. Decide via the
|
||||
first 10–20 user requests.
|
||||
- **PyInstaller packaging of Tesseract**: bundling the OCR binary into
|
||||
the desktop build is non-trivial. May force a "Tesseract not found —
|
||||
install it separately" path on first launch, which hurts the "one-
|
||||
click install" story.
|
||||
- **Bank layout drift**: a template captured today can stop working
|
||||
next month if the bank redesigns its statement. Layout-fingerprint
|
||||
detection has to fail loudly rather than silently produce garbage.
|
||||
- **PII surface**: bank statements are some of the most sensitive
|
||||
documents the user might touch. The "runs locally — your data never
|
||||
leaves this computer" guarantee is even more load-bearing here than
|
||||
for CSVs. No telemetry, no cloud OCR services, hard line.
|
||||
|
||||
### Ship criteria
|
||||
|
||||
Before this tool re-enters active development, all of these need to be
|
||||
true:
|
||||
|
||||
- DataTools has shipped to **≥1 paying customer** (the `PLAN.md` §2.1
|
||||
freeze condition).
|
||||
- **At least 3 paying customers OR 5 demo-traffic emails** have
|
||||
explicitly asked for PDF extraction. Below that signal, build
|
||||
something else.
|
||||
- The bookkeeper niche (per `PLAN.md` §2.3) has at least one converted
|
||||
customer — that's the persona who actually needs this tool, and
|
||||
confirming they pay before building a tool aimed squarely at them
|
||||
is the discipline the freeze exists to enforce.
|
||||
|
||||
If those three trip, the **Phase A minimum-viable beta (3–4 weeks)**
|
||||
goes first — text PDFs + single-region drawing — so we can see real
|
||||
user behaviour before committing to the full template surface.
|
||||
|
||||
---
|
||||
|
||||
## (placeholder for additional future-tool entries)
|
||||
|
||||
Add new entries above this line. Keep the same shape:
|
||||
What / Why / Can we ship now / Approach / GUI / Effort / Risks /
|
||||
Ship criteria. The shape is what makes "is this idea ready" a
|
||||
factual question instead of an opinion.
|
||||
@@ -269,6 +269,7 @@ moves until $5k/mo MRR:
|
||||
| | Why locked |
|
||||
|---|---|
|
||||
| ❌ More tools (06–08) | `PLAN.md` §2.1 distribution-gate. Tool 09 was the exception; no others until first paid customer + one external review. |
|
||||
| ❌ Tool #10 PDF → CSV (the most-asked-for adjacency) | Parked in `docs/FUTURE-TOOLS.md` with full design + 3–4 wk MVP / 6–10 wk polished estimate. Ship trigger: paying customer + ≥3 paid or ≥5 demo emails asking for PDF + the bookkeeper niche converting first. None have fired yet. |
|
||||
| ❌ SaaS pivot | `DECISIONS.md` §4 — recurring infra conflicts with the lifestyle constraint. |
|
||||
| ❌ Live chat / sales calls | `DECISIONS.md` §1 #8 — no-touch is locked until $5k/mo. |
|
||||
| ❌ Custom integrations / one-off consulting | Breaks "build once, sell many." |
|
||||
|
||||
@@ -58,6 +58,14 @@ buy" into "an automatable workflow you depend on." That conversion is
|
||||
what produces retention and word-of-mouth — the only marketing channel
|
||||
that scales under the no-network/no-touch constraint.
|
||||
|
||||
**Parked behind the freeze**: post-launch tool ideas are captured in
|
||||
`docs/FUTURE-TOOLS.md` with feasibility, GUI sketch, effort estimate,
|
||||
and ship criteria for each. Currently parked: **#10 PDF → CSV
|
||||
extractor** (bank statements et al.) — gated on a paying customer +
|
||||
≥3 paying customers or ≥5 demo emails explicitly asking for PDF
|
||||
extraction, with the bookkeeper niche converting at least one customer
|
||||
first. None of those triggers have fired yet.
|
||||
|
||||
### 2.2 The demo *is* the product. Make it embarrassingly good.
|
||||
|
||||
- Three persona-tagged sample datasets, not one generic CSV: Shopify
|
||||
|
||||
@@ -127,6 +127,17 @@ Sample size: 1,000 rows (configurable).
|
||||
8. Quality Check — Coming Soon
|
||||
9. Automated Workflows — Ready
|
||||
|
||||
**Future / not in v1.** Tool ideas captured for after-launch consideration
|
||||
live in `docs/FUTURE-TOOLS.md` — entries there are gated by the new-tool
|
||||
freeze in `PLAN.md` §2.1 and don't ship without a paying-customer +
|
||||
repeated-demand signal. Currently parked there:
|
||||
|
||||
- **#10. PDF → CSV extractor** (bank statements + similar). No PDF
|
||||
dependency exists in the repo today; this tool would need pdfplumber,
|
||||
streamlit-drawable-canvas, and a templates store. Estimated 3–4 weeks
|
||||
for a text-only MVP, 6–10 weeks for the polished version with
|
||||
multi-page template recall.
|
||||
|
||||
### 11.a Recommended pipeline order (soft, not enforced)
|
||||
|
||||
Automated Workflows ships with a `SOFT_DEPENDENCIES` table; the
|
||||
|
||||
Reference in New Issue
Block a user