docs: design notes for future PDF→CSV tool
New ``docs/FUTURE-TOOLS.md`` captures post-launch tool ideas with a consistent shape — What / Why / Can we ship now / Approach / GUI sketch / Effort / Risks / Ship criteria. Resting place for things the new-tool freeze in ``PLAN.md`` §2.1 refuses to build but that keep coming up. First entry: **#10 PDF → CSV extractor** (bank statements et al.). Key facts captured: - **Current state**: no PDF infrastructure exists. Zero PDF dependencies in requirements.txt; zero PDF-touching code under ``src/``. The only "PDF" string in the codebase is the planned- output copy for the Quality Check tool, unrelated to extraction. - **Library picks**: pdfplumber as the extraction core (BSD-3, no native compiler, gives coordinate-aware text), Tesseract via pytesseract as the OCR fallback for scanned PDFs, streamlit-drawable-canvas as the region-picker component. - **GUI sketch**: user draws a header strip + a row template on a rendered page; the tool applies that template across N pages, saves the template by layout fingerprint for next month's statement, emits CSV. - **Effort phased A–E**: 3–4 weeks for a text-only MVP; 6–10 weeks for a polished version with multi-page template recall; +2–3 weeks if scanned-PDF OCR is required. - **Difficulty**: medium-hard. The pieces are well-trodden; the combination (region selection that persists across pages and across documents with similar layouts) is where the engineering goes. - **Ship criteria**: ≥1 paying customer + ≥3 paid or ≥5 demo emails asking for PDF extraction + the bookkeeper niche converting at least one customer first. None have fired. Cross-references added: - ``docs/REQUIREMENTS.md`` §11: pointer to FUTURE-TOOLS.md for parked tool ideas, with a one-paragraph summary of #10. - ``docs/PLAN.md`` §2.1: notes that the freeze parks future tools in FUTURE-TOOLS.md and explicitly names #10 as the current highest-pressure entry. - ``docs/NEXT-STEPS.md`` Phase 5 "what NOT to build" table: a new row for the PDF tool tied to the same ship-trigger language. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
244
docs/FUTURE-TOOLS.md
Normal file
244
docs/FUTURE-TOOLS.md
Normal file
@@ -0,0 +1,244 @@
|
|||||||
|
# Future tools — design notes
|
||||||
|
|
||||||
|
> Creator-only. Specs for tools the strategic plan refuses to build right now
|
||||||
|
> but that surface repeatedly enough to be worth documenting once instead of
|
||||||
|
> re-thinking from scratch every time a customer asks.
|
||||||
|
> **Status of these tools**: post-launch, post-revenue. See `PLAN.md` §2.1 —
|
||||||
|
> new-tool development is frozen until DataTools has a paying customer and a
|
||||||
|
> repeated demand signal for the same idea. This file is the resting place
|
||||||
|
> for those ideas in the meantime; nothing here ships unless a future
|
||||||
|
> decision says it does.
|
||||||
|
|
||||||
|
Each entry follows the same shape: **What it does**, **Why someone would
|
||||||
|
want it**, **Can we ship it now?**, **Approach**, **GUI sketch**, **Effort**,
|
||||||
|
**Risks/unknowns**, **Ship criteria** (the signal that overrides the freeze).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. PDF → CSV extractor (bank statements + similar)
|
||||||
|
|
||||||
|
### What it does
|
||||||
|
|
||||||
|
Takes a PDF (typically a bank statement, expense report, paystub, invoice,
|
||||||
|
or any document where humans-but-not-computers can read a table) and turns
|
||||||
|
the tabular content into a CSV that the rest of DataTools can consume.
|
||||||
|
|
||||||
|
The user shows the tool **where** the data lives by drawing rectangles on
|
||||||
|
a rendered preview of the first page; the tool then applies those region
|
||||||
|
templates to every page of the document (and remembers the template so the
|
||||||
|
same template can be re-applied to next month's statement without
|
||||||
|
re-clicking).
|
||||||
|
|
||||||
|
### Why someone would want it
|
||||||
|
|
||||||
|
Bookkeepers, accountants, and any small-business operator who:
|
||||||
|
|
||||||
|
- Gets bank/credit-card statements only as PDFs (most US banks; many
|
||||||
|
European ones).
|
||||||
|
- Wants to import transactions into QuickBooks / Xero / a spreadsheet
|
||||||
|
without paying $10–$30/month for a SaaS converter (Docparser,
|
||||||
|
Rossum, Hubdoc) or relying on a Python script they can't maintain.
|
||||||
|
- Has 12 months × N accounts of statements to back-fill into a
|
||||||
|
ledger.
|
||||||
|
|
||||||
|
This is the most-requested DataTools adjacency in the casual feedback we
|
||||||
|
have so far. It maps tightly onto the **bookkeeper niche** identified in
|
||||||
|
`PLAN.md` §2.3 — that persona is exactly who needs PDF extraction and is
|
||||||
|
exactly the kind of operator who'd pay for a one-time desktop tool over a
|
||||||
|
recurring SaaS subscription.
|
||||||
|
|
||||||
|
### Can we ship it now?
|
||||||
|
|
||||||
|
**No.** Current state, verified 2026-05-17:
|
||||||
|
|
||||||
|
- No PDF dependency in `requirements.txt` or `requirements-dev.txt`.
|
||||||
|
- No PDF-touching code anywhere under `src/`. The single
|
||||||
|
string-mention of "PDF" in the codebase is in the **output** copy for
|
||||||
|
the Quality Check tool ("generate PDF/Excel quality reports"),
|
||||||
|
unrelated to extraction.
|
||||||
|
- No region-selection / canvas component in the Streamlit GUI today.
|
||||||
|
|
||||||
|
Building this requires net-new infrastructure on three axes (libraries,
|
||||||
|
extraction core, region-picker UI). Estimates below.
|
||||||
|
|
||||||
|
### Approach (technical)
|
||||||
|
|
||||||
|
PDFs split cleanly into two populations and the strategy differs:
|
||||||
|
|
||||||
|
1. **Native / text-layer PDFs** — text is stored as text, just laid out
|
||||||
|
visually. Most modern bank statements are this. Solvable with
|
||||||
|
coordinate-aware text extraction:
|
||||||
|
|
||||||
|
- **`pdfplumber`** (BSD-3, on top of `pdfminer.six`) — gives `(x0, y0,
|
||||||
|
x1, y1, text)` per character/word/line for each page. Mature, well
|
||||||
|
tested, single dependency, no native compiler. **First-choice.**
|
||||||
|
- **`pypdf`** (BSD-3) — text-only, no positions. Too coarse for
|
||||||
|
statement parsing; useful only for "the whole document as one big
|
||||||
|
string."
|
||||||
|
- **`camelot-py`** (MIT) — purpose-built for table extraction.
|
||||||
|
Heavier (needs `ghostscript` and `tk`/`opencv` for some modes),
|
||||||
|
and assumes the table grid is already visible. Worth evaluating
|
||||||
|
as a fallback for documents with explicit ruled tables.
|
||||||
|
|
||||||
|
2. **Scanned / image-only PDFs** — pixels of a scanner; no text layer.
|
||||||
|
Less common from major banks today but still happens with old PDFs
|
||||||
|
and receipts. Needs OCR:
|
||||||
|
|
||||||
|
- **`pytesseract`** wrapping the **Tesseract** binary (Apache-2). The
|
||||||
|
OCR is good for English on clean scans, mediocre on receipts.
|
||||||
|
Detect with `pdfplumber`: a page where every character is in a
|
||||||
|
glyph "image" object means the page is image-only → OCR fallback.
|
||||||
|
|
||||||
|
The extraction core would be a state machine:
|
||||||
|
|
||||||
|
1. Render page to an image (`pdfplumber.Page.to_image()` returns a PIL
|
||||||
|
image at a chosen DPI).
|
||||||
|
2. User draws a header region and per-row regions (or marks a single
|
||||||
|
table bounding box + column dividers) on the preview.
|
||||||
|
3. For each PDF page, crop the corresponding pixel region (or pdf
|
||||||
|
coordinate region), pull the text in that crop, and apply per-region
|
||||||
|
parsing (date, amount, description).
|
||||||
|
4. Emit one CSV row per detected statement row.
|
||||||
|
|
||||||
|
Bank-statement-specific niceties — implementable as templates on top of
|
||||||
|
the generic engine:
|
||||||
|
|
||||||
|
- Recurring-template store: save "Chase visa October layout" once, the
|
||||||
|
next month's PDF lands on the same template automatically. JSON file
|
||||||
|
in `~/.datatools/templates/` keyed by a layout fingerprint (page
|
||||||
|
size + header text hash).
|
||||||
|
- Multi-page row stitching: a row that wraps across pages gets merged
|
||||||
|
back together based on date-column continuity.
|
||||||
|
- Currency / sign inference: a column that mixes `$1,234.56` and
|
||||||
|
`($45.00)` — already handled by the (now-existing) Standardize
|
||||||
|
Formats analyzer rules.
|
||||||
|
|
||||||
|
### GUI sketch
|
||||||
|
|
||||||
|
The hardest part of the whole project. Streamlit doesn't ship a native
|
||||||
|
"draw rectangles on an image" widget. Options:
|
||||||
|
|
||||||
|
- **`streamlit-drawable-canvas`** — community component (MIT-licensed).
|
||||||
|
Lets the user draw freehand rectangles on top of a background image.
|
||||||
|
Returns the rectangle coordinates as JSON. Active maintenance.
|
||||||
|
**First-choice for the region picker.**
|
||||||
|
- **`streamlit-cropper`** — single-rectangle crop tool. Good if we only
|
||||||
|
needed the table bbox; too limited for "header region + column
|
||||||
|
dividers + repeating-row template."
|
||||||
|
- **Custom React component** — fully tailored UX but adds a build
|
||||||
|
toolchain DataTools doesn't have today. Last resort.
|
||||||
|
|
||||||
|
Sketch of the proposed page (under "Transformations" in the sidebar
|
||||||
|
section):
|
||||||
|
|
||||||
|
```
|
||||||
|
🧾 PDF → CSV (Beta)
|
||||||
|
─────────────────────────────────────────────────────────────────────
|
||||||
|
Upload a PDF [ Browse… ]
|
||||||
|
(statement / invoice / form — text-based PDFs work best)
|
||||||
|
|
||||||
|
[ ▸ Preview: October-statement.pdf · 3 pages ]
|
||||||
|
┌────────────────────────────────────────────────┐
|
||||||
|
│ CHASE BANK │
|
||||||
|
│ Statement period Oct 1–31, 2025 │
|
||||||
|
│ ┌─[1: header strip — drawn in red]──────────┐ │
|
||||||
|
│ │ Date Description Amount │ │
|
||||||
|
│ └────────────────────────────────────────────┘ │
|
||||||
|
│ ┌─[2: row template — drawn in green]────────┐ │
|
||||||
|
│ │ 10/03 AMAZON.COM #42… -45.67 │ │
|
||||||
|
│ └────────────────────────────────────────────┘ │
|
||||||
|
│ ⋮ (more transactions) │
|
||||||
|
└────────────────────────────────────────────────┘
|
||||||
|
|
||||||
|
Columns: [Date] [Description] [Amount] [+ Add column]
|
||||||
|
|
||||||
|
Apply template to: ( ) Only this page
|
||||||
|
(•) All pages with this layout
|
||||||
|
( ) All pages (force)
|
||||||
|
|
||||||
|
[ Save template as… Chase Visa Oct 2025 ]
|
||||||
|
|
||||||
|
[ Run extraction → CSV ]
|
||||||
|
```
|
||||||
|
|
||||||
|
After "Run extraction": the standard tool-page result layout (preview
|
||||||
|
table, "Saved to ~/Downloads/<name>_extracted.csv", "Open Downloads
|
||||||
|
folder" — matching the other Ready tools).
|
||||||
|
|
||||||
|
The **template save/recall** is what makes this a one-time setup
|
||||||
|
instead of a per-document chore — bookkeepers don't want to re-draw
|
||||||
|
rectangles every month.
|
||||||
|
|
||||||
|
### Effort estimate
|
||||||
|
|
||||||
|
| Phase | Scope | Estimate | Risk |
|
||||||
|
|---|---|---|---|
|
||||||
|
| **A. Backend, native PDFs only** | pdfplumber-based extraction, hard-coded region passed via a JSON config (no GUI) | **1–2 weeks** | Low — straightforward use of pdfplumber. |
|
||||||
|
| **B. Region-picker GUI** | streamlit-drawable-canvas, multi-region drawing, per-region role assignment (date / amount / description) | **2–3 weeks** | Medium — the canvas component has quirks; persisting region state across reruns is non-trivial. |
|
||||||
|
| **C. Multi-page application + template persistence** | Apply one page's template to N pages, save/load templates, layout fingerprint | **1–2 weeks** | Medium — "is the next page the same layout?" is a real perception problem; we'll need a heuristic. |
|
||||||
|
| **D. Scanned-PDF OCR fallback** | Detect image-only pages, run Tesseract, merge OCR text into the extraction path | **2–3 weeks** | High — OCR accuracy is variable; we'd want a quality threshold + a "fail this page noisily" path. Bundling Tesseract with the PyInstaller build is its own packaging headache. |
|
||||||
|
| **E. Bank-statement specifics** | Cross-page row stitching, currency-sign inference, multi-account splits | **1–2 weeks** | Medium — every bank's idea of a "statement" differs. Templates absorb most of the variance. |
|
||||||
|
|
||||||
|
**Realistic total for a polished v1**: 6–10 calendar weeks of focused work
|
||||||
|
(text-PDFs + GUI + templates + statement-specific niceties). Add another
|
||||||
|
2–3 weeks if scanned PDFs are required at launch.
|
||||||
|
|
||||||
|
**Minimum viable extract** (just text PDFs, single-region drawing, no
|
||||||
|
template recall, no OCR): **3–4 weeks**. Worth scoping a beta at that
|
||||||
|
level before committing to the full surface.
|
||||||
|
|
||||||
|
### Difficulty rating
|
||||||
|
|
||||||
|
**Medium-hard.** Not because any single piece is novel — pdfplumber +
|
||||||
|
streamlit-drawable-canvas are well-trodden libraries — but because the
|
||||||
|
*combination* (point-and-click region selection that persists across
|
||||||
|
multiple PDF pages and across documents with similar layouts) is where
|
||||||
|
most of the engineering goes. The "every bank does it slightly
|
||||||
|
differently" reality makes templates a hard requirement rather than a
|
||||||
|
nice-to-have, and templates raise the design effort.
|
||||||
|
|
||||||
|
### Risks / unknowns
|
||||||
|
|
||||||
|
- **Scanned-PDF coverage**: if a meaningful slice of the addressable
|
||||||
|
market sends image-only PDFs (older statements, scanned receipts),
|
||||||
|
shipping text-only extraction limits the audience. Decide via the
|
||||||
|
first 10–20 user requests.
|
||||||
|
- **PyInstaller packaging of Tesseract**: bundling the OCR binary into
|
||||||
|
the desktop build is non-trivial. May force a "Tesseract not found —
|
||||||
|
install it separately" path on first launch, which hurts the "one-
|
||||||
|
click install" story.
|
||||||
|
- **Bank layout drift**: a template captured today can stop working
|
||||||
|
next month if the bank redesigns its statement. Layout-fingerprint
|
||||||
|
detection has to fail loudly rather than silently produce garbage.
|
||||||
|
- **PII surface**: bank statements are some of the most sensitive
|
||||||
|
documents the user might touch. The "runs locally — your data never
|
||||||
|
leaves this computer" guarantee is even more load-bearing here than
|
||||||
|
for CSVs. No telemetry, no cloud OCR services, hard line.
|
||||||
|
|
||||||
|
### Ship criteria
|
||||||
|
|
||||||
|
Before this tool re-enters active development, all of these need to be
|
||||||
|
true:
|
||||||
|
|
||||||
|
- DataTools has shipped to **≥1 paying customer** (the `PLAN.md` §2.1
|
||||||
|
freeze condition).
|
||||||
|
- **At least 3 paying customers OR 5 demo-traffic emails** have
|
||||||
|
explicitly asked for PDF extraction. Below that signal, build
|
||||||
|
something else.
|
||||||
|
- The bookkeeper niche (per `PLAN.md` §2.3) has at least one converted
|
||||||
|
customer — that's the persona who actually needs this tool, and
|
||||||
|
confirming they pay before building a tool aimed squarely at them
|
||||||
|
is the discipline the freeze exists to enforce.
|
||||||
|
|
||||||
|
If those three trip, the **Phase A minimum-viable beta (3–4 weeks)**
|
||||||
|
goes first — text PDFs + single-region drawing — so we can see real
|
||||||
|
user behaviour before committing to the full template surface.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## (placeholder for additional future-tool entries)
|
||||||
|
|
||||||
|
Add new entries above this line. Keep the same shape:
|
||||||
|
What / Why / Can we ship now / Approach / GUI / Effort / Risks /
|
||||||
|
Ship criteria. The shape is what makes "is this idea ready" a
|
||||||
|
factual question instead of an opinion.
|
||||||
@@ -269,6 +269,7 @@ moves until $5k/mo MRR:
|
|||||||
| | Why locked |
|
| | Why locked |
|
||||||
|---|---|
|
|---|---|
|
||||||
| ❌ More tools (06–08) | `PLAN.md` §2.1 distribution-gate. Tool 09 was the exception; no others until first paid customer + one external review. |
|
| ❌ More tools (06–08) | `PLAN.md` §2.1 distribution-gate. Tool 09 was the exception; no others until first paid customer + one external review. |
|
||||||
|
| ❌ Tool #10 PDF → CSV (the most-asked-for adjacency) | Parked in `docs/FUTURE-TOOLS.md` with full design + 3–4 wk MVP / 6–10 wk polished estimate. Ship trigger: paying customer + ≥3 paid or ≥5 demo emails asking for PDF + the bookkeeper niche converting first. None have fired yet. |
|
||||||
| ❌ SaaS pivot | `DECISIONS.md` §4 — recurring infra conflicts with the lifestyle constraint. |
|
| ❌ SaaS pivot | `DECISIONS.md` §4 — recurring infra conflicts with the lifestyle constraint. |
|
||||||
| ❌ Live chat / sales calls | `DECISIONS.md` §1 #8 — no-touch is locked until $5k/mo. |
|
| ❌ Live chat / sales calls | `DECISIONS.md` §1 #8 — no-touch is locked until $5k/mo. |
|
||||||
| ❌ Custom integrations / one-off consulting | Breaks "build once, sell many." |
|
| ❌ Custom integrations / one-off consulting | Breaks "build once, sell many." |
|
||||||
|
|||||||
@@ -58,6 +58,14 @@ buy" into "an automatable workflow you depend on." That conversion is
|
|||||||
what produces retention and word-of-mouth — the only marketing channel
|
what produces retention and word-of-mouth — the only marketing channel
|
||||||
that scales under the no-network/no-touch constraint.
|
that scales under the no-network/no-touch constraint.
|
||||||
|
|
||||||
|
**Parked behind the freeze**: post-launch tool ideas are captured in
|
||||||
|
`docs/FUTURE-TOOLS.md` with feasibility, GUI sketch, effort estimate,
|
||||||
|
and ship criteria for each. Currently parked: **#10 PDF → CSV
|
||||||
|
extractor** (bank statements et al.) — gated on a paying customer +
|
||||||
|
≥3 paying customers or ≥5 demo emails explicitly asking for PDF
|
||||||
|
extraction, with the bookkeeper niche converting at least one customer
|
||||||
|
first. None of those triggers have fired yet.
|
||||||
|
|
||||||
### 2.2 The demo *is* the product. Make it embarrassingly good.
|
### 2.2 The demo *is* the product. Make it embarrassingly good.
|
||||||
|
|
||||||
- Three persona-tagged sample datasets, not one generic CSV: Shopify
|
- Three persona-tagged sample datasets, not one generic CSV: Shopify
|
||||||
|
|||||||
@@ -127,6 +127,17 @@ Sample size: 1,000 rows (configurable).
|
|||||||
8. Quality Check — Coming Soon
|
8. Quality Check — Coming Soon
|
||||||
9. Automated Workflows — Ready
|
9. Automated Workflows — Ready
|
||||||
|
|
||||||
|
**Future / not in v1.** Tool ideas captured for after-launch consideration
|
||||||
|
live in `docs/FUTURE-TOOLS.md` — entries there are gated by the new-tool
|
||||||
|
freeze in `PLAN.md` §2.1 and don't ship without a paying-customer +
|
||||||
|
repeated-demand signal. Currently parked there:
|
||||||
|
|
||||||
|
- **#10. PDF → CSV extractor** (bank statements + similar). No PDF
|
||||||
|
dependency exists in the repo today; this tool would need pdfplumber,
|
||||||
|
streamlit-drawable-canvas, and a templates store. Estimated 3–4 weeks
|
||||||
|
for a text-only MVP, 6–10 weeks for the polished version with
|
||||||
|
multi-page template recall.
|
||||||
|
|
||||||
### 11.a Recommended pipeline order (soft, not enforced)
|
### 11.a Recommended pipeline order (soft, not enforced)
|
||||||
|
|
||||||
Automated Workflows ships with a `SOFT_DEPENDENCIES` table; the
|
Automated Workflows ships with a `SOFT_DEPENDENCIES` table; the
|
||||||
|
|||||||
Reference in New Issue
Block a user