docs: design notes for future PDF→CSV tool

New ``docs/FUTURE-TOOLS.md`` captures post-launch tool ideas with a
consistent shape — What / Why / Can we ship now / Approach / GUI
sketch / Effort / Risks / Ship criteria. Resting place for things
the new-tool freeze in ``PLAN.md`` §2.1 refuses to build but that
keep coming up.

First entry: **#10 PDF → CSV extractor** (bank statements et al.).

Key facts captured:

- **Current state**: no PDF infrastructure exists. Zero PDF
  dependencies in requirements.txt; zero PDF-touching code under
  ``src/``. The only "PDF" string in the codebase is the planned-
  output copy for the Quality Check tool, unrelated to extraction.
- **Library picks**: pdfplumber as the extraction core (BSD-3,
  no native compiler, gives coordinate-aware text), Tesseract via
  pytesseract as the OCR fallback for scanned PDFs,
  streamlit-drawable-canvas as the region-picker component.
- **GUI sketch**: user draws a header strip + a row template on a
  rendered page; the tool applies that template across N pages,
  saves the template by layout fingerprint for next month's
  statement, emits CSV.
- **Effort phased A–E**: 3–4 weeks for a text-only MVP; 6–10
  weeks for a polished version with multi-page template recall;
  +2–3 weeks if scanned-PDF OCR is required.
- **Difficulty**: medium-hard. The pieces are well-trodden; the
  combination (region selection that persists across pages and
  across documents with similar layouts) is where the engineering
  goes.
- **Ship criteria**: ≥1 paying customer + ≥3 paid or ≥5 demo
  emails asking for PDF extraction + the bookkeeper niche
  converting at least one customer first. None have fired.

Cross-references added:

- ``docs/REQUIREMENTS.md`` §11: pointer to FUTURE-TOOLS.md for
  parked tool ideas, with a one-paragraph summary of #10.
- ``docs/PLAN.md`` §2.1: notes that the freeze parks future tools
  in FUTURE-TOOLS.md and explicitly names #10 as the current
  highest-pressure entry.
- ``docs/NEXT-STEPS.md`` Phase 5 "what NOT to build" table: a new
  row for the PDF tool tied to the same ship-trigger language.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-17 01:52:42 +00:00
parent c73d716d06
commit ee0b1f6f6b
4 changed files with 264 additions and 0 deletions

244
docs/FUTURE-TOOLS.md Normal file
View File

@@ -0,0 +1,244 @@
# Future tools — design notes
> Creator-only. Specs for tools the strategic plan refuses to build right now
> but that surface repeatedly enough to be worth documenting once instead of
> re-thinking from scratch every time a customer asks.
> **Status of these tools**: post-launch, post-revenue. See `PLAN.md` §2.1 —
> new-tool development is frozen until DataTools has a paying customer and a
> repeated demand signal for the same idea. This file is the resting place
> for those ideas in the meantime; nothing here ships unless a future
> decision says it does.
Each entry follows the same shape: **What it does**, **Why someone would
want it**, **Can we ship it now?**, **Approach**, **GUI sketch**, **Effort**,
**Risks/unknowns**, **Ship criteria** (the signal that overrides the freeze).
---
## 10. PDF → CSV extractor (bank statements + similar)
### What it does
Takes a PDF (typically a bank statement, expense report, paystub, invoice,
or any document where humans-but-not-computers can read a table) and turns
the tabular content into a CSV that the rest of DataTools can consume.
The user shows the tool **where** the data lives by drawing rectangles on
a rendered preview of the first page; the tool then applies those region
templates to every page of the document (and remembers the template so the
same template can be re-applied to next month's statement without
re-clicking).
### Why someone would want it
Bookkeepers, accountants, and any small-business operator who:
- Gets bank/credit-card statements only as PDFs (most US banks; many
European ones).
- Wants to import transactions into QuickBooks / Xero / a spreadsheet
without paying $10$30/month for a SaaS converter (Docparser,
Rossum, Hubdoc) or relying on a Python script they can't maintain.
- Has 12 months × N accounts of statements to back-fill into a
ledger.
This is the most-requested DataTools adjacency in the casual feedback we
have so far. It maps tightly onto the **bookkeeper niche** identified in
`PLAN.md` §2.3 — that persona is exactly who needs PDF extraction and is
exactly the kind of operator who'd pay for a one-time desktop tool over a
recurring SaaS subscription.
### Can we ship it now?
**No.** Current state, verified 2026-05-17:
- No PDF dependency in `requirements.txt` or `requirements-dev.txt`.
- No PDF-touching code anywhere under `src/`. The single
string-mention of "PDF" in the codebase is in the **output** copy for
the Quality Check tool ("generate PDF/Excel quality reports"),
unrelated to extraction.
- No region-selection / canvas component in the Streamlit GUI today.
Building this requires net-new infrastructure on three axes (libraries,
extraction core, region-picker UI). Estimates below.
### Approach (technical)
PDFs split cleanly into two populations and the strategy differs:
1. **Native / text-layer PDFs** — text is stored as text, just laid out
visually. Most modern bank statements are this. Solvable with
coordinate-aware text extraction:
- **`pdfplumber`** (BSD-3, on top of `pdfminer.six`) — gives `(x0, y0,
x1, y1, text)` per character/word/line for each page. Mature, well
tested, single dependency, no native compiler. **First-choice.**
- **`pypdf`** (BSD-3) — text-only, no positions. Too coarse for
statement parsing; useful only for "the whole document as one big
string."
- **`camelot-py`** (MIT) — purpose-built for table extraction.
Heavier (needs `ghostscript` and `tk`/`opencv` for some modes),
and assumes the table grid is already visible. Worth evaluating
as a fallback for documents with explicit ruled tables.
2. **Scanned / image-only PDFs** — pixels of a scanner; no text layer.
Less common from major banks today but still happens with old PDFs
and receipts. Needs OCR:
- **`pytesseract`** wrapping the **Tesseract** binary (Apache-2). The
OCR is good for English on clean scans, mediocre on receipts.
Detect with `pdfplumber`: a page where every character is in a
glyph "image" object means the page is image-only → OCR fallback.
The extraction core would be a state machine:
1. Render page to an image (`pdfplumber.Page.to_image()` returns a PIL
image at a chosen DPI).
2. User draws a header region and per-row regions (or marks a single
table bounding box + column dividers) on the preview.
3. For each PDF page, crop the corresponding pixel region (or pdf
coordinate region), pull the text in that crop, and apply per-region
parsing (date, amount, description).
4. Emit one CSV row per detected statement row.
Bank-statement-specific niceties — implementable as templates on top of
the generic engine:
- Recurring-template store: save "Chase visa October layout" once, the
next month's PDF lands on the same template automatically. JSON file
in `~/.datatools/templates/` keyed by a layout fingerprint (page
size + header text hash).
- Multi-page row stitching: a row that wraps across pages gets merged
back together based on date-column continuity.
- Currency / sign inference: a column that mixes `$1,234.56` and
`($45.00)` — already handled by the (now-existing) Standardize
Formats analyzer rules.
### GUI sketch
The hardest part of the whole project. Streamlit doesn't ship a native
"draw rectangles on an image" widget. Options:
- **`streamlit-drawable-canvas`** — community component (MIT-licensed).
Lets the user draw freehand rectangles on top of a background image.
Returns the rectangle coordinates as JSON. Active maintenance.
**First-choice for the region picker.**
- **`streamlit-cropper`** — single-rectangle crop tool. Good if we only
needed the table bbox; too limited for "header region + column
dividers + repeating-row template."
- **Custom React component** — fully tailored UX but adds a build
toolchain DataTools doesn't have today. Last resort.
Sketch of the proposed page (under "Transformations" in the sidebar
section):
```
🧾 PDF → CSV (Beta)
─────────────────────────────────────────────────────────────────────
Upload a PDF [ Browse… ]
(statement / invoice / form — text-based PDFs work best)
[ ▸ Preview: October-statement.pdf · 3 pages ]
┌────────────────────────────────────────────────┐
│ CHASE BANK │
│ Statement period Oct 131, 2025 │
│ ┌─[1: header strip — drawn in red]──────────┐ │
│ │ Date Description Amount │ │
│ └────────────────────────────────────────────┘ │
│ ┌─[2: row template — drawn in green]────────┐ │
│ │ 10/03 AMAZON.COM #42… -45.67 │ │
│ └────────────────────────────────────────────┘ │
│ ⋮ (more transactions) │
└────────────────────────────────────────────────┘
Columns: [Date] [Description] [Amount] [+ Add column]
Apply template to: ( ) Only this page
(•) All pages with this layout
( ) All pages (force)
[ Save template as… Chase Visa Oct 2025 ]
[ Run extraction → CSV ]
```
After "Run extraction": the standard tool-page result layout (preview
table, "Saved to ~/Downloads/<name>_extracted.csv", "Open Downloads
folder" — matching the other Ready tools).
The **template save/recall** is what makes this a one-time setup
instead of a per-document chore — bookkeepers don't want to re-draw
rectangles every month.
### Effort estimate
| Phase | Scope | Estimate | Risk |
|---|---|---|---|
| **A. Backend, native PDFs only** | pdfplumber-based extraction, hard-coded region passed via a JSON config (no GUI) | **12 weeks** | Low — straightforward use of pdfplumber. |
| **B. Region-picker GUI** | streamlit-drawable-canvas, multi-region drawing, per-region role assignment (date / amount / description) | **23 weeks** | Medium — the canvas component has quirks; persisting region state across reruns is non-trivial. |
| **C. Multi-page application + template persistence** | Apply one page's template to N pages, save/load templates, layout fingerprint | **12 weeks** | Medium — "is the next page the same layout?" is a real perception problem; we'll need a heuristic. |
| **D. Scanned-PDF OCR fallback** | Detect image-only pages, run Tesseract, merge OCR text into the extraction path | **23 weeks** | High — OCR accuracy is variable; we'd want a quality threshold + a "fail this page noisily" path. Bundling Tesseract with the PyInstaller build is its own packaging headache. |
| **E. Bank-statement specifics** | Cross-page row stitching, currency-sign inference, multi-account splits | **12 weeks** | Medium — every bank's idea of a "statement" differs. Templates absorb most of the variance. |
**Realistic total for a polished v1**: 610 calendar weeks of focused work
(text-PDFs + GUI + templates + statement-specific niceties). Add another
23 weeks if scanned PDFs are required at launch.
**Minimum viable extract** (just text PDFs, single-region drawing, no
template recall, no OCR): **34 weeks**. Worth scoping a beta at that
level before committing to the full surface.
### Difficulty rating
**Medium-hard.** Not because any single piece is novel — pdfplumber +
streamlit-drawable-canvas are well-trodden libraries — but because the
*combination* (point-and-click region selection that persists across
multiple PDF pages and across documents with similar layouts) is where
most of the engineering goes. The "every bank does it slightly
differently" reality makes templates a hard requirement rather than a
nice-to-have, and templates raise the design effort.
### Risks / unknowns
- **Scanned-PDF coverage**: if a meaningful slice of the addressable
market sends image-only PDFs (older statements, scanned receipts),
shipping text-only extraction limits the audience. Decide via the
first 1020 user requests.
- **PyInstaller packaging of Tesseract**: bundling the OCR binary into
the desktop build is non-trivial. May force a "Tesseract not found —
install it separately" path on first launch, which hurts the "one-
click install" story.
- **Bank layout drift**: a template captured today can stop working
next month if the bank redesigns its statement. Layout-fingerprint
detection has to fail loudly rather than silently produce garbage.
- **PII surface**: bank statements are some of the most sensitive
documents the user might touch. The "runs locally — your data never
leaves this computer" guarantee is even more load-bearing here than
for CSVs. No telemetry, no cloud OCR services, hard line.
### Ship criteria
Before this tool re-enters active development, all of these need to be
true:
- DataTools has shipped to **≥1 paying customer** (the `PLAN.md` §2.1
freeze condition).
- **At least 3 paying customers OR 5 demo-traffic emails** have
explicitly asked for PDF extraction. Below that signal, build
something else.
- The bookkeeper niche (per `PLAN.md` §2.3) has at least one converted
customer — that's the persona who actually needs this tool, and
confirming they pay before building a tool aimed squarely at them
is the discipline the freeze exists to enforce.
If those three trip, the **Phase A minimum-viable beta (34 weeks)**
goes first — text PDFs + single-region drawing — so we can see real
user behaviour before committing to the full template surface.
---
## (placeholder for additional future-tool entries)
Add new entries above this line. Keep the same shape:
What / Why / Can we ship now / Approach / GUI / Effort / Risks /
Ship criteria. The shape is what makes "is this idea ready" a
factual question instead of an opinion.

View File

@@ -269,6 +269,7 @@ moves until $5k/mo MRR:
| | Why locked |
|---|---|
| ❌ More tools (0608) | `PLAN.md` §2.1 distribution-gate. Tool 09 was the exception; no others until first paid customer + one external review. |
| ❌ Tool #10 PDF → CSV (the most-asked-for adjacency) | Parked in `docs/FUTURE-TOOLS.md` with full design + 34 wk MVP / 610 wk polished estimate. Ship trigger: paying customer + ≥3 paid or ≥5 demo emails asking for PDF + the bookkeeper niche converting first. None have fired yet. |
| ❌ SaaS pivot | `DECISIONS.md` §4 — recurring infra conflicts with the lifestyle constraint. |
| ❌ Live chat / sales calls | `DECISIONS.md` §1 #8 — no-touch is locked until $5k/mo. |
| ❌ Custom integrations / one-off consulting | Breaks "build once, sell many." |

View File

@@ -58,6 +58,14 @@ buy" into "an automatable workflow you depend on." That conversion is
what produces retention and word-of-mouth — the only marketing channel
that scales under the no-network/no-touch constraint.
**Parked behind the freeze**: post-launch tool ideas are captured in
`docs/FUTURE-TOOLS.md` with feasibility, GUI sketch, effort estimate,
and ship criteria for each. Currently parked: **#10 PDF → CSV
extractor** (bank statements et al.) — gated on a paying customer +
≥3 paying customers or ≥5 demo emails explicitly asking for PDF
extraction, with the bookkeeper niche converting at least one customer
first. None of those triggers have fired yet.
### 2.2 The demo *is* the product. Make it embarrassingly good.
- Three persona-tagged sample datasets, not one generic CSV: Shopify

View File

@@ -127,6 +127,17 @@ Sample size: 1,000 rows (configurable).
8. Quality Check — Coming Soon
9. Automated Workflows — Ready
**Future / not in v1.** Tool ideas captured for after-launch consideration
live in `docs/FUTURE-TOOLS.md` — entries there are gated by the new-tool
freeze in `PLAN.md` §2.1 and don't ship without a paying-customer +
repeated-demand signal. Currently parked there:
- **#10. PDF → CSV extractor** (bank statements + similar). No PDF
dependency exists in the repo today; this tool would need pdfplumber,
streamlit-drawable-canvas, and a templates store. Estimated 34 weeks
for a text-only MVP, 610 weeks for the polished version with
multi-page template recall.
### 11.a Recommended pipeline order (soft, not enforced)
Automated Workflows ships with a `SOFT_DEPENDENCIES` table; the