docs: design notes for future PDF→CSV tool

New ``docs/FUTURE-TOOLS.md`` captures post-launch tool ideas with a consistent shape — What / Why / Can we ship now / Approach / GUI sketch / Effort / Risks / Ship criteria. Resting place for things the new-tool freeze in ``PLAN.md`` §2.1 refuses to build but that keep coming up. First entry: **#10 PDF → CSV extractor** (bank statements et al.). Key facts captured: - **Current state**: no PDF infrastructure exists. Zero PDF dependencies in requirements.txt; zero PDF-touching code under ``src/``. The only "PDF" string in the codebase is the planned- output copy for the Quality Check tool, unrelated to extraction. - **Library picks**: pdfplumber as the extraction core (BSD-3, no native compiler, gives coordinate-aware text), Tesseract via pytesseract as the OCR fallback for scanned PDFs, streamlit-drawable-canvas as the region-picker component. - **GUI sketch**: user draws a header strip + a row template on a rendered page; the tool applies that template across N pages, saves the template by layout fingerprint for next month's statement, emits CSV. - **Effort phased A–E**: 3–4 weeks for a text-only MVP; 6–10 weeks for a polished version with multi-page template recall; +2–3 weeks if scanned-PDF OCR is required. - **Difficulty**: medium-hard. The pieces are well-trodden; the combination (region selection that persists across pages and across documents with similar layouts) is where the engineering goes. - **Ship criteria**: ≥1 paying customer + ≥3 paid or ≥5 demo emails asking for PDF extraction + the bookkeeper niche converting at least one customer first. None have fired. Cross-references added: - ``docs/REQUIREMENTS.md`` §11: pointer to FUTURE-TOOLS.md for parked tool ideas, with a one-paragraph summary of #10. - ``docs/PLAN.md`` §2.1: notes that the freeze parks future tools in FUTURE-TOOLS.md and explicitly names #10 as the current highest-pressure entry. - ``docs/NEXT-STEPS.md`` Phase 5 "what NOT to build" table: a new row for the PDF tool tied to the same ship-trigger language. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:52:42 +00:00
parent c73d716d06
commit ee0b1f6f6b
4 changed files with 264 additions and 0 deletions
--- a/docs/FUTURE-TOOLS.md
+++ b/docs/FUTURE-TOOLS.md
@@ -0,0 +1,244 @@
+# Future tools — design notes
+
+> Creator-only. Specs for tools the strategic plan refuses to build right now
+> but that surface repeatedly enough to be worth documenting once instead of
+> re-thinking from scratch every time a customer asks.
+> **Status of these tools**: post-launch, post-revenue. See `PLAN.md` §2.1 —
+> new-tool development is frozen until DataTools has a paying customer and a
+> repeated demand signal for the same idea. This file is the resting place
+> for those ideas in the meantime; nothing here ships unless a future
+> decision says it does.
+
+Each entry follows the same shape: **What it does**, **Why someone would
+want it**, **Can we ship it now?**, **Approach**, **GUI sketch**, **Effort**,
+**Risks/unknowns**, **Ship criteria** (the signal that overrides the freeze).
+
+---
+
+## 10. PDF → CSV extractor (bank statements + similar)
+
+### What it does
+
+Takes a PDF (typically a bank statement, expense report, paystub, invoice,
+or any document where humans-but-not-computers can read a table) and turns
+the tabular content into a CSV that the rest of DataTools can consume.
+
+The user shows the tool **where** the data lives by drawing rectangles on
+a rendered preview of the first page; the tool then applies those region
+templates to every page of the document (and remembers the template so the
+same template can be re-applied to next month's statement without
+re-clicking).
+
+### Why someone would want it
+
+Bookkeepers, accountants, and any small-business operator who:
+
+- Gets bank/credit-card statements only as PDFs (most US banks; many
+  European ones).
+- Wants to import transactions into QuickBooks / Xero / a spreadsheet
+  without paying $10–$30/month for a SaaS converter (Docparser,
+  Rossum, Hubdoc) or relying on a Python script they can't maintain.
+- Has 12 months × N accounts of statements to back-fill into a
+  ledger.
+
+This is the most-requested DataTools adjacency in the casual feedback we
+have so far. It maps tightly onto the **bookkeeper niche** identified in
+`PLAN.md` §2.3 — that persona is exactly who needs PDF extraction and is
+exactly the kind of operator who'd pay for a one-time desktop tool over a
+recurring SaaS subscription.
+
+### Can we ship it now?
+
+**No.** Current state, verified 2026-05-17:
+
+- No PDF dependency in `requirements.txt` or `requirements-dev.txt`.
+- No PDF-touching code anywhere under `src/`. The single
+  string-mention of "PDF" in the codebase is in the **output** copy for
+  the Quality Check tool ("generate PDF/Excel quality reports"),
+  unrelated to extraction.
+- No region-selection / canvas component in the Streamlit GUI today.
+
+Building this requires net-new infrastructure on three axes (libraries,
+extraction core, region-picker UI). Estimates below.
+
+### Approach (technical)
+
+PDFs split cleanly into two populations and the strategy differs:
+
+1. **Native / text-layer PDFs** — text is stored as text, just laid out
+   visually. Most modern bank statements are this. Solvable with
+   coordinate-aware text extraction:
+
+   - **`pdfplumber`** (BSD-3, on top of `pdfminer.six`) — gives `(x0, y0,
+     x1, y1, text)` per character/word/line for each page. Mature, well
+     tested, single dependency, no native compiler. **First-choice.**
+   - **`pypdf`** (BSD-3) — text-only, no positions. Too coarse for
+     statement parsing; useful only for "the whole document as one big
+     string."
+   - **`camelot-py`** (MIT) — purpose-built for table extraction.
+     Heavier (needs `ghostscript` and `tk`/`opencv` for some modes),
+     and assumes the table grid is already visible. Worth evaluating
+     as a fallback for documents with explicit ruled tables.
+
+2. **Scanned / image-only PDFs** — pixels of a scanner; no text layer.
+   Less common from major banks today but still happens with old PDFs
+   and receipts. Needs OCR:
+
+   - **`pytesseract`** wrapping the **Tesseract** binary (Apache-2). The
+     OCR is good for English on clean scans, mediocre on receipts.
+     Detect with `pdfplumber`: a page where every character is in a
+     glyph "image" object means the page is image-only → OCR fallback.
+
+The extraction core would be a state machine:
+
+1. Render page to an image (`pdfplumber.Page.to_image()` returns a PIL
+   image at a chosen DPI).
+2. User draws a header region and per-row regions (or marks a single
+   table bounding box + column dividers) on the preview.
+3. For each PDF page, crop the corresponding pixel region (or pdf
+   coordinate region), pull the text in that crop, and apply per-region
+   parsing (date, amount, description).
+4. Emit one CSV row per detected statement row.
+
+Bank-statement-specific niceties — implementable as templates on top of
+the generic engine:
+
+- Recurring-template store: save "Chase visa October layout" once, the
+  next month's PDF lands on the same template automatically. JSON file
+  in `~/.datatools/templates/` keyed by a layout fingerprint (page
+  size + header text hash).
+- Multi-page row stitching: a row that wraps across pages gets merged
+  back together based on date-column continuity.
+- Currency / sign inference: a column that mixes `$1,234.56` and
+  `($45.00)` — already handled by the (now-existing) Standardize
+  Formats analyzer rules.
+
+### GUI sketch
+
+The hardest part of the whole project. Streamlit doesn't ship a native
+"draw rectangles on an image" widget. Options:
+
+- **`streamlit-drawable-canvas`** — community component (MIT-licensed).
+  Lets the user draw freehand rectangles on top of a background image.
+  Returns the rectangle coordinates as JSON. Active maintenance.
+  **First-choice for the region picker.**
+- **`streamlit-cropper`** — single-rectangle crop tool. Good if we only
+  needed the table bbox; too limited for "header region + column
+  dividers + repeating-row template."
+- **Custom React component** — fully tailored UX but adds a build
+  toolchain DataTools doesn't have today. Last resort.
+
+Sketch of the proposed page (under "Transformations" in the sidebar
+section):
+
+```
+🧾  PDF → CSV (Beta)
+─────────────────────────────────────────────────────────────────────
+Upload a PDF                                  [ Browse… ]
+  (statement / invoice / form — text-based PDFs work best)
+
+[ ▸ Preview: October-statement.pdf  ·  3 pages ]
+  ┌────────────────────────────────────────────────┐
+  │  CHASE BANK                                    │
+  │  Statement period Oct 1–31, 2025               │
+  │  ┌─[1: header strip — drawn in red]──────────┐ │
+  │  │  Date    Description          Amount      │ │
+  │  └────────────────────────────────────────────┘ │
+  │  ┌─[2: row template — drawn in green]────────┐ │
+  │  │  10/03   AMAZON.COM #42…       -45.67     │ │
+  │  └────────────────────────────────────────────┘ │
+  │      ⋮ (more transactions)                     │
+  └────────────────────────────────────────────────┘
+
+Columns:  [Date]  [Description]  [Amount]      [+ Add column]
+
+Apply template to:   ( ) Only this page
+                     (•) All pages with this layout
+                     ( ) All pages (force)
+
+[ Save template as…  Chase Visa Oct 2025 ]
+
+[ Run extraction → CSV ]
+```
+
+After "Run extraction": the standard tool-page result layout (preview
+table, "Saved to ~/Downloads/<name>_extracted.csv", "Open Downloads
+folder" — matching the other Ready tools).
+
+The **template save/recall** is what makes this a one-time setup
+instead of a per-document chore — bookkeepers don't want to re-draw
+rectangles every month.
+
+### Effort estimate
+
+| Phase | Scope | Estimate | Risk |
+|---|---|---|---|
+| **A. Backend, native PDFs only** | pdfplumber-based extraction, hard-coded region passed via a JSON config (no GUI) | **1–2 weeks** | Low — straightforward use of pdfplumber. |
+| **B. Region-picker GUI** | streamlit-drawable-canvas, multi-region drawing, per-region role assignment (date / amount / description) | **2–3 weeks** | Medium — the canvas component has quirks; persisting region state across reruns is non-trivial. |
+| **C. Multi-page application + template persistence** | Apply one page's template to N pages, save/load templates, layout fingerprint | **1–2 weeks** | Medium — "is the next page the same layout?" is a real perception problem; we'll need a heuristic. |
+| **D. Scanned-PDF OCR fallback** | Detect image-only pages, run Tesseract, merge OCR text into the extraction path | **2–3 weeks** | High — OCR accuracy is variable; we'd want a quality threshold + a "fail this page noisily" path. Bundling Tesseract with the PyInstaller build is its own packaging headache. |
+| **E. Bank-statement specifics** | Cross-page row stitching, currency-sign inference, multi-account splits | **1–2 weeks** | Medium — every bank's idea of a "statement" differs. Templates absorb most of the variance. |
+
+**Realistic total for a polished v1**: 6–10 calendar weeks of focused work
+(text-PDFs + GUI + templates + statement-specific niceties). Add another
+2–3 weeks if scanned PDFs are required at launch.
+
+**Minimum viable extract** (just text PDFs, single-region drawing, no
+template recall, no OCR): **3–4 weeks**. Worth scoping a beta at that
+level before committing to the full surface.
+
+### Difficulty rating
+
+**Medium-hard.** Not because any single piece is novel — pdfplumber +
+streamlit-drawable-canvas are well-trodden libraries — but because the
+*combination* (point-and-click region selection that persists across
+multiple PDF pages and across documents with similar layouts) is where
+most of the engineering goes. The "every bank does it slightly
+differently" reality makes templates a hard requirement rather than a
+nice-to-have, and templates raise the design effort.
+
+### Risks / unknowns
+
+- **Scanned-PDF coverage**: if a meaningful slice of the addressable
+  market sends image-only PDFs (older statements, scanned receipts),
+  shipping text-only extraction limits the audience. Decide via the
+  first 10–20 user requests.
+- **PyInstaller packaging of Tesseract**: bundling the OCR binary into
+  the desktop build is non-trivial. May force a "Tesseract not found —
+  install it separately" path on first launch, which hurts the "one-
+  click install" story.
+- **Bank layout drift**: a template captured today can stop working
+  next month if the bank redesigns its statement. Layout-fingerprint
+  detection has to fail loudly rather than silently produce garbage.
+- **PII surface**: bank statements are some of the most sensitive
+  documents the user might touch. The "runs locally — your data never
+  leaves this computer" guarantee is even more load-bearing here than
+  for CSVs. No telemetry, no cloud OCR services, hard line.
+
+### Ship criteria
+
+Before this tool re-enters active development, all of these need to be
+true:
+
+- DataTools has shipped to **≥1 paying customer** (the `PLAN.md` §2.1
+  freeze condition).
+- **At least 3 paying customers OR 5 demo-traffic emails** have
+  explicitly asked for PDF extraction. Below that signal, build
+  something else.
+- The bookkeeper niche (per `PLAN.md` §2.3) has at least one converted
+  customer — that's the persona who actually needs this tool, and
+  confirming they pay before building a tool aimed squarely at them
+  is the discipline the freeze exists to enforce.
+
+If those three trip, the **Phase A minimum-viable beta (3–4 weeks)**
+goes first — text PDFs + single-region drawing — so we can see real
+user behaviour before committing to the full template surface.
+
+---
+
+## (placeholder for additional future-tool entries)
+
+Add new entries above this line. Keep the same shape:
+What / Why / Can we ship now / Approach / GUI / Effort / Risks /
+Ship criteria. The shape is what makes "is this idea ready" a
+factual question instead of an opinion.
--- a/docs/NEXT-STEPS.md
+++ b/docs/NEXT-STEPS.md
@@ -269,6 +269,7 @@ moves until $5k/mo MRR:
 | | Why locked |
 |---|---|
 | ❌ More tools (06–08) | `PLAN.md` §2.1 distribution-gate. Tool 09 was the exception; no others until first paid customer + one external review. |
+| ❌ Tool #10 PDF → CSV (the most-asked-for adjacency) | Parked in `docs/FUTURE-TOOLS.md` with full design + 3–4 wk MVP / 6–10 wk polished estimate. Ship trigger: paying customer + ≥3 paid or ≥5 demo emails asking for PDF + the bookkeeper niche converting first. None have fired yet. |
 | ❌ SaaS pivot | `DECISIONS.md` §4 — recurring infra conflicts with the lifestyle constraint. |
 | ❌ Live chat / sales calls | `DECISIONS.md` §1 #8 — no-touch is locked until $5k/mo. |
 | ❌ Custom integrations / one-off consulting | Breaks "build once, sell many." |
--- a/docs/PLAN.md
+++ b/docs/PLAN.md
@@ -58,6 +58,14 @@ buy" into "an automatable workflow you depend on." That conversion is
 what produces retention and word-of-mouth — the only marketing channel
 that scales under the no-network/no-touch constraint.

+**Parked behind the freeze**: post-launch tool ideas are captured in
+`docs/FUTURE-TOOLS.md` with feasibility, GUI sketch, effort estimate,
+and ship criteria for each. Currently parked: **#10 PDF → CSV
+extractor** (bank statements et al.) — gated on a paying customer +
+≥3 paying customers or ≥5 demo emails explicitly asking for PDF
+extraction, with the bookkeeper niche converting at least one customer
+first. None of those triggers have fired yet.
+
 ### 2.2 The demo *is* the product. Make it embarrassingly good.

 - Three persona-tagged sample datasets, not one generic CSV: Shopify
--- a/docs/REQUIREMENTS.md
+++ b/docs/REQUIREMENTS.md
@@ -127,6 +127,17 @@ Sample size: 1,000 rows (configurable).
 8. Quality Check — Coming Soon
 9. Automated Workflows — Ready

+**Future / not in v1.** Tool ideas captured for after-launch consideration
+live in `docs/FUTURE-TOOLS.md` — entries there are gated by the new-tool
+freeze in `PLAN.md` §2.1 and don't ship without a paying-customer +
+repeated-demand signal. Currently parked there:
+
+- **#10. PDF → CSV extractor** (bank statements + similar). No PDF
+  dependency exists in the repo today; this tool would need pdfplumber,
+  streamlit-drawable-canvas, and a templates store. Estimated 3–4 weeks
+  for a text-only MVP, 6–10 weeks for the polished version with
+  multi-page template recall.
+
 ### 11.a Recommended pipeline order (soft, not enforced)

 Automated Workflows ships with a `SOFT_DEPENDENCIES` table; the