docs: design notes for future PDF→CSV tool

New ``docs/FUTURE-TOOLS.md`` captures post-launch tool ideas with a consistent shape — What / Why / Can we ship now / Approach / GUI sketch / Effort / Risks / Ship criteria. Resting place for things the new-tool freeze in ``PLAN.md`` §2.1 refuses to build but that keep coming up. First entry: **#10 PDF → CSV extractor** (bank statements et al.). Key facts captured: - **Current state**: no PDF infrastructure exists. Zero PDF dependencies in requirements.txt; zero PDF-touching code under ``src/``. The only "PDF" string in the codebase is the planned- output copy for the Quality Check tool, unrelated to extraction. - **Library picks**: pdfplumber as the extraction core (BSD-3, no native compiler, gives coordinate-aware text), Tesseract via pytesseract as the OCR fallback for scanned PDFs, streamlit-drawable-canvas as the region-picker component. - **GUI sketch**: user draws a header strip + a row template on a rendered page; the tool applies that template across N pages, saves the template by layout fingerprint for next month's statement, emits CSV. - **Effort phased A–E**: 3–4 weeks for a text-only MVP; 6–10 weeks for a polished version with multi-page template recall; +2–3 weeks if scanned-PDF OCR is required. - **Difficulty**: medium-hard. The pieces are well-trodden; the combination (region selection that persists across pages and across documents with similar layouts) is where the engineering goes. - **Ship criteria**: ≥1 paying customer + ≥3 paid or ≥5 demo emails asking for PDF extraction + the bookkeeper niche converting at least one customer first. None have fired. Cross-references added: - ``docs/REQUIREMENTS.md`` §11: pointer to FUTURE-TOOLS.md for parked tool ideas, with a one-paragraph summary of #10. - ``docs/PLAN.md`` §2.1: notes that the freeze parks future tools in FUTURE-TOOLS.md and explicitly names #10 as the current highest-pressure entry. - ``docs/NEXT-STEPS.md`` Phase 5 "what NOT to build" table: a new row for the PDF tool tied to the same ship-trigger language. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:52:42 +00:00
parent c73d716d06
commit ee0b1f6f6b
4 changed files with 264 additions and 0 deletions
--- a/docs/FUTURE-TOOLS.md
+++ b/docs/FUTURE-TOOLS.md
@@ -0,0 +1,244 @@
 # Future tools — design notes
 > Creator-only. Specs for tools the strategic plan refuses to build right now
 > but that surface repeatedly enough to be worth documenting once instead of
 > re-thinking from scratch every time a customer asks.
 > **Status of these tools**: post-launch, post-revenue. See `PLAN.md` §2.1 —
 > new-tool development is frozen until DataTools has a paying customer and a
 > repeated demand signal for the same idea. This file is the resting place
 > for those ideas in the meantime; nothing here ships unless a future
 > decision says it does.
 Each entry follows the same shape: **What it does**, **Why someone would
 want it**, **Can we ship it now?**, **Approach**, **GUI sketch**, **Effort**,
 **Risks/unknowns**, **Ship criteria** (the signal that overrides the freeze).
 ---
 ## 10. PDF → CSV extractor (bank statements + similar)
 ### What it does
 Takes a PDF (typically a bank statement, expense report, paystub, invoice,
 or any document where humans-but-not-computers can read a table) and turns
 the tabular content into a CSV that the rest of DataTools can consume.
 The user shows the tool **where** the data lives by drawing rectangles on
 a rendered preview of the first page; the tool then applies those region
 templates to every page of the document (and remembers the template so the
 same template can be re-applied to next month's statement without
 re-clicking).
 ### Why someone would want it
 Bookkeepers, accountants, and any small-business operator who:
 - Gets bank/credit-card statements only as PDFs (most US banks; many
  European ones).
 - Wants to import transactions into QuickBooks / Xero / a spreadsheet
  without paying $10–$30/month for a SaaS converter (Docparser,
  Rossum, Hubdoc) or relying on a Python script they can't maintain.
 - Has 12 months × N accounts of statements to back-fill into a
  ledger.
 This is the most-requested DataTools adjacency in the casual feedback we
 have so far. It maps tightly onto the **bookkeeper niche** identified in
 `PLAN.md` §2.3 — that persona is exactly who needs PDF extraction and is
 exactly the kind of operator who'd pay for a one-time desktop tool over a
 recurring SaaS subscription.
 ### Can we ship it now?
 **No.** Current state, verified 2026-05-17:
 - No PDF dependency in `requirements.txt` or `requirements-dev.txt`.
 - No PDF-touching code anywhere under `src/`. The single
  string-mention of "PDF" in the codebase is in the **output** copy for
  the Quality Check tool ("generate PDF/Excel quality reports"),
  unrelated to extraction.
 - No region-selection / canvas component in the Streamlit GUI today.
 Building this requires net-new infrastructure on three axes (libraries,
 extraction core, region-picker UI). Estimates below.
 ### Approach (technical)
 PDFs split cleanly into two populations and the strategy differs:
 1. **Native / text-layer PDFs** — text is stored as text, just laid out
   visually. Most modern bank statements are this. Solvable with
   coordinate-aware text extraction:
   - **`pdfplumber`** (BSD-3, on top of `pdfminer.six`) — gives `(x0, y0,
     x1, y1, text)` per character/word/line for each page. Mature, well
     tested, single dependency, no native compiler. **First-choice.**
   - **`pypdf`** (BSD-3) — text-only, no positions. Too coarse for
     statement parsing; useful only for "the whole document as one big
     string."
   - **`camelot-py`** (MIT) — purpose-built for table extraction.
     Heavier (needs `ghostscript` and `tk`/`opencv` for some modes),
     and assumes the table grid is already visible. Worth evaluating
     as a fallback for documents with explicit ruled tables.
 2. **Scanned / image-only PDFs** — pixels of a scanner; no text layer.
   Less common from major banks today but still happens with old PDFs
   and receipts. Needs OCR:
   - **`pytesseract`** wrapping the **Tesseract** binary (Apache-2). The
     OCR is good for English on clean scans, mediocre on receipts.
     Detect with `pdfplumber`: a page where every character is in a
     glyph "image" object means the page is image-only → OCR fallback.
 The extraction core would be a state machine:
 1. Render page to an image (`pdfplumber.Page.to_image()` returns a PIL
   image at a chosen DPI).
 2. User draws a header region and per-row regions (or marks a single
   table bounding box + column dividers) on the preview.
 3. For each PDF page, crop the corresponding pixel region (or pdf
   coordinate region), pull the text in that crop, and apply per-region
   parsing (date, amount, description).
 4. Emit one CSV row per detected statement row.
 Bank-statement-specific niceties — implementable as templates on top of
 the generic engine:
 - Recurring-template store: save "Chase visa October layout" once, the
  next month's PDF lands on the same template automatically. JSON file
  in `~/.datatools/templates/` keyed by a layout fingerprint (page
  size + header text hash).
 - Multi-page row stitching: a row that wraps across pages gets merged
  back together based on date-column continuity.
 - Currency / sign inference: a column that mixes `$1,234.56` and
  `($45.00)` — already handled by the (now-existing) Standardize
  Formats analyzer rules.
 ### GUI sketch
 The hardest part of the whole project. Streamlit doesn't ship a native
 "draw rectangles on an image" widget. Options:
 - **`streamlit-drawable-canvas`** — community component (MIT-licensed).
  Lets the user draw freehand rectangles on top of a background image.
  Returns the rectangle coordinates as JSON. Active maintenance.
  **First-choice for the region picker.**
 - **`streamlit-cropper`** — single-rectangle crop tool. Good if we only
  needed the table bbox; too limited for "header region + column
  dividers + repeating-row template."
 - **Custom React component** — fully tailored UX but adds a build
  toolchain DataTools doesn't have today. Last resort.
 Sketch of the proposed page (under "Transformations" in the sidebar
 section):
 ```
 🧾  PDF → CSV (Beta)
 ─────────────────────────────────────────────────────────────────────
 Upload a PDF                                  [ Browse… ]
  (statement / invoice / form — text-based PDFs work best)
 [ ▸ Preview: October-statement.pdf  ·  3 pages ]
  ┌────────────────────────────────────────────────┐
  │  CHASE BANK                                    │
  │  Statement period Oct 1–31, 2025               │
  │  ┌─[1: header strip — drawn in red]──────────┐ │
  │  │  Date    Description          Amount      │ │
  │  └────────────────────────────────────────────┘ │
  │  ┌─[2: row template — drawn in green]────────┐ │
  │  │  10/03   AMAZON.COM #42…       -45.67     │ │
  │  └────────────────────────────────────────────┘ │
  │      ⋮ (more transactions)                     │
  └────────────────────────────────────────────────┘
 Columns:  [Date]  [Description]  [Amount]      [+ Add column]
 Apply template to:   ( ) Only this page
                     (•) All pages with this layout
                     ( ) All pages (force)
 [ Save template as…  Chase Visa Oct 2025 ]
 [ Run extraction → CSV ]
 ```
 After "Run extraction": the standard tool-page result layout (preview
 table, "Saved to ~/Downloads/<name>_extracted.csv", "Open Downloads
 folder" — matching the other Ready tools).
 The **template save/recall** is what makes this a one-time setup
 instead of a per-document chore — bookkeepers don't want to re-draw
 rectangles every month.
 ### Effort estimate
 | Phase | Scope | Estimate | Risk |
 |---|---|---|---|
 | **A. Backend, native PDFs only** | pdfplumber-based extraction, hard-coded region passed via a JSON config (no GUI) | **1–2 weeks** | Low — straightforward use of pdfplumber. |
 | **B. Region-picker GUI** | streamlit-drawable-canvas, multi-region drawing, per-region role assignment (date / amount / description) | **2–3 weeks** | Medium — the canvas component has quirks; persisting region state across reruns is non-trivial. |
 | **C. Multi-page application + template persistence** | Apply one page's template to N pages, save/load templates, layout fingerprint | **1–2 weeks** | Medium — "is the next page the same layout?" is a real perception problem; we'll need a heuristic. |
 | **D. Scanned-PDF OCR fallback** | Detect image-only pages, run Tesseract, merge OCR text into the extraction path | **2–3 weeks** | High — OCR accuracy is variable; we'd want a quality threshold + a "fail this page noisily" path. Bundling Tesseract with the PyInstaller build is its own packaging headache. |
 | **E. Bank-statement specifics** | Cross-page row stitching, currency-sign inference, multi-account splits | **1–2 weeks** | Medium — every bank's idea of a "statement" differs. Templates absorb most of the variance. |
 **Realistic total for a polished v1**: 6–10 calendar weeks of focused work
 (text-PDFs + GUI + templates + statement-specific niceties). Add another
 2–3 weeks if scanned PDFs are required at launch.
 **Minimum viable extract** (just text PDFs, single-region drawing, no
 template recall, no OCR): **3–4 weeks**. Worth scoping a beta at that
 level before committing to the full surface.
 ### Difficulty rating
 **Medium-hard.** Not because any single piece is novel — pdfplumber +
 streamlit-drawable-canvas are well-trodden libraries — but because the
 *combination* (point-and-click region selection that persists across
 multiple PDF pages and across documents with similar layouts) is where
 most of the engineering goes. The "every bank does it slightly
 differently" reality makes templates a hard requirement rather than a
 nice-to-have, and templates raise the design effort.
 ### Risks / unknowns
 - **Scanned-PDF coverage**: if a meaningful slice of the addressable
  market sends image-only PDFs (older statements, scanned receipts),
  shipping text-only extraction limits the audience. Decide via the
  first 10–20 user requests.
 - **PyInstaller packaging of Tesseract**: bundling the OCR binary into
  the desktop build is non-trivial. May force a "Tesseract not found —
  install it separately" path on first launch, which hurts the "one-
  click install" story.
 - **Bank layout drift**: a template captured today can stop working
  next month if the bank redesigns its statement. Layout-fingerprint
  detection has to fail loudly rather than silently produce garbage.
 - **PII surface**: bank statements are some of the most sensitive
  documents the user might touch. The "runs locally — your data never
  leaves this computer" guarantee is even more load-bearing here than
  for CSVs. No telemetry, no cloud OCR services, hard line.
 ### Ship criteria
 Before this tool re-enters active development, all of these need to be
 true:
 - DataTools has shipped to **≥1 paying customer** (the `PLAN.md` §2.1
  freeze condition).
 - **At least 3 paying customers OR 5 demo-traffic emails** have
  explicitly asked for PDF extraction. Below that signal, build
  something else.
 - The bookkeeper niche (per `PLAN.md` §2.3) has at least one converted
  customer — that's the persona who actually needs this tool, and
  confirming they pay before building a tool aimed squarely at them
  is the discipline the freeze exists to enforce.
 If those three trip, the **Phase A minimum-viable beta (3–4 weeks)**
 goes first — text PDFs + single-region drawing — so we can see real
 user behaviour before committing to the full template surface.
 ---
 ## (placeholder for additional future-tool entries)
 Add new entries above this line. Keep the same shape:
 What / Why / Can we ship now / Approach / GUI / Effort / Risks /
 Ship criteria. The shape is what makes "is this idea ready" a
 factual question instead of an opinion.
--- a/docs/NEXT-STEPS.md
+++ b/docs/NEXT-STEPS.md
@@ -269,6 +269,7 @@ moves until $5k/mo MRR:
 | | Why locked |
 |---|---|
 | ❌ More tools (06–08) | `PLAN.md` §2.1 distribution-gate. Tool 09 was the exception; no others until first paid customer + one external review. |
 | ❌ Tool #10 PDF → CSV (the most-asked-for adjacency) | Parked in `docs/FUTURE-TOOLS.md` with full design + 3–4 wk MVP / 6–10 wk polished estimate. Ship trigger: paying customer + ≥3 paid or ≥5 demo emails asking for PDF + the bookkeeper niche converting first. None have fired yet. |
 | ❌ SaaS pivot | `DECISIONS.md` §4 — recurring infra conflicts with the lifestyle constraint. |
 | ❌ Live chat / sales calls | `DECISIONS.md` §1 #8 — no-touch is locked until $5k/mo. |
 | ❌ Custom integrations / one-off consulting | Breaks "build once, sell many." |
--- a/docs/PLAN.md
+++ b/docs/PLAN.md
@@ -58,6 +58,14 @@ buy" into "an automatable workflow you depend on." That conversion is
 what produces retention and word-of-mouth — the only marketing channel
 that scales under the no-network/no-touch constraint.
 **Parked behind the freeze**: post-launch tool ideas are captured in
 `docs/FUTURE-TOOLS.md` with feasibility, GUI sketch, effort estimate,
 and ship criteria for each. Currently parked: **#10 PDF → CSV
 extractor** (bank statements et al.) — gated on a paying customer +
 ≥3 paying customers or ≥5 demo emails explicitly asking for PDF
 extraction, with the bookkeeper niche converting at least one customer
 first. None of those triggers have fired yet.
 ### 2.2 The demo *is* the product. Make it embarrassingly good.
 - Three persona-tagged sample datasets, not one generic CSV: Shopify
--- a/docs/REQUIREMENTS.md
+++ b/docs/REQUIREMENTS.md
@@ -127,6 +127,17 @@ Sample size: 1,000 rows (configurable).
 8. Quality Check — Coming Soon
 9. Automated Workflows — Ready
 **Future / not in v1.** Tool ideas captured for after-launch consideration
 live in `docs/FUTURE-TOOLS.md` — entries there are gated by the new-tool
 freeze in `PLAN.md` §2.1 and don't ship without a paying-customer +
 repeated-demand signal. Currently parked there:
 - **#10. PDF → CSV extractor** (bank statements + similar). No PDF
  dependency exists in the repo today; this tool would need pdfplumber,
  streamlit-drawable-canvas, and a templates store. Estimated 3–4 weeks
  for a text-only MVP, 6–10 weeks for the polished version with
  multi-page template recall.
 ### 11.a Recommended pipeline order (soft, not enforced)
 Automated Workflows ships with a `SOFT_DEPENDENCIES` table; the