diff --git a/docs/FUTURE-TOOLS.md b/docs/FUTURE-TOOLS.md new file mode 100644 index 0000000..c1a77c4 --- /dev/null +++ b/docs/FUTURE-TOOLS.md @@ -0,0 +1,244 @@ +# Future tools — design notes + +> Creator-only. Specs for tools the strategic plan refuses to build right now +> but that surface repeatedly enough to be worth documenting once instead of +> re-thinking from scratch every time a customer asks. +> **Status of these tools**: post-launch, post-revenue. See `PLAN.md` §2.1 — +> new-tool development is frozen until DataTools has a paying customer and a +> repeated demand signal for the same idea. This file is the resting place +> for those ideas in the meantime; nothing here ships unless a future +> decision says it does. + +Each entry follows the same shape: **What it does**, **Why someone would +want it**, **Can we ship it now?**, **Approach**, **GUI sketch**, **Effort**, +**Risks/unknowns**, **Ship criteria** (the signal that overrides the freeze). + +--- + +## 10. PDF → CSV extractor (bank statements + similar) + +### What it does + +Takes a PDF (typically a bank statement, expense report, paystub, invoice, +or any document where humans-but-not-computers can read a table) and turns +the tabular content into a CSV that the rest of DataTools can consume. + +The user shows the tool **where** the data lives by drawing rectangles on +a rendered preview of the first page; the tool then applies those region +templates to every page of the document (and remembers the template so the +same template can be re-applied to next month's statement without +re-clicking). + +### Why someone would want it + +Bookkeepers, accountants, and any small-business operator who: + +- Gets bank/credit-card statements only as PDFs (most US banks; many + European ones). +- Wants to import transactions into QuickBooks / Xero / a spreadsheet + without paying $10–$30/month for a SaaS converter (Docparser, + Rossum, Hubdoc) or relying on a Python script they can't maintain. +- Has 12 months × N accounts of statements to back-fill into a + ledger. + +This is the most-requested DataTools adjacency in the casual feedback we +have so far. It maps tightly onto the **bookkeeper niche** identified in +`PLAN.md` §2.3 — that persona is exactly who needs PDF extraction and is +exactly the kind of operator who'd pay for a one-time desktop tool over a +recurring SaaS subscription. + +### Can we ship it now? + +**No.** Current state, verified 2026-05-17: + +- No PDF dependency in `requirements.txt` or `requirements-dev.txt`. +- No PDF-touching code anywhere under `src/`. The single + string-mention of "PDF" in the codebase is in the **output** copy for + the Quality Check tool ("generate PDF/Excel quality reports"), + unrelated to extraction. +- No region-selection / canvas component in the Streamlit GUI today. + +Building this requires net-new infrastructure on three axes (libraries, +extraction core, region-picker UI). Estimates below. + +### Approach (technical) + +PDFs split cleanly into two populations and the strategy differs: + +1. **Native / text-layer PDFs** — text is stored as text, just laid out + visually. Most modern bank statements are this. Solvable with + coordinate-aware text extraction: + + - **`pdfplumber`** (BSD-3, on top of `pdfminer.six`) — gives `(x0, y0, + x1, y1, text)` per character/word/line for each page. Mature, well + tested, single dependency, no native compiler. **First-choice.** + - **`pypdf`** (BSD-3) — text-only, no positions. Too coarse for + statement parsing; useful only for "the whole document as one big + string." + - **`camelot-py`** (MIT) — purpose-built for table extraction. + Heavier (needs `ghostscript` and `tk`/`opencv` for some modes), + and assumes the table grid is already visible. Worth evaluating + as a fallback for documents with explicit ruled tables. + +2. **Scanned / image-only PDFs** — pixels of a scanner; no text layer. + Less common from major banks today but still happens with old PDFs + and receipts. Needs OCR: + + - **`pytesseract`** wrapping the **Tesseract** binary (Apache-2). The + OCR is good for English on clean scans, mediocre on receipts. + Detect with `pdfplumber`: a page where every character is in a + glyph "image" object means the page is image-only → OCR fallback. + +The extraction core would be a state machine: + +1. Render page to an image (`pdfplumber.Page.to_image()` returns a PIL + image at a chosen DPI). +2. User draws a header region and per-row regions (or marks a single + table bounding box + column dividers) on the preview. +3. For each PDF page, crop the corresponding pixel region (or pdf + coordinate region), pull the text in that crop, and apply per-region + parsing (date, amount, description). +4. Emit one CSV row per detected statement row. + +Bank-statement-specific niceties — implementable as templates on top of +the generic engine: + +- Recurring-template store: save "Chase visa October layout" once, the + next month's PDF lands on the same template automatically. JSON file + in `~/.datatools/templates/` keyed by a layout fingerprint (page + size + header text hash). +- Multi-page row stitching: a row that wraps across pages gets merged + back together based on date-column continuity. +- Currency / sign inference: a column that mixes `$1,234.56` and + `($45.00)` — already handled by the (now-existing) Standardize + Formats analyzer rules. + +### GUI sketch + +The hardest part of the whole project. Streamlit doesn't ship a native +"draw rectangles on an image" widget. Options: + +- **`streamlit-drawable-canvas`** — community component (MIT-licensed). + Lets the user draw freehand rectangles on top of a background image. + Returns the rectangle coordinates as JSON. Active maintenance. + **First-choice for the region picker.** +- **`streamlit-cropper`** — single-rectangle crop tool. Good if we only + needed the table bbox; too limited for "header region + column + dividers + repeating-row template." +- **Custom React component** — fully tailored UX but adds a build + toolchain DataTools doesn't have today. Last resort. + +Sketch of the proposed page (under "Transformations" in the sidebar +section): + +``` +🧾 PDF → CSV (Beta) +───────────────────────────────────────────────────────────────────── +Upload a PDF [ Browse… ] + (statement / invoice / form — text-based PDFs work best) + +[ ▸ Preview: October-statement.pdf · 3 pages ] + ┌────────────────────────────────────────────────┐ + │ CHASE BANK │ + │ Statement period Oct 1–31, 2025 │ + │ ┌─[1: header strip — drawn in red]──────────┐ │ + │ │ Date Description Amount │ │ + │ └────────────────────────────────────────────┘ │ + │ ┌─[2: row template — drawn in green]────────┐ │ + │ │ 10/03 AMAZON.COM #42… -45.67 │ │ + │ └────────────────────────────────────────────┘ │ + │ ⋮ (more transactions) │ + └────────────────────────────────────────────────┘ + +Columns: [Date] [Description] [Amount] [+ Add column] + +Apply template to: ( ) Only this page + (•) All pages with this layout + ( ) All pages (force) + +[ Save template as… Chase Visa Oct 2025 ] + +[ Run extraction → CSV ] +``` + +After "Run extraction": the standard tool-page result layout (preview +table, "Saved to ~/Downloads/_extracted.csv", "Open Downloads +folder" — matching the other Ready tools). + +The **template save/recall** is what makes this a one-time setup +instead of a per-document chore — bookkeepers don't want to re-draw +rectangles every month. + +### Effort estimate + +| Phase | Scope | Estimate | Risk | +|---|---|---|---| +| **A. Backend, native PDFs only** | pdfplumber-based extraction, hard-coded region passed via a JSON config (no GUI) | **1–2 weeks** | Low — straightforward use of pdfplumber. | +| **B. Region-picker GUI** | streamlit-drawable-canvas, multi-region drawing, per-region role assignment (date / amount / description) | **2–3 weeks** | Medium — the canvas component has quirks; persisting region state across reruns is non-trivial. | +| **C. Multi-page application + template persistence** | Apply one page's template to N pages, save/load templates, layout fingerprint | **1–2 weeks** | Medium — "is the next page the same layout?" is a real perception problem; we'll need a heuristic. | +| **D. Scanned-PDF OCR fallback** | Detect image-only pages, run Tesseract, merge OCR text into the extraction path | **2–3 weeks** | High — OCR accuracy is variable; we'd want a quality threshold + a "fail this page noisily" path. Bundling Tesseract with the PyInstaller build is its own packaging headache. | +| **E. Bank-statement specifics** | Cross-page row stitching, currency-sign inference, multi-account splits | **1–2 weeks** | Medium — every bank's idea of a "statement" differs. Templates absorb most of the variance. | + +**Realistic total for a polished v1**: 6–10 calendar weeks of focused work +(text-PDFs + GUI + templates + statement-specific niceties). Add another +2–3 weeks if scanned PDFs are required at launch. + +**Minimum viable extract** (just text PDFs, single-region drawing, no +template recall, no OCR): **3–4 weeks**. Worth scoping a beta at that +level before committing to the full surface. + +### Difficulty rating + +**Medium-hard.** Not because any single piece is novel — pdfplumber + +streamlit-drawable-canvas are well-trodden libraries — but because the +*combination* (point-and-click region selection that persists across +multiple PDF pages and across documents with similar layouts) is where +most of the engineering goes. The "every bank does it slightly +differently" reality makes templates a hard requirement rather than a +nice-to-have, and templates raise the design effort. + +### Risks / unknowns + +- **Scanned-PDF coverage**: if a meaningful slice of the addressable + market sends image-only PDFs (older statements, scanned receipts), + shipping text-only extraction limits the audience. Decide via the + first 10–20 user requests. +- **PyInstaller packaging of Tesseract**: bundling the OCR binary into + the desktop build is non-trivial. May force a "Tesseract not found — + install it separately" path on first launch, which hurts the "one- + click install" story. +- **Bank layout drift**: a template captured today can stop working + next month if the bank redesigns its statement. Layout-fingerprint + detection has to fail loudly rather than silently produce garbage. +- **PII surface**: bank statements are some of the most sensitive + documents the user might touch. The "runs locally — your data never + leaves this computer" guarantee is even more load-bearing here than + for CSVs. No telemetry, no cloud OCR services, hard line. + +### Ship criteria + +Before this tool re-enters active development, all of these need to be +true: + +- DataTools has shipped to **≥1 paying customer** (the `PLAN.md` §2.1 + freeze condition). +- **At least 3 paying customers OR 5 demo-traffic emails** have + explicitly asked for PDF extraction. Below that signal, build + something else. +- The bookkeeper niche (per `PLAN.md` §2.3) has at least one converted + customer — that's the persona who actually needs this tool, and + confirming they pay before building a tool aimed squarely at them + is the discipline the freeze exists to enforce. + +If those three trip, the **Phase A minimum-viable beta (3–4 weeks)** +goes first — text PDFs + single-region drawing — so we can see real +user behaviour before committing to the full template surface. + +--- + +## (placeholder for additional future-tool entries) + +Add new entries above this line. Keep the same shape: +What / Why / Can we ship now / Approach / GUI / Effort / Risks / +Ship criteria. The shape is what makes "is this idea ready" a +factual question instead of an opinion. diff --git a/docs/NEXT-STEPS.md b/docs/NEXT-STEPS.md index e5f2122..f9fc5c4 100644 --- a/docs/NEXT-STEPS.md +++ b/docs/NEXT-STEPS.md @@ -269,6 +269,7 @@ moves until $5k/mo MRR: | | Why locked | |---|---| | ❌ More tools (06–08) | `PLAN.md` §2.1 distribution-gate. Tool 09 was the exception; no others until first paid customer + one external review. | +| ❌ Tool #10 PDF → CSV (the most-asked-for adjacency) | Parked in `docs/FUTURE-TOOLS.md` with full design + 3–4 wk MVP / 6–10 wk polished estimate. Ship trigger: paying customer + ≥3 paid or ≥5 demo emails asking for PDF + the bookkeeper niche converting first. None have fired yet. | | ❌ SaaS pivot | `DECISIONS.md` §4 — recurring infra conflicts with the lifestyle constraint. | | ❌ Live chat / sales calls | `DECISIONS.md` §1 #8 — no-touch is locked until $5k/mo. | | ❌ Custom integrations / one-off consulting | Breaks "build once, sell many." | diff --git a/docs/PLAN.md b/docs/PLAN.md index c2f4e57..db3da28 100644 --- a/docs/PLAN.md +++ b/docs/PLAN.md @@ -58,6 +58,14 @@ buy" into "an automatable workflow you depend on." That conversion is what produces retention and word-of-mouth — the only marketing channel that scales under the no-network/no-touch constraint. +**Parked behind the freeze**: post-launch tool ideas are captured in +`docs/FUTURE-TOOLS.md` with feasibility, GUI sketch, effort estimate, +and ship criteria for each. Currently parked: **#10 PDF → CSV +extractor** (bank statements et al.) — gated on a paying customer + +≥3 paying customers or ≥5 demo emails explicitly asking for PDF +extraction, with the bookkeeper niche converting at least one customer +first. None of those triggers have fired yet. + ### 2.2 The demo *is* the product. Make it embarrassingly good. - Three persona-tagged sample datasets, not one generic CSV: Shopify diff --git a/docs/REQUIREMENTS.md b/docs/REQUIREMENTS.md index 0078908..35fdef5 100644 --- a/docs/REQUIREMENTS.md +++ b/docs/REQUIREMENTS.md @@ -127,6 +127,17 @@ Sample size: 1,000 rows (configurable). 8. Quality Check — Coming Soon 9. Automated Workflows — Ready +**Future / not in v1.** Tool ideas captured for after-launch consideration +live in `docs/FUTURE-TOOLS.md` — entries there are gated by the new-tool +freeze in `PLAN.md` §2.1 and don't ship without a paying-customer + +repeated-demand signal. Currently parked there: + +- **#10. PDF → CSV extractor** (bank statements + similar). No PDF + dependency exists in the repo today; this tool would need pdfplumber, + streamlit-drawable-canvas, and a templates store. Estimated 3–4 weeks + for a text-only MVP, 6–10 weeks for the polished version with + multi-page template recall. + ### 11.a Recommended pipeline order (soft, not enforced) Automated Workflows ships with a `SOFT_DEPENDENCIES` table; the