feat(pdf): tool page with Extract / Build / Manage modes

Phase 3/6. Wires the PDF Extractor into the GUI as a new "transformations" tool with three modes selected by a horizontal radio at the top of the page: **Extract** — pick a saved template, upload one or more statement PDFs (single + batch shipping together to keep the common case one-step), get a previewed DataFrame + CSV download. Per-file row counts and warnings are surfaced; failures on one file don't kill the whole batch. The combined CSV gets a ``source_file`` first column so the accountant can sort/filter by statement. **Build template** — load an existing template or start fresh, upload a sample PDF, edit every schema field across four tabs (Pages & table / Columns / Parsing / Save). A live preview below re-runs ``apply_template`` against the sample on each re-render so the user sees their changes hit rows immediately. The column- boundary editor is text-input ("comma-separated x-positions") for now — replaced by the drawable-canvas visual picker in commit 5. **Manage templates** — list with rename / delete / export (downloads the canonical JSON) / import (uploads someone else's JSON, validated through ``template_from_json``). Heavy work (``extract_pages_auto``) only runs on explicit user action (Extract / a new sample upload), and the parsed Page list is cached in ``st.session_state`` so widget-edit reruns don't re-parse the PDF. Logging: tool runs and template saves both hit the audit log via ``log_event("tool_run", …)``, matching every other tool's instrumentation pattern. Registered in ``tools_registry.py`` under ``transformations`` with status ``Ready`` and the picture-as-pdf Material icon. i18n keys added for en + es ("PDF to CSV" / "PDF a CSV"). OCR is wired in this commit — ``extract_pages_auto`` already falls back through ``pytesseract`` when the binary is available, and the warning strings it returns surface as ``st.info`` / ``st.warning`` per-file. Commit 6 will polish the OCR UX with a status row. Next commits build on this page: 4 — batch progress + cancellation + per-file error grouping 5 — drawable-canvas visual picker replaces text x-positions 6 — OCR availability banner + scanned-page indicators Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 22:49:44 +00:00
parent aea520d2f7
commit 2f349e8191
4 changed files with 608 additions and 0 deletions
--- a/src/gui/pages/10_PDF_Extractor.py
+++ b/src/gui/pages/10_PDF_Extractor.py
@@ -0,0 +1,584 @@
 """PDF Extractor — extract bank-statement transactions to CSV.
 Three modes:
 - **Extract** (daily workflow): pick a saved template, upload a
  PDF, get a CSV preview + download.
 - **Build template**: upload a sample PDF, configure how the
  table is identified, save the template for reuse.
 - **Manage templates**: list / rename / delete / export / import.
 The expensive step is ``extract_pages_auto`` (PDF I/O + word
 extraction + optional OCR). It runs only on explicit user action
 ("Extract" / "Preview"), and results are stashed in session_state
 so re-renders from form-field edits don't re-parse the PDF. Heavy
 work off Streamlit's rerun-on-every-widget path.
 """
 from __future__ import annotations
 import io
 import sys
 from datetime import datetime
 from pathlib import Path
 import pandas as pd
 import streamlit as st
 _project_root = Path(__file__).resolve().parent.parent.parent.parent
 if str(_project_root) not in sys.path:
    sys.path.insert(0, str(_project_root))
 from src.audit import log_event, log_page_open
 from src.gui.components import hide_streamlit_chrome, render_sticky_footer
 from src.pdf_extract import apply_template, extract_pages_auto
 from src.pdf_templates import (
    SCHEMA_VERSION,
    VALID_TARGETS,
    delete_template,
    list_templates,
    load_template,
    new_template,
    save_template,
    slugify,
    template_from_json,
    template_to_json,
    validate_template,
 )
 log_page_open("10_PDF_Extractor")
 _ICON_PATH = str(Path(__file__).parent.parent / "assets" / "datatools_icon_256.png")
 st.set_page_config(
    page_title="PDF to CSV · DataTools",
    page_icon=_ICON_PATH,
    layout="wide",
 )
 hide_streamlit_chrome()
 render_sticky_footer()
 # ---------------------------------------------------------------------------
 # Session-state keys (centralized so the build / extract flows agree on names)
 # ---------------------------------------------------------------------------
 K_MODE = "pdf_mode"
 K_CURRENT_TEMPLATE = "pdf_tpl_current"
 K_SAMPLE_BYTES = "pdf_tpl_sample_bytes"
 K_SAMPLE_NAME = "pdf_tpl_sample_name"
 K_SAMPLE_PAGES = "pdf_tpl_sample_pages"
 K_EXTRACT_DF = "pdf_extract_df"
 K_EXTRACT_WARNINGS = "pdf_extract_warnings"
 K_EXTRACT_FILES = "pdf_extract_files"
 def _get_or_init(key: str, default):
    if key not in st.session_state:
        st.session_state[key] = default
    return st.session_state[key]
 # ---------------------------------------------------------------------------
 # Page header + mode selector
 # ---------------------------------------------------------------------------
 st.markdown("# PDF to CSV")
 st.caption(
    "Extract transaction tables from bank-statement PDFs. Build one "
    "template per source (bank + account type), then reuse it for "
    "every statement that follows the same layout."
 )
 mode = st.radio(
    "Mode",
    ["Extract", "Build template", "Manage templates"],
    horizontal=True,
    key=K_MODE,
 )
 st.divider()
 # ===========================================================================
 # Extract mode
 # ===========================================================================
 def _render_extract_mode() -> None:
    templates = list_templates()
    if not templates:
        st.info(
            "No templates yet. Switch to **Build template** to create your "
            "first one — you'll need a sample PDF from the source bank."
        )
        return
    options = {f"{t['name']}  ·  {t['slug']}": t["slug"] for t in templates}
    label = st.selectbox("Template", list(options.keys()))
    slug = options[label]
    uploads = st.file_uploader(
        "Statement PDF(s)",
        type=["pdf"],
        accept_multiple_files=True,
        help=(
            "Drop one or more statements from the same source. Rows from "
            "every file are combined into a single CSV, tagged with the "
            "source filename."
        ),
    )
    run = st.button("Extract", type="primary", disabled=not uploads)
    if run and uploads:
        try:
            tpl = load_template(slug)
        except Exception as e:
            st.error(f"Couldn't load template {slug!r}: {e}")
            return
        per_file_frames: list[pd.DataFrame] = []
        all_warnings: list[str] = []
        files_meta: list[dict] = []
        progress = st.progress(0.0, text="Reading PDFs…")
        for i, up in enumerate(uploads, start=1):
            try:
                pdf_bytes = up.read()
                pages, warns = extract_pages_auto(pdf_bytes, allow_ocr=True)
                df = apply_template(pages, tpl)
                df.insert(0, "source_file", up.name)
                per_file_frames.append(df)
                files_meta.append({
                    "file": up.name,
                    "rows": len(df),
                    "pages": len(pages),
                })
                for w in warns:
                    all_warnings.append(f"[{up.name}] {w}")
            except Exception as e:
                all_warnings.append(
                    f"[{up.name}] extraction failed: "
                    f"{type(e).__name__}: {e}"
                )
                files_meta.append({
                    "file": up.name, "rows": 0, "pages": 0, "error": str(e),
                })
            progress.progress(i / len(uploads), text=f"Read {i}/{len(uploads)}")
        progress.empty()
        if per_file_frames:
            combined = pd.concat(per_file_frames, ignore_index=True)
        else:
            combined = pd.DataFrame()
        st.session_state[K_EXTRACT_DF] = combined
        st.session_state[K_EXTRACT_WARNINGS] = all_warnings
        st.session_state[K_EXTRACT_FILES] = files_meta
        log_event(
            "tool_run",
            "PDF Extractor run",
            page="10_PDF_Extractor",
            template=slug,
            files=len(uploads),
            rows=len(combined),
        )
    df = st.session_state.get(K_EXTRACT_DF)
    if isinstance(df, pd.DataFrame):
        warnings = st.session_state.get(K_EXTRACT_WARNINGS, []) or []
        files_meta = st.session_state.get(K_EXTRACT_FILES, []) or []
        if files_meta:
            st.markdown("#### Per-file summary")
            st.dataframe(
                pd.DataFrame(files_meta),
                hide_index=True,
                use_container_width=True,
            )
        if warnings:
            with st.expander(f"Warnings ({len(warnings)})", expanded=False):
                for w in warnings:
                    st.warning(w)
        if df.empty:
            st.info(
                "No rows were extracted. Re-check the template's header "
                "text, column boundaries, and end markers in **Build "
                "template** mode against a sample PDF."
            )
        else:
            st.markdown(f"#### Extracted rows ({len(df):,})")
            st.dataframe(df, hide_index=True, use_container_width=True)
            csv_bytes = df.to_csv(index=False).encode("utf-8")
            ts = datetime.now().strftime("%Y%m%d-%H%M%S")
            st.download_button(
                "Download CSV",
                data=csv_bytes,
                file_name=f"transactions-{slug}-{ts}.csv",
                mime="text/csv",
                type="primary",
            )
 # ===========================================================================
 # Build-template mode
 # ===========================================================================
 def _ensure_sample_loaded() -> bool:
    """Side-bar uploader for the sample PDF. Returns True if a sample
    is loaded and parsed (pages cached in session_state)."""
    up = st.file_uploader(
        "Sample statement",
        type=["pdf"],
        help=(
            "Used to drive the live preview while you build the "
            "template — pick a representative statement from this "
            "source."
        ),
        key="pdf_tpl_sample_uploader",
    )
    if up is not None and up.name != st.session_state.get(K_SAMPLE_NAME):
        pdf_bytes = up.read()
        try:
            pages, warns = extract_pages_auto(pdf_bytes, allow_ocr=True)
        except Exception as e:
            st.error(f"Couldn't read PDF: {type(e).__name__}: {e}")
            return False
        st.session_state[K_SAMPLE_BYTES] = pdf_bytes
        st.session_state[K_SAMPLE_NAME] = up.name
        st.session_state[K_SAMPLE_PAGES] = pages
        for w in warns:
            st.info(w)
    return bool(st.session_state.get(K_SAMPLE_PAGES))
 def _render_columns_editor(tpl: dict) -> None:
    """Edit the column mapping (source index → target field) and the
    boundary x-positions in one place."""
    st.markdown("##### Columns")
    boundaries = list(tpl["table"].get("column_boundaries") or [])
    bounds_text = st.text_input(
        "Column boundaries (x-positions, comma-separated)",
        value=", ".join(str(int(b)) for b in boundaries),
        help=(
            "N boundaries create N+1 columns. The visual picker in "
            "the next phase will set these for you — until then you "
            "can read x-positions from the page-preview hover tip "
            "below, or trial-and-error against the live preview."
        ),
    )
    try:
        tpl["table"]["column_boundaries"] = sorted(
            float(x.strip()) for x in bounds_text.split(",") if x.strip()
        )
    except ValueError:
        st.warning("Column boundaries must be numbers.")
    n_cols = len(tpl["table"]["column_boundaries"]) + 1
    st.caption(f"{n_cols} source column(s) defined.")
    # Column mapping: one row per output column the user wants.
    columns_state = tpl.get("columns") or []
    if not columns_state:
        # Seed a reasonable default the first time.
        columns_state = [
            {"source": 0, "target": "date"},
            {"source": 1, "target": "description"},
            {"source": 2, "target": "amount"},
        ][:n_cols]
    targets = ["date", "description", "amount", "amount_debit",
               "amount_credit", "balance", "type"]
    new_columns: list[dict] = []
    for i, col in enumerate(columns_state):
        c1, c2, c3 = st.columns([2, 3, 1])
        src = c1.number_input(
            f"Source #{i + 1}",
            min_value=0,
            max_value=max(n_cols - 1, 0),
            value=min(int(col.get("source", 0)), max(n_cols - 1, 0)),
            step=1,
            key=f"src_{i}",
        )
        tgt_default = col.get("target", "")
        if tgt_default not in targets:
            targets_ext = targets + [tgt_default] if tgt_default else targets
        else:
            targets_ext = targets
        tgt = c2.selectbox(
            f"Target #{i + 1}",
            targets_ext,
            index=(targets_ext.index(tgt_default) if tgt_default in targets_ext else 0),
            key=f"tgt_{i}",
        )
        keep = c3.checkbox("Keep", value=True, key=f"keep_{i}")
        if keep:
            new_columns.append({"source": int(src), "target": tgt})
    if st.button("+ Add column", key="add_col"):
        new_columns.append({"source": n_cols - 1 if n_cols else 0, "target": ""})
        st.rerun()
    tpl["columns"] = new_columns
 def _render_build_form(tpl: dict) -> None:
    """Render every editable field on the template, in tabs."""
    t1, t2, t3, t4 = st.tabs(["Pages & table", "Columns", "Parsing", "Save"])
    with t1:
        c1, c2 = st.columns(2)
        with c1:
            tpl["name"] = st.text_input("Template name", value=tpl.get("name", ""))
            tpl["slug"] = slugify(tpl["name"])
            tpl["notes"] = st.text_area("Notes", value=tpl.get("notes", ""), height=70)
            tpl["pages"]["range"] = st.text_input(
                "Pages",
                value=tpl["pages"].get("range", "all"),
                help='"all", "1-3", "2,4", "3-" all work.',
            )
            tpl["pages"]["skip_matching"] = st.text_input(
                "Skip pages matching (regex, optional)",
                value=tpl["pages"].get("skip_matching", ""),
                help='e.g. "Page \\d+ of" to skip cover pages.',
            )
        with c2:
            tpl["table"]["header_text"] = st.text_input(
                "Header text (transactions table)",
                value=tpl["table"].get("header_text", ""),
                help=(
                    "Words from the header row of the transactions table, "
                    "e.g. \"Date Description Amount Balance\". Extraction "
                    "starts on the row AFTER this match."
                ),
            )
            ends = "\n".join(tpl["table"].get("end_markers") or [])
            new_ends = st.text_area(
                "End markers (one regex per line)",
                value=ends,
                help='e.g. "Closing balance", "Page \\d+ of".',
                height=80,
            )
            tpl["table"]["end_markers"] = [
                line.strip() for line in new_ends.splitlines() if line.strip()
            ]
            skips = "\n".join(tpl["table"].get("skip_rows_matching") or [])
            new_skips = st.text_area(
                "Skip rows matching (one regex per line, optional)",
                value=skips,
                help='Common entries: "Total", "Subtotal", "^Page ".',
                height=80,
            )
            tpl["table"]["skip_rows_matching"] = [
                line.strip() for line in new_skips.splitlines() if line.strip()
            ]
            tpl["table"]["y_tolerance"] = st.number_input(
                "Row y-tolerance (pts)",
                min_value=0.5,
                max_value=20.0,
                value=float(tpl["table"].get("y_tolerance", 3.0)),
                step=0.5,
                help=(
                    "How close two words' y-positions must be to be on the "
                    "same row. Bump up if rows are getting split, down if "
                    "rows are merging."
                ),
            )
    with t2:
        _render_columns_editor(tpl)
    with t3:
        c1, c2 = st.columns(2)
        with c1:
            tpl["parse"]["date_format"] = st.text_input(
                "Date format",
                value=tpl["parse"].get("date_format", "%m/%d/%Y"),
                help=(
                    "Python strftime format. Common: %m/%d/%Y (US), "
                    "%d/%m/%Y (EU), %Y-%m-%d (ISO)."
                ),
            )
            tpl["parse"]["currency_strip"] = st.text_input(
                "Currency symbols to strip",
                value=tpl["parse"].get("currency_strip", "$"),
            )
            tpl["parse"]["decimal_separator"] = st.text_input(
                "Decimal separator",
                value=tpl["parse"].get("decimal_separator", "."),
                max_chars=1,
            )
            tpl["parse"]["thousands_separator"] = st.text_input(
                "Thousands separator",
                value=tpl["parse"].get("thousands_separator", ","),
                max_chars=1,
            )
        with c2:
            tpl["parse"]["amount_negative_in_parens"] = st.checkbox(
                "Parens = negative amount",
                value=bool(tpl["parse"].get("amount_negative_in_parens", True)),
            )
            tpl["parse"]["merge_multiline_description"] = st.checkbox(
                "Merge multi-line descriptions",
                value=bool(tpl["parse"].get("merge_multiline_description", True)),
                help=(
                    "Rows with no date attach to the previous row's "
                    "description — handles wrapped vendor names."
                ),
            )
    with t4:
        ok, errors = validate_template(tpl)
        if errors:
            for err in errors:
                st.error(err)
        c1, c2 = st.columns([1, 3])
        with c1:
            save_btn = st.button("Save template", type="primary", disabled=not ok)
        with c2:
            st.caption(
                f"Will save as: ``{tpl.get('slug') or '—'}``  "
                f"(folder: ``~/.datatools/pdf_templates/``)"
            )
        if save_btn:
            try:
                slug = save_template(tpl)
                st.success(f"Saved as **{slug}**. Switch to Extract mode to use it.")
                log_event(
                    "tool_run",
                    "PDF Extractor template saved",
                    page="10_PDF_Extractor",
                    template=slug,
                )
            except Exception as e:
                st.error(f"Save failed: {e}")
 def _render_preview(tpl: dict) -> None:
    """Below-the-fold live preview against the cached sample pages."""
    pages = st.session_state.get(K_SAMPLE_PAGES)
    if not pages:
        return
    st.divider()
    st.markdown("##### Live preview")
    try:
        df = apply_template(pages, tpl)
    except Exception as e:
        st.error(f"Preview failed: {type(e).__name__}: {e}")
        return
    if df.empty:
        st.info(
            "Template doesn't match any rows yet. Common fixes: tighten "
            "the header text, add an end marker, adjust column "
            "boundaries."
        )
    else:
        st.caption(f"{len(df)} row(s) from {len(pages)} page(s)")
        st.dataframe(df.head(50), hide_index=True, use_container_width=True)
 def _render_build_mode() -> None:
    # Optionally load an existing template into the form
    templates = list_templates()
    c1, c2, c3 = st.columns([2, 2, 1])
    with c1:
        existing_label = "— start from scratch —"
        choices = [existing_label] + [
            f"{t['name']}  ·  {t['slug']}" for t in templates
        ]
        picked = st.selectbox("Load existing", choices, key="build_load_pick")
    with c2:
        if st.button("Load", disabled=picked == existing_label, key="build_load_btn"):
            slug = picked.split("  ·  ")[-1]
            try:
                st.session_state[K_CURRENT_TEMPLATE] = load_template(slug)
                st.rerun()
            except Exception as e:
                st.error(f"Load failed: {e}")
    with c3:
        if st.button("New", key="build_new_btn"):
            st.session_state[K_CURRENT_TEMPLATE] = new_template("New template")
            st.rerun()
    tpl = _get_or_init(K_CURRENT_TEMPLATE, new_template("New template"))
    if not _ensure_sample_loaded():
        st.info(
            "Upload a sample statement from this source to drive the live "
            "preview. Your template is built against the sample's layout."
        )
        return
    _render_build_form(tpl)
    _render_preview(tpl)
 # ===========================================================================
 # Manage-templates mode
 # ===========================================================================
 def _render_manage_mode() -> None:
    templates = list_templates()
    st.markdown("##### Import a template")
    up = st.file_uploader(
        "Template JSON",
        type=["json"],
        key="manage_import_uploader",
        help="Paste a colleague's exported JSON file here to add it to your library.",
    )
    if up is not None:
        try:
            imported = template_from_json(up.read().decode("utf-8"))
            save_template(imported)
            st.success(f"Imported **{imported['name']}** (slug `{imported['slug']}`).")
            st.rerun()
        except Exception as e:
            st.error(f"Import failed: {e}")
    st.divider()
    st.markdown("##### Existing templates")
    if not templates:
        st.caption("No templates yet — build one in **Build template** mode.")
        return
    for t in templates:
        slug = t["slug"]
        with st.container(border=True):
            c1, c2, c3, c4 = st.columns([3, 3, 2, 2])
            with c1:
                st.markdown(f"**{t['name']}**")
                st.caption(f"`{slug}`")
            with c2:
                st.caption(f"Updated: {t.get('updated_at', '—')}")
                if t.get("notes"):
                    st.caption(t["notes"])
            with c3:
                try:
                    full = load_template(slug)
                    payload = template_to_json(full)
                    st.download_button(
                        "Export",
                        data=payload.encode("utf-8"),
                        file_name=f"{slug}.json",
                        mime="application/json",
                        key=f"export_{slug}",
                    )
                except Exception as e:
                    st.error(f"Read failed: {e}")
            with c4:
                if st.button("Delete", key=f"del_{slug}"):
                    delete_template(slug)
                    st.success(f"Deleted `{slug}`.")
                    st.rerun()
 # ===========================================================================
 # Dispatch
 # ===========================================================================
 if mode == "Extract":
    _render_extract_mode()
 elif mode == "Build template":
    _render_build_mode()
 elif mode == "Manage templates":
    _render_manage_mode()
--- a/src/gui/tools_registry.py
+++ b/src/gui/tools_registry.py
@@ -145,6 +145,18 @@ TOOLS: list[Tool] = [
        status="Ready",
        section="automations",
    ),
    Tool(
        tool_id="10_pdf_extractor",
        icon=":material/picture_as_pdf:",
        name="PDF to CSV",
        description=(
            "Extract bank-statement transactions from PDFs using reusable "
            "per-source templates."
        ),
        page_slug="10_PDF_Extractor",
        status="Ready",
        section="transformations",
    ),
 ]
--- a/src/i18n/packs/en.json
+++ b/src/i18n/packs/en.json
@@ -158,6 +158,12 @@
      "description": "Chain tools in recommended order and pass output between steps.",
      "page_title": "Automated Workflows",
      "page_caption": "Chain DataTools cleaning steps into one repeatable workflow. The pipeline recommends an order; you stay in control."
    },
    "10_pdf_extractor": {
      "name": "PDF to CSV",
      "description": "Extract bank-statement transactions from PDFs using reusable per-source templates.",
      "page_title": "PDF to CSV",
      "page_caption": "Extract transaction tables from bank-statement PDFs. Build one template per source and reuse it for every statement that follows the same layout. Runs locally — your data never leaves this computer."
    }
  },
  "nav": {
--- a/src/i18n/packs/es.json
+++ b/src/i18n/packs/es.json
@@ -158,6 +158,12 @@
      "description": "Encadena herramientas en el orden recomendado y pasa la salida entre pasos.",
      "page_title": "Flujos automatizados",
      "page_caption": "Encadena pasos de limpieza de DataTools en un flujo repetible. La canalización recomienda un orden; tú mantienes el control."
    },
    "10_pdf_extractor": {
      "name": "PDF a CSV",
      "description": "Extrae transacciones de extractos bancarios en PDF usando plantillas reutilizables por origen.",
      "page_title": "PDF a CSV",
      "page_caption": "Extrae tablas de transacciones de extractos bancarios en PDF. Crea una plantilla por origen y reutilízala para cada extracto que siga el mismo formato. Se ejecuta localmente — tus datos no salen de este equipo."
    }
  },
  "nav": {