10 Commits

Author SHA1 Message Date
41ab2166ef build(ci): wire macOS code signing + notarization into release workflow
Add a guarded "Sign & notarize macOS app" step to build.yml that signs
dist/DataTools.app with the Developer ID (hardened runtime + entitlements
+ secure timestamp), notarizes via notarytool, and staples the ticket —
running before DMG packaging. The step exits 0 with a warning when the
MACOS_* secrets are absent, so dry-run dispatches still produce an
(unsigned) build.

Add build/macos/entitlements.plist with the hardened-runtime entitlements
a frozen PyInstaller/CPython app needs (JIT memory, library-validation
disabled for bundled .so/.dylib + Tesseract). Update build/README.md to
reflect that macOS signing is now wired and only needs the secrets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:56:17 +00:00
fd9606c67b build: drop the local Python release method, return to CI-only installer builds
Removes the single-command Python packaging method (build/make_release.py
+ build/build_portable_zip.py + build/macos/build_zip.sh) and the portable
.zip artifacts it produced. Release builds go back to the original GitHub
Actions process: the CI matrix builds one installer per platform (.dmg /
.exe / .AppImage) on tag push and attaches them to a GitHub Release.

Tesseract OCR bundling is preserved: the fetch helpers the workflow depends
on (fetch_tessdata, fetch_tesseract_for_platform) are extracted into a
standalone build/tesseract.py, which build.yml now imports.

Docs (README, build/README, DEVELOPER, TECHNICAL, USER-GUIDE, vendor README,
es translations) updated to drop the portable-zip flavor and point at the
new module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 17:47:36 +00:00
b703911df3 docs: reflect bundled Tesseract on every install surface
- NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers
  the bundled Tesseract OCR binary (Apache 2.0, upstream
  tesseract-ocr/tesseract, copyright Google + contributors) and the
  eng.traineddata from tessdata_best (also Apache 2.0). Clarifies
  DataTools itself remains proprietary. Full canonical Apache 2.0
  license text included.
- README.md + README.es.md (Download section): bumped size estimate
  ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR
  is bundled (no separate install required), with a link to the new
  license file.
- docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System
  requirements): bumped disk estimate, added a paragraph stating
  Tesseract 5.5 + eng.traineddata ship inside every installer /
  portable / AppImage, with a source-install fallback hint pointing
  developers to DEVELOPER.md.
- docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section
  documenting the runtime layout (sys._MEIPASS / tesseract / …),
  discovery order, source of bytes (build/vendor/tessdata + per-
  platform fetch in make_release.py), version pin, update recipe.
- docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor
  OCR)" — short version of the discovery order for the build
  pipeline section.
- build/README.md: distribution-outputs paragraph now lists
  Tesseract among bundled deps with the ~250-300 MB estimate; new
  "Tesseract bundling" section: layout diagram, resolver order,
  source of bytes + 5.5.0 pin, update steps, license-file ref.

Out-of-scope gaps noted by the docs sweep:
- docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a
  high-risk packaging headache; now superseded. Worth a one-line
  "(resolved — bundled as of v1.x)" callout in a future pass.
- USER-GUIDE §2 "What's included" table doesn't list PDF Extractor
  at all (it shipped in b8aff86…967d3f6). Separate gap to close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:50 +00:00
93ccada974 build: bundle Tesseract 5.5.0 + tessdata into every release artifact
End users no longer have to install Tesseract separately for OCR on
scanned PDFs — the engine ships inside the installer, portable .zip,
and AppImage for all three platforms.

Per-platform fetch in build/make_release.py (run before PyInstaller):
- Windows: download UB-Mannheim installer 5.5.0.20241111, extract
  with 7-Zip, copy tesseract.exe + required DLLs into the staging dir.
- macOS: ``brew install tesseract``, copy binary + every Homebrew-
  prefixed dylib resolved via otool -L (recurse one level for
  transitive deps), then install_name_tool rewrites IDs / load paths
  to @loader_path/... so the bundle is relocatable.
- Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary
  + every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'.

Wire-up:
- build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by
  make_release) and adds the staging dir + tessdata + the
  LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas
  so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/}
  and the license sits at the bundle root. Soft-warns when staging
  is empty so dev spec runs still complete.
- English tessdata pulled by fetch_tessdata() from
  tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at
  build/vendor/tessdata/.
- .github/workflows/build.yml: actions/cache@v4 step keyed on
  ``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the
  staging dir and the vendored tessdata across runs; apt installs
  patchelf on the Linux runner; PyInstaller step now receives the
  DATATOOLS_TESS_STAGING env var.
- .gitignore: build/_tesseract/ and the .traineddata blob.
- TESSERACT_SKIP_FETCH=1 honored for offline / manual stages.
- Installer / .dmg / .zip / AppImage scripts: one-line comments
  confirming Tesseract rides along automatically via PyInstaller's
  datas (no extra packaging steps required in those scripts).

Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post-
compression. Net installer size ~250-300 MB (was ~120 MB) — accepted
tradeoff for zero end-user OCR setup.

Reversal of the prior "don't bundle Tesseract" decision (option A).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:33 +00:00
9c426194b1 build: add single-command release script + portable zip artifacts
One-developer workflow: ``python build/make_release.py`` on each
target OS produces both the installer and a portable .zip for that
platform. Preflight checks PyInstaller / Pillow / iscc / hdiutil /
ditto / appimagetool and bails with install hints if anything is
missing — no half-built dist/.

New scripts:
- build/make_release.py   — orchestrator, auto-detects host OS.
- build/generate_icons.py — icon.ico / icon.icns / icon.png from
  src/gui/assets/datatools_icon_256.png (Pillow ships ICO + ICNS
  writers; no platform tooling needed).
- build/build_portable_zip.py — Win/Linux portable zip via stdlib.
- build/macos/build_zip.sh — Mac portable .app via ditto so
  bundle metadata survives.

installer.iss now adds: Quick Launch task (opt-in, legacy Win 7),
App Paths registry entry (Win+R "DataTools" works), SetupIconFile,
UninstallDisplayIcon, AppSupportURL, AppUpdatesURL.

CI workflow uploads installer + portable per platform and attaches
both to GitHub Releases on tag push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 19:30:17 +00:00
bece2b4030 refactor(pdf): rip out templates; heuristic scan + selectable table
User feedback: the template / visual-picker / mode-dispatch
implementation was too complex for the actual workflow.
Statements drift between months, the canvas state didn't survive
multi-page navigation, and accountants don't want to maintain
per-bank configuration just to convert PDFs to CSV.

Start-over design — one public function, one page, no
persistence:

  ``scan_pdf_for_transactions(pdf_bytes) → (rows, warnings)``

A row is "any text line with a date pattern AND at least one
amount pattern." Each detected row is a dict shaped::

    {
      "date": "2026-01-15",
      "description": "Coffee Shop",
      "amount_1": -4.50,
      "amount_2": 1000.00,   # if a second amount was found
      "page": 1,
      "raw": "01/15/2026 Coffee Shop (4.50) 1,000.00",
      "source_file": "chase-jan-2026.pdf",
    }

Multi-line descriptions still merge (no-date no-amount lines
attach to the previous transaction). Multi-PDF batches share a
single combined table with a ``source_file`` column.

**Page UX:**

- Upload PDF(s) → optional Options expander (parens-negative,
  use-OCR) → click Scan → see all detected rows in an
  ``st.data_editor``.
- The editor has an ``Include`` checkbox column (default on),
  plus user-editable date / description / amount cells and a
  read-only ``raw`` column showing the original PDF text for
  verification.
- A ``Columns to include in CSV`` multiselect hides
  ``page`` / ``raw`` from the download by default; user can
  re-add either.
- Download CSV gets only the checked rows.

No template save/load. No visual picker. No mode dispatch. No
column boundaries. No schema migration. No per-bank
configuration files.

**Deletions:**

- ``src/pdf_templates.py`` — template storage layer
- ``src/gui/_drawable_canvas_compat.py`` — Streamlit compat shim
  for the canvas (no canvas now)
- ``tests/test_pdf_templates.py``, ``test_pdf_row_heuristic.py``,
  ``test_drawable_canvas_compat.py`` — covered the removed APIs
- ``build/hooks/hook-streamlit_drawable_canvas.py`` — hook for
  the removed dep
- ``streamlit-drawable-canvas==0.9.3`` from ``requirements.txt``
- The drawable-canvas references in ``build/datatools.spec``

**``src/pdf_extract.py``** shrinks from ~30 helper functions to
~10. Keeps: value parsers, row clusterer, date/amount token
finders, OCR pipeline, dependency guards. The one new public
function ``scan_pdf_for_transactions`` glues them together.

**Tests** (59 passing): the unit layer keeps full coverage of
the building blocks; the smoke layer pins the end-to-end PDF
roundtrip, OCR discovery, dependency-import behavior, and the
multi-line-description merge. The fpdf2-generated fixture PDF
still drives the real-PDF test.

Rollback: ``git revert HEAD`` brings back the template system if
needed — but the simpler model should make that unlikely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:57:30 +00:00
538e23d219 build(pdf): bundle PDF deps in installers + pin versions + smoke tests
Three changes prepare the next tagged release so end users get
the PDF Extractor without ever touching pip.

**Exact-pin the new deps** (``requirements.txt``):

  pdfplumber==0.11.9
  pypdfium2==5.8.0
  pytesseract==0.3.13
  streamlit-drawable-canvas==0.9.3

Tight pins are the right call for these because the GUI's
visual-picker geometry + the parsing-pipeline word positions
depend on stable internal behavior — a quiet upstream tweak to
``extract_words`` or ``page.render`` would re-break the tool on
the next CI build. Bumping requires a deliberate edit + a CI
run, not a transient ``pip install`` resolving to whatever
``setup.py`` pulled.

Existing deps stay on their current ``>=X.Y,<X+1`` ranges; the
user's "tight pin" concern is specifically about the PDF stack.

**Wire the new deps into the PyInstaller bundle** (``build/``):

- ``datatools.spec`` — add ``collect_submodules`` for pdfplumber,
  pdfminer, pypdfium2, streamlit_drawable_canvas, PIL,
  pytesseract; add ``collect_data_files`` for pypdfium2 (PDFium
  native ``.dll``/``.so``/``.dylib``), streamlit_drawable_canvas
  (frontend JS bundle), pdfminer (Adobe CMap tables).
- ``hooks/hook-pypdfium2.py`` — belt-and-braces hook that uses
  ``collect_dynamic_libs`` to force-include the PDFium binary.
  Without this the visual picker silently fails on installed
  builds with a ``FileNotFoundError`` for the shared library.
- ``hooks/hook-streamlit_drawable_canvas.py`` — collects the
  built JS frontend so the canvas iframe loads under the bundled
  Streamlit server instead of rendering blank.

**Tesseract is intentionally NOT bundled** (option A from the
design discussion). Modern bank statements are text-based;
bundling Tesseract would ~triple installer size for a long-tail
case. The in-app banner directs users to install it from
``UB-Mannheim/tesseract`` if they need OCR. Decision is captured
in the ``project-pdf-installer-pending`` memory note.

**Smoke tests** (``tests/test_pdf_extract_smoke.py``, 17 tests)
add the layer above the pure unit tests:

- ``TestDependencyImports`` — each dep imports cleanly
- ``TestRealPdfRoundTrip`` — generates a tiny statement PDF in
  memory with ``fpdf2`` (test-only dep in
  ``requirements-dev.txt``), runs ``extract_pages`` +
  ``apply_template``, asserts 3 rows out with the right signed
  amounts. Catches "the build succeeded but pdfplumber breaks at
  runtime."
- ``TestRenderPageImage`` — exercises ``pypdfium2.render`` so the
  hook-bundled native lib gets a real call. This is the most
  common installer-bug signature (missing .dll) and the test
  catches it before users do.
- ``TestPdfDependencyMissing`` — monkeypatches ``__import__`` to
  simulate a stripped install; confirms the typed exception +
  actionable hint round-trip.
- ``TestPinnedVersionsMatchInstalled`` — parametrized over all
  four pinned dists; uses ``importlib.metadata`` rather than
  ``__version__`` because pypdfium2 doesn't expose it directly.
  Trips if someone bumps the pin without reinstalling.
- ``TestOcrAvailability`` — confirms ``ocr_available()`` returns
  ``(bool, str)`` and ``extract_pages_auto(allow_ocr=False)``
  skips OCR cleanly.

All 81 PDF + audit tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:10:43 +00:00
db5ec084da docs+code: rename tool labels everywhere
Sweep follow-up to 93e43fc. Display labels now consistent across docs,
landing pages, CLI output, code comments, docstrings, and test prose.
Five parallel surfaces touched:

- docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal
  design/planning docs
- landing pages: index + bookkeeper/revops/shopify-pet
- src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py
  and gui/components/_legacy.py, core module headers, every tool
  page's module docstring
- tests: class/method/module docstrings and section-header comments
- test-cases READMEs

Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator
etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*),
URL paths, anchor IDs, CSS classes, and asset filenames were left
intact since they're code identifiers / structural references.

All 2033 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:50:09 +00:00
4706ed571e build: wire desktop-bundle pipeline (CI matrix + per-platform installers)
Stand up the seamless-download path for non-technical buyers:

* .github/workflows/build.yml — matrix CI (mac/win/linux) that builds
  PyInstaller bundles and packages them per platform on tag push,
  attaching the resulting installers to a GitHub Release.
* build/installer.iss — Inno Setup script for the Windows installer
  (per-user install, optional desktop shortcut, runs on finish).
* build/macos/build_dmg.sh — wraps DataTools.app into a .dmg with a
  drag-to-/Applications layout.
* build/appimage/{AppRun,datatools.desktop,build.sh} — AppImage recipe.
* src/__init__.py — single source of truth for __version__; the spec
  reads it (was hardcoded), CI passes it through to all packagers.

Buyer download path now lives in the top-level README. Per-build
README documents the Phase 2 step (signing/notarization) that needs
the owner's Apple Developer + Windows code-signing credentials —
those are intentionally not in CI yet because they require setup
outside this repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:58:43 +00:00
e1f364f010 feat: Tier B operator scaffolding — bundle, copy SoT, posts, emails
Pick up and finish yesterday's cut-off Tier B pass.

- build/: PyInstaller scaffold (datatools.spec + launcher.py +
  hook-streamlit.py + README) — folder-mode bundle, locked
  127.0.0.1, per-OS recipe
- marketing/COPY.md: single source of truth for every customer-facing
  string — landing H1/sub/CTAs, demo CTAs, email subjects, Gumroad
  listing, banned phrases
- marketing/community-posts/: 9 drafts (3 posts × 3 niches:
  bookkeeper, revops, shopify-pet) — story / tip / soft-offer
- marketing/emails/: 18 drafts (Gumroad delivery + 5-touch
  onboarding × 3 niches), per-niche segmentation guidance
- docs/NEXT-STEPS.md: flip 2.2 / 2.4 / 3.1 / 3.4 to done with
  pointers to the new assets; add Phase 0 inventory rows
- .gitignore: narrow `build/` ignore so PyInstaller spec + launcher
  + hooks get tracked, only generated artifacts (build/build/,
  build/__pycache__/, build/dist/) stay ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 14:04:37 +00:00