Commit Graph

163 Commits

Author SHA1 Message Date
7ad19ac7f4 feat(nav,i18n): sticky footer with Back-to-Home + localized tool headers
Two unrelated UX issues addressed in one sweep across all nine tool
pages because they share the same edit surface.

(1) Sticky footer replaces the top + bottom back-link buttons.

Reported: a big white empty footer space at the bottom of every page;
the Back to Home button at the top scrolled out of view on long pages.

New ``render_sticky_footer()`` helper in ``components/_legacy.py``
injects a fixed-position bar at ``bottom: 0`` of the viewport with:

- A border-top so it visually reads as a non-movable bar.
- A semi-transparent background (rgba 0.96 + ``backdrop-filter: blur``)
  so content underneath shows through faintly when the user scrolls.
- A styled ``<a href="home">`` anchor (not an ``st.button``) because
  Streamlit widgets can't be CSS-positioned reliably — Streamlit owns
  the widget's DOM container and re-mounts it on every rerun. A real
  anchor sits exactly where the CSS puts it and triggers Streamlit's
  URL routing to the home page.
- ``padding-bottom: 3.5rem`` on the main container so the last widget
  isn't hidden behind the bar.

Called once per tool page, immediately after ``hide_streamlit_chrome()``
so it renders even on pages that ``st.stop()`` early before any other
content runs. The old top-and-bottom ``back_to_home_link()`` calls are
removed from every tool page; their entry/exit points were dropping
the button when the script short-circuited.

(2) Tool-page headers now localize.

Reported: switching the sidebar language picker to Spanish left the
tool page's title + caption in English. Root cause: every page had
hard-coded ``st.title("✂️ Clean Text")`` / ``st.caption("Trim
whitespace...")`` strings.

Added per-tool ``tools.<id>.page_title`` and
``tools.<id>.page_caption`` keys to ``en.json`` and ``es.json`` for
all nine tools. Routed each page's title/caption call through ``t()``.
Verified: with ``ui_lang=es`` set, the Clean Text page now renders
"✂️ Limpiar texto" + the Spanish caption.

Updated ``tests/gui/test_smoke.py::EXPECTED_SUBSTRINGS`` so the
``es`` column for each tool page asserts the actual Spanish string
(was a duplicate of the English string back when the page bodies
were English-only).

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:42:45 +00:00
84e4665ab0 fix(home): make per-file Remove button reliable
Reported: the "✕" buttons on the uploaded file list removed files
inconsistently — some clicks took, some didn't.

Two compounding causes:

1. ``key=f"_home_remove_{name}"`` embedded the raw filename in the
   Streamlit widget key. Streamlit's widget-identity machinery
   normalizes keys differently across reruns when they contain
   spaces, dots, brackets, or non-ASCII characters, so a button's
   identity could shift between the render where the user clicked
   it and the rerun that should have processed the click. The click
   was registered, but the post-rerun render produced a new widget
   under a new effective key, and the original click was "lost".

2. The handler mutated ``home_uploads`` mid-loop while subsequent
   iterations were still creating buttons. ``st.rerun()`` raises
   synchronously, but if ANOTHER button's state changed in the same
   pass (e.g. a stale click held over from a fast double-tap), the
   ordering of state-mutation vs widget-key-update vs rerun could
   race.

Fixes:

- Stable widget keys: ``f"_home_remove_{sha1(name)[:10]}"``. The
  hash is identifier-safe regardless of spaces / dots / Unicode in
  the filename. Verified across "sample with spaces.csv",
  "sample.csv", and "日本語.csv" — three sequential Remove clicks
  each remove exactly one file with no clicks lost.

- Two-phase capture: the loop collects the target ``to_remove``
  filename, finishes rendering every other row at consistent widget
  identity, THEN mutates state once and reruns. No more mid-loop
  ``del`` racing other widgets' click handlers.

- Wider click target: column ratio ``[8, 1]`` (was ``[12, 1]``) and
  ``use_container_width=True`` on the Remove button so the click
  surface fills the entire column. Label changed to "Remove" for
  the same reason — "✕" is a thin glyph that compressed the
  hit-test region.

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:34:20 +00:00
4685bb4289 style(chrome): tighter vertical rhythm — less whitespace across screens
Reported: too much whitespace between widgets, dividers, and headings.

Compact-spacing CSS layer added to ``_HIDE_CHROME_CSS`` (so it applies
on every page that calls ``hide_streamlit_chrome``):

- ``[data-testid="stVerticalBlock"]`` and ``stHorizontalBlock`` gap
  trimmed from Streamlit's default ~1rem to 0.5rem.
- Heading margins (h1-h4) tightened — h1/h2/h3 used to leave 1-1.5rem
  above; now 0.25-0.5rem.
- ``hr`` (``st.divider()``) drops from 1rem above+below to 0.4rem.
- Markdown paragraphs and captions: 0.25rem bottom margin instead of
  the default 1rem.
- Expander summary padding reduced (0.35rem top/bottom).
- File-uploader, button, and metric tiles: trimmed internal padding.

Also slimmed the main-container padding from 1rem top / Streamlit
default bottom (~6rem) to 0.5rem top / 0.75rem bottom.

The existing ``zoom: 0.85`` on ``.stApp`` is kept — the user wanted
*less white space*, not *smaller content*, and dropping zoom would
shrink type alongside everything else.

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:28:58 +00:00
e96d5901f4 fix(close): graceful about:blank fallback + display-mode aware hint
Reported: user asked whether we can send Alt+F4 / Ctrl+W to the
browser from JavaScript to force-close a tab.

Honest answer that's now baked into the hint message: NO. Synthesized
keyboard events from page JS only reach DOM event listeners, not the
browser chrome or the OS. There is no flag, API, or trick that lets
a page close a tab the user opened themselves. The page CAN close a
window it opened (window.opener trail) or one whose display-mode is
``standalone`` (Chrome/Edge ``--app=URL``) — that's what
``python -m src.gui`` arranges, and that's the path that actually
closes the window without a manual Ctrl+W.

Improvements landed:

1. ``isStandalone(win)`` detects Chrome --app windows up front
   (``matchMedia('(display-mode: standalone)').matches``). In a
   regular tab the manual hint surfaces immediately on the
   "Close this window" click; in --app mode we only show it if the
   close attempt actually fails.

2. ``fallbackToBlank(win)`` navigates the tab to ``about:blank``
   via ``location.replace`` (no history pollution) so the user
   sees a clean empty tab instead of the farewell overlay frozen
   over Streamlit's connection-error banner. They still have to
   Ctrl+W the blank tab, but the screen is no longer a misleading
   "did it close or not?" mess. Fires 250 ms after a failed close
   in --app mode (very rare path), or 1.5 s in a regular tab so
   the user has time to read the hint.

3. Hint message rewritten in en + es to explain WHY the close is
   blocked (browser security — not something we can override), to
   acknowledge the Alt+F4 / Ctrl+W question directly (those don't
   work either, for the same reason), and to point at
   ``python -m src.gui`` as the path that gives a clean auto-close.

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:07:51 +00:00
ecfc52499f fix(home): persist upload list across page navigation
Reported: clicking "Back to Home" from a tool page returned the user
to an empty home — their previously-uploaded files were gone.

Root cause: Streamlit's ``st.file_uploader`` widget state does not
reliably survive ``st.switch_page``. The widget gets unmounted on
navigation, and its ``UploadedFile`` objects don't always re-attach
on remount. The home page was treating the widget's return value as
the source of truth, so after navigation the list was empty.

Fix: introduce a session-state stash keyed by filename
(``home_uploads: dict[str, {"bytes": bytes, "size": int}]``) and
treat it as the source of truth for everything downstream — the
active-file pickup keys for tool pages, the per-file findings
cache, and the rendered file list. The widget is reduced to its
narrow role of capturing NEW uploads, which we merge into the stash
without ever removing.

Per-file remove: a "✕" button next to each filename drops just that
file (and its findings). The widget's own "✕" is bypassed by our
rendering, since trusting it would let the widget's state diverge
from the stash.

Clear-results button is unchanged: it wipes only the analysis cache,
leaving uploaded files intact (per the user's "persistent until
cleared" requirement — removal is per-file via "✕").

Tool-page compatibility: the singular ``home_uploaded_{name,size,
bytes}`` keys still get populated from the first entry in the stash
on every render, so ``pickup_or_upload`` on a tool page keeps
finding the active upload. When the user removes the active file,
those keys are cleared so the next render repopulates from whatever
file is now first.

``_StashedUpload`` is a small duck type ( ``.name``, ``.size``,
``.getvalue()`` ) so ``_run_analysis_on_upload`` accepts entries
restored from the stash without changes.

2220 tests pass. Smoke-verified via AppTest: pre-stashed
``home_uploads`` renders the file list with per-file remove buttons,
and the persistent state survives a simulated navigation round-trip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:04:12 +00:00
21fd8a4cd7 fix(nav): switch_page resolves correctly + bottom-of-page back link
Two issues, same fix surface.

(1) Reported crash on Back-to-Home:

    StreamlitAPIException: Could not find page: app.py.

``st.switch_page("app.py")`` doesn't work under ``st.navigation`` —
the entry script is the nav manager itself and is not a registered
page. The fix needs to pass an ``st.Page`` object whose script
identity matches one registered in the nav.

First-pass attempt (``from src.gui.app import _home_page``) hit a
worse failure: importing ``app.py`` from inside a tool-page render
re-executes the nav setup with the WRONG "main script" context, so
every ``st.Page("pages/N_foo.py", ...)`` call in ``_build_navigation``
fails with "file could not be found".

Extract the home renderer into its own module ``src/gui/_home.py``
which has no top-level Streamlit side effects. Both the nav manager
and the back-link helper import ``_home_page`` from there. The Page
object built at click time has the same callable identity as the one
registered, so ``st.switch_page`` resolves it.

(2) Reported UX: the back button scrolled out of view on long pages.

Add a second ``back_to_home_link(key="_back_to_home_link_bottom")``
call near the footer of every tool page (1-9). The unique key avoids
widget-id collision with the top instance. Coming-Soon stubs get it
unconditionally; Ready tools render it only after a result exists
because the page short-circuits with ``st.stop()`` before then —
when no result is on screen the page is short enough that the top
link is sufficient.

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:58:33 +00:00
42f8d78dd5 fix(downloads): drop /select on Windows — opens wrong folder
Reported: clicking "Open Downloads folder" was opening the Documents
folder instead of Downloads. Root cause is the classic Windows
gotcha: when the path contains a space (e.g.
``C:\Users\Michael Dombaugh\Downloads``), Python's
``subprocess.Popen`` packs the ``/select,...`` argument into a single
quoted token, and Explorer's ``/select`` argument parser does NOT
accept that form — it silently falls back to whatever the user's
default Explorer view is (typically Documents).

Resolution paths considered:

- ``shell=True`` with a hand-built command string — works but opens
  the door to shell-injection if a file_name ever contained a quote
  or special char.
- ``cmd /c start "" explorer /select,...`` — same parsing issue.
- ctypes ShellExecuteW — pulls in a Windows-only dependency.
- **Skip /select. Open the folder directly.** ✓

Going with the last. ``explorer <folder>`` reliably opens the folder
regardless of spaces in the path; the user finds the freshly-saved
file by its name. The previous "highlight the file" nicety wasn't
worth the path-parsing fragility — every user folder on Windows is
``C:\Users\<name>`` and every Windows username can contain a space.

macOS keeps the ``open -R <file>`` reveal-in-Finder path because
macOS argument parsing is sane and that's a strict UX win.

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:45:47 +00:00
0f89d7ba66 fix(downloads): use explorer /select on Windows + show open feedback
Reported: clicking "Open Downloads folder" did nothing visible. The
previous implementation called ``os.startfile(folder)`` on Windows,
which is known to silently no-op or open Explorer behind the active
window in some configurations (Streamlit running headless, no
foreground rights inherited by the click handler thread, etc.).

Switch to the more reliable ``explorer /select,<file>`` form:

- Opens Explorer with the just-saved file pre-highlighted instead of
  just navigating to the folder — better UX than the old behavior.
- explorer.exe is a real GUI process that's spawned in the user's
  session with foreground rights, so it shows up on top.
- Fallback chain on Windows: ``/select`` first, then plain
  ``explorer <folder>``, then ``os.startfile`` as a last resort.

macOS upgraded the same way: ``open -R <file>`` reveals in Finder
rather than opening the directory.

Linux: no reliable cross-distro reveal, so ``xdg-open <folder>``.

Plus user feedback at the call site:

- On successful dispatch: ``st.toast("Opening <folder>", icon="📂")``
  — confirms we tried, in case the window comes up behind the
  browser.
- On dispatch failure: ``st.warning`` with the full path the user
  can copy/paste into their file manager manually.

2220 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:25:06 +00:00
b9147f3b66 fix(downloads): save server-side to ~/Downloads + open-folder link
Switch the download mechanic from "browser <a download> with a data:
URL" to "write the bytes directly to the user's Downloads folder and
show them the exact path". DataTools runs as a local Streamlit app,
so the "server" IS the user's machine — there's no reason to go
through the browser save dialog at all.

Flow:

1. Click "Download <something>" button (rendered as a regular
   ``st.button``, so no widget-collision issues).
2. Bytes are written to ``Path.home() / "Downloads" / file_name``
   (overwriting any same-named file).
3. The page reruns and renders a success caption with the absolute
   path the file landed at.
4. An "📂 Open Downloads folder" button appears. Clicking it pops the
   OS file manager via ``os.startfile`` (Windows), ``open`` (macOS),
   or ``xdg-open`` (Linux).

Why this is better than the previous HTML-data-URL helper:

- Unambiguous about where the file went — user sees the full path,
  not "wherever your browser was configured to save".
- The data: URL approach base64-inflated the page payload by 33% and
  bloated for large outputs; server-side write is byte-for-byte.
- No more browser-side widget collision class of bug.
- The save action is a real Streamlit button, so the existing widget
  semantics (disabled, help tooltip, key isolation) work without
  workarounds.

API surface unchanged. New canonical name ``local_download_button``;
``html_download_button`` is kept as a back-compat alias that points
at the same implementation — every existing call site continues to
work without edits.

Tests are protected from polluting the developer's home dir via a
``DATATOOLS_DOWNLOADS_DIR`` env var override returned by the new
``_downloads_dir()`` helper. Smoke verified end-to-end via AppTest:
click → file appears in tmp dir → success banner shows path →
open-folder button renders.

2220 tests pass, 91 skipped, 35 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:48:28 +00:00
5128d35961 fix(text-cleaner): hoist show_hidden + stress-test all tool pages
Reported crash: clicking "Clean Text" with mojibake.csv (a junk corpus
file that the cleaner ran on but produced zero changes) blew up the
results render with

    NameError: name 'show_hidden' is not defined

at the cleaned-preview block. ``show_hidden`` was defined inside
``if result.cells_changed:`` and referenced unconditionally below.

Fix on the page itself: hoist the ``show_hidden = st.toggle(...)``
declaration out of the conditional so it's always in scope for the
downstream cleaned-preview render. One toggle now drives both the
Examples table (which only renders when there are changes) AND the
cleaned preview (which always renders).

Generalized regression net: ``tests/test_junk_corpus_tool_pages.py``.
For nine representative junk files (empty, only_nul, mojibake,
invalid_utf8, utf16_le_no_bom, mismatched_columns, all_nulls,
corrupt_xlsx, single_column) and every Ready/Coming-Soon tool page,
the test:

1. Stashes the junk bytes as the home upload via session_state.
2. Runs the page through AppTest, asserts ``app.exception`` is empty.
3. If the page exposes a deterministic primary-action button label,
   clicks it and asserts no exception on the post-click render.

Pages that catch a bad file at read time and short-circuit via
``st.error`` + ``st.stop`` are correctly skipped from the
primary-action half (the button isn't rendered). A genuine crash
shows up as ``app.exception`` carrying a Python traceback — exactly
what the user reported, exactly what we now catch.

162 tests collected, 102 passed, 60 skipped. 4 seconds.

Full suite: 2220 passed, 91 skipped, 35 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:41:14 +00:00
696996c119 test(junk-corpus): pathological-input stress suite for the analyzer
Build a corpus of 35 deliberately-broken files (empty bytes, NUL
bytes, mojibake, UTF-16 without BOM, mismatched columns, unescaped
quotes, corrupt zip, etc.) and pin the analyzer's stability contract
against them.

Files land in ``test-cases/junk-corpus/test_data/``. The generator
``make_junk_corpus.py`` produces them deterministically (one random
sample uses ``secrets.token_bytes`` — committed bytes are stable
across regenerations because the byte stream is captured at commit
time). README documents the categories and how to add new shapes.

``tests/test_junk_corpus.py`` parametrizes over every file in the
corpus and asserts:

1. ``_run_analysis_on_upload`` never raises — exceptions must be
   caught and surfaced as a synthetic ``Finding`` with
   severity="error". This was the user-reported crash for
   13_non_latin_scripts.csv that the previous fix in ae9d4a2
   defensively wrapped; the corpus now stops the regression
   from re-landing on a different shape.
2. Every Finding in the result list is well-formed (string id,
   valid severity, non-empty description).
3. A high-risk subset (empty.csv, only_bom.csv, only_nul.csv,
   corrupt_xlsx.xlsx) MUST surface at least one error-level
   Finding — otherwise the GUI would render "no issues found"
   for a structurally broken file.
4. Error-level Finding descriptions are at least 20 chars so the
   UI banner gives the user something to act on.

Also exclude ``junk-corpus`` from ``tests/test_fixtures_sweep.py``
since that sweep is happy-path (round-trip the text cleaner) and
fights with files designed to break it. The contract is enforced
by the dedicated junk-corpus test, not the sweep.

Runtime: 12 s for the junk-corpus tests, 30 s for the full
project suite (was 19 s without these). 2118 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:35:22 +00:00
ae9d4a2db5 fix(home): defensive analysis errors don't crash the whole page
Reported: uploading 13_non_latin_scripts.csv made the home page bubble
a ``pandas.errors.EmptyDataError`` traceback up through the page
chrome instead of surfacing as a per-file error. In a multi-file
analysis run that kills every other file's results too, which is
worse than the symptom itself.

Wrap ``_run_analysis_on_upload`` in proper error handling:

- Empty bytes ``getvalue() == b""`` short-circuits with a synthetic
  error Finding telling the user the upload was zero-byte and to
  re-upload.
- Empty ``repair.repaired_bytes`` (file was all NULs / BOM / stripped
  to nothing) likewise surfaces as a synthetic Finding rather than
  reaching pd.read_csv.
- ``pd.errors.EmptyDataError`` from pandas is caught and rendered as
  a Finding that names the file, its byte size, and suggests opening
  it in a text editor to verify the header row matches the data row
  delimiter.
- Any other exception during read/analyze is caught and surfaces as
  a Finding via ``format_for_user`` so the user gets a clean message,
  not a Python traceback.

Each file in a multi-file run now stands alone: a bad file produces
one red banner in its own card, every other file analyzes normally.

The 13_non_latin_scripts.csv corpus file is 249 bytes of valid UTF-8
on disk and parses cleanly under the same code path locally — the
user's specific symptom is likely a zero-byte upload (browser /
network / Python 3.14 + Streamlit edge case). The new ``empty_upload``
finding will name the bytes count so they can confirm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:22:10 +00:00
ef9f8b5de4 fix(close): Edge fallback + better tryClose + honest hint
There is no JavaScript override for browser tab-close security:
``window.close()`` only succeeds on windows JS opened (Chrome --app
windows qualify; a regular browser tab does not). What we can do is
make the --app path easier to hit and the failure case more
actionable.

Three changes:

1. ``src/gui/__main__.py`` — extend browser detection. PATH lookup
   now also looks for ``msedge`` / ``microsoft-edge``; Windows install
   candidates include the Edge install path; macOS candidates include
   Edge and Chromium. Edge is Chromium-based, supports ``--app``, and
   ships on every Windows 10+ machine — so users without Chrome no
   longer fall through to the regular browser tab. When the fallback
   IS hit, print a warning to stderr explaining why Close-from-page
   will require Ctrl+W. Renamed ``_find_chrome`` to
   ``_find_app_browser`` to reflect the broader scope.

2. ``_FAREWELL_SCRIPT_TEMPLATE`` in ``components/_legacy.py`` —
   factor close attempts into a ``tryClose`` helper that runs three
   escalating tries: standard ``win.close()``, the
   ``win.open('', '_self')`` history-rewrite trick (no-op in modern
   Chrome but free), and ``win.top.close()``. Auto-close on paint AND
   the manual button now both call this helper. Skip the manual hint
   if the close eventually succeeded between the click and the 250 ms
   timeout.

3. ``quit.close_hint`` in en/es i18n packs — rewrite the message to
   tell the user honestly that this is a browser security restriction,
   tell them the Ctrl+W keystroke that works, and point them at
   ``python -m src.gui`` for the auto-closing app-mode experience.

2008 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:17:18 +00:00
aeead05e4c fix(downloads): swap st.download_button for an HTML <a download> helper
Reported symptom: only the FIRST download button in a multi-button
row pops the browser save dialog. The second and third do nothing on
click. Affects every tool page that exposes (cleaned + audit + config)
downloads.

Root cause is ``st.download_button`` itself — when several render in
the same script pass, the click-to-bytes wiring on the browser side
mis-routes and only one button's data is actually exposed. Explicit
``key`` arguments don't fix it; ``use_container_width=True`` doesn't
help either; we confirmed this in the Text Cleaner reverts.

Replace the widget with a real ``<a download="file" href="data:...">``
anchor rendered via ``st.markdown(..., unsafe_allow_html=True)``.
Bypasses Streamlit's widget machinery entirely; behaves identically to
a native browser download. Side benefit: clicking it does NOT trigger
a script rerun, so other in-flight UI state survives.

New helper ``html_download_button`` lives in
``src/gui/components/_legacy.py`` (exported from ``components``). API:

    html_download_button(
        label, data,
        *, file_name, mime="application/octet-stream",
        disabled=False, help=None, use_container_width=True,
    )

Translation pattern applied across every tool page (and shared
``results_summary`` / ``config_panel`` widgets in ``_legacy.py``):

- ``st.download_button(`` -> ``html_download_button(``
- ``data=foo_bytes`` kwarg -> positional second arg
- ``key="..."`` -> dropped (helper has no widget identity)
- ``use_container_width=True`` -> dropped (default)
- ``disabled=`` and ``help=`` pass through unchanged
- Pre-computed byte buffers kept where they were

Total: 17 sites replaced (3 in Text Cleaner, 3 in Format
Standardizer, 3 in Fix Missing Values, 3 in Map Columns, 3 in
Automated Workflows, 2 in Find Duplicates page + 4 in shared
_legacy.py widgets used by Find Duplicates).

Caveat: data: URLs balloon by 33% (base64). Fine for tool output
sizes we ship; if a future result topped a few hundred MB we'd want a
Blob-URL fallback.

The marketing demo at src/gui/app_demo.py keeps its single
st.download_button — single button, no collision, no need to switch.

2008 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:13:41 +00:00
6415be8bf4 feat(tools): unified post-run UX across all Ready tool pages
Apply the Clean Text page's post-run UX pattern to every other Ready
tool page (Find Duplicates, Standardize Formats, Fix Missing Values,
Map Columns, Automated Workflows) for consistency and ease of use.

Per page:

1. Preview wrapped in ``st.expander(f"Preview: {filename}",
   expanded=not _has_result)``. Open before a result exists, folded
   afterwards.

2. Options / configuration controls wrapped in
   ``st.expander("Options", expanded=not _has_result)``. Inner
   sub-expanders preserved (Streamlit 1.36+ supports nesting).

3. After the primary action stashes the result, set a one-shot
   ``_<tool>_scroll_to_results`` flag in session state and call
   ``st.rerun()`` so the preview + options expanders see the new
   state on the next pass and collapse themselves.

4. ``<div id="<tool>-results-anchor" style="height:1px">`` placed
   immediately before the Results subheader.

5. End-of-page: pop the scroll flag and inject a tiny
   ``streamlit.components.v1.html`` iframe whose ``<script>`` calls
   ``scrollIntoView`` on the parent document's anchor. One-shot, so
   unrelated reruns (toggling Show-hidden, etc.) don't yank the
   viewport.

6. Download buttons hardened against the multi-button Streamlit
   footgun: byte buffers pre-computed outside the column scopes,
   explicit unique ``key="<tool>_dl_<purpose>"`` per button,
   ``use_container_width=True``, and previously-conditional buttons
   now render unconditionally with ``disabled=True`` + a help
   tooltip when the underlying data is empty so layout stays steady.

Per-page judgment calls (already noted in agent reports):

- Find Duplicates: sheet picker and delimiter selector kept OUTSIDE
  expanders (the user still needs to see them when a file fails to
  parse).
- Fix Missing Values: missingness profile wrapped INSIDE the Options
  expander together with Strategy — the Results section already
  shows a before/after missingness comparison that supersedes the
  static input profile.
- Map Columns: all three subsections (Target schema, Strategy,
  Mapping) wrapped under one outer Options expander, matching the
  Text Cleaner pattern.
- Automated Workflows: inner "Recommended tool order" expander stays
  nested inside the outer Options wrap; Run button stays outside
  Options so the user can re-run after tweaking the (collapsed)
  editor.

2008 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:04:37 +00:00
d1aaf3c2b9 feat(quit): close-window button + manual hint on the farewell overlay
The farewell overlay already attempted ``window.top.close()`` after a
Close click — but browsers only honour that for tabs that JS opened
(Chrome --app windows qualify; a regular browser tab does not). For
users whose Chrome wasn't auto-detected and who fall back to
``webbrowser.open``, the overlay stays put and they had no in-page
way to close.

Add to the overlay HTML:
- A "Close this window" button (uses the user-gesture path, which has
  slightly looser browser rules than auto-close).
- A hidden hint paragraph that reveals itself 250 ms after the
  button is clicked IF the window is still here, telling the user to
  press Ctrl+W (⌘W on Mac).

Wired through the existing _farewell_script template + ``_js_html_safe``
escaping so neither label can break out of the JS string literal.

New i18n keys (en + es): ``quit.close_window_button`` and
``quit.close_hint``.

The existing auto-close attempt remains — Chrome --app users still get
their window closed without touching the button.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:59:17 +00:00
27f0648093 fix(text-cleaner): make all three download buttons actually fire
Only "Download cleaned CSV" was working; "Download changes audit" and
"Download config JSON" did nothing on click.

The symptom is the classic Streamlit footgun for multiple
``st.download_button`` widgets in adjacent columns: without an explicit
``key`` argument the auto-derived widget IDs can collide, especially
when one button is conditionally rendered, and only the first button
in source order actually fires on click. Same goes for unstable
``data`` bytes recomputed inside the ``with col:`` block — the widget
identity can drift between renders.

Robustness pattern applied:
- Compute all three byte buffers up front, outside the columns, so the
  ``data`` parameter is the same object across reruns.
- Pass an explicit unique ``key`` ("textclean_dl_cleaned" /
  "textclean_dl_changes" / "textclean_dl_config") to each button.
- Render the changes button unconditionally with ``disabled=True`` and
  a help tooltip when ``result.changes.empty`` — instead of hiding it.
  Layout stays steady and the empty case is self-explanatory.
- ``use_container_width=True`` so the three buttons size identically
  inside their columns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:56:52 +00:00
0a61d52200 feat(text-cleaner): collapse options + auto-scroll to Results on run
After clicking Clean Text the user was left at the bottom of the
script with the Options block still expanded and no viewport movement
— they had to scroll to find the Results.

- Wrap the whole Options block in an outer ``st.expander("Options",
  expanded=not _has_result)``. After the Clean Text rerun, both
  Preview AND Options collapse, leaving the primary action button +
  Results as the only prominent elements above the fold. The inner
  Advanced-options expander is preserved as a nested expander
  (supported in Streamlit 1.36+; this repo pins 1.35+).
- Add a 1px anchor div ``#textclean-results-anchor`` immediately
  before the Results subheader.
- On Clean Text click, set a one-shot ``_textclean_scroll_to_results``
  flag in session state; on the next render, pop the flag and inject
  a tiny ``st.components.v1.html`` iframe whose ``<script>`` calls
  ``scrollIntoView`` on the parent document's anchor. One-shot so
  re-renders triggered by other widgets (Show-hidden toggle, etc.)
  don't jerk the viewport back to the top of Results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:50:43 +00:00
ca14ce2952 feat(text-cleaner): collapse preview on run + full hidden-char audit
Two small UX fixes on the Clean Text page:

1. The input preview is now wrapped in an ``st.expander`` whose
   default-expanded state is ``not has_result``. Clicking the
   "Clean Text" primary button stashes the result and calls
   ``st.rerun()`` so the next pass sees the result in session state
   and the expander folds — the Results section becomes the primary
   visual focus. User can re-expand manually to re-inspect the source.

2. The Examples (changes audit) table's Before/After columns were
   calling ``visualize_hidden_html`` WITHOUT ``mark_outer_whitespace``,
   so leading/trailing whitespace — which is exactly what the cleaner
   most often removes — was invisible. Pass ``mark_outer_whitespace=True``
   to match the input-preview rendering. Column-name cell now mirrors
   that flag too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:43:52 +00:00
502a72cd46 feat(nav): ← Back to Home link on every tool page
Multi-file workflow: a user uploads several files on Home, clicks
"Open <Tool>" on one file's findings, lands on a tool page. The
sidebar lets them get back to Home, but a top-of-page back affordance
is more discoverable and keeps the hand in the same screen region as
the upload list they're working through.

- New ``back_to_home_link()`` helper in components/_legacy.py renders
  a secondary button that calls ``st.switch_page("app.py")`` — under
  ``st.navigation`` that routes to the default (Home) page.
- Wired into every tool page (1-9) directly after
  ``hide_streamlit_chrome()`` and BEFORE the license gate so a Lite
  user who lands on a locked tool can navigate away without paying.
- New i18n key ``nav.back_to_home`` ("← Back to Home" /
  "← Volver al inicio") in en/es packs.

2008 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:38:01 +00:00
604debb9a9 revert(home): keep per-tool grouping for per-file findings
Restoring ``render_findings_panel`` on the home page. Previous commit
(c575efd) inlined a flat renderer that dropped the per-tool grouping
and the "Open <Tool>" jump links — that was an over-correction. The
user only wanted the bottom tool-card grid gone (already removed in
ff2eaeb). The grouping inside the findings panel is what lets a user
land on a specific finding and one-click into the cleaner that fixes
it; without it they'd have to guess which sidebar entry to open.

Tool-card grid stays removed. Sidebar nav is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:31:36 +00:00
c575efd26e fix(home): render findings flat — drop per-tool grouping
The home page was calling ``render_findings_panel``, which groups
findings by tool into expanders and renders an "Open <Tool>" page
link under each. After uploading a file, the user still saw a tool
list (just under a different shape) — defeating the earlier cleanup
that removed the tool-cards grid.

Inline a flat renderer in ``_home_page``: per uploaded file, render
the filename header + severity summary + a flat list of findings via
``_render_one_finding`` directly. No expanders, no tool names as
section headers, no per-tool page-link buttons. Tool discovery
happens in the sidebar.

``render_findings_panel`` itself is unchanged — it still groups by
tool and remains tested via the findings-panel harness, but is no
longer used on the home page.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:22:20 +00:00
175389219f fix(gui): translate sidebar tool names when language changes
The sidebar nav was passing ``tool.name`` (the registry's English
field) to ``st.Page``, so the tool entries stayed in English even
after the user picked Spanish from the language selector. Section
headers were already i18n-driven; tool entries were not.

Switch to ``tool_name(tool_id)`` which routes through ``t(...)`` and
picks up the active language from session state. Verified: with
``ui_lang=es`` the sidebar renders Buscar duplicados / Limpiar texto /
Mapear columnas / etc. instead of the English fallbacks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:19:15 +00:00
c568aec8a7 feat(gui): one-click Close in its own bottom sidebar section
Close is now a direct shutdown trigger: visiting the Close page (the
sidebar entry) fires shutdown_app() immediately — no confirm step, no
intermediate body. The farewell overlay paints and os._exit(0) lands
~1s later from a daemon thread.

Layout: Close moved into its own bottom-of-sidebar section so the
destructive action is visually separated from Account/Activate.

- New shutdown_app() in components/_legacy.py replaces quit_button.
  os._exit thread is skipped when "pytest" is in sys.modules so the
  test suite doesn't suicide on rendering 99_Close.
- pages/99_Close.py shrinks to set_page_config + chrome + shutdown_app.
- app.py nav grows a new "Close" section header (new
  nav.section_close key in en/es packs) pinned at the bottom of the
  navigation dict.

Tests updated:
- TestQuitButtonRenders → TestClosePageShutsDownImmediately.
  Assert the shutdown caption renders + no confirm button exists.
- test_smoke EXPECTED_SUBSTRINGS["99_Close"] now pins
  "Shutting down" / "Cerrando" (the visible page body) instead of
  the removed page title.

2008 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:17:14 +00:00
ff2eaeb6c4 feat(home): multi-file upload + per-file analysis, drop tool grid
Home is now upload + analysis only. The page accepts multiple files in
one go, analyzes each independently, and renders findings grouped by
filename in bordered containers. The 3-section tool-card grid is gone —
discovery happens via the sidebar now.

Mechanics:
- file_uploader uses accept_multiple_files=True. Each file's findings
  cache in session_state["home_findings_by_file"] keyed by filename so
  removing a file via Streamlit's "x" button drops its findings too,
  and re-clicking Run only re-analyzes pending files.
- The first uploaded file is mirrored into the singular
  home_uploaded_{name,bytes,size} keys so tool pages continue to pick
  up an "active" upload through pickup_or_upload — no tool-page changes.
- New i18n keys: upload.intro_multi, upload.uploader_label_multi,
  upload.clear_results, upload.empty_state. upload.heading text is
  updated to "Upload one or more files to start" (EN + ES).

Dropped tests pinning the tool grid:
- TestHomeToolGridLocalization (test_chrome.py)
- test_home_tool_card_uses_es_name (test_smoke.py)
- TestLiteHomeGridBadges (test_lite_tier.py — locked-card lock-badge
  assertions; locking is still enforced per-tool-page via
  require_feature_or_render_upgrade)

2009 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:12:48 +00:00
dad744f17f refactor(gui): drop Review page + normalization gate
Home is now the only entry point: the "Run analysis" button on the
upload section IS the review step (findings render inline via
render_findings_panel). Tool pages no longer gate on a passed
normalization — running the analyzer is sufficient context.

Removed:
- src/gui/pages/0_Review.py
- src/gui/components/gate.py (re-export seam)
- require_normalization_gate() in src/gui/components/_legacy.py
- "review" section enum in tools_registry.py
- Data Review entry in app.py navigation
- require_normalization_gate() calls + imports in all nine tool pages
- tests/gui/test_gate.py (whole file)
- TestReviewWorkflow in tests/gui/test_workflows.py
- 0_Review entry in tests/gui/test_smoke.py PAGE_SLUGS
- stash_upload's normalization_result+normalization_for stashing
- stash_upload_without_gate (was the gate's negative-path helper)

2017 tests pass (16 retired with the gate flow).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 20:04:33 +00:00
fc6c22c6a7 feat(review): inline file uploader instead of redirect home
When a user lands on Review without an upload, show a file uploader
on the page itself and auto-run the analyzer once a file is picked,
rather than bouncing them to the home page with a "Back to home"
button.

Auto-analyze is the right default here: the user is already on the
Review page, so they've implicitly committed to a scan. Stashing the
bytes in the same session-state keys the home page uses keeps the
rest of the flow (encoding picker, gate, tool pages) unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:57:01 +00:00
db5ec084da docs+code: rename tool labels everywhere
Sweep follow-up to 93e43fc. Display labels now consistent across docs,
landing pages, CLI output, code comments, docstrings, and test prose.
Five parallel surfaces touched:

- docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal
  design/planning docs
- landing pages: index + bookkeeper/revops/shopify-pet
- src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py
  and gui/components/_legacy.py, core module headers, every tool
  page's module docstring
- tests: class/method/module docstrings and section-header comments
- test-cases READMEs

Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator
etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*),
URL paths, anchor IDs, CSS classes, and asset filenames were left
intact since they're code identifiers / structural references.

All 2033 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:50:09 +00:00
93e43fc0d9 feat(gui): sidebar sections + non-technical tool labels
Sidebar nav now groups tools under Data Review / Data Cleaners /
Transformations / Automations via st.navigation, replacing the flat
auto-discovered list. Tool display names switch to action-first
phrasing (Find Duplicates, Fix Missing Values, Find Unusual Values,
Standardize Formats, Clean Text, Quality Check, Map Columns, Combine
Files, Automated Workflows) in EN + ES packs and on each page's H1.

The Data Cleaners section follows the requested order: Missing
Values → Outliers → Text Cleaner → Format Standardizer → Deduplicator
→ Quality Check. (Text Cleaner kept inside cleaners since the request
didn't list it but the tool still ships.) Registry now carries a
section field; helpers added: tools_in_section(), section_label().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:36:01 +00:00
624f99653e docs(arch): end-to-end system + tech-stack diagrams
New ARCHITECTURE.md pulls the desktop app (TECHNICAL.md) and the
license server (LICENSE-SERVER.md) into a single picture — the two
were never reconciled into an end-to-end view before.

Contents:
  §1. System diagram (ASCII) showing operator laptop, license
      server stack (nginx → FastAPI → Postgres), Postmark, Gumroad,
      and the buyer's machine — with the three primary flows
      (sale, manual mint, offline activation) traced through it.
  §2. Tech stack diagram, layered: desktop / server / operator /
      external SaaS, with version pins.
  §3. Trust + isolation boundaries table — what crosses each one
      and what the threat model is.
  §4. "Where things are stored" — paths, tables, files.
  §5. Pointers to the deeper per-component docs.

ASCII over Mermaid since the repo's Gitea version is unknown and
plain text renders in every viewer / IDE / raw `cat`.

LICENSE-SERVER.md status flipped from "design proposal, not built"
to "deployed (PR 1 + PR 2 code merged)" — that was stale since
the PR 1 deploy yesterday.

TECHNICAL.md and ADMIN.md gain one-line pointers to ARCHITECTURE.md
so people land at the unified view when looking for "how does it
all fit together".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:59:05 +00:00
86ad21db79 docs(license): PR 2 deploy + operator instructions
ADMIN.md gains a "Running a Gumroad webhook" section: how the URL
secret works, how to add a SKU to products.yaml, how to inspect
gumroad_events (recent activity + failures-only queries), how to
replay a failed delivery, and how to test without buyers via
Gumroad's "Send Test Ping" button.

The deployed-vs-queued matrix flips Gumroad + Postmark to
"code merged, deploy pending" so it's clear the bits exist on
main but the live box still runs PR 1.

SETUP-LICENSE-SERVER.md §3 commits the eventual compose.yml shape
with PR 2 environment + secrets lines included but commented out,
ready to uncomment at deploy time. The §3 chown step already covers
the new secret files because it uses `chmod 400 secrets/*` /
`chown 10001:10001 secrets/*`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:33:53 +00:00
2bbaba954b feat(server): Gumroad webhook receiver + Postmark email (PR 2)
Wires the second source-adapter (Gumroad) plus the email delivery
that lets the server fulfill a sale end-to-end without operator
intervention.

Auth model: Gumroad doesn't HMAC the body, so we use their
recommended URL-secret pattern (?secret=...). Wrong/missing secret
returns 404 — no signal to a prober that the endpoint exists.

Webhook flow (server/app/routes/webhooks.py):
  1. audit-log the raw payload (gumroad_events row) BEFORE anything
     else, so a later failure leaves us replayable
  2. parse via GumroadAdapter (server/app/adapters/gumroad.py)
  3. mint_from_sale — UNIQUE(source, source_order_id) dedups
     duplicate webhook retries
  4. send the license email
  5. mark gumroad_events.processed = true

Always returns 200 once auth passes. Non-2xx would trigger Gumroad's
3-day retry storm; we'd rather record the failure on the audit row
and replay manually after fixing whatever surfaced.

Product → tier mapping is per-source YAML at
server/config/products.yaml (lru_cached). Adding a SKU = edit yaml,
restart api. Unmapped product_id is an error on the audit row, not
a crash.

EmailService (server/app/email.py): provider-agnostic interface with
Postmark as the first implementation. When POSTMARK_TOKEN is unset
the factory returns LoggingEmailService instead, so the webhook
exercises end-to-end before Postmark is provisioned.

48 unit tests (was 21) including:
- Gumroad secret verify with constant-time compare
- Sale parsing: amount-in-cents, name fallback from email,
  test=true tagging, missing-required fields, offer codes
- Product mapping lookups
- Email rendering text + HTML, HTML-escapes user input
- Postmark client via httpx.MockTransport (success and 4xx)
- Webhook end-to-end: secret check, audit log, idempotency on
  retry, unmapped product, email failure keeps license

Smoke test (server/scripts/smoke.sh) extended to POST a synthetic
Ping payload, verify the row + audit log, prove wrong-secret is
rejected, prove duplicate sale_id stays one row.

SQLite-test compatibility:
- BigInteger primary key uses with_variant(Integer, "sqlite") since
  SQLite only autoincrements INTEGER PRIMARY KEY.
- python-multipart pulled in for FastAPI Form parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:33:43 +00:00
b5cd74d474 docs(admin): live deployment section for the running license server
Documents the post-deploy state of PR 1: live URLs (datatools and
licenses subdomains on unalogix.com), the on-box filesystem layout
under /srv/datatools-license/, where the admin token lives and how
to retrieve / rotate it, the laptop-side SSH-tunnel + admin_cli
mint workflow, inspection commands (logs, psql, container status),
restart / rebuild procedures, manual backup commands until cron
lands, the production-key rotation outline, and a deployed-vs-queued
capability matrix.

Secrets are NEVER pasted into this doc — the admin token's literal
value lives only on disk (mode 400, UID 10001). Committing it to
git would mean permanent leakage via history even after rotation;
documenting its location + rotation procedure achieves the same
operational outcome without the residual exposure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:19:57 +00:00
1cf69dd23b docs(license): runbook fixes from PR 1 self-host deploy
Two real-world footguns surfaced during the first live deploy:

1. docker-compose's uid/gid/mode long-form on file-based secrets is
   silently ignored — that's a swarm-mode-only feature. The
   container app user (UID 10001 from the Dockerfile) cannot read
   a mode-400 file whose host UID it doesn't match. Fix is to
   chown the secret files to 10001 directly; host-side access
   control stays gated by the parent dir's mode 750.

2. nginx 1.24 (Ubuntu 24.04 default) rejects the standalone
   "http2 on;" directive (that arrived in 1.25). Use the legacy
   "listen 443 ssl http2;" combined form. Noted prominently so the
   next deploy doesn't trip on it.

Also realigned §3's compose example to what actually got deployed
for PR 1 — only pg_password + admin_token secrets, postmark /
gumroad / license_privkey commented out as PR 2 / production-key
follow-ups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:17:05 +00:00
673b902377 feat(license): datatools-admin CLI for the mint API
New operator CLI at src/admin_cli.py: mint, list, revoke, ping —
talks to the server's /internal/* endpoints over a local SSH tunnel.
Stdlib-only on the desktop side (urllib + typer), no new top-level
deps. Auth via $DATATOOLS_ADMIN_TOKEN.

scripts/generate_license.py is now annotated as a break-glass tool
for when the server is unreachable — routine work goes through the
new CLI so the authoritative `licenses` row is created.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 00:47:01 +00:00
bab2c9468c feat(server): mint API + Postgres schema + manual adapter (PR 1)
Source-agnostic license issuance service. FastAPI app fronts a
Postgres `licenses` table; the only currently-wired source is
`manual` (operator mints via /internal/mint). Gumroad webhook
adapter lands in PR 2.

Key design points:

- Signing reuses src/license/crypto.py via a COPY into the image
  (single source of truth — blobs minted server-side verify against
  the same embedded pubkey on the buyer's machine).
- Source adapter Protocol (app/adapters/base.py) is the seam for
  Gumroad / Lemon Squeezy / Stripe in later PRs; Mint API speaks
  only SaleEvent / RefundEvent.
- (source, source_order_id) UNIQUE composite gives idempotent
  webhook retries without double-mint.
- JSONB type uses with_variant(JSON, 'sqlite') so the same models
  drive both Postgres prod and SQLite tests (no testcontainers dep).
- Bearer-token auth on /internal/*; the IP-loopback guard was
  removed after the docker bridge made it fight legitimate prod
  traffic (nginx defense + Bearer remain).
- Secrets resolved via *_FILE env vars pointing at
  /run/secrets/<name>, so passwords never appear in `docker inspect`.

21 unit tests (SQLite in-memory, StaticPool) plus a real-Postgres
docker-compose smoke test in server/scripts/smoke.sh that builds the
image, runs the alembic migration, mints a license, verifies the
signature against the host dev pubkey, and checks the DB row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 00:46:54 +00:00
4179cb5156 docs(license): self-hosted server runbook + multi-tenancy plan
Adds SETUP-LICENSE-SERVER.md — end-to-end install runbook for the
license server on the existing invixiom box (Ubuntu 24.04). Covers
DNS, system packages, Postgres + API in Docker, dedicated system
user, secrets layout under /srv/datatools-license/secrets (mode
400), nginx config in a separate sites-available/unalogix file,
Let's Encrypt cert issuance, smoke tests, backups, monitoring, key
rotation, and rollback.

Multi-tenancy is explicit at every layer: separate DNS zone
(unalogix.com vs invixiom.com), separate nginx file, separate TLS
cert, dedicated backend ports (8090 for the API, 5433 for Postgres,
both localhost-only), separate docker compose project and volume.
No invixiom service is touched.

LICENSE-SERVER.md updated: hosting choice moved from "Fly.io /
Render" (rejected) to self-hosted (decided). Points at the new
runbook for ops specifics.

ADMIN.md pointer table updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:57:53 +00:00
52e04f63a9 docs(license): design proposal for online issuance & record-keeping
Forward-looking design doc — not implemented. Describes the smallest
useful server that replaces the manual mint-and-paste workflow:
Gumroad webhook → Mint API (KMS-held private key) → Postgres
licenses table, plus a self-service renewal/re-delivery portal.

The desktop app is deliberately untouched across all three migration
phases: activation stays fully offline and continues to verify blobs
against the embedded pubkey, preserving the DECISIONS.md §9b promise
that buyer machines never phone home.

Schema is intentionally a superset of the local issuance JSONL log
(ADMIN.md), so Phase 1 migration is a flat INSERT per row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:26:24 +00:00
23c51fd759 feat(license): local issuance log for minted blobs
generate_license.py now appends every minted license to
~/.datatools-creator/issued.jsonl (overridable via env). This is the
creator-side system of record until the server-side flow lands.

The full blob is stored alongside name/email/tier/expiry so buyers
who lose their delivery email can be re-served without re-minting.
File is created mode 600 and lives outside the buyer-facing
~/.datatools/ dir so it never gets bundled into a shipped install.

Log failures are non-fatal (warning to stderr) — the mint already
succeeded by the time we try to log, and forcing a re-mint after a
log error would invalidate any device the buyer had activated. Pass
--no-log for test mints.

ADMIN.md adds a "Customer record-keeping" section with the path,
schema, jq one-liners, and migration note pointing at the upcoming
LICENSE-SERVER.md design doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:25:19 +00:00
65e17e0a70 docs(admin): internal license operations reference
Creator-only ADMIN.md covering keypair generation, blob minting,
dev vs. production key model, tier matrix, and recovery if the
private key is lost. Includes a TL;DR for minting a dev license
against the in-tree keypair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 22:10:16 +00:00
e534fb4989 sec(license): Ed25519 sigs + production-safe tripwire
Two coupled hardening upgrades.

1. Asymmetric signatures (HMAC → Ed25519)

The previous HMAC scheme used a symmetric secret that any motivated
reverse engineer could pull out of the shipped binary and use to
mint blobs for any tier / name / email. With Ed25519, the binary
ships only the public verification key; the signing key never
leaves the seller's environment, so binary compromise no longer
yields forgery.

- src/license/crypto.py rewritten around
  cryptography.hazmat.primitives.asymmetric.ed25519. Same public
  API surface (sign/verify/encode_blob/decode_blob), same canonical
  JSON encoding — drop-in for the manager / cli / GUI layers.
- DATATOOLS_LICENSE_PRIVKEY (seller-side) and
  DATATOOLS_LICENSE_PUBKEY (build-time) env vars supply the keys;
  the in-source dev keypair (src/license/_dev_keypair.py)
  deterministically derives from a seed phrase for repro builds and
  tests.
- Blob prefix bumped DTLIC1: → DTLIC2:. Decoding a DTLIC1 blob
  surfaces a clear "old format" error rather than a confusing
  signature mismatch.
- scripts/generate_keypair.py mints fresh production keypairs for
  the seller (run once, stash the private key offline). Adds
  cryptography>=41,<46 to requirements.txt (was an undeclared
  transitive dep).

2. Production-safe tripwire

assert_production_safe() refuses to boot a frozen / shipped build
when either:

- DATATOOLS_DEV_MODE=1 is set (would unconditionally bypass every
  license check — fine in source/test but catastrophic in a buyer
  install).
- The active verification key is still the embedded dev key (the
  build pipeline forgot to set DATATOOLS_LICENSE_PUBKEY).

No-op in source / pytest runs (sys.frozen is unset) so test
fixtures and dev workflows keep working without ceremony. Called
from src/cli_license_guard.guard() and from hide_streamlit_chrome
— so it fires on every CLI invocation and every GUI page load.

Tests: 49 license-layer unit tests (was 40); added Ed25519
wrong-key rejection, dev-keypair seed pin, blob v2 prefix, v1
rejection with clear message, and four production-safe scenarios
(no-op in source, fires on DEV_MODE in frozen, fires on dev key in
frozen, passes in frozen with prod pubkey). Total: 2024 → 2033.

Docs (REQUIREMENTS §17a, DEVELOPER licensing recipe, DECISIONS
§9b + decision log) updated with the new threat-model write-up,
key-storage workflow, and tripwire behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:34:48 +00:00
d32b58e61a feat(license): add Lite SKU; remove user-facing free trial
Two coupled changes:

1. Lite tier
   - New Tier.LITE in src/license/schema.py.
   - FEATURES_BY_TIER[Tier.LITE] = {Deduplicator, Text Cleaner,
     Format Standardizer}. The three universally-useful tools that
     cover the most common bookkeeping / RevOps / Klaviyo prep
     workflows. Other six tools require Core.
   - i18n: license.tier_lite, license.feature_locked_title,
     license.feature_locked_body, license.upgrade_link,
     license.status_locked (en + es).
   - Per-tool feature gate at every GUI tool page
     (require_feature_or_render_upgrade) and every tool CLI
     (guard(feature=...)). A locked tool renders an upgrade
     prompt + Manage-license button (GUI) or exits with code 2
     (CLI).
   - Home grid: tool cards the user's tier doesn't unlock get a
     red 🔒 Locked badge in place of green Ready.

2. Trial removed
   - Activation form's "Start 1-year trial" button removed.
   - license_cli's `trial` subcommand removed.
   - activation.trial_button / activation.trial_help i18n keys
     dropped (pack parity test stays green).
   - Tier.TRIAL stays in the enum (back-compat with any field-
     tested trial licenses); LicenseManager._mint stays internal
     for tests and the seller's key generator.
   - Decision logged in DECISIONS §9b: a 1-year all-features
     trial undercuts paid Lite; paid-only keeps tier economics
     clean.

Tests (+29 net): +17 Lite-tier unit/guard tests + 13 Lite-tier
GUI tests + 1 trial-absent assertion - 2 trial CLI tests - 1
trial GUI button test. Total: 1995 → 2024.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:19:30 +00:00
e612c751a8 docs(license): document activation flow, tier system, dev bypass
- USER-GUIDE EN + ES gain a §0 "First launch — activation" section
  covering paid blob activation, 1-year trial, renewal, file
  location, and device-swap.
- REQUIREMENTS §17a "Licensing" — storage path, activation model,
  lifetime, tier list, dev bypass env var. Test count: 1995.
- DEVELOPER gains a "Licensing" recipe in the Extension recipes
  section: public API, feature-flag add, tier add, minting via the
  creator-only script.
- DECISIONS §9b — log the offline-HMAC choice with the threat-model
  trade-off (motivated piracy not stopped; honor-system + 30-day
  refund covers casual sharing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:54:30 +00:00
e435103113 feat(license): registration + 1-year licenses + tier scaffolding
A complete offline licensing layer (no internet at any step):

Core
- src/license/ — schema (License, Tier, FeatureFlag), HMAC crypto,
  JSON storage, LicenseManager singleton with activate/renew/
  deactivate/issue_trial. Tier-scaffolded so future SKUs can carve
  per-tool feature sets without consumer-code edits.
- scripts/generate_license.py — creator-only key generator. Mints a
  DTLIC1: blob the buyer pastes into the activation page.

GUI
- New activation form component (src/gui/components/activation.py).
- hide_streamlit_chrome() now inline-renders the activation form when
  no valid license is present (every page short-circuits to the form
  until activated).
- Sidebar shows tier + days remaining; renewal warning under 30 days.
- New pages/_Activate.py for revisiting the form after activation.

CLI
- src/license_cli.py — activate / renew / status / trial / deactivate
  commands. Exempt from the guard.
- src/cli_license_guard.py — drop-in guard call added to every tool
  CLI's main(). Lets --help through; respects DATATOOLS_DEV_MODE.

i18n
- New activation.* and license.* keys in en.json + es.json
  (page title, form labels, status badges, renewal warnings, error
  messages). Pack parity test stays green.

Test infrastructure
- tests/conftest.py autouse fixture sets DATATOOLS_DEV_MODE=1 so the
  existing 1916 tests continue to pass.
- isolated_license_path / activated_license_manager /
  unactivated_license_manager fixtures for tests that want to drive
  the real check.

Tests (+79)
- tests/test_license.py (40): schema, crypto roundtrip, blob
  encode/decode, tier→feature mapping, activation flow, name/email
  mismatch rejection, tamper detection, expiration, renewal,
  dev-mode bypass.
- tests/test_license_cli.py (26): every license_cli command +
  subprocess tests confirming every tool CLI refuses to run without
  a license, --help always works, DEV_MODE bypasses.
- tests/gui/test_activation.py (13): gate blocks without license,
  passes with trial, activation form submission unlocks the gate,
  sidebar status, renewal warning, i18n.

Total: 1916 → 1995 tests. All pass under the strict warning filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:54:23 +00:00
b2c7b94fe9 fix: clear all latent deprecation + resource warnings
Three real issues surfaced when running the suite with strict warnings:

1. src/core/format_standardize.py: ``datetime.utcfromtimestamp`` is
   deprecated in CPython 3.12 and slated for removal. Replace with
   ``datetime.fromtimestamp(ts, tz=timezone.utc)``. Output for the
   date-only format codes we use is byte-identical.

2. src/core/io.py: ``list_sheets`` leaked the openpyxl file handle by
   returning ``xl.sheet_names`` from an unclosed ``pd.ExcelFile``.
   Wrap in a ``with`` block so the FD closes deterministically — also
   prevents the Windows-only "file is locked" repro path.

3. tests/test_corpus.py: ``TestXlsxPollution.workbook`` fixture
   returned the bare ``pd.ExcelFile`` instead of yielding + closing.
   Convert to a yield-and-finally pattern so the class-scoped handle
   isn't leaked across the whole test file.

Also harden pytest.ini's warning policy: escalate
``ResourceWarning`` from ``src`` to an error, alongside the existing
``DeprecationWarning`` rule. Third-party warnings stay filtered — we
can't fix pandas/openpyxl/streamlit churn from here.

All 1916 tests pass under the strict filter; full and split runs
(``pytest``, ``pytest -m 'not gui'``, ``pytest -m gui``) all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:28:48 +00:00
070e3c9f06 docs(gui): document the new GUI test layer
REQUIREMENTS §16 updates the test count (1777 → 1916) and breaks out
the GUI subset. DEVELOPER's Tests section gains the 'gui' marker
recipes and the new tests/gui/ tree under test layout, plus a short
'GUI test layer' explainer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:13:40 +00:00
35d46a0c1a test(gui): add Streamlit AppTest layer (139 tests)
Until now every test ran against core or the CLI; the Streamlit GUI
was verified by hand. This commit adds tests/gui/ — 139 AppTest-
driven tests behind a 'gui' marker so the quick loop
(``pytest -m 'not gui'``) stays at 1777 tests / ~10s while
``pytest`` runs everything (1916 / ~14s).

Coverage:
- test_smoke.py (59): every page renders in EN and ES, expected
  substring present, sidebar selector mounted.
- test_chrome.py (18): language selector flips session state and
  re-renders; quit button + farewell strings localize; tool-card
  names use the active language.
- test_gate.py (9): require_normalization_gate no-op / warning /
  short-circuit / hash-mismatch invariants; warning + button
  localized.
- test_workflows.py (14): happy path per Ready tool — stash
  upload, render, find primary action, verify result lands in
  session state.
- test_dedup_review.py (8): Accept All / Reject All / Clear
  Decisions wire through to review_decisions; apply_review_decisions
  semantics (keep-all, merge, column override).
- test_advanced_panels.py (15): config_panel widget defaults and
  options (algorithm, threshold, survivor rule, merge, multiselects,
  config save/load).
- test_errors.py (4): garbage / empty / single-column uploads don't
  crash; duplicate-target mapping raises InputValidationError.
- test_findings_panel.py (12): driven via a small standalone harness
  page so we test the component without faking a file_uploader. EN
  + ES strings, per-tool grouping, open-tool button label, untargeted
  expander, severity summary.

Shared infrastructure in tests/gui/conftest.py:
- ``stash_upload`` / ``stash_upload_without_gate`` — populate
  session_state to pre-pass or block the gate.
- ``with_language`` — set ``ui_lang`` before run().
- ``collected_text`` — flatten title/caption/markdown/etc. into
  one string for substring assertions.
- Auto-marking: every test in tests/gui/ gets ``@pytest.mark.gui``
  via ``pytest_collection_modifyitems``, so the marker isn't
  per-test boilerplate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:13:40 +00:00
d0423a8912 docs(perf): publish the dedup/parallel/lazy-copy wins and limits
REQUIREMENTS §10 carries the new measured numbers and the dedup
blocking trade-off note. DEVELOPER known-limitations is rewritten to
reflect that exact-only dedup is now O(n), fuzzy-blocking is opt-in,
and column-parallelism is scaffolding for free-threaded Python.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:54:25 +00:00
64452dd783 perf: dedup blocking, column-parallel scaffolding, lazy-copy pipelines
Three follow-on wins from the audit, each with shape-pinning tests.

1. Dedup blocking
   - Exact-only strategies (every column EXACT @ 100 — covers strong-
     key dedup like email/phone, the drop-duplicates fallback, and
     explicit "match on this exact column" calls) now route through
     an O(n) groupby fast path. Lossless; no API change required.
     Measured: 10k-row email-exact dedup → 73 ms (was ~30 minutes
     via the O(n²) pair compare).
   - Fuzzy strategies still pair-compare, with opt-in prefix blocking
     via deduplicate(..., blocking_columns=[...], blocking_prefix_len=1).
     Measured: 5k-row fuzzy-name → 25.6s with blocking vs 179s
     without (7x). Trade-off: cross-block matches missed.

2. Column-parallel standardize
   - StandardizeOptions.parallel_columns (default 1) lands a
     ThreadPoolExecutor over the column loop. Output order and
     audit-record order are preserved deterministically via a merge
     step keyed off column_types order. Honest doc: under CPython
     3.12's GIL the win is roughly neutral (phonenumbers/dateutil
     hold the GIL); the API is ready for free-threaded Python 3.13+.

3. Lazy-copy in missing / column_mapper
   - _standardize_sentinels now builds per-column changes in a dict
     and only materialises the output frame when at least one column
     actually changed. On a clean 1 GB file this skips a 1 GB
     allocation.
   - handle_missing carries an out_is_owned flag, copying on demand
     before any mutating step. No-op runs return the input frame.
   - map_columns drops the unconditional upfront df.copy(); rename
     and drop both return fresh frames already, and schema-add /
     coerce trigger _ensure_owned() lazily.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:54:25 +00:00
e5f632bcd6 docs(perf): publish 1.5 GB target and the new measured throughputs
REQUIREMENTS §10 reflects the post-optimisation numbers and the
known O(n²) dedup match step (flagged for a future blocking pass).
en/es upload-limit copy and uploader help now say 1.5 GB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:37:26 +00:00