datatools-dev

Author	SHA1	Message	Date
Michael	7ad19ac7f4	feat(nav,i18n): sticky footer with Back-to-Home + localized tool headers Two unrelated UX issues addressed in one sweep across all nine tool pages because they share the same edit surface. (1) Sticky footer replaces the top + bottom back-link buttons. Reported: a big white empty footer space at the bottom of every page; the Back to Home button at the top scrolled out of view on long pages. New ``render_sticky_footer()`` helper in ``components/_legacy.py`` injects a fixed-position bar at ``bottom: 0`` of the viewport with: - A border-top so it visually reads as a non-movable bar. - A semi-transparent background (rgba 0.96 + ``backdrop-filter: blur``) so content underneath shows through faintly when the user scrolls. - A styled ``<a href="home">`` anchor (not an ``st.button``) because Streamlit widgets can't be CSS-positioned reliably — Streamlit owns the widget's DOM container and re-mounts it on every rerun. A real anchor sits exactly where the CSS puts it and triggers Streamlit's URL routing to the home page. - ``padding-bottom: 3.5rem`` on the main container so the last widget isn't hidden behind the bar. Called once per tool page, immediately after ``hide_streamlit_chrome()`` so it renders even on pages that ``st.stop()`` early before any other content runs. The old top-and-bottom ``back_to_home_link()`` calls are removed from every tool page; their entry/exit points were dropping the button when the script short-circuited. (2) Tool-page headers now localize. Reported: switching the sidebar language picker to Spanish left the tool page's title + caption in English. Root cause: every page had hard-coded ``st.title("✂️ Clean Text")`` / ``st.caption("Trim whitespace...")`` strings. Added per-tool ``tools.<id>.page_title`` and ``tools.<id>.page_caption`` keys to ``en.json`` and ``es.json`` for all nine tools. Routed each page's title/caption call through ``t()``. Verified: with ``ui_lang=es`` set, the Clean Text page now renders "✂️ Limpiar texto" + the Spanish caption. Updated ``tests/gui/test_smoke.py::EXPECTED_SUBSTRINGS`` so the ``es`` column for each tool page asserts the actual Spanish string (was a duplicate of the English string back when the page bodies were English-only). 2220 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 00:42:45 +00:00
Michael	84e4665ab0	fix(home): make per-file Remove button reliable Reported: the "✕" buttons on the uploaded file list removed files inconsistently — some clicks took, some didn't. Two compounding causes: 1. ``key=f"_home_remove_{name}"`` embedded the raw filename in the Streamlit widget key. Streamlit's widget-identity machinery normalizes keys differently across reruns when they contain spaces, dots, brackets, or non-ASCII characters, so a button's identity could shift between the render where the user clicked it and the rerun that should have processed the click. The click was registered, but the post-rerun render produced a new widget under a new effective key, and the original click was "lost". 2. The handler mutated ``home_uploads`` mid-loop while subsequent iterations were still creating buttons. ``st.rerun()`` raises synchronously, but if ANOTHER button's state changed in the same pass (e.g. a stale click held over from a fast double-tap), the ordering of state-mutation vs widget-key-update vs rerun could race. Fixes: - Stable widget keys: ``f"_home_remove_{sha1(name)[:10]}"``. The hash is identifier-safe regardless of spaces / dots / Unicode in the filename. Verified across "sample with spaces.csv", "sample.csv", and "日本語.csv" — three sequential Remove clicks each remove exactly one file with no clicks lost. - Two-phase capture: the loop collects the target ``to_remove`` filename, finishes rendering every other row at consistent widget identity, THEN mutates state once and reruns. No more mid-loop ``del`` racing other widgets' click handlers. - Wider click target: column ratio ``[8, 1]`` (was ``[12, 1]``) and ``use_container_width=True`` on the Remove button so the click surface fills the entire column. Label changed to "Remove" for the same reason — "✕" is a thin glyph that compressed the hit-test region. 2220 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 00:34:20 +00:00
Michael	4685bb4289	style(chrome): tighter vertical rhythm — less whitespace across screens Reported: too much whitespace between widgets, dividers, and headings. Compact-spacing CSS layer added to ``_HIDE_CHROME_CSS`` (so it applies on every page that calls ``hide_streamlit_chrome``): - ``[data-testid="stVerticalBlock"]`` and ``stHorizontalBlock`` gap trimmed from Streamlit's default ~1rem to 0.5rem. - Heading margins (h1-h4) tightened — h1/h2/h3 used to leave 1-1.5rem above; now 0.25-0.5rem. - ``hr`` (``st.divider()``) drops from 1rem above+below to 0.4rem. - Markdown paragraphs and captions: 0.25rem bottom margin instead of the default 1rem. - Expander summary padding reduced (0.35rem top/bottom). - File-uploader, button, and metric tiles: trimmed internal padding. Also slimmed the main-container padding from 1rem top / Streamlit default bottom (~6rem) to 0.5rem top / 0.75rem bottom. The existing ``zoom: 0.85`` on ``.stApp`` is kept — the user wanted less white space, not smaller content, and dropping zoom would shrink type alongside everything else. 2220 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 00:28:58 +00:00
Michael	e96d5901f4	fix(close): graceful about:blank fallback + display-mode aware hint Reported: user asked whether we can send Alt+F4 / Ctrl+W to the browser from JavaScript to force-close a tab. Honest answer that's now baked into the hint message: NO. Synthesized keyboard events from page JS only reach DOM event listeners, not the browser chrome or the OS. There is no flag, API, or trick that lets a page close a tab the user opened themselves. The page CAN close a window it opened (window.opener trail) or one whose display-mode is ``standalone`` (Chrome/Edge ``--app=URL``) — that's what ``python -m src.gui`` arranges, and that's the path that actually closes the window without a manual Ctrl+W. Improvements landed: 1. ``isStandalone(win)`` detects Chrome --app windows up front (``matchMedia('(display-mode: standalone)').matches``). In a regular tab the manual hint surfaces immediately on the "Close this window" click; in --app mode we only show it if the close attempt actually fails. 2. ``fallbackToBlank(win)`` navigates the tab to ``about:blank`` via ``location.replace`` (no history pollution) so the user sees a clean empty tab instead of the farewell overlay frozen over Streamlit's connection-error banner. They still have to Ctrl+W the blank tab, but the screen is no longer a misleading "did it close or not?" mess. Fires 250 ms after a failed close in --app mode (very rare path), or 1.5 s in a regular tab so the user has time to read the hint. 3. Hint message rewritten in en + es to explain WHY the close is blocked (browser security — not something we can override), to acknowledge the Alt+F4 / Ctrl+W question directly (those don't work either, for the same reason), and to point at ``python -m src.gui`` as the path that gives a clean auto-close. 2220 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 00:07:51 +00:00
Michael	ecfc52499f	fix(home): persist upload list across page navigation Reported: clicking "Back to Home" from a tool page returned the user to an empty home — their previously-uploaded files were gone. Root cause: Streamlit's ``st.file_uploader`` widget state does not reliably survive ``st.switch_page``. The widget gets unmounted on navigation, and its ``UploadedFile`` objects don't always re-attach on remount. The home page was treating the widget's return value as the source of truth, so after navigation the list was empty. Fix: introduce a session-state stash keyed by filename (``home_uploads: dict[str, {"bytes": bytes, "size": int}]``) and treat it as the source of truth for everything downstream — the active-file pickup keys for tool pages, the per-file findings cache, and the rendered file list. The widget is reduced to its narrow role of capturing NEW uploads, which we merge into the stash without ever removing. Per-file remove: a "✕" button next to each filename drops just that file (and its findings). The widget's own "✕" is bypassed by our rendering, since trusting it would let the widget's state diverge from the stash. Clear-results button is unchanged: it wipes only the analysis cache, leaving uploaded files intact (per the user's "persistent until cleared" requirement — removal is per-file via "✕"). Tool-page compatibility: the singular ``home_uploaded_{name,size, bytes}`` keys still get populated from the first entry in the stash on every render, so ``pickup_or_upload`` on a tool page keeps finding the active upload. When the user removes the active file, those keys are cleared so the next render repopulates from whatever file is now first. ``_StashedUpload`` is a small duck type ( ``.name``, ``.size``, ``.getvalue()`` ) so ``_run_analysis_on_upload`` accepts entries restored from the stash without changes. 2220 tests pass. Smoke-verified via AppTest: pre-stashed ``home_uploads`` renders the file list with per-file remove buttons, and the persistent state survives a simulated navigation round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 00:04:12 +00:00
Michael	21fd8a4cd7	fix(nav): switch_page resolves correctly + bottom-of-page back link Two issues, same fix surface. (1) Reported crash on Back-to-Home: StreamlitAPIException: Could not find page: app.py. ``st.switch_page("app.py")`` doesn't work under ``st.navigation`` — the entry script is the nav manager itself and is not a registered page. The fix needs to pass an ``st.Page`` object whose script identity matches one registered in the nav. First-pass attempt (``from src.gui.app import _home_page``) hit a worse failure: importing ``app.py`` from inside a tool-page render re-executes the nav setup with the WRONG "main script" context, so every ``st.Page("pages/N_foo.py", ...)`` call in ``_build_navigation`` fails with "file could not be found". Extract the home renderer into its own module ``src/gui/_home.py`` which has no top-level Streamlit side effects. Both the nav manager and the back-link helper import ``_home_page`` from there. The Page object built at click time has the same callable identity as the one registered, so ``st.switch_page`` resolves it. (2) Reported UX: the back button scrolled out of view on long pages. Add a second ``back_to_home_link(key="_back_to_home_link_bottom")`` call near the footer of every tool page (1-9). The unique key avoids widget-id collision with the top instance. Coming-Soon stubs get it unconditionally; Ready tools render it only after a result exists because the page short-circuits with ``st.stop()`` before then — when no result is on screen the page is short enough that the top link is sufficient. 2220 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 23:58:33 +00:00
Michael	42f8d78dd5	fix(downloads): drop /select on Windows — opens wrong folder Reported: clicking "Open Downloads folder" was opening the Documents folder instead of Downloads. Root cause is the classic Windows gotcha: when the path contains a space (e.g. ``C:\Users\Michael Dombaugh\Downloads``), Python's ``subprocess.Popen`` packs the ``/select,...`` argument into a single quoted token, and Explorer's ``/select`` argument parser does NOT accept that form — it silently falls back to whatever the user's default Explorer view is (typically Documents). Resolution paths considered: - ``shell=True`` with a hand-built command string — works but opens the door to shell-injection if a file_name ever contained a quote or special char. - ``cmd /c start "" explorer /select,...`` — same parsing issue. - ctypes ShellExecuteW — pulls in a Windows-only dependency. - Skip /select. Open the folder directly. ✓ Going with the last. ``explorer <folder>`` reliably opens the folder regardless of spaces in the path; the user finds the freshly-saved file by its name. The previous "highlight the file" nicety wasn't worth the path-parsing fragility — every user folder on Windows is ``C:\Users\<name>`` and every Windows username can contain a space. macOS keeps the ``open -R <file>`` reveal-in-Finder path because macOS argument parsing is sane and that's a strict UX win. 2220 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 23:45:47 +00:00
Michael	0f89d7ba66	fix(downloads): use explorer /select on Windows + show open feedback Reported: clicking "Open Downloads folder" did nothing visible. The previous implementation called ``os.startfile(folder)`` on Windows, which is known to silently no-op or open Explorer behind the active window in some configurations (Streamlit running headless, no foreground rights inherited by the click handler thread, etc.). Switch to the more reliable ``explorer /select,<file>`` form: - Opens Explorer with the just-saved file pre-highlighted instead of just navigating to the folder — better UX than the old behavior. - explorer.exe is a real GUI process that's spawned in the user's session with foreground rights, so it shows up on top. - Fallback chain on Windows: ``/select`` first, then plain ``explorer <folder>``, then ``os.startfile`` as a last resort. macOS upgraded the same way: ``open -R <file>`` reveals in Finder rather than opening the directory. Linux: no reliable cross-distro reveal, so ``xdg-open <folder>``. Plus user feedback at the call site: - On successful dispatch: ``st.toast("Opening <folder>", icon="📂")`` — confirms we tried, in case the window comes up behind the browser. - On dispatch failure: ``st.warning`` with the full path the user can copy/paste into their file manager manually. 2220 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 23:25:06 +00:00
Michael	b9147f3b66	fix(downloads): save server-side to ~/Downloads + open-folder link Switch the download mechanic from "browser <a download> with a data: URL" to "write the bytes directly to the user's Downloads folder and show them the exact path". DataTools runs as a local Streamlit app, so the "server" IS the user's machine — there's no reason to go through the browser save dialog at all. Flow: 1. Click "Download <something>" button (rendered as a regular ``st.button``, so no widget-collision issues). 2. Bytes are written to ``Path.home() / "Downloads" / file_name`` (overwriting any same-named file). 3. The page reruns and renders a success caption with the absolute path the file landed at. 4. An "📂 Open Downloads folder" button appears. Clicking it pops the OS file manager via ``os.startfile`` (Windows), ``open`` (macOS), or ``xdg-open`` (Linux). Why this is better than the previous HTML-data-URL helper: - Unambiguous about where the file went — user sees the full path, not "wherever your browser was configured to save". - The data: URL approach base64-inflated the page payload by 33% and bloated for large outputs; server-side write is byte-for-byte. - No more browser-side widget collision class of bug. - The save action is a real Streamlit button, so the existing widget semantics (disabled, help tooltip, key isolation) work without workarounds. API surface unchanged. New canonical name ``local_download_button``; ``html_download_button`` is kept as a back-compat alias that points at the same implementation — every existing call site continues to work without edits. Tests are protected from polluting the developer's home dir via a ``DATATOOLS_DOWNLOADS_DIR`` env var override returned by the new ``_downloads_dir()`` helper. Smoke verified end-to-end via AppTest: click → file appears in tmp dir → success banner shows path → open-folder button renders. 2220 tests pass, 91 skipped, 35 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 21:48:28 +00:00
Michael	5128d35961	fix(text-cleaner): hoist show_hidden + stress-test all tool pages Reported crash: clicking "Clean Text" with mojibake.csv (a junk corpus file that the cleaner ran on but produced zero changes) blew up the results render with NameError: name 'show_hidden' is not defined at the cleaned-preview block. ``show_hidden`` was defined inside ``if result.cells_changed:`` and referenced unconditionally below. Fix on the page itself: hoist the ``show_hidden = st.toggle(...)`` declaration out of the conditional so it's always in scope for the downstream cleaned-preview render. One toggle now drives both the Examples table (which only renders when there are changes) AND the cleaned preview (which always renders). Generalized regression net: ``tests/test_junk_corpus_tool_pages.py``. For nine representative junk files (empty, only_nul, mojibake, invalid_utf8, utf16_le_no_bom, mismatched_columns, all_nulls, corrupt_xlsx, single_column) and every Ready/Coming-Soon tool page, the test: 1. Stashes the junk bytes as the home upload via session_state. 2. Runs the page through AppTest, asserts ``app.exception`` is empty. 3. If the page exposes a deterministic primary-action button label, clicks it and asserts no exception on the post-click render. Pages that catch a bad file at read time and short-circuit via ``st.error`` + ``st.stop`` are correctly skipped from the primary-action half (the button isn't rendered). A genuine crash shows up as ``app.exception`` carrying a Python traceback — exactly what the user reported, exactly what we now catch. 162 tests collected, 102 passed, 60 skipped. 4 seconds. Full suite: 2220 passed, 91 skipped, 35 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 21:41:14 +00:00
Michael	696996c119	test(junk-corpus): pathological-input stress suite for the analyzer Build a corpus of 35 deliberately-broken files (empty bytes, NUL bytes, mojibake, UTF-16 without BOM, mismatched columns, unescaped quotes, corrupt zip, etc.) and pin the analyzer's stability contract against them. Files land in ``test-cases/junk-corpus/test_data/``. The generator ``make_junk_corpus.py`` produces them deterministically (one random sample uses ``secrets.token_bytes`` — committed bytes are stable across regenerations because the byte stream is captured at commit time). README documents the categories and how to add new shapes. ``tests/test_junk_corpus.py`` parametrizes over every file in the corpus and asserts: 1. ``_run_analysis_on_upload`` never raises — exceptions must be caught and surfaced as a synthetic ``Finding`` with severity="error". This was the user-reported crash for 13_non_latin_scripts.csv that the previous fix in `ae9d4a2` defensively wrapped; the corpus now stops the regression from re-landing on a different shape. 2. Every Finding in the result list is well-formed (string id, valid severity, non-empty description). 3. A high-risk subset (empty.csv, only_bom.csv, only_nul.csv, corrupt_xlsx.xlsx) MUST surface at least one error-level Finding — otherwise the GUI would render "no issues found" for a structurally broken file. 4. Error-level Finding descriptions are at least 20 chars so the UI banner gives the user something to act on. Also exclude ``junk-corpus`` from ``tests/test_fixtures_sweep.py`` since that sweep is happy-path (round-trip the text cleaner) and fights with files designed to break it. The contract is enforced by the dedicated junk-corpus test, not the sweep. Runtime: 12 s for the junk-corpus tests, 30 s for the full project suite (was 19 s without these). 2118 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 21:35:22 +00:00
Michael	ae9d4a2db5	fix(home): defensive analysis errors don't crash the whole page Reported: uploading 13_non_latin_scripts.csv made the home page bubble a ``pandas.errors.EmptyDataError`` traceback up through the page chrome instead of surfacing as a per-file error. In a multi-file analysis run that kills every other file's results too, which is worse than the symptom itself. Wrap ``_run_analysis_on_upload`` in proper error handling: - Empty bytes ``getvalue() == b""`` short-circuits with a synthetic error Finding telling the user the upload was zero-byte and to re-upload. - Empty ``repair.repaired_bytes`` (file was all NULs / BOM / stripped to nothing) likewise surfaces as a synthetic Finding rather than reaching pd.read_csv. - ``pd.errors.EmptyDataError`` from pandas is caught and rendered as a Finding that names the file, its byte size, and suggests opening it in a text editor to verify the header row matches the data row delimiter. - Any other exception during read/analyze is caught and surfaces as a Finding via ``format_for_user`` so the user gets a clean message, not a Python traceback. Each file in a multi-file run now stands alone: a bad file produces one red banner in its own card, every other file analyzes normally. The 13_non_latin_scripts.csv corpus file is 249 bytes of valid UTF-8 on disk and parses cleanly under the same code path locally — the user's specific symptom is likely a zero-byte upload (browser / network / Python 3.14 + Streamlit edge case). The new ``empty_upload`` finding will name the bytes count so they can confirm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 21:22:10 +00:00
Michael	ef9f8b5de4	fix(close): Edge fallback + better tryClose + honest hint There is no JavaScript override for browser tab-close security: ``window.close()`` only succeeds on windows JS opened (Chrome --app windows qualify; a regular browser tab does not). What we can do is make the --app path easier to hit and the failure case more actionable. Three changes: 1. ``src/gui/__main__.py`` — extend browser detection. PATH lookup now also looks for ``msedge`` / ``microsoft-edge``; Windows install candidates include the Edge install path; macOS candidates include Edge and Chromium. Edge is Chromium-based, supports ``--app``, and ships on every Windows 10+ machine — so users without Chrome no longer fall through to the regular browser tab. When the fallback IS hit, print a warning to stderr explaining why Close-from-page will require Ctrl+W. Renamed ``_find_chrome`` to ``_find_app_browser`` to reflect the broader scope. 2. ``_FAREWELL_SCRIPT_TEMPLATE`` in ``components/_legacy.py`` — factor close attempts into a ``tryClose`` helper that runs three escalating tries: standard ``win.close()``, the ``win.open('', '_self')`` history-rewrite trick (no-op in modern Chrome but free), and ``win.top.close()``. Auto-close on paint AND the manual button now both call this helper. Skip the manual hint if the close eventually succeeded between the click and the 250 ms timeout. 3. ``quit.close_hint`` in en/es i18n packs — rewrite the message to tell the user honestly that this is a browser security restriction, tell them the Ctrl+W keystroke that works, and point them at ``python -m src.gui`` for the auto-closing app-mode experience. 2008 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 21:17:18 +00:00
Michael	aeead05e4c	fix(downloads): swap st.download_button for an HTML <a download> helper Reported symptom: only the FIRST download button in a multi-button row pops the browser save dialog. The second and third do nothing on click. Affects every tool page that exposes (cleaned + audit + config) downloads. Root cause is ``st.download_button`` itself — when several render in the same script pass, the click-to-bytes wiring on the browser side mis-routes and only one button's data is actually exposed. Explicit ``key`` arguments don't fix it; ``use_container_width=True`` doesn't help either; we confirmed this in the Text Cleaner reverts. Replace the widget with a real ``<a download="file" href="data:...">`` anchor rendered via ``st.markdown(..., unsafe_allow_html=True)``. Bypasses Streamlit's widget machinery entirely; behaves identically to a native browser download. Side benefit: clicking it does NOT trigger a script rerun, so other in-flight UI state survives. New helper ``html_download_button`` lives in ``src/gui/components/_legacy.py`` (exported from ``components``). API: html_download_button( label, data, *, file_name, mime="application/octet-stream", disabled=False, help=None, use_container_width=True, ) Translation pattern applied across every tool page (and shared ``results_summary`` / ``config_panel`` widgets in ``_legacy.py``): - ``st.download_button(`` -> ``html_download_button(`` - ``data=foo_bytes`` kwarg -> positional second arg - ``key="..."`` -> dropped (helper has no widget identity) - ``use_container_width=True`` -> dropped (default) - ``disabled=`` and ``help=`` pass through unchanged - Pre-computed byte buffers kept where they were Total: 17 sites replaced (3 in Text Cleaner, 3 in Format Standardizer, 3 in Fix Missing Values, 3 in Map Columns, 3 in Automated Workflows, 2 in Find Duplicates page + 4 in shared _legacy.py widgets used by Find Duplicates). Caveat: data: URLs balloon by 33% (base64). Fine for tool output sizes we ship; if a future result topped a few hundred MB we'd want a Blob-URL fallback. The marketing demo at src/gui/app_demo.py keeps its single st.download_button — single button, no collision, no need to switch. 2008 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 21:13:41 +00:00
Michael	6415be8bf4	feat(tools): unified post-run UX across all Ready tool pages Apply the Clean Text page's post-run UX pattern to every other Ready tool page (Find Duplicates, Standardize Formats, Fix Missing Values, Map Columns, Automated Workflows) for consistency and ease of use. Per page: 1. Preview wrapped in ``st.expander(f"Preview: {filename}", expanded=not _has_result)``. Open before a result exists, folded afterwards. 2. Options / configuration controls wrapped in ``st.expander("Options", expanded=not _has_result)``. Inner sub-expanders preserved (Streamlit 1.36+ supports nesting). 3. After the primary action stashes the result, set a one-shot ``_<tool>_scroll_to_results`` flag in session state and call ``st.rerun()`` so the preview + options expanders see the new state on the next pass and collapse themselves. 4. ``<div id="<tool>-results-anchor" style="height:1px">`` placed immediately before the Results subheader. 5. End-of-page: pop the scroll flag and inject a tiny ``streamlit.components.v1.html`` iframe whose ``<script>`` calls ``scrollIntoView`` on the parent document's anchor. One-shot, so unrelated reruns (toggling Show-hidden, etc.) don't yank the viewport. 6. Download buttons hardened against the multi-button Streamlit footgun: byte buffers pre-computed outside the column scopes, explicit unique ``key="<tool>_dl_<purpose>"`` per button, ``use_container_width=True``, and previously-conditional buttons now render unconditionally with ``disabled=True`` + a help tooltip when the underlying data is empty so layout stays steady. Per-page judgment calls (already noted in agent reports): - Find Duplicates: sheet picker and delimiter selector kept OUTSIDE expanders (the user still needs to see them when a file fails to parse). - Fix Missing Values: missingness profile wrapped INSIDE the Options expander together with Strategy — the Results section already shows a before/after missingness comparison that supersedes the static input profile. - Map Columns: all three subsections (Target schema, Strategy, Mapping) wrapped under one outer Options expander, matching the Text Cleaner pattern. - Automated Workflows: inner "Recommended tool order" expander stays nested inside the outer Options wrap; Run button stays outside Options so the user can re-run after tweaking the (collapsed) editor. 2008 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 21:04:37 +00:00
Michael	d1aaf3c2b9	feat(quit): close-window button + manual hint on the farewell overlay The farewell overlay already attempted ``window.top.close()`` after a Close click — but browsers only honour that for tabs that JS opened (Chrome --app windows qualify; a regular browser tab does not). For users whose Chrome wasn't auto-detected and who fall back to ``webbrowser.open``, the overlay stays put and they had no in-page way to close. Add to the overlay HTML: - A "Close this window" button (uses the user-gesture path, which has slightly looser browser rules than auto-close). - A hidden hint paragraph that reveals itself 250 ms after the button is clicked IF the window is still here, telling the user to press Ctrl+W (⌘W on Mac). Wired through the existing _farewell_script template + ``_js_html_safe`` escaping so neither label can break out of the JS string literal. New i18n keys (en + es): ``quit.close_window_button`` and ``quit.close_hint``. The existing auto-close attempt remains — Chrome --app users still get their window closed without touching the button. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:59:17 +00:00
Michael	27f0648093	fix(text-cleaner): make all three download buttons actually fire Only "Download cleaned CSV" was working; "Download changes audit" and "Download config JSON" did nothing on click. The symptom is the classic Streamlit footgun for multiple ``st.download_button`` widgets in adjacent columns: without an explicit ``key`` argument the auto-derived widget IDs can collide, especially when one button is conditionally rendered, and only the first button in source order actually fires on click. Same goes for unstable ``data`` bytes recomputed inside the ``with col:`` block — the widget identity can drift between renders. Robustness pattern applied: - Compute all three byte buffers up front, outside the columns, so the ``data`` parameter is the same object across reruns. - Pass an explicit unique ``key`` ("textclean_dl_cleaned" / "textclean_dl_changes" / "textclean_dl_config") to each button. - Render the changes button unconditionally with ``disabled=True`` and a help tooltip when ``result.changes.empty`` — instead of hiding it. Layout stays steady and the empty case is self-explanatory. - ``use_container_width=True`` so the three buttons size identically inside their columns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:56:52 +00:00
Michael	0a61d52200	feat(text-cleaner): collapse options + auto-scroll to Results on run After clicking Clean Text the user was left at the bottom of the script with the Options block still expanded and no viewport movement — they had to scroll to find the Results. - Wrap the whole Options block in an outer ``st.expander("Options", expanded=not _has_result)``. After the Clean Text rerun, both Preview AND Options collapse, leaving the primary action button + Results as the only prominent elements above the fold. The inner Advanced-options expander is preserved as a nested expander (supported in Streamlit 1.36+; this repo pins 1.35+). - Add a 1px anchor div ``#textclean-results-anchor`` immediately before the Results subheader. - On Clean Text click, set a one-shot ``_textclean_scroll_to_results`` flag in session state; on the next render, pop the flag and inject a tiny ``st.components.v1.html`` iframe whose ``<script>`` calls ``scrollIntoView`` on the parent document's anchor. One-shot so re-renders triggered by other widgets (Show-hidden toggle, etc.) don't jerk the viewport back to the top of Results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:50:43 +00:00
Michael	ca14ce2952	feat(text-cleaner): collapse preview on run + full hidden-char audit Two small UX fixes on the Clean Text page: 1. The input preview is now wrapped in an ``st.expander`` whose default-expanded state is ``not has_result``. Clicking the "Clean Text" primary button stashes the result and calls ``st.rerun()`` so the next pass sees the result in session state and the expander folds — the Results section becomes the primary visual focus. User can re-expand manually to re-inspect the source. 2. The Examples (changes audit) table's Before/After columns were calling ``visualize_hidden_html`` WITHOUT ``mark_outer_whitespace``, so leading/trailing whitespace — which is exactly what the cleaner most often removes — was invisible. Pass ``mark_outer_whitespace=True`` to match the input-preview rendering. Column-name cell now mirrors that flag too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:43:52 +00:00
Michael	502a72cd46	feat(nav): ← Back to Home link on every tool page Multi-file workflow: a user uploads several files on Home, clicks "Open <Tool>" on one file's findings, lands on a tool page. The sidebar lets them get back to Home, but a top-of-page back affordance is more discoverable and keeps the hand in the same screen region as the upload list they're working through. - New ``back_to_home_link()`` helper in components/_legacy.py renders a secondary button that calls ``st.switch_page("app.py")`` — under ``st.navigation`` that routes to the default (Home) page. - Wired into every tool page (1-9) directly after ``hide_streamlit_chrome()`` and BEFORE the license gate so a Lite user who lands on a locked tool can navigate away without paying. - New i18n key ``nav.back_to_home`` ("← Back to Home" / "← Volver al inicio") in en/es packs. 2008 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:38:01 +00:00
Michael	604debb9a9	revert(home): keep per-tool grouping for per-file findings Restoring ``render_findings_panel`` on the home page. Previous commit (`c575efd`) inlined a flat renderer that dropped the per-tool grouping and the "Open <Tool>" jump links — that was an over-correction. The user only wanted the bottom tool-card grid gone (already removed in `ff2eaeb`). The grouping inside the findings panel is what lets a user land on a specific finding and one-click into the cleaner that fixes it; without it they'd have to guess which sidebar entry to open. Tool-card grid stays removed. Sidebar nav is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:31:36 +00:00
Michael	c575efd26e	fix(home): render findings flat — drop per-tool grouping The home page was calling ``render_findings_panel``, which groups findings by tool into expanders and renders an "Open <Tool>" page link under each. After uploading a file, the user still saw a tool list (just under a different shape) — defeating the earlier cleanup that removed the tool-cards grid. Inline a flat renderer in ``_home_page``: per uploaded file, render the filename header + severity summary + a flat list of findings via ``_render_one_finding`` directly. No expanders, no tool names as section headers, no per-tool page-link buttons. Tool discovery happens in the sidebar. ``render_findings_panel`` itself is unchanged — it still groups by tool and remains tested via the findings-panel harness, but is no longer used on the home page. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:22:20 +00:00
Michael	175389219f	fix(gui): translate sidebar tool names when language changes The sidebar nav was passing ``tool.name`` (the registry's English field) to ``st.Page``, so the tool entries stayed in English even after the user picked Spanish from the language selector. Section headers were already i18n-driven; tool entries were not. Switch to ``tool_name(tool_id)`` which routes through ``t(...)`` and picks up the active language from session state. Verified: with ``ui_lang=es`` the sidebar renders Buscar duplicados / Limpiar texto / Mapear columnas / etc. instead of the English fallbacks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:19:15 +00:00
Michael	c568aec8a7	feat(gui): one-click Close in its own bottom sidebar section Close is now a direct shutdown trigger: visiting the Close page (the sidebar entry) fires shutdown_app() immediately — no confirm step, no intermediate body. The farewell overlay paints and os._exit(0) lands ~1s later from a daemon thread. Layout: Close moved into its own bottom-of-sidebar section so the destructive action is visually separated from Account/Activate. - New shutdown_app() in components/_legacy.py replaces quit_button. os._exit thread is skipped when "pytest" is in sys.modules so the test suite doesn't suicide on rendering 99_Close. - pages/99_Close.py shrinks to set_page_config + chrome + shutdown_app. - app.py nav grows a new "Close" section header (new nav.section_close key in en/es packs) pinned at the bottom of the navigation dict. Tests updated: - TestQuitButtonRenders → TestClosePageShutsDownImmediately. Assert the shutdown caption renders + no confirm button exists. - test_smoke EXPECTED_SUBSTRINGS["99_Close"] now pins "Shutting down" / "Cerrando" (the visible page body) instead of the removed page title. 2008 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:17:14 +00:00
Michael	ff2eaeb6c4	feat(home): multi-file upload + per-file analysis, drop tool grid Home is now upload + analysis only. The page accepts multiple files in one go, analyzes each independently, and renders findings grouped by filename in bordered containers. The 3-section tool-card grid is gone — discovery happens via the sidebar now. Mechanics: - file_uploader uses accept_multiple_files=True. Each file's findings cache in session_state["home_findings_by_file"] keyed by filename so removing a file via Streamlit's "x" button drops its findings too, and re-clicking Run only re-analyzes pending files. - The first uploaded file is mirrored into the singular home_uploaded_{name,bytes,size} keys so tool pages continue to pick up an "active" upload through pickup_or_upload — no tool-page changes. - New i18n keys: upload.intro_multi, upload.uploader_label_multi, upload.clear_results, upload.empty_state. upload.heading text is updated to "Upload one or more files to start" (EN + ES). Dropped tests pinning the tool grid: - TestHomeToolGridLocalization (test_chrome.py) - test_home_tool_card_uses_es_name (test_smoke.py) - TestLiteHomeGridBadges (test_lite_tier.py — locked-card lock-badge assertions; locking is still enforced per-tool-page via require_feature_or_render_upgrade) 2009 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:12:48 +00:00
Michael	dad744f17f	refactor(gui): drop Review page + normalization gate Home is now the only entry point: the "Run analysis" button on the upload section IS the review step (findings render inline via render_findings_panel). Tool pages no longer gate on a passed normalization — running the analyzer is sufficient context. Removed: - src/gui/pages/0_Review.py - src/gui/components/gate.py (re-export seam) - require_normalization_gate() in src/gui/components/_legacy.py - "review" section enum in tools_registry.py - Data Review entry in app.py navigation - require_normalization_gate() calls + imports in all nine tool pages - tests/gui/test_gate.py (whole file) - TestReviewWorkflow in tests/gui/test_workflows.py - 0_Review entry in tests/gui/test_smoke.py PAGE_SLUGS - stash_upload's normalization_result+normalization_for stashing - stash_upload_without_gate (was the gate's negative-path helper) 2017 tests pass (16 retired with the gate flow). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 20:04:33 +00:00
Michael	fc6c22c6a7	feat(review): inline file uploader instead of redirect home When a user lands on Review without an upload, show a file uploader on the page itself and auto-run the analyzer once a file is picked, rather than bouncing them to the home page with a "Back to home" button. Auto-analyze is the right default here: the user is already on the Review page, so they've implicitly committed to a scan. Stashing the bytes in the same session-state keys the home page uses keeps the rest of the flow (encoding picker, gate, tool pages) unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 19:57:01 +00:00
Michael	db5ec084da	docs+code: rename tool labels everywhere Sweep follow-up to `93e43fc`. Display labels now consistent across docs, landing pages, CLI output, code comments, docstrings, and test prose. Five parallel surfaces touched: - docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal design/planning docs - landing pages: index + bookkeeper/revops/shopify-pet - src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py and gui/components/_legacy.py, core module headers, every tool page's module docstring - tests: class/method/module docstrings and section-header comments - test-cases READMEs Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*), URL paths, anchor IDs, CSS classes, and asset filenames were left intact since they're code identifiers / structural references. All 2033 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 19:50:09 +00:00
Michael	93e43fc0d9	feat(gui): sidebar sections + non-technical tool labels Sidebar nav now groups tools under Data Review / Data Cleaners / Transformations / Automations via st.navigation, replacing the flat auto-discovered list. Tool display names switch to action-first phrasing (Find Duplicates, Fix Missing Values, Find Unusual Values, Standardize Formats, Clean Text, Quality Check, Map Columns, Combine Files, Automated Workflows) in EN + ES packs and on each page's H1. The Data Cleaners section follows the requested order: Missing Values → Outliers → Text Cleaner → Format Standardizer → Deduplicator → Quality Check. (Text Cleaner kept inside cleaners since the request didn't list it but the tool still ships.) Registry now carries a section field; helpers added: tools_in_section(), section_label(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 19:36:01 +00:00
Michael	624f99653e	docs(arch): end-to-end system + tech-stack diagrams New ARCHITECTURE.md pulls the desktop app (TECHNICAL.md) and the license server (LICENSE-SERVER.md) into a single picture — the two were never reconciled into an end-to-end view before. Contents: §1. System diagram (ASCII) showing operator laptop, license server stack (nginx → FastAPI → Postgres), Postmark, Gumroad, and the buyer's machine — with the three primary flows (sale, manual mint, offline activation) traced through it. §2. Tech stack diagram, layered: desktop / server / operator / external SaaS, with version pins. §3. Trust + isolation boundaries table — what crosses each one and what the threat model is. §4. "Where things are stored" — paths, tables, files. §5. Pointers to the deeper per-component docs. ASCII over Mermaid since the repo's Gitea version is unknown and plain text renders in every viewer / IDE / raw `cat`. LICENSE-SERVER.md status flipped from "design proposal, not built" to "deployed (PR 1 + PR 2 code merged)" — that was stale since the PR 1 deploy yesterday. TECHNICAL.md and ADMIN.md gain one-line pointers to ARCHITECTURE.md so people land at the unified view when looking for "how does it all fit together". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 01:59:05 +00:00
Michael	86ad21db79	docs(license): PR 2 deploy + operator instructions ADMIN.md gains a "Running a Gumroad webhook" section: how the URL secret works, how to add a SKU to products.yaml, how to inspect gumroad_events (recent activity + failures-only queries), how to replay a failed delivery, and how to test without buyers via Gumroad's "Send Test Ping" button. The deployed-vs-queued matrix flips Gumroad + Postmark to "code merged, deploy pending" so it's clear the bits exist on main but the live box still runs PR 1. SETUP-LICENSE-SERVER.md §3 commits the eventual compose.yml shape with PR 2 environment + secrets lines included but commented out, ready to uncomment at deploy time. The §3 chown step already covers the new secret files because it uses `chmod 400 secrets/` / `chown 10001:10001 secrets/`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 01:33:53 +00:00
Michael	2bbaba954b	feat(server): Gumroad webhook receiver + Postmark email (PR 2) Wires the second source-adapter (Gumroad) plus the email delivery that lets the server fulfill a sale end-to-end without operator intervention. Auth model: Gumroad doesn't HMAC the body, so we use their recommended URL-secret pattern (?secret=...). Wrong/missing secret returns 404 — no signal to a prober that the endpoint exists. Webhook flow (server/app/routes/webhooks.py): 1. audit-log the raw payload (gumroad_events row) BEFORE anything else, so a later failure leaves us replayable 2. parse via GumroadAdapter (server/app/adapters/gumroad.py) 3. mint_from_sale — UNIQUE(source, source_order_id) dedups duplicate webhook retries 4. send the license email 5. mark gumroad_events.processed = true Always returns 200 once auth passes. Non-2xx would trigger Gumroad's 3-day retry storm; we'd rather record the failure on the audit row and replay manually after fixing whatever surfaced. Product → tier mapping is per-source YAML at server/config/products.yaml (lru_cached). Adding a SKU = edit yaml, restart api. Unmapped product_id is an error on the audit row, not a crash. EmailService (server/app/email.py): provider-agnostic interface with Postmark as the first implementation. When POSTMARK_TOKEN is unset the factory returns LoggingEmailService instead, so the webhook exercises end-to-end before Postmark is provisioned. 48 unit tests (was 21) including: - Gumroad secret verify with constant-time compare - Sale parsing: amount-in-cents, name fallback from email, test=true tagging, missing-required fields, offer codes - Product mapping lookups - Email rendering text + HTML, HTML-escapes user input - Postmark client via httpx.MockTransport (success and 4xx) - Webhook end-to-end: secret check, audit log, idempotency on retry, unmapped product, email failure keeps license Smoke test (server/scripts/smoke.sh) extended to POST a synthetic Ping payload, verify the row + audit log, prove wrong-secret is rejected, prove duplicate sale_id stays one row. SQLite-test compatibility: - BigInteger primary key uses with_variant(Integer, "sqlite") since SQLite only autoincrements INTEGER PRIMARY KEY. - python-multipart pulled in for FastAPI Form parsing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 01:33:43 +00:00
Michael	b5cd74d474	docs(admin): live deployment section for the running license server Documents the post-deploy state of PR 1: live URLs (datatools and licenses subdomains on unalogix.com), the on-box filesystem layout under /srv/datatools-license/, where the admin token lives and how to retrieve / rotate it, the laptop-side SSH-tunnel + admin_cli mint workflow, inspection commands (logs, psql, container status), restart / rebuild procedures, manual backup commands until cron lands, the production-key rotation outline, and a deployed-vs-queued capability matrix. Secrets are NEVER pasted into this doc — the admin token's literal value lives only on disk (mode 400, UID 10001). Committing it to git would mean permanent leakage via history even after rotation; documenting its location + rotation procedure achieves the same operational outcome without the residual exposure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 01:19:57 +00:00
Michael	1cf69dd23b	docs(license): runbook fixes from PR 1 self-host deploy Two real-world footguns surfaced during the first live deploy: 1. docker-compose's uid/gid/mode long-form on file-based secrets is silently ignored — that's a swarm-mode-only feature. The container app user (UID 10001 from the Dockerfile) cannot read a mode-400 file whose host UID it doesn't match. Fix is to chown the secret files to 10001 directly; host-side access control stays gated by the parent dir's mode 750. 2. nginx 1.24 (Ubuntu 24.04 default) rejects the standalone "http2 on;" directive (that arrived in 1.25). Use the legacy "listen 443 ssl http2;" combined form. Noted prominently so the next deploy doesn't trip on it. Also realigned §3's compose example to what actually got deployed for PR 1 — only pg_password + admin_token secrets, postmark / gumroad / license_privkey commented out as PR 2 / production-key follow-ups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 01:17:05 +00:00
Michael	673b902377	feat(license): datatools-admin CLI for the mint API New operator CLI at src/admin_cli.py: mint, list, revoke, ping — talks to the server's /internal/* endpoints over a local SSH tunnel. Stdlib-only on the desktop side (urllib + typer), no new top-level deps. Auth via $DATATOOLS_ADMIN_TOKEN. scripts/generate_license.py is now annotated as a break-glass tool for when the server is unreachable — routine work goes through the new CLI so the authoritative `licenses` row is created. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 00:47:01 +00:00
Michael	bab2c9468c	feat(server): mint API + Postgres schema + manual adapter (PR 1) Source-agnostic license issuance service. FastAPI app fronts a Postgres `licenses` table; the only currently-wired source is `manual` (operator mints via /internal/mint). Gumroad webhook adapter lands in PR 2. Key design points: - Signing reuses src/license/crypto.py via a COPY into the image (single source of truth — blobs minted server-side verify against the same embedded pubkey on the buyer's machine). - Source adapter Protocol (app/adapters/base.py) is the seam for Gumroad / Lemon Squeezy / Stripe in later PRs; Mint API speaks only SaleEvent / RefundEvent. - (source, source_order_id) UNIQUE composite gives idempotent webhook retries without double-mint. - JSONB type uses with_variant(JSON, 'sqlite') so the same models drive both Postgres prod and SQLite tests (no testcontainers dep). - Bearer-token auth on /internal/; the IP-loopback guard was removed after the docker bridge made it fight legitimate prod traffic (nginx defense + Bearer remain). - Secrets resolved via _FILE env vars pointing at /run/secrets/<name>, so passwords never appear in `docker inspect`. 21 unit tests (SQLite in-memory, StaticPool) plus a real-Postgres docker-compose smoke test in server/scripts/smoke.sh that builds the image, runs the alembic migration, mints a license, verifies the signature against the host dev pubkey, and checks the DB row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 00:46:54 +00:00
Michael	4179cb5156	docs(license): self-hosted server runbook + multi-tenancy plan Adds SETUP-LICENSE-SERVER.md — end-to-end install runbook for the license server on the existing invixiom box (Ubuntu 24.04). Covers DNS, system packages, Postgres + API in Docker, dedicated system user, secrets layout under /srv/datatools-license/secrets (mode 400), nginx config in a separate sites-available/unalogix file, Let's Encrypt cert issuance, smoke tests, backups, monitoring, key rotation, and rollback. Multi-tenancy is explicit at every layer: separate DNS zone (unalogix.com vs invixiom.com), separate nginx file, separate TLS cert, dedicated backend ports (8090 for the API, 5433 for Postgres, both localhost-only), separate docker compose project and volume. No invixiom service is touched. LICENSE-SERVER.md updated: hosting choice moved from "Fly.io / Render" (rejected) to self-hosted (decided). Points at the new runbook for ops specifics. ADMIN.md pointer table updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 22:57:53 +00:00
Michael	52e04f63a9	docs(license): design proposal for online issuance & record-keeping Forward-looking design doc — not implemented. Describes the smallest useful server that replaces the manual mint-and-paste workflow: Gumroad webhook → Mint API (KMS-held private key) → Postgres licenses table, plus a self-service renewal/re-delivery portal. The desktop app is deliberately untouched across all three migration phases: activation stays fully offline and continues to verify blobs against the embedded pubkey, preserving the DECISIONS.md §9b promise that buyer machines never phone home. Schema is intentionally a superset of the local issuance JSONL log (ADMIN.md), so Phase 1 migration is a flat INSERT per row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 22:26:24 +00:00
Michael	23c51fd759	feat(license): local issuance log for minted blobs generate_license.py now appends every minted license to ~/.datatools-creator/issued.jsonl (overridable via env). This is the creator-side system of record until the server-side flow lands. The full blob is stored alongside name/email/tier/expiry so buyers who lose their delivery email can be re-served without re-minting. File is created mode 600 and lives outside the buyer-facing ~/.datatools/ dir so it never gets bundled into a shipped install. Log failures are non-fatal (warning to stderr) — the mint already succeeded by the time we try to log, and forcing a re-mint after a log error would invalidate any device the buyer had activated. Pass --no-log for test mints. ADMIN.md adds a "Customer record-keeping" section with the path, schema, jq one-liners, and migration note pointing at the upcoming LICENSE-SERVER.md design doc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 22:25:19 +00:00
Michael	65e17e0a70	docs(admin): internal license operations reference Creator-only ADMIN.md covering keypair generation, blob minting, dev vs. production key model, tier matrix, and recovery if the private key is lost. Includes a TL;DR for minting a dev license against the in-tree keypair. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 22:10:16 +00:00
Michael	e534fb4989	sec(license): Ed25519 sigs + production-safe tripwire Two coupled hardening upgrades. 1. Asymmetric signatures (HMAC → Ed25519) The previous HMAC scheme used a symmetric secret that any motivated reverse engineer could pull out of the shipped binary and use to mint blobs for any tier / name / email. With Ed25519, the binary ships only the public verification key; the signing key never leaves the seller's environment, so binary compromise no longer yields forgery. - src/license/crypto.py rewritten around cryptography.hazmat.primitives.asymmetric.ed25519. Same public API surface (sign/verify/encode_blob/decode_blob), same canonical JSON encoding — drop-in for the manager / cli / GUI layers. - DATATOOLS_LICENSE_PRIVKEY (seller-side) and DATATOOLS_LICENSE_PUBKEY (build-time) env vars supply the keys; the in-source dev keypair (src/license/_dev_keypair.py) deterministically derives from a seed phrase for repro builds and tests. - Blob prefix bumped DTLIC1: → DTLIC2:. Decoding a DTLIC1 blob surfaces a clear "old format" error rather than a confusing signature mismatch. - scripts/generate_keypair.py mints fresh production keypairs for the seller (run once, stash the private key offline). Adds cryptography>=41,<46 to requirements.txt (was an undeclared transitive dep). 2. Production-safe tripwire assert_production_safe() refuses to boot a frozen / shipped build when either: - DATATOOLS_DEV_MODE=1 is set (would unconditionally bypass every license check — fine in source/test but catastrophic in a buyer install). - The active verification key is still the embedded dev key (the build pipeline forgot to set DATATOOLS_LICENSE_PUBKEY). No-op in source / pytest runs (sys.frozen is unset) so test fixtures and dev workflows keep working without ceremony. Called from src/cli_license_guard.guard() and from hide_streamlit_chrome — so it fires on every CLI invocation and every GUI page load. Tests: 49 license-layer unit tests (was 40); added Ed25519 wrong-key rejection, dev-keypair seed pin, blob v2 prefix, v1 rejection with clear message, and four production-safe scenarios (no-op in source, fires on DEV_MODE in frozen, fires on dev key in frozen, passes in frozen with prod pubkey). Total: 2024 → 2033. Docs (REQUIREMENTS §17a, DEVELOPER licensing recipe, DECISIONS §9b + decision log) updated with the new threat-model write-up, key-storage workflow, and tripwire behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:34:48 +00:00
Michael	d32b58e61a	feat(license): add Lite SKU; remove user-facing free trial Two coupled changes: 1. Lite tier - New Tier.LITE in src/license/schema.py. - FEATURES_BY_TIER[Tier.LITE] = {Deduplicator, Text Cleaner, Format Standardizer}. The three universally-useful tools that cover the most common bookkeeping / RevOps / Klaviyo prep workflows. Other six tools require Core. - i18n: license.tier_lite, license.feature_locked_title, license.feature_locked_body, license.upgrade_link, license.status_locked (en + es). - Per-tool feature gate at every GUI tool page (require_feature_or_render_upgrade) and every tool CLI (guard(feature=...)). A locked tool renders an upgrade prompt + Manage-license button (GUI) or exits with code 2 (CLI). - Home grid: tool cards the user's tier doesn't unlock get a red 🔒 Locked badge in place of green Ready. 2. Trial removed - Activation form's "Start 1-year trial" button removed. - license_cli's `trial` subcommand removed. - activation.trial_button / activation.trial_help i18n keys dropped (pack parity test stays green). - Tier.TRIAL stays in the enum (back-compat with any field- tested trial licenses); LicenseManager._mint stays internal for tests and the seller's key generator. - Decision logged in DECISIONS §9b: a 1-year all-features trial undercuts paid Lite; paid-only keeps tier economics clean. Tests (+29 net): +17 Lite-tier unit/guard tests + 13 Lite-tier GUI tests + 1 trial-absent assertion - 2 trial CLI tests - 1 trial GUI button test. Total: 1995 → 2024. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:19:30 +00:00
Michael	e612c751a8	docs(license): document activation flow, tier system, dev bypass - USER-GUIDE EN + ES gain a §0 "First launch — activation" section covering paid blob activation, 1-year trial, renewal, file location, and device-swap. - REQUIREMENTS §17a "Licensing" — storage path, activation model, lifetime, tier list, dev bypass env var. Test count: 1995. - DEVELOPER gains a "Licensing" recipe in the Extension recipes section: public API, feature-flag add, tier add, minting via the creator-only script. - DECISIONS §9b — log the offline-HMAC choice with the threat-model trade-off (motivated piracy not stopped; honor-system + 30-day refund covers casual sharing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:54:30 +00:00
Michael	e435103113	feat(license): registration + 1-year licenses + tier scaffolding A complete offline licensing layer (no internet at any step): Core - src/license/ — schema (License, Tier, FeatureFlag), HMAC crypto, JSON storage, LicenseManager singleton with activate/renew/ deactivate/issue_trial. Tier-scaffolded so future SKUs can carve per-tool feature sets without consumer-code edits. - scripts/generate_license.py — creator-only key generator. Mints a DTLIC1: blob the buyer pastes into the activation page. GUI - New activation form component (src/gui/components/activation.py). - hide_streamlit_chrome() now inline-renders the activation form when no valid license is present (every page short-circuits to the form until activated). - Sidebar shows tier + days remaining; renewal warning under 30 days. - New pages/_Activate.py for revisiting the form after activation. CLI - src/license_cli.py — activate / renew / status / trial / deactivate commands. Exempt from the guard. - src/cli_license_guard.py — drop-in guard call added to every tool CLI's main(). Lets --help through; respects DATATOOLS_DEV_MODE. i18n - New activation.* and license.* keys in en.json + es.json (page title, form labels, status badges, renewal warnings, error messages). Pack parity test stays green. Test infrastructure - tests/conftest.py autouse fixture sets DATATOOLS_DEV_MODE=1 so the existing 1916 tests continue to pass. - isolated_license_path / activated_license_manager / unactivated_license_manager fixtures for tests that want to drive the real check. Tests (+79) - tests/test_license.py (40): schema, crypto roundtrip, blob encode/decode, tier→feature mapping, activation flow, name/email mismatch rejection, tamper detection, expiration, renewal, dev-mode bypass. - tests/test_license_cli.py (26): every license_cli command + subprocess tests confirming every tool CLI refuses to run without a license, --help always works, DEV_MODE bypasses. - tests/gui/test_activation.py (13): gate blocks without license, passes with trial, activation form submission unlocks the gate, sidebar status, renewal warning, i18n. Total: 1916 → 1995 tests. All pass under the strict warning filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:54:23 +00:00
Michael	b2c7b94fe9	fix: clear all latent deprecation + resource warnings Three real issues surfaced when running the suite with strict warnings: 1. src/core/format_standardize.py: ``datetime.utcfromtimestamp`` is deprecated in CPython 3.12 and slated for removal. Replace with ``datetime.fromtimestamp(ts, tz=timezone.utc)``. Output for the date-only format codes we use is byte-identical. 2. src/core/io.py: ``list_sheets`` leaked the openpyxl file handle by returning ``xl.sheet_names`` from an unclosed ``pd.ExcelFile``. Wrap in a ``with`` block so the FD closes deterministically — also prevents the Windows-only "file is locked" repro path. 3. tests/test_corpus.py: ``TestXlsxPollution.workbook`` fixture returned the bare ``pd.ExcelFile`` instead of yielding + closing. Convert to a yield-and-finally pattern so the class-scoped handle isn't leaked across the whole test file. Also harden pytest.ini's warning policy: escalate ``ResourceWarning`` from ``src`` to an error, alongside the existing ``DeprecationWarning`` rule. Third-party warnings stay filtered — we can't fix pandas/openpyxl/streamlit churn from here. All 1916 tests pass under the strict filter; full and split runs (``pytest``, ``pytest -m 'not gui'``, ``pytest -m gui``) all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:28:48 +00:00
Michael	070e3c9f06	docs(gui): document the new GUI test layer REQUIREMENTS §16 updates the test count (1777 → 1916) and breaks out the GUI subset. DEVELOPER's Tests section gains the 'gui' marker recipes and the new tests/gui/ tree under test layout, plus a short 'GUI test layer' explainer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:13:40 +00:00
Michael	35d46a0c1a	test(gui): add Streamlit AppTest layer (139 tests) Until now every test ran against core or the CLI; the Streamlit GUI was verified by hand. This commit adds tests/gui/ — 139 AppTest- driven tests behind a 'gui' marker so the quick loop (``pytest -m 'not gui'``) stays at 1777 tests / ~10s while ``pytest`` runs everything (1916 / ~14s). Coverage: - test_smoke.py (59): every page renders in EN and ES, expected substring present, sidebar selector mounted. - test_chrome.py (18): language selector flips session state and re-renders; quit button + farewell strings localize; tool-card names use the active language. - test_gate.py (9): require_normalization_gate no-op / warning / short-circuit / hash-mismatch invariants; warning + button localized. - test_workflows.py (14): happy path per Ready tool — stash upload, render, find primary action, verify result lands in session state. - test_dedup_review.py (8): Accept All / Reject All / Clear Decisions wire through to review_decisions; apply_review_decisions semantics (keep-all, merge, column override). - test_advanced_panels.py (15): config_panel widget defaults and options (algorithm, threshold, survivor rule, merge, multiselects, config save/load). - test_errors.py (4): garbage / empty / single-column uploads don't crash; duplicate-target mapping raises InputValidationError. - test_findings_panel.py (12): driven via a small standalone harness page so we test the component without faking a file_uploader. EN + ES strings, per-tool grouping, open-tool button label, untargeted expander, severity summary. Shared infrastructure in tests/gui/conftest.py: - ``stash_upload`` / ``stash_upload_without_gate`` — populate session_state to pre-pass or block the gate. - ``with_language`` — set ``ui_lang`` before run(). - ``collected_text`` — flatten title/caption/markdown/etc. into one string for substring assertions. - Auto-marking: every test in tests/gui/ gets ``@pytest.mark.gui`` via ``pytest_collection_modifyitems``, so the marker isn't per-test boilerplate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:13:40 +00:00
Michael	d0423a8912	docs(perf): publish the dedup/parallel/lazy-copy wins and limits REQUIREMENTS §10 carries the new measured numbers and the dedup blocking trade-off note. DEVELOPER known-limitations is rewritten to reflect that exact-only dedup is now O(n), fuzzy-blocking is opt-in, and column-parallelism is scaffolding for free-threaded Python. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:54:25 +00:00
Michael	64452dd783	perf: dedup blocking, column-parallel scaffolding, lazy-copy pipelines Three follow-on wins from the audit, each with shape-pinning tests. 1. Dedup blocking - Exact-only strategies (every column EXACT @ 100 — covers strong- key dedup like email/phone, the drop-duplicates fallback, and explicit "match on this exact column" calls) now route through an O(n) groupby fast path. Lossless; no API change required. Measured: 10k-row email-exact dedup → 73 ms (was ~30 minutes via the O(n²) pair compare). - Fuzzy strategies still pair-compare, with opt-in prefix blocking via deduplicate(..., blocking_columns=[...], blocking_prefix_len=1). Measured: 5k-row fuzzy-name → 25.6s with blocking vs 179s without (7x). Trade-off: cross-block matches missed. 2. Column-parallel standardize - StandardizeOptions.parallel_columns (default 1) lands a ThreadPoolExecutor over the column loop. Output order and audit-record order are preserved deterministically via a merge step keyed off column_types order. Honest doc: under CPython 3.12's GIL the win is roughly neutral (phonenumbers/dateutil hold the GIL); the API is ready for free-threaded Python 3.13+. 3. Lazy-copy in missing / column_mapper - _standardize_sentinels now builds per-column changes in a dict and only materialises the output frame when at least one column actually changed. On a clean 1 GB file this skips a 1 GB allocation. - handle_missing carries an out_is_owned flag, copying on demand before any mutating step. No-op runs return the input frame. - map_columns drops the unconditional upfront df.copy(); rename and drop both return fresh frames already, and schema-add / coerce trigger _ensure_owned() lazily. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:54:25 +00:00
Michael	e5f632bcd6	docs(perf): publish 1.5 GB target and the new measured throughputs REQUIREMENTS §10 reflects the post-optimisation numbers and the known O(n²) dedup match step (flagged for a future blocking pass). en/es upload-limit copy and uploader help now say 1.5 GB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 15:37:26 +00:00

1 2 3 4

163 Commits