datatools-dev

Author	SHA1	Message	Date
Michael	b3ae913bb9	feat(audit): daily filename + 7-day retention sweep Replaces the per-session ``datatools-<ts>-<sid>.jsonl`` filename with a single daily file ``datatools-YYYY-MM-DD.jsonl`` (local date). Sessions on the same calendar day share a file via the writer thread's per-batch open+append; multiple DataTools instances running concurrently on the same day fan into the same file (append-mode small writes are atomic on POSIX, safe-enough on Windows under realistic load). Drops the ``_LOG_PATH`` module global and the lock around it — ``audit_log_path()`` is now pure date math, recomputed on every call so a session that crosses midnight follows the rollover into the next day's file. Adds ``_sweep_old_logs()`` invoked once per process at writer- thread start. Deletes any ``datatools-*.jsonl`` whose mtime is older than 7 days. The glob deliberately matches the legacy per-session filename too, so users upgrading from the previous build don't keep a permanent backlog of pre-retention files. Event ``ts`` fields stay UTC; only the filename uses local date, because users go looking for "today's log" on their wall clock. Tests cover: daily filename shape, sweep removes stale files, sweep keeps fresh files, sweep also clears legacy filenames. Rollback: ``git revert HEAD`` restores the per-session filename and removes the sweep. No data migration needed either way — existing files keep working as JSONL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 21:22:47 +00:00
Michael	ba07dcb6c7	feat(audit): re-enable audit log (kill switch off by default) Phase 1 diagnostic build validated end-to-end on the user's machine: session cf2ebbd5 (2026-05-19) produced session/upload/analyze/nav/ session-end events with no blank-pages regression. Root cause of the original symptom was the audit_log_path/_session_id deadlock fixed in `a8ff8f4` — the kill switch is no longer load-bearing. Flips ``_DISABLED: True`` → ``False`` so the default install writes a log. The three env-var overrides (``DATATOOLS_AUDIT_ENABLED``, ``DATATOOLS_AUDIT_TRACE``, ``DATATOOLS_AUDIT_PROBE``) and the writer- thread BaseException guard from `76c9f5a` stay in place as escape hatches if the symptom ever recurs. TestKillSwitchContract continues to pass — it monkeypatches ``_DISABLED = True`` explicitly and doesn't rely on the module default. Rollback: ``git revert HEAD`` flips the switch back without removing the diagnostic instrumentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:50:28 +00:00
Michael	76c9f5a679	feat(audit): diagnostic instrumentation env vars + writer-thread guard Phase 1 of the audit-log re-enablement plan. Adds three opt-in env vars that let us ship one instrumented build for the user to run, without flipping the kill switch on for everybody. Default behaviour is byte-identical to today: with no env vars set the kill switch wins, no writer thread starts, no file is written, no stderr line is printed. Env vars (do NOT set in prod): - ``DATATOOLS_AUDIT_ENABLED=1`` — bypass ``_DISABLED`` for one session. ``_DISABLED = True`` stays in the source so an upgrade with no env var is still safe. - ``DATATOOLS_AUDIT_TRACE=1`` — print ``[audit] ...`` lines to stderr at module import, every writer-thread state change, and every producer entry point. Lets the user share a small log instead of attaching a debugger. - ``DATATOOLS_AUDIT_PROBE=<value>`` — bisect the producer path for Phase 2. Values: ``full`` (default), ``noop``, ``no-events``, ``no-page-open``, ``no-session-start``. The named variants return early from the corresponding ``log_*`` function so we can isolate which call is implicated in the blank-pages symptom. Also: - ``_writer_loop`` gets an outer ``try/except BaseException`` so silent thread death now surfaces a ``"writer thread died: ..."`` line in the launcher terminal instead of looking like a hang. - Existing first-write-failure stderr print gets ``flush=True`` so the user actually sees it before the process is killed. - Test fixture switches from the previous-commit ``_DISABLED = False`` override to ``_ENABLE_OVERRIDE = True`` so tests exercise the same bypass path the diagnostic build uses. - Two new tests pin the safety contract: with the kill switch on and no override, every producer is a true no-op (no writer thread, no file). And ``DATATOOLS_AUDIT_PROBE=no-events`` bypasses ``log_event`` even when the override is on — guards the bisect. Rollback: ``git revert HEAD`` removes Phase 1 cleanly. The deadlock fix from the previous commit stays in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:46:27 +00:00
Michael	a8ff8f4bd0	fix(audit): break audit_log_path/_session_id deadlock Pre-existing latent bug since `d9e32e5`: ``audit_log_path()`` acquires the non-reentrant ``_LOCK`` and, while holding it, calls ``_session_id()`` which also takes ``_LOCK``. On a clean module state (both ``_LOG_PATH`` and ``_SESSION_ID`` unset) the first caller deadlocks. ``log_session_start`` triggers it in practice — it's the first GUI call after import and the ``log_file=str(audit_log_path())`` arg is evaluated before any ``log_event`` has had a chance to lazy-init the session id. Strong candidate contributor to the blank-pages symptom the kill switch was put back to mask: the writer thread (and any producer reaching ``audit_log_path``) would freeze forever, and Ctrl+C would not free the GIL — matches the launcher-can't-be-killed behaviour reported in `1caedbb`. Fix: resolve the session id BEFORE acquiring ``_LOCK`` in ``audit_log_path``. ``_session_id`` already double-checks under its own lock, so the call is safe and self-synchronising. Test fixture in ``tests/test_audit.py`` now bypasses the kill switch via ``monkeypatch.setattr(audit, "_DISABLED", False)`` — env vars are captured at import time and ``monkeypatch.setenv`` won't reach the module-level flag. With the fix in place, all 6 tests pass in 0.15s; without it, ``test_session_start_renders`` (and any test exercising the log_session_start path) hangs indefinitely. Kill switch behaviour is unchanged in production (`_DISABLED = True` in the shipped module); this is purely a correctness fix for the code path that gets exercised when the switch is off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:45:08 +00:00
Michael	65c85107b6	revert: restore audit-log kill switch — async redesign didn't help User pulled `d9e32e5` (async-writer audit log + re-enabled diagnostics sidebar) and still sees blank pages. The synchronous-write theory from the previous round was at most a partial explanation; something ELSE in the audit-log code path is also taking the page render down on the user's machine. Restore the kill switch so the user has a working app while we diagnose: - ``src/audit.py``: ``_DISABLED = True`` re-introduced at module top, each of ``log_event`` / ``log_session_start`` / ``log_page_open`` / ``flush_audit_log`` early-returns. The async writer thread is never started. - ``hide_streamlit_chrome``: ``_render_diagnostics_sidebar()`` call re-gated behind ``if False:``. The async writer code stays in place — easier to flip the flag back when we identify the real cause than to rewrite a third time. The shutdown-flush call in ``shutdown_app`` also stays; it early-returns on the kill switch and is harmless. Diagnostic plan for the next session: ask the user for the launcher terminal output (the new stderr "DataTools audit: writes failing..." message would tell us if the writer thread DID start and DID fail), and whether ``~/.datatools/logs/`` is being created at all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 02:44:23 +00:00
Michael	d9e32e578b	feat(audit): async writer thread — safe to re-enable Reported earlier: synchronous file writes in ``log_event`` blocked the GUI render thread on hostile filesystems (Windows antivirus on ``~/.datatools/logs/`` is the prime suspect). A blocking ``open`` call doesn't raise — try/except can't recover from it — so the only safe re-enable is to take file I/O off the render path. Refactor: - ``log_event`` and friends push events onto a ``deque(maxlen=5000)`` via ``put_nowait`` and return in microseconds. - A single daemon thread (``datatools-audit-writer``) drains the queue and writes batches. Holds the queue lock only long enough to snapshot + clear, then does I/O outside the lock so producers can keep enqueueing. - ``audit_log_path()`` is now pure path arithmetic — no ``mkdir`` no ``open``. The writer thread does the directory creation off the request path, so any hang there only affects the writer. - Bounded queue means an unwritable disk doesn't unbounded-grow memory; the queue caps at 5000 and overflow drops OLDEST events so the most-recent (most-diagnostic) ones survive. - First write failure prints once to stderr; subsequent failures are silent so logs don't drown the launcher terminal. - ``flush_audit_log(timeout_s=0.5)`` drains the queue and signals the writer to exit; bounded so a stuck disk can't delay shutdown. Other changes in this commit: - ``shutdown_app`` now emits a "Session ending" event and calls ``flush_audit_log`` before kicking the os._exit timer, so the closing session's events make it to disk. - The Diagnostics sidebar in ``hide_streamlit_chrome`` is re-enabled (the ``if False:`` gate is removed). Wrapped in try/except defensively — render errors print to stderr, never blank the page. - ``_DISABLED`` kill-switch is gone. The async design IS the safety mechanism now. Tests in ``tests/test_audit.py``: - log_event burst of 1000 events completes in well under 1s (proves non-blocking). - Events queued before flush land on disk with the expected JSON shape; session_start renders; idempotent. - Pointing the audit dir at a file (so mkdir fails) doesn't hang or crash the producer. - Non-JSON extras are str()-coerced rather than dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 02:39:48 +00:00
Michael	1caedbbbc7	bisect: kill-switch every audit-log write Reported: bisection commit `c0bfd4d` that disabled the sticky footer, diagnostics sidebar, and compact-CSS didn't fix the blank-page symptom. User adds that Ctrl+C also can't kill the launcher. Ctrl+C-doesn't-work + every-page-blank together points at a hang in the Python process, not an exception. The most likely hang point in the chrome path is the audit log's file I/O — ``open()`` inside the ``with`` block in ``log_event`` blocks on a stuck filesystem (Windows antivirus quarantining ``~/.datatools/logs/datatools-*.jsonl`` on every write is a plausible culprit on the user's machine). A blocking ``open`` call does NOT raise — try/except can't recover from it — which is why our prior defensive wrap didn't help. Add a module-level ``_DISABLED = True`` kill switch. ``log_event``, ``log_session_start``, and ``log_page_open`` each early-return at the very top of the function when the flag is set, before any file-system call. Path resolution (``audit_log_path``) still works since it's needed for the diagnostics sidebar (still disabled in `c0bfd4d`, but kept harmless). If pages render after this commit, file I/O from the audit log is confirmed as the culprit; we'll redesign with an async writer queue and a tighter timeout. If they still don't, the cause is somewhere we haven't bisected yet and we move to a hard revert. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 02:14:29 +00:00
Michael	59c6d0f914	fix(audit): defensive wrap so audit failures can never blank the GUI Reported: after pulling commit `c73d716` (audit log) the main body of every page showed empty. Sidebar nav still worked. Diagnosis: the most likely path is that something inside the audit calls — ``_render_diagnostics_sidebar()`` calling ``audit_log_path()``, or ``log_session_start()`` itself — raises during ``hide_streamlit_chrome`` on the user's environment (Python 3.14 on Windows, a less-tested combo than the test environment). Streamlit's script runner sees the exception, and on some chrome paths it eats it without surfacing an error block, leaving the page body empty. The audit log is best-effort by design. Make that contract real: 1. ``hide_streamlit_chrome`` now wraps both ``log_session_start()`` and ``_render_diagnostics_sidebar()`` in try/except. Errors print to stderr (so the developer running ``python -m src.gui`` sees them in the launcher's console) but never bubble up to kill the page render. 2. ``audit_log_path()`` already had a tempdir fallback for the primary mkdir failure, but the SECOND mkdir wasn't protected either. Restructured to a two-level fallback: configured dir → tempdir → ``/dev/null`` (or ``NUL`` on Windows). The last fallback ensures the function never raises; ``log_event``'s own try/except handles the eventual unwritable-file case. 3. ``log_page_open(slug)`` now has an outer try/except so it cannot raise either — protecting every tool page's render path. If a user reports the same symptom again, the launcher terminal will now show a real traceback explaining what's wrong, and the GUI will still render normally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 02:00:31 +00:00
Michael	c73d716d06	feat(audit): JSONL audit log for support diagnostics New ``src/audit.py`` module records GUI actions to a per-session JSONL file under ``~/.datatools/logs/`` (overrideable via ``DATATOOLS_AUDIT_DIR``). The file is human-readable (one JSON object per line, each with a ``message`` field) AND trivially machine-parseable — the support flow is "client mails the file, we read it and explain what went wrong." Format example:: {"ts":"2026-05-17T05:30:00.123+00:00","level":"info","category":"session", "session":"a1b2c3d4","message":"Session started", "platform":"Windows 11","python":"3.14.0","user":"Michael Dombaugh", "log_file":"C:\\Users\\Michael Dombaugh\\.datatools\\logs\\datatools-...jsonl"} {"ts":"...","category":"upload","message":"Uploaded customers.csv", "filename":"customers.csv","bytes":24813} {"ts":"...","category":"analyze","message":"Analyzed customers.csv (3 findings)", "filename":"customers.csv","findings":3,"rows":120,"cols":8} {"ts":"...","category":"tool_run","message":"Clean Text run", "page":"2_Text_Cleaner"} {"ts":"...","category":"error","level":"error", "message":"analyze(weird.csv): EmptyDataError: No columns to parse", "filename":"weird.csv","outcome":"empty_after_repair"} Public API: - ``log_event(category, message, extra)`` - ``log_session_start()`` — idempotent banner with platform info - ``log_page_open(slug)`` — emit a ``nav`` event, deduplicated per Streamlit session so reruns don't spam the log - ``log_exception(where, exc, extra)`` — convenience wrapper - ``audit_log_path()`` / ``audit_log_dir()`` — for the UI Wired in at: - ``hide_streamlit_chrome``: stamps session start, mounts a small "🩺 Diagnostics" expander in the sidebar with the log path and an "Open log folder" button so the user can grab the file to attach to a support email. - Home page: ``upload`` event on every new file, ``upload`` event on per-file remove, ``analyze`` event with file count when Run-analysis fires. - ``_run_analysis_on_upload``: ``analyze`` event with rows / cols / findings count per file, plus ``error`` events on every caught exception (empty upload, empty after repair, pandas EmptyDataError, generic Exception). - Every Ready tool page (1, 2, 3, 4, 5, 9): ``tool_run`` event immediately after the primary action stashes its result. - Every tool page (1-9): ``log_page_open(slug)`` on render — deduped via session state so we don't get one event per Streamlit rerun. Safety: - ``log_event`` wraps every write in try/except. A broken audit log must NOT crash the GUI. - Non-JSON-serializable extras are ``str()``-coerced before writing. - File CONTENTS are never logged. We capture filename, byte count, and (in the analyzer) a 12-char sha1 fingerprint of the bytes so the same file re-uploaded gets the same trace. - License keys, session cookies, etc. are not logged. - ``DATATOOLS_AUDIT_DIR`` env var lets tests redirect writes into a tmp dir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 01:36:35 +00:00

9 Commits