Commit Graph

9 Commits

Author SHA1 Message Date
b3ae913bb9 feat(audit): daily filename + 7-day retention sweep
Replaces the per-session ``datatools-<ts>-<sid>.jsonl`` filename
with a single daily file ``datatools-YYYY-MM-DD.jsonl`` (local
date). Sessions on the same calendar day share a file via the
writer thread's per-batch open+append; multiple DataTools
instances running concurrently on the same day fan into the same
file (append-mode small writes are atomic on POSIX, safe-enough on
Windows under realistic load).

Drops the ``_LOG_PATH`` module global and the lock around it —
``audit_log_path()`` is now pure date math, recomputed on every
call so a session that crosses midnight follows the rollover into
the next day's file.

Adds ``_sweep_old_logs()`` invoked once per process at writer-
thread start. Deletes any ``datatools-*.jsonl`` whose mtime is
older than 7 days. The glob deliberately matches the legacy
per-session filename too, so users upgrading from the previous
build don't keep a permanent backlog of pre-retention files.

Event ``ts`` fields stay UTC; only the filename uses local date,
because users go looking for "today's log" on their wall clock.

Tests cover: daily filename shape, sweep removes stale files,
sweep keeps fresh files, sweep also clears legacy filenames.

Rollback: ``git revert HEAD`` restores the per-session filename
and removes the sweep. No data migration needed either way —
existing files keep working as JSONL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 21:22:47 +00:00
ba07dcb6c7 feat(audit): re-enable audit log (kill switch off by default)
Phase 1 diagnostic build validated end-to-end on the user's machine:
session cf2ebbd5 (2026-05-19) produced session/upload/analyze/nav/
session-end events with no blank-pages regression. Root cause of the
original symptom was the audit_log_path/_session_id deadlock fixed in
a8ff8f4 — the kill switch is no longer load-bearing.

Flips ``_DISABLED: True`` → ``False`` so the default install writes a
log. The three env-var overrides (``DATATOOLS_AUDIT_ENABLED``,
``DATATOOLS_AUDIT_TRACE``, ``DATATOOLS_AUDIT_PROBE``) and the writer-
thread BaseException guard from 76c9f5a stay in place as escape
hatches if the symptom ever recurs.

TestKillSwitchContract continues to pass — it monkeypatches
``_DISABLED = True`` explicitly and doesn't rely on the module default.

Rollback: ``git revert HEAD`` flips the switch back without removing
the diagnostic instrumentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:50:28 +00:00
76c9f5a679 feat(audit): diagnostic instrumentation env vars + writer-thread guard
Phase 1 of the audit-log re-enablement plan. Adds three opt-in env
vars that let us ship one instrumented build for the user to run,
without flipping the kill switch on for everybody. **Default
behaviour is byte-identical to today**: with no env vars set the
kill switch wins, no writer thread starts, no file is written, no
stderr line is printed.

Env vars (do NOT set in prod):

- ``DATATOOLS_AUDIT_ENABLED=1`` — bypass ``_DISABLED`` for one
  session. ``_DISABLED = True`` stays in the source so an upgrade
  with no env var is still safe.
- ``DATATOOLS_AUDIT_TRACE=1`` — print ``[audit] ...`` lines to
  stderr at module import, every writer-thread state change, and
  every producer entry point. Lets the user share a small log
  instead of attaching a debugger.
- ``DATATOOLS_AUDIT_PROBE=<value>`` — bisect the producer path
  for Phase 2. Values: ``full`` (default), ``noop``, ``no-events``,
  ``no-page-open``, ``no-session-start``. The named variants
  return early from the corresponding ``log_*`` function so we can
  isolate which call is implicated in the blank-pages symptom.

Also:

- ``_writer_loop`` gets an outer ``try/except BaseException`` so
  silent thread death now surfaces a ``"writer thread died: ..."``
  line in the launcher terminal instead of looking like a hang.
- Existing first-write-failure stderr print gets ``flush=True`` so
  the user actually sees it before the process is killed.
- Test fixture switches from the previous-commit ``_DISABLED = False``
  override to ``_ENABLE_OVERRIDE = True`` so tests exercise the same
  bypass path the diagnostic build uses.
- Two new tests pin the safety contract: with the kill switch on
  and no override, every producer is a true no-op (no writer
  thread, no file). And ``DATATOOLS_AUDIT_PROBE=no-events`` bypasses
  ``log_event`` even when the override is on — guards the bisect.

Rollback: ``git revert HEAD`` removes Phase 1 cleanly. The deadlock
fix from the previous commit stays in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:46:27 +00:00
a8ff8f4bd0 fix(audit): break audit_log_path/_session_id deadlock
Pre-existing latent bug since d9e32e5: ``audit_log_path()`` acquires
the non-reentrant ``_LOCK`` and, while holding it, calls
``_session_id()`` which also takes ``_LOCK``. On a clean module state
(both ``_LOG_PATH`` and ``_SESSION_ID`` unset) the first caller
deadlocks.

``log_session_start`` triggers it in practice — it's the first GUI
call after import and the ``log_file=str(audit_log_path())`` arg is
evaluated before any ``log_event`` has had a chance to lazy-init the
session id. Strong candidate contributor to the blank-pages symptom
the kill switch was put back to mask: the writer thread (and any
producer reaching ``audit_log_path``) would freeze forever, and
Ctrl+C would not free the GIL — matches the launcher-can't-be-killed
behaviour reported in 1caedbb.

Fix: resolve the session id BEFORE acquiring ``_LOCK`` in
``audit_log_path``. ``_session_id`` already double-checks under its
own lock, so the call is safe and self-synchronising.

Test fixture in ``tests/test_audit.py`` now bypasses the kill switch
via ``monkeypatch.setattr(audit, "_DISABLED", False)`` — env vars are
captured at import time and ``monkeypatch.setenv`` won't reach the
module-level flag. With the fix in place, all 6 tests pass in 0.15s;
without it, ``test_session_start_renders`` (and any test exercising
the log_session_start path) hangs indefinitely.

Kill switch behaviour is unchanged in production (`_DISABLED = True`
in the shipped module); this is purely a correctness fix for the
code path that gets exercised when the switch is off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 14:45:08 +00:00
65c85107b6 revert: restore audit-log kill switch — async redesign didn't help
User pulled d9e32e5 (async-writer audit log + re-enabled diagnostics
sidebar) and still sees blank pages. The synchronous-write theory
from the previous round was at most a partial explanation; something
ELSE in the audit-log code path is also taking the page render down
on the user's machine.

Restore the kill switch so the user has a working app while we
diagnose:

- ``src/audit.py``: ``_DISABLED = True`` re-introduced at module
  top, each of ``log_event`` / ``log_session_start`` /
  ``log_page_open`` / ``flush_audit_log`` early-returns. The async
  writer thread is never started.
- ``hide_streamlit_chrome``: ``_render_diagnostics_sidebar()`` call
  re-gated behind ``if False:``.

The async writer code stays in place — easier to flip the flag back
when we identify the real cause than to rewrite a third time. The
shutdown-flush call in ``shutdown_app`` also stays; it early-returns
on the kill switch and is harmless.

Diagnostic plan for the next session: ask the user for the launcher
terminal output (the new stderr "DataTools audit: writes failing..."
message would tell us if the writer thread DID start and DID fail),
and whether ``~/.datatools/logs/`` is being created at all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:44:23 +00:00
d9e32e578b feat(audit): async writer thread — safe to re-enable
Reported earlier: synchronous file writes in ``log_event`` blocked
the GUI render thread on hostile filesystems (Windows antivirus on
``~/.datatools/logs/`` is the prime suspect). A blocking ``open``
call doesn't raise — try/except can't recover from it — so the
only safe re-enable is to take file I/O off the render path.

Refactor:

- ``log_event`` and friends push events onto a ``deque(maxlen=5000)``
  via ``put_nowait`` and return in microseconds.
- A single daemon thread (``datatools-audit-writer``) drains the
  queue and writes batches. Holds the queue lock only long enough to
  snapshot + clear, then does I/O outside the lock so producers can
  keep enqueueing.
- ``audit_log_path()`` is now pure path arithmetic — no ``mkdir``
  no ``open``. The writer thread does the directory creation off
  the request path, so any hang there only affects the writer.
- Bounded queue means an unwritable disk doesn't unbounded-grow
  memory; the queue caps at 5000 and overflow drops OLDEST events
  so the most-recent (most-diagnostic) ones survive.
- First write failure prints once to stderr; subsequent failures
  are silent so logs don't drown the launcher terminal.
- ``flush_audit_log(timeout_s=0.5)`` drains the queue and signals
  the writer to exit; bounded so a stuck disk can't delay shutdown.

Other changes in this commit:

- ``shutdown_app`` now emits a "Session ending" event and calls
  ``flush_audit_log`` before kicking the os._exit timer, so the
  closing session's events make it to disk.
- The Diagnostics sidebar in ``hide_streamlit_chrome`` is
  re-enabled (the ``if False:`` gate is removed). Wrapped in
  try/except defensively — render errors print to stderr, never
  blank the page.
- ``_DISABLED`` kill-switch is gone. The async design IS the
  safety mechanism now.

Tests in ``tests/test_audit.py``:

- log_event burst of 1000 events completes in well under 1s
  (proves non-blocking).
- Events queued before flush land on disk with the expected JSON
  shape; session_start renders; idempotent.
- Pointing the audit dir at a file (so mkdir fails) doesn't hang
  or crash the producer.
- Non-JSON extras are str()-coerced rather than dropped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:39:48 +00:00
1caedbbbc7 bisect: kill-switch every audit-log write
Reported: bisection commit c0bfd4d that disabled the sticky footer,
diagnostics sidebar, and compact-CSS didn't fix the blank-page
symptom. User adds that Ctrl+C also can't kill the launcher.

Ctrl+C-doesn't-work + every-page-blank together points at a hang in
the Python process, not an exception. The most likely hang point in
the chrome path is the audit log's file I/O — ``open()`` inside the
``with`` block in ``log_event`` blocks on a stuck filesystem (Windows
antivirus quarantining ``~/.datatools/logs/datatools-*.jsonl`` on
every write is a plausible culprit on the user's machine). A blocking
``open`` call does NOT raise — try/except can't recover from it —
which is why our prior defensive wrap didn't help.

Add a module-level ``_DISABLED = True`` kill switch. ``log_event``,
``log_session_start``, and ``log_page_open`` each early-return at
the very top of the function when the flag is set, before any
file-system call. Path resolution (``audit_log_path``) still works
since it's needed for the diagnostics sidebar (still disabled in
c0bfd4d, but kept harmless).

If pages render after this commit, file I/O from the audit log is
confirmed as the culprit; we'll redesign with an async writer
queue and a tighter timeout. If they still don't, the cause is
somewhere we haven't bisected yet and we move to a hard revert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:14:29 +00:00
59c6d0f914 fix(audit): defensive wrap so audit failures can never blank the GUI
Reported: after pulling commit c73d716 (audit log) the main body of
every page showed empty. Sidebar nav still worked.

Diagnosis: the most likely path is that something inside the audit
calls — ``_render_diagnostics_sidebar()`` calling ``audit_log_path()``,
or ``log_session_start()`` itself — raises during ``hide_streamlit_chrome``
on the user's environment (Python 3.14 on Windows, a less-tested
combo than the test environment). Streamlit's script runner sees the
exception, and on some chrome paths it eats it without surfacing an
error block, leaving the page body empty.

The audit log is best-effort by design. Make that contract real:

1. ``hide_streamlit_chrome`` now wraps both ``log_session_start()``
   and ``_render_diagnostics_sidebar()`` in try/except. Errors print
   to stderr (so the developer running ``python -m src.gui`` sees
   them in the launcher's console) but never bubble up to kill the
   page render.

2. ``audit_log_path()`` already had a tempdir fallback for the
   primary mkdir failure, but the SECOND mkdir wasn't protected
   either. Restructured to a two-level fallback: configured dir →
   tempdir → ``/dev/null`` (or ``NUL`` on Windows). The last fallback
   ensures the function never raises; ``log_event``'s own try/except
   handles the eventual unwritable-file case.

3. ``log_page_open(slug)`` now has an outer try/except so it cannot
   raise either — protecting every tool page's render path.

If a user reports the same symptom again, the launcher terminal will
now show a real traceback explaining what's wrong, and the GUI will
still render normally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:00:31 +00:00
c73d716d06 feat(audit): JSONL audit log for support diagnostics
New ``src/audit.py`` module records GUI actions to a per-session
JSONL file under ``~/.datatools/logs/`` (overrideable via
``DATATOOLS_AUDIT_DIR``). The file is human-readable (one JSON
object per line, each with a ``message`` field) AND trivially
machine-parseable — the support flow is "client mails the file,
we read it and explain what went wrong."

Format example::

    {"ts":"2026-05-17T05:30:00.123+00:00","level":"info","category":"session",
     "session":"a1b2c3d4","message":"Session started",
     "platform":"Windows 11","python":"3.14.0","user":"Michael Dombaugh",
     "log_file":"C:\\Users\\Michael Dombaugh\\.datatools\\logs\\datatools-...jsonl"}
    {"ts":"...","category":"upload","message":"Uploaded customers.csv",
     "filename":"customers.csv","bytes":24813}
    {"ts":"...","category":"analyze","message":"Analyzed customers.csv (3 findings)",
     "filename":"customers.csv","findings":3,"rows":120,"cols":8}
    {"ts":"...","category":"tool_run","message":"Clean Text run",
     "page":"2_Text_Cleaner"}
    {"ts":"...","category":"error","level":"error",
     "message":"analyze(weird.csv): EmptyDataError: No columns to parse",
     "filename":"weird.csv","outcome":"empty_after_repair"}

Public API:

- ``log_event(category, message, **extra)``
- ``log_session_start()`` — idempotent banner with platform info
- ``log_page_open(slug)`` — emit a ``nav`` event, deduplicated per
  Streamlit session so reruns don't spam the log
- ``log_exception(where, exc, **extra)`` — convenience wrapper
- ``audit_log_path()`` / ``audit_log_dir()`` — for the UI

Wired in at:

- ``hide_streamlit_chrome``: stamps session start, mounts a small
  "🩺  Diagnostics" expander in the sidebar with the log path and
  an "Open log folder" button so the user can grab the file to
  attach to a support email.
- Home page: ``upload`` event on every new file, ``upload`` event
  on per-file remove, ``analyze`` event with file count when
  Run-analysis fires.
- ``_run_analysis_on_upload``: ``analyze`` event with rows / cols /
  findings count per file, plus ``error`` events on every caught
  exception (empty upload, empty after repair, pandas EmptyDataError,
  generic Exception).
- Every Ready tool page (1, 2, 3, 4, 5, 9): ``tool_run`` event
  immediately after the primary action stashes its result.
- Every tool page (1-9): ``log_page_open(slug)`` on render — deduped
  via session state so we don't get one event per Streamlit rerun.

Safety:

- ``log_event`` wraps every write in try/except. A broken audit
  log must NOT crash the GUI.
- Non-JSON-serializable extras are ``str()``-coerced before writing.
- File CONTENTS are never logged. We capture filename, byte count,
  and (in the analyzer) a 12-char sha1 fingerprint of the bytes so
  the same file re-uploaded gets the same trace.
- License keys, session cookies, etc. are not logged.
- ``DATATOOLS_AUDIT_DIR`` env var lets tests redirect writes into a
  tmp dir.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:36:35 +00:00