feat(audit): async writer thread — safe to re-enable
Reported earlier: synchronous file writes in ``log_event`` blocked the GUI render thread on hostile filesystems (Windows antivirus on ``~/.datatools/logs/`` is the prime suspect). A blocking ``open`` call doesn't raise — try/except can't recover from it — so the only safe re-enable is to take file I/O off the render path. Refactor: - ``log_event`` and friends push events onto a ``deque(maxlen=5000)`` via ``put_nowait`` and return in microseconds. - A single daemon thread (``datatools-audit-writer``) drains the queue and writes batches. Holds the queue lock only long enough to snapshot + clear, then does I/O outside the lock so producers can keep enqueueing. - ``audit_log_path()`` is now pure path arithmetic — no ``mkdir`` no ``open``. The writer thread does the directory creation off the request path, so any hang there only affects the writer. - Bounded queue means an unwritable disk doesn't unbounded-grow memory; the queue caps at 5000 and overflow drops OLDEST events so the most-recent (most-diagnostic) ones survive. - First write failure prints once to stderr; subsequent failures are silent so logs don't drown the launcher terminal. - ``flush_audit_log(timeout_s=0.5)`` drains the queue and signals the writer to exit; bounded so a stuck disk can't delay shutdown. Other changes in this commit: - ``shutdown_app`` now emits a "Session ending" event and calls ``flush_audit_log`` before kicking the os._exit timer, so the closing session's events make it to disk. - The Diagnostics sidebar in ``hide_streamlit_chrome`` is re-enabled (the ``if False:`` gate is removed). Wrapped in try/except defensively — render errors print to stderr, never blank the page. - ``_DISABLED`` kill-switch is gone. The async design IS the safety mechanism now. Tests in ``tests/test_audit.py``: - log_event burst of 1000 events completes in well under 1s (proves non-blocking). - Events queued before flush land on disk with the expected JSON shape; session_start renders; idempotent. - Pointing the audit dir at a file (so mkdir fails) doesn't hang or crash the producer. - Non-JSON extras are str()-coerced rather than dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
340
src/audit.py
340
src/audit.py
@@ -1,38 +1,35 @@
|
||||
"""Audit log — records GUI actions for support diagnostics.
|
||||
"""Audit log — async, never-blocks-the-GUI file writer.
|
||||
|
||||
A client running DataTools who hits a bug should be able to grab one
|
||||
file off disk, mail it to support, and have us reconstruct what they
|
||||
were doing when things broke. That file is the audit log written by
|
||||
this module.
|
||||
|
||||
Design choices:
|
||||
Hard contract: ``log_event`` and friends are **non-blocking**. They
|
||||
push events onto an in-memory queue in microseconds and return. A
|
||||
single background daemon thread drains the queue and writes batches
|
||||
to disk. If the disk is hostile (Windows antivirus locks, slow
|
||||
network drive, permission failures), only the writer thread waits —
|
||||
the GUI render thread keeps moving. This is the contract that lets
|
||||
us re-enable the audit log after the earlier blank-pages incident:
|
||||
a stuck ``open(...)`` call can't hang the page anymore because the
|
||||
``open`` happens off the request path.
|
||||
|
||||
- **JSONL**, one event per line. Each line is a valid JSON object; the
|
||||
whole file is grep-friendly, ``jq``-friendly, and still readable in
|
||||
Notepad / TextEdit if no tooling is available. Each event carries a
|
||||
human-readable ``message`` field so the file is useful even without
|
||||
any tooling.
|
||||
- **One file per session**, named ``datatools-<utc-timestamp>-<id>.jsonl``.
|
||||
Multiple sessions on the same machine don't clobber each other, and
|
||||
the filename sorts chronologically.
|
||||
- **Default location**: ``~/.datatools/logs/`` on every platform.
|
||||
Overrideable via the ``DATATOOLS_AUDIT_DIR`` environment variable —
|
||||
used by tests to redirect writes into a tmp dir.
|
||||
- **Never crashes the app**. Every write is wrapped in a try/except;
|
||||
a broken audit log must not take down the GUI.
|
||||
- **No PII bytes**: file CONTENTS are never logged. We log the
|
||||
filename, byte size, and a short content hash so the same file
|
||||
re-uploaded gets the same fingerprint, but the actual bytes stay
|
||||
local.
|
||||
JSONL on-disk format (one event per line):
|
||||
|
||||
{"ts": "...", "level": "info", "category": "upload",
|
||||
"session": "a1b2c3d4", "message": "Uploaded customers.csv",
|
||||
"filename": "customers.csv", "bytes": 24813}
|
||||
|
||||
Public API:
|
||||
|
||||
- ``log_event(category, message, **extra)`` — write one event.
|
||||
- ``log_session_start()`` — emit a session-start record with platform
|
||||
info. Idempotent within a single session.
|
||||
- ``audit_log_path()`` — return the path to the current session's file
|
||||
so the GUI can show it to the user.
|
||||
- ``audit_log_dir()`` — return the directory holding all session logs.
|
||||
- ``log_event(category, message, **extra)`` — enqueue one event.
|
||||
- ``log_session_start()`` — idempotent session banner.
|
||||
- ``log_page_open(slug)`` — nav event, deduplicated per session.
|
||||
- ``log_exception(where, exc, **extra)`` — convenience wrapper.
|
||||
- ``audit_log_path()`` — pure path computation, no disk touch.
|
||||
- ``audit_log_dir()`` — directory holding all session files.
|
||||
- ``flush_audit_log(timeout_s=0.5)`` — drain queue (shutdown hook).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
@@ -43,100 +40,158 @@ import os
|
||||
import platform
|
||||
import sys
|
||||
import threading
|
||||
import time
|
||||
import uuid
|
||||
from collections import deque
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
# Module-level cache for per-session state. Streamlit reruns the script
|
||||
# many times per session but the module is imported once, so these
|
||||
# survive across reruns within the same Python process.
|
||||
# Module-level state. All access either lock-free (immutable after
|
||||
# init) or guarded by ``_LOCK`` / ``_QUEUE_COND``.
|
||||
_LOCK = threading.Lock()
|
||||
_LOG_PATH: Path | None = None
|
||||
_SESSION_ID: str | None = None
|
||||
_SESSION_STARTED: bool = False
|
||||
|
||||
# Kill switch — when True, every log_* function is a no-op. Set to
|
||||
# True while bisecting a "blank pages" report where ``open()`` inside
|
||||
# the log writer was suspected of blocking on the user's filesystem
|
||||
# (Windows + antivirus + ``~/.datatools/logs/``). A blocking ``open``
|
||||
# call doesn't raise so try/except can't recover it; the only safe
|
||||
# bisect is "don't touch the disk at all." Toggle back to False once
|
||||
# the user confirms pages render.
|
||||
_DISABLED: bool = True
|
||||
# Bounded in-memory queue. ``deque(maxlen=N)`` drops the OLDEST entry
|
||||
# when full so the most-recent events are always the ones that
|
||||
# survive on disk — diagnostic-most-valuable at the moment of a
|
||||
# crash. The number is generous; typical sessions produce well under
|
||||
# a hundred events.
|
||||
_MAX_QUEUED = 5000
|
||||
_QUEUE: deque = deque(maxlen=_MAX_QUEUED)
|
||||
_QUEUE_COND = threading.Condition()
|
||||
|
||||
# Writer-thread lifecycle.
|
||||
_WRITER_THREAD: threading.Thread | None = None
|
||||
_SHUTDOWN_REQUESTED: bool = False
|
||||
_WRITE_FAILED_REPORTED: bool = False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Path helpers (pure — no I/O)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def audit_log_dir() -> Path:
|
||||
"""Return the directory where audit logs are written.
|
||||
|
||||
Defaults to ``~/.datatools/logs/``. Overrideable via the
|
||||
``DATATOOLS_AUDIT_DIR`` environment variable so tests can redirect
|
||||
writes into ``tmp_path``.
|
||||
``DATATOOLS_AUDIT_DIR`` env var (used by tests to point at a
|
||||
tmp dir). No filesystem I/O happens here — caller is responsible
|
||||
for ``mkdir`` if they want the dir to exist.
|
||||
"""
|
||||
override = os.environ.get("DATATOOLS_AUDIT_DIR")
|
||||
if override:
|
||||
return Path(override)
|
||||
return Path.home() / ".datatools" / "logs"
|
||||
try:
|
||||
return Path.home() / ".datatools" / "logs"
|
||||
except Exception:
|
||||
# In some sandboxed environments Path.home() can raise; fall
|
||||
# back to a temp dir without touching the disk.
|
||||
import tempfile
|
||||
return Path(tempfile.gettempdir()) / "datatools-logs"
|
||||
|
||||
|
||||
def _session_id() -> str:
|
||||
global _SESSION_ID
|
||||
with _LOCK:
|
||||
if _SESSION_ID is None:
|
||||
_SESSION_ID = uuid.uuid4().hex
|
||||
return _SESSION_ID
|
||||
if _SESSION_ID is None:
|
||||
with _LOCK:
|
||||
if _SESSION_ID is None:
|
||||
_SESSION_ID = uuid.uuid4().hex
|
||||
return _SESSION_ID
|
||||
|
||||
|
||||
def audit_log_path() -> Path:
|
||||
"""Return this session's log file path.
|
||||
"""Return this session's log file path. **No I/O performed.**
|
||||
|
||||
The path is created the first time it's queried so each Python
|
||||
process gets a single file regardless of how many Streamlit
|
||||
reruns happen.
|
||||
|
||||
Two-level fallback so this NEVER raises: first try the configured
|
||||
audit dir, then a tempdir, then ``/dev/null``-equivalent (just a
|
||||
nonexistent path) as a last resort. Callers downstream skip
|
||||
writing when the file isn't writable.
|
||||
Caller does not need to worry about hangs — this is pure path
|
||||
arithmetic. The writer thread does the actual ``mkdir`` /
|
||||
``open`` work off the request path.
|
||||
"""
|
||||
global _LOG_PATH
|
||||
with _LOCK:
|
||||
if _LOG_PATH is None:
|
||||
try:
|
||||
ts = datetime.now(tz=timezone.utc).strftime("%Y%m%dT%H%M%SZ")
|
||||
sid = _session_id()[:8]
|
||||
except Exception:
|
||||
ts, sid = "unknown", "unknown"
|
||||
name = f"datatools-{ts}-{sid}.jsonl"
|
||||
|
||||
# First choice: configured audit dir.
|
||||
d = None
|
||||
try:
|
||||
d = audit_log_dir()
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
except Exception:
|
||||
d = None
|
||||
|
||||
# Fallback: tempdir.
|
||||
if d is None:
|
||||
if _LOG_PATH is None:
|
||||
with _LOCK:
|
||||
if _LOG_PATH is None:
|
||||
try:
|
||||
import tempfile
|
||||
d = Path(tempfile.gettempdir()) / "datatools-logs"
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
ts = datetime.now(tz=timezone.utc).strftime("%Y%m%dT%H%M%SZ")
|
||||
except Exception:
|
||||
d = None
|
||||
ts = "unknown"
|
||||
sid = _session_id()[:8]
|
||||
_LOG_PATH = audit_log_dir() / f"datatools-{ts}-{sid}.jsonl"
|
||||
return _LOG_PATH
|
||||
|
||||
# Last resort: point at a path we won't try to write to —
|
||||
# ``log_event``'s own try/except handles the eventual
|
||||
# write failure cleanly.
|
||||
if d is None:
|
||||
d = Path("/dev/null") if os.name != "nt" else Path("NUL")
|
||||
_LOG_PATH = d
|
||||
else:
|
||||
_LOG_PATH = d / name
|
||||
return _LOG_PATH
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Writer thread
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _writer_loop() -> None:
|
||||
"""Drain ``_QUEUE`` to disk forever.
|
||||
|
||||
Runs in a daemon thread. Wakes on new events, batches everything
|
||||
currently in the queue into a single ``open``/``write``/``close``
|
||||
cycle (cheaper than per-event opens), then goes back to waiting.
|
||||
File-system errors are reported once per session to stderr and
|
||||
swallowed — the audit log is best-effort.
|
||||
"""
|
||||
global _WRITE_FAILED_REPORTED
|
||||
while True:
|
||||
# Wait for something to do.
|
||||
with _QUEUE_COND:
|
||||
while not _QUEUE and not _SHUTDOWN_REQUESTED:
|
||||
_QUEUE_COND.wait(timeout=5.0)
|
||||
if _SHUTDOWN_REQUESTED and not _QUEUE:
|
||||
return
|
||||
# Snapshot + clear under the lock; release before doing
|
||||
# I/O so producers can keep enqueueing.
|
||||
batch = list(_QUEUE)
|
||||
_QUEUE.clear()
|
||||
|
||||
try:
|
||||
path = audit_log_path()
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with path.open("a", encoding="utf-8") as f:
|
||||
for ev in batch:
|
||||
f.write(json.dumps(ev, ensure_ascii=False) + "\n")
|
||||
except Exception as e:
|
||||
if not _WRITE_FAILED_REPORTED:
|
||||
_WRITE_FAILED_REPORTED = True
|
||||
try:
|
||||
print(
|
||||
f"DataTools audit: writes failing — log degraded "
|
||||
f"for the rest of this session. Error: "
|
||||
f"{type(e).__name__}: {e}",
|
||||
file=sys.stderr,
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
def _ensure_writer_started() -> None:
|
||||
"""Lazy-start the writer thread on first event.
|
||||
|
||||
Idempotent. Cheap enough to call on every event (a sentinel
|
||||
check + an occasional lock acquisition).
|
||||
"""
|
||||
global _WRITER_THREAD
|
||||
if _WRITER_THREAD is not None and _WRITER_THREAD.is_alive():
|
||||
return
|
||||
with _LOCK:
|
||||
if _WRITER_THREAD is None or not _WRITER_THREAD.is_alive():
|
||||
t = threading.Thread(
|
||||
target=_writer_loop,
|
||||
daemon=True,
|
||||
name="datatools-audit-writer",
|
||||
)
|
||||
t.start()
|
||||
_WRITER_THREAD = t
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Public producer API
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def log_event(
|
||||
category: str,
|
||||
@@ -145,52 +200,45 @@ def log_event(
|
||||
level: str = "info",
|
||||
**extra: Any,
|
||||
) -> None:
|
||||
"""Append one event to the session log.
|
||||
"""Enqueue one event. Non-blocking; returns in microseconds.
|
||||
|
||||
``category`` groups related events (e.g. ``upload``, ``analyze``,
|
||||
``tool_run``, ``error``, ``nav``). ``message`` is the human
|
||||
sentence that lands in the file. ``extra`` keys are passed through
|
||||
to the JSON object verbatim, so callers can attach structured
|
||||
context (filename, byte counts, finding counts, timings).
|
||||
|
||||
Failures are swallowed silently — a broken audit log must not
|
||||
take the GUI down.
|
||||
Failures inside this function are swallowed; a broken audit log
|
||||
must never take the GUI down.
|
||||
"""
|
||||
if _DISABLED:
|
||||
return
|
||||
try:
|
||||
event = {
|
||||
"ts": datetime.now(tz=timezone.utc).isoformat(timespec="milliseconds"),
|
||||
try:
|
||||
ts = datetime.now(tz=timezone.utc).isoformat(timespec="milliseconds")
|
||||
except Exception:
|
||||
ts = ""
|
||||
event: dict[str, Any] = {
|
||||
"ts": ts,
|
||||
"level": level,
|
||||
"category": category,
|
||||
"session": _session_id()[:8],
|
||||
"message": message,
|
||||
}
|
||||
# Attach extras with serialization safety: non-JSON values get
|
||||
# str()'d so a bad caller can't poison the whole entry.
|
||||
for k, v in extra.items():
|
||||
try:
|
||||
json.dumps(v)
|
||||
event[k] = v
|
||||
except (TypeError, ValueError):
|
||||
event[k] = str(v)
|
||||
with audit_log_path().open("a", encoding="utf-8") as f:
|
||||
f.write(json.dumps(event, ensure_ascii=False) + "\n")
|
||||
|
||||
_ensure_writer_started()
|
||||
with _QUEUE_COND:
|
||||
_QUEUE.append(event)
|
||||
_QUEUE_COND.notify()
|
||||
except Exception:
|
||||
# Last-ditch silent swallow. Diagnostics is best-effort.
|
||||
pass
|
||||
|
||||
|
||||
def log_session_start() -> None:
|
||||
"""Write the session-start banner. Idempotent within one process."""
|
||||
if _DISABLED:
|
||||
return
|
||||
"""Idempotent session-start banner with platform info."""
|
||||
global _SESSION_STARTED
|
||||
with _LOCK:
|
||||
if _SESSION_STARTED:
|
||||
return
|
||||
_SESSION_STARTED = True
|
||||
# Best-effort metadata. Failures don't propagate.
|
||||
try:
|
||||
user = getpass.getuser()
|
||||
except Exception:
|
||||
@@ -210,33 +258,8 @@ def log_session_start() -> None:
|
||||
)
|
||||
|
||||
|
||||
def log_exception(where: str, exc: BaseException, **extra: Any) -> None:
|
||||
"""Convenience wrapper for caught exceptions."""
|
||||
log_event(
|
||||
"error",
|
||||
f"{where}: {type(exc).__name__}: {exc}",
|
||||
level="error",
|
||||
exc_type=type(exc).__name__,
|
||||
exc_message=str(exc),
|
||||
**extra,
|
||||
)
|
||||
|
||||
|
||||
def log_page_open(slug: str) -> None:
|
||||
"""Emit a "page open" event, deduplicated within a session.
|
||||
|
||||
Streamlit reruns the script many times per page (every widget
|
||||
interaction triggers a rerun). Tracking the last page the user
|
||||
visited in session state lets us emit a single ``nav`` event when
|
||||
they actually switch pages, not one per rerun. Falls back to
|
||||
always-emit when session state is unreachable (running outside
|
||||
Streamlit, e.g. in tests).
|
||||
|
||||
Whole body is wrapped in try/except — the audit log is best-
|
||||
effort and MUST NOT crash the page that called it.
|
||||
"""
|
||||
if _DISABLED:
|
||||
return
|
||||
"""Emit a deduplicated 'page open' nav event."""
|
||||
try:
|
||||
try:
|
||||
import streamlit as st
|
||||
@@ -251,19 +274,68 @@ def log_page_open(slug: str) -> None:
|
||||
pass
|
||||
|
||||
|
||||
def log_exception(where: str, exc: BaseException, **extra: Any) -> None:
|
||||
"""Convenience wrapper for caught exceptions."""
|
||||
log_event(
|
||||
"error",
|
||||
f"{where}: {type(exc).__name__}: {exc}",
|
||||
level="error",
|
||||
exc_type=type(exc).__name__,
|
||||
exc_message=str(exc),
|
||||
**extra,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Shutdown
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def flush_audit_log(timeout_s: float = 0.5) -> None:
|
||||
"""Drain queued events to disk, then signal the writer to exit.
|
||||
|
||||
Called from ``shutdown_app`` so the closing session's events make
|
||||
it to disk before the process dies. Bounded by ``timeout_s`` so a
|
||||
stuck disk can never delay shutdown — events still in the queue
|
||||
when the timer expires are dropped.
|
||||
"""
|
||||
global _SHUTDOWN_REQUESTED
|
||||
deadline = time.monotonic() + max(0.0, timeout_s)
|
||||
with _QUEUE_COND:
|
||||
_SHUTDOWN_REQUESTED = True
|
||||
_QUEUE_COND.notify_all()
|
||||
while time.monotonic() < deadline:
|
||||
with _QUEUE_COND:
|
||||
if not _QUEUE:
|
||||
return
|
||||
time.sleep(0.02)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Test helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def reset_for_tests() -> None:
|
||||
"""Reset module-level state. Test-only — call from a pytest fixture
|
||||
when isolation between tests matters."""
|
||||
"""Reset module-level state. Call from a pytest fixture for
|
||||
isolation between tests."""
|
||||
global _LOG_PATH, _SESSION_ID, _SESSION_STARTED
|
||||
global _WRITER_THREAD, _SHUTDOWN_REQUESTED, _WRITE_FAILED_REPORTED
|
||||
with _LOCK:
|
||||
_LOG_PATH = None
|
||||
_SESSION_ID = None
|
||||
_SESSION_STARTED = False
|
||||
_SHUTDOWN_REQUESTED = False
|
||||
_WRITE_FAILED_REPORTED = False
|
||||
# The previous writer thread (if any) will exit on its own
|
||||
# the next time it wakes — daemon thread, no need to join.
|
||||
_WRITER_THREAD = None
|
||||
with _QUEUE_COND:
|
||||
_QUEUE.clear()
|
||||
|
||||
|
||||
__all__ = [
|
||||
"audit_log_dir",
|
||||
"audit_log_path",
|
||||
"flush_audit_log",
|
||||
"log_event",
|
||||
"log_exception",
|
||||
"log_page_open",
|
||||
|
||||
Reference in New Issue
Block a user