feat(audit): JSONL audit log for support diagnostics
New ``src/audit.py`` module records GUI actions to a per-session
JSONL file under ``~/.datatools/logs/`` (overrideable via
``DATATOOLS_AUDIT_DIR``). The file is human-readable (one JSON
object per line, each with a ``message`` field) AND trivially
machine-parseable — the support flow is "client mails the file,
we read it and explain what went wrong."
Format example::
{"ts":"2026-05-17T05:30:00.123+00:00","level":"info","category":"session",
"session":"a1b2c3d4","message":"Session started",
"platform":"Windows 11","python":"3.14.0","user":"Michael Dombaugh",
"log_file":"C:\\Users\\Michael Dombaugh\\.datatools\\logs\\datatools-...jsonl"}
{"ts":"...","category":"upload","message":"Uploaded customers.csv",
"filename":"customers.csv","bytes":24813}
{"ts":"...","category":"analyze","message":"Analyzed customers.csv (3 findings)",
"filename":"customers.csv","findings":3,"rows":120,"cols":8}
{"ts":"...","category":"tool_run","message":"Clean Text run",
"page":"2_Text_Cleaner"}
{"ts":"...","category":"error","level":"error",
"message":"analyze(weird.csv): EmptyDataError: No columns to parse",
"filename":"weird.csv","outcome":"empty_after_repair"}
Public API:
- ``log_event(category, message, **extra)``
- ``log_session_start()`` — idempotent banner with platform info
- ``log_page_open(slug)`` — emit a ``nav`` event, deduplicated per
Streamlit session so reruns don't spam the log
- ``log_exception(where, exc, **extra)`` — convenience wrapper
- ``audit_log_path()`` / ``audit_log_dir()`` — for the UI
Wired in at:
- ``hide_streamlit_chrome``: stamps session start, mounts a small
"🩺 Diagnostics" expander in the sidebar with the log path and
an "Open log folder" button so the user can grab the file to
attach to a support email.
- Home page: ``upload`` event on every new file, ``upload`` event
on per-file remove, ``analyze`` event with file count when
Run-analysis fires.
- ``_run_analysis_on_upload``: ``analyze`` event with rows / cols /
findings count per file, plus ``error`` events on every caught
exception (empty upload, empty after repair, pandas EmptyDataError,
generic Exception).
- Every Ready tool page (1, 2, 3, 4, 5, 9): ``tool_run`` event
immediately after the primary action stashes its result.
- Every tool page (1-9): ``log_page_open(slug)`` on render — deduped
via session state so we don't get one event per Streamlit rerun.
Safety:
- ``log_event`` wraps every write in try/except. A broken audit
log must NOT crash the GUI.
- Non-JSON-serializable extras are ``str()``-coerced before writing.
- File CONTENTS are never logged. We capture filename, byte count,
and (in the analyzer) a 12-char sha1 fingerprint of the bytes so
the same file re-uploaded gets the same trace.
- License keys, session cookies, etc. are not logged.
- ``DATATOOLS_AUDIT_DIR`` env var lets tests redirect writes into a
tmp dir.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
227
src/audit.py
Normal file
227
src/audit.py
Normal file
@@ -0,0 +1,227 @@
|
|||||||
|
"""Audit log — records GUI actions for support diagnostics.
|
||||||
|
|
||||||
|
A client running DataTools who hits a bug should be able to grab one
|
||||||
|
file off disk, mail it to support, and have us reconstruct what they
|
||||||
|
were doing when things broke. That file is the audit log written by
|
||||||
|
this module.
|
||||||
|
|
||||||
|
Design choices:
|
||||||
|
|
||||||
|
- **JSONL**, one event per line. Each line is a valid JSON object; the
|
||||||
|
whole file is grep-friendly, ``jq``-friendly, and still readable in
|
||||||
|
Notepad / TextEdit if no tooling is available. Each event carries a
|
||||||
|
human-readable ``message`` field so the file is useful even without
|
||||||
|
any tooling.
|
||||||
|
- **One file per session**, named ``datatools-<utc-timestamp>-<id>.jsonl``.
|
||||||
|
Multiple sessions on the same machine don't clobber each other, and
|
||||||
|
the filename sorts chronologically.
|
||||||
|
- **Default location**: ``~/.datatools/logs/`` on every platform.
|
||||||
|
Overrideable via the ``DATATOOLS_AUDIT_DIR`` environment variable —
|
||||||
|
used by tests to redirect writes into a tmp dir.
|
||||||
|
- **Never crashes the app**. Every write is wrapped in a try/except;
|
||||||
|
a broken audit log must not take down the GUI.
|
||||||
|
- **No PII bytes**: file CONTENTS are never logged. We log the
|
||||||
|
filename, byte size, and a short content hash so the same file
|
||||||
|
re-uploaded gets the same fingerprint, but the actual bytes stay
|
||||||
|
local.
|
||||||
|
|
||||||
|
Public API:
|
||||||
|
|
||||||
|
- ``log_event(category, message, **extra)`` — write one event.
|
||||||
|
- ``log_session_start()`` — emit a session-start record with platform
|
||||||
|
info. Idempotent within a single session.
|
||||||
|
- ``audit_log_path()`` — return the path to the current session's file
|
||||||
|
so the GUI can show it to the user.
|
||||||
|
- ``audit_log_dir()`` — return the directory holding all session logs.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import getpass
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import platform
|
||||||
|
import sys
|
||||||
|
import threading
|
||||||
|
import uuid
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
|
||||||
|
# Module-level cache for per-session state. Streamlit reruns the script
|
||||||
|
# many times per session but the module is imported once, so these
|
||||||
|
# survive across reruns within the same Python process.
|
||||||
|
_LOCK = threading.Lock()
|
||||||
|
_LOG_PATH: Path | None = None
|
||||||
|
_SESSION_ID: str | None = None
|
||||||
|
_SESSION_STARTED: bool = False
|
||||||
|
|
||||||
|
|
||||||
|
def audit_log_dir() -> Path:
|
||||||
|
"""Return the directory where audit logs are written.
|
||||||
|
|
||||||
|
Defaults to ``~/.datatools/logs/``. Overrideable via the
|
||||||
|
``DATATOOLS_AUDIT_DIR`` environment variable so tests can redirect
|
||||||
|
writes into ``tmp_path``.
|
||||||
|
"""
|
||||||
|
override = os.environ.get("DATATOOLS_AUDIT_DIR")
|
||||||
|
if override:
|
||||||
|
return Path(override)
|
||||||
|
return Path.home() / ".datatools" / "logs"
|
||||||
|
|
||||||
|
|
||||||
|
def _session_id() -> str:
|
||||||
|
global _SESSION_ID
|
||||||
|
with _LOCK:
|
||||||
|
if _SESSION_ID is None:
|
||||||
|
_SESSION_ID = uuid.uuid4().hex
|
||||||
|
return _SESSION_ID
|
||||||
|
|
||||||
|
|
||||||
|
def audit_log_path() -> Path:
|
||||||
|
"""Return this session's log file path.
|
||||||
|
|
||||||
|
The path is created the first time it's queried so each Python
|
||||||
|
process gets a single file regardless of how many Streamlit
|
||||||
|
reruns happen.
|
||||||
|
"""
|
||||||
|
global _LOG_PATH
|
||||||
|
with _LOCK:
|
||||||
|
if _LOG_PATH is None:
|
||||||
|
ts = datetime.now(tz=timezone.utc).strftime("%Y%m%dT%H%M%SZ")
|
||||||
|
sid = _session_id()[:8]
|
||||||
|
d = audit_log_dir()
|
||||||
|
try:
|
||||||
|
d.mkdir(parents=True, exist_ok=True)
|
||||||
|
except Exception:
|
||||||
|
# If we can't create the dir, fall back to a tmpdir
|
||||||
|
# location so we never crash the app for the audit
|
||||||
|
# log's sake.
|
||||||
|
import tempfile
|
||||||
|
d = Path(tempfile.gettempdir()) / "datatools-logs"
|
||||||
|
d.mkdir(parents=True, exist_ok=True)
|
||||||
|
_LOG_PATH = d / f"datatools-{ts}-{sid}.jsonl"
|
||||||
|
return _LOG_PATH
|
||||||
|
|
||||||
|
|
||||||
|
def log_event(
|
||||||
|
category: str,
|
||||||
|
message: str,
|
||||||
|
*,
|
||||||
|
level: str = "info",
|
||||||
|
**extra: Any,
|
||||||
|
) -> None:
|
||||||
|
"""Append one event to the session log.
|
||||||
|
|
||||||
|
``category`` groups related events (e.g. ``upload``, ``analyze``,
|
||||||
|
``tool_run``, ``error``, ``nav``). ``message`` is the human
|
||||||
|
sentence that lands in the file. ``extra`` keys are passed through
|
||||||
|
to the JSON object verbatim, so callers can attach structured
|
||||||
|
context (filename, byte counts, finding counts, timings).
|
||||||
|
|
||||||
|
Failures are swallowed silently — a broken audit log must not
|
||||||
|
take the GUI down.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
event = {
|
||||||
|
"ts": datetime.now(tz=timezone.utc).isoformat(timespec="milliseconds"),
|
||||||
|
"level": level,
|
||||||
|
"category": category,
|
||||||
|
"session": _session_id()[:8],
|
||||||
|
"message": message,
|
||||||
|
}
|
||||||
|
# Attach extras with serialization safety: non-JSON values get
|
||||||
|
# str()'d so a bad caller can't poison the whole entry.
|
||||||
|
for k, v in extra.items():
|
||||||
|
try:
|
||||||
|
json.dumps(v)
|
||||||
|
event[k] = v
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
event[k] = str(v)
|
||||||
|
with audit_log_path().open("a", encoding="utf-8") as f:
|
||||||
|
f.write(json.dumps(event, ensure_ascii=False) + "\n")
|
||||||
|
except Exception:
|
||||||
|
# Last-ditch silent swallow. Diagnostics is best-effort.
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
def log_session_start() -> None:
|
||||||
|
"""Write the session-start banner. Idempotent within one process."""
|
||||||
|
global _SESSION_STARTED
|
||||||
|
with _LOCK:
|
||||||
|
if _SESSION_STARTED:
|
||||||
|
return
|
||||||
|
_SESSION_STARTED = True
|
||||||
|
# Best-effort metadata. Failures don't propagate.
|
||||||
|
try:
|
||||||
|
user = getpass.getuser()
|
||||||
|
except Exception:
|
||||||
|
user = "?"
|
||||||
|
try:
|
||||||
|
cwd = str(Path.cwd())
|
||||||
|
except Exception:
|
||||||
|
cwd = "?"
|
||||||
|
log_event(
|
||||||
|
"session",
|
||||||
|
"Session started",
|
||||||
|
platform=f"{platform.system()} {platform.release()}",
|
||||||
|
python=sys.version.split()[0],
|
||||||
|
user=user,
|
||||||
|
cwd=cwd,
|
||||||
|
log_file=str(audit_log_path()),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def log_exception(where: str, exc: BaseException, **extra: Any) -> None:
|
||||||
|
"""Convenience wrapper for caught exceptions."""
|
||||||
|
log_event(
|
||||||
|
"error",
|
||||||
|
f"{where}: {type(exc).__name__}: {exc}",
|
||||||
|
level="error",
|
||||||
|
exc_type=type(exc).__name__,
|
||||||
|
exc_message=str(exc),
|
||||||
|
**extra,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def log_page_open(slug: str) -> None:
|
||||||
|
"""Emit a "page open" event, deduplicated within a session.
|
||||||
|
|
||||||
|
Streamlit reruns the script many times per page (every widget
|
||||||
|
interaction triggers a rerun). Tracking the last page the user
|
||||||
|
visited in session state lets us emit a single ``nav`` event when
|
||||||
|
they actually switch pages, not one per rerun. Falls back to
|
||||||
|
always-emit when session state is unreachable (running outside
|
||||||
|
Streamlit, e.g. in tests).
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
import streamlit as st
|
||||||
|
prev = st.session_state.get("_audit_current_page")
|
||||||
|
if prev == slug:
|
||||||
|
return
|
||||||
|
st.session_state["_audit_current_page"] = slug
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
log_event("nav", f"Opened {slug}", page=slug)
|
||||||
|
|
||||||
|
|
||||||
|
def reset_for_tests() -> None:
|
||||||
|
"""Reset module-level state. Test-only — call from a pytest fixture
|
||||||
|
when isolation between tests matters."""
|
||||||
|
global _LOG_PATH, _SESSION_ID, _SESSION_STARTED
|
||||||
|
with _LOCK:
|
||||||
|
_LOG_PATH = None
|
||||||
|
_SESSION_ID = None
|
||||||
|
_SESSION_STARTED = False
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"audit_log_dir",
|
||||||
|
"audit_log_path",
|
||||||
|
"log_event",
|
||||||
|
"log_exception",
|
||||||
|
"log_page_open",
|
||||||
|
"log_session_start",
|
||||||
|
"reset_for_tests",
|
||||||
|
]
|
||||||
@@ -86,6 +86,7 @@ def _home_page() -> None:
|
|||||||
help=t("upload.uploader_help"),
|
help=t("upload.uploader_help"),
|
||||||
)
|
)
|
||||||
if new_files:
|
if new_files:
|
||||||
|
from src.audit import log_event
|
||||||
changed = False
|
changed = False
|
||||||
for f in new_files:
|
for f in new_files:
|
||||||
if f.name not in home_uploads:
|
if f.name not in home_uploads:
|
||||||
@@ -94,6 +95,12 @@ def _home_page() -> None:
|
|||||||
"size": f.size,
|
"size": f.size,
|
||||||
}
|
}
|
||||||
changed = True
|
changed = True
|
||||||
|
log_event(
|
||||||
|
"upload",
|
||||||
|
f"Uploaded {f.name}",
|
||||||
|
filename=f.name,
|
||||||
|
bytes=f.size,
|
||||||
|
)
|
||||||
if changed:
|
if changed:
|
||||||
st.session_state["home_uploads"] = home_uploads
|
st.session_state["home_uploads"] = home_uploads
|
||||||
|
|
||||||
@@ -139,6 +146,12 @@ def _home_page() -> None:
|
|||||||
to_remove = name
|
to_remove = name
|
||||||
|
|
||||||
if to_remove is not None:
|
if to_remove is not None:
|
||||||
|
from src.audit import log_event
|
||||||
|
log_event(
|
||||||
|
"upload",
|
||||||
|
f"Removed {to_remove}",
|
||||||
|
filename=to_remove,
|
||||||
|
)
|
||||||
del home_uploads[to_remove]
|
del home_uploads[to_remove]
|
||||||
# Drop any findings/results tied to the removed file.
|
# Drop any findings/results tied to the removed file.
|
||||||
findings_by_file_drop = st.session_state.get(
|
findings_by_file_drop = st.session_state.get(
|
||||||
@@ -209,6 +222,12 @@ def _home_page() -> None:
|
|||||||
st.rerun()
|
st.rerun()
|
||||||
|
|
||||||
if run_clicked:
|
if run_clicked:
|
||||||
|
from src.audit import log_event
|
||||||
|
log_event(
|
||||||
|
"analyze",
|
||||||
|
f"Run analysis clicked on {len(pending)} file(s)",
|
||||||
|
files=list(pending),
|
||||||
|
)
|
||||||
progress = st.progress(0.0, text=t("upload.scanning"))
|
progress = st.progress(0.0, text=t("upload.scanning"))
|
||||||
for i, name in enumerate(pending, start=1):
|
for i, name in enumerate(pending, start=1):
|
||||||
stashed = _StashedUpload(name, home_uploads[name]["bytes"])
|
stashed = _StashedUpload(name, home_uploads[name]["bytes"])
|
||||||
|
|||||||
@@ -155,6 +155,10 @@ def hide_streamlit_chrome(*, gate_license: bool = True) -> None:
|
|||||||
can render its own form without recursion.
|
can render its own form without recursion.
|
||||||
"""
|
"""
|
||||||
st.markdown(_HIDE_CHROME_CSS, unsafe_allow_html=True)
|
st.markdown(_HIDE_CHROME_CSS, unsafe_allow_html=True)
|
||||||
|
# Stamp a session-start record into the audit log the first time
|
||||||
|
# any page renders. Idempotent — subsequent calls are no-ops.
|
||||||
|
from src.audit import log_session_start
|
||||||
|
log_session_start()
|
||||||
# Production-safe check runs first so a misconfigured shipped
|
# Production-safe check runs first so a misconfigured shipped
|
||||||
# build refuses to render anything (rather than rendering a
|
# build refuses to render anything (rather than rendering a
|
||||||
# broken activation form that doesn't accept real blobs).
|
# broken activation form that doesn't accept real blobs).
|
||||||
@@ -172,10 +176,39 @@ def hide_streamlit_chrome(*, gate_license: bool = True) -> None:
|
|||||||
require_license_or_render_activation,
|
require_license_or_render_activation,
|
||||||
)
|
)
|
||||||
render_license_status_sidebar()
|
render_license_status_sidebar()
|
||||||
|
_render_diagnostics_sidebar()
|
||||||
if gate_license:
|
if gate_license:
|
||||||
require_license_or_render_activation()
|
require_license_or_render_activation()
|
||||||
|
|
||||||
|
|
||||||
|
def _render_diagnostics_sidebar() -> None:
|
||||||
|
"""Render a small Diagnostics expander in the sidebar.
|
||||||
|
|
||||||
|
Shows the path to the current session's audit log and an "Open
|
||||||
|
folder" button. Lives behind an expander so it doesn't take
|
||||||
|
screen space until the user opens it; the support flow is
|
||||||
|
"client mails us the file, we tell them what went wrong."
|
||||||
|
"""
|
||||||
|
from src.audit import audit_log_dir, audit_log_path
|
||||||
|
log_path = audit_log_path()
|
||||||
|
with st.sidebar:
|
||||||
|
with st.expander("🩺 Diagnostics", expanded=False):
|
||||||
|
st.caption("Audit log for this session:")
|
||||||
|
st.code(str(log_path), language=None)
|
||||||
|
if st.button(
|
||||||
|
"📂 Open log folder",
|
||||||
|
key="_diag_open_logs",
|
||||||
|
type="secondary",
|
||||||
|
use_container_width=True,
|
||||||
|
):
|
||||||
|
opened = _open_in_file_manager(audit_log_dir(), select=log_path)
|
||||||
|
if not opened:
|
||||||
|
st.warning(
|
||||||
|
"Could not open the file manager from here. "
|
||||||
|
"Path is above — paste it into your file manager."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# Clean shutdown
|
# Clean shutdown
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
@@ -1669,6 +1702,8 @@ def _run_analysis_on_upload(uploaded):
|
|||||||
one of several uploaded files) should yield a clean red banner for
|
one of several uploaded files) should yield a clean red banner for
|
||||||
that file, not kill the whole multi-file analysis run.
|
that file, not kill the whole multi-file analysis run.
|
||||||
"""
|
"""
|
||||||
|
import hashlib
|
||||||
|
from src.audit import log_event, log_exception
|
||||||
from src.core.analyze import Finding, analyze
|
from src.core.analyze import Finding, analyze
|
||||||
from src.core.errors import format_for_user
|
from src.core.errors import format_for_user
|
||||||
from src.core.io import repair_bytes
|
from src.core.io import repair_bytes
|
||||||
@@ -1676,6 +1711,18 @@ def _run_analysis_on_upload(uploaded):
|
|||||||
name = uploaded.name
|
name = uploaded.name
|
||||||
data = uploaded.getvalue()
|
data = uploaded.getvalue()
|
||||||
suffix = name.rsplit(".", 1)[-1].lower() if "." in name else ""
|
suffix = name.rsplit(".", 1)[-1].lower() if "." in name else ""
|
||||||
|
digest = hashlib.sha1(
|
||||||
|
data, usedforsecurity=False,
|
||||||
|
).hexdigest()[:12] if data else "empty"
|
||||||
|
|
||||||
|
log_event(
|
||||||
|
"analyze",
|
||||||
|
f"Analyzing {name}",
|
||||||
|
filename=name,
|
||||||
|
bytes=len(data),
|
||||||
|
sha1_12=digest,
|
||||||
|
suffix=suffix,
|
||||||
|
)
|
||||||
|
|
||||||
def _error_finding(description: str, fid: str = "analysis_failed") -> list[Finding]:
|
def _error_finding(description: str, fid: str = "analysis_failed") -> list[Finding]:
|
||||||
return [Finding(
|
return [Finding(
|
||||||
@@ -1689,6 +1736,13 @@ def _run_analysis_on_upload(uploaded):
|
|||||||
)]
|
)]
|
||||||
|
|
||||||
if not data:
|
if not data:
|
||||||
|
log_event(
|
||||||
|
"analyze",
|
||||||
|
f"Skipping {name} — 0 bytes",
|
||||||
|
level="warn",
|
||||||
|
filename=name,
|
||||||
|
outcome="empty_upload",
|
||||||
|
)
|
||||||
return _error_finding(
|
return _error_finding(
|
||||||
f"`{name}` is empty (0 bytes). Please re-upload — the bytes "
|
f"`{name}` is empty (0 bytes). Please re-upload — the bytes "
|
||||||
f"may not have transferred correctly from your browser.",
|
f"may not have transferred correctly from your browser.",
|
||||||
@@ -1698,7 +1752,17 @@ def _run_analysis_on_upload(uploaded):
|
|||||||
try:
|
try:
|
||||||
if suffix in ("xlsx", "xls"):
|
if suffix in ("xlsx", "xls"):
|
||||||
df = pd.read_excel(io.BytesIO(data), dtype=str, keep_default_na=False)
|
df = pd.read_excel(io.BytesIO(data), dtype=str, keep_default_na=False)
|
||||||
return analyze(df)
|
findings = analyze(df)
|
||||||
|
log_event(
|
||||||
|
"analyze",
|
||||||
|
f"Analyzed {name} ({len(findings)} findings)",
|
||||||
|
filename=name,
|
||||||
|
bytes=len(data),
|
||||||
|
sha1_12=digest,
|
||||||
|
findings=len(findings),
|
||||||
|
rows=len(df), cols=len(df.columns),
|
||||||
|
)
|
||||||
|
return findings
|
||||||
|
|
||||||
# CSV / TSV: run repair_bytes so the user sees csv_* findings.
|
# CSV / TSV: run repair_bytes so the user sees csv_* findings.
|
||||||
text_head = data[:4096].decode("utf-8", errors="replace")
|
text_head = data[:4096].decode("utf-8", errors="replace")
|
||||||
@@ -1710,6 +1774,13 @@ def _run_analysis_on_upload(uploaded):
|
|||||||
break
|
break
|
||||||
repair = repair_bytes(data, encoding="utf-8", delimiter=delim)
|
repair = repair_bytes(data, encoding="utf-8", delimiter=delim)
|
||||||
if not repair.repaired_bytes:
|
if not repair.repaired_bytes:
|
||||||
|
log_event(
|
||||||
|
"analyze",
|
||||||
|
f"Skipping {name} — empty after repair",
|
||||||
|
level="warn",
|
||||||
|
filename=name,
|
||||||
|
outcome="empty_after_repair",
|
||||||
|
)
|
||||||
return _error_finding(
|
return _error_finding(
|
||||||
f"`{name}` is empty after pre-parse repair "
|
f"`{name}` is empty after pre-parse repair "
|
||||||
f"(original was {len(data)} bytes — likely all NUL "
|
f"(original was {len(data)} bytes — likely all NUL "
|
||||||
@@ -1723,8 +1794,25 @@ def _run_analysis_on_upload(uploaded):
|
|||||||
encoding="utf-8", delimiter=delim,
|
encoding="utf-8", delimiter=delim,
|
||||||
dtype=str, keep_default_na=False, on_bad_lines="warn",
|
dtype=str, keep_default_na=False, on_bad_lines="warn",
|
||||||
)
|
)
|
||||||
return analyze(df, repair_result=repair)
|
findings = analyze(df, repair_result=repair)
|
||||||
except pd.errors.EmptyDataError:
|
log_event(
|
||||||
|
"analyze",
|
||||||
|
f"Analyzed {name} ({len(findings)} findings)",
|
||||||
|
filename=name,
|
||||||
|
bytes=len(data),
|
||||||
|
sha1_12=digest,
|
||||||
|
findings=len(findings),
|
||||||
|
rows=len(df), cols=len(df.columns),
|
||||||
|
delimiter=repr(delim),
|
||||||
|
)
|
||||||
|
return findings
|
||||||
|
except pd.errors.EmptyDataError as e:
|
||||||
|
log_exception(
|
||||||
|
f"analyze({name})",
|
||||||
|
e,
|
||||||
|
filename=name,
|
||||||
|
outcome="empty_after_repair",
|
||||||
|
)
|
||||||
return _error_finding(
|
return _error_finding(
|
||||||
f"`{name}` could not be parsed — pandas reports no columns "
|
f"`{name}` could not be parsed — pandas reports no columns "
|
||||||
f"in the file. Original size was {len(data)} bytes. Open "
|
f"in the file. Original size was {len(data)} bytes. Open "
|
||||||
@@ -1733,6 +1821,12 @@ def _run_analysis_on_upload(uploaded):
|
|||||||
fid="empty_after_repair",
|
fid="empty_after_repair",
|
||||||
)
|
)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
|
log_exception(
|
||||||
|
f"analyze({name})",
|
||||||
|
e,
|
||||||
|
filename=name,
|
||||||
|
outcome="analysis_failed",
|
||||||
|
)
|
||||||
return _error_finding(
|
return _error_finding(
|
||||||
f"`{name}` could not be analyzed: {format_for_user(e)}",
|
f"`{name}` could not be analyzed: {format_for_user(e)}",
|
||||||
)
|
)
|
||||||
|
|||||||
@@ -33,6 +33,8 @@ from src.license import FeatureFlag
|
|||||||
|
|
||||||
hide_streamlit_chrome()
|
hide_streamlit_chrome()
|
||||||
render_sticky_footer()
|
render_sticky_footer()
|
||||||
|
from src.audit import log_page_open
|
||||||
|
log_page_open("1_Deduplicator")
|
||||||
require_feature_or_render_upgrade(FeatureFlag.DEDUPLICATOR)
|
require_feature_or_render_upgrade(FeatureFlag.DEDUPLICATOR)
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
@@ -231,6 +233,8 @@ if uploaded is not None:
|
|||||||
|
|
||||||
progress_bar.empty()
|
progress_bar.empty()
|
||||||
st.session_state["result"] = result
|
st.session_state["result"] = result
|
||||||
|
from src.audit import log_event
|
||||||
|
log_event("tool_run", "Find Duplicates run", page="1_Deduplicator")
|
||||||
st.session_state["review_decisions"] = {}
|
st.session_state["review_decisions"] = {}
|
||||||
# One-shot flag for the scroll snippet at the bottom of the
|
# One-shot flag for the scroll snippet at the bottom of the
|
||||||
# page. Force a rerun so the Preview / Options expanders see
|
# page. Force a rerun so the Preview / Options expanders see
|
||||||
|
|||||||
@@ -35,6 +35,8 @@ from src.core.text_clean import (
|
|||||||
|
|
||||||
hide_streamlit_chrome()
|
hide_streamlit_chrome()
|
||||||
render_sticky_footer()
|
render_sticky_footer()
|
||||||
|
from src.audit import log_page_open
|
||||||
|
log_page_open("2_Text_Cleaner")
|
||||||
require_feature_or_render_upgrade(FeatureFlag.TEXT_CLEANER)
|
require_feature_or_render_upgrade(FeatureFlag.TEXT_CLEANER)
|
||||||
|
|
||||||
|
|
||||||
@@ -218,6 +220,8 @@ if st.button("Clean Text", type="primary", use_container_width=True):
|
|||||||
st.error(str(e))
|
st.error(str(e))
|
||||||
st.stop()
|
st.stop()
|
||||||
st.session_state["textclean_result"] = result
|
st.session_state["textclean_result"] = result
|
||||||
|
from src.audit import log_event
|
||||||
|
log_event("tool_run", "Clean Text run", page="2_Text_Cleaner")
|
||||||
st.session_state["textclean_input_name"] = uploaded.name
|
st.session_state["textclean_input_name"] = uploaded.name
|
||||||
# One-shot flag picked up on the next pass to scroll the parent
|
# One-shot flag picked up on the next pass to scroll the parent
|
||||||
# document to the Results anchor (see scroll snippet below).
|
# document to the Results anchor (see scroll snippet below).
|
||||||
|
|||||||
@@ -33,6 +33,8 @@ from src.license import FeatureFlag
|
|||||||
|
|
||||||
hide_streamlit_chrome()
|
hide_streamlit_chrome()
|
||||||
render_sticky_footer()
|
render_sticky_footer()
|
||||||
|
from src.audit import log_page_open
|
||||||
|
log_page_open("3_Format_Standardizer")
|
||||||
require_feature_or_render_upgrade(FeatureFlag.FORMAT_STANDARDIZER)
|
require_feature_or_render_upgrade(FeatureFlag.FORMAT_STANDARDIZER)
|
||||||
|
|
||||||
|
|
||||||
@@ -537,6 +539,8 @@ if st.button(
|
|||||||
st.error(str(e))
|
st.error(str(e))
|
||||||
st.stop()
|
st.stop()
|
||||||
st.session_state["fmtstd_result"] = result
|
st.session_state["fmtstd_result"] = result
|
||||||
|
from src.audit import log_event
|
||||||
|
log_event("tool_run", "Standardize Formats run", page="3_Format_Standardizer")
|
||||||
st.session_state["fmtstd_input_name"] = uploaded.name
|
st.session_state["fmtstd_input_name"] = uploaded.name
|
||||||
# One-shot flag picked up on the next pass to scroll the parent
|
# One-shot flag picked up on the next pass to scroll the parent
|
||||||
# document to the Results anchor (see scroll snippet below).
|
# document to the Results anchor (see scroll snippet below).
|
||||||
|
|||||||
@@ -34,6 +34,8 @@ from src.license import FeatureFlag
|
|||||||
|
|
||||||
hide_streamlit_chrome()
|
hide_streamlit_chrome()
|
||||||
render_sticky_footer()
|
render_sticky_footer()
|
||||||
|
from src.audit import log_page_open
|
||||||
|
log_page_open("4_Missing_Values")
|
||||||
require_feature_or_render_upgrade(FeatureFlag.MISSING_HANDLER)
|
require_feature_or_render_upgrade(FeatureFlag.MISSING_HANDLER)
|
||||||
|
|
||||||
|
|
||||||
@@ -291,6 +293,8 @@ if st.button("Handle Missing Values", type="primary", use_container_width=True):
|
|||||||
st.error(format_for_user(e))
|
st.error(format_for_user(e))
|
||||||
st.stop()
|
st.stop()
|
||||||
st.session_state["missing_result"] = result
|
st.session_state["missing_result"] = result
|
||||||
|
from src.audit import log_event
|
||||||
|
log_event("tool_run", "Fix Missing Values run", page="4_Missing_Values")
|
||||||
st.session_state["missing_input_name"] = uploaded.name
|
st.session_state["missing_input_name"] = uploaded.name
|
||||||
st.session_state["missing_options"] = options.to_dict()
|
st.session_state["missing_options"] = options.to_dict()
|
||||||
# One-shot flag picked up on the next pass to scroll the parent
|
# One-shot flag picked up on the next pass to scroll the parent
|
||||||
|
|||||||
@@ -35,6 +35,8 @@ from src.license import FeatureFlag
|
|||||||
|
|
||||||
hide_streamlit_chrome()
|
hide_streamlit_chrome()
|
||||||
render_sticky_footer()
|
render_sticky_footer()
|
||||||
|
from src.audit import log_page_open
|
||||||
|
log_page_open("5_Column_Mapper")
|
||||||
require_feature_or_render_upgrade(FeatureFlag.COLUMN_MAPPER)
|
require_feature_or_render_upgrade(FeatureFlag.COLUMN_MAPPER)
|
||||||
|
|
||||||
|
|
||||||
@@ -338,6 +340,8 @@ if st.button("Apply Column Mapping", type="primary", use_container_width=True):
|
|||||||
st.error(format_for_user(e))
|
st.error(format_for_user(e))
|
||||||
st.stop()
|
st.stop()
|
||||||
st.session_state["colmap_result"] = result
|
st.session_state["colmap_result"] = result
|
||||||
|
from src.audit import log_event
|
||||||
|
log_event("tool_run", "Map Columns run", page="5_Column_Mapper")
|
||||||
st.session_state["colmap_input_name"] = uploaded.name
|
st.session_state["colmap_input_name"] = uploaded.name
|
||||||
st.session_state["colmap_options"] = options.to_dict()
|
st.session_state["colmap_options"] = options.to_dict()
|
||||||
# One-shot flag picked up on the next pass to scroll the parent
|
# One-shot flag picked up on the next pass to scroll the parent
|
||||||
|
|||||||
@@ -22,6 +22,8 @@ from src.license import FeatureFlag
|
|||||||
|
|
||||||
hide_streamlit_chrome()
|
hide_streamlit_chrome()
|
||||||
render_sticky_footer()
|
render_sticky_footer()
|
||||||
|
from src.audit import log_page_open
|
||||||
|
log_page_open("6_Outlier_Detector")
|
||||||
require_feature_or_render_upgrade(FeatureFlag.OUTLIER_DETECTOR)
|
require_feature_or_render_upgrade(FeatureFlag.OUTLIER_DETECTOR)
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|||||||
@@ -22,6 +22,8 @@ from src.license import FeatureFlag
|
|||||||
|
|
||||||
hide_streamlit_chrome()
|
hide_streamlit_chrome()
|
||||||
render_sticky_footer()
|
render_sticky_footer()
|
||||||
|
from src.audit import log_page_open
|
||||||
|
log_page_open("7_Multi_File_Merger")
|
||||||
require_feature_or_render_upgrade(FeatureFlag.MULTI_FILE_MERGER)
|
require_feature_or_render_upgrade(FeatureFlag.MULTI_FILE_MERGER)
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|||||||
@@ -22,6 +22,8 @@ from src.license import FeatureFlag
|
|||||||
|
|
||||||
hide_streamlit_chrome()
|
hide_streamlit_chrome()
|
||||||
render_sticky_footer()
|
render_sticky_footer()
|
||||||
|
from src.audit import log_page_open
|
||||||
|
log_page_open("8_Validator_Reporter")
|
||||||
require_feature_or_render_upgrade(FeatureFlag.VALIDATOR_REPORTER)
|
require_feature_or_render_upgrade(FeatureFlag.VALIDATOR_REPORTER)
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|||||||
@@ -36,6 +36,8 @@ from src.license import FeatureFlag
|
|||||||
|
|
||||||
hide_streamlit_chrome()
|
hide_streamlit_chrome()
|
||||||
render_sticky_footer()
|
render_sticky_footer()
|
||||||
|
from src.audit import log_page_open
|
||||||
|
log_page_open("9_Pipeline_Runner")
|
||||||
require_feature_or_render_upgrade(FeatureFlag.PIPELINE_RUNNER)
|
require_feature_or_render_upgrade(FeatureFlag.PIPELINE_RUNNER)
|
||||||
|
|
||||||
|
|
||||||
@@ -283,6 +285,8 @@ if st.button(
|
|||||||
|
|
||||||
progress.progress(1.0, text="Done")
|
progress.progress(1.0, text="Done")
|
||||||
st.session_state["pipeline_result"] = result
|
st.session_state["pipeline_result"] = result
|
||||||
|
from src.audit import log_event
|
||||||
|
log_event("tool_run", "Automated Workflows run", page="9_Pipeline_Runner")
|
||||||
st.session_state["pipeline_input_name"] = uploaded.name
|
st.session_state["pipeline_input_name"] = uploaded.name
|
||||||
# One-shot flag picked up on the next pass to scroll the parent
|
# One-shot flag picked up on the next pass to scroll the parent
|
||||||
# document to the Results anchor (see scroll snippet at end of file).
|
# document to the Results anchor (see scroll snippet at end of file).
|
||||||
|
|||||||
Reference in New Issue
Block a user