Files
datatools-dev/src/gui/pages/6_Outlier_Detector.py
Michael 82d7fef21e feat(gate): CSV-normalization gate with confidence-tiered findings
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:35:27 +00:00

94 lines
3.0 KiB
Python

"""DataTools Outlier Detector — stub page."""
from __future__ import annotations
import sys
from pathlib import Path
import streamlit as st
_project_root = Path(__file__).resolve().parent.parent.parent.parent
if str(_project_root) not in sys.path:
sys.path.insert(0, str(_project_root))
from src.gui.components import hide_streamlit_chrome, require_normalization_gate
hide_streamlit_chrome()
require_normalization_gate()
# ---------------------------------------------------------------------------
# Header
# ---------------------------------------------------------------------------
st.title("📊 Outlier Detector")
st.caption("Detect and handle outliers in numeric columns.")
st.info("This tool is under development.")
# ---------------------------------------------------------------------------
# What this tool will do
# ---------------------------------------------------------------------------
st.markdown("""
**Features:**
- Z-score detection (configurable threshold)
- IQR (interquartile range) detection
- MAD (median absolute deviation) detection
- Domain-rule violations (e.g., age < 0, price > $1M)
- Visual outlier highlighting in data preview
- Handling: flag only, remove, cap/winsorize to bounds
""")
st.divider()
# ---------------------------------------------------------------------------
# File upload (functional)
# ---------------------------------------------------------------------------
uploaded = st.file_uploader(
"Upload CSV or Excel file",
type=["csv", "tsv", "xlsx", "xls"],
help="Upload a file to preview. Processing is not yet available.",
key="outlier_file_upload",
)
if uploaded is not None:
import pandas as pd
try:
if uploaded.name.endswith((".xlsx", ".xls")):
df = pd.read_excel(uploaded)
else:
df = pd.read_csv(uploaded)
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
except Exception as e:
st.error(f"Failed to read file: {e}")
# ---------------------------------------------------------------------------
# Placeholder options
# ---------------------------------------------------------------------------
st.subheader("Detection Method")
st.selectbox("Method", ["Z-Score", "IQR (Interquartile Range)", "MAD (Median Absolute Deviation)"], disabled=True)
st.slider("Z-Score threshold", 1.0, 5.0, 3.0, 0.1, disabled=True)
st.slider("IQR multiplier", 1.0, 3.0, 1.5, 0.1, disabled=True)
st.subheader("Handling")
st.selectbox("Action", ["Flag only (add column)", "Remove outlier rows", "Cap / Winsorize to bounds"], disabled=True)
st.divider()
st.button("Detect Outliers", type="primary", use_container_width=True, disabled=True)
# ---------------------------------------------------------------------------
# Footer
# ---------------------------------------------------------------------------
st.divider()
st.caption(
"Runs locally. Your data never leaves this computer. "
"| DataTools v3.0"
)