Introduces src/core/errors.py with a small structured error hierarchy
that every public entry point now uses. Each error carries the
context a user needs to fix it and the context a maintainer needs to
trace it.
The hierarchy:
DataToolsError (base — formats path, column, operation, suggestion)
InputValidationError (extends ValueError — bad arg / wrong type)
ConfigError (extends ValueError — bad config / options)
FileFormatError (extends ValueError — file is not what we expected)
FileAccessError (extends OSError — file I/O failure)
Subclassing the stdlib bases means existing `except OSError` /
`except ValueError` handlers still catch them — no breaking change.
Helpers:
- ensure_dataframe(value, function=...) — uniform DataFrame guard
- ensure_choice(value, name=, choices=) — uniform enum/literal guard
- wrap_file_read(path, op, exc) — tag OSError with hint + path
- wrap_file_write(path, op, exc) — same, with Windows-aware tip
- format_for_user(exc, context=) — user-facing string for st.error / stderr
Library hardening:
- io.read_file: missing files surface FileAccessError listing whether
the parent directory exists, and the suggestion to check the path.
- io.read_file: chunk_size <= 0 now raises InputValidationError with
a positive-integer suggestion.
- io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas
ValueError ("sheet not found") wrapped as FileFormatError listing
the path and a "list sheets with list_sheets()" hint.
- io._detect_excel_header_row: bare except narrowed to specific
openpyxl exceptions; falls back gracefully and logs at debug so
the real error surfaces from pd.read_excel.
- io.write_file: OSError / PermissionError on to_csv/to_excel wrapped
with file path and Windows-aware "file may be open in another
program" hint.
- dedup._parse_date: bare `except Exception` narrowed to
(TypeError, ValueError, OutOfBoundsDatetime); failed values
logged at debug for survivor-selection forensics.
- dedup._select_survivor: KEEP_MOST_RECENT now raises
InputValidationError instead of silently falling back to keep_first.
- dedup.deduplicate: input validation errors are InputValidationError
with operation/column/suggestion fields.
- format_standardize.from_dict: invalid FieldType for a column raises
ConfigError naming the column AND the bad value AND listing valid
values; same for date_order / phone_format / etc.
- format_standardize.from_file: OSError / JSON decode wrapped with
path AND line/column where parsing failed.
- format_standardize.to_file: TypeError on json.dumps wrapped as
ConfigError with the suspected source (extra_abbreviations).
- format_standardize._apply_field_type: dispatcher's "unknown field
type" branch now raises AssertionError (it's an internal invariant,
not user error — a new enum value was added without a branch).
- format_standardize._resolve_column_types: missing-column error now
InputValidationError with a "check for typos / unparsed header"
suggestion.
- format_standardize.standardize_dataframe: ensure_dataframe at entry.
- text_clean.clean_dataframe: ensure_dataframe at entry.
- config.to_strategies: invalid Algorithm/NormalizerType wrapped as
ConfigError naming the strategy index AND the column.
- config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError
listing valid values.
- config.from_file: OSError / JSON decode wrapped (mirror of
StandardizeOptions.from_file).
- fixes.repair_mojibake: ImportError on ftfy now logged at info level
with the underlying ImportError so a corrupt-package vs not-installed
distinction is visible in the logs.
- normalizers.normalize_phone: phonenumbers.NumberParseException now
logged at debug when the digits-only fallback drops extension /
country-code information — gives a trail when matching results
look wrong.
GUI / CLI surfaces:
- All 9 page handlers (`except Exception as e: st.error(...)`) now
use format_for_user(), which renders DataToolsError fields nicely
and falls back to "ClassName: message" for unrecognized errors.
- 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish
UnicodeDecodeError with an "re-save as UTF-8" suggestion before
the generic handler.
- cli.py's "Error reading file" handler now uses format_for_user()
and includes the input path in the prefix.
Tests:
- tests/test_errors.py — 22 new tests covering: base class formatting,
stdlib inheritance, ensure_dataframe / ensure_choice helpers,
wrap_file_read / wrap_file_write, format_for_user behavior, and
end-to-end integration (missing file, missing dir, bad JSON, bad
algorithm, bad enum, missing column).
- tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for
the new exception types (InputValidationError replaces TypeError,
FileAccessError extends OSError).
Full project suite: 1230 passed, 4 skipped, 17 xfailed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
103 lines
3.2 KiB
Python
103 lines
3.2 KiB
Python
"""DataTools Validator & Reporter — stub page."""
|
|
|
|
from __future__ import annotations
|
|
|
|
import sys
|
|
from pathlib import Path
|
|
|
|
import streamlit as st
|
|
|
|
_project_root = Path(__file__).resolve().parent.parent.parent.parent
|
|
if str(_project_root) not in sys.path:
|
|
sys.path.insert(0, str(_project_root))
|
|
|
|
from src.gui.components import hide_streamlit_chrome, require_normalization_gate
|
|
|
|
hide_streamlit_chrome()
|
|
require_normalization_gate()
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Header
|
|
# ---------------------------------------------------------------------------
|
|
|
|
st.title("✅ Validator & Reporter")
|
|
st.caption("Validate data against rules and generate quality reports.")
|
|
|
|
st.info("This tool is under development.")
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# What this tool will do
|
|
# ---------------------------------------------------------------------------
|
|
|
|
st.markdown("""
|
|
**Features:**
|
|
- Column-level validation rules (not null, unique, regex pattern, range, enum)
|
|
- Cross-column validation (e.g., start_date < end_date)
|
|
- Data quality score per column and overall
|
|
- Generate PDF quality report
|
|
- Generate Excel report with flagged rows highlighted
|
|
- Summary dashboard: pass/fail counts, severity breakdown
|
|
""")
|
|
|
|
st.divider()
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# File upload (functional)
|
|
# ---------------------------------------------------------------------------
|
|
|
|
uploaded = st.file_uploader(
|
|
"Upload CSV or Excel file",
|
|
type=["csv", "tsv", "xlsx", "xls"],
|
|
help="Upload a file to preview. Processing is not yet available.",
|
|
key="validator_file_upload",
|
|
)
|
|
|
|
if uploaded is not None:
|
|
import pandas as pd
|
|
try:
|
|
if uploaded.name.endswith((".xlsx", ".xls")):
|
|
df = pd.read_excel(uploaded)
|
|
else:
|
|
df = pd.read_csv(uploaded)
|
|
st.subheader(f"Preview: {uploaded.name}")
|
|
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
|
|
st.dataframe(df.head(10), use_container_width=True)
|
|
except Exception as e:
|
|
from src.core.errors import format_for_user
|
|
st.error(
|
|
f"**Could not read `{uploaded.name}`**\n\n"
|
|
f"```\n{format_for_user(e)}\n```"
|
|
)
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Placeholder options
|
|
# ---------------------------------------------------------------------------
|
|
|
|
st.subheader("Validation Rules")
|
|
|
|
st.file_uploader("Load rules file (JSON)", type=["json"], disabled=True, key="validator_rules")
|
|
st.multiselect("Quick checks", [
|
|
"No null values",
|
|
"No duplicate rows",
|
|
"All emails valid",
|
|
"All dates parseable",
|
|
"Numeric columns in range",
|
|
], disabled=True)
|
|
|
|
st.subheader("Report Format")
|
|
|
|
st.selectbox("Output format", ["Excel (flagged rows)", "PDF summary", "Both"], disabled=True)
|
|
|
|
st.divider()
|
|
st.button("Validate & Generate Report", type="primary", use_container_width=True, disabled=True)
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Footer
|
|
# ---------------------------------------------------------------------------
|
|
|
|
st.divider()
|
|
st.caption(
|
|
"Runs locally. Your data never leaves this computer. "
|
|
"| DataTools v3.0"
|
|
)
|