feat(errors): structured error hierarchy + helpful messages everywhere
Introduces src/core/errors.py with a small structured error hierarchy
that every public entry point now uses. Each error carries the
context a user needs to fix it and the context a maintainer needs to
trace it.
The hierarchy:
DataToolsError (base — formats path, column, operation, suggestion)
InputValidationError (extends ValueError — bad arg / wrong type)
ConfigError (extends ValueError — bad config / options)
FileFormatError (extends ValueError — file is not what we expected)
FileAccessError (extends OSError — file I/O failure)
Subclassing the stdlib bases means existing `except OSError` /
`except ValueError` handlers still catch them — no breaking change.
Helpers:
- ensure_dataframe(value, function=...) — uniform DataFrame guard
- ensure_choice(value, name=, choices=) — uniform enum/literal guard
- wrap_file_read(path, op, exc) — tag OSError with hint + path
- wrap_file_write(path, op, exc) — same, with Windows-aware tip
- format_for_user(exc, context=) — user-facing string for st.error / stderr
Library hardening:
- io.read_file: missing files surface FileAccessError listing whether
the parent directory exists, and the suggestion to check the path.
- io.read_file: chunk_size <= 0 now raises InputValidationError with
a positive-integer suggestion.
- io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas
ValueError ("sheet not found") wrapped as FileFormatError listing
the path and a "list sheets with list_sheets()" hint.
- io._detect_excel_header_row: bare except narrowed to specific
openpyxl exceptions; falls back gracefully and logs at debug so
the real error surfaces from pd.read_excel.
- io.write_file: OSError / PermissionError on to_csv/to_excel wrapped
with file path and Windows-aware "file may be open in another
program" hint.
- dedup._parse_date: bare `except Exception` narrowed to
(TypeError, ValueError, OutOfBoundsDatetime); failed values
logged at debug for survivor-selection forensics.
- dedup._select_survivor: KEEP_MOST_RECENT now raises
InputValidationError instead of silently falling back to keep_first.
- dedup.deduplicate: input validation errors are InputValidationError
with operation/column/suggestion fields.
- format_standardize.from_dict: invalid FieldType for a column raises
ConfigError naming the column AND the bad value AND listing valid
values; same for date_order / phone_format / etc.
- format_standardize.from_file: OSError / JSON decode wrapped with
path AND line/column where parsing failed.
- format_standardize.to_file: TypeError on json.dumps wrapped as
ConfigError with the suspected source (extra_abbreviations).
- format_standardize._apply_field_type: dispatcher's "unknown field
type" branch now raises AssertionError (it's an internal invariant,
not user error — a new enum value was added without a branch).
- format_standardize._resolve_column_types: missing-column error now
InputValidationError with a "check for typos / unparsed header"
suggestion.
- format_standardize.standardize_dataframe: ensure_dataframe at entry.
- text_clean.clean_dataframe: ensure_dataframe at entry.
- config.to_strategies: invalid Algorithm/NormalizerType wrapped as
ConfigError naming the strategy index AND the column.
- config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError
listing valid values.
- config.from_file: OSError / JSON decode wrapped (mirror of
StandardizeOptions.from_file).
- fixes.repair_mojibake: ImportError on ftfy now logged at info level
with the underlying ImportError so a corrupt-package vs not-installed
distinction is visible in the logs.
- normalizers.normalize_phone: phonenumbers.NumberParseException now
logged at debug when the digits-only fallback drops extension /
country-code information — gives a trail when matching results
look wrong.
GUI / CLI surfaces:
- All 9 page handlers (`except Exception as e: st.error(...)`) now
use format_for_user(), which renders DataToolsError fields nicely
and falls back to "ClassName: message" for unrecognized errors.
- 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish
UnicodeDecodeError with an "re-save as UTF-8" suggestion before
the generic handler.
- cli.py's "Error reading file" handler now uses format_for_user()
and includes the input path in the prefix.
Tests:
- tests/test_errors.py — 22 new tests covering: base class formatting,
stdlib inheritance, ensure_dataframe / ensure_choice helpers,
wrap_file_read / wrap_file_write, format_for_user behavior, and
end-to-end integration (missing file, missing dir, bad JSON, bad
algorithm, bad enum, missing column).
- tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for
the new exception types (InputValidationError replaces TypeError,
FileAccessError extends OSError).
Full project suite: 1230 passed, 4 skipped, 17 xfailed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
185
src/core/errors.py
Normal file
185
src/core/errors.py
Normal file
@@ -0,0 +1,185 @@
|
||||
"""Shared error-formatting helpers.
|
||||
|
||||
These keep error messages uniform across modules: same "what failed,
|
||||
where, and what to try next" structure regardless of which layer
|
||||
raises. Public CLIs / GUIs can rely on the message format being
|
||||
consistent enough to surface to end users without further wrapping.
|
||||
|
||||
Usage patterns:
|
||||
|
||||
raise DataToolsError(
|
||||
"Could not read input file",
|
||||
path=path,
|
||||
suggestion="Check that the file exists and is readable.",
|
||||
)
|
||||
|
||||
# Wrapping a library error:
|
||||
try:
|
||||
wb = load_workbook(path)
|
||||
except (BadZipFile, InvalidFileException) as e:
|
||||
raise FileFormatError(
|
||||
"Excel file is corrupted or not a valid .xlsx",
|
||||
path=path,
|
||||
cause=e,
|
||||
) from e
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
from typing import Any, Iterable, Optional
|
||||
|
||||
|
||||
class DataToolsError(Exception):
|
||||
"""Base class for all DataTools-raised errors.
|
||||
|
||||
Carries optional structured fields so GUIs / logs can render them
|
||||
consistently rather than re-parsing free-form messages.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str,
|
||||
*,
|
||||
path: Optional[Path | str] = None,
|
||||
column: Optional[str] = None,
|
||||
operation: Optional[str] = None,
|
||||
suggestion: Optional[str] = None,
|
||||
cause: Optional[BaseException] = None,
|
||||
):
|
||||
self.message = message
|
||||
self.path = Path(path) if path is not None else None
|
||||
self.column = column
|
||||
self.operation = operation
|
||||
self.suggestion = suggestion
|
||||
self.cause = cause
|
||||
super().__init__(self.format())
|
||||
|
||||
def format(self) -> str:
|
||||
"""Render a human-friendly multi-line message."""
|
||||
lines = [self.message]
|
||||
if self.operation:
|
||||
lines.append(f" while: {self.operation}")
|
||||
if self.path:
|
||||
lines.append(f" file: {self.path}")
|
||||
if self.column:
|
||||
lines.append(f" column: {self.column!r}")
|
||||
if self.cause:
|
||||
lines.append(f" underlying: {type(self.cause).__name__}: {self.cause}")
|
||||
if self.suggestion:
|
||||
lines.append(f" suggestion: {self.suggestion}")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
class InputValidationError(DataToolsError, ValueError):
|
||||
"""Caller passed a bad argument — e.g., non-DataFrame, bad enum value."""
|
||||
|
||||
|
||||
class ConfigError(DataToolsError, ValueError):
|
||||
"""Configuration file or options object is invalid."""
|
||||
|
||||
|
||||
class FileFormatError(DataToolsError, ValueError):
|
||||
"""File exists but is not in the expected format (corrupted, wrong schema)."""
|
||||
|
||||
|
||||
class FileAccessError(DataToolsError, OSError):
|
||||
"""File could not be read or written — permissions, missing parent, full disk."""
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Convenience constructors
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def ensure_dataframe(value: Any, *, function: str, parameter: str = "df") -> None:
|
||||
"""Raise InputValidationError if *value* isn't a pandas DataFrame.
|
||||
|
||||
Centralizes the repetitive guard so every public entry point gives
|
||||
the same message shape.
|
||||
"""
|
||||
import pandas as pd # lazy — keeps this module dependency-light
|
||||
if not isinstance(value, pd.DataFrame):
|
||||
raise InputValidationError(
|
||||
f"{function}() requires a pandas DataFrame for {parameter!r}",
|
||||
operation=function,
|
||||
suggestion=(
|
||||
f"Got {type(value).__name__}. "
|
||||
"Pass a DataFrame loaded via src.core.io.read_file() "
|
||||
"or constructed with pd.DataFrame(...)."
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
def ensure_choice(
|
||||
value: Any,
|
||||
*,
|
||||
name: str,
|
||||
choices: Iterable[Any],
|
||||
function: Optional[str] = None,
|
||||
) -> None:
|
||||
"""Raise InputValidationError if *value* isn't in *choices*."""
|
||||
choices = list(choices)
|
||||
if value in choices:
|
||||
return
|
||||
raise InputValidationError(
|
||||
f"Invalid {name}={value!r}",
|
||||
operation=function,
|
||||
suggestion=f"Valid: {sorted(map(str, choices))}",
|
||||
)
|
||||
|
||||
|
||||
def wrap_file_read(path: Path | str, operation: str, exc: BaseException) -> FileAccessError:
|
||||
"""Build a FileAccessError describing a read failure with helpful context."""
|
||||
return FileAccessError(
|
||||
f"Could not read file ({type(exc).__name__})",
|
||||
path=path,
|
||||
operation=operation,
|
||||
cause=exc,
|
||||
suggestion=(
|
||||
"Check that the file exists, you have read permission, and the "
|
||||
"path isn't on a network mount that may have disconnected."
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
def wrap_file_write(path: Path | str, operation: str, exc: BaseException) -> FileAccessError:
|
||||
"""Build a FileAccessError describing a write failure with helpful context."""
|
||||
suggestion = (
|
||||
"Check that the parent directory exists, you have write permission, "
|
||||
"and there is enough free disk space."
|
||||
)
|
||||
if isinstance(exc, PermissionError):
|
||||
suggestion = (
|
||||
"Check write permissions on the parent directory. "
|
||||
"On Windows, also ensure the file is not open in another program."
|
||||
)
|
||||
return FileAccessError(
|
||||
f"Could not write file ({type(exc).__name__})",
|
||||
path=path,
|
||||
operation=operation,
|
||||
cause=exc,
|
||||
suggestion=suggestion,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Friendly formatter for end-user surfaces (CLI stderr, GUI st.error)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def format_for_user(exc: BaseException, *, context: Optional[str] = None) -> str:
|
||||
"""Render an exception for end-user display.
|
||||
|
||||
Recognizes :class:`DataToolsError` and uses its structured fields;
|
||||
falls back to a generic message + class name for unrecognized
|
||||
exceptions. ``context`` is an optional one-line prefix describing
|
||||
what the user was trying to do (e.g., ``"Failed to read upload"``).
|
||||
"""
|
||||
if isinstance(exc, DataToolsError):
|
||||
body = exc.format()
|
||||
else:
|
||||
body = f"{type(exc).__name__}: {exc}"
|
||||
if context:
|
||||
return f"{context}\n\n{body}"
|
||||
return body
|
||||
Reference in New Issue
Block a user