feat(errors): structured error hierarchy + helpful messages everywhere
Introduces src/core/errors.py with a small structured error hierarchy
that every public entry point now uses. Each error carries the
context a user needs to fix it and the context a maintainer needs to
trace it.
The hierarchy:
DataToolsError (base — formats path, column, operation, suggestion)
InputValidationError (extends ValueError — bad arg / wrong type)
ConfigError (extends ValueError — bad config / options)
FileFormatError (extends ValueError — file is not what we expected)
FileAccessError (extends OSError — file I/O failure)
Subclassing the stdlib bases means existing `except OSError` /
`except ValueError` handlers still catch them — no breaking change.
Helpers:
- ensure_dataframe(value, function=...) — uniform DataFrame guard
- ensure_choice(value, name=, choices=) — uniform enum/literal guard
- wrap_file_read(path, op, exc) — tag OSError with hint + path
- wrap_file_write(path, op, exc) — same, with Windows-aware tip
- format_for_user(exc, context=) — user-facing string for st.error / stderr
Library hardening:
- io.read_file: missing files surface FileAccessError listing whether
the parent directory exists, and the suggestion to check the path.
- io.read_file: chunk_size <= 0 now raises InputValidationError with
a positive-integer suggestion.
- io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas
ValueError ("sheet not found") wrapped as FileFormatError listing
the path and a "list sheets with list_sheets()" hint.
- io._detect_excel_header_row: bare except narrowed to specific
openpyxl exceptions; falls back gracefully and logs at debug so
the real error surfaces from pd.read_excel.
- io.write_file: OSError / PermissionError on to_csv/to_excel wrapped
with file path and Windows-aware "file may be open in another
program" hint.
- dedup._parse_date: bare `except Exception` narrowed to
(TypeError, ValueError, OutOfBoundsDatetime); failed values
logged at debug for survivor-selection forensics.
- dedup._select_survivor: KEEP_MOST_RECENT now raises
InputValidationError instead of silently falling back to keep_first.
- dedup.deduplicate: input validation errors are InputValidationError
with operation/column/suggestion fields.
- format_standardize.from_dict: invalid FieldType for a column raises
ConfigError naming the column AND the bad value AND listing valid
values; same for date_order / phone_format / etc.
- format_standardize.from_file: OSError / JSON decode wrapped with
path AND line/column where parsing failed.
- format_standardize.to_file: TypeError on json.dumps wrapped as
ConfigError with the suspected source (extra_abbreviations).
- format_standardize._apply_field_type: dispatcher's "unknown field
type" branch now raises AssertionError (it's an internal invariant,
not user error — a new enum value was added without a branch).
- format_standardize._resolve_column_types: missing-column error now
InputValidationError with a "check for typos / unparsed header"
suggestion.
- format_standardize.standardize_dataframe: ensure_dataframe at entry.
- text_clean.clean_dataframe: ensure_dataframe at entry.
- config.to_strategies: invalid Algorithm/NormalizerType wrapped as
ConfigError naming the strategy index AND the column.
- config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError
listing valid values.
- config.from_file: OSError / JSON decode wrapped (mirror of
StandardizeOptions.from_file).
- fixes.repair_mojibake: ImportError on ftfy now logged at info level
with the underlying ImportError so a corrupt-package vs not-installed
distinction is visible in the logs.
- normalizers.normalize_phone: phonenumbers.NumberParseException now
logged at debug when the digits-only fallback drops extension /
country-code information — gives a trail when matching results
look wrong.
GUI / CLI surfaces:
- All 9 page handlers (`except Exception as e: st.error(...)`) now
use format_for_user(), which renders DataToolsError fields nicely
and falls back to "ClassName: message" for unrecognized errors.
- 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish
UnicodeDecodeError with an "re-save as UTF-8" suggestion before
the generic handler.
- cli.py's "Error reading file" handler now uses format_for_user()
and includes the input path in the prefix.
Tests:
- tests/test_errors.py — 22 new tests covering: base class formatting,
stdlib inheritance, ensure_dataframe / ensure_choice helpers,
wrap_file_read / wrap_file_write, format_for_user behavior, and
end-to-end integration (missing file, missing dir, bad JSON, bad
algorithm, bad enum, missing column).
- tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for
the new exception types (InputValidationError replaces TypeError,
FileAccessError extends OSError).
Full project suite: 1230 passed, 4 skipped, 17 xfailed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1527,19 +1527,26 @@ class StandardizeOptions:
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> StandardizeOptions:
|
||||
from .errors import ConfigError
|
||||
known = {f for f in cls.__dataclass_fields__}
|
||||
kwargs = {k: v for k, v in data.items() if k in known}
|
||||
column_types = kwargs.get("column_types") or {}
|
||||
try:
|
||||
kwargs["column_types"] = {
|
||||
c: FieldType(t) if not isinstance(t, FieldType) else t
|
||||
for c, t in column_types.items()
|
||||
}
|
||||
except ValueError as e:
|
||||
valid = ", ".join(sorted(t.value for t in FieldType))
|
||||
raise ValueError(
|
||||
f"Invalid field type in column_types: {e}. Valid: {valid}"
|
||||
) from e
|
||||
resolved: dict[str, FieldType] = {}
|
||||
for col, raw in column_types.items():
|
||||
try:
|
||||
resolved[col] = (
|
||||
FieldType(raw) if not isinstance(raw, FieldType) else raw
|
||||
)
|
||||
except ValueError as e:
|
||||
valid = sorted(t.value for t in FieldType)
|
||||
raise ConfigError(
|
||||
f"Invalid field type {raw!r} for column {col!r}",
|
||||
column=col,
|
||||
operation="StandardizeOptions.from_dict",
|
||||
cause=e,
|
||||
suggestion=f"Valid field types: {valid}",
|
||||
) from e
|
||||
kwargs["column_types"] = resolved
|
||||
# Surface enum-string mismatches early — bad date_order ("xyz")
|
||||
# would otherwise crash deep inside standardize_date.
|
||||
for field_name, valid in (
|
||||
@@ -1555,8 +1562,10 @@ class StandardizeOptions:
|
||||
):
|
||||
value = kwargs.get(field_name)
|
||||
if value is not None and value not in valid:
|
||||
raise ValueError(
|
||||
f"Invalid {field_name}={value!r}. Valid: {sorted(valid)}"
|
||||
raise ConfigError(
|
||||
f"Invalid {field_name}={value!r}",
|
||||
operation="StandardizeOptions.from_dict",
|
||||
suggestion=f"Valid values: {sorted(valid)}",
|
||||
)
|
||||
return cls(**kwargs)
|
||||
|
||||
@@ -1567,24 +1576,47 @@ class StandardizeOptions:
|
||||
return d
|
||||
|
||||
def to_file(self, path: str | Path) -> Path:
|
||||
from .errors import ConfigError, wrap_file_write
|
||||
out = Path(path)
|
||||
out.write_text(json.dumps(self.to_dict(), indent=2))
|
||||
try:
|
||||
payload = json.dumps(self.to_dict(), indent=2)
|
||||
except TypeError as e:
|
||||
raise ConfigError(
|
||||
"Could not serialize StandardizeOptions to JSON",
|
||||
operation="StandardizeOptions.to_file",
|
||||
cause=e,
|
||||
suggestion=(
|
||||
"extra_abbreviations or column_types likely contains a "
|
||||
"non-string/non-enum value. Inspect with .to_dict() and "
|
||||
"remove the offending entry."
|
||||
),
|
||||
) from e
|
||||
try:
|
||||
out.write_text(payload)
|
||||
except (OSError, PermissionError) as e:
|
||||
raise wrap_file_write(out, "StandardizeOptions.to_file", e) from e
|
||||
return out
|
||||
|
||||
@classmethod
|
||||
def from_file(cls, path: str | Path) -> StandardizeOptions:
|
||||
from .errors import ConfigError, wrap_file_read
|
||||
path = Path(path)
|
||||
try:
|
||||
text = path.read_text()
|
||||
except OSError as e:
|
||||
raise OSError(
|
||||
f"Could not read StandardizeOptions config from {path}: {e}"
|
||||
) from e
|
||||
raise wrap_file_read(path, "StandardizeOptions.from_file", e) from e
|
||||
try:
|
||||
data = json.loads(text)
|
||||
except json.JSONDecodeError as e:
|
||||
raise ValueError(
|
||||
f"Invalid JSON in StandardizeOptions config {path}: {e}"
|
||||
raise ConfigError(
|
||||
"Invalid JSON in StandardizeOptions config",
|
||||
path=path,
|
||||
operation="StandardizeOptions.from_file",
|
||||
cause=e,
|
||||
suggestion=(
|
||||
f"JSON parser failed at line {e.lineno}, column {e.colno}. "
|
||||
"Validate the file with `python -m json.tool < file.json`."
|
||||
),
|
||||
) from e
|
||||
return cls.from_dict(data)
|
||||
|
||||
@@ -1679,7 +1711,14 @@ def _apply_field_type(
|
||||
elif field_type == FieldType.BOOLEAN:
|
||||
new, changed = standardize_boolean(value, style=options.boolean_style)
|
||||
else:
|
||||
raise ValueError(f"Unknown field type: {field_type}")
|
||||
# Unreachable for well-formed input — _resolve_column_types
|
||||
# would have rejected the bad enum at the entry point. Hitting
|
||||
# this means an internal invariant was broken, not user error.
|
||||
raise AssertionError(
|
||||
f"Unhandled FieldType in dispatcher: {field_type!r}. "
|
||||
"This indicates a code bug — a new FieldType was added to "
|
||||
"the enum without a matching branch here."
|
||||
)
|
||||
|
||||
# ``changed=False`` on a non-empty cell means the standardizer either
|
||||
# accepted the input as already-canonical OR couldn't parse it. The
|
||||
@@ -1760,9 +1799,14 @@ def _resolve_column_types(
|
||||
continue
|
||||
resolved[col] = ft if isinstance(ft, FieldType) else FieldType(ft)
|
||||
if missing:
|
||||
raise ValueError(
|
||||
f"Columns not found in input: {missing}. "
|
||||
f"Available: {list(df_columns)}"
|
||||
from .errors import InputValidationError
|
||||
raise InputValidationError(
|
||||
f"Columns referenced by column_types not found in input: {missing}",
|
||||
operation="standardize_dataframe",
|
||||
suggestion=(
|
||||
f"Available columns: {list(df_columns)}. "
|
||||
"Check for typos and for header rows that didn't get parsed."
|
||||
),
|
||||
)
|
||||
return resolved
|
||||
|
||||
@@ -1776,6 +1820,8 @@ def standardize_dataframe(
|
||||
Columns absent from ``options.column_types`` pass through unchanged.
|
||||
The input DataFrame is not mutated.
|
||||
"""
|
||||
from .errors import ensure_dataframe
|
||||
ensure_dataframe(df, function="standardize_dataframe")
|
||||
options = options or StandardizeOptions()
|
||||
out = df.copy()
|
||||
column_types = _resolve_column_types(options, out.columns)
|
||||
|
||||
Reference in New Issue
Block a user