Introduces src/core/errors.py with a small structured error hierarchy
that every public entry point now uses. Each error carries the
context a user needs to fix it and the context a maintainer needs to
trace it.
The hierarchy:
DataToolsError (base — formats path, column, operation, suggestion)
InputValidationError (extends ValueError — bad arg / wrong type)
ConfigError (extends ValueError — bad config / options)
FileFormatError (extends ValueError — file is not what we expected)
FileAccessError (extends OSError — file I/O failure)
Subclassing the stdlib bases means existing `except OSError` /
`except ValueError` handlers still catch them — no breaking change.
Helpers:
- ensure_dataframe(value, function=...) — uniform DataFrame guard
- ensure_choice(value, name=, choices=) — uniform enum/literal guard
- wrap_file_read(path, op, exc) — tag OSError with hint + path
- wrap_file_write(path, op, exc) — same, with Windows-aware tip
- format_for_user(exc, context=) — user-facing string for st.error / stderr
Library hardening:
- io.read_file: missing files surface FileAccessError listing whether
the parent directory exists, and the suggestion to check the path.
- io.read_file: chunk_size <= 0 now raises InputValidationError with
a positive-integer suggestion.
- io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas
ValueError ("sheet not found") wrapped as FileFormatError listing
the path and a "list sheets with list_sheets()" hint.
- io._detect_excel_header_row: bare except narrowed to specific
openpyxl exceptions; falls back gracefully and logs at debug so
the real error surfaces from pd.read_excel.
- io.write_file: OSError / PermissionError on to_csv/to_excel wrapped
with file path and Windows-aware "file may be open in another
program" hint.
- dedup._parse_date: bare `except Exception` narrowed to
(TypeError, ValueError, OutOfBoundsDatetime); failed values
logged at debug for survivor-selection forensics.
- dedup._select_survivor: KEEP_MOST_RECENT now raises
InputValidationError instead of silently falling back to keep_first.
- dedup.deduplicate: input validation errors are InputValidationError
with operation/column/suggestion fields.
- format_standardize.from_dict: invalid FieldType for a column raises
ConfigError naming the column AND the bad value AND listing valid
values; same for date_order / phone_format / etc.
- format_standardize.from_file: OSError / JSON decode wrapped with
path AND line/column where parsing failed.
- format_standardize.to_file: TypeError on json.dumps wrapped as
ConfigError with the suspected source (extra_abbreviations).
- format_standardize._apply_field_type: dispatcher's "unknown field
type" branch now raises AssertionError (it's an internal invariant,
not user error — a new enum value was added without a branch).
- format_standardize._resolve_column_types: missing-column error now
InputValidationError with a "check for typos / unparsed header"
suggestion.
- format_standardize.standardize_dataframe: ensure_dataframe at entry.
- text_clean.clean_dataframe: ensure_dataframe at entry.
- config.to_strategies: invalid Algorithm/NormalizerType wrapped as
ConfigError naming the strategy index AND the column.
- config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError
listing valid values.
- config.from_file: OSError / JSON decode wrapped (mirror of
StandardizeOptions.from_file).
- fixes.repair_mojibake: ImportError on ftfy now logged at info level
with the underlying ImportError so a corrupt-package vs not-installed
distinction is visible in the logs.
- normalizers.normalize_phone: phonenumbers.NumberParseException now
logged at debug when the digits-only fallback drops extension /
country-code information — gives a trail when matching results
look wrong.
GUI / CLI surfaces:
- All 9 page handlers (`except Exception as e: st.error(...)`) now
use format_for_user(), which renders DataToolsError fields nicely
and falls back to "ClassName: message" for unrecognized errors.
- 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish
UnicodeDecodeError with an "re-save as UTF-8" suggestion before
the generic handler.
- cli.py's "Error reading file" handler now uses format_for_user()
and includes the input path in the prefix.
Tests:
- tests/test_errors.py — 22 new tests covering: base class formatting,
stdlib inheritance, ensure_dataframe / ensure_choice helpers,
wrap_file_read / wrap_file_write, format_for_user behavior, and
end-to-end integration (missing file, missing dir, bad JSON, bad
algorithm, bad enum, missing column).
- tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for
the new exception types (InputValidationError replaces TypeError,
FileAccessError extends OSError).
Full project suite: 1230 passed, 4 skipped, 17 xfailed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
231 lines
8.2 KiB
Python
231 lines
8.2 KiB
Python
"""Tests for the structured error-handling infrastructure.
|
|
|
|
Covers:
|
|
- DataToolsError base class formatting (path, column, operation, suggestion).
|
|
- Specialized subclasses inherit from the right stdlib bases so existing
|
|
``except OSError`` / ``except ValueError`` handlers still catch them.
|
|
- ensure_dataframe / ensure_choice raise the right structured errors.
|
|
- format_for_user produces readable output for both DataTools and
|
|
unrecognized exceptions.
|
|
- Per-module integration: bad config / bad file / bad input each
|
|
surface a helpful error rather than a deep library traceback.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import json
|
|
from pathlib import Path
|
|
|
|
import pandas as pd
|
|
import pytest
|
|
|
|
from src.core.errors import (
|
|
ConfigError,
|
|
DataToolsError,
|
|
FileAccessError,
|
|
FileFormatError,
|
|
InputValidationError,
|
|
ensure_choice,
|
|
ensure_dataframe,
|
|
format_for_user,
|
|
wrap_file_read,
|
|
wrap_file_write,
|
|
)
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Base class
|
|
# ---------------------------------------------------------------------------
|
|
|
|
class TestDataToolsError:
|
|
def test_message_only(self):
|
|
err = DataToolsError("something failed")
|
|
assert "something failed" in str(err)
|
|
|
|
def test_full_context(self):
|
|
err = DataToolsError(
|
|
"could not parse",
|
|
path="/tmp/foo.csv",
|
|
column="email",
|
|
operation="read_file",
|
|
suggestion="check encoding",
|
|
cause=ValueError("inner"),
|
|
)
|
|
text = str(err)
|
|
assert "could not parse" in text
|
|
assert "read_file" in text
|
|
assert "/tmp/foo.csv" in text
|
|
assert "'email'" in text
|
|
assert "ValueError" in text
|
|
assert "check encoding" in text
|
|
|
|
def test_inheritance_for_oserror_handlers(self):
|
|
# FileAccessError must be catchable as OSError so callers using
|
|
# the stdlib hierarchy continue to work.
|
|
with pytest.raises(OSError):
|
|
raise FileAccessError("nope", path="/tmp/x")
|
|
|
|
def test_inheritance_for_valueerror_handlers(self):
|
|
for cls in (InputValidationError, ConfigError, FileFormatError):
|
|
with pytest.raises(ValueError):
|
|
raise cls("nope")
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Helpers
|
|
# ---------------------------------------------------------------------------
|
|
|
|
class TestEnsureDataframe:
|
|
def test_passes_real_df(self):
|
|
ensure_dataframe(pd.DataFrame({"a": [1]}), function="x")
|
|
|
|
def test_rejects_dict(self):
|
|
with pytest.raises(InputValidationError, match="DataFrame"):
|
|
ensure_dataframe({"a": 1}, function="my_func")
|
|
|
|
def test_includes_function_name(self):
|
|
try:
|
|
ensure_dataframe(None, function="my_func")
|
|
except InputValidationError as e:
|
|
assert "my_func" in str(e)
|
|
else: # pragma: no cover
|
|
pytest.fail("should have raised")
|
|
|
|
def test_includes_actual_type(self):
|
|
try:
|
|
ensure_dataframe([1, 2, 3], function="x")
|
|
except InputValidationError as e:
|
|
assert "list" in str(e)
|
|
|
|
|
|
class TestEnsureChoice:
|
|
def test_passes_valid(self):
|
|
ensure_choice("a", name="mode", choices=["a", "b"])
|
|
|
|
def test_rejects_invalid(self):
|
|
with pytest.raises(InputValidationError, match="Invalid mode"):
|
|
ensure_choice("c", name="mode", choices=["a", "b"])
|
|
|
|
def test_lists_choices_in_message(self):
|
|
try:
|
|
ensure_choice("c", name="mode", choices=["a", "b"])
|
|
except InputValidationError as e:
|
|
assert "'a'" in str(e) and "'b'" in str(e)
|
|
|
|
|
|
class TestWrapFileHelpers:
|
|
def test_wrap_read_keeps_cause(self):
|
|
inner = OSError("disk error")
|
|
wrapped = wrap_file_read("/tmp/x", "read_file", inner)
|
|
assert wrapped.cause is inner
|
|
assert "/tmp/x" in str(wrapped)
|
|
|
|
def test_wrap_write_permission_hint(self):
|
|
inner = PermissionError("no perm")
|
|
wrapped = wrap_file_write("/tmp/x", "save", inner)
|
|
# Permission failures get a Windows-aware suggestion
|
|
assert "Windows" in str(wrapped) or "permission" in str(wrapped).lower()
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# format_for_user
|
|
# ---------------------------------------------------------------------------
|
|
|
|
class TestFormatForUser:
|
|
def test_datatools_error(self):
|
|
err = InputValidationError(
|
|
"bad date_order", suggestion="use MDY or DMY",
|
|
)
|
|
out = format_for_user(err)
|
|
assert "bad date_order" in out
|
|
assert "use MDY or DMY" in out
|
|
|
|
def test_with_context_prefix(self):
|
|
err = ValueError("inner")
|
|
out = format_for_user(err, context="Failed to read upload")
|
|
assert out.startswith("Failed to read upload")
|
|
assert "ValueError" in out
|
|
|
|
def test_unrecognized_exception(self):
|
|
err = RuntimeError("oops")
|
|
out = format_for_user(err)
|
|
assert "RuntimeError" in out
|
|
assert "oops" in out
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Integration — every public entry point surfaces structured errors
|
|
# ---------------------------------------------------------------------------
|
|
|
|
class TestIntegration:
|
|
def test_io_read_missing_file_is_structured(self, tmp_path):
|
|
from src.core.io import read_file
|
|
with pytest.raises(FileAccessError) as exc_info:
|
|
read_file(tmp_path / "missing.csv")
|
|
msg = str(exc_info.value)
|
|
assert "Input file not found" in msg
|
|
assert str(tmp_path) in msg
|
|
assert "exists" in msg or "does NOT exist" in msg
|
|
|
|
def test_io_write_to_missing_dir(self, tmp_path):
|
|
from src.core.io import write_file
|
|
# Writing into a non-existent directory raises a wrapped
|
|
# FileAccessError rather than a raw FileNotFoundError, so the
|
|
# user sees the path and a recovery hint.
|
|
df = pd.DataFrame({"a": [1]})
|
|
with pytest.raises(FileAccessError) as exc_info:
|
|
write_file(df, tmp_path / "no_such_dir" / "out.csv")
|
|
msg = str(exc_info.value)
|
|
assert "Could not write" in msg
|
|
assert "no_such_dir" in msg
|
|
|
|
def test_config_bad_json(self, tmp_path):
|
|
from src.core.config import DeduplicationConfig
|
|
path = tmp_path / "bad.json"
|
|
path.write_text("{not json")
|
|
with pytest.raises(ConfigError) as exc_info:
|
|
DeduplicationConfig.from_file(path)
|
|
assert "Invalid JSON" in str(exc_info.value)
|
|
assert "line" in str(exc_info.value)
|
|
|
|
def test_config_bad_algorithm_includes_strategy_index(self, tmp_path):
|
|
from src.core.config import DeduplicationConfig
|
|
path = tmp_path / "cfg.json"
|
|
path.write_text(json.dumps({
|
|
"strategies": [{
|
|
"columns": [{
|
|
"column": "name",
|
|
"algorithm": "not_a_real_algo",
|
|
"threshold": 90.0,
|
|
}],
|
|
}],
|
|
}))
|
|
loaded = DeduplicationConfig.from_file(path)
|
|
with pytest.raises(ConfigError) as exc_info:
|
|
loaded.to_strategies()
|
|
msg = str(exc_info.value)
|
|
assert "not_a_real_algo" in msg
|
|
assert "name" in msg # column name
|
|
assert "strategy[0]" in msg # strategy index
|
|
|
|
def test_standardize_options_bad_field_type_includes_column(self):
|
|
from src.core.format_standardize import StandardizeOptions
|
|
with pytest.raises(ConfigError) as exc_info:
|
|
StandardizeOptions.from_dict({
|
|
"column_types": {"my_col": "made_up"},
|
|
})
|
|
msg = str(exc_info.value)
|
|
assert "my_col" in msg
|
|
assert "made_up" in msg
|
|
|
|
def test_standardize_dataframe_unknown_column(self):
|
|
from src.core.format_standardize import (
|
|
FieldType, StandardizeOptions, standardize_dataframe,
|
|
)
|
|
df = pd.DataFrame({"name": ["a"]})
|
|
opts = StandardizeOptions(column_types={"missing": FieldType.DATE})
|
|
with pytest.raises(InputValidationError) as exc_info:
|
|
standardize_dataframe(df, opts)
|
|
assert "missing" in str(exc_info.value)
|
|
assert "['name']" in str(exc_info.value)
|