feat(errors): structured error hierarchy + helpful messages everywhere

Introduces src/core/errors.py with a small structured error hierarchy that every public entry point now uses. Each error carries the context a user needs to fix it and the context a maintainer needs to trace it. The hierarchy: DataToolsError (base — formats path, column, operation, suggestion) InputValidationError (extends ValueError — bad arg / wrong type) ConfigError (extends ValueError — bad config / options) FileFormatError (extends ValueError — file is not what we expected) FileAccessError (extends OSError — file I/O failure) Subclassing the stdlib bases means existing `except OSError` / `except ValueError` handlers still catch them — no breaking change. Helpers: - ensure_dataframe(value, function=...) — uniform DataFrame guard - ensure_choice(value, name=, choices=) — uniform enum/literal guard - wrap_file_read(path, op, exc) — tag OSError with hint + path - wrap_file_write(path, op, exc) — same, with Windows-aware tip - format_for_user(exc, context=) — user-facing string for st.error / stderr Library hardening: - io.read_file: missing files surface FileAccessError listing whether the parent directory exists, and the suggestion to check the path. - io.read_file: chunk_size <= 0 now raises InputValidationError with a positive-integer suggestion. - io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas ValueError ("sheet not found") wrapped as FileFormatError listing the path and a "list sheets with list_sheets()" hint. - io._detect_excel_header_row: bare except narrowed to specific openpyxl exceptions; falls back gracefully and logs at debug so the real error surfaces from pd.read_excel. - io.write_file: OSError / PermissionError on to_csv/to_excel wrapped with file path and Windows-aware "file may be open in another program" hint. - dedup._parse_date: bare `except Exception` narrowed to (TypeError, ValueError, OutOfBoundsDatetime); failed values logged at debug for survivor-selection forensics. - dedup._select_survivor: KEEP_MOST_RECENT now raises InputValidationError instead of silently falling back to keep_first. - dedup.deduplicate: input validation errors are InputValidationError with operation/column/suggestion fields. - format_standardize.from_dict: invalid FieldType for a column raises ConfigError naming the column AND the bad value AND listing valid values; same for date_order / phone_format / etc. - format_standardize.from_file: OSError / JSON decode wrapped with path AND line/column where parsing failed. - format_standardize.to_file: TypeError on json.dumps wrapped as ConfigError with the suspected source (extra_abbreviations). - format_standardize._apply_field_type: dispatcher's "unknown field type" branch now raises AssertionError (it's an internal invariant, not user error — a new enum value was added without a branch). - format_standardize._resolve_column_types: missing-column error now InputValidationError with a "check for typos / unparsed header" suggestion. - format_standardize.standardize_dataframe: ensure_dataframe at entry. - text_clean.clean_dataframe: ensure_dataframe at entry. - config.to_strategies: invalid Algorithm/NormalizerType wrapped as ConfigError naming the strategy index AND the column. - config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError listing valid values. - config.from_file: OSError / JSON decode wrapped (mirror of StandardizeOptions.from_file). - fixes.repair_mojibake: ImportError on ftfy now logged at info level with the underlying ImportError so a corrupt-package vs not-installed distinction is visible in the logs. - normalizers.normalize_phone: phonenumbers.NumberParseException now logged at debug when the digits-only fallback drops extension / country-code information — gives a trail when matching results look wrong. GUI / CLI surfaces: - All 9 page handlers (`except Exception as e: st.error(...)`) now use format_for_user(), which renders DataToolsError fields nicely and falls back to "ClassName: message" for unrecognized errors. - 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish UnicodeDecodeError with an "re-save as UTF-8" suggestion before the generic handler. - cli.py's "Error reading file" handler now uses format_for_user() and includes the input path in the prefix. Tests: - tests/test_errors.py — 22 new tests covering: base class formatting, stdlib inheritance, ensure_dataframe / ensure_choice helpers, wrap_file_read / wrap_file_write, format_for_user behavior, and end-to-end integration (missing file, missing dir, bad JSON, bad algorithm, bad enum, missing column). - tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for the new exception types (InputValidationError replaces TypeError, FileAccessError extends OSError). Full project suite: 1230 passed, 4 skipped, 17 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:35:42 +00:00
parent 2eece6467d
commit 26b9771625
21 changed files with 751 additions and 104 deletions
--- a/src/core/format_standardize.py
+++ b/src/core/format_standardize.py
@@ -1527,19 +1527,26 @@ class StandardizeOptions:

    @classmethod
    def from_dict(cls, data: dict) -> StandardizeOptions:
+        from .errors import ConfigError
        known = {f for f in cls.__dataclass_fields__}
        kwargs = {k: v for k, v in data.items() if k in known}
        column_types = kwargs.get("column_types") or {}
-        try:
-            kwargs["column_types"] = {
-                c: FieldType(t) if not isinstance(t, FieldType) else t
-                for c, t in column_types.items()
-            }
-        except ValueError as e:
-            valid = ", ".join(sorted(t.value for t in FieldType))
-            raise ValueError(
-                f"Invalid field type in column_types: {e}. Valid: {valid}"
-            ) from e
+        resolved: dict[str, FieldType] = {}
+        for col, raw in column_types.items():
+            try:
+                resolved[col] = (
+                    FieldType(raw) if not isinstance(raw, FieldType) else raw
+                )
+            except ValueError as e:
+                valid = sorted(t.value for t in FieldType)
+                raise ConfigError(
+                    f"Invalid field type {raw!r} for column {col!r}",
+                    column=col,
+                    operation="StandardizeOptions.from_dict",
+                    cause=e,
+                    suggestion=f"Valid field types: {valid}",
+                ) from e
+        kwargs["column_types"] = resolved
        # Surface enum-string mismatches early — bad date_order ("xyz")
        # would otherwise crash deep inside standardize_date.
        for field_name, valid in (
@@ -1555,8 +1562,10 @@ class StandardizeOptions:
        ):
            value = kwargs.get(field_name)
            if value is not None and value not in valid:
-                raise ValueError(
-                    f"Invalid {field_name}={value!r}. Valid: {sorted(valid)}"
+                raise ConfigError(
+                    f"Invalid {field_name}={value!r}",
+                    operation="StandardizeOptions.from_dict",
+                    suggestion=f"Valid values: {sorted(valid)}",
                )
        return cls(**kwargs)

@@ -1567,24 +1576,47 @@ class StandardizeOptions:
        return d

    def to_file(self, path: str | Path) -> Path:
+        from .errors import ConfigError, wrap_file_write
        out = Path(path)
-        out.write_text(json.dumps(self.to_dict(), indent=2))
+        try:
+            payload = json.dumps(self.to_dict(), indent=2)
+        except TypeError as e:
+            raise ConfigError(
+                "Could not serialize StandardizeOptions to JSON",
+                operation="StandardizeOptions.to_file",
+                cause=e,
+                suggestion=(
+                    "extra_abbreviations or column_types likely contains a "
+                    "non-string/non-enum value. Inspect with .to_dict() and "
+                    "remove the offending entry."
+                ),
+            ) from e
+        try:
+            out.write_text(payload)
+        except (OSError, PermissionError) as e:
+            raise wrap_file_write(out, "StandardizeOptions.to_file", e) from e
        return out

    @classmethod
    def from_file(cls, path: str | Path) -> StandardizeOptions:
+        from .errors import ConfigError, wrap_file_read
        path = Path(path)
        try:
            text = path.read_text()
        except OSError as e:
-            raise OSError(
-                f"Could not read StandardizeOptions config from {path}: {e}"
-            ) from e
+            raise wrap_file_read(path, "StandardizeOptions.from_file", e) from e
        try:
            data = json.loads(text)
        except json.JSONDecodeError as e:
-            raise ValueError(
-                f"Invalid JSON in StandardizeOptions config {path}: {e}"
+            raise ConfigError(
+                "Invalid JSON in StandardizeOptions config",
+                path=path,
+                operation="StandardizeOptions.from_file",
+                cause=e,
+                suggestion=(
+                    f"JSON parser failed at line {e.lineno}, column {e.colno}. "
+                    "Validate the file with `python -m json.tool < file.json`."
+                ),
            ) from e
        return cls.from_dict(data)

@@ -1679,7 +1711,14 @@ def _apply_field_type(
    elif field_type == FieldType.BOOLEAN:
        new, changed = standardize_boolean(value, style=options.boolean_style)
    else:
-        raise ValueError(f"Unknown field type: {field_type}")
+        # Unreachable for well-formed input — _resolve_column_types
+        # would have rejected the bad enum at the entry point. Hitting
+        # this means an internal invariant was broken, not user error.
+        raise AssertionError(
+            f"Unhandled FieldType in dispatcher: {field_type!r}. "
+            "This indicates a code bug — a new FieldType was added to "
+            "the enum without a matching branch here."
+        )

    # ``changed=False`` on a non-empty cell means the standardizer either
    # accepted the input as already-canonical OR couldn't parse it. The
@@ -1760,9 +1799,14 @@ def _resolve_column_types(
            continue
        resolved[col] = ft if isinstance(ft, FieldType) else FieldType(ft)
    if missing:
-        raise ValueError(
-            f"Columns not found in input: {missing}. "
-            f"Available: {list(df_columns)}"
+        from .errors import InputValidationError
+        raise InputValidationError(
+            f"Columns referenced by column_types not found in input: {missing}",
+            operation="standardize_dataframe",
+            suggestion=(
+                f"Available columns: {list(df_columns)}. "
+                "Check for typos and for header rows that didn't get parsed."
+            ),
        )
    return resolved

@@ -1776,6 +1820,8 @@ def standardize_dataframe(
    Columns absent from ``options.column_types`` pass through unchanged.
    The input DataFrame is not mutated.
    """
+    from .errors import ensure_dataframe
+    ensure_dataframe(df, function="standardize_dataframe")
    options = options or StandardizeOptions()
    out = df.copy()
    column_types = _resolve_column_types(options, out.columns)