feat(errors): structured error hierarchy + helpful messages everywhere

Introduces src/core/errors.py with a small structured error hierarchy that every public entry point now uses. Each error carries the context a user needs to fix it and the context a maintainer needs to trace it. The hierarchy: DataToolsError (base — formats path, column, operation, suggestion) InputValidationError (extends ValueError — bad arg / wrong type) ConfigError (extends ValueError — bad config / options) FileFormatError (extends ValueError — file is not what we expected) FileAccessError (extends OSError — file I/O failure) Subclassing the stdlib bases means existing `except OSError` / `except ValueError` handlers still catch them — no breaking change. Helpers: - ensure_dataframe(value, function=...) — uniform DataFrame guard - ensure_choice(value, name=, choices=) — uniform enum/literal guard - wrap_file_read(path, op, exc) — tag OSError with hint + path - wrap_file_write(path, op, exc) — same, with Windows-aware tip - format_for_user(exc, context=) — user-facing string for st.error / stderr Library hardening: - io.read_file: missing files surface FileAccessError listing whether the parent directory exists, and the suggestion to check the path. - io.read_file: chunk_size <= 0 now raises InputValidationError with a positive-integer suggestion. - io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas ValueError ("sheet not found") wrapped as FileFormatError listing the path and a "list sheets with list_sheets()" hint. - io._detect_excel_header_row: bare except narrowed to specific openpyxl exceptions; falls back gracefully and logs at debug so the real error surfaces from pd.read_excel. - io.write_file: OSError / PermissionError on to_csv/to_excel wrapped with file path and Windows-aware "file may be open in another program" hint. - dedup._parse_date: bare `except Exception` narrowed to (TypeError, ValueError, OutOfBoundsDatetime); failed values logged at debug for survivor-selection forensics. - dedup._select_survivor: KEEP_MOST_RECENT now raises InputValidationError instead of silently falling back to keep_first. - dedup.deduplicate: input validation errors are InputValidationError with operation/column/suggestion fields. - format_standardize.from_dict: invalid FieldType for a column raises ConfigError naming the column AND the bad value AND listing valid values; same for date_order / phone_format / etc. - format_standardize.from_file: OSError / JSON decode wrapped with path AND line/column where parsing failed. - format_standardize.to_file: TypeError on json.dumps wrapped as ConfigError with the suspected source (extra_abbreviations). - format_standardize._apply_field_type: dispatcher's "unknown field type" branch now raises AssertionError (it's an internal invariant, not user error — a new enum value was added without a branch). - format_standardize._resolve_column_types: missing-column error now InputValidationError with a "check for typos / unparsed header" suggestion. - format_standardize.standardize_dataframe: ensure_dataframe at entry. - text_clean.clean_dataframe: ensure_dataframe at entry. - config.to_strategies: invalid Algorithm/NormalizerType wrapped as ConfigError naming the strategy index AND the column. - config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError listing valid values. - config.from_file: OSError / JSON decode wrapped (mirror of StandardizeOptions.from_file). - fixes.repair_mojibake: ImportError on ftfy now logged at info level with the underlying ImportError so a corrupt-package vs not-installed distinction is visible in the logs. - normalizers.normalize_phone: phonenumbers.NumberParseException now logged at debug when the digits-only fallback drops extension / country-code information — gives a trail when matching results look wrong. GUI / CLI surfaces: - All 9 page handlers (`except Exception as e: st.error(...)`) now use format_for_user(), which renders DataToolsError fields nicely and falls back to "ClassName: message" for unrecognized errors. - 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish UnicodeDecodeError with an "re-save as UTF-8" suggestion before the generic handler. - cli.py's "Error reading file" handler now uses format_for_user() and includes the input path in the prefix. Tests: - tests/test_errors.py — 22 new tests covering: base class formatting, stdlib inheritance, ensure_dataframe / ensure_choice helpers, wrap_file_read / wrap_file_write, format_for_user behavior, and end-to-end integration (missing file, missing dir, bad JSON, bad algorithm, bad enum, missing column). - tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for the new exception types (InputValidationError replaces TypeError, FileAccessError extends OSError). Full project suite: 1230 passed, 4 skipped, 17 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:35:42 +00:00
parent 2eece6467d
commit 26b9771625
21 changed files with 751 additions and 104 deletions
--- a/src/core/io.py
+++ b/src/core/io.py
@@ -182,14 +182,25 @@ def read_file(

    Returns a DataFrame (or generator when *chunk_size* is set).
    """
+    from .errors import FileAccessError, InputValidationError
    filepath = Path(path)
    if not filepath.exists():
-        raise FileNotFoundError(
-            f"Input file not found: {filepath} "
-            f"(required for encoding/delimiter detection and reading)"
+        raise FileAccessError(
+            "Input file not found",
+            path=filepath,
+            operation="read_file",
+            suggestion=(
+                f"Check the path is correct. Parent directory "
+                f"{filepath.parent} "
+                f"{'exists' if filepath.parent.exists() else 'does NOT exist'}."
+            ),
        )
    if chunk_size is not None and chunk_size <= 0:
-        raise ValueError(f"chunk_size must be positive; got {chunk_size}")
+        raise InputValidationError(
+            f"chunk_size must be positive; got {chunk_size}",
+            operation="read_file",
+            suggestion="Pass a positive integer (e.g., chunk_size=10000) or omit for non-streaming reads.",
+        )

    suffix = filepath.suffix.lower()
    logger.info(
@@ -288,14 +299,42 @@ def _read_excel(
        else _detect_excel_header_row(path, sheet_name)
    )
    logger.debug("Reading Excel {} (sheet={}, header_row={})", path.name, sheet_name, hdr)
-    return pd.read_excel(
-        path,
-        sheet_name=sheet_name,
-        header=hdr,
-        dtype=str,
-        keep_default_na=False,
-        engine="openpyxl",
-    )
+    try:
+        return pd.read_excel(
+            path,
+            sheet_name=sheet_name,
+            header=hdr,
+            dtype=str,
+            keep_default_na=False,
+            engine="openpyxl",
+        )
+    except ValueError as e:
+        # pandas raises ValueError for "Worksheet named 'X' not found".
+        from .errors import FileFormatError
+        raise FileFormatError(
+            "Could not read Excel sheet",
+            path=path,
+            operation=f"open sheet {sheet_name!r}",
+            cause=e,
+            suggestion=(
+                "Check the sheet name exists. List available sheets with "
+                "`from src.core.io import list_sheets; list_sheets(path)`."
+            ),
+        ) from e
+    except Exception as e:
+        # openpyxl can raise BadZipFile, InvalidFileException for
+        # corrupt / non-xlsx inputs. Wrap with file context.
+        from .errors import FileFormatError
+        raise FileFormatError(
+            "Excel file could not be parsed",
+            path=path,
+            operation="pd.read_excel",
+            cause=e,
+            suggestion=(
+                "Confirm the file is a valid .xlsx workbook and not "
+                "renamed/corrupted. Try opening it in Excel to verify."
+            ),
+        ) from e


 def _detect_excel_header_row(
@@ -308,18 +347,20 @@ def _detect_excel_header_row(
    Scans the first *max_scan* rows of *sheet_name* in read-only mode
    (so a 100 MB workbook doesn't get fully materialized) and returns
    the index of the first row where every non-empty cell looks like a
-    column header. Falls back to 0.
+    column header. Falls back to 0 on parse failure (logged at debug —
+    the caller's ``pd.read_excel`` will raise a useful FileFormatError
+    with full context).
    """
    try:
        from openpyxl import load_workbook
-    except ImportError:
+        from openpyxl.utils.exceptions import InvalidFileException
+    except ImportError as e:
+        logger.debug("openpyxl unavailable for header detection: {}", e)
        return 0

+    wb = None
    try:
        wb = load_workbook(path, read_only=True, data_only=True)
-    except Exception:
-        return 0
-    try:
        if isinstance(sheet_name, int):
            names = wb.sheetnames
            target = names[sheet_name] if 0 <= sheet_name < len(names) else names[0]
@@ -340,8 +381,18 @@ def _detect_excel_header_row(
            ):
                return idx
        return 0
+    except (InvalidFileException, KeyError, IndexError, OSError) as e:
+        # Corrupt workbook, missing sheet name, or read failure — fall
+        # back to row 0 and let pd.read_excel raise the user-facing error
+        # with full context.
+        logger.debug(
+            "Excel header detection failed for {} (sheet={}): {}",
+            path, sheet_name, e,
+        )
+        return 0
    finally:
-        wb.close()
+        if wb is not None:
+            wb.close()


 # ---------------------------------------------------------------------------
@@ -371,20 +422,22 @@ def write_file(

    Returns the resolved output Path.
    """
-    if not isinstance(df, pd.DataFrame):
-        raise TypeError(
-            f"write_file() requires a pandas DataFrame; got {type(df).__name__}"
-        )
+    from .errors import ensure_dataframe, wrap_file_write
+    ensure_dataframe(df, function="write_file")
+
    out = Path(path)
    fmt = file_format or out.suffix.lstrip(".").lower()
-    if fmt in ("xlsx", "xls"):
-        df.to_excel(out, index=False, engine="openpyxl")
-    else:
-        sep = delimiter if delimiter is not None else (
-            "\t" if fmt == "tsv" else ","
-        )
-        df.to_csv(out, index=False, encoding=encoding, sep=sep)
-    logger.info("Wrote {} rows to {}", len(df), out)
+    try:
+        if fmt in ("xlsx", "xls"):
+            df.to_excel(out, index=False, engine="openpyxl")
+        else:
+            sep = delimiter if delimiter is not None else (
+                "\t" if fmt == "tsv" else ","
+            )
+            df.to_csv(out, index=False, encoding=encoding, sep=sep)
+    except (OSError, PermissionError) as e:
+        raise wrap_file_write(out, f"write_file (format={fmt})", e) from e
+    logger.info("Wrote {} rows × {} cols to {}", len(df), len(df.columns), out)
    return out