feat(errors): structured error hierarchy + helpful messages everywhere
Introduces src/core/errors.py with a small structured error hierarchy
that every public entry point now uses. Each error carries the
context a user needs to fix it and the context a maintainer needs to
trace it.
The hierarchy:
DataToolsError (base — formats path, column, operation, suggestion)
InputValidationError (extends ValueError — bad arg / wrong type)
ConfigError (extends ValueError — bad config / options)
FileFormatError (extends ValueError — file is not what we expected)
FileAccessError (extends OSError — file I/O failure)
Subclassing the stdlib bases means existing `except OSError` /
`except ValueError` handlers still catch them — no breaking change.
Helpers:
- ensure_dataframe(value, function=...) — uniform DataFrame guard
- ensure_choice(value, name=, choices=) — uniform enum/literal guard
- wrap_file_read(path, op, exc) — tag OSError with hint + path
- wrap_file_write(path, op, exc) — same, with Windows-aware tip
- format_for_user(exc, context=) — user-facing string for st.error / stderr
Library hardening:
- io.read_file: missing files surface FileAccessError listing whether
the parent directory exists, and the suggestion to check the path.
- io.read_file: chunk_size <= 0 now raises InputValidationError with
a positive-integer suggestion.
- io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas
ValueError ("sheet not found") wrapped as FileFormatError listing
the path and a "list sheets with list_sheets()" hint.
- io._detect_excel_header_row: bare except narrowed to specific
openpyxl exceptions; falls back gracefully and logs at debug so
the real error surfaces from pd.read_excel.
- io.write_file: OSError / PermissionError on to_csv/to_excel wrapped
with file path and Windows-aware "file may be open in another
program" hint.
- dedup._parse_date: bare `except Exception` narrowed to
(TypeError, ValueError, OutOfBoundsDatetime); failed values
logged at debug for survivor-selection forensics.
- dedup._select_survivor: KEEP_MOST_RECENT now raises
InputValidationError instead of silently falling back to keep_first.
- dedup.deduplicate: input validation errors are InputValidationError
with operation/column/suggestion fields.
- format_standardize.from_dict: invalid FieldType for a column raises
ConfigError naming the column AND the bad value AND listing valid
values; same for date_order / phone_format / etc.
- format_standardize.from_file: OSError / JSON decode wrapped with
path AND line/column where parsing failed.
- format_standardize.to_file: TypeError on json.dumps wrapped as
ConfigError with the suspected source (extra_abbreviations).
- format_standardize._apply_field_type: dispatcher's "unknown field
type" branch now raises AssertionError (it's an internal invariant,
not user error — a new enum value was added without a branch).
- format_standardize._resolve_column_types: missing-column error now
InputValidationError with a "check for typos / unparsed header"
suggestion.
- format_standardize.standardize_dataframe: ensure_dataframe at entry.
- text_clean.clean_dataframe: ensure_dataframe at entry.
- config.to_strategies: invalid Algorithm/NormalizerType wrapped as
ConfigError naming the strategy index AND the column.
- config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError
listing valid values.
- config.from_file: OSError / JSON decode wrapped (mirror of
StandardizeOptions.from_file).
- fixes.repair_mojibake: ImportError on ftfy now logged at info level
with the underlying ImportError so a corrupt-package vs not-installed
distinction is visible in the logs.
- normalizers.normalize_phone: phonenumbers.NumberParseException now
logged at debug when the digits-only fallback drops extension /
country-code information — gives a trail when matching results
look wrong.
GUI / CLI surfaces:
- All 9 page handlers (`except Exception as e: st.error(...)`) now
use format_for_user(), which renders DataToolsError fields nicely
and falls back to "ClassName: message" for unrecognized errors.
- 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish
UnicodeDecodeError with an "re-save as UTF-8" suggestion before
the generic handler.
- cli.py's "Error reading file" handler now uses format_for_user()
and includes the input path in the prefix.
Tests:
- tests/test_errors.py — 22 new tests covering: base class formatting,
stdlib inheritance, ensure_dataframe / ensure_choice helpers,
wrap_file_read / wrap_file_write, format_for_user behavior, and
end-to-end integration (missing file, missing dir, bad JSON, bad
algorithm, bad enum, missing column).
- tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for
the new exception types (InputValidationError replaces TypeError,
FileAccessError extends OSError).
Full project suite: 1230 passed, 4 skipped, 17 xfailed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
113
src/core/io.py
113
src/core/io.py
@@ -182,14 +182,25 @@ def read_file(
|
||||
|
||||
Returns a DataFrame (or generator when *chunk_size* is set).
|
||||
"""
|
||||
from .errors import FileAccessError, InputValidationError
|
||||
filepath = Path(path)
|
||||
if not filepath.exists():
|
||||
raise FileNotFoundError(
|
||||
f"Input file not found: {filepath} "
|
||||
f"(required for encoding/delimiter detection and reading)"
|
||||
raise FileAccessError(
|
||||
"Input file not found",
|
||||
path=filepath,
|
||||
operation="read_file",
|
||||
suggestion=(
|
||||
f"Check the path is correct. Parent directory "
|
||||
f"{filepath.parent} "
|
||||
f"{'exists' if filepath.parent.exists() else 'does NOT exist'}."
|
||||
),
|
||||
)
|
||||
if chunk_size is not None and chunk_size <= 0:
|
||||
raise ValueError(f"chunk_size must be positive; got {chunk_size}")
|
||||
raise InputValidationError(
|
||||
f"chunk_size must be positive; got {chunk_size}",
|
||||
operation="read_file",
|
||||
suggestion="Pass a positive integer (e.g., chunk_size=10000) or omit for non-streaming reads.",
|
||||
)
|
||||
|
||||
suffix = filepath.suffix.lower()
|
||||
logger.info(
|
||||
@@ -288,14 +299,42 @@ def _read_excel(
|
||||
else _detect_excel_header_row(path, sheet_name)
|
||||
)
|
||||
logger.debug("Reading Excel {} (sheet={}, header_row={})", path.name, sheet_name, hdr)
|
||||
return pd.read_excel(
|
||||
path,
|
||||
sheet_name=sheet_name,
|
||||
header=hdr,
|
||||
dtype=str,
|
||||
keep_default_na=False,
|
||||
engine="openpyxl",
|
||||
)
|
||||
try:
|
||||
return pd.read_excel(
|
||||
path,
|
||||
sheet_name=sheet_name,
|
||||
header=hdr,
|
||||
dtype=str,
|
||||
keep_default_na=False,
|
||||
engine="openpyxl",
|
||||
)
|
||||
except ValueError as e:
|
||||
# pandas raises ValueError for "Worksheet named 'X' not found".
|
||||
from .errors import FileFormatError
|
||||
raise FileFormatError(
|
||||
"Could not read Excel sheet",
|
||||
path=path,
|
||||
operation=f"open sheet {sheet_name!r}",
|
||||
cause=e,
|
||||
suggestion=(
|
||||
"Check the sheet name exists. List available sheets with "
|
||||
"`from src.core.io import list_sheets; list_sheets(path)`."
|
||||
),
|
||||
) from e
|
||||
except Exception as e:
|
||||
# openpyxl can raise BadZipFile, InvalidFileException for
|
||||
# corrupt / non-xlsx inputs. Wrap with file context.
|
||||
from .errors import FileFormatError
|
||||
raise FileFormatError(
|
||||
"Excel file could not be parsed",
|
||||
path=path,
|
||||
operation="pd.read_excel",
|
||||
cause=e,
|
||||
suggestion=(
|
||||
"Confirm the file is a valid .xlsx workbook and not "
|
||||
"renamed/corrupted. Try opening it in Excel to verify."
|
||||
),
|
||||
) from e
|
||||
|
||||
|
||||
def _detect_excel_header_row(
|
||||
@@ -308,18 +347,20 @@ def _detect_excel_header_row(
|
||||
Scans the first *max_scan* rows of *sheet_name* in read-only mode
|
||||
(so a 100 MB workbook doesn't get fully materialized) and returns
|
||||
the index of the first row where every non-empty cell looks like a
|
||||
column header. Falls back to 0.
|
||||
column header. Falls back to 0 on parse failure (logged at debug —
|
||||
the caller's ``pd.read_excel`` will raise a useful FileFormatError
|
||||
with full context).
|
||||
"""
|
||||
try:
|
||||
from openpyxl import load_workbook
|
||||
except ImportError:
|
||||
from openpyxl.utils.exceptions import InvalidFileException
|
||||
except ImportError as e:
|
||||
logger.debug("openpyxl unavailable for header detection: {}", e)
|
||||
return 0
|
||||
|
||||
wb = None
|
||||
try:
|
||||
wb = load_workbook(path, read_only=True, data_only=True)
|
||||
except Exception:
|
||||
return 0
|
||||
try:
|
||||
if isinstance(sheet_name, int):
|
||||
names = wb.sheetnames
|
||||
target = names[sheet_name] if 0 <= sheet_name < len(names) else names[0]
|
||||
@@ -340,8 +381,18 @@ def _detect_excel_header_row(
|
||||
):
|
||||
return idx
|
||||
return 0
|
||||
except (InvalidFileException, KeyError, IndexError, OSError) as e:
|
||||
# Corrupt workbook, missing sheet name, or read failure — fall
|
||||
# back to row 0 and let pd.read_excel raise the user-facing error
|
||||
# with full context.
|
||||
logger.debug(
|
||||
"Excel header detection failed for {} (sheet={}): {}",
|
||||
path, sheet_name, e,
|
||||
)
|
||||
return 0
|
||||
finally:
|
||||
wb.close()
|
||||
if wb is not None:
|
||||
wb.close()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
@@ -371,20 +422,22 @@ def write_file(
|
||||
|
||||
Returns the resolved output Path.
|
||||
"""
|
||||
if not isinstance(df, pd.DataFrame):
|
||||
raise TypeError(
|
||||
f"write_file() requires a pandas DataFrame; got {type(df).__name__}"
|
||||
)
|
||||
from .errors import ensure_dataframe, wrap_file_write
|
||||
ensure_dataframe(df, function="write_file")
|
||||
|
||||
out = Path(path)
|
||||
fmt = file_format or out.suffix.lstrip(".").lower()
|
||||
if fmt in ("xlsx", "xls"):
|
||||
df.to_excel(out, index=False, engine="openpyxl")
|
||||
else:
|
||||
sep = delimiter if delimiter is not None else (
|
||||
"\t" if fmt == "tsv" else ","
|
||||
)
|
||||
df.to_csv(out, index=False, encoding=encoding, sep=sep)
|
||||
logger.info("Wrote {} rows to {}", len(df), out)
|
||||
try:
|
||||
if fmt in ("xlsx", "xls"):
|
||||
df.to_excel(out, index=False, engine="openpyxl")
|
||||
else:
|
||||
sep = delimiter if delimiter is not None else (
|
||||
"\t" if fmt == "tsv" else ","
|
||||
)
|
||||
df.to_csv(out, index=False, encoding=encoding, sep=sep)
|
||||
except (OSError, PermissionError) as e:
|
||||
raise wrap_file_write(out, f"write_file (format={fmt})", e) from e
|
||||
logger.info("Wrote {} rows × {} cols to {}", len(df), len(df.columns), out)
|
||||
return out
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user