refactor: dedup, consolidate, harden public APIs across core modules

Closes 16 high-value findings from a parallel cross-module review. Refactors: - New src/core/_constants.py centralizes USPS street-suffix abbreviations, US state names, and 2-letter postal codes — one source of truth for both normalize_address (matching keys) and standardize_address (display formatting). Eliminates ~80 lines of duplicated dicts across normalizers.py and format_standardize.py. - format_standardize.py: collapse 4 identical nested _err() helpers into one shared _err_or_passthrough() module function; drop a dead duplicate `return _err("not a phone number")` branch in standardize_phone. - format_standardize.py: precompile per-locale month-name regexes (_MONTH_LOCALE_PATTERNS) and per-state-name regexes (_STATE_NAME_PATTERNS) at import time — they were rebuilt on every cell, a measurable hot path on million-row inputs. - dedup.py: extract _is_missing(value) helper; one definition of "this cell is None / NaN / pd.NA" instead of two. - fixes.py: extract _is_string_column(ser) helper; one dtype check instead of three duplicates across _apply_to_strings, _vectorized_translate, _vectorized_regex_sub. Production-readiness: - format_standardize.standardize_dataframe now logs a warning when more than 10% of typed cells are unparseable — surfaces the silently-broken-pipeline failure mode. - StandardizeOptions.from_dict validates date_order / phone_format / currency_decimal / name_case / boolean_style / *_error_policy enum values up front, with a clear error message instead of a deep crash inside the per-cell function. - StandardizeOptions.from_file and DeduplicationConfig.from_file wrap read + json.loads with descriptive OSError / ValueError messages including the file path. - standardize_date(month_locales=...) validates locale codes against the available set instead of silently passing through unknown ones. - io.read_file rejects chunk_size <= 0 (was silently failing inside pandas) and logs the resolved suffix + chunk_size at info level so data-pipeline runs are debuggable. - io.read_file's FileNotFoundError gains explanatory context. - io.write_file, text_clean.clean_dataframe, and dedup.deduplicate now reject non-DataFrame inputs with clear TypeError instead of cryptic pandas tracebacks downstream. - dedup.deduplicate validates that survivor_rule=KEEP_MOST_RECENT has a usable date_column up front; the helper _select_survivor now raises (instead of silently falling back to keep_first) when called directly with bad arguments. - dedup.deduplicate gains a structured no-op return when strategies is empty after auto-detection — preserves schema instead of crashing. - analyze._detect_inconsistent_date_format narrows its bare except to (TypeError, ValueError) and logs a debug line so genuine bugs don't hide behind silent skip. Tests: - tests/test_audit_fixes.py grows by 11 cases covering the new validation paths (chunk_size, DataFrame guards, KEEP_MOST_RECENT date_column, enum validation, locale validation, JSON error wrapping). Full project suite: 1208 passed, 4 skipped, 17 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:23:09 +00:00
parent b23a27d4e3
commit 2eece6467d
10 changed files with 457 additions and 231 deletions
--- a/src/core/io.py
+++ b/src/core/io.py
@@ -184,9 +184,18 @@ def read_file(
    """
    filepath = Path(path)
    if not filepath.exists():
-        raise FileNotFoundError(f"File not found: {filepath}")
+        raise FileNotFoundError(
+            f"Input file not found: {filepath} "
+            f"(required for encoding/delimiter detection and reading)"
+        )
+    if chunk_size is not None and chunk_size <= 0:
+        raise ValueError(f"chunk_size must be positive; got {chunk_size}")

    suffix = filepath.suffix.lower()
+    logger.info(
+        "read_file: {} (suffix={}, chunk_size={})",
+        filepath, suffix, chunk_size,
+    )
    if suffix in (".xlsx", ".xls"):
        return _read_excel(filepath, header_row=header_row, sheet_name=sheet_name)
    else:
@@ -362,6 +371,10 @@ def write_file(

    Returns the resolved output Path.
    """
+    if not isinstance(df, pd.DataFrame):
+        raise TypeError(
+            f"write_file() requires a pandas DataFrame; got {type(df).__name__}"
+        )
    out = Path(path)
    fmt = file_format or out.suffix.lstrip(".").lower()
    if fmt in ("xlsx", "xls"):