datatools-dev

Author	SHA1	Message	Date
Michael	26b9771625	feat(errors): structured error hierarchy + helpful messages everywhere Introduces src/core/errors.py with a small structured error hierarchy that every public entry point now uses. Each error carries the context a user needs to fix it and the context a maintainer needs to trace it. The hierarchy: DataToolsError (base — formats path, column, operation, suggestion) InputValidationError (extends ValueError — bad arg / wrong type) ConfigError (extends ValueError — bad config / options) FileFormatError (extends ValueError — file is not what we expected) FileAccessError (extends OSError — file I/O failure) Subclassing the stdlib bases means existing `except OSError` / `except ValueError` handlers still catch them — no breaking change. Helpers: - ensure_dataframe(value, function=...) — uniform DataFrame guard - ensure_choice(value, name=, choices=) — uniform enum/literal guard - wrap_file_read(path, op, exc) — tag OSError with hint + path - wrap_file_write(path, op, exc) — same, with Windows-aware tip - format_for_user(exc, context=) — user-facing string for st.error / stderr Library hardening: - io.read_file: missing files surface FileAccessError listing whether the parent directory exists, and the suggestion to check the path. - io.read_file: chunk_size <= 0 now raises InputValidationError with a positive-integer suggestion. - io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas ValueError ("sheet not found") wrapped as FileFormatError listing the path and a "list sheets with list_sheets()" hint. - io._detect_excel_header_row: bare except narrowed to specific openpyxl exceptions; falls back gracefully and logs at debug so the real error surfaces from pd.read_excel. - io.write_file: OSError / PermissionError on to_csv/to_excel wrapped with file path and Windows-aware "file may be open in another program" hint. - dedup._parse_date: bare `except Exception` narrowed to (TypeError, ValueError, OutOfBoundsDatetime); failed values logged at debug for survivor-selection forensics. - dedup._select_survivor: KEEP_MOST_RECENT now raises InputValidationError instead of silently falling back to keep_first. - dedup.deduplicate: input validation errors are InputValidationError with operation/column/suggestion fields. - format_standardize.from_dict: invalid FieldType for a column raises ConfigError naming the column AND the bad value AND listing valid values; same for date_order / phone_format / etc. - format_standardize.from_file: OSError / JSON decode wrapped with path AND line/column where parsing failed. - format_standardize.to_file: TypeError on json.dumps wrapped as ConfigError with the suspected source (extra_abbreviations). - format_standardize._apply_field_type: dispatcher's "unknown field type" branch now raises AssertionError (it's an internal invariant, not user error — a new enum value was added without a branch). - format_standardize._resolve_column_types: missing-column error now InputValidationError with a "check for typos / unparsed header" suggestion. - format_standardize.standardize_dataframe: ensure_dataframe at entry. - text_clean.clean_dataframe: ensure_dataframe at entry. - config.to_strategies: invalid Algorithm/NormalizerType wrapped as ConfigError naming the strategy index AND the column. - config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError listing valid values. - config.from_file: OSError / JSON decode wrapped (mirror of StandardizeOptions.from_file). - fixes.repair_mojibake: ImportError on ftfy now logged at info level with the underlying ImportError so a corrupt-package vs not-installed distinction is visible in the logs. - normalizers.normalize_phone: phonenumbers.NumberParseException now logged at debug when the digits-only fallback drops extension / country-code information — gives a trail when matching results look wrong. GUI / CLI surfaces: - All 9 page handlers (`except Exception as e: st.error(...)`) now use format_for_user(), which renders DataToolsError fields nicely and falls back to "ClassName: message" for unrecognized errors. - 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish UnicodeDecodeError with an "re-save as UTF-8" suggestion before the generic handler. - cli.py's "Error reading file" handler now uses format_for_user() and includes the input path in the prefix. Tests: - tests/test_errors.py — 22 new tests covering: base class formatting, stdlib inheritance, ensure_dataframe / ensure_choice helpers, wrap_file_read / wrap_file_write, format_for_user behavior, and end-to-end integration (missing file, missing dir, bad JSON, bad algorithm, bad enum, missing column). - tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for the new exception types (InputValidationError replaces TypeError, FileAccessError extends OSError). Full project suite: 1230 passed, 4 skipped, 17 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:35:42 +00:00
Michael	b23a27d4e3	fix: cross-tool audit findings + alignment with format standardizer Closes 12 bugs and 8 gaps surfaced by parallel audits across all core modules, plus aligns the dedup-side normalizers with the new format_standardize behavior where they had silently diverged. Bugs (data integrity / correctness): - dedup: NaN/None values matched as duplicates because str(None)='None'. Two rows with missing email silently merged. - dedup: removed_df had 0 columns when nothing was removed; downstream code expecting matching schema broke. Now preserves column shape. - dedup: ColumnMatchStrategy threshold accepted any value; out-of-range silently broke matching. Validated to [0, 100] in __post_init__. - dedup: strategy referencing a missing column was silently skipped. Now raises ValueError listing available columns. - fixes: replace_null_sentinels crashed on non-string sentinels (int/None from JSON payload). Coerced to str. - fixes: _vectorized_regex_sub raised raw re.error on bad patterns. Now wraps as ValueError with clear message. - io: detect_header_row mis-identified all-empty and metadata-only rows as headers (all([]) is True). Now requires ≥2 non-empty cells. - config: from_dict crashed when JSON had unknown fields, breaking forward compat. Now filters to known fields. - analyze: mixed-case email detector flagged all-None columns because str(None)='None' contains both N and one. Now drops NaN before stringify. New features and gap closures: - io: _detect_excel_header_row mirrors detect_header_row for Excel via openpyxl read-only; _read_excel uses it when header_row=None. - io: write_file gains delimiter + encoding params; .tsv extension defaults to tab. - normalizers: normalize_phone preserves extensions as ;ext=N suffix. - normalizers: normalize_address folds spelled-out US state names to 2-letter codes (California ≡ CA). - normalizers: normalize_name drops surname particles (van, de, von) so "Charles de Gaulle" ≡ "Charles Gaulle" for matching. - analyze: new _detect_inconsistent_date_format detector flags columns with mixed ISO/US/EU date shapes; routes to format standardizer. - analyze: _NULL_LIKE recognizes "<na>" (pd.NA repr). - analyze: duplicate-row finding renamed count → n_extra (rows that would actually be removed) with clarified description. - dedup: group_confidence no longer falsely 100.0 when transitive group members lack a recorded direct pair; falls back to 100.0 only when truly no pairs were observed. - dedup: MatchResult / DeduplicationResult docstrings clarify that row_indices refer to the input frame's positional index (output index is reset). - text_clean: visualize_hidden_html(None) now returns None (matches visualize_hidden_text); strip_bom strips at most one BOM per call; sentence_case dead elif branch removed. Tests: - tests/test_audit_fixes.py — 28 regression tests, one or more per numbered finding, named after BUG/GAP/NIT tags so future readers can trace each test back to its audit. - tests/test_fixes_unit.py — 26 isolated unit tests for previously integration-only fix functions (trim_whitespace, strip_nbsp, strip_zero_width, normalize_line_endings, clean_headers, repair_mojibake — last skipped if ftfy unavailable). - tests/test_io.py — adds CSV / TSV / semicolon / UTF-8-BOM round-trip tests + Excel auto-header-detection tests. - tests/test_normalizers.py — adds 8 tests for the alignment work above (phone extension, state names, particles). Adds .claude/ to .gitignore (agent worktrees + local settings). Full project suite: 1197 passed, 4 skipped, 17 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:11:57 +00:00
Michael	0671ef277e	feat(io): route read_file through pre-parse repair by default Previously only analyze() and direct read_csv_repaired() callers got the byte-level repair pass (BOM strip, NUL strip, smart-double-quote fold, unquoted-delimiter merge). The dedup CLI and any other read_file consumer silently missed it. read_file gains a repair=True default. CSV/TSV inputs run through repair_bytes before pandas sees them; Excel inputs still pass through unchanged. Chunked reads (chunk_size set) bypass repair because the pre- parse pass loads the whole file — preserving streaming behavior on huge files. Repair actions and unrepairable lines are logged at INFO/WARNING. cli_text_clean opts out (repair=False): the cleaner offers fine-grained control via --preset and per-op flags, and a byte-level smart-quote fold under the user's "minimal" preset would violate that contract. The cell-level cleaner does the equivalent work itself when its options ask for it. Tests: read_file default strips BOM and folds curly double quotes; repair=False preserves smart quotes; chunked reads still work and skip repair as documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:09:35 +00:00
Michael	b8a9fa1b09	feat(io): pre-parse CSV repair (BOM/NUL/smart-quotes/unquoted-delim) Some pollution patterns block pandas before the cell-level cleaner can run. Add a pre-parse pass on raw bytes that fixes only what breaks parsing, and returns a structured action log the GUI/CLI can surface to the user. repair_bytes(raw, *, encoding, delimiter, fold_quotes, strip_nul, repair_delims): 1. Strip leading UTF-8 BOM. 2. Strip embedded NUL bytes (the C parser truncates fields at NUL). 3. Fold smart double quotes (curly, guillemet, double-prime) to ASCII '"'. Curly singles are NOT folded here; they don't conflict with CSV and the cell-level cleaner handles them more accurately. 4. Per-row repair when one rogue delimiter is embedded in a field that looks like currency or thousands-grouped digits. Tiered scoring keeps " $1,500.00 ,7" unambiguous: the strict currency regex match wins over the loose digit/sigil heuristic. read_csv_repaired(path) -> (DataFrame, RepairResult). RepairResult exposes .actions, .unrepairable_lines, and a summary() grouped by kind. Out of scope for this pass: encoding repair, delimiter conversion, multi- delimiter merges (k>1) — logged as unrepairable so callers can see what was left alone instead of silently parsing wrong. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:37:49 +00:00
Michael	b871ab24fc	feat: add documentation, Streamlit GUI, and full source tree - Rewrite README.md with project overview, quick-start, and CLI summary - Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections - Add docs/DEVELOPER.md with architecture, data flow, and extension guides - Rewrite src/core/__init__.py with public API exports and module docstring - Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive match group review with side-by-side diff, and download buttons - Add .gitignore, requirements.txt, all source code, tests, and sample data - Add streamlit to requirements.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-28 23:06:39 +00:00

5 Commits