Pure, advisory scan over an uploaded file or DataFrame that returns a list of
Finding objects naming each issue, the affected count, and which downstream
tool can fix it. The GUI uses this to badge tool nav items at upload; the CLI
will print findings as a table or JSON.
src/core/analyze.py:
Finding dataclass (id, severity, tool, count, description, column, samples)
analyze(source, *, sample_rows=1000, repair_result=None) -> list[Finding]
- source: DataFrame, path, or str. Path scans first 1000 rows.
- When source is a path, runs the same pre-parse repair the tool pages
will use; the resulting RepairResult is auto-surfaced as csv_*
findings. A caller-supplied repair_result wins so non-default repair
flags are respected.
Detectors (each independent, samples capped at 5):
- smart_punctuation_in_data -> 02
- nbsp_or_unicode_whitespace -> 02
- zero_width_or_invisible -> 02
- dirty_column_headers -> 02
- whitespace_padding -> 02
- null_like_sentinels -> 04
- suspected_mojibake -> 02 (Tier 2)
- mixed_case_email_column -> 02 case op
- leading_zero_ids -> informational, no tool
Helpers: findings_by_tool() for sidebar grouping, to_dict() for JSON.
Detectors are decoupled from the GUI display layer — they emit stable tool
ids ("02_text_cleaner") and the GUI maps those to display names.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.9 KiB
9.9 KiB