Two more detectors close the analyzer gap list:
mixed_line_endings (warn, tool=02): scans raw bytes for combinations of
CRLF / LF / bare CR. Disaster pattern after multi-source concat
(Windows + macOS + Linux exports stitched together). Operates on raw
bytes only — DataFrame-mode analyze() skips it because raw bytes
aren't available. _load_for_analysis now returns the raw bytes
alongside the DataFrame and repair result so the detector has them.
near_duplicate_rows (info, tool=01): cheap dedup signal — strip and
lowercase every string column, then count df.duplicated(). Catches the
most common case (same customer entered twice with subtle formatting
differences) without paying for fuzzy matching. Anything more
sophisticated stays in tool 01.
Six new tests cover both detectors plus the dataframe-mode skip path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure, advisory scan over an uploaded file or DataFrame that returns a list of
Finding objects naming each issue, the affected count, and which downstream
tool can fix it. The GUI uses this to badge tool nav items at upload; the CLI
will print findings as a table or JSON.
src/core/analyze.py:
Finding dataclass (id, severity, tool, count, description, column, samples)
analyze(source, *, sample_rows=1000, repair_result=None) -> list[Finding]
- source: DataFrame, path, or str. Path scans first 1000 rows.
- When source is a path, runs the same pre-parse repair the tool pages
will use; the resulting RepairResult is auto-surfaced as csv_*
findings. A caller-supplied repair_result wins so non-default repair
flags are respected.
Detectors (each independent, samples capped at 5):
- smart_punctuation_in_data -> 02
- nbsp_or_unicode_whitespace -> 02
- zero_width_or_invisible -> 02
- dirty_column_headers -> 02
- whitespace_padding -> 02
- null_like_sentinels -> 04
- suspected_mojibake -> 02 (Tier 2)
- mixed_case_email_column -> 02 case op
- leading_zero_ids -> informational, no tool
Helpers: findings_by_tool() for sidebar grouping, to_dict() for JSON.
Detectors are decoupled from the GUI display layer — they emit stable tool
ids ("02_text_cleaner") and the GUI maps those to display names.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>