feat(io): route read_file through pre-parse repair by default

Previously only analyze() and direct read_csv_repaired() callers got the
byte-level repair pass (BOM strip, NUL strip, smart-double-quote fold,
unquoted-delimiter merge). The dedup CLI and any other read_file consumer
silently missed it.

read_file gains a repair=True default. CSV/TSV inputs run through
repair_bytes before pandas sees them; Excel inputs still pass through
unchanged. Chunked reads (chunk_size set) bypass repair because the pre-
parse pass loads the whole file — preserving streaming behavior on huge
files. Repair actions and unrepairable lines are logged at INFO/WARNING.

cli_text_clean opts out (repair=False): the cleaner offers fine-grained
control via --preset and per-op flags, and a byte-level smart-quote fold
under the user's "minimal" preset would violate that contract. The
cell-level cleaner does the equivalent work itself when its options ask
for it.

Tests: read_file default strips BOM and folds curly double quotes;
repair=False preserves smart quotes; chunked reads still work and skip
repair as documented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 16:09:35 +00:00
parent 0b959dee93
commit 0671ef277e
3 changed files with 91 additions and 9 deletions

View File

@@ -280,6 +280,10 @@ def clean(
encoding=encoding_override,
header_row=header_row,
sheet_name=sheet_arg if sheet_arg is not None else 0,
# Bypass byte-level repair so the user's preset/flag choices
# remain authoritative. The cell-level cleaner does the
# smart-quote / NUL / BOM work itself.
repair=False,
)
if not isinstance(df, pd.DataFrame):
df = pd.concat(list(df), ignore_index=True)