datatools-dev

Author	SHA1	Message	Date
Michael	0671ef277e	feat(io): route read_file through pre-parse repair by default Previously only analyze() and direct read_csv_repaired() callers got the byte-level repair pass (BOM strip, NUL strip, smart-double-quote fold, unquoted-delimiter merge). The dedup CLI and any other read_file consumer silently missed it. read_file gains a repair=True default. CSV/TSV inputs run through repair_bytes before pandas sees them; Excel inputs still pass through unchanged. Chunked reads (chunk_size set) bypass repair because the pre- parse pass loads the whole file — preserving streaming behavior on huge files. Repair actions and unrepairable lines are logged at INFO/WARNING. cli_text_clean opts out (repair=False): the cleaner offers fine-grained control via --preset and per-op flags, and a byte-level smart-quote fold under the user's "minimal" preset would violate that contract. The cell-level cleaner does the equivalent work itself when its options ask for it. Tests: read_file default strips BOM and folds curly double quotes; repair=False preserves smart quotes; chunked reads still work and skip repair as documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:09:35 +00:00
Michael	0b959dee93	feat(text_clean): preserve internal whitespace in numeric/date/phone cells Closes the §4.17 spec gap that test_gap_coverage.py was tracking via xfail: collapse_whitespace must NOT touch cells whose shape carries meaningful internal whitespace. Adds _looks_structured(s) — returns True when s matches: - numeric (currency optional, thousand-grouping by , . or single space) - date (ISO/slash/dot separator, or 'Mon DD YYYY' / 'DD Mon YYYY') - phone (digits + parens/dots/dashes/+/spaces, >= 7 digits, no letters) The pipeline uses a new _smart_collapse_whitespace wrapper that defers to collapse_whitespace only when _looks_structured returns False. The raw collapse_whitespace function is unchanged so direct callers and existing unit tests remain valid. Five new positive tests replace the xfail: - "(555) 123-4567" preserved (phone, double space inside) - "1 234" preserved (European thousands) - "2024-01-15" preserved (ISO date) - "Jan 15 2024" preserved (textual date) - "hello world" still collapsed to "hello world" (free-text negative case) Conservative on purpose: a false negative just collapses (existing behavior); a false positive leaves intentional double spaces in prose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:09:25 +00:00
Michael	4687cf87b4	test: single-command runner, cross-platform automation, fixture auto-discovery Adds a top-level test infrastructure layer addressing four needs at once: a single command to run anything, cross-platform automation, install/e2e sanity, and zero-config pickup of new fixtures dropped into test-cases/. Top-level runner — run_tests.py python run_tests.py # everything (default) python run_tests.py --tool dedup # one tool's tests python run_tests.py --unit # category scopes python run_tests.py --e2e # end-to-end CLI python run_tests.py --install # import / dependency sanity python run_tests.py --fixtures # corpus + dropped-file sweep python run_tests.py --coverage # term-missing report python run_tests.py --quick # skip @pytest.mark.slow Tools: analyze, cli, config, dedup, io, normalizers, text_clean. Cross-platform — tox.ini Envs for py310-py313 plus install / e2e / fixtures / coverage / lint. Forces UTF-8 (PYTHONUTF8=1, PYTHONIOENCODING=utf-8) so identical fixture bytes parse the same on Linux/macOS/Windows. Shared config — pytest.ini testpaths, python_files conventions, custom markers (slow, e2e, install, fixture_sweep), warning filters that fail on our own DeprecationWarnings while tolerating third-party ones. New test layers tests/test_install.py — required deps import; project modules import; src.core public API surface; CLI --help exits 0; streamlit app.py parses as valid Python; run_tests.py --help works. tests/test_e2e.py — CLI roundtrips: cli_analyze table + JSON, cli_text_clean --apply writes a real file with NBSP/smart-quote folded, dedup CLI removes duplicates, run_tests.py self-tests. tests/test_fixtures_sweep.py — parametrizes over every CSV/TSV/XLSX inside test-cases/ (excluding text-cleaner-corpus/, which has its own suite). Each fixture must: load through repair_bytes, run analyze() cleanly, and survive clean_dataframe() with row/col counts unchanged plus idempotency. Drop a CSV in, re-run — no test code changes needed. tests/test_gap_coverage.py — closes audit gaps: clean_headers=False toggle, repair_bytes with tab/semicolon delimiters, BOM+NUL+smart- quote combined-fix scenario, analyze() over an XLSX path, sample_rows larger than the DataFrame, mid-cell BOM, findings_by_tool edges, plus a strict xfail documenting the known §4.17 numeric/phone whitespace heuristic gap. Test count Before: 288 passed + 1 xfailed After: 475 passed + 2 xfailed (the second xfail is the documented collapse_whitespace gap on phone-shaped cells; spec §4.17 calls for a heuristic that hasn't been implemented yet). Functional gaps surfaced (not fixed in this commit): - Text cleaner: collapse_whitespace runs unconditionally on every string cell, including phone/numeric/date-shaped ones. Spec §4.17 requires a skip heuristic. Captured as strict xfail so the gap stays visible. - io.read_file does not run pre-parse repair; only analyze() and direct callers of read_csv_repaired() get it. CLI tool pages and the dedup CLI miss the safety net. - Analyzer has no mixed_line_endings detector or near_duplicate_rows detector; both planned but require additional plumbing. - GUI tool pages each have their own uploader instead of picking up the home-page upload through session_state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:01:06 +00:00
Michael	5c62fb6117	feat(cli): src.cli_analyze — Typer CLI for the analyzer python -m src.cli_analyze input.csv # rich table per tool python -m src.cli_analyze input.csv --json # array of finding dicts python -m src.cli_analyze input.csv --strict # exit 1 on warn/error python -m src.cli_analyze input.csv -n 50000 # cap rows scanned Findings are grouped by destination tool so the user can see at a glance which tool to open next. Read-only; exit code 0 unless --strict is set. The CLI keeps its own tool-id -> display-name map so it doesn't depend on the GUI module. 7 tests cover: clean-file passthrough, dirty-file table, --json round-trip, missing-file (exit 2), --strict exit code, --sample-rows cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:53:11 +00:00
Michael	edf6ccf90b	feat(analyze): upload-time data quality analyzer Pure, advisory scan over an uploaded file or DataFrame that returns a list of Finding objects naming each issue, the affected count, and which downstream tool can fix it. The GUI uses this to badge tool nav items at upload; the CLI will print findings as a table or JSON. src/core/analyze.py: Finding dataclass (id, severity, tool, count, description, column, samples) analyze(source, , sample_rows=1000, repair_result=None) -> list[Finding] - source: DataFrame, path, or str. Path scans first 1000 rows. - When source is a path, runs the same pre-parse repair the tool pages will use; the resulting RepairResult is auto-surfaced as csv_ findings. A caller-supplied repair_result wins so non-default repair flags are respected. Detectors (each independent, samples capped at 5): - smart_punctuation_in_data -> 02 - nbsp_or_unicode_whitespace -> 02 - zero_width_or_invisible -> 02 - dirty_column_headers -> 02 - whitespace_padding -> 02 - null_like_sentinels -> 04 - suspected_mojibake -> 02 (Tier 2) - mixed_case_email_column -> 02 case op - leading_zero_ids -> informational, no tool Helpers: findings_by_tool() for sidebar grouping, to_dict() for JSON. Detectors are decoupled from the GUI display layer — they emit stable tool ids ("02_text_cleaner") and the GUI maps those to display names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:41:36 +00:00
Michael	b8a9fa1b09	feat(io): pre-parse CSV repair (BOM/NUL/smart-quotes/unquoted-delim) Some pollution patterns block pandas before the cell-level cleaner can run. Add a pre-parse pass on raw bytes that fixes only what breaks parsing, and returns a structured action log the GUI/CLI can surface to the user. repair_bytes(raw, *, encoding, delimiter, fold_quotes, strip_nul, repair_delims): 1. Strip leading UTF-8 BOM. 2. Strip embedded NUL bytes (the C parser truncates fields at NUL). 3. Fold smart double quotes (curly, guillemet, double-prime) to ASCII '"'. Curly singles are NOT folded here; they don't conflict with CSV and the cell-level cleaner handles them more accurately. 4. Per-row repair when one rogue delimiter is embedded in a field that looks like currency or thousands-grouped digits. Tiered scoring keeps " $1,500.00 ,7" unambiguous: the strict currency regex match wins over the loose digit/sigil heuristic. read_csv_repaired(path) -> (DataFrame, RepairResult). RepairResult exposes .actions, .unrepairable_lines, and a summary() grouped by kind. Out of scope for this pass: encoding repair, delimiter conversion, multi- delimiter merges (k>1) — logged as unrepairable so callers can see what was left alone instead of silently parsing wrong. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:37:49 +00:00
Michael	c349a90e18	test: add text-cleaner corpus and close gaps surfaced by it The 21-fixture corpus (test-cases/text-cleaner-corpus/) exercises the cleaner end-to-end against the spec in TEST-CASES.md. Closing the failing cases drove five small cleaner fixes plus two fixture-generation fixes: - _SMART_CHARS: add prime, double prime, guillemets (case 03) - _ZERO_WIDTH: add soft hyphen U+00AD (case 05) - clean_dataframe: clean column headers via the same pipeline (cases 16/19/20), with a clean_headers toggle on CleanOptions - smart_title_case: title-case full-shout strings ("ALICE SMITH" -> "Alice Smith") while still preserving embedded acronyms; preserve uppercase after apostrophe in names ("O'CONNOR" -> "O'Connor", "o'neil" -> "O'neil") - test_corpus.py reader: pre-strip NUL bytes (C parser truncates at NUL, python engine is too strict about embedded literal "), per spec case 06 - generate_test_data.py: properly CSV-escape literal-quote cells in case 03 expected; quote the rogue-comma price field in case 17 input Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:37:35 +00:00
Michael	54f92ae47e	feat: implement text cleaner (script 02) with CLI, GUI, and tests Builds 02_text_cleaner.py from stub to working: character-level hygiene for CSV/Excel inputs covering trim, whitespace collapse, smart-character folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char strip, line-ending normalization, and per-column case conversion. Three presets (minimal/excel-hygiene/paranoid) keep the buyer surface small. - src/core/text_clean.py: pure helpers + CleanOptions/CleanResult + clean_dataframe with dtype-safe column selection - src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape (dry-run by default, --apply writes cleaned + changes audit, JSON config save/load) - src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset picker, advanced toggles, preview, before/after metrics, and three download buttons - tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests covering edge cases E1-E50 from the spec - samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10 in 10 rows - test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case fixtures Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7 entry locking the spec, CLI-REFERENCE.md gains the text cleaner section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md status row 02 promoted Skeleton -> Working. 200/200 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:14:15 +00:00
Michael	b871ab24fc	feat: add documentation, Streamlit GUI, and full source tree - Rewrite README.md with project overview, quick-start, and CLI summary - Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections - Add docs/DEVELOPER.md with architecture, data flow, and extension guides - Rewrite src/core/__init__.py with public API exports and module docstring - Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive match group review with side-by-side diff, and download buttons - Add .gitignore, requirements.txt, all source code, tests, and sample data - Add streamlit to requirements.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-28 23:06:39 +00:00

9 Commits