datatools-dev

Author	SHA1	Message	Date
Michael	b23a27d4e3	fix: cross-tool audit findings + alignment with format standardizer Closes 12 bugs and 8 gaps surfaced by parallel audits across all core modules, plus aligns the dedup-side normalizers with the new format_standardize behavior where they had silently diverged. Bugs (data integrity / correctness): - dedup: NaN/None values matched as duplicates because str(None)='None'. Two rows with missing email silently merged. - dedup: removed_df had 0 columns when nothing was removed; downstream code expecting matching schema broke. Now preserves column shape. - dedup: ColumnMatchStrategy threshold accepted any value; out-of-range silently broke matching. Validated to [0, 100] in __post_init__. - dedup: strategy referencing a missing column was silently skipped. Now raises ValueError listing available columns. - fixes: replace_null_sentinels crashed on non-string sentinels (int/None from JSON payload). Coerced to str. - fixes: _vectorized_regex_sub raised raw re.error on bad patterns. Now wraps as ValueError with clear message. - io: detect_header_row mis-identified all-empty and metadata-only rows as headers (all([]) is True). Now requires ≥2 non-empty cells. - config: from_dict crashed when JSON had unknown fields, breaking forward compat. Now filters to known fields. - analyze: mixed-case email detector flagged all-None columns because str(None)='None' contains both N and one. Now drops NaN before stringify. New features and gap closures: - io: _detect_excel_header_row mirrors detect_header_row for Excel via openpyxl read-only; _read_excel uses it when header_row=None. - io: write_file gains delimiter + encoding params; .tsv extension defaults to tab. - normalizers: normalize_phone preserves extensions as ;ext=N suffix. - normalizers: normalize_address folds spelled-out US state names to 2-letter codes (California ≡ CA). - normalizers: normalize_name drops surname particles (van, de, von) so "Charles de Gaulle" ≡ "Charles Gaulle" for matching. - analyze: new _detect_inconsistent_date_format detector flags columns with mixed ISO/US/EU date shapes; routes to format standardizer. - analyze: _NULL_LIKE recognizes "<na>" (pd.NA repr). - analyze: duplicate-row finding renamed count → n_extra (rows that would actually be removed) with clarified description. - dedup: group_confidence no longer falsely 100.0 when transitive group members lack a recorded direct pair; falls back to 100.0 only when truly no pairs were observed. - dedup: MatchResult / DeduplicationResult docstrings clarify that row_indices refer to the input frame's positional index (output index is reset). - text_clean: visualize_hidden_html(None) now returns None (matches visualize_hidden_text); strip_bom strips at most one BOM per call; sentence_case dead elif branch removed. Tests: - tests/test_audit_fixes.py — 28 regression tests, one or more per numbered finding, named after BUG/GAP/NIT tags so future readers can trace each test back to its audit. - tests/test_fixes_unit.py — 26 isolated unit tests for previously integration-only fix functions (trim_whitespace, strip_nbsp, strip_zero_width, normalize_line_endings, clean_headers, repair_mojibake — last skipped if ftfy unavailable). - tests/test_io.py — adds CSV / TSV / semicolon / UTF-8-BOM round-trip tests + Excel auto-header-detection tests. - tests/test_normalizers.py — adds 8 tests for the alignment work above (phone extension, state names, particles). Adds .claude/ to .gitignore (agent worktrees + local settings). Full project suite: 1197 passed, 4 skipped, 17 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:11:57 +00:00
Michael	4adeb5c7f3	feat(format): per-cell standardizers + 199-row buyer corpus Adds src/core/format_standardize.py — a per-cell standardizer for dates, phones, emails, addresses, names, currencies, booleans — wired through StandardizeOptions / standardize_dataframe with FieldType registry. Includes: - Date parser handles ISO/US/EU/longform/excel-serial/unix-timestamp/ partial-precision/quarter notation; opt-in French/German/Spanish month dictionaries via month_locales. - Phone via libphonenumber with extension preservation (;ext=N), 001 international prefix handling, error sentinels for placeholders / multi-number cells. - Email lowercase/trim/mailto/angle-bracket strip with optional --gmail-canonical mode. - Address USPS abbreviation expansion or compression (expand=False per corpus § 6.3), state-name → 2-letter conversion, multi-line collapse, PO Box normalization, state-code preservation regardless of input case. - Name handler: Mc/Mac/O'/D' inner caps, hyphen segments, particle lowercasing (von/van/de/da), comma-format reversal, period stripping for titles/suffixes/initials, PhD/MD acronym preservation, conservative mode for mixed-case input. - Currency: auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation, error sentinels for percentages/ranges/word-values/ambiguous separators. - Per-domain error_policy ("passthrough" \| "sentinel") for surfacing malformed values as <error: reason> per corpus § 0.3. Test corpus from Business/DataTools/test-cases-format-cleaner copied to test-cases/format-cleaner-corpus/ — 7 fixtures plus FORMATS-CASES.md. tests/test_format_standardize_corpus.py drives all 199 rows through the per-cell standardizers; 0 xfailed. Wires the GUI page (3_Format_Standardizer.py) to "Ready" status. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:11:24 +00:00
Michael	82d7fef21e	feat(gate): CSV-normalization gate with confidence-tiered findings Adds a Review & Normalize page that sits between upload and every tool page. The analyzer now tags each finding with confidence (high/medium/low) and a fix_action; the gate auto-applies high-confidence fixes, surfaces medium/low ones for user review, and blocks tool pages on error-level findings until resolved or waived. Core (src/core/): - analyze.py: Finding gains confidence, fix_action, pre_applied; new detectors for encoding_uncertain, encoding_decode_failed; new top- level encoding_override parameter. - fixes.py: registry of fix algorithms keyed by fix_action id. - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and the NormalizationResult / Decision dataclasses the gate consumes. - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption) and normalizes line endings (fixes bare-CR parser crash); empty file handled gracefully instead of EmptyDataError traceback. GUI (src/gui/): - pages/0_Review.py: gate page with per-finding decision controls, encoding override picker (16 codepages + custom), and Advanced output options (encoding, delimiter, line terminator) on the download. - components.py: require_normalization_gate() helper. - pages/1-9: gate guard wired on every tool page. Test corpora: - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference UTF-8 files + manifest, synced from Business/DataTools. - test-cases/text-cleaner-corpus/test_data/17: synced malformed input (unquoted $1,500.00) for the unquoted-delimiter detector. Tests (94 new): - test_normalize.py (48): finding fields, fix registry, auto_fix scope, decision paths, gate idempotency, output-options helper. - test_encodings_corpus.py (90, 16 xfailed): parametric detection + decode + analyzer-no-crash sweep against the manifest. - test_analyze.py: encoding override + encoding_uncertain detectors. - test_corpus.py: pre-parse repair in the strict reader. run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate; encodings corpus added to --fixtures category. Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema, gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds the analyzer JSON schema with the new fields; README links to all of it. Suite: 765 passed, 17 xfailed (was 458 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:35:27 +00:00
Michael	1049c033cb	feat(gui): visualize leading/trailing whitespace in analyzer findings The analyzer's "Run Analysis" panel rendered sample cells via st.dataframe, which (a) silently collapses leading/trailing ASCII whitespace and (b) displays NBSP/ZWSP/control chars as nothing. The user couldn't see the exact pollution they were being told about. visualize_hidden_html gains a mark_outer_whitespace=True option that wraps each leading and trailing ASCII space/tab in its own badge with a "SP LEAD" / "SP TRAIL" tooltip. The badges are per-character so the user can count exactly how much padding the cleaner will strip. components.render_findings_panel now: - injects hidden_char_css() once at the top of the panel - replaces st.dataframe(samples) with a custom HTML table - renders the value column with mark_outer_whitespace=True - applies white-space: pre-wrap on value cells so any internal ASCII whitespace also stays visible (browsers collapse runs by default) Four new tests cover: leading+trailing badge counts, default-off behaviour, leading tab badge, all-whitespace string treated entirely as leading. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:21:39 +00:00
Michael	90ceada2d1	feat(text_clean): visualize hidden characters in the cleaner GUI The whole point of the cleaner is to remove characters the user can't see — which makes the "before / after" preview nearly useless by default. A cell with NBSP padding looks identical to a cell with regular spaces. Two new helpers in src.core.text_clean: visualize_hidden_text(s) Plain-text rendering: each invisible/control/smart character is replaced by a glyph + [LABEL] (e.g. "·[NBSP]", "→[TAB]", "∅[ZWSP]", """[L DQUOTE]"). Suitable for terminal output, CSV exports, anywhere HTML is wrong. Unmapped C0 controls render as [U+XXXX]. visualize_hidden_html(s) + hidden_char_css() HTML rendering: every flagged character is wrapped in a <span> with a CSS class and a tooltip showing the codepoint and label. Pair with hidden_char_css() to inject the matching styles. Three colour bands (whitespace, special, control) so the user can scan an audit table and spot what's being changed at a glance. Mapping covers: ASCII tab/LF/CR, every NBSP variant (U+00A0, U+202F, U+2009, …), zero-width family (ZWSP/ZWNJ/ZWJ/WJ/BOM/SHY), bidi marks (LRM/RLM), all smart quotes, en/em dashes, ellipsis, prime/double-prime, and guillemets. ASCII printable text passes through; HTML output also escapes &/</> . GUI wiring (src/gui/pages/2_Text_Cleaner.py) The "Examples" changes table now defaults to a hidden-char-rendered HTML view: every NBSP/ZWSP/smart-quote/control char is shown with its badge and codepoint tooltip. A "Show hidden characters" toggle lets the user fall back to the raw st.dataframe view if they prefer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:14:14 +00:00
Michael	8dfc6ad8ae	feat(analyze): add mixed_line_endings + near_duplicate_rows detectors Two more detectors close the analyzer gap list: mixed_line_endings (warn, tool=02): scans raw bytes for combinations of CRLF / LF / bare CR. Disaster pattern after multi-source concat (Windows + macOS + Linux exports stitched together). Operates on raw bytes only — DataFrame-mode analyze() skips it because raw bytes aren't available. _load_for_analysis now returns the raw bytes alongside the DataFrame and repair result so the detector has them. near_duplicate_rows (info, tool=01): cheap dedup signal — strip and lowercase every string column, then count df.duplicated(). Catches the most common case (same customer entered twice with subtle formatting differences) without paying for fuzzy matching. Anything more sophisticated stays in tool 01. Six new tests cover both detectors plus the dataframe-mode skip path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:09:42 +00:00
Michael	0671ef277e	feat(io): route read_file through pre-parse repair by default Previously only analyze() and direct read_csv_repaired() callers got the byte-level repair pass (BOM strip, NUL strip, smart-double-quote fold, unquoted-delimiter merge). The dedup CLI and any other read_file consumer silently missed it. read_file gains a repair=True default. CSV/TSV inputs run through repair_bytes before pandas sees them; Excel inputs still pass through unchanged. Chunked reads (chunk_size set) bypass repair because the pre- parse pass loads the whole file — preserving streaming behavior on huge files. Repair actions and unrepairable lines are logged at INFO/WARNING. cli_text_clean opts out (repair=False): the cleaner offers fine-grained control via --preset and per-op flags, and a byte-level smart-quote fold under the user's "minimal" preset would violate that contract. The cell-level cleaner does the equivalent work itself when its options ask for it. Tests: read_file default strips BOM and folds curly double quotes; repair=False preserves smart quotes; chunked reads still work and skip repair as documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:09:35 +00:00
Michael	0b959dee93	feat(text_clean): preserve internal whitespace in numeric/date/phone cells Closes the §4.17 spec gap that test_gap_coverage.py was tracking via xfail: collapse_whitespace must NOT touch cells whose shape carries meaningful internal whitespace. Adds _looks_structured(s) — returns True when s matches: - numeric (currency optional, thousand-grouping by , . or single space) - date (ISO/slash/dot separator, or 'Mon DD YYYY' / 'DD Mon YYYY') - phone (digits + parens/dots/dashes/+/spaces, >= 7 digits, no letters) The pipeline uses a new _smart_collapse_whitespace wrapper that defers to collapse_whitespace only when _looks_structured returns False. The raw collapse_whitespace function is unchanged so direct callers and existing unit tests remain valid. Five new positive tests replace the xfail: - "(555) 123-4567" preserved (phone, double space inside) - "1 234" preserved (European thousands) - "2024-01-15" preserved (ISO date) - "Jan 15 2024" preserved (textual date) - "hello world" still collapsed to "hello world" (free-text negative case) Conservative on purpose: a false negative just collapses (existing behavior); a false positive leaves intentional double spaces in prose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:09:25 +00:00
Michael	4687cf87b4	test: single-command runner, cross-platform automation, fixture auto-discovery Adds a top-level test infrastructure layer addressing four needs at once: a single command to run anything, cross-platform automation, install/e2e sanity, and zero-config pickup of new fixtures dropped into test-cases/. Top-level runner — run_tests.py python run_tests.py # everything (default) python run_tests.py --tool dedup # one tool's tests python run_tests.py --unit # category scopes python run_tests.py --e2e # end-to-end CLI python run_tests.py --install # import / dependency sanity python run_tests.py --fixtures # corpus + dropped-file sweep python run_tests.py --coverage # term-missing report python run_tests.py --quick # skip @pytest.mark.slow Tools: analyze, cli, config, dedup, io, normalizers, text_clean. Cross-platform — tox.ini Envs for py310-py313 plus install / e2e / fixtures / coverage / lint. Forces UTF-8 (PYTHONUTF8=1, PYTHONIOENCODING=utf-8) so identical fixture bytes parse the same on Linux/macOS/Windows. Shared config — pytest.ini testpaths, python_files conventions, custom markers (slow, e2e, install, fixture_sweep), warning filters that fail on our own DeprecationWarnings while tolerating third-party ones. New test layers tests/test_install.py — required deps import; project modules import; src.core public API surface; CLI --help exits 0; streamlit app.py parses as valid Python; run_tests.py --help works. tests/test_e2e.py — CLI roundtrips: cli_analyze table + JSON, cli_text_clean --apply writes a real file with NBSP/smart-quote folded, dedup CLI removes duplicates, run_tests.py self-tests. tests/test_fixtures_sweep.py — parametrizes over every CSV/TSV/XLSX inside test-cases/ (excluding text-cleaner-corpus/, which has its own suite). Each fixture must: load through repair_bytes, run analyze() cleanly, and survive clean_dataframe() with row/col counts unchanged plus idempotency. Drop a CSV in, re-run — no test code changes needed. tests/test_gap_coverage.py — closes audit gaps: clean_headers=False toggle, repair_bytes with tab/semicolon delimiters, BOM+NUL+smart- quote combined-fix scenario, analyze() over an XLSX path, sample_rows larger than the DataFrame, mid-cell BOM, findings_by_tool edges, plus a strict xfail documenting the known §4.17 numeric/phone whitespace heuristic gap. Test count Before: 288 passed + 1 xfailed After: 475 passed + 2 xfailed (the second xfail is the documented collapse_whitespace gap on phone-shaped cells; spec §4.17 calls for a heuristic that hasn't been implemented yet). Functional gaps surfaced (not fixed in this commit): - Text cleaner: collapse_whitespace runs unconditionally on every string cell, including phone/numeric/date-shaped ones. Spec §4.17 requires a skip heuristic. Captured as strict xfail so the gap stays visible. - io.read_file does not run pre-parse repair; only analyze() and direct callers of read_csv_repaired() get it. CLI tool pages and the dedup CLI miss the safety net. - Analyzer has no mixed_line_endings detector or near_duplicate_rows detector; both planned but require additional plumbing. - GUI tool pages each have their own uploader instead of picking up the home-page upload through session_state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:01:06 +00:00
Michael	5c62fb6117	feat(cli): src.cli_analyze — Typer CLI for the analyzer python -m src.cli_analyze input.csv # rich table per tool python -m src.cli_analyze input.csv --json # array of finding dicts python -m src.cli_analyze input.csv --strict # exit 1 on warn/error python -m src.cli_analyze input.csv -n 50000 # cap rows scanned Findings are grouped by destination tool so the user can see at a glance which tool to open next. Read-only; exit code 0 unless --strict is set. The CLI keeps its own tool-id -> display-name map so it doesn't depend on the GUI module. 7 tests cover: clean-file passthrough, dirty-file table, --json round-trip, missing-file (exit 2), --strict exit code, --sample-rows cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:53:11 +00:00
Michael	edf6ccf90b	feat(analyze): upload-time data quality analyzer Pure, advisory scan over an uploaded file or DataFrame that returns a list of Finding objects naming each issue, the affected count, and which downstream tool can fix it. The GUI uses this to badge tool nav items at upload; the CLI will print findings as a table or JSON. src/core/analyze.py: Finding dataclass (id, severity, tool, count, description, column, samples) analyze(source, , sample_rows=1000, repair_result=None) -> list[Finding] - source: DataFrame, path, or str. Path scans first 1000 rows. - When source is a path, runs the same pre-parse repair the tool pages will use; the resulting RepairResult is auto-surfaced as csv_ findings. A caller-supplied repair_result wins so non-default repair flags are respected. Detectors (each independent, samples capped at 5): - smart_punctuation_in_data -> 02 - nbsp_or_unicode_whitespace -> 02 - zero_width_or_invisible -> 02 - dirty_column_headers -> 02 - whitespace_padding -> 02 - null_like_sentinels -> 04 - suspected_mojibake -> 02 (Tier 2) - mixed_case_email_column -> 02 case op - leading_zero_ids -> informational, no tool Helpers: findings_by_tool() for sidebar grouping, to_dict() for JSON. Detectors are decoupled from the GUI display layer — they emit stable tool ids ("02_text_cleaner") and the GUI maps those to display names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:41:36 +00:00
Michael	b8a9fa1b09	feat(io): pre-parse CSV repair (BOM/NUL/smart-quotes/unquoted-delim) Some pollution patterns block pandas before the cell-level cleaner can run. Add a pre-parse pass on raw bytes that fixes only what breaks parsing, and returns a structured action log the GUI/CLI can surface to the user. repair_bytes(raw, *, encoding, delimiter, fold_quotes, strip_nul, repair_delims): 1. Strip leading UTF-8 BOM. 2. Strip embedded NUL bytes (the C parser truncates fields at NUL). 3. Fold smart double quotes (curly, guillemet, double-prime) to ASCII '"'. Curly singles are NOT folded here; they don't conflict with CSV and the cell-level cleaner handles them more accurately. 4. Per-row repair when one rogue delimiter is embedded in a field that looks like currency or thousands-grouped digits. Tiered scoring keeps " $1,500.00 ,7" unambiguous: the strict currency regex match wins over the loose digit/sigil heuristic. read_csv_repaired(path) -> (DataFrame, RepairResult). RepairResult exposes .actions, .unrepairable_lines, and a summary() grouped by kind. Out of scope for this pass: encoding repair, delimiter conversion, multi- delimiter merges (k>1) — logged as unrepairable so callers can see what was left alone instead of silently parsing wrong. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:37:49 +00:00
Michael	c349a90e18	test: add text-cleaner corpus and close gaps surfaced by it The 21-fixture corpus (test-cases/text-cleaner-corpus/) exercises the cleaner end-to-end against the spec in TEST-CASES.md. Closing the failing cases drove five small cleaner fixes plus two fixture-generation fixes: - _SMART_CHARS: add prime, double prime, guillemets (case 03) - _ZERO_WIDTH: add soft hyphen U+00AD (case 05) - clean_dataframe: clean column headers via the same pipeline (cases 16/19/20), with a clean_headers toggle on CleanOptions - smart_title_case: title-case full-shout strings ("ALICE SMITH" -> "Alice Smith") while still preserving embedded acronyms; preserve uppercase after apostrophe in names ("O'CONNOR" -> "O'Connor", "o'neil" -> "O'neil") - test_corpus.py reader: pre-strip NUL bytes (C parser truncates at NUL, python engine is too strict about embedded literal "), per spec case 06 - generate_test_data.py: properly CSV-escape literal-quote cells in case 03 expected; quote the rogue-comma price field in case 17 input Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:37:35 +00:00
Michael	54f92ae47e	feat: implement text cleaner (script 02) with CLI, GUI, and tests Builds 02_text_cleaner.py from stub to working: character-level hygiene for CSV/Excel inputs covering trim, whitespace collapse, smart-character folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char strip, line-ending normalization, and per-column case conversion. Three presets (minimal/excel-hygiene/paranoid) keep the buyer surface small. - src/core/text_clean.py: pure helpers + CleanOptions/CleanResult + clean_dataframe with dtype-safe column selection - src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape (dry-run by default, --apply writes cleaned + changes audit, JSON config save/load) - src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset picker, advanced toggles, preview, before/after metrics, and three download buttons - tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests covering edge cases E1-E50 from the spec - samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10 in 10 rows - test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case fixtures Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7 entry locking the spec, CLI-REFERENCE.md gains the text cleaner section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md status row 02 promoted Skeleton -> Working. 200/200 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:14:15 +00:00
Michael	b871ab24fc	feat: add documentation, Streamlit GUI, and full source tree - Rewrite README.md with project overview, quick-start, and CLI summary - Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections - Add docs/DEVELOPER.md with architecture, data flow, and extension guides - Rewrite src/core/__init__.py with public API exports and module docstring - Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive match group review with side-by-side diff, and download buttons - Add .gitignore, requirements.txt, all source code, tests, and sample data - Add streamlit to requirements.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-28 23:06:39 +00:00

1 2

65 Commits