datatools-dev

Author	SHA1	Message	Date
Michael	82d7fef21e	feat(gate): CSV-normalization gate with confidence-tiered findings Adds a Review & Normalize page that sits between upload and every tool page. The analyzer now tags each finding with confidence (high/medium/low) and a fix_action; the gate auto-applies high-confidence fixes, surfaces medium/low ones for user review, and blocks tool pages on error-level findings until resolved or waived. Core (src/core/): - analyze.py: Finding gains confidence, fix_action, pre_applied; new detectors for encoding_uncertain, encoding_decode_failed; new top- level encoding_override parameter. - fixes.py: registry of fix algorithms keyed by fix_action id. - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and the NormalizationResult / Decision dataclasses the gate consumes. - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption) and normalizes line endings (fixes bare-CR parser crash); empty file handled gracefully instead of EmptyDataError traceback. GUI (src/gui/): - pages/0_Review.py: gate page with per-finding decision controls, encoding override picker (16 codepages + custom), and Advanced output options (encoding, delimiter, line terminator) on the download. - components.py: require_normalization_gate() helper. - pages/1-9: gate guard wired on every tool page. Test corpora: - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference UTF-8 files + manifest, synced from Business/DataTools. - test-cases/text-cleaner-corpus/test_data/17: synced malformed input (unquoted $1,500.00) for the unquoted-delimiter detector. Tests (94 new): - test_normalize.py (48): finding fields, fix registry, auto_fix scope, decision paths, gate idempotency, output-options helper. - test_encodings_corpus.py (90, 16 xfailed): parametric detection + decode + analyzer-no-crash sweep against the manifest. - test_analyze.py: encoding override + encoding_uncertain detectors. - test_corpus.py: pre-parse repair in the strict reader. run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate; encodings corpus added to --fixtures category. Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema, gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds the analyzer JSON schema with the new fields; README links to all of it. Suite: 765 passed, 17 xfailed (was 458 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:35:27 +00:00
Michael	e9c490ae1b	feat(gui): hidden-char-aware preview tables in Text Cleaner The Text Cleaner had two st.dataframe previews — the initial upload preview ("Preview: filename") and the post-clean "Cleaned preview" table — that both rendered cells with the same browser-collapses- whitespace, hides-invisibles problem the analyzer findings panel had before commit `1049c03`. components.render_hidden_aware_preview(df, n_rows, caption) renders a DataFrame as an HTML table where: - every cell uses visualize_hidden_html(mark_outer_whitespace=True), so leading/trailing ASCII spaces appear as per-character "·" badges - white-space: pre-wrap on every cell preserves internal multi-space runs and embedded newlines visually - headers route through the same visualizer so dirty column names (NBSP padding, ZWSP, smart quotes) show their badges too - NaN cells render as a faint "NaN" placeholder - rows are sticky-headed and scrollable inside a 26rem capped container so a 10-row preview doesn't push the rest of the UI off screen 2_Text_Cleaner.py wires it into both previews: - The upload preview gains its own "Show hidden characters in preview" toggle (default on). - The cleaned preview reuses the existing show_hidden toggle that already governs the Examples changes table, so one switch controls the whole results section. Either toggle off falls back to the original st.dataframe view. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:26:30 +00:00
Michael	90ceada2d1	feat(text_clean): visualize hidden characters in the cleaner GUI The whole point of the cleaner is to remove characters the user can't see — which makes the "before / after" preview nearly useless by default. A cell with NBSP padding looks identical to a cell with regular spaces. Two new helpers in src.core.text_clean: visualize_hidden_text(s) Plain-text rendering: each invisible/control/smart character is replaced by a glyph + [LABEL] (e.g. "·[NBSP]", "→[TAB]", "∅[ZWSP]", """[L DQUOTE]"). Suitable for terminal output, CSV exports, anywhere HTML is wrong. Unmapped C0 controls render as [U+XXXX]. visualize_hidden_html(s) + hidden_char_css() HTML rendering: every flagged character is wrapped in a <span> with a CSS class and a tooltip showing the codepoint and label. Pair with hidden_char_css() to inject the matching styles. Three colour bands (whitespace, special, control) so the user can scan an audit table and spot what's being changed at a glance. Mapping covers: ASCII tab/LF/CR, every NBSP variant (U+00A0, U+202F, U+2009, …), zero-width family (ZWSP/ZWNJ/ZWJ/WJ/BOM/SHY), bidi marks (LRM/RLM), all smart quotes, en/em dashes, ellipsis, prime/double-prime, and guillemets. ASCII printable text passes through; HTML output also escapes &/</> . GUI wiring (src/gui/pages/2_Text_Cleaner.py) The "Examples" changes table now defaults to a hidden-char-rendered HTML view: every NBSP/ZWSP/smart-quote/control char is shown with its badge and codepoint tooltip. A "Show hidden characters" toggle lets the user fall back to the raw st.dataframe view if they prefer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:14:14 +00:00
Michael	794d4cda94	feat(gui): tool pages pick up the home-page upload via session_state Closes the last UX gap from the analyzer review: each tool page had its own st.file_uploader, so users had to upload the same file twice (once on the home page for analysis, once on each tool page). components.pickup_or_upload(label, key, types) returns either: - a _StashedUpload shim wrapping the home-page bytes (when present and the user hasn't asked for a different file on this page), or - the standard st.file_uploader (when nothing is stashed or the user clicked "Use a different file"). _StashedUpload duck-types Streamlit's UploadedFile (.name, .size, .getvalue(), .read()) so existing tool-page code consumes it without changes. A "Use a different file" button per page sets a session-state override flag; a "Switch back to upload-screen file" button clears it. Wired into 2_Text_Cleaner.py and 1_Deduplicator.py — the two pages with working uploaders today. The remaining stub pages adopt it when they're implemented; the helper is the public surface they'll use. Verified by smoke-launching streamlit headless and curling the home, text-cleaner, and deduplicator routes — all return 200 with no errors in the server log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:09:51 +00:00
Michael	54f92ae47e	feat: implement text cleaner (script 02) with CLI, GUI, and tests Builds 02_text_cleaner.py from stub to working: character-level hygiene for CSV/Excel inputs covering trim, whitespace collapse, smart-character folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char strip, line-ending normalization, and per-column case conversion. Three presets (minimal/excel-hygiene/paranoid) keep the buyer surface small. - src/core/text_clean.py: pure helpers + CleanOptions/CleanResult + clean_dataframe with dtype-safe column selection - src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape (dry-run by default, --apply writes cleaned + changes audit, JSON config save/load) - src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset picker, advanced toggles, preview, before/after metrics, and three download buttons - tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests covering edge cases E1-E50 from the spec - samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10 in 10 rows - test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case fixtures Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7 entry locking the spec, CLI-REFERENCE.md gains the text cleaner section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md status row 02 promoted Skeleton -> Working. 200/200 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:14:15 +00:00
Michael	35ea21ad33	feat: hide Streamlit chrome for app-like appearance Add shared hide_streamlit_chrome() helper that removes header bar, hamburger menu, footer, and deploy button via CSS injection. Called on every page. Add .streamlit/config.toml with minimal toolbar mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 01:20:54 +00:00
Michael	f2fdc10af7	feat: refactor GUI to multi-page Streamlit app with 9 tool pages Convert single-page deduplicator into a multi-page suite. Home page shows tool card grid. Deduplicator extracted to its own page (fully working). 8 stub pages added for Text Cleaner, Format Standardizer, Missing Values, Column Mapper, Outlier Detector, Multi-File Merger, Validator & Reporter, and Pipeline Runner — each with functional file upload and coming-soon UI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 01:16:12 +00:00

7 Commits