datatools-dev

Author	SHA1	Message	Date
Michael	3dd924474c	docs: short-form numbered requirements list New docs/REQUIREMENTS.md catalogs every shipped capability in 17 numbered categories — file handling, input/output encodings, delimiters, line endings, detectors, finding schema, confidence tiers, decisions, performance targets (1 GB), tools, gate behavior, interfaces, platforms, deps, test coverage, privacy. Linked from README and USER-GUIDE so a buyer / integrator can scan compliance in under a minute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 21:19:21 +00:00
Michael	ec56b1994b	chore: remove one-time 1.25GB stress harness The stress benchmark served its purpose — perf findings shipped in `438bc0f` (1 GB-class file efficiency for the analyzer + gate pipeline). Removing the script and the (already auto-deleted) test fixture so the repo doesn't carry one-time scaffolding. Future ad-hoc benchmarks can resurrect this from git history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 21:15:58 +00:00
Michael	438bc0f84d	perf: 1 GB-class file efficiency for the analyzer + gate pipeline Six targeted changes that drop the user-visible analyzer scan time from "go for coffee" to sub-second on 1 GB inputs and reduce peak RSS by ~10×. src/core/io.py - detect_encoding: open + read sample bytes instead of read_bytes()[:N]. Was allocating the full file in memory just to slice the head; on a 1 GB input this saves a 1 GB intermediate allocation. - repair_bytes: byte-level smart-quote fold via bytes.replace when the input is UTF-8. The probe (b"\\xe2\\x80" / b"\\xc2\\xab" / b"\\xc2\\xbb") is a single C-implemented contains check that skips the entire fold stage on files with no smart quotes — most of them. - repair_bytes: skip the per-row csv.reader walk unless a cheap byte scan finds a currency sigil ($/€/£), the delimiter is non-comma, the decoder substituted U+FFFD, or _has_field_count_mismatch detects an unquoted-delimiter row. csv.reader was the dominant cost in repair_bytes on big files (materializes a list of every row). - _has_field_count_mismatch: hand-rolled quote-state walker; one pass, no allocation, returns True at first mismatch. False positives just fall through to the slower _repair_rows pass. src/core/analyze.py - _load_for_analysis: read only ~max(4KB, sample_rows × 256B × 2) head bytes for the analyzer's sample-mode scan. Drops analyze(sample_rows =1000) from "read + repair full file" to "read + repair 500KB" — 150× faster on a 1.25 GB file. Falls back to a single full-file retry if pandas reports fewer rows than the cap. - Compiled regex character classes for hot-path detectors and a _vec_match_count helper that runs Series.str.contains in C instead of Python per-cell loops. Detectors converted: smart_punctuation, invisible_chars (NBSP + zero-width), whitespace_padding, null_like_sentinels, mojibake, encoding_uncertainty, mixed_case_email, leading_zero_ids. src/core/fixes.py - _vectorized_translate / _vectorized_regex_sub: pandas-native string transforms for the fixes that are pure character maps (strip_nbsp, fold_smart_punctuation, strip_zero_width). Series.str.translate runs in C — 10-50× faster than per-cell Python. - _apply_to_strings: replaced inner per-cell loops with Series.map + boolean-mask diff for the count. - All fix entry points read an "inplace" flag from payload and thread it through the helpers. src/core/normalize.py - apply_decisions: takes a single working copy at the top, then sets payload["inplace"] = True so each chained fix mutates that copy. Previously every fix did df.copy(); N fixes × 6 GB DataFrame = 30+ GB peak. Now: one 6 GB allocation. Validation: 765 passed, 17 xfailed (no regressions). 100 MB benchmark: stage before after ------------------------------ ------- -------- detect_encoding 0.97s+1.3GB ~0s + 0 MB analyze (sample_rows=1000) 235.76s 0.08s _load_for_analysis (1000 rows) 148.17s 0.01s repair_bytes (full file) 150s/1.25GB 2.91s/100MB The user-visible analyzer scan dropped from minutes to sub-second on 1 GB-class files. Full-DataFrame analyze + auto_fix improvements are more modest (~25%) because trim_whitespace and replace_null_sentinels still need per-cell Python for the structural-shape checks, but the hot path through these is now bounded by pandas' .map rather than a manual for loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 21:13:47 +00:00
Michael	f891c6116d	refactor(gui): tool registry + components package for per-tool builds Two low-risk seam moves to enable selling per-tool subsets without breaking the existing all-in-one bundle. Behaviour identical; every existing import still resolves; full pytest suite + every page returns HTTP 200. 1. Tool registry (src/gui/tools_registry.py) — replaces the inline dict-of-dicts in app.py with a Tool dataclass and a TOOLS list. Adds a tier field ("core" today, "pro" / "enterprise" later) and tools_for_tier() / tool_by_id() / display_name() helpers. A per-tool build slices TOOLS at import time without code changes. 2. components package (src/gui/components/) — converts the former single components.py into a package with: _legacy.py — original file, unchanged. __init__.py — re-exports the legacy surface; existing "from src.gui.components import …" calls continue to work. shared.py — hide_streamlit_chrome, pickup_or_upload (every build needs these). gate.py — require_normalization_gate (Pro / Suite SKUs). findings.py — analyzer-finding widgets (drops out of a standalone-Dedup build). dedup_review.py — match-group cards + apply pipeline (drops out of a non-dedup build). The seam modules are narrow re-exports today. As code migrates out of _legacy.py into the focused modules, the public import path stays stable via the shim. E2E: 765 passed, 17 xfailed (unchanged); home page + all 9 tool pages + Review page render HTTP 200; full pipeline (analyze → auto_fix → apply_decisions → output bytes) round-trips on the kitchen-sink fixture with zero high-confidence findings remaining post-fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:56:21 +00:00
Michael	70ed695027	test(scripts): one-shot 1.25GB stress harness for the gate pipeline Generates a synthetic messy CSV at the target size, then runs every pipeline stage end-to-end (detect_encoding, repair_bytes, analyze, auto_fix on sample + full file) capturing wall-clock and peak RSS at each stage. Not part of the automated suite — invoke directly via ``python scripts/stress_1_25gb.py``. ``--keep`` to preserve the file between runs, ``--target-gb`` to tune the size. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:52:27 +00:00
Michael	82d7fef21e	feat(gate): CSV-normalization gate with confidence-tiered findings Adds a Review & Normalize page that sits between upload and every tool page. The analyzer now tags each finding with confidence (high/medium/low) and a fix_action; the gate auto-applies high-confidence fixes, surfaces medium/low ones for user review, and blocks tool pages on error-level findings until resolved or waived. Core (src/core/): - analyze.py: Finding gains confidence, fix_action, pre_applied; new detectors for encoding_uncertain, encoding_decode_failed; new top- level encoding_override parameter. - fixes.py: registry of fix algorithms keyed by fix_action id. - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and the NormalizationResult / Decision dataclasses the gate consumes. - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption) and normalizes line endings (fixes bare-CR parser crash); empty file handled gracefully instead of EmptyDataError traceback. GUI (src/gui/): - pages/0_Review.py: gate page with per-finding decision controls, encoding override picker (16 codepages + custom), and Advanced output options (encoding, delimiter, line terminator) on the download. - components.py: require_normalization_gate() helper. - pages/1-9: gate guard wired on every tool page. Test corpora: - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference UTF-8 files + manifest, synced from Business/DataTools. - test-cases/text-cleaner-corpus/test_data/17: synced malformed input (unquoted $1,500.00) for the unquoted-delimiter detector. Tests (94 new): - test_normalize.py (48): finding fields, fix registry, auto_fix scope, decision paths, gate idempotency, output-options helper. - test_encodings_corpus.py (90, 16 xfailed): parametric detection + decode + analyzer-no-crash sweep against the manifest. - test_analyze.py: encoding override + encoding_uncertain detectors. - test_corpus.py: pre-parse repair in the strict reader. run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate; encodings corpus added to --fixtures category. Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema, gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds the analyzer JSON schema with the new fields; README links to all of it. Suite: 765 passed, 17 xfailed (was 458 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:35:27 +00:00
Michael	e9c490ae1b	feat(gui): hidden-char-aware preview tables in Text Cleaner The Text Cleaner had two st.dataframe previews — the initial upload preview ("Preview: filename") and the post-clean "Cleaned preview" table — that both rendered cells with the same browser-collapses- whitespace, hides-invisibles problem the analyzer findings panel had before commit `1049c03`. components.render_hidden_aware_preview(df, n_rows, caption) renders a DataFrame as an HTML table where: - every cell uses visualize_hidden_html(mark_outer_whitespace=True), so leading/trailing ASCII spaces appear as per-character "·" badges - white-space: pre-wrap on every cell preserves internal multi-space runs and embedded newlines visually - headers route through the same visualizer so dirty column names (NBSP padding, ZWSP, smart quotes) show their badges too - NaN cells render as a faint "NaN" placeholder - rows are sticky-headed and scrollable inside a 26rem capped container so a 10-row preview doesn't push the rest of the UI off screen 2_Text_Cleaner.py wires it into both previews: - The upload preview gains its own "Show hidden characters in preview" toggle (default on). - The cleaned preview reuses the existing show_hidden toggle that already governs the Examples changes table, so one switch controls the whole results section. Either toggle off falls back to the original st.dataframe view. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:26:30 +00:00
Michael	1049c033cb	feat(gui): visualize leading/trailing whitespace in analyzer findings The analyzer's "Run Analysis" panel rendered sample cells via st.dataframe, which (a) silently collapses leading/trailing ASCII whitespace and (b) displays NBSP/ZWSP/control chars as nothing. The user couldn't see the exact pollution they were being told about. visualize_hidden_html gains a mark_outer_whitespace=True option that wraps each leading and trailing ASCII space/tab in its own badge with a "SP LEAD" / "SP TRAIL" tooltip. The badges are per-character so the user can count exactly how much padding the cleaner will strip. components.render_findings_panel now: - injects hidden_char_css() once at the top of the panel - replaces st.dataframe(samples) with a custom HTML table - renders the value column with mark_outer_whitespace=True - applies white-space: pre-wrap on value cells so any internal ASCII whitespace also stays visible (browsers collapse runs by default) Four new tests cover: leading+trailing badge counts, default-off behaviour, leading tab badge, all-whitespace string treated entirely as leading. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:21:39 +00:00
Michael	e12615357d	fix(gui): use page paths relative to streamlit entrypoint st.page_link resolves paths from the directory of the entrypoint file (src/gui/app.py), so the existing "src/gui/{page_slug}" prefix doubled up and produced StreamlitPageNotFoundError on first upload + analysis (reproducible on Windows; the stack trace from a Windows install surfaced the bug). The _TOOL_PAGE_PATHS map already stores the correct relative form ("pages/2_Text_Cleaner.py"); just pass the slug straight to st.page_link. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:17:50 +00:00
Michael	90ceada2d1	feat(text_clean): visualize hidden characters in the cleaner GUI The whole point of the cleaner is to remove characters the user can't see — which makes the "before / after" preview nearly useless by default. A cell with NBSP padding looks identical to a cell with regular spaces. Two new helpers in src.core.text_clean: visualize_hidden_text(s) Plain-text rendering: each invisible/control/smart character is replaced by a glyph + [LABEL] (e.g. "·[NBSP]", "→[TAB]", "∅[ZWSP]", """[L DQUOTE]"). Suitable for terminal output, CSV exports, anywhere HTML is wrong. Unmapped C0 controls render as [U+XXXX]. visualize_hidden_html(s) + hidden_char_css() HTML rendering: every flagged character is wrapped in a <span> with a CSS class and a tooltip showing the codepoint and label. Pair with hidden_char_css() to inject the matching styles. Three colour bands (whitespace, special, control) so the user can scan an audit table and spot what's being changed at a glance. Mapping covers: ASCII tab/LF/CR, every NBSP variant (U+00A0, U+202F, U+2009, …), zero-width family (ZWSP/ZWNJ/ZWJ/WJ/BOM/SHY), bidi marks (LRM/RLM), all smart quotes, en/em dashes, ellipsis, prime/double-prime, and guillemets. ASCII printable text passes through; HTML output also escapes &/</> . GUI wiring (src/gui/pages/2_Text_Cleaner.py) The "Examples" changes table now defaults to a hidden-char-rendered HTML view: every NBSP/ZWSP/smart-quote/control char is shown with its badge and codepoint tooltip. A "Show hidden characters" toggle lets the user fall back to the raw st.dataframe view if they prefer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:14:14 +00:00
Michael	794d4cda94	feat(gui): tool pages pick up the home-page upload via session_state Closes the last UX gap from the analyzer review: each tool page had its own st.file_uploader, so users had to upload the same file twice (once on the home page for analysis, once on each tool page). components.pickup_or_upload(label, key, types) returns either: - a _StashedUpload shim wrapping the home-page bytes (when present and the user hasn't asked for a different file on this page), or - the standard st.file_uploader (when nothing is stashed or the user clicked "Use a different file"). _StashedUpload duck-types Streamlit's UploadedFile (.name, .size, .getvalue(), .read()) so existing tool-page code consumes it without changes. A "Use a different file" button per page sets a session-state override flag; a "Switch back to upload-screen file" button clears it. Wired into 2_Text_Cleaner.py and 1_Deduplicator.py — the two pages with working uploaders today. The remaining stub pages adopt it when they're implemented; the helper is the public surface they'll use. Verified by smoke-launching streamlit headless and curling the home, text-cleaner, and deduplicator routes — all return 200 with no errors in the server log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:09:51 +00:00
Michael	8dfc6ad8ae	feat(analyze): add mixed_line_endings + near_duplicate_rows detectors Two more detectors close the analyzer gap list: mixed_line_endings (warn, tool=02): scans raw bytes for combinations of CRLF / LF / bare CR. Disaster pattern after multi-source concat (Windows + macOS + Linux exports stitched together). Operates on raw bytes only — DataFrame-mode analyze() skips it because raw bytes aren't available. _load_for_analysis now returns the raw bytes alongside the DataFrame and repair result so the detector has them. near_duplicate_rows (info, tool=01): cheap dedup signal — strip and lowercase every string column, then count df.duplicated(). Catches the most common case (same customer entered twice with subtle formatting differences) without paying for fuzzy matching. Anything more sophisticated stays in tool 01. Six new tests cover both detectors plus the dataframe-mode skip path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:09:42 +00:00
Michael	0671ef277e	feat(io): route read_file through pre-parse repair by default Previously only analyze() and direct read_csv_repaired() callers got the byte-level repair pass (BOM strip, NUL strip, smart-double-quote fold, unquoted-delimiter merge). The dedup CLI and any other read_file consumer silently missed it. read_file gains a repair=True default. CSV/TSV inputs run through repair_bytes before pandas sees them; Excel inputs still pass through unchanged. Chunked reads (chunk_size set) bypass repair because the pre- parse pass loads the whole file — preserving streaming behavior on huge files. Repair actions and unrepairable lines are logged at INFO/WARNING. cli_text_clean opts out (repair=False): the cleaner offers fine-grained control via --preset and per-op flags, and a byte-level smart-quote fold under the user's "minimal" preset would violate that contract. The cell-level cleaner does the equivalent work itself when its options ask for it. Tests: read_file default strips BOM and folds curly double quotes; repair=False preserves smart quotes; chunked reads still work and skip repair as documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:09:35 +00:00
Michael	0b959dee93	feat(text_clean): preserve internal whitespace in numeric/date/phone cells Closes the §4.17 spec gap that test_gap_coverage.py was tracking via xfail: collapse_whitespace must NOT touch cells whose shape carries meaningful internal whitespace. Adds _looks_structured(s) — returns True when s matches: - numeric (currency optional, thousand-grouping by , . or single space) - date (ISO/slash/dot separator, or 'Mon DD YYYY' / 'DD Mon YYYY') - phone (digits + parens/dots/dashes/+/spaces, >= 7 digits, no letters) The pipeline uses a new _smart_collapse_whitespace wrapper that defers to collapse_whitespace only when _looks_structured returns False. The raw collapse_whitespace function is unchanged so direct callers and existing unit tests remain valid. Five new positive tests replace the xfail: - "(555) 123-4567" preserved (phone, double space inside) - "1 234" preserved (European thousands) - "2024-01-15" preserved (ISO date) - "Jan 15 2024" preserved (textual date) - "hello world" still collapsed to "hello world" (free-text negative case) Conservative on purpose: a false negative just collapses (existing behavior); a false positive leaves intentional double spaces in prose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:09:25 +00:00
Michael	4687cf87b4	test: single-command runner, cross-platform automation, fixture auto-discovery Adds a top-level test infrastructure layer addressing four needs at once: a single command to run anything, cross-platform automation, install/e2e sanity, and zero-config pickup of new fixtures dropped into test-cases/. Top-level runner — run_tests.py python run_tests.py # everything (default) python run_tests.py --tool dedup # one tool's tests python run_tests.py --unit # category scopes python run_tests.py --e2e # end-to-end CLI python run_tests.py --install # import / dependency sanity python run_tests.py --fixtures # corpus + dropped-file sweep python run_tests.py --coverage # term-missing report python run_tests.py --quick # skip @pytest.mark.slow Tools: analyze, cli, config, dedup, io, normalizers, text_clean. Cross-platform — tox.ini Envs for py310-py313 plus install / e2e / fixtures / coverage / lint. Forces UTF-8 (PYTHONUTF8=1, PYTHONIOENCODING=utf-8) so identical fixture bytes parse the same on Linux/macOS/Windows. Shared config — pytest.ini testpaths, python_files conventions, custom markers (slow, e2e, install, fixture_sweep), warning filters that fail on our own DeprecationWarnings while tolerating third-party ones. New test layers tests/test_install.py — required deps import; project modules import; src.core public API surface; CLI --help exits 0; streamlit app.py parses as valid Python; run_tests.py --help works. tests/test_e2e.py — CLI roundtrips: cli_analyze table + JSON, cli_text_clean --apply writes a real file with NBSP/smart-quote folded, dedup CLI removes duplicates, run_tests.py self-tests. tests/test_fixtures_sweep.py — parametrizes over every CSV/TSV/XLSX inside test-cases/ (excluding text-cleaner-corpus/, which has its own suite). Each fixture must: load through repair_bytes, run analyze() cleanly, and survive clean_dataframe() with row/col counts unchanged plus idempotency. Drop a CSV in, re-run — no test code changes needed. tests/test_gap_coverage.py — closes audit gaps: clean_headers=False toggle, repair_bytes with tab/semicolon delimiters, BOM+NUL+smart- quote combined-fix scenario, analyze() over an XLSX path, sample_rows larger than the DataFrame, mid-cell BOM, findings_by_tool edges, plus a strict xfail documenting the known §4.17 numeric/phone whitespace heuristic gap. Test count Before: 288 passed + 1 xfailed After: 475 passed + 2 xfailed (the second xfail is the documented collapse_whitespace gap on phone-shaped cells; spec §4.17 calls for a heuristic that hasn't been implemented yet). Functional gaps surfaced (not fixed in this commit): - Text cleaner: collapse_whitespace runs unconditionally on every string cell, including phone/numeric/date-shaped ones. Spec §4.17 requires a skip heuristic. Captured as strict xfail so the gap stays visible. - io.read_file does not run pre-parse repair; only analyze() and direct callers of read_csv_repaired() get it. CLI tool pages and the dedup CLI miss the safety net. - Analyzer has no mixed_line_endings detector or near_duplicate_rows detector; both planned but require additional plumbing. - GUI tool pages each have their own uploader instead of picking up the home-page upload through session_state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:01:06 +00:00
Michael	a8943f29eb	feat(gui): wire analyzer into home page with findings panel and tool badges Home page (src/gui/app.py) gains an upload + analyze section above the tool grid: file uploader, "Run analysis" / "Skip" buttons, and a findings panel grouped by destination tool. Tool cards now carry a "N findings" badge when the active session's findings reference that tool, so the user sees at a glance which tools their just-uploaded file would benefit from. src/gui/components.py adds the shared GUI surface: - TOOL_DISPLAY_NAMES + tool_display_name() — single source of truth for GUI labels, keeping detector tool ids decoupled from the UI. - render_findings_panel(findings) — severity icons, expander per tool, open-tool page link, sample-cells dataframe. - upload_and_analyze_section() — the home-page widget; stashes file bytes and findings in session_state so future tool pages can pick up the existing upload instead of re-prompting. - findings_count_for_tool(tool_id) — used by app.py to badge cards. CSV/TSV uploads run through repair_bytes() before analysis, so the user also sees csv_bom_stripped / csv_smart_quotes_folded findings synthesized from the pre-parse repair pass. Excel uploads skip that step. The Text Cleaner tool card flips from "Coming Soon" to "Ready" — that has been true since the v3.0 implementation and the home page just hadn't been updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:53:22 +00:00
Michael	5c62fb6117	feat(cli): src.cli_analyze — Typer CLI for the analyzer python -m src.cli_analyze input.csv # rich table per tool python -m src.cli_analyze input.csv --json # array of finding dicts python -m src.cli_analyze input.csv --strict # exit 1 on warn/error python -m src.cli_analyze input.csv -n 50000 # cap rows scanned Findings are grouped by destination tool so the user can see at a glance which tool to open next. Read-only; exit code 0 unless --strict is set. The CLI keeps its own tool-id -> display-name map so it doesn't depend on the GUI module. 7 tests cover: clean-file passthrough, dirty-file table, --json round-trip, missing-file (exit 2), --strict exit code, --sample-rows cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:53:11 +00:00
Michael	edf6ccf90b	feat(analyze): upload-time data quality analyzer Pure, advisory scan over an uploaded file or DataFrame that returns a list of Finding objects naming each issue, the affected count, and which downstream tool can fix it. The GUI uses this to badge tool nav items at upload; the CLI will print findings as a table or JSON. src/core/analyze.py: Finding dataclass (id, severity, tool, count, description, column, samples) analyze(source, , sample_rows=1000, repair_result=None) -> list[Finding] - source: DataFrame, path, or str. Path scans first 1000 rows. - When source is a path, runs the same pre-parse repair the tool pages will use; the resulting RepairResult is auto-surfaced as csv_ findings. A caller-supplied repair_result wins so non-default repair flags are respected. Detectors (each independent, samples capped at 5): - smart_punctuation_in_data -> 02 - nbsp_or_unicode_whitespace -> 02 - zero_width_or_invisible -> 02 - dirty_column_headers -> 02 - whitespace_padding -> 02 - null_like_sentinels -> 04 - suspected_mojibake -> 02 (Tier 2) - mixed_case_email_column -> 02 case op - leading_zero_ids -> informational, no tool Helpers: findings_by_tool() for sidebar grouping, to_dict() for JSON. Detectors are decoupled from the GUI display layer — they emit stable tool ids ("02_text_cleaner") and the GUI maps those to display names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:41:36 +00:00
Michael	b8a9fa1b09	feat(io): pre-parse CSV repair (BOM/NUL/smart-quotes/unquoted-delim) Some pollution patterns block pandas before the cell-level cleaner can run. Add a pre-parse pass on raw bytes that fixes only what breaks parsing, and returns a structured action log the GUI/CLI can surface to the user. repair_bytes(raw, *, encoding, delimiter, fold_quotes, strip_nul, repair_delims): 1. Strip leading UTF-8 BOM. 2. Strip embedded NUL bytes (the C parser truncates fields at NUL). 3. Fold smart double quotes (curly, guillemet, double-prime) to ASCII '"'. Curly singles are NOT folded here; they don't conflict with CSV and the cell-level cleaner handles them more accurately. 4. Per-row repair when one rogue delimiter is embedded in a field that looks like currency or thousands-grouped digits. Tiered scoring keeps " $1,500.00 ,7" unambiguous: the strict currency regex match wins over the loose digit/sigil heuristic. read_csv_repaired(path) -> (DataFrame, RepairResult). RepairResult exposes .actions, .unrepairable_lines, and a summary() grouped by kind. Out of scope for this pass: encoding repair, delimiter conversion, multi- delimiter merges (k>1) — logged as unrepairable so callers can see what was left alone instead of silently parsing wrong. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:37:49 +00:00
Michael	c349a90e18	test: add text-cleaner corpus and close gaps surfaced by it The 21-fixture corpus (test-cases/text-cleaner-corpus/) exercises the cleaner end-to-end against the spec in TEST-CASES.md. Closing the failing cases drove five small cleaner fixes plus two fixture-generation fixes: - _SMART_CHARS: add prime, double prime, guillemets (case 03) - _ZERO_WIDTH: add soft hyphen U+00AD (case 05) - clean_dataframe: clean column headers via the same pipeline (cases 16/19/20), with a clean_headers toggle on CleanOptions - smart_title_case: title-case full-shout strings ("ALICE SMITH" -> "Alice Smith") while still preserving embedded acronyms; preserve uppercase after apostrophe in names ("O'CONNOR" -> "O'Connor", "o'neil" -> "O'neil") - test_corpus.py reader: pre-strip NUL bytes (C parser truncates at NUL, python engine is too strict about embedded literal "), per spec case 06 - generate_test_data.py: properly CSV-escape literal-quote cells in case 03 expected; quote the rogue-comma price field in case 17 input Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:37:35 +00:00
Michael	54f92ae47e	feat: implement text cleaner (script 02) with CLI, GUI, and tests Builds 02_text_cleaner.py from stub to working: character-level hygiene for CSV/Excel inputs covering trim, whitespace collapse, smart-character folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char strip, line-ending normalization, and per-column case conversion. Three presets (minimal/excel-hygiene/paranoid) keep the buyer surface small. - src/core/text_clean.py: pure helpers + CleanOptions/CleanResult + clean_dataframe with dtype-safe column selection - src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape (dry-run by default, --apply writes cleaned + changes audit, JSON config save/load) - src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset picker, advanced toggles, preview, before/after metrics, and three download buttons - tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests covering edge cases E1-E50 from the spec - samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10 in 10 rows - test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case fixtures Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7 entry locking the spec, CLI-REFERENCE.md gains the text cleaner section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md status row 02 promoted Skeleton -> Working. 200/200 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:14:15 +00:00
Michael	b2ca04e6f4	fix: scale app content to 85% zoom Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 01:30:58 +00:00
Michael	223148283d	revert: remove 75% zoom, 100% fits correctly with chrome hidden Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 01:29:42 +00:00
Michael	1c609214b0	fix: scale app content to 75% to fit window Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 01:28:12 +00:00
Michael	dc48578c7e	feat: launch Chrome in app mode for chromeless window python -m src.gui now opens Chrome with --app flag, hiding the address bar, tabs, and bookmarks bar. Falls back to default browser if Chrome is not found. Headless flag passed via CLI so streamlit run directly still auto-opens normally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 01:24:54 +00:00
Michael	28bda8d624	fix: remove headless=true so browser opens on launch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 01:23:07 +00:00
Michael	35ea21ad33	feat: hide Streamlit chrome for app-like appearance Add shared hide_streamlit_chrome() helper that removes header bar, hamburger menu, footer, and deploy button via CSS injection. Called on every page. Add .streamlit/config.toml with minimal toolbar mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 01:20:54 +00:00
Michael	f2fdc10af7	feat: refactor GUI to multi-page Streamlit app with 9 tool pages Convert single-page deduplicator into a multi-page suite. Home page shows tool card grid. Deduplicator extracted to its own page (fully working). 8 stub pages added for Text Cleaner, Format Standardizer, Missing Values, Column Mapper, Outlier Detector, Multi-File Merger, Validator & Reporter, and Pipeline Runner — each with functional file upload and coming-soon UI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 01:16:12 +00:00
Michael	9ec371a85f	docs: update all documentation to reflect v3.0 functionality Update README, CLI reference, and developer guide to cover delimiter selector, inline checkboxes/dropdowns, live surviving rows preview, multi-row survivors, and apply_review_decisions(). Remove dead link. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 00:58:38 +00:00
Michael	27fe87c4fe	fix: simplify upload placeholder text Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 00:56:32 +00:00
Michael	8f1fb690ae	chore: bump version to v3.0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 00:54:37 +00:00
Michael	ec9f100e67	feat: add custom delimiter input and update subtitle text Delimiter dropdown now includes "Other" option with a text input for custom delimiter characters. Subtitle updated to mention delimited text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 00:46:12 +00:00
Michael	310bea08bf	feat: add delimiter selector for CSV/TSV files in GUI Auto-detects delimiter on upload and shows a selectbox with comma, tab, semicolon, and pipe options. Changing re-reads the file immediately. Line terminators (Windows/Unix/Mac) already handled by universal newlines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 00:30:50 +00:00
Michael	24ae566ec4	fix: hide Deploy button from Streamlit toolbar Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 00:25:26 +00:00
Michael	f97b633d4c	feat: add live surviving rows preview in match group editor Shows a read-only preview of the output rows below the editor, updating as checkboxes and column dropdowns are changed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 00:17:34 +00:00
Michael	e672488d50	fix: default Keep checkbox to algorithm-selected survivor only Only the row chosen by the survivor rule (first, last, most-recent, etc.) is checked by default. Other rows start unchecked. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 00:15:13 +00:00
Michael	d368cad89d	feat: inline checkboxes and column dropdowns in match group editor Replace separate checkbox row and "Customize columns" toggle with a unified st.data_editor grid — Keep checkboxes at the start of each row, differing columns render as inline selectbox dropdowns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-29 00:10:16 +00:00
Michael	863fe89f2c	feat: multi-row survivor support in match group review Replace radio + Merge/Keep Both buttons with per-row checkboxes and a single Confirm button. Users can now: - Keep all rows (not duplicates) — check all, confirm - Merge to one row — uncheck all but one, optionally customize columns - Split a group — keep some rows, remove others (new capability) Decision format changed from {action, survivor_idx, overrides} to {keep_indices, overrides}. apply_review_decisions() updated to handle all three modes. Batch actions updated accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-28 23:52:45 +00:00
Michael	debb0cb516	feat: per-group survivor selection and column cherry-picking in GUI Each match group card now has: - Radio button to pick which row to keep as the base survivor - "Customize columns" toggle showing only columns that differ - Per-column selectbox to pick values from any row in the group - Decisions stored as {action, survivor_idx, overrides} dicts Added apply_review_decisions() that builds the final DataFrame by applying survivor selection + column overrides without re-running the dedup engine. Batch actions also use the new dict format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-28 23:47:25 +00:00
Michael	39e139d777	fix: prevent match group expanders from collapsing on button click Replace st.rerun() with on_click callbacks so decisions write to session state before the natural rerun. Decided groups auto-collapse with status in the label; undecided groups stay expanded. Added undo button on decided groups. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-28 23:25:12 +00:00
Michael	b871ab24fc	feat: add documentation, Streamlit GUI, and full source tree - Rewrite README.md with project overview, quick-start, and CLI summary - Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections - Add docs/DEVELOPER.md with architecture, data flow, and extension guides - Rewrite src/core/__init__.py with public API exports and module docstring - Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive match group review with side-by-side diff, and download buttons - Add .gitignore, requirements.txt, all source code, tests, and sample data - Add streamlit to requirements.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-28 23:06:39 +00:00
Michael	0613dc420c	docs: add project documentation files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-28 22:02:07 +00:00
giteadmin	a23f7a9b6f	Initial commit	2026-04-28 21:59:55 +00:00

1 2 3 4 5

243 Commits