Commit Graph

127 Commits

Author SHA1 Message Date
5c62fb6117 feat(cli): src.cli_analyze — Typer CLI for the analyzer
python -m src.cli_analyze input.csv             # rich table per tool
python -m src.cli_analyze input.csv --json      # array of finding dicts
python -m src.cli_analyze input.csv --strict    # exit 1 on warn/error
python -m src.cli_analyze input.csv -n 50000    # cap rows scanned

Findings are grouped by destination tool so the user can see at a glance
which tool to open next. Read-only; exit code 0 unless --strict is set.
The CLI keeps its own tool-id -> display-name map so it doesn't depend on
the GUI module.

7 tests cover: clean-file passthrough, dirty-file table, --json round-trip,
missing-file (exit 2), --strict exit code, --sample-rows cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:53:11 +00:00
edf6ccf90b feat(analyze): upload-time data quality analyzer
Pure, advisory scan over an uploaded file or DataFrame that returns a list of
Finding objects naming each issue, the affected count, and which downstream
tool can fix it. The GUI uses this to badge tool nav items at upload; the CLI
will print findings as a table or JSON.

src/core/analyze.py:
  Finding dataclass (id, severity, tool, count, description, column, samples)
  analyze(source, *, sample_rows=1000, repair_result=None) -> list[Finding]
    - source: DataFrame, path, or str. Path scans first 1000 rows.
    - When source is a path, runs the same pre-parse repair the tool pages
      will use; the resulting RepairResult is auto-surfaced as csv_*
      findings. A caller-supplied repair_result wins so non-default repair
      flags are respected.
  Detectors (each independent, samples capped at 5):
    - smart_punctuation_in_data        -> 02
    - nbsp_or_unicode_whitespace       -> 02
    - zero_width_or_invisible          -> 02
    - dirty_column_headers             -> 02
    - whitespace_padding               -> 02
    - null_like_sentinels              -> 04
    - suspected_mojibake               -> 02 (Tier 2)
    - mixed_case_email_column          -> 02 case op
    - leading_zero_ids                 -> informational, no tool
  Helpers: findings_by_tool() for sidebar grouping, to_dict() for JSON.

Detectors are decoupled from the GUI display layer — they emit stable tool
ids ("02_text_cleaner") and the GUI maps those to display names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:41:36 +00:00
b8a9fa1b09 feat(io): pre-parse CSV repair (BOM/NUL/smart-quotes/unquoted-delim)
Some pollution patterns block pandas before the cell-level cleaner can run.
Add a pre-parse pass on raw bytes that fixes only what breaks parsing, and
returns a structured action log the GUI/CLI can surface to the user.

repair_bytes(raw, *, encoding, delimiter, fold_quotes, strip_nul, repair_delims):
  1. Strip leading UTF-8 BOM.
  2. Strip embedded NUL bytes (the C parser truncates fields at NUL).
  3. Fold smart double quotes (curly, guillemet, double-prime) to ASCII '"'.
     Curly singles are NOT folded here; they don't conflict with CSV and the
     cell-level cleaner handles them more accurately.
  4. Per-row repair when one rogue delimiter is embedded in a field that
     looks like currency or thousands-grouped digits. Tiered scoring keeps
     "  $1,500.00  ,7" unambiguous: the strict currency regex match wins
     over the loose digit/sigil heuristic.

read_csv_repaired(path) -> (DataFrame, RepairResult). RepairResult exposes
.actions, .unrepairable_lines, and a summary() grouped by kind.

Out of scope for this pass: encoding repair, delimiter conversion, multi-
delimiter merges (k>1) — logged as unrepairable so callers can see what was
left alone instead of silently parsing wrong.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:37:49 +00:00
c349a90e18 test: add text-cleaner corpus and close gaps surfaced by it
The 21-fixture corpus (test-cases/text-cleaner-corpus/) exercises the cleaner
end-to-end against the spec in TEST-CASES.md. Closing the failing cases drove
five small cleaner fixes plus two fixture-generation fixes:

- _SMART_CHARS: add prime, double prime, guillemets (case 03)
- _ZERO_WIDTH: add soft hyphen U+00AD (case 05)
- clean_dataframe: clean column headers via the same pipeline (cases 16/19/20),
  with a clean_headers toggle on CleanOptions
- smart_title_case: title-case full-shout strings ("ALICE SMITH" -> "Alice
  Smith") while still preserving embedded acronyms; preserve uppercase after
  apostrophe in names ("O'CONNOR" -> "O'Connor", "o'neil" -> "O'neil")
- test_corpus.py reader: pre-strip NUL bytes (C parser truncates at NUL,
  python engine is too strict about embedded literal "), per spec case 06
- generate_test_data.py: properly CSV-escape literal-quote cells in case 03
  expected; quote the rogue-comma price field in case 17 input

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:37:35 +00:00
54f92ae47e feat: implement text cleaner (script 02) with CLI, GUI, and tests
Builds 02_text_cleaner.py from stub to working: character-level hygiene
for CSV/Excel inputs covering trim, whitespace collapse, smart-character
folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char
strip, line-ending normalization, and per-column case conversion. Three
presets (minimal/excel-hygiene/paranoid) keep the buyer surface small.

- src/core/text_clean.py: pure helpers + CleanOptions/CleanResult +
  clean_dataframe with dtype-safe column selection
- src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape
  (dry-run by default, --apply writes cleaned + changes audit, JSON
  config save/load)
- src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset
  picker, advanced toggles, preview, before/after metrics, and three
  download buttons
- tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests
  covering edge cases E1-E50 from the spec
- samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10
  in 10 rows
- test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case
  fixtures

Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7
entry locking the spec, CLI-REFERENCE.md gains the text cleaner
section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md
status row 02 promoted Skeleton -> Working.

200/200 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:14:15 +00:00
b2ca04e6f4 fix: scale app content to 85% zoom
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:30:58 +00:00
223148283d revert: remove 75% zoom, 100% fits correctly with chrome hidden
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:29:42 +00:00
1c609214b0 fix: scale app content to 75% to fit window
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:28:12 +00:00
dc48578c7e feat: launch Chrome in app mode for chromeless window
python -m src.gui now opens Chrome with --app flag, hiding the address
bar, tabs, and bookmarks bar. Falls back to default browser if Chrome
is not found. Headless flag passed via CLI so streamlit run directly
still auto-opens normally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:24:54 +00:00
28bda8d624 fix: remove headless=true so browser opens on launch
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:23:07 +00:00
35ea21ad33 feat: hide Streamlit chrome for app-like appearance
Add shared hide_streamlit_chrome() helper that removes header bar,
hamburger menu, footer, and deploy button via CSS injection. Called
on every page. Add .streamlit/config.toml with minimal toolbar mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:20:54 +00:00
f2fdc10af7 feat: refactor GUI to multi-page Streamlit app with 9 tool pages
Convert single-page deduplicator into a multi-page suite. Home page shows
tool card grid. Deduplicator extracted to its own page (fully working).
8 stub pages added for Text Cleaner, Format Standardizer, Missing Values,
Column Mapper, Outlier Detector, Multi-File Merger, Validator & Reporter,
and Pipeline Runner — each with functional file upload and coming-soon UI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:16:12 +00:00
9ec371a85f docs: update all documentation to reflect v3.0 functionality
Update README, CLI reference, and developer guide to cover delimiter
selector, inline checkboxes/dropdowns, live surviving rows preview,
multi-row survivors, and apply_review_decisions(). Remove dead link.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:58:38 +00:00
27fe87c4fe fix: simplify upload placeholder text
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:56:32 +00:00
8f1fb690ae chore: bump version to v3.0
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:54:37 +00:00
ec9f100e67 feat: add custom delimiter input and update subtitle text
Delimiter dropdown now includes "Other" option with a text input for
custom delimiter characters. Subtitle updated to mention delimited text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:46:12 +00:00
310bea08bf feat: add delimiter selector for CSV/TSV files in GUI
Auto-detects delimiter on upload and shows a selectbox with comma, tab,
semicolon, and pipe options. Changing re-reads the file immediately.
Line terminators (Windows/Unix/Mac) already handled by universal newlines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:30:50 +00:00
24ae566ec4 fix: hide Deploy button from Streamlit toolbar
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:25:26 +00:00
f97b633d4c feat: add live surviving rows preview in match group editor
Shows a read-only preview of the output rows below the editor,
updating as checkboxes and column dropdowns are changed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:17:34 +00:00
e672488d50 fix: default Keep checkbox to algorithm-selected survivor only
Only the row chosen by the survivor rule (first, last, most-recent, etc.)
is checked by default. Other rows start unchecked.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:15:13 +00:00
d368cad89d feat: inline checkboxes and column dropdowns in match group editor
Replace separate checkbox row and "Customize columns" toggle with a
unified st.data_editor grid — Keep checkboxes at the start of each row,
differing columns render as inline selectbox dropdowns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 00:10:16 +00:00
863fe89f2c feat: multi-row survivor support in match group review
Replace radio + Merge/Keep Both buttons with per-row checkboxes
and a single Confirm button. Users can now:

- Keep all rows (not duplicates) — check all, confirm
- Merge to one row — uncheck all but one, optionally customize columns
- Split a group — keep some rows, remove others (new capability)

Decision format changed from {action, survivor_idx, overrides} to
{keep_indices, overrides}. apply_review_decisions() updated to handle
all three modes. Batch actions updated accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 23:52:45 +00:00
debb0cb516 feat: per-group survivor selection and column cherry-picking in GUI
Each match group card now has:
- Radio button to pick which row to keep as the base survivor
- "Customize columns" toggle showing only columns that differ
- Per-column selectbox to pick values from any row in the group
- Decisions stored as {action, survivor_idx, overrides} dicts

Added apply_review_decisions() that builds the final DataFrame by
applying survivor selection + column overrides without re-running
the dedup engine. Batch actions also use the new dict format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 23:47:25 +00:00
39e139d777 fix: prevent match group expanders from collapsing on button click
Replace st.rerun() with on_click callbacks so decisions write to
session state before the natural rerun. Decided groups auto-collapse
with status in the label; undecided groups stay expanded. Added undo
button on decided groups.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 23:25:12 +00:00
b871ab24fc feat: add documentation, Streamlit GUI, and full source tree
- Rewrite README.md with project overview, quick-start, and CLI summary
- Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections
- Add docs/DEVELOPER.md with architecture, data flow, and extension guides
- Rewrite src/core/__init__.py with public API exports and module docstring
- Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive
  match group review with side-by-side diff, and download buttons
- Add .gitignore, requirements.txt, all source code, tests, and sample data
- Add streamlit to requirements.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 23:06:39 +00:00
0613dc420c docs: add project documentation files
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 22:02:07 +00:00
a23f7a9b6f Initial commit 2026-04-28 21:59:55 +00:00