Commit Graph

5 Commits

Author SHA1 Message Date
edf6ccf90b feat(analyze): upload-time data quality analyzer
Pure, advisory scan over an uploaded file or DataFrame that returns a list of
Finding objects naming each issue, the affected count, and which downstream
tool can fix it. The GUI uses this to badge tool nav items at upload; the CLI
will print findings as a table or JSON.

src/core/analyze.py:
  Finding dataclass (id, severity, tool, count, description, column, samples)
  analyze(source, *, sample_rows=1000, repair_result=None) -> list[Finding]
    - source: DataFrame, path, or str. Path scans first 1000 rows.
    - When source is a path, runs the same pre-parse repair the tool pages
      will use; the resulting RepairResult is auto-surfaced as csv_*
      findings. A caller-supplied repair_result wins so non-default repair
      flags are respected.
  Detectors (each independent, samples capped at 5):
    - smart_punctuation_in_data        -> 02
    - nbsp_or_unicode_whitespace       -> 02
    - zero_width_or_invisible          -> 02
    - dirty_column_headers             -> 02
    - whitespace_padding               -> 02
    - null_like_sentinels              -> 04
    - suspected_mojibake               -> 02 (Tier 2)
    - mixed_case_email_column          -> 02 case op
    - leading_zero_ids                 -> informational, no tool
  Helpers: findings_by_tool() for sidebar grouping, to_dict() for JSON.

Detectors are decoupled from the GUI display layer — they emit stable tool
ids ("02_text_cleaner") and the GUI maps those to display names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:41:36 +00:00
b8a9fa1b09 feat(io): pre-parse CSV repair (BOM/NUL/smart-quotes/unquoted-delim)
Some pollution patterns block pandas before the cell-level cleaner can run.
Add a pre-parse pass on raw bytes that fixes only what breaks parsing, and
returns a structured action log the GUI/CLI can surface to the user.

repair_bytes(raw, *, encoding, delimiter, fold_quotes, strip_nul, repair_delims):
  1. Strip leading UTF-8 BOM.
  2. Strip embedded NUL bytes (the C parser truncates fields at NUL).
  3. Fold smart double quotes (curly, guillemet, double-prime) to ASCII '"'.
     Curly singles are NOT folded here; they don't conflict with CSV and the
     cell-level cleaner handles them more accurately.
  4. Per-row repair when one rogue delimiter is embedded in a field that
     looks like currency or thousands-grouped digits. Tiered scoring keeps
     "  $1,500.00  ,7" unambiguous: the strict currency regex match wins
     over the loose digit/sigil heuristic.

read_csv_repaired(path) -> (DataFrame, RepairResult). RepairResult exposes
.actions, .unrepairable_lines, and a summary() grouped by kind.

Out of scope for this pass: encoding repair, delimiter conversion, multi-
delimiter merges (k>1) — logged as unrepairable so callers can see what was
left alone instead of silently parsing wrong.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:37:49 +00:00
c349a90e18 test: add text-cleaner corpus and close gaps surfaced by it
The 21-fixture corpus (test-cases/text-cleaner-corpus/) exercises the cleaner
end-to-end against the spec in TEST-CASES.md. Closing the failing cases drove
five small cleaner fixes plus two fixture-generation fixes:

- _SMART_CHARS: add prime, double prime, guillemets (case 03)
- _ZERO_WIDTH: add soft hyphen U+00AD (case 05)
- clean_dataframe: clean column headers via the same pipeline (cases 16/19/20),
  with a clean_headers toggle on CleanOptions
- smart_title_case: title-case full-shout strings ("ALICE SMITH" -> "Alice
  Smith") while still preserving embedded acronyms; preserve uppercase after
  apostrophe in names ("O'CONNOR" -> "O'Connor", "o'neil" -> "O'neil")
- test_corpus.py reader: pre-strip NUL bytes (C parser truncates at NUL,
  python engine is too strict about embedded literal "), per spec case 06
- generate_test_data.py: properly CSV-escape literal-quote cells in case 03
  expected; quote the rogue-comma price field in case 17 input

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:37:35 +00:00
54f92ae47e feat: implement text cleaner (script 02) with CLI, GUI, and tests
Builds 02_text_cleaner.py from stub to working: character-level hygiene
for CSV/Excel inputs covering trim, whitespace collapse, smart-character
folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char
strip, line-ending normalization, and per-column case conversion. Three
presets (minimal/excel-hygiene/paranoid) keep the buyer surface small.

- src/core/text_clean.py: pure helpers + CleanOptions/CleanResult +
  clean_dataframe with dtype-safe column selection
- src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape
  (dry-run by default, --apply writes cleaned + changes audit, JSON
  config save/load)
- src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset
  picker, advanced toggles, preview, before/after metrics, and three
  download buttons
- tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests
  covering edge cases E1-E50 from the spec
- samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10
  in 10 rows
- test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case
  fixtures

Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7
entry locking the spec, CLI-REFERENCE.md gains the text cleaner
section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md
status row 02 promoted Skeleton -> Working.

200/200 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:14:15 +00:00
b871ab24fc feat: add documentation, Streamlit GUI, and full source tree
- Rewrite README.md with project overview, quick-start, and CLI summary
- Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections
- Add docs/DEVELOPER.md with architecture, data flow, and extension guides
- Rewrite src/core/__init__.py with public API exports and module docstring
- Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive
  match group review with side-by-side diff, and download buttons
- Add .gitignore, requirements.txt, all source code, tests, and sample data
- Add streamlit to requirements.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 23:06:39 +00:00