Go to file

Michael 26b9771625 feat(errors): structured error hierarchy + helpful messages everywhere

Introduces src/core/errors.py with a small structured error hierarchy
that every public entry point now uses. Each error carries the
context a user needs to fix it and the context a maintainer needs to
trace it.

The hierarchy:
  DataToolsError  (base — formats path, column, operation, suggestion)
    InputValidationError  (extends ValueError — bad arg / wrong type)
    ConfigError           (extends ValueError — bad config / options)
    FileFormatError       (extends ValueError — file is not what we expected)
    FileAccessError       (extends OSError   — file I/O failure)

Subclassing the stdlib bases means existing `except OSError` /
`except ValueError` handlers still catch them — no breaking change.

Helpers:
- ensure_dataframe(value, function=...)  — uniform DataFrame guard
- ensure_choice(value, name=, choices=)  — uniform enum/literal guard
- wrap_file_read(path, op, exc)          — tag OSError with hint + path
- wrap_file_write(path, op, exc)         — same, with Windows-aware tip
- format_for_user(exc, context=)         — user-facing string for st.error / stderr

Library hardening:
- io.read_file: missing files surface FileAccessError listing whether
  the parent directory exists, and the suggestion to check the path.
- io.read_file: chunk_size <= 0 now raises InputValidationError with
  a positive-integer suggestion.
- io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas
  ValueError ("sheet not found") wrapped as FileFormatError listing
  the path and a "list sheets with list_sheets()" hint.
- io._detect_excel_header_row: bare except narrowed to specific
  openpyxl exceptions; falls back gracefully and logs at debug so
  the real error surfaces from pd.read_excel.
- io.write_file: OSError / PermissionError on to_csv/to_excel wrapped
  with file path and Windows-aware "file may be open in another
  program" hint.
- dedup._parse_date: bare `except Exception` narrowed to
  (TypeError, ValueError, OutOfBoundsDatetime); failed values
  logged at debug for survivor-selection forensics.
- dedup._select_survivor: KEEP_MOST_RECENT now raises
  InputValidationError instead of silently falling back to keep_first.
- dedup.deduplicate: input validation errors are InputValidationError
  with operation/column/suggestion fields.
- format_standardize.from_dict: invalid FieldType for a column raises
  ConfigError naming the column AND the bad value AND listing valid
  values; same for date_order / phone_format / etc.
- format_standardize.from_file: OSError / JSON decode wrapped with
  path AND line/column where parsing failed.
- format_standardize.to_file: TypeError on json.dumps wrapped as
  ConfigError with the suspected source (extra_abbreviations).
- format_standardize._apply_field_type: dispatcher's "unknown field
  type" branch now raises AssertionError (it's an internal invariant,
  not user error — a new enum value was added without a branch).
- format_standardize._resolve_column_types: missing-column error now
  InputValidationError with a "check for typos / unparsed header"
  suggestion.
- format_standardize.standardize_dataframe: ensure_dataframe at entry.
- text_clean.clean_dataframe: ensure_dataframe at entry.
- config.to_strategies: invalid Algorithm/NormalizerType wrapped as
  ConfigError naming the strategy index AND the column.
- config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError
  listing valid values.
- config.from_file: OSError / JSON decode wrapped (mirror of
  StandardizeOptions.from_file).
- fixes.repair_mojibake: ImportError on ftfy now logged at info level
  with the underlying ImportError so a corrupt-package vs not-installed
  distinction is visible in the logs.
- normalizers.normalize_phone: phonenumbers.NumberParseException now
  logged at debug when the digits-only fallback drops extension /
  country-code information — gives a trail when matching results
  look wrong.

GUI / CLI surfaces:
- All 9 page handlers (`except Exception as e: st.error(...)`) now
  use format_for_user(), which renders DataToolsError fields nicely
  and falls back to "ClassName: message" for unrecognized errors.
- 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish
  UnicodeDecodeError with an "re-save as UTF-8" suggestion before
  the generic handler.
- cli.py's "Error reading file" handler now uses format_for_user()
  and includes the input path in the prefix.

Tests:
- tests/test_errors.py — 22 new tests covering: base class formatting,
  stdlib inheritance, ensure_dataframe / ensure_choice helpers,
  wrap_file_read / wrap_file_write, format_for_user behavior, and
  end-to-end integration (missing file, missing dir, bad JSON, bad
  algorithm, bad enum, missing column).
- tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for
  the new exception types (InputValidationError replaces TypeError,
  FileAccessError extends OSError).

Full project suite: 1230 passed, 4 skipped, 17 xfailed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-01 02:35:42 +00:00

.streamlit

feat(gui): 1 GB upload cap + delimiter / encoding diversity caption

2026-04-29 21:23:21 +00:00

docs

docs: short-form numbered requirements list

2026-04-29 21:19:21 +00:00

samples

feat: implement text cleaner (script 02) with CLI, GUI, and tests

2026-04-29 15:14:15 +00:00

src

feat(errors): structured error hierarchy + helpful messages everywhere

2026-05-01 02:35:42 +00:00

test-cases

feat(format): per-cell standardizers + 199-row buyer corpus

2026-05-01 02:11:24 +00:00

tests

feat(errors): structured error hierarchy + helpful messages everywhere

2026-05-01 02:35:42 +00:00

.gitignore

fix: cross-tool audit findings + alignment with format standardizer

2026-05-01 02:11:57 +00:00

pytest.ini

test: single-command runner, cross-platform automation, fixture auto-discovery

2026-04-29 16:01:06 +00:00

README.md

docs: short-form numbered requirements list

2026-04-29 21:19:21 +00:00

requirements-dev.txt

feat: add documentation, Streamlit GUI, and full source tree

2026-04-28 23:06:39 +00:00

requirements.txt

feat: add documentation, Streamlit GUI, and full source tree

2026-04-28 23:06:39 +00:00

run_tests.py

feat(gate): CSV-normalization gate with confidence-tiered findings

2026-04-29 20:35:27 +00:00

tox.ini

test: single-command runner, cross-platform automation, fixture auto-discovery

2026-04-29 16:01:06 +00:00

README.md

DataTools

A bundle of Python data-cleaning tools for CSV and Excel files. Two scripts ship today; more are in build.

#	Tool	What it does
01	Deduplicator	Find and remove duplicate rows with exact + fuzzy matching, smart normalization, and interactive review.
02	Text Cleaner	Trim whitespace, fold smart quotes, strip invisible / control characters, normalize Unicode, normalize line endings, optional case conversion.

Deduplicator

Features

Zero-config start — auto-detects encoding, delimiters, headers, and match columns
Fuzzy matching — Jaro-Winkler, Levenshtein, and token set ratio algorithms
5 built-in normalizers — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
Merge mode — fill missing fields in the surviving row from removed duplicates
4 survivor rules — keep first, last, most complete, or most recent row per group
Interactive review — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live
Config profiles — save and reload your settings as JSON for repeatable runs
Dual interface — full CLI for automation, Streamlit GUI for visual review
Dry-run by default — preview what would change before writing anything
Audit trail — every run produces a match groups report and timestamped log

Quick Start

Install

pip install -r requirements.txt

CLI

# Preview duplicates (dry run — no files written)
python -m src.cli customers.csv

# Remove duplicates and save the result
python -m src.cli customers.csv --apply

# Fuzzy-match names at 80% similarity, merge missing fields
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply

# Interactively review each match group
python -m src.cli customers.csv --review --apply

GUI

streamlit run src/gui/app.py

Upload a file, click Find Duplicates, review match groups side-by-side, then download the cleaned result.

CLI Usage Summary

python -m src.cli INPUT_FILE [OPTIONS]

Options:
  --apply                  Write output files (default: preview only)
  --output, -o PATH        Output file path
  --subset, -s COLS        Columns to match on (comma-separated)
  --key, -k COLS           Strong-key columns for exact matching
  --fuzzy COLS             Columns to fuzzy-match
  --algorithm, -a ALG      levenshtein | jaro_winkler | token_set_ratio
  --threshold, -t N        Similarity threshold 0-100 (default: 85)
  --normalize COL:TYPE     Per-column normalizers (e.g., email:email,phone:phone)
  --survivor RULE          first | last | most-complete | most-recent
  --merge                  Fill missing fields from removed duplicates
  --review                 Interactively review each match group
  --config PATH            Load settings from a JSON config file
  --save-config PATH       Save current settings to JSON
  --sheet NAME             Excel sheet name or 0-based index
  --encoding ENC           Override auto-detected encoding
  --header-row N           0-based header row index
  --help                   Show full help

Sample Output

$ python -m src.cli samples/messy_sales.csv

Reading messy_sales.csv...
  50 rows, 8 columns
Finding duplicates...

──────────────────────────────────────────────────
  File:      messy_sales.csv
  Rows in:   50
  Rows out:  28
  Removed:   22
  Groups:    22
──────────────────────────────────────────────────

Match groups:
  Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
  Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
  ...

This was a preview. Add --apply to write the output files.

Output Files

When --apply is used, three files are produced:

File	Contents
`{input}_deduplicated.csv`	Cleaned data with duplicates removed
`{input}_removed.csv`	Rows that were removed
`{input}_match_groups.csv`	Audit trail: group ID, confidence, matched columns, survivor flag

Text Cleaner

Character-level hygiene for messy CSV / Excel input. Solves the dirty-data failure modes that silently break VLOOKUPs, dedup runs, and downstream imports:

Trailing / leading whitespace and tabs in cells
Non-breaking spaces (U+00A0) hiding inside text where regular spaces should be
Smart quotes pasted from Word (" " ' ' → " " ' ')
Em / en dashes, ellipsis, other typographic Unicode
Zero-width and bidi-mark characters (U+200B, U+200C, U+200D, etc.)
BOMs from Excel "Save As CSV UTF-8"
Mixed line endings (\r\n, bare \r) inside multi-line cells
Control characters (U+0000-U+001F minus \t \n \r)
Optional Unicode NFC / NFKC normalization
Optional per-column case conversion (UPPER / lower / smart Title / Sentence)

# Preview what would change (dry-run)
python -m src.cli_text_clean samples/messy_text.csv

# Apply the safe defaults
python -m src.cli_text_clean samples/messy_text.csv --apply

# Title-case the name column, upper-case the SKU column
python -m src.cli_text_clean products.csv --case title:name,upper:sku --apply

# Just trim and collapse — nothing fancy
python -m src.cli_text_clean messy.csv --preset minimal --apply

Three presets: minimal (trim + collapse only), excel-hygiene (default; everything safe ON), paranoid (adds lossy NFKC fold).

Outputs {input}_cleaned.csv plus a per-cell {input}_changes.csv audit (row, column, old, new, ops applied).

See docs/CLI-REFERENCE.md for every flag.

Review & Normalize gate

Every uploaded file passes through a CSV-normalization gate before any tool page sees it. The analyzer scans for ~15 issue types — whitespace pollution, NBSP / zero-width chars, mixed line endings, BOM artifacts, encoding misdetections, smart punctuation, dirty headers, null sentinels, mojibake, and more — and tags each finding by confidence (high / medium / low) and fix action (the algorithm in src/core/fixes.py that resolves it).

In the GUI, the Review & Normalize page renders one expandable card per finding with a decision control (Auto-fix / Skip / Customize), a live before-and-after preview, an encoding-override picker for misdetected codepages, and an Advanced output options block (encoding, delimiter, line terminator) for the download. Tool pages refuse to load until the gate passes.

See docs/USER-GUIDE.md §3.3 for the user-facing walkthrough and docs/TECHNICAL.md §10.2.1–10.2.4 for the developer-facing API.

Documentation

Requirements — short-form numbered list: file size, codepages, delimiters, detectors, performance targets
User Guide — installation, GUI workflow, the Review & Normalize gate
CLI Reference — every flag with examples and recipe sections
Technical — architecture, gate internals, finding schema, fix registry
Developer Guide — extending the bundle, adding fixes / detectors

Requirements

Python 3.10+
Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer

License

Languages

Python 87.3%

HTML 10%

CSS 1.8%

Shell 0.4%

JavaScript 0.2%

Other 0.2%

README.md Unescape Escape

DataTools

Deduplicator

Features

Quick Start

Install

CLI

GUI

CLI Usage Summary

Sample Output

Output Files

Text Cleaner

Review & Normalize gate

Documentation

Requirements

License

README.md